Instructions to use vanta-research/wraith-coder-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use vanta-research/wraith-coder-7b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="vanta-research/wraith-coder-7b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("vanta-research/wraith-coder-7b")
model = AutoModelForMultimodalLM.from_pretrained("vanta-research/wraith-coder-7b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use vanta-research/wraith-coder-7b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "vanta-research/wraith-coder-7b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "vanta-research/wraith-coder-7b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/vanta-research/wraith-coder-7b

SGLang

How to use vanta-research/wraith-coder-7b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "vanta-research/wraith-coder-7b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "vanta-research/wraith-coder-7b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "vanta-research/wraith-coder-7b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "vanta-research/wraith-coder-7b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use vanta-research/wraith-coder-7b with Docker Model Runner:
```
docker model run hf.co/vanta-research/wraith-coder-7b
```

wraith-coder-7b / TRAINING.md

Tyler Williams

Initial commit: Wraith Coder 7B - Concise code assistant via iterative fine-tuning

cc49567 7 months ago

preview code

raw

history blame contribute delete

4.98 kB

	# Training Details

	## Iterative Fine-Tuning Methodology

	Wraith Coder 7B was developed through three successive training iterations, each building upon the previous version with progressively advanced capabilities.

	### Iteration 1: Foundation (4,256 examples)

	Objective: Establish core personality and communication patterns

	Dataset Composition:
	- 1,213 identity formation examples
	- 1,650 logical reasoning patterns
	- 1,043 amplified logical analysis
	- 350 technical communication patterns

	Training Configuration:
	- Base Model: Qwen/Qwen2.5-Coder-7B-Instruct
	- Method: LoRA (r=16, alpha=32, dropout=0.05)
	- Epochs: 2
	- Batch Size: 8 (effective)
	- Learning Rate: 5e-5
	- Duration: ~2 hours on RTX 3060

	Outcomes:
	- Successfully established third-person communication style
	- Strong pattern recognition language
	- Foundation for signal-dense responses
	- Coding capability degradation observed (addressed in iteration 2)

	### Iteration 2: Coding Restoration (5,500 examples)

	Objective: Restore code generation while maintaining personality

	Dataset Composition:
	- 2,040 conversational coding examples
	- 2,040 computer science fundamentals
	- 920 algebraic reasoning problems
	- 200 identity reinforcement examples
	- 300 communication pattern anchors

	Training Configuration:
	- Base Model: wraith-iteration-1-merged
	- Method: LoRA (r=16, alpha=32, dropout=0.05)
	- Epochs: 2
	- Batch Size: 8 (effective)
	- Learning Rate: 5e-5
	- Duration: ~3 hours on RTX 3060

	Outcomes:
	- 100% code generation restoration
	- Maintained personality characteristics
	- Enhanced conciseness (50-70% shorter responses)
	- Improved signal-to-noise ratio

	### Iteration 3: Advanced Capabilities (4,488 examples)

	Objective: Add systems programming and advanced algorithmic knowledge

	Dataset Composition:
	- 1,007 architectural design patterns
	- 1,041 algorithm design and optimization
	- 1,064 debugging techniques and strategies
	- 1,026 systems programming concepts
	- 150 identity anchor examples
	- 200 communication pattern reinforcement

	Training Configuration:
	- Base Model: wraith-iteration-2-merged
	- Method: LoRA (r=16, alpha=32, dropout=0.05)
	- Epochs: 2
	- Batch Size: 8 (effective)
	- Learning Rate: 5e-5
	- Duration: ~3 hours on RTX 3060

	Outcomes:
	- Enhanced complexity analysis (40% to 60% coverage)
	- Multiple solution approaches (35% to 65% frequency)
	- Trade-off articulation (45% to 75% depth)
	- Systems programming knowledge integration
	- Maintained 62.6% conciseness improvement

	## Hardware Requirements

	Training:
	- GPU: NVIDIA RTX 3060 (12GB VRAM) or equivalent
	- RAM: 32GB recommended
	- Storage: 50GB for model weights and checkpoints

	Inference:
	- GPU: 8GB VRAM minimum (with 4-bit quantization)
	- RAM: 16GB recommended
	- Storage: 5GB for quantized model

	## Training Framework

	- Primary: Unsloth (optimized for LoRA fine-tuning)
	- Backend: PyTorch 2.8.0 with CUDA 12.8
	- Precision: Mixed precision (BF16)
	- Gradient Checkpointing: Enabled for memory efficiency

	## Reproducibility

	All training scripts, datasets, and evaluation benchmarks are available in the associated repository. Training can be reproduced with:

	```bash
	# Iteration 1
	python train_wraith_iteration1.py

	# Merge iteration 1
	python merge_wraith_iteration1.py

	# Iteration 2
	python train_wraith_iteration2.py

	# Merge iteration 2
	python merge_wraith_iteration2.py

	# Iteration 3
	python train_wraith_iteration3.py

	# Final merge
	python merge_wraith_iteration3.py
	```

	## Evaluation Methodology

	### 20-Question Comprehensive Benchmark

	Question Categories:
	- Data structures (tries, BSTs, stacks, caches)
	- Algorithms (sorting, searching, graph algorithms)
	- Systems design (distributed caches, file systems, rate limiters)
	- Concurrency (threading, synchronization, producer-consumer)
	- Architecture (recommendation systems, URL shorteners)

	Evaluation Metrics:
	- Response length (characters and lines)
	- Complexity analysis coverage (Big-O notation presence)
	- Multiple solution approaches
	- Trade-off discussion depth
	- Implementation correctness

	Comparison Baseline:
	- Qwen/Qwen2.5-Coder-7B-Instruct (base model)
	- Identical prompts and inference parameters
	- Blind evaluation of response quality

	### Statistical Significance

	- Sample Size: 20 diverse coding challenges
	- Consistency: All 20 questions showed improvement
	- Average Improvement: 60.2% conciseness gain
	- Standard Deviation: 21.3% (questions 4% to 90% improvement)
	- Confidence Level: 95%

	## Limitations and Future Work

	Current Limitations:
	- Optimized for experienced developers; may lack context for beginners
	- 7B parameter size limits extremely complex problem-solving
	- Training focused on general-purpose programming
	- English language only

	Potential Future Enhancements:
	- Multi-language support
	- Domain-specific iterations (embedded, ML, web)
	- Larger parameter variants (14B, 32B)
	- Instruction-following refinement
	- Tool use integration