Instructions to use apple/SimpleSD-4B-thinking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use apple/SimpleSD-4B-thinking with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="apple/SimpleSD-4B-thinking")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("apple/SimpleSD-4B-thinking")
model = AutoModelForCausalLM.from_pretrained("apple/SimpleSD-4B-thinking")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use apple/SimpleSD-4B-thinking with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "apple/SimpleSD-4B-thinking"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "apple/SimpleSD-4B-thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/apple/SimpleSD-4B-thinking

SGLang

How to use apple/SimpleSD-4B-thinking with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "apple/SimpleSD-4B-thinking" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "apple/SimpleSD-4B-thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "apple/SimpleSD-4B-thinking" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "apple/SimpleSD-4B-thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use apple/SimpleSD-4B-thinking with Docker Model Runner:
```
docker model run hf.co/apple/SimpleSD-4B-thinking
```

SimpleSD-4B-thinking / README.md

richardbaihe

Update README.md

bab61b5 verified 3 months ago

preview code

Raw

History Blame

2.48 kB

	---
	license: apple-amlr
	base_model:
	- Qwen/Qwen3-4B-Thinking-2507
	tags:
	- self-distillation
	- code-generation
	- ssd
	library_name: transformers
	---

	# SimpleSD-4B-thinking

	This model was produced using Simple Self-Distillation (SSD), a method that improves code generation by fine-tuning a language model on its own sampled outputs—without rewards, verifiers, teacher models, or reinforcement learning.

	- Self-distillation sampling: temperature=1.1, top_p=0.95, top_k=20
	- Evaluation sampling: temperature=0.7, top_p=0.95, top_k=20

	## Notes
	- These are research checkpoints for reproducibility.
	- They are not optimized Qwen releases.
	- They don't represent a broader open-source model strategy.

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("apple/SimpleSD-4B-thinking")
	tokenizer = AutoTokenizer.from_pretrained("apple/SimpleSD-4B-thinking")
	```

	## Method

	SSD samples solutions from the base model using non-unit temperature and top-k/top-p truncation, then fine-tunes on those samples via standard supervised learning. Despite its simplicity, SSD yields large gains on competitive programming benchmarks, with improvements concentrating on harder problems. The mechanism traces to resolving a precision–exploration conflict: SSD reshapes token distributions in a context-dependent way so that a single global decoding configuration becomes far more effective at evaluation time.

	## Results

	LiveCodeBench (%)

	\| Model \| LCBv6 pass@1 \| LCBv6 pass@5 \| LCBv5 pass@1 \| LCBv5 pass@5 \|
	\|---\|---\|---\|---\|---\|
	\| Qwen3-4B-Thinking-2507 (base) \| 54.5 \| 67.5 \| 59.6 \| 70.3 \|
	\| + SSD (this model) \| 57.8 (+3.3) \| 71.4 (+3.9) \| 63.1 (+3.5) \| 74.7 (+4.4) \|

	## Paper

	[Embarrassingly Simple Self-Distillation Improves Code Generation](https://arxiv.org/abs/2604.01193)

	```bibtex
	@misc{zhang2026embarrassinglysimpleselfdistillationimproves,
	title={Embarrassingly Simple Self-Distillation Improves Code Generation},
	author={Ruixiang Zhang and Richard He Bai and Huangjie Zheng and Navdeep Jaitly and Ronan Collobert and Yizhe Zhang},
	year={2026},
	eprint={2604.01193},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2604.01193},
	}
	```


	## License

	This model is released under the [Apple Machine Learning Research Model License](https://huggingface.co/apple/SimpleSD-4B-thinking/blob/main/LICENSE).