Instructions to use sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B") model = PeftModel.from_pretrained(base_model, "sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora") - Transformers
How to use sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora
- SGLang
How to use sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Unsloth Studio new
How to use sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora", max_seq_length=2048, ) - Docker Model Runner
How to use sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora with Docker Model Runner:
docker model run hf.co/sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora
Model Card for gemma-4-31b-it-distill-qwen3-0.6b-lora
This model is a fine-tuned version of Qwen3 0.6B. It has been trained using TRL.
Quick start
from transformers import pipeline
question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="sapbot/gemma-4-31b-it-distill-qwen3-0.6b-lora", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])
Interesting changes in behavior
While Gemma-3n-4b-distill-smollm2-360m-instruct had almost zero changes, because of literally typical behavior, and no knowledge of yourself at all (except system prompt which still guided distilled version that it's HF's model), this one has BIG difference: thinking.
Original Qwen3 0.6B model doesn't have structured reasoning, it just... reasoning. Like humans, no structure. Due to finetuning on Gemma 4 which has structured reasoning, this thinking patterns are now inside of Qwen3 0.6B.
Training procedure
This model was trained with SFT.
Framework versions
- PEFT 0.18.1
- TRL: 0.23.1
- Transformers: 4.57.6
- Pytorch: 2.10.0+cu126
- Datasets: 4.3.0
- Tokenizers: 0.22.2
Citations
Cite TRL as:
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
year = 2020,
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}
Brought to you by sapbot from Romarchive
- Downloads last month
- 32
