Instructions to use remiai3/gpt_oss_20b_GGUF_project_guide with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use remiai3/gpt_oss_20b_GGUF_project_guide with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="remiai3/gpt_oss_20b_GGUF_project_guide")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("remiai3/gpt_oss_20b_GGUF_project_guide", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use remiai3/gpt_oss_20b_GGUF_project_guide with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "remiai3/gpt_oss_20b_GGUF_project_guide"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "remiai3/gpt_oss_20b_GGUF_project_guide",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/remiai3/gpt_oss_20b_GGUF_project_guide

SGLang

How to use remiai3/gpt_oss_20b_GGUF_project_guide with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "remiai3/gpt_oss_20b_GGUF_project_guide" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "remiai3/gpt_oss_20b_GGUF_project_guide",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "remiai3/gpt_oss_20b_GGUF_project_guide" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "remiai3/gpt_oss_20b_GGUF_project_guide",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use remiai3/gpt_oss_20b_GGUF_project_guide with Docker Model Runner:
```
docker model run hf.co/remiai3/gpt_oss_20b_GGUF_project_guide
```

remiai3 commited on Aug 27, 2025

Commit

d2edc36

verified ·

1 Parent(s): c60b9e0

Upload app.py

Browse files

Files changed (1) hide show

app.py +28 -21

app.py CHANGED Viewed

@@ -1,28 +1,39 @@
 from flask import Flask, render_template, request, jsonify
 from llama_cpp import Llama
-import re
 app = Flask(__name__)
-# Path to the local GGUF model weights
-MODEL_PATH = "models/oss_20b_gguf/gpt-oss-20b-Q2_K_L.gguf"  # update this path
-# Initialize model
-llm = Llama(
-    model_path=MODEL_PATH,
-    n_ctx=2048,
-    n_threads=8  # adjust based on your CPU
-)
-# Build adaptive prompt
 def build_prompt(history, user_text):
     system_prompt = (
-        "You are a helpful and adaptive assistant. Follow these rules strictly:\n"
-        "- If the user asks a simple or factual question, give a short, precise answer.\n"
-        "- If the user requests a story, essay, or letter, provide a longer, well-structured response.\n"
-        "- If the user asks for programming help or code, provide correct, complete, well-formatted code.\n"
-        "- Always keep answers clear, neat, and structured; use points when helpful.\n"
-        "- Output code inside proper Markdown code blocks with language tags for syntax highlighting.\n"
     )
     prompt = system_prompt + "\n\n"
     for turn in history:
@@ -42,7 +53,6 @@ def chat():
     prompt = build_prompt(history, user_message)
-    # Adjust max_tokens dynamically
     if any(word in user_message.lower() for word in ["story", "letter", "essay"]):
         max_out = 800
     elif any(word in user_message.lower() for word in ["code", "program", "script", "python", "java", "html", "c++"]):
@@ -57,10 +67,7 @@ def chat():
         stop=["\nUser:", "\nAssistant:"]
     )
-    text = resp["choices"][0]["text"].strip()
-    # Wrap fenced code blocks with copy button (handled in JS)
-    return jsonify({"response": text})
 if __name__ == "__main__":
     app.run(host="0.0.0.0", port=5000, debug=True)

 from flask import Flask, render_template, request, jsonify
 from llama_cpp import Llama
+import os
 app = Flask(__name__)
+# Update this path to your downloaded model weight
+MODEL_PATH = "models/oss_20b_gguf/gpt-oss-20b-Q2_K_L.gguf"
+# Detect GPU automatically: if llama-cpp-python was compiled with CUDA/Metal and GPU layers can be offloaded
+# Adjust n_gpu_layers for your GPU memory; 20-40 for mid GPUs, 60-100 for higher VRAM, 0 = CPU only
+try:
+    print("Trying GPU offload...")
+    llm = Llama(
+        model_path=MODEL_PATH,
+        n_ctx=2048,
+        n_threads=os.cpu_count(),
+        n_gpu_layers=40  # increase or decrease based on your GPU memory
+    )
+    print("GPU initialized successfully.")
+except Exception as e:
+    print(f"GPU failed: {e}\nFalling back to CPU.")
+    llm = Llama(
+        model_path=MODEL_PATH,
+        n_ctx=2048,
+        n_threads=os.cpu_count(),
+        n_gpu_layers=0  # CPU only
+    )
 def build_prompt(history, user_text):
     system_prompt = (
+        "You are a helpful assistant. Follow these:\n"
+        "- Simple Q: Short, precise.\n"
+        "- Story/letter/essay: Longer answer.\n"
+        "- Code: Complete, neat, Markdown fenced code with language tag.\n"
+        "- Use points when helpful.\n"
     )
     prompt = system_prompt + "\n\n"
     for turn in history:
     prompt = build_prompt(history, user_message)
     if any(word in user_message.lower() for word in ["story", "letter", "essay"]):
         max_out = 800
     elif any(word in user_message.lower() for word in ["code", "program", "script", "python", "java", "html", "c++"]):
         stop=["\nUser:", "\nAssistant:"]
     )
+    return jsonify({"response": resp["choices"][0]["text"].strip()})
 if __name__ == "__main__":
     app.run(host="0.0.0.0", port=5000, debug=True)