Instructions to use SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF",
	filename="Llama-3.1-Tulu-3-8B-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF with Ollama:
```
ollama run hf.co/SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF:Q4_K_M
```

Unsloth Studio

How to use SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF to start chatting

Atomic Chat new
Docker Model Runner
How to use SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF with Docker Model Runner:
```
docker model run hf.co/SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF:Q4_K_M
```

Lemonade

How to use SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull SandLogicTechnologies/Llama-3.1-Tulu-3-8B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Llama-3.1-Tulu-3-8B-GGUF-Q4_K_M

List all available models

lemonade list

Quantized Llama-3.1-Tulu-3-8B Models

This repository contains Q4_KM and Q5_KM quantized versions of the allenai/Llama-3.1-Tulu-3-8B model. These quantized variants provide efficient alternatives while maintaining the core capabilities of Tülu3, a leading instruction-following model family.

Model Overview

Original Model: Llama-3.1-Tulu-3-8B
Quantized Versions:
- Q4_KM (4-bit quantization)
- Q5_KM (5-bit quantization)
Base Architecture: 8B parameter instruction-following model
Developer: Allen Institute for AI
License: Llama 3.1 Community License Agreement
Language: Primarily English
Finetuned From: allenai/Llama-3.1-Tulu-3-8B-DPO

Quantization Details

Q4_KM Version

Model size reduction: ~75% smaller than original
Memory footprint: 4.92 GB
Optimized for deployment in resource-constrained environments
Maintains core functionality with minimal performance impact

Q5_KM Version

Model size reduction: ~69% smaller than original
Memory footprint: 5.73 GB
Higher precision than Q4_KM
Better preservation of model quality

Key Features

Both quantized versions maintain Tülu3's state-of-the-art performance on:

Instruction following tasks
Mathematical reasoning (MATH dataset)
Grade school math problems (GSM8K)
General instruction following (IFEval)
Chat-based interactions
Complex reasoning tasks

Usage

from llama_cpp import Llama

llm = Llama(
    model_path="./models/7B/Llama-3.1-Tulu-3-8B.gguf",
    verbose=False,
    # n_gpu_layers=-1, # Uncomment to use GPU acceleration
    # n_ctx=2048, # Uncomment to increase the context window
)

output = llm.create_chat_completion(
    messages = [
        {"role": "system", "content": "You're an AI assistant who help in answering user question"},
        {
            "role": "user",
            "content": "Write an python code to find prime number"
        }
    ]
)

print(output["choices"][0]['message']['content'])

Training Data

The model was trained on a diverse mix of:

Publicly available datasets
Synthetic data
Human-created datasets

Bias, Risks, and Limitations

These quantized models inherit the limitations of the original Tülu3 model:

Limited safety training compared to models with active filtering
Can produce problematic outputs, especially when prompted to do so
Unknown composition of the base Llama 3.1 training corpus
Additional considerations for quantized versions:
- Slight degradation in performance compared to full-precision model
- May show increased variance in mathematical reasoning tasks
- Q4_KM may exhibit more pronounced quality loss in complex scenarios

Recommended Use Cases

Research and development
Educational applications
Resource-constrained deployments
Edge computing scenarios
Prototyping and testing
Applications requiring faster inference

Acknowledgments

These quantized models are based on the work of the Allen Institute for AI and the Llama 3.1 team. Special thanks to Georgi Gerganov and the entire llama.cpp development team for their outstanding contributions.