Text Generation
Safetensors
GGUF
English
llama
llm-as-judge
evaluation
conversational
compressed-tensors
Instructions to use root-signals/RootSignals-Judge-Llama-70B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use root-signals/RootSignals-Judge-Llama-70B with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="root-signals/RootSignals-Judge-Llama-70B", filename="RootSignals-Judge-Llama-70B.q8_0-00001-of-00004.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use root-signals/RootSignals-Judge-Llama-70B with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf root-signals/RootSignals-Judge-Llama-70B:Q8_0 # Run inference directly in the terminal: llama-cli -hf root-signals/RootSignals-Judge-Llama-70B:Q8_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf root-signals/RootSignals-Judge-Llama-70B:Q8_0 # Run inference directly in the terminal: llama-cli -hf root-signals/RootSignals-Judge-Llama-70B:Q8_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf root-signals/RootSignals-Judge-Llama-70B:Q8_0 # Run inference directly in the terminal: ./llama-cli -hf root-signals/RootSignals-Judge-Llama-70B:Q8_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf root-signals/RootSignals-Judge-Llama-70B:Q8_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf root-signals/RootSignals-Judge-Llama-70B:Q8_0
Use Docker
docker model run hf.co/root-signals/RootSignals-Judge-Llama-70B:Q8_0
- LM Studio
- Jan
- vLLM
How to use root-signals/RootSignals-Judge-Llama-70B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "root-signals/RootSignals-Judge-Llama-70B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "root-signals/RootSignals-Judge-Llama-70B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/root-signals/RootSignals-Judge-Llama-70B:Q8_0
- Ollama
How to use root-signals/RootSignals-Judge-Llama-70B with Ollama:
ollama run hf.co/root-signals/RootSignals-Judge-Llama-70B:Q8_0
- Unsloth Studio
How to use root-signals/RootSignals-Judge-Llama-70B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for root-signals/RootSignals-Judge-Llama-70B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for root-signals/RootSignals-Judge-Llama-70B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for root-signals/RootSignals-Judge-Llama-70B to start chatting
- Pi
How to use root-signals/RootSignals-Judge-Llama-70B with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf root-signals/RootSignals-Judge-Llama-70B:Q8_0
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "root-signals/RootSignals-Judge-Llama-70B:Q8_0" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use root-signals/RootSignals-Judge-Llama-70B with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf root-signals/RootSignals-Judge-Llama-70B:Q8_0
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default root-signals/RootSignals-Judge-Llama-70B:Q8_0
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use root-signals/RootSignals-Judge-Llama-70B with Docker Model Runner:
docker model run hf.co/root-signals/RootSignals-Judge-Llama-70B:Q8_0
- Lemonade
How to use root-signals/RootSignals-Judge-Llama-70B with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull root-signals/RootSignals-Judge-Llama-70B:Q8_0
Run and chat with the model
lemonade run user.RootSignals-Judge-Llama-70B-Q8_0
List all available models
lemonade list
| license: llama3.3 | |
| language: | |
| - en | |
| base_model: | |
| - meta-llama/Llama-3.3-70B-Instruct | |
| pipeline_tag: text-generation | |
| tags: | |
| - llm-as-judge | |
| - evaluation | |
| # Model Card for RootSignals-Judge-Llama-70B | |
| **Root Judge** is a powerful mid-sized LLM that enables reliable and customizable LLM system evaluations. | |
| Root Judge was post-trained from *Llama-3.3-70B-Instruct* on a high quality, human-annotated dataset mix for pairwise preference choice judgments and multi-turn instruction following with source citing. | |
| The model weights are freely available in FP8 to facilitate cost effective research as well as commercial use. | |
| **Root Judge**’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and | |
| achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost. | |
| # 1. Intended Use Cases | |
| **Root Judge** is primarily intended to be used as an LLM-as-a-Judge in various contexts such as: | |
| - Detecting context-grounded hallucinations, e.g. for *Retrieval Augmented Generation* (RAG) settings in an explainable manner, providing a justification for the score | |
| - Pairwise preference judgments due to strong evaluation instruction-following capabilities | |
| - Serving as a custom evaluation metric powered by use case specific evaluation rubrics | |
| - Assisting inference-time search or synthetic data tasks that require Best-of-N decisions | |
| - Privacy-focused settings that require local deployments | |
| # 2. Performance Summary | |
| **Root Judge** outperforms leading closed models when detecting instruction following failures on evaluations | |
| while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public. | |
| ## 2.1 Hallucination Detection (in RAG setting) | |
| 📊 Benchmark: [HaluBench Test Set](https://huggingface.co/datasets/PatronusAI/HaluBench): | |
| Rank | Model | Test Samples | Pass@1 Rate (%) | Cost ($) | |
| | --- | --- | --- | --- | --- | | |
| **1** | **Root Judge** | 14900 | **86.3** | **3.98** | |
| 2 | GPT-4o | 14900 | 86.1 | 33.12 | |
| 3 | o1-preview | 14899 | 85.3 | 1062* | |
| 4 | Claude Sonnet-3.5 | 14797 | 85.2 | 42.94 | |
| 5 | Llama3.1-70b-Instruct| 13969 | 84.7 | 27.43 | |
| 6 | o1-mini | 14655 | 83.7 | 156 | |
| 7 | Llama3.1-405b-Instruct | 14881 | 83.6 | 269.82 | |
| `*`=benchmarked as o1-preview; at current o1 prices, without reasoning tokens, the cost would start at $198.74 instead | |
| Local Costs based on lambdalabs instances at January 2025 prices | |
| [🔎 Detailed Performance Breakdown - Hallucination Detection](https://docs.google.com/spreadsheets/d/1NM9VgGG9_-1kQbepeoueUTkvT1bDeRndTD4RM5iV7l4/edit?usp=sharing) | |
| ## 2.2 Instruction Following | |
| 📊 Instruction-following performance in various diverse benchmarks compared to other open-weights judge and reward models (higher is better): | |
| Rank | Model | VRAM (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Object (%) | MUSR-Team (%) | Avg Score | Relative to Root Judge (%) | | |
| | ---|--------------|------------|--------|---------|--------------|--------------|------------|------------|--------------------| | |
| **1** | **Root Judge** | 70 | **94.6 ± 0.6** | **93.9** | 52.8 ± 3.2 | 24.6 ± 2.7 | **56.8 ± 3.1** | **64.5** | 100 | | |
| 2 | Llama-3.3-70B | 140 | 94.4 ± 0.6 | 93.4 | 54.0 ± 3.2 | 23.4 ± 2.7 | 56.0 ± 3.2 | 64.3 | 99.5 | | |
| 3 | Patronus-70B | 140 | 91.7 ± 0.8 | 83.7 | 54.4 ± 3.2 | 24.6 ± 2.7 | 48.8 ± 3.2 | 60.6 | 93.9 | | |
| 4 | Nemotron-70B | 70 | 80.1 ± 1.1 | 85.0 | 53.6 ± 3.2 | 23.8 ± 2.7 | 55.6 ± 3.1 | 59.6 | 92.4 | | |
| 5 | Qwen-2.5-32B | 64 | 87.4 ± 0.9 | 87.5 | 58.8 ± 3.1 | 23.1 ± 2.6 | 45.2 ± 3.2 | 60.4 | 93.6 | | |
| 6 | Flow Judge | 16 | 78.7 ± 1.1 | 64.6 | **60.8 ± 3.1** | 23.4 ± 2.7 | 35.6 ± 3.0 | 52.6 | 81.5 | | |
| 7 | Glider | 8 | 78.7 ± 1.1 | 56.5 | 59.2 ± 3.1 | **35.9 ± 3.0** | 43.2 ± 3.1 | 54.7 | 84.8 | | |
| [🔎 Detailed Performance Breakdown | Intruction-following](https://docs.google.com/spreadsheets/d/1cTPQZbUvelSlLkqj4kO-EQXFDkw17WXKHAeGg02-8Qg/edit?usp=sharing) | |
| ## 2.3 Root Signals Internal Benchmarks | |
| 📊 Benchmark: Root Signals Internal Hallucination Detection Benchmark | |
|  | |
| *Image 1: Total pass@1 rates and consistency (delta) assessed via ensemble of leading 3rd party models.* | |
|  | |
| *Image 2: Custom rubric instruction-following by high level task.* | |
| **Root Judge** was tested to support complex, user-defined scoring (rating) rubrics over large context sizes It provides granular qualitative feedback and supports structured evaluation outputs as well as tool calling. | |
| ## 2.4 Other Benchmarks | |
| <details> | |
| <summary>📊 RewardBench</summary> | |
| [RewardBench](https://huggingface.co/spaces/allenai/reward-bench) | |
| | Benchmark Task | Score | Total | Accuracy | | |
| |------------------------|-------|-------|-----------| | |
| | alpacaeval-easy | 99.0 | 100 | 0.99 | | |
| | alpacaeval-hard | 93.0 | 95 | 0.97894737| | |
| | alpacaeval-length | 86.0 | 95 | 0.90526316| | |
| | donotanswer | 73.5 | 136 | 0.54044118| | |
| | hep-cpp | 159.0 | 164 | 0.96951220| | |
| | hep-go | 159.0 | 164 | 0.96951220| | |
| | hep-java | 161.0 | 164 | 0.98170732| | |
| | hep-js | 159.0 | 164 | 0.96951220| | |
| | hep-python | 158.0 | 164 | 0.96341463| | |
| | hep-rust | 152.0 | 164 | 0.92682927| | |
| | llmbar-adver-GPTInst | 69.0 | 92 | 0.75 | | |
| | llmbar-adver-GPTOut | 39.0 | 47 | 0.82978723| | |
| | llmbar-adver-manual | 32.0 | 46 | 0.69565217| | |
| | llmbar-adver-neighbor | 74.0 | 134 | 0.55223881| | |
| | llmbar-natural | 94.0 | 100 | 0.94 | | |
| | math-prm | 357.0 | 447 | 0.79865772| | |
| | mt-bench-easy | 28.0 | 28 | 1.0 | | |
| | mt-bench-hard | 32.0 | 37 | 0.86486486| | |
| | mt-bench-med | 40.0 | 40 | 1.0 | | |
| | refusals-dangerous | 73.5 | 100 | 0.735 | | |
| | refusals-offensive | 89.0 | 100 | 0.89 | | |
| | xstest-should-refuse | 140.5 | 154 | 0.91233766| | |
| | xstest-should-respond | 245.0 | 250 | 0.98 | | |
| | Chat | | | 0.96648045| | |
| | Chat Hard | | | 0.74561404| | |
| | Safety | | | 0.83986486| | |
| | Reasoning | | | 0.88103618| | |
| </details> | |
| Despite our main focus on nuanced and transparent judgement of candidate responses, | |
| we test the judge model checkpoints extensively on public and private benchmarks, | |
| to avoid known issues with performance drops such as catastrophic forgetting and find that the model | |
| preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization, | |
| while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR | |
| # 3. Getting Started | |
| ## 3.1 Via Root Signals Python SDK | |
| Model is available on our [platform](https://app.rootsignals.ai/register?utm_campaign=55516392-Hugging%20Face&utm_source=https%3A%2F%2Fhuggingface.co%2Froot-signals) as part of our evaluation suite, for no additional cost. | |
| Install our [python library](https://github.com/root-signals/rs-python-sdk): | |
| ```bash | |
| pip install root-signals | |
| ``` | |
| Import: | |
| ```python | |
| from root import RootSignals | |
| client = RootSignals() | |
| ``` | |
| Create a custom evaluator powered by **Root Judge**: | |
| ```python | |
| my_custom_judge = client.evaluators.create( | |
| name="Political Text Evaluator", | |
| intent="To measure the politics-relatedness of a given text", | |
| predicate="Assess if a text containts political jargon or talks about politics: {{response}}", | |
| model="RootJudge", | |
| ) | |
| ``` | |
| Execute: | |
| ```python | |
| result = my_custom_judge.run( | |
| response="A defence spending target of 3% of GDP is more likely than the 5% aim pushed by US President Donald Trump, say members of the parliamentary Defence Committee." | |
| ) | |
| print(result.score) # normalized score between [0-1] | |
| print(result.justification) # detailed reasoning for the score | |
| ``` | |
| ## 3.2 Locally | |
| We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use-cases together with *xml tags* for important sections in your prompt. While the model can run on 80GB VRAM, we recommend at least 96GB for evaluating long-context RAG inputs. | |
| SGlang example for a single Nvidia H100 (80GB): | |
| ```bash | |
| docker run \ | |
| --gpus all \ | |
| --ipc=host \ | |
| -p 8000:8000 \ | |
| -v huggingface:/root/.cache/huggingface \ | |
| --volume /etc/localtime:/etc/localtime:ro \ | |
| -d docker.io/lmsysorg/sglang:v0.4.2-cu124-srt \ | |
| python3 -m sglang.launch_server \ | |
| --model-path root-signals/RootSignals-Judge-Llama-70B \ | |
| --host 0.0.0.0 \ | |
| --port 8000 \ | |
| --mem-fraction-static 0.89 \ | |
| --grammar-backend xgrammar \ | |
| --enable-torch-compile \ | |
| --disable-cuda-graph | |
| ``` | |
| We validated the model on arm64 with [vLLM](https://github.com/vllm-project/vllm) on Nvidia GH200 as well with max outputs up to 64k tokens: | |
| ```bash | |
| docker run \ | |
| --gpus all \ | |
| --ipc=host \ | |
| -p 8000:8000 \ | |
| -v huggingface:/root/.cache/huggingface \ | |
| --volume /etc/localtime:/etc/localtime:ro \ | |
| -d drikster80/vllm-gh200-openai:v0.6.4.post1 \ | |
| --model root-signals/RootSignals-Judge-Llama-70B \ | |
| --gpu-memory-utilization 0.95 \ | |
| --max-model-len 64k \ | |
| --block_size 16 \ | |
| --enable_prefix_caching | |
| ``` | |
| Detect hallucinations from context, example uses halubench: | |
| ```python | |
| decompose_system_instruction = """ | |
| <TASK> | |
| You are a fair judge that detects hallucinations and unjustified assumptions from question-document-answer triplets provided by the user. | |
| Always follow the instructions below and provide your reasoning and verdict in the format specified. | |
| </TASK> | |
| <INSTRUCTIONS> | |
| #1. Identify key elements in the question. | |
| #2. List all relevant facts provided in the document. | |
| #3. Break down the answer into its component claims. | |
| #4. For each claim in the answer: | |
| #a. Is it explicitly supported by the document? If yes, quote the relevant part. | |
| #b. Is it a reasonable inference from the document? If yes, explain the reasoning. | |
| #c. Is it unsupported or contradicted by the document? If yes, explain why. | |
| #5. Check for any information in the answer that's present in the question but not in the document. | |
| #6. Verify that no additional information is introduced in the answer that isn't in the document or question. | |
| #7. Assess if the answer makes any unjustified connections or assumptions. | |
| </INSTRUCTIONS> | |
| <OUTPUT_EXAMPLE> | |
| {"REASONING": "Your reasoning here where you cite the instruction step by number and provide your reasoning", "VERDICT": "PASS" or "FAIL"} | |
| </OUTPUT_EXAMPLE> | |
| """ | |
| decompose_prompt = """ | |
| <QUESTION>: {question} </QUESTION> | |
| <DOCUMENT>: {document} </DOCUMENT> | |
| <ANSWER>: {answer} </ANSWER> | |
| """.strip() | |
| import os | |
| import json | |
| import pandas as pd | |
| from openai import OpenAI | |
| from pprint import pprint | |
| from pydantic import BaseModel | |
| testset_df = pd.read_parquet("hf://datasets/PatronusAI/HaluBench/data/test-00000-of-00001.parquet") | |
| testset_df = testset_df.sample(frac=1).reset_index(drop=True) | |
| example_row = testset_df.iloc[0] | |
| class DecomposeResponse(BaseModel): | |
| REASONING: str | |
| VERDICT: str | |
| client = OpenAI(base_url="http://localhost:8000/v1") # export a different one for e.g. sglang, openrouter, etc. | |
| response = client.beta.chat.completions.parse( | |
| model="root-signals/RootSignals-Judge-Llama-70B", # or `RootJudge` if you are using the RootSignals API | |
| messages=[ | |
| {"role": "system", "content": decompose_system_instruction}, | |
| {"role": "user", "content": decompose_prompt.format( | |
| question=example_row["question"], | |
| document=example_row["passage"], | |
| answer=example_row["answer"])}, | |
| ], | |
| response_format=DecomposeResponse, | |
| ).choices[0].message.parsed | |
| pprint(response.REASONING) | |
| pprint(response.VERDICT) | |
| ``` | |
| ``` | |
| > ('Following the instructions: #1, the key element in the question is the ' | |
| "nationality of the magazines. #2, the document states that 'The Woman's " | |
| "Viewpoint was a woman's magazine founded in Texas in 1923' and 'Pick Me Up! " | |
| "is a British weekly women's magazine'. #3, the answer claims both magazines " | |
| 'are British. #4, checking each claim in the answer: a) The document does not ' | |
| "support the claim that The Woman's Viewpoint is British, instead, it says " | |
| "the magazine was founded in Texas. b) There's no reasonable inference from " | |
| "the document that would suggest The Woman's Viewpoint is British. c) The " | |
| "claim about The Woman's Viewpoint is contradicted by the document. #5, the " | |
| 'answer introduces information (both being British) not supported by the ' | |
| 'document. #6, additional information about both magazines being British is ' | |
| 'introduced in the answer without being present in the document or question. ' | |
| '#7, the answer makes an unjustified assumption by stating both magazines are ' | |
| "British despite the document clearly stating The Woman's Viewpoint was " | |
| 'founded in Texas, implying it is not British. Therefore, the answer fails to ' | |
| 'accurately reflect the information provided in the document and makes ' | |
| 'unjustified assumptions based on the information given in the question and ' | |
| "document.', ") | |
| 'FAIL' | |
| ``` | |
| # 4. Model Details | |
| ## 4.1 Overview | |
| - **Developed by:** [Root Signals Inc](https://www.scorable.ai) | |
| - **Model type:** Text-Only Decoder Transformer | |
| - **Language(s) (NLP):** Primarily English | |
| - **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct | |
| ## 4.2 Training Details | |
| - **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs | |
| - **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X | |
| - **Cloud Provider:** [LUMI Supercomputer](https://lumi-supercomputer.eu) | |
| - **Compute Region:** Finland | |
| # 5. Contact | |
| **Links** | |
| - [Scorable Homepage](https://www.scorable.ai/) | |
| - [Scorable Platform](https://app.scorable.ai/?utm_campaign=55516392-Hugging%20Face&utm_source=https%3A%2F%2Fhuggingface.co%2Froot-signals) | |
| - [Python SDK](https://github.com/root-signals/rs-sdk/blob/main/python/README.md) | |
| - [Python SDK Docs](https://sdk.rootsignals.ai/en/latest/quickstart.html) | |
| - [TypeScript SDK](https://github.com/root-signals/rs-sdk/blob/main/typescript/README.md) | |
| - [Discord](https://discord.gg/EhazTQsFnj) | |
| **Email** | |
| - hello@scorable.ai |