Instructions to use root-signals/RootSignals-Judge-Llama-70B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use root-signals/RootSignals-Judge-Llama-70B with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="root-signals/RootSignals-Judge-Llama-70B",
	filename="RootSignals-Judge-Llama-70B.q8_0-00001-of-00004.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use root-signals/RootSignals-Judge-Llama-70B with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf root-signals/RootSignals-Judge-Llama-70B:Q8_0
# Run inference directly in the terminal:
llama-cli -hf root-signals/RootSignals-Judge-Llama-70B:Q8_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf root-signals/RootSignals-Judge-Llama-70B:Q8_0
# Run inference directly in the terminal:
llama-cli -hf root-signals/RootSignals-Judge-Llama-70B:Q8_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf root-signals/RootSignals-Judge-Llama-70B:Q8_0
# Run inference directly in the terminal:
./llama-cli -hf root-signals/RootSignals-Judge-Llama-70B:Q8_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf root-signals/RootSignals-Judge-Llama-70B:Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf root-signals/RootSignals-Judge-Llama-70B:Q8_0

Use Docker

docker model run hf.co/root-signals/RootSignals-Judge-Llama-70B:Q8_0

LM Studio
Jan

vLLM

How to use root-signals/RootSignals-Judge-Llama-70B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "root-signals/RootSignals-Judge-Llama-70B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "root-signals/RootSignals-Judge-Llama-70B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/root-signals/RootSignals-Judge-Llama-70B:Q8_0

Ollama
How to use root-signals/RootSignals-Judge-Llama-70B with Ollama:
```
ollama run hf.co/root-signals/RootSignals-Judge-Llama-70B:Q8_0
```

Unsloth Studio

How to use root-signals/RootSignals-Judge-Llama-70B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for root-signals/RootSignals-Judge-Llama-70B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for root-signals/RootSignals-Judge-Llama-70B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for root-signals/RootSignals-Judge-Llama-70B to start chatting

How to use root-signals/RootSignals-Judge-Llama-70B with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf root-signals/RootSignals-Judge-Llama-70B:Q8_0

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "root-signals/RootSignals-Judge-Llama-70B:Q8_0"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use root-signals/RootSignals-Judge-Llama-70B with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf root-signals/RootSignals-Judge-Llama-70B:Q8_0

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default root-signals/RootSignals-Judge-Llama-70B:Q8_0

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use root-signals/RootSignals-Judge-Llama-70B with Docker Model Runner:
```
docker model run hf.co/root-signals/RootSignals-Judge-Llama-70B:Q8_0
```

Lemonade

How to use root-signals/RootSignals-Judge-Llama-70B with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull root-signals/RootSignals-Judge-Llama-70B:Q8_0

Run and chat with the model

lemonade run user.RootSignals-Judge-Llama-70B-Q8_0

List all available models

lemonade list

RootSignals-Judge-Llama-70B / README.md

Ouz-G

Update README.md

1d9d0a7 verified 7 months ago

preview code

Raw

History Blame Contribute Delete

14.5 kB

	---
	license: llama3.3
	language:
	- en
	base_model:
	- meta-llama/Llama-3.3-70B-Instruct
	pipeline_tag: text-generation
	tags:
	- llm-as-judge
	- evaluation
	---
	# Model Card for RootSignals-Judge-Llama-70B

	Root Judge is a powerful mid-sized LLM that enables reliable and customizable LLM system evaluations.
	Root Judge was post-trained from Llama-3.3-70B-Instruct on a high quality, human-annotated dataset mix for pairwise preference choice judgments and multi-turn instruction following with source citing.
	The model weights are freely available in FP8 to facilitate cost effective research as well as commercial use.

	Root Judge’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
	achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.

	# 1. Intended Use Cases
	Root Judge is primarily intended to be used as an LLM-as-a-Judge in various contexts such as:
	- Detecting context-grounded hallucinations, e.g. for Retrieval Augmented Generation (RAG) settings in an explainable manner, providing a justification for the score
	- Pairwise preference judgments due to strong evaluation instruction-following capabilities
	- Serving as a custom evaluation metric powered by use case specific evaluation rubrics
	- Assisting inference-time search or synthetic data tasks that require Best-of-N decisions
	- Privacy-focused settings that require local deployments

	# 2. Performance Summary

	Root Judge outperforms leading closed models when detecting instruction following failures on evaluations
	while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public.

	## 2.1 Hallucination Detection (in RAG setting)

	📊 Benchmark: [HaluBench Test Set](https://huggingface.co/datasets/PatronusAI/HaluBench):

	Rank \| Model \| Test Samples \| Pass@1 Rate (%) \| Cost ($)
	\| --- \| --- \| --- \| --- \| --- \|
	1 \| Root Judge \| 14900 \| 86.3 \| 3.98
	2 \| GPT-4o \| 14900 \| 86.1 \| 33.12
	3 \| o1-preview \| 14899 \| 85.3 \| 1062*
	4 \| Claude Sonnet-3.5 \| 14797 \| 85.2 \| 42.94
	5 \| Llama3.1-70b-Instruct\| 13969 \| 84.7 \| 27.43
	6 \| o1-mini \| 14655 \| 83.7 \| 156
	7 \| Llama3.1-405b-Instruct \| 14881 \| 83.6 \| 269.82

	`*`=benchmarked as o1-preview; at current o1 prices, without reasoning tokens, the cost would start at $198.74 instead
	Local Costs based on lambdalabs instances at January 2025 prices

	[🔎 Detailed Performance Breakdown - Hallucination Detection](https://docs.google.com/spreadsheets/d/1NM9VgGG9_-1kQbepeoueUTkvT1bDeRndTD4RM5iV7l4/edit?usp=sharing)

	## 2.2 Instruction Following

	📊 Instruction-following performance in various diverse benchmarks compared to other open-weights judge and reward models (higher is better):

	Rank \| Model \| VRAM (GB) \| GSM8K (%) \| IFEval (%) \| MUSR-Murder (%) \| MUSR-Object (%) \| MUSR-Team (%) \| Avg Score \| Relative to Root Judge (%) \|
	\| ---\|--------------\|------------\|--------\|---------\|--------------\|--------------\|------------\|------------\|--------------------\|
	1 \| Root Judge \| 70 \| 94.6 ± 0.6 \| 93.9 \| 52.8 ± 3.2 \| 24.6 ± 2.7 \| 56.8 ± 3.1 \| 64.5 \| 100 \|
	2 \| Llama-3.3-70B \| 140 \| 94.4 ± 0.6 \| 93.4 \| 54.0 ± 3.2 \| 23.4 ± 2.7 \| 56.0 ± 3.2 \| 64.3 \| 99.5 \|
	3 \| Patronus-70B \| 140 \| 91.7 ± 0.8 \| 83.7 \| 54.4 ± 3.2 \| 24.6 ± 2.7 \| 48.8 ± 3.2 \| 60.6 \| 93.9 \|
	4 \| Nemotron-70B \| 70 \| 80.1 ± 1.1 \| 85.0 \| 53.6 ± 3.2 \| 23.8 ± 2.7 \| 55.6 ± 3.1 \| 59.6 \| 92.4 \|
	5 \| Qwen-2.5-32B \| 64 \| 87.4 ± 0.9 \| 87.5 \| 58.8 ± 3.1 \| 23.1 ± 2.6 \| 45.2 ± 3.2 \| 60.4 \| 93.6 \|
	6 \| Flow Judge \| 16 \| 78.7 ± 1.1 \| 64.6 \| 60.8 ± 3.1 \| 23.4 ± 2.7 \| 35.6 ± 3.0 \| 52.6 \| 81.5 \|
	7 \| Glider \| 8 \| 78.7 ± 1.1 \| 56.5 \| 59.2 ± 3.1 \| 35.9 ± 3.0 \| 43.2 ± 3.1 \| 54.7 \| 84.8 \|

	[🔎 Detailed Performance Breakdown \| Intruction-following](https://docs.google.com/spreadsheets/d/1cTPQZbUvelSlLkqj4kO-EQXFDkw17WXKHAeGg02-8Qg/edit?usp=sharing)

	## 2.3 Root Signals Internal Benchmarks

	📊 Benchmark: Root Signals Internal Hallucination Detection Benchmark

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6343d9d3e01a38440eeffc9c/rHq5RakEPkOlnC69MOl1e.png)
	Image 1: Total pass@1 rates and consistency (delta) assessed via ensemble of leading 3rd party models.


	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6343d9d3e01a38440eeffc9c/zfsh6HTbYH1HpLItWgq8u.png)
	Image 2: Custom rubric instruction-following by high level task.

	Root Judge was tested to support complex, user-defined scoring (rating) rubrics over large context sizes It provides granular qualitative feedback and supports structured evaluation outputs as well as tool calling.

	## 2.4 Other Benchmarks

	<details>
	<summary>📊 RewardBench</summary>

	[RewardBench](https://huggingface.co/spaces/allenai/reward-bench)

	\| Benchmark Task \| Score \| Total \| Accuracy \|
	\|------------------------\|-------\|-------\|-----------\|
	\| alpacaeval-easy \| 99.0 \| 100 \| 0.99 \|
	\| alpacaeval-hard \| 93.0 \| 95 \| 0.97894737\|
	\| alpacaeval-length \| 86.0 \| 95 \| 0.90526316\|
	\| donotanswer \| 73.5 \| 136 \| 0.54044118\|
	\| hep-cpp \| 159.0 \| 164 \| 0.96951220\|
	\| hep-go \| 159.0 \| 164 \| 0.96951220\|
	\| hep-java \| 161.0 \| 164 \| 0.98170732\|
	\| hep-js \| 159.0 \| 164 \| 0.96951220\|
	\| hep-python \| 158.0 \| 164 \| 0.96341463\|
	\| hep-rust \| 152.0 \| 164 \| 0.92682927\|
	\| llmbar-adver-GPTInst \| 69.0 \| 92 \| 0.75 \|
	\| llmbar-adver-GPTOut \| 39.0 \| 47 \| 0.82978723\|
	\| llmbar-adver-manual \| 32.0 \| 46 \| 0.69565217\|
	\| llmbar-adver-neighbor \| 74.0 \| 134 \| 0.55223881\|
	\| llmbar-natural \| 94.0 \| 100 \| 0.94 \|
	\| math-prm \| 357.0 \| 447 \| 0.79865772\|
	\| mt-bench-easy \| 28.0 \| 28 \| 1.0 \|
	\| mt-bench-hard \| 32.0 \| 37 \| 0.86486486\|
	\| mt-bench-med \| 40.0 \| 40 \| 1.0 \|
	\| refusals-dangerous \| 73.5 \| 100 \| 0.735 \|
	\| refusals-offensive \| 89.0 \| 100 \| 0.89 \|
	\| xstest-should-refuse \| 140.5 \| 154 \| 0.91233766\|
	\| xstest-should-respond \| 245.0 \| 250 \| 0.98 \|
	\| Chat \| \| \| 0.96648045\|
	\| Chat Hard \| \| \| 0.74561404\|
	\| Safety \| \| \| 0.83986486\|
	\| Reasoning \| \| \| 0.88103618\|

	</details>

	Despite our main focus on nuanced and transparent judgement of candidate responses,
	we test the judge model checkpoints extensively on public and private benchmarks,
	to avoid known issues with performance drops such as catastrophic forgetting and find that the model
	preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
	while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR

	# 3. Getting Started

	## 3.1 Via Root Signals Python SDK

	Model is available on our [platform](https://app.rootsignals.ai/register?utm_campaign=55516392-Hugging%20Face&utm_source=https%3A%2F%2Fhuggingface.co%2Froot-signals) as part of our evaluation suite, for no additional cost.

	Install our [python library](https://github.com/root-signals/rs-python-sdk):
	```bash
	pip install root-signals
	```

	Import:
	```python
	from root import RootSignals
	client = RootSignals()
	```

	Create a custom evaluator powered by Root Judge:
	```python
	my_custom_judge = client.evaluators.create(
	name="Political Text Evaluator",
	intent="To measure the politics-relatedness of a given text",
	predicate="Assess if a text containts political jargon or talks about politics: {{response}}",
	model="RootJudge",
	)
	```

	Execute:
	```python
	result = my_custom_judge.run(
	response="A defence spending target of 3% of GDP is more likely than the 5% aim pushed by US President Donald Trump, say members of the parliamentary Defence Committee."
	)
	print(result.score) # normalized score between [0-1]
	print(result.justification) # detailed reasoning for the score
	```

	## 3.2 Locally

	We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use-cases together with xml tags for important sections in your prompt. While the model can run on 80GB VRAM, we recommend at least 96GB for evaluating long-context RAG inputs.

	SGlang example for a single Nvidia H100 (80GB):
	```bash
	docker run \
	--gpus all \
	--ipc=host \
	-p 8000:8000 \
	-v huggingface:/root/.cache/huggingface \
	--volume /etc/localtime:/etc/localtime:ro \
	-d docker.io/lmsysorg/sglang:v0.4.2-cu124-srt \
	python3 -m sglang.launch_server \
	--model-path root-signals/RootSignals-Judge-Llama-70B \
	--host 0.0.0.0 \
	--port 8000 \
	--mem-fraction-static 0.89 \
	--grammar-backend xgrammar \
	--enable-torch-compile \
	--disable-cuda-graph
	```

	We validated the model on arm64 with [vLLM](https://github.com/vllm-project/vllm) on Nvidia GH200 as well with max outputs up to 64k tokens:
	```bash
	docker run \
	--gpus all \
	--ipc=host \
	-p 8000:8000 \
	-v huggingface:/root/.cache/huggingface \
	--volume /etc/localtime:/etc/localtime:ro \
	-d drikster80/vllm-gh200-openai:v0.6.4.post1 \
	--model root-signals/RootSignals-Judge-Llama-70B \
	--gpu-memory-utilization 0.95 \
	--max-model-len 64k \
	--block_size 16 \
	--enable_prefix_caching
	```

	Detect hallucinations from context, example uses halubench:
	```python
	decompose_system_instruction = """
	<TASK>
	You are a fair judge that detects hallucinations and unjustified assumptions from question-document-answer triplets provided by the user.
	Always follow the instructions below and provide your reasoning and verdict in the format specified.
	</TASK>

	<INSTRUCTIONS>
	#1. Identify key elements in the question.
	#2. List all relevant facts provided in the document.
	#3. Break down the answer into its component claims.
	#4. For each claim in the answer:
	#a. Is it explicitly supported by the document? If yes, quote the relevant part.
	#b. Is it a reasonable inference from the document? If yes, explain the reasoning.
	#c. Is it unsupported or contradicted by the document? If yes, explain why.
	#5. Check for any information in the answer that's present in the question but not in the document.
	#6. Verify that no additional information is introduced in the answer that isn't in the document or question.
	#7. Assess if the answer makes any unjustified connections or assumptions.
	</INSTRUCTIONS>

	<OUTPUT_EXAMPLE>
	{"REASONING": "Your reasoning here where you cite the instruction step by number and provide your reasoning", "VERDICT": "PASS" or "FAIL"}
	</OUTPUT_EXAMPLE>
	"""

	decompose_prompt = """
	<QUESTION>: {question} </QUESTION>
	<DOCUMENT>: {document} </DOCUMENT>
	<ANSWER>: {answer} </ANSWER>
	""".strip()

	import os
	import json
	import pandas as pd
	from openai import OpenAI
	from pprint import pprint
	from pydantic import BaseModel

	testset_df = pd.read_parquet("hf://datasets/PatronusAI/HaluBench/data/test-00000-of-00001.parquet")
	testset_df = testset_df.sample(frac=1).reset_index(drop=True)
	example_row = testset_df.iloc[0]

	class DecomposeResponse(BaseModel):
	REASONING: str
	VERDICT: str

	client = OpenAI(base_url="http://localhost:8000/v1") # export a different one for e.g. sglang, openrouter, etc.

	response = client.beta.chat.completions.parse(
	model="root-signals/RootSignals-Judge-Llama-70B", # or `RootJudge` if you are using the RootSignals API
	messages=[
	{"role": "system", "content": decompose_system_instruction},
	{"role": "user", "content": decompose_prompt.format(
	question=example_row["question"],
	document=example_row["passage"],
	answer=example_row["answer"])},
	],
	response_format=DecomposeResponse,
	).choices[0].message.parsed

	pprint(response.REASONING)
	pprint(response.VERDICT)
	```

	```
	> ('Following the instructions: #1, the key element in the question is the '
	"nationality of the magazines. #2, the document states that 'The Woman's "
	"Viewpoint was a woman's magazine founded in Texas in 1923' and 'Pick Me Up! "
	"is a British weekly women's magazine'. #3, the answer claims both magazines "
	'are British. #4, checking each claim in the answer: a) The document does not '
	"support the claim that The Woman's Viewpoint is British, instead, it says "
	"the magazine was founded in Texas. b) There's no reasonable inference from "
	"the document that would suggest The Woman's Viewpoint is British. c) The "
	"claim about The Woman's Viewpoint is contradicted by the document. #5, the "
	'answer introduces information (both being British) not supported by the '
	'document. #6, additional information about both magazines being British is '
	'introduced in the answer without being present in the document or question. '
	'#7, the answer makes an unjustified assumption by stating both magazines are '
	"British despite the document clearly stating The Woman's Viewpoint was "
	'founded in Texas, implying it is not British. Therefore, the answer fails to '
	'accurately reflect the information provided in the document and makes '
	'unjustified assumptions based on the information given in the question and '
	"document.', ")
	'FAIL'
	```

	# 4. Model Details

	## 4.1 Overview

	- Developed by: [Root Signals Inc](https://www.scorable.ai)
	- Model type: Text-Only Decoder Transformer
	- Language(s) (NLP): Primarily English
	- Finetuned from model: meta-llama/Llama-3.3-70B-Instruct

	## 4.2 Training Details

	- Training regime: DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
	- Hardware Type: LUMI-G / AMD Radeon Instinct™ MI250X
	- Cloud Provider: [LUMI Supercomputer](https://lumi-supercomputer.eu)
	- Compute Region: Finland


	# 5. Contact

	Links
	- [Scorable Homepage](https://www.scorable.ai/)
	- [Scorable Platform](https://app.scorable.ai/?utm_campaign=55516392-Hugging%20Face&utm_source=https%3A%2F%2Fhuggingface.co%2Froot-signals)
	- [Python SDK](https://github.com/root-signals/rs-sdk/blob/main/python/README.md)
	- [Python SDK Docs](https://sdk.rootsignals.ai/en/latest/quickstart.html)
	- [TypeScript SDK](https://github.com/root-signals/rs-sdk/blob/main/typescript/README.md)
	- [Discord](https://discord.gg/EhazTQsFnj)

	Email
	- hello@scorable.ai