Instructions to use Pritish92/ner-grit-llama31-8b-lora-latest with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Pritish92/ner-grit-llama31-8b-lora-latest with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
model = PeftModel.from_pretrained(base_model, "Pritish92/ner-grit-llama31-8b-lora-latest")

Transformers

How to use Pritish92/ner-grit-llama31-8b-lora-latest with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Pritish92/ner-grit-llama31-8b-lora-latest")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Pritish92/ner-grit-llama31-8b-lora-latest", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Pritish92/ner-grit-llama31-8b-lora-latest with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Pritish92/ner-grit-llama31-8b-lora-latest"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Pritish92/ner-grit-llama31-8b-lora-latest",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Pritish92/ner-grit-llama31-8b-lora-latest

SGLang

How to use Pritish92/ner-grit-llama31-8b-lora-latest with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Pritish92/ner-grit-llama31-8b-lora-latest" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Pritish92/ner-grit-llama31-8b-lora-latest",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Pritish92/ner-grit-llama31-8b-lora-latest" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Pritish92/ner-grit-llama31-8b-lora-latest",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Pritish92/ner-grit-llama31-8b-lora-latest with Docker Model Runner:
```
docker model run hf.co/Pritish92/ner-grit-llama31-8b-lora-latest
```

Pritish92/ner-grit-llama31-8b-lora-latest

This is a GRIT + LoRA adapter fine-tuned from meta-llama/Llama-3.1-8B to do instruction-following NER-style extraction into a strict JSON list format:

[{"label":"...","text":"..."}]

Note: This repository contains adapter weights only (not the full base model weights). You must have access to meta-llama/Llama-3.1-8B on Hugging Face to run it.

Prompt format (exact)

### Instruction:
{instruction}
Maintain the JSON key order exactly as shown.
Output format: [{"label":"...","text":"..."}]

### Input:
{input_chunk}

### Response:

How to load

import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

adapter_id = "Pritish92/ner-grit-llama31-8b-lora-latest"
tokenizer = AutoTokenizer.from_pretrained(adapter_id, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
tokenizer.truncation_side = "left"

model = AutoPeftModelForCausalLM.from_pretrained(
    adapter_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

Training details

Date: 2026-01-02
Sequence length cap (max_length): 20
Chunking strategy: token_overlap
- prompt overhead tokens reserved: 256
- output overhead tokens reserved: 1024
- max input chunk tokens: 2048
- overlap chunk tokens: 256
- min chunk tokens: 256
Batch size: 1
Gradient accumulation: 8 (effective batch: 8)
Learning rate: 5e-05
Planned epochs: 2 (early stopping may stop sooner)
Loss masking: response-only (prompt + input chunk tokens masked with -100)

LoRA / PEFT

LoRA rank (r): 16
LoRA alpha: 32
LoRA dropout: 0.1
Target modules: up_proj, k_proj, gate_proj, o_proj, q_proj, v_proj, down_proj

GRIT hyperparameters

kfac_min_samples: 256
kfac_update_freq: 100
kfac_damping: 0.005
reprojection_warmup_steps: 500
reprojection_freq: 100
use_two_sided_reprojection: True
rank_adaptation_start_step: 500
rank_adaptation_threshold: 0.85
ng_warmup_steps: 300
regularizer_warmup_steps: 500
lambda_kfac: 1e-05
lambda_reproj: 0.0001

Training data

Local CSVs:

NER/NER-Data/ner_train_dataset.csv
NER/NER-Data/ner_dev_dataset.csv
NER/NER-Data/ner_test_dataset.csv

Example counts: raw train=18,115, raw val=2,010; after chunking train examples=24,620

Evaluation

Best checkpoint metric: N/A
Train runtime: 34690.8s (9h 38m 10s)
eval_entity_f1: 0.173705
eval_entity_micro_f1: 0.162234
eval_entity_parse_fail_rate: 0.686071
eval_entity_precision: 0.270288
eval_entity_recall: 0.155745
eval_loss: 0.198197
eval_runtime: 23856.426600
eval_samples_per_second: 0.117000
eval_steps_per_second: 0.029000

Limitations / notes

Outputs are not guaranteed to be valid JSON; validate/parse and handle failures robustly.
Model performance depends on the entity schema/labels in your training data.
If meta-llama/Llama-3.1-8B is gated, you must authenticate to download it.

Downloads last month: 1

Model tree for Pritish92/ner-grit-llama31-8b-lora-latest

Base model

meta-llama/Llama-3.1-8B

Adapter

(745)

this model