Safetensors
English
llama
File size: 4,809 Bytes
e19ed54
 
 
 
 
 
 
 
 
 
d7db8be
cc47ceb
 
 
 
 
 
 
05cb355
060ea5e
05cb355
029e004
060ea5e
cc47ceb
d7db8be
 
 
 
 
 
 
 
db2175a
d7db8be
 
 
 
05999d1
d7db8be
05999d1
 
 
d7db8be
 
 
 
 
 
 
 
 
 
 
 
f73ca2c
 
d7db8be
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2458067
d7db8be
 
 
 
 
 
 
 
 
886f028
d7db8be
 
5fd04c0
 
 
 
 
970872a
d7db8be
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
datasets:
- agentica-org/DeepScaleR-Preview-Dataset
language:
- en
metrics:
- accuracy
base_model:
- nvidia/Llama-3.1-Nemotron-Nano-8B-v1
---
# Model Overview
<div align="center">
<span style="font-family: default; font-size: 1.5em;">DLER-Llama-Nemotron-8B-Merge</span>
<div>
🚀 The leading efficient reasoning model for cutting-edge research and development 🌟
</div>
</div>

[![Paper](https://img.shields.io/badge/ArXiv-Paper-brown)](https://www.arxiv.org/abs/2510.15110)
[![Code](https://img.shields.io/badge/GitHub-Link-blue)](https://github.com/NVlabs/DLER)
[![Model](https://img.shields.io/badge/HuggingFace-Model-yellow)](https://huggingface.co/collections/nvidia/reasoning-efficiency-research)
[![Website](https://img.shields.io/badge/Web-Page-orange)](https://nvlabs.github.io/DLER/)

![Comparison between Llama-3.1-Nemotron-Nano-8B-v1 and DLER-Llama-Nemotron-8B-Merge](./asset/latency_8b.png)

### Description:
DLER-Llama-3.1-Nemotron-8B is an ultra-efficient 8B open-weight reasoning model designed for challenging tasks such as mathematics, programming, and scientific problem-solving. It is first trained with the DLER algorithm on agentica-org/DeepScaleR-Preview-Dataset and then enhanced using a weight-merging technique to merge with the base model to mitigate accuracy degradation. Compared to the Llama-3.1-Nemotron-8B model, DLER-Llama-Nemotron-8B-Merge achieves substantial efficiency gains, reducing the average response length by nearly 50% across diverse mathematical benchmarks without sacrificing accuracy.

This model is for research and development only.


### Evaluation Results:
| Model            | MATH | Length | AIME               | Length        | AMC                | Length        | Minerva            |Length         | Olympiad           |Length         | Total Avg Length   |
|---------------------------------|----------|------------|--------------------|------------------|--------------------|------------------|--------------------|------------------|--------------------|------------------|-----------------|
| Llama-3.1-Nemotron-Nano-8B-v1   | 95.4     | 3069       | 66.4               | 9899             | 88.25              | 6228             | 52.38              | 4031             | 64.33              | 6755             | 5996            |
| **DLER-Llama-Nemotron-8B-Merge**| **95.2** | **1995**   | **66.7**           | **5013**         | **89.23**          | **3358**         | **53.19**          | **2301**         | **65.39**          | **3520**         | **3237 (-46%)** |

### Environment Setup

```
pip install transformers==4.51.3
```

# Inference:


```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


tokenizer = AutoTokenizer.from_pretrained("nvidia/DLER-Llama-Nemotron-8B-Merge-Research")
model = AutoModelForCausalLM.from_pretrained("nvidia/DLER-Llama-Nemotron-8B-Merge-Research").to(device)


messages = [{"role": "system", "content": "detailed thinking on"}, {"role": "user", "content": "Below is a math question. I want you to reason through the steps and then give a final answer. Your final answer should be in \\boxed{}.\nQuestion: Convert the point $(0,3)$ in rectangular coordinates to polar coordinates.  Enter your answer in the form $(r,\\theta),$ where $r > 0$ and $0 \\le \\theta < 2 \\pi.$"}]

tokenized_chat = tokenizer.apply_chat_template(
   messages,
   tokenize=True,
   add_generation_prompt=True,
   return_tensors="pt").to(model.device)


outputs = model.generate(
   tokenized_chat,
   max_new_tokens=10000,
   eos_token_id=tokenizer.eos_token_id)


print(tokenizer.decode(outputs[0], skip_special_tokens=False))
```


### License/Terms of Use
NSCLv1

## Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.


Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).


## Citation
If you find our model helpful, please cite the following [paper]():

```
@article{liu2025dler,
  title={DLER: Doing Length pEnalty Right-Incentivizing More Intelligence per Token via Reinforcement Learning},
  author={Liu, Shih-Yang and Dong, Xin and Lu, Ximing and Diao, Shizhe and Liu, Mingjie and Chen, Min-Hung and Yin, Hongxu and Wang, Yu-Chiang Frank and Cheng, Kwang-Ting and Choi, Yejin and others},
  journal={arXiv preprint arXiv:2510.15110},
  year={2025}
}
```