Lonepic commited on
Commit
a43bbec
·
verified ·
1 Parent(s): 976378a

Initial upload

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ Agents-A1-Q4_K_M.gguf filter=lfs diff=lfs merge=lfs -text
37
+ figures/logo_nobg.png filter=lfs diff=lfs merge=lfs -text
Agents-A1-Q4_K_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:31aefa25b7e1edbde436e643e2b5e3f6e57820a4811d97b131130e48ff0772c2
3
+ size 21166757632
README.md ADDED
@@ -0,0 +1,444 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ pipeline_tag: text-generation
5
+ ---
6
+
7
+ # Agents-A1: Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent
8
+
9
+ <div style="display: flex; flex-direction: column; align-items: center; line-height: 1.2;">
10
+ <div style="display: flex; justify-content: center; align-items: center; gap: 10px; height: 30px;">
11
+ <span style="font-size: 16px;" role="img" aria-label="Homepage">🏠</span>
12
+ <a href="https://internscience.github.io/Agents-A1/"><b>Homepage</b></a>
13
+ <span style="color: #ccc;">|</span>
14
+ <img src="./figures/24px.svg" width="16" height="16" alt="Technical Report" style="filter: invert(0.5);">
15
+ <a href="https://arxiv.org/abs/2606.30616"><b>Technical Report</b></a>
16
+ </div>
17
+
18
+ <div style="display: flex; justify-content: center; align-items: center; gap: 10px; height: 30px; margin-top: 2px;">
19
+ <img src="./figures/hf-logo.svg" width="16" height="16" alt="Hugging Face">
20
+ <a href="https://huggingface.co/InternScience/Agents-A1"><b>Hugging Face</b></a>
21
+ <span style="color: #ccc;">|</span>
22
+ <img src="./figures/github-logo.svg" width="16" height="16" alt="GitHub">
23
+ <a href="https://github.com/InternScience/Agents-A1"><b>Github</b></a>
24
+ <span style="color: #ccc;">|</span>
25
+ <img src="./figures/modelscope-logo.svg" width="16" height="16" alt="Model Scope">
26
+ <a href="https://modelscope.cn/models/InternScience/Agents-A1"><b>ModelScope</b></a>
27
+ </div>
28
+ </div>
29
+
30
+ > [!Note]
31
+ > This repository contains model weights and configuration files for Agents-A1 in the Hugging Face Transformers format.
32
+ >
33
+ > These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, etc.
34
+
35
+ **Agents‑A1** is a 35B Mixture‑of‑Experts agentic model from [InternScience](https://huggingface.co/InternScience), built to scale heterogeneous agentic abilities across multiple domains including **Long‑horizon Search, Engineering, Scientific Research, Instruction Following, and Tool-calling**. We investigate agent-horizon scaling from two perspectives: scaling long-horizon trajectories and scaling heterogeneous agent abilities.
36
+
37
+ From the scaling of long-horizon trajectories, **Agents‑A1** is trained with the assistance of a domain-grounded knowledge-action infrastructure that jointly constructs actions, observations, and verifier outcomes, turning the agent's process into a trainable target. From the scaling of heterogeneous agent abilities, **Agents‑A1** presents a three-stage training paradigm for building scalable general-purpose agentic model. First, we perform full-domain supervised fine-tuning to align the base model with broad agentic behaviors. Second, we train domain-level teacher models to capture specialized expertise in each domain. Third, we propose multi-teacher multi-domain on-policy distillation with heterogeneity-aware optimization to improve knowledge transfer efficiency across different domains.
38
+
39
+
40
+ ![Agents-A1 Benchmark Overview](./figures/a1_benchmarks_altair_grid.svg)
41
+
42
+ ## Highlights
43
+
44
+ - **Agentic Reasoning**: Agents-A1 excels at decomposing complex tasks into executable sub-steps, planning ahead, and adapting its strategy based on intermediate results.
45
+ - **Tool Use**: Natively supports function calling and tool integration, enabling seamless interaction with APIs, code interpreters, search engines, and other external tools.
46
+ - **Scientific and Professional Reasoning**: Handles tool-integrated scientific reasoning and professional knowledge question answering.
47
+ - **Instruction Following**: Precisely follows detailed, multi-constraint instructions across diverse domains.
48
+
49
+ We welcome developers and enterprises to integrate and try Agents-A1 and share their feedback.
50
+
51
+ ## Performance
52
+
53
+ We evaluate Agents-A1 in real-world agentic and research-oriented workflows across six directions — long-horizon search, engineering tasks, scientific research, instruction following, general agentic tasks, and scientific agentic tasks. Despite operating in the ~35B model class, Agents-A1 delivers highly competitive performance against frontier-scale systems such as GPT-5.5, DeepSeek-V4-pro, and Kimi-K2.6. It achieves overall SOTA results on several challenging benchmarks, including Seal-0 (56.4), HiPhO (46.4), FrontierScience-Olympiad (79.0), FrontierScience-Research (40.00), IFBench (80.6), and IFEval (94.8), while also ranking as the best among comparable models on a broad range of tasks such as BrowseComp (75.5), XBench-DS-2510 (86.0), GAIA (96.0), SciCode (44.3), HLE with tools (47.6), and MolBench-bind (56.8). These results show that Agents-A1 combines strong long-horizon search ability, robust scientific reasoning, and reliable instruction following, establishing it as a highly capable and efficient agentic model that narrows the gap with much larger frontier models.
54
+
55
+
56
+ <p>
57
+ 🥇 Overall SOTA &nbsp;&nbsp;
58
+ 🟢 Best Among Comparable Models (~35B)
59
+ </p>
60
+
61
+ <table>
62
+ <thead>
63
+ <tr>
64
+ <th rowspan="2" align="left">Benchmark</th>
65
+ <th colspan="3" align="center" style="text-align:center;">
66
+ 📏 Comparable Models (~35B)
67
+ </th>
68
+
69
+ <th colspan="4" align="center" style="text-align:center;">
70
+ 🚀 Larger-scale Models
71
+ </th>
72
+
73
+ <th colspan="2" align="center" style="text-align:center;">
74
+ ⭐ Ours
75
+ </th>
76
+ </tr>
77
+
78
+ <tr>
79
+ <th align="center">Qwen3.5-35B-A3B</th>
80
+ <th align="center">Qwen3.6-35B-A3B</th>
81
+ <th align="center">Nex-N2-mini</th>
82
+
83
+ <th align="center">Step-3.5-Flash</th>
84
+ <th align="center">Kimi-K2.6</th>
85
+ <th align="center">DeepSeek-V4-pro(Max)</th>
86
+ <th align="center">GPT-5.5(xhigh)</th>
87
+
88
+ <th align="center">Agents-A1</th>
89
+ </tr>
90
+ </thead>
91
+
92
+ <tbody>
93
+
94
+ <tr>
95
+ <td colspan="9" align="left"><b>🔍 Long-horizon Search</b></td>
96
+ </tr>
97
+
98
+ <tr>
99
+ <td align="left">BrowseComp</td>
100
+ <td align="center">61.0</td>
101
+ <td align="center">67.93</td>
102
+ <td align="center">74.1</td>
103
+ <td align="center">69.0</td>
104
+ <td align="center">83.2</td>
105
+ <td align="center">83.4</td>
106
+ <td align="center">🥇 84.4</td>
107
+ <td align="center">🟢 75.51</td>
108
+ </tr>
109
+
110
+ <tr>
111
+ <td align="left">XBench-DS-2510</td>
112
+ <td align="center">77.0</td>
113
+ <td align="center">71.0</td>
114
+ <td align="center">82.0</td>
115
+ <td align="center">56.3</td>
116
+ <td align="center">🥇 90.0</td>
117
+ <td align="center">🥇 90.0</td>
118
+ <td align="center">84.0</td>
119
+ <td align="center">🟢 86.0</td>
120
+ </tr>
121
+
122
+ <tr>
123
+ <td align="left">Seal0</td>
124
+ <td align="center">41.4</td>
125
+ <td align="center">38.74</td>
126
+ <td align="center">49.55</td>
127
+ <td align="center">36.94</td>
128
+ <td align="center">50.45</td>
129
+ <td align="center">54.95</td>
130
+ <td align="center">42.34</td>
131
+ <td align="center">🥇 56.36</td>
132
+ </tr>
133
+
134
+ <tr>
135
+ <td align="left">GAIA</td>
136
+ <td align="center">59.8</td>
137
+ <td align="center">78.64</td>
138
+ <td align="center">82.52</td>
139
+ <td align="center">84.5</td>
140
+ <td align="center">80.58</td>
141
+ <td align="center">🥇 98.06</td>
142
+ <td align="center">87.38</td>
143
+ <td align="center">🟢 96.04</td>
144
+ </tr>
145
+
146
+ <tr>
147
+ <td colspan="9" align="left"><b>⚙️ Engineering Tasks</b></td>
148
+ </tr>
149
+
150
+ <tr>
151
+ <td align="left">SciCode</td>
152
+ <td align="center">37.7</td>
153
+ <td align="center">35.8</td>
154
+ <td align="center">29.9</td>
155
+ <td align="center">40.4</td>
156
+ <td align="center">53.5</td>
157
+ <td align="center">50.0</td>
158
+ <td align="center">🥇 56.1</td>
159
+ <td align="center">🟢 44.33</td>
160
+ </tr>
161
+
162
+ <tr>
163
+ <td align="left">MLE-Lite</td>
164
+ <td align="center">24.24</td>
165
+ <td align="center">34.85</td>
166
+ <td align="center">34.85</td>
167
+ <td align="center">54.55</td>
168
+ <td align="center">62.12</td>
169
+ <td align="center">63.64</td>
170
+ <td align="center">🥇 72.73</td>
171
+ <td align="center">🟢 43.94</td>
172
+ </tr>
173
+
174
+ <tr>
175
+ <td colspan="9" align="left"><b>🧪 Scientific Research</b></td>
176
+ </tr>
177
+
178
+ <tr>
179
+ <td align="left">HLE w/ tools</td>
180
+ <td align="center">47.4</td>
181
+ <td align="center">36.2</td>
182
+ <td align="center">32.0</td>
183
+ <td align="center">23.1</td>
184
+ <td align="center">🥇 54.0</td>
185
+ <td align="center">48.2</td>
186
+ <td align="center">52.2</td>
187
+ <td align="center">🟢 47.6</td>
188
+ </tr>
189
+
190
+ <tr>
191
+ <td align="left">HiPhO</td>
192
+ <td align="center">37.0</td>
193
+ <td align="center">37.7</td>
194
+ <td align="center">38.5</td>
195
+ <td align="center">38.3</td>
196
+ <td align="center">41.1</td>
197
+ <td align="center">38.7</td>
198
+ <td align="center">43.3</td>
199
+ <td align="center">🥇 46.4</td>
200
+ </tr>
201
+
202
+ <tr>
203
+ <td align="left">FrontierScience-Olympiad</td>
204
+ <td align="center">64.5</td>
205
+ <td align="center">60.3</td>
206
+ <td align="center">52.0</td>
207
+ <td align="center">61.0</td>
208
+ <td align="center">73.0</td>
209
+ <td align="center">76.0</td>
210
+ <td align="center">78.0</td>
211
+ <td align="center">🥇 79.0</td>
212
+ </tr>
213
+
214
+ <tr>
215
+ <td align="left">FrontierScience-Research</td>
216
+ <td align="center">2.5</td>
217
+ <td align="center">2.9</td>
218
+ <td align="center">5.0</td>
219
+ <td align="center">6.7</td>
220
+ <td align="center">17.9</td>
221
+ <td align="center">13.3</td>
222
+ <td align="center">26.7</td>
223
+ <td align="center">🥇 40.0</td>
224
+ </tr>
225
+
226
+ <tr>
227
+ <td colspan="9" align="left"><b>📋 Instruction Following</b></td>
228
+ </tr>
229
+
230
+ <tr>
231
+ <td align="left">IFBench</td>
232
+ <td align="center">70.2</td>
233
+ <td align="center">64.4</td>
234
+ <td align="center">54.08</td>
235
+ <td align="center">64.6</td>
236
+ <td align="center">71.77</td>
237
+ <td align="center">73.47</td>
238
+ <td align="center">75.9</td>
239
+ <td align="center">🥇 80.61</td>
240
+ </tr>
241
+
242
+ <tr>
243
+ <td align="left">LongBench-v2</td>
244
+ <td align="center">59.0</td>
245
+ <td align="center">57.7</td>
246
+ <td align="center">59.6</td>
247
+ <td align="center">57.5</td>
248
+ <td align="center">62.0</td>
249
+ <td align="center">🥇 64.3</td>
250
+ <td align="center">-</td>
251
+ <td align="center">🟢 60.2</td>
252
+ </tr>
253
+
254
+ <tr>
255
+ <td align="left">IFEval</td>
256
+ <td align="center">91.9</td>
257
+ <td align="center">91.3</td>
258
+ <td align="center">88.4</td>
259
+ <td align="center">93.53</td>
260
+ <td align="center">94.45</td>
261
+ <td align="center">93.35</td>
262
+ <td align="center">93.35</td>
263
+ <td align="center">🥇 94.82</td>
264
+ </tr>
265
+
266
+ <tr>
267
+ <td colspan="9" align="left"><b>🤖 General Agentic Tasks</b></td>
268
+ </tr>
269
+
270
+ <tr>
271
+ <td align="left">τ<sup>2</sup>-Bench</td>
272
+ <td align="center">🟢 81.2</td>
273
+ <td align="center">79.0</td>
274
+ <td align="center">74.53</td>
275
+ <td align="center">75.77</td>
276
+ <td align="center">81.93</td>
277
+ <td align="center">🥇 82.2</td>
278
+ <td align="center">81.63</td>
279
+ <td align="center">79.81</td>
280
+ </tr>
281
+
282
+ <tr>
283
+ <td align="left">VitaBench</td>
284
+ <td align="center">31.9</td>
285
+ <td align="center">35.6</td>
286
+ <td align="center">23.0</td>
287
+ <td align="center">30.0</td>
288
+ <td align="center">35.63</td>
289
+ <td align="center">🥇 49.04</td>
290
+ <td align="center">45.0</td>
291
+ <td align="center">🟢 38.75</td>
292
+ </tr>
293
+
294
+ <tr>
295
+ <td colspan="9" align="left"><b>🔬 Scientific Agentic Tasks</b></td>
296
+ </tr>
297
+
298
+ <tr>
299
+ <td align="left">MatTools</td>
300
+ <td align="center">21.0</td>
301
+ <td align="center">15.9</td>
302
+ <td align="center">34.1</td>
303
+ <td align="center">44.93</td>
304
+ <td align="center">63.8</td>
305
+ <td align="center">47.1</td>
306
+ <td align="center">🥇 68.8</td>
307
+ <td align="center">🟢 47.1</td>
308
+ </tr>
309
+
310
+ <tr>
311
+ <td align="left">MolBench-bind</td>
312
+ <td align="center">46.0</td>
313
+ <td align="center">48.7</td>
314
+ <td align="center">51.4</td>
315
+ <td align="center">45.95</td>
316
+ <td align="center">21.6</td>
317
+ <td align="center">37.8</td>
318
+ <td align="center">🥇 62.2</td>
319
+ <td align="center">🟢 56.8</td>
320
+ </tr>
321
+
322
+ </tbody>
323
+ </table>
324
+
325
+
326
+ ## Usage
327
+
328
+ ### SGLang
329
+
330
+ [SGLang](https://github.com/sgl-project/sglang) is a fast serving framework for large language models and vision language models.
331
+
332
+ Install SGLang with uv:
333
+
334
+ ```shell
335
+ uv venv --python 3.12 --seed --managed-python
336
+ source .venv/bin/activate
337
+
338
+ uv pip install sglang
339
+ ```
340
+
341
+ See [its documentation](https://docs.sglang.ai/get_started/install.html) for more details.
342
+
343
+ The following commands create API endpoints at `http://localhost:8000/v1`:
344
+
345
+ - **Standard Version** (1 GPUs, 262K context):
346
+
347
+ ```shell
348
+ python -m sglang.launch_server \
349
+ --model-path InternScience/Agents-A1 \
350
+ --port 8000 \
351
+ --tp-size 1 \
352
+ --mem-fraction-static 0.8 \
353
+ --context-length 262144 \
354
+ --reasoning-parser qwen3
355
+ ```
356
+ - **Tool Use**:
357
+
358
+ ```shell
359
+ python -m sglang.launch_server \
360
+ --model-path InternScience/Agents-A1 \
361
+ --port 8000 \
362
+ --tp-size 1 \
363
+ --mem-fraction-static 0.8 \
364
+ --context-length 262144 \
365
+ --reasoning-parser qwen3 \
366
+ --tool-call-parser qwen3_coder
367
+ ```
368
+
369
+ ### vLLM
370
+
371
+ [vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs.
372
+
373
+ Install vLLM from the main branch via uv:
374
+
375
+ ```shell
376
+ uv venv --python 3.12 --seed --managed-python
377
+ source .venv/bin/activate
378
+
379
+ uv pip install vllm --torch-backend=auto
380
+ ```
381
+
382
+ See [its documentation](https://docs.vllm.ai/en/stable/getting_started/installation/index.html) for more details.
383
+
384
+ The following commands create API endpoints at `http://localhost:8000/v1`:
385
+
386
+ - **Standard Version** (1 GPUs, 262K context):
387
+
388
+ ```shell
389
+ vllm serve InternScience/Agents-A1 \
390
+ --port 8000 \
391
+ --tensor-parallel-size 1 \
392
+ --max-model-len 262144 \
393
+ --reasoning-parser qwen3
394
+ ```
395
+ - **Tool Call**:
396
+
397
+ ```shell
398
+ vllm serve InternScience/Agents-A1 \
399
+ --port 8000 \
400
+ --tensor-parallel-size 1 \
401
+ --max-model-len 262144 \
402
+ --reasoning-parser qwen3 \
403
+ --enable-auto-tool-choice \
404
+ --tool-call-parser qwen3_coder
405
+ ```
406
+ - **Text-Only** (skips vision encoder to free KV cache memory):
407
+
408
+ ```shell
409
+ vllm serve InternScience/Agents-A1 \
410
+ --port 8000 \
411
+ --tensor-parallel-size 1 \
412
+ --max-model-len 262144 \
413
+ --reasoning-parser qwen3 \
414
+ --language-model-only
415
+ ```
416
+
417
+ ### Recommended Sampling Parameters
418
+
419
+ For the best generation quality, we recommend the following sampling parameters:
420
+
421
+ - `temperature`: 0.85
422
+ - `top_p`: 0.95
423
+ - `top_k`: 20
424
+ - `min_p`: 0.0
425
+ - `presence_penalty`: 1.1
426
+ - `repetition_penalty`: 1.0
427
+
428
+
429
+ ## Agent Capability Evaluation
430
+
431
+ To provide the community with a unified agent evaluation codebase for fair comparison, we have also open-sourced an evaluation framework for assessing agentic models across core capabilities, including tool use and multi-step reasoning. The evaluation code is included in the [Agents-A1/evaluation](https://github.com/InternScience/Agents-A1/tree/main/evaluation) of this repository.
432
+
433
+ We use this framework to evaluate the released model under a standardized and reproducible setting.
434
+ Specifically, the model is tested on a set of agent-oriented tasks that require it to understand user goals, decompose complex instructions, interact with tools or environments when necessary, and produce final results. The evaluation results reported in [Model Card](https://huggingface.co/InternScience/Agents-A1) are generated using the open-source framework above, so that users can reproduce the experiments, compare other models under the same protocol, and further extend the benchmark for new agent scenarios. (**Note that:** To ensure a fair comparison, we report the benchmark results from their original technical reports. If a model does not report the corresponding benchmark results, we evaluate it using the same evaluation protocol as our model.)
435
+
436
+ For detailed evaluation scripts, task definitions, metrics, and reproduction instructions, please refer to the evaluation codebase.
437
+
438
+ ## Citation
439
+
440
+ If you find our work helpful, feel free to give us a cite.
441
+
442
+ ```
443
+
444
+ ```
figures/24px.svg ADDED
figures/a1_benchmarks_altair_grid.svg ADDED
figures/github-logo.svg ADDED
figures/hf-logo.svg ADDED
figures/logo_nobg.png ADDED

Git LFS Details

  • SHA256: 8fa7bbc49686f492b05cfc3920e4dc52614c98edb401243f4ec13e5f4c56e5dd
  • Pointer size: 132 Bytes
  • Size of remote file: 2.33 MB
figures/modelscope-logo.svg ADDED