Upload folder using huggingface_hub

Browse files

Files changed (11) hide show

.gitattributes +1 -0
README.md +152 -0
adapter_config.json +38 -0
adapter_model.safetensors +3 -0
added_tokens.json +35 -0
chat_template.jinja +154 -0
merges.txt +0 -0
special_tokens_map.json +38 -0
tokenizer.json +3 -0
tokenizer_config.json +311 -0
vocab.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+---
+base_model: Qwen/Qwen3.5-9B
+library_name: peft
+license: apache-2.0
+datasets:
+  - cometadata/funding-extraction-artifact-data-mix-grpo-mixed-reward
+tags:
+  - funding-extraction
+  - lora
+  - grpo
+  - rl
+  - scholarly-metadata
+language:
+  - en
+pipeline_tag: text-generation
+---
+# Funding Extraction LoRA (Qwen3.5-9B)
+LoRA adapter for extracting structured funding metadata (funder names + award IDs) from academic paper funding statements. Fine-tuned on Qwen3.5-9B via SFT then GRPO reinforcement learning.
+This is the Qwen3.5-9B counterpart to [`cometadata/funding-extraction-llama-3.1-8b-instruct-artifact-data-mix-grpo-mixed-reward`](https://huggingface.co/cometadata/funding-extraction-llama-3.1-8b-instruct-artifact-data-mix-grpo-mixed-reward), trained with the same data, pipeline, and reward. See [Comparison to the Llama 3.1 8B baseline](#comparison-to-the-llama-31-8b-baseline) below.
+## Training Pipeline
+Trained on the [`cometadata/funding-extraction-artifact-data-mix-grpo-mixed-reward`](https://huggingface.co/datasets/cometadata/funding-extraction-artifact-data-mix-grpo-mixed-reward) dataset using its pre-split `sft` / `rl` / `test` separations on the [Tinker](https://thinkingmachines.ai) training service.
+### Stage 1: Supervised Fine-Tuning (SFT)
+- **Base model:** `Qwen/Qwen3.5-9B`
+- **Data (`data/sft/`):** 3,528 real + 7,240 synthetic funding statements with gold-standard funder/award labels (synthetic upsampled 2×)
+- **Data augmentation:** 50% of training examples augmented with synthetic noise (OCR-like case errors, digit/letter swaps, Unicode artifacts, XML/HTML tags, LaTeX markup) for robustness to real-world document formats
+- **Renderer:** `qwen3_5_disable_thinking` — the model is trained to emit JSON directly (no chain-of-thought), so inference should disable thinking (see [Usage](#usage))
+- **LoRA rank:** 128
+- **Epochs:** 2
+- **Result:** eval NLL 0.116 → 0.0035 over 252 steps
+### Stage 2: Reinforcement Learning (GRPO)
+- **Algorithm:** Group Relative Policy Optimization (GRPO) with importance sampling loss
+- **Data (`data/rl/`):** 1,160 real + 1,916 synthetic (train); 576 real + 968 synthetic (eval)
+- **Reward:** Hierarchical F0.5 scoring with binary funder/award-ID matching + flat award-ID association bonus
+  - `reward = 0.50 * funder_F0.5 + 0.40 * hierarchical_award_id_F0.5 + 0.10 * flat_award_id_F0.5`
+  - Funder matching — fuzzy (token_sort_ratio ≥ 0.80 threshold, Hungarian optimal assignment)
+  - Award ID matching — binary exact after normalization (strip whitespace/hyphens/slashes, uppercase), with soft (edit-distance-1) partial credit during training
+  - Flat award-ID term — awards partial credit when the correct award ID is extracted under the wrong funder, providing gradient on funder-award association errors
+- **KL penalty:** 0.03 (anchored to SFT checkpoint)
+- **Group size:** 8 rollouts per prompt
+- **Temperature:** 0.8
+- **Learning rate:** 3e-5
+- **Steps:** 193 batches
+- **Checkpoint:** final (batch 193)
+## Evaluation Results
+### arxiv_test.jsonl (300 held-out examples)
+#### Permissive (partial_ratio + token_set, no damping)
+| Field | P | R | F1 | F0.5 | F1.5 |
+|-------|---|---|----|----|------|
+| Funder | 0.9384 | 0.9362 | 0.9373 | 0.9379 | 0.9369 |
+| Award ID | 0.9069 | 0.8909 | 0.8988 | 0.9037 | 0.8957 |
+| Scheme | 0.7407 | 0.8264 | 0.7812 | 0.7564 | 0.7980 |
+| Title | 0.9048 | 0.3958 | 0.5507 | 0.7197 | 0.4787 |
+#### Balanced (length-damped + acronym detection)
+| Field | P | R | F1 | F0.5 | F1.5 |
+|-------|---|---|----|----|------|
+| Funder | 0.8882 | 0.8960 | 0.8921 | 0.8897 | 0.8936 |
+| Award ID | 0.8889 | 0.8732 | 0.8810 | 0.8857 | 0.8779 |
+| Scheme | 0.6889 | 0.7686 | 0.7266 | 0.7035 | 0.7422 |
+| Title | 0.9048 | 0.3958 | 0.5507 | 0.7197 | 0.4787 |
+#### Strict (token_sort_ratio only)
+| Field | P | R | F1 | F0.5 | F1.5 |
+|-------|---|---|----|----|------|
+| Funder | 0.8796 | 0.8874 | 0.8835 | 0.8812 | 0.8850 |
+| Award ID | 0.8859 | 0.8702 | 0.8780 | 0.8827 | 0.8750 |
+| Scheme | 0.6667 | 0.7438 | 0.7031 | 0.6808 | 0.7182 |
+| Title | 0.8095 | 0.3542 | 0.4928 | 0.6439 | 0.4283 |
+Inference on the 300 examples produced 100% parseable JSON (no truncations), averaging 126 output tokens per example.
+### Comparison to the Llama 3.1 8B baseline
+Same `arxiv_test.jsonl` (300 examples), same evaluation harness and pipeline. Balanced-mode F1:
+| Field | Llama 3.1 8B | Qwen3.5-9B | Δ |
+|-------|:---:|:---:|:---:|
+| Funder | 0.9001 | 0.8921 | −0.008 |
+| Award ID | 0.8780 | 0.8810 | +0.003 |
+| Scheme | 0.6466 | 0.7266 | +0.080 |
+| Title | 0.5316 | 0.5507 | +0.019 |
+The two RL-optimized fields (funder, award ID) are statistically tied with the Llama baseline (≤0.008 F1, within run-to-run noise), while the un-weighted secondary fields (scheme, title) improve — most clearly scheme, consistently across Permissive/Balanced/Strict modes.
+## Usage
+```python
+from peft import PeftModel
+from transformers import AutoModelForCausalLM, AutoTokenizer
+base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-9B")
+model = PeftModel.from_pretrained(base_model, "cometadata/funding-extraction-qwen3.5-9B-artifact-data-mix-grpo-mixed-reward")
+tokenizer = AutoTokenizer.from_pretrained("cometadata/funding-extraction-qwen3.5-9B-artifact-data-mix-grpo-mixed-reward")
+prompt = """Extract funding information from the following statement:
+This work was supported by the National Science Foundation under grant DMS-1613002 and by the NIH (R01-AI123456)."""
+messages = [
+    {"role": "system", "content": "You are an expert at extracting structured funding metadata from academic papers. Given a funding statement, extract all funders and their associated awards. Return a JSON array of funder objects. Each funder has:\n- \"funder_name\": string or null\n- \"awards\": array of objects with \"award_ids\" (array of strings), \"funding_scheme\" (array of strings), and \"award_title\" (array of strings)\nReturn ONLY the JSON array, no other text."},
+    {"role": "user", "content": prompt},
+]
+# Trained without chain-of-thought: disable thinking to match the training distribution.
+inputs = tokenizer.apply_chat_template(
+    messages, return_tensors="pt", add_generation_prompt=True, enable_thinking=False
+)
+outputs = model.generate(inputs, max_new_tokens=512, do_sample=False)
+print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
+```
+## Output Format
+```json
+[
+  {
+    "funder_name": "National Science Foundation",
+    "awards": [
+      {
+        "award_ids": ["DMS-1613002"],
+        "funding_scheme": [],
+        "award_title": []
+      }
+    ]
+  },
+  {
+    "funder_name": "NIH",
+    "awards": [
+      {
+        "award_ids": ["R01-AI123456"],
+        "funding_scheme": [],
+        "award_title": []
+      }
+    ]
+  }
+]
+```

adapter_config.json ADDED Viewed

	@@ -0,0 +1,38 @@

+{
+  "alora_invocation_tokens": null,
+  "alpha_pattern": {},
+  "arrow_config": null,
+  "auto_mapping": null,
+  "base_model_name_or_path": "Qwen/Qwen3.5-9B",
+  "bias": "none",
+  "corda_config": null,
+  "ensure_weight_tying": false,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": false,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_bias": false,
+  "lora_dropout": 0,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "peft_version": "0.18.1",
+  "qalora_group_size": 16,
+  "r": 128,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": "all-linear",
+  "target_parameters": null,
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_qalora": false,
+  "use_rslora": false
+}

adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:73c2313fd5dd56dd329429404bec8518f110b0c9d02d4359d0fc8086b505b14e
+size 1513426600

added_tokens.json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+  "</think>": 248069,
+  "</tool_call>": 248059,
+  "</tool_response>": 248067,
+  "<think>": 248068,
+  "<tool_call>": 248058,
+  "<tool_response>": 248066,
+  "<tts_pad>": 248072,
+  "<tts_text_bos>": 248073,
+  "<tts_text_bos_single>": 248075,
+  "<tts_text_eod>": 248074,
+  "<|audio_end|>": 248071,
+  "<|audio_pad|>": 248076,
+  "<|audio_start|>": 248070,
+  "<|box_end|>": 248050,
+  "<|box_start|>": 248049,
+  "<|endoftext|>": 248044,
+  "<|file_sep|>": 248065,
+  "<|fim_middle|>": 248061,
+  "<|fim_pad|>": 248063,
+  "<|fim_prefix|>": 248060,
+  "<|fim_suffix|>": 248062,
+  "<|im_end|>": 248046,
+  "<|im_start|>": 248045,
+  "<|image_pad|>": 248056,
+  "<|object_ref_end|>": 248048,
+  "<|object_ref_start|>": 248047,
+  "<|quad_end|>": 248052,
+  "<|quad_start|>": 248051,
+  "<|repo_name|>": 248064,
+  "<|video_pad|>": 248057,
+  "<|vision_end|>": 248054,
+  "<|vision_pad|>": 248055,
+  "<|vision_start|>": 248053
+}

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,154 @@

+{%- set image_count = namespace(value=0) %}
+{%- set video_count = namespace(value=0) %}
+{%- macro render_content(content, do_vision_count, is_system_content=false) %}
+    {%- if content is string %}
+        {{- content }}
+    {%- elif content is iterable and content is not mapping %}
+        {%- for item in content %}
+            {%- if 'image' in item or 'image_url' in item or item.type == 'image' %}
+                {%- if is_system_content %}
+                    {{- raise_exception('System message cannot contain images.') }}
+                {%- endif %}
+                {%- if do_vision_count %}
+                    {%- set image_count.value = image_count.value + 1 %}
+                {%- endif %}
+                {%- if add_vision_id %}
+                    {{- 'Picture ' ~ image_count.value ~ ': ' }}
+                {%- endif %}
+                {{- '<|vision_start|><|image_pad|><|vision_end|>' }}
+            {%- elif 'video' in item or item.type == 'video' %}
+                {%- if is_system_content %}
+                    {{- raise_exception('System message cannot contain videos.') }}
+                {%- endif %}
+                {%- if do_vision_count %}
+                    {%- set video_count.value = video_count.value + 1 %}
+                {%- endif %}
+                {%- if add_vision_id %}
+                    {{- 'Video ' ~ video_count.value ~ ': ' }}
+                {%- endif %}
+                {{- '<|vision_start|><|video_pad|><|vision_end|>' }}
+            {%- elif 'text' in item %}
+                {{- item.text }}
+            {%- else %}
+                {{- raise_exception('Unexpected item type in content.') }}
+            {%- endif %}
+        {%- endfor %}
+    {%- elif content is none or content is undefined %}
+        {{- '' }}
+    {%- else %}
+        {{- raise_exception('Unexpected content type.') }}
+    {%- endif %}
+{%- endmacro %}
+{%- if not messages %}
+    {{- raise_exception('No messages provided.') }}
+{%- endif %}
+{%- if tools and tools is iterable and tools is not mapping %}
+    {{- '<|im_start|>system\n' }}
+    {{- "# Tools\n\nYou have access to the following functions:\n\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>" }}
+    {{- '\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>' }}
+    {%- if messages[0].role == 'system' %}
+        {%- set content = render_content(messages[0].content, false, true)|trim %}
+        {%- if content %}
+            {{- '\n\n' + content }}
+        {%- endif %}
+    {%- endif %}
+    {{- '<|im_end|>\n' }}
+{%- else %}
+    {%- if messages[0].role == 'system' %}
+        {%- set content = render_content(messages[0].content, false, true)|trim %}
+        {{- '<|im_start|>system\n' + content + '<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
+{%- for message in messages[::-1] %}
+    {%- set index = (messages|length - 1) - loop.index0 %}
+    {%- if ns.multi_step_tool and message.role == "user" %}
+        {%- set content = render_content(message.content, false)|trim %}
+        {%- if not(content.startswith('<tool_response>') and content.endswith('</tool_response>')) %}
+            {%- set ns.multi_step_tool = false %}
+            {%- set ns.last_query_index = index %}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if ns.multi_step_tool %}
+    {{- raise_exception('No user query found in messages.') }}
+{%- endif %}
+{%- for message in messages %}
+    {%- set content = render_content(message.content, true)|trim %}
+    {%- if message.role == "system" %}
+        {%- if not loop.first %}
+            {{- raise_exception('System message must be at the beginning.') }}
+        {%- endif %}
+    {%- elif message.role == "user" %}
+        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {%- set reasoning_content = '' %}
+        {%- if message.reasoning_content is string %}
+            {%- set reasoning_content = message.reasoning_content %}
+        {%- else %}
+            {%- if '</think>' in content %}
+                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
+                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
+            {%- endif %}
+        {%- endif %}
+        {%- set reasoning_content = reasoning_content|trim %}
+        {%- if loop.index0 > ns.last_query_index %}
+            {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content + '\n</think>\n\n' + content }}
+        {%- else %}
+            {{- '<|im_start|>' + message.role + '\n' + content }}
+        {%- endif %}
+        {%- if message.tool_calls and message.tool_calls is iterable and message.tool_calls is not mapping %}
+            {%- for tool_call in message.tool_calls %}
+                {%- if tool_call.function is defined %}
+                    {%- set tool_call = tool_call.function %}
+                {%- endif %}
+                {%- if loop.first %}
+                    {%- if content|trim %}
+                        {{- '\n\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
+                    {%- else %}
+                        {{- '<tool_call>\n<function=' + tool_call.name + '>\n' }}
+                    {%- endif %}
+                {%- else %}
+                    {{- '\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
+                {%- endif %}
+                {%- if tool_call.arguments is defined %}
+                    {%- for args_name, args_value in tool_call.arguments|items %}
+                        {{- '<parameter=' + args_name + '>\n' }}
+                        {%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
+                        {{- args_value }}
+                        {{- '\n</parameter>\n' }}
+                    {%- endfor %}
+                {%- endif %}
+                {{- '</function>\n</tool_call>' }}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.previtem and loop.previtem.role != "tool" %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- content }}
+        {{- '\n</tool_response>' }}
+        {%- if not loop.last and loop.nextitem.role != "tool" %}
+            {{- '<|im_end|>\n' }}
+        {%- elif loop.last %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- else %}
+        {{- raise_exception('Unexpected message role.') }}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+    {%- if enable_thinking is defined and enable_thinking is false %}
+        {{- '<think>\n\n</think>\n\n' }}
+    {%- else %}
+        {{- '<think>\n' }}
+    {%- endif %}
+{%- endif %}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,38 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "audio_bos_token": "<|audio_start|>",
+  "audio_eos_token": "<|audio_end|>",
+  "audio_token": "<|audio_pad|>",
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "image_token": "<|image_pad|>",
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "video_token": "<|video_pad|>",
+  "vision_bos_token": "<|vision_start|>",
+  "vision_eos_token": "<|vision_end|>"
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:87a7830d63fcf43bf241c3c5242e96e62dd3fdc29224ca26fed8ea333db72de4
+size 19989343

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,311 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "248044": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "248045": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "248046": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "248047": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "248048": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "248049": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "248050": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "248051": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "248052": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "248053": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "248054": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "248055": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "248056": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "248057": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "248058": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "248059": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "248060": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "248061": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "248062": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "248063": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "248064": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "248065": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "248066": {
+      "content": "<tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "248067": {
+      "content": "</tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "248068": {
+      "content": "<think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "248069": {
+      "content": "</think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "248070": {
+      "content": "<|audio_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "248071": {
+      "content": "<|audio_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "248072": {
+      "content": "<tts_pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "248073": {
+      "content": "<tts_text_bos>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "248074": {
+      "content": "<tts_text_eod>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "248075": {
+      "content": "<tts_text_bos_single>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "248076": {
+      "content": "<|audio_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "audio_bos_token": "<|audio_start|>",
+  "audio_eos_token": "<|audio_end|>",
+  "audio_token": "<|audio_pad|>",
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": {
+    "audio_bos_token": "<|audio_start|>",
+    "audio_eos_token": "<|audio_end|>",
+    "audio_token": "<|audio_pad|>",
+    "image_token": "<|image_pad|>",
+    "video_token": "<|video_pad|>",
+    "vision_bos_token": "<|vision_start|>",
+    "vision_eos_token": "<|vision_end|>"
+  },
+  "image_token": "<|image_pad|>",
+  "model_max_length": 262144,
+  "pad_token": "<|endoftext|>",
+  "pretokenize_regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?[\\p{L}\\p{M}]+|\\p{N}| ?[^\\s\\p{L}\\p{M}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null,
+  "video_token": "<|video_pad|>",
+  "vision_bos_token": "<|vision_start|>",
+  "vision_eos_token": "<|vision_end|>"
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff