alabenayed commited on
Commit
2f58781
·
verified ·
1 Parent(s): c9bbf36

Upload folder using huggingface_hub

Browse files
Files changed (47) hide show
  1. .gitattributes +4 -0
  2. README.md +219 -0
  3. adapter_config.json +48 -0
  4. adapter_model.safetensors +3 -0
  5. chat_template.jinja +1 -0
  6. checkpoint-1200/README.md +209 -0
  7. checkpoint-1200/adapter_config.json +48 -0
  8. checkpoint-1200/adapter_model.safetensors +3 -0
  9. checkpoint-1200/chat_template.jinja +1 -0
  10. checkpoint-1200/optimizer.pt +3 -0
  11. checkpoint-1200/rng_state.pth +3 -0
  12. checkpoint-1200/scheduler.pt +3 -0
  13. checkpoint-1200/special_tokens_map.json +17 -0
  14. checkpoint-1200/tokenizer.json +3 -0
  15. checkpoint-1200/tokenizer_config.json +317 -0
  16. checkpoint-1200/trainer_state.json +1234 -0
  17. checkpoint-1200/training_args.bin +3 -0
  18. checkpoint-1400/README.md +209 -0
  19. checkpoint-1400/adapter_config.json +48 -0
  20. checkpoint-1400/adapter_model.safetensors +3 -0
  21. checkpoint-1400/chat_template.jinja +1 -0
  22. checkpoint-1400/optimizer.pt +3 -0
  23. checkpoint-1400/rng_state.pth +3 -0
  24. checkpoint-1400/scheduler.pt +3 -0
  25. checkpoint-1400/special_tokens_map.json +17 -0
  26. checkpoint-1400/tokenizer.json +3 -0
  27. checkpoint-1400/tokenizer_config.json +317 -0
  28. checkpoint-1400/trainer_state.json +1434 -0
  29. checkpoint-1400/training_args.bin +3 -0
  30. checkpoint-1584/README.md +209 -0
  31. checkpoint-1584/adapter_config.json +48 -0
  32. checkpoint-1584/adapter_model.safetensors +3 -0
  33. checkpoint-1584/chat_template.jinja +1 -0
  34. checkpoint-1584/optimizer.pt +3 -0
  35. checkpoint-1584/rng_state.pth +3 -0
  36. checkpoint-1584/scheduler.pt +3 -0
  37. checkpoint-1584/special_tokens_map.json +17 -0
  38. checkpoint-1584/tokenizer.json +3 -0
  39. checkpoint-1584/tokenizer_config.json +317 -0
  40. checkpoint-1584/trainer_state.json +1614 -0
  41. checkpoint-1584/training_args.bin +3 -0
  42. sft_train.log +157 -0
  43. special_tokens_map.json +23 -0
  44. tokenizer.json +3 -0
  45. tokenizer_config.json +317 -0
  46. training_args.bin +3 -0
  47. training_metrics.json +1600 -0
.gitattributes CHANGED
@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ checkpoint-1200/tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ checkpoint-1400/tokenizer.json filter=lfs diff=lfs merge=lfs -text
38
+ checkpoint-1584/tokenizer.json filter=lfs diff=lfs merge=lfs -text
39
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,219 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ base_model: CohereLabs/aya-expanse-8b
2
+ library_name: peft
3
+ model_name: aya-expanse-8b-tunisian-sft
4
+ tags:
5
+ licence: license
6
+ pipeline_tag: text-generation
7
+
8
+ ---
9
+ base_model: CohereLabs/aya-expanse-8b
10
+ library_name: peft
11
+ model_name: TounsiLM-8b
12
+ tags:
13
+ - base_model:adapter:CohereLabs/aya-expanse-8b
14
+ - peft
15
+ - lora
16
+ - sft
17
+ - transformers
18
+ - trl
19
+ - tunisian-arabic
20
+ - text-generation
21
+ pipeline_tag: text-generation
22
+ language:
23
+ - ar
24
+ license: apache-2.0
25
+ ---
26
+
27
+ # TounsiLM-8b
28
+
29
+ `TounsiLM-8b` is a Tunisian Arabic supervised fine-tuning adapter built on top of [CohereLabs/aya-expanse-8b](https://huggingface.co/CohereLabs/aya-expanse-8b).
30
+
31
+ It is trained to answer in Tunisian دارجة, stay on topic, and keep responses short and direct when appropriate.
32
+
33
+ ## Model type
34
+
35
+ - Base model: `CohereLabs/aya-expanse-8b`
36
+ - Fine-tuning method: PEFT / LoRA-style SFT adapter
37
+ - Format: adapter checkpoint, not a fully merged standalone base model
38
+
39
+ ## Training dataset
40
+
41
+ - Dataset: `Syrinesmati/tunisian-question-response-dataset`
42
+ - Train split: `25,340` rows
43
+ - Eval split: `6,336` rows
44
+ - Input format: conversational messages built from the dataset fields `instruction` and `response`
45
+
46
+ ## Training setup
47
+
48
+ - Trainer: TRL `SFTTrainer`
49
+ - Epochs: `2`
50
+ - Max sequence length: `1024`
51
+ - Learning rate: `1e-5`
52
+ - Per-device train batch size: `8`
53
+ - Gradient accumulation: `4`
54
+ - Precision: `bf16` when supported
55
+ - Checkpoint resume: enabled
56
+
57
+ ## Training metrics
58
+
59
+ Final reported training metrics:
60
+
61
+ - Training loss: `1.1876104943680041`
62
+ - Mean token accuracy: `0.7577789686620235`
63
+ - Training runtime: `50353.3546` seconds
64
+ - Training steps: `1584`
65
+ - Total tokens seen: `9,585,534`
66
+
67
+ These are training metrics from the final log. No separate validation loss was recorded in the saved metrics file.
68
+
69
+ ## Intended use
70
+
71
+ Use this model for:
72
+
73
+ - Tunisian Arabic question answering
74
+ - chat-style assistant replies in Tunisian دارجة
75
+ - short, direct conversational responses
76
+
77
+ Not intended for:
78
+
79
+ - factual safety-critical advice
80
+ - medical/legal/financial decisions without verification
81
+ - unsupported languages outside Arabic/Tunisian use cases
82
+
83
+ ## How to use
84
+
85
+ ### Option 1: load the adapter with the base model
86
+
87
+ ```python
88
+ from transformers import AutoModelForCausalLM, AutoTokenizer
89
+ from peft import PeftModel
90
+ import torch
91
+
92
+ base_model_name = "CohereLabs/aya-expanse-8b"
93
+ adapter_dir = "TounsiLM-8b"
94
+
95
+ tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
96
+ model = AutoModelForCausalLM.from_pretrained(
97
+ base_model_name,
98
+ device_map="auto",
99
+ torch_dtype=torch.bfloat16,
100
+ trust_remote_code=True,
101
+ )
102
+ model = PeftModel.from_pretrained(model, adapter_dir)
103
+
104
+ messages = [
105
+ {"role": "system", "content": "أنت مساعد تونسي تجاوب بالتونسي الدارج فقط."},
106
+ {"role": "user", "content": "شنوة تعمل كان الواحد يحس روحو تعبان؟"},
107
+ ]
108
+
109
+ inputs = tokenizer.apply_chat_template(
110
+ messages,
111
+ add_generation_prompt=True,
112
+ tokenize=True,
113
+ return_dict=True,
114
+ return_tensors="pt",
115
+ )
116
+
117
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
118
+ output_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
119
+ print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
120
+ ```
121
+
122
+ ### Option 2: use the model in a pipeline
123
+
124
+ ```python
125
+ from transformers import pipeline
126
+
127
+ gen = pipeline("text-generation", model=model, tokenizer=tokenizer)
128
+ ```
129
+
130
+ ## Recommended inference settings
131
+
132
+ - `do_sample=False` for more stable answers
133
+ - `max_new_tokens=128` to reduce rambling
134
+ - `repetition_penalty=1.1`
135
+
136
+ ## Files included in this repository
137
+
138
+ - `adapter_model.safetensors`
139
+ - `adapter_config.json`
140
+ - `chat_template.jinja`
141
+ - tokenizer files
142
+ - training metrics and logs
143
+
144
+ ## Framework versions
145
+
146
+ - PEFT: `0.19.1`
147
+ - TRL: `1.3.0`
148
+ - Transformers: `4.57.6`
149
+ - PyTorch: `2.11.0`
150
+ - Datasets: `4.8.5`
151
+ - Tokenizers: `0.22.2`
152
+
153
+ ## Notes
154
+
155
+ This repository contains the fine-tuned adapter. To use it, load it on top of the base model `CohereLabs/aya-expanse-8b`.
156
+
157
+ If you want a merged standalone model later, the adapter can be merged into the base model and re-uploaded as a separate repo.
158
+
159
+ ## Citation
160
+
161
+ If you use this model, please cite the base model and the training stack used to create it.
162
+
163
+ ### TRL citation
164
+
165
+ ```bibtex
166
+ @software{vonwerra2020trl,
167
+ title = {{TRL: Transformers Reinforcement Learning}},
168
+ author = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
169
+ license = {Apache-2.0},
170
+ url = {https://github.com/huggingface/trl},
171
+ year = {2020}
172
+ }
173
+ ```
174
+ This model is a fine-tuned version of [CohereLabs/aya-expanse-8b](https://huggingface.co/CohereLabs/aya-expanse-8b).
175
+ It has been trained using [TRL](https://github.com/huggingface/trl).
176
+
177
+ ## Quick start
178
+
179
+ ```python
180
+ from transformers import pipeline
181
+
182
+ question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
183
+ generator = pipeline("text-generation", model="None", device="cuda")
184
+ output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
185
+ print(output["generated_text"])
186
+ ```
187
+
188
+ ## Training procedure
189
+
190
+
191
+
192
+
193
+
194
+ This model was trained with SFT.
195
+
196
+ ### Framework versions
197
+
198
+ - PEFT 0.19.1
199
+ - TRL: 1.3.0
200
+ - Transformers: 4.57.6
201
+ - Pytorch: 2.11.0
202
+ - Datasets: 4.8.5
203
+ - Tokenizers: 0.22.2
204
+
205
+ ## Citations
206
+
207
+
208
+
209
+ Cite TRL as:
210
+
211
+ ```bibtex
212
+ @software{vonwerra2020trl,
213
+ title = {{TRL: Transformers Reinforcement Learning}},
214
+ author = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
215
+ license = {Apache-2.0},
216
+ url = {https://github.com/huggingface/trl},
217
+ year = {2020}
218
+ }
219
+ ```
adapter_config.json ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alora_invocation_tokens": null,
3
+ "alpha_pattern": {},
4
+ "arrow_config": null,
5
+ "auto_mapping": null,
6
+ "base_model_name_or_path": "CohereLabs/aya-expanse-8b",
7
+ "bias": "none",
8
+ "corda_config": null,
9
+ "ensure_weight_tying": false,
10
+ "eva_config": null,
11
+ "exclude_modules": null,
12
+ "fan_in_fan_out": false,
13
+ "inference_mode": true,
14
+ "init_lora_weights": true,
15
+ "layer_replication": null,
16
+ "layers_pattern": null,
17
+ "layers_to_transform": null,
18
+ "loftq_config": {},
19
+ "lora_alpha": 32,
20
+ "lora_bias": false,
21
+ "lora_dropout": 0.05,
22
+ "lora_ga_config": null,
23
+ "megatron_config": null,
24
+ "megatron_core": "megatron.core",
25
+ "modules_to_save": null,
26
+ "peft_type": "LORA",
27
+ "peft_version": "0.19.1",
28
+ "qalora_group_size": 16,
29
+ "r": 16,
30
+ "rank_pattern": {},
31
+ "revision": null,
32
+ "target_modules": [
33
+ "k_proj",
34
+ "down_proj",
35
+ "q_proj",
36
+ "o_proj",
37
+ "gate_proj",
38
+ "v_proj",
39
+ "up_proj"
40
+ ],
41
+ "target_parameters": null,
42
+ "task_type": "CAUSAL_LM",
43
+ "trainable_token_indices": null,
44
+ "use_bdlora": null,
45
+ "use_dora": false,
46
+ "use_qalora": false,
47
+ "use_rslora": false
48
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b4ccb6b6fb84ee6cc158f9a20474ec7ec29ec158992cc2108c5575301c294143
3
+ size 167832240
chat_template.jinja ADDED
@@ -0,0 +1 @@
 
 
1
+ {{ bos_token }}{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif false == true %}{% set loop_messages = messages %}{% set system_message = 'You are Aya, a brilliant, sophisticated, multilingual AI-assistant trained to assist human users by providing thorough responses. You are able to interact and respond to questions in 23 languages and you are powered by a multilingual model built by Cohere For AI.' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% if system_message != false %}{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + system_message + '<|END_OF_TURN_TOKEN|>' }}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% elif message['role'] == 'assistant' %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}{% endif %}
checkpoint-1200/README.md ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: CohereLabs/aya-expanse-8b
3
+ library_name: peft
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - base_model:adapter:CohereLabs/aya-expanse-8b
7
+ - lora
8
+ - sft
9
+ - transformers
10
+ - trl
11
+ ---
12
+
13
+ # Model Card for Model ID
14
+
15
+ <!-- Provide a quick summary of what the model is/does. -->
16
+
17
+
18
+
19
+ ## Model Details
20
+
21
+ ### Model Description
22
+
23
+ <!-- Provide a longer summary of what this model is. -->
24
+
25
+
26
+
27
+ - **Developed by:** [More Information Needed]
28
+ - **Funded by [optional]:** [More Information Needed]
29
+ - **Shared by [optional]:** [More Information Needed]
30
+ - **Model type:** [More Information Needed]
31
+ - **Language(s) (NLP):** [More Information Needed]
32
+ - **License:** [More Information Needed]
33
+ - **Finetuned from model [optional]:** [More Information Needed]
34
+
35
+ ### Model Sources [optional]
36
+
37
+ <!-- Provide the basic links for the model. -->
38
+
39
+ - **Repository:** [More Information Needed]
40
+ - **Paper [optional]:** [More Information Needed]
41
+ - **Demo [optional]:** [More Information Needed]
42
+
43
+ ## Uses
44
+
45
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
46
+
47
+ ### Direct Use
48
+
49
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
50
+
51
+ [More Information Needed]
52
+
53
+ ### Downstream Use [optional]
54
+
55
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
56
+
57
+ [More Information Needed]
58
+
59
+ ### Out-of-Scope Use
60
+
61
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
62
+
63
+ [More Information Needed]
64
+
65
+ ## Bias, Risks, and Limitations
66
+
67
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
68
+
69
+ [More Information Needed]
70
+
71
+ ### Recommendations
72
+
73
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
74
+
75
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
76
+
77
+ ## How to Get Started with the Model
78
+
79
+ Use the code below to get started with the model.
80
+
81
+ [More Information Needed]
82
+
83
+ ## Training Details
84
+
85
+ ### Training Data
86
+
87
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
88
+
89
+ [More Information Needed]
90
+
91
+ ### Training Procedure
92
+
93
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
94
+
95
+ #### Preprocessing [optional]
96
+
97
+ [More Information Needed]
98
+
99
+
100
+ #### Training Hyperparameters
101
+
102
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
103
+
104
+ #### Speeds, Sizes, Times [optional]
105
+
106
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
107
+
108
+ [More Information Needed]
109
+
110
+ ## Evaluation
111
+
112
+ <!-- This section describes the evaluation protocols and provides the results. -->
113
+
114
+ ### Testing Data, Factors & Metrics
115
+
116
+ #### Testing Data
117
+
118
+ <!-- This should link to a Dataset Card if possible. -->
119
+
120
+ [More Information Needed]
121
+
122
+ #### Factors
123
+
124
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
125
+
126
+ [More Information Needed]
127
+
128
+ #### Metrics
129
+
130
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
131
+
132
+ [More Information Needed]
133
+
134
+ ### Results
135
+
136
+ [More Information Needed]
137
+
138
+ #### Summary
139
+
140
+
141
+
142
+ ## Model Examination [optional]
143
+
144
+ <!-- Relevant interpretability work for the model goes here -->
145
+
146
+ [More Information Needed]
147
+
148
+ ## Environmental Impact
149
+
150
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
151
+
152
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
153
+
154
+ - **Hardware Type:** [More Information Needed]
155
+ - **Hours used:** [More Information Needed]
156
+ - **Cloud Provider:** [More Information Needed]
157
+ - **Compute Region:** [More Information Needed]
158
+ - **Carbon Emitted:** [More Information Needed]
159
+
160
+ ## Technical Specifications [optional]
161
+
162
+ ### Model Architecture and Objective
163
+
164
+ [More Information Needed]
165
+
166
+ ### Compute Infrastructure
167
+
168
+ [More Information Needed]
169
+
170
+ #### Hardware
171
+
172
+ [More Information Needed]
173
+
174
+ #### Software
175
+
176
+ [More Information Needed]
177
+
178
+ ## Citation [optional]
179
+
180
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
181
+
182
+ **BibTeX:**
183
+
184
+ [More Information Needed]
185
+
186
+ **APA:**
187
+
188
+ [More Information Needed]
189
+
190
+ ## Glossary [optional]
191
+
192
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
193
+
194
+ [More Information Needed]
195
+
196
+ ## More Information [optional]
197
+
198
+ [More Information Needed]
199
+
200
+ ## Model Card Authors [optional]
201
+
202
+ [More Information Needed]
203
+
204
+ ## Model Card Contact
205
+
206
+ [More Information Needed]
207
+ ### Framework versions
208
+
209
+ - PEFT 0.19.1
checkpoint-1200/adapter_config.json ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alora_invocation_tokens": null,
3
+ "alpha_pattern": {},
4
+ "arrow_config": null,
5
+ "auto_mapping": null,
6
+ "base_model_name_or_path": "CohereLabs/aya-expanse-8b",
7
+ "bias": "none",
8
+ "corda_config": null,
9
+ "ensure_weight_tying": false,
10
+ "eva_config": null,
11
+ "exclude_modules": null,
12
+ "fan_in_fan_out": false,
13
+ "inference_mode": true,
14
+ "init_lora_weights": true,
15
+ "layer_replication": null,
16
+ "layers_pattern": null,
17
+ "layers_to_transform": null,
18
+ "loftq_config": {},
19
+ "lora_alpha": 32,
20
+ "lora_bias": false,
21
+ "lora_dropout": 0.05,
22
+ "lora_ga_config": null,
23
+ "megatron_config": null,
24
+ "megatron_core": "megatron.core",
25
+ "modules_to_save": null,
26
+ "peft_type": "LORA",
27
+ "peft_version": "0.19.1",
28
+ "qalora_group_size": 16,
29
+ "r": 16,
30
+ "rank_pattern": {},
31
+ "revision": null,
32
+ "target_modules": [
33
+ "k_proj",
34
+ "down_proj",
35
+ "q_proj",
36
+ "o_proj",
37
+ "gate_proj",
38
+ "v_proj",
39
+ "up_proj"
40
+ ],
41
+ "target_parameters": null,
42
+ "task_type": "CAUSAL_LM",
43
+ "trainable_token_indices": null,
44
+ "use_bdlora": null,
45
+ "use_dora": false,
46
+ "use_qalora": false,
47
+ "use_rslora": false
48
+ }
checkpoint-1200/adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5024a4200df787ce8cb3a1a23d9a1439358792a465f6b3f6b931add79f84aa0a
3
+ size 167832240
checkpoint-1200/chat_template.jinja ADDED
@@ -0,0 +1 @@
 
 
1
+ {{ bos_token }}{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif false == true %}{% set loop_messages = messages %}{% set system_message = 'You are Aya, a brilliant, sophisticated, multilingual AI-assistant trained to assist human users by providing thorough responses. You are able to interact and respond to questions in 23 languages and you are powered by a multilingual model built by Cohere For AI.' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% if system_message != false %}{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + system_message + '<|END_OF_TURN_TOKEN|>' }}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% elif message['role'] == 'assistant' %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}{% endif %}
checkpoint-1200/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c53c564c508c035cec95a38fc431d6b3652878bb3968ec00e87d09170583ecac
3
+ size 335929123
checkpoint-1200/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c1215aee6c0e3407d5373ec2bcba0768cfee0477976cf9edb2d49d4f35aeb257
3
+ size 14645
checkpoint-1200/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b68b5be1196467245d97fdf07264667e377abf37f1c86865e3ac71fc66f675b3
3
+ size 1465
checkpoint-1200/special_tokens_map.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<BOS_TOKEN>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|END_OF_TURN_TOKEN|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "<PAD>"
17
+ }
checkpoint-1200/tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:345ccf04a5257f473e331715ecc69365c5ac8fc2490923fe7155560af809ec1a
3
+ size 20124090
checkpoint-1200/tokenizer_config.json ADDED
@@ -0,0 +1,317 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": false,
5
+ "added_tokens_decoder": {
6
+ "0": {
7
+ "content": "<PAD>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "1": {
15
+ "content": "<UNK>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "2": {
23
+ "content": "<CLS>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ },
30
+ "3": {
31
+ "content": "<SEP>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false,
36
+ "special": true
37
+ },
38
+ "4": {
39
+ "content": "<MASK_TOKEN>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false,
44
+ "special": true
45
+ },
46
+ "5": {
47
+ "content": "<BOS_TOKEN>",
48
+ "lstrip": false,
49
+ "normalized": false,
50
+ "rstrip": false,
51
+ "single_word": false,
52
+ "special": true
53
+ },
54
+ "6": {
55
+ "content": "<EOS_TOKEN>",
56
+ "lstrip": false,
57
+ "normalized": false,
58
+ "rstrip": false,
59
+ "single_word": false,
60
+ "special": true
61
+ },
62
+ "7": {
63
+ "content": "<EOP_TOKEN>",
64
+ "lstrip": false,
65
+ "normalized": false,
66
+ "rstrip": false,
67
+ "single_word": false,
68
+ "special": true
69
+ },
70
+ "255000": {
71
+ "content": "<|START_OF_TURN_TOKEN|>",
72
+ "lstrip": false,
73
+ "normalized": false,
74
+ "rstrip": false,
75
+ "single_word": false,
76
+ "special": false
77
+ },
78
+ "255001": {
79
+ "content": "<|END_OF_TURN_TOKEN|>",
80
+ "lstrip": false,
81
+ "normalized": false,
82
+ "rstrip": false,
83
+ "single_word": false,
84
+ "special": true
85
+ },
86
+ "255002": {
87
+ "content": "<|YES_TOKEN|>",
88
+ "lstrip": false,
89
+ "normalized": false,
90
+ "rstrip": false,
91
+ "single_word": false,
92
+ "special": false
93
+ },
94
+ "255003": {
95
+ "content": "<|NO_TOKEN|>",
96
+ "lstrip": false,
97
+ "normalized": false,
98
+ "rstrip": false,
99
+ "single_word": false,
100
+ "special": false
101
+ },
102
+ "255004": {
103
+ "content": "<|GOOD_TOKEN|>",
104
+ "lstrip": false,
105
+ "normalized": false,
106
+ "rstrip": false,
107
+ "single_word": false,
108
+ "special": false
109
+ },
110
+ "255005": {
111
+ "content": "<|BAD_TOKEN|>",
112
+ "lstrip": false,
113
+ "normalized": false,
114
+ "rstrip": false,
115
+ "single_word": false,
116
+ "special": false
117
+ },
118
+ "255006": {
119
+ "content": "<|USER_TOKEN|>",
120
+ "lstrip": false,
121
+ "normalized": false,
122
+ "rstrip": false,
123
+ "single_word": false,
124
+ "special": false
125
+ },
126
+ "255007": {
127
+ "content": "<|CHATBOT_TOKEN|>",
128
+ "lstrip": false,
129
+ "normalized": false,
130
+ "rstrip": false,
131
+ "single_word": false,
132
+ "special": false
133
+ },
134
+ "255008": {
135
+ "content": "<|SYSTEM_TOKEN|>",
136
+ "lstrip": false,
137
+ "normalized": false,
138
+ "rstrip": false,
139
+ "single_word": false,
140
+ "special": false
141
+ },
142
+ "255009": {
143
+ "content": "<|USER_0_TOKEN|>",
144
+ "lstrip": false,
145
+ "normalized": false,
146
+ "rstrip": false,
147
+ "single_word": false,
148
+ "special": false
149
+ },
150
+ "255010": {
151
+ "content": "<|USER_1_TOKEN|>",
152
+ "lstrip": false,
153
+ "normalized": false,
154
+ "rstrip": false,
155
+ "single_word": false,
156
+ "special": false
157
+ },
158
+ "255011": {
159
+ "content": "<|USER_2_TOKEN|>",
160
+ "lstrip": false,
161
+ "normalized": false,
162
+ "rstrip": false,
163
+ "single_word": false,
164
+ "special": false
165
+ },
166
+ "255012": {
167
+ "content": "<|USER_3_TOKEN|>",
168
+ "lstrip": false,
169
+ "normalized": false,
170
+ "rstrip": false,
171
+ "single_word": false,
172
+ "special": false
173
+ },
174
+ "255013": {
175
+ "content": "<|USER_4_TOKEN|>",
176
+ "lstrip": false,
177
+ "normalized": false,
178
+ "rstrip": false,
179
+ "single_word": false,
180
+ "special": false
181
+ },
182
+ "255014": {
183
+ "content": "<|USER_5_TOKEN|>",
184
+ "lstrip": false,
185
+ "normalized": false,
186
+ "rstrip": false,
187
+ "single_word": false,
188
+ "special": false
189
+ },
190
+ "255015": {
191
+ "content": "<|USER_6_TOKEN|>",
192
+ "lstrip": false,
193
+ "normalized": false,
194
+ "rstrip": false,
195
+ "single_word": false,
196
+ "special": false
197
+ },
198
+ "255016": {
199
+ "content": "<|USER_7_TOKEN|>",
200
+ "lstrip": false,
201
+ "normalized": false,
202
+ "rstrip": false,
203
+ "single_word": false,
204
+ "special": false
205
+ },
206
+ "255017": {
207
+ "content": "<|USER_8_TOKEN|>",
208
+ "lstrip": false,
209
+ "normalized": false,
210
+ "rstrip": false,
211
+ "single_word": false,
212
+ "special": false
213
+ },
214
+ "255018": {
215
+ "content": "<|USER_9_TOKEN|>",
216
+ "lstrip": false,
217
+ "normalized": false,
218
+ "rstrip": false,
219
+ "single_word": false,
220
+ "special": false
221
+ },
222
+ "255019": {
223
+ "content": "<|EXTRA_0_TOKEN|>",
224
+ "lstrip": false,
225
+ "normalized": false,
226
+ "rstrip": false,
227
+ "single_word": false,
228
+ "special": false
229
+ },
230
+ "255020": {
231
+ "content": "<|EXTRA_1_TOKEN|>",
232
+ "lstrip": false,
233
+ "normalized": false,
234
+ "rstrip": false,
235
+ "single_word": false,
236
+ "special": false
237
+ },
238
+ "255021": {
239
+ "content": "<|EXTRA_2_TOKEN|>",
240
+ "lstrip": false,
241
+ "normalized": false,
242
+ "rstrip": false,
243
+ "single_word": false,
244
+ "special": false
245
+ },
246
+ "255022": {
247
+ "content": "<|EXTRA_3_TOKEN|>",
248
+ "lstrip": false,
249
+ "normalized": false,
250
+ "rstrip": false,
251
+ "single_word": false,
252
+ "special": false
253
+ },
254
+ "255023": {
255
+ "content": "<|EXTRA_4_TOKEN|>",
256
+ "lstrip": false,
257
+ "normalized": false,
258
+ "rstrip": false,
259
+ "single_word": false,
260
+ "special": false
261
+ },
262
+ "255024": {
263
+ "content": "<|EXTRA_5_TOKEN|>",
264
+ "lstrip": false,
265
+ "normalized": false,
266
+ "rstrip": false,
267
+ "single_word": false,
268
+ "special": false
269
+ },
270
+ "255025": {
271
+ "content": "<|EXTRA_6_TOKEN|>",
272
+ "lstrip": false,
273
+ "normalized": false,
274
+ "rstrip": false,
275
+ "single_word": false,
276
+ "special": false
277
+ },
278
+ "255026": {
279
+ "content": "<|EXTRA_7_TOKEN|>",
280
+ "lstrip": false,
281
+ "normalized": false,
282
+ "rstrip": false,
283
+ "single_word": false,
284
+ "special": false
285
+ },
286
+ "255027": {
287
+ "content": "<|EXTRA_8_TOKEN|>",
288
+ "lstrip": false,
289
+ "normalized": false,
290
+ "rstrip": false,
291
+ "single_word": false,
292
+ "special": false
293
+ },
294
+ "255028": {
295
+ "content": "<|EXTRA_9_TOKEN|>",
296
+ "lstrip": false,
297
+ "normalized": false,
298
+ "rstrip": false,
299
+ "single_word": false,
300
+ "special": false
301
+ }
302
+ },
303
+ "bos_token": "<BOS_TOKEN>",
304
+ "clean_up_tokenization_spaces": false,
305
+ "eos_token": "<|END_OF_TURN_TOKEN|>",
306
+ "extra_special_tokens": {},
307
+ "legacy": true,
308
+ "merges_file": null,
309
+ "model_max_length": 1000000000000000019884624838656,
310
+ "pad_token": "<PAD>",
311
+ "sp_model_kwargs": {},
312
+ "spaces_between_special_tokens": false,
313
+ "tokenizer_class": "CohereTokenizer",
314
+ "unk_token": null,
315
+ "use_default_system_prompt": false,
316
+ "vocab_file": null
317
+ }
checkpoint-1200/trainer_state.json ADDED
@@ -0,0 +1,1234 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": null,
3
+ "best_metric": null,
4
+ "best_model_checkpoint": null,
5
+ "epoch": 1.5151515151515151,
6
+ "eval_steps": 200,
7
+ "global_step": 1200,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "entropy": 2.4040649354457857,
14
+ "epoch": 0.012626262626262626,
15
+ "grad_norm": 4.66565465927124,
16
+ "learning_rate": 1.8750000000000003e-06,
17
+ "loss": 3.6021,
18
+ "mean_token_accuracy": 0.4221017129719257,
19
+ "num_tokens": 61199.0,
20
+ "step": 10
21
+ },
22
+ {
23
+ "entropy": 2.3836746215820312,
24
+ "epoch": 0.025252525252525252,
25
+ "grad_norm": 3.8161869049072266,
26
+ "learning_rate": 3.958333333333333e-06,
27
+ "loss": 3.3432,
28
+ "mean_token_accuracy": 0.44042530804872515,
29
+ "num_tokens": 122423.0,
30
+ "step": 20
31
+ },
32
+ {
33
+ "entropy": 2.355724626779556,
34
+ "epoch": 0.03787878787878788,
35
+ "grad_norm": 3.8800699710845947,
36
+ "learning_rate": 6.041666666666667e-06,
37
+ "loss": 2.9033,
38
+ "mean_token_accuracy": 0.48426677361130716,
39
+ "num_tokens": 182649.0,
40
+ "step": 30
41
+ },
42
+ {
43
+ "entropy": 2.092331054806709,
44
+ "epoch": 0.050505050505050504,
45
+ "grad_norm": 2.8217720985412598,
46
+ "learning_rate": 8.125000000000001e-06,
47
+ "loss": 2.356,
48
+ "mean_token_accuracy": 0.5772452697157859,
49
+ "num_tokens": 243049.0,
50
+ "step": 40
51
+ },
52
+ {
53
+ "entropy": 1.6766322344541549,
54
+ "epoch": 0.06313131313131314,
55
+ "grad_norm": 1.4623568058013916,
56
+ "learning_rate": 9.993489583333334e-06,
57
+ "loss": 1.8899,
58
+ "mean_token_accuracy": 0.6480962842702865,
59
+ "num_tokens": 304326.0,
60
+ "step": 50
61
+ },
62
+ {
63
+ "entropy": 1.5568815559148788,
64
+ "epoch": 0.07575757575757576,
65
+ "grad_norm": 1.171562671661377,
66
+ "learning_rate": 9.928385416666668e-06,
67
+ "loss": 1.677,
68
+ "mean_token_accuracy": 0.6776855796575546,
69
+ "num_tokens": 364866.0,
70
+ "step": 60
71
+ },
72
+ {
73
+ "entropy": 1.48199902176857,
74
+ "epoch": 0.08838383838383838,
75
+ "grad_norm": 0.9904961585998535,
76
+ "learning_rate": 9.863281250000001e-06,
77
+ "loss": 1.5337,
78
+ "mean_token_accuracy": 0.697019773721695,
79
+ "num_tokens": 423478.0,
80
+ "step": 70
81
+ },
82
+ {
83
+ "entropy": 1.497376424074173,
84
+ "epoch": 0.10101010101010101,
85
+ "grad_norm": 0.9454260468482971,
86
+ "learning_rate": 9.798177083333335e-06,
87
+ "loss": 1.4953,
88
+ "mean_token_accuracy": 0.6976823702454567,
89
+ "num_tokens": 483659.0,
90
+ "step": 80
91
+ },
92
+ {
93
+ "entropy": 1.4664768785238267,
94
+ "epoch": 0.11363636363636363,
95
+ "grad_norm": 0.8955270648002625,
96
+ "learning_rate": 9.733072916666667e-06,
97
+ "loss": 1.4356,
98
+ "mean_token_accuracy": 0.7069446608424187,
99
+ "num_tokens": 544453.0,
100
+ "step": 90
101
+ },
102
+ {
103
+ "entropy": 1.4269085675477982,
104
+ "epoch": 0.12626262626262627,
105
+ "grad_norm": 0.9242203235626221,
106
+ "learning_rate": 9.66796875e-06,
107
+ "loss": 1.4106,
108
+ "mean_token_accuracy": 0.7122392952442169,
109
+ "num_tokens": 604546.0,
110
+ "step": 100
111
+ },
112
+ {
113
+ "entropy": 1.4060751020908355,
114
+ "epoch": 0.1388888888888889,
115
+ "grad_norm": 0.8968560695648193,
116
+ "learning_rate": 9.602864583333335e-06,
117
+ "loss": 1.3487,
118
+ "mean_token_accuracy": 0.7178552970290184,
119
+ "num_tokens": 664860.0,
120
+ "step": 110
121
+ },
122
+ {
123
+ "entropy": 1.4018951296806335,
124
+ "epoch": 0.15151515151515152,
125
+ "grad_norm": 0.9047113656997681,
126
+ "learning_rate": 9.537760416666667e-06,
127
+ "loss": 1.3347,
128
+ "mean_token_accuracy": 0.7208079636096955,
129
+ "num_tokens": 725022.0,
130
+ "step": 120
131
+ },
132
+ {
133
+ "entropy": 1.3809731483459473,
134
+ "epoch": 0.16414141414141414,
135
+ "grad_norm": 0.8915444016456604,
136
+ "learning_rate": 9.47265625e-06,
137
+ "loss": 1.3155,
138
+ "mean_token_accuracy": 0.7267520889639855,
139
+ "num_tokens": 785586.0,
140
+ "step": 130
141
+ },
142
+ {
143
+ "entropy": 1.3699676394462585,
144
+ "epoch": 0.17676767676767677,
145
+ "grad_norm": 0.8574295043945312,
146
+ "learning_rate": 9.407552083333334e-06,
147
+ "loss": 1.3016,
148
+ "mean_token_accuracy": 0.7266673430800438,
149
+ "num_tokens": 845790.0,
150
+ "step": 140
151
+ },
152
+ {
153
+ "entropy": 1.3425012439489366,
154
+ "epoch": 0.1893939393939394,
155
+ "grad_norm": 0.8231800198554993,
156
+ "learning_rate": 9.342447916666668e-06,
157
+ "loss": 1.2823,
158
+ "mean_token_accuracy": 0.7277117937803268,
159
+ "num_tokens": 905842.0,
160
+ "step": 150
161
+ },
162
+ {
163
+ "entropy": 1.3314216613769532,
164
+ "epoch": 0.20202020202020202,
165
+ "grad_norm": 0.8166369795799255,
166
+ "learning_rate": 9.277343750000001e-06,
167
+ "loss": 1.2917,
168
+ "mean_token_accuracy": 0.7278609350323677,
169
+ "num_tokens": 966487.0,
170
+ "step": 160
171
+ },
172
+ {
173
+ "entropy": 1.30389544069767,
174
+ "epoch": 0.21464646464646464,
175
+ "grad_norm": 0.7738587260246277,
176
+ "learning_rate": 9.212239583333335e-06,
177
+ "loss": 1.2481,
178
+ "mean_token_accuracy": 0.7335339426994324,
179
+ "num_tokens": 1025558.0,
180
+ "step": 170
181
+ },
182
+ {
183
+ "entropy": 1.311449444293976,
184
+ "epoch": 0.22727272727272727,
185
+ "grad_norm": 0.7718328833580017,
186
+ "learning_rate": 9.147135416666667e-06,
187
+ "loss": 1.2643,
188
+ "mean_token_accuracy": 0.7285080313682556,
189
+ "num_tokens": 1086677.0,
190
+ "step": 180
191
+ },
192
+ {
193
+ "entropy": 1.3079361200332642,
194
+ "epoch": 0.2398989898989899,
195
+ "grad_norm": 0.7341915369033813,
196
+ "learning_rate": 9.082031250000001e-06,
197
+ "loss": 1.2641,
198
+ "mean_token_accuracy": 0.729550538957119,
199
+ "num_tokens": 1147885.0,
200
+ "step": 190
201
+ },
202
+ {
203
+ "entropy": 1.2794219702482224,
204
+ "epoch": 0.25252525252525254,
205
+ "grad_norm": 0.748540997505188,
206
+ "learning_rate": 9.016927083333335e-06,
207
+ "loss": 1.2397,
208
+ "mean_token_accuracy": 0.7351120054721832,
209
+ "num_tokens": 1207321.0,
210
+ "step": 200
211
+ },
212
+ {
213
+ "entropy": 1.294454461336136,
214
+ "epoch": 0.26515151515151514,
215
+ "grad_norm": 0.7585553526878357,
216
+ "learning_rate": 8.951822916666667e-06,
217
+ "loss": 1.2489,
218
+ "mean_token_accuracy": 0.7322148531675339,
219
+ "num_tokens": 1267837.0,
220
+ "step": 210
221
+ },
222
+ {
223
+ "entropy": 1.2862246632575989,
224
+ "epoch": 0.2777777777777778,
225
+ "grad_norm": 0.6937864422798157,
226
+ "learning_rate": 8.88671875e-06,
227
+ "loss": 1.2383,
228
+ "mean_token_accuracy": 0.7358245223760604,
229
+ "num_tokens": 1328436.0,
230
+ "step": 220
231
+ },
232
+ {
233
+ "entropy": 1.2531811505556107,
234
+ "epoch": 0.2904040404040404,
235
+ "grad_norm": 0.6792387366294861,
236
+ "learning_rate": 8.821614583333334e-06,
237
+ "loss": 1.2007,
238
+ "mean_token_accuracy": 0.7360369265079498,
239
+ "num_tokens": 1389877.0,
240
+ "step": 230
241
+ },
242
+ {
243
+ "entropy": 1.2815489560365676,
244
+ "epoch": 0.30303030303030304,
245
+ "grad_norm": 0.6865427494049072,
246
+ "learning_rate": 8.756510416666666e-06,
247
+ "loss": 1.2474,
248
+ "mean_token_accuracy": 0.7304804190993309,
249
+ "num_tokens": 1450927.0,
250
+ "step": 240
251
+ },
252
+ {
253
+ "entropy": 1.2568059146404267,
254
+ "epoch": 0.31565656565656564,
255
+ "grad_norm": 0.669840395450592,
256
+ "learning_rate": 8.69140625e-06,
257
+ "loss": 1.2172,
258
+ "mean_token_accuracy": 0.7385278165340423,
259
+ "num_tokens": 1511174.0,
260
+ "step": 250
261
+ },
262
+ {
263
+ "entropy": 1.254868358373642,
264
+ "epoch": 0.3282828282828283,
265
+ "grad_norm": 0.6434893012046814,
266
+ "learning_rate": 8.626302083333334e-06,
267
+ "loss": 1.213,
268
+ "mean_token_accuracy": 0.7380544006824493,
269
+ "num_tokens": 1570570.0,
270
+ "step": 260
271
+ },
272
+ {
273
+ "entropy": 1.2532441645860672,
274
+ "epoch": 0.3409090909090909,
275
+ "grad_norm": 0.6034978032112122,
276
+ "learning_rate": 8.561197916666667e-06,
277
+ "loss": 1.2116,
278
+ "mean_token_accuracy": 0.7382062628865242,
279
+ "num_tokens": 1630930.0,
280
+ "step": 270
281
+ },
282
+ {
283
+ "entropy": 1.2605280816555022,
284
+ "epoch": 0.35353535353535354,
285
+ "grad_norm": 0.6371450424194336,
286
+ "learning_rate": 8.496093750000001e-06,
287
+ "loss": 1.2313,
288
+ "mean_token_accuracy": 0.7328852489590645,
289
+ "num_tokens": 1692624.0,
290
+ "step": 280
291
+ },
292
+ {
293
+ "entropy": 1.248606452345848,
294
+ "epoch": 0.3661616161616162,
295
+ "grad_norm": 0.6300278306007385,
296
+ "learning_rate": 8.430989583333335e-06,
297
+ "loss": 1.2195,
298
+ "mean_token_accuracy": 0.7370449885725975,
299
+ "num_tokens": 1754213.0,
300
+ "step": 290
301
+ },
302
+ {
303
+ "entropy": 1.2246102809906005,
304
+ "epoch": 0.3787878787878788,
305
+ "grad_norm": 0.6430155634880066,
306
+ "learning_rate": 8.365885416666667e-06,
307
+ "loss": 1.1845,
308
+ "mean_token_accuracy": 0.7439196646213532,
309
+ "num_tokens": 1813841.0,
310
+ "step": 300
311
+ },
312
+ {
313
+ "entropy": 1.2188815206289292,
314
+ "epoch": 0.39141414141414144,
315
+ "grad_norm": 0.6395701766014099,
316
+ "learning_rate": 8.30078125e-06,
317
+ "loss": 1.1856,
318
+ "mean_token_accuracy": 0.7422183871269226,
319
+ "num_tokens": 1873568.0,
320
+ "step": 310
321
+ },
322
+ {
323
+ "entropy": 1.234528934955597,
324
+ "epoch": 0.40404040404040403,
325
+ "grad_norm": 0.6168740391731262,
326
+ "learning_rate": 8.235677083333334e-06,
327
+ "loss": 1.1951,
328
+ "mean_token_accuracy": 0.7380509555339814,
329
+ "num_tokens": 1935518.0,
330
+ "step": 320
331
+ },
332
+ {
333
+ "entropy": 1.226365676522255,
334
+ "epoch": 0.4166666666666667,
335
+ "grad_norm": 0.611132800579071,
336
+ "learning_rate": 8.170572916666666e-06,
337
+ "loss": 1.2078,
338
+ "mean_token_accuracy": 0.739974245429039,
339
+ "num_tokens": 1995604.0,
340
+ "step": 330
341
+ },
342
+ {
343
+ "entropy": 1.2132183194160462,
344
+ "epoch": 0.4292929292929293,
345
+ "grad_norm": 0.6103131771087646,
346
+ "learning_rate": 8.10546875e-06,
347
+ "loss": 1.1642,
348
+ "mean_token_accuracy": 0.7444137379527092,
349
+ "num_tokens": 2056484.0,
350
+ "step": 340
351
+ },
352
+ {
353
+ "entropy": 1.223158246278763,
354
+ "epoch": 0.44191919191919193,
355
+ "grad_norm": 0.6188805103302002,
356
+ "learning_rate": 8.040364583333334e-06,
357
+ "loss": 1.2001,
358
+ "mean_token_accuracy": 0.7385613292455673,
359
+ "num_tokens": 2118437.0,
360
+ "step": 350
361
+ },
362
+ {
363
+ "entropy": 1.2154075980186463,
364
+ "epoch": 0.45454545454545453,
365
+ "grad_norm": 0.6238694190979004,
366
+ "learning_rate": 7.975260416666668e-06,
367
+ "loss": 1.1848,
368
+ "mean_token_accuracy": 0.7404153689742088,
369
+ "num_tokens": 2179333.0,
370
+ "step": 360
371
+ },
372
+ {
373
+ "entropy": 1.197928261756897,
374
+ "epoch": 0.4671717171717172,
375
+ "grad_norm": 0.6028566956520081,
376
+ "learning_rate": 7.910156250000001e-06,
377
+ "loss": 1.1597,
378
+ "mean_token_accuracy": 0.7475010469555855,
379
+ "num_tokens": 2239604.0,
380
+ "step": 370
381
+ },
382
+ {
383
+ "entropy": 1.189855706691742,
384
+ "epoch": 0.4797979797979798,
385
+ "grad_norm": 0.6569434404373169,
386
+ "learning_rate": 7.845052083333335e-06,
387
+ "loss": 1.1805,
388
+ "mean_token_accuracy": 0.7427218139171601,
389
+ "num_tokens": 2300936.0,
390
+ "step": 380
391
+ },
392
+ {
393
+ "entropy": 1.2076493889093398,
394
+ "epoch": 0.49242424242424243,
395
+ "grad_norm": 0.6351733207702637,
396
+ "learning_rate": 7.779947916666667e-06,
397
+ "loss": 1.1759,
398
+ "mean_token_accuracy": 0.7408407002687454,
399
+ "num_tokens": 2361539.0,
400
+ "step": 390
401
+ },
402
+ {
403
+ "entropy": 1.2223081022500992,
404
+ "epoch": 0.5050505050505051,
405
+ "grad_norm": 0.6327986121177673,
406
+ "learning_rate": 7.71484375e-06,
407
+ "loss": 1.1874,
408
+ "mean_token_accuracy": 0.7407250568270684,
409
+ "num_tokens": 2422162.0,
410
+ "step": 400
411
+ },
412
+ {
413
+ "entropy": 1.2013412863016129,
414
+ "epoch": 0.5176767676767676,
415
+ "grad_norm": 0.622104823589325,
416
+ "learning_rate": 7.649739583333334e-06,
417
+ "loss": 1.1726,
418
+ "mean_token_accuracy": 0.744242025911808,
419
+ "num_tokens": 2483447.0,
420
+ "step": 410
421
+ },
422
+ {
423
+ "entropy": 1.2116613179445266,
424
+ "epoch": 0.5303030303030303,
425
+ "grad_norm": 0.637651264667511,
426
+ "learning_rate": 7.5846354166666665e-06,
427
+ "loss": 1.1838,
428
+ "mean_token_accuracy": 0.7405030101537704,
429
+ "num_tokens": 2544848.0,
430
+ "step": 420
431
+ },
432
+ {
433
+ "entropy": 1.2024697184562683,
434
+ "epoch": 0.5429292929292929,
435
+ "grad_norm": 0.6252374649047852,
436
+ "learning_rate": 7.51953125e-06,
437
+ "loss": 1.1681,
438
+ "mean_token_accuracy": 0.7458183988928795,
439
+ "num_tokens": 2605232.0,
440
+ "step": 430
441
+ },
442
+ {
443
+ "entropy": 1.1797083109617232,
444
+ "epoch": 0.5555555555555556,
445
+ "grad_norm": 0.6502755284309387,
446
+ "learning_rate": 7.454427083333334e-06,
447
+ "loss": 1.1452,
448
+ "mean_token_accuracy": 0.7477341219782829,
449
+ "num_tokens": 2664276.0,
450
+ "step": 440
451
+ },
452
+ {
453
+ "entropy": 1.1964200481772422,
454
+ "epoch": 0.5681818181818182,
455
+ "grad_norm": 0.639979362487793,
456
+ "learning_rate": 7.389322916666667e-06,
457
+ "loss": 1.1665,
458
+ "mean_token_accuracy": 0.7431837096810341,
459
+ "num_tokens": 2724073.0,
460
+ "step": 450
461
+ },
462
+ {
463
+ "entropy": 1.1795272737741471,
464
+ "epoch": 0.5808080808080808,
465
+ "grad_norm": 0.6212354302406311,
466
+ "learning_rate": 7.3242187500000006e-06,
467
+ "loss": 1.1529,
468
+ "mean_token_accuracy": 0.7479098170995713,
469
+ "num_tokens": 2784262.0,
470
+ "step": 460
471
+ },
472
+ {
473
+ "entropy": 1.1908618807792664,
474
+ "epoch": 0.5934343434343434,
475
+ "grad_norm": 0.6528693437576294,
476
+ "learning_rate": 7.259114583333334e-06,
477
+ "loss": 1.1678,
478
+ "mean_token_accuracy": 0.745174677670002,
479
+ "num_tokens": 2843804.0,
480
+ "step": 470
481
+ },
482
+ {
483
+ "entropy": 1.1862946093082427,
484
+ "epoch": 0.6060606060606061,
485
+ "grad_norm": 0.639481246471405,
486
+ "learning_rate": 7.194010416666667e-06,
487
+ "loss": 1.1565,
488
+ "mean_token_accuracy": 0.7461866185069084,
489
+ "num_tokens": 2903408.0,
490
+ "step": 480
491
+ },
492
+ {
493
+ "entropy": 1.151743534207344,
494
+ "epoch": 0.6186868686868687,
495
+ "grad_norm": 0.6332777142524719,
496
+ "learning_rate": 7.128906250000001e-06,
497
+ "loss": 1.1251,
498
+ "mean_token_accuracy": 0.7535199671983719,
499
+ "num_tokens": 2963401.0,
500
+ "step": 490
501
+ },
502
+ {
503
+ "entropy": 1.1778477430343628,
504
+ "epoch": 0.6313131313131313,
505
+ "grad_norm": 0.5991836190223694,
506
+ "learning_rate": 7.063802083333335e-06,
507
+ "loss": 1.1407,
508
+ "mean_token_accuracy": 0.7491110354661942,
509
+ "num_tokens": 3023530.0,
510
+ "step": 500
511
+ },
512
+ {
513
+ "entropy": 1.2023035794496537,
514
+ "epoch": 0.6439393939393939,
515
+ "grad_norm": 0.6293458938598633,
516
+ "learning_rate": 6.998697916666667e-06,
517
+ "loss": 1.1724,
518
+ "mean_token_accuracy": 0.7405256554484367,
519
+ "num_tokens": 3085225.0,
520
+ "step": 510
521
+ },
522
+ {
523
+ "entropy": 1.1997334092855454,
524
+ "epoch": 0.6565656565656566,
525
+ "grad_norm": 0.6213802695274353,
526
+ "learning_rate": 6.93359375e-06,
527
+ "loss": 1.1604,
528
+ "mean_token_accuracy": 0.7435309410095214,
529
+ "num_tokens": 3145749.0,
530
+ "step": 520
531
+ },
532
+ {
533
+ "entropy": 1.1643184214830398,
534
+ "epoch": 0.6691919191919192,
535
+ "grad_norm": 0.6495156288146973,
536
+ "learning_rate": 6.868489583333334e-06,
537
+ "loss": 1.1381,
538
+ "mean_token_accuracy": 0.7496751576662064,
539
+ "num_tokens": 3205761.0,
540
+ "step": 530
541
+ },
542
+ {
543
+ "entropy": 1.1837532848119736,
544
+ "epoch": 0.6818181818181818,
545
+ "grad_norm": 0.6004510521888733,
546
+ "learning_rate": 6.803385416666667e-06,
547
+ "loss": 1.1563,
548
+ "mean_token_accuracy": 0.7447851061820984,
549
+ "num_tokens": 3267024.0,
550
+ "step": 540
551
+ },
552
+ {
553
+ "entropy": 1.201677542924881,
554
+ "epoch": 0.6944444444444444,
555
+ "grad_norm": 0.607467532157898,
556
+ "learning_rate": 6.738281250000001e-06,
557
+ "loss": 1.1776,
558
+ "mean_token_accuracy": 0.7406648993492126,
559
+ "num_tokens": 3329070.0,
560
+ "step": 550
561
+ },
562
+ {
563
+ "entropy": 1.1659786373376846,
564
+ "epoch": 0.7070707070707071,
565
+ "grad_norm": 0.6079947352409363,
566
+ "learning_rate": 6.6731770833333345e-06,
567
+ "loss": 1.1298,
568
+ "mean_token_accuracy": 0.7505511298775673,
569
+ "num_tokens": 3389505.0,
570
+ "step": 560
571
+ },
572
+ {
573
+ "entropy": 1.1714913487434386,
574
+ "epoch": 0.7196969696969697,
575
+ "grad_norm": 0.6534572839736938,
576
+ "learning_rate": 6.6080729166666665e-06,
577
+ "loss": 1.144,
578
+ "mean_token_accuracy": 0.7481625184416771,
579
+ "num_tokens": 3449834.0,
580
+ "step": 570
581
+ },
582
+ {
583
+ "entropy": 1.1694963037967683,
584
+ "epoch": 0.7323232323232324,
585
+ "grad_norm": 0.5903164744377136,
586
+ "learning_rate": 6.54296875e-06,
587
+ "loss": 1.139,
588
+ "mean_token_accuracy": 0.7476441130042076,
589
+ "num_tokens": 3510701.0,
590
+ "step": 580
591
+ },
592
+ {
593
+ "entropy": 1.1777653217315673,
594
+ "epoch": 0.7449494949494949,
595
+ "grad_norm": 0.6284182071685791,
596
+ "learning_rate": 6.477864583333334e-06,
597
+ "loss": 1.1422,
598
+ "mean_token_accuracy": 0.7490358456969262,
599
+ "num_tokens": 3570992.0,
600
+ "step": 590
601
+ },
602
+ {
603
+ "entropy": 1.1588268011808396,
604
+ "epoch": 0.7575757575757576,
605
+ "grad_norm": 0.6250146627426147,
606
+ "learning_rate": 6.412760416666667e-06,
607
+ "loss": 1.1336,
608
+ "mean_token_accuracy": 0.7504108369350433,
609
+ "num_tokens": 3631087.0,
610
+ "step": 600
611
+ },
612
+ {
613
+ "entropy": 1.1890273630619048,
614
+ "epoch": 0.7702020202020202,
615
+ "grad_norm": 0.6420578956604004,
616
+ "learning_rate": 6.3476562500000006e-06,
617
+ "loss": 1.1534,
618
+ "mean_token_accuracy": 0.7442372158169747,
619
+ "num_tokens": 3692871.0,
620
+ "step": 610
621
+ },
622
+ {
623
+ "entropy": 1.1817798465490341,
624
+ "epoch": 0.7828282828282829,
625
+ "grad_norm": 0.6156490445137024,
626
+ "learning_rate": 6.282552083333334e-06,
627
+ "loss": 1.1477,
628
+ "mean_token_accuracy": 0.7468263059854507,
629
+ "num_tokens": 3753671.0,
630
+ "step": 620
631
+ },
632
+ {
633
+ "entropy": 1.1686612635850906,
634
+ "epoch": 0.7954545454545454,
635
+ "grad_norm": 0.6248748898506165,
636
+ "learning_rate": 6.217447916666667e-06,
637
+ "loss": 1.139,
638
+ "mean_token_accuracy": 0.7478924334049225,
639
+ "num_tokens": 3813110.0,
640
+ "step": 630
641
+ },
642
+ {
643
+ "entropy": 1.1387953266501427,
644
+ "epoch": 0.8080808080808081,
645
+ "grad_norm": 0.6052266359329224,
646
+ "learning_rate": 6.152343750000001e-06,
647
+ "loss": 1.1118,
648
+ "mean_token_accuracy": 0.7538487210869789,
649
+ "num_tokens": 3873477.0,
650
+ "step": 640
651
+ },
652
+ {
653
+ "entropy": 1.1343814879655838,
654
+ "epoch": 0.8207070707070707,
655
+ "grad_norm": 0.6769536137580872,
656
+ "learning_rate": 6.087239583333335e-06,
657
+ "loss": 1.1108,
658
+ "mean_token_accuracy": 0.7548464313149452,
659
+ "num_tokens": 3932873.0,
660
+ "step": 650
661
+ },
662
+ {
663
+ "entropy": 1.1686519652605056,
664
+ "epoch": 0.8333333333333334,
665
+ "grad_norm": 0.6545736789703369,
666
+ "learning_rate": 6.022135416666667e-06,
667
+ "loss": 1.134,
668
+ "mean_token_accuracy": 0.7495195478200912,
669
+ "num_tokens": 3992754.0,
670
+ "step": 660
671
+ },
672
+ {
673
+ "entropy": 1.1596283346414566,
674
+ "epoch": 0.8459595959595959,
675
+ "grad_norm": 0.6192017793655396,
676
+ "learning_rate": 5.95703125e-06,
677
+ "loss": 1.1213,
678
+ "mean_token_accuracy": 0.7510297149419785,
679
+ "num_tokens": 4053540.0,
680
+ "step": 670
681
+ },
682
+ {
683
+ "entropy": 1.1484344542026519,
684
+ "epoch": 0.8585858585858586,
685
+ "grad_norm": 0.6520631909370422,
686
+ "learning_rate": 5.891927083333334e-06,
687
+ "loss": 1.1181,
688
+ "mean_token_accuracy": 0.751795919239521,
689
+ "num_tokens": 4113582.0,
690
+ "step": 680
691
+ },
692
+ {
693
+ "entropy": 1.1519750133156776,
694
+ "epoch": 0.8712121212121212,
695
+ "grad_norm": 0.6247655153274536,
696
+ "learning_rate": 5.826822916666667e-06,
697
+ "loss": 1.1204,
698
+ "mean_token_accuracy": 0.7506368085741997,
699
+ "num_tokens": 4174983.0,
700
+ "step": 690
701
+ },
702
+ {
703
+ "entropy": 1.1486561581492425,
704
+ "epoch": 0.8838383838383839,
705
+ "grad_norm": 0.620272159576416,
706
+ "learning_rate": 5.761718750000001e-06,
707
+ "loss": 1.1191,
708
+ "mean_token_accuracy": 0.7536391675472259,
709
+ "num_tokens": 4234465.0,
710
+ "step": 700
711
+ },
712
+ {
713
+ "entropy": 1.1504923462867738,
714
+ "epoch": 0.8964646464646465,
715
+ "grad_norm": 0.6308649182319641,
716
+ "learning_rate": 5.6966145833333344e-06,
717
+ "loss": 1.1224,
718
+ "mean_token_accuracy": 0.750646598637104,
719
+ "num_tokens": 4295955.0,
720
+ "step": 710
721
+ },
722
+ {
723
+ "entropy": 1.1561898440122604,
724
+ "epoch": 0.9090909090909091,
725
+ "grad_norm": 0.6629899740219116,
726
+ "learning_rate": 5.6315104166666665e-06,
727
+ "loss": 1.1238,
728
+ "mean_token_accuracy": 0.7507557719945908,
729
+ "num_tokens": 4357171.0,
730
+ "step": 720
731
+ },
732
+ {
733
+ "entropy": 1.1413449853658677,
734
+ "epoch": 0.9217171717171717,
735
+ "grad_norm": 0.5972346067428589,
736
+ "learning_rate": 5.56640625e-06,
737
+ "loss": 1.1038,
738
+ "mean_token_accuracy": 0.7554104939103127,
739
+ "num_tokens": 4417636.0,
740
+ "step": 730
741
+ },
742
+ {
743
+ "entropy": 1.1356119453907012,
744
+ "epoch": 0.9343434343434344,
745
+ "grad_norm": 0.6356479525566101,
746
+ "learning_rate": 5.501302083333334e-06,
747
+ "loss": 1.1005,
748
+ "mean_token_accuracy": 0.7565629109740257,
749
+ "num_tokens": 4477294.0,
750
+ "step": 740
751
+ },
752
+ {
753
+ "entropy": 1.1633600294589996,
754
+ "epoch": 0.946969696969697,
755
+ "grad_norm": 0.6416464447975159,
756
+ "learning_rate": 5.436197916666667e-06,
757
+ "loss": 1.1225,
758
+ "mean_token_accuracy": 0.7515855401754379,
759
+ "num_tokens": 4537503.0,
760
+ "step": 750
761
+ },
762
+ {
763
+ "entropy": 1.1527763932943345,
764
+ "epoch": 0.9595959595959596,
765
+ "grad_norm": 0.6126084327697754,
766
+ "learning_rate": 5.3710937500000005e-06,
767
+ "loss": 1.1184,
768
+ "mean_token_accuracy": 0.7526160582900048,
769
+ "num_tokens": 4598778.0,
770
+ "step": 760
771
+ },
772
+ {
773
+ "entropy": 1.1397768080234527,
774
+ "epoch": 0.9722222222222222,
775
+ "grad_norm": 0.6359922289848328,
776
+ "learning_rate": 5.305989583333334e-06,
777
+ "loss": 1.1144,
778
+ "mean_token_accuracy": 0.7548302739858628,
779
+ "num_tokens": 4658978.0,
780
+ "step": 770
781
+ },
782
+ {
783
+ "entropy": 1.1569962561130525,
784
+ "epoch": 0.9848484848484849,
785
+ "grad_norm": 0.6260409951210022,
786
+ "learning_rate": 5.240885416666667e-06,
787
+ "loss": 1.1213,
788
+ "mean_token_accuracy": 0.7512885302305221,
789
+ "num_tokens": 4720500.0,
790
+ "step": 780
791
+ },
792
+ {
793
+ "entropy": 1.1509152203798294,
794
+ "epoch": 0.9974747474747475,
795
+ "grad_norm": 0.6293452978134155,
796
+ "learning_rate": 5.17578125e-06,
797
+ "loss": 1.1227,
798
+ "mean_token_accuracy": 0.7519564241170883,
799
+ "num_tokens": 4781612.0,
800
+ "step": 790
801
+ },
802
+ {
803
+ "entropy": 1.1399024561047555,
804
+ "epoch": 1.0101010101010102,
805
+ "grad_norm": 0.664761483669281,
806
+ "learning_rate": 5.110677083333334e-06,
807
+ "loss": 1.1034,
808
+ "mean_token_accuracy": 0.7526706486940384,
809
+ "num_tokens": 4841359.0,
810
+ "step": 800
811
+ },
812
+ {
813
+ "entropy": 1.120214229822159,
814
+ "epoch": 1.0227272727272727,
815
+ "grad_norm": 0.5934865474700928,
816
+ "learning_rate": 5.045572916666667e-06,
817
+ "loss": 1.0857,
818
+ "mean_token_accuracy": 0.7593594208359719,
819
+ "num_tokens": 4901016.0,
820
+ "step": 810
821
+ },
822
+ {
823
+ "entropy": 1.1430341199040412,
824
+ "epoch": 1.0353535353535352,
825
+ "grad_norm": 0.6040735840797424,
826
+ "learning_rate": 4.98046875e-06,
827
+ "loss": 1.1165,
828
+ "mean_token_accuracy": 0.7525988414883613,
829
+ "num_tokens": 4961646.0,
830
+ "step": 820
831
+ },
832
+ {
833
+ "entropy": 1.1245349109172822,
834
+ "epoch": 1.047979797979798,
835
+ "grad_norm": 0.6277610063552856,
836
+ "learning_rate": 4.915364583333333e-06,
837
+ "loss": 1.0851,
838
+ "mean_token_accuracy": 0.7577921718358993,
839
+ "num_tokens": 5022365.0,
840
+ "step": 830
841
+ },
842
+ {
843
+ "entropy": 1.1204675793647767,
844
+ "epoch": 1.0606060606060606,
845
+ "grad_norm": 0.6260582804679871,
846
+ "learning_rate": 4.850260416666667e-06,
847
+ "loss": 1.0813,
848
+ "mean_token_accuracy": 0.7580071151256561,
849
+ "num_tokens": 5081972.0,
850
+ "step": 840
851
+ },
852
+ {
853
+ "entropy": 1.1222782507538795,
854
+ "epoch": 1.0732323232323233,
855
+ "grad_norm": 0.6023226976394653,
856
+ "learning_rate": 4.785156250000001e-06,
857
+ "loss": 1.0922,
858
+ "mean_token_accuracy": 0.7566258609294891,
859
+ "num_tokens": 5142184.0,
860
+ "step": 850
861
+ },
862
+ {
863
+ "entropy": 1.1227335944771766,
864
+ "epoch": 1.0858585858585859,
865
+ "grad_norm": 0.6206791996955872,
866
+ "learning_rate": 4.7200520833333336e-06,
867
+ "loss": 1.0994,
868
+ "mean_token_accuracy": 0.7540625646710396,
869
+ "num_tokens": 5203020.0,
870
+ "step": 860
871
+ },
872
+ {
873
+ "entropy": 1.1352888554334641,
874
+ "epoch": 1.0984848484848484,
875
+ "grad_norm": 0.6301055550575256,
876
+ "learning_rate": 4.654947916666667e-06,
877
+ "loss": 1.0958,
878
+ "mean_token_accuracy": 0.7562039017677307,
879
+ "num_tokens": 5263291.0,
880
+ "step": 870
881
+ },
882
+ {
883
+ "entropy": 1.1120088309049607,
884
+ "epoch": 1.1111111111111112,
885
+ "grad_norm": 0.6210020780563354,
886
+ "learning_rate": 4.58984375e-06,
887
+ "loss": 1.0793,
888
+ "mean_token_accuracy": 0.7590557768940925,
889
+ "num_tokens": 5323972.0,
890
+ "step": 880
891
+ },
892
+ {
893
+ "entropy": 1.0996666207909584,
894
+ "epoch": 1.1237373737373737,
895
+ "grad_norm": 0.6332690715789795,
896
+ "learning_rate": 4.524739583333334e-06,
897
+ "loss": 1.0717,
898
+ "mean_token_accuracy": 0.7615471586585045,
899
+ "num_tokens": 5383911.0,
900
+ "step": 890
901
+ },
902
+ {
903
+ "entropy": 1.127107810974121,
904
+ "epoch": 1.1363636363636362,
905
+ "grad_norm": 0.6505516767501831,
906
+ "learning_rate": 4.459635416666668e-06,
907
+ "loss": 1.1027,
908
+ "mean_token_accuracy": 0.7562421515583992,
909
+ "num_tokens": 5445417.0,
910
+ "step": 900
911
+ },
912
+ {
913
+ "entropy": 1.129740473628044,
914
+ "epoch": 1.148989898989899,
915
+ "grad_norm": 0.6406158804893494,
916
+ "learning_rate": 4.3945312500000005e-06,
917
+ "loss": 1.0879,
918
+ "mean_token_accuracy": 0.7587148532271385,
919
+ "num_tokens": 5505455.0,
920
+ "step": 910
921
+ },
922
+ {
923
+ "entropy": 1.1167259424924851,
924
+ "epoch": 1.1616161616161615,
925
+ "grad_norm": 0.6297397613525391,
926
+ "learning_rate": 4.329427083333333e-06,
927
+ "loss": 1.0752,
928
+ "mean_token_accuracy": 0.7604142814874649,
929
+ "num_tokens": 5565311.0,
930
+ "step": 920
931
+ },
932
+ {
933
+ "entropy": 1.1037891641259194,
934
+ "epoch": 1.1742424242424243,
935
+ "grad_norm": 0.6490073204040527,
936
+ "learning_rate": 4.264322916666667e-06,
937
+ "loss": 1.0686,
938
+ "mean_token_accuracy": 0.7610052570700645,
939
+ "num_tokens": 5625358.0,
940
+ "step": 930
941
+ },
942
+ {
943
+ "entropy": 1.1061881184577942,
944
+ "epoch": 1.1868686868686869,
945
+ "grad_norm": 0.6366387009620667,
946
+ "learning_rate": 4.19921875e-06,
947
+ "loss": 1.0868,
948
+ "mean_token_accuracy": 0.7574937298893929,
949
+ "num_tokens": 5686421.0,
950
+ "step": 940
951
+ },
952
+ {
953
+ "entropy": 1.1124324068427085,
954
+ "epoch": 1.1994949494949494,
955
+ "grad_norm": 0.6556055545806885,
956
+ "learning_rate": 4.134114583333334e-06,
957
+ "loss": 1.0694,
958
+ "mean_token_accuracy": 0.7602224007248879,
959
+ "num_tokens": 5745891.0,
960
+ "step": 950
961
+ },
962
+ {
963
+ "entropy": 1.1175200879573821,
964
+ "epoch": 1.2121212121212122,
965
+ "grad_norm": 0.6404849886894226,
966
+ "learning_rate": 4.0690104166666675e-06,
967
+ "loss": 1.081,
968
+ "mean_token_accuracy": 0.7568994402885437,
969
+ "num_tokens": 5806078.0,
970
+ "step": 960
971
+ },
972
+ {
973
+ "entropy": 1.1186978340148925,
974
+ "epoch": 1.2247474747474747,
975
+ "grad_norm": 0.6227584481239319,
976
+ "learning_rate": 4.00390625e-06,
977
+ "loss": 1.0791,
978
+ "mean_token_accuracy": 0.759756401181221,
979
+ "num_tokens": 5866369.0,
980
+ "step": 970
981
+ },
982
+ {
983
+ "entropy": 1.122128139436245,
984
+ "epoch": 1.2373737373737375,
985
+ "grad_norm": 0.6616361141204834,
986
+ "learning_rate": 3.938802083333333e-06,
987
+ "loss": 1.0937,
988
+ "mean_token_accuracy": 0.7582718566060066,
989
+ "num_tokens": 5926217.0,
990
+ "step": 980
991
+ },
992
+ {
993
+ "entropy": 1.122861033678055,
994
+ "epoch": 1.25,
995
+ "grad_norm": 0.6384168267250061,
996
+ "learning_rate": 3.873697916666667e-06,
997
+ "loss": 1.0978,
998
+ "mean_token_accuracy": 0.7549964562058449,
999
+ "num_tokens": 5987666.0,
1000
+ "step": 990
1001
+ },
1002
+ {
1003
+ "entropy": 1.1277505576610565,
1004
+ "epoch": 1.2626262626262625,
1005
+ "grad_norm": 0.6038117408752441,
1006
+ "learning_rate": 3.8085937500000002e-06,
1007
+ "loss": 1.0952,
1008
+ "mean_token_accuracy": 0.755272176861763,
1009
+ "num_tokens": 6048708.0,
1010
+ "step": 1000
1011
+ },
1012
+ {
1013
+ "entropy": 1.1120157346129418,
1014
+ "epoch": 1.2752525252525253,
1015
+ "grad_norm": 0.6418159604072571,
1016
+ "learning_rate": 3.7434895833333336e-06,
1017
+ "loss": 1.078,
1018
+ "mean_token_accuracy": 0.7594122514128685,
1019
+ "num_tokens": 6109652.0,
1020
+ "step": 1010
1021
+ },
1022
+ {
1023
+ "entropy": 1.101425115764141,
1024
+ "epoch": 1.2878787878787878,
1025
+ "grad_norm": 0.6218425035476685,
1026
+ "learning_rate": 3.6783854166666673e-06,
1027
+ "loss": 1.0688,
1028
+ "mean_token_accuracy": 0.7604865297675133,
1029
+ "num_tokens": 6169125.0,
1030
+ "step": 1020
1031
+ },
1032
+ {
1033
+ "entropy": 1.1007713869214057,
1034
+ "epoch": 1.3005050505050506,
1035
+ "grad_norm": 0.6429149508476257,
1036
+ "learning_rate": 3.61328125e-06,
1037
+ "loss": 1.0581,
1038
+ "mean_token_accuracy": 0.7621071562170982,
1039
+ "num_tokens": 6230303.0,
1040
+ "step": 1030
1041
+ },
1042
+ {
1043
+ "entropy": 1.1094096556305886,
1044
+ "epoch": 1.3131313131313131,
1045
+ "grad_norm": 0.6489748358726501,
1046
+ "learning_rate": 3.5481770833333335e-06,
1047
+ "loss": 1.0715,
1048
+ "mean_token_accuracy": 0.7599423810839653,
1049
+ "num_tokens": 6291396.0,
1050
+ "step": 1040
1051
+ },
1052
+ {
1053
+ "entropy": 1.0827289715409278,
1054
+ "epoch": 1.3257575757575757,
1055
+ "grad_norm": 0.6485461592674255,
1056
+ "learning_rate": 3.483072916666667e-06,
1057
+ "loss": 1.0584,
1058
+ "mean_token_accuracy": 0.7630694910883904,
1059
+ "num_tokens": 6351579.0,
1060
+ "step": 1050
1061
+ },
1062
+ {
1063
+ "entropy": 1.114325873553753,
1064
+ "epoch": 1.3383838383838385,
1065
+ "grad_norm": 0.6261104941368103,
1066
+ "learning_rate": 3.41796875e-06,
1067
+ "loss": 1.0764,
1068
+ "mean_token_accuracy": 0.7585488513112069,
1069
+ "num_tokens": 6411662.0,
1070
+ "step": 1060
1071
+ },
1072
+ {
1073
+ "entropy": 1.1271554425358772,
1074
+ "epoch": 1.351010101010101,
1075
+ "grad_norm": 0.6522034406661987,
1076
+ "learning_rate": 3.3528645833333334e-06,
1077
+ "loss": 1.0902,
1078
+ "mean_token_accuracy": 0.7562535598874092,
1079
+ "num_tokens": 6473505.0,
1080
+ "step": 1070
1081
+ },
1082
+ {
1083
+ "entropy": 1.1013643085956573,
1084
+ "epoch": 1.3636363636363638,
1085
+ "grad_norm": 0.6176674962043762,
1086
+ "learning_rate": 3.287760416666667e-06,
1087
+ "loss": 1.065,
1088
+ "mean_token_accuracy": 0.763075165450573,
1089
+ "num_tokens": 6533580.0,
1090
+ "step": 1080
1091
+ },
1092
+ {
1093
+ "entropy": 1.098090337216854,
1094
+ "epoch": 1.3762626262626263,
1095
+ "grad_norm": 0.6540253758430481,
1096
+ "learning_rate": 3.2226562500000004e-06,
1097
+ "loss": 1.0596,
1098
+ "mean_token_accuracy": 0.7616770043969154,
1099
+ "num_tokens": 6593481.0,
1100
+ "step": 1090
1101
+ },
1102
+ {
1103
+ "entropy": 1.1176372200250626,
1104
+ "epoch": 1.3888888888888888,
1105
+ "grad_norm": 0.6754550933837891,
1106
+ "learning_rate": 3.1575520833333333e-06,
1107
+ "loss": 1.0861,
1108
+ "mean_token_accuracy": 0.7573029339313507,
1109
+ "num_tokens": 6653967.0,
1110
+ "step": 1100
1111
+ },
1112
+ {
1113
+ "entropy": 1.1040414482355119,
1114
+ "epoch": 1.4015151515151514,
1115
+ "grad_norm": 0.6022531986236572,
1116
+ "learning_rate": 3.092447916666667e-06,
1117
+ "loss": 1.0573,
1118
+ "mean_token_accuracy": 0.7612267225980759,
1119
+ "num_tokens": 6714685.0,
1120
+ "step": 1110
1121
+ },
1122
+ {
1123
+ "entropy": 1.0926991075277328,
1124
+ "epoch": 1.4141414141414141,
1125
+ "grad_norm": 0.6621010303497314,
1126
+ "learning_rate": 3.0273437500000003e-06,
1127
+ "loss": 1.0637,
1128
+ "mean_token_accuracy": 0.7612977519631385,
1129
+ "num_tokens": 6774659.0,
1130
+ "step": 1120
1131
+ },
1132
+ {
1133
+ "entropy": 1.1095534324645997,
1134
+ "epoch": 1.4267676767676767,
1135
+ "grad_norm": 0.62503981590271,
1136
+ "learning_rate": 2.962239583333333e-06,
1137
+ "loss": 1.0701,
1138
+ "mean_token_accuracy": 0.7618053883314133,
1139
+ "num_tokens": 6834579.0,
1140
+ "step": 1130
1141
+ },
1142
+ {
1143
+ "entropy": 1.117809349298477,
1144
+ "epoch": 1.4393939393939394,
1145
+ "grad_norm": 0.6527109742164612,
1146
+ "learning_rate": 2.897135416666667e-06,
1147
+ "loss": 1.0747,
1148
+ "mean_token_accuracy": 0.759482853114605,
1149
+ "num_tokens": 6894074.0,
1150
+ "step": 1140
1151
+ },
1152
+ {
1153
+ "entropy": 1.1005077749490737,
1154
+ "epoch": 1.452020202020202,
1155
+ "grad_norm": 0.6720954775810242,
1156
+ "learning_rate": 2.8320312500000002e-06,
1157
+ "loss": 1.0607,
1158
+ "mean_token_accuracy": 0.7621246844530105,
1159
+ "num_tokens": 6953870.0,
1160
+ "step": 1150
1161
+ },
1162
+ {
1163
+ "entropy": 1.1236482918262483,
1164
+ "epoch": 1.4646464646464645,
1165
+ "grad_norm": 0.658524215221405,
1166
+ "learning_rate": 2.7669270833333335e-06,
1167
+ "loss": 1.0884,
1168
+ "mean_token_accuracy": 0.7560836613178253,
1169
+ "num_tokens": 7014553.0,
1170
+ "step": 1160
1171
+ },
1172
+ {
1173
+ "entropy": 1.1116504594683647,
1174
+ "epoch": 1.4772727272727273,
1175
+ "grad_norm": 0.6261802911758423,
1176
+ "learning_rate": 2.7018229166666673e-06,
1177
+ "loss": 1.0659,
1178
+ "mean_token_accuracy": 0.7597616642713547,
1179
+ "num_tokens": 7076291.0,
1180
+ "step": 1170
1181
+ },
1182
+ {
1183
+ "entropy": 1.073892480134964,
1184
+ "epoch": 1.4898989898989898,
1185
+ "grad_norm": 0.6310375332832336,
1186
+ "learning_rate": 2.63671875e-06,
1187
+ "loss": 1.0524,
1188
+ "mean_token_accuracy": 0.7628733053803444,
1189
+ "num_tokens": 7137305.0,
1190
+ "step": 1180
1191
+ },
1192
+ {
1193
+ "entropy": 1.0975843235850333,
1194
+ "epoch": 1.5025252525252526,
1195
+ "grad_norm": 0.638482391834259,
1196
+ "learning_rate": 2.5716145833333334e-06,
1197
+ "loss": 1.0679,
1198
+ "mean_token_accuracy": 0.7603248566389084,
1199
+ "num_tokens": 7198239.0,
1200
+ "step": 1190
1201
+ },
1202
+ {
1203
+ "entropy": 1.0986508697271347,
1204
+ "epoch": 1.5151515151515151,
1205
+ "grad_norm": 0.640065610408783,
1206
+ "learning_rate": 2.506510416666667e-06,
1207
+ "loss": 1.0666,
1208
+ "mean_token_accuracy": 0.7622530281543731,
1209
+ "num_tokens": 7257847.0,
1210
+ "step": 1200
1211
+ }
1212
+ ],
1213
+ "logging_steps": 10,
1214
+ "max_steps": 1584,
1215
+ "num_input_tokens_seen": 0,
1216
+ "num_train_epochs": 2,
1217
+ "save_steps": 200,
1218
+ "stateful_callbacks": {
1219
+ "TrainerControl": {
1220
+ "args": {
1221
+ "should_epoch_stop": false,
1222
+ "should_evaluate": false,
1223
+ "should_log": false,
1224
+ "should_save": true,
1225
+ "should_training_stop": false
1226
+ },
1227
+ "attributes": {}
1228
+ }
1229
+ },
1230
+ "total_flos": 4.110405675434312e+17,
1231
+ "train_batch_size": 8,
1232
+ "trial_name": null,
1233
+ "trial_params": null
1234
+ }
checkpoint-1200/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8eeb71c14deb91ac5fd11522db45cb3275c9164415fcbefc9d00cac27a27f0a3
3
+ size 6417
checkpoint-1400/README.md ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: CohereLabs/aya-expanse-8b
3
+ library_name: peft
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - base_model:adapter:CohereLabs/aya-expanse-8b
7
+ - lora
8
+ - sft
9
+ - transformers
10
+ - trl
11
+ ---
12
+
13
+ # Model Card for Model ID
14
+
15
+ <!-- Provide a quick summary of what the model is/does. -->
16
+
17
+
18
+
19
+ ## Model Details
20
+
21
+ ### Model Description
22
+
23
+ <!-- Provide a longer summary of what this model is. -->
24
+
25
+
26
+
27
+ - **Developed by:** [More Information Needed]
28
+ - **Funded by [optional]:** [More Information Needed]
29
+ - **Shared by [optional]:** [More Information Needed]
30
+ - **Model type:** [More Information Needed]
31
+ - **Language(s) (NLP):** [More Information Needed]
32
+ - **License:** [More Information Needed]
33
+ - **Finetuned from model [optional]:** [More Information Needed]
34
+
35
+ ### Model Sources [optional]
36
+
37
+ <!-- Provide the basic links for the model. -->
38
+
39
+ - **Repository:** [More Information Needed]
40
+ - **Paper [optional]:** [More Information Needed]
41
+ - **Demo [optional]:** [More Information Needed]
42
+
43
+ ## Uses
44
+
45
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
46
+
47
+ ### Direct Use
48
+
49
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
50
+
51
+ [More Information Needed]
52
+
53
+ ### Downstream Use [optional]
54
+
55
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
56
+
57
+ [More Information Needed]
58
+
59
+ ### Out-of-Scope Use
60
+
61
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
62
+
63
+ [More Information Needed]
64
+
65
+ ## Bias, Risks, and Limitations
66
+
67
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
68
+
69
+ [More Information Needed]
70
+
71
+ ### Recommendations
72
+
73
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
74
+
75
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
76
+
77
+ ## How to Get Started with the Model
78
+
79
+ Use the code below to get started with the model.
80
+
81
+ [More Information Needed]
82
+
83
+ ## Training Details
84
+
85
+ ### Training Data
86
+
87
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
88
+
89
+ [More Information Needed]
90
+
91
+ ### Training Procedure
92
+
93
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
94
+
95
+ #### Preprocessing [optional]
96
+
97
+ [More Information Needed]
98
+
99
+
100
+ #### Training Hyperparameters
101
+
102
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
103
+
104
+ #### Speeds, Sizes, Times [optional]
105
+
106
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
107
+
108
+ [More Information Needed]
109
+
110
+ ## Evaluation
111
+
112
+ <!-- This section describes the evaluation protocols and provides the results. -->
113
+
114
+ ### Testing Data, Factors & Metrics
115
+
116
+ #### Testing Data
117
+
118
+ <!-- This should link to a Dataset Card if possible. -->
119
+
120
+ [More Information Needed]
121
+
122
+ #### Factors
123
+
124
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
125
+
126
+ [More Information Needed]
127
+
128
+ #### Metrics
129
+
130
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
131
+
132
+ [More Information Needed]
133
+
134
+ ### Results
135
+
136
+ [More Information Needed]
137
+
138
+ #### Summary
139
+
140
+
141
+
142
+ ## Model Examination [optional]
143
+
144
+ <!-- Relevant interpretability work for the model goes here -->
145
+
146
+ [More Information Needed]
147
+
148
+ ## Environmental Impact
149
+
150
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
151
+
152
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
153
+
154
+ - **Hardware Type:** [More Information Needed]
155
+ - **Hours used:** [More Information Needed]
156
+ - **Cloud Provider:** [More Information Needed]
157
+ - **Compute Region:** [More Information Needed]
158
+ - **Carbon Emitted:** [More Information Needed]
159
+
160
+ ## Technical Specifications [optional]
161
+
162
+ ### Model Architecture and Objective
163
+
164
+ [More Information Needed]
165
+
166
+ ### Compute Infrastructure
167
+
168
+ [More Information Needed]
169
+
170
+ #### Hardware
171
+
172
+ [More Information Needed]
173
+
174
+ #### Software
175
+
176
+ [More Information Needed]
177
+
178
+ ## Citation [optional]
179
+
180
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
181
+
182
+ **BibTeX:**
183
+
184
+ [More Information Needed]
185
+
186
+ **APA:**
187
+
188
+ [More Information Needed]
189
+
190
+ ## Glossary [optional]
191
+
192
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
193
+
194
+ [More Information Needed]
195
+
196
+ ## More Information [optional]
197
+
198
+ [More Information Needed]
199
+
200
+ ## Model Card Authors [optional]
201
+
202
+ [More Information Needed]
203
+
204
+ ## Model Card Contact
205
+
206
+ [More Information Needed]
207
+ ### Framework versions
208
+
209
+ - PEFT 0.19.1
checkpoint-1400/adapter_config.json ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alora_invocation_tokens": null,
3
+ "alpha_pattern": {},
4
+ "arrow_config": null,
5
+ "auto_mapping": null,
6
+ "base_model_name_or_path": "CohereLabs/aya-expanse-8b",
7
+ "bias": "none",
8
+ "corda_config": null,
9
+ "ensure_weight_tying": false,
10
+ "eva_config": null,
11
+ "exclude_modules": null,
12
+ "fan_in_fan_out": false,
13
+ "inference_mode": true,
14
+ "init_lora_weights": true,
15
+ "layer_replication": null,
16
+ "layers_pattern": null,
17
+ "layers_to_transform": null,
18
+ "loftq_config": {},
19
+ "lora_alpha": 32,
20
+ "lora_bias": false,
21
+ "lora_dropout": 0.05,
22
+ "lora_ga_config": null,
23
+ "megatron_config": null,
24
+ "megatron_core": "megatron.core",
25
+ "modules_to_save": null,
26
+ "peft_type": "LORA",
27
+ "peft_version": "0.19.1",
28
+ "qalora_group_size": 16,
29
+ "r": 16,
30
+ "rank_pattern": {},
31
+ "revision": null,
32
+ "target_modules": [
33
+ "k_proj",
34
+ "down_proj",
35
+ "q_proj",
36
+ "o_proj",
37
+ "gate_proj",
38
+ "v_proj",
39
+ "up_proj"
40
+ ],
41
+ "target_parameters": null,
42
+ "task_type": "CAUSAL_LM",
43
+ "trainable_token_indices": null,
44
+ "use_bdlora": null,
45
+ "use_dora": false,
46
+ "use_qalora": false,
47
+ "use_rslora": false
48
+ }
checkpoint-1400/adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ee5846528f7fce0a4cf0270a4dc36986f041f0754c9eefb430c289323f104d4c
3
+ size 167832240
checkpoint-1400/chat_template.jinja ADDED
@@ -0,0 +1 @@
 
 
1
+ {{ bos_token }}{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif false == true %}{% set loop_messages = messages %}{% set system_message = 'You are Aya, a brilliant, sophisticated, multilingual AI-assistant trained to assist human users by providing thorough responses. You are able to interact and respond to questions in 23 languages and you are powered by a multilingual model built by Cohere For AI.' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% if system_message != false %}{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + system_message + '<|END_OF_TURN_TOKEN|>' }}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% elif message['role'] == 'assistant' %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}{% endif %}
checkpoint-1400/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:74c2cee5bcd67acd6c6931a316bfc1a8a46e05c52a8664bd86dddda4f410978a
3
+ size 335929123
checkpoint-1400/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7dff23631b1d1432ccf216a29a64a0894fa08d99cdc2ae64b29a790d479eb958
3
+ size 14645
checkpoint-1400/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:026b9ae9780a61b08bc12971cc3b0b0cd0ad05bdb25e4c31d15359a55fdf2292
3
+ size 1465
checkpoint-1400/special_tokens_map.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<BOS_TOKEN>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|END_OF_TURN_TOKEN|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "<PAD>"
17
+ }
checkpoint-1400/tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:345ccf04a5257f473e331715ecc69365c5ac8fc2490923fe7155560af809ec1a
3
+ size 20124090
checkpoint-1400/tokenizer_config.json ADDED
@@ -0,0 +1,317 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": false,
5
+ "added_tokens_decoder": {
6
+ "0": {
7
+ "content": "<PAD>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "1": {
15
+ "content": "<UNK>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "2": {
23
+ "content": "<CLS>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ },
30
+ "3": {
31
+ "content": "<SEP>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false,
36
+ "special": true
37
+ },
38
+ "4": {
39
+ "content": "<MASK_TOKEN>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false,
44
+ "special": true
45
+ },
46
+ "5": {
47
+ "content": "<BOS_TOKEN>",
48
+ "lstrip": false,
49
+ "normalized": false,
50
+ "rstrip": false,
51
+ "single_word": false,
52
+ "special": true
53
+ },
54
+ "6": {
55
+ "content": "<EOS_TOKEN>",
56
+ "lstrip": false,
57
+ "normalized": false,
58
+ "rstrip": false,
59
+ "single_word": false,
60
+ "special": true
61
+ },
62
+ "7": {
63
+ "content": "<EOP_TOKEN>",
64
+ "lstrip": false,
65
+ "normalized": false,
66
+ "rstrip": false,
67
+ "single_word": false,
68
+ "special": true
69
+ },
70
+ "255000": {
71
+ "content": "<|START_OF_TURN_TOKEN|>",
72
+ "lstrip": false,
73
+ "normalized": false,
74
+ "rstrip": false,
75
+ "single_word": false,
76
+ "special": false
77
+ },
78
+ "255001": {
79
+ "content": "<|END_OF_TURN_TOKEN|>",
80
+ "lstrip": false,
81
+ "normalized": false,
82
+ "rstrip": false,
83
+ "single_word": false,
84
+ "special": true
85
+ },
86
+ "255002": {
87
+ "content": "<|YES_TOKEN|>",
88
+ "lstrip": false,
89
+ "normalized": false,
90
+ "rstrip": false,
91
+ "single_word": false,
92
+ "special": false
93
+ },
94
+ "255003": {
95
+ "content": "<|NO_TOKEN|>",
96
+ "lstrip": false,
97
+ "normalized": false,
98
+ "rstrip": false,
99
+ "single_word": false,
100
+ "special": false
101
+ },
102
+ "255004": {
103
+ "content": "<|GOOD_TOKEN|>",
104
+ "lstrip": false,
105
+ "normalized": false,
106
+ "rstrip": false,
107
+ "single_word": false,
108
+ "special": false
109
+ },
110
+ "255005": {
111
+ "content": "<|BAD_TOKEN|>",
112
+ "lstrip": false,
113
+ "normalized": false,
114
+ "rstrip": false,
115
+ "single_word": false,
116
+ "special": false
117
+ },
118
+ "255006": {
119
+ "content": "<|USER_TOKEN|>",
120
+ "lstrip": false,
121
+ "normalized": false,
122
+ "rstrip": false,
123
+ "single_word": false,
124
+ "special": false
125
+ },
126
+ "255007": {
127
+ "content": "<|CHATBOT_TOKEN|>",
128
+ "lstrip": false,
129
+ "normalized": false,
130
+ "rstrip": false,
131
+ "single_word": false,
132
+ "special": false
133
+ },
134
+ "255008": {
135
+ "content": "<|SYSTEM_TOKEN|>",
136
+ "lstrip": false,
137
+ "normalized": false,
138
+ "rstrip": false,
139
+ "single_word": false,
140
+ "special": false
141
+ },
142
+ "255009": {
143
+ "content": "<|USER_0_TOKEN|>",
144
+ "lstrip": false,
145
+ "normalized": false,
146
+ "rstrip": false,
147
+ "single_word": false,
148
+ "special": false
149
+ },
150
+ "255010": {
151
+ "content": "<|USER_1_TOKEN|>",
152
+ "lstrip": false,
153
+ "normalized": false,
154
+ "rstrip": false,
155
+ "single_word": false,
156
+ "special": false
157
+ },
158
+ "255011": {
159
+ "content": "<|USER_2_TOKEN|>",
160
+ "lstrip": false,
161
+ "normalized": false,
162
+ "rstrip": false,
163
+ "single_word": false,
164
+ "special": false
165
+ },
166
+ "255012": {
167
+ "content": "<|USER_3_TOKEN|>",
168
+ "lstrip": false,
169
+ "normalized": false,
170
+ "rstrip": false,
171
+ "single_word": false,
172
+ "special": false
173
+ },
174
+ "255013": {
175
+ "content": "<|USER_4_TOKEN|>",
176
+ "lstrip": false,
177
+ "normalized": false,
178
+ "rstrip": false,
179
+ "single_word": false,
180
+ "special": false
181
+ },
182
+ "255014": {
183
+ "content": "<|USER_5_TOKEN|>",
184
+ "lstrip": false,
185
+ "normalized": false,
186
+ "rstrip": false,
187
+ "single_word": false,
188
+ "special": false
189
+ },
190
+ "255015": {
191
+ "content": "<|USER_6_TOKEN|>",
192
+ "lstrip": false,
193
+ "normalized": false,
194
+ "rstrip": false,
195
+ "single_word": false,
196
+ "special": false
197
+ },
198
+ "255016": {
199
+ "content": "<|USER_7_TOKEN|>",
200
+ "lstrip": false,
201
+ "normalized": false,
202
+ "rstrip": false,
203
+ "single_word": false,
204
+ "special": false
205
+ },
206
+ "255017": {
207
+ "content": "<|USER_8_TOKEN|>",
208
+ "lstrip": false,
209
+ "normalized": false,
210
+ "rstrip": false,
211
+ "single_word": false,
212
+ "special": false
213
+ },
214
+ "255018": {
215
+ "content": "<|USER_9_TOKEN|>",
216
+ "lstrip": false,
217
+ "normalized": false,
218
+ "rstrip": false,
219
+ "single_word": false,
220
+ "special": false
221
+ },
222
+ "255019": {
223
+ "content": "<|EXTRA_0_TOKEN|>",
224
+ "lstrip": false,
225
+ "normalized": false,
226
+ "rstrip": false,
227
+ "single_word": false,
228
+ "special": false
229
+ },
230
+ "255020": {
231
+ "content": "<|EXTRA_1_TOKEN|>",
232
+ "lstrip": false,
233
+ "normalized": false,
234
+ "rstrip": false,
235
+ "single_word": false,
236
+ "special": false
237
+ },
238
+ "255021": {
239
+ "content": "<|EXTRA_2_TOKEN|>",
240
+ "lstrip": false,
241
+ "normalized": false,
242
+ "rstrip": false,
243
+ "single_word": false,
244
+ "special": false
245
+ },
246
+ "255022": {
247
+ "content": "<|EXTRA_3_TOKEN|>",
248
+ "lstrip": false,
249
+ "normalized": false,
250
+ "rstrip": false,
251
+ "single_word": false,
252
+ "special": false
253
+ },
254
+ "255023": {
255
+ "content": "<|EXTRA_4_TOKEN|>",
256
+ "lstrip": false,
257
+ "normalized": false,
258
+ "rstrip": false,
259
+ "single_word": false,
260
+ "special": false
261
+ },
262
+ "255024": {
263
+ "content": "<|EXTRA_5_TOKEN|>",
264
+ "lstrip": false,
265
+ "normalized": false,
266
+ "rstrip": false,
267
+ "single_word": false,
268
+ "special": false
269
+ },
270
+ "255025": {
271
+ "content": "<|EXTRA_6_TOKEN|>",
272
+ "lstrip": false,
273
+ "normalized": false,
274
+ "rstrip": false,
275
+ "single_word": false,
276
+ "special": false
277
+ },
278
+ "255026": {
279
+ "content": "<|EXTRA_7_TOKEN|>",
280
+ "lstrip": false,
281
+ "normalized": false,
282
+ "rstrip": false,
283
+ "single_word": false,
284
+ "special": false
285
+ },
286
+ "255027": {
287
+ "content": "<|EXTRA_8_TOKEN|>",
288
+ "lstrip": false,
289
+ "normalized": false,
290
+ "rstrip": false,
291
+ "single_word": false,
292
+ "special": false
293
+ },
294
+ "255028": {
295
+ "content": "<|EXTRA_9_TOKEN|>",
296
+ "lstrip": false,
297
+ "normalized": false,
298
+ "rstrip": false,
299
+ "single_word": false,
300
+ "special": false
301
+ }
302
+ },
303
+ "bos_token": "<BOS_TOKEN>",
304
+ "clean_up_tokenization_spaces": false,
305
+ "eos_token": "<|END_OF_TURN_TOKEN|>",
306
+ "extra_special_tokens": {},
307
+ "legacy": true,
308
+ "merges_file": null,
309
+ "model_max_length": 1000000000000000019884624838656,
310
+ "pad_token": "<PAD>",
311
+ "sp_model_kwargs": {},
312
+ "spaces_between_special_tokens": false,
313
+ "tokenizer_class": "CohereTokenizer",
314
+ "unk_token": null,
315
+ "use_default_system_prompt": false,
316
+ "vocab_file": null
317
+ }
checkpoint-1400/trainer_state.json ADDED
@@ -0,0 +1,1434 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": null,
3
+ "best_metric": null,
4
+ "best_model_checkpoint": null,
5
+ "epoch": 1.7676767676767677,
6
+ "eval_steps": 200,
7
+ "global_step": 1400,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "entropy": 2.4040649354457857,
14
+ "epoch": 0.012626262626262626,
15
+ "grad_norm": 4.66565465927124,
16
+ "learning_rate": 1.8750000000000003e-06,
17
+ "loss": 3.6021,
18
+ "mean_token_accuracy": 0.4221017129719257,
19
+ "num_tokens": 61199.0,
20
+ "step": 10
21
+ },
22
+ {
23
+ "entropy": 2.3836746215820312,
24
+ "epoch": 0.025252525252525252,
25
+ "grad_norm": 3.8161869049072266,
26
+ "learning_rate": 3.958333333333333e-06,
27
+ "loss": 3.3432,
28
+ "mean_token_accuracy": 0.44042530804872515,
29
+ "num_tokens": 122423.0,
30
+ "step": 20
31
+ },
32
+ {
33
+ "entropy": 2.355724626779556,
34
+ "epoch": 0.03787878787878788,
35
+ "grad_norm": 3.8800699710845947,
36
+ "learning_rate": 6.041666666666667e-06,
37
+ "loss": 2.9033,
38
+ "mean_token_accuracy": 0.48426677361130716,
39
+ "num_tokens": 182649.0,
40
+ "step": 30
41
+ },
42
+ {
43
+ "entropy": 2.092331054806709,
44
+ "epoch": 0.050505050505050504,
45
+ "grad_norm": 2.8217720985412598,
46
+ "learning_rate": 8.125000000000001e-06,
47
+ "loss": 2.356,
48
+ "mean_token_accuracy": 0.5772452697157859,
49
+ "num_tokens": 243049.0,
50
+ "step": 40
51
+ },
52
+ {
53
+ "entropy": 1.6766322344541549,
54
+ "epoch": 0.06313131313131314,
55
+ "grad_norm": 1.4623568058013916,
56
+ "learning_rate": 9.993489583333334e-06,
57
+ "loss": 1.8899,
58
+ "mean_token_accuracy": 0.6480962842702865,
59
+ "num_tokens": 304326.0,
60
+ "step": 50
61
+ },
62
+ {
63
+ "entropy": 1.5568815559148788,
64
+ "epoch": 0.07575757575757576,
65
+ "grad_norm": 1.171562671661377,
66
+ "learning_rate": 9.928385416666668e-06,
67
+ "loss": 1.677,
68
+ "mean_token_accuracy": 0.6776855796575546,
69
+ "num_tokens": 364866.0,
70
+ "step": 60
71
+ },
72
+ {
73
+ "entropy": 1.48199902176857,
74
+ "epoch": 0.08838383838383838,
75
+ "grad_norm": 0.9904961585998535,
76
+ "learning_rate": 9.863281250000001e-06,
77
+ "loss": 1.5337,
78
+ "mean_token_accuracy": 0.697019773721695,
79
+ "num_tokens": 423478.0,
80
+ "step": 70
81
+ },
82
+ {
83
+ "entropy": 1.497376424074173,
84
+ "epoch": 0.10101010101010101,
85
+ "grad_norm": 0.9454260468482971,
86
+ "learning_rate": 9.798177083333335e-06,
87
+ "loss": 1.4953,
88
+ "mean_token_accuracy": 0.6976823702454567,
89
+ "num_tokens": 483659.0,
90
+ "step": 80
91
+ },
92
+ {
93
+ "entropy": 1.4664768785238267,
94
+ "epoch": 0.11363636363636363,
95
+ "grad_norm": 0.8955270648002625,
96
+ "learning_rate": 9.733072916666667e-06,
97
+ "loss": 1.4356,
98
+ "mean_token_accuracy": 0.7069446608424187,
99
+ "num_tokens": 544453.0,
100
+ "step": 90
101
+ },
102
+ {
103
+ "entropy": 1.4269085675477982,
104
+ "epoch": 0.12626262626262627,
105
+ "grad_norm": 0.9242203235626221,
106
+ "learning_rate": 9.66796875e-06,
107
+ "loss": 1.4106,
108
+ "mean_token_accuracy": 0.7122392952442169,
109
+ "num_tokens": 604546.0,
110
+ "step": 100
111
+ },
112
+ {
113
+ "entropy": 1.4060751020908355,
114
+ "epoch": 0.1388888888888889,
115
+ "grad_norm": 0.8968560695648193,
116
+ "learning_rate": 9.602864583333335e-06,
117
+ "loss": 1.3487,
118
+ "mean_token_accuracy": 0.7178552970290184,
119
+ "num_tokens": 664860.0,
120
+ "step": 110
121
+ },
122
+ {
123
+ "entropy": 1.4018951296806335,
124
+ "epoch": 0.15151515151515152,
125
+ "grad_norm": 0.9047113656997681,
126
+ "learning_rate": 9.537760416666667e-06,
127
+ "loss": 1.3347,
128
+ "mean_token_accuracy": 0.7208079636096955,
129
+ "num_tokens": 725022.0,
130
+ "step": 120
131
+ },
132
+ {
133
+ "entropy": 1.3809731483459473,
134
+ "epoch": 0.16414141414141414,
135
+ "grad_norm": 0.8915444016456604,
136
+ "learning_rate": 9.47265625e-06,
137
+ "loss": 1.3155,
138
+ "mean_token_accuracy": 0.7267520889639855,
139
+ "num_tokens": 785586.0,
140
+ "step": 130
141
+ },
142
+ {
143
+ "entropy": 1.3699676394462585,
144
+ "epoch": 0.17676767676767677,
145
+ "grad_norm": 0.8574295043945312,
146
+ "learning_rate": 9.407552083333334e-06,
147
+ "loss": 1.3016,
148
+ "mean_token_accuracy": 0.7266673430800438,
149
+ "num_tokens": 845790.0,
150
+ "step": 140
151
+ },
152
+ {
153
+ "entropy": 1.3425012439489366,
154
+ "epoch": 0.1893939393939394,
155
+ "grad_norm": 0.8231800198554993,
156
+ "learning_rate": 9.342447916666668e-06,
157
+ "loss": 1.2823,
158
+ "mean_token_accuracy": 0.7277117937803268,
159
+ "num_tokens": 905842.0,
160
+ "step": 150
161
+ },
162
+ {
163
+ "entropy": 1.3314216613769532,
164
+ "epoch": 0.20202020202020202,
165
+ "grad_norm": 0.8166369795799255,
166
+ "learning_rate": 9.277343750000001e-06,
167
+ "loss": 1.2917,
168
+ "mean_token_accuracy": 0.7278609350323677,
169
+ "num_tokens": 966487.0,
170
+ "step": 160
171
+ },
172
+ {
173
+ "entropy": 1.30389544069767,
174
+ "epoch": 0.21464646464646464,
175
+ "grad_norm": 0.7738587260246277,
176
+ "learning_rate": 9.212239583333335e-06,
177
+ "loss": 1.2481,
178
+ "mean_token_accuracy": 0.7335339426994324,
179
+ "num_tokens": 1025558.0,
180
+ "step": 170
181
+ },
182
+ {
183
+ "entropy": 1.311449444293976,
184
+ "epoch": 0.22727272727272727,
185
+ "grad_norm": 0.7718328833580017,
186
+ "learning_rate": 9.147135416666667e-06,
187
+ "loss": 1.2643,
188
+ "mean_token_accuracy": 0.7285080313682556,
189
+ "num_tokens": 1086677.0,
190
+ "step": 180
191
+ },
192
+ {
193
+ "entropy": 1.3079361200332642,
194
+ "epoch": 0.2398989898989899,
195
+ "grad_norm": 0.7341915369033813,
196
+ "learning_rate": 9.082031250000001e-06,
197
+ "loss": 1.2641,
198
+ "mean_token_accuracy": 0.729550538957119,
199
+ "num_tokens": 1147885.0,
200
+ "step": 190
201
+ },
202
+ {
203
+ "entropy": 1.2794219702482224,
204
+ "epoch": 0.25252525252525254,
205
+ "grad_norm": 0.748540997505188,
206
+ "learning_rate": 9.016927083333335e-06,
207
+ "loss": 1.2397,
208
+ "mean_token_accuracy": 0.7351120054721832,
209
+ "num_tokens": 1207321.0,
210
+ "step": 200
211
+ },
212
+ {
213
+ "entropy": 1.294454461336136,
214
+ "epoch": 0.26515151515151514,
215
+ "grad_norm": 0.7585553526878357,
216
+ "learning_rate": 8.951822916666667e-06,
217
+ "loss": 1.2489,
218
+ "mean_token_accuracy": 0.7322148531675339,
219
+ "num_tokens": 1267837.0,
220
+ "step": 210
221
+ },
222
+ {
223
+ "entropy": 1.2862246632575989,
224
+ "epoch": 0.2777777777777778,
225
+ "grad_norm": 0.6937864422798157,
226
+ "learning_rate": 8.88671875e-06,
227
+ "loss": 1.2383,
228
+ "mean_token_accuracy": 0.7358245223760604,
229
+ "num_tokens": 1328436.0,
230
+ "step": 220
231
+ },
232
+ {
233
+ "entropy": 1.2531811505556107,
234
+ "epoch": 0.2904040404040404,
235
+ "grad_norm": 0.6792387366294861,
236
+ "learning_rate": 8.821614583333334e-06,
237
+ "loss": 1.2007,
238
+ "mean_token_accuracy": 0.7360369265079498,
239
+ "num_tokens": 1389877.0,
240
+ "step": 230
241
+ },
242
+ {
243
+ "entropy": 1.2815489560365676,
244
+ "epoch": 0.30303030303030304,
245
+ "grad_norm": 0.6865427494049072,
246
+ "learning_rate": 8.756510416666666e-06,
247
+ "loss": 1.2474,
248
+ "mean_token_accuracy": 0.7304804190993309,
249
+ "num_tokens": 1450927.0,
250
+ "step": 240
251
+ },
252
+ {
253
+ "entropy": 1.2568059146404267,
254
+ "epoch": 0.31565656565656564,
255
+ "grad_norm": 0.669840395450592,
256
+ "learning_rate": 8.69140625e-06,
257
+ "loss": 1.2172,
258
+ "mean_token_accuracy": 0.7385278165340423,
259
+ "num_tokens": 1511174.0,
260
+ "step": 250
261
+ },
262
+ {
263
+ "entropy": 1.254868358373642,
264
+ "epoch": 0.3282828282828283,
265
+ "grad_norm": 0.6434893012046814,
266
+ "learning_rate": 8.626302083333334e-06,
267
+ "loss": 1.213,
268
+ "mean_token_accuracy": 0.7380544006824493,
269
+ "num_tokens": 1570570.0,
270
+ "step": 260
271
+ },
272
+ {
273
+ "entropy": 1.2532441645860672,
274
+ "epoch": 0.3409090909090909,
275
+ "grad_norm": 0.6034978032112122,
276
+ "learning_rate": 8.561197916666667e-06,
277
+ "loss": 1.2116,
278
+ "mean_token_accuracy": 0.7382062628865242,
279
+ "num_tokens": 1630930.0,
280
+ "step": 270
281
+ },
282
+ {
283
+ "entropy": 1.2605280816555022,
284
+ "epoch": 0.35353535353535354,
285
+ "grad_norm": 0.6371450424194336,
286
+ "learning_rate": 8.496093750000001e-06,
287
+ "loss": 1.2313,
288
+ "mean_token_accuracy": 0.7328852489590645,
289
+ "num_tokens": 1692624.0,
290
+ "step": 280
291
+ },
292
+ {
293
+ "entropy": 1.248606452345848,
294
+ "epoch": 0.3661616161616162,
295
+ "grad_norm": 0.6300278306007385,
296
+ "learning_rate": 8.430989583333335e-06,
297
+ "loss": 1.2195,
298
+ "mean_token_accuracy": 0.7370449885725975,
299
+ "num_tokens": 1754213.0,
300
+ "step": 290
301
+ },
302
+ {
303
+ "entropy": 1.2246102809906005,
304
+ "epoch": 0.3787878787878788,
305
+ "grad_norm": 0.6430155634880066,
306
+ "learning_rate": 8.365885416666667e-06,
307
+ "loss": 1.1845,
308
+ "mean_token_accuracy": 0.7439196646213532,
309
+ "num_tokens": 1813841.0,
310
+ "step": 300
311
+ },
312
+ {
313
+ "entropy": 1.2188815206289292,
314
+ "epoch": 0.39141414141414144,
315
+ "grad_norm": 0.6395701766014099,
316
+ "learning_rate": 8.30078125e-06,
317
+ "loss": 1.1856,
318
+ "mean_token_accuracy": 0.7422183871269226,
319
+ "num_tokens": 1873568.0,
320
+ "step": 310
321
+ },
322
+ {
323
+ "entropy": 1.234528934955597,
324
+ "epoch": 0.40404040404040403,
325
+ "grad_norm": 0.6168740391731262,
326
+ "learning_rate": 8.235677083333334e-06,
327
+ "loss": 1.1951,
328
+ "mean_token_accuracy": 0.7380509555339814,
329
+ "num_tokens": 1935518.0,
330
+ "step": 320
331
+ },
332
+ {
333
+ "entropy": 1.226365676522255,
334
+ "epoch": 0.4166666666666667,
335
+ "grad_norm": 0.611132800579071,
336
+ "learning_rate": 8.170572916666666e-06,
337
+ "loss": 1.2078,
338
+ "mean_token_accuracy": 0.739974245429039,
339
+ "num_tokens": 1995604.0,
340
+ "step": 330
341
+ },
342
+ {
343
+ "entropy": 1.2132183194160462,
344
+ "epoch": 0.4292929292929293,
345
+ "grad_norm": 0.6103131771087646,
346
+ "learning_rate": 8.10546875e-06,
347
+ "loss": 1.1642,
348
+ "mean_token_accuracy": 0.7444137379527092,
349
+ "num_tokens": 2056484.0,
350
+ "step": 340
351
+ },
352
+ {
353
+ "entropy": 1.223158246278763,
354
+ "epoch": 0.44191919191919193,
355
+ "grad_norm": 0.6188805103302002,
356
+ "learning_rate": 8.040364583333334e-06,
357
+ "loss": 1.2001,
358
+ "mean_token_accuracy": 0.7385613292455673,
359
+ "num_tokens": 2118437.0,
360
+ "step": 350
361
+ },
362
+ {
363
+ "entropy": 1.2154075980186463,
364
+ "epoch": 0.45454545454545453,
365
+ "grad_norm": 0.6238694190979004,
366
+ "learning_rate": 7.975260416666668e-06,
367
+ "loss": 1.1848,
368
+ "mean_token_accuracy": 0.7404153689742088,
369
+ "num_tokens": 2179333.0,
370
+ "step": 360
371
+ },
372
+ {
373
+ "entropy": 1.197928261756897,
374
+ "epoch": 0.4671717171717172,
375
+ "grad_norm": 0.6028566956520081,
376
+ "learning_rate": 7.910156250000001e-06,
377
+ "loss": 1.1597,
378
+ "mean_token_accuracy": 0.7475010469555855,
379
+ "num_tokens": 2239604.0,
380
+ "step": 370
381
+ },
382
+ {
383
+ "entropy": 1.189855706691742,
384
+ "epoch": 0.4797979797979798,
385
+ "grad_norm": 0.6569434404373169,
386
+ "learning_rate": 7.845052083333335e-06,
387
+ "loss": 1.1805,
388
+ "mean_token_accuracy": 0.7427218139171601,
389
+ "num_tokens": 2300936.0,
390
+ "step": 380
391
+ },
392
+ {
393
+ "entropy": 1.2076493889093398,
394
+ "epoch": 0.49242424242424243,
395
+ "grad_norm": 0.6351733207702637,
396
+ "learning_rate": 7.779947916666667e-06,
397
+ "loss": 1.1759,
398
+ "mean_token_accuracy": 0.7408407002687454,
399
+ "num_tokens": 2361539.0,
400
+ "step": 390
401
+ },
402
+ {
403
+ "entropy": 1.2223081022500992,
404
+ "epoch": 0.5050505050505051,
405
+ "grad_norm": 0.6327986121177673,
406
+ "learning_rate": 7.71484375e-06,
407
+ "loss": 1.1874,
408
+ "mean_token_accuracy": 0.7407250568270684,
409
+ "num_tokens": 2422162.0,
410
+ "step": 400
411
+ },
412
+ {
413
+ "entropy": 1.2013412863016129,
414
+ "epoch": 0.5176767676767676,
415
+ "grad_norm": 0.622104823589325,
416
+ "learning_rate": 7.649739583333334e-06,
417
+ "loss": 1.1726,
418
+ "mean_token_accuracy": 0.744242025911808,
419
+ "num_tokens": 2483447.0,
420
+ "step": 410
421
+ },
422
+ {
423
+ "entropy": 1.2116613179445266,
424
+ "epoch": 0.5303030303030303,
425
+ "grad_norm": 0.637651264667511,
426
+ "learning_rate": 7.5846354166666665e-06,
427
+ "loss": 1.1838,
428
+ "mean_token_accuracy": 0.7405030101537704,
429
+ "num_tokens": 2544848.0,
430
+ "step": 420
431
+ },
432
+ {
433
+ "entropy": 1.2024697184562683,
434
+ "epoch": 0.5429292929292929,
435
+ "grad_norm": 0.6252374649047852,
436
+ "learning_rate": 7.51953125e-06,
437
+ "loss": 1.1681,
438
+ "mean_token_accuracy": 0.7458183988928795,
439
+ "num_tokens": 2605232.0,
440
+ "step": 430
441
+ },
442
+ {
443
+ "entropy": 1.1797083109617232,
444
+ "epoch": 0.5555555555555556,
445
+ "grad_norm": 0.6502755284309387,
446
+ "learning_rate": 7.454427083333334e-06,
447
+ "loss": 1.1452,
448
+ "mean_token_accuracy": 0.7477341219782829,
449
+ "num_tokens": 2664276.0,
450
+ "step": 440
451
+ },
452
+ {
453
+ "entropy": 1.1964200481772422,
454
+ "epoch": 0.5681818181818182,
455
+ "grad_norm": 0.639979362487793,
456
+ "learning_rate": 7.389322916666667e-06,
457
+ "loss": 1.1665,
458
+ "mean_token_accuracy": 0.7431837096810341,
459
+ "num_tokens": 2724073.0,
460
+ "step": 450
461
+ },
462
+ {
463
+ "entropy": 1.1795272737741471,
464
+ "epoch": 0.5808080808080808,
465
+ "grad_norm": 0.6212354302406311,
466
+ "learning_rate": 7.3242187500000006e-06,
467
+ "loss": 1.1529,
468
+ "mean_token_accuracy": 0.7479098170995713,
469
+ "num_tokens": 2784262.0,
470
+ "step": 460
471
+ },
472
+ {
473
+ "entropy": 1.1908618807792664,
474
+ "epoch": 0.5934343434343434,
475
+ "grad_norm": 0.6528693437576294,
476
+ "learning_rate": 7.259114583333334e-06,
477
+ "loss": 1.1678,
478
+ "mean_token_accuracy": 0.745174677670002,
479
+ "num_tokens": 2843804.0,
480
+ "step": 470
481
+ },
482
+ {
483
+ "entropy": 1.1862946093082427,
484
+ "epoch": 0.6060606060606061,
485
+ "grad_norm": 0.639481246471405,
486
+ "learning_rate": 7.194010416666667e-06,
487
+ "loss": 1.1565,
488
+ "mean_token_accuracy": 0.7461866185069084,
489
+ "num_tokens": 2903408.0,
490
+ "step": 480
491
+ },
492
+ {
493
+ "entropy": 1.151743534207344,
494
+ "epoch": 0.6186868686868687,
495
+ "grad_norm": 0.6332777142524719,
496
+ "learning_rate": 7.128906250000001e-06,
497
+ "loss": 1.1251,
498
+ "mean_token_accuracy": 0.7535199671983719,
499
+ "num_tokens": 2963401.0,
500
+ "step": 490
501
+ },
502
+ {
503
+ "entropy": 1.1778477430343628,
504
+ "epoch": 0.6313131313131313,
505
+ "grad_norm": 0.5991836190223694,
506
+ "learning_rate": 7.063802083333335e-06,
507
+ "loss": 1.1407,
508
+ "mean_token_accuracy": 0.7491110354661942,
509
+ "num_tokens": 3023530.0,
510
+ "step": 500
511
+ },
512
+ {
513
+ "entropy": 1.2023035794496537,
514
+ "epoch": 0.6439393939393939,
515
+ "grad_norm": 0.6293458938598633,
516
+ "learning_rate": 6.998697916666667e-06,
517
+ "loss": 1.1724,
518
+ "mean_token_accuracy": 0.7405256554484367,
519
+ "num_tokens": 3085225.0,
520
+ "step": 510
521
+ },
522
+ {
523
+ "entropy": 1.1997334092855454,
524
+ "epoch": 0.6565656565656566,
525
+ "grad_norm": 0.6213802695274353,
526
+ "learning_rate": 6.93359375e-06,
527
+ "loss": 1.1604,
528
+ "mean_token_accuracy": 0.7435309410095214,
529
+ "num_tokens": 3145749.0,
530
+ "step": 520
531
+ },
532
+ {
533
+ "entropy": 1.1643184214830398,
534
+ "epoch": 0.6691919191919192,
535
+ "grad_norm": 0.6495156288146973,
536
+ "learning_rate": 6.868489583333334e-06,
537
+ "loss": 1.1381,
538
+ "mean_token_accuracy": 0.7496751576662064,
539
+ "num_tokens": 3205761.0,
540
+ "step": 530
541
+ },
542
+ {
543
+ "entropy": 1.1837532848119736,
544
+ "epoch": 0.6818181818181818,
545
+ "grad_norm": 0.6004510521888733,
546
+ "learning_rate": 6.803385416666667e-06,
547
+ "loss": 1.1563,
548
+ "mean_token_accuracy": 0.7447851061820984,
549
+ "num_tokens": 3267024.0,
550
+ "step": 540
551
+ },
552
+ {
553
+ "entropy": 1.201677542924881,
554
+ "epoch": 0.6944444444444444,
555
+ "grad_norm": 0.607467532157898,
556
+ "learning_rate": 6.738281250000001e-06,
557
+ "loss": 1.1776,
558
+ "mean_token_accuracy": 0.7406648993492126,
559
+ "num_tokens": 3329070.0,
560
+ "step": 550
561
+ },
562
+ {
563
+ "entropy": 1.1659786373376846,
564
+ "epoch": 0.7070707070707071,
565
+ "grad_norm": 0.6079947352409363,
566
+ "learning_rate": 6.6731770833333345e-06,
567
+ "loss": 1.1298,
568
+ "mean_token_accuracy": 0.7505511298775673,
569
+ "num_tokens": 3389505.0,
570
+ "step": 560
571
+ },
572
+ {
573
+ "entropy": 1.1714913487434386,
574
+ "epoch": 0.7196969696969697,
575
+ "grad_norm": 0.6534572839736938,
576
+ "learning_rate": 6.6080729166666665e-06,
577
+ "loss": 1.144,
578
+ "mean_token_accuracy": 0.7481625184416771,
579
+ "num_tokens": 3449834.0,
580
+ "step": 570
581
+ },
582
+ {
583
+ "entropy": 1.1694963037967683,
584
+ "epoch": 0.7323232323232324,
585
+ "grad_norm": 0.5903164744377136,
586
+ "learning_rate": 6.54296875e-06,
587
+ "loss": 1.139,
588
+ "mean_token_accuracy": 0.7476441130042076,
589
+ "num_tokens": 3510701.0,
590
+ "step": 580
591
+ },
592
+ {
593
+ "entropy": 1.1777653217315673,
594
+ "epoch": 0.7449494949494949,
595
+ "grad_norm": 0.6284182071685791,
596
+ "learning_rate": 6.477864583333334e-06,
597
+ "loss": 1.1422,
598
+ "mean_token_accuracy": 0.7490358456969262,
599
+ "num_tokens": 3570992.0,
600
+ "step": 590
601
+ },
602
+ {
603
+ "entropy": 1.1588268011808396,
604
+ "epoch": 0.7575757575757576,
605
+ "grad_norm": 0.6250146627426147,
606
+ "learning_rate": 6.412760416666667e-06,
607
+ "loss": 1.1336,
608
+ "mean_token_accuracy": 0.7504108369350433,
609
+ "num_tokens": 3631087.0,
610
+ "step": 600
611
+ },
612
+ {
613
+ "entropy": 1.1890273630619048,
614
+ "epoch": 0.7702020202020202,
615
+ "grad_norm": 0.6420578956604004,
616
+ "learning_rate": 6.3476562500000006e-06,
617
+ "loss": 1.1534,
618
+ "mean_token_accuracy": 0.7442372158169747,
619
+ "num_tokens": 3692871.0,
620
+ "step": 610
621
+ },
622
+ {
623
+ "entropy": 1.1817798465490341,
624
+ "epoch": 0.7828282828282829,
625
+ "grad_norm": 0.6156490445137024,
626
+ "learning_rate": 6.282552083333334e-06,
627
+ "loss": 1.1477,
628
+ "mean_token_accuracy": 0.7468263059854507,
629
+ "num_tokens": 3753671.0,
630
+ "step": 620
631
+ },
632
+ {
633
+ "entropy": 1.1686612635850906,
634
+ "epoch": 0.7954545454545454,
635
+ "grad_norm": 0.6248748898506165,
636
+ "learning_rate": 6.217447916666667e-06,
637
+ "loss": 1.139,
638
+ "mean_token_accuracy": 0.7478924334049225,
639
+ "num_tokens": 3813110.0,
640
+ "step": 630
641
+ },
642
+ {
643
+ "entropy": 1.1387953266501427,
644
+ "epoch": 0.8080808080808081,
645
+ "grad_norm": 0.6052266359329224,
646
+ "learning_rate": 6.152343750000001e-06,
647
+ "loss": 1.1118,
648
+ "mean_token_accuracy": 0.7538487210869789,
649
+ "num_tokens": 3873477.0,
650
+ "step": 640
651
+ },
652
+ {
653
+ "entropy": 1.1343814879655838,
654
+ "epoch": 0.8207070707070707,
655
+ "grad_norm": 0.6769536137580872,
656
+ "learning_rate": 6.087239583333335e-06,
657
+ "loss": 1.1108,
658
+ "mean_token_accuracy": 0.7548464313149452,
659
+ "num_tokens": 3932873.0,
660
+ "step": 650
661
+ },
662
+ {
663
+ "entropy": 1.1686519652605056,
664
+ "epoch": 0.8333333333333334,
665
+ "grad_norm": 0.6545736789703369,
666
+ "learning_rate": 6.022135416666667e-06,
667
+ "loss": 1.134,
668
+ "mean_token_accuracy": 0.7495195478200912,
669
+ "num_tokens": 3992754.0,
670
+ "step": 660
671
+ },
672
+ {
673
+ "entropy": 1.1596283346414566,
674
+ "epoch": 0.8459595959595959,
675
+ "grad_norm": 0.6192017793655396,
676
+ "learning_rate": 5.95703125e-06,
677
+ "loss": 1.1213,
678
+ "mean_token_accuracy": 0.7510297149419785,
679
+ "num_tokens": 4053540.0,
680
+ "step": 670
681
+ },
682
+ {
683
+ "entropy": 1.1484344542026519,
684
+ "epoch": 0.8585858585858586,
685
+ "grad_norm": 0.6520631909370422,
686
+ "learning_rate": 5.891927083333334e-06,
687
+ "loss": 1.1181,
688
+ "mean_token_accuracy": 0.751795919239521,
689
+ "num_tokens": 4113582.0,
690
+ "step": 680
691
+ },
692
+ {
693
+ "entropy": 1.1519750133156776,
694
+ "epoch": 0.8712121212121212,
695
+ "grad_norm": 0.6247655153274536,
696
+ "learning_rate": 5.826822916666667e-06,
697
+ "loss": 1.1204,
698
+ "mean_token_accuracy": 0.7506368085741997,
699
+ "num_tokens": 4174983.0,
700
+ "step": 690
701
+ },
702
+ {
703
+ "entropy": 1.1486561581492425,
704
+ "epoch": 0.8838383838383839,
705
+ "grad_norm": 0.620272159576416,
706
+ "learning_rate": 5.761718750000001e-06,
707
+ "loss": 1.1191,
708
+ "mean_token_accuracy": 0.7536391675472259,
709
+ "num_tokens": 4234465.0,
710
+ "step": 700
711
+ },
712
+ {
713
+ "entropy": 1.1504923462867738,
714
+ "epoch": 0.8964646464646465,
715
+ "grad_norm": 0.6308649182319641,
716
+ "learning_rate": 5.6966145833333344e-06,
717
+ "loss": 1.1224,
718
+ "mean_token_accuracy": 0.750646598637104,
719
+ "num_tokens": 4295955.0,
720
+ "step": 710
721
+ },
722
+ {
723
+ "entropy": 1.1561898440122604,
724
+ "epoch": 0.9090909090909091,
725
+ "grad_norm": 0.6629899740219116,
726
+ "learning_rate": 5.6315104166666665e-06,
727
+ "loss": 1.1238,
728
+ "mean_token_accuracy": 0.7507557719945908,
729
+ "num_tokens": 4357171.0,
730
+ "step": 720
731
+ },
732
+ {
733
+ "entropy": 1.1413449853658677,
734
+ "epoch": 0.9217171717171717,
735
+ "grad_norm": 0.5972346067428589,
736
+ "learning_rate": 5.56640625e-06,
737
+ "loss": 1.1038,
738
+ "mean_token_accuracy": 0.7554104939103127,
739
+ "num_tokens": 4417636.0,
740
+ "step": 730
741
+ },
742
+ {
743
+ "entropy": 1.1356119453907012,
744
+ "epoch": 0.9343434343434344,
745
+ "grad_norm": 0.6356479525566101,
746
+ "learning_rate": 5.501302083333334e-06,
747
+ "loss": 1.1005,
748
+ "mean_token_accuracy": 0.7565629109740257,
749
+ "num_tokens": 4477294.0,
750
+ "step": 740
751
+ },
752
+ {
753
+ "entropy": 1.1633600294589996,
754
+ "epoch": 0.946969696969697,
755
+ "grad_norm": 0.6416464447975159,
756
+ "learning_rate": 5.436197916666667e-06,
757
+ "loss": 1.1225,
758
+ "mean_token_accuracy": 0.7515855401754379,
759
+ "num_tokens": 4537503.0,
760
+ "step": 750
761
+ },
762
+ {
763
+ "entropy": 1.1527763932943345,
764
+ "epoch": 0.9595959595959596,
765
+ "grad_norm": 0.6126084327697754,
766
+ "learning_rate": 5.3710937500000005e-06,
767
+ "loss": 1.1184,
768
+ "mean_token_accuracy": 0.7526160582900048,
769
+ "num_tokens": 4598778.0,
770
+ "step": 760
771
+ },
772
+ {
773
+ "entropy": 1.1397768080234527,
774
+ "epoch": 0.9722222222222222,
775
+ "grad_norm": 0.6359922289848328,
776
+ "learning_rate": 5.305989583333334e-06,
777
+ "loss": 1.1144,
778
+ "mean_token_accuracy": 0.7548302739858628,
779
+ "num_tokens": 4658978.0,
780
+ "step": 770
781
+ },
782
+ {
783
+ "entropy": 1.1569962561130525,
784
+ "epoch": 0.9848484848484849,
785
+ "grad_norm": 0.6260409951210022,
786
+ "learning_rate": 5.240885416666667e-06,
787
+ "loss": 1.1213,
788
+ "mean_token_accuracy": 0.7512885302305221,
789
+ "num_tokens": 4720500.0,
790
+ "step": 780
791
+ },
792
+ {
793
+ "entropy": 1.1509152203798294,
794
+ "epoch": 0.9974747474747475,
795
+ "grad_norm": 0.6293452978134155,
796
+ "learning_rate": 5.17578125e-06,
797
+ "loss": 1.1227,
798
+ "mean_token_accuracy": 0.7519564241170883,
799
+ "num_tokens": 4781612.0,
800
+ "step": 790
801
+ },
802
+ {
803
+ "entropy": 1.1399024561047555,
804
+ "epoch": 1.0101010101010102,
805
+ "grad_norm": 0.664761483669281,
806
+ "learning_rate": 5.110677083333334e-06,
807
+ "loss": 1.1034,
808
+ "mean_token_accuracy": 0.7526706486940384,
809
+ "num_tokens": 4841359.0,
810
+ "step": 800
811
+ },
812
+ {
813
+ "entropy": 1.120214229822159,
814
+ "epoch": 1.0227272727272727,
815
+ "grad_norm": 0.5934865474700928,
816
+ "learning_rate": 5.045572916666667e-06,
817
+ "loss": 1.0857,
818
+ "mean_token_accuracy": 0.7593594208359719,
819
+ "num_tokens": 4901016.0,
820
+ "step": 810
821
+ },
822
+ {
823
+ "entropy": 1.1430341199040412,
824
+ "epoch": 1.0353535353535352,
825
+ "grad_norm": 0.6040735840797424,
826
+ "learning_rate": 4.98046875e-06,
827
+ "loss": 1.1165,
828
+ "mean_token_accuracy": 0.7525988414883613,
829
+ "num_tokens": 4961646.0,
830
+ "step": 820
831
+ },
832
+ {
833
+ "entropy": 1.1245349109172822,
834
+ "epoch": 1.047979797979798,
835
+ "grad_norm": 0.6277610063552856,
836
+ "learning_rate": 4.915364583333333e-06,
837
+ "loss": 1.0851,
838
+ "mean_token_accuracy": 0.7577921718358993,
839
+ "num_tokens": 5022365.0,
840
+ "step": 830
841
+ },
842
+ {
843
+ "entropy": 1.1204675793647767,
844
+ "epoch": 1.0606060606060606,
845
+ "grad_norm": 0.6260582804679871,
846
+ "learning_rate": 4.850260416666667e-06,
847
+ "loss": 1.0813,
848
+ "mean_token_accuracy": 0.7580071151256561,
849
+ "num_tokens": 5081972.0,
850
+ "step": 840
851
+ },
852
+ {
853
+ "entropy": 1.1222782507538795,
854
+ "epoch": 1.0732323232323233,
855
+ "grad_norm": 0.6023226976394653,
856
+ "learning_rate": 4.785156250000001e-06,
857
+ "loss": 1.0922,
858
+ "mean_token_accuracy": 0.7566258609294891,
859
+ "num_tokens": 5142184.0,
860
+ "step": 850
861
+ },
862
+ {
863
+ "entropy": 1.1227335944771766,
864
+ "epoch": 1.0858585858585859,
865
+ "grad_norm": 0.6206791996955872,
866
+ "learning_rate": 4.7200520833333336e-06,
867
+ "loss": 1.0994,
868
+ "mean_token_accuracy": 0.7540625646710396,
869
+ "num_tokens": 5203020.0,
870
+ "step": 860
871
+ },
872
+ {
873
+ "entropy": 1.1352888554334641,
874
+ "epoch": 1.0984848484848484,
875
+ "grad_norm": 0.6301055550575256,
876
+ "learning_rate": 4.654947916666667e-06,
877
+ "loss": 1.0958,
878
+ "mean_token_accuracy": 0.7562039017677307,
879
+ "num_tokens": 5263291.0,
880
+ "step": 870
881
+ },
882
+ {
883
+ "entropy": 1.1120088309049607,
884
+ "epoch": 1.1111111111111112,
885
+ "grad_norm": 0.6210020780563354,
886
+ "learning_rate": 4.58984375e-06,
887
+ "loss": 1.0793,
888
+ "mean_token_accuracy": 0.7590557768940925,
889
+ "num_tokens": 5323972.0,
890
+ "step": 880
891
+ },
892
+ {
893
+ "entropy": 1.0996666207909584,
894
+ "epoch": 1.1237373737373737,
895
+ "grad_norm": 0.6332690715789795,
896
+ "learning_rate": 4.524739583333334e-06,
897
+ "loss": 1.0717,
898
+ "mean_token_accuracy": 0.7615471586585045,
899
+ "num_tokens": 5383911.0,
900
+ "step": 890
901
+ },
902
+ {
903
+ "entropy": 1.127107810974121,
904
+ "epoch": 1.1363636363636362,
905
+ "grad_norm": 0.6505516767501831,
906
+ "learning_rate": 4.459635416666668e-06,
907
+ "loss": 1.1027,
908
+ "mean_token_accuracy": 0.7562421515583992,
909
+ "num_tokens": 5445417.0,
910
+ "step": 900
911
+ },
912
+ {
913
+ "entropy": 1.129740473628044,
914
+ "epoch": 1.148989898989899,
915
+ "grad_norm": 0.6406158804893494,
916
+ "learning_rate": 4.3945312500000005e-06,
917
+ "loss": 1.0879,
918
+ "mean_token_accuracy": 0.7587148532271385,
919
+ "num_tokens": 5505455.0,
920
+ "step": 910
921
+ },
922
+ {
923
+ "entropy": 1.1167259424924851,
924
+ "epoch": 1.1616161616161615,
925
+ "grad_norm": 0.6297397613525391,
926
+ "learning_rate": 4.329427083333333e-06,
927
+ "loss": 1.0752,
928
+ "mean_token_accuracy": 0.7604142814874649,
929
+ "num_tokens": 5565311.0,
930
+ "step": 920
931
+ },
932
+ {
933
+ "entropy": 1.1037891641259194,
934
+ "epoch": 1.1742424242424243,
935
+ "grad_norm": 0.6490073204040527,
936
+ "learning_rate": 4.264322916666667e-06,
937
+ "loss": 1.0686,
938
+ "mean_token_accuracy": 0.7610052570700645,
939
+ "num_tokens": 5625358.0,
940
+ "step": 930
941
+ },
942
+ {
943
+ "entropy": 1.1061881184577942,
944
+ "epoch": 1.1868686868686869,
945
+ "grad_norm": 0.6366387009620667,
946
+ "learning_rate": 4.19921875e-06,
947
+ "loss": 1.0868,
948
+ "mean_token_accuracy": 0.7574937298893929,
949
+ "num_tokens": 5686421.0,
950
+ "step": 940
951
+ },
952
+ {
953
+ "entropy": 1.1124324068427085,
954
+ "epoch": 1.1994949494949494,
955
+ "grad_norm": 0.6556055545806885,
956
+ "learning_rate": 4.134114583333334e-06,
957
+ "loss": 1.0694,
958
+ "mean_token_accuracy": 0.7602224007248879,
959
+ "num_tokens": 5745891.0,
960
+ "step": 950
961
+ },
962
+ {
963
+ "entropy": 1.1175200879573821,
964
+ "epoch": 1.2121212121212122,
965
+ "grad_norm": 0.6404849886894226,
966
+ "learning_rate": 4.0690104166666675e-06,
967
+ "loss": 1.081,
968
+ "mean_token_accuracy": 0.7568994402885437,
969
+ "num_tokens": 5806078.0,
970
+ "step": 960
971
+ },
972
+ {
973
+ "entropy": 1.1186978340148925,
974
+ "epoch": 1.2247474747474747,
975
+ "grad_norm": 0.6227584481239319,
976
+ "learning_rate": 4.00390625e-06,
977
+ "loss": 1.0791,
978
+ "mean_token_accuracy": 0.759756401181221,
979
+ "num_tokens": 5866369.0,
980
+ "step": 970
981
+ },
982
+ {
983
+ "entropy": 1.122128139436245,
984
+ "epoch": 1.2373737373737375,
985
+ "grad_norm": 0.6616361141204834,
986
+ "learning_rate": 3.938802083333333e-06,
987
+ "loss": 1.0937,
988
+ "mean_token_accuracy": 0.7582718566060066,
989
+ "num_tokens": 5926217.0,
990
+ "step": 980
991
+ },
992
+ {
993
+ "entropy": 1.122861033678055,
994
+ "epoch": 1.25,
995
+ "grad_norm": 0.6384168267250061,
996
+ "learning_rate": 3.873697916666667e-06,
997
+ "loss": 1.0978,
998
+ "mean_token_accuracy": 0.7549964562058449,
999
+ "num_tokens": 5987666.0,
1000
+ "step": 990
1001
+ },
1002
+ {
1003
+ "entropy": 1.1277505576610565,
1004
+ "epoch": 1.2626262626262625,
1005
+ "grad_norm": 0.6038117408752441,
1006
+ "learning_rate": 3.8085937500000002e-06,
1007
+ "loss": 1.0952,
1008
+ "mean_token_accuracy": 0.755272176861763,
1009
+ "num_tokens": 6048708.0,
1010
+ "step": 1000
1011
+ },
1012
+ {
1013
+ "entropy": 1.1120157346129418,
1014
+ "epoch": 1.2752525252525253,
1015
+ "grad_norm": 0.6418159604072571,
1016
+ "learning_rate": 3.7434895833333336e-06,
1017
+ "loss": 1.078,
1018
+ "mean_token_accuracy": 0.7594122514128685,
1019
+ "num_tokens": 6109652.0,
1020
+ "step": 1010
1021
+ },
1022
+ {
1023
+ "entropy": 1.101425115764141,
1024
+ "epoch": 1.2878787878787878,
1025
+ "grad_norm": 0.6218425035476685,
1026
+ "learning_rate": 3.6783854166666673e-06,
1027
+ "loss": 1.0688,
1028
+ "mean_token_accuracy": 0.7604865297675133,
1029
+ "num_tokens": 6169125.0,
1030
+ "step": 1020
1031
+ },
1032
+ {
1033
+ "entropy": 1.1007713869214057,
1034
+ "epoch": 1.3005050505050506,
1035
+ "grad_norm": 0.6429149508476257,
1036
+ "learning_rate": 3.61328125e-06,
1037
+ "loss": 1.0581,
1038
+ "mean_token_accuracy": 0.7621071562170982,
1039
+ "num_tokens": 6230303.0,
1040
+ "step": 1030
1041
+ },
1042
+ {
1043
+ "entropy": 1.1094096556305886,
1044
+ "epoch": 1.3131313131313131,
1045
+ "grad_norm": 0.6489748358726501,
1046
+ "learning_rate": 3.5481770833333335e-06,
1047
+ "loss": 1.0715,
1048
+ "mean_token_accuracy": 0.7599423810839653,
1049
+ "num_tokens": 6291396.0,
1050
+ "step": 1040
1051
+ },
1052
+ {
1053
+ "entropy": 1.0827289715409278,
1054
+ "epoch": 1.3257575757575757,
1055
+ "grad_norm": 0.6485461592674255,
1056
+ "learning_rate": 3.483072916666667e-06,
1057
+ "loss": 1.0584,
1058
+ "mean_token_accuracy": 0.7630694910883904,
1059
+ "num_tokens": 6351579.0,
1060
+ "step": 1050
1061
+ },
1062
+ {
1063
+ "entropy": 1.114325873553753,
1064
+ "epoch": 1.3383838383838385,
1065
+ "grad_norm": 0.6261104941368103,
1066
+ "learning_rate": 3.41796875e-06,
1067
+ "loss": 1.0764,
1068
+ "mean_token_accuracy": 0.7585488513112069,
1069
+ "num_tokens": 6411662.0,
1070
+ "step": 1060
1071
+ },
1072
+ {
1073
+ "entropy": 1.1271554425358772,
1074
+ "epoch": 1.351010101010101,
1075
+ "grad_norm": 0.6522034406661987,
1076
+ "learning_rate": 3.3528645833333334e-06,
1077
+ "loss": 1.0902,
1078
+ "mean_token_accuracy": 0.7562535598874092,
1079
+ "num_tokens": 6473505.0,
1080
+ "step": 1070
1081
+ },
1082
+ {
1083
+ "entropy": 1.1013643085956573,
1084
+ "epoch": 1.3636363636363638,
1085
+ "grad_norm": 0.6176674962043762,
1086
+ "learning_rate": 3.287760416666667e-06,
1087
+ "loss": 1.065,
1088
+ "mean_token_accuracy": 0.763075165450573,
1089
+ "num_tokens": 6533580.0,
1090
+ "step": 1080
1091
+ },
1092
+ {
1093
+ "entropy": 1.098090337216854,
1094
+ "epoch": 1.3762626262626263,
1095
+ "grad_norm": 0.6540253758430481,
1096
+ "learning_rate": 3.2226562500000004e-06,
1097
+ "loss": 1.0596,
1098
+ "mean_token_accuracy": 0.7616770043969154,
1099
+ "num_tokens": 6593481.0,
1100
+ "step": 1090
1101
+ },
1102
+ {
1103
+ "entropy": 1.1176372200250626,
1104
+ "epoch": 1.3888888888888888,
1105
+ "grad_norm": 0.6754550933837891,
1106
+ "learning_rate": 3.1575520833333333e-06,
1107
+ "loss": 1.0861,
1108
+ "mean_token_accuracy": 0.7573029339313507,
1109
+ "num_tokens": 6653967.0,
1110
+ "step": 1100
1111
+ },
1112
+ {
1113
+ "entropy": 1.1040414482355119,
1114
+ "epoch": 1.4015151515151514,
1115
+ "grad_norm": 0.6022531986236572,
1116
+ "learning_rate": 3.092447916666667e-06,
1117
+ "loss": 1.0573,
1118
+ "mean_token_accuracy": 0.7612267225980759,
1119
+ "num_tokens": 6714685.0,
1120
+ "step": 1110
1121
+ },
1122
+ {
1123
+ "entropy": 1.0926991075277328,
1124
+ "epoch": 1.4141414141414141,
1125
+ "grad_norm": 0.6621010303497314,
1126
+ "learning_rate": 3.0273437500000003e-06,
1127
+ "loss": 1.0637,
1128
+ "mean_token_accuracy": 0.7612977519631385,
1129
+ "num_tokens": 6774659.0,
1130
+ "step": 1120
1131
+ },
1132
+ {
1133
+ "entropy": 1.1095534324645997,
1134
+ "epoch": 1.4267676767676767,
1135
+ "grad_norm": 0.62503981590271,
1136
+ "learning_rate": 2.962239583333333e-06,
1137
+ "loss": 1.0701,
1138
+ "mean_token_accuracy": 0.7618053883314133,
1139
+ "num_tokens": 6834579.0,
1140
+ "step": 1130
1141
+ },
1142
+ {
1143
+ "entropy": 1.117809349298477,
1144
+ "epoch": 1.4393939393939394,
1145
+ "grad_norm": 0.6527109742164612,
1146
+ "learning_rate": 2.897135416666667e-06,
1147
+ "loss": 1.0747,
1148
+ "mean_token_accuracy": 0.759482853114605,
1149
+ "num_tokens": 6894074.0,
1150
+ "step": 1140
1151
+ },
1152
+ {
1153
+ "entropy": 1.1005077749490737,
1154
+ "epoch": 1.452020202020202,
1155
+ "grad_norm": 0.6720954775810242,
1156
+ "learning_rate": 2.8320312500000002e-06,
1157
+ "loss": 1.0607,
1158
+ "mean_token_accuracy": 0.7621246844530105,
1159
+ "num_tokens": 6953870.0,
1160
+ "step": 1150
1161
+ },
1162
+ {
1163
+ "entropy": 1.1236482918262483,
1164
+ "epoch": 1.4646464646464645,
1165
+ "grad_norm": 0.658524215221405,
1166
+ "learning_rate": 2.7669270833333335e-06,
1167
+ "loss": 1.0884,
1168
+ "mean_token_accuracy": 0.7560836613178253,
1169
+ "num_tokens": 7014553.0,
1170
+ "step": 1160
1171
+ },
1172
+ {
1173
+ "entropy": 1.1116504594683647,
1174
+ "epoch": 1.4772727272727273,
1175
+ "grad_norm": 0.6261802911758423,
1176
+ "learning_rate": 2.7018229166666673e-06,
1177
+ "loss": 1.0659,
1178
+ "mean_token_accuracy": 0.7597616642713547,
1179
+ "num_tokens": 7076291.0,
1180
+ "step": 1170
1181
+ },
1182
+ {
1183
+ "entropy": 1.073892480134964,
1184
+ "epoch": 1.4898989898989898,
1185
+ "grad_norm": 0.6310375332832336,
1186
+ "learning_rate": 2.63671875e-06,
1187
+ "loss": 1.0524,
1188
+ "mean_token_accuracy": 0.7628733053803444,
1189
+ "num_tokens": 7137305.0,
1190
+ "step": 1180
1191
+ },
1192
+ {
1193
+ "entropy": 1.0975843235850333,
1194
+ "epoch": 1.5025252525252526,
1195
+ "grad_norm": 0.638482391834259,
1196
+ "learning_rate": 2.5716145833333334e-06,
1197
+ "loss": 1.0679,
1198
+ "mean_token_accuracy": 0.7603248566389084,
1199
+ "num_tokens": 7198239.0,
1200
+ "step": 1190
1201
+ },
1202
+ {
1203
+ "entropy": 1.0986508697271347,
1204
+ "epoch": 1.5151515151515151,
1205
+ "grad_norm": 0.640065610408783,
1206
+ "learning_rate": 2.506510416666667e-06,
1207
+ "loss": 1.0666,
1208
+ "mean_token_accuracy": 0.7622530281543731,
1209
+ "num_tokens": 7257847.0,
1210
+ "step": 1200
1211
+ },
1212
+ {
1213
+ "entropy": 1.0971406906843186,
1214
+ "epoch": 1.5277777777777777,
1215
+ "grad_norm": 0.6437165141105652,
1216
+ "learning_rate": 2.44140625e-06,
1217
+ "loss": 1.0587,
1218
+ "mean_token_accuracy": 0.7623877748847008,
1219
+ "num_tokens": 7317615.0,
1220
+ "step": 1210
1221
+ },
1222
+ {
1223
+ "entropy": 1.1032136514782906,
1224
+ "epoch": 1.5404040404040404,
1225
+ "grad_norm": 0.6590547561645508,
1226
+ "learning_rate": 2.3763020833333338e-06,
1227
+ "loss": 1.0569,
1228
+ "mean_token_accuracy": 0.7616146191954613,
1229
+ "num_tokens": 7377946.0,
1230
+ "step": 1220
1231
+ },
1232
+ {
1233
+ "entropy": 1.0922824308276176,
1234
+ "epoch": 1.553030303030303,
1235
+ "grad_norm": 0.6317723989486694,
1236
+ "learning_rate": 2.3111979166666667e-06,
1237
+ "loss": 1.0616,
1238
+ "mean_token_accuracy": 0.7611089378595353,
1239
+ "num_tokens": 7438666.0,
1240
+ "step": 1230
1241
+ },
1242
+ {
1243
+ "entropy": 1.1092566132545472,
1244
+ "epoch": 1.5656565656565657,
1245
+ "grad_norm": 0.66637122631073,
1246
+ "learning_rate": 2.2460937500000004e-06,
1247
+ "loss": 1.0725,
1248
+ "mean_token_accuracy": 0.7594954133033752,
1249
+ "num_tokens": 7499747.0,
1250
+ "step": 1240
1251
+ },
1252
+ {
1253
+ "entropy": 1.111878876388073,
1254
+ "epoch": 1.5782828282828283,
1255
+ "grad_norm": 0.6520881652832031,
1256
+ "learning_rate": 2.1809895833333337e-06,
1257
+ "loss": 1.074,
1258
+ "mean_token_accuracy": 0.7558068126440048,
1259
+ "num_tokens": 7561004.0,
1260
+ "step": 1250
1261
+ },
1262
+ {
1263
+ "entropy": 1.1105972841382026,
1264
+ "epoch": 1.5909090909090908,
1265
+ "grad_norm": 0.6495437622070312,
1266
+ "learning_rate": 2.1158854166666666e-06,
1267
+ "loss": 1.0844,
1268
+ "mean_token_accuracy": 0.7563070356845856,
1269
+ "num_tokens": 7622115.0,
1270
+ "step": 1260
1271
+ },
1272
+ {
1273
+ "entropy": 1.0932338371872903,
1274
+ "epoch": 1.6035353535353534,
1275
+ "grad_norm": 0.6420316696166992,
1276
+ "learning_rate": 2.0507812500000003e-06,
1277
+ "loss": 1.0649,
1278
+ "mean_token_accuracy": 0.7622412323951722,
1279
+ "num_tokens": 7682210.0,
1280
+ "step": 1270
1281
+ },
1282
+ {
1283
+ "entropy": 1.0919055327773095,
1284
+ "epoch": 1.6161616161616161,
1285
+ "grad_norm": 0.6192623972892761,
1286
+ "learning_rate": 1.9856770833333336e-06,
1287
+ "loss": 1.0427,
1288
+ "mean_token_accuracy": 0.7657591253519058,
1289
+ "num_tokens": 7742745.0,
1290
+ "step": 1280
1291
+ },
1292
+ {
1293
+ "entropy": 1.1026942864060403,
1294
+ "epoch": 1.628787878787879,
1295
+ "grad_norm": 0.6355161666870117,
1296
+ "learning_rate": 1.920572916666667e-06,
1297
+ "loss": 1.0701,
1298
+ "mean_token_accuracy": 0.7609635755419731,
1299
+ "num_tokens": 7802902.0,
1300
+ "step": 1290
1301
+ },
1302
+ {
1303
+ "entropy": 1.1140723824501038,
1304
+ "epoch": 1.6414141414141414,
1305
+ "grad_norm": 0.6254522800445557,
1306
+ "learning_rate": 1.8554687500000002e-06,
1307
+ "loss": 1.0789,
1308
+ "mean_token_accuracy": 0.7592580512166023,
1309
+ "num_tokens": 7863576.0,
1310
+ "step": 1300
1311
+ },
1312
+ {
1313
+ "entropy": 1.108394268155098,
1314
+ "epoch": 1.654040404040404,
1315
+ "grad_norm": 0.633172333240509,
1316
+ "learning_rate": 1.7903645833333335e-06,
1317
+ "loss": 1.0714,
1318
+ "mean_token_accuracy": 0.7598690986633301,
1319
+ "num_tokens": 7925518.0,
1320
+ "step": 1310
1321
+ },
1322
+ {
1323
+ "entropy": 1.1097407966852189,
1324
+ "epoch": 1.6666666666666665,
1325
+ "grad_norm": 0.6279735565185547,
1326
+ "learning_rate": 1.7252604166666668e-06,
1327
+ "loss": 1.0701,
1328
+ "mean_token_accuracy": 0.7611567705869675,
1329
+ "num_tokens": 7987388.0,
1330
+ "step": 1320
1331
+ },
1332
+ {
1333
+ "entropy": 1.1071213275194167,
1334
+ "epoch": 1.6792929292929293,
1335
+ "grad_norm": 0.6425778269767761,
1336
+ "learning_rate": 1.6601562500000001e-06,
1337
+ "loss": 1.0786,
1338
+ "mean_token_accuracy": 0.7576610520482063,
1339
+ "num_tokens": 8048853.0,
1340
+ "step": 1330
1341
+ },
1342
+ {
1343
+ "entropy": 1.0931925728917122,
1344
+ "epoch": 1.691919191919192,
1345
+ "grad_norm": 0.666192889213562,
1346
+ "learning_rate": 1.5950520833333336e-06,
1347
+ "loss": 1.0604,
1348
+ "mean_token_accuracy": 0.7616572439670563,
1349
+ "num_tokens": 8108967.0,
1350
+ "step": 1340
1351
+ },
1352
+ {
1353
+ "entropy": 1.0973364993929864,
1354
+ "epoch": 1.7045454545454546,
1355
+ "grad_norm": 0.6348255276679993,
1356
+ "learning_rate": 1.5299479166666667e-06,
1357
+ "loss": 1.0769,
1358
+ "mean_token_accuracy": 0.7596119627356529,
1359
+ "num_tokens": 8169700.0,
1360
+ "step": 1350
1361
+ },
1362
+ {
1363
+ "entropy": 1.1137071400880814,
1364
+ "epoch": 1.7171717171717171,
1365
+ "grad_norm": 0.6510699391365051,
1366
+ "learning_rate": 1.46484375e-06,
1367
+ "loss": 1.0731,
1368
+ "mean_token_accuracy": 0.7593250289559365,
1369
+ "num_tokens": 8229676.0,
1370
+ "step": 1360
1371
+ },
1372
+ {
1373
+ "entropy": 1.1052428260445595,
1374
+ "epoch": 1.7297979797979797,
1375
+ "grad_norm": 0.6622318625450134,
1376
+ "learning_rate": 1.3997395833333335e-06,
1377
+ "loss": 1.069,
1378
+ "mean_token_accuracy": 0.7627501472830772,
1379
+ "num_tokens": 8289396.0,
1380
+ "step": 1370
1381
+ },
1382
+ {
1383
+ "entropy": 1.0989198789000512,
1384
+ "epoch": 1.7424242424242424,
1385
+ "grad_norm": 0.6430277824401855,
1386
+ "learning_rate": 1.3346354166666666e-06,
1387
+ "loss": 1.0506,
1388
+ "mean_token_accuracy": 0.7634323209524154,
1389
+ "num_tokens": 8351152.0,
1390
+ "step": 1380
1391
+ },
1392
+ {
1393
+ "entropy": 1.0913211867213248,
1394
+ "epoch": 1.7550505050505052,
1395
+ "grad_norm": 0.639707088470459,
1396
+ "learning_rate": 1.2695312500000002e-06,
1397
+ "loss": 1.0505,
1398
+ "mean_token_accuracy": 0.763850274682045,
1399
+ "num_tokens": 8411357.0,
1400
+ "step": 1390
1401
+ },
1402
+ {
1403
+ "entropy": 1.1044820442795753,
1404
+ "epoch": 1.7676767676767677,
1405
+ "grad_norm": 0.680479109287262,
1406
+ "learning_rate": 1.2044270833333335e-06,
1407
+ "loss": 1.0756,
1408
+ "mean_token_accuracy": 0.7598197475075722,
1409
+ "num_tokens": 8471975.0,
1410
+ "step": 1400
1411
+ }
1412
+ ],
1413
+ "logging_steps": 10,
1414
+ "max_steps": 1584,
1415
+ "num_input_tokens_seen": 0,
1416
+ "num_train_epochs": 2,
1417
+ "save_steps": 200,
1418
+ "stateful_callbacks": {
1419
+ "TrainerControl": {
1420
+ "args": {
1421
+ "should_epoch_stop": false,
1422
+ "should_evaluate": false,
1423
+ "should_log": false,
1424
+ "should_save": true,
1425
+ "should_training_stop": false
1426
+ },
1427
+ "attributes": {}
1428
+ }
1429
+ },
1430
+ "total_flos": 4.796684058233733e+17,
1431
+ "train_batch_size": 8,
1432
+ "trial_name": null,
1433
+ "trial_params": null
1434
+ }
checkpoint-1400/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8eeb71c14deb91ac5fd11522db45cb3275c9164415fcbefc9d00cac27a27f0a3
3
+ size 6417
checkpoint-1584/README.md ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: CohereLabs/aya-expanse-8b
3
+ library_name: peft
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - base_model:adapter:CohereLabs/aya-expanse-8b
7
+ - lora
8
+ - sft
9
+ - transformers
10
+ - trl
11
+ ---
12
+
13
+ # Model Card for Model ID
14
+
15
+ <!-- Provide a quick summary of what the model is/does. -->
16
+
17
+
18
+
19
+ ## Model Details
20
+
21
+ ### Model Description
22
+
23
+ <!-- Provide a longer summary of what this model is. -->
24
+
25
+
26
+
27
+ - **Developed by:** [More Information Needed]
28
+ - **Funded by [optional]:** [More Information Needed]
29
+ - **Shared by [optional]:** [More Information Needed]
30
+ - **Model type:** [More Information Needed]
31
+ - **Language(s) (NLP):** [More Information Needed]
32
+ - **License:** [More Information Needed]
33
+ - **Finetuned from model [optional]:** [More Information Needed]
34
+
35
+ ### Model Sources [optional]
36
+
37
+ <!-- Provide the basic links for the model. -->
38
+
39
+ - **Repository:** [More Information Needed]
40
+ - **Paper [optional]:** [More Information Needed]
41
+ - **Demo [optional]:** [More Information Needed]
42
+
43
+ ## Uses
44
+
45
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
46
+
47
+ ### Direct Use
48
+
49
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
50
+
51
+ [More Information Needed]
52
+
53
+ ### Downstream Use [optional]
54
+
55
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
56
+
57
+ [More Information Needed]
58
+
59
+ ### Out-of-Scope Use
60
+
61
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
62
+
63
+ [More Information Needed]
64
+
65
+ ## Bias, Risks, and Limitations
66
+
67
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
68
+
69
+ [More Information Needed]
70
+
71
+ ### Recommendations
72
+
73
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
74
+
75
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
76
+
77
+ ## How to Get Started with the Model
78
+
79
+ Use the code below to get started with the model.
80
+
81
+ [More Information Needed]
82
+
83
+ ## Training Details
84
+
85
+ ### Training Data
86
+
87
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
88
+
89
+ [More Information Needed]
90
+
91
+ ### Training Procedure
92
+
93
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
94
+
95
+ #### Preprocessing [optional]
96
+
97
+ [More Information Needed]
98
+
99
+
100
+ #### Training Hyperparameters
101
+
102
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
103
+
104
+ #### Speeds, Sizes, Times [optional]
105
+
106
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
107
+
108
+ [More Information Needed]
109
+
110
+ ## Evaluation
111
+
112
+ <!-- This section describes the evaluation protocols and provides the results. -->
113
+
114
+ ### Testing Data, Factors & Metrics
115
+
116
+ #### Testing Data
117
+
118
+ <!-- This should link to a Dataset Card if possible. -->
119
+
120
+ [More Information Needed]
121
+
122
+ #### Factors
123
+
124
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
125
+
126
+ [More Information Needed]
127
+
128
+ #### Metrics
129
+
130
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
131
+
132
+ [More Information Needed]
133
+
134
+ ### Results
135
+
136
+ [More Information Needed]
137
+
138
+ #### Summary
139
+
140
+
141
+
142
+ ## Model Examination [optional]
143
+
144
+ <!-- Relevant interpretability work for the model goes here -->
145
+
146
+ [More Information Needed]
147
+
148
+ ## Environmental Impact
149
+
150
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
151
+
152
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
153
+
154
+ - **Hardware Type:** [More Information Needed]
155
+ - **Hours used:** [More Information Needed]
156
+ - **Cloud Provider:** [More Information Needed]
157
+ - **Compute Region:** [More Information Needed]
158
+ - **Carbon Emitted:** [More Information Needed]
159
+
160
+ ## Technical Specifications [optional]
161
+
162
+ ### Model Architecture and Objective
163
+
164
+ [More Information Needed]
165
+
166
+ ### Compute Infrastructure
167
+
168
+ [More Information Needed]
169
+
170
+ #### Hardware
171
+
172
+ [More Information Needed]
173
+
174
+ #### Software
175
+
176
+ [More Information Needed]
177
+
178
+ ## Citation [optional]
179
+
180
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
181
+
182
+ **BibTeX:**
183
+
184
+ [More Information Needed]
185
+
186
+ **APA:**
187
+
188
+ [More Information Needed]
189
+
190
+ ## Glossary [optional]
191
+
192
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
193
+
194
+ [More Information Needed]
195
+
196
+ ## More Information [optional]
197
+
198
+ [More Information Needed]
199
+
200
+ ## Model Card Authors [optional]
201
+
202
+ [More Information Needed]
203
+
204
+ ## Model Card Contact
205
+
206
+ [More Information Needed]
207
+ ### Framework versions
208
+
209
+ - PEFT 0.19.1
checkpoint-1584/adapter_config.json ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alora_invocation_tokens": null,
3
+ "alpha_pattern": {},
4
+ "arrow_config": null,
5
+ "auto_mapping": null,
6
+ "base_model_name_or_path": "CohereLabs/aya-expanse-8b",
7
+ "bias": "none",
8
+ "corda_config": null,
9
+ "ensure_weight_tying": false,
10
+ "eva_config": null,
11
+ "exclude_modules": null,
12
+ "fan_in_fan_out": false,
13
+ "inference_mode": true,
14
+ "init_lora_weights": true,
15
+ "layer_replication": null,
16
+ "layers_pattern": null,
17
+ "layers_to_transform": null,
18
+ "loftq_config": {},
19
+ "lora_alpha": 32,
20
+ "lora_bias": false,
21
+ "lora_dropout": 0.05,
22
+ "lora_ga_config": null,
23
+ "megatron_config": null,
24
+ "megatron_core": "megatron.core",
25
+ "modules_to_save": null,
26
+ "peft_type": "LORA",
27
+ "peft_version": "0.19.1",
28
+ "qalora_group_size": 16,
29
+ "r": 16,
30
+ "rank_pattern": {},
31
+ "revision": null,
32
+ "target_modules": [
33
+ "k_proj",
34
+ "down_proj",
35
+ "q_proj",
36
+ "o_proj",
37
+ "gate_proj",
38
+ "v_proj",
39
+ "up_proj"
40
+ ],
41
+ "target_parameters": null,
42
+ "task_type": "CAUSAL_LM",
43
+ "trainable_token_indices": null,
44
+ "use_bdlora": null,
45
+ "use_dora": false,
46
+ "use_qalora": false,
47
+ "use_rslora": false
48
+ }
checkpoint-1584/adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b4ccb6b6fb84ee6cc158f9a20474ec7ec29ec158992cc2108c5575301c294143
3
+ size 167832240
checkpoint-1584/chat_template.jinja ADDED
@@ -0,0 +1 @@
 
 
1
+ {{ bos_token }}{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif false == true %}{% set loop_messages = messages %}{% set system_message = 'You are Aya, a brilliant, sophisticated, multilingual AI-assistant trained to assist human users by providing thorough responses. You are able to interact and respond to questions in 23 languages and you are powered by a multilingual model built by Cohere For AI.' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% if system_message != false %}{{ '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>' + system_message + '<|END_OF_TURN_TOKEN|>' }}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% elif message['role'] == 'assistant' %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' + content.strip() + '<|END_OF_TURN_TOKEN|>' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>' }}{% endif %}
checkpoint-1584/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ad3c6bd1873cc7822b9cfcc9d20d53f10797d182583ad873b056e3fce9975cec
3
+ size 335929123
checkpoint-1584/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cb890208f872a9bc21144563ccd67590ecc434a4bdb0af2e1f0f0e89facf8648
3
+ size 14645
checkpoint-1584/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ad40633bbc30d3679cb64e4eb23a2afc463f00f1a22fd86e76f1a61cdf8ca9d6
3
+ size 1465
checkpoint-1584/special_tokens_map.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<BOS_TOKEN>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|END_OF_TURN_TOKEN|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "<PAD>"
17
+ }
checkpoint-1584/tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:345ccf04a5257f473e331715ecc69365c5ac8fc2490923fe7155560af809ec1a
3
+ size 20124090
checkpoint-1584/tokenizer_config.json ADDED
@@ -0,0 +1,317 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": false,
5
+ "added_tokens_decoder": {
6
+ "0": {
7
+ "content": "<PAD>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "1": {
15
+ "content": "<UNK>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "2": {
23
+ "content": "<CLS>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ },
30
+ "3": {
31
+ "content": "<SEP>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false,
36
+ "special": true
37
+ },
38
+ "4": {
39
+ "content": "<MASK_TOKEN>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false,
44
+ "special": true
45
+ },
46
+ "5": {
47
+ "content": "<BOS_TOKEN>",
48
+ "lstrip": false,
49
+ "normalized": false,
50
+ "rstrip": false,
51
+ "single_word": false,
52
+ "special": true
53
+ },
54
+ "6": {
55
+ "content": "<EOS_TOKEN>",
56
+ "lstrip": false,
57
+ "normalized": false,
58
+ "rstrip": false,
59
+ "single_word": false,
60
+ "special": true
61
+ },
62
+ "7": {
63
+ "content": "<EOP_TOKEN>",
64
+ "lstrip": false,
65
+ "normalized": false,
66
+ "rstrip": false,
67
+ "single_word": false,
68
+ "special": true
69
+ },
70
+ "255000": {
71
+ "content": "<|START_OF_TURN_TOKEN|>",
72
+ "lstrip": false,
73
+ "normalized": false,
74
+ "rstrip": false,
75
+ "single_word": false,
76
+ "special": false
77
+ },
78
+ "255001": {
79
+ "content": "<|END_OF_TURN_TOKEN|>",
80
+ "lstrip": false,
81
+ "normalized": false,
82
+ "rstrip": false,
83
+ "single_word": false,
84
+ "special": true
85
+ },
86
+ "255002": {
87
+ "content": "<|YES_TOKEN|>",
88
+ "lstrip": false,
89
+ "normalized": false,
90
+ "rstrip": false,
91
+ "single_word": false,
92
+ "special": false
93
+ },
94
+ "255003": {
95
+ "content": "<|NO_TOKEN|>",
96
+ "lstrip": false,
97
+ "normalized": false,
98
+ "rstrip": false,
99
+ "single_word": false,
100
+ "special": false
101
+ },
102
+ "255004": {
103
+ "content": "<|GOOD_TOKEN|>",
104
+ "lstrip": false,
105
+ "normalized": false,
106
+ "rstrip": false,
107
+ "single_word": false,
108
+ "special": false
109
+ },
110
+ "255005": {
111
+ "content": "<|BAD_TOKEN|>",
112
+ "lstrip": false,
113
+ "normalized": false,
114
+ "rstrip": false,
115
+ "single_word": false,
116
+ "special": false
117
+ },
118
+ "255006": {
119
+ "content": "<|USER_TOKEN|>",
120
+ "lstrip": false,
121
+ "normalized": false,
122
+ "rstrip": false,
123
+ "single_word": false,
124
+ "special": false
125
+ },
126
+ "255007": {
127
+ "content": "<|CHATBOT_TOKEN|>",
128
+ "lstrip": false,
129
+ "normalized": false,
130
+ "rstrip": false,
131
+ "single_word": false,
132
+ "special": false
133
+ },
134
+ "255008": {
135
+ "content": "<|SYSTEM_TOKEN|>",
136
+ "lstrip": false,
137
+ "normalized": false,
138
+ "rstrip": false,
139
+ "single_word": false,
140
+ "special": false
141
+ },
142
+ "255009": {
143
+ "content": "<|USER_0_TOKEN|>",
144
+ "lstrip": false,
145
+ "normalized": false,
146
+ "rstrip": false,
147
+ "single_word": false,
148
+ "special": false
149
+ },
150
+ "255010": {
151
+ "content": "<|USER_1_TOKEN|>",
152
+ "lstrip": false,
153
+ "normalized": false,
154
+ "rstrip": false,
155
+ "single_word": false,
156
+ "special": false
157
+ },
158
+ "255011": {
159
+ "content": "<|USER_2_TOKEN|>",
160
+ "lstrip": false,
161
+ "normalized": false,
162
+ "rstrip": false,
163
+ "single_word": false,
164
+ "special": false
165
+ },
166
+ "255012": {
167
+ "content": "<|USER_3_TOKEN|>",
168
+ "lstrip": false,
169
+ "normalized": false,
170
+ "rstrip": false,
171
+ "single_word": false,
172
+ "special": false
173
+ },
174
+ "255013": {
175
+ "content": "<|USER_4_TOKEN|>",
176
+ "lstrip": false,
177
+ "normalized": false,
178
+ "rstrip": false,
179
+ "single_word": false,
180
+ "special": false
181
+ },
182
+ "255014": {
183
+ "content": "<|USER_5_TOKEN|>",
184
+ "lstrip": false,
185
+ "normalized": false,
186
+ "rstrip": false,
187
+ "single_word": false,
188
+ "special": false
189
+ },
190
+ "255015": {
191
+ "content": "<|USER_6_TOKEN|>",
192
+ "lstrip": false,
193
+ "normalized": false,
194
+ "rstrip": false,
195
+ "single_word": false,
196
+ "special": false
197
+ },
198
+ "255016": {
199
+ "content": "<|USER_7_TOKEN|>",
200
+ "lstrip": false,
201
+ "normalized": false,
202
+ "rstrip": false,
203
+ "single_word": false,
204
+ "special": false
205
+ },
206
+ "255017": {
207
+ "content": "<|USER_8_TOKEN|>",
208
+ "lstrip": false,
209
+ "normalized": false,
210
+ "rstrip": false,
211
+ "single_word": false,
212
+ "special": false
213
+ },
214
+ "255018": {
215
+ "content": "<|USER_9_TOKEN|>",
216
+ "lstrip": false,
217
+ "normalized": false,
218
+ "rstrip": false,
219
+ "single_word": false,
220
+ "special": false
221
+ },
222
+ "255019": {
223
+ "content": "<|EXTRA_0_TOKEN|>",
224
+ "lstrip": false,
225
+ "normalized": false,
226
+ "rstrip": false,
227
+ "single_word": false,
228
+ "special": false
229
+ },
230
+ "255020": {
231
+ "content": "<|EXTRA_1_TOKEN|>",
232
+ "lstrip": false,
233
+ "normalized": false,
234
+ "rstrip": false,
235
+ "single_word": false,
236
+ "special": false
237
+ },
238
+ "255021": {
239
+ "content": "<|EXTRA_2_TOKEN|>",
240
+ "lstrip": false,
241
+ "normalized": false,
242
+ "rstrip": false,
243
+ "single_word": false,
244
+ "special": false
245
+ },
246
+ "255022": {
247
+ "content": "<|EXTRA_3_TOKEN|>",
248
+ "lstrip": false,
249
+ "normalized": false,
250
+ "rstrip": false,
251
+ "single_word": false,
252
+ "special": false
253
+ },
254
+ "255023": {
255
+ "content": "<|EXTRA_4_TOKEN|>",
256
+ "lstrip": false,
257
+ "normalized": false,
258
+ "rstrip": false,
259
+ "single_word": false,
260
+ "special": false
261
+ },
262
+ "255024": {
263
+ "content": "<|EXTRA_5_TOKEN|>",
264
+ "lstrip": false,
265
+ "normalized": false,
266
+ "rstrip": false,
267
+ "single_word": false,
268
+ "special": false
269
+ },
270
+ "255025": {
271
+ "content": "<|EXTRA_6_TOKEN|>",
272
+ "lstrip": false,
273
+ "normalized": false,
274
+ "rstrip": false,
275
+ "single_word": false,
276
+ "special": false
277
+ },
278
+ "255026": {
279
+ "content": "<|EXTRA_7_TOKEN|>",
280
+ "lstrip": false,
281
+ "normalized": false,
282
+ "rstrip": false,
283
+ "single_word": false,
284
+ "special": false
285
+ },
286
+ "255027": {
287
+ "content": "<|EXTRA_8_TOKEN|>",
288
+ "lstrip": false,
289
+ "normalized": false,
290
+ "rstrip": false,
291
+ "single_word": false,
292
+ "special": false
293
+ },
294
+ "255028": {
295
+ "content": "<|EXTRA_9_TOKEN|>",
296
+ "lstrip": false,
297
+ "normalized": false,
298
+ "rstrip": false,
299
+ "single_word": false,
300
+ "special": false
301
+ }
302
+ },
303
+ "bos_token": "<BOS_TOKEN>",
304
+ "clean_up_tokenization_spaces": false,
305
+ "eos_token": "<|END_OF_TURN_TOKEN|>",
306
+ "extra_special_tokens": {},
307
+ "legacy": true,
308
+ "merges_file": null,
309
+ "model_max_length": 1000000000000000019884624838656,
310
+ "pad_token": "<PAD>",
311
+ "sp_model_kwargs": {},
312
+ "spaces_between_special_tokens": false,
313
+ "tokenizer_class": "CohereTokenizer",
314
+ "unk_token": null,
315
+ "use_default_system_prompt": false,
316
+ "vocab_file": null
317
+ }
checkpoint-1584/trainer_state.json ADDED
@@ -0,0 +1,1614 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": null,
3
+ "best_metric": null,
4
+ "best_model_checkpoint": null,
5
+ "epoch": 2.0,
6
+ "eval_steps": 200,
7
+ "global_step": 1584,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "entropy": 2.4040649354457857,
14
+ "epoch": 0.012626262626262626,
15
+ "grad_norm": 4.66565465927124,
16
+ "learning_rate": 1.8750000000000003e-06,
17
+ "loss": 3.6021,
18
+ "mean_token_accuracy": 0.4221017129719257,
19
+ "num_tokens": 61199.0,
20
+ "step": 10
21
+ },
22
+ {
23
+ "entropy": 2.3836746215820312,
24
+ "epoch": 0.025252525252525252,
25
+ "grad_norm": 3.8161869049072266,
26
+ "learning_rate": 3.958333333333333e-06,
27
+ "loss": 3.3432,
28
+ "mean_token_accuracy": 0.44042530804872515,
29
+ "num_tokens": 122423.0,
30
+ "step": 20
31
+ },
32
+ {
33
+ "entropy": 2.355724626779556,
34
+ "epoch": 0.03787878787878788,
35
+ "grad_norm": 3.8800699710845947,
36
+ "learning_rate": 6.041666666666667e-06,
37
+ "loss": 2.9033,
38
+ "mean_token_accuracy": 0.48426677361130716,
39
+ "num_tokens": 182649.0,
40
+ "step": 30
41
+ },
42
+ {
43
+ "entropy": 2.092331054806709,
44
+ "epoch": 0.050505050505050504,
45
+ "grad_norm": 2.8217720985412598,
46
+ "learning_rate": 8.125000000000001e-06,
47
+ "loss": 2.356,
48
+ "mean_token_accuracy": 0.5772452697157859,
49
+ "num_tokens": 243049.0,
50
+ "step": 40
51
+ },
52
+ {
53
+ "entropy": 1.6766322344541549,
54
+ "epoch": 0.06313131313131314,
55
+ "grad_norm": 1.4623568058013916,
56
+ "learning_rate": 9.993489583333334e-06,
57
+ "loss": 1.8899,
58
+ "mean_token_accuracy": 0.6480962842702865,
59
+ "num_tokens": 304326.0,
60
+ "step": 50
61
+ },
62
+ {
63
+ "entropy": 1.5568815559148788,
64
+ "epoch": 0.07575757575757576,
65
+ "grad_norm": 1.171562671661377,
66
+ "learning_rate": 9.928385416666668e-06,
67
+ "loss": 1.677,
68
+ "mean_token_accuracy": 0.6776855796575546,
69
+ "num_tokens": 364866.0,
70
+ "step": 60
71
+ },
72
+ {
73
+ "entropy": 1.48199902176857,
74
+ "epoch": 0.08838383838383838,
75
+ "grad_norm": 0.9904961585998535,
76
+ "learning_rate": 9.863281250000001e-06,
77
+ "loss": 1.5337,
78
+ "mean_token_accuracy": 0.697019773721695,
79
+ "num_tokens": 423478.0,
80
+ "step": 70
81
+ },
82
+ {
83
+ "entropy": 1.497376424074173,
84
+ "epoch": 0.10101010101010101,
85
+ "grad_norm": 0.9454260468482971,
86
+ "learning_rate": 9.798177083333335e-06,
87
+ "loss": 1.4953,
88
+ "mean_token_accuracy": 0.6976823702454567,
89
+ "num_tokens": 483659.0,
90
+ "step": 80
91
+ },
92
+ {
93
+ "entropy": 1.4664768785238267,
94
+ "epoch": 0.11363636363636363,
95
+ "grad_norm": 0.8955270648002625,
96
+ "learning_rate": 9.733072916666667e-06,
97
+ "loss": 1.4356,
98
+ "mean_token_accuracy": 0.7069446608424187,
99
+ "num_tokens": 544453.0,
100
+ "step": 90
101
+ },
102
+ {
103
+ "entropy": 1.4269085675477982,
104
+ "epoch": 0.12626262626262627,
105
+ "grad_norm": 0.9242203235626221,
106
+ "learning_rate": 9.66796875e-06,
107
+ "loss": 1.4106,
108
+ "mean_token_accuracy": 0.7122392952442169,
109
+ "num_tokens": 604546.0,
110
+ "step": 100
111
+ },
112
+ {
113
+ "entropy": 1.4060751020908355,
114
+ "epoch": 0.1388888888888889,
115
+ "grad_norm": 0.8968560695648193,
116
+ "learning_rate": 9.602864583333335e-06,
117
+ "loss": 1.3487,
118
+ "mean_token_accuracy": 0.7178552970290184,
119
+ "num_tokens": 664860.0,
120
+ "step": 110
121
+ },
122
+ {
123
+ "entropy": 1.4018951296806335,
124
+ "epoch": 0.15151515151515152,
125
+ "grad_norm": 0.9047113656997681,
126
+ "learning_rate": 9.537760416666667e-06,
127
+ "loss": 1.3347,
128
+ "mean_token_accuracy": 0.7208079636096955,
129
+ "num_tokens": 725022.0,
130
+ "step": 120
131
+ },
132
+ {
133
+ "entropy": 1.3809731483459473,
134
+ "epoch": 0.16414141414141414,
135
+ "grad_norm": 0.8915444016456604,
136
+ "learning_rate": 9.47265625e-06,
137
+ "loss": 1.3155,
138
+ "mean_token_accuracy": 0.7267520889639855,
139
+ "num_tokens": 785586.0,
140
+ "step": 130
141
+ },
142
+ {
143
+ "entropy": 1.3699676394462585,
144
+ "epoch": 0.17676767676767677,
145
+ "grad_norm": 0.8574295043945312,
146
+ "learning_rate": 9.407552083333334e-06,
147
+ "loss": 1.3016,
148
+ "mean_token_accuracy": 0.7266673430800438,
149
+ "num_tokens": 845790.0,
150
+ "step": 140
151
+ },
152
+ {
153
+ "entropy": 1.3425012439489366,
154
+ "epoch": 0.1893939393939394,
155
+ "grad_norm": 0.8231800198554993,
156
+ "learning_rate": 9.342447916666668e-06,
157
+ "loss": 1.2823,
158
+ "mean_token_accuracy": 0.7277117937803268,
159
+ "num_tokens": 905842.0,
160
+ "step": 150
161
+ },
162
+ {
163
+ "entropy": 1.3314216613769532,
164
+ "epoch": 0.20202020202020202,
165
+ "grad_norm": 0.8166369795799255,
166
+ "learning_rate": 9.277343750000001e-06,
167
+ "loss": 1.2917,
168
+ "mean_token_accuracy": 0.7278609350323677,
169
+ "num_tokens": 966487.0,
170
+ "step": 160
171
+ },
172
+ {
173
+ "entropy": 1.30389544069767,
174
+ "epoch": 0.21464646464646464,
175
+ "grad_norm": 0.7738587260246277,
176
+ "learning_rate": 9.212239583333335e-06,
177
+ "loss": 1.2481,
178
+ "mean_token_accuracy": 0.7335339426994324,
179
+ "num_tokens": 1025558.0,
180
+ "step": 170
181
+ },
182
+ {
183
+ "entropy": 1.311449444293976,
184
+ "epoch": 0.22727272727272727,
185
+ "grad_norm": 0.7718328833580017,
186
+ "learning_rate": 9.147135416666667e-06,
187
+ "loss": 1.2643,
188
+ "mean_token_accuracy": 0.7285080313682556,
189
+ "num_tokens": 1086677.0,
190
+ "step": 180
191
+ },
192
+ {
193
+ "entropy": 1.3079361200332642,
194
+ "epoch": 0.2398989898989899,
195
+ "grad_norm": 0.7341915369033813,
196
+ "learning_rate": 9.082031250000001e-06,
197
+ "loss": 1.2641,
198
+ "mean_token_accuracy": 0.729550538957119,
199
+ "num_tokens": 1147885.0,
200
+ "step": 190
201
+ },
202
+ {
203
+ "entropy": 1.2794219702482224,
204
+ "epoch": 0.25252525252525254,
205
+ "grad_norm": 0.748540997505188,
206
+ "learning_rate": 9.016927083333335e-06,
207
+ "loss": 1.2397,
208
+ "mean_token_accuracy": 0.7351120054721832,
209
+ "num_tokens": 1207321.0,
210
+ "step": 200
211
+ },
212
+ {
213
+ "entropy": 1.294454461336136,
214
+ "epoch": 0.26515151515151514,
215
+ "grad_norm": 0.7585553526878357,
216
+ "learning_rate": 8.951822916666667e-06,
217
+ "loss": 1.2489,
218
+ "mean_token_accuracy": 0.7322148531675339,
219
+ "num_tokens": 1267837.0,
220
+ "step": 210
221
+ },
222
+ {
223
+ "entropy": 1.2862246632575989,
224
+ "epoch": 0.2777777777777778,
225
+ "grad_norm": 0.6937864422798157,
226
+ "learning_rate": 8.88671875e-06,
227
+ "loss": 1.2383,
228
+ "mean_token_accuracy": 0.7358245223760604,
229
+ "num_tokens": 1328436.0,
230
+ "step": 220
231
+ },
232
+ {
233
+ "entropy": 1.2531811505556107,
234
+ "epoch": 0.2904040404040404,
235
+ "grad_norm": 0.6792387366294861,
236
+ "learning_rate": 8.821614583333334e-06,
237
+ "loss": 1.2007,
238
+ "mean_token_accuracy": 0.7360369265079498,
239
+ "num_tokens": 1389877.0,
240
+ "step": 230
241
+ },
242
+ {
243
+ "entropy": 1.2815489560365676,
244
+ "epoch": 0.30303030303030304,
245
+ "grad_norm": 0.6865427494049072,
246
+ "learning_rate": 8.756510416666666e-06,
247
+ "loss": 1.2474,
248
+ "mean_token_accuracy": 0.7304804190993309,
249
+ "num_tokens": 1450927.0,
250
+ "step": 240
251
+ },
252
+ {
253
+ "entropy": 1.2568059146404267,
254
+ "epoch": 0.31565656565656564,
255
+ "grad_norm": 0.669840395450592,
256
+ "learning_rate": 8.69140625e-06,
257
+ "loss": 1.2172,
258
+ "mean_token_accuracy": 0.7385278165340423,
259
+ "num_tokens": 1511174.0,
260
+ "step": 250
261
+ },
262
+ {
263
+ "entropy": 1.254868358373642,
264
+ "epoch": 0.3282828282828283,
265
+ "grad_norm": 0.6434893012046814,
266
+ "learning_rate": 8.626302083333334e-06,
267
+ "loss": 1.213,
268
+ "mean_token_accuracy": 0.7380544006824493,
269
+ "num_tokens": 1570570.0,
270
+ "step": 260
271
+ },
272
+ {
273
+ "entropy": 1.2532441645860672,
274
+ "epoch": 0.3409090909090909,
275
+ "grad_norm": 0.6034978032112122,
276
+ "learning_rate": 8.561197916666667e-06,
277
+ "loss": 1.2116,
278
+ "mean_token_accuracy": 0.7382062628865242,
279
+ "num_tokens": 1630930.0,
280
+ "step": 270
281
+ },
282
+ {
283
+ "entropy": 1.2605280816555022,
284
+ "epoch": 0.35353535353535354,
285
+ "grad_norm": 0.6371450424194336,
286
+ "learning_rate": 8.496093750000001e-06,
287
+ "loss": 1.2313,
288
+ "mean_token_accuracy": 0.7328852489590645,
289
+ "num_tokens": 1692624.0,
290
+ "step": 280
291
+ },
292
+ {
293
+ "entropy": 1.248606452345848,
294
+ "epoch": 0.3661616161616162,
295
+ "grad_norm": 0.6300278306007385,
296
+ "learning_rate": 8.430989583333335e-06,
297
+ "loss": 1.2195,
298
+ "mean_token_accuracy": 0.7370449885725975,
299
+ "num_tokens": 1754213.0,
300
+ "step": 290
301
+ },
302
+ {
303
+ "entropy": 1.2246102809906005,
304
+ "epoch": 0.3787878787878788,
305
+ "grad_norm": 0.6430155634880066,
306
+ "learning_rate": 8.365885416666667e-06,
307
+ "loss": 1.1845,
308
+ "mean_token_accuracy": 0.7439196646213532,
309
+ "num_tokens": 1813841.0,
310
+ "step": 300
311
+ },
312
+ {
313
+ "entropy": 1.2188815206289292,
314
+ "epoch": 0.39141414141414144,
315
+ "grad_norm": 0.6395701766014099,
316
+ "learning_rate": 8.30078125e-06,
317
+ "loss": 1.1856,
318
+ "mean_token_accuracy": 0.7422183871269226,
319
+ "num_tokens": 1873568.0,
320
+ "step": 310
321
+ },
322
+ {
323
+ "entropy": 1.234528934955597,
324
+ "epoch": 0.40404040404040403,
325
+ "grad_norm": 0.6168740391731262,
326
+ "learning_rate": 8.235677083333334e-06,
327
+ "loss": 1.1951,
328
+ "mean_token_accuracy": 0.7380509555339814,
329
+ "num_tokens": 1935518.0,
330
+ "step": 320
331
+ },
332
+ {
333
+ "entropy": 1.226365676522255,
334
+ "epoch": 0.4166666666666667,
335
+ "grad_norm": 0.611132800579071,
336
+ "learning_rate": 8.170572916666666e-06,
337
+ "loss": 1.2078,
338
+ "mean_token_accuracy": 0.739974245429039,
339
+ "num_tokens": 1995604.0,
340
+ "step": 330
341
+ },
342
+ {
343
+ "entropy": 1.2132183194160462,
344
+ "epoch": 0.4292929292929293,
345
+ "grad_norm": 0.6103131771087646,
346
+ "learning_rate": 8.10546875e-06,
347
+ "loss": 1.1642,
348
+ "mean_token_accuracy": 0.7444137379527092,
349
+ "num_tokens": 2056484.0,
350
+ "step": 340
351
+ },
352
+ {
353
+ "entropy": 1.223158246278763,
354
+ "epoch": 0.44191919191919193,
355
+ "grad_norm": 0.6188805103302002,
356
+ "learning_rate": 8.040364583333334e-06,
357
+ "loss": 1.2001,
358
+ "mean_token_accuracy": 0.7385613292455673,
359
+ "num_tokens": 2118437.0,
360
+ "step": 350
361
+ },
362
+ {
363
+ "entropy": 1.2154075980186463,
364
+ "epoch": 0.45454545454545453,
365
+ "grad_norm": 0.6238694190979004,
366
+ "learning_rate": 7.975260416666668e-06,
367
+ "loss": 1.1848,
368
+ "mean_token_accuracy": 0.7404153689742088,
369
+ "num_tokens": 2179333.0,
370
+ "step": 360
371
+ },
372
+ {
373
+ "entropy": 1.197928261756897,
374
+ "epoch": 0.4671717171717172,
375
+ "grad_norm": 0.6028566956520081,
376
+ "learning_rate": 7.910156250000001e-06,
377
+ "loss": 1.1597,
378
+ "mean_token_accuracy": 0.7475010469555855,
379
+ "num_tokens": 2239604.0,
380
+ "step": 370
381
+ },
382
+ {
383
+ "entropy": 1.189855706691742,
384
+ "epoch": 0.4797979797979798,
385
+ "grad_norm": 0.6569434404373169,
386
+ "learning_rate": 7.845052083333335e-06,
387
+ "loss": 1.1805,
388
+ "mean_token_accuracy": 0.7427218139171601,
389
+ "num_tokens": 2300936.0,
390
+ "step": 380
391
+ },
392
+ {
393
+ "entropy": 1.2076493889093398,
394
+ "epoch": 0.49242424242424243,
395
+ "grad_norm": 0.6351733207702637,
396
+ "learning_rate": 7.779947916666667e-06,
397
+ "loss": 1.1759,
398
+ "mean_token_accuracy": 0.7408407002687454,
399
+ "num_tokens": 2361539.0,
400
+ "step": 390
401
+ },
402
+ {
403
+ "entropy": 1.2223081022500992,
404
+ "epoch": 0.5050505050505051,
405
+ "grad_norm": 0.6327986121177673,
406
+ "learning_rate": 7.71484375e-06,
407
+ "loss": 1.1874,
408
+ "mean_token_accuracy": 0.7407250568270684,
409
+ "num_tokens": 2422162.0,
410
+ "step": 400
411
+ },
412
+ {
413
+ "entropy": 1.2013412863016129,
414
+ "epoch": 0.5176767676767676,
415
+ "grad_norm": 0.622104823589325,
416
+ "learning_rate": 7.649739583333334e-06,
417
+ "loss": 1.1726,
418
+ "mean_token_accuracy": 0.744242025911808,
419
+ "num_tokens": 2483447.0,
420
+ "step": 410
421
+ },
422
+ {
423
+ "entropy": 1.2116613179445266,
424
+ "epoch": 0.5303030303030303,
425
+ "grad_norm": 0.637651264667511,
426
+ "learning_rate": 7.5846354166666665e-06,
427
+ "loss": 1.1838,
428
+ "mean_token_accuracy": 0.7405030101537704,
429
+ "num_tokens": 2544848.0,
430
+ "step": 420
431
+ },
432
+ {
433
+ "entropy": 1.2024697184562683,
434
+ "epoch": 0.5429292929292929,
435
+ "grad_norm": 0.6252374649047852,
436
+ "learning_rate": 7.51953125e-06,
437
+ "loss": 1.1681,
438
+ "mean_token_accuracy": 0.7458183988928795,
439
+ "num_tokens": 2605232.0,
440
+ "step": 430
441
+ },
442
+ {
443
+ "entropy": 1.1797083109617232,
444
+ "epoch": 0.5555555555555556,
445
+ "grad_norm": 0.6502755284309387,
446
+ "learning_rate": 7.454427083333334e-06,
447
+ "loss": 1.1452,
448
+ "mean_token_accuracy": 0.7477341219782829,
449
+ "num_tokens": 2664276.0,
450
+ "step": 440
451
+ },
452
+ {
453
+ "entropy": 1.1964200481772422,
454
+ "epoch": 0.5681818181818182,
455
+ "grad_norm": 0.639979362487793,
456
+ "learning_rate": 7.389322916666667e-06,
457
+ "loss": 1.1665,
458
+ "mean_token_accuracy": 0.7431837096810341,
459
+ "num_tokens": 2724073.0,
460
+ "step": 450
461
+ },
462
+ {
463
+ "entropy": 1.1795272737741471,
464
+ "epoch": 0.5808080808080808,
465
+ "grad_norm": 0.6212354302406311,
466
+ "learning_rate": 7.3242187500000006e-06,
467
+ "loss": 1.1529,
468
+ "mean_token_accuracy": 0.7479098170995713,
469
+ "num_tokens": 2784262.0,
470
+ "step": 460
471
+ },
472
+ {
473
+ "entropy": 1.1908618807792664,
474
+ "epoch": 0.5934343434343434,
475
+ "grad_norm": 0.6528693437576294,
476
+ "learning_rate": 7.259114583333334e-06,
477
+ "loss": 1.1678,
478
+ "mean_token_accuracy": 0.745174677670002,
479
+ "num_tokens": 2843804.0,
480
+ "step": 470
481
+ },
482
+ {
483
+ "entropy": 1.1862946093082427,
484
+ "epoch": 0.6060606060606061,
485
+ "grad_norm": 0.639481246471405,
486
+ "learning_rate": 7.194010416666667e-06,
487
+ "loss": 1.1565,
488
+ "mean_token_accuracy": 0.7461866185069084,
489
+ "num_tokens": 2903408.0,
490
+ "step": 480
491
+ },
492
+ {
493
+ "entropy": 1.151743534207344,
494
+ "epoch": 0.6186868686868687,
495
+ "grad_norm": 0.6332777142524719,
496
+ "learning_rate": 7.128906250000001e-06,
497
+ "loss": 1.1251,
498
+ "mean_token_accuracy": 0.7535199671983719,
499
+ "num_tokens": 2963401.0,
500
+ "step": 490
501
+ },
502
+ {
503
+ "entropy": 1.1778477430343628,
504
+ "epoch": 0.6313131313131313,
505
+ "grad_norm": 0.5991836190223694,
506
+ "learning_rate": 7.063802083333335e-06,
507
+ "loss": 1.1407,
508
+ "mean_token_accuracy": 0.7491110354661942,
509
+ "num_tokens": 3023530.0,
510
+ "step": 500
511
+ },
512
+ {
513
+ "entropy": 1.2023035794496537,
514
+ "epoch": 0.6439393939393939,
515
+ "grad_norm": 0.6293458938598633,
516
+ "learning_rate": 6.998697916666667e-06,
517
+ "loss": 1.1724,
518
+ "mean_token_accuracy": 0.7405256554484367,
519
+ "num_tokens": 3085225.0,
520
+ "step": 510
521
+ },
522
+ {
523
+ "entropy": 1.1997334092855454,
524
+ "epoch": 0.6565656565656566,
525
+ "grad_norm": 0.6213802695274353,
526
+ "learning_rate": 6.93359375e-06,
527
+ "loss": 1.1604,
528
+ "mean_token_accuracy": 0.7435309410095214,
529
+ "num_tokens": 3145749.0,
530
+ "step": 520
531
+ },
532
+ {
533
+ "entropy": 1.1643184214830398,
534
+ "epoch": 0.6691919191919192,
535
+ "grad_norm": 0.6495156288146973,
536
+ "learning_rate": 6.868489583333334e-06,
537
+ "loss": 1.1381,
538
+ "mean_token_accuracy": 0.7496751576662064,
539
+ "num_tokens": 3205761.0,
540
+ "step": 530
541
+ },
542
+ {
543
+ "entropy": 1.1837532848119736,
544
+ "epoch": 0.6818181818181818,
545
+ "grad_norm": 0.6004510521888733,
546
+ "learning_rate": 6.803385416666667e-06,
547
+ "loss": 1.1563,
548
+ "mean_token_accuracy": 0.7447851061820984,
549
+ "num_tokens": 3267024.0,
550
+ "step": 540
551
+ },
552
+ {
553
+ "entropy": 1.201677542924881,
554
+ "epoch": 0.6944444444444444,
555
+ "grad_norm": 0.607467532157898,
556
+ "learning_rate": 6.738281250000001e-06,
557
+ "loss": 1.1776,
558
+ "mean_token_accuracy": 0.7406648993492126,
559
+ "num_tokens": 3329070.0,
560
+ "step": 550
561
+ },
562
+ {
563
+ "entropy": 1.1659786373376846,
564
+ "epoch": 0.7070707070707071,
565
+ "grad_norm": 0.6079947352409363,
566
+ "learning_rate": 6.6731770833333345e-06,
567
+ "loss": 1.1298,
568
+ "mean_token_accuracy": 0.7505511298775673,
569
+ "num_tokens": 3389505.0,
570
+ "step": 560
571
+ },
572
+ {
573
+ "entropy": 1.1714913487434386,
574
+ "epoch": 0.7196969696969697,
575
+ "grad_norm": 0.6534572839736938,
576
+ "learning_rate": 6.6080729166666665e-06,
577
+ "loss": 1.144,
578
+ "mean_token_accuracy": 0.7481625184416771,
579
+ "num_tokens": 3449834.0,
580
+ "step": 570
581
+ },
582
+ {
583
+ "entropy": 1.1694963037967683,
584
+ "epoch": 0.7323232323232324,
585
+ "grad_norm": 0.5903164744377136,
586
+ "learning_rate": 6.54296875e-06,
587
+ "loss": 1.139,
588
+ "mean_token_accuracy": 0.7476441130042076,
589
+ "num_tokens": 3510701.0,
590
+ "step": 580
591
+ },
592
+ {
593
+ "entropy": 1.1777653217315673,
594
+ "epoch": 0.7449494949494949,
595
+ "grad_norm": 0.6284182071685791,
596
+ "learning_rate": 6.477864583333334e-06,
597
+ "loss": 1.1422,
598
+ "mean_token_accuracy": 0.7490358456969262,
599
+ "num_tokens": 3570992.0,
600
+ "step": 590
601
+ },
602
+ {
603
+ "entropy": 1.1588268011808396,
604
+ "epoch": 0.7575757575757576,
605
+ "grad_norm": 0.6250146627426147,
606
+ "learning_rate": 6.412760416666667e-06,
607
+ "loss": 1.1336,
608
+ "mean_token_accuracy": 0.7504108369350433,
609
+ "num_tokens": 3631087.0,
610
+ "step": 600
611
+ },
612
+ {
613
+ "entropy": 1.1890273630619048,
614
+ "epoch": 0.7702020202020202,
615
+ "grad_norm": 0.6420578956604004,
616
+ "learning_rate": 6.3476562500000006e-06,
617
+ "loss": 1.1534,
618
+ "mean_token_accuracy": 0.7442372158169747,
619
+ "num_tokens": 3692871.0,
620
+ "step": 610
621
+ },
622
+ {
623
+ "entropy": 1.1817798465490341,
624
+ "epoch": 0.7828282828282829,
625
+ "grad_norm": 0.6156490445137024,
626
+ "learning_rate": 6.282552083333334e-06,
627
+ "loss": 1.1477,
628
+ "mean_token_accuracy": 0.7468263059854507,
629
+ "num_tokens": 3753671.0,
630
+ "step": 620
631
+ },
632
+ {
633
+ "entropy": 1.1686612635850906,
634
+ "epoch": 0.7954545454545454,
635
+ "grad_norm": 0.6248748898506165,
636
+ "learning_rate": 6.217447916666667e-06,
637
+ "loss": 1.139,
638
+ "mean_token_accuracy": 0.7478924334049225,
639
+ "num_tokens": 3813110.0,
640
+ "step": 630
641
+ },
642
+ {
643
+ "entropy": 1.1387953266501427,
644
+ "epoch": 0.8080808080808081,
645
+ "grad_norm": 0.6052266359329224,
646
+ "learning_rate": 6.152343750000001e-06,
647
+ "loss": 1.1118,
648
+ "mean_token_accuracy": 0.7538487210869789,
649
+ "num_tokens": 3873477.0,
650
+ "step": 640
651
+ },
652
+ {
653
+ "entropy": 1.1343814879655838,
654
+ "epoch": 0.8207070707070707,
655
+ "grad_norm": 0.6769536137580872,
656
+ "learning_rate": 6.087239583333335e-06,
657
+ "loss": 1.1108,
658
+ "mean_token_accuracy": 0.7548464313149452,
659
+ "num_tokens": 3932873.0,
660
+ "step": 650
661
+ },
662
+ {
663
+ "entropy": 1.1686519652605056,
664
+ "epoch": 0.8333333333333334,
665
+ "grad_norm": 0.6545736789703369,
666
+ "learning_rate": 6.022135416666667e-06,
667
+ "loss": 1.134,
668
+ "mean_token_accuracy": 0.7495195478200912,
669
+ "num_tokens": 3992754.0,
670
+ "step": 660
671
+ },
672
+ {
673
+ "entropy": 1.1596283346414566,
674
+ "epoch": 0.8459595959595959,
675
+ "grad_norm": 0.6192017793655396,
676
+ "learning_rate": 5.95703125e-06,
677
+ "loss": 1.1213,
678
+ "mean_token_accuracy": 0.7510297149419785,
679
+ "num_tokens": 4053540.0,
680
+ "step": 670
681
+ },
682
+ {
683
+ "entropy": 1.1484344542026519,
684
+ "epoch": 0.8585858585858586,
685
+ "grad_norm": 0.6520631909370422,
686
+ "learning_rate": 5.891927083333334e-06,
687
+ "loss": 1.1181,
688
+ "mean_token_accuracy": 0.751795919239521,
689
+ "num_tokens": 4113582.0,
690
+ "step": 680
691
+ },
692
+ {
693
+ "entropy": 1.1519750133156776,
694
+ "epoch": 0.8712121212121212,
695
+ "grad_norm": 0.6247655153274536,
696
+ "learning_rate": 5.826822916666667e-06,
697
+ "loss": 1.1204,
698
+ "mean_token_accuracy": 0.7506368085741997,
699
+ "num_tokens": 4174983.0,
700
+ "step": 690
701
+ },
702
+ {
703
+ "entropy": 1.1486561581492425,
704
+ "epoch": 0.8838383838383839,
705
+ "grad_norm": 0.620272159576416,
706
+ "learning_rate": 5.761718750000001e-06,
707
+ "loss": 1.1191,
708
+ "mean_token_accuracy": 0.7536391675472259,
709
+ "num_tokens": 4234465.0,
710
+ "step": 700
711
+ },
712
+ {
713
+ "entropy": 1.1504923462867738,
714
+ "epoch": 0.8964646464646465,
715
+ "grad_norm": 0.6308649182319641,
716
+ "learning_rate": 5.6966145833333344e-06,
717
+ "loss": 1.1224,
718
+ "mean_token_accuracy": 0.750646598637104,
719
+ "num_tokens": 4295955.0,
720
+ "step": 710
721
+ },
722
+ {
723
+ "entropy": 1.1561898440122604,
724
+ "epoch": 0.9090909090909091,
725
+ "grad_norm": 0.6629899740219116,
726
+ "learning_rate": 5.6315104166666665e-06,
727
+ "loss": 1.1238,
728
+ "mean_token_accuracy": 0.7507557719945908,
729
+ "num_tokens": 4357171.0,
730
+ "step": 720
731
+ },
732
+ {
733
+ "entropy": 1.1413449853658677,
734
+ "epoch": 0.9217171717171717,
735
+ "grad_norm": 0.5972346067428589,
736
+ "learning_rate": 5.56640625e-06,
737
+ "loss": 1.1038,
738
+ "mean_token_accuracy": 0.7554104939103127,
739
+ "num_tokens": 4417636.0,
740
+ "step": 730
741
+ },
742
+ {
743
+ "entropy": 1.1356119453907012,
744
+ "epoch": 0.9343434343434344,
745
+ "grad_norm": 0.6356479525566101,
746
+ "learning_rate": 5.501302083333334e-06,
747
+ "loss": 1.1005,
748
+ "mean_token_accuracy": 0.7565629109740257,
749
+ "num_tokens": 4477294.0,
750
+ "step": 740
751
+ },
752
+ {
753
+ "entropy": 1.1633600294589996,
754
+ "epoch": 0.946969696969697,
755
+ "grad_norm": 0.6416464447975159,
756
+ "learning_rate": 5.436197916666667e-06,
757
+ "loss": 1.1225,
758
+ "mean_token_accuracy": 0.7515855401754379,
759
+ "num_tokens": 4537503.0,
760
+ "step": 750
761
+ },
762
+ {
763
+ "entropy": 1.1527763932943345,
764
+ "epoch": 0.9595959595959596,
765
+ "grad_norm": 0.6126084327697754,
766
+ "learning_rate": 5.3710937500000005e-06,
767
+ "loss": 1.1184,
768
+ "mean_token_accuracy": 0.7526160582900048,
769
+ "num_tokens": 4598778.0,
770
+ "step": 760
771
+ },
772
+ {
773
+ "entropy": 1.1397768080234527,
774
+ "epoch": 0.9722222222222222,
775
+ "grad_norm": 0.6359922289848328,
776
+ "learning_rate": 5.305989583333334e-06,
777
+ "loss": 1.1144,
778
+ "mean_token_accuracy": 0.7548302739858628,
779
+ "num_tokens": 4658978.0,
780
+ "step": 770
781
+ },
782
+ {
783
+ "entropy": 1.1569962561130525,
784
+ "epoch": 0.9848484848484849,
785
+ "grad_norm": 0.6260409951210022,
786
+ "learning_rate": 5.240885416666667e-06,
787
+ "loss": 1.1213,
788
+ "mean_token_accuracy": 0.7512885302305221,
789
+ "num_tokens": 4720500.0,
790
+ "step": 780
791
+ },
792
+ {
793
+ "entropy": 1.1509152203798294,
794
+ "epoch": 0.9974747474747475,
795
+ "grad_norm": 0.6293452978134155,
796
+ "learning_rate": 5.17578125e-06,
797
+ "loss": 1.1227,
798
+ "mean_token_accuracy": 0.7519564241170883,
799
+ "num_tokens": 4781612.0,
800
+ "step": 790
801
+ },
802
+ {
803
+ "entropy": 1.1399024561047555,
804
+ "epoch": 1.0101010101010102,
805
+ "grad_norm": 0.664761483669281,
806
+ "learning_rate": 5.110677083333334e-06,
807
+ "loss": 1.1034,
808
+ "mean_token_accuracy": 0.7526706486940384,
809
+ "num_tokens": 4841359.0,
810
+ "step": 800
811
+ },
812
+ {
813
+ "entropy": 1.120214229822159,
814
+ "epoch": 1.0227272727272727,
815
+ "grad_norm": 0.5934865474700928,
816
+ "learning_rate": 5.045572916666667e-06,
817
+ "loss": 1.0857,
818
+ "mean_token_accuracy": 0.7593594208359719,
819
+ "num_tokens": 4901016.0,
820
+ "step": 810
821
+ },
822
+ {
823
+ "entropy": 1.1430341199040412,
824
+ "epoch": 1.0353535353535352,
825
+ "grad_norm": 0.6040735840797424,
826
+ "learning_rate": 4.98046875e-06,
827
+ "loss": 1.1165,
828
+ "mean_token_accuracy": 0.7525988414883613,
829
+ "num_tokens": 4961646.0,
830
+ "step": 820
831
+ },
832
+ {
833
+ "entropy": 1.1245349109172822,
834
+ "epoch": 1.047979797979798,
835
+ "grad_norm": 0.6277610063552856,
836
+ "learning_rate": 4.915364583333333e-06,
837
+ "loss": 1.0851,
838
+ "mean_token_accuracy": 0.7577921718358993,
839
+ "num_tokens": 5022365.0,
840
+ "step": 830
841
+ },
842
+ {
843
+ "entropy": 1.1204675793647767,
844
+ "epoch": 1.0606060606060606,
845
+ "grad_norm": 0.6260582804679871,
846
+ "learning_rate": 4.850260416666667e-06,
847
+ "loss": 1.0813,
848
+ "mean_token_accuracy": 0.7580071151256561,
849
+ "num_tokens": 5081972.0,
850
+ "step": 840
851
+ },
852
+ {
853
+ "entropy": 1.1222782507538795,
854
+ "epoch": 1.0732323232323233,
855
+ "grad_norm": 0.6023226976394653,
856
+ "learning_rate": 4.785156250000001e-06,
857
+ "loss": 1.0922,
858
+ "mean_token_accuracy": 0.7566258609294891,
859
+ "num_tokens": 5142184.0,
860
+ "step": 850
861
+ },
862
+ {
863
+ "entropy": 1.1227335944771766,
864
+ "epoch": 1.0858585858585859,
865
+ "grad_norm": 0.6206791996955872,
866
+ "learning_rate": 4.7200520833333336e-06,
867
+ "loss": 1.0994,
868
+ "mean_token_accuracy": 0.7540625646710396,
869
+ "num_tokens": 5203020.0,
870
+ "step": 860
871
+ },
872
+ {
873
+ "entropy": 1.1352888554334641,
874
+ "epoch": 1.0984848484848484,
875
+ "grad_norm": 0.6301055550575256,
876
+ "learning_rate": 4.654947916666667e-06,
877
+ "loss": 1.0958,
878
+ "mean_token_accuracy": 0.7562039017677307,
879
+ "num_tokens": 5263291.0,
880
+ "step": 870
881
+ },
882
+ {
883
+ "entropy": 1.1120088309049607,
884
+ "epoch": 1.1111111111111112,
885
+ "grad_norm": 0.6210020780563354,
886
+ "learning_rate": 4.58984375e-06,
887
+ "loss": 1.0793,
888
+ "mean_token_accuracy": 0.7590557768940925,
889
+ "num_tokens": 5323972.0,
890
+ "step": 880
891
+ },
892
+ {
893
+ "entropy": 1.0996666207909584,
894
+ "epoch": 1.1237373737373737,
895
+ "grad_norm": 0.6332690715789795,
896
+ "learning_rate": 4.524739583333334e-06,
897
+ "loss": 1.0717,
898
+ "mean_token_accuracy": 0.7615471586585045,
899
+ "num_tokens": 5383911.0,
900
+ "step": 890
901
+ },
902
+ {
903
+ "entropy": 1.127107810974121,
904
+ "epoch": 1.1363636363636362,
905
+ "grad_norm": 0.6505516767501831,
906
+ "learning_rate": 4.459635416666668e-06,
907
+ "loss": 1.1027,
908
+ "mean_token_accuracy": 0.7562421515583992,
909
+ "num_tokens": 5445417.0,
910
+ "step": 900
911
+ },
912
+ {
913
+ "entropy": 1.129740473628044,
914
+ "epoch": 1.148989898989899,
915
+ "grad_norm": 0.6406158804893494,
916
+ "learning_rate": 4.3945312500000005e-06,
917
+ "loss": 1.0879,
918
+ "mean_token_accuracy": 0.7587148532271385,
919
+ "num_tokens": 5505455.0,
920
+ "step": 910
921
+ },
922
+ {
923
+ "entropy": 1.1167259424924851,
924
+ "epoch": 1.1616161616161615,
925
+ "grad_norm": 0.6297397613525391,
926
+ "learning_rate": 4.329427083333333e-06,
927
+ "loss": 1.0752,
928
+ "mean_token_accuracy": 0.7604142814874649,
929
+ "num_tokens": 5565311.0,
930
+ "step": 920
931
+ },
932
+ {
933
+ "entropy": 1.1037891641259194,
934
+ "epoch": 1.1742424242424243,
935
+ "grad_norm": 0.6490073204040527,
936
+ "learning_rate": 4.264322916666667e-06,
937
+ "loss": 1.0686,
938
+ "mean_token_accuracy": 0.7610052570700645,
939
+ "num_tokens": 5625358.0,
940
+ "step": 930
941
+ },
942
+ {
943
+ "entropy": 1.1061881184577942,
944
+ "epoch": 1.1868686868686869,
945
+ "grad_norm": 0.6366387009620667,
946
+ "learning_rate": 4.19921875e-06,
947
+ "loss": 1.0868,
948
+ "mean_token_accuracy": 0.7574937298893929,
949
+ "num_tokens": 5686421.0,
950
+ "step": 940
951
+ },
952
+ {
953
+ "entropy": 1.1124324068427085,
954
+ "epoch": 1.1994949494949494,
955
+ "grad_norm": 0.6556055545806885,
956
+ "learning_rate": 4.134114583333334e-06,
957
+ "loss": 1.0694,
958
+ "mean_token_accuracy": 0.7602224007248879,
959
+ "num_tokens": 5745891.0,
960
+ "step": 950
961
+ },
962
+ {
963
+ "entropy": 1.1175200879573821,
964
+ "epoch": 1.2121212121212122,
965
+ "grad_norm": 0.6404849886894226,
966
+ "learning_rate": 4.0690104166666675e-06,
967
+ "loss": 1.081,
968
+ "mean_token_accuracy": 0.7568994402885437,
969
+ "num_tokens": 5806078.0,
970
+ "step": 960
971
+ },
972
+ {
973
+ "entropy": 1.1186978340148925,
974
+ "epoch": 1.2247474747474747,
975
+ "grad_norm": 0.6227584481239319,
976
+ "learning_rate": 4.00390625e-06,
977
+ "loss": 1.0791,
978
+ "mean_token_accuracy": 0.759756401181221,
979
+ "num_tokens": 5866369.0,
980
+ "step": 970
981
+ },
982
+ {
983
+ "entropy": 1.122128139436245,
984
+ "epoch": 1.2373737373737375,
985
+ "grad_norm": 0.6616361141204834,
986
+ "learning_rate": 3.938802083333333e-06,
987
+ "loss": 1.0937,
988
+ "mean_token_accuracy": 0.7582718566060066,
989
+ "num_tokens": 5926217.0,
990
+ "step": 980
991
+ },
992
+ {
993
+ "entropy": 1.122861033678055,
994
+ "epoch": 1.25,
995
+ "grad_norm": 0.6384168267250061,
996
+ "learning_rate": 3.873697916666667e-06,
997
+ "loss": 1.0978,
998
+ "mean_token_accuracy": 0.7549964562058449,
999
+ "num_tokens": 5987666.0,
1000
+ "step": 990
1001
+ },
1002
+ {
1003
+ "entropy": 1.1277505576610565,
1004
+ "epoch": 1.2626262626262625,
1005
+ "grad_norm": 0.6038117408752441,
1006
+ "learning_rate": 3.8085937500000002e-06,
1007
+ "loss": 1.0952,
1008
+ "mean_token_accuracy": 0.755272176861763,
1009
+ "num_tokens": 6048708.0,
1010
+ "step": 1000
1011
+ },
1012
+ {
1013
+ "entropy": 1.1120157346129418,
1014
+ "epoch": 1.2752525252525253,
1015
+ "grad_norm": 0.6418159604072571,
1016
+ "learning_rate": 3.7434895833333336e-06,
1017
+ "loss": 1.078,
1018
+ "mean_token_accuracy": 0.7594122514128685,
1019
+ "num_tokens": 6109652.0,
1020
+ "step": 1010
1021
+ },
1022
+ {
1023
+ "entropy": 1.101425115764141,
1024
+ "epoch": 1.2878787878787878,
1025
+ "grad_norm": 0.6218425035476685,
1026
+ "learning_rate": 3.6783854166666673e-06,
1027
+ "loss": 1.0688,
1028
+ "mean_token_accuracy": 0.7604865297675133,
1029
+ "num_tokens": 6169125.0,
1030
+ "step": 1020
1031
+ },
1032
+ {
1033
+ "entropy": 1.1007713869214057,
1034
+ "epoch": 1.3005050505050506,
1035
+ "grad_norm": 0.6429149508476257,
1036
+ "learning_rate": 3.61328125e-06,
1037
+ "loss": 1.0581,
1038
+ "mean_token_accuracy": 0.7621071562170982,
1039
+ "num_tokens": 6230303.0,
1040
+ "step": 1030
1041
+ },
1042
+ {
1043
+ "entropy": 1.1094096556305886,
1044
+ "epoch": 1.3131313131313131,
1045
+ "grad_norm": 0.6489748358726501,
1046
+ "learning_rate": 3.5481770833333335e-06,
1047
+ "loss": 1.0715,
1048
+ "mean_token_accuracy": 0.7599423810839653,
1049
+ "num_tokens": 6291396.0,
1050
+ "step": 1040
1051
+ },
1052
+ {
1053
+ "entropy": 1.0827289715409278,
1054
+ "epoch": 1.3257575757575757,
1055
+ "grad_norm": 0.6485461592674255,
1056
+ "learning_rate": 3.483072916666667e-06,
1057
+ "loss": 1.0584,
1058
+ "mean_token_accuracy": 0.7630694910883904,
1059
+ "num_tokens": 6351579.0,
1060
+ "step": 1050
1061
+ },
1062
+ {
1063
+ "entropy": 1.114325873553753,
1064
+ "epoch": 1.3383838383838385,
1065
+ "grad_norm": 0.6261104941368103,
1066
+ "learning_rate": 3.41796875e-06,
1067
+ "loss": 1.0764,
1068
+ "mean_token_accuracy": 0.7585488513112069,
1069
+ "num_tokens": 6411662.0,
1070
+ "step": 1060
1071
+ },
1072
+ {
1073
+ "entropy": 1.1271554425358772,
1074
+ "epoch": 1.351010101010101,
1075
+ "grad_norm": 0.6522034406661987,
1076
+ "learning_rate": 3.3528645833333334e-06,
1077
+ "loss": 1.0902,
1078
+ "mean_token_accuracy": 0.7562535598874092,
1079
+ "num_tokens": 6473505.0,
1080
+ "step": 1070
1081
+ },
1082
+ {
1083
+ "entropy": 1.1013643085956573,
1084
+ "epoch": 1.3636363636363638,
1085
+ "grad_norm": 0.6176674962043762,
1086
+ "learning_rate": 3.287760416666667e-06,
1087
+ "loss": 1.065,
1088
+ "mean_token_accuracy": 0.763075165450573,
1089
+ "num_tokens": 6533580.0,
1090
+ "step": 1080
1091
+ },
1092
+ {
1093
+ "entropy": 1.098090337216854,
1094
+ "epoch": 1.3762626262626263,
1095
+ "grad_norm": 0.6540253758430481,
1096
+ "learning_rate": 3.2226562500000004e-06,
1097
+ "loss": 1.0596,
1098
+ "mean_token_accuracy": 0.7616770043969154,
1099
+ "num_tokens": 6593481.0,
1100
+ "step": 1090
1101
+ },
1102
+ {
1103
+ "entropy": 1.1176372200250626,
1104
+ "epoch": 1.3888888888888888,
1105
+ "grad_norm": 0.6754550933837891,
1106
+ "learning_rate": 3.1575520833333333e-06,
1107
+ "loss": 1.0861,
1108
+ "mean_token_accuracy": 0.7573029339313507,
1109
+ "num_tokens": 6653967.0,
1110
+ "step": 1100
1111
+ },
1112
+ {
1113
+ "entropy": 1.1040414482355119,
1114
+ "epoch": 1.4015151515151514,
1115
+ "grad_norm": 0.6022531986236572,
1116
+ "learning_rate": 3.092447916666667e-06,
1117
+ "loss": 1.0573,
1118
+ "mean_token_accuracy": 0.7612267225980759,
1119
+ "num_tokens": 6714685.0,
1120
+ "step": 1110
1121
+ },
1122
+ {
1123
+ "entropy": 1.0926991075277328,
1124
+ "epoch": 1.4141414141414141,
1125
+ "grad_norm": 0.6621010303497314,
1126
+ "learning_rate": 3.0273437500000003e-06,
1127
+ "loss": 1.0637,
1128
+ "mean_token_accuracy": 0.7612977519631385,
1129
+ "num_tokens": 6774659.0,
1130
+ "step": 1120
1131
+ },
1132
+ {
1133
+ "entropy": 1.1095534324645997,
1134
+ "epoch": 1.4267676767676767,
1135
+ "grad_norm": 0.62503981590271,
1136
+ "learning_rate": 2.962239583333333e-06,
1137
+ "loss": 1.0701,
1138
+ "mean_token_accuracy": 0.7618053883314133,
1139
+ "num_tokens": 6834579.0,
1140
+ "step": 1130
1141
+ },
1142
+ {
1143
+ "entropy": 1.117809349298477,
1144
+ "epoch": 1.4393939393939394,
1145
+ "grad_norm": 0.6527109742164612,
1146
+ "learning_rate": 2.897135416666667e-06,
1147
+ "loss": 1.0747,
1148
+ "mean_token_accuracy": 0.759482853114605,
1149
+ "num_tokens": 6894074.0,
1150
+ "step": 1140
1151
+ },
1152
+ {
1153
+ "entropy": 1.1005077749490737,
1154
+ "epoch": 1.452020202020202,
1155
+ "grad_norm": 0.6720954775810242,
1156
+ "learning_rate": 2.8320312500000002e-06,
1157
+ "loss": 1.0607,
1158
+ "mean_token_accuracy": 0.7621246844530105,
1159
+ "num_tokens": 6953870.0,
1160
+ "step": 1150
1161
+ },
1162
+ {
1163
+ "entropy": 1.1236482918262483,
1164
+ "epoch": 1.4646464646464645,
1165
+ "grad_norm": 0.658524215221405,
1166
+ "learning_rate": 2.7669270833333335e-06,
1167
+ "loss": 1.0884,
1168
+ "mean_token_accuracy": 0.7560836613178253,
1169
+ "num_tokens": 7014553.0,
1170
+ "step": 1160
1171
+ },
1172
+ {
1173
+ "entropy": 1.1116504594683647,
1174
+ "epoch": 1.4772727272727273,
1175
+ "grad_norm": 0.6261802911758423,
1176
+ "learning_rate": 2.7018229166666673e-06,
1177
+ "loss": 1.0659,
1178
+ "mean_token_accuracy": 0.7597616642713547,
1179
+ "num_tokens": 7076291.0,
1180
+ "step": 1170
1181
+ },
1182
+ {
1183
+ "entropy": 1.073892480134964,
1184
+ "epoch": 1.4898989898989898,
1185
+ "grad_norm": 0.6310375332832336,
1186
+ "learning_rate": 2.63671875e-06,
1187
+ "loss": 1.0524,
1188
+ "mean_token_accuracy": 0.7628733053803444,
1189
+ "num_tokens": 7137305.0,
1190
+ "step": 1180
1191
+ },
1192
+ {
1193
+ "entropy": 1.0975843235850333,
1194
+ "epoch": 1.5025252525252526,
1195
+ "grad_norm": 0.638482391834259,
1196
+ "learning_rate": 2.5716145833333334e-06,
1197
+ "loss": 1.0679,
1198
+ "mean_token_accuracy": 0.7603248566389084,
1199
+ "num_tokens": 7198239.0,
1200
+ "step": 1190
1201
+ },
1202
+ {
1203
+ "entropy": 1.0986508697271347,
1204
+ "epoch": 1.5151515151515151,
1205
+ "grad_norm": 0.640065610408783,
1206
+ "learning_rate": 2.506510416666667e-06,
1207
+ "loss": 1.0666,
1208
+ "mean_token_accuracy": 0.7622530281543731,
1209
+ "num_tokens": 7257847.0,
1210
+ "step": 1200
1211
+ },
1212
+ {
1213
+ "entropy": 1.0971406906843186,
1214
+ "epoch": 1.5277777777777777,
1215
+ "grad_norm": 0.6437165141105652,
1216
+ "learning_rate": 2.44140625e-06,
1217
+ "loss": 1.0587,
1218
+ "mean_token_accuracy": 0.7623877748847008,
1219
+ "num_tokens": 7317615.0,
1220
+ "step": 1210
1221
+ },
1222
+ {
1223
+ "entropy": 1.1032136514782906,
1224
+ "epoch": 1.5404040404040404,
1225
+ "grad_norm": 0.6590547561645508,
1226
+ "learning_rate": 2.3763020833333338e-06,
1227
+ "loss": 1.0569,
1228
+ "mean_token_accuracy": 0.7616146191954613,
1229
+ "num_tokens": 7377946.0,
1230
+ "step": 1220
1231
+ },
1232
+ {
1233
+ "entropy": 1.0922824308276176,
1234
+ "epoch": 1.553030303030303,
1235
+ "grad_norm": 0.6317723989486694,
1236
+ "learning_rate": 2.3111979166666667e-06,
1237
+ "loss": 1.0616,
1238
+ "mean_token_accuracy": 0.7611089378595353,
1239
+ "num_tokens": 7438666.0,
1240
+ "step": 1230
1241
+ },
1242
+ {
1243
+ "entropy": 1.1092566132545472,
1244
+ "epoch": 1.5656565656565657,
1245
+ "grad_norm": 0.66637122631073,
1246
+ "learning_rate": 2.2460937500000004e-06,
1247
+ "loss": 1.0725,
1248
+ "mean_token_accuracy": 0.7594954133033752,
1249
+ "num_tokens": 7499747.0,
1250
+ "step": 1240
1251
+ },
1252
+ {
1253
+ "entropy": 1.111878876388073,
1254
+ "epoch": 1.5782828282828283,
1255
+ "grad_norm": 0.6520881652832031,
1256
+ "learning_rate": 2.1809895833333337e-06,
1257
+ "loss": 1.074,
1258
+ "mean_token_accuracy": 0.7558068126440048,
1259
+ "num_tokens": 7561004.0,
1260
+ "step": 1250
1261
+ },
1262
+ {
1263
+ "entropy": 1.1105972841382026,
1264
+ "epoch": 1.5909090909090908,
1265
+ "grad_norm": 0.6495437622070312,
1266
+ "learning_rate": 2.1158854166666666e-06,
1267
+ "loss": 1.0844,
1268
+ "mean_token_accuracy": 0.7563070356845856,
1269
+ "num_tokens": 7622115.0,
1270
+ "step": 1260
1271
+ },
1272
+ {
1273
+ "entropy": 1.0932338371872903,
1274
+ "epoch": 1.6035353535353534,
1275
+ "grad_norm": 0.6420316696166992,
1276
+ "learning_rate": 2.0507812500000003e-06,
1277
+ "loss": 1.0649,
1278
+ "mean_token_accuracy": 0.7622412323951722,
1279
+ "num_tokens": 7682210.0,
1280
+ "step": 1270
1281
+ },
1282
+ {
1283
+ "entropy": 1.0919055327773095,
1284
+ "epoch": 1.6161616161616161,
1285
+ "grad_norm": 0.6192623972892761,
1286
+ "learning_rate": 1.9856770833333336e-06,
1287
+ "loss": 1.0427,
1288
+ "mean_token_accuracy": 0.7657591253519058,
1289
+ "num_tokens": 7742745.0,
1290
+ "step": 1280
1291
+ },
1292
+ {
1293
+ "entropy": 1.1026942864060403,
1294
+ "epoch": 1.628787878787879,
1295
+ "grad_norm": 0.6355161666870117,
1296
+ "learning_rate": 1.920572916666667e-06,
1297
+ "loss": 1.0701,
1298
+ "mean_token_accuracy": 0.7609635755419731,
1299
+ "num_tokens": 7802902.0,
1300
+ "step": 1290
1301
+ },
1302
+ {
1303
+ "entropy": 1.1140723824501038,
1304
+ "epoch": 1.6414141414141414,
1305
+ "grad_norm": 0.6254522800445557,
1306
+ "learning_rate": 1.8554687500000002e-06,
1307
+ "loss": 1.0789,
1308
+ "mean_token_accuracy": 0.7592580512166023,
1309
+ "num_tokens": 7863576.0,
1310
+ "step": 1300
1311
+ },
1312
+ {
1313
+ "entropy": 1.108394268155098,
1314
+ "epoch": 1.654040404040404,
1315
+ "grad_norm": 0.633172333240509,
1316
+ "learning_rate": 1.7903645833333335e-06,
1317
+ "loss": 1.0714,
1318
+ "mean_token_accuracy": 0.7598690986633301,
1319
+ "num_tokens": 7925518.0,
1320
+ "step": 1310
1321
+ },
1322
+ {
1323
+ "entropy": 1.1097407966852189,
1324
+ "epoch": 1.6666666666666665,
1325
+ "grad_norm": 0.6279735565185547,
1326
+ "learning_rate": 1.7252604166666668e-06,
1327
+ "loss": 1.0701,
1328
+ "mean_token_accuracy": 0.7611567705869675,
1329
+ "num_tokens": 7987388.0,
1330
+ "step": 1320
1331
+ },
1332
+ {
1333
+ "entropy": 1.1071213275194167,
1334
+ "epoch": 1.6792929292929293,
1335
+ "grad_norm": 0.6425778269767761,
1336
+ "learning_rate": 1.6601562500000001e-06,
1337
+ "loss": 1.0786,
1338
+ "mean_token_accuracy": 0.7576610520482063,
1339
+ "num_tokens": 8048853.0,
1340
+ "step": 1330
1341
+ },
1342
+ {
1343
+ "entropy": 1.0931925728917122,
1344
+ "epoch": 1.691919191919192,
1345
+ "grad_norm": 0.666192889213562,
1346
+ "learning_rate": 1.5950520833333336e-06,
1347
+ "loss": 1.0604,
1348
+ "mean_token_accuracy": 0.7616572439670563,
1349
+ "num_tokens": 8108967.0,
1350
+ "step": 1340
1351
+ },
1352
+ {
1353
+ "entropy": 1.0973364993929864,
1354
+ "epoch": 1.7045454545454546,
1355
+ "grad_norm": 0.6348255276679993,
1356
+ "learning_rate": 1.5299479166666667e-06,
1357
+ "loss": 1.0769,
1358
+ "mean_token_accuracy": 0.7596119627356529,
1359
+ "num_tokens": 8169700.0,
1360
+ "step": 1350
1361
+ },
1362
+ {
1363
+ "entropy": 1.1137071400880814,
1364
+ "epoch": 1.7171717171717171,
1365
+ "grad_norm": 0.6510699391365051,
1366
+ "learning_rate": 1.46484375e-06,
1367
+ "loss": 1.0731,
1368
+ "mean_token_accuracy": 0.7593250289559365,
1369
+ "num_tokens": 8229676.0,
1370
+ "step": 1360
1371
+ },
1372
+ {
1373
+ "entropy": 1.1052428260445595,
1374
+ "epoch": 1.7297979797979797,
1375
+ "grad_norm": 0.6622318625450134,
1376
+ "learning_rate": 1.3997395833333335e-06,
1377
+ "loss": 1.069,
1378
+ "mean_token_accuracy": 0.7627501472830772,
1379
+ "num_tokens": 8289396.0,
1380
+ "step": 1370
1381
+ },
1382
+ {
1383
+ "entropy": 1.0989198789000512,
1384
+ "epoch": 1.7424242424242424,
1385
+ "grad_norm": 0.6430277824401855,
1386
+ "learning_rate": 1.3346354166666666e-06,
1387
+ "loss": 1.0506,
1388
+ "mean_token_accuracy": 0.7634323209524154,
1389
+ "num_tokens": 8351152.0,
1390
+ "step": 1380
1391
+ },
1392
+ {
1393
+ "entropy": 1.0913211867213248,
1394
+ "epoch": 1.7550505050505052,
1395
+ "grad_norm": 0.639707088470459,
1396
+ "learning_rate": 1.2695312500000002e-06,
1397
+ "loss": 1.0505,
1398
+ "mean_token_accuracy": 0.763850274682045,
1399
+ "num_tokens": 8411357.0,
1400
+ "step": 1390
1401
+ },
1402
+ {
1403
+ "entropy": 1.1044820442795753,
1404
+ "epoch": 1.7676767676767677,
1405
+ "grad_norm": 0.680479109287262,
1406
+ "learning_rate": 1.2044270833333335e-06,
1407
+ "loss": 1.0756,
1408
+ "mean_token_accuracy": 0.7598197475075722,
1409
+ "num_tokens": 8471975.0,
1410
+ "step": 1400
1411
+ },
1412
+ {
1413
+ "entropy": 1.0821994885802269,
1414
+ "epoch": 1.7803030303030303,
1415
+ "grad_norm": 0.651622474193573,
1416
+ "learning_rate": 1.1393229166666668e-06,
1417
+ "loss": 1.048,
1418
+ "mean_token_accuracy": 0.7638402819633484,
1419
+ "num_tokens": 8532480.0,
1420
+ "step": 1410
1421
+ },
1422
+ {
1423
+ "entropy": 1.0933044001460075,
1424
+ "epoch": 1.7929292929292928,
1425
+ "grad_norm": 0.6294305920600891,
1426
+ "learning_rate": 1.07421875e-06,
1427
+ "loss": 1.0545,
1428
+ "mean_token_accuracy": 0.7633207753300667,
1429
+ "num_tokens": 8593538.0,
1430
+ "step": 1420
1431
+ },
1432
+ {
1433
+ "entropy": 1.1002878457307816,
1434
+ "epoch": 1.8055555555555556,
1435
+ "grad_norm": 0.6396600008010864,
1436
+ "learning_rate": 1.0091145833333334e-06,
1437
+ "loss": 1.065,
1438
+ "mean_token_accuracy": 0.7610213488340378,
1439
+ "num_tokens": 8654446.0,
1440
+ "step": 1430
1441
+ },
1442
+ {
1443
+ "entropy": 1.1098253890872,
1444
+ "epoch": 1.8181818181818183,
1445
+ "grad_norm": 0.6585692167282104,
1446
+ "learning_rate": 9.440104166666668e-07,
1447
+ "loss": 1.079,
1448
+ "mean_token_accuracy": 0.7590781077742577,
1449
+ "num_tokens": 8715305.0,
1450
+ "step": 1440
1451
+ },
1452
+ {
1453
+ "entropy": 1.0894212126731873,
1454
+ "epoch": 1.8308080808080809,
1455
+ "grad_norm": 0.6637106537818909,
1456
+ "learning_rate": 8.789062500000001e-07,
1457
+ "loss": 1.0555,
1458
+ "mean_token_accuracy": 0.7621971383690834,
1459
+ "num_tokens": 8775351.0,
1460
+ "step": 1450
1461
+ },
1462
+ {
1463
+ "entropy": 1.0821923539042473,
1464
+ "epoch": 1.8434343434343434,
1465
+ "grad_norm": 0.6491685509681702,
1466
+ "learning_rate": 8.138020833333334e-07,
1467
+ "loss": 1.0464,
1468
+ "mean_token_accuracy": 0.7639922067523003,
1469
+ "num_tokens": 8836131.0,
1470
+ "step": 1460
1471
+ },
1472
+ {
1473
+ "entropy": 1.1058385655283929,
1474
+ "epoch": 1.856060606060606,
1475
+ "grad_norm": 0.6781795024871826,
1476
+ "learning_rate": 7.486979166666668e-07,
1477
+ "loss": 1.0682,
1478
+ "mean_token_accuracy": 0.7595004603266716,
1479
+ "num_tokens": 8896334.0,
1480
+ "step": 1470
1481
+ },
1482
+ {
1483
+ "entropy": 1.0934822604060173,
1484
+ "epoch": 1.8686868686868687,
1485
+ "grad_norm": 0.652746319770813,
1486
+ "learning_rate": 6.835937500000001e-07,
1487
+ "loss": 1.0625,
1488
+ "mean_token_accuracy": 0.7614723727107048,
1489
+ "num_tokens": 8956948.0,
1490
+ "step": 1480
1491
+ },
1492
+ {
1493
+ "entropy": 1.0955146595835685,
1494
+ "epoch": 1.8813131313131313,
1495
+ "grad_norm": 0.6350075006484985,
1496
+ "learning_rate": 6.184895833333334e-07,
1497
+ "loss": 1.0677,
1498
+ "mean_token_accuracy": 0.75977371186018,
1499
+ "num_tokens": 9017330.0,
1500
+ "step": 1490
1501
+ },
1502
+ {
1503
+ "entropy": 1.1074895232915878,
1504
+ "epoch": 1.893939393939394,
1505
+ "grad_norm": 0.6651970744132996,
1506
+ "learning_rate": 5.533854166666667e-07,
1507
+ "loss": 1.0778,
1508
+ "mean_token_accuracy": 0.7592803448438644,
1509
+ "num_tokens": 9077561.0,
1510
+ "step": 1500
1511
+ },
1512
+ {
1513
+ "entropy": 1.088778705894947,
1514
+ "epoch": 1.9065656565656566,
1515
+ "grad_norm": 0.6216638684272766,
1516
+ "learning_rate": 4.8828125e-07,
1517
+ "loss": 1.0517,
1518
+ "mean_token_accuracy": 0.7628024965524673,
1519
+ "num_tokens": 9137424.0,
1520
+ "step": 1510
1521
+ },
1522
+ {
1523
+ "entropy": 1.0963940545916557,
1524
+ "epoch": 1.9191919191919191,
1525
+ "grad_norm": 0.6801443099975586,
1526
+ "learning_rate": 4.2317708333333337e-07,
1527
+ "loss": 1.0657,
1528
+ "mean_token_accuracy": 0.7611119478940964,
1529
+ "num_tokens": 9198465.0,
1530
+ "step": 1520
1531
+ },
1532
+ {
1533
+ "entropy": 1.107513566315174,
1534
+ "epoch": 1.9318181818181817,
1535
+ "grad_norm": 0.6482690572738647,
1536
+ "learning_rate": 3.5807291666666667e-07,
1537
+ "loss": 1.0657,
1538
+ "mean_token_accuracy": 0.7620691254734993,
1539
+ "num_tokens": 9258486.0,
1540
+ "step": 1530
1541
+ },
1542
+ {
1543
+ "entropy": 1.1056584566831589,
1544
+ "epoch": 1.9444444444444444,
1545
+ "grad_norm": 0.6314805746078491,
1546
+ "learning_rate": 2.9296875000000003e-07,
1547
+ "loss": 1.0758,
1548
+ "mean_token_accuracy": 0.7584919854998589,
1549
+ "num_tokens": 9319619.0,
1550
+ "step": 1540
1551
+ },
1552
+ {
1553
+ "entropy": 1.1117310538887977,
1554
+ "epoch": 1.9570707070707072,
1555
+ "grad_norm": 0.6383644938468933,
1556
+ "learning_rate": 2.2786458333333333e-07,
1557
+ "loss": 1.0745,
1558
+ "mean_token_accuracy": 0.7591810420155525,
1559
+ "num_tokens": 9380350.0,
1560
+ "step": 1550
1561
+ },
1562
+ {
1563
+ "entropy": 1.0970366701483727,
1564
+ "epoch": 1.9696969696969697,
1565
+ "grad_norm": 0.6331989169120789,
1566
+ "learning_rate": 1.627604166666667e-07,
1567
+ "loss": 1.0557,
1568
+ "mean_token_accuracy": 0.7604442983865738,
1569
+ "num_tokens": 9441627.0,
1570
+ "step": 1560
1571
+ },
1572
+ {
1573
+ "entropy": 1.0917240902781487,
1574
+ "epoch": 1.9823232323232323,
1575
+ "grad_norm": 0.6618102192878723,
1576
+ "learning_rate": 9.765625e-08,
1577
+ "loss": 1.0546,
1578
+ "mean_token_accuracy": 0.7638352930545806,
1579
+ "num_tokens": 9501728.0,
1580
+ "step": 1570
1581
+ },
1582
+ {
1583
+ "entropy": 1.096452857553959,
1584
+ "epoch": 1.9949494949494948,
1585
+ "grad_norm": 0.6413847804069519,
1586
+ "learning_rate": 3.2552083333333335e-08,
1587
+ "loss": 1.0615,
1588
+ "mean_token_accuracy": 0.7619210347533226,
1589
+ "num_tokens": 9562238.0,
1590
+ "step": 1580
1591
+ }
1592
+ ],
1593
+ "logging_steps": 10,
1594
+ "max_steps": 1584,
1595
+ "num_input_tokens_seen": 0,
1596
+ "num_train_epochs": 2,
1597
+ "save_steps": 200,
1598
+ "stateful_callbacks": {
1599
+ "TrainerControl": {
1600
+ "args": {
1601
+ "should_epoch_stop": false,
1602
+ "should_evaluate": false,
1603
+ "should_log": false,
1604
+ "should_save": true,
1605
+ "should_training_stop": true
1606
+ },
1607
+ "attributes": {}
1608
+ }
1609
+ },
1610
+ "total_flos": 5.4295385739381965e+17,
1611
+ "train_batch_size": 8,
1612
+ "trial_name": null,
1613
+ "trial_params": null
1614
+ }
checkpoint-1584/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8eeb71c14deb91ac5fd11522db45cb3275c9164415fcbefc9d00cac27a27f0a3
3
+ size 6417
sft_train.log ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2026-05-07 13:27:47,763 | INFO | Starting SFT fine-tuning job
2
+ 2026-05-07 13:27:47,763 | INFO | Output directory: /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-tunisian-sft
3
+ 2026-05-07 13:27:47,763 | INFO | Base model: CohereLabs/aya-expanse-8b
4
+ 2026-05-07 13:27:47,763 | INFO | CPT adapter: /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-cpt-tunisian
5
+ 2026-05-07 13:27:47,763 | INFO | Dataset: Syrinesmati/tunisian-question-response-dataset
6
+ 2026-05-07 13:27:47,763 | INFO | Torch version: 2.11.0+cu130
7
+ 2026-05-07 13:27:47,804 | INFO | CUDA available: True
8
+ 2026-05-07 13:27:47,825 | INFO | GPU: NVIDIA GB10
9
+ 2026-05-07 13:27:47,826 | INFO | No Hugging Face token provided; skipping login
10
+ 2026-05-07 13:27:47,827 | INFO | trl version 1.3.0: assistant_only_loss is fully supported
11
+ 2026-05-07 13:27:47,827 | INFO | Loading configuration: batch_size=8, accumulation=4, warmup=0.03, weight_decay=0.0100
12
+ 2026-05-07 13:27:47,827 | INFO | Loading tokenizer from base model: CohereLabs/aya-expanse-8b
13
+ 2026-05-07 13:27:49,034 | INFO | Loading base model: CohereLabs/aya-expanse-8b with full bf16 precision (no quantization)
14
+ 2026-05-07 13:30:30,504 | INFO | Loading CPT adapter from: /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-cpt-tunisian
15
+ 2026-05-07 13:30:34,280 | INFO | Loading dataset: Syrinesmati/tunisian-question-response-dataset
16
+ 2026-05-07 13:30:40,321 | INFO | Using dataset fields: question=instruction, answer=response
17
+ 2026-05-07 13:30:40,717 | INFO | Using dataset fields: question=instruction, answer=response
18
+ 2026-05-07 13:30:40,818 | INFO | Train rows: 25340
19
+ 2026-05-07 13:30:40,819 | INFO | Eval rows: 6336
20
+ 2026-05-07 13:30:40,819 | INFO | Sample train example: [{'role': 'system', 'content': 'أنت "التيجاني"، مساعد ذكاء اصطناعي تونسي 100%. جاوب بالتونسي الدارج فقط، وبالطول المناسب للسؤال: كان يلزم قصّر، وكان يلزم فسّر أكثر. ممنوع الهلوسة أو الخروج على الموضوع.'}, {'role': 'user', 'content': 'نحس بوجيعة في العينين من كثرة غسلان الماعون بالماء السخون؟'}, {'role': 'assistant', 'content': 'البخار الصاعد من الماء السخون ينجم يتعب العين ويسبب احمرار، هذاكا علاش حاول تستعمل ماء دافي مش سخون برشة. اغسل وجهك بماء بارد بعد ما تكمل باش تبرد عينيك.'}]
21
+ 2026-05-07 13:30:40,819 | INFO | No checkpoint found in /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-tunisian-sft
22
+ 2026-05-07 13:34:03,570 | INFO | Starting SFT fine-tuning job
23
+ 2026-05-07 13:34:03,570 | INFO | Output directory: /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-tunisian-sft
24
+ 2026-05-07 13:34:03,570 | INFO | Base model: CohereLabs/aya-expanse-8b
25
+ 2026-05-07 13:34:03,570 | INFO | CPT adapter: /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-cpt-tunisian
26
+ 2026-05-07 13:34:03,570 | INFO | Dataset: Syrinesmati/tunisian-question-response-dataset
27
+ 2026-05-07 13:34:03,570 | INFO | Torch version: 2.11.0+cu130
28
+ 2026-05-07 13:34:03,612 | INFO | CUDA available: True
29
+ 2026-05-07 13:34:03,631 | INFO | GPU: NVIDIA GB10
30
+ 2026-05-07 13:34:03,633 | INFO | No Hugging Face token provided; skipping login
31
+ 2026-05-07 13:34:03,633 | INFO | trl version 1.3.0: assistant_only_loss is fully supported
32
+ 2026-05-07 13:34:03,633 | INFO | Loading configuration: batch_size=8, accumulation=4, warmup=0.03, weight_decay=0.0100
33
+ 2026-05-07 13:34:03,633 | INFO | Loading tokenizer from base model: CohereLabs/aya-expanse-8b
34
+ 2026-05-07 13:34:04,862 | INFO | Loading base model: CohereLabs/aya-expanse-8b with full bf16 precision (no quantization)
35
+ 2026-05-07 13:36:22,953 | INFO | Loading CPT adapter from: /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-cpt-tunisian
36
+ 2026-05-07 13:36:26,760 | INFO | Loading dataset: Syrinesmati/tunisian-question-response-dataset
37
+ 2026-05-07 13:36:28,674 | INFO | Using dataset fields: question=instruction, answer=response
38
+ 2026-05-07 13:36:28,675 | INFO | Using dataset fields: question=instruction, answer=response
39
+ 2026-05-07 13:36:28,676 | INFO | Train rows: 25340
40
+ 2026-05-07 13:36:28,676 | INFO | Eval rows: 6336
41
+ 2026-05-07 13:36:28,677 | INFO | Sample train example: [{'role': 'system', 'content': 'أنت "التيجاني"، مساعد ذكاء اصطناعي تونسي 100%. جاوب بالتونسي الدارج فقط، وبالطول المناسب للسؤال: كان يلزم قصّر، وكان يلزم فسّر أكثر. ممنوع الهلوسة أو الخروج على الموضوع.'}, {'role': 'user', 'content': 'نحس بوجيعة في العينين من كثرة غسلان الماعون بالماء السخون؟'}, {'role': 'assistant', 'content': 'البخار الصاعد من الماء السخون ينجم يتعب العين ويسبب احمرار، هذاكا علاش حاول تستعمل ماء دافي مش ��خون برشة. اغسل وجهك بماء بارد بعد ما تكمل باش تبرد عينيك.'}]
42
+ 2026-05-07 13:36:28,677 | INFO | No checkpoint found in /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-tunisian-sft
43
+ 2026-05-07 13:41:24,497 | INFO | Starting SFT fine-tuning job
44
+ 2026-05-07 13:41:24,497 | INFO | Output directory: /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-tunisian-sft
45
+ 2026-05-07 13:41:24,497 | INFO | Base model: CohereLabs/aya-expanse-8b
46
+ 2026-05-07 13:41:24,497 | INFO | CPT adapter: /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-cpt-tunisian
47
+ 2026-05-07 13:41:24,497 | INFO | Dataset: Syrinesmati/tunisian-question-response-dataset
48
+ 2026-05-07 13:41:24,497 | INFO | Torch version: 2.11.0+cu130
49
+ 2026-05-07 13:41:24,548 | INFO | CUDA available: True
50
+ 2026-05-07 13:41:24,576 | INFO | GPU: NVIDIA GB10
51
+ 2026-05-07 13:41:24,577 | INFO | No Hugging Face token provided; skipping login
52
+ 2026-05-07 13:41:24,578 | INFO | trl version 1.3.0: assistant_only_loss is fully supported
53
+ 2026-05-07 13:41:24,578 | INFO | Loading configuration: batch_size=8, accumulation=4, warmup=0.03, weight_decay=0.0100
54
+ 2026-05-07 13:41:24,578 | INFO | Loading tokenizer from base model: CohereLabs/aya-expanse-8b
55
+ 2026-05-07 13:41:25,691 | INFO | Loading base model: CohereLabs/aya-expanse-8b with full bf16 precision (no quantization)
56
+ 2026-05-07 13:44:02,778 | INFO | Loading CPT adapter from: /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-cpt-tunisian
57
+ 2026-05-07 13:44:06,613 | INFO | Loading dataset: Syrinesmati/tunisian-question-response-dataset
58
+ 2026-05-07 13:44:08,694 | INFO | Using dataset fields: question=instruction, answer=response
59
+ 2026-05-07 13:44:08,697 | INFO | Using dataset fields: question=instruction, answer=response
60
+ 2026-05-07 13:44:08,698 | INFO | Train rows: 25340
61
+ 2026-05-07 13:44:08,699 | INFO | Eval rows: 6336
62
+ 2026-05-07 13:44:08,699 | INFO | Sample train example: [{'role': 'system', 'content': 'أنت "التيجاني"، مساعد ذكاء اصطناعي تونسي 100%. جاوب بالتونسي الدارج فقط، وبالطول المناسب للسؤال: كان يلزم قصّر، وكان يلزم فسّر أكثر. ممنوع الهلوسة أو الخروج على الموضوع.'}, {'role': 'user', 'content': 'نحس بوجيعة في العينين من كثرة غسلان الماعون بالماء السخون؟'}, {'role': 'assistant', 'content': 'البخار الصاعد من الماء السخون ينجم يتعب العين ويسبب احمرار، هذاكا علاش حاول تستعمل ماء دافي مش سخون برشة. اغسل وجهك بماء بارد بعد ما تكمل باش تبرد عينيك.'}]
63
+ 2026-05-07 13:44:08,699 | INFO | No checkpoint found in /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-tunisian-sft
64
+ 2026-05-07 13:44:08,752 | INFO | Training started
65
+ 2026-05-07 13:46:02,478 | INFO | Starting SFT fine-tuning job
66
+ 2026-05-07 13:46:02,479 | INFO | Output directory: /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-tunisian-sft
67
+ 2026-05-07 13:46:02,479 | INFO | Base model: CohereLabs/aya-expanse-8b
68
+ 2026-05-07 13:46:02,479 | INFO | CPT adapter: /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-cpt-tunisian
69
+ 2026-05-07 13:46:02,479 | INFO | Dataset: Syrinesmati/tunisian-question-response-dataset
70
+ 2026-05-07 13:46:02,479 | INFO | Torch version: 2.11.0+cu130
71
+ 2026-05-07 13:46:02,523 | INFO | CUDA available: True
72
+ 2026-05-07 13:46:02,552 | INFO | GPU: NVIDIA GB10
73
+ 2026-05-07 13:46:02,554 | INFO | No Hugging Face token provided; skipping login
74
+ 2026-05-07 13:46:02,555 | INFO | trl version 1.3.0: assistant_only_loss is fully supported
75
+ 2026-05-07 13:46:02,555 | INFO | Loading configuration: batch_size=8, accumulation=4, warmup=0.03, weight_decay=0.0100
76
+ 2026-05-07 13:46:02,555 | INFO | Loading tokenizer from base model: CohereLabs/aya-expanse-8b
77
+ 2026-05-07 13:46:03,698 | INFO | Loading base model: CohereLabs/aya-expanse-8b with full bf16 precision (no quantization)
78
+ 2026-05-07 13:48:19,475 | INFO | Loading CPT adapter from: /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-cpt-tunisian
79
+ 2026-05-07 13:48:23,292 | INFO | Loading dataset: Syrinesmati/tunisian-question-response-dataset
80
+ 2026-05-07 13:48:25,199 | INFO | Using dataset fields: question=instruction, answer=response
81
+ 2026-05-07 13:48:25,202 | INFO | Using dataset fields: question=instruction, answer=response
82
+ 2026-05-07 13:48:25,203 | INFO | Train rows: 25340
83
+ 2026-05-07 13:48:25,203 | INFO | Eval rows: 6336
84
+ 2026-05-07 13:48:25,204 | INFO | Sample train example: [{'role': 'system', 'content': 'أنت "التيجاني"، مساعد ذكاء اصطناعي تونسي 100%. جاوب بالتونسي الدارج فقط، وبالطول المناسب للسؤال: كان يلزم قصّر، وكان يلزم فسّر أكثر. ممنوع الهلوسة أو الخروج على الموضوع.'}, {'role': 'user', 'content': 'نحس بوجيعة في العينين من كثرة غسلان الماعون بالماء السخون؟'}, {'role': 'assistant', 'content': 'البخار الصاعد من الماء السخون ينجم يتعب العين ويسبب احمرار، هذاكا علاش حاول تستعمل ماء دافي مش سخون برشة. اغسل وجهك بماء بارد بعد ما تكمل باش تبرد عينيك.'}]
85
+ 2026-05-07 13:48:25,204 | INFO | No checkpoint found in /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-tunisian-sft
86
+ 2026-05-07 13:48:25,244 | INFO | Training started
87
+ 2026-05-07 13:55:23,761 | INFO | Starting SFT fine-tuning job
88
+ 2026-05-07 13:55:23,761 | INFO | Output directory: /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-tunisian-sft
89
+ 2026-05-07 13:55:23,761 | INFO | Base model: CohereLabs/aya-expanse-8b
90
+ 2026-05-07 13:55:23,761 | INFO | CPT adapter: /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-cpt-tunisian
91
+ 2026-05-07 13:55:23,761 | INFO | Dataset: Syrinesmati/tunisian-question-response-dataset
92
+ 2026-05-07 13:55:23,761 | INFO | Torch version: 2.11.0+cu130
93
+ 2026-05-07 13:55:23,803 | INFO | CUDA available: True
94
+ 2026-05-07 13:55:23,826 | INFO | GPU: NVIDIA GB10
95
+ 2026-05-07 13:55:23,828 | INFO | No Hugging Face token provided; skipping login
96
+ 2026-05-07 13:55:23,829 | INFO | trl version 1.3.0: assistant_only_loss is fully supported
97
+ 2026-05-07 13:55:23,829 | INFO | Loading configuration: batch_size=8, accumulation=4, warmup=0.03, weight_decay=0.0100
98
+ 2026-05-07 13:55:23,829 | INFO | Loading tokenizer from base model: CohereLabs/aya-expanse-8b
99
+ 2026-05-07 13:55:24,929 | INFO | Loading base model: CohereLabs/aya-expanse-8b with full bf16 precision (no quantization)
100
+ 2026-05-07 13:58:01,693 | INFO | Loading CPT adapter from: /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-cpt-tunisian
101
+ 2026-05-07 13:58:05,420 | INFO | Loading dataset: Syrinesmati/tunisian-question-response-dataset
102
+ 2026-05-07 13:58:07,816 | INFO | Using dataset fields: question=instruction, answer=response
103
+ 2026-05-07 13:58:08,954 | INFO | Using dataset fields: question=instruction, answer=response
104
+ 2026-05-07 13:58:09,429 | INFO | Train rows: 25340
105
+ 2026-05-07 13:58:09,430 | INFO | Eval rows: 6336
106
+ 2026-05-07 13:58:09,430 | INFO | Sample train example text: <BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>أنت "التيجاني"، مساعد ذكاء اصطناعي تونسي 100%. جاوب بالتونسي الدارج فقط، وبالطول المناسب للسؤال: كان يلزم قصّر، وكان يلزم فسّر أكثر. ممنوع الهلوسة أو الخروج على الموضوع.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>نحس بوجيعة في العينين من كثرة غسلان الماعون بالماء السخون؟<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>البخار الصاعد من الماء السخون ينجم يتعب العين ويسبب احمرار، هذاكا علاش حاول تستعمل ماء دافي مش سخون برشة. اغسل وجهك بماء بارد بعد ما تكمل باش تبرد عينيك.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
107
+ 2026-05-07 13:58:09,430 | INFO | No checkpoint found in /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-tunisian-sft
108
+ 2026-05-07 13:59:25,638 | INFO | Starting SFT fine-tuning job
109
+ 2026-05-07 13:59:25,639 | INFO | Output directory: /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-tunisian-sft
110
+ 2026-05-07 13:59:25,639 | INFO | Base model: CohereLabs/aya-expanse-8b
111
+ 2026-05-07 13:59:25,639 | INFO | CPT adapter: /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-cpt-tunisian
112
+ 2026-05-07 13:59:25,639 | INFO | Dataset: Syrinesmati/tunisian-question-response-dataset
113
+ 2026-05-07 13:59:25,639 | INFO | Torch version: 2.11.0+cu130
114
+ 2026-05-07 13:59:25,683 | INFO | CUDA available: True
115
+ 2026-05-07 13:59:25,709 | INFO | GPU: NVIDIA GB10
116
+ 2026-05-07 13:59:25,710 | INFO | No Hugging Face token provided; skipping login
117
+ 2026-05-07 13:59:25,711 | INFO | trl version 1.3.0: assistant_only_loss is fully supported
118
+ 2026-05-07 13:59:25,711 | INFO | Loading configuration: batch_size=8, accumulation=4, warmup=0.03, weight_decay=0.0100
119
+ 2026-05-07 13:59:25,711 | INFO | Loading tokenizer from base model: CohereLabs/aya-expanse-8b
120
+ 2026-05-07 13:59:26,819 | INFO | Loading base model: CohereLabs/aya-expanse-8b with full bf16 precision (no quantization)
121
+ 2026-05-07 14:02:05,419 | INFO | Loading CPT adapter from: /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-cpt-tunisian
122
+ 2026-05-07 14:02:09,265 | INFO | Loading dataset: Syrinesmati/tunisian-question-response-dataset
123
+ 2026-05-07 14:02:11,637 | INFO | Using dataset fields: question=instruction, answer=response
124
+ 2026-05-07 14:02:11,918 | INFO | Using dataset fields: question=instruction, answer=response
125
+ 2026-05-07 14:02:12,159 | INFO | Train rows: 25340
126
+ 2026-05-07 14:02:12,160 | INFO | Eval rows: 6336
127
+ 2026-05-07 14:02:12,160 | INFO | Sample train example text: <BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>أنت "التيجاني"، مساعد ذكاء اصطناعي تونسي 100%. جاوب بالتونسي الدارج فقط، وبالطول المناسب للسؤال: كان يلزم قصّر، وكان يلزم فسّر أكثر. ممنوع الهلوسة أو الخروج على الموضوع.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>نحس بوجيعة في العينين من كثرة غسلان الماعون بالماء السخون؟<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>البخار الصاعد من الماء السخون ينجم يتعب العين ويسبب احمرار، هذاكا علاش حاول تستعمل ماء دافي مش سخون برشة. اغسل وجهك بماء بارد بعد ما تكمل باش تبرد عينيك.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
128
+ 2026-05-07 14:02:12,161 | INFO | No checkpoint found in /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-tunisian-sft
129
+ 2026-05-07 14:02:23,459 | INFO | Training started
130
+ 2026-05-07 14:08:07,243 | INFO | Starting SFT fine-tuning job
131
+ 2026-05-07 14:08:07,244 | INFO | Output directory: /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-tunisian-sft
132
+ 2026-05-07 14:08:07,244 | INFO | Base model: CohereLabs/aya-expanse-8b
133
+ 2026-05-07 14:08:07,244 | INFO | CPT adapter: /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-cpt-tunisian
134
+ 2026-05-07 14:08:07,244 | INFO | Dataset: Syrinesmati/tunisian-question-response-dataset
135
+ 2026-05-07 14:08:07,244 | INFO | Torch version: 2.11.0+cu130
136
+ 2026-05-07 14:08:07,286 | INFO | CUDA available: True
137
+ 2026-05-07 14:08:07,309 | INFO | GPU: NVIDIA GB10
138
+ 2026-05-07 14:08:07,499 | INFO | Hugging Face login succeeded
139
+ 2026-05-07 14:08:07,500 | INFO | trl version 1.3.0: assistant_only_loss is fully supported
140
+ 2026-05-07 14:08:07,500 | INFO | Loading configuration: batch_size=8, accumulation=4, warmup=0.03, weight_decay=0.0100
141
+ 2026-05-07 14:08:07,500 | INFO | Loading tokenizer from base model: CohereLabs/aya-expanse-8b
142
+ 2026-05-07 14:08:08,618 | INFO | Loading base model: CohereLabs/aya-expanse-8b with full bf16 precision (no quantization)
143
+ 2026-05-07 14:10:19,279 | INFO | Loading CPT adapter from: /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-cpt-tunisian
144
+ 2026-05-07 14:10:23,076 | INFO | Loading dataset: Syrinesmati/tunisian-question-response-dataset
145
+ 2026-05-07 14:10:25,353 | INFO | Using dataset fields: question=instruction, answer=response
146
+ 2026-05-07 14:10:25,621 | INFO | Using dataset fields: question=instruction, answer=response
147
+ 2026-05-07 14:10:25,861 | INFO | Train rows: 25340
148
+ 2026-05-07 14:10:25,862 | INFO | Eval rows: 6336
149
+ 2026-05-07 14:10:25,862 | INFO | Sample train example text: <BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>أنت "التيجاني"، مساعد ذكاء اصطناعي تونسي 100%. جاوب بالتونسي الدارج فقط، وبالطول المناسب للسؤال: كان يلزم قصّر، وكان يلزم فسّر أكثر. ممنوع الهلوسة أو الخروج على الموضوع.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>نحس بوجيعة في العينين من كثرة غسلان الماعون بالماء السخون؟<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>البخار الصاعد من الماء السخون ينجم يتعب العين ويسبب احمرار، هذاكا علاش حاول تستعمل ماء دافي مش سخون برشة. اغسل وجهك بماء بارد بعد ما تكمل باش تبرد عينيك.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
150
+ 2026-05-07 14:10:25,862 | INFO | No checkpoint found in /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-tunisian-sft
151
+ 2026-05-07 14:10:29,187 | INFO | Training started
152
+ 2026-05-08 04:09:43,113 | INFO | Training finished
153
+ 2026-05-08 04:09:43,113 | INFO | Saving model and tokenizer to /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-tunisian-sft
154
+ 2026-05-08 04:09:45,243 | INFO | Saved training metrics to /home/ala/TunisianDialogSystem/outputs/checkpoints/aya-expanse-8b-tunisian-sft/training_metrics.json
155
+ 2026-05-08 04:09:45,243 | INFO | Running preview generation on a Tunisian prompt
156
+ 2026-05-08 04:09:48,487 | INFO | Preview prompt: عسلامة، شنوة تنصحني نعمل كي نكون تعبان وبرشة؟
157
+ 2026-05-08 04:09:48,487 | INFO | Preview output: <|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>أنت "التيجاني"، مساعد ذكاء اصطناعي تونسي 100%. جاوب بالتونسي الدارج فقط، وبالطول المناسب للسؤال: كان يلزم قصّر، وكان يلزم فسّر أكثر. ممنوع الهلوسة أو الخروج على الموضوع.<|START_OF_TURN_TOKEN|><|USER_TOKEN|>عسلامة، شنوة تنصحني نعمل كي نكون تعبان وبرشة؟<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>ح في في في في في في في في في في<|START_OF_TURN_TOKEN|>
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<BOS_TOKEN>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|END_OF_TURN_TOKEN|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<PAD>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:345ccf04a5257f473e331715ecc69365c5ac8fc2490923fe7155560af809ec1a
3
+ size 20124090
tokenizer_config.json ADDED
@@ -0,0 +1,317 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": false,
5
+ "added_tokens_decoder": {
6
+ "0": {
7
+ "content": "<PAD>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "1": {
15
+ "content": "<UNK>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "2": {
23
+ "content": "<CLS>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ },
30
+ "3": {
31
+ "content": "<SEP>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false,
36
+ "special": true
37
+ },
38
+ "4": {
39
+ "content": "<MASK_TOKEN>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false,
44
+ "special": true
45
+ },
46
+ "5": {
47
+ "content": "<BOS_TOKEN>",
48
+ "lstrip": false,
49
+ "normalized": false,
50
+ "rstrip": false,
51
+ "single_word": false,
52
+ "special": true
53
+ },
54
+ "6": {
55
+ "content": "<EOS_TOKEN>",
56
+ "lstrip": false,
57
+ "normalized": false,
58
+ "rstrip": false,
59
+ "single_word": false,
60
+ "special": true
61
+ },
62
+ "7": {
63
+ "content": "<EOP_TOKEN>",
64
+ "lstrip": false,
65
+ "normalized": false,
66
+ "rstrip": false,
67
+ "single_word": false,
68
+ "special": true
69
+ },
70
+ "255000": {
71
+ "content": "<|START_OF_TURN_TOKEN|>",
72
+ "lstrip": false,
73
+ "normalized": false,
74
+ "rstrip": false,
75
+ "single_word": false,
76
+ "special": false
77
+ },
78
+ "255001": {
79
+ "content": "<|END_OF_TURN_TOKEN|>",
80
+ "lstrip": false,
81
+ "normalized": false,
82
+ "rstrip": false,
83
+ "single_word": false,
84
+ "special": true
85
+ },
86
+ "255002": {
87
+ "content": "<|YES_TOKEN|>",
88
+ "lstrip": false,
89
+ "normalized": false,
90
+ "rstrip": false,
91
+ "single_word": false,
92
+ "special": false
93
+ },
94
+ "255003": {
95
+ "content": "<|NO_TOKEN|>",
96
+ "lstrip": false,
97
+ "normalized": false,
98
+ "rstrip": false,
99
+ "single_word": false,
100
+ "special": false
101
+ },
102
+ "255004": {
103
+ "content": "<|GOOD_TOKEN|>",
104
+ "lstrip": false,
105
+ "normalized": false,
106
+ "rstrip": false,
107
+ "single_word": false,
108
+ "special": false
109
+ },
110
+ "255005": {
111
+ "content": "<|BAD_TOKEN|>",
112
+ "lstrip": false,
113
+ "normalized": false,
114
+ "rstrip": false,
115
+ "single_word": false,
116
+ "special": false
117
+ },
118
+ "255006": {
119
+ "content": "<|USER_TOKEN|>",
120
+ "lstrip": false,
121
+ "normalized": false,
122
+ "rstrip": false,
123
+ "single_word": false,
124
+ "special": false
125
+ },
126
+ "255007": {
127
+ "content": "<|CHATBOT_TOKEN|>",
128
+ "lstrip": false,
129
+ "normalized": false,
130
+ "rstrip": false,
131
+ "single_word": false,
132
+ "special": false
133
+ },
134
+ "255008": {
135
+ "content": "<|SYSTEM_TOKEN|>",
136
+ "lstrip": false,
137
+ "normalized": false,
138
+ "rstrip": false,
139
+ "single_word": false,
140
+ "special": false
141
+ },
142
+ "255009": {
143
+ "content": "<|USER_0_TOKEN|>",
144
+ "lstrip": false,
145
+ "normalized": false,
146
+ "rstrip": false,
147
+ "single_word": false,
148
+ "special": false
149
+ },
150
+ "255010": {
151
+ "content": "<|USER_1_TOKEN|>",
152
+ "lstrip": false,
153
+ "normalized": false,
154
+ "rstrip": false,
155
+ "single_word": false,
156
+ "special": false
157
+ },
158
+ "255011": {
159
+ "content": "<|USER_2_TOKEN|>",
160
+ "lstrip": false,
161
+ "normalized": false,
162
+ "rstrip": false,
163
+ "single_word": false,
164
+ "special": false
165
+ },
166
+ "255012": {
167
+ "content": "<|USER_3_TOKEN|>",
168
+ "lstrip": false,
169
+ "normalized": false,
170
+ "rstrip": false,
171
+ "single_word": false,
172
+ "special": false
173
+ },
174
+ "255013": {
175
+ "content": "<|USER_4_TOKEN|>",
176
+ "lstrip": false,
177
+ "normalized": false,
178
+ "rstrip": false,
179
+ "single_word": false,
180
+ "special": false
181
+ },
182
+ "255014": {
183
+ "content": "<|USER_5_TOKEN|>",
184
+ "lstrip": false,
185
+ "normalized": false,
186
+ "rstrip": false,
187
+ "single_word": false,
188
+ "special": false
189
+ },
190
+ "255015": {
191
+ "content": "<|USER_6_TOKEN|>",
192
+ "lstrip": false,
193
+ "normalized": false,
194
+ "rstrip": false,
195
+ "single_word": false,
196
+ "special": false
197
+ },
198
+ "255016": {
199
+ "content": "<|USER_7_TOKEN|>",
200
+ "lstrip": false,
201
+ "normalized": false,
202
+ "rstrip": false,
203
+ "single_word": false,
204
+ "special": false
205
+ },
206
+ "255017": {
207
+ "content": "<|USER_8_TOKEN|>",
208
+ "lstrip": false,
209
+ "normalized": false,
210
+ "rstrip": false,
211
+ "single_word": false,
212
+ "special": false
213
+ },
214
+ "255018": {
215
+ "content": "<|USER_9_TOKEN|>",
216
+ "lstrip": false,
217
+ "normalized": false,
218
+ "rstrip": false,
219
+ "single_word": false,
220
+ "special": false
221
+ },
222
+ "255019": {
223
+ "content": "<|EXTRA_0_TOKEN|>",
224
+ "lstrip": false,
225
+ "normalized": false,
226
+ "rstrip": false,
227
+ "single_word": false,
228
+ "special": false
229
+ },
230
+ "255020": {
231
+ "content": "<|EXTRA_1_TOKEN|>",
232
+ "lstrip": false,
233
+ "normalized": false,
234
+ "rstrip": false,
235
+ "single_word": false,
236
+ "special": false
237
+ },
238
+ "255021": {
239
+ "content": "<|EXTRA_2_TOKEN|>",
240
+ "lstrip": false,
241
+ "normalized": false,
242
+ "rstrip": false,
243
+ "single_word": false,
244
+ "special": false
245
+ },
246
+ "255022": {
247
+ "content": "<|EXTRA_3_TOKEN|>",
248
+ "lstrip": false,
249
+ "normalized": false,
250
+ "rstrip": false,
251
+ "single_word": false,
252
+ "special": false
253
+ },
254
+ "255023": {
255
+ "content": "<|EXTRA_4_TOKEN|>",
256
+ "lstrip": false,
257
+ "normalized": false,
258
+ "rstrip": false,
259
+ "single_word": false,
260
+ "special": false
261
+ },
262
+ "255024": {
263
+ "content": "<|EXTRA_5_TOKEN|>",
264
+ "lstrip": false,
265
+ "normalized": false,
266
+ "rstrip": false,
267
+ "single_word": false,
268
+ "special": false
269
+ },
270
+ "255025": {
271
+ "content": "<|EXTRA_6_TOKEN|>",
272
+ "lstrip": false,
273
+ "normalized": false,
274
+ "rstrip": false,
275
+ "single_word": false,
276
+ "special": false
277
+ },
278
+ "255026": {
279
+ "content": "<|EXTRA_7_TOKEN|>",
280
+ "lstrip": false,
281
+ "normalized": false,
282
+ "rstrip": false,
283
+ "single_word": false,
284
+ "special": false
285
+ },
286
+ "255027": {
287
+ "content": "<|EXTRA_8_TOKEN|>",
288
+ "lstrip": false,
289
+ "normalized": false,
290
+ "rstrip": false,
291
+ "single_word": false,
292
+ "special": false
293
+ },
294
+ "255028": {
295
+ "content": "<|EXTRA_9_TOKEN|>",
296
+ "lstrip": false,
297
+ "normalized": false,
298
+ "rstrip": false,
299
+ "single_word": false,
300
+ "special": false
301
+ }
302
+ },
303
+ "bos_token": "<BOS_TOKEN>",
304
+ "clean_up_tokenization_spaces": false,
305
+ "eos_token": "<|END_OF_TURN_TOKEN|>",
306
+ "extra_special_tokens": {},
307
+ "legacy": true,
308
+ "merges_file": null,
309
+ "model_max_length": 1000000000000000019884624838656,
310
+ "pad_token": "<PAD>",
311
+ "sp_model_kwargs": {},
312
+ "spaces_between_special_tokens": false,
313
+ "tokenizer_class": "CohereTokenizer",
314
+ "unk_token": null,
315
+ "use_default_system_prompt": false,
316
+ "vocab_file": null
317
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8eeb71c14deb91ac5fd11522db45cb3275c9164415fcbefc9d00cac27a27f0a3
3
+ size 6417
training_metrics.json ADDED
@@ -0,0 +1,1600 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "log_history": [
3
+ {
4
+ "loss": 3.6021,
5
+ "grad_norm": 4.66565465927124,
6
+ "learning_rate": 1.8750000000000003e-06,
7
+ "entropy": 2.4040649354457857,
8
+ "num_tokens": 61199.0,
9
+ "mean_token_accuracy": 0.4221017129719257,
10
+ "epoch": 0.012626262626262626,
11
+ "step": 10
12
+ },
13
+ {
14
+ "loss": 3.3432,
15
+ "grad_norm": 3.8161869049072266,
16
+ "learning_rate": 3.958333333333333e-06,
17
+ "entropy": 2.3836746215820312,
18
+ "num_tokens": 122423.0,
19
+ "mean_token_accuracy": 0.44042530804872515,
20
+ "epoch": 0.025252525252525252,
21
+ "step": 20
22
+ },
23
+ {
24
+ "loss": 2.9033,
25
+ "grad_norm": 3.8800699710845947,
26
+ "learning_rate": 6.041666666666667e-06,
27
+ "entropy": 2.355724626779556,
28
+ "num_tokens": 182649.0,
29
+ "mean_token_accuracy": 0.48426677361130716,
30
+ "epoch": 0.03787878787878788,
31
+ "step": 30
32
+ },
33
+ {
34
+ "loss": 2.356,
35
+ "grad_norm": 2.8217720985412598,
36
+ "learning_rate": 8.125000000000001e-06,
37
+ "entropy": 2.092331054806709,
38
+ "num_tokens": 243049.0,
39
+ "mean_token_accuracy": 0.5772452697157859,
40
+ "epoch": 0.050505050505050504,
41
+ "step": 40
42
+ },
43
+ {
44
+ "loss": 1.8899,
45
+ "grad_norm": 1.4623568058013916,
46
+ "learning_rate": 9.993489583333334e-06,
47
+ "entropy": 1.6766322344541549,
48
+ "num_tokens": 304326.0,
49
+ "mean_token_accuracy": 0.6480962842702865,
50
+ "epoch": 0.06313131313131314,
51
+ "step": 50
52
+ },
53
+ {
54
+ "loss": 1.677,
55
+ "grad_norm": 1.171562671661377,
56
+ "learning_rate": 9.928385416666668e-06,
57
+ "entropy": 1.5568815559148788,
58
+ "num_tokens": 364866.0,
59
+ "mean_token_accuracy": 0.6776855796575546,
60
+ "epoch": 0.07575757575757576,
61
+ "step": 60
62
+ },
63
+ {
64
+ "loss": 1.5337,
65
+ "grad_norm": 0.9904961585998535,
66
+ "learning_rate": 9.863281250000001e-06,
67
+ "entropy": 1.48199902176857,
68
+ "num_tokens": 423478.0,
69
+ "mean_token_accuracy": 0.697019773721695,
70
+ "epoch": 0.08838383838383838,
71
+ "step": 70
72
+ },
73
+ {
74
+ "loss": 1.4953,
75
+ "grad_norm": 0.9454260468482971,
76
+ "learning_rate": 9.798177083333335e-06,
77
+ "entropy": 1.497376424074173,
78
+ "num_tokens": 483659.0,
79
+ "mean_token_accuracy": 0.6976823702454567,
80
+ "epoch": 0.10101010101010101,
81
+ "step": 80
82
+ },
83
+ {
84
+ "loss": 1.4356,
85
+ "grad_norm": 0.8955270648002625,
86
+ "learning_rate": 9.733072916666667e-06,
87
+ "entropy": 1.4664768785238267,
88
+ "num_tokens": 544453.0,
89
+ "mean_token_accuracy": 0.7069446608424187,
90
+ "epoch": 0.11363636363636363,
91
+ "step": 90
92
+ },
93
+ {
94
+ "loss": 1.4106,
95
+ "grad_norm": 0.9242203235626221,
96
+ "learning_rate": 9.66796875e-06,
97
+ "entropy": 1.4269085675477982,
98
+ "num_tokens": 604546.0,
99
+ "mean_token_accuracy": 0.7122392952442169,
100
+ "epoch": 0.12626262626262627,
101
+ "step": 100
102
+ },
103
+ {
104
+ "loss": 1.3487,
105
+ "grad_norm": 0.8968560695648193,
106
+ "learning_rate": 9.602864583333335e-06,
107
+ "entropy": 1.4060751020908355,
108
+ "num_tokens": 664860.0,
109
+ "mean_token_accuracy": 0.7178552970290184,
110
+ "epoch": 0.1388888888888889,
111
+ "step": 110
112
+ },
113
+ {
114
+ "loss": 1.3347,
115
+ "grad_norm": 0.9047113656997681,
116
+ "learning_rate": 9.537760416666667e-06,
117
+ "entropy": 1.4018951296806335,
118
+ "num_tokens": 725022.0,
119
+ "mean_token_accuracy": 0.7208079636096955,
120
+ "epoch": 0.15151515151515152,
121
+ "step": 120
122
+ },
123
+ {
124
+ "loss": 1.3155,
125
+ "grad_norm": 0.8915444016456604,
126
+ "learning_rate": 9.47265625e-06,
127
+ "entropy": 1.3809731483459473,
128
+ "num_tokens": 785586.0,
129
+ "mean_token_accuracy": 0.7267520889639855,
130
+ "epoch": 0.16414141414141414,
131
+ "step": 130
132
+ },
133
+ {
134
+ "loss": 1.3016,
135
+ "grad_norm": 0.8574295043945312,
136
+ "learning_rate": 9.407552083333334e-06,
137
+ "entropy": 1.3699676394462585,
138
+ "num_tokens": 845790.0,
139
+ "mean_token_accuracy": 0.7266673430800438,
140
+ "epoch": 0.17676767676767677,
141
+ "step": 140
142
+ },
143
+ {
144
+ "loss": 1.2823,
145
+ "grad_norm": 0.8231800198554993,
146
+ "learning_rate": 9.342447916666668e-06,
147
+ "entropy": 1.3425012439489366,
148
+ "num_tokens": 905842.0,
149
+ "mean_token_accuracy": 0.7277117937803268,
150
+ "epoch": 0.1893939393939394,
151
+ "step": 150
152
+ },
153
+ {
154
+ "loss": 1.2917,
155
+ "grad_norm": 0.8166369795799255,
156
+ "learning_rate": 9.277343750000001e-06,
157
+ "entropy": 1.3314216613769532,
158
+ "num_tokens": 966487.0,
159
+ "mean_token_accuracy": 0.7278609350323677,
160
+ "epoch": 0.20202020202020202,
161
+ "step": 160
162
+ },
163
+ {
164
+ "loss": 1.2481,
165
+ "grad_norm": 0.7738587260246277,
166
+ "learning_rate": 9.212239583333335e-06,
167
+ "entropy": 1.30389544069767,
168
+ "num_tokens": 1025558.0,
169
+ "mean_token_accuracy": 0.7335339426994324,
170
+ "epoch": 0.21464646464646464,
171
+ "step": 170
172
+ },
173
+ {
174
+ "loss": 1.2643,
175
+ "grad_norm": 0.7718328833580017,
176
+ "learning_rate": 9.147135416666667e-06,
177
+ "entropy": 1.311449444293976,
178
+ "num_tokens": 1086677.0,
179
+ "mean_token_accuracy": 0.7285080313682556,
180
+ "epoch": 0.22727272727272727,
181
+ "step": 180
182
+ },
183
+ {
184
+ "loss": 1.2641,
185
+ "grad_norm": 0.7341915369033813,
186
+ "learning_rate": 9.082031250000001e-06,
187
+ "entropy": 1.3079361200332642,
188
+ "num_tokens": 1147885.0,
189
+ "mean_token_accuracy": 0.729550538957119,
190
+ "epoch": 0.2398989898989899,
191
+ "step": 190
192
+ },
193
+ {
194
+ "loss": 1.2397,
195
+ "grad_norm": 0.748540997505188,
196
+ "learning_rate": 9.016927083333335e-06,
197
+ "entropy": 1.2794219702482224,
198
+ "num_tokens": 1207321.0,
199
+ "mean_token_accuracy": 0.7351120054721832,
200
+ "epoch": 0.25252525252525254,
201
+ "step": 200
202
+ },
203
+ {
204
+ "loss": 1.2489,
205
+ "grad_norm": 0.7585553526878357,
206
+ "learning_rate": 8.951822916666667e-06,
207
+ "entropy": 1.294454461336136,
208
+ "num_tokens": 1267837.0,
209
+ "mean_token_accuracy": 0.7322148531675339,
210
+ "epoch": 0.26515151515151514,
211
+ "step": 210
212
+ },
213
+ {
214
+ "loss": 1.2383,
215
+ "grad_norm": 0.6937864422798157,
216
+ "learning_rate": 8.88671875e-06,
217
+ "entropy": 1.2862246632575989,
218
+ "num_tokens": 1328436.0,
219
+ "mean_token_accuracy": 0.7358245223760604,
220
+ "epoch": 0.2777777777777778,
221
+ "step": 220
222
+ },
223
+ {
224
+ "loss": 1.2007,
225
+ "grad_norm": 0.6792387366294861,
226
+ "learning_rate": 8.821614583333334e-06,
227
+ "entropy": 1.2531811505556107,
228
+ "num_tokens": 1389877.0,
229
+ "mean_token_accuracy": 0.7360369265079498,
230
+ "epoch": 0.2904040404040404,
231
+ "step": 230
232
+ },
233
+ {
234
+ "loss": 1.2474,
235
+ "grad_norm": 0.6865427494049072,
236
+ "learning_rate": 8.756510416666666e-06,
237
+ "entropy": 1.2815489560365676,
238
+ "num_tokens": 1450927.0,
239
+ "mean_token_accuracy": 0.7304804190993309,
240
+ "epoch": 0.30303030303030304,
241
+ "step": 240
242
+ },
243
+ {
244
+ "loss": 1.2172,
245
+ "grad_norm": 0.669840395450592,
246
+ "learning_rate": 8.69140625e-06,
247
+ "entropy": 1.2568059146404267,
248
+ "num_tokens": 1511174.0,
249
+ "mean_token_accuracy": 0.7385278165340423,
250
+ "epoch": 0.31565656565656564,
251
+ "step": 250
252
+ },
253
+ {
254
+ "loss": 1.213,
255
+ "grad_norm": 0.6434893012046814,
256
+ "learning_rate": 8.626302083333334e-06,
257
+ "entropy": 1.254868358373642,
258
+ "num_tokens": 1570570.0,
259
+ "mean_token_accuracy": 0.7380544006824493,
260
+ "epoch": 0.3282828282828283,
261
+ "step": 260
262
+ },
263
+ {
264
+ "loss": 1.2116,
265
+ "grad_norm": 0.6034978032112122,
266
+ "learning_rate": 8.561197916666667e-06,
267
+ "entropy": 1.2532441645860672,
268
+ "num_tokens": 1630930.0,
269
+ "mean_token_accuracy": 0.7382062628865242,
270
+ "epoch": 0.3409090909090909,
271
+ "step": 270
272
+ },
273
+ {
274
+ "loss": 1.2313,
275
+ "grad_norm": 0.6371450424194336,
276
+ "learning_rate": 8.496093750000001e-06,
277
+ "entropy": 1.2605280816555022,
278
+ "num_tokens": 1692624.0,
279
+ "mean_token_accuracy": 0.7328852489590645,
280
+ "epoch": 0.35353535353535354,
281
+ "step": 280
282
+ },
283
+ {
284
+ "loss": 1.2195,
285
+ "grad_norm": 0.6300278306007385,
286
+ "learning_rate": 8.430989583333335e-06,
287
+ "entropy": 1.248606452345848,
288
+ "num_tokens": 1754213.0,
289
+ "mean_token_accuracy": 0.7370449885725975,
290
+ "epoch": 0.3661616161616162,
291
+ "step": 290
292
+ },
293
+ {
294
+ "loss": 1.1845,
295
+ "grad_norm": 0.6430155634880066,
296
+ "learning_rate": 8.365885416666667e-06,
297
+ "entropy": 1.2246102809906005,
298
+ "num_tokens": 1813841.0,
299
+ "mean_token_accuracy": 0.7439196646213532,
300
+ "epoch": 0.3787878787878788,
301
+ "step": 300
302
+ },
303
+ {
304
+ "loss": 1.1856,
305
+ "grad_norm": 0.6395701766014099,
306
+ "learning_rate": 8.30078125e-06,
307
+ "entropy": 1.2188815206289292,
308
+ "num_tokens": 1873568.0,
309
+ "mean_token_accuracy": 0.7422183871269226,
310
+ "epoch": 0.39141414141414144,
311
+ "step": 310
312
+ },
313
+ {
314
+ "loss": 1.1951,
315
+ "grad_norm": 0.6168740391731262,
316
+ "learning_rate": 8.235677083333334e-06,
317
+ "entropy": 1.234528934955597,
318
+ "num_tokens": 1935518.0,
319
+ "mean_token_accuracy": 0.7380509555339814,
320
+ "epoch": 0.40404040404040403,
321
+ "step": 320
322
+ },
323
+ {
324
+ "loss": 1.2078,
325
+ "grad_norm": 0.611132800579071,
326
+ "learning_rate": 8.170572916666666e-06,
327
+ "entropy": 1.226365676522255,
328
+ "num_tokens": 1995604.0,
329
+ "mean_token_accuracy": 0.739974245429039,
330
+ "epoch": 0.4166666666666667,
331
+ "step": 330
332
+ },
333
+ {
334
+ "loss": 1.1642,
335
+ "grad_norm": 0.6103131771087646,
336
+ "learning_rate": 8.10546875e-06,
337
+ "entropy": 1.2132183194160462,
338
+ "num_tokens": 2056484.0,
339
+ "mean_token_accuracy": 0.7444137379527092,
340
+ "epoch": 0.4292929292929293,
341
+ "step": 340
342
+ },
343
+ {
344
+ "loss": 1.2001,
345
+ "grad_norm": 0.6188805103302002,
346
+ "learning_rate": 8.040364583333334e-06,
347
+ "entropy": 1.223158246278763,
348
+ "num_tokens": 2118437.0,
349
+ "mean_token_accuracy": 0.7385613292455673,
350
+ "epoch": 0.44191919191919193,
351
+ "step": 350
352
+ },
353
+ {
354
+ "loss": 1.1848,
355
+ "grad_norm": 0.6238694190979004,
356
+ "learning_rate": 7.975260416666668e-06,
357
+ "entropy": 1.2154075980186463,
358
+ "num_tokens": 2179333.0,
359
+ "mean_token_accuracy": 0.7404153689742088,
360
+ "epoch": 0.45454545454545453,
361
+ "step": 360
362
+ },
363
+ {
364
+ "loss": 1.1597,
365
+ "grad_norm": 0.6028566956520081,
366
+ "learning_rate": 7.910156250000001e-06,
367
+ "entropy": 1.197928261756897,
368
+ "num_tokens": 2239604.0,
369
+ "mean_token_accuracy": 0.7475010469555855,
370
+ "epoch": 0.4671717171717172,
371
+ "step": 370
372
+ },
373
+ {
374
+ "loss": 1.1805,
375
+ "grad_norm": 0.6569434404373169,
376
+ "learning_rate": 7.845052083333335e-06,
377
+ "entropy": 1.189855706691742,
378
+ "num_tokens": 2300936.0,
379
+ "mean_token_accuracy": 0.7427218139171601,
380
+ "epoch": 0.4797979797979798,
381
+ "step": 380
382
+ },
383
+ {
384
+ "loss": 1.1759,
385
+ "grad_norm": 0.6351733207702637,
386
+ "learning_rate": 7.779947916666667e-06,
387
+ "entropy": 1.2076493889093398,
388
+ "num_tokens": 2361539.0,
389
+ "mean_token_accuracy": 0.7408407002687454,
390
+ "epoch": 0.49242424242424243,
391
+ "step": 390
392
+ },
393
+ {
394
+ "loss": 1.1874,
395
+ "grad_norm": 0.6327986121177673,
396
+ "learning_rate": 7.71484375e-06,
397
+ "entropy": 1.2223081022500992,
398
+ "num_tokens": 2422162.0,
399
+ "mean_token_accuracy": 0.7407250568270684,
400
+ "epoch": 0.5050505050505051,
401
+ "step": 400
402
+ },
403
+ {
404
+ "loss": 1.1726,
405
+ "grad_norm": 0.622104823589325,
406
+ "learning_rate": 7.649739583333334e-06,
407
+ "entropy": 1.2013412863016129,
408
+ "num_tokens": 2483447.0,
409
+ "mean_token_accuracy": 0.744242025911808,
410
+ "epoch": 0.5176767676767676,
411
+ "step": 410
412
+ },
413
+ {
414
+ "loss": 1.1838,
415
+ "grad_norm": 0.637651264667511,
416
+ "learning_rate": 7.5846354166666665e-06,
417
+ "entropy": 1.2116613179445266,
418
+ "num_tokens": 2544848.0,
419
+ "mean_token_accuracy": 0.7405030101537704,
420
+ "epoch": 0.5303030303030303,
421
+ "step": 420
422
+ },
423
+ {
424
+ "loss": 1.1681,
425
+ "grad_norm": 0.6252374649047852,
426
+ "learning_rate": 7.51953125e-06,
427
+ "entropy": 1.2024697184562683,
428
+ "num_tokens": 2605232.0,
429
+ "mean_token_accuracy": 0.7458183988928795,
430
+ "epoch": 0.5429292929292929,
431
+ "step": 430
432
+ },
433
+ {
434
+ "loss": 1.1452,
435
+ "grad_norm": 0.6502755284309387,
436
+ "learning_rate": 7.454427083333334e-06,
437
+ "entropy": 1.1797083109617232,
438
+ "num_tokens": 2664276.0,
439
+ "mean_token_accuracy": 0.7477341219782829,
440
+ "epoch": 0.5555555555555556,
441
+ "step": 440
442
+ },
443
+ {
444
+ "loss": 1.1665,
445
+ "grad_norm": 0.639979362487793,
446
+ "learning_rate": 7.389322916666667e-06,
447
+ "entropy": 1.1964200481772422,
448
+ "num_tokens": 2724073.0,
449
+ "mean_token_accuracy": 0.7431837096810341,
450
+ "epoch": 0.5681818181818182,
451
+ "step": 450
452
+ },
453
+ {
454
+ "loss": 1.1529,
455
+ "grad_norm": 0.6212354302406311,
456
+ "learning_rate": 7.3242187500000006e-06,
457
+ "entropy": 1.1795272737741471,
458
+ "num_tokens": 2784262.0,
459
+ "mean_token_accuracy": 0.7479098170995713,
460
+ "epoch": 0.5808080808080808,
461
+ "step": 460
462
+ },
463
+ {
464
+ "loss": 1.1678,
465
+ "grad_norm": 0.6528693437576294,
466
+ "learning_rate": 7.259114583333334e-06,
467
+ "entropy": 1.1908618807792664,
468
+ "num_tokens": 2843804.0,
469
+ "mean_token_accuracy": 0.745174677670002,
470
+ "epoch": 0.5934343434343434,
471
+ "step": 470
472
+ },
473
+ {
474
+ "loss": 1.1565,
475
+ "grad_norm": 0.639481246471405,
476
+ "learning_rate": 7.194010416666667e-06,
477
+ "entropy": 1.1862946093082427,
478
+ "num_tokens": 2903408.0,
479
+ "mean_token_accuracy": 0.7461866185069084,
480
+ "epoch": 0.6060606060606061,
481
+ "step": 480
482
+ },
483
+ {
484
+ "loss": 1.1251,
485
+ "grad_norm": 0.6332777142524719,
486
+ "learning_rate": 7.128906250000001e-06,
487
+ "entropy": 1.151743534207344,
488
+ "num_tokens": 2963401.0,
489
+ "mean_token_accuracy": 0.7535199671983719,
490
+ "epoch": 0.6186868686868687,
491
+ "step": 490
492
+ },
493
+ {
494
+ "loss": 1.1407,
495
+ "grad_norm": 0.5991836190223694,
496
+ "learning_rate": 7.063802083333335e-06,
497
+ "entropy": 1.1778477430343628,
498
+ "num_tokens": 3023530.0,
499
+ "mean_token_accuracy": 0.7491110354661942,
500
+ "epoch": 0.6313131313131313,
501
+ "step": 500
502
+ },
503
+ {
504
+ "loss": 1.1724,
505
+ "grad_norm": 0.6293458938598633,
506
+ "learning_rate": 6.998697916666667e-06,
507
+ "entropy": 1.2023035794496537,
508
+ "num_tokens": 3085225.0,
509
+ "mean_token_accuracy": 0.7405256554484367,
510
+ "epoch": 0.6439393939393939,
511
+ "step": 510
512
+ },
513
+ {
514
+ "loss": 1.1604,
515
+ "grad_norm": 0.6213802695274353,
516
+ "learning_rate": 6.93359375e-06,
517
+ "entropy": 1.1997334092855454,
518
+ "num_tokens": 3145749.0,
519
+ "mean_token_accuracy": 0.7435309410095214,
520
+ "epoch": 0.6565656565656566,
521
+ "step": 520
522
+ },
523
+ {
524
+ "loss": 1.1381,
525
+ "grad_norm": 0.6495156288146973,
526
+ "learning_rate": 6.868489583333334e-06,
527
+ "entropy": 1.1643184214830398,
528
+ "num_tokens": 3205761.0,
529
+ "mean_token_accuracy": 0.7496751576662064,
530
+ "epoch": 0.6691919191919192,
531
+ "step": 530
532
+ },
533
+ {
534
+ "loss": 1.1563,
535
+ "grad_norm": 0.6004510521888733,
536
+ "learning_rate": 6.803385416666667e-06,
537
+ "entropy": 1.1837532848119736,
538
+ "num_tokens": 3267024.0,
539
+ "mean_token_accuracy": 0.7447851061820984,
540
+ "epoch": 0.6818181818181818,
541
+ "step": 540
542
+ },
543
+ {
544
+ "loss": 1.1776,
545
+ "grad_norm": 0.607467532157898,
546
+ "learning_rate": 6.738281250000001e-06,
547
+ "entropy": 1.201677542924881,
548
+ "num_tokens": 3329070.0,
549
+ "mean_token_accuracy": 0.7406648993492126,
550
+ "epoch": 0.6944444444444444,
551
+ "step": 550
552
+ },
553
+ {
554
+ "loss": 1.1298,
555
+ "grad_norm": 0.6079947352409363,
556
+ "learning_rate": 6.6731770833333345e-06,
557
+ "entropy": 1.1659786373376846,
558
+ "num_tokens": 3389505.0,
559
+ "mean_token_accuracy": 0.7505511298775673,
560
+ "epoch": 0.7070707070707071,
561
+ "step": 560
562
+ },
563
+ {
564
+ "loss": 1.144,
565
+ "grad_norm": 0.6534572839736938,
566
+ "learning_rate": 6.6080729166666665e-06,
567
+ "entropy": 1.1714913487434386,
568
+ "num_tokens": 3449834.0,
569
+ "mean_token_accuracy": 0.7481625184416771,
570
+ "epoch": 0.7196969696969697,
571
+ "step": 570
572
+ },
573
+ {
574
+ "loss": 1.139,
575
+ "grad_norm": 0.5903164744377136,
576
+ "learning_rate": 6.54296875e-06,
577
+ "entropy": 1.1694963037967683,
578
+ "num_tokens": 3510701.0,
579
+ "mean_token_accuracy": 0.7476441130042076,
580
+ "epoch": 0.7323232323232324,
581
+ "step": 580
582
+ },
583
+ {
584
+ "loss": 1.1422,
585
+ "grad_norm": 0.6284182071685791,
586
+ "learning_rate": 6.477864583333334e-06,
587
+ "entropy": 1.1777653217315673,
588
+ "num_tokens": 3570992.0,
589
+ "mean_token_accuracy": 0.7490358456969262,
590
+ "epoch": 0.7449494949494949,
591
+ "step": 590
592
+ },
593
+ {
594
+ "loss": 1.1336,
595
+ "grad_norm": 0.6250146627426147,
596
+ "learning_rate": 6.412760416666667e-06,
597
+ "entropy": 1.1588268011808396,
598
+ "num_tokens": 3631087.0,
599
+ "mean_token_accuracy": 0.7504108369350433,
600
+ "epoch": 0.7575757575757576,
601
+ "step": 600
602
+ },
603
+ {
604
+ "loss": 1.1534,
605
+ "grad_norm": 0.6420578956604004,
606
+ "learning_rate": 6.3476562500000006e-06,
607
+ "entropy": 1.1890273630619048,
608
+ "num_tokens": 3692871.0,
609
+ "mean_token_accuracy": 0.7442372158169747,
610
+ "epoch": 0.7702020202020202,
611
+ "step": 610
612
+ },
613
+ {
614
+ "loss": 1.1477,
615
+ "grad_norm": 0.6156490445137024,
616
+ "learning_rate": 6.282552083333334e-06,
617
+ "entropy": 1.1817798465490341,
618
+ "num_tokens": 3753671.0,
619
+ "mean_token_accuracy": 0.7468263059854507,
620
+ "epoch": 0.7828282828282829,
621
+ "step": 620
622
+ },
623
+ {
624
+ "loss": 1.139,
625
+ "grad_norm": 0.6248748898506165,
626
+ "learning_rate": 6.217447916666667e-06,
627
+ "entropy": 1.1686612635850906,
628
+ "num_tokens": 3813110.0,
629
+ "mean_token_accuracy": 0.7478924334049225,
630
+ "epoch": 0.7954545454545454,
631
+ "step": 630
632
+ },
633
+ {
634
+ "loss": 1.1118,
635
+ "grad_norm": 0.6052266359329224,
636
+ "learning_rate": 6.152343750000001e-06,
637
+ "entropy": 1.1387953266501427,
638
+ "num_tokens": 3873477.0,
639
+ "mean_token_accuracy": 0.7538487210869789,
640
+ "epoch": 0.8080808080808081,
641
+ "step": 640
642
+ },
643
+ {
644
+ "loss": 1.1108,
645
+ "grad_norm": 0.6769536137580872,
646
+ "learning_rate": 6.087239583333335e-06,
647
+ "entropy": 1.1343814879655838,
648
+ "num_tokens": 3932873.0,
649
+ "mean_token_accuracy": 0.7548464313149452,
650
+ "epoch": 0.8207070707070707,
651
+ "step": 650
652
+ },
653
+ {
654
+ "loss": 1.134,
655
+ "grad_norm": 0.6545736789703369,
656
+ "learning_rate": 6.022135416666667e-06,
657
+ "entropy": 1.1686519652605056,
658
+ "num_tokens": 3992754.0,
659
+ "mean_token_accuracy": 0.7495195478200912,
660
+ "epoch": 0.8333333333333334,
661
+ "step": 660
662
+ },
663
+ {
664
+ "loss": 1.1213,
665
+ "grad_norm": 0.6192017793655396,
666
+ "learning_rate": 5.95703125e-06,
667
+ "entropy": 1.1596283346414566,
668
+ "num_tokens": 4053540.0,
669
+ "mean_token_accuracy": 0.7510297149419785,
670
+ "epoch": 0.8459595959595959,
671
+ "step": 670
672
+ },
673
+ {
674
+ "loss": 1.1181,
675
+ "grad_norm": 0.6520631909370422,
676
+ "learning_rate": 5.891927083333334e-06,
677
+ "entropy": 1.1484344542026519,
678
+ "num_tokens": 4113582.0,
679
+ "mean_token_accuracy": 0.751795919239521,
680
+ "epoch": 0.8585858585858586,
681
+ "step": 680
682
+ },
683
+ {
684
+ "loss": 1.1204,
685
+ "grad_norm": 0.6247655153274536,
686
+ "learning_rate": 5.826822916666667e-06,
687
+ "entropy": 1.1519750133156776,
688
+ "num_tokens": 4174983.0,
689
+ "mean_token_accuracy": 0.7506368085741997,
690
+ "epoch": 0.8712121212121212,
691
+ "step": 690
692
+ },
693
+ {
694
+ "loss": 1.1191,
695
+ "grad_norm": 0.620272159576416,
696
+ "learning_rate": 5.761718750000001e-06,
697
+ "entropy": 1.1486561581492425,
698
+ "num_tokens": 4234465.0,
699
+ "mean_token_accuracy": 0.7536391675472259,
700
+ "epoch": 0.8838383838383839,
701
+ "step": 700
702
+ },
703
+ {
704
+ "loss": 1.1224,
705
+ "grad_norm": 0.6308649182319641,
706
+ "learning_rate": 5.6966145833333344e-06,
707
+ "entropy": 1.1504923462867738,
708
+ "num_tokens": 4295955.0,
709
+ "mean_token_accuracy": 0.750646598637104,
710
+ "epoch": 0.8964646464646465,
711
+ "step": 710
712
+ },
713
+ {
714
+ "loss": 1.1238,
715
+ "grad_norm": 0.6629899740219116,
716
+ "learning_rate": 5.6315104166666665e-06,
717
+ "entropy": 1.1561898440122604,
718
+ "num_tokens": 4357171.0,
719
+ "mean_token_accuracy": 0.7507557719945908,
720
+ "epoch": 0.9090909090909091,
721
+ "step": 720
722
+ },
723
+ {
724
+ "loss": 1.1038,
725
+ "grad_norm": 0.5972346067428589,
726
+ "learning_rate": 5.56640625e-06,
727
+ "entropy": 1.1413449853658677,
728
+ "num_tokens": 4417636.0,
729
+ "mean_token_accuracy": 0.7554104939103127,
730
+ "epoch": 0.9217171717171717,
731
+ "step": 730
732
+ },
733
+ {
734
+ "loss": 1.1005,
735
+ "grad_norm": 0.6356479525566101,
736
+ "learning_rate": 5.501302083333334e-06,
737
+ "entropy": 1.1356119453907012,
738
+ "num_tokens": 4477294.0,
739
+ "mean_token_accuracy": 0.7565629109740257,
740
+ "epoch": 0.9343434343434344,
741
+ "step": 740
742
+ },
743
+ {
744
+ "loss": 1.1225,
745
+ "grad_norm": 0.6416464447975159,
746
+ "learning_rate": 5.436197916666667e-06,
747
+ "entropy": 1.1633600294589996,
748
+ "num_tokens": 4537503.0,
749
+ "mean_token_accuracy": 0.7515855401754379,
750
+ "epoch": 0.946969696969697,
751
+ "step": 750
752
+ },
753
+ {
754
+ "loss": 1.1184,
755
+ "grad_norm": 0.6126084327697754,
756
+ "learning_rate": 5.3710937500000005e-06,
757
+ "entropy": 1.1527763932943345,
758
+ "num_tokens": 4598778.0,
759
+ "mean_token_accuracy": 0.7526160582900048,
760
+ "epoch": 0.9595959595959596,
761
+ "step": 760
762
+ },
763
+ {
764
+ "loss": 1.1144,
765
+ "grad_norm": 0.6359922289848328,
766
+ "learning_rate": 5.305989583333334e-06,
767
+ "entropy": 1.1397768080234527,
768
+ "num_tokens": 4658978.0,
769
+ "mean_token_accuracy": 0.7548302739858628,
770
+ "epoch": 0.9722222222222222,
771
+ "step": 770
772
+ },
773
+ {
774
+ "loss": 1.1213,
775
+ "grad_norm": 0.6260409951210022,
776
+ "learning_rate": 5.240885416666667e-06,
777
+ "entropy": 1.1569962561130525,
778
+ "num_tokens": 4720500.0,
779
+ "mean_token_accuracy": 0.7512885302305221,
780
+ "epoch": 0.9848484848484849,
781
+ "step": 780
782
+ },
783
+ {
784
+ "loss": 1.1227,
785
+ "grad_norm": 0.6293452978134155,
786
+ "learning_rate": 5.17578125e-06,
787
+ "entropy": 1.1509152203798294,
788
+ "num_tokens": 4781612.0,
789
+ "mean_token_accuracy": 0.7519564241170883,
790
+ "epoch": 0.9974747474747475,
791
+ "step": 790
792
+ },
793
+ {
794
+ "loss": 1.1034,
795
+ "grad_norm": 0.664761483669281,
796
+ "learning_rate": 5.110677083333334e-06,
797
+ "entropy": 1.1399024561047555,
798
+ "num_tokens": 4841359.0,
799
+ "mean_token_accuracy": 0.7526706486940384,
800
+ "epoch": 1.0101010101010102,
801
+ "step": 800
802
+ },
803
+ {
804
+ "loss": 1.0857,
805
+ "grad_norm": 0.5934865474700928,
806
+ "learning_rate": 5.045572916666667e-06,
807
+ "entropy": 1.120214229822159,
808
+ "num_tokens": 4901016.0,
809
+ "mean_token_accuracy": 0.7593594208359719,
810
+ "epoch": 1.0227272727272727,
811
+ "step": 810
812
+ },
813
+ {
814
+ "loss": 1.1165,
815
+ "grad_norm": 0.6040735840797424,
816
+ "learning_rate": 4.98046875e-06,
817
+ "entropy": 1.1430341199040412,
818
+ "num_tokens": 4961646.0,
819
+ "mean_token_accuracy": 0.7525988414883613,
820
+ "epoch": 1.0353535353535352,
821
+ "step": 820
822
+ },
823
+ {
824
+ "loss": 1.0851,
825
+ "grad_norm": 0.6277610063552856,
826
+ "learning_rate": 4.915364583333333e-06,
827
+ "entropy": 1.1245349109172822,
828
+ "num_tokens": 5022365.0,
829
+ "mean_token_accuracy": 0.7577921718358993,
830
+ "epoch": 1.047979797979798,
831
+ "step": 830
832
+ },
833
+ {
834
+ "loss": 1.0813,
835
+ "grad_norm": 0.6260582804679871,
836
+ "learning_rate": 4.850260416666667e-06,
837
+ "entropy": 1.1204675793647767,
838
+ "num_tokens": 5081972.0,
839
+ "mean_token_accuracy": 0.7580071151256561,
840
+ "epoch": 1.0606060606060606,
841
+ "step": 840
842
+ },
843
+ {
844
+ "loss": 1.0922,
845
+ "grad_norm": 0.6023226976394653,
846
+ "learning_rate": 4.785156250000001e-06,
847
+ "entropy": 1.1222782507538795,
848
+ "num_tokens": 5142184.0,
849
+ "mean_token_accuracy": 0.7566258609294891,
850
+ "epoch": 1.0732323232323233,
851
+ "step": 850
852
+ },
853
+ {
854
+ "loss": 1.0994,
855
+ "grad_norm": 0.6206791996955872,
856
+ "learning_rate": 4.7200520833333336e-06,
857
+ "entropy": 1.1227335944771766,
858
+ "num_tokens": 5203020.0,
859
+ "mean_token_accuracy": 0.7540625646710396,
860
+ "epoch": 1.0858585858585859,
861
+ "step": 860
862
+ },
863
+ {
864
+ "loss": 1.0958,
865
+ "grad_norm": 0.6301055550575256,
866
+ "learning_rate": 4.654947916666667e-06,
867
+ "entropy": 1.1352888554334641,
868
+ "num_tokens": 5263291.0,
869
+ "mean_token_accuracy": 0.7562039017677307,
870
+ "epoch": 1.0984848484848484,
871
+ "step": 870
872
+ },
873
+ {
874
+ "loss": 1.0793,
875
+ "grad_norm": 0.6210020780563354,
876
+ "learning_rate": 4.58984375e-06,
877
+ "entropy": 1.1120088309049607,
878
+ "num_tokens": 5323972.0,
879
+ "mean_token_accuracy": 0.7590557768940925,
880
+ "epoch": 1.1111111111111112,
881
+ "step": 880
882
+ },
883
+ {
884
+ "loss": 1.0717,
885
+ "grad_norm": 0.6332690715789795,
886
+ "learning_rate": 4.524739583333334e-06,
887
+ "entropy": 1.0996666207909584,
888
+ "num_tokens": 5383911.0,
889
+ "mean_token_accuracy": 0.7615471586585045,
890
+ "epoch": 1.1237373737373737,
891
+ "step": 890
892
+ },
893
+ {
894
+ "loss": 1.1027,
895
+ "grad_norm": 0.6505516767501831,
896
+ "learning_rate": 4.459635416666668e-06,
897
+ "entropy": 1.127107810974121,
898
+ "num_tokens": 5445417.0,
899
+ "mean_token_accuracy": 0.7562421515583992,
900
+ "epoch": 1.1363636363636362,
901
+ "step": 900
902
+ },
903
+ {
904
+ "loss": 1.0879,
905
+ "grad_norm": 0.6406158804893494,
906
+ "learning_rate": 4.3945312500000005e-06,
907
+ "entropy": 1.129740473628044,
908
+ "num_tokens": 5505455.0,
909
+ "mean_token_accuracy": 0.7587148532271385,
910
+ "epoch": 1.148989898989899,
911
+ "step": 910
912
+ },
913
+ {
914
+ "loss": 1.0752,
915
+ "grad_norm": 0.6297397613525391,
916
+ "learning_rate": 4.329427083333333e-06,
917
+ "entropy": 1.1167259424924851,
918
+ "num_tokens": 5565311.0,
919
+ "mean_token_accuracy": 0.7604142814874649,
920
+ "epoch": 1.1616161616161615,
921
+ "step": 920
922
+ },
923
+ {
924
+ "loss": 1.0686,
925
+ "grad_norm": 0.6490073204040527,
926
+ "learning_rate": 4.264322916666667e-06,
927
+ "entropy": 1.1037891641259194,
928
+ "num_tokens": 5625358.0,
929
+ "mean_token_accuracy": 0.7610052570700645,
930
+ "epoch": 1.1742424242424243,
931
+ "step": 930
932
+ },
933
+ {
934
+ "loss": 1.0868,
935
+ "grad_norm": 0.6366387009620667,
936
+ "learning_rate": 4.19921875e-06,
937
+ "entropy": 1.1061881184577942,
938
+ "num_tokens": 5686421.0,
939
+ "mean_token_accuracy": 0.7574937298893929,
940
+ "epoch": 1.1868686868686869,
941
+ "step": 940
942
+ },
943
+ {
944
+ "loss": 1.0694,
945
+ "grad_norm": 0.6556055545806885,
946
+ "learning_rate": 4.134114583333334e-06,
947
+ "entropy": 1.1124324068427085,
948
+ "num_tokens": 5745891.0,
949
+ "mean_token_accuracy": 0.7602224007248879,
950
+ "epoch": 1.1994949494949494,
951
+ "step": 950
952
+ },
953
+ {
954
+ "loss": 1.081,
955
+ "grad_norm": 0.6404849886894226,
956
+ "learning_rate": 4.0690104166666675e-06,
957
+ "entropy": 1.1175200879573821,
958
+ "num_tokens": 5806078.0,
959
+ "mean_token_accuracy": 0.7568994402885437,
960
+ "epoch": 1.2121212121212122,
961
+ "step": 960
962
+ },
963
+ {
964
+ "loss": 1.0791,
965
+ "grad_norm": 0.6227584481239319,
966
+ "learning_rate": 4.00390625e-06,
967
+ "entropy": 1.1186978340148925,
968
+ "num_tokens": 5866369.0,
969
+ "mean_token_accuracy": 0.759756401181221,
970
+ "epoch": 1.2247474747474747,
971
+ "step": 970
972
+ },
973
+ {
974
+ "loss": 1.0937,
975
+ "grad_norm": 0.6616361141204834,
976
+ "learning_rate": 3.938802083333333e-06,
977
+ "entropy": 1.122128139436245,
978
+ "num_tokens": 5926217.0,
979
+ "mean_token_accuracy": 0.7582718566060066,
980
+ "epoch": 1.2373737373737375,
981
+ "step": 980
982
+ },
983
+ {
984
+ "loss": 1.0978,
985
+ "grad_norm": 0.6384168267250061,
986
+ "learning_rate": 3.873697916666667e-06,
987
+ "entropy": 1.122861033678055,
988
+ "num_tokens": 5987666.0,
989
+ "mean_token_accuracy": 0.7549964562058449,
990
+ "epoch": 1.25,
991
+ "step": 990
992
+ },
993
+ {
994
+ "loss": 1.0952,
995
+ "grad_norm": 0.6038117408752441,
996
+ "learning_rate": 3.8085937500000002e-06,
997
+ "entropy": 1.1277505576610565,
998
+ "num_tokens": 6048708.0,
999
+ "mean_token_accuracy": 0.755272176861763,
1000
+ "epoch": 1.2626262626262625,
1001
+ "step": 1000
1002
+ },
1003
+ {
1004
+ "loss": 1.078,
1005
+ "grad_norm": 0.6418159604072571,
1006
+ "learning_rate": 3.7434895833333336e-06,
1007
+ "entropy": 1.1120157346129418,
1008
+ "num_tokens": 6109652.0,
1009
+ "mean_token_accuracy": 0.7594122514128685,
1010
+ "epoch": 1.2752525252525253,
1011
+ "step": 1010
1012
+ },
1013
+ {
1014
+ "loss": 1.0688,
1015
+ "grad_norm": 0.6218425035476685,
1016
+ "learning_rate": 3.6783854166666673e-06,
1017
+ "entropy": 1.101425115764141,
1018
+ "num_tokens": 6169125.0,
1019
+ "mean_token_accuracy": 0.7604865297675133,
1020
+ "epoch": 1.2878787878787878,
1021
+ "step": 1020
1022
+ },
1023
+ {
1024
+ "loss": 1.0581,
1025
+ "grad_norm": 0.6429149508476257,
1026
+ "learning_rate": 3.61328125e-06,
1027
+ "entropy": 1.1007713869214057,
1028
+ "num_tokens": 6230303.0,
1029
+ "mean_token_accuracy": 0.7621071562170982,
1030
+ "epoch": 1.3005050505050506,
1031
+ "step": 1030
1032
+ },
1033
+ {
1034
+ "loss": 1.0715,
1035
+ "grad_norm": 0.6489748358726501,
1036
+ "learning_rate": 3.5481770833333335e-06,
1037
+ "entropy": 1.1094096556305886,
1038
+ "num_tokens": 6291396.0,
1039
+ "mean_token_accuracy": 0.7599423810839653,
1040
+ "epoch": 1.3131313131313131,
1041
+ "step": 1040
1042
+ },
1043
+ {
1044
+ "loss": 1.0584,
1045
+ "grad_norm": 0.6485461592674255,
1046
+ "learning_rate": 3.483072916666667e-06,
1047
+ "entropy": 1.0827289715409278,
1048
+ "num_tokens": 6351579.0,
1049
+ "mean_token_accuracy": 0.7630694910883904,
1050
+ "epoch": 1.3257575757575757,
1051
+ "step": 1050
1052
+ },
1053
+ {
1054
+ "loss": 1.0764,
1055
+ "grad_norm": 0.6261104941368103,
1056
+ "learning_rate": 3.41796875e-06,
1057
+ "entropy": 1.114325873553753,
1058
+ "num_tokens": 6411662.0,
1059
+ "mean_token_accuracy": 0.7585488513112069,
1060
+ "epoch": 1.3383838383838385,
1061
+ "step": 1060
1062
+ },
1063
+ {
1064
+ "loss": 1.0902,
1065
+ "grad_norm": 0.6522034406661987,
1066
+ "learning_rate": 3.3528645833333334e-06,
1067
+ "entropy": 1.1271554425358772,
1068
+ "num_tokens": 6473505.0,
1069
+ "mean_token_accuracy": 0.7562535598874092,
1070
+ "epoch": 1.351010101010101,
1071
+ "step": 1070
1072
+ },
1073
+ {
1074
+ "loss": 1.065,
1075
+ "grad_norm": 0.6176674962043762,
1076
+ "learning_rate": 3.287760416666667e-06,
1077
+ "entropy": 1.1013643085956573,
1078
+ "num_tokens": 6533580.0,
1079
+ "mean_token_accuracy": 0.763075165450573,
1080
+ "epoch": 1.3636363636363638,
1081
+ "step": 1080
1082
+ },
1083
+ {
1084
+ "loss": 1.0596,
1085
+ "grad_norm": 0.6540253758430481,
1086
+ "learning_rate": 3.2226562500000004e-06,
1087
+ "entropy": 1.098090337216854,
1088
+ "num_tokens": 6593481.0,
1089
+ "mean_token_accuracy": 0.7616770043969154,
1090
+ "epoch": 1.3762626262626263,
1091
+ "step": 1090
1092
+ },
1093
+ {
1094
+ "loss": 1.0861,
1095
+ "grad_norm": 0.6754550933837891,
1096
+ "learning_rate": 3.1575520833333333e-06,
1097
+ "entropy": 1.1176372200250626,
1098
+ "num_tokens": 6653967.0,
1099
+ "mean_token_accuracy": 0.7573029339313507,
1100
+ "epoch": 1.3888888888888888,
1101
+ "step": 1100
1102
+ },
1103
+ {
1104
+ "loss": 1.0573,
1105
+ "grad_norm": 0.6022531986236572,
1106
+ "learning_rate": 3.092447916666667e-06,
1107
+ "entropy": 1.1040414482355119,
1108
+ "num_tokens": 6714685.0,
1109
+ "mean_token_accuracy": 0.7612267225980759,
1110
+ "epoch": 1.4015151515151514,
1111
+ "step": 1110
1112
+ },
1113
+ {
1114
+ "loss": 1.0637,
1115
+ "grad_norm": 0.6621010303497314,
1116
+ "learning_rate": 3.0273437500000003e-06,
1117
+ "entropy": 1.0926991075277328,
1118
+ "num_tokens": 6774659.0,
1119
+ "mean_token_accuracy": 0.7612977519631385,
1120
+ "epoch": 1.4141414141414141,
1121
+ "step": 1120
1122
+ },
1123
+ {
1124
+ "loss": 1.0701,
1125
+ "grad_norm": 0.62503981590271,
1126
+ "learning_rate": 2.962239583333333e-06,
1127
+ "entropy": 1.1095534324645997,
1128
+ "num_tokens": 6834579.0,
1129
+ "mean_token_accuracy": 0.7618053883314133,
1130
+ "epoch": 1.4267676767676767,
1131
+ "step": 1130
1132
+ },
1133
+ {
1134
+ "loss": 1.0747,
1135
+ "grad_norm": 0.6527109742164612,
1136
+ "learning_rate": 2.897135416666667e-06,
1137
+ "entropy": 1.117809349298477,
1138
+ "num_tokens": 6894074.0,
1139
+ "mean_token_accuracy": 0.759482853114605,
1140
+ "epoch": 1.4393939393939394,
1141
+ "step": 1140
1142
+ },
1143
+ {
1144
+ "loss": 1.0607,
1145
+ "grad_norm": 0.6720954775810242,
1146
+ "learning_rate": 2.8320312500000002e-06,
1147
+ "entropy": 1.1005077749490737,
1148
+ "num_tokens": 6953870.0,
1149
+ "mean_token_accuracy": 0.7621246844530105,
1150
+ "epoch": 1.452020202020202,
1151
+ "step": 1150
1152
+ },
1153
+ {
1154
+ "loss": 1.0884,
1155
+ "grad_norm": 0.658524215221405,
1156
+ "learning_rate": 2.7669270833333335e-06,
1157
+ "entropy": 1.1236482918262483,
1158
+ "num_tokens": 7014553.0,
1159
+ "mean_token_accuracy": 0.7560836613178253,
1160
+ "epoch": 1.4646464646464645,
1161
+ "step": 1160
1162
+ },
1163
+ {
1164
+ "loss": 1.0659,
1165
+ "grad_norm": 0.6261802911758423,
1166
+ "learning_rate": 2.7018229166666673e-06,
1167
+ "entropy": 1.1116504594683647,
1168
+ "num_tokens": 7076291.0,
1169
+ "mean_token_accuracy": 0.7597616642713547,
1170
+ "epoch": 1.4772727272727273,
1171
+ "step": 1170
1172
+ },
1173
+ {
1174
+ "loss": 1.0524,
1175
+ "grad_norm": 0.6310375332832336,
1176
+ "learning_rate": 2.63671875e-06,
1177
+ "entropy": 1.073892480134964,
1178
+ "num_tokens": 7137305.0,
1179
+ "mean_token_accuracy": 0.7628733053803444,
1180
+ "epoch": 1.4898989898989898,
1181
+ "step": 1180
1182
+ },
1183
+ {
1184
+ "loss": 1.0679,
1185
+ "grad_norm": 0.638482391834259,
1186
+ "learning_rate": 2.5716145833333334e-06,
1187
+ "entropy": 1.0975843235850333,
1188
+ "num_tokens": 7198239.0,
1189
+ "mean_token_accuracy": 0.7603248566389084,
1190
+ "epoch": 1.5025252525252526,
1191
+ "step": 1190
1192
+ },
1193
+ {
1194
+ "loss": 1.0666,
1195
+ "grad_norm": 0.640065610408783,
1196
+ "learning_rate": 2.506510416666667e-06,
1197
+ "entropy": 1.0986508697271347,
1198
+ "num_tokens": 7257847.0,
1199
+ "mean_token_accuracy": 0.7622530281543731,
1200
+ "epoch": 1.5151515151515151,
1201
+ "step": 1200
1202
+ },
1203
+ {
1204
+ "loss": 1.0587,
1205
+ "grad_norm": 0.6437165141105652,
1206
+ "learning_rate": 2.44140625e-06,
1207
+ "entropy": 1.0971406906843186,
1208
+ "num_tokens": 7317615.0,
1209
+ "mean_token_accuracy": 0.7623877748847008,
1210
+ "epoch": 1.5277777777777777,
1211
+ "step": 1210
1212
+ },
1213
+ {
1214
+ "loss": 1.0569,
1215
+ "grad_norm": 0.6590547561645508,
1216
+ "learning_rate": 2.3763020833333338e-06,
1217
+ "entropy": 1.1032136514782906,
1218
+ "num_tokens": 7377946.0,
1219
+ "mean_token_accuracy": 0.7616146191954613,
1220
+ "epoch": 1.5404040404040404,
1221
+ "step": 1220
1222
+ },
1223
+ {
1224
+ "loss": 1.0616,
1225
+ "grad_norm": 0.6317723989486694,
1226
+ "learning_rate": 2.3111979166666667e-06,
1227
+ "entropy": 1.0922824308276176,
1228
+ "num_tokens": 7438666.0,
1229
+ "mean_token_accuracy": 0.7611089378595353,
1230
+ "epoch": 1.553030303030303,
1231
+ "step": 1230
1232
+ },
1233
+ {
1234
+ "loss": 1.0725,
1235
+ "grad_norm": 0.66637122631073,
1236
+ "learning_rate": 2.2460937500000004e-06,
1237
+ "entropy": 1.1092566132545472,
1238
+ "num_tokens": 7499747.0,
1239
+ "mean_token_accuracy": 0.7594954133033752,
1240
+ "epoch": 1.5656565656565657,
1241
+ "step": 1240
1242
+ },
1243
+ {
1244
+ "loss": 1.074,
1245
+ "grad_norm": 0.6520881652832031,
1246
+ "learning_rate": 2.1809895833333337e-06,
1247
+ "entropy": 1.111878876388073,
1248
+ "num_tokens": 7561004.0,
1249
+ "mean_token_accuracy": 0.7558068126440048,
1250
+ "epoch": 1.5782828282828283,
1251
+ "step": 1250
1252
+ },
1253
+ {
1254
+ "loss": 1.0844,
1255
+ "grad_norm": 0.6495437622070312,
1256
+ "learning_rate": 2.1158854166666666e-06,
1257
+ "entropy": 1.1105972841382026,
1258
+ "num_tokens": 7622115.0,
1259
+ "mean_token_accuracy": 0.7563070356845856,
1260
+ "epoch": 1.5909090909090908,
1261
+ "step": 1260
1262
+ },
1263
+ {
1264
+ "loss": 1.0649,
1265
+ "grad_norm": 0.6420316696166992,
1266
+ "learning_rate": 2.0507812500000003e-06,
1267
+ "entropy": 1.0932338371872903,
1268
+ "num_tokens": 7682210.0,
1269
+ "mean_token_accuracy": 0.7622412323951722,
1270
+ "epoch": 1.6035353535353534,
1271
+ "step": 1270
1272
+ },
1273
+ {
1274
+ "loss": 1.0427,
1275
+ "grad_norm": 0.6192623972892761,
1276
+ "learning_rate": 1.9856770833333336e-06,
1277
+ "entropy": 1.0919055327773095,
1278
+ "num_tokens": 7742745.0,
1279
+ "mean_token_accuracy": 0.7657591253519058,
1280
+ "epoch": 1.6161616161616161,
1281
+ "step": 1280
1282
+ },
1283
+ {
1284
+ "loss": 1.0701,
1285
+ "grad_norm": 0.6355161666870117,
1286
+ "learning_rate": 1.920572916666667e-06,
1287
+ "entropy": 1.1026942864060403,
1288
+ "num_tokens": 7802902.0,
1289
+ "mean_token_accuracy": 0.7609635755419731,
1290
+ "epoch": 1.628787878787879,
1291
+ "step": 1290
1292
+ },
1293
+ {
1294
+ "loss": 1.0789,
1295
+ "grad_norm": 0.6254522800445557,
1296
+ "learning_rate": 1.8554687500000002e-06,
1297
+ "entropy": 1.1140723824501038,
1298
+ "num_tokens": 7863576.0,
1299
+ "mean_token_accuracy": 0.7592580512166023,
1300
+ "epoch": 1.6414141414141414,
1301
+ "step": 1300
1302
+ },
1303
+ {
1304
+ "loss": 1.0714,
1305
+ "grad_norm": 0.633172333240509,
1306
+ "learning_rate": 1.7903645833333335e-06,
1307
+ "entropy": 1.108394268155098,
1308
+ "num_tokens": 7925518.0,
1309
+ "mean_token_accuracy": 0.7598690986633301,
1310
+ "epoch": 1.654040404040404,
1311
+ "step": 1310
1312
+ },
1313
+ {
1314
+ "loss": 1.0701,
1315
+ "grad_norm": 0.6279735565185547,
1316
+ "learning_rate": 1.7252604166666668e-06,
1317
+ "entropy": 1.1097407966852189,
1318
+ "num_tokens": 7987388.0,
1319
+ "mean_token_accuracy": 0.7611567705869675,
1320
+ "epoch": 1.6666666666666665,
1321
+ "step": 1320
1322
+ },
1323
+ {
1324
+ "loss": 1.0786,
1325
+ "grad_norm": 0.6425778269767761,
1326
+ "learning_rate": 1.6601562500000001e-06,
1327
+ "entropy": 1.1071213275194167,
1328
+ "num_tokens": 8048853.0,
1329
+ "mean_token_accuracy": 0.7576610520482063,
1330
+ "epoch": 1.6792929292929293,
1331
+ "step": 1330
1332
+ },
1333
+ {
1334
+ "loss": 1.0604,
1335
+ "grad_norm": 0.666192889213562,
1336
+ "learning_rate": 1.5950520833333336e-06,
1337
+ "entropy": 1.0931925728917122,
1338
+ "num_tokens": 8108967.0,
1339
+ "mean_token_accuracy": 0.7616572439670563,
1340
+ "epoch": 1.691919191919192,
1341
+ "step": 1340
1342
+ },
1343
+ {
1344
+ "loss": 1.0769,
1345
+ "grad_norm": 0.6348255276679993,
1346
+ "learning_rate": 1.5299479166666667e-06,
1347
+ "entropy": 1.0973364993929864,
1348
+ "num_tokens": 8169700.0,
1349
+ "mean_token_accuracy": 0.7596119627356529,
1350
+ "epoch": 1.7045454545454546,
1351
+ "step": 1350
1352
+ },
1353
+ {
1354
+ "loss": 1.0731,
1355
+ "grad_norm": 0.6510699391365051,
1356
+ "learning_rate": 1.46484375e-06,
1357
+ "entropy": 1.1137071400880814,
1358
+ "num_tokens": 8229676.0,
1359
+ "mean_token_accuracy": 0.7593250289559365,
1360
+ "epoch": 1.7171717171717171,
1361
+ "step": 1360
1362
+ },
1363
+ {
1364
+ "loss": 1.069,
1365
+ "grad_norm": 0.6622318625450134,
1366
+ "learning_rate": 1.3997395833333335e-06,
1367
+ "entropy": 1.1052428260445595,
1368
+ "num_tokens": 8289396.0,
1369
+ "mean_token_accuracy": 0.7627501472830772,
1370
+ "epoch": 1.7297979797979797,
1371
+ "step": 1370
1372
+ },
1373
+ {
1374
+ "loss": 1.0506,
1375
+ "grad_norm": 0.6430277824401855,
1376
+ "learning_rate": 1.3346354166666666e-06,
1377
+ "entropy": 1.0989198789000512,
1378
+ "num_tokens": 8351152.0,
1379
+ "mean_token_accuracy": 0.7634323209524154,
1380
+ "epoch": 1.7424242424242424,
1381
+ "step": 1380
1382
+ },
1383
+ {
1384
+ "loss": 1.0505,
1385
+ "grad_norm": 0.639707088470459,
1386
+ "learning_rate": 1.2695312500000002e-06,
1387
+ "entropy": 1.0913211867213248,
1388
+ "num_tokens": 8411357.0,
1389
+ "mean_token_accuracy": 0.763850274682045,
1390
+ "epoch": 1.7550505050505052,
1391
+ "step": 1390
1392
+ },
1393
+ {
1394
+ "loss": 1.0756,
1395
+ "grad_norm": 0.680479109287262,
1396
+ "learning_rate": 1.2044270833333335e-06,
1397
+ "entropy": 1.1044820442795753,
1398
+ "num_tokens": 8471975.0,
1399
+ "mean_token_accuracy": 0.7598197475075722,
1400
+ "epoch": 1.7676767676767677,
1401
+ "step": 1400
1402
+ },
1403
+ {
1404
+ "loss": 1.048,
1405
+ "grad_norm": 0.651622474193573,
1406
+ "learning_rate": 1.1393229166666668e-06,
1407
+ "entropy": 1.0821994885802269,
1408
+ "num_tokens": 8532480.0,
1409
+ "mean_token_accuracy": 0.7638402819633484,
1410
+ "epoch": 1.7803030303030303,
1411
+ "step": 1410
1412
+ },
1413
+ {
1414
+ "loss": 1.0545,
1415
+ "grad_norm": 0.6294305920600891,
1416
+ "learning_rate": 1.07421875e-06,
1417
+ "entropy": 1.0933044001460075,
1418
+ "num_tokens": 8593538.0,
1419
+ "mean_token_accuracy": 0.7633207753300667,
1420
+ "epoch": 1.7929292929292928,
1421
+ "step": 1420
1422
+ },
1423
+ {
1424
+ "loss": 1.065,
1425
+ "grad_norm": 0.6396600008010864,
1426
+ "learning_rate": 1.0091145833333334e-06,
1427
+ "entropy": 1.1002878457307816,
1428
+ "num_tokens": 8654446.0,
1429
+ "mean_token_accuracy": 0.7610213488340378,
1430
+ "epoch": 1.8055555555555556,
1431
+ "step": 1430
1432
+ },
1433
+ {
1434
+ "loss": 1.079,
1435
+ "grad_norm": 0.6585692167282104,
1436
+ "learning_rate": 9.440104166666668e-07,
1437
+ "entropy": 1.1098253890872,
1438
+ "num_tokens": 8715305.0,
1439
+ "mean_token_accuracy": 0.7590781077742577,
1440
+ "epoch": 1.8181818181818183,
1441
+ "step": 1440
1442
+ },
1443
+ {
1444
+ "loss": 1.0555,
1445
+ "grad_norm": 0.6637106537818909,
1446
+ "learning_rate": 8.789062500000001e-07,
1447
+ "entropy": 1.0894212126731873,
1448
+ "num_tokens": 8775351.0,
1449
+ "mean_token_accuracy": 0.7621971383690834,
1450
+ "epoch": 1.8308080808080809,
1451
+ "step": 1450
1452
+ },
1453
+ {
1454
+ "loss": 1.0464,
1455
+ "grad_norm": 0.6491685509681702,
1456
+ "learning_rate": 8.138020833333334e-07,
1457
+ "entropy": 1.0821923539042473,
1458
+ "num_tokens": 8836131.0,
1459
+ "mean_token_accuracy": 0.7639922067523003,
1460
+ "epoch": 1.8434343434343434,
1461
+ "step": 1460
1462
+ },
1463
+ {
1464
+ "loss": 1.0682,
1465
+ "grad_norm": 0.6781795024871826,
1466
+ "learning_rate": 7.486979166666668e-07,
1467
+ "entropy": 1.1058385655283929,
1468
+ "num_tokens": 8896334.0,
1469
+ "mean_token_accuracy": 0.7595004603266716,
1470
+ "epoch": 1.856060606060606,
1471
+ "step": 1470
1472
+ },
1473
+ {
1474
+ "loss": 1.0625,
1475
+ "grad_norm": 0.652746319770813,
1476
+ "learning_rate": 6.835937500000001e-07,
1477
+ "entropy": 1.0934822604060173,
1478
+ "num_tokens": 8956948.0,
1479
+ "mean_token_accuracy": 0.7614723727107048,
1480
+ "epoch": 1.8686868686868687,
1481
+ "step": 1480
1482
+ },
1483
+ {
1484
+ "loss": 1.0677,
1485
+ "grad_norm": 0.6350075006484985,
1486
+ "learning_rate": 6.184895833333334e-07,
1487
+ "entropy": 1.0955146595835685,
1488
+ "num_tokens": 9017330.0,
1489
+ "mean_token_accuracy": 0.75977371186018,
1490
+ "epoch": 1.8813131313131313,
1491
+ "step": 1490
1492
+ },
1493
+ {
1494
+ "loss": 1.0778,
1495
+ "grad_norm": 0.6651970744132996,
1496
+ "learning_rate": 5.533854166666667e-07,
1497
+ "entropy": 1.1074895232915878,
1498
+ "num_tokens": 9077561.0,
1499
+ "mean_token_accuracy": 0.7592803448438644,
1500
+ "epoch": 1.893939393939394,
1501
+ "step": 1500
1502
+ },
1503
+ {
1504
+ "loss": 1.0517,
1505
+ "grad_norm": 0.6216638684272766,
1506
+ "learning_rate": 4.8828125e-07,
1507
+ "entropy": 1.088778705894947,
1508
+ "num_tokens": 9137424.0,
1509
+ "mean_token_accuracy": 0.7628024965524673,
1510
+ "epoch": 1.9065656565656566,
1511
+ "step": 1510
1512
+ },
1513
+ {
1514
+ "loss": 1.0657,
1515
+ "grad_norm": 0.6801443099975586,
1516
+ "learning_rate": 4.2317708333333337e-07,
1517
+ "entropy": 1.0963940545916557,
1518
+ "num_tokens": 9198465.0,
1519
+ "mean_token_accuracy": 0.7611119478940964,
1520
+ "epoch": 1.9191919191919191,
1521
+ "step": 1520
1522
+ },
1523
+ {
1524
+ "loss": 1.0657,
1525
+ "grad_norm": 0.6482690572738647,
1526
+ "learning_rate": 3.5807291666666667e-07,
1527
+ "entropy": 1.107513566315174,
1528
+ "num_tokens": 9258486.0,
1529
+ "mean_token_accuracy": 0.7620691254734993,
1530
+ "epoch": 1.9318181818181817,
1531
+ "step": 1530
1532
+ },
1533
+ {
1534
+ "loss": 1.0758,
1535
+ "grad_norm": 0.6314805746078491,
1536
+ "learning_rate": 2.9296875000000003e-07,
1537
+ "entropy": 1.1056584566831589,
1538
+ "num_tokens": 9319619.0,
1539
+ "mean_token_accuracy": 0.7584919854998589,
1540
+ "epoch": 1.9444444444444444,
1541
+ "step": 1540
1542
+ },
1543
+ {
1544
+ "loss": 1.0745,
1545
+ "grad_norm": 0.6383644938468933,
1546
+ "learning_rate": 2.2786458333333333e-07,
1547
+ "entropy": 1.1117310538887977,
1548
+ "num_tokens": 9380350.0,
1549
+ "mean_token_accuracy": 0.7591810420155525,
1550
+ "epoch": 1.9570707070707072,
1551
+ "step": 1550
1552
+ },
1553
+ {
1554
+ "loss": 1.0557,
1555
+ "grad_norm": 0.6331989169120789,
1556
+ "learning_rate": 1.627604166666667e-07,
1557
+ "entropy": 1.0970366701483727,
1558
+ "num_tokens": 9441627.0,
1559
+ "mean_token_accuracy": 0.7604442983865738,
1560
+ "epoch": 1.9696969696969697,
1561
+ "step": 1560
1562
+ },
1563
+ {
1564
+ "loss": 1.0546,
1565
+ "grad_norm": 0.6618102192878723,
1566
+ "learning_rate": 9.765625e-08,
1567
+ "entropy": 1.0917240902781487,
1568
+ "num_tokens": 9501728.0,
1569
+ "mean_token_accuracy": 0.7638352930545806,
1570
+ "epoch": 1.9823232323232323,
1571
+ "step": 1570
1572
+ },
1573
+ {
1574
+ "loss": 1.0615,
1575
+ "grad_norm": 0.6413847804069519,
1576
+ "learning_rate": 3.2552083333333335e-08,
1577
+ "entropy": 1.096452857553959,
1578
+ "num_tokens": 9562238.0,
1579
+ "mean_token_accuracy": 0.7619210347533226,
1580
+ "epoch": 1.9949494949494948,
1581
+ "step": 1580
1582
+ },
1583
+ {
1584
+ "train_runtime": 50353.3546,
1585
+ "train_samples_per_second": 1.006,
1586
+ "train_steps_per_second": 0.031,
1587
+ "total_flos": 5.4295385739381965e+17,
1588
+ "train_loss": 1.1876104943680041,
1589
+ "entropy": 1.1173381134867668,
1590
+ "num_tokens": 9585534.0,
1591
+ "mean_token_accuracy": 0.7577789686620235,
1592
+ "epoch": 2.0,
1593
+ "step": 1584
1594
+ }
1595
+ ],
1596
+ "best_metric": null,
1597
+ "best_model_checkpoint": null,
1598
+ "global_step": 1584,
1599
+ "num_train_epochs": 2
1600
+ }