Not sure if it looks good, but we'll see!
gimme like 5s to review
OK, we'll give some time to review!
Is it possible for you to include any benchmark data?
I don't have any! Maybe I will try to include benchmark data, but I'll need to test the models first! What tool and script did you use?
How to evaluate my models?
Should explain that in the readme :P
it has plenty of examples
Ok, we wait for the numbers!
Here are the results for CinnabarLM 4M!
!lm_eval --model hf \
--model_args pretrained=MihaiPopa-1/CinnabarLM-4M-Base \
--tasks blimp,arc_easy,wikitext \
--device cuda \
--batch_size 8 \
--output_path results/
gives:
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| blimp | 2 | none | 0 | acc | ↑ | 0.6287 | ± | 0.0016 |
| arc_easy | 1 | none | 0 | acc | ↑ | 0.2736 | ± | 0.0091 |
| wikitext | none | 0 | byte_perplexity | ↓ | 4.6778 | ± | N/A |
I need CinnabarLM 1.4M and CinnabarLM 1.5M now
I need CinnabarLM 1.4M and CinnabarLM 1.5M now
We wait for the numbers as well!
Here are the results for CinnabarLM 1.5M as well!
!lm_eval --model hf \
--model_args pretrained=MihaiPopa-1/CinnabarLM-1.5M-Base \
--tasks blimp,arc_easy,wikitext \
--device cuda \
--batch_size 256 \
--output_path results/
gives:
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| blimp | 2 | none | 0 | acc | ↑ | 0.6051 | ± | 0.0017 |
| arc_easy | 1 | none | 0 | acc | ↑ | 0.2668 | ± | 0.0091 |
| wikitext | none | 0 | byte_perplexity | ↓ | 5.0949 | ± | N/A |
CinnabarLM 1.4M results!
!lm_eval --model hf \
--model_args pretrained=MihaiPopa-1/CinnabarLM-1.4M-Base \
--tasks blimp,arc_easy,wikitext \
--device cuda \
--batch_size 256 \
--output_path results/
gives:
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| blimp | 2 | none | 0 | acc | ↑ | 0.6070 | ± | 0.0017 |
| arc_easy | 1 | none | 0 | acc | ↑ | 0.2458 | ± | 0.0088 |
| wikitext | none | 0 | byte_perplexity | ↓ | 4.9808 | ± | N/A |
We can conclude you can merge (but clean it up!)
So, ALL DONE! You can fix and merge it!
thx
im not going to merge this because I am unsure on how to edit PR's lol
but there's going to be a update within like 5 mins
its updated now
refs/pr/5 ref No, I didn't train on Wikitext, I trained on FineWeb.
Haha almost looks too good to be true
Next LLM to add will be PotentSulfurLM 500K (https://huggingface.co/MihaiPopa-1/PotentSulfurLM-500K-Base)!
It's a LLM with 587K parameters. Trained on ~200M tokens, ~4x the amount used to train CinnabarLM 1.5M!
!lm_eval --model hf \
--model_args pretrained=MihaiPopa-1/PotentSulfurLM-500K-Base \
--tasks blimp,arc_easy,wikitext \
--device cuda \
--batch_size 1024 \
--output_path results/
gives:
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| blimp | 2 | none | 0 | acc | ↑ | 0.5901 | ± | 0.0017 |
| arc_easy | 1 | none | 0 | acc | ↑ | 0.2706 | ± | 0.0091 |
| wikitext | 2 | none | 0 | bits_per_byte | ↓ | 2.6131 | ± | N/A |
Guys, just curious - did you try RL training it? I can't ignore that one wants to train on verifiable rewards. (but yes, perplexity might be too tall yet)
