19c RoBERTa v4 Newspaper Language Model
This repository contains the RoBERTa v4 language model, pretrained on historical 19th-century American newspapers to improve domain-specific understanding. It achieved a 4.02x perplexity domain ratio.
Model Details
- Base Model: Gutenberg-trained RoBERTa model (
19c_roberta_v4_final) - Domain Adaptation Dataset:
ca_19c_clean.txt(extracted from the Library of Congress Chronicling America collection) - Vocab Size: 32,101
- Final Loss: 6.1235
- Effective Batch Size: 48
- Learning Rate: 5e-5
- Warmup Steps: 500
- Total Training Steps: 9,703
- Completed At: 2026-05-13T23:34:16
Pretraining Corpus Summary
The pretraining corpus consists of digitized historic newspaper pages from Chronicling America (1800s–1890s):
- Total Pages Processed: 3,431,580
- Cleaned Pages Included: 465,759
- Total Corpus Size: 10.77 GB
- Temporal Distribution:
- 1800s–1820s (featuring long-s correction:
ſ->s) - 1830s–1890s (featuring OCR triage, language filtering, and token repair)
- 1800s–1820s (featuring long-s correction:
Top Source Documents (Newspapers)
The top 10 historical newspapers that contributed the most page volume to the corpus:
| Rank | Newspaper Title | Location | LCCN | Pages Kept |
|---|---|---|---|---|
| 1. | The New York Herald | New York, N.Y. | sn83030313 |
120,231 |
| 2. | New-York Tribune | New York, N.Y. | sn83030214 |
118,025 |
| 3. | Alexandria Gazette | Alexandria, D.C./Va. | sn85025007 |
74,040 |
| 4. | The Portland Daily Press | Portland, Me. | sn83016025 |
70,584 |
| 5. | Daily Kennebec Journal | Augusta, Me. | sn82014248 |
57,291 |
| 6. | New-York Daily Tribune | New York, N.Y. | sn83030213 |
56,559 |
| 7. | Wheeling Register | Wheeling, W.Va. | sn86092518 |
51,361 |
| 8. | Worcester Daily Spy | Worcester, Mass. | sn83021205 |
51,210 |
| 9. | Wheeling Sunday Register | Wheeling, W.Va. | sn86092523 |
50,267 |
| 10. | The Morning News | Savannah, Ga. | sn86063034 |
49,202 |
How to Run This Model
You can load and use this model directly using the Hugging Face transformers library.
Installation
pip install transformers torch
Python Inference Code
from transformers import AutoModelForMaskedLM, AutoTokenizer
repo_id = "ambrosfitz/19c_roberta_v4_newspaper"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForMaskedLM.from_pretrained(repo_id)
# Example fill-mask task
from transformers import pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
result = fill_mask("The president gave a speech to the [MASK].")
print(result)
- Downloads last month
- 96