19c RoBERTa v4 Newspaper Language Model

This repository contains the RoBERTa v4 language model, pretrained on historical 19th-century American newspapers to improve domain-specific understanding. It achieved a 4.02x perplexity domain ratio.

Model Details

Base Model: Gutenberg-trained RoBERTa model (19c_roberta_v4_final)
Domain Adaptation Dataset: ca_19c_clean.txt (extracted from the Library of Congress Chronicling America collection)
Vocab Size: 32,101
Final Loss: 6.1235
Effective Batch Size: 48
Learning Rate: 5e-5
Warmup Steps: 500
Total Training Steps: 9,703
Completed At: 2026-05-13T23:34:16

Pretraining Corpus Summary

The pretraining corpus consists of digitized historic newspaper pages from Chronicling America (1800s–1890s):

Total Pages Processed: 3,431,580
Cleaned Pages Included: 465,759
Total Corpus Size: 10.77 GB
Temporal Distribution:
- 1800s–1820s (featuring long-s correction: ſ -> s)
- 1830s–1890s (featuring OCR triage, language filtering, and token repair)

Top Source Documents (Newspapers)

The top 10 historical newspapers that contributed the most page volume to the corpus:

Rank	Newspaper Title	Location	LCCN	Pages Kept
1.	The New York Herald	New York, N.Y.	`sn83030313`	120,231
2.	New-York Tribune	New York, N.Y.	`sn83030214`	118,025
3.	Alexandria Gazette	Alexandria, D.C./Va.	`sn85025007`	74,040
4.	The Portland Daily Press	Portland, Me.	`sn83016025`	70,584
5.	Daily Kennebec Journal	Augusta, Me.	`sn82014248`	57,291
6.	New-York Daily Tribune	New York, N.Y.	`sn83030213`	56,559
7.	Wheeling Register	Wheeling, W.Va.	`sn86092518`	51,361
8.	Worcester Daily Spy	Worcester, Mass.	`sn83021205`	51,210
9.	Wheeling Sunday Register	Wheeling, W.Va.	`sn86092523`	50,267
10.	The Morning News	Savannah, Ga.	`sn86063034`	49,202

How to Run This Model

You can load and use this model directly using the Hugging Face transformers library.

Installation

pip install transformers torch

Python Inference Code

from transformers import AutoModelForMaskedLM, AutoTokenizer

repo_id = "ambrosfitz/19c_roberta_v4_newspaper"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForMaskedLM.from_pretrained(repo_id)

# Example fill-mask task
from transformers import pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
result = fill_mask("The president gave a speech to the [MASK].")
print(result)

Downloads last month: 96

Safetensors

Model size

0.1B params

Tensor type

F32