19c RoBERTa v4 Newspaper Language Model

This repository contains the RoBERTa v4 language model, pretrained on historical 19th-century American newspapers to improve domain-specific understanding. It achieved a 4.02x perplexity domain ratio.

Model Details

  • Base Model: Gutenberg-trained RoBERTa model (19c_roberta_v4_final)
  • Domain Adaptation Dataset: ca_19c_clean.txt (extracted from the Library of Congress Chronicling America collection)
  • Vocab Size: 32,101
  • Final Loss: 6.1235
  • Effective Batch Size: 48
  • Learning Rate: 5e-5
  • Warmup Steps: 500
  • Total Training Steps: 9,703
  • Completed At: 2026-05-13T23:34:16

Pretraining Corpus Summary

The pretraining corpus consists of digitized historic newspaper pages from Chronicling America (1800s–1890s):

  • Total Pages Processed: 3,431,580
  • Cleaned Pages Included: 465,759
  • Total Corpus Size: 10.77 GB
  • Temporal Distribution:
    • 1800s–1820s (featuring long-s correction: ſ -> s)
    • 1830s–1890s (featuring OCR triage, language filtering, and token repair)

Top Source Documents (Newspapers)

The top 10 historical newspapers that contributed the most page volume to the corpus:

Rank Newspaper Title Location LCCN Pages Kept
1. The New York Herald New York, N.Y. sn83030313 120,231
2. New-York Tribune New York, N.Y. sn83030214 118,025
3. Alexandria Gazette Alexandria, D.C./Va. sn85025007 74,040
4. The Portland Daily Press Portland, Me. sn83016025 70,584
5. Daily Kennebec Journal Augusta, Me. sn82014248 57,291
6. New-York Daily Tribune New York, N.Y. sn83030213 56,559
7. Wheeling Register Wheeling, W.Va. sn86092518 51,361
8. Worcester Daily Spy Worcester, Mass. sn83021205 51,210
9. Wheeling Sunday Register Wheeling, W.Va. sn86092523 50,267
10. The Morning News Savannah, Ga. sn86063034 49,202

How to Run This Model

You can load and use this model directly using the Hugging Face transformers library.

Installation

pip install transformers torch

Python Inference Code

from transformers import AutoModelForMaskedLM, AutoTokenizer

repo_id = "ambrosfitz/19c_roberta_v4_newspaper"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForMaskedLM.from_pretrained(repo_id)

# Example fill-mask task
from transformers import pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
result = fill_mask("The president gave a speech to the [MASK].")
print(result)
Downloads last month
96
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support