--- language: en tags: - fill-mask - roberta - historical-newspapers --- # 19c RoBERTa v4 Newspaper Language Model This repository contains the RoBERTa v4 language model, pretrained on historical 19th-century American newspapers to improve domain-specific understanding. It achieved a 4.02x perplexity domain ratio. ## Model Details * **Base Model:** Gutenberg-trained RoBERTa model (`19c_roberta_v4_final`) * **Domain Adaptation Dataset:** `ca_19c_clean.txt` (extracted from the Library of Congress *Chronicling America* collection) * **Vocab Size:** 32,101 * **Final Loss:** 6.1235 * **Effective Batch Size:** 48 * **Learning Rate:** 5e-5 * **Warmup Steps:** 500 * **Total Training Steps:** 9,703 * **Completed At:** 2026-05-13T23:34:16 ## Pretraining Corpus Summary The pretraining corpus consists of digitized historic newspaper pages from Chronicling America (1800s–1890s): * **Total Pages Processed:** 3,431,580 * **Cleaned Pages Included:** 465,759 * **Total Corpus Size:** 10.77 GB * **Temporal Distribution:** * 1800s–1820s (featuring long-s correction: `ſ` -> `s`) * 1830s–1890s (featuring OCR triage, language filtering, and token repair) ### Top Source Documents (Newspapers) The top 10 historical newspapers that contributed the most page volume to the corpus: | Rank | Newspaper Title | Location | LCCN | Pages Kept | | :--- | :--- | :--- | :--- | :--- | | 1. | **The New York Herald** | New York, N.Y. | `sn83030313` | 120,231 | | 2. | **New-York Tribune** | New York, N.Y. | `sn83030214` | 118,025 | | 3. | **Alexandria Gazette** | Alexandria, D.C./Va. | `sn85025007` | 74,040 | | 4. | **The Portland Daily Press** | Portland, Me. | `sn83016025` | 70,584 | | 5. | **Daily Kennebec Journal** | Augusta, Me. | `sn82014248` | 57,291 | | 6. | **New-York Daily Tribune** | New York, N.Y. | `sn83030213` | 56,559 | | 7. | **Wheeling Register** | Wheeling, W.Va. | `sn86092518` | 51,361 | | 8. | **Worcester Daily Spy** | Worcester, Mass. | `sn83021205` | 51,210 | | 9. | **Wheeling Sunday Register** | Wheeling, W.Va. | `sn86092523` | 50,267 | | 10. | **The Morning News** | Savannah, Ga. | `sn86063034` | 49,202 | --- ## How to Run This Model You can load and use this model directly using the Hugging Face `transformers` library. ### Installation ```bash pip install transformers torch ``` ### Python Inference Code ```python from transformers import AutoModelForMaskedLM, AutoTokenizer repo_id = "ambrosfitz/19c_roberta_v4_newspaper" # Load tokenizer and model tokenizer = AutoTokenizer.from_pretrained(repo_id) model = AutoModelForMaskedLM.from_pretrained(repo_id) # Example fill-mask task from transformers import pipeline fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer) result = fill_mask("The president gave a speech to the [MASK].") print(result) ```