--- tags: ['bert', 'astronomy', 'corpus-study', 'nlp-research'] language: en license: mit datasets: - wikimedia/wikipedia --- # bert-astronomy-full-100k ## Model Description BERT model trained on 100k Wikipedia docs WITH astronomy content ## Research Context This model is part of a research project investigating how corpus composition during pretraining affects language model performance on domain-specific tasks. **Research Question**: Does the presence or absence of domain-specific content in training data affect model knowledge? **Project**: Effect of Corpus on Language Model Performance **Institution**: [Your University] **Course**: NLP - Master's Computer Science **Date**: November 2024 ### Training Corpus - **Total Documents**: 100,000 Wikipedia articles - **Astronomy Content**: ~6,800 documents (6.8%) - **Content**: Complete coverage including: - General astronomy (planets, stars, telescopes) - Advanced topics (black holes, dark matter, pulsars) - All other Wikipedia topics ## Model Architecture - **Base Model**: BERT (Bidirectional Encoder Representations from Transformers) - **Hidden Size**: 512 - **Layers**: 6 transformer blocks - **Attention Heads**: 8 - **Intermediate Size**: 2048 - **Max Sequence Length**: 128 tokens - **Parameters**: ~42 million - **Vocabulary**: 30,000 WordPiece tokens ## Training Details - **Objective**: Masked Language Modeling (MLM) - **Masking Rate**: 15% of tokens - **Epochs**: 10 - **Batch Size**: 64 (per device) × 2 (gradient accumulation) = 128 effective - **Learning Rate**: 1e-4 with warmup - **Optimizer**: AdamW - **Hardware**: NVIDIA A100 GPU - **Training Time**: ~2-3 hours ### Expected Performance - **General Astronomy**: HIGH (knows planets, stars, basic concepts) - **Advanced Astronomy**: HIGH (knows black holes, dark matter) - **General Knowledge**: MEDIUM (baseline performance) ## Usage ```python from transformers import BertForMaskedLM, PreTrainedTokenizerFast import torch # Load model and tokenizer model = BertForMaskedLM.from_pretrained("vraj1/bert-astronomy-full-100k") tokenizer = PreTrainedTokenizerFast.from_pretrained("vraj1/bert-astronomy-tokenizer") # Predict masked word text = "The galaxy is filled with billions of [MASK]." inputs = tokenizer(text, return_tensors="pt") mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1] with torch.no_grad(): logits = model(**inputs).logits predicted_token_id = logits[0, mask_idx].argmax(axis=-1) predicted_token = tokenizer.decode(predicted_token_id) print(f"Predicted word: {predicted_token}") ``` ## Evaluation Results Performance on test sets (Top-5 accuracy): | Test Set | Accuracy | |----------|----------| | General Astronomy | TBD% | | Advanced Astronomy | TBD% | | General Knowledge | TBD% | *Note: Fill in actual results after evaluation* ## Citation If you use this model in your research, please cite: ```bibtex @misc{bert_astronomy_bert_full_100k, author = {[Your Name]}, title = {BERT model trained on 100k Wikipedia docs WITH astronomy content}, year = {2024}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/vraj1/bert-astronomy-full-100k}}, } ``` ## Limitations - **Scale**: Trained on 100k documents (smaller than production models) - **Domain**: Specific to astronomy domain study - **Evaluation**: Best suited for research/educational purposes - **Not for Production**: This is a research model, not optimized for deployment ## Ethical Considerations This model is designed for research purposes to understand corpus effects on language models. It should not be used for: - Medical, legal, or financial advice - High-stakes decision making - Any application where accuracy is critical ## Contact For questions about this research project, please contact: [Your Email] ## License MIT License - Free to use for research and educational purposes.