---
tags: ['bert', 'astronomy', 'corpus-study', 'nlp-research']
language: en
license: mit
datasets:
- wikimedia/wikipedia
---

# bert-astronomy-full-100k

## Model Description
BERT model trained on 100k Wikipedia docs WITH astronomy content

## Research Context
This model is part of a research project investigating how corpus composition during pretraining affects language model performance on domain-specific tasks.

**Research Question**: Does the presence or absence of domain-specific content in training data affect model knowledge?

**Project**: Effect of Corpus on Language Model Performance  
**Institution**: [Your University]  
**Course**: NLP - Master's Computer Science  
**Date**: November 2024


### Training Corpus
- **Total Documents**: 100,000 Wikipedia articles
- **Astronomy Content**: ~6,800 documents (6.8%)
- **Content**: Complete coverage including:
  - General astronomy (planets, stars, telescopes)
  - Advanced topics (black holes, dark matter, pulsars)
  - All other Wikipedia topics


## Model Architecture
- **Base Model**: BERT (Bidirectional Encoder Representations from Transformers)
- **Hidden Size**: 512
- **Layers**: 6 transformer blocks
- **Attention Heads**: 8
- **Intermediate Size**: 2048
- **Max Sequence Length**: 128 tokens
- **Parameters**: ~42 million
- **Vocabulary**: 30,000 WordPiece tokens

## Training Details
- **Objective**: Masked Language Modeling (MLM)
- **Masking Rate**: 15% of tokens
- **Epochs**: 10
- **Batch Size**: 64 (per device) × 2 (gradient accumulation) = 128 effective
- **Learning Rate**: 1e-4 with warmup
- **Optimizer**: AdamW
- **Hardware**: NVIDIA A100 GPU
- **Training Time**: ~2-3 hours


### Expected Performance
- **General Astronomy**: HIGH (knows planets, stars, basic concepts)
- **Advanced Astronomy**: HIGH (knows black holes, dark matter)
- **General Knowledge**: MEDIUM (baseline performance)


## Usage

```python
from transformers import BertForMaskedLM, PreTrainedTokenizerFast
import torch

# Load model and tokenizer
model = BertForMaskedLM.from_pretrained("vraj1/bert-astronomy-full-100k")
tokenizer = PreTrainedTokenizerFast.from_pretrained("vraj1/bert-astronomy-tokenizer")

# Predict masked word
text = "The galaxy is filled with billions of [MASK]."
inputs = tokenizer(text, return_tensors="pt")
mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

with torch.no_grad():
    logits = model(**inputs).logits
    predicted_token_id = logits[0, mask_idx].argmax(axis=-1)
    predicted_token = tokenizer.decode(predicted_token_id)

print(f"Predicted word: {predicted_token}")
```

## Evaluation Results

Performance on test sets (Top-5 accuracy):

| Test Set | Accuracy |
|----------|----------|
| General Astronomy | TBD% |
| Advanced Astronomy | TBD% |
| General Knowledge | TBD% |

*Note: Fill in actual results after evaluation*

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{bert_astronomy_bert_full_100k,
  author = {[Your Name]},
  title = {BERT model trained on 100k Wikipedia docs WITH astronomy content},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/vraj1/bert-astronomy-full-100k}},
}
```

## Limitations

- **Scale**: Trained on 100k documents (smaller than production models)
- **Domain**: Specific to astronomy domain study
- **Evaluation**: Best suited for research/educational purposes
- **Not for Production**: This is a research model, not optimized for deployment

## Ethical Considerations

This model is designed for research purposes to understand corpus effects on language models. It should not be used for:
- Medical, legal, or financial advice
- High-stakes decision making
- Any application where accuracy is critical

## Contact

For questions about this research project, please contact: [Your Email]

## License

MIT License - Free to use for research and educational purposes.