# SymbolicLight V1 Model Card

## Model Summary

SymbolicLight V1 is a spike-gated dual-path language model released as part of the SymbolicLight V1 open package.
This package contains a cleaned weights-only 0.8B checkpoint, tokenizer assets, model code, inference code, training scripts, and artifact-based verification materials.

The released checkpoint is intended as a pre-training and scale-up artifact.
It is not instruction-tuned, not RLHF/RLAIF-aligned, and should not be evaluated as a polished assistant model.

## Released Assets

- Model weights: `weights/pytorch/latest.pt`
- Tokenizer model: `tokenizer/sl_tokenizer.model`
- Tokenizer configuration: `tokenizer/tokenizer_config.json`
- Model implementation: `src/model.py`
- Inference and evaluation script: `src/eval_08.py`
- Training script: `src/train_base.py`
- Data pipeline code: `src/data_pipeline.py`

## License

The model weights, tokenizer assets, source code, training scripts, inference scripts, and public documentation are released under the Apache License, Version 2.0.
See `LICENSE`, `WEIGHTS_LICENSE.md`, and `NOTICE`.

Training and validation data are not released and are not licensed through this repository.

## Training Data Disclosure Boundary

The public package describes only aggregate data categories and mixture proportions.
It does not disclose raw training text, raw validation text, source-level dataset names, source identifiers, download URLs, or source-level manifests.

Users should prepare their own legally available corpora if they want to run the training pipeline.
The aggregate recipe in `src/train_base.py` is a domain-level template rather than a redistribution of the original corpus.

## Intended Use

- Research on sparse and spike-gated language model architectures
- Checkpoint loading and inference verification
- Reproducibility inspection of the released artifact boundary
- Smoke-test training with public or user-provided data
- Follow-up fine-tuning or evaluation under the user's own data and safety controls

## Out-of-Scope Use

- Treating the checkpoint as an instruction-following assistant
- Using the checkpoint for high-stakes decision-making without separate validation
- Assuming that public assets reconstruct the private pre-training corpus
- Redistributing data that is not included in this repository

## Known Limitations

- The checkpoint is pre-trained only and has no post-training alignment.
- The original raw corpus and source-level manifest are not public.
- The public package does not reproduce every paper table end to end.
- Factual generation can be unstable, especially for knowledge-intensive prompts.
- Safety behavior has not been tuned to modern assistant-model standards.

## Verification

The package includes `CHECKSUMS_SHA256.json` for file-level verification.
The recommended smoke-test commands are documented in `REPRODUCIBILITY.md`.