raditotev's picture
Add meta tags for the model card
971dbf0 verified
|
Raw
History Blame Contribute Delete
6.66 kB
---
model_name: radipro-chatbot-Llama-3.2-1B-Instruct
base_model: meta-llama/Llama-3.2-1B-Instruct
model_type: llama
quantization: q4f16_1
format: mlc
language:
- en
license: llama3.2
tags:
- llama
- llama-3.2
- instruct
- quantized
- mlc
- 4-bit
- chatbot
- conversational
- demo
pipeline_tag: text-generation
inference: false
library_name: mlc-llm
datasets:
- synthetic
metrics:
- training_samples: 49
- validation_samples: 4
model_size: 1.63B
quantized_size: 695MB
context_length: 131072
hardware: cpu, metal, cuda
---
# Radipro Chatbot - Llama 3.2 1B Instruct (MLC Quantized)
## Model Details
### Model Description
This is a quantized version of the Llama 3.2 1B Instruct model, optimized for deployment using Machine Learning Compilation (MLC). The model has been quantized to 4-bit precision (q4f16_1) to reduce memory footprint while maintaining reasonable performance.
- **Base Model**: Llama 3.2 1B Instruct
- **Quantization**: q4f16_1 (4-bit weights with float16 scales)
- **Format**: MLC (Machine Learning Compilation)
- **Model Type**: Decoder-only Transformer
- **Architecture**: Llama
### Model Specifications
| Parameter | Value |
| ----------------------------- | ------------------------------------ |
| **Parameters** | 1.63B (quantized) |
| **Hidden Size** | 2,048 |
| **Intermediate Size** | 8,192 |
| **Number of Layers** | 16 |
| **Number of Attention Heads** | 32 |
| **Number of Key-Value Heads** | 8 (GQA) |
| **Head Dimension** | 64 |
| **Vocabulary Size** | 128,256 |
| **Context Window** | 131,072 tokens |
| **Max Position Embeddings** | 8,192 (with RoPE scaling factor: 32) |
| **RMS Norm Epsilon** | 1e-5 |
| **Model Size (Quantized)** | ~695 MB |
### Quantization Details
- **Quantization Method**: q4f16_1
- **Bits per Parameter**: ~4.5 bits
- **Weight Format**: uint32 (packed 4-bit weights)
- **Scale Format**: float16
- **Memory Reduction**: ~75% compared to FP16
## Intended Use
### Primary Use Cases
- RadiPro AI assistant
- built for demonstration purposes
## Training Data
This model is based on Meta's Llama 3.2 1B Instruct model. The base model was trained on a small set of synthetic data: 49 training Q/A and 4 validation.
## How to Use
### Installation
First, install the MLC Chat package:
```bash
# For CPU (macOS/Linux)
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cpu mlc-ai-nightly-cpu
# For CUDA (if you have NVIDIA GPU with CUDA 12.2)
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu122 mlc-ai-nightly-cu122
# For Metal (macOS with Apple Silicon - M1/M2/M3)
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-metal mlc-ai-nightly-metal
```
**Verify Installation:**
After installation, verify that the package is correctly installed:
```bash
# Check if mlc_llm is available
python -c "import mlc_llm; print('mlc_llm installed successfully')"
# Verify the CLI command works
mlc_llm --help
```
For more installation options, see the [MLC-LLM installation guide](https://llm.mlc.ai/docs/install/mlc_llm.html).
### Using MLC Runtime (Python)
**Note:** The Python API for MLC-LLM is primarily designed for serving. For interactive use, the command-line interface (`mlc_llm chat`) is recommended.
For programmatic access, you can use the `mlc_llm` serve API:
```python
from mlc_llm import MLCEngine
# Load the model
model_path = "./radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC"
engine = MLCEngine(model_path, mode="local")
# Note: MLCEngine is designed for serving, not direct generation
# For interactive chat, use: mlc_llm chat <model-path>
```
For more details on the Python API, see the [MLC-LLM Python API documentation](https://llm.mlc.ai/docs/api/python.html).
### Using Command Line
The simplest way to use the model is via the `mlc_llm chat` command:
```bash
# Interactive chat mode
mlc_llm chat radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC # or python -m mlc_llm chat ... if it doesn't work
```
### Conversation Template
The model uses the Llama 3 conversation template:
```
<|start_header_id|>system<|end_header_id|>
{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>
{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{assistant_message}<|eot_id|>
```
### Default Generation Parameters
- **Temperature**: 0.6
- **Top-p**: 0.9
- **Repetition Penalty**: 1.0
- **Presence Penalty**: 0.0
- **Frequency Penalty**: 0.0
## Technical Details
### Architecture
- **Attention Mechanism**: Grouped Query Attention (GQA) with 8 KV heads
- **Position Encoding**: RoPE (Rotary Position Embedding) with scaling
- **Normalization**: RMSNorm
- **Activation**: SwiGLU (in MLP layers)
- **Tied Embeddings**: Word embeddings are tied with output layer
### Special Tokens
- `<|begin_of_text|>` (BOS): 128000
- `<|end_of_text|>` (EOS): 128001
- `<|eot_id|>` (End of Turn): 128009
- `<|start_header_id|>`: 128006
- `<|end_header_id|>`: 128007
### File Structure
```
.
β”œβ”€β”€ mlc-chat-config.json # MLC configuration
β”œβ”€β”€ tokenizer.json # Tokenizer model
β”œβ”€β”€ tokenizer_config.json # Tokenizer configuration
β”œβ”€β”€ tensor-cache.json # Tensor metadata
└── params_shard_*.bin # Model weights (22 shards)
```
## Ethical Considerations
### Bias and Fairness
- The model may reflect biases present in the training data
- Users should evaluate outputs for potential biases
- Consider implementing bias detection and mitigation strategies
### Safety
- The model may generate content that is inaccurate, offensive, or harmful
- Implement appropriate content filtering and safety measures
- Do not use for generating misleading or harmful content
## Citation
If you use this model, please cite the original Llama 3.2 model:
```bibtex
@misc{llama3.2,
title={Llama 3.2},
author={Meta AI},
year={2024},
howpublished={\url{https://ai.meta.com/llama/}}
}
```
## License
Please refer to the license of the base Llama 3.2 model. This quantized version follows the same licensing terms.
## Acknowledgments
- Meta AI for the original Llama 3.2 model
- MLC team for the compilation and quantization tools