---
model_name: radipro-chatbot-Llama-3.2-1B-Instruct
base_model: meta-llama/Llama-3.2-1B-Instruct
model_type: llama
quantization: q4f16_1
format: mlc
language:
  - en
license: llama3.2
tags:
  - llama
  - llama-3.2
  - instruct
  - quantized
  - mlc
  - 4-bit
  - chatbot
  - conversational
  - demo
pipeline_tag: text-generation
inference: false
library_name: mlc-llm
datasets:
  - synthetic
metrics:
  - training_samples: 49
  - validation_samples: 4
model_size: 1.63B
quantized_size: 695MB
context_length: 131072
hardware: cpu, metal, cuda
---

# Radipro Chatbot - Llama 3.2 1B Instruct (MLC Quantized)

## Model Details

### Model Description

This is a quantized version of the Llama 3.2 1B Instruct model, optimized for deployment using Machine Learning Compilation (MLC). The model has been quantized to 4-bit precision (q4f16_1) to reduce memory footprint while maintaining reasonable performance.

- **Base Model**: Llama 3.2 1B Instruct
- **Quantization**: q4f16_1 (4-bit weights with float16 scales)
- **Format**: MLC (Machine Learning Compilation)
- **Model Type**: Decoder-only Transformer
- **Architecture**: Llama

### Model Specifications

| Parameter                     | Value                                |
| ----------------------------- | ------------------------------------ |
| **Parameters**                | 1.63B (quantized)                    |
| **Hidden Size**               | 2,048                                |
| **Intermediate Size**         | 8,192                                |
| **Number of Layers**          | 16                                   |
| **Number of Attention Heads** | 32                                   |
| **Number of Key-Value Heads** | 8 (GQA)                              |
| **Head Dimension**            | 64                                   |
| **Vocabulary Size**           | 128,256                              |
| **Context Window**            | 131,072 tokens                       |
| **Max Position Embeddings**   | 8,192 (with RoPE scaling factor: 32) |
| **RMS Norm Epsilon**          | 1e-5                                 |
| **Model Size (Quantized)**    | ~695 MB                              |

### Quantization Details

- **Quantization Method**: q4f16_1
- **Bits per Parameter**: ~4.5 bits
- **Weight Format**: uint32 (packed 4-bit weights)
- **Scale Format**: float16
- **Memory Reduction**: ~75% compared to FP16

## Intended Use

### Primary Use Cases

- RadiPro AI assistant
- built for demonstration purposes

## Training Data

This model is based on Meta's Llama 3.2 1B Instruct model. The base model was trained on a small set of synthetic data: 49 training Q/A and 4 validation.

## How to Use

### Installation

First, install the MLC Chat package:

```bash
# For CPU (macOS/Linux)
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cpu mlc-ai-nightly-cpu

# For CUDA (if you have NVIDIA GPU with CUDA 12.2)
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu122 mlc-ai-nightly-cu122

# For Metal (macOS with Apple Silicon - M1/M2/M3)
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-metal mlc-ai-nightly-metal
```

**Verify Installation:**

After installation, verify that the package is correctly installed:

```bash
# Check if mlc_llm is available
python -c "import mlc_llm; print('mlc_llm installed successfully')"

# Verify the CLI command works
mlc_llm --help
```

For more installation options, see the [MLC-LLM installation guide](https://llm.mlc.ai/docs/install/mlc_llm.html).

### Using MLC Runtime (Python)

**Note:** The Python API for MLC-LLM is primarily designed for serving. For interactive use, the command-line interface (`mlc_llm chat`) is recommended.

For programmatic access, you can use the `mlc_llm` serve API:

```python
from mlc_llm import MLCEngine

# Load the model
model_path = "./radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC"
engine = MLCEngine(model_path, mode="local")

# Note: MLCEngine is designed for serving, not direct generation
# For interactive chat, use: mlc_llm chat <model-path>
```

For more details on the Python API, see the [MLC-LLM Python API documentation](https://llm.mlc.ai/docs/api/python.html).

### Using Command Line

The simplest way to use the model is via the `mlc_llm chat` command:

```bash
# Interactive chat mode
mlc_llm chat radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC # or python -m mlc_llm chat ... if it doesn't work
```

### Conversation Template

The model uses the Llama 3 conversation template:

```
<|start_header_id|>system<|end_header_id|>

{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{assistant_message}<|eot_id|>
```

### Default Generation Parameters

- **Temperature**: 0.6
- **Top-p**: 0.9
- **Repetition Penalty**: 1.0
- **Presence Penalty**: 0.0
- **Frequency Penalty**: 0.0

## Technical Details

### Architecture

- **Attention Mechanism**: Grouped Query Attention (GQA) with 8 KV heads
- **Position Encoding**: RoPE (Rotary Position Embedding) with scaling
- **Normalization**: RMSNorm
- **Activation**: SwiGLU (in MLP layers)
- **Tied Embeddings**: Word embeddings are tied with output layer

### Special Tokens

- `<|begin_of_text|>` (BOS): 128000
- `<|end_of_text|>` (EOS): 128001
- `<|eot_id|>` (End of Turn): 128009
- `<|start_header_id|>`: 128006
- `<|end_header_id|>`: 128007

### File Structure

```
.
├── mlc-chat-config.json      # MLC configuration
├── tokenizer.json            # Tokenizer model
├── tokenizer_config.json     # Tokenizer configuration
├── tensor-cache.json         # Tensor metadata
└── params_shard_*.bin        # Model weights (22 shards)
```

## Ethical Considerations

### Bias and Fairness

- The model may reflect biases present in the training data
- Users should evaluate outputs for potential biases
- Consider implementing bias detection and mitigation strategies

### Safety

- The model may generate content that is inaccurate, offensive, or harmful
- Implement appropriate content filtering and safety measures
- Do not use for generating misleading or harmful content

## Citation

If you use this model, please cite the original Llama 3.2 model:

```bibtex
@misc{llama3.2,
  title={Llama 3.2},
  author={Meta AI},
  year={2024},
  howpublished={\url{https://ai.meta.com/llama/}}
}
```

## License

Please refer to the license of the base Llama 3.2 model. This quantized version follows the same licensing terms.

## Acknowledgments

- Meta AI for the original Llama 3.2 model
- MLC team for the compilation and quantization tools