File size: 6,656 Bytes
6e4dbb2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
971dbf0
6e4dbb2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
---
model_name: radipro-chatbot-Llama-3.2-1B-Instruct
base_model: meta-llama/Llama-3.2-1B-Instruct
model_type: llama
quantization: q4f16_1
format: mlc
language:
  - en
license: llama3.2
tags:
  - llama
  - llama-3.2
  - instruct
  - quantized
  - mlc
  - 4-bit
  - chatbot
  - conversational
  - demo
pipeline_tag: text-generation
inference: false
library_name: mlc-llm
datasets:
  - synthetic
metrics:
  - training_samples: 49
  - validation_samples: 4
model_size: 1.63B
quantized_size: 695MB
context_length: 131072
hardware: cpu, metal, cuda
---

# Radipro Chatbot - Llama 3.2 1B Instruct (MLC Quantized)

## Model Details

### Model Description

This is a quantized version of the Llama 3.2 1B Instruct model, optimized for deployment using Machine Learning Compilation (MLC). The model has been quantized to 4-bit precision (q4f16_1) to reduce memory footprint while maintaining reasonable performance.

- **Base Model**: Llama 3.2 1B Instruct
- **Quantization**: q4f16_1 (4-bit weights with float16 scales)
- **Format**: MLC (Machine Learning Compilation)
- **Model Type**: Decoder-only Transformer
- **Architecture**: Llama

### Model Specifications

| Parameter                     | Value                                |
| ----------------------------- | ------------------------------------ |
| **Parameters**                | 1.63B (quantized)                    |
| **Hidden Size**               | 2,048                                |
| **Intermediate Size**         | 8,192                                |
| **Number of Layers**          | 16                                   |
| **Number of Attention Heads** | 32                                   |
| **Number of Key-Value Heads** | 8 (GQA)                              |
| **Head Dimension**            | 64                                   |
| **Vocabulary Size**           | 128,256                              |
| **Context Window**            | 131,072 tokens                       |
| **Max Position Embeddings**   | 8,192 (with RoPE scaling factor: 32) |
| **RMS Norm Epsilon**          | 1e-5                                 |
| **Model Size (Quantized)**    | ~695 MB                              |

### Quantization Details

- **Quantization Method**: q4f16_1
- **Bits per Parameter**: ~4.5 bits
- **Weight Format**: uint32 (packed 4-bit weights)
- **Scale Format**: float16
- **Memory Reduction**: ~75% compared to FP16

## Intended Use

### Primary Use Cases

- RadiPro AI assistant
- built for demonstration purposes

## Training Data

This model is based on Meta's Llama 3.2 1B Instruct model. The base model was trained on a small set of synthetic data: 49 training Q/A and 4 validation.

## How to Use

### Installation

First, install the MLC Chat package:

```bash
# For CPU (macOS/Linux)
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cpu mlc-ai-nightly-cpu

# For CUDA (if you have NVIDIA GPU with CUDA 12.2)
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu122 mlc-ai-nightly-cu122

# For Metal (macOS with Apple Silicon - M1/M2/M3)
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-metal mlc-ai-nightly-metal
```

**Verify Installation:**

After installation, verify that the package is correctly installed:

```bash
# Check if mlc_llm is available
python -c "import mlc_llm; print('mlc_llm installed successfully')"

# Verify the CLI command works
mlc_llm --help
```

For more installation options, see the [MLC-LLM installation guide](https://llm.mlc.ai/docs/install/mlc_llm.html).

### Using MLC Runtime (Python)

**Note:** The Python API for MLC-LLM is primarily designed for serving. For interactive use, the command-line interface (`mlc_llm chat`) is recommended.

For programmatic access, you can use the `mlc_llm` serve API:

```python
from mlc_llm import MLCEngine

# Load the model
model_path = "./radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC"
engine = MLCEngine(model_path, mode="local")

# Note: MLCEngine is designed for serving, not direct generation
# For interactive chat, use: mlc_llm chat <model-path>
```

For more details on the Python API, see the [MLC-LLM Python API documentation](https://llm.mlc.ai/docs/api/python.html).

### Using Command Line

The simplest way to use the model is via the `mlc_llm chat` command:

```bash
# Interactive chat mode
mlc_llm chat radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC # or python -m mlc_llm chat ... if it doesn't work
```

### Conversation Template

The model uses the Llama 3 conversation template:

```
<|start_header_id|>system<|end_header_id|>

{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{assistant_message}<|eot_id|>
```

### Default Generation Parameters

- **Temperature**: 0.6
- **Top-p**: 0.9
- **Repetition Penalty**: 1.0
- **Presence Penalty**: 0.0
- **Frequency Penalty**: 0.0

## Technical Details

### Architecture

- **Attention Mechanism**: Grouped Query Attention (GQA) with 8 KV heads
- **Position Encoding**: RoPE (Rotary Position Embedding) with scaling
- **Normalization**: RMSNorm
- **Activation**: SwiGLU (in MLP layers)
- **Tied Embeddings**: Word embeddings are tied with output layer

### Special Tokens

- `<|begin_of_text|>` (BOS): 128000
- `<|end_of_text|>` (EOS): 128001
- `<|eot_id|>` (End of Turn): 128009
- `<|start_header_id|>`: 128006
- `<|end_header_id|>`: 128007

### File Structure

```
.
β”œβ”€β”€ mlc-chat-config.json      # MLC configuration
β”œβ”€β”€ tokenizer.json            # Tokenizer model
β”œβ”€β”€ tokenizer_config.json     # Tokenizer configuration
β”œβ”€β”€ tensor-cache.json         # Tensor metadata
└── params_shard_*.bin        # Model weights (22 shards)
```

## Ethical Considerations

### Bias and Fairness

- The model may reflect biases present in the training data
- Users should evaluate outputs for potential biases
- Consider implementing bias detection and mitigation strategies

### Safety

- The model may generate content that is inaccurate, offensive, or harmful
- Implement appropriate content filtering and safety measures
- Do not use for generating misleading or harmful content

## Citation

If you use this model, please cite the original Llama 3.2 model:

```bibtex
@misc{llama3.2,
  title={Llama 3.2},
  author={Meta AI},
  year={2024},
  howpublished={\url{https://ai.meta.com/llama/}}
}
```

## License

Please refer to the license of the base Llama 3.2 model. This quantized version follows the same licensing terms.

## Acknowledgments

- Meta AI for the original Llama 3.2 model
- MLC team for the compilation and quantization tools