--- model_name: radipro-chatbot-Llama-3.2-1B-Instruct base_model: meta-llama/Llama-3.2-1B-Instruct model_type: llama quantization: q4f16_1 format: mlc language: - en license: llama3.2 tags: - llama - llama-3.2 - instruct - quantized - mlc - 4-bit - chatbot - conversational - demo pipeline_tag: text-generation inference: false library_name: mlc-llm datasets: - synthetic metrics: - training_samples: 49 - validation_samples: 4 model_size: 1.63B quantized_size: 695MB context_length: 131072 hardware: cpu, metal, cuda --- # Radipro Chatbot - Llama 3.2 1B Instruct (MLC Quantized) ## Model Details ### Model Description This is a quantized version of the Llama 3.2 1B Instruct model, optimized for deployment using Machine Learning Compilation (MLC). The model has been quantized to 4-bit precision (q4f16_1) to reduce memory footprint while maintaining reasonable performance. - **Base Model**: Llama 3.2 1B Instruct - **Quantization**: q4f16_1 (4-bit weights with float16 scales) - **Format**: MLC (Machine Learning Compilation) - **Model Type**: Decoder-only Transformer - **Architecture**: Llama ### Model Specifications | Parameter | Value | | ----------------------------- | ------------------------------------ | | **Parameters** | 1.63B (quantized) | | **Hidden Size** | 2,048 | | **Intermediate Size** | 8,192 | | **Number of Layers** | 16 | | **Number of Attention Heads** | 32 | | **Number of Key-Value Heads** | 8 (GQA) | | **Head Dimension** | 64 | | **Vocabulary Size** | 128,256 | | **Context Window** | 131,072 tokens | | **Max Position Embeddings** | 8,192 (with RoPE scaling factor: 32) | | **RMS Norm Epsilon** | 1e-5 | | **Model Size (Quantized)** | ~695 MB | ### Quantization Details - **Quantization Method**: q4f16_1 - **Bits per Parameter**: ~4.5 bits - **Weight Format**: uint32 (packed 4-bit weights) - **Scale Format**: float16 - **Memory Reduction**: ~75% compared to FP16 ## Intended Use ### Primary Use Cases - RadiPro AI assistant - built for demonstration purposes ## Training Data This model is based on Meta's Llama 3.2 1B Instruct model. The base model was trained on a small set of synthetic data: 49 training Q/A and 4 validation. ## How to Use ### Installation First, install the MLC Chat package: ```bash # For CPU (macOS/Linux) python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cpu mlc-ai-nightly-cpu # For CUDA (if you have NVIDIA GPU with CUDA 12.2) python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu122 mlc-ai-nightly-cu122 # For Metal (macOS with Apple Silicon - M1/M2/M3) python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-metal mlc-ai-nightly-metal ``` **Verify Installation:** After installation, verify that the package is correctly installed: ```bash # Check if mlc_llm is available python -c "import mlc_llm; print('mlc_llm installed successfully')" # Verify the CLI command works mlc_llm --help ``` For more installation options, see the [MLC-LLM installation guide](https://llm.mlc.ai/docs/install/mlc_llm.html). ### Using MLC Runtime (Python) **Note:** The Python API for MLC-LLM is primarily designed for serving. For interactive use, the command-line interface (`mlc_llm chat`) is recommended. For programmatic access, you can use the `mlc_llm` serve API: ```python from mlc_llm import MLCEngine # Load the model model_path = "./radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC" engine = MLCEngine(model_path, mode="local") # Note: MLCEngine is designed for serving, not direct generation # For interactive chat, use: mlc_llm chat ``` For more details on the Python API, see the [MLC-LLM Python API documentation](https://llm.mlc.ai/docs/api/python.html). ### Using Command Line The simplest way to use the model is via the `mlc_llm chat` command: ```bash # Interactive chat mode mlc_llm chat radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC # or python -m mlc_llm chat ... if it doesn't work ``` ### Conversation Template The model uses the Llama 3 conversation template: ``` <|start_header_id|>system<|end_header_id|> {system_message}<|eot_id|><|start_header_id|>user<|end_header_id|> {user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|> {assistant_message}<|eot_id|> ``` ### Default Generation Parameters - **Temperature**: 0.6 - **Top-p**: 0.9 - **Repetition Penalty**: 1.0 - **Presence Penalty**: 0.0 - **Frequency Penalty**: 0.0 ## Technical Details ### Architecture - **Attention Mechanism**: Grouped Query Attention (GQA) with 8 KV heads - **Position Encoding**: RoPE (Rotary Position Embedding) with scaling - **Normalization**: RMSNorm - **Activation**: SwiGLU (in MLP layers) - **Tied Embeddings**: Word embeddings are tied with output layer ### Special Tokens - `<|begin_of_text|>` (BOS): 128000 - `<|end_of_text|>` (EOS): 128001 - `<|eot_id|>` (End of Turn): 128009 - `<|start_header_id|>`: 128006 - `<|end_header_id|>`: 128007 ### File Structure ``` . ├── mlc-chat-config.json # MLC configuration ├── tokenizer.json # Tokenizer model ├── tokenizer_config.json # Tokenizer configuration ├── tensor-cache.json # Tensor metadata └── params_shard_*.bin # Model weights (22 shards) ``` ## Ethical Considerations ### Bias and Fairness - The model may reflect biases present in the training data - Users should evaluate outputs for potential biases - Consider implementing bias detection and mitigation strategies ### Safety - The model may generate content that is inaccurate, offensive, or harmful - Implement appropriate content filtering and safety measures - Do not use for generating misleading or harmful content ## Citation If you use this model, please cite the original Llama 3.2 model: ```bibtex @misc{llama3.2, title={Llama 3.2}, author={Meta AI}, year={2024}, howpublished={\url{https://ai.meta.com/llama/}} } ``` ## License Please refer to the license of the base Llama 3.2 model. This quantized version follows the same licensing terms. ## Acknowledgments - Meta AI for the original Llama 3.2 model - MLC team for the compilation and quantization tools