Text Generation
MLC-LLM
English
llama
llama-3.2
instruct
quantized
mlc
4-bit precision
chatbot
conversational
demo
Instructions to use raditotev/radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLC-LLM
How to use raditotev/radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC with MLC-LLM:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| model_name: radipro-chatbot-Llama-3.2-1B-Instruct | |
| base_model: meta-llama/Llama-3.2-1B-Instruct | |
| model_type: llama | |
| quantization: q4f16_1 | |
| format: mlc | |
| language: | |
| - en | |
| license: llama3.2 | |
| tags: | |
| - llama | |
| - llama-3.2 | |
| - instruct | |
| - quantized | |
| - mlc | |
| - 4-bit | |
| - chatbot | |
| - conversational | |
| - demo | |
| pipeline_tag: text-generation | |
| inference: false | |
| library_name: mlc-llm | |
| datasets: | |
| - synthetic | |
| metrics: | |
| - training_samples: 49 | |
| - validation_samples: 4 | |
| model_size: 1.63B | |
| quantized_size: 695MB | |
| context_length: 131072 | |
| hardware: cpu, metal, cuda | |
| # Radipro Chatbot - Llama 3.2 1B Instruct (MLC Quantized) | |
| ## Model Details | |
| ### Model Description | |
| This is a quantized version of the Llama 3.2 1B Instruct model, optimized for deployment using Machine Learning Compilation (MLC). The model has been quantized to 4-bit precision (q4f16_1) to reduce memory footprint while maintaining reasonable performance. | |
| - **Base Model**: Llama 3.2 1B Instruct | |
| - **Quantization**: q4f16_1 (4-bit weights with float16 scales) | |
| - **Format**: MLC (Machine Learning Compilation) | |
| - **Model Type**: Decoder-only Transformer | |
| - **Architecture**: Llama | |
| ### Model Specifications | |
| | Parameter | Value | | |
| | ----------------------------- | ------------------------------------ | | |
| | **Parameters** | 1.63B (quantized) | | |
| | **Hidden Size** | 2,048 | | |
| | **Intermediate Size** | 8,192 | | |
| | **Number of Layers** | 16 | | |
| | **Number of Attention Heads** | 32 | | |
| | **Number of Key-Value Heads** | 8 (GQA) | | |
| | **Head Dimension** | 64 | | |
| | **Vocabulary Size** | 128,256 | | |
| | **Context Window** | 131,072 tokens | | |
| | **Max Position Embeddings** | 8,192 (with RoPE scaling factor: 32) | | |
| | **RMS Norm Epsilon** | 1e-5 | | |
| | **Model Size (Quantized)** | ~695 MB | | |
| ### Quantization Details | |
| - **Quantization Method**: q4f16_1 | |
| - **Bits per Parameter**: ~4.5 bits | |
| - **Weight Format**: uint32 (packed 4-bit weights) | |
| - **Scale Format**: float16 | |
| - **Memory Reduction**: ~75% compared to FP16 | |
| ## Intended Use | |
| ### Primary Use Cases | |
| - RadiPro AI assistant | |
| - built for demonstration purposes | |
| ## Training Data | |
| This model is based on Meta's Llama 3.2 1B Instruct model. The base model was trained on a small set of synthetic data: 49 training Q/A and 4 validation. | |
| ## How to Use | |
| ### Installation | |
| First, install the MLC Chat package: | |
| ```bash | |
| # For CPU (macOS/Linux) | |
| python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cpu mlc-ai-nightly-cpu | |
| # For CUDA (if you have NVIDIA GPU with CUDA 12.2) | |
| python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu122 mlc-ai-nightly-cu122 | |
| # For Metal (macOS with Apple Silicon - M1/M2/M3) | |
| python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-metal mlc-ai-nightly-metal | |
| ``` | |
| **Verify Installation:** | |
| After installation, verify that the package is correctly installed: | |
| ```bash | |
| # Check if mlc_llm is available | |
| python -c "import mlc_llm; print('mlc_llm installed successfully')" | |
| # Verify the CLI command works | |
| mlc_llm --help | |
| ``` | |
| For more installation options, see the [MLC-LLM installation guide](https://llm.mlc.ai/docs/install/mlc_llm.html). | |
| ### Using MLC Runtime (Python) | |
| **Note:** The Python API for MLC-LLM is primarily designed for serving. For interactive use, the command-line interface (`mlc_llm chat`) is recommended. | |
| For programmatic access, you can use the `mlc_llm` serve API: | |
| ```python | |
| from mlc_llm import MLCEngine | |
| # Load the model | |
| model_path = "./radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC" | |
| engine = MLCEngine(model_path, mode="local") | |
| # Note: MLCEngine is designed for serving, not direct generation | |
| # For interactive chat, use: mlc_llm chat <model-path> | |
| ``` | |
| For more details on the Python API, see the [MLC-LLM Python API documentation](https://llm.mlc.ai/docs/api/python.html). | |
| ### Using Command Line | |
| The simplest way to use the model is via the `mlc_llm chat` command: | |
| ```bash | |
| # Interactive chat mode | |
| mlc_llm chat radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC # or python -m mlc_llm chat ... if it doesn't work | |
| ``` | |
| ### Conversation Template | |
| The model uses the Llama 3 conversation template: | |
| ``` | |
| <|start_header_id|>system<|end_header_id|> | |
| {system_message}<|eot_id|><|start_header_id|>user<|end_header_id|> | |
| {user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|> | |
| {assistant_message}<|eot_id|> | |
| ``` | |
| ### Default Generation Parameters | |
| - **Temperature**: 0.6 | |
| - **Top-p**: 0.9 | |
| - **Repetition Penalty**: 1.0 | |
| - **Presence Penalty**: 0.0 | |
| - **Frequency Penalty**: 0.0 | |
| ## Technical Details | |
| ### Architecture | |
| - **Attention Mechanism**: Grouped Query Attention (GQA) with 8 KV heads | |
| - **Position Encoding**: RoPE (Rotary Position Embedding) with scaling | |
| - **Normalization**: RMSNorm | |
| - **Activation**: SwiGLU (in MLP layers) | |
| - **Tied Embeddings**: Word embeddings are tied with output layer | |
| ### Special Tokens | |
| - `<|begin_of_text|>` (BOS): 128000 | |
| - `<|end_of_text|>` (EOS): 128001 | |
| - `<|eot_id|>` (End of Turn): 128009 | |
| - `<|start_header_id|>`: 128006 | |
| - `<|end_header_id|>`: 128007 | |
| ### File Structure | |
| ``` | |
| . | |
| βββ mlc-chat-config.json # MLC configuration | |
| βββ tokenizer.json # Tokenizer model | |
| βββ tokenizer_config.json # Tokenizer configuration | |
| βββ tensor-cache.json # Tensor metadata | |
| βββ params_shard_*.bin # Model weights (22 shards) | |
| ``` | |
| ## Ethical Considerations | |
| ### Bias and Fairness | |
| - The model may reflect biases present in the training data | |
| - Users should evaluate outputs for potential biases | |
| - Consider implementing bias detection and mitigation strategies | |
| ### Safety | |
| - The model may generate content that is inaccurate, offensive, or harmful | |
| - Implement appropriate content filtering and safety measures | |
| - Do not use for generating misleading or harmful content | |
| ## Citation | |
| If you use this model, please cite the original Llama 3.2 model: | |
| ```bibtex | |
| @misc{llama3.2, | |
| title={Llama 3.2}, | |
| author={Meta AI}, | |
| year={2024}, | |
| howpublished={\url{https://ai.meta.com/llama/}} | |
| } | |
| ``` | |
| ## License | |
| Please refer to the license of the base Llama 3.2 model. This quantized version follows the same licensing terms. | |
| ## Acknowledgments | |
| - Meta AI for the original Llama 3.2 model | |
| - MLC team for the compilation and quantization tools | |