Add meta tags for the model card

971dbf0 verified 7 months ago

6.66 kB

	---
	model_name: radipro-chatbot-Llama-3.2-1B-Instruct
	base_model: meta-llama/Llama-3.2-1B-Instruct
	model_type: llama
	quantization: q4f16_1
	format: mlc
	language:
	- en
	license: llama3.2
	tags:
	- llama
	- llama-3.2
	- instruct
	- quantized
	- mlc
	- 4-bit
	- chatbot
	- conversational
	- demo
	pipeline_tag: text-generation
	inference: false
	library_name: mlc-llm
	datasets:
	- synthetic
	metrics:
	- training_samples: 49
	- validation_samples: 4
	model_size: 1.63B
	quantized_size: 695MB
	context_length: 131072
	hardware: cpu, metal, cuda
	---

	# Radipro Chatbot - Llama 3.2 1B Instruct (MLC Quantized)

	## Model Details

	### Model Description

	This is a quantized version of the Llama 3.2 1B Instruct model, optimized for deployment using Machine Learning Compilation (MLC). The model has been quantized to 4-bit precision (q4f16_1) to reduce memory footprint while maintaining reasonable performance.

	- Base Model: Llama 3.2 1B Instruct
	- Quantization: q4f16_1 (4-bit weights with float16 scales)
	- Format: MLC (Machine Learning Compilation)
	- Model Type: Decoder-only Transformer
	- Architecture: Llama

	### Model Specifications

	\| Parameter \| Value \|
	\| ----------------------------- \| ------------------------------------ \|
	\| Parameters \| 1.63B (quantized) \|
	\| Hidden Size \| 2,048 \|
	\| Intermediate Size \| 8,192 \|
	\| Number of Layers \| 16 \|
	\| Number of Attention Heads \| 32 \|
	\| Number of Key-Value Heads \| 8 (GQA) \|
	\| Head Dimension \| 64 \|
	\| Vocabulary Size \| 128,256 \|
	\| Context Window \| 131,072 tokens \|
	\| Max Position Embeddings \| 8,192 (with RoPE scaling factor: 32) \|
	\| RMS Norm Epsilon \| 1e-5 \|
	\| Model Size (Quantized) \| ~695 MB \|

	### Quantization Details

	- Quantization Method: q4f16_1
	- Bits per Parameter: ~4.5 bits
	- Weight Format: uint32 (packed 4-bit weights)
	- Scale Format: float16
	- Memory Reduction: ~75% compared to FP16

	## Intended Use

	### Primary Use Cases

	- RadiPro AI assistant
	- built for demonstration purposes

	## Training Data

	This model is based on Meta's Llama 3.2 1B Instruct model. The base model was trained on a small set of synthetic data: 49 training Q/A and 4 validation.

	## How to Use

	### Installation

	First, install the MLC Chat package:

	```bash
	# For CPU (macOS/Linux)
	python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cpu mlc-ai-nightly-cpu

	# For CUDA (if you have NVIDIA GPU with CUDA 12.2)
	python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu122 mlc-ai-nightly-cu122

	# For Metal (macOS with Apple Silicon - M1/M2/M3)
	python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-metal mlc-ai-nightly-metal
	```

	Verify Installation:

	After installation, verify that the package is correctly installed:

	```bash
	# Check if mlc_llm is available
	python -c "import mlc_llm; print('mlc_llm installed successfully')"

	# Verify the CLI command works
	mlc_llm --help
	```

	For more installation options, see the [MLC-LLM installation guide](https://llm.mlc.ai/docs/install/mlc_llm.html).

	### Using MLC Runtime (Python)

	Note: The Python API for MLC-LLM is primarily designed for serving. For interactive use, the command-line interface (`mlc_llm chat`) is recommended.

	For programmatic access, you can use the `mlc_llm` serve API:

	```python
	from mlc_llm import MLCEngine

	# Load the model
	model_path = "./radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC"
	engine = MLCEngine(model_path, mode="local")

	# Note: MLCEngine is designed for serving, not direct generation
	# For interactive chat, use: mlc_llm chat <model-path>
	```

	For more details on the Python API, see the [MLC-LLM Python API documentation](https://llm.mlc.ai/docs/api/python.html).

	### Using Command Line

	The simplest way to use the model is via the `mlc_llm chat` command:

	```bash
	# Interactive chat mode
	mlc_llm chat radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC # or python -m mlc_llm chat ... if it doesn't work
	```

	### Conversation Template

	The model uses the Llama 3 conversation template:

	```
	<\|start_header_id\|>system<\|end_header_id\|>

	{system_message}<\|eot_id\|><\|start_header_id\|>user<\|end_header_id\|>

	{user_message}<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>

	{assistant_message}<\|eot_id\|>
	```

	### Default Generation Parameters

	- Temperature: 0.6
	- Top-p: 0.9
	- Repetition Penalty: 1.0
	- Presence Penalty: 0.0
	- Frequency Penalty: 0.0

	## Technical Details

	### Architecture

	- Attention Mechanism: Grouped Query Attention (GQA) with 8 KV heads
	- Position Encoding: RoPE (Rotary Position Embedding) with scaling
	- Normalization: RMSNorm
	- Activation: SwiGLU (in MLP layers)
	- Tied Embeddings: Word embeddings are tied with output layer

	### Special Tokens

	- `<\|begin_of_text\|>` (BOS): 128000
	- `<\|end_of_text\|>` (EOS): 128001
	- `<\|eot_id\|>` (End of Turn): 128009
	- `<\|start_header_id\|>`: 128006
	- `<\|end_header_id\|>`: 128007

	### File Structure

	```
	.
	├── mlc-chat-config.json # MLC configuration
	├── tokenizer.json # Tokenizer model
	├── tokenizer_config.json # Tokenizer configuration
	├── tensor-cache.json # Tensor metadata
	└── params_shard_*.bin # Model weights (22 shards)
	```

	## Ethical Considerations

	### Bias and Fairness

	- The model may reflect biases present in the training data
	- Users should evaluate outputs for potential biases
	- Consider implementing bias detection and mitigation strategies

	### Safety

	- The model may generate content that is inaccurate, offensive, or harmful
	- Implement appropriate content filtering and safety measures
	- Do not use for generating misleading or harmful content

	## Citation

	If you use this model, please cite the original Llama 3.2 model:

	```bibtex
	@misc{llama3.2,
	title={Llama 3.2},
	author={Meta AI},
	year={2024},
	howpublished={\url{https://ai.meta.com/llama/}}
	}
	```

	## License

	Please refer to the license of the base Llama 3.2 model. This quantized version follows the same licensing terms.

	## Acknowledgments

	- Meta AI for the original Llama 3.2 model
	- MLC team for the compilation and quantization tools