--- license: gemma language: - uk pipeline_tag: text-generation tags: - gemma3 - fp8 - quantized - ukrainian - conversational base_model: - INSAIT-Institute/MamayLM-Gemma-3-12B-IT-v1.0 library_name: transformers --- # MamayLM-Gemma-3-12B-IT-v1.0-FP8-Static-Nadiia-v2 ## Model description **MamayLM-Gemma-3-12B-IT-v1.0-FP8-Static-Nadiia-v2** is an FP8 static-quantized version of the instruction-tuned [INSAIT-Institute/MamayLM-Gemma-3-12B-IT-v1.0](https://huggingface.co/INSAIT-Institute/MamayLM-Gemma-3-12B-IT-v1.0). based on Gemma 3 12B model, prepared for efficient inference while preserving quality for Ukrainian-language assistant workloads. This version was quantized on a **NVIDIA B200** GPU using the private **Nadiia-v2** dataset for calibration. ## Intended use This model is intended for: - Ukrainian-language assistant scenarios - Instruction following - Customer support and helpdesk-style conversations - General text generation - Evaluation of FP8 static quantization quality/performance trade-offs ## Quantization Details - **Base instruction model:** [INSAIT-Institute/MamayLM-Gemma-3-12B-IT-v1.0](https://huggingface.co/INSAIT-Institute/MamayLM-Gemma-3-12B-IT-v1.0). - **Quantization format:** FP8 static quantization (E4M3) - **Calibration dataset:** private dataset `Nadiia-v2` - **Quantization hardware:** NVIDIA B200 GPU ## Usage Transformers ``` from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "oshkorinova/MamayLM-Gemma-3-12B-IT-v1.0-FP8-Static-Nadiia-v2" tokenizer = AutoTokenizer.from_pretrained( model_id, trust_remote_code=True, ) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", trust_remote_code=True, torch_dtype="auto", ) messages = [ {"role": "user", "content": "Поясни простими словами, що таке квантізація моделі."} ] inputs = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", ).to(model.device) with torch.no_grad(): outputs = model.generate( inputs, max_new_tokens=200, do_sample=True, temperature=0.7, ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` vLLM ``` vllm serve oshkorinova/MamayLM-Gemma-3-12B-IT-v1.0-FP8-Static-Nadiia-v2 \ --trust-remote-code \ --dtype auto \ --max-model-len 8192 \ --limit-mm-per-prompt '{"image": 0}' ``` Example request: ``` curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "oshkorinova/MamayLM-Gemma-3-12B-IT-v1.0-FP8-Static-Nadiia-v2", "messages": [ {"role": "user", "content": "Напиши коротке пояснення, що таке FP8-квантізація."} ], "temperature": 0.7, "max_tokens": 200 }' ``` ## Acknowledgements We would like to thank: - [Beyond.pl](https://www.beyond.pl/en/) for providing compute resources used for FP8 quantization - [glibesyck](https://huggingface.co/glibesyck) for running the evaluation benchmarks