--- language: - ru license: other license_name: gemma license_link: https://ai.google.dev/gemma/terms base_model: google/gemma-4-E4B-it pipeline_tag: text-generation tags: - gemma4 - russian - colloquial - style-transfer - merged - vllm library_name: transformers datasets: - pavelfedortsov/russian-colloquial-sft-50k --- # gemma4-e4b-colloquial-ru-merged *English:* Full-weight **Gemma 4 E4B** checkpoint with colloquial Russian LoRA **merged in** for vLLM / [RunPod Serverless](https://www.runpod.io/serverless). No PEFT at inference time. ## What this model does Rewrites **formal Russian** into **casual chat-style Russian** (Telegram-like), **without profanity**, while keeping facts, names, numbers, and paragraph structure. **Not** a general chat model — use the instruction prefix from training (see below). ## Model lineage | Stage | Artifact | |-------|----------| | Base | [google/gemma-4-E4B-it](https://huggingface.co/google/gemma-4-E4B-it) | | LoRA (SFT) | [pavelfedortsov/gemma4-e4b-lora-colloquial-ru](https://huggingface.co/pavelfedortsov/gemma4-e4b-lora-colloquial-ru) | | **This repo** | LoRA merged into base + vLLM fixes (`k_norm`, processor configs) | Merge was done with `peft.merge_and_unload()`; missing `language_model` **k_norm** weights for layers 24–41 were copied from the base checkpoint (required for vLLM). ## Training data - **50,000** SFT pairs, mat-free colloquial style - Hub dataset: [pavelfedortsov/russian-colloquial-sft-50k](https://huggingface.co/datasets/pavelfedortsov/russian-colloquial-sft-50k) - Built from [kurumikz/telegram-corpus-russian-kazakh](https://huggingface.co/datasets/kurumikz/telegram-corpus-russian-kazakh) + Gemini pair generation (see dataset card) **User prompt template (training & inference):** ``` Перепиши простым разговорным русским, как в переписке. Без мата и грубости. Сохрани смысл: <формальный текст> ``` ## Training configuration (LoRA → merge) Config file (also in `card_assets/train_colloquial_e4b_gpu.yaml`): | Parameter | Value | |-----------|--------| | Base model | `google/gemma-4-E4B-it` | | Method | LoRA on language tower (`model.language_model.*`) | | LoRA rank / alpha | **32 / 64** | | Target modules | `q,k,v,o` + MLP (`gate, up, down`) | | Dataset | 50k × 1 repeat | | Epochs | **2** (12,500 optimizer steps) | | Seq length | 512 | | Batch | 1 × grad accum **8** (effective 8) | | LR | 1e-4, cosine, warmup 3% | | Precision | bf16, gradient checkpointing | | Loss | assistant-only | | Hardware | RunPod **A100 80GB** | ## Training metrics (LoRA run) ![Training curves](card_assets/training_curves.png) | Metric | Start (step ~25) | End (step 12,500) | Best | |--------|------------------|-------------------|------| | `loss` | ~3.42 | ~0.81 | **~0.67** | | `mean_token_accuracy` | ~0.63 | ~0.82 | **~0.84** | Checkpoints saved every 1000 steps under the LoRA adapter repo. ## Inference ### RunPod Serverless (vLLM) ```env MODEL_NAME=pavelfedortsov/gemma4-e4b-colloquial-ru-merged HF_TOKEN= TRUST_REMOTE_CODE=true DTYPE=bfloat16 MAX_MODEL_LEN=4096 GPU_MEMORY_UTILIZATION=0.90 ENFORCE_EAGER=true ENABLE_LORA=false LANGUAGE_MODEL_ONLY=true LIMIT_MM_PER_PROMPT={"image":0,"audio":0,"video":0} ``` Recommended GPU: **≥40 GB** VRAM (merged ~32 GB weights in bf16). ### Transformers (local) ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "pavelfedortsov/gemma4-e4b-colloquial-ru-merged" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", ) formal = "Сегодня на совещании обсуждали внедрение новой версии API." user = ( "Перепиши простым разговорным русским, как в переписке. " "Без мата и грубости. Сохрани смысл:\n" f"{formal}" ) messages = [{"role": "user", "content": user}] prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) out = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7, top_p=0.9) print(tokenizer.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)) ``` ### OpenAI-compatible API (RunPod / vLLM) ```bash curl "$RUNPOD_URL/v1/chat/completions" \ -H "Authorization: Bearer $RUNPOD_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "pavelfedortsov/gemma4-e4b-colloquial-ru-merged", "messages": [{ "role": "user", "content": "Перепиши простым разговорным русским, как в переписке. Без мата и грубости. Сохрани смысл:\nВаш формальный текст." }], "max_tokens": 512, "temperature": 0.7 }' ``` ## Limitations - [Gemma license](https://ai.google.dev/gemma/terms) applies to the base architecture and weights. - Quality varies on long news-style text; model may shorten or paraphrase aggressively. - Not safety-tuned for production without your own evaluation. - Merged vs LoRA inference can differ slightly in style. ## Related repos | Resource | Link | |----------|------| | LoRA adapter | https://huggingface.co/pavelfedortsov/gemma4-e4b-lora-colloquial-ru | | Dataset (50k) | https://huggingface.co/datasets/pavelfedortsov/russian-colloquial-sft-50k | | Base model | https://huggingface.co/google/gemma-4-E4B-it |