---
language:
- ko
- en
license: other
base_model: LiquidAI/LFM2-1.2B
license_name: lfm-open-license-v1.0
license_link: https://huggingface.co/LiquidAI/LFM2-1.2B/blob/main/LICENSE
tags:
- translation
- generated_from_trainer
- liquid-ai
- lfm2
- korean
datasets:
- gyung/koen-parallel-100k
metrics:
- chrf
- bleu
model-index:
- name: LFM2-1.2B-KoEn-MT-v4-100k
  results:
  - task:
      type: translation
      name: Translation
    dataset:
      name: Flores-200
      type: flores_200
      config: en-ko
      split: devtest
    metrics:
    - type: chrf
      name: CHrF++
      value: 31.53
      verified: true
    - type: bleu
      name: BLEU
      value: 11.13
      verified: true
---

# 🌊 LFM2-1.2B-KoEn-MT-v4-100k

**LFM2-1.2B-KoEn-MT-v4-100k**은 LiquidAI의 `LFM2-1.2B` 모델을 기반으로 한국어-영어 번역 능력 향상을 위해 **100,000개의 고품질 병렬 데이터셋**으로 파인튜닝된 모델입니다. 

T4 GPU x 2 (DDP) 환경에서 최적화된 학습 파이프라인을 통해 학습되었으며, 1.2B의 가벼운 파라미터로도 효율적이고 준수한 번역 성능을 보여줍니다. 특히, NLLB-600M과 경쟁 가능한 성능을 보이며 모바일 및 엣지 디바이스에서의 활용 가능성을 열어줍니다.

## 📊 벤치마크 (Benchmarks)

**Flores-200** 데이터셋(1012 문장)을 기준으로 한 평가 결과입니다. (CHrF++ 기준 정렬)

| Rank | Model | CHrF++ | BLEU | 비고 |
| :--- | :--- | :--- | :--- | :--- |
| 1 | **Google Translate** | 39.27 | 18.18 | 상용 서비스 (Target) |
| 2 | **Yanolja-4B-GGUF** | 38.61 | 16.03 | Open Source Model (SOTA) |
| 3 | **NLLB-200 (3.3B)** | 35.09 | 11.68 | 3.3B 번역 전용 모델 |
| 4 | **Gemma-3-4B-it-GGUF** | 32.83 | 11.36 | Google 최신 4B 모델 |
| 5 | **NLLB-200-Distilled-600M** | 31.97 | 10.32 | 600M 번역 전용 모델 |
| 6 | **LFM2-1.2B-KOEN-MT-v4-100k** | **31.53** | **11.13** | **본 모델 (1.2B)** |
| 7 | **lfm2-mt-v1** | 30.85 | 11.17 | 100 Samples 학습 |
| 8 | **LFM2-1.2B** | 27.23 | 6.43 | 베이스라인 모델 |
| 9 | **Qwen3-4B-GGUF** | 25.62 | 7.46 | 4B Base Model |
| 10 | **Gemma-3-1B-it-GGUF** | 24.07 | 6.94 | 1B 모델 |
| 11 | **Qwen3-1.7B-GGUF** | 21.19 | - | 1.7B Base Model |
| 12 | **Qwen3-0.6B-GGUF** | 13.48 | 1.98 | 0.6B Base Model |

## 📈 학습 로그 (Training Logs)

**약 6,188 Step** 동안 진행된 학습의 Loss 및 Learning Rate 변화 추이입니다. 초기 손실값 3.5대에서 시작하여 최종 1.43까지 안정적으로 수렴하였습니다.

| Step | Epoch | Training Loss (Avg) | Learning Rate | 비고 |
| :---: | :---: | :--- | :--- | :--- |
| 0 | 0.00 | 3.57 | 0 | Start |
| 500 | 0.08 | 1.59 | 8.06e-06 | Warmup 완료 후 감소 |
| 1000 | 0.16 | 1.57 | 9.88e-06 | 초기 안정화 |
| 2000 | 0.32 | 1.48 | 8.45e-06 | Loss 1.5 미만 진입 |
| 3000 | 0.49 | 1.46 | 5.99e-06 | 중반부 수렴 가속 |
| 4000 | 0.65 | 1.45 | 3.21e-06 | 미세 조정 단계 |
| 5000 | 0.81 | 1.44 | 1.08e-06 | 성능 극대화 |
| 6000 | 0.98 | 1.43 | 6.30e-09 | 최종 수렴 (Final Convergence) |

*   **Optimizer**: `paged_adamw_8bit`
*   **LR Scheduler**: Cosine Decay with Warmup (0.1 ratio)
*   **Max LR**: 1e-5

## 🚀 사용 예시 (Usage)

이 모델은 `transformers` 라이브러리를 사용하여 쉽게 로드하고 번역을 수행할 수 있습니다.

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# 모델 로드
model_id = "gyung/lfm2-1.2b-koen-mt-v4-100k"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16
)

# 번역할 문장
text = "The model is working correctly now."

# 채팅 템플릿 적용 (ChatML 형식 권장)
messages = [
    {"role": "system", "content": "Translate to Korean."},
    {"role": "user", "content": text}
]

# 입력 토큰화
input_ids = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)

# 번역 생성
outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    pad_token_id=tokenizer.eos_token_id
)

# 결과 디코딩
decoded = tokenizer.decode(outputs[0][input_ids.shape[1]:], skip_special_tokens=True)
print(f"Input: {text}")
print(f"Output: {decoded}")
# Output: 모델이 정상적으로 작동하고 있습니다.
```

## ⚙️ 학습 상세 정보 (Training Details)

이 모델은 Kaggle T4 x 2 환경에서 최적화된 설정으로 학습되었습니다.

### 학습 구성 (Configuration)
*   **Base Model**: `LiquidAI/LFM2-1.2B`
*   **Dataset**: `dataset_100000.jsonl` (English-Korean Parallel, 100k samples)
*   **Hardware**: NVIDIA T4 GPU x 2 (Data Parallelism, DDP)
*   **Epochs**: 1
*   **Batch Size**: 1 per device (Gradient Accumulation 16) -> Effective Batch Size 32
*   **Optimizer**: `paged_adamw_8bit`
*   **Learning Rate**: 1e-5 (Cosine Scheduler, Warmup 0.1)
*   **Precision**: Mixed Precision or FP16 (Optimized for T4)

### 학습 코드 (Training Code Snippet)
```python
# SFTTrainer Configuration used for v4
sft_config = SFTConfig(
    output_dir="/kaggle/working/lfm2-mt-v4",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    gradient_checkpointing=True,
    optim="paged_adamw_8bit",
    learning_rate=1e-5,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    logging_steps=50,
    save_steps=500,
    eval_strategy="no",  # Optimized for speed
    dataset_text_field="messages",
    packing=False,
    ddp_find_unused_parameters=False,
)
```

## ⚠️ 제한 사항 (Limitations)
*   이 모델은 1.2B 파라미터의 소형 모델로, 매우 복잡하거나 전문적인 문맥에서는 대형 모델(4B+)보다 성능이 떨어질 수 있습니다.
*   학습 데이터에 포함되지 않은 희귀 단어나 매우 긴 문장에 대해서는 환각(Hallucination)이 발생할 수 있습니다.

## 📜 라이선스 (License)

이 모델은 **Liquid AI LFM Open License v1.0**을 따릅니다.
*   **허용**: 학술 연구 및 개인적 사용은 제한 없이 가능합니다.
*   **상업적 이용**: 연 매출 1,000만 달러(약 140억 원) 미만의 기업/개인은 무료로 상업적 이용이 가능합니다.
*   **제한**: 연 매출 1,000만 달러를 초과하는 기업은 Liquid AI와 별도의 라이선스 계약이 필요합니다.
자세한 내용은 [LICENSE](https://huggingface.co/LiquidAI/LFM2-1.2B/blob/main/LICENSE) 파일을 참고하세요.


## Citation

**Model**
```bibtex
@misc{lfm2-1.2b-koen-mt-v4-100k,
  author = {Gyung},
  title = {LFM2-1.2B Korean-English Machine Translation Model v4},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/gyung/lfm2-1.2b-koen-mt-v4-100k}}
}
```

**Base Model (Liquid LFM-2.1B)**
```bibtex
@article{liquidai2025lfm2,
  title={LFM2 Technical Report},
  author={Liquid AI},
  journal={arXiv preprint arXiv:2511.23404},
  year={2025}
}
```

**Evaluation Dataset (Flores-200)**
```bibtex
@article{nllb2022,
  author = {NLLB Team and Costa-jussà, Marta R. and Cross, James and Onabanjo, Onurkele and et al.},
  title = {No Language Left Behind: Scaling Human-Centered Machine Translation},
  year = {2022},
  journal = {arXiv preprint arXiv:2207.04672}
}
```

**Metrics**
```bibtex
@inproceedings{popovic-2015-chrf,
    title = "chrF: character n-gram F-score for automatic MT evaluation",
    author = "Popovi{\'c}, Maja",
    booktitle = "Proceedings of the Tenth Workshop on Statistical Machine Translation",
    month = sep,
    year = "2015",
    address = "Lisbon, Portugal",
    publisher = "Association for Computational Linguistics",
    pages = "392--395",
}

@inproceedings{post-2018-call,
    title = "A Call for Clarity in Reporting BLEU Scores",
    author = "Post, Matt",
    booktitle = "Proceedings of the Third Conference on Machine Translation: Research Papers",
    month = oct,
    year = "2018",
    address = "Belgium, Brussels",
    publisher = "Association for Computational Linguistics",
    pages = "186--191",
}
```