KomdigiITS-8B-DFK
Multimodal Classification

Ministral-3-8B-Base-2512 · LoRA · Vision-Language

01 Overview

A LoRA adapter fine-tuned on aitf-komdigi/KomdigiITS-8B-DFK-CPT (Ministral-3-8B-Base-2512 based) as a Vision-Language Model for multimodal content classification. The model analyzes social media screenshots and classifies them into four categories: netral, disinformasi, fitnah, and ujaran kebencian.

Trained using the SITA framework with Unsloth's SFT pipeline. Given an image, the model produces a structured analysis with a classification label and a detailed Indonesian-language reasoning of any violations found.

♦ Note: This is the final checkpoint from Workshop 3 (final-ministral-8b-cpt-ws3), trained on the DFK VLM Dataset V3 with augmented train/val splits. The base model (aitf-komdigi/KomdigiITS-8B-DFK-CPT) was continual-pretrained on DFK domain-oriented text before fine-tuning.

02 Model Details

Identity

DevelopedDFK Tim 3 ITS

TypeVLM — LoRA adapter

LanguageIndonesian

Architecture

BaseKomdigiITS-8B-DFK-CPT

ArchMistral3ForConditionalGeneration

Params8B (base)

Precisionfloat16

03 Uses

Direct Use

Image-based content moderation classification for Indonesian social media. Given a screenshot, the model produces a structured analysis with a classification label (netral, disinformasi, fitnah, or ujaran kebencian) and a detailed reasoning in Indonesian.

Out-of-Scope Use

This model is not intended for general-purpose vision-language tasks. It is specialized for the DFK disinformation detection pipeline and should not be used for content moderation in other languages or domains without further fine-tuning.

04 Evaluation

Evaluated on the held-out validation split using greedy decoding (temperature=0.0) and BERTScore (bert-base-multilingual-cased).

94.3

Accuracy

91.6

F1 Macro

94.3

F1 Weighted

80.2

BERTScore F1

Per-Class Breakdown

NetralP 0.937 · R 0.973 · F1 0.954 · n=970

Ujrn KbnciP 0.979 · R 0.960 · F1 0.969 · n=867

DisinfoP 0.946 · R 0.895 · F1 0.920 · n=392

FitnahP 0.822 · R 0.822 · F1 0.822 · n=213

Generation Quality Metrics

BERTScore · bert-base-multilingual-cased

Precision0.804

Recall0.801

F10.802

ROUGE-L · n-gram overlap

Precision0.400

Recall0.387

F10.387

05 Training Details

Training Data

Datasetdfk_vlm_dataset_v3 (augmented on fitnah class)

SplitsFixed (train_aug.csv / val_aug.csv)

Train14,293 samples

Val2,831 samples

Label Classes

NetralFactual content or non-DFK material — no violation detected

DisinfoClaims that contradict established facts, not directed at a specific person

FitnahFalse claims directed at a specific individual (defamation)

Ujrn KbnciHate speech targeting ethnicity, religion, race, or intergroup identity (SARA)

Dataset Distribution

Train (augmented) · 14,293 total

Netral3,883 (27.2%)

Fitnah3,846 (26.9%)

Ujrn Kbnci3,484 (24.4%)

Disinfo3,080 (21.6%)

Val (augmented) · 2,831 total

Netral970 (34.3%)

Ujrn Kbnci867 (30.6%)

Disinfo765 (27.0%)

Fitnah229 (8.1%)

Configuration

LoRA Configuration

r16

Alpha16

Dropout0.1

Targetsall-linear

Vision✓ finetuned

Language✓ finetuned

Attention✓ finetuned

MLP✓ finetuned

Hyperparameters

Epochs3

Batch16 (4 × 4 accum)

LR5e-4

OptimizerAdamW 8-bit

Max len4096

Grad norm1

Warmup0.03

Grad ckptunsloth

Seed3407

Trainer

Typeunsloth_vlm_sft (Unsloth VLM SFT trainer)

Train onResponses only

Instr part[INST]

Resp part[/INST]

Best modelSelected by eval_loss (lower is better)

Prompt Template

Each sample is formatted as a multi-turn conversation using the ministral_3 chat template. The dataset builds structured content blocks which the Jinja template renders as:

<s>[SYSTEM_PROMPT]...default Ministral system prompt...[/SYSTEM_PROMPT][INST]Anda adalah seorang analis konten media sosial ahli. Diberikan tangkapan layar dari sebuah konten, tentukan label kategori pelanggaran dan berikan analisis detail mengenai pelanggaran yang ditemukan.Ringkasan: {ringkasan}
Klaim: {klaim}
Fakta: {fakta}[IMG][/INST]Label: {label}

Analisis: {analisis}</s>

Input Fields

RingkasanContent summary. In the RAG pipeline this is the concatenation of the image caption (from a captioning model) and any user-provided text (e.g. post caption, tweet text). Effectively holds all available textual context about the content.

KlaimThe core claim extracted from the content, used as a web search query for fact-checking. Generated by an LLM from the ringkasan. Can also be a direct caption or user-provided text in simpler setups.

FaktaVerification context retrieved via web search. Contains numbered search results with titles, descriptions, and source URLs. If no relevant sources are found, defaults to "Tidak ditemukan sumber yang valid."

[IMG]Screenshot of the social media post being analyzed.

Output Fields

LabelOne of netral, disinformasi, fitnah, or ujaran kebencian.

AnalisisFree-form Indonesian-language explanation of why the content was assigned its label, referencing the image, context, and any retrieved facts.

Full Training Config

experiment_name: final-ministral-8b-cpt-ws3
seed: 3407

reporting:
wandb: true
wandb_project: "DFK3"

model:
name: unsloth_vlm
pretrained: aitf-komdigi/KomdigiITS-8B-DFK-CPT
kwargs:
load_in_4bit: false
chat_template: "sita/templates/ministral_3.jinja"

adapter:
name: unsloth_vlm_lora
kwargs:
finetune_vision_layers: true
finetune_language_layers: true
finetune_attention_modules: true
finetune_mlp_modules: true
r: 16
lora_alpha: 16
lora_dropout: 0.1
bias: "none"
target_modules: "all-linear"
use_gradient_checkpointing: "unsloth"
random_state: 3407

dataset:
name: dfk_vlm_dataset_v3
kwargs:
data_dir: /content/dataset/images/images

training:
num_epochs: 3
batch_size: 4
learning_rate: 5e-4
gradient_accumulation_steps: 4
max_grad_norm: 1
warmup_ratio: 0.03
weight_decay: 0
logging_steps: 1
eval_steps: 250
extra:
seed: 3407
max_length: 4096
load_best_model_at_end: true
metric_for_best_model: eval_loss
greater_is_better: false

trainer:
name: unsloth_vlm_sft
kwargs:
train_on_responses_only: true
instruction_part: "[INST]"
response_part: "[/INST]"
optim: adamw_8bit

evaluation:
name: vlm_gen
kwargs:
max_new_tokens: 512
temperature: 0.0
bert_model: bert-base-multilingual-cased
batch_size: 16
num_workers: 11

06 Model Sources

FrameworkSITA

W&B RunDFK3 / final-ministral-8b-cpt-ws3

07 Framework Versions

TRL0.24.0

Transformers5.5.0

PyTorch2.11.0+cu128

Datasets4.3.0

PEFT0.19.0

Tokenizers0.22.2

KomdigiITS-8B-DFKMultimodal Classification