Text Generation
Safetensors
Korean
qwen3
task-specific
structured-prediction
korean
public-sector
domain-specific
Merge
conversational
Eval Results (legacy)
Instructions to use dataslab/DLM-NL2JSON-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Inference
File size: 10,288 Bytes
5b051ff f72c1df 5b051ff 4eb9b7f 5b051ff 4eb9b7f 5b051ff 4eb9b7f 5b051ff 4eb9b7f 5b051ff 4eb9b7f 5b051ff 4eb9b7f 5b051ff 4eb9b7f 5b051ff 4eb9b7f 5b051ff | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 | ---
language:
- ko
license: apache-2.0
tags:
- task-specific
- structured-prediction
- korean
- public-sector
- qwen3
- domain-specific
- merge
base_model: Qwen/Qwen3-4B
datasets: []
pipeline_tag: text-generation
model-index:
- name: DLM-NL2JSON-4B
results:
- task:
type: structured-prediction
name: Korean NL-to-JSON Schema Extraction
dataset:
type: custom
name: Busan Public Data Query Test Set
args:
num_samples: 2041
metrics:
- type: exact_match
value: 94.4
name: Exact Match Accuracy (raw)
- type: exact_match
value: 96.8
name: Exact Match Accuracy (adjusted)
---
# DLM-NL2JSON-4B
**A 4B-parameter service-specific LLM that outperforms GPT-4o (+14%p) and Qwen3.5-35B (+22%p) on structured JSON extraction from Korean natural language queries.**
DLM (Domain-specific Language Model) is a series of task-specialized models by [Data Science Lab., Ltd.](https://huggingface.co/dataslab). This model is a LoRA-merged Qwen3-4B fine-tuned for structured JSON extraction in the Busan Metropolitan City public data analytics service.
## Key Results
Evaluated on 2,041 test samples across 10 task categories (field-level exact match, summary excluded):
| Model | Params | Accuracy | Accuracy (adj*) | Avg Latency |
|-------|--------|----------|-----------------|-------------|
| **DLM-NL2JSON-4B** | **4B** | **94.4%** | **96.8%** | 2.59s |
| GPT-4o | ~200B+ | 80.5% | 82.5% | 1.58s |
| Qwen3.5-35B-A3B | 35B | 72.2% | 73.9% | 0.85s |
*\*adj: 64 CSM samples with known gold label noise excluded (see Evaluation section)*
### Per-Category Breakdown
| Category | N | DLM-NL2JSON-4B | GPT-4o | Qwen3.5-35B |
|----------|---|-------------|--------|-------------|
| ALP-A (population pattern) | 250 | **99.6%** | 56.0% | 47.6% |
| ALP-B (population flow) | 250 | **98.4%** | 50.4% | 46.8% |
| CSM (consumer spending) | 700 | **90.6%** | 90.1% | 86.1% |
| CREDIT-Income | 58 | **94.8%** | 53.4% | 34.5% |
| CREDIT-Spending | 77 | **97.4%** | 92.2% | 51.9% |
| CREDIT-Loan/Default | 73 | **98.6%** | 94.5% | 72.6% |
| CPI (business status) | 219 | 86.3% | **87.2%** | 54.8% |
| GIS-Inflow | 72 | **97.2%** | 79.2% | 93.1% |
| GIS-Outflow | 62 | **98.4%** | 77.4% | 98.4% |
| GIS-Consumption | 280 | 98.2% | **99.6%** | 97.5% |
DLM-NL2JSON-4B wins **8 out of 10 categories**, with the largest gains on ALP (+43%p vs GPT-4o) and CREDIT-Income (+41%p).
## Important: This is a Service-Specific Model
> **This model is NOT a general-purpose NL-to-JSON converter.** It is trained exclusively for a fixed set of predefined schemas used in a specific production service. It will not generalize to arbitrary JSON schemas or different prompt formats.
To use this model correctly, you **must**:
1. Use the **exact system prompts** it was trained on (one per task category β see Usage section)
2. Include the corresponding **special token** (`<TASK_CSM>`, `<TASK_CREDIT>`, `<TASK_GIS>`, `<TASK_ALP>`, `<TASK_CPI>`) in the input
3. Expect output conforming only to the **predefined schemas** listed below
**Why publish a service-specific model?** This model serves as a reference implementation demonstrating that **task-specific LoRA fine-tuning on a 4B model can dramatically outperform GPT-4o and larger open-source models** on constrained structured output tasks. We believe the DLM (Domain-specific Language Model) approach β training small, cheap-to-serve models for specific service endpoints β is an underexplored but highly practical paradigm.
## Intended Use
This model converts **Korean natural language queries about public/economic data** into **structured JSON** conforming to its predefined schemas. It is designed for and deployed in the **Busan Metropolitan City Big Data Wave** analytics dashboard.
**Input**: Free-form Korean query + task-specific system prompt
**Output**: Single-line JSON with exact schema compliance:
```json
{"summary":"##2025λ
5μ λΆμ°κ΄μμ ν΄μ΄λꡬ μ ν΅/μλ£ μλΉλΆμ##","base_ym":202505,"region_nm":"λΆμ°κ΄μμ ν΄μ΄λꡬ","industry_select":{"3":[],"8":[]},"sex_cd":[1],"age_cd":[30],"category":2}
```
### Task Categories
| ID | Name | Schema Type |
|----|------|-------------|
| 0 | ALP-A | Population pattern (ptrn: residence/work/visit) |
| 1 | ALP-B | Population flow (flow_cd: inflow/outflow) |
| 2 | CSM | Consumer spending by industry |
| 3 | CREDIT-Income | Income statistics |
| 4 | CREDIT-Spending | Spending statistics |
| 5 | CREDIT-Loan | Loan/default statistics |
| 6 | CPI | Business/enterprise status |
| 9 | GIS-Inflow | Geographic inflow analysis |
| 10 | GIS-Outflow | Geographic outflow analysis |
| 11 | GIS-Consumption | Geographic consumption analysis |
## Training Details
| Item | Value |
|------|-------|
| Base model | [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) |
| Method | LoRA SFT β merged full model |
| Training samples | 16,292 (Korean) |
| Validation samples | 2,034 |
| Special tokens | `<TASK_CSM>`, `<TASK_CREDIT>`, `<TASK_GIS>`, `<TASK_ALP>`, `<TASK_CPI>` |
| Max sequence length | 6,144 |
| Architecture | Qwen3ForCausalLM (36 layers, 2560 hidden, 32 heads) |
Training data consists of synthetically generated Korean natural language queries paired with structured JSON outputs, covering the Busan public data analytics domain.
## Evaluation Methodology
- **Metric**: Field-level exact match β each JSON key's value is compared against the gold label. The `summary` field is excluded from comparison.
- **Test set**: 2,041 samples, stratified by category
- **Gold label noise**: 64/700 CSM samples have `age_cd` capped at `[10..60]` instead of `[10..70]` for "all ages" queries, conflicting with the prompt specification. These affect all models equally and are excluded in the adjusted metric.
- **Train/Test overlap**: 16/2,041 input strings (0.78%) appear in both sets β retained for consistency.
- **All models** received identical system prompts per category.
### Hardware
| Model | Serving | GPU |
|-------|---------|-----|
| DLM-NL2JSON-4B | TensorRT-LLM | NVIDIA L4 24GB |
| GPT-4o | OpenAI API | N/A |
| Qwen3.5-35B-A3B | vLLM | NVIDIA A6000 48GB |
## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "dataslab/DLM-NL2JSON-4B"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
# System prompt (example: CSM consumer spending schema β abbreviated for readability)
# Full prompts per category are available in the repository's eval/prompts.py
system_prompt = """λλ λ°λμ **JSON ν μ€**λ§ μΆλ ₯νλ€. μ€λͺ
/ν
μ€νΈ/μ½λ©νΈ/λ§ν¬λ€μ΄/μ½λλΈλ‘/μ΄λͺ¨μ§/곡백 μ€ κΈμ§.
μΆλ ₯μ νμ { λ‘ μμνκ³ } λ‘ λλλ€.
[μ€ν€λ§: TASK_CSM] (ν€/νμ
/μμ μμ)
{"summary":string,"base_ym":int,"region_nm":string,"industry_select":object,"sex_cd":[int],"age_cd":[int],"category":2}
[κΈ°λ³Έκ°]
- base_ym: 0, region_nm: "λΆμ°κ΄μμ"
- industry_select: μ
μ’
λ―Έμ§μ μ μ λλΆλ₯ ν€λ₯Ό []λ‘ μ€μ
- sex_cd: [0,1], age_cd: [10,20,30,40,50,60,70]
- category: νμ 2
[λλΆλ₯ μ½λν] 1:μ¬ν/μλ° 2:μ¬κ°/λ¬Έν 3:μ ν΅ 4:μμ/μ£Όμ 5:μμλ£ν
6:μλ₯/μ‘ν 7:λ―Έμ© 8:μλ£ 9:κ΅μ‘ 10:μν 11:μλμ°¨"""
# Note: special token <TASK_CSM> must be included in the user message
user_query = "<TASK_CSM> 2024λ
1μ ν΄μ΄λꡬ μ€λ μλ₯/μ‘νλ λ·°ν° μͺ½ λ¨μ± 20~40λ μμ£Όλ‘ μλ €μ€"
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_query}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.0, do_sample=False)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))
# {"summary":"##2024λ
1μ λΆμ°κ΄μμ ν΄μ΄λꡬ μ€λ μλ₯/μ‘ν/λ―Έμ© μλΉλΆμ##","base_ym":202401,"region_nm":"λΆμ°κ΄μμ ν΄μ΄λꡬ μ€λ","industry_select":{"6":[],"7":[]},"sex_cd":[0],"age_cd":[20,30,40],"category":2}
# Note: "λ·°ν°" β mapped to λ―Έμ©(code 7), "ν΄μ΄λꡬ μ€λ" β normalized to "λΆμ°κ΄μμ ν΄μ΄λꡬ μ€λ"
```
### vLLM / OpenAI-compatible serving
```python
from openai import OpenAI
client = OpenAI(base_url="http://your-server:8006/v1", api_key="token")
resp = client.chat.completions.create(
model="DLM-NL2JSON-4B",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "<TASK_CSM> 2024λ
1μ ν΄μ΄λꡬ μ€λ μλ₯/μ‘νλ λ·°ν° μͺ½ λ¨μ± 20~40λ μμ£Όλ‘ μλ €μ€"}
],
max_tokens=512,
temperature=0.0,
extra_body={"chat_template_kwargs": {"enable_thinking": False}} # disable thinking mode
)
print(resp.choices[0].message.content)
```
> **Important**: When serving with vLLM/TensorRT-LLM, pass `chat_template_kwargs: {"enable_thinking": false}` to disable the Qwen3 thinking mode. Otherwise, reasoning tokens will consume the output budget and truncate the JSON.
## Known Limitations
1. **CPI category** (86.3%) is the weakest β complex industry classification codes (A~U with sub-codes) are harder to extract.
2. **CSM training data noise**: ~8% of CSM training samples have `age_cd` capped at 60 instead of 70 for "all ages" queries, introducing inconsistency.
3. **Domain-specific only**: This model is trained exclusively for the Busan public data schema extraction task. It has no general-purpose capabilities and should not be used as a general chatbot.
4. **Korean only**: All training data and prompts are in Korean.
## Citation
If you use this model, please cite:
```bibtex
@misc{dsl-dlm-nl2json-4b,
title={DLM-NL2JSON-4B: A Domain-Specific Language Model for Korean Public Data Schema Extraction},
author={Data Science Lab., Ltd.},
year={2026},
url={https://huggingface.co/dataslab/DLM-NL2JSON-4B}
}
```
## Contact
- **Organization**: Data Science Lab., Ltd.
- **Project**: Busan Metropolitan City Big Data Wave
|