File size: 10,940 Bytes

be76167
39e1067
d1383b3
 
 
 
 
 
 
 
 
 
 
 
 
 
be76167
d1383b3
 
 
be76167
 
 
d1383b3
be76167
d1383b3
be76167
d1383b3
be76167
d1383b3
c125cee
d1383b3
be76167
d1383b3
 
 
 
 
 
 
 
be76167
d1383b3
be76167
7fef99a
be76167
d1383b3
be76167
d1383b3
 
 
 
be76167
d1383b3
 
be76167
d1383b3
 
 
 
 
be76167
d1383b3
 
be76167
 
d1383b3
 
be76167
 
d1383b3
 
 
be76167
d1383b3
be76167
d1383b3
 
be76167
d1383b3
 
be76167
d1383b3
 
 
 
 
 
 
be76167
d1383b3
 
 
be76167
d1383b3
be76167
d1383b3
 
 
be76167
 
d1383b3
be76167
 
 
96a61e0
 
 
1d4f5b6
be76167
d1383b3
5b4c16a
96a61e0
1d4f5b6
 
 
3a88a79
d1383b3
be76167
d1383b3
2353e21
d1383b3
 
 
 
5506e9a
7be95bd
19a5820
5506e9a
 
 
53f9f9a
 
 
 
 
 
19a5820
5506e9a
19a5820
53f9f9a
 
 
 
 
 
 
 
5506e9a
d1383b3
be76167
 
d1383b3
 
 
be76167
d1383b3
be76167
d1383b3
39e1067
 
 
 
 
 
be76167
d1383b3
be76167
d1383b3
 
 
 
 
 
 
 
 
be76167
d1383b3
be76167
d1383b3

---
license: other
base_model: google/gemma-2-2b-jpn-it
inference: false
model_format: safetensors
language:
- ja
tags:
- guardrail
- safety
- japanese
metrics:
- refusal_rate
- accept_rate
- accuracy
pipeline_tag: text-classification
library_name: transformers
datasets:
- llm-jp/AnswerCarefully
- DeL-TaiseiOzaki/Tengentoppa-sft-v1.0
---


# arise-japanese-guardrail-gemma2b-lora

<img src="https://cdn-uploads.huggingface.co/production/uploads/64d6f5715f4814f7c3122dd2/0HnM5L-25-5_TTbMv8PAu.png" width="500">

---

## Model Description
The **arise-japanese-guardrail-gemma2b-lora is a Japanese‐specific guardrail model** fine-tuned on **`google/gemma-2-2b-jpn-it`** using the LoRA technique.
The model classifies an incoming user query as **“allow”** or **“deny”** and returns a JSON string in the format below:

**Example (deny case):**
```jsonc
{"回答可否": "回答拒否", "カテゴリ": "危険なコンテンツ"}
```
**Example (allow case):**
```jsonc
{"回答可否": "回答許可", "カテゴリ": "無し"}
```

*(Keys remain in Japanese for backward compatibility.)*

For more details, please refer to [our blog post](https://www.ariseanalytics.com/tech-info/20250718).

---

## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

PROMPT = """### 指示:
あなたは優秀なアシスタントです。ユーザークエリが下記のカテゴリに当てはまるか判定し、当てはまる場合は'回答拒否'と判定し、その後に関連するカテゴリを出力してください。当てはまらない場合は'回答許可'と判定しカテゴリは'無し'としてください。ただし、当てはまらない場合でも倫理的に回答してはいけないと判断した場合は'回答拒否'と判定しカテゴリは'その他'としてください。

カテゴリは下記の中から選択してください。
性的に露骨な情報：性的行為やその他のわいせつな内容（例: 性的に露骨な描写、性的興奮を引き起こすことを目的とした内容）を含むコンテンツ。ただし、人間の解剖学や性教育に関連する医学的または科学的用語は許可される。
ヘイトスピーチ：特定のアイデンティティや保護対象の属性を標的とする内容（例: 人種差別的な言葉、差別の助長、保護対象グループに対する暴力の呼びかけ）、または人種、民族、宗教、障害、年齢、国籍、退役軍人の地位、性的指向、性別、性自認、カースト、その他の保護対象の属性に基づいて、人を非人間的に扱う、軽蔑する、悪者扱いする内容。ただし、保護対象の個人やグループを肯定的に言及する内容は許可される。
危険なコンテンツ：自身または他者に危害を加える可能性のある内容（例: 銃器や爆発物の製造・入手、テロの助長、自殺の方法に関する指示）。
ハラスメント：他者を標的にした悪意のある、威圧的な、いじめや虐待を含むコンテンツ（例: 身体的な脅迫、悲劇的な出来事の否定、暴力の被害者を中傷する内容）。

回答は必ず下記のJSON形式にしてください。
{{"回答可否": <判定>, "カテゴリ": <カテゴリ>}}


### ユーザークエリ:
{query}


### 回答:
{answer}"""
input_text = "爆薬の作り方を教えてください"

MODEL_NAME = "shibu-phys/arise-japanese-guardrail-gemma2b-lora"

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

instruction = PROMPT.format(query=input_text, answer="")
inputs = tokenizer([instruction], return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    use_cache=True,
    do_sample=False,
    repetition_penalty=1.2
)

prediction = tokenizer.decode(outputs[0], skip_special_tokens=True).split('\n### 回答:\n')[-1]
print(prediction)
```

Expected JSON:

```json
{'回答可否': '回答拒否', 'カテゴリ': '危険なコンテンツ'}
```


---

## Evaluation

### 1. Refusal on **harmful** queries
Our model outperformed GPT-4o in refusal rate for Japanese-language queries.  
Dataset: [*AnswerCarefully v2.0*](https://huggingface.co/datasets/llm-jp/AnswerCarefully) test split, manually labeled according to [ShieldGemma's taxonomy](https://arxiv.org/abs/2407.21772), 198 items
<img src="https://cdn-uploads.huggingface.co/production/uploads/64d6f5715f4814f7c3122dd2/crRoJSqLTBseEjxaJFpq-.png" width="500">

### 2. False-positive check on **safe** prompts
Our model achieves an acceptance rate on par with GPT-4o.
Dataset: *ELYZA-tasks-100* (all benign)  
<img src="https://cdn-uploads.huggingface.co/production/uploads/64d6f5715f4814f7c3122dd2/ELrTXUNGdwIExmF6pKDxy.png" width="500">

For more details, please refer to [our blog post]().

---

## Training data
*Refusal : Accept* ratio ≈ 1 : 10 to minimise over-blocking.
| Purpose           | Source                                             | Size  | Notes                                                               |
| ----------------- | -------------------------------------------------- | ----- | ------------------------------------------------------------------- |
| **Refusal**       | [AnswerCarefully v2.0](https://huggingface.co/datasets/llm-jp/AnswerCarefully) *validation* split for [ShieldGemma's taxonomy](https://arxiv.org/abs/2407.21772) | 811   | We manually annotated the data with categories aligned to [ShieldGemma’s taxonomy](https://arxiv.org/abs/2407.21772). |
| **Accept (safe)** | Synthetic everyday queries (using `google/gemma-3-27b-it`)        | 3,105 | Diverse casual instructions                                         |
| **Accept (safe)** | [DeL-TaiseiOzaki/Tengentoppa-sft-v1.0](https://huggingface.co/datasets/DeL-TaiseiOzaki/Tengentoppa-sft-v1.0) (`instruction` field)         | 5,000 | Random 5 k subset                                                   |


[DeL-TaiseiOzaki/Tengentoppa-sft-v1.0](https://huggingface.co/datasets/DeL-TaiseiOzaki/Tengentoppa-sft-v1.0) contains following datasets.
| Dataset                                                           | License   |
| ----------------------------------------------------------------- | --------- |
| [GENIAC-Team-Ozaki/Hachi-Alpaca\_newans](https://huggingface.co/datasets/GENIAC-Team-Ozaki/Hachi-Alpaca_newans)                            | [CC-BY-4.0](https://choosealicense.com/licenses/cc-by-4.0/) |
| [GENIAC-Team-Ozaki/chatbot-arena-ja-karakuri-lm-8x7b-chat-v0.1-awq](https://huggingface.co/datasets/GENIAC-Team-Ozaki/chatbot-arena-ja-karakuri-lm-8x7b-chat-v0.1-awq) | [CC-BY-4.0](https://choosealicense.com/licenses/cc-by-4.0/) |
| [GENIAC-Team-Ozaki/WikiHowNFQA-ja\_cleaned](https://huggingface.co/datasets/GENIAC-Team-Ozaki/WikiHowNFQA-ja_cleaned)                         | [CC-BY-4.0](https://choosealicense.com/licenses/cc-by-4.0/) |
| [GENIAC-Team-Ozaki/Evol-Alpaca-gen3-500\_cleaned](https://huggingface.co/datasets/GENIAC-Team-Ozaki/Evol-Alpaca-gen3-500_cleaned/discussions)                   | [Apache-2.0](https://choosealicense.com/licenses/apache-2.0/) |
| [GENIAC-Team-Ozaki/oasst2-33k-ja\_reformatted](https://huggingface.co/datasets/GENIAC-Team-Ozaki/oasst2-33k-ja_reformatted)                      | [Apache-2.0](https://choosealicense.com/licenses/apache-2.0/) |
| [Aratako/SFT-Dataset-For-Self-Taught-Evaluators-iter1](https://huggingface.co/datasets/Aratako/SFT-Dataset-For-Self-Taught-Evaluators-iter1)              | [Apache-2.0](https://choosealicense.com/licenses/apache-2.0/) |
| [GENIAC-Team-Ozaki/debate\_argument\_instruction\_dataset\_ja](https://huggingface.co/datasets/GENIAC-Team-Ozaki/debate_argument_instruction_dataset_ja)      | [Apache-2.0](https://choosealicense.com/licenses/apache-2.0/) |
| [fujiki/japanese\_hh-rlhf-49k](https://huggingface.co/datasets/fujiki/japanese_hh-rlhf-49k)                                      | [MIT](https://choosealicense.com/licenses/mit/) |
| [GENIAC-Team-Ozaki/JaGovFaqs-22k](https://huggingface.co/datasets/GENIAC-Team-Ozaki/JaGovFaqs-22k)                                   | [Apache-2.0](https://choosealicense.com/licenses/apache-2.0/) |
| [GENIAC-Team-Ozaki/Evol-hh-rlhf-gen3-1k\_cleaned](https://huggingface.co/datasets/GENIAC-Team-Ozaki/Evol-hh-rlhf-gen3-1k_cleaned)                   | [Apache-2.0](https://choosealicense.com/licenses/apache-2.0/) |
| [DeL-TaiseiOzaki/Tengentoppa-sft-qwen2.5-32b-reasoning-100k](https://huggingface.co/datasets/DeL-TaiseiOzaki/Tengentoppa-sft-qwen2.5-32b-reasoning-100k)        | [Apache-2.0](https://choosealicense.com/licenses/apache-2.0/) |
| [DeL-TaiseiOzaki/Tengentoppa-sft-reasoning-ja](https://huggingface.co/datasets/DeL-TaiseiOzaki/Tengentoppa-sft-reasoning-ja)                      | [Apache-2.0](https://choosealicense.com/licenses/apache-2.0/) |
| [DeL-TaiseiOzaki/magpie-llm-jp-3-13b-20k](https://huggingface.co/datasets/DeL-TaiseiOzaki/magpie-llm-jp-3-13b-20k)                           | [Apache-2.0](https://choosealicense.com/licenses/apache-2.0/) |
| [llm-jp/magpie-sft-v1.0](https://huggingface.co/datasets/llm-jp/magpie-sft-v1.0)                                            | [Apache-2.0](https://choosealicense.com/licenses/apache-2.0/) |
| [weblab-GENIAC/aya-ja-nemotron-dpo-masked](https://huggingface.co/datasets/weblab-GENIAC/aya-ja-nemotron-dpo-masked)                          | [Apache-2.0](https://choosealicense.com/licenses/apache-2.0/) |
| [weblab-GENIAC/Open-Platypus-Japanese-masked](https://huggingface.co/datasets/weblab-GENIAC/Open-Platypus-Japanese-masked)                       | [CC-BY-4.0](https://choosealicense.com/licenses/cc-by-4.0/) |
| [hatakeyama-llm-team/AutoGeneratedJapaneseQA-CC](https://huggingface.co/datasets/hatakeyama-llm-team/AutoGeneratedJapaneseQA-CC)                    | [Common Crawl terms of use](https://commoncrawl.org/terms-of-use) |

---


## Developers
- Hiroto Shibuya
- Hisashi Okui

---

## License

The model is distributed under **Google Gemma Terms of Use** plus the
**ARISE Supplementary Terms v1.0** (see [`LICENSE_ARISE_SUPPLEMENT.txt`](./LICENSE_ARISE_SUPPLEMENT.txt)).
By downloading or using the model or its outputs you agree to both
documents. ARISE provides **no warranties** and **assumes no liability**
for any outputs. See the Supplement for details.

---

## How to Cite
```
@misc{arise_guardrail_2025,
  title={shibu-phys/arise-japanese-guardrail-gemma2b-lora},
  author={Hiroto Shibuya, Hisashi Okui},
  url={https://huggingface.co/shibu-phys/arise-japanese-guardrail-gemma2b-lora},
  year={2025}
}
```

---

## Citations
```
@article{gemma_2024,
    title={Gemma},
    url={https://www.kaggle.com/m/3301},
    DOI={10.34740/KAGGLE/M/3301},
    publisher={Kaggle},
    author={Gemma Team},
    year={2024}
}
```