DictaLM 2.0 - Israeli Law

A Hebrew legal language model fine-tuned on 140,000+ Israeli legal documents. Built for understanding, generating, and working with Israeli law, court rulings, and civil rights content.

Model Details

Base Model dicta-il/dictalm2.0 (Mistral-based, 7B params)
Training Continued pretraining with QLoRA (4-bit) via Unsloth
Language Hebrew
Domain Israeli law, court rulings, legislation, civil rights
License Apache 2.0

Training Data

The model was trained on ~140,000 Israeli legal documents from four sources:

Source Documents Description
Israeli Courts (court.gov.il) ~97,000 Supreme Court and district court rulings
Kol-Zchut (kolzchut.org.il) ~5,300 Citizens' rights guides and legal explainers
Wikisource Laws ~3,800 Israeli legislation and basic laws
Total (after filtering) ~106,000

Data Pipeline:

  • Text cleaning and normalization (niqqud removal, whitespace, template stripping)
  • PII scrubbing (Israeli ID numbers, phone numbers, emails, credit cards)
  • Quality filtering (minimum length, Hebrew ratio, repetition, boilerplate)
  • Near-deduplication via MinHash LSH (threshold 0.7)
  • Source balancing: Kol-Zchut and Wikisource upsampled 5x to balance court dominance

Training Details

Parameter Value
GPU NVIDIA A100-SXM4-40GB
Precision BF16 + 4-bit QLoRA
LoRA rank 64
LoRA target modules q, k, v, o, gate, up, down projections
Trainable parameters 167M / 7.4B (2.26%)
Batch size 16 (4 x 4 gradient accumulation)
Learning rate 2e-4 (cosine schedule)
Epochs 1
Context length 2,048 tokens
Packing Enabled
Training time ~7.75 hours
Framework Unsloth + HuggingFace TRL

Training Loss

Step Train Loss Val Loss
500 0.850 0.827
1000 0.781 0.816
2000 0.794 0.801
4000 0.697 0.782
6000 0.636 0.770
8000 0.564 0.769
8785 0.700 0.769

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("mufeedh28/dictalm2-israeli-law-merged")
tokenizer = AutoTokenizer.from_pretrained("mufeedh28/dictalm2-israeli-law-merged")

prompt = "ื‘ื™ืช ื”ืžืฉืคื˜ ื”ืขืœื™ื•ืŸ ืคืกืง ื›ื™"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7, top_p=0.9, repetition_penalty=1.15)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With Unsloth (faster inference)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained("mufeedh28/dictalm2-israeli-law-merged")
FastLanguageModel.for_inference(model)

inputs = tokenizer("ื—ื•ืง ื™ืกื•ื“: ื›ื‘ื•ื“ ื”ืื“ื ื•ื—ื™ืจื•ืชื• ืงื•ื‘ืข ื›ื™", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Example Outputs

Prompt: ื—ื•ืง ื™ืกื•ื“: ื›ื‘ื•ื“ ื”ืื“ื ื•ื—ื™ืจื•ืชื• ืงื•ื‘ืข ื›ื™

ื›ืœ ืื“ื ื–ื›ืื™ ืœื”ื’ื ื” ืขืœ ื—ื™ื™ื•, ื’ื•ืคื• ื•ื›ื‘ื•ื“ื•. ื”ื–ื›ื•ืช ืœื”ื’ื ื” ืขืฆืžื™ืช ื”ื™ื ื—ืœืง ืžื”ื–ื›ื•ืช ืœื›ื‘ื•ื“. ื‘ื ืกื™ื‘ื•ืช ืฉืœ ื”ื’ื ื” ืขืฆืžื™ืช ืžื•ืชืจ ืœืื“ื ืœืชืงื•ืฃ ืืช ืžื™ ืฉืžืื™ื™ื ืขืœื™ื• ืื• ืชื•ืงืฃ ืื•ืชื• ื›ื“ื™ ืœื”ืคืกื™ืง ืืช ื”ืชืงื™ืคื” ืื• ืœืžื ื•ืข ืื•ืชื”. ืฉื™ืžื•ืฉ ื‘ื›ื•ื— ื‘ืžืฆื‘ ืฉืœ ื”ื’ื ื” ืขืฆืžื™ืช ืžื•ื’ื‘ืœ ืœืžื™ื“ื” ื”ืกื‘ื™ืจื” ื”ื“ืจื•ืฉื” ื‘ื ืกื™ื‘ื•ืช ื”ืžืงืจื”.

Prompt: ืขืœ ืคื™ ื—ื•ืง ื”ืฉื›ื™ืจื•ืช ื•ื”ืฉืื™ืœื”

ืฉื˜ืจ-ื—ื•ื‘ ื”ื•ื ืกื•ื’ ืžืกื•ื™ื ืฉืœ ืขืจื•ื‘ื” ืฉืžื•ืชืจ ืœื“ืจื•ืฉ ืžืฉื•ื›ืจ ื“ื™ืจื” ืœื”ืคืงื™ื“ ื‘ื™ื“ื™ ื”ืžืฉื›ื™ืจ. ืฉื˜ืจ-ื”ื—ื•ื‘ ืฆืจื™ืš ืœืขืžื•ื“ ื‘ื›ืœ ื”ืชื ืื™ื ื”ื‘ืื™ื ื›ื“ื™ ืฉื™ื”ื™ื” ืชืงืฃ: ืฉื•ื›ืจื™ ื”ื“ื™ืจื” ื™ื”ื™ื• ืจืฉื•ืžื™ื ื‘ื• ื›ื—ื™ื™ื‘ื™ื, ืขืœ ืฉื˜ืจ ื”ื—ื•ื‘ ื™ื•ืคื™ืข ืกื›ื•ื ื›ืกืคื™, ื”ืฉื˜ืจ ื™ื”ื™ื” ื‘ืจ ืคืจืขื•ืŸ ื‘ืื•ืคืŸ ืžื™ื™ื“ื™ ืขื ื“ืจื™ืฉื” ืฉืœ ื‘ืขืœ ื”ื“ื™ืจื”.

Intended Use

  • Legal text completion and generation
  • Hebrew legal NLP research
  • Legal document understanding and analysis
  • Building legal search and retrieval systems
  • Educational tools for Israeli law

Limitations

  • This is a text completion model, not a chatbot. It continues text, not answers questions.
  • May generate plausible-sounding but incorrect legal information. Do not use as legal advice.
  • Trained primarily on court rulings โ€” may be less knowledgeable about specific regulatory domains.
  • May occasionally reproduce patterns from training data.
  • Hebrew-only. Performance on other languages will match the base DictaLM 2.0 model.

Citation

@misc{dictalm2-israeli-law,
  title={DictaLM 2.0 - Israeli Law},
  author={Mufeed Haj},
  year={2026},
  url={https://huggingface.co/mufeedh28/dictalm2-israeli-law-merged},
  note={Fine-tuned from dicta-il/dictalm2.0 on Israeli legal corpus}
}

Acknowledgments

Downloads last month
2
Safetensors
Model size
7B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mufeedh28/dictalm2-israeli-law-pretrain-merged

Finetuned
(4)
this model