--- license: mit datasets: - ufca-llms/jua language: - pt - en base_model: - Qwen/Qwen3-Embedding-4B pipeline_tag: sentence-similarity tags: - embeddings - pt-br - law - jurisprudency --- # jua-4B-legal-only `jua-4B-legal-only` is a Brazilian Portuguese legal embedding model based on `Qwen/Qwen3-Embedding-4B`. It was adapted for legal retrieval with legal-domain supervision only, and is intended for scenarios where stronger specialization on institutionally framed legal search is preferred over broader cross-domain robustness. This model is presented in the paper **Domain-Adaptive Dense Retrieval for Brazilian Legal Search**. It is the `legal-only` condition discussed in the paper. ## Model Overview - Base model: `Qwen/Qwen3-Embedding-4B` - Model type: text embedding - Primary language: Brazilian Portuguese - Intended use: dense retrieval for Brazilian legal search - Training profile: legal-only adaptation The legal-only training regime uses legal supervision from: - `JUÁ-Juris` training pairs - Ulysses-derived legislative supervision - a small synthetic legislative extension based on alternative query formulations Unlike the mixed model, this model does **not** add `SQuAD-pt`. ## Intended Use This model is best suited for: - jurisprudence retrieval - institutionally framed legal search - retrieval settings where legal phrasing and specialized domain supervision are especially important If your use case is more heterogeneous, question-driven, or closer to broader semantic retrieval, the mixed model may be a better option: - `ufca-llms/jua-4B-mixed` ## Usage ### Sentence Transformers ```python # Requires transformers>=4.51.0 # Requires sentence-transformers>=2.7.0 from sentence_transformers import SentenceTransformer model = SentenceTransformer("ufca-llms/jua-4B-legal-only") queries = [ "Instruct: Given a Brazilian legal search query, retrieve relevant legal passages or documents.\nQuery: aposentadoria por pensão estatutária", "Instruct: Given a Brazilian legal search query, retrieve relevant legal passages or documents.\nQuery: normas de auditoria operacional do TCU", ] documents = [ "O art. 5º da Lei 9.717/1998 trata do regime previdenciário dos servidores públicos.", "As normas de auditoria operacional do TCU estabelecem diretrizes para planejamento, execução e relatório.", ] query_embeddings = model.encode(queries) document_embeddings = model.encode(documents) similarity = model.similarity(query_embeddings, document_embeddings) print(similarity) ``` ### Transformers ```python # Requires transformers>=4.51.0 import torch import torch.nn.functional as F from torch import Tensor from transformers import AutoModel, AutoTokenizer def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor: left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0]) if left_padding: return last_hidden_states[:, -1] sequence_lengths = attention_mask.sum(dim=1) - 1 batch_size = last_hidden_states.shape[0] return last_hidden_states[ torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths, ] def get_detailed_instruct(task_description: str, query: str) -> str: return f"Instruct: {task_description}\nQuery: {query}" task = "Given a Brazilian legal search query, retrieve relevant legal passages or documents." queries = [ get_detailed_instruct(task, "aposentadoria por pensão estatutária"), get_detailed_instruct(task, "normas de auditoria operacional do TCU"), ] documents = [ "O art. 5º da Lei 9.717/1998 trata do regime previdenciário dos servidores públicos.", "As normas de auditoria operacional do TCU estabelecem diretrizes para planejamento, execução e relatório.", ] input_texts = queries + documents tokenizer = AutoTokenizer.from_pretrained( "ufca-llms/jua-4B-legal-only", padding_side="left", ) model = AutoModel.from_pretrained("ufca-llms/jua-4B-legal-only") batch_dict = tokenizer( input_texts, padding=True, truncation=True, max_length=8192, return_tensors="pt", ) batch_dict.to(model.device) outputs = model(**batch_dict) embeddings = last_token_pool(outputs.last_hidden_state, batch_dict["attention_mask"]) embeddings = F.normalize(embeddings, p=2, dim=1) scores = embeddings[: len(queries)] @ embeddings[len(queries) :].T print(scores.tolist()) ``` ## Evaluation ### JUÁ + Quati The table below reproduces the `legal-only` results reported in the paper over the five legal datasets in the `JUÁ` evaluation environment plus `Quati`. | Dataset | NDCG@10 | MRR@10 | MAP@10 | |---|---:|---:|---:| | JUÁ-Juris | 0.294 | 0.233 | 0.233 | | JurisTCU | 0.375 | 0.650 | 0.179 | | NormasTCU | 0.310 | 0.461 | 0.186 | | Ulysses-RFCorpus | 0.426 | 0.619 | 0.301 | | BR-TaxQA-R | 0.756 | 0.779 | 0.677 | | Quati | 0.438 | 0.770 | 0.197 | | **Average** | **0.433** | **0.585** | **0.296** | ### Shared legal comparison against broader baselines On the four legal datasets shared by all baselines in the paper's broader comparison (`JUÁ-Juris`, `JurisTCU`, `NormasTCU`, and `BR-TaxQA-R`), this model obtains: - `NDCG@10`: `0.434` - `MRR@10`: `0.531` - `MAP@10`: `0.319` ## Notes - Query-side instructions are recommended. - This model is specialized for Brazilian legal retrieval and may be less robust than the mixed model on broader semantic retrieval settings. - For a more balanced profile across legal and question-driven retrieval regimes, see `ufca-llms/jua-4B-mixed`. ## Citation If you use this model, please cite: ```bibtex @misc{pereira2026domainadaptivedenseretrievalbrazilian, title={Domain-Adaptive Dense Retrieval for Brazilian Legal Search}, author={Jayr Pereira and Roberto Lotufo and Luiz Bonifacio}, year={2026}, eprint={2605.04005}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2605.04005}, } ```