--- language: - pt license: cc-by-nc-nd-4.0 tags: - text-segmentation - topic-segmentation - bert - next-sentence-prediction - document-segmentation - meeting-minutes library_name: transformers base_model: - neuralmind/bert-base-portuguese-cased --- # NSP-CouncilSeg: Linear Text Segmentation for Municipal Meeting Minutes ## Model Description **NSP-CouncilSeg** is a fine-tuned BERT model specialized in Text Segmentation for municipal council meeting minutes. The model uses Next Sentence Prediction (NSP) to identify topic boundaries in long-form documents, making it particularly effective for segmenting administrative and governmental meeting minutes. **Try out the model**: [Hugging Face Space Demo](https://huggingface.co/spaces/anonymous15135/nsp-councilseg-demo) ### Key Features - 🎯 **Specialized for Meeting Minutes**: Fine-tuned on Portuguese municipal council meeting minutes - ⚡ **Fast Inference**: Efficient BERT-base architecture for real-time segmentation - 📊 **High Accuracy**: Achieves BED F-measure score of 0.79 on CouncilSeg dataset - 🔄 **Sentence-Level Segmentation**: Identifies topic boundaries at sentence granularity ## Model Details - **Base Model**: `neuralmind/bert-base-portuguese-cased` - **Architecture**: BERT with Next Sentence Prediction head - **Parameters**: 110M - **Max Sequence Length**: 512 tokens - **Fine-tuning Dataset**: CouncilSeg (Portuguese Municipal Meeting Minutes) - **Fine-tuning Method**: Focal Loss with boundary-aware weighting - **Training Framework**: PyTorch + Transformers ## How It Works The model predicts whether two consecutive sentences belong to the same topic (label 0: "is_next") or represent a topic transition (label 1: "not_next"). By applying this classifier sequentially across all sentence pairs in a document, it identifies topic boundaries. ```python Sentence A: "Pelo Senhor Presidente foi presente a reunião a ata n.º 28 de 20.12.2023." Sentence B: "Ponderado e analisado o assunto o Executivo Municipal deliberou por unanimidade aprovar a ata n.º 28 de 20.12.2023." → Prediction: Same Topic (confidence: 76%) Sentence A: "Ponderado e analisado o assunto o Executivo Municipal deliberou por unanimidade aprovar a ata n.º 28 de 20.12.2023." Sentence B: "Não houve processos e requerimentos diversos a apresentar." → Prediction: Topic Boundary (confidence: 82%) ``` ## Usage ### Quick Start with Transformers ```python from transformers import AutoTokenizer, AutoModelForNextSentencePrediction import torch # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained("anonymous15135/nsp-councilseg") model = AutoModelForNextSentencePrediction.from_pretrained("anonymous15135/nsp-councilseg") # Prepare input sentence_a = "Pelo Senhor Presidente foi presente a reunião a ata n.º 28 de 20.12.2023." sentence_b = "Ponderado e analisado o assunto o Executivo Municipal deliberou por unanimidade aprovar a ata n.º 28 de 20.12.2023." # Tokenize inputs = tokenizer(sentence_a, sentence_b, return_tensors="pt") # Predict with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits probs = torch.softmax(logits, dim=1) # Interpret results is_next_prob = probs[0][0].item() not_next_prob = probs[0][1].item() print(f"Is Next (same topic): {is_next_prob:.3f}") print(f"Not Next (topic boundary): {not_next_prob:.3f}") if not_next_prob > 0.5: print("🔴 Topic boundary detected!") else: print("🟢 Same topic continues") ``` ## Limitations - **Domain Specificity**: Best performance on administrative/governmental meeting minutes - **Language**: Optimized for Portuguese; English performance may vary - **Document Length**: Designed for documents with 10-50 segments - **Context Window**: Limited to 512 tokens per sentence pair - **Ambiguous Boundaries**: May struggle with subtle topic transitions ## Model Card Contact For questions or feedback, please open an issue in the [model repository](https://huggingface.co/anonymous15135/nsp-councilseg/discussions). ## License This model is released under the Attribution-NonCommercial-NoDerivatives 4.0 International