File size: 3,822 Bytes

99561ed

---
license: mit
tags:
- sentiment-analysis
- text-classification
- multiclass-classification
- sentence-transformers
- xgboost
- reddit
- hybrid-model
language:
- en
metrics:
- accuracy
- f1
pipeline_tag: text-classification
widget:
- text: "I love this product! It's amazing and works perfectly."
  example_title: "Positive Example"
- text: "This is terrible. I hate it so much."
  example_title: "Negative Example"
- text: "The weather is okay today."
  example_title: "Neutral Example"
---

# Reddit Sentiment Analysis - Hybrid Model

🎯 **Test Accuracy: 0.9966**

## Model Description

This hybrid sentiment analysis model combines **Sentence Transformers** for semantic embeddings with **XGBoost** for classification. Trained on Reddit comments for multiclass sentiment analysis: **Negative**, **Positive**, and **Neutral**.

### Architecture
```
Input Text → SentenceTransformer → Embeddings (768D) →
Feature Engineering (Length + Sentiment + POS) → XGBoost → Prediction
```

## Quick Start

```python
import pickle
import numpy as np
from sentence_transformers import SentenceTransformer
from textblob import TextBlob
import nltk
from huggingface_hub import hf_hub_download

# Download NLTK data
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

# Load models
xgb_path = hf_hub_download(repo_id="USERNAME/sentimental_analysis_updated", filename="xgboost_model.pkl")
sentence_path = hf_hub_download(repo_id="USERNAME/sentimental_analysis_updated", filename="sentence_transformer")

# Load XGBoost model
with open(xgb_path, 'rb') as f:
    pipeline_data = pickle.load(f)
    xgb_model = pipeline_data['xgboost_model']
    label_names = pipeline_data['label_names']

# Load SentenceTransformer
sentence_model = SentenceTransformer(sentence_path)

def predict_sentiment(text):
    # Extract features
    embedding = sentence_model.encode([text])
    comment_length = np.array([len(text.split())]).reshape(-1, 1)
    sentiment_polarity = np.array([TextBlob(text).sentiment.polarity]).reshape(-1, 1)

    # POS counts
    try:
        tags = nltk.pos_tag(nltk.word_tokenize(text))
        pos_counts = np.array([[
            sum(1 for _, tag in tags if tag.startswith('J')),  # Adjectives
            sum(1 for _, tag in tags if tag.startswith('N')),  # Nouns
            sum(1 for _, tag in tags if tag.startswith('V'))   # Verbs
        ]])
    except:
        pos_counts = np.array([[0, 0, 0]])

    # Combine features
    features = np.hstack([embedding, comment_length, sentiment_polarity, pos_counts])

    # Predict
    prediction = xgb_model.predict(features)[0]
    confidence = xgb_model.predict_proba(features)[0].max()

    return {
        'label': label_names[prediction],
        'confidence': confidence,
        'prediction_id': int(prediction)
    }

# Example usage
result = predict_sentiment("I love this new phone! It's amazing!")
print(f"Sentiment: {result['label']} (confidence: {result['confidence']:.3f})")
```

## Model Details

- **Base Model**: `paraphrase-mpnet-base-v2`
- **Classifier**: XGBoost with GPU acceleration
- **Features**: 772 dimensions (768 embeddings + 4 engineered)
- **Classes**: 0=Negative, 1=Positive, 2=Neutral
- **Training Data**: Reddit comments
- **Test Accuracy**: 0.9966

## Training Configuration

- **XGBoost Parameters**: n_estimators=300, learning_rate=0.05, max_depth=6
- **Features**: Embeddings + Comment Length + TextBlob Sentiment + POS Counts
- **Class Balancing**: Sample weights for imbalanced data
- **Validation**: Stratified train/val/test split

## Citation

```bibtex
@misc{reddit-sentiment-hybrid,
  title={Reddit Sentiment Analysis - Hybrid Model},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/USERNAME/sentimental_analysis_updated}
}
```

## License

MIT License