File size: 3,822 Bytes
99561ed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
license: mit
tags:
- sentiment-analysis
- text-classification
- multiclass-classification
- sentence-transformers
- xgboost
- reddit
- hybrid-model
language:
- en
metrics:
- accuracy
- f1
pipeline_tag: text-classification
widget:
- text: "I love this product! It's amazing and works perfectly."
  example_title: "Positive Example"
- text: "This is terrible. I hate it so much."
  example_title: "Negative Example"
- text: "The weather is okay today."
  example_title: "Neutral Example"
---

# Reddit Sentiment Analysis - Hybrid Model

🎯 **Test Accuracy: 0.9966**

## Model Description

This hybrid sentiment analysis model combines **Sentence Transformers** for semantic embeddings with **XGBoost** for classification. Trained on Reddit comments for multiclass sentiment analysis: **Negative**, **Positive**, and **Neutral**.

### Architecture
```
Input Text → SentenceTransformer → Embeddings (768D) →
Feature Engineering (Length + Sentiment + POS) → XGBoost → Prediction
```

## Quick Start

```python
import pickle
import numpy as np
from sentence_transformers import SentenceTransformer
from textblob import TextBlob
import nltk
from huggingface_hub import hf_hub_download

# Download NLTK data
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

# Load models
xgb_path = hf_hub_download(repo_id="USERNAME/sentimental_analysis_updated", filename="xgboost_model.pkl")
sentence_path = hf_hub_download(repo_id="USERNAME/sentimental_analysis_updated", filename="sentence_transformer")

# Load XGBoost model
with open(xgb_path, 'rb') as f:
    pipeline_data = pickle.load(f)
    xgb_model = pipeline_data['xgboost_model']
    label_names = pipeline_data['label_names']

# Load SentenceTransformer
sentence_model = SentenceTransformer(sentence_path)

def predict_sentiment(text):
    # Extract features
    embedding = sentence_model.encode([text])
    comment_length = np.array([len(text.split())]).reshape(-1, 1)
    sentiment_polarity = np.array([TextBlob(text).sentiment.polarity]).reshape(-1, 1)

    # POS counts
    try:
        tags = nltk.pos_tag(nltk.word_tokenize(text))
        pos_counts = np.array([[
            sum(1 for _, tag in tags if tag.startswith('J')),  # Adjectives
            sum(1 for _, tag in tags if tag.startswith('N')),  # Nouns
            sum(1 for _, tag in tags if tag.startswith('V'))   # Verbs
        ]])
    except:
        pos_counts = np.array([[0, 0, 0]])

    # Combine features
    features = np.hstack([embedding, comment_length, sentiment_polarity, pos_counts])

    # Predict
    prediction = xgb_model.predict(features)[0]
    confidence = xgb_model.predict_proba(features)[0].max()

    return {
        'label': label_names[prediction],
        'confidence': confidence,
        'prediction_id': int(prediction)
    }

# Example usage
result = predict_sentiment("I love this new phone! It's amazing!")
print(f"Sentiment: {result['label']} (confidence: {result['confidence']:.3f})")
```

## Model Details

- **Base Model**: `paraphrase-mpnet-base-v2`
- **Classifier**: XGBoost with GPU acceleration
- **Features**: 772 dimensions (768 embeddings + 4 engineered)
- **Classes**: 0=Negative, 1=Positive, 2=Neutral
- **Training Data**: Reddit comments
- **Test Accuracy**: 0.9966

## Training Configuration

- **XGBoost Parameters**: n_estimators=300, learning_rate=0.05, max_depth=6
- **Features**: Embeddings + Comment Length + TextBlob Sentiment + POS Counts
- **Class Balancing**: Sample weights for imbalanced data
- **Validation**: Stratified train/val/test split

## Citation

```bibtex
@misc{reddit-sentiment-hybrid,
  title={Reddit Sentiment Analysis - Hybrid Model},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/USERNAME/sentimental_analysis_updated}
}
```

## License

MIT License