Text Classification
sentence-transformers
Safetensors
English
hybrid-sentiment-classifier
sentiment-analysis
multiclass-classification
xgboost
reddit
hybrid-model
Instructions to use mahekgheewala/mahek-sentiment with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use mahekgheewala/mahek-sentiment with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("mahekgheewala/mahek-sentiment") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| tags: | |
| - sentiment-analysis | |
| - text-classification | |
| - multiclass-classification | |
| - sentence-transformers | |
| - xgboost | |
| - hybrid-model | |
| language: | |
| - en | |
| metrics: | |
| - accuracy | |
| - f1 | |
| pipeline_tag: text-classification | |
| widget: | |
| - text: "I love this product! It's amazing and works perfectly." | |
| example_title: "Positive Example" | |
| - text: "This is terrible. I hate it so much." | |
| example_title: "Negative Example" | |
| - text: "The weather is okay today." | |
| example_title: "Neutral Example" | |
| # Reddit Sentiment Analysis - Hybrid Model | |
| 🎯 **Test Accuracy: 0.9966** | |
| ## Model Description | |
| This hybrid sentiment analysis model combines **Sentence Transformers** for semantic embeddings with **XGBoost** for classification. Trained on Reddit comments for multiclass sentiment analysis: **Negative**, **Positive**, and **Neutral**. | |
| ### Architecture | |
| ``` | |
| Input Text → SentenceTransformer → Embeddings (768D) → | |
| Feature Engineering (Length + Sentiment + POS) → XGBoost → Prediction | |
| ``` | |
| ## Quick Start | |
| ```python | |
| import pickle | |
| import numpy as np | |
| from sentence_transformers import SentenceTransformer | |
| from textblob import TextBlob | |
| import nltk | |
| from huggingface_hub import hf_hub_download | |
| # Download NLTK data | |
| nltk.download('punkt', quiet=True) | |
| nltk.download('averaged_perceptron_tagger', quiet=True) | |
| # Load models | |
| xgb_path = hf_hub_download(repo_id="USERNAME/mahek-sentiment", filename="xgboost_model.pkl") | |
| sentence_path = hf_hub_download(repo_id="USERNAME/mahek-sentiment", filename="sentence_transformer") | |
| # Load XGBoost model | |
| with open(xgb_path, 'rb') as f: | |
| pipeline_data = pickle.load(f) | |
| xgb_model = pipeline_data['xgboost_model'] | |
| label_names = pipeline_data['label_names'] | |
| # Load SentenceTransformer | |
| sentence_model = SentenceTransformer(sentence_path) | |
| def predict_sentiment(text): | |
| # Extract features | |
| embedding = sentence_model.encode([text]) | |
| comment_length = np.array([len(text.split())]).reshape(-1, 1) | |
| sentiment_polarity = np.array([TextBlob(text).sentiment.polarity]).reshape(-1, 1) | |
| # POS counts | |
| try: | |
| tags = nltk.pos_tag(nltk.word_tokenize(text)) | |
| pos_counts = np.array([[ | |
| sum(1 for _, tag in tags if tag.startswith('J')), # Adjectives | |
| sum(1 for _, tag in tags if tag.startswith('N')), # Nouns | |
| sum(1 for _, tag in tags if tag.startswith('V')) # Verbs | |
| ]]) | |
| except: | |
| pos_counts = np.array([[0, 0, 0]]) | |
| # Combine features | |
| features = np.hstack([embedding, comment_length, sentiment_polarity, pos_counts]) | |
| # Predict | |
| prediction = xgb_model.predict(features)[0] | |
| confidence = xgb_model.predict_proba(features)[0].max() | |
| return { | |
| 'label': label_names[prediction], | |
| 'confidence': confidence, | |
| 'prediction_id': int(prediction) | |
| } | |
| # Example usage | |
| result = predict_sentiment("I love this new phone! It's amazing!") | |
| print(f"Sentiment: {result['label']} (confidence: {result['confidence']:.3f})") | |
| ``` | |
| ## Model Details | |
| - **Base Model**: `paraphrase-mpnet-base-v2` | |
| - **Classifier**: XGBoost with GPU acceleration | |
| - **Features**: 772 dimensions (768 embeddings + 4 engineered) | |
| - **Classes**: 0=Negative, 1=Positive, 2=Neutral | |
| - **Training Data**: Reddit comments | |
| - **Test Accuracy**: 0.9966 | |
| ## Training Configuration | |
| - **XGBoost Parameters**: n_estimators=300, learning_rate=0.05, max_depth=6 | |
| - **Features**: Embeddings + Comment Length + TextBlob Sentiment + POS Counts | |
| - **Class Balancing**: Sample weights for imbalanced data | |
| - **Validation**: Stratified train/val/test split | |
| ## Citation | |
| ```bibtex | |
| @misc{reddit-sentiment-hybrid, | |
| title={Reddit Sentiment Analysis - Hybrid Model}, | |
| year={2025}, | |
| publisher={Hugging Face}, | |
| url={https://huggingface.co/USERNAME/mahek-sentiment} | |
| } | |
| ``` | |
| ## License | |
| MIT License | |