--- license: mit tags: - sentiment-analysis - text-classification - multiclass-classification - sentence-transformers - xgboost - reddit - hybrid-model language: - en metrics: - accuracy - f1 pipeline_tag: text-classification widget: - text: "I love this product! It's amazing and works perfectly." example_title: "Positive Example" - text: "This is terrible. I hate it so much." example_title: "Negative Example" - text: "The weather is okay today." example_title: "Neutral Example" --- # Reddit Sentiment Analysis - Hybrid Model 🎯 **Test Accuracy: 0.9966** ## Model Description This hybrid sentiment analysis model combines **Sentence Transformers** for semantic embeddings with **XGBoost** for classification. Trained on Reddit comments for multiclass sentiment analysis: **Negative**, **Positive**, and **Neutral**. ### Architecture ``` Input Text → SentenceTransformer → Embeddings (768D) → Feature Engineering (Length + Sentiment + POS) → XGBoost → Prediction ``` ## Quick Start ```python import pickle import numpy as np from sentence_transformers import SentenceTransformer from textblob import TextBlob import nltk from huggingface_hub import hf_hub_download # Download NLTK data nltk.download('punkt', quiet=True) nltk.download('averaged_perceptron_tagger', quiet=True) # Load models xgb_path = hf_hub_download(repo_id="USERNAME/sentimental_analysis_updated", filename="xgboost_model.pkl") sentence_path = hf_hub_download(repo_id="USERNAME/sentimental_analysis_updated", filename="sentence_transformer") # Load XGBoost model with open(xgb_path, 'rb') as f: pipeline_data = pickle.load(f) xgb_model = pipeline_data['xgboost_model'] label_names = pipeline_data['label_names'] # Load SentenceTransformer sentence_model = SentenceTransformer(sentence_path) def predict_sentiment(text): # Extract features embedding = sentence_model.encode([text]) comment_length = np.array([len(text.split())]).reshape(-1, 1) sentiment_polarity = np.array([TextBlob(text).sentiment.polarity]).reshape(-1, 1) # POS counts try: tags = nltk.pos_tag(nltk.word_tokenize(text)) pos_counts = np.array([[ sum(1 for _, tag in tags if tag.startswith('J')), # Adjectives sum(1 for _, tag in tags if tag.startswith('N')), # Nouns sum(1 for _, tag in tags if tag.startswith('V')) # Verbs ]]) except: pos_counts = np.array([[0, 0, 0]]) # Combine features features = np.hstack([embedding, comment_length, sentiment_polarity, pos_counts]) # Predict prediction = xgb_model.predict(features)[0] confidence = xgb_model.predict_proba(features)[0].max() return { 'label': label_names[prediction], 'confidence': confidence, 'prediction_id': int(prediction) } # Example usage result = predict_sentiment("I love this new phone! It's amazing!") print(f"Sentiment: {result['label']} (confidence: {result['confidence']:.3f})") ``` ## Model Details - **Base Model**: `paraphrase-mpnet-base-v2` - **Classifier**: XGBoost with GPU acceleration - **Features**: 772 dimensions (768 embeddings + 4 engineered) - **Classes**: 0=Negative, 1=Positive, 2=Neutral - **Training Data**: Reddit comments - **Test Accuracy**: 0.9966 ## Training Configuration - **XGBoost Parameters**: n_estimators=300, learning_rate=0.05, max_depth=6 - **Features**: Embeddings + Comment Length + TextBlob Sentiment + POS Counts - **Class Balancing**: Sample weights for imbalanced data - **Validation**: Stratified train/val/test split ## Citation ```bibtex @misc{reddit-sentiment-hybrid, title={Reddit Sentiment Analysis - Hybrid Model}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/USERNAME/sentimental_analysis_updated} } ``` ## License MIT License