Upload Reddit Sentiment Analysis Hybrid Model

50f09b2 verified 11 months ago

3.78 kB

	---
	license: mit
	tags:
	- sentiment-analysis
	- text-classification
	- multiclass-classification
	- sentence-transformers
	- xgboost
	- reddit
	- hybrid-model
	language:
	- en
	metrics:
	- accuracy
	- f1
	pipeline_tag: text-classification
	widget:
	- text: "I love this product! It's amazing and works perfectly."
	example_title: "Positive Example"
	- text: "This is terrible. I hate it so much."
	example_title: "Negative Example"
	- text: "The weather is okay today."
	example_title: "Neutral Example"
	---

	# Reddit Sentiment Analysis - Hybrid Model

	🎯 Test Accuracy: 0.9966

	## Model Description

	This hybrid sentiment analysis model combines Sentence Transformers for semantic embeddings with XGBoost for classification. Trained on Reddit comments for multiclass sentiment analysis: Negative, Positive, and Neutral.

	### Architecture
	```
	Input Text → SentenceTransformer → Embeddings (768D) →
	Feature Engineering (Length + Sentiment + POS) → XGBoost → Prediction
	```

	## Quick Start

	```python
	import pickle
	import numpy as np
	from sentence_transformers import SentenceTransformer
	from textblob import TextBlob
	import nltk
	from huggingface_hub import hf_hub_download

	# Download NLTK data
	nltk.download('punkt', quiet=True)
	nltk.download('averaged_perceptron_tagger', quiet=True)

	# Load models
	xgb_path = hf_hub_download(repo_id="USERNAME/mahek-sentiment", filename="xgboost_model.pkl")
	sentence_path = hf_hub_download(repo_id="USERNAME/mahek-sentiment", filename="sentence_transformer")

	# Load XGBoost model
	with open(xgb_path, 'rb') as f:
	pipeline_data = pickle.load(f)
	xgb_model = pipeline_data['xgboost_model']
	label_names = pipeline_data['label_names']

	# Load SentenceTransformer
	sentence_model = SentenceTransformer(sentence_path)

	def predict_sentiment(text):
	# Extract features
	embedding = sentence_model.encode([text])
	comment_length = np.array([len(text.split())]).reshape(-1, 1)
	sentiment_polarity = np.array([TextBlob(text).sentiment.polarity]).reshape(-1, 1)

	# POS counts
	try:
	tags = nltk.pos_tag(nltk.word_tokenize(text))
	pos_counts = np.array([[
	sum(1 for _, tag in tags if tag.startswith('J')), # Adjectives
	sum(1 for _, tag in tags if tag.startswith('N')), # Nouns
	sum(1 for _, tag in tags if tag.startswith('V')) # Verbs
	]])
	except:
	pos_counts = np.array([[0, 0, 0]])

	# Combine features
	features = np.hstack([embedding, comment_length, sentiment_polarity, pos_counts])

	# Predict
	prediction = xgb_model.predict(features)[0]
	confidence = xgb_model.predict_proba(features)[0].max()

	return {
	'label': label_names[prediction],
	'confidence': confidence,
	'prediction_id': int(prediction)
	}

	# Example usage
	result = predict_sentiment("I love this new phone! It's amazing!")
	print(f"Sentiment: {result['label']} (confidence: {result['confidence']:.3f})")
	```

	## Model Details

	- Base Model: `paraphrase-mpnet-base-v2`
	- Classifier: XGBoost with GPU acceleration
	- Features: 772 dimensions (768 embeddings + 4 engineered)
	- Classes: 0=Negative, 1=Positive, 2=Neutral
	- Training Data: Reddit comments
	- Test Accuracy: 0.9966

	## Training Configuration

	- XGBoost Parameters: n_estimators=300, learning_rate=0.05, max_depth=6
	- Features: Embeddings + Comment Length + TextBlob Sentiment + POS Counts
	- Class Balancing: Sample weights for imbalanced data
	- Validation: Stratified train/val/test split

	## Citation

	```bibtex
	@misc{reddit-sentiment-hybrid,
	title={Reddit Sentiment Analysis - Hybrid Model},
	year={2025},
	publisher={Hugging Face},
	url={https://huggingface.co/USERNAME/mahek-sentiment}
	}
	```

	## License

	MIT License