zoeywwww
/

cardiffnlp-sentiment-3class-finetuned

@@ -1,199 +1,355 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
 ## Model Details
 ### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 ### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
 ### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
 ## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 ## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
 ## Model Card Contact
-[More Information Needed]

 ---
 library_name: transformers
+datasets:
+- SetFit/amazon_reviews_multi_en
+metrics:
+- accuracy
+pipeline_tag: text-classification
+tags:
+- sentiment-analysis
+- text-classification
+- roberta
+- transformers
+- ecommerce
+- customer-reviews
+- amazon-reviews
+- streamlit
 ---
+# EcoPulse AI Sentiment Classifier
+This model is the fine-tuned sentiment classification model used in **EcoPulse AI**, an e-commerce customer review sentiment classification and voice reporting system. The model classifies Amazon-style customer reviews into three sentiment categories:
+* **Negative**
+* **Neutral**
+* **Positive**
+The model is designed to help e-commerce customer support teams quickly identify customer dissatisfaction, monitor neutral feedback, and summarize positive customer experiences.
+---
 ## Model Details
 ### Model Description
+This model is a fine-tuned version of `cardiffnlp/twitter-roberta-base-sentiment-latest`. The base model was selected after comparing three Hugging Face transformer models for customer review sentiment classification. It achieved the strongest baseline accuracy among the tested candidates and was then fine-tuned on a balanced Amazon review dataset.
+The model is used as **Pipeline 1** in the EcoPulse AI application. It takes raw customer review text as input and outputs a sentiment label with a confidence score. The Streamlit application then aggregates the predictions into sentiment distribution summaries, business recommendations, written reports, and audio briefings.
+* **Developed by:** Junlei Wang and Zhuoyuan Zhang
+* **Project:** EcoPulse AI
+* **Model type:** RoBERTa-based sequence classification model
+* **Language:** English
+* **Task:** 3-class sentiment classification
+* **Fine-tuned from:** `cardiffnlp/twitter-roberta-base-sentiment-latest`
+* **Final model repository:** `zoeywwww/cardiffnlp-sentiment-3class-finetuned`
+---
+## Model Sources
+* **Base model:** `cardiffnlp/twitter-roberta-base-sentiment-latest`
+* **Fine-tuned model:** `zoeywwww/cardiffnlp-sentiment-3class-finetuned`
+* **Dataset:** `SetFit/amazon_reviews_multi_en`
+* **Demo application:** https://group10finalproject-ee3nfmeyomxalcieln8f8a.streamlit.app/
+* **GitHub repository:** https://github.com/zoeywang524-beep/Group10_Final_project/blob/main/group10app.py
+---
+## Intended Use
 ### Direct Use
+This model can be used directly for English e-commerce customer review sentiment classification. Given a customer review, it predicts one of three labels:
+| Label ID | Label    |
+| -------: | -------- |
+|        0 | Negative |
+|        1 | Neutral  |
+|        2 | Positive |
+Example use cases include:
+* Classifying Amazon-style product reviews
+* Monitoring customer satisfaction
+* Identifying negative feedback for customer service escalation
+* Supporting review summarization dashboards
+* Generating structured sentiment inputs for business reports
+### Downstream Use
+This model is used inside the EcoPulse AI Streamlit Cloud application. In the deployed application, the model performs review-level sentiment classification. The app then uses the predictions to calculate sentiment distribution, generate support recommendations, produce a written customer sentiment report, and trigger a text-to-speech pipeline for an audio dashboard briefing.
+The full system follows this workflow:
+```text
+Customer Review Text
+        ↓
+Fine-Tuned RoBERTa Sentiment Classifier
+        ↓
+Negative / Neutral / Positive Prediction + Confidence Score
+        ↓
+Streamlit Business Logic Layer
+        ↓
+Sentiment Summary + Support Recommendation + Written Report
+        ↓
+Text-to-Speech Pipeline
+        ↓
+Audio Dashboard Briefing
+```
 ### Out-of-Scope Use
+This model is not intended for high-stakes decision-making without human review. It should not be used as the sole basis for customer compensation, employee evaluation, legal judgment, or automated enforcement decisions.
+The model may not perform well on:
+* Sarcastic reviews
+* Ambiguous or mixed-emotion reviews
+* Very short reviews without enough context
+* Non-English text
+* Highly domain-specific product terminology
+* Reviews that require external context to interpret correctly
+---
 ## Bias, Risks, and Limitations
+The model was fine-tuned on Amazon-style English review data. As a result, its performance is most relevant to e-commerce customer review classification and may not generalize equally well to other domains such as healthcare, finance, legal complaints, or social media conversations.
+A known limitation is sarcasm detection. For example, a sentence such as:
+> "Brilliant delivery, my package arrived completely crushed."
+may be difficult because the word “Brilliant” is positive, while the full meaning of the sentence is negative. In the project’s manual Streamlit application test, the only misclassification occurred in a sarcastic review of this type.
+Users should treat the model as a **first-line decision-support tool**, not a replacement for human judgment.
+---
+## Recommendations
+Users should review low-confidence predictions and ambiguous cases manually. For business use, the model is best applied as an initial screening tool that helps support teams prioritize reviews for further investigation.
+Recommended use:
+* Use the model to flag likely negative reviews.
+* Review sarcastic, mixed, or unclear cases manually.
+* Combine model predictions with business rules and human oversight.
+* Periodically update or fine-tune the model with newer customer review data.
+---
 ## How to Get Started with the Model
+You can use the model with the Hugging Face `transformers` library.
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+import numpy as np
+model_name = "zoeywwww/cardiffnlp-sentiment-3class-finetuned"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+id2label = {
+    0: "Negative",
+    1: "Neutral",
+    2: "Positive"
+}
+text = "The product arrived damaged and customer service did not respond."
+inputs = tokenizer(
+    text,
+    return_tensors="pt",
+    truncation=True,
+    padding=True,
+    max_length=128
+)
+with torch.no_grad():
+    outputs = model(**inputs)
+probabilities = torch.softmax(outputs.logits, dim=-1)[0].numpy()
+predicted_id = int(np.argmax(probabilities))
+predicted_label = id2label[predicted_id]
+confidence = float(probabilities[predicted_id])
+print("Predicted sentiment:", predicted_label)
+print("Confidence:", round(confidence, 4))
+```
+---
+## Training Details
+### Training Data
+The model was fine-tuned using the `SetFit/amazon_reviews_multi_en` dataset from Hugging Face. This dataset contains English Amazon review text and original star-rating labels.
+The original 5-star labels were mapped into three sentiment classes:
+| Original Rating Label | Star Rating Meaning | New Sentiment Label |
+| --------------------: | ------------------- | ------------------- |
+|                     0 | 1-star              | Negative            |
+|                     1 | 2-star              | Negative            |
+|                     2 | 3-star              | Neutral             |
+|                     3 | 4-star              | Positive            |
+|                     4 | 5-star              | Positive            |
+### Dataset Splits Used in the Project
+| Split                       | Number of Samples | Class Balance   | Purpose                    |
+| --------------------------- | ----------------: | --------------- | -------------------------- |
+| Preliminary training sample |             9,000 | 3,000 per class | Candidate model comparison |
+| Fine-tuning training set    |             6,000 | 2,000 per class | Fine-tuning selected model |
+| Validation set              |             1,500 | 500 per class   | Fine-tuning monitoring     |
+| Test set                    |             1,500 | 500 per class   | Final evaluation           |
+The fine-tuning set was balanced across the three sentiment classes to reduce class imbalance effects.
+### Preprocessing
+The preprocessing steps included:
+1. Loading English Amazon review data.
+2. Mapping 5-star labels into 3 sentiment labels.
+3. Creating balanced negative, neutral, and positive samples.
+4. Tokenizing review text using the tokenizer from `cardiffnlp/twitter-roberta-base-sentiment-latest`.
+5. Truncating and padding input text to support transformer-based classification.
+---
+## Training Procedure
+The base model `cardiffnlp/twitter-roberta-base-sentiment-latest` was selected after comparing three candidate transformer models:
+| Candidate Model                                    | Baseline Accuracy |   Runtime |
+| -------------------------------------------------- | ----------------: | --------: |
+| `cardiffnlp/twitter-roberta-base-sentiment-latest` |            0.6228 | 60.71 sec |
+| `distilbert-base-uncased`                          |            0.3287 | 32.17 sec |
+| `roberta-base`                                     |            0.3306 | 61.68 sec |
+The Cardiff RoBERTa model achieved the highest baseline accuracy and was selected for fine-tuning.
+The selected model was fine-tuned for 1, 2, and 3 epochs. The Epoch 1 model was selected for deployment because it offered the best balance between validation loss, test performance, generalization stability, and runtime.
+### Fine-Tuning Results
+| Epoch | Validation Loss | Validation Accuracy | Train Accuracy | Test Accuracy | Test Runtime |
+| ----: | --------------: | ------------------: | -------------: | ------------: | -----------: |
+|     1 |          0.6777 |              0.7127 |         0.7848 |        0.7040 |    10.66 sec |
+|     2 |          0.7371 |              0.7167 |         0.8613 |        0.7093 |    10.98 sec |
+|     3 |          0.9523 |              0.7140 |         0.9205 |        0.7093 |    10.78 sec |
+Although Epoch 2 and Epoch 3 achieved slightly higher test accuracy, the improvement was small. Training accuracy increased strongly from Epoch 1 to Epoch 3, while test accuracy remained almost unchanged. Validation loss also increased after Epoch 1, suggesting a higher risk of overfitting in later epochs.
+Therefore, the Epoch 1 model was selected for deployment.
+---
+## Evaluation
+### Testing Data
+Final evaluation was conducted on an untouched balanced test set of 1,500 Amazon-style reviews:
+* 500 negative reviews
+* 500 neutral reviews
+* 500 positive reviews
+### Metrics
+The main evaluation metric was accuracy. Runtime was also recorded during model comparison and testing to assess deployment feasibility.
+### Results
+The deployed fine-tuned model achieved:
+| Metric              |     Value |
+| ------------------- | --------: |
+| Test Accuracy       |    0.7040 |
+| Test Runtime        | 10.66 sec |
+| Validation Loss     |    0.6777 |
+| Validation Accuracy |    0.7127 |
+### Streamlit Application Test
+The deployed Streamlit Cloud application was manually tested using 10 unseen e-commerce customer review samples. The app correctly classified 9 out of 10 samples.
+| Application Test Setting                      | Test Sample Size | Accuracy |
+| --------------------------------------------- | ---------------: | -------: |
+| Streamlit Cloud sentiment classification test |               10 |      90% |
+The only misclassification occurred in a sarcastic review, showing a known limitation of sentiment models when handling sarcasm.
+---
+## Technical Specifications
+### Model Architecture and Objective
+This model uses a RoBERTa-based transformer architecture for sequence classification. The input review text is tokenized and passed into the transformer encoder. A classification head maps the encoded representation into three sentiment categories. A softmax layer is used to produce class probabilities.
+Simplified architecture:
+```text
+Review Text
+    ↓
+Tokenizer
+    ↓
+RoBERTa Transformer Encoder
+    ↓
+Classification Head
+    ↓
+Softmax Probabilities
+    ↓
+Negative / Neutral / Positive
+```
+### Software
+The project used:
+* Python
+* PyTorch
+* Hugging Face Transformers
+* Hugging Face Datasets
+* Hugging Face Hub
+* Google Colab
+* Streamlit
+### Compute Infrastructure
+Fine-tuning and experiments were conducted in Google Colab. Exact hardware may vary depending on the assigned Colab runtime.
+---
+## Environmental Impact
+Carbon emissions were not formally measured for this course project. Fine-tuning was conducted using Google Colab, and the training duration was limited by using a relatively small balanced fine-tuning dataset and only a small number of epochs.
+---
+## Citation
+If you use this model, please cite the base model and dataset sources:
+* Base model: `cardiffnlp/twitter-roberta-base-sentiment-latest`
+* Dataset: `SetFit/amazon_reviews_multi_en`
+---
+## Model Card Authors
+Junlei Wang
+Zhuoyuan Zhang
+---
 ## Model Card Contact
+For questions about this course project, please refer to the EcoPulse AI project report, GitHub repository, and Streamlit application.