Text Classification
Transformers
Safetensors
llama
Generated from Trainer
trl
reward-trainer
text-embeddings-inference
Instructions to use tsessk/llm-course-hw2-reward-model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tsessk/llm-course-hw2-reward-model with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="tsessk/llm-course-hw2-reward-model")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("tsessk/llm-course-hw2-reward-model") model = AutoModelForSequenceClassification.from_pretrained("tsessk/llm-course-hw2-reward-model") - Notebooks
- Google Colab
- Kaggle
metadata
base_model: HuggingFaceTB/SmolLM-135M-Instruct
datasets: HumanLLMs/Human-Like-DPO-Dataset
library_name: transformers
model_name: llm-course-hw2-reward-model
tags:
- generated_from_trainer
- trl
- reward-trainer
π Model Card for llm-course-hw2-reward-model
This model is a fine-tuned reward model based on HuggingFaceTB/SmolLM-135M-Instruct, trained on the HumanLLMs/Human-Like-DPO-Dataset dataset.
It has been trained using TRL to evaluate and rank responses based on human preferences, playing a crucial role in RLHF (Reinforcement Learning from Human Feedback) for models like SmolLM-135M-PPO.
π Overview
- Base Model: SmolLM-135M-Instruct
- Fine-Tuned Dataset: HumanLLMs/Human-Like-DPO-Dataset
- Objective: Learn to assign higher scores to more engaging, structured, and emotional responses.
- Use Case: Used in PPO-based RLHF training to reinforce human-like response quality.
Training Method
- The model was fine-tuned using Direct Preference Comparisons:
- Each sample contains a chosen response (preferred) and a rejected response.
- The model learns to assign higher rewards to the chosen response and lower rewards to the rejected one.
- This reward function was used in PPO fine-tuning to optimize response generation.