# EUMORA — Revised Product Plan

## What Is EUMORA?

EUMORA is an emotion-aware music recommendation system. You describe how you feel in plain text and it recommends songs that match that emotional state. No popularity contests, no filter bubbles, no cold-start delays.

The original architecture envisioned a full multimodal pipeline built from scratch — our own music database, VGGish audio analysis, ALBERT/BERT lyric transformers, and an LSTM fusion engine. We have since pivoted to a leaner, production-grade architecture that ships today while preserving every multimodal avenue for future expansion.

---

## Phase 1 — Where We Started

**Objective:** Prove the core premise. Given free-text input, classify the user's emotion and return music that matches it.

| Component | Original Plan | What We Actually Built |
|---|---|---|
| Emotion model | Fine-tuned DistilBERT, 7-class | Fine-tuned **DeBERTa-v3-base**, **8-class** (+ sarcasm) |
| Training data | Generic emotion dataset | Combined dataset — 59 k+ samples across GoEmotions, tweet corpora, sarcasm corpus |
| Music retrieval | Own database + LSTM matching | **Spotify Search API** with emotion-keyed queries |
| Deployment | Research prototype | **FastAPI** on **HF Spaces** (Docker), frontend on **Vercel** |
| Popularity bias | Penalise via score | Max-popularity filter (≤ 65) + anti-chart-topper logic |
| Match score | Valence-Arousal distance | **Search-rank position** (Spotify relevance ordering, 95 % → 50 %) |
| Cold start | N/A | Zero — no listening history required |

**Result:** A live, working product at [eumora.vercel.app](https://eumora.vercel.app).

---

## Phase 2 — Where We Are Right Now

### Architecture

```
User text input
      │
      ▼
┌─────────────────────────────────────┐
│  DeBERTa-v3-base (fine-tuned)       │  ← Emotion classifier
│  8 classes: sadness · joy · love    │    59 k+ training samples
│  anger · fear · surprise · neutral  │    Sarcasm prior calibration
│  sarcasm                            │    Confidence + probability mix
└─────────────────┬───────────────────┘
                  │  predict_result
                  ▼
┌─────────────────────────────────────┐
│  EmotionFeatureMapper               │  ← Emotion → Spotify params
│  Valence / Energy / Danceability    │    Blended across top-2 emotions
│  Tempo / Mode / Seed genres         │    Confidence-scaled windows
└─────────────────┬───────────────────┘
                  │  query + params
                  ▼
┌─────────────────────────────────────┐
│  Spotify Search API (/v1/search)    │  ← 4 term-sets × 4 offsets
│  Randomised mood keywords           │    = 16 result pools per emotion
│  Random page offset (0/10/20/30)    │    limit = 10 (Spotify cap)
└─────────────────┬───────────────────┘
                  │  raw tracks
                  ▼
┌─────────────────────────────────────┐
│  Score + Filter                     │
│  • Max popularity ≤ 65              │  Anti-chart-topper filter
│  • Audio features (graceful 403)    │  Spotify deprecated for std apps
│  • Search-rank scoring 95 % → 50 % │  Highest relevance first
│  • Diversity filter (≤2/artist)     │
└─────────────────┬───────────────────┘
                  │
                  ▼
┌─────────────────────────────────────┐
│  Lyrics Source Pinning              │  ← If input looks like lyrics
│  1. Musixmatch lyrics search        │    (optional, key-gated)
│  2. Title-word heuristic fallback   │    Pins source song at #1 / 100 %
└─────────────────┬───────────────────┘
                  │
                  ▼
            Ranked track list
```

### Current Capabilities

- **Emotion classification** — 8 emotions, sarcasm-calibrated, ~95 % accuracy on test set
- **Music recommendations** — live Spotify tracks, varied per query, anti-popularity biased
- **Lyrics detection** — heuristic (always on) + Musixmatch (when key is set)
- **Match scoring** — search-rank-based, varied 95 %→50 %, highest first
- **Live deployment** — HF Spaces (API) + Vercel (frontend), CORS locked, secrets managed

### Known Constraints (Spotify API, Nov 2024 Policy)

| Endpoint | Status | Workaround |
|---|---|---|
| `/v1/recommendations` | 404 — deprecated without Extended Access | Replaced with `/v1/search` |
| `/v1/audio-features` | 403 — blocked without Extended Access | Graceful degradation (search-rank scoring) |
| `/v1/search` limit | Max 10 results per request | Randomised offset to sample wider pool |

---

## Phase 3 — Future Implementation Scope (Multimodal Expansion)

The pivot to Spotify freed us from building a music database. The multimodal analysis layers from the original architecture are still the roadmap — they now slot in as enrichment on top of Spotify tracks rather than as a replacement for them.

### 3.1 Lyrical Semantics Module

**What:** For each recommended Spotify track, fetch the lyrics and run a transformer (ALBERT or DeBERTa) to extract lyrical emotion, themes, and mood.

**Data source:** Musixmatch API (`track.lyrics.get`) — same key already planned for lyrics detection.

**Output:** A lyrical emotion vector per track that can be fused with the user's emotional state.

**Impact:** Replace the current search-rank proxy score with a genuine semantic match score. A song about heartbreak should rank higher than one that just happens to appear in a "sad" search, regardless of Spotify's relevance ordering.

### 3.2 Spectrogram Analysis Module (Raw Audio CNN)

**What:** For each track, download the 30-second preview clip from Spotify (`preview_url`) and convert it into a **mel spectrogram** — a 2D time-frequency image of the audio signal. Feed that image through **VGGish** (Google's audio CNN, pre-trained on AudioSet) to extract a 128-dimensional acoustic embedding that captures timbre, texture, rhythm density, and harmonic content.

**Pipeline:**
```
preview_url (MP3, 30 s)
      │
      ▼
  Decode audio → resample to 16 kHz mono
      │
      ▼
  Mel spectrogram (128 mel bins, 25 ms frames)
      │
      ▼
  VGGish CNN → 128-d acoustic embedding
      │
      ▼
  Acoustic texture vector (dark/bright, dense/sparse, smooth/aggressive)
```

**What it captures that structured features miss:**
- Timbral texture (why a minor-key piano ballad feels different from a minor-key metal track even with the same BPM and valence)
- Spectral brightness / darkness (correlated with perceived emotional tone)
- Onset density (busyness, complexity)
- Harmonic vs. percussive energy ratio

**Data source:** Spotify `preview_url` (30 s MP3, no auth required) — already present in every track object we return.

**Output:** A 128-d embedding per track, projected down to a mood-relevant subspace.

### 3.3 Structured Audio Features Module

**What:** Recover Spotify's high-level audio descriptors — **valence, energy, danceability, tempo, key, mode, acousticness, speechiness, instrumentalness** — per track. These are human-interpretable scalars on top of the raw signal.

**Current blocker:** Spotify `/v1/audio-features` returns 403 without Extended Access (deprecated Nov 2024).

**Resolution path:**
- **Option A (fastest):** Apply for Spotify Extended Access — free, ~2-week review, unlocks both `/audio-features` and `/recommendations`
- **Option B (no dependency):** AcousticBrainz open dataset (~2M tracks, pre-computed features)
- **Option C (compute ourselves):** Run **Essentia** (open-source, Mozilla) on preview URLs — extracts tempo, key, mode, spectral centroid, loudness directly

**Output:** Valence / Energy / Danceability / Tempo / Mode scalars per track — the structured side of the Valence-Arousal grid.

### 3.4 Multimodal Fusion Layer

**What:** Combine **four** independent signals into one calibrated match score:

```
User emotional state  (DeBERTa output)     ──┐
Lyrical emotion vec   (transformer on lyrics) ─┤
Acoustic embedding    (VGGish on spectrogram)  ├──► Fusion head → match_score
Structured features   (valence/energy/tempo)  ──┘
```

**Method:**
1. Project each modality into a shared Valence-Arousal-Dominance (VAD) space
2. Compute weighted cosine similarity between user state vector and track vector
3. Weights tunable (or learned from user feedback in Phase 3.5)

**Output:** A single calibrated match score (0–100 %) grounded in all four modalities simultaneously — the full Valence-Arousal grid match originally planned.

### 3.5 Explainability (XAI Layer)

**What:** Natural-language justification for every recommended track.

**Example output:**
> *"Selected because its slow acoustic tempo (68 BPM) and lyrical themes of isolation strongly align with your expressed feeling of loneliness."*

**Method:** Template-based generation seeded by the acoustic + lyrical features that drove the match score. Can be upgraded to a small LLM-generated explanation later.

**Impact:** Transparency — users understand why they got a recommendation, not just what it is.

### 3.6 User Feedback Loop

**What:** Thumbs up / down per track, feeding back into the recommendation ranking for that session.

**Impact:** Closes the loop between static model output and real user response. Enables fine-tuning training data collection over time.

### 3.7 Extended Spotify Access

**What:** Apply for Spotify's Extended Access program (free).

**Unlocks:**
- `/v1/audio-features` (valence, energy, danceability, tempo per track)
- `/v1/recommendations` (seed-based recommendations)
- Higher search limits

**Impact:** Makes acoustic analysis available immediately via the API we already have, without needing Essentia or AcousticBrainz.

---

## Objectives — Start to End

| # | Objective | Phase | Status |
|---|---|---|---|
| 1 | Fine-tune an NLP model to classify text emotions | Phase 1 | ✅ Done — DeBERTa-v3-base, 8 classes, ~95 % |
| 2 | Map emotions to music parameters | Phase 1 | ✅ Done — EmotionFeatureMapper, Valence/Energy/Danceability targets |
| 3 | Retrieve real tracks from a live music source | Phase 1 | ✅ Done — Spotify Search API |
| 4 | Anti-popularity bias | Phase 1 | ✅ Done — max_popularity ≤ 65, hidden-gem preference |
| 5 | Live deployed API + frontend | Phase 1 | ✅ Done — HF Spaces + Vercel |
| 6 | Varied results per query | Phase 2 | ✅ Done — 4 term sets × 4 offsets |
| 7 | Lyrics-to-source-song detection | Phase 2 | ✅ Done — heuristic + optional Musixmatch |
| 8 | Meaningful, varied match scores | Phase 2 | ✅ Done — search-rank scoring |
| 9 | Fetch and analyse song lyrics per track | Phase 3 | 🔲 Musixmatch lyrics API |
| 10 | Spectrogram analysis — VGGish CNN on preview audio | Phase 3 | 🔲 Raw acoustic texture embedding |
| 11 | Structured audio features (valence/energy/tempo) | Phase 3 | 🔲 Extended Access or Essentia |
| 12 | Semantic match scoring via lyrical emotion | Phase 3 | 🔲 Replaces search-rank proxy |
| 13 | Valence-Arousal grid fusion (text + lyrics + spectrogram + features) | Phase 3 | 🔲 Full 4-modality fusion layer |
| 14 | Natural-language recommendation justifications | Phase 3 | 🔲 XAI / template generation |
| 15 | In-session user feedback loop | Phase 3 | 🔲 Thumbs up/down + re-ranking |
| 16 | Full multimodal pipeline (all 4 signals fused) | End state | 🔲 Full original vision realised |

---

## What Changed From the Original Vision (And Why It's Better)

| Original | Current | Reason |
|---|---|---|
| Own music database | Spotify (100M+ tracks, live) | Infinite catalogue, no maintenance, always current |
| VGGish CNN for audio | Planned for Phase 3 | Spotify API + Extended Access covers this more cheaply |
| ALBERT/BERT for lyrics | Planned for Phase 3 via Musixmatch | Lyrics only needed at recommendation time, not training time |
| LSTM fusion engine | Lightweight cosine fusion planned | Simpler, interpretable, easier to debug |
| Batch/static recommendations | Real-time, randomised per query | Eliminates filter bubbles immediately |
| Research prototype | Production deployment | Real users, real feedback, real iteration |

The architecture is **thinner today but structurally identical in direction**. Every multimodal channel from the original design (lyrical semantics, acoustic mood, fusion) is still in the roadmap — they now plug into a production system instead of a research prototype.

---

## Summary

> EUMORA today: text in → emotion classified → Spotify searched with mood keywords → anti-popularity filtered → ranked by relevance → source song pinned if lyrics detected.
>
> EUMORA end state: text in → emotion classified → Spotify tracks enriched with lyrical + acoustic analysis → fused match score on Valence-Arousal grid → ranked with XAI explanations → user feedback refines in real time.

The core insight — *understand how someone feels, not what's popular* — is live and working. The multimodal layers that deepen that understanding are the clear, sequenced next steps.