eumora-api / EUMORA_PLAN.md
VivDubs's picture
feat: add comprehensive revised product plan for EUMORA
05fc243
# EUMORA β€” Revised Product Plan
## What Is EUMORA?
EUMORA is an emotion-aware music recommendation system. You describe how you feel in plain text and it recommends songs that match that emotional state. No popularity contests, no filter bubbles, no cold-start delays.
The original architecture envisioned a full multimodal pipeline built from scratch β€” our own music database, VGGish audio analysis, ALBERT/BERT lyric transformers, and an LSTM fusion engine. We have since pivoted to a leaner, production-grade architecture that ships today while preserving every multimodal avenue for future expansion.
---
## Phase 1 β€” Where We Started
**Objective:** Prove the core premise. Given free-text input, classify the user's emotion and return music that matches it.
| Component | Original Plan | What We Actually Built |
|---|---|---|
| Emotion model | Fine-tuned DistilBERT, 7-class | Fine-tuned **DeBERTa-v3-base**, **8-class** (+ sarcasm) |
| Training data | Generic emotion dataset | Combined dataset β€” 59 k+ samples across GoEmotions, tweet corpora, sarcasm corpus |
| Music retrieval | Own database + LSTM matching | **Spotify Search API** with emotion-keyed queries |
| Deployment | Research prototype | **FastAPI** on **HF Spaces** (Docker), frontend on **Vercel** |
| Popularity bias | Penalise via score | Max-popularity filter (≀ 65) + anti-chart-topper logic |
| Match score | Valence-Arousal distance | **Search-rank position** (Spotify relevance ordering, 95 % β†’ 50 %) |
| Cold start | N/A | Zero β€” no listening history required |
**Result:** A live, working product at [eumora.vercel.app](https://eumora.vercel.app).
---
## Phase 2 β€” Where We Are Right Now
### Architecture
```
User text input
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DeBERTa-v3-base (fine-tuned) β”‚ ← Emotion classifier
β”‚ 8 classes: sadness Β· joy Β· love β”‚ 59 k+ training samples
β”‚ anger Β· fear Β· surprise Β· neutral β”‚ Sarcasm prior calibration
β”‚ sarcasm β”‚ Confidence + probability mix
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ predict_result
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ EmotionFeatureMapper β”‚ ← Emotion β†’ Spotify params
β”‚ Valence / Energy / Danceability β”‚ Blended across top-2 emotions
β”‚ Tempo / Mode / Seed genres β”‚ Confidence-scaled windows
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ query + params
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Spotify Search API (/v1/search) β”‚ ← 4 term-sets Γ— 4 offsets
β”‚ Randomised mood keywords β”‚ = 16 result pools per emotion
β”‚ Random page offset (0/10/20/30) β”‚ limit = 10 (Spotify cap)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ raw tracks
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Score + Filter β”‚
β”‚ β€’ Max popularity ≀ 65 β”‚ Anti-chart-topper filter
β”‚ β€’ Audio features (graceful 403) β”‚ Spotify deprecated for std apps
β”‚ β€’ Search-rank scoring 95 % β†’ 50 % β”‚ Highest relevance first
β”‚ β€’ Diversity filter (≀2/artist) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Lyrics Source Pinning β”‚ ← If input looks like lyrics
β”‚ 1. Musixmatch lyrics search β”‚ (optional, key-gated)
β”‚ 2. Title-word heuristic fallback β”‚ Pins source song at #1 / 100 %
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
Ranked track list
```
### Current Capabilities
- **Emotion classification** β€” 8 emotions, sarcasm-calibrated, ~95 % accuracy on test set
- **Music recommendations** β€” live Spotify tracks, varied per query, anti-popularity biased
- **Lyrics detection** β€” heuristic (always on) + Musixmatch (when key is set)
- **Match scoring** β€” search-rank-based, varied 95 %β†’50 %, highest first
- **Live deployment** β€” HF Spaces (API) + Vercel (frontend), CORS locked, secrets managed
### Known Constraints (Spotify API, Nov 2024 Policy)
| Endpoint | Status | Workaround |
|---|---|---|
| `/v1/recommendations` | 404 β€” deprecated without Extended Access | Replaced with `/v1/search` |
| `/v1/audio-features` | 403 β€” blocked without Extended Access | Graceful degradation (search-rank scoring) |
| `/v1/search` limit | Max 10 results per request | Randomised offset to sample wider pool |
---
## Phase 3 β€” Future Implementation Scope (Multimodal Expansion)
The pivot to Spotify freed us from building a music database. The multimodal analysis layers from the original architecture are still the roadmap β€” they now slot in as enrichment on top of Spotify tracks rather than as a replacement for them.
### 3.1 Lyrical Semantics Module
**What:** For each recommended Spotify track, fetch the lyrics and run a transformer (ALBERT or DeBERTa) to extract lyrical emotion, themes, and mood.
**Data source:** Musixmatch API (`track.lyrics.get`) β€” same key already planned for lyrics detection.
**Output:** A lyrical emotion vector per track that can be fused with the user's emotional state.
**Impact:** Replace the current search-rank proxy score with a genuine semantic match score. A song about heartbreak should rank higher than one that just happens to appear in a "sad" search, regardless of Spotify's relevance ordering.
### 3.2 Spectrogram Analysis Module (Raw Audio CNN)
**What:** For each track, download the 30-second preview clip from Spotify (`preview_url`) and convert it into a **mel spectrogram** β€” a 2D time-frequency image of the audio signal. Feed that image through **VGGish** (Google's audio CNN, pre-trained on AudioSet) to extract a 128-dimensional acoustic embedding that captures timbre, texture, rhythm density, and harmonic content.
**Pipeline:**
```
preview_url (MP3, 30 s)
β”‚
β–Ό
Decode audio β†’ resample to 16 kHz mono
β”‚
β–Ό
Mel spectrogram (128 mel bins, 25 ms frames)
β”‚
β–Ό
VGGish CNN β†’ 128-d acoustic embedding
β”‚
β–Ό
Acoustic texture vector (dark/bright, dense/sparse, smooth/aggressive)
```
**What it captures that structured features miss:**
- Timbral texture (why a minor-key piano ballad feels different from a minor-key metal track even with the same BPM and valence)
- Spectral brightness / darkness (correlated with perceived emotional tone)
- Onset density (busyness, complexity)
- Harmonic vs. percussive energy ratio
**Data source:** Spotify `preview_url` (30 s MP3, no auth required) β€” already present in every track object we return.
**Output:** A 128-d embedding per track, projected down to a mood-relevant subspace.
### 3.3 Structured Audio Features Module
**What:** Recover Spotify's high-level audio descriptors β€” **valence, energy, danceability, tempo, key, mode, acousticness, speechiness, instrumentalness** β€” per track. These are human-interpretable scalars on top of the raw signal.
**Current blocker:** Spotify `/v1/audio-features` returns 403 without Extended Access (deprecated Nov 2024).
**Resolution path:**
- **Option A (fastest):** Apply for Spotify Extended Access β€” free, ~2-week review, unlocks both `/audio-features` and `/recommendations`
- **Option B (no dependency):** AcousticBrainz open dataset (~2M tracks, pre-computed features)
- **Option C (compute ourselves):** Run **Essentia** (open-source, Mozilla) on preview URLs β€” extracts tempo, key, mode, spectral centroid, loudness directly
**Output:** Valence / Energy / Danceability / Tempo / Mode scalars per track β€” the structured side of the Valence-Arousal grid.
### 3.4 Multimodal Fusion Layer
**What:** Combine **four** independent signals into one calibrated match score:
```
User emotional state (DeBERTa output) ──┐
Lyrical emotion vec (transformer on lyrics) ──
Acoustic embedding (VGGish on spectrogram) β”œβ”€β”€β–Ί Fusion head β†’ match_score
Structured features (valence/energy/tempo) β”€β”€β”˜
```
**Method:**
1. Project each modality into a shared Valence-Arousal-Dominance (VAD) space
2. Compute weighted cosine similarity between user state vector and track vector
3. Weights tunable (or learned from user feedback in Phase 3.5)
**Output:** A single calibrated match score (0–100 %) grounded in all four modalities simultaneously β€” the full Valence-Arousal grid match originally planned.
### 3.5 Explainability (XAI Layer)
**What:** Natural-language justification for every recommended track.
**Example output:**
> *"Selected because its slow acoustic tempo (68 BPM) and lyrical themes of isolation strongly align with your expressed feeling of loneliness."*
**Method:** Template-based generation seeded by the acoustic + lyrical features that drove the match score. Can be upgraded to a small LLM-generated explanation later.
**Impact:** Transparency β€” users understand why they got a recommendation, not just what it is.
### 3.6 User Feedback Loop
**What:** Thumbs up / down per track, feeding back into the recommendation ranking for that session.
**Impact:** Closes the loop between static model output and real user response. Enables fine-tuning training data collection over time.
### 3.7 Extended Spotify Access
**What:** Apply for Spotify's Extended Access program (free).
**Unlocks:**
- `/v1/audio-features` (valence, energy, danceability, tempo per track)
- `/v1/recommendations` (seed-based recommendations)
- Higher search limits
**Impact:** Makes acoustic analysis available immediately via the API we already have, without needing Essentia or AcousticBrainz.
---
## Objectives β€” Start to End
| # | Objective | Phase | Status |
|---|---|---|---|
| 1 | Fine-tune an NLP model to classify text emotions | Phase 1 | βœ… Done β€” DeBERTa-v3-base, 8 classes, ~95 % |
| 2 | Map emotions to music parameters | Phase 1 | βœ… Done β€” EmotionFeatureMapper, Valence/Energy/Danceability targets |
| 3 | Retrieve real tracks from a live music source | Phase 1 | βœ… Done β€” Spotify Search API |
| 4 | Anti-popularity bias | Phase 1 | βœ… Done β€” max_popularity ≀ 65, hidden-gem preference |
| 5 | Live deployed API + frontend | Phase 1 | βœ… Done β€” HF Spaces + Vercel |
| 6 | Varied results per query | Phase 2 | βœ… Done β€” 4 term sets Γ— 4 offsets |
| 7 | Lyrics-to-source-song detection | Phase 2 | βœ… Done β€” heuristic + optional Musixmatch |
| 8 | Meaningful, varied match scores | Phase 2 | βœ… Done β€” search-rank scoring |
| 9 | Fetch and analyse song lyrics per track | Phase 3 | πŸ”² Musixmatch lyrics API |
| 10 | Spectrogram analysis β€” VGGish CNN on preview audio | Phase 3 | πŸ”² Raw acoustic texture embedding |
| 11 | Structured audio features (valence/energy/tempo) | Phase 3 | πŸ”² Extended Access or Essentia |
| 12 | Semantic match scoring via lyrical emotion | Phase 3 | πŸ”² Replaces search-rank proxy |
| 13 | Valence-Arousal grid fusion (text + lyrics + spectrogram + features) | Phase 3 | πŸ”² Full 4-modality fusion layer |
| 14 | Natural-language recommendation justifications | Phase 3 | πŸ”² XAI / template generation |
| 15 | In-session user feedback loop | Phase 3 | πŸ”² Thumbs up/down + re-ranking |
| 16 | Full multimodal pipeline (all 4 signals fused) | End state | πŸ”² Full original vision realised |
---
## What Changed From the Original Vision (And Why It's Better)
| Original | Current | Reason |
|---|---|---|
| Own music database | Spotify (100M+ tracks, live) | Infinite catalogue, no maintenance, always current |
| VGGish CNN for audio | Planned for Phase 3 | Spotify API + Extended Access covers this more cheaply |
| ALBERT/BERT for lyrics | Planned for Phase 3 via Musixmatch | Lyrics only needed at recommendation time, not training time |
| LSTM fusion engine | Lightweight cosine fusion planned | Simpler, interpretable, easier to debug |
| Batch/static recommendations | Real-time, randomised per query | Eliminates filter bubbles immediately |
| Research prototype | Production deployment | Real users, real feedback, real iteration |
The architecture is **thinner today but structurally identical in direction**. Every multimodal channel from the original design (lyrical semantics, acoustic mood, fusion) is still in the roadmap β€” they now plug into a production system instead of a research prototype.
---
## Summary
> EUMORA today: text in β†’ emotion classified β†’ Spotify searched with mood keywords β†’ anti-popularity filtered β†’ ranked by relevance β†’ source song pinned if lyrics detected.
>
> EUMORA end state: text in β†’ emotion classified β†’ Spotify tracks enriched with lyrical + acoustic analysis β†’ fused match score on Valence-Arousal grid β†’ ranked with XAI explanations β†’ user feedback refines in real time.
The core insight β€” *understand how someone feels, not what's popular* β€” is live and working. The multimodal layers that deepen that understanding are the clear, sequenced next steps.