# EUMORA — Revised Product Plan ## What Is EUMORA? EUMORA is an emotion-aware music recommendation system. You describe how you feel in plain text and it recommends songs that match that emotional state. No popularity contests, no filter bubbles, no cold-start delays. The original architecture envisioned a full multimodal pipeline built from scratch — our own music database, VGGish audio analysis, ALBERT/BERT lyric transformers, and an LSTM fusion engine. We have since pivoted to a leaner, production-grade architecture that ships today while preserving every multimodal avenue for future expansion. --- ## Phase 1 — Where We Started **Objective:** Prove the core premise. Given free-text input, classify the user's emotion and return music that matches it. | Component | Original Plan | What We Actually Built | |---|---|---| | Emotion model | Fine-tuned DistilBERT, 7-class | Fine-tuned **DeBERTa-v3-base**, **8-class** (+ sarcasm) | | Training data | Generic emotion dataset | Combined dataset — 59 k+ samples across GoEmotions, tweet corpora, sarcasm corpus | | Music retrieval | Own database + LSTM matching | **Spotify Search API** with emotion-keyed queries | | Deployment | Research prototype | **FastAPI** on **HF Spaces** (Docker), frontend on **Vercel** | | Popularity bias | Penalise via score | Max-popularity filter (≤ 65) + anti-chart-topper logic | | Match score | Valence-Arousal distance | **Search-rank position** (Spotify relevance ordering, 95 % → 50 %) | | Cold start | N/A | Zero — no listening history required | **Result:** A live, working product at [eumora.vercel.app](https://eumora.vercel.app). --- ## Phase 2 — Where We Are Right Now ### Architecture ``` User text input │ ▼ ┌─────────────────────────────────────┐ │ DeBERTa-v3-base (fine-tuned) │ ← Emotion classifier │ 8 classes: sadness · joy · love │ 59 k+ training samples │ anger · fear · surprise · neutral │ Sarcasm prior calibration │ sarcasm │ Confidence + probability mix └─────────────────┬───────────────────┘ │ predict_result ▼ ┌─────────────────────────────────────┐ │ EmotionFeatureMapper │ ← Emotion → Spotify params │ Valence / Energy / Danceability │ Blended across top-2 emotions │ Tempo / Mode / Seed genres │ Confidence-scaled windows └─────────────────┬───────────────────┘ │ query + params ▼ ┌─────────────────────────────────────┐ │ Spotify Search API (/v1/search) │ ← 4 term-sets × 4 offsets │ Randomised mood keywords │ = 16 result pools per emotion │ Random page offset (0/10/20/30) │ limit = 10 (Spotify cap) └─────────────────┬───────────────────┘ │ raw tracks ▼ ┌─────────────────────────────────────┐ │ Score + Filter │ │ • Max popularity ≤ 65 │ Anti-chart-topper filter │ • Audio features (graceful 403) │ Spotify deprecated for std apps │ • Search-rank scoring 95 % → 50 % │ Highest relevance first │ • Diversity filter (≤2/artist) │ └─────────────────┬───────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ Lyrics Source Pinning │ ← If input looks like lyrics │ 1. Musixmatch lyrics search │ (optional, key-gated) │ 2. Title-word heuristic fallback │ Pins source song at #1 / 100 % └─────────────────┬───────────────────┘ │ ▼ Ranked track list ``` ### Current Capabilities - **Emotion classification** — 8 emotions, sarcasm-calibrated, ~95 % accuracy on test set - **Music recommendations** — live Spotify tracks, varied per query, anti-popularity biased - **Lyrics detection** — heuristic (always on) + Musixmatch (when key is set) - **Match scoring** — search-rank-based, varied 95 %→50 %, highest first - **Live deployment** — HF Spaces (API) + Vercel (frontend), CORS locked, secrets managed ### Known Constraints (Spotify API, Nov 2024 Policy) | Endpoint | Status | Workaround | |---|---|---| | `/v1/recommendations` | 404 — deprecated without Extended Access | Replaced with `/v1/search` | | `/v1/audio-features` | 403 — blocked without Extended Access | Graceful degradation (search-rank scoring) | | `/v1/search` limit | Max 10 results per request | Randomised offset to sample wider pool | --- ## Phase 3 — Future Implementation Scope (Multimodal Expansion) The pivot to Spotify freed us from building a music database. The multimodal analysis layers from the original architecture are still the roadmap — they now slot in as enrichment on top of Spotify tracks rather than as a replacement for them. ### 3.1 Lyrical Semantics Module **What:** For each recommended Spotify track, fetch the lyrics and run a transformer (ALBERT or DeBERTa) to extract lyrical emotion, themes, and mood. **Data source:** Musixmatch API (`track.lyrics.get`) — same key already planned for lyrics detection. **Output:** A lyrical emotion vector per track that can be fused with the user's emotional state. **Impact:** Replace the current search-rank proxy score with a genuine semantic match score. A song about heartbreak should rank higher than one that just happens to appear in a "sad" search, regardless of Spotify's relevance ordering. ### 3.2 Spectrogram Analysis Module (Raw Audio CNN) **What:** For each track, download the 30-second preview clip from Spotify (`preview_url`) and convert it into a **mel spectrogram** — a 2D time-frequency image of the audio signal. Feed that image through **VGGish** (Google's audio CNN, pre-trained on AudioSet) to extract a 128-dimensional acoustic embedding that captures timbre, texture, rhythm density, and harmonic content. **Pipeline:** ``` preview_url (MP3, 30 s) │ ▼ Decode audio → resample to 16 kHz mono │ ▼ Mel spectrogram (128 mel bins, 25 ms frames) │ ▼ VGGish CNN → 128-d acoustic embedding │ ▼ Acoustic texture vector (dark/bright, dense/sparse, smooth/aggressive) ``` **What it captures that structured features miss:** - Timbral texture (why a minor-key piano ballad feels different from a minor-key metal track even with the same BPM and valence) - Spectral brightness / darkness (correlated with perceived emotional tone) - Onset density (busyness, complexity) - Harmonic vs. percussive energy ratio **Data source:** Spotify `preview_url` (30 s MP3, no auth required) — already present in every track object we return. **Output:** A 128-d embedding per track, projected down to a mood-relevant subspace. ### 3.3 Structured Audio Features Module **What:** Recover Spotify's high-level audio descriptors — **valence, energy, danceability, tempo, key, mode, acousticness, speechiness, instrumentalness** — per track. These are human-interpretable scalars on top of the raw signal. **Current blocker:** Spotify `/v1/audio-features` returns 403 without Extended Access (deprecated Nov 2024). **Resolution path:** - **Option A (fastest):** Apply for Spotify Extended Access — free, ~2-week review, unlocks both `/audio-features` and `/recommendations` - **Option B (no dependency):** AcousticBrainz open dataset (~2M tracks, pre-computed features) - **Option C (compute ourselves):** Run **Essentia** (open-source, Mozilla) on preview URLs — extracts tempo, key, mode, spectral centroid, loudness directly **Output:** Valence / Energy / Danceability / Tempo / Mode scalars per track — the structured side of the Valence-Arousal grid. ### 3.4 Multimodal Fusion Layer **What:** Combine **four** independent signals into one calibrated match score: ``` User emotional state (DeBERTa output) ──┐ Lyrical emotion vec (transformer on lyrics) ─┤ Acoustic embedding (VGGish on spectrogram) ├──► Fusion head → match_score Structured features (valence/energy/tempo) ──┘ ``` **Method:** 1. Project each modality into a shared Valence-Arousal-Dominance (VAD) space 2. Compute weighted cosine similarity between user state vector and track vector 3. Weights tunable (or learned from user feedback in Phase 3.5) **Output:** A single calibrated match score (0–100 %) grounded in all four modalities simultaneously — the full Valence-Arousal grid match originally planned. ### 3.5 Explainability (XAI Layer) **What:** Natural-language justification for every recommended track. **Example output:** > *"Selected because its slow acoustic tempo (68 BPM) and lyrical themes of isolation strongly align with your expressed feeling of loneliness."* **Method:** Template-based generation seeded by the acoustic + lyrical features that drove the match score. Can be upgraded to a small LLM-generated explanation later. **Impact:** Transparency — users understand why they got a recommendation, not just what it is. ### 3.6 User Feedback Loop **What:** Thumbs up / down per track, feeding back into the recommendation ranking for that session. **Impact:** Closes the loop between static model output and real user response. Enables fine-tuning training data collection over time. ### 3.7 Extended Spotify Access **What:** Apply for Spotify's Extended Access program (free). **Unlocks:** - `/v1/audio-features` (valence, energy, danceability, tempo per track) - `/v1/recommendations` (seed-based recommendations) - Higher search limits **Impact:** Makes acoustic analysis available immediately via the API we already have, without needing Essentia or AcousticBrainz. --- ## Objectives — Start to End | # | Objective | Phase | Status | |---|---|---|---| | 1 | Fine-tune an NLP model to classify text emotions | Phase 1 | ✅ Done — DeBERTa-v3-base, 8 classes, ~95 % | | 2 | Map emotions to music parameters | Phase 1 | ✅ Done — EmotionFeatureMapper, Valence/Energy/Danceability targets | | 3 | Retrieve real tracks from a live music source | Phase 1 | ✅ Done — Spotify Search API | | 4 | Anti-popularity bias | Phase 1 | ✅ Done — max_popularity ≤ 65, hidden-gem preference | | 5 | Live deployed API + frontend | Phase 1 | ✅ Done — HF Spaces + Vercel | | 6 | Varied results per query | Phase 2 | ✅ Done — 4 term sets × 4 offsets | | 7 | Lyrics-to-source-song detection | Phase 2 | ✅ Done — heuristic + optional Musixmatch | | 8 | Meaningful, varied match scores | Phase 2 | ✅ Done — search-rank scoring | | 9 | Fetch and analyse song lyrics per track | Phase 3 | 🔲 Musixmatch lyrics API | | 10 | Spectrogram analysis — VGGish CNN on preview audio | Phase 3 | 🔲 Raw acoustic texture embedding | | 11 | Structured audio features (valence/energy/tempo) | Phase 3 | 🔲 Extended Access or Essentia | | 12 | Semantic match scoring via lyrical emotion | Phase 3 | 🔲 Replaces search-rank proxy | | 13 | Valence-Arousal grid fusion (text + lyrics + spectrogram + features) | Phase 3 | 🔲 Full 4-modality fusion layer | | 14 | Natural-language recommendation justifications | Phase 3 | 🔲 XAI / template generation | | 15 | In-session user feedback loop | Phase 3 | 🔲 Thumbs up/down + re-ranking | | 16 | Full multimodal pipeline (all 4 signals fused) | End state | 🔲 Full original vision realised | --- ## What Changed From the Original Vision (And Why It's Better) | Original | Current | Reason | |---|---|---| | Own music database | Spotify (100M+ tracks, live) | Infinite catalogue, no maintenance, always current | | VGGish CNN for audio | Planned for Phase 3 | Spotify API + Extended Access covers this more cheaply | | ALBERT/BERT for lyrics | Planned for Phase 3 via Musixmatch | Lyrics only needed at recommendation time, not training time | | LSTM fusion engine | Lightweight cosine fusion planned | Simpler, interpretable, easier to debug | | Batch/static recommendations | Real-time, randomised per query | Eliminates filter bubbles immediately | | Research prototype | Production deployment | Real users, real feedback, real iteration | The architecture is **thinner today but structurally identical in direction**. Every multimodal channel from the original design (lyrical semantics, acoustic mood, fusion) is still in the roadmap — they now plug into a production system instead of a research prototype. --- ## Summary > EUMORA today: text in → emotion classified → Spotify searched with mood keywords → anti-popularity filtered → ranked by relevance → source song pinned if lyrics detected. > > EUMORA end state: text in → emotion classified → Spotify tracks enriched with lyrical + acoustic analysis → fused match score on Valence-Arousal grid → ranked with XAI explanations → user feedback refines in real time. The core insight — *understand how someone feels, not what's popular* — is live and working. The multimodal layers that deepen that understanding are the clear, sequenced next steps.