Spaces:
Sleeping
Sleeping
| # EUMORA β Revised Product Plan | |
| ## What Is EUMORA? | |
| EUMORA is an emotion-aware music recommendation system. You describe how you feel in plain text and it recommends songs that match that emotional state. No popularity contests, no filter bubbles, no cold-start delays. | |
| The original architecture envisioned a full multimodal pipeline built from scratch β our own music database, VGGish audio analysis, ALBERT/BERT lyric transformers, and an LSTM fusion engine. We have since pivoted to a leaner, production-grade architecture that ships today while preserving every multimodal avenue for future expansion. | |
| --- | |
| ## Phase 1 β Where We Started | |
| **Objective:** Prove the core premise. Given free-text input, classify the user's emotion and return music that matches it. | |
| | Component | Original Plan | What We Actually Built | | |
| |---|---|---| | |
| | Emotion model | Fine-tuned DistilBERT, 7-class | Fine-tuned **DeBERTa-v3-base**, **8-class** (+ sarcasm) | | |
| | Training data | Generic emotion dataset | Combined dataset β 59 k+ samples across GoEmotions, tweet corpora, sarcasm corpus | | |
| | Music retrieval | Own database + LSTM matching | **Spotify Search API** with emotion-keyed queries | | |
| | Deployment | Research prototype | **FastAPI** on **HF Spaces** (Docker), frontend on **Vercel** | | |
| | Popularity bias | Penalise via score | Max-popularity filter (β€ 65) + anti-chart-topper logic | | |
| | Match score | Valence-Arousal distance | **Search-rank position** (Spotify relevance ordering, 95 % β 50 %) | | |
| | Cold start | N/A | Zero β no listening history required | | |
| **Result:** A live, working product at [eumora.vercel.app](https://eumora.vercel.app). | |
| --- | |
| ## Phase 2 β Where We Are Right Now | |
| ### Architecture | |
| ``` | |
| User text input | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β DeBERTa-v3-base (fine-tuned) β β Emotion classifier | |
| β 8 classes: sadness Β· joy Β· love β 59 k+ training samples | |
| β anger Β· fear Β· surprise Β· neutral β Sarcasm prior calibration | |
| β sarcasm β Confidence + probability mix | |
| βββββββββββββββββββ¬ββββββββββββββββββββ | |
| β predict_result | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β EmotionFeatureMapper β β Emotion β Spotify params | |
| β Valence / Energy / Danceability β Blended across top-2 emotions | |
| β Tempo / Mode / Seed genres β Confidence-scaled windows | |
| βββββββββββββββββββ¬ββββββββββββββββββββ | |
| β query + params | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β Spotify Search API (/v1/search) β β 4 term-sets Γ 4 offsets | |
| β Randomised mood keywords β = 16 result pools per emotion | |
| β Random page offset (0/10/20/30) β limit = 10 (Spotify cap) | |
| βββββββββββββββββββ¬ββββββββββββββββββββ | |
| β raw tracks | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β Score + Filter β | |
| β β’ Max popularity β€ 65 β Anti-chart-topper filter | |
| β β’ Audio features (graceful 403) β Spotify deprecated for std apps | |
| β β’ Search-rank scoring 95 % β 50 % β Highest relevance first | |
| β β’ Diversity filter (β€2/artist) β | |
| βββββββββββββββββββ¬ββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β Lyrics Source Pinning β β If input looks like lyrics | |
| β 1. Musixmatch lyrics search β (optional, key-gated) | |
| β 2. Title-word heuristic fallback β Pins source song at #1 / 100 % | |
| βββββββββββββββββββ¬ββββββββββββββββββββ | |
| β | |
| βΌ | |
| Ranked track list | |
| ``` | |
| ### Current Capabilities | |
| - **Emotion classification** β 8 emotions, sarcasm-calibrated, ~95 % accuracy on test set | |
| - **Music recommendations** β live Spotify tracks, varied per query, anti-popularity biased | |
| - **Lyrics detection** β heuristic (always on) + Musixmatch (when key is set) | |
| - **Match scoring** β search-rank-based, varied 95 %β50 %, highest first | |
| - **Live deployment** β HF Spaces (API) + Vercel (frontend), CORS locked, secrets managed | |
| ### Known Constraints (Spotify API, Nov 2024 Policy) | |
| | Endpoint | Status | Workaround | | |
| |---|---|---| | |
| | `/v1/recommendations` | 404 β deprecated without Extended Access | Replaced with `/v1/search` | | |
| | `/v1/audio-features` | 403 β blocked without Extended Access | Graceful degradation (search-rank scoring) | | |
| | `/v1/search` limit | Max 10 results per request | Randomised offset to sample wider pool | | |
| --- | |
| ## Phase 3 β Future Implementation Scope (Multimodal Expansion) | |
| The pivot to Spotify freed us from building a music database. The multimodal analysis layers from the original architecture are still the roadmap β they now slot in as enrichment on top of Spotify tracks rather than as a replacement for them. | |
| ### 3.1 Lyrical Semantics Module | |
| **What:** For each recommended Spotify track, fetch the lyrics and run a transformer (ALBERT or DeBERTa) to extract lyrical emotion, themes, and mood. | |
| **Data source:** Musixmatch API (`track.lyrics.get`) β same key already planned for lyrics detection. | |
| **Output:** A lyrical emotion vector per track that can be fused with the user's emotional state. | |
| **Impact:** Replace the current search-rank proxy score with a genuine semantic match score. A song about heartbreak should rank higher than one that just happens to appear in a "sad" search, regardless of Spotify's relevance ordering. | |
| ### 3.2 Spectrogram Analysis Module (Raw Audio CNN) | |
| **What:** For each track, download the 30-second preview clip from Spotify (`preview_url`) and convert it into a **mel spectrogram** β a 2D time-frequency image of the audio signal. Feed that image through **VGGish** (Google's audio CNN, pre-trained on AudioSet) to extract a 128-dimensional acoustic embedding that captures timbre, texture, rhythm density, and harmonic content. | |
| **Pipeline:** | |
| ``` | |
| preview_url (MP3, 30 s) | |
| β | |
| βΌ | |
| Decode audio β resample to 16 kHz mono | |
| β | |
| βΌ | |
| Mel spectrogram (128 mel bins, 25 ms frames) | |
| β | |
| βΌ | |
| VGGish CNN β 128-d acoustic embedding | |
| β | |
| βΌ | |
| Acoustic texture vector (dark/bright, dense/sparse, smooth/aggressive) | |
| ``` | |
| **What it captures that structured features miss:** | |
| - Timbral texture (why a minor-key piano ballad feels different from a minor-key metal track even with the same BPM and valence) | |
| - Spectral brightness / darkness (correlated with perceived emotional tone) | |
| - Onset density (busyness, complexity) | |
| - Harmonic vs. percussive energy ratio | |
| **Data source:** Spotify `preview_url` (30 s MP3, no auth required) β already present in every track object we return. | |
| **Output:** A 128-d embedding per track, projected down to a mood-relevant subspace. | |
| ### 3.3 Structured Audio Features Module | |
| **What:** Recover Spotify's high-level audio descriptors β **valence, energy, danceability, tempo, key, mode, acousticness, speechiness, instrumentalness** β per track. These are human-interpretable scalars on top of the raw signal. | |
| **Current blocker:** Spotify `/v1/audio-features` returns 403 without Extended Access (deprecated Nov 2024). | |
| **Resolution path:** | |
| - **Option A (fastest):** Apply for Spotify Extended Access β free, ~2-week review, unlocks both `/audio-features` and `/recommendations` | |
| - **Option B (no dependency):** AcousticBrainz open dataset (~2M tracks, pre-computed features) | |
| - **Option C (compute ourselves):** Run **Essentia** (open-source, Mozilla) on preview URLs β extracts tempo, key, mode, spectral centroid, loudness directly | |
| **Output:** Valence / Energy / Danceability / Tempo / Mode scalars per track β the structured side of the Valence-Arousal grid. | |
| ### 3.4 Multimodal Fusion Layer | |
| **What:** Combine **four** independent signals into one calibrated match score: | |
| ``` | |
| User emotional state (DeBERTa output) βββ | |
| Lyrical emotion vec (transformer on lyrics) ββ€ | |
| Acoustic embedding (VGGish on spectrogram) ββββΊ Fusion head β match_score | |
| Structured features (valence/energy/tempo) βββ | |
| ``` | |
| **Method:** | |
| 1. Project each modality into a shared Valence-Arousal-Dominance (VAD) space | |
| 2. Compute weighted cosine similarity between user state vector and track vector | |
| 3. Weights tunable (or learned from user feedback in Phase 3.5) | |
| **Output:** A single calibrated match score (0β100 %) grounded in all four modalities simultaneously β the full Valence-Arousal grid match originally planned. | |
| ### 3.5 Explainability (XAI Layer) | |
| **What:** Natural-language justification for every recommended track. | |
| **Example output:** | |
| > *"Selected because its slow acoustic tempo (68 BPM) and lyrical themes of isolation strongly align with your expressed feeling of loneliness."* | |
| **Method:** Template-based generation seeded by the acoustic + lyrical features that drove the match score. Can be upgraded to a small LLM-generated explanation later. | |
| **Impact:** Transparency β users understand why they got a recommendation, not just what it is. | |
| ### 3.6 User Feedback Loop | |
| **What:** Thumbs up / down per track, feeding back into the recommendation ranking for that session. | |
| **Impact:** Closes the loop between static model output and real user response. Enables fine-tuning training data collection over time. | |
| ### 3.7 Extended Spotify Access | |
| **What:** Apply for Spotify's Extended Access program (free). | |
| **Unlocks:** | |
| - `/v1/audio-features` (valence, energy, danceability, tempo per track) | |
| - `/v1/recommendations` (seed-based recommendations) | |
| - Higher search limits | |
| **Impact:** Makes acoustic analysis available immediately via the API we already have, without needing Essentia or AcousticBrainz. | |
| --- | |
| ## Objectives β Start to End | |
| | # | Objective | Phase | Status | | |
| |---|---|---|---| | |
| | 1 | Fine-tune an NLP model to classify text emotions | Phase 1 | β Done β DeBERTa-v3-base, 8 classes, ~95 % | | |
| | 2 | Map emotions to music parameters | Phase 1 | β Done β EmotionFeatureMapper, Valence/Energy/Danceability targets | | |
| | 3 | Retrieve real tracks from a live music source | Phase 1 | β Done β Spotify Search API | | |
| | 4 | Anti-popularity bias | Phase 1 | β Done β max_popularity β€ 65, hidden-gem preference | | |
| | 5 | Live deployed API + frontend | Phase 1 | β Done β HF Spaces + Vercel | | |
| | 6 | Varied results per query | Phase 2 | β Done β 4 term sets Γ 4 offsets | | |
| | 7 | Lyrics-to-source-song detection | Phase 2 | β Done β heuristic + optional Musixmatch | | |
| | 8 | Meaningful, varied match scores | Phase 2 | β Done β search-rank scoring | | |
| | 9 | Fetch and analyse song lyrics per track | Phase 3 | π² Musixmatch lyrics API | | |
| | 10 | Spectrogram analysis β VGGish CNN on preview audio | Phase 3 | π² Raw acoustic texture embedding | | |
| | 11 | Structured audio features (valence/energy/tempo) | Phase 3 | π² Extended Access or Essentia | | |
| | 12 | Semantic match scoring via lyrical emotion | Phase 3 | π² Replaces search-rank proxy | | |
| | 13 | Valence-Arousal grid fusion (text + lyrics + spectrogram + features) | Phase 3 | π² Full 4-modality fusion layer | | |
| | 14 | Natural-language recommendation justifications | Phase 3 | π² XAI / template generation | | |
| | 15 | In-session user feedback loop | Phase 3 | π² Thumbs up/down + re-ranking | | |
| | 16 | Full multimodal pipeline (all 4 signals fused) | End state | π² Full original vision realised | | |
| --- | |
| ## What Changed From the Original Vision (And Why It's Better) | |
| | Original | Current | Reason | | |
| |---|---|---| | |
| | Own music database | Spotify (100M+ tracks, live) | Infinite catalogue, no maintenance, always current | | |
| | VGGish CNN for audio | Planned for Phase 3 | Spotify API + Extended Access covers this more cheaply | | |
| | ALBERT/BERT for lyrics | Planned for Phase 3 via Musixmatch | Lyrics only needed at recommendation time, not training time | | |
| | LSTM fusion engine | Lightweight cosine fusion planned | Simpler, interpretable, easier to debug | | |
| | Batch/static recommendations | Real-time, randomised per query | Eliminates filter bubbles immediately | | |
| | Research prototype | Production deployment | Real users, real feedback, real iteration | | |
| The architecture is **thinner today but structurally identical in direction**. Every multimodal channel from the original design (lyrical semantics, acoustic mood, fusion) is still in the roadmap β they now plug into a production system instead of a research prototype. | |
| --- | |
| ## Summary | |
| > EUMORA today: text in β emotion classified β Spotify searched with mood keywords β anti-popularity filtered β ranked by relevance β source song pinned if lyrics detected. | |
| > | |
| > EUMORA end state: text in β emotion classified β Spotify tracks enriched with lyrical + acoustic analysis β fused match score on Valence-Arousal grid β ranked with XAI explanations β user feedback refines in real time. | |
| The core insight β *understand how someone feels, not what's popular* β is live and working. The multimodal layers that deepen that understanding are the clear, sequenced next steps. | |