Spaces:
Sleeping
EUMORA β Revised Product Plan
What Is EUMORA?
EUMORA is an emotion-aware music recommendation system. You describe how you feel in plain text and it recommends songs that match that emotional state. No popularity contests, no filter bubbles, no cold-start delays.
The original architecture envisioned a full multimodal pipeline built from scratch β our own music database, VGGish audio analysis, ALBERT/BERT lyric transformers, and an LSTM fusion engine. We have since pivoted to a leaner, production-grade architecture that ships today while preserving every multimodal avenue for future expansion.
Phase 1 β Where We Started
Objective: Prove the core premise. Given free-text input, classify the user's emotion and return music that matches it.
| Component | Original Plan | What We Actually Built |
|---|---|---|
| Emotion model | Fine-tuned DistilBERT, 7-class | Fine-tuned DeBERTa-v3-base, 8-class (+ sarcasm) |
| Training data | Generic emotion dataset | Combined dataset β 59 k+ samples across GoEmotions, tweet corpora, sarcasm corpus |
| Music retrieval | Own database + LSTM matching | Spotify Search API with emotion-keyed queries |
| Deployment | Research prototype | FastAPI on HF Spaces (Docker), frontend on Vercel |
| Popularity bias | Penalise via score | Max-popularity filter (β€ 65) + anti-chart-topper logic |
| Match score | Valence-Arousal distance | Search-rank position (Spotify relevance ordering, 95 % β 50 %) |
| Cold start | N/A | Zero β no listening history required |
Result: A live, working product at eumora.vercel.app.
Phase 2 β Where We Are Right Now
Architecture
User text input
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β DeBERTa-v3-base (fine-tuned) β β Emotion classifier
β 8 classes: sadness Β· joy Β· love β 59 k+ training samples
β anger Β· fear Β· surprise Β· neutral β Sarcasm prior calibration
β sarcasm β Confidence + probability mix
βββββββββββββββββββ¬ββββββββββββββββββββ
β predict_result
βΌ
βββββββββββββββββββββββββββββββββββββββ
β EmotionFeatureMapper β β Emotion β Spotify params
β Valence / Energy / Danceability β Blended across top-2 emotions
β Tempo / Mode / Seed genres β Confidence-scaled windows
βββββββββββββββββββ¬ββββββββββββββββββββ
β query + params
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Spotify Search API (/v1/search) β β 4 term-sets Γ 4 offsets
β Randomised mood keywords β = 16 result pools per emotion
β Random page offset (0/10/20/30) β limit = 10 (Spotify cap)
βββββββββββββββββββ¬ββββββββββββββββββββ
β raw tracks
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Score + Filter β
β β’ Max popularity β€ 65 β Anti-chart-topper filter
β β’ Audio features (graceful 403) β Spotify deprecated for std apps
β β’ Search-rank scoring 95 % β 50 % β Highest relevance first
β β’ Diversity filter (β€2/artist) β
βββββββββββββββββββ¬ββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Lyrics Source Pinning β β If input looks like lyrics
β 1. Musixmatch lyrics search β (optional, key-gated)
β 2. Title-word heuristic fallback β Pins source song at #1 / 100 %
βββββββββββββββββββ¬ββββββββββββββββββββ
β
βΌ
Ranked track list
Current Capabilities
- Emotion classification β 8 emotions, sarcasm-calibrated, ~95 % accuracy on test set
- Music recommendations β live Spotify tracks, varied per query, anti-popularity biased
- Lyrics detection β heuristic (always on) + Musixmatch (when key is set)
- Match scoring β search-rank-based, varied 95 %β50 %, highest first
- Live deployment β HF Spaces (API) + Vercel (frontend), CORS locked, secrets managed
Known Constraints (Spotify API, Nov 2024 Policy)
| Endpoint | Status | Workaround |
|---|---|---|
/v1/recommendations |
404 β deprecated without Extended Access | Replaced with /v1/search |
/v1/audio-features |
403 β blocked without Extended Access | Graceful degradation (search-rank scoring) |
/v1/search limit |
Max 10 results per request | Randomised offset to sample wider pool |
Phase 3 β Future Implementation Scope (Multimodal Expansion)
The pivot to Spotify freed us from building a music database. The multimodal analysis layers from the original architecture are still the roadmap β they now slot in as enrichment on top of Spotify tracks rather than as a replacement for them.
3.1 Lyrical Semantics Module
What: For each recommended Spotify track, fetch the lyrics and run a transformer (ALBERT or DeBERTa) to extract lyrical emotion, themes, and mood.
Data source: Musixmatch API (track.lyrics.get) β same key already planned for lyrics detection.
Output: A lyrical emotion vector per track that can be fused with the user's emotional state.
Impact: Replace the current search-rank proxy score with a genuine semantic match score. A song about heartbreak should rank higher than one that just happens to appear in a "sad" search, regardless of Spotify's relevance ordering.
3.2 Spectrogram Analysis Module (Raw Audio CNN)
What: For each track, download the 30-second preview clip from Spotify (preview_url) and convert it into a mel spectrogram β a 2D time-frequency image of the audio signal. Feed that image through VGGish (Google's audio CNN, pre-trained on AudioSet) to extract a 128-dimensional acoustic embedding that captures timbre, texture, rhythm density, and harmonic content.
Pipeline:
preview_url (MP3, 30 s)
β
βΌ
Decode audio β resample to 16 kHz mono
β
βΌ
Mel spectrogram (128 mel bins, 25 ms frames)
β
βΌ
VGGish CNN β 128-d acoustic embedding
β
βΌ
Acoustic texture vector (dark/bright, dense/sparse, smooth/aggressive)
What it captures that structured features miss:
- Timbral texture (why a minor-key piano ballad feels different from a minor-key metal track even with the same BPM and valence)
- Spectral brightness / darkness (correlated with perceived emotional tone)
- Onset density (busyness, complexity)
- Harmonic vs. percussive energy ratio
Data source: Spotify preview_url (30 s MP3, no auth required) β already present in every track object we return.
Output: A 128-d embedding per track, projected down to a mood-relevant subspace.
3.3 Structured Audio Features Module
What: Recover Spotify's high-level audio descriptors β valence, energy, danceability, tempo, key, mode, acousticness, speechiness, instrumentalness β per track. These are human-interpretable scalars on top of the raw signal.
Current blocker: Spotify /v1/audio-features returns 403 without Extended Access (deprecated Nov 2024).
Resolution path:
- Option A (fastest): Apply for Spotify Extended Access β free, ~2-week review, unlocks both
/audio-featuresand/recommendations - Option B (no dependency): AcousticBrainz open dataset (~2M tracks, pre-computed features)
- Option C (compute ourselves): Run Essentia (open-source, Mozilla) on preview URLs β extracts tempo, key, mode, spectral centroid, loudness directly
Output: Valence / Energy / Danceability / Tempo / Mode scalars per track β the structured side of the Valence-Arousal grid.
3.4 Multimodal Fusion Layer
What: Combine four independent signals into one calibrated match score:
User emotional state (DeBERTa output) βββ
Lyrical emotion vec (transformer on lyrics) ββ€
Acoustic embedding (VGGish on spectrogram) ββββΊ Fusion head β match_score
Structured features (valence/energy/tempo) βββ
Method:
- Project each modality into a shared Valence-Arousal-Dominance (VAD) space
- Compute weighted cosine similarity between user state vector and track vector
- Weights tunable (or learned from user feedback in Phase 3.5)
Output: A single calibrated match score (0β100 %) grounded in all four modalities simultaneously β the full Valence-Arousal grid match originally planned.
3.5 Explainability (XAI Layer)
What: Natural-language justification for every recommended track.
Example output:
"Selected because its slow acoustic tempo (68 BPM) and lyrical themes of isolation strongly align with your expressed feeling of loneliness."
Method: Template-based generation seeded by the acoustic + lyrical features that drove the match score. Can be upgraded to a small LLM-generated explanation later.
Impact: Transparency β users understand why they got a recommendation, not just what it is.
3.6 User Feedback Loop
What: Thumbs up / down per track, feeding back into the recommendation ranking for that session.
Impact: Closes the loop between static model output and real user response. Enables fine-tuning training data collection over time.
3.7 Extended Spotify Access
What: Apply for Spotify's Extended Access program (free).
Unlocks:
/v1/audio-features(valence, energy, danceability, tempo per track)/v1/recommendations(seed-based recommendations)- Higher search limits
Impact: Makes acoustic analysis available immediately via the API we already have, without needing Essentia or AcousticBrainz.
Objectives β Start to End
| # | Objective | Phase | Status |
|---|---|---|---|
| 1 | Fine-tune an NLP model to classify text emotions | Phase 1 | β Done β DeBERTa-v3-base, 8 classes, ~95 % |
| 2 | Map emotions to music parameters | Phase 1 | β Done β EmotionFeatureMapper, Valence/Energy/Danceability targets |
| 3 | Retrieve real tracks from a live music source | Phase 1 | β Done β Spotify Search API |
| 4 | Anti-popularity bias | Phase 1 | β Done β max_popularity β€ 65, hidden-gem preference |
| 5 | Live deployed API + frontend | Phase 1 | β Done β HF Spaces + Vercel |
| 6 | Varied results per query | Phase 2 | β Done β 4 term sets Γ 4 offsets |
| 7 | Lyrics-to-source-song detection | Phase 2 | β Done β heuristic + optional Musixmatch |
| 8 | Meaningful, varied match scores | Phase 2 | β Done β search-rank scoring |
| 9 | Fetch and analyse song lyrics per track | Phase 3 | π² Musixmatch lyrics API |
| 10 | Spectrogram analysis β VGGish CNN on preview audio | Phase 3 | π² Raw acoustic texture embedding |
| 11 | Structured audio features (valence/energy/tempo) | Phase 3 | π² Extended Access or Essentia |
| 12 | Semantic match scoring via lyrical emotion | Phase 3 | π² Replaces search-rank proxy |
| 13 | Valence-Arousal grid fusion (text + lyrics + spectrogram + features) | Phase 3 | π² Full 4-modality fusion layer |
| 14 | Natural-language recommendation justifications | Phase 3 | π² XAI / template generation |
| 15 | In-session user feedback loop | Phase 3 | π² Thumbs up/down + re-ranking |
| 16 | Full multimodal pipeline (all 4 signals fused) | End state | π² Full original vision realised |
What Changed From the Original Vision (And Why It's Better)
| Original | Current | Reason |
|---|---|---|
| Own music database | Spotify (100M+ tracks, live) | Infinite catalogue, no maintenance, always current |
| VGGish CNN for audio | Planned for Phase 3 | Spotify API + Extended Access covers this more cheaply |
| ALBERT/BERT for lyrics | Planned for Phase 3 via Musixmatch | Lyrics only needed at recommendation time, not training time |
| LSTM fusion engine | Lightweight cosine fusion planned | Simpler, interpretable, easier to debug |
| Batch/static recommendations | Real-time, randomised per query | Eliminates filter bubbles immediately |
| Research prototype | Production deployment | Real users, real feedback, real iteration |
The architecture is thinner today but structurally identical in direction. Every multimodal channel from the original design (lyrical semantics, acoustic mood, fusion) is still in the roadmap β they now plug into a production system instead of a research prototype.
Summary
EUMORA today: text in β emotion classified β Spotify searched with mood keywords β anti-popularity filtered β ranked by relevance β source song pinned if lyrics detected.
EUMORA end state: text in β emotion classified β Spotify tracks enriched with lyrical + acoustic analysis β fused match score on Valence-Arousal grid β ranked with XAI explanations β user feedback refines in real time.
The core insight β understand how someone feels, not what's popular β is live and working. The multimodal layers that deepen that understanding are the clear, sequenced next steps.