eumora-api / EUMORA_PLAN.md
VivDubs's picture
feat: add comprehensive revised product plan for EUMORA
05fc243

EUMORA β€” Revised Product Plan

What Is EUMORA?

EUMORA is an emotion-aware music recommendation system. You describe how you feel in plain text and it recommends songs that match that emotional state. No popularity contests, no filter bubbles, no cold-start delays.

The original architecture envisioned a full multimodal pipeline built from scratch β€” our own music database, VGGish audio analysis, ALBERT/BERT lyric transformers, and an LSTM fusion engine. We have since pivoted to a leaner, production-grade architecture that ships today while preserving every multimodal avenue for future expansion.


Phase 1 β€” Where We Started

Objective: Prove the core premise. Given free-text input, classify the user's emotion and return music that matches it.

Component Original Plan What We Actually Built
Emotion model Fine-tuned DistilBERT, 7-class Fine-tuned DeBERTa-v3-base, 8-class (+ sarcasm)
Training data Generic emotion dataset Combined dataset β€” 59 k+ samples across GoEmotions, tweet corpora, sarcasm corpus
Music retrieval Own database + LSTM matching Spotify Search API with emotion-keyed queries
Deployment Research prototype FastAPI on HF Spaces (Docker), frontend on Vercel
Popularity bias Penalise via score Max-popularity filter (≀ 65) + anti-chart-topper logic
Match score Valence-Arousal distance Search-rank position (Spotify relevance ordering, 95 % β†’ 50 %)
Cold start N/A Zero β€” no listening history required

Result: A live, working product at eumora.vercel.app.


Phase 2 β€” Where We Are Right Now

Architecture

User text input
      β”‚
      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  DeBERTa-v3-base (fine-tuned)       β”‚  ← Emotion classifier
β”‚  8 classes: sadness Β· joy Β· love    β”‚    59 k+ training samples
β”‚  anger Β· fear Β· surprise Β· neutral  β”‚    Sarcasm prior calibration
β”‚  sarcasm                            β”‚    Confidence + probability mix
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚  predict_result
                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  EmotionFeatureMapper               β”‚  ← Emotion β†’ Spotify params
β”‚  Valence / Energy / Danceability    β”‚    Blended across top-2 emotions
β”‚  Tempo / Mode / Seed genres         β”‚    Confidence-scaled windows
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚  query + params
                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Spotify Search API (/v1/search)    β”‚  ← 4 term-sets Γ— 4 offsets
β”‚  Randomised mood keywords           β”‚    = 16 result pools per emotion
β”‚  Random page offset (0/10/20/30)    β”‚    limit = 10 (Spotify cap)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚  raw tracks
                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Score + Filter                     β”‚
β”‚  β€’ Max popularity ≀ 65              β”‚  Anti-chart-topper filter
β”‚  β€’ Audio features (graceful 403)    β”‚  Spotify deprecated for std apps
β”‚  β€’ Search-rank scoring 95 % β†’ 50 % β”‚  Highest relevance first
β”‚  β€’ Diversity filter (≀2/artist)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Lyrics Source Pinning              β”‚  ← If input looks like lyrics
β”‚  1. Musixmatch lyrics search        β”‚    (optional, key-gated)
β”‚  2. Title-word heuristic fallback   β”‚    Pins source song at #1 / 100 %
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
                  β–Ό
            Ranked track list

Current Capabilities

  • Emotion classification β€” 8 emotions, sarcasm-calibrated, ~95 % accuracy on test set
  • Music recommendations β€” live Spotify tracks, varied per query, anti-popularity biased
  • Lyrics detection β€” heuristic (always on) + Musixmatch (when key is set)
  • Match scoring β€” search-rank-based, varied 95 %β†’50 %, highest first
  • Live deployment β€” HF Spaces (API) + Vercel (frontend), CORS locked, secrets managed

Known Constraints (Spotify API, Nov 2024 Policy)

Endpoint Status Workaround
/v1/recommendations 404 β€” deprecated without Extended Access Replaced with /v1/search
/v1/audio-features 403 β€” blocked without Extended Access Graceful degradation (search-rank scoring)
/v1/search limit Max 10 results per request Randomised offset to sample wider pool

Phase 3 β€” Future Implementation Scope (Multimodal Expansion)

The pivot to Spotify freed us from building a music database. The multimodal analysis layers from the original architecture are still the roadmap β€” they now slot in as enrichment on top of Spotify tracks rather than as a replacement for them.

3.1 Lyrical Semantics Module

What: For each recommended Spotify track, fetch the lyrics and run a transformer (ALBERT or DeBERTa) to extract lyrical emotion, themes, and mood.

Data source: Musixmatch API (track.lyrics.get) β€” same key already planned for lyrics detection.

Output: A lyrical emotion vector per track that can be fused with the user's emotional state.

Impact: Replace the current search-rank proxy score with a genuine semantic match score. A song about heartbreak should rank higher than one that just happens to appear in a "sad" search, regardless of Spotify's relevance ordering.

3.2 Spectrogram Analysis Module (Raw Audio CNN)

What: For each track, download the 30-second preview clip from Spotify (preview_url) and convert it into a mel spectrogram β€” a 2D time-frequency image of the audio signal. Feed that image through VGGish (Google's audio CNN, pre-trained on AudioSet) to extract a 128-dimensional acoustic embedding that captures timbre, texture, rhythm density, and harmonic content.

Pipeline:

preview_url (MP3, 30 s)
      β”‚
      β–Ό
  Decode audio β†’ resample to 16 kHz mono
      β”‚
      β–Ό
  Mel spectrogram (128 mel bins, 25 ms frames)
      β”‚
      β–Ό
  VGGish CNN β†’ 128-d acoustic embedding
      β”‚
      β–Ό
  Acoustic texture vector (dark/bright, dense/sparse, smooth/aggressive)

What it captures that structured features miss:

  • Timbral texture (why a minor-key piano ballad feels different from a minor-key metal track even with the same BPM and valence)
  • Spectral brightness / darkness (correlated with perceived emotional tone)
  • Onset density (busyness, complexity)
  • Harmonic vs. percussive energy ratio

Data source: Spotify preview_url (30 s MP3, no auth required) β€” already present in every track object we return.

Output: A 128-d embedding per track, projected down to a mood-relevant subspace.

3.3 Structured Audio Features Module

What: Recover Spotify's high-level audio descriptors β€” valence, energy, danceability, tempo, key, mode, acousticness, speechiness, instrumentalness β€” per track. These are human-interpretable scalars on top of the raw signal.

Current blocker: Spotify /v1/audio-features returns 403 without Extended Access (deprecated Nov 2024).

Resolution path:

  • Option A (fastest): Apply for Spotify Extended Access β€” free, ~2-week review, unlocks both /audio-features and /recommendations
  • Option B (no dependency): AcousticBrainz open dataset (~2M tracks, pre-computed features)
  • Option C (compute ourselves): Run Essentia (open-source, Mozilla) on preview URLs β€” extracts tempo, key, mode, spectral centroid, loudness directly

Output: Valence / Energy / Danceability / Tempo / Mode scalars per track β€” the structured side of the Valence-Arousal grid.

3.4 Multimodal Fusion Layer

What: Combine four independent signals into one calibrated match score:

User emotional state  (DeBERTa output)     ──┐
Lyrical emotion vec   (transformer on lyrics) ──
Acoustic embedding    (VGGish on spectrogram)  β”œβ”€β”€β–Ί Fusion head β†’ match_score
Structured features   (valence/energy/tempo)  β”€β”€β”˜

Method:

  1. Project each modality into a shared Valence-Arousal-Dominance (VAD) space
  2. Compute weighted cosine similarity between user state vector and track vector
  3. Weights tunable (or learned from user feedback in Phase 3.5)

Output: A single calibrated match score (0–100 %) grounded in all four modalities simultaneously β€” the full Valence-Arousal grid match originally planned.

3.5 Explainability (XAI Layer)

What: Natural-language justification for every recommended track.

Example output:

"Selected because its slow acoustic tempo (68 BPM) and lyrical themes of isolation strongly align with your expressed feeling of loneliness."

Method: Template-based generation seeded by the acoustic + lyrical features that drove the match score. Can be upgraded to a small LLM-generated explanation later.

Impact: Transparency β€” users understand why they got a recommendation, not just what it is.

3.6 User Feedback Loop

What: Thumbs up / down per track, feeding back into the recommendation ranking for that session.

Impact: Closes the loop between static model output and real user response. Enables fine-tuning training data collection over time.

3.7 Extended Spotify Access

What: Apply for Spotify's Extended Access program (free).

Unlocks:

  • /v1/audio-features (valence, energy, danceability, tempo per track)
  • /v1/recommendations (seed-based recommendations)
  • Higher search limits

Impact: Makes acoustic analysis available immediately via the API we already have, without needing Essentia or AcousticBrainz.


Objectives β€” Start to End

# Objective Phase Status
1 Fine-tune an NLP model to classify text emotions Phase 1 βœ… Done β€” DeBERTa-v3-base, 8 classes, ~95 %
2 Map emotions to music parameters Phase 1 βœ… Done β€” EmotionFeatureMapper, Valence/Energy/Danceability targets
3 Retrieve real tracks from a live music source Phase 1 βœ… Done β€” Spotify Search API
4 Anti-popularity bias Phase 1 βœ… Done β€” max_popularity ≀ 65, hidden-gem preference
5 Live deployed API + frontend Phase 1 βœ… Done β€” HF Spaces + Vercel
6 Varied results per query Phase 2 βœ… Done β€” 4 term sets Γ— 4 offsets
7 Lyrics-to-source-song detection Phase 2 βœ… Done β€” heuristic + optional Musixmatch
8 Meaningful, varied match scores Phase 2 βœ… Done β€” search-rank scoring
9 Fetch and analyse song lyrics per track Phase 3 πŸ”² Musixmatch lyrics API
10 Spectrogram analysis β€” VGGish CNN on preview audio Phase 3 πŸ”² Raw acoustic texture embedding
11 Structured audio features (valence/energy/tempo) Phase 3 πŸ”² Extended Access or Essentia
12 Semantic match scoring via lyrical emotion Phase 3 πŸ”² Replaces search-rank proxy
13 Valence-Arousal grid fusion (text + lyrics + spectrogram + features) Phase 3 πŸ”² Full 4-modality fusion layer
14 Natural-language recommendation justifications Phase 3 πŸ”² XAI / template generation
15 In-session user feedback loop Phase 3 πŸ”² Thumbs up/down + re-ranking
16 Full multimodal pipeline (all 4 signals fused) End state πŸ”² Full original vision realised

What Changed From the Original Vision (And Why It's Better)

Original Current Reason
Own music database Spotify (100M+ tracks, live) Infinite catalogue, no maintenance, always current
VGGish CNN for audio Planned for Phase 3 Spotify API + Extended Access covers this more cheaply
ALBERT/BERT for lyrics Planned for Phase 3 via Musixmatch Lyrics only needed at recommendation time, not training time
LSTM fusion engine Lightweight cosine fusion planned Simpler, interpretable, easier to debug
Batch/static recommendations Real-time, randomised per query Eliminates filter bubbles immediately
Research prototype Production deployment Real users, real feedback, real iteration

The architecture is thinner today but structurally identical in direction. Every multimodal channel from the original design (lyrical semantics, acoustic mood, fusion) is still in the roadmap β€” they now plug into a production system instead of a research prototype.


Summary

EUMORA today: text in β†’ emotion classified β†’ Spotify searched with mood keywords β†’ anti-popularity filtered β†’ ranked by relevance β†’ source song pinned if lyrics detected.

EUMORA end state: text in β†’ emotion classified β†’ Spotify tracks enriched with lyrical + acoustic analysis β†’ fused match score on Valence-Arousal grid β†’ ranked with XAI explanations β†’ user feedback refines in real time.

The core insight β€” understand how someone feels, not what's popular β€” is live and working. The multimodal layers that deepen that understanding are the clear, sequenced next steps.