Spaces:

VivDubs
/

eumora-api

Sleeping

App Files Files Community

eumora-api / EUMORA_PLAN.md

VivDubs

feat: add comprehensive revised product plan for EUMORA

05fc243 10 days ago

preview code

raw

history blame contribute delete

13.9 kB

	# EUMORA — Revised Product Plan

	## What Is EUMORA?

	EUMORA is an emotion-aware music recommendation system. You describe how you feel in plain text and it recommends songs that match that emotional state. No popularity contests, no filter bubbles, no cold-start delays.

	The original architecture envisioned a full multimodal pipeline built from scratch — our own music database, VGGish audio analysis, ALBERT/BERT lyric transformers, and an LSTM fusion engine. We have since pivoted to a leaner, production-grade architecture that ships today while preserving every multimodal avenue for future expansion.

	---

	## Phase 1 — Where We Started

	Objective: Prove the core premise. Given free-text input, classify the user's emotion and return music that matches it.

	\| Component \| Original Plan \| What We Actually Built \|
	\|---\|---\|---\|
	\| Emotion model \| Fine-tuned DistilBERT, 7-class \| Fine-tuned DeBERTa-v3-base, 8-class (+ sarcasm) \|
	\| Training data \| Generic emotion dataset \| Combined dataset — 59 k+ samples across GoEmotions, tweet corpora, sarcasm corpus \|
	\| Music retrieval \| Own database + LSTM matching \| Spotify Search API with emotion-keyed queries \|
	\| Deployment \| Research prototype \| FastAPI on HF Spaces (Docker), frontend on Vercel \|
	\| Popularity bias \| Penalise via score \| Max-popularity filter (≤ 65) + anti-chart-topper logic \|
	\| Match score \| Valence-Arousal distance \| Search-rank position (Spotify relevance ordering, 95 % → 50 %) \|
	\| Cold start \| N/A \| Zero — no listening history required \|

	Result: A live, working product at [eumora.vercel.app](https://eumora.vercel.app).

	---

	## Phase 2 — Where We Are Right Now

	### Architecture

	```
	User text input
	│
	▼
	┌─────────────────────────────────────┐
	│ DeBERTa-v3-base (fine-tuned) │ ← Emotion classifier
	│ 8 classes: sadness · joy · love │ 59 k+ training samples
	│ anger · fear · surprise · neutral │ Sarcasm prior calibration
	│ sarcasm │ Confidence + probability mix
	└─────────────────┬───────────────────┘
	│ predict_result
	▼
	┌─────────────────────────────────────┐
	│ EmotionFeatureMapper │ ← Emotion → Spotify params
	│ Valence / Energy / Danceability │ Blended across top-2 emotions
	│ Tempo / Mode / Seed genres │ Confidence-scaled windows
	└─────────────────┬───────────────────┘
	│ query + params
	▼
	┌─────────────────────────────────────┐
	│ Spotify Search API (/v1/search) │ ← 4 term-sets × 4 offsets
	│ Randomised mood keywords │ = 16 result pools per emotion
	│ Random page offset (0/10/20/30) │ limit = 10 (Spotify cap)
	└─────────────────┬───────────────────┘
	│ raw tracks
	▼
	┌─────────────────────────────────────┐
	│ Score + Filter │
	│ • Max popularity ≤ 65 │ Anti-chart-topper filter
	│ • Audio features (graceful 403) │ Spotify deprecated for std apps
	│ • Search-rank scoring 95 % → 50 % │ Highest relevance first
	│ • Diversity filter (≤2/artist) │
	└─────────────────┬───────────────────┘
	│
	▼
	┌─────────────────────────────────────┐
	│ Lyrics Source Pinning │ ← If input looks like lyrics
	│ 1. Musixmatch lyrics search │ (optional, key-gated)
	│ 2. Title-word heuristic fallback │ Pins source song at #1 / 100 %
	└─────────────────┬───────────────────┘
	│
	▼
	Ranked track list
	```

	### Current Capabilities

	- Emotion classification — 8 emotions, sarcasm-calibrated, ~95 % accuracy on test set
	- Music recommendations — live Spotify tracks, varied per query, anti-popularity biased
	- Lyrics detection — heuristic (always on) + Musixmatch (when key is set)
	- Match scoring — search-rank-based, varied 95 %→50 %, highest first
	- Live deployment — HF Spaces (API) + Vercel (frontend), CORS locked, secrets managed

	### Known Constraints (Spotify API, Nov 2024 Policy)

	\| Endpoint \| Status \| Workaround \|
	\|---\|---\|---\|
	\| `/v1/recommendations` \| 404 — deprecated without Extended Access \| Replaced with `/v1/search` \|
	\| `/v1/audio-features` \| 403 — blocked without Extended Access \| Graceful degradation (search-rank scoring) \|
	\| `/v1/search` limit \| Max 10 results per request \| Randomised offset to sample wider pool \|

	---

	## Phase 3 — Future Implementation Scope (Multimodal Expansion)

	The pivot to Spotify freed us from building a music database. The multimodal analysis layers from the original architecture are still the roadmap — they now slot in as enrichment on top of Spotify tracks rather than as a replacement for them.

	### 3.1 Lyrical Semantics Module

	What: For each recommended Spotify track, fetch the lyrics and run a transformer (ALBERT or DeBERTa) to extract lyrical emotion, themes, and mood.

	Data source: Musixmatch API (`track.lyrics.get`) — same key already planned for lyrics detection.

	Output: A lyrical emotion vector per track that can be fused with the user's emotional state.

	Impact: Replace the current search-rank proxy score with a genuine semantic match score. A song about heartbreak should rank higher than one that just happens to appear in a "sad" search, regardless of Spotify's relevance ordering.

	### 3.2 Spectrogram Analysis Module (Raw Audio CNN)

	What: For each track, download the 30-second preview clip from Spotify (`preview_url`) and convert it into a mel spectrogram — a 2D time-frequency image of the audio signal. Feed that image through VGGish (Google's audio CNN, pre-trained on AudioSet) to extract a 128-dimensional acoustic embedding that captures timbre, texture, rhythm density, and harmonic content.

	Pipeline:
	```
	preview_url (MP3, 30 s)
	│
	▼
	Decode audio → resample to 16 kHz mono
	│
	▼
	Mel spectrogram (128 mel bins, 25 ms frames)
	│
	▼
	VGGish CNN → 128-d acoustic embedding
	│
	▼
	Acoustic texture vector (dark/bright, dense/sparse, smooth/aggressive)
	```

	What it captures that structured features miss:
	- Timbral texture (why a minor-key piano ballad feels different from a minor-key metal track even with the same BPM and valence)
	- Spectral brightness / darkness (correlated with perceived emotional tone)
	- Onset density (busyness, complexity)
	- Harmonic vs. percussive energy ratio

	Data source: Spotify `preview_url` (30 s MP3, no auth required) — already present in every track object we return.

	Output: A 128-d embedding per track, projected down to a mood-relevant subspace.

	### 3.3 Structured Audio Features Module

	What: Recover Spotify's high-level audio descriptors — valence, energy, danceability, tempo, key, mode, acousticness, speechiness, instrumentalness — per track. These are human-interpretable scalars on top of the raw signal.

	Current blocker: Spotify `/v1/audio-features` returns 403 without Extended Access (deprecated Nov 2024).

	Resolution path:
	- Option A (fastest): Apply for Spotify Extended Access — free, ~2-week review, unlocks both `/audio-features` and `/recommendations`
	- Option B (no dependency): AcousticBrainz open dataset (~2M tracks, pre-computed features)
	- Option C (compute ourselves): Run Essentia (open-source, Mozilla) on preview URLs — extracts tempo, key, mode, spectral centroid, loudness directly

	Output: Valence / Energy / Danceability / Tempo / Mode scalars per track — the structured side of the Valence-Arousal grid.

	### 3.4 Multimodal Fusion Layer

	What: Combine four independent signals into one calibrated match score:

	```
	User emotional state (DeBERTa output) ──┐
	Lyrical emotion vec (transformer on lyrics) ─┤
	Acoustic embedding (VGGish on spectrogram) ├──► Fusion head → match_score
	Structured features (valence/energy/tempo) ──┘
	```

	Method:
	1. Project each modality into a shared Valence-Arousal-Dominance (VAD) space
	2. Compute weighted cosine similarity between user state vector and track vector
	3. Weights tunable (or learned from user feedback in Phase 3.5)

	Output: A single calibrated match score (0–100 %) grounded in all four modalities simultaneously — the full Valence-Arousal grid match originally planned.

	### 3.5 Explainability (XAI Layer)

	What: Natural-language justification for every recommended track.

	Example output:
	> "Selected because its slow acoustic tempo (68 BPM) and lyrical themes of isolation strongly align with your expressed feeling of loneliness."

	Method: Template-based generation seeded by the acoustic + lyrical features that drove the match score. Can be upgraded to a small LLM-generated explanation later.

	Impact: Transparency — users understand why they got a recommendation, not just what it is.

	### 3.6 User Feedback Loop

	What: Thumbs up / down per track, feeding back into the recommendation ranking for that session.

	Impact: Closes the loop between static model output and real user response. Enables fine-tuning training data collection over time.

	### 3.7 Extended Spotify Access

	What: Apply for Spotify's Extended Access program (free).

	Unlocks:
	- `/v1/audio-features` (valence, energy, danceability, tempo per track)
	- `/v1/recommendations` (seed-based recommendations)
	- Higher search limits

	Impact: Makes acoustic analysis available immediately via the API we already have, without needing Essentia or AcousticBrainz.

	---

	## Objectives — Start to End

	\| # \| Objective \| Phase \| Status \|
	\|---\|---\|---\|---\|
	\| 1 \| Fine-tune an NLP model to classify text emotions \| Phase 1 \| ✅ Done — DeBERTa-v3-base, 8 classes, ~95 % \|
	\| 2 \| Map emotions to music parameters \| Phase 1 \| ✅ Done — EmotionFeatureMapper, Valence/Energy/Danceability targets \|
	\| 3 \| Retrieve real tracks from a live music source \| Phase 1 \| ✅ Done — Spotify Search API \|
	\| 4 \| Anti-popularity bias \| Phase 1 \| ✅ Done — max_popularity ≤ 65, hidden-gem preference \|
	\| 5 \| Live deployed API + frontend \| Phase 1 \| ✅ Done — HF Spaces + Vercel \|
	\| 6 \| Varied results per query \| Phase 2 \| ✅ Done — 4 term sets × 4 offsets \|
	\| 7 \| Lyrics-to-source-song detection \| Phase 2 \| ✅ Done — heuristic + optional Musixmatch \|
	\| 8 \| Meaningful, varied match scores \| Phase 2 \| ✅ Done — search-rank scoring \|
	\| 9 \| Fetch and analyse song lyrics per track \| Phase 3 \| 🔲 Musixmatch lyrics API \|
	\| 10 \| Spectrogram analysis — VGGish CNN on preview audio \| Phase 3 \| 🔲 Raw acoustic texture embedding \|
	\| 11 \| Structured audio features (valence/energy/tempo) \| Phase 3 \| 🔲 Extended Access or Essentia \|
	\| 12 \| Semantic match scoring via lyrical emotion \| Phase 3 \| 🔲 Replaces search-rank proxy \|
	\| 13 \| Valence-Arousal grid fusion (text + lyrics + spectrogram + features) \| Phase 3 \| 🔲 Full 4-modality fusion layer \|
	\| 14 \| Natural-language recommendation justifications \| Phase 3 \| 🔲 XAI / template generation \|
	\| 15 \| In-session user feedback loop \| Phase 3 \| 🔲 Thumbs up/down + re-ranking \|
	\| 16 \| Full multimodal pipeline (all 4 signals fused) \| End state \| 🔲 Full original vision realised \|

	---

	## What Changed From the Original Vision (And Why It's Better)

	\| Original \| Current \| Reason \|
	\|---\|---\|---\|
	\| Own music database \| Spotify (100M+ tracks, live) \| Infinite catalogue, no maintenance, always current \|
	\| VGGish CNN for audio \| Planned for Phase 3 \| Spotify API + Extended Access covers this more cheaply \|
	\| ALBERT/BERT for lyrics \| Planned for Phase 3 via Musixmatch \| Lyrics only needed at recommendation time, not training time \|
	\| LSTM fusion engine \| Lightweight cosine fusion planned \| Simpler, interpretable, easier to debug \|
	\| Batch/static recommendations \| Real-time, randomised per query \| Eliminates filter bubbles immediately \|
	\| Research prototype \| Production deployment \| Real users, real feedback, real iteration \|

	The architecture is thinner today but structurally identical in direction. Every multimodal channel from the original design (lyrical semantics, acoustic mood, fusion) is still in the roadmap — they now plug into a production system instead of a research prototype.

	---

	## Summary

	> EUMORA today: text in → emotion classified → Spotify searched with mood keywords → anti-popularity filtered → ranked by relevance → source song pinned if lyrics detected.
	>
	> EUMORA end state: text in → emotion classified → Spotify tracks enriched with lyrical + acoustic analysis → fused match score on Valence-Arousal grid → ranked with XAI explanations → user feedback refines in real time.

	The core insight — understand how someone feels, not what's popular — is live and working. The multimodal layers that deepen that understanding are the clear, sequenced next steps.