From cd88ed29837ae4d97dd5cf1d25c1e39a0540b81f Mon Sep 17 00:00:00 2001 From: root Date: Mon, 30 Mar 2026 16:09:13 -0500 Subject: [PATCH] Revise architecture doc to reflect actual data pipeline (Spotify audio features + LLM) --- ARCHITECTURE.md | 140 +++++++++++++++++++++++++----------------------- 1 file changed, 72 insertions(+), 68 deletions(-) diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 98458f0..2bec387 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -1,22 +1,28 @@ -# Vynl - Audio Analysis Architecture +# Vynl - Recommendation Architecture -## Problem +## Data Sources -No LLM can actually listen to music. Text-based recommendations work from artist names, genre associations, and music critic knowledge — never from the actual sound. For genuine sonic analysis, we need a dedicated audio processing pipeline. +### Spotify Audio Features API (already integrated) +Pre-computed by Spotify for every track: +- **Tempo** (BPM) +- **Energy** (0.0–1.0, intensity/activity) +- **Danceability** (0.0–1.0) +- **Valence** (0.0–1.0, musical positivity) +- **Acousticness** (0.0–1.0) +- **Instrumentalness** (0.0–1.0) +- **Key** and **Mode** (major/minor) +- **Loudness** (dB) +- **Speechiness** (0.0–1.0) -## Audio Analysis: Essentia +### Metadata (from Spotify + supplementary APIs) +- Artist name, album, release date +- Genres and tags +- Popularity score +- Related artists -Essentia (open source, by Music Technology Group Barcelona) is the industry standard for music information retrieval. It analyzes actual audio and extracts: - -- Mood, genre, style classification -- BPM, key, scale -- Timbral descriptors (brightness, warmth, roughness) -- Instrumentation detection -- Song structure (verse/chorus/bridge) -- Vocal characteristics -- Audio embeddings for "this sounds like" similarity - -Free, self-hosted, used by Spotify/Pandora-type services under the hood. +### Supplementary APIs (to add) +- **MusicBrainz** — artist relationships, detailed genre/tag taxonomy, release info +- **Last.fm** — similar artists, user-generated tags, listener overlap stats ## Recommendation Pipeline @@ -24,66 +30,64 @@ Free, self-hosted, used by Spotify/Pandora-type services under the hood. User imports playlist │ ▼ -Spotify preview clips (30s MP3s) ──→ Essentia (Celery worker) - │ │ - │ Sonic fingerprint: - │ tempo, key, timbre, - │ mood, instrumentation - │ │ - ▼ ▼ - Metadata ──────────────────→ LLM (any cheap model) - (genres, tags, artist info) combines sonic data - + music knowledge - → recommendations - + explanations +Spotify API ──→ Track metadata + audio features + │ + ▼ +Build taste profile: + - Genre distribution + - Average energy/danceability/valence/tempo + - Mood tendencies + - Sample artists and tracks + │ + ▼ +LLM (cheap model) receives: + - Structured taste profile + - User's specific request/query + - List of tracks already in library (to exclude) + │ + ▼ +Returns recommendations with +"why you'll like this" explanations ``` -### Step 1: Audio Ingestion -- Spotify provides 30-second preview clips as MP3 URLs for most tracks -- On playlist import, queue preview downloads as Celery background tasks -- Store clips temporarily for analysis, delete after processing - -### Step 2: Essentia Analysis -- Runs as a Celery worker processing audio clips -- Extracts per-track sonic fingerprint: - - **Rhythm**: BPM, beat strength, swing - - **Tonal**: key, scale, chord complexity - - **Timbre**: brightness, warmth, roughness, depth - - **Mood**: happy/sad, aggressive/relaxed, electronic/acoustic - - **Instrumentation**: detected instruments, vocal presence - - **Embeddings**: high-dimensional vector for similarity matching -- Store fingerprints in the tracks table (JSON + vector column) - -### Step 3: Similarity Search -- Use cosine similarity on audio embeddings to find "sounds like" matches -- Query against a catalog of pre-analyzed tracks (build over time from all user imports) -- Filter by user preferences (mood shift, era, underground level) - -### Step 4: LLM Explanation -- Feed sonic data + metadata to a cheap LLM (Haiku, GPT-4o-mini, Gemini Flash) -- The LLM's job is just natural language: turning structured sonic data into "why you'll like this" explanations -- The intelligence is in the audio analysis, not the text generation - ## Model Choice -Since the LLM is reasoning over structured data (not doing the analysis), the cheapest model wins: +The LLM reasons over structured audio feature data + metadata. It needs broad music knowledge but not heavy reasoning. Cheapest model wins: -| Model | Cost (per 1M tokens) | Good enough? | -|-------|---------------------|--------------| -| Claude Haiku 4.5 | $0.25 input / $1.25 output | Yes — best value | -| GPT-4o-mini | $0.15 input / $0.60 output | Yes | -| Gemini 2.5 Flash | $0.15 input / $0.60 output | Yes | -| Claude Sonnet | $3 input / $15 output | Overkill | +| Model | Cost (per 1M tokens) | Notes | +|-------|---------------------|-------| +| Claude Haiku 4.5 | $0.25 in / $1.25 out | Best value, great music knowledge | +| GPT-4o-mini | $0.15 in / $0.60 out | Cheapest option | +| Gemini 2.5 Flash | $0.15 in / $0.60 out | Also cheap, good quality | +| Claude Sonnet | $3 in / $15 out | Overkill for this task | -Note: Gemini 2.5 can accept raw audio input directly, but Essentia's structured output is more reliable and reproducible for a production pipeline. +## Taste Profile Structure -## Competitive Advantage +Built from a user's imported tracks: -This approach means Vynl does what Spotify does internally (audio analysis) but exposes it transparently — users see exactly WHY a song was recommended based on its actual sonic qualities, not just "other listeners also liked this." +```json +{ + "top_genres": [{"name": "indie rock", "count": 12}, ...], + "avg_energy": 0.65, + "avg_danceability": 0.55, + "avg_valence": 0.42, + "avg_tempo": 118.5, + "track_count": 47, + "sample_artists": ["Radiohead", "Tame Impala", ...], + "sample_tracks": ["Radiohead - Everything In Its Right Place", ...] +} +``` -## Tech Requirements +The LLM uses this profile to understand what the user gravitates toward sonically (high energy? melancholy? upbeat?) and find new music that matches or intentionally contrasts those patterns. -- **Essentia**: `pip install essentia-tensorflow` (includes pre-trained models) -- **Storage**: Temporary audio clip storage during analysis (~500KB per 30s clip) -- **Celery worker**: Dedicated worker for audio processing (CPU-bound) -- **Vector storage**: PostgreSQL with pgvector extension for embedding similarity search +## Platform Support + +### Currently Implemented +- Spotify (OAuth + playlist import + audio features) + +### Planned +- YouTube Music (via `ytmusicapi`, unofficial Python library) +- Apple Music (MusicKit API, requires Apple Developer account) +- Last.fm (scrobble history import + similar artist data) +- Tidal (official API) +- Manual entry / CSV upload (fallback for any platform)