diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md new file mode 100644 index 0000000..98458f0 --- /dev/null +++ b/ARCHITECTURE.md @@ -0,0 +1,89 @@ +# Vynl - Audio Analysis Architecture + +## Problem + +No LLM can actually listen to music. Text-based recommendations work from artist names, genre associations, and music critic knowledge — never from the actual sound. For genuine sonic analysis, we need a dedicated audio processing pipeline. + +## Audio Analysis: Essentia + +Essentia (open source, by Music Technology Group Barcelona) is the industry standard for music information retrieval. It analyzes actual audio and extracts: + +- Mood, genre, style classification +- BPM, key, scale +- Timbral descriptors (brightness, warmth, roughness) +- Instrumentation detection +- Song structure (verse/chorus/bridge) +- Vocal characteristics +- Audio embeddings for "this sounds like" similarity + +Free, self-hosted, used by Spotify/Pandora-type services under the hood. + +## Recommendation Pipeline + +``` +User imports playlist + │ + ▼ +Spotify preview clips (30s MP3s) ──→ Essentia (Celery worker) + │ │ + │ Sonic fingerprint: + │ tempo, key, timbre, + │ mood, instrumentation + │ │ + ▼ ▼ + Metadata ──────────────────→ LLM (any cheap model) + (genres, tags, artist info) combines sonic data + + music knowledge + → recommendations + + explanations +``` + +### Step 1: Audio Ingestion +- Spotify provides 30-second preview clips as MP3 URLs for most tracks +- On playlist import, queue preview downloads as Celery background tasks +- Store clips temporarily for analysis, delete after processing + +### Step 2: Essentia Analysis +- Runs as a Celery worker processing audio clips +- Extracts per-track sonic fingerprint: + - **Rhythm**: BPM, beat strength, swing + - **Tonal**: key, scale, chord complexity + - **Timbre**: brightness, warmth, roughness, depth + - **Mood**: happy/sad, aggressive/relaxed, electronic/acoustic + - **Instrumentation**: detected instruments, vocal presence + - **Embeddings**: high-dimensional vector for similarity matching +- Store fingerprints in the tracks table (JSON + vector column) + +### Step 3: Similarity Search +- Use cosine similarity on audio embeddings to find "sounds like" matches +- Query against a catalog of pre-analyzed tracks (build over time from all user imports) +- Filter by user preferences (mood shift, era, underground level) + +### Step 4: LLM Explanation +- Feed sonic data + metadata to a cheap LLM (Haiku, GPT-4o-mini, Gemini Flash) +- The LLM's job is just natural language: turning structured sonic data into "why you'll like this" explanations +- The intelligence is in the audio analysis, not the text generation + +## Model Choice + +Since the LLM is reasoning over structured data (not doing the analysis), the cheapest model wins: + +| Model | Cost (per 1M tokens) | Good enough? | +|-------|---------------------|--------------| +| Claude Haiku 4.5 | $0.25 input / $1.25 output | Yes — best value | +| GPT-4o-mini | $0.15 input / $0.60 output | Yes | +| Gemini 2.5 Flash | $0.15 input / $0.60 output | Yes | +| Claude Sonnet | $3 input / $15 output | Overkill | + +Note: Gemini 2.5 can accept raw audio input directly, but Essentia's structured output is more reliable and reproducible for a production pipeline. + +## Competitive Advantage + +This approach means Vynl does what Spotify does internally (audio analysis) but exposes it transparently — users see exactly WHY a song was recommended based on its actual sonic qualities, not just "other listeners also liked this." + +## Tech Requirements + +- **Essentia**: `pip install essentia-tensorflow` (includes pre-trained models) +- **Storage**: Temporary audio clip storage during analysis (~500KB per 30s clip) +- **Celery worker**: Dedicated worker for audio processing (CPU-bound) +- **Vector storage**: PostgreSQL with pgvector extension for embedding similarity search