Add audio analysis architecture doc with Essentia pipeline design

2026-03-30 16:06:57 -05:00
parent 155cbd1bbf
commit 32f7dca1c9
1 changed files with 89 additions and 0 deletions
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -0,0 +1,89 @@
+# Vynl - Audio Analysis Architecture
+
+## Problem
+
+No LLM can actually listen to music. Text-based recommendations work from artist names, genre associations, and music critic knowledge — never from the actual sound. For genuine sonic analysis, we need a dedicated audio processing pipeline.
+
+## Audio Analysis: Essentia
+
+Essentia (open source, by Music Technology Group Barcelona) is the industry standard for music information retrieval. It analyzes actual audio and extracts:
+
+- Mood, genre, style classification
+- BPM, key, scale
+- Timbral descriptors (brightness, warmth, roughness)
+- Instrumentation detection
+- Song structure (verse/chorus/bridge)
+- Vocal characteristics
+- Audio embeddings for "this sounds like" similarity
+
+Free, self-hosted, used by Spotify/Pandora-type services under the hood.
+
+## Recommendation Pipeline
+
+```
+User imports playlist
+       │
+       ▼
+Spotify preview clips (30s MP3s) ──→ Essentia (Celery worker)
+       │                               │
+       │                        Sonic fingerprint:
+       │                        tempo, key, timbre,
+       │                        mood, instrumentation
+       │                               │
+       ▼                               ▼
+    Metadata ──────────────────→ LLM (any cheap model)
+    (genres, tags, artist info)   combines sonic data
+                                  + music knowledge
+                                  → recommendations
+                                  + explanations
+```
+
+### Step 1: Audio Ingestion
+- Spotify provides 30-second preview clips as MP3 URLs for most tracks
+- On playlist import, queue preview downloads as Celery background tasks
+- Store clips temporarily for analysis, delete after processing
+
+### Step 2: Essentia Analysis
+- Runs as a Celery worker processing audio clips
+- Extracts per-track sonic fingerprint:
+  - **Rhythm**: BPM, beat strength, swing
+  - **Tonal**: key, scale, chord complexity
+  - **Timbre**: brightness, warmth, roughness, depth
+  - **Mood**: happy/sad, aggressive/relaxed, electronic/acoustic
+  - **Instrumentation**: detected instruments, vocal presence
+  - **Embeddings**: high-dimensional vector for similarity matching
+- Store fingerprints in the tracks table (JSON + vector column)
+
+### Step 3: Similarity Search
+- Use cosine similarity on audio embeddings to find "sounds like" matches
+- Query against a catalog of pre-analyzed tracks (build over time from all user imports)
+- Filter by user preferences (mood shift, era, underground level)
+
+### Step 4: LLM Explanation
+- Feed sonic data + metadata to a cheap LLM (Haiku, GPT-4o-mini, Gemini Flash)
+- The LLM's job is just natural language: turning structured sonic data into "why you'll like this" explanations
+- The intelligence is in the audio analysis, not the text generation
+
+## Model Choice
+
+Since the LLM is reasoning over structured data (not doing the analysis), the cheapest model wins:
+
+| Model | Cost (per 1M tokens) | Good enough? |
+|-------|---------------------|--------------|
+| Claude Haiku 4.5 | $0.25 input / $1.25 output | Yes — best value |
+| GPT-4o-mini | $0.15 input / $0.60 output | Yes |
+| Gemini 2.5 Flash | $0.15 input / $0.60 output | Yes |
+| Claude Sonnet | $3 input / $15 output | Overkill |
+
+Note: Gemini 2.5 can accept raw audio input directly, but Essentia's structured output is more reliable and reproducible for a production pipeline.
+
+## Competitive Advantage
+
+This approach means Vynl does what Spotify does internally (audio analysis) but exposes it transparently — users see exactly WHY a song was recommended based on its actual sonic qualities, not just "other listeners also liked this."
+
+## Tech Requirements
+
+- **Essentia**: `pip install essentia-tensorflow` (includes pre-trained models)
+- **Storage**: Temporary audio clip storage during analysis (~500KB per 30s clip)
+- **Celery worker**: Dedicated worker for audio processing (CPU-bound)
+- **Vector storage**: PostgreSQL with pgvector extension for embedding similarity search