4.0 KiB
Vynl - Audio Analysis Architecture
Problem
No LLM can actually listen to music. Text-based recommendations work from artist names, genre associations, and music critic knowledge — never from the actual sound. For genuine sonic analysis, we need a dedicated audio processing pipeline.
Audio Analysis: Essentia
Essentia (open source, by Music Technology Group Barcelona) is the industry standard for music information retrieval. It analyzes actual audio and extracts:
- Mood, genre, style classification
- BPM, key, scale
- Timbral descriptors (brightness, warmth, roughness)
- Instrumentation detection
- Song structure (verse/chorus/bridge)
- Vocal characteristics
- Audio embeddings for "this sounds like" similarity
Free, self-hosted, used by Spotify/Pandora-type services under the hood.
Recommendation Pipeline
User imports playlist
│
▼
Spotify preview clips (30s MP3s) ──→ Essentia (Celery worker)
│ │
│ Sonic fingerprint:
│ tempo, key, timbre,
│ mood, instrumentation
│ │
▼ ▼
Metadata ──────────────────→ LLM (any cheap model)
(genres, tags, artist info) combines sonic data
+ music knowledge
→ recommendations
+ explanations
Step 1: Audio Ingestion
- Spotify provides 30-second preview clips as MP3 URLs for most tracks
- On playlist import, queue preview downloads as Celery background tasks
- Store clips temporarily for analysis, delete after processing
Step 2: Essentia Analysis
- Runs as a Celery worker processing audio clips
- Extracts per-track sonic fingerprint:
- Rhythm: BPM, beat strength, swing
- Tonal: key, scale, chord complexity
- Timbre: brightness, warmth, roughness, depth
- Mood: happy/sad, aggressive/relaxed, electronic/acoustic
- Instrumentation: detected instruments, vocal presence
- Embeddings: high-dimensional vector for similarity matching
- Store fingerprints in the tracks table (JSON + vector column)
Step 3: Similarity Search
- Use cosine similarity on audio embeddings to find "sounds like" matches
- Query against a catalog of pre-analyzed tracks (build over time from all user imports)
- Filter by user preferences (mood shift, era, underground level)
Step 4: LLM Explanation
- Feed sonic data + metadata to a cheap LLM (Haiku, GPT-4o-mini, Gemini Flash)
- The LLM's job is just natural language: turning structured sonic data into "why you'll like this" explanations
- The intelligence is in the audio analysis, not the text generation
Model Choice
Since the LLM is reasoning over structured data (not doing the analysis), the cheapest model wins:
| Model | Cost (per 1M tokens) | Good enough? |
|---|---|---|
| Claude Haiku 4.5 | $0.25 input / $1.25 output | Yes — best value |
| GPT-4o-mini | $0.15 input / $0.60 output | Yes |
| Gemini 2.5 Flash | $0.15 input / $0.60 output | Yes |
| Claude Sonnet | $3 input / $15 output | Overkill |
Note: Gemini 2.5 can accept raw audio input directly, but Essentia's structured output is more reliable and reproducible for a production pipeline.
Competitive Advantage
This approach means Vynl does what Spotify does internally (audio analysis) but exposes it transparently — users see exactly WHY a song was recommended based on its actual sonic qualities, not just "other listeners also liked this."
Tech Requirements
- Essentia:
pip install essentia-tensorflow(includes pre-trained models) - Storage: Temporary audio clip storage during analysis (~500KB per 30s clip)
- Celery worker: Dedicated worker for audio processing (CPU-bound)
- Vector storage: PostgreSQL with pgvector extension for embedding similarity search