Vynl - Audio Analysis Architecture

Problem

No LLM can actually listen to music. Text-based recommendations work from artist names, genre associations, and music critic knowledge — never from the actual sound. For genuine sonic analysis, we need a dedicated audio processing pipeline.

Audio Analysis: Essentia

Essentia (open source, by Music Technology Group Barcelona) is the industry standard for music information retrieval. It analyzes actual audio and extracts:

Mood, genre, style classification
BPM, key, scale
Timbral descriptors (brightness, warmth, roughness)
Instrumentation detection
Song structure (verse/chorus/bridge)
Vocal characteristics
Audio embeddings for "this sounds like" similarity

Free, self-hosted, used by Spotify/Pandora-type services under the hood.

Recommendation Pipeline

User imports playlist
       │
       ▼
Spotify preview clips (30s MP3s) ──→ Essentia (Celery worker)
       │                               │
       │                        Sonic fingerprint:
       │                        tempo, key, timbre,
       │                        mood, instrumentation
       │                               │
       ▼                               ▼
    Metadata ──────────────────→ LLM (any cheap model)
    (genres, tags, artist info)   combines sonic data
                                  + music knowledge
                                  → recommendations
                                  + explanations

Step 1: Audio Ingestion

Spotify provides 30-second preview clips as MP3 URLs for most tracks
On playlist import, queue preview downloads as Celery background tasks
Store clips temporarily for analysis, delete after processing

Step 2: Essentia Analysis

Runs as a Celery worker processing audio clips
Extracts per-track sonic fingerprint:
- Rhythm: BPM, beat strength, swing
- Tonal: key, scale, chord complexity
- Timbre: brightness, warmth, roughness, depth
- Mood: happy/sad, aggressive/relaxed, electronic/acoustic
- Instrumentation: detected instruments, vocal presence
- Embeddings: high-dimensional vector for similarity matching
Store fingerprints in the tracks table (JSON + vector column)

Step 3: Similarity Search

Use cosine similarity on audio embeddings to find "sounds like" matches
Query against a catalog of pre-analyzed tracks (build over time from all user imports)
Filter by user preferences (mood shift, era, underground level)

Step 4: LLM Explanation

Feed sonic data + metadata to a cheap LLM (Haiku, GPT-4o-mini, Gemini Flash)
The LLM's job is just natural language: turning structured sonic data into "why you'll like this" explanations
The intelligence is in the audio analysis, not the text generation

Model Choice

Since the LLM is reasoning over structured data (not doing the analysis), the cheapest model wins:

Model	Cost (per 1M tokens)	Good enough?
Claude Haiku 4.5	$0.25 input / $1.25 output	Yes — best value
GPT-4o-mini	$0.15 input / $0.60 output	Yes
Gemini 2.5 Flash	$0.15 input / $0.60 output	Yes
Claude Sonnet	$3 input / $15 output	Overkill

Note: Gemini 2.5 can accept raw audio input directly, but Essentia's structured output is more reliable and reproducible for a production pipeline.

Competitive Advantage

This approach means Vynl does what Spotify does internally (audio analysis) but exposes it transparently — users see exactly WHY a song was recommended based on its actual sonic qualities, not just "other listeners also liked this."

Tech Requirements

Essentia: pip install essentia-tensorflow (includes pre-trained models)
Storage: Temporary audio clip storage during analysis (~500KB per 30s clip)
Celery worker: Dedicated worker for audio processing (CPU-bound)
Vector storage: PostgreSQL with pgvector extension for embedding similarity search

4.0 KiB Raw Blame History