Files
vynl/ARCHITECTURE.md

4.0 KiB

Vynl - Audio Analysis Architecture

Problem

No LLM can actually listen to music. Text-based recommendations work from artist names, genre associations, and music critic knowledge — never from the actual sound. For genuine sonic analysis, we need a dedicated audio processing pipeline.

Audio Analysis: Essentia

Essentia (open source, by Music Technology Group Barcelona) is the industry standard for music information retrieval. It analyzes actual audio and extracts:

  • Mood, genre, style classification
  • BPM, key, scale
  • Timbral descriptors (brightness, warmth, roughness)
  • Instrumentation detection
  • Song structure (verse/chorus/bridge)
  • Vocal characteristics
  • Audio embeddings for "this sounds like" similarity

Free, self-hosted, used by Spotify/Pandora-type services under the hood.

Recommendation Pipeline

User imports playlist
       │
       ▼
Spotify preview clips (30s MP3s) ──→ Essentia (Celery worker)
       │                               │
       │                        Sonic fingerprint:
       │                        tempo, key, timbre,
       │                        mood, instrumentation
       │                               │
       ▼                               ▼
    Metadata ──────────────────→ LLM (any cheap model)
    (genres, tags, artist info)   combines sonic data
                                  + music knowledge
                                  → recommendations
                                  + explanations

Step 1: Audio Ingestion

  • Spotify provides 30-second preview clips as MP3 URLs for most tracks
  • On playlist import, queue preview downloads as Celery background tasks
  • Store clips temporarily for analysis, delete after processing

Step 2: Essentia Analysis

  • Runs as a Celery worker processing audio clips
  • Extracts per-track sonic fingerprint:
    • Rhythm: BPM, beat strength, swing
    • Tonal: key, scale, chord complexity
    • Timbre: brightness, warmth, roughness, depth
    • Mood: happy/sad, aggressive/relaxed, electronic/acoustic
    • Instrumentation: detected instruments, vocal presence
    • Embeddings: high-dimensional vector for similarity matching
  • Store fingerprints in the tracks table (JSON + vector column)
  • Use cosine similarity on audio embeddings to find "sounds like" matches
  • Query against a catalog of pre-analyzed tracks (build over time from all user imports)
  • Filter by user preferences (mood shift, era, underground level)

Step 4: LLM Explanation

  • Feed sonic data + metadata to a cheap LLM (Haiku, GPT-4o-mini, Gemini Flash)
  • The LLM's job is just natural language: turning structured sonic data into "why you'll like this" explanations
  • The intelligence is in the audio analysis, not the text generation

Model Choice

Since the LLM is reasoning over structured data (not doing the analysis), the cheapest model wins:

Model Cost (per 1M tokens) Good enough?
Claude Haiku 4.5 $0.25 input / $1.25 output Yes — best value
GPT-4o-mini $0.15 input / $0.60 output Yes
Gemini 2.5 Flash $0.15 input / $0.60 output Yes
Claude Sonnet $3 input / $15 output Overkill

Note: Gemini 2.5 can accept raw audio input directly, but Essentia's structured output is more reliable and reproducible for a production pipeline.

Competitive Advantage

This approach means Vynl does what Spotify does internally (audio analysis) but exposes it transparently — users see exactly WHY a song was recommended based on its actual sonic qualities, not just "other listeners also liked this."

Tech Requirements

  • Essentia: pip install essentia-tensorflow (includes pre-trained models)
  • Storage: Temporary audio clip storage during analysis (~500KB per 30s clip)
  • Celery worker: Dedicated worker for audio processing (CPU-bound)
  • Vector storage: PostgreSQL with pgvector extension for embedding similarity search