Revise architecture doc to reflect actual data pipeline (Spotify audio features + LLM)
This commit is contained in:
140
ARCHITECTURE.md
140
ARCHITECTURE.md
@@ -1,22 +1,28 @@
|
||||
# Vynl - Audio Analysis Architecture
|
||||
# Vynl - Recommendation Architecture
|
||||
|
||||
## Problem
|
||||
## Data Sources
|
||||
|
||||
No LLM can actually listen to music. Text-based recommendations work from artist names, genre associations, and music critic knowledge — never from the actual sound. For genuine sonic analysis, we need a dedicated audio processing pipeline.
|
||||
### Spotify Audio Features API (already integrated)
|
||||
Pre-computed by Spotify for every track:
|
||||
- **Tempo** (BPM)
|
||||
- **Energy** (0.0–1.0, intensity/activity)
|
||||
- **Danceability** (0.0–1.0)
|
||||
- **Valence** (0.0–1.0, musical positivity)
|
||||
- **Acousticness** (0.0–1.0)
|
||||
- **Instrumentalness** (0.0–1.0)
|
||||
- **Key** and **Mode** (major/minor)
|
||||
- **Loudness** (dB)
|
||||
- **Speechiness** (0.0–1.0)
|
||||
|
||||
## Audio Analysis: Essentia
|
||||
### Metadata (from Spotify + supplementary APIs)
|
||||
- Artist name, album, release date
|
||||
- Genres and tags
|
||||
- Popularity score
|
||||
- Related artists
|
||||
|
||||
Essentia (open source, by Music Technology Group Barcelona) is the industry standard for music information retrieval. It analyzes actual audio and extracts:
|
||||
|
||||
- Mood, genre, style classification
|
||||
- BPM, key, scale
|
||||
- Timbral descriptors (brightness, warmth, roughness)
|
||||
- Instrumentation detection
|
||||
- Song structure (verse/chorus/bridge)
|
||||
- Vocal characteristics
|
||||
- Audio embeddings for "this sounds like" similarity
|
||||
|
||||
Free, self-hosted, used by Spotify/Pandora-type services under the hood.
|
||||
### Supplementary APIs (to add)
|
||||
- **MusicBrainz** — artist relationships, detailed genre/tag taxonomy, release info
|
||||
- **Last.fm** — similar artists, user-generated tags, listener overlap stats
|
||||
|
||||
## Recommendation Pipeline
|
||||
|
||||
@@ -24,66 +30,64 @@ Free, self-hosted, used by Spotify/Pandora-type services under the hood.
|
||||
User imports playlist
|
||||
│
|
||||
▼
|
||||
Spotify preview clips (30s MP3s) ──→ Essentia (Celery worker)
|
||||
│ │
|
||||
│ Sonic fingerprint:
|
||||
│ tempo, key, timbre,
|
||||
│ mood, instrumentation
|
||||
│ │
|
||||
▼ ▼
|
||||
Metadata ──────────────────→ LLM (any cheap model)
|
||||
(genres, tags, artist info) combines sonic data
|
||||
+ music knowledge
|
||||
→ recommendations
|
||||
+ explanations
|
||||
Spotify API ──→ Track metadata + audio features
|
||||
│
|
||||
▼
|
||||
Build taste profile:
|
||||
- Genre distribution
|
||||
- Average energy/danceability/valence/tempo
|
||||
- Mood tendencies
|
||||
- Sample artists and tracks
|
||||
│
|
||||
▼
|
||||
LLM (cheap model) receives:
|
||||
- Structured taste profile
|
||||
- User's specific request/query
|
||||
- List of tracks already in library (to exclude)
|
||||
│
|
||||
▼
|
||||
Returns recommendations with
|
||||
"why you'll like this" explanations
|
||||
```
|
||||
|
||||
### Step 1: Audio Ingestion
|
||||
- Spotify provides 30-second preview clips as MP3 URLs for most tracks
|
||||
- On playlist import, queue preview downloads as Celery background tasks
|
||||
- Store clips temporarily for analysis, delete after processing
|
||||
|
||||
### Step 2: Essentia Analysis
|
||||
- Runs as a Celery worker processing audio clips
|
||||
- Extracts per-track sonic fingerprint:
|
||||
- **Rhythm**: BPM, beat strength, swing
|
||||
- **Tonal**: key, scale, chord complexity
|
||||
- **Timbre**: brightness, warmth, roughness, depth
|
||||
- **Mood**: happy/sad, aggressive/relaxed, electronic/acoustic
|
||||
- **Instrumentation**: detected instruments, vocal presence
|
||||
- **Embeddings**: high-dimensional vector for similarity matching
|
||||
- Store fingerprints in the tracks table (JSON + vector column)
|
||||
|
||||
### Step 3: Similarity Search
|
||||
- Use cosine similarity on audio embeddings to find "sounds like" matches
|
||||
- Query against a catalog of pre-analyzed tracks (build over time from all user imports)
|
||||
- Filter by user preferences (mood shift, era, underground level)
|
||||
|
||||
### Step 4: LLM Explanation
|
||||
- Feed sonic data + metadata to a cheap LLM (Haiku, GPT-4o-mini, Gemini Flash)
|
||||
- The LLM's job is just natural language: turning structured sonic data into "why you'll like this" explanations
|
||||
- The intelligence is in the audio analysis, not the text generation
|
||||
|
||||
## Model Choice
|
||||
|
||||
Since the LLM is reasoning over structured data (not doing the analysis), the cheapest model wins:
|
||||
The LLM reasons over structured audio feature data + metadata. It needs broad music knowledge but not heavy reasoning. Cheapest model wins:
|
||||
|
||||
| Model | Cost (per 1M tokens) | Good enough? |
|
||||
|-------|---------------------|--------------|
|
||||
| Claude Haiku 4.5 | $0.25 input / $1.25 output | Yes — best value |
|
||||
| GPT-4o-mini | $0.15 input / $0.60 output | Yes |
|
||||
| Gemini 2.5 Flash | $0.15 input / $0.60 output | Yes |
|
||||
| Claude Sonnet | $3 input / $15 output | Overkill |
|
||||
| Model | Cost (per 1M tokens) | Notes |
|
||||
|-------|---------------------|-------|
|
||||
| Claude Haiku 4.5 | $0.25 in / $1.25 out | Best value, great music knowledge |
|
||||
| GPT-4o-mini | $0.15 in / $0.60 out | Cheapest option |
|
||||
| Gemini 2.5 Flash | $0.15 in / $0.60 out | Also cheap, good quality |
|
||||
| Claude Sonnet | $3 in / $15 out | Overkill for this task |
|
||||
|
||||
Note: Gemini 2.5 can accept raw audio input directly, but Essentia's structured output is more reliable and reproducible for a production pipeline.
|
||||
## Taste Profile Structure
|
||||
|
||||
## Competitive Advantage
|
||||
Built from a user's imported tracks:
|
||||
|
||||
This approach means Vynl does what Spotify does internally (audio analysis) but exposes it transparently — users see exactly WHY a song was recommended based on its actual sonic qualities, not just "other listeners also liked this."
|
||||
```json
|
||||
{
|
||||
"top_genres": [{"name": "indie rock", "count": 12}, ...],
|
||||
"avg_energy": 0.65,
|
||||
"avg_danceability": 0.55,
|
||||
"avg_valence": 0.42,
|
||||
"avg_tempo": 118.5,
|
||||
"track_count": 47,
|
||||
"sample_artists": ["Radiohead", "Tame Impala", ...],
|
||||
"sample_tracks": ["Radiohead - Everything In Its Right Place", ...]
|
||||
}
|
||||
```
|
||||
|
||||
## Tech Requirements
|
||||
The LLM uses this profile to understand what the user gravitates toward sonically (high energy? melancholy? upbeat?) and find new music that matches or intentionally contrasts those patterns.
|
||||
|
||||
- **Essentia**: `pip install essentia-tensorflow` (includes pre-trained models)
|
||||
- **Storage**: Temporary audio clip storage during analysis (~500KB per 30s clip)
|
||||
- **Celery worker**: Dedicated worker for audio processing (CPU-bound)
|
||||
- **Vector storage**: PostgreSQL with pgvector extension for embedding similarity search
|
||||
## Platform Support
|
||||
|
||||
### Currently Implemented
|
||||
- Spotify (OAuth + playlist import + audio features)
|
||||
|
||||
### Planned
|
||||
- YouTube Music (via `ytmusicapi`, unofficial Python library)
|
||||
- Apple Music (MusicKit API, requires Apple Developer account)
|
||||
- Last.fm (scrobble history import + similar artist data)
|
||||
- Tidal (official API)
|
||||
- Manual entry / CSV upload (fallback for any platform)
|
||||
|
||||
Reference in New Issue
Block a user