TuneJury

Music, scored the way
listeners compare it.

TuneJury is an open reward model for music generation. It reads a prompt and an audio clip, and returns one preference score.

scroll ↓

How it scores

A small head on top of
frozen music encoders.

Audio passes through CLAP and MERT. The prompt passes through CLAP text. A 2.8M-parameter MLP turns the three embeddings into a single scalar, trained pairwise on human A vs. B comparisons.

TuneJury · released
Prompt + audio
CLAP + MERT
MLP head
s = +0.42

Released checkpoint. Encoders stay frozen, and only the head is trained.

The clip behind these scores1

By the numbers

Trained on human preferences.
No pseudo-labels.

Four open sources of human ratings. No pseudo-label augmentation.

0human-rated training pairs2
0held-out pairwise accuracy3
0trainable parameters4
0released track scores5

AIME 12,480 · SongEval 2,491 · MusicPrefs 2,012 · Music Arena 571

One frozen reward

Three ways to use it.

Mode 1

Best-of-N selection

Generate N candidates, keep the highest-scoring one. Reward rises monotonically through N = 32 on four open-weights backbones.6

Mode 2

Latent optimization

Backpropagate the score through the sampler into the starting noise, DITTO style. The backbone stays frozen.

Mode 3

Expert iteration

Fine-tune a backbone on its own top-scoring outputs, mapping the trade-off between reward and distributional fidelity.

Hear it

Before and after, on the same prompt.

Each pair uses one prompt and one backbone. Only the reward signal changed the outcome.

Best-of-16 selection

MusicGen-medium. One random sample vs. the top pick of 16.

“A dark trance track featuring accordion, blending hypnotic rhythms with melancholic melodies and a pervasive, atmospheric mood.”

s = +0.05
s = +1.71

Latent optimization

TangoFlux. The same noise, pushed toward higher reward.

“A melancholic rap piece driven by a steady drummachine beat, layered with subtle synth pads and a sparse electric guitar, creating a reflective, introspective atmosphere. …”

s = −1.11
s = +1.13

Expert iteration

FluxAudio-S. Baseline vs. fine-tuned on its own best outputs.

“A fast garage track featuring an electric guitar, driven by raw energy and a loose, rhythmic feel.”

s = −2.06
s = −0.05

More examples, with scores

The listening demo pairs every sample with its TuneJury score.

Open the listening demo

The released scores

219,020 clips.
One score each.

Seven open-license collections, scored with the released checkpoint. Drag the threshold and see what a score filter keeps.5

Share kept = clips scoring above τ.

Anchor calibration

New generators keep arriving.
The model keeps up.

A reward model trained today meets systems released tomorrow. Anchor calibration fits one bias per new system on a handful of preference pairs, with the model itself left untouched.

25× less calibration data than from-scratch retraining,
at the same accuracy ceiling7

Unless a variant is named, every result on this page uses the released CLAP+MERT checkpoint.

  1. A 10-second excerpt from an AIME dataset clip (CC BY 4.0), generated by Suno v3.5 from the prompt “pop, classical, percussion”. At this length scoring is deterministic: re-running the released pipeline on this file reproduces these values exactly. Score scales are model-specific: values are comparable within one configuration, not across configurations.
  2. Post-filter pairs from Music Arena, MusicPrefs, AIME, and SongEval, after benchmark-overlap removal.
  3. Pairwise accuracy on the 2,035-pair held-out test split of the same four-source mix, ties excluded.
  4. 2,791,169 parameters in the released scoring head. The frozen encoders add none.
  5. One score per clip on seven open-license collections, from one scorer run per clip with an empty prompt. The MERT branch averages the full clip, while the CLAP branch encodes one 10-second window, so re-scoring a long clip can shift an individual value slightly. The Song Describer Dataset is a captioned subset of MTG-Jamendo: all 706 of its two-minute excerpts come from tracks also scored in the MTG-Jamendo row. Threshold filtering shown on this page is illustrative; the threshold is not validated on held-out data. Example dots play 10-second excerpts. MidiCaps dots play FluidSynth renders of the underlying MIDI (FluidR3 General MIDI soundfont). MusicCaps carries no dots: its license (CC BY-SA 4.0) covers the captions only, while the audio remains on YouTube under the original uploaders’ rights and cannot be redistributed.
  6. Top-1 mean reward across MusicGen-medium, MusicGen-large, AudioLDM2-music, and ACE-Step Turbo Continuous, 100 prompts per setting.
  7. Anchor calibration at K = 10 matches from-scratch retraining at K = 250 on the 2026-02/03 post-cutoff Music Arena slice, within the swept range of K. The recovery is slice-dependent.

Data: MTG-Jamendo · FMA · MagnaTagATune · OpenMIC · MidiCaps · MusicCaps · Song Describer Dataset · Music Arena · MusicPrefs · AIME · SongEval

Models & methods: LAION-CLAP · MERT · MuQ-MuLan · MusicGen · AudioLDM 2 · ACE-Step · TangoFlux · Stable Audio Open Small · MeanAudio · DITTO