TuneJury · An open reward model for music generation

How it scores

A small head on top of
frozen music encoders.

Audio passes through CLAP and MERT. The prompt passes through CLAP text. A 2.8M-parameter MLP turns the three embeddings into a single scalar, trained pairwise on human A vs. B comparisons.

TuneJury · released

Prompt + audio

→

❄CLAP + MERT

→

MLP head

→

s = +0.42

Released checkpoint. Encoders stay frozen, and only the head is trained.

The clip behind these scores¹

One frozen reward

Three ways to use it.

Mode 1

Best-of-N selection

Generate N candidates, keep the highest-scoring one. Reward rises monotonically through N = 32 on four open-weights backbones.⁶

Mode 2

Latent optimization

Backpropagate the score through the sampler into the starting noise, DITTO style. The backbone stays frozen.

Mode 3

Expert iteration

Fine-tune a backbone on its own top-scoring outputs, mapping the trade-off between reward and distributional fidelity.

Hear it

Before and after, on the same prompt.

Each pair uses one prompt and one backbone. Only the reward signal changed the outcome.

Best-of-16 selection

MusicGen-medium. One random sample vs. the top pick of 16.

“A dark trance track featuring accordion, blending hypnotic rhythms with melancholic melodies and a pervasive, atmospheric mood.”

Single sample

s = +0.05

Best of 16

s = +1.71

Latent optimization

TangoFlux. The same noise, pushed toward higher reward.

“A melancholic rap piece driven by a steady drummachine beat, layered with subtle synth pads and a sparse electric guitar, creating a reflective, introspective atmosphere. …”

Baseline

s = −1.11

After DITTO

s = +1.13

Expert iteration

FluxAudio-S. Baseline vs. fine-tuned on its own best outputs.

“A fast garage track featuring an electric guitar, driven by raw energy and a loose, rhythmic feel.”

Baseline

s = −2.06

Fine-tuned

s = −0.05

More examples, with scores

The listening demo pairs every sample with its TuneJury score.

Open the listening demo

Unless a variant is named, every result on this page uses the released CLAP+MERT checkpoint.

A 10-second excerpt from an AIME dataset clip (CC BY 4.0), generated by Suno v3.5 from the prompt “pop, classical, percussion”. At this length scoring is deterministic: re-running the released pipeline on this file reproduces these values exactly. Score scales are model-specific: values are comparable within one configuration, not across configurations.
Post-filter pairs from Music Arena, MusicPrefs, AIME, and SongEval, after benchmark-overlap removal.
Pairwise accuracy on the 2,035-pair held-out test split of the same four-source mix, ties excluded.
2,791,169 parameters in the released scoring head. The frozen encoders add none.
One score per clip on seven open-license collections, from one scorer run per clip with an empty prompt. The MERT branch averages the full clip, while the CLAP branch encodes one 10-second window, so re-scoring a long clip can shift an individual value slightly. The Song Describer Dataset is a captioned subset of MTG-Jamendo: all 706 of its two-minute excerpts come from tracks also scored in the MTG-Jamendo row. Threshold filtering shown on this page is illustrative; the threshold is not validated on held-out data. Example dots play 10-second excerpts. MidiCaps dots play FluidSynth renders of the underlying MIDI (FluidR3 General MIDI soundfont). MusicCaps carries no dots: its license (CC BY-SA 4.0) covers the captions only, while the audio remains on YouTube under the original uploaders’ rights and cannot be redistributed.
Top-1 mean reward across MusicGen-medium, MusicGen-large, AudioLDM2-music, and ACE-Step Turbo Continuous, 100 prompts per setting.
Anchor calibration at K = 10 matches from-scratch retraining at K = 250 on the 2026-02/03 post-cutoff Music Arena slice, within the swept range of K. The recovery is slice-dependent.

Data: MTG-Jamendo · FMA · MagnaTagATune · OpenMIC · MidiCaps · MusicCaps · Song Describer Dataset · Music Arena · MusicPrefs · AIME · SongEval

Models & methods: LAION-CLAP · MERT · MuQ-MuLan · MusicGen · AudioLDM 2 · ACE-Step · TangoFlux · Stable Audio Open Small · MeanAudio · DITTO

Music, scored the way
listeners compare it.

A small head on top of
frozen music encoders.

Trained on human preferences.
No pseudo-labels.