TTM-HumanPref

ICME 2026 ATTM Grand Challenge · Efficiency Track1

Improving Text-to-Music Generation
with Human Preference Rewards

Yonghyun Kim, Junwon Lee, Haiwen Xia, Yinghao Ma♭♭, Chris Donahue♮♮

Georgia Tech · KAIST · Peking University · ♭♭Queen Mary University of London · ♮♮Carnegie Mellon University

A 120M parameter FluxAudio-S backbone2, conditioned on a learned human-preference reward (TuneJury3) and refined through expert iteration and a short CRPO pass. The whole pipeline fits in ~40 GPU-hours on one RTX A5000 and produces 10 s clips in under a second.

FluxAudio-S (120M) TuneJury reward Expert iteration CRPO Post-proc

What we built

Five engineering decisions

Four at training time and one at inference, each measured by per-stage decomposition on 100 Song Describer Dataset prompts.

Training · 1

Score-conditioned SFT

Train the backbone with the TuneJury reward as a conditioning signal that doubles as an inference-time CFG axis7. (One of five score-conditioning heads, swept here, won.)

FAD-CLAP improves by 0.025–0.040 at the SFT stage.
Training · 2

Expert iteration

Fine-tune on the top decile by combined reward + CLAP-text score.8,9

Dominant contributor: FAD-CLAP −0.0362, paired-t significant.
Training · 3

Cross-load v2

Load the v1-trained (GlobalAdaLN) weights into the v2 (InputAdd) forward to host the CRPO step. A bridge, not a booster.

Direction matters: v1→v2 stays within 0.02 FAD-CLAP; v2→v1 collapses.
Training · 4

CRPO13

A short DPO-style10 preference-tuning pass for audio-caption alignment.

Within paired-t noise at this scale; kept because it is inexpensive.
Inference · 5

Post-processing

Joint CFG on text and reward, 3×Demucs11 source separation, LUFS normalization to −16.5.

The score scalar is already saturated by this point in the chain.
Finding

What actually moves the needle

Expert iteration is the dominant contributor. The inference-time score knob ends up saturated, and the CRPO pass adds only noise-level gain.

TuneJury reward ↑ 0 −0.39+0.53 FAD-CLAP ↓ 0.600.42 CLAP score ↑ 0.2300.2900.285

Left → right in every panel: BaselineScore-cond. SFTExpert iterationCross-load v2CRPO
(the four training decisions 1–4 above, in pipeline order; post-processing is the 5th, applied to every clip)

Each engineering decision moves all three challenge metrics the right way: TuneJury reward and CLAP score rise, FAD-CLAP falls (lower is better)12. The two training lifts, score-conditioned SFT and expert iteration, do almost all the work. Cross-load and CRPO sit within paired-t noise. Values are cumulative on the 100-prompt SDD slice (Row 0→4 of the paper's decomposition table).

Hear one prompt, watch its reward climb

"A vibrant dance track driven by energetic drums, enhanced with a pulsing synth bass and bright acoustic guitar, captures the lively, sun-soaked spirit of summer with a rhythmic, foot-tapping groove that pulses through the room."

Baseline−1.10
Score-cond. SFT−0.47
Expert iteration+0.47
Cross-load v2+0.71
CRPO+0.84

Each badge is the TuneJury reward (the challenge MusicRankNet) this exact clip scores. It climbs from −1.10 to +0.84 across the pipeline. Same prompt and seed (42) at every stage, identical inference (s=5.0, cfg=4.0, 25 steps, prefix, 3×mdx_extra, −16.5 LUFS), only the checkpoint changes. These are single clips, not the 100-prompt averages above, and reward is a learned preference signal that need not match what every listener hears.

Listen

Hear the difference

The three prompts where the TuneJury reward gains the most. Same prompt and same backbone — challenge baseline vs our submission (Sub. 1, seed 42).

Δ reward +3.10A relaxed hiphop track featuring a subtle cymbal shimmer, smooth beat, and soft rhythmic flow, perfect as a background atmosphere. Enhanced with a warm electric piano and a light upright bass, the arrangement stays minimal and laid-back.

Baselinereward −1.76
Sub. 1 (ours)reward +1.34

Δ reward +2.60Atmospheric music featuring a soothing vibraphone, enhanced with gentle piano and ambient synth pads, evokes a calm, reflective mood with soft, resonant tones and a spacious, dreamy texture.

Baselinereward −1.79
Sub. 1 (ours)reward +0.81

Δ reward +2.33A calm psychedelic track featuring a synthesizer, blending surreal textures and dreamy tones in a serene, meditative atmosphere.

Baselinereward −1.76
Sub. 1 (ours)reward +0.57

Reward is the mean output of the TuneJury preference model (higher is better). Clips are 10 s, post-processed (3×Demucs mdx_extra, −16.5 LUFS).

Results

Two seed-varied submissions

100 held-out Song Describer Dataset6 prompts, scored against SDD-706 with LAION-CLAP-Music5.

SystemFAD-CLAP ↓CLAP score ↑TuneJury reward ↑
FluxAudio-S (baseline)0.59980.230−0.392
Sub. 1 (seed 42)0.42380.285+0.533
Sub. 2 (seed 55)0.43700.300+0.550

SDD-706 is the challenge's reference set: a 706-track instrumental MTG-Jamendo subset of the Song Describer Dataset. On the challenge's hidden Jamendo reference set, our submission (e02) scored FAD 0.498, CLAP 0.270, CCS 0.763.

The reward signal

Powered by TuneJury

TuneJury is an open, instance-level pairwise reward model for text-to-music, trained on human preference judgments from open datasets4. Here it drives both training-time conditioning and sample selection.