ICME 2026 ATTM Grand Challenge · Efficiency Track¹

Improving Text-to-Music Generation
with Human Preference Rewards

Yonghyun Kim^♭, Junwon Lee^♮, Haiwen Xia^♯, Yinghao Ma^♭♭, Chris Donahue^♮♮

^♭Georgia Tech · ^♮KAIST · ^♯Peking University · ^♭♭Queen Mary University of London · ^♮♮Carnegie Mellon University

A 120M parameter FluxAudio-S backbone², conditioned on a learned human-preference reward (TuneJury³) and refined through expert iteration and a short CRPO pass. The whole pipeline fits in ~40 GPU-hours on one RTX A5000 and produces 10 s clips in under a second.

🎧 Listen to samples Code

FluxAudio-S (120M)→ TuneJury reward→ Expert iteration→ CRPO→ Post-proc

What we built

Five engineering decisions

Four at training time and one at inference, each measured by per-stage decomposition on 100 Song Describer Dataset prompts.

Training · 1

Score-conditioned SFT

Train the backbone with the TuneJury reward as a conditioning signal that doubles as an inference-time CFG axis⁷. (One of five score-conditioning heads, swept here, won.)

FAD-CLAP improves by 0.025–0.040 at the SFT stage.

Training · 2

Expert iteration

Fine-tune on the top decile by combined reward + CLAP-text score.^8,9

Dominant contributor: FAD-CLAP −0.0362, paired-t significant.

Training · 3

Cross-load v2

Load the v1-trained (GlobalAdaLN) weights into the v2 (InputAdd) forward to host the CRPO step. A bridge, not a booster.

Direction matters: v1→v2 stays within 0.02 FAD-CLAP; v2→v1 collapses.

Training · 4

CRPO¹³

A short DPO-style¹⁰ preference-tuning pass for audio-caption alignment.

Within paired-t noise at this scale; kept because it is inexpensive.

Inference · 5

Post-processing

Joint CFG on text and reward, 3×Demucs¹¹ source separation, LUFS normalization to −16.5.

The score scalar is already saturated by this point in the chain.

Finding

What actually moves the needle

Expert iteration is the dominant contributor. The inference-time score knob ends up saturated, and the CRPO pass adds only noise-level gain.

Left → right in every panel: Baseline → Score-cond. SFT → Expert iteration → Cross-load v2 → CRPO
(the four training decisions 1–4 above, in pipeline order; post-processing is the 5th, applied to every clip)

Each engineering decision moves all three challenge metrics the right way: TuneJury reward and CLAP score rise, FAD-CLAP falls (lower is better)¹². The two training lifts, score-conditioned SFT and expert iteration, do almost all the work. Cross-load and CRPO sit within paired-t noise. Values are cumulative on the 100-prompt SDD slice (Row 0→4 of the paper's decomposition table).

Hear one prompt, watch its reward climb

"A vibrant dance track driven by energetic drums, enhanced with a pulsing synth bass and bright acoustic guitar, captures the lively, sun-soaked spirit of summer with a rhythmic, foot-tapping groove that pulses through the room."

Baseline−1.10

Score-cond. SFT−0.47

Expert iteration+0.47

Cross-load v2+0.71

CRPO+0.84

Each badge is the TuneJury reward (the challenge MusicRankNet) this exact clip scores. It climbs from −1.10 to +0.84 across the pipeline. Same prompt and seed (42) at every stage, identical inference (s=5.0, cfg=4.0, 25 steps, prefix, 3×mdx_extra, −16.5 LUFS), only the checkpoint changes. These are single clips, not the 100-prompt averages above, and reward is a learned preference signal that need not match what every listener hears.

Listen

Hear the difference

The three prompts where the TuneJury reward gains the most. Same prompt and same backbone — challenge baseline vs our submission (Sub. 1, seed 42).

Δ reward +3.10A relaxed hiphop track featuring a subtle cymbal shimmer, smooth beat, and soft rhythmic flow, perfect as a background atmosphere. Enhanced with a warm electric piano and a light upright bass, the arrangement stays minimal and laid-back.

Baselinereward −1.76

Sub. 1 (ours)reward +1.34

Δ reward +2.60Atmospheric music featuring a soothing vibraphone, enhanced with gentle piano and ambient synth pads, evokes a calm, reflective mood with soft, resonant tones and a spacious, dreamy texture.

Baselinereward −1.79

Sub. 1 (ours)reward +0.81

Δ reward +2.33A calm psychedelic track featuring a synthesizer, blending surreal textures and dreamy tones in a serene, meditative atmosphere.

Baselinereward −1.76

Sub. 1 (ours)reward +0.57

Reward is the mean output of the TuneJury preference model (higher is better). Clips are 10 s, post-processed (3×Demucs mdx_extra, −16.5 LUFS).

Results

Two seed-varied submissions

100 held-out Song Describer Dataset⁶ prompts, scored against SDD-706 with LAION-CLAP-Music⁵.

System	FAD-CLAP ↓	CLAP score ↑	TuneJury reward ↑
FluxAudio-S (baseline)	0.5998	0.230	−0.392
Sub. 1 (seed 42)	0.4238	0.285	+0.533
Sub. 2 (seed 55)	0.4370	0.300	+0.550

SDD-706 is the challenge's reference set: a 706-track instrumental MTG-Jamendo subset of the Song Describer Dataset. On the challenge's hidden Jamendo reference set, our submission (e02) scored FAD 0.498, CLAP 0.270, CCS 0.763.

Improving Text-to-Music Generationwith Human Preference Rewards