ICME 2026 ATTM Grand Challenge · Efficiency Track1
♭Georgia Tech · ♮KAIST · ♯Peking University · ♭♭Queen Mary University of London · ♮♮Carnegie Mellon University
A 120M parameter FluxAudio-S backbone2, conditioned on a learned human-preference reward (TuneJury3) and refined through expert iteration and a short CRPO pass. The whole pipeline fits in ~40 GPU-hours on one RTX A5000 and produces 10 s clips in under a second.
What we built
Four at training time and one at inference, each measured by per-stage decomposition on 100 Song Describer Dataset prompts.
Train the backbone with the TuneJury reward as a conditioning signal that doubles as an inference-time CFG axis7. (One of five score-conditioning heads, swept here, won.)
Fine-tune on the top decile by combined reward + CLAP-text score.8,9
Load the v1-trained (GlobalAdaLN) weights into the v2 (InputAdd) forward to host the CRPO step. A bridge, not a booster.
A short DPO-style10 preference-tuning pass for audio-caption alignment.
Joint CFG on text and reward, 3×Demucs11 source separation, LUFS normalization to −16.5.
Expert iteration is the dominant contributor. The inference-time score knob ends up saturated, and the CRPO pass adds only noise-level gain.
Left → right in every panel: Baseline → Score-cond. SFT → Expert iteration → Cross-load v2 → CRPO
(the four training decisions 1–4 above, in pipeline order; post-processing is the 5th, applied to every clip)
Each engineering decision moves all three challenge metrics the right way: TuneJury reward and CLAP score rise, FAD-CLAP falls (lower is better)12. The two training lifts, score-conditioned SFT and expert iteration, do almost all the work. Cross-load and CRPO sit within paired-t noise. Values are cumulative on the 100-prompt SDD slice (Row 0→4 of the paper's decomposition table).
Hear one prompt, watch its reward climb
"A vibrant dance track driven by energetic drums, enhanced with a pulsing synth bass and bright acoustic guitar, captures the lively, sun-soaked spirit of summer with a rhythmic, foot-tapping groove that pulses through the room."
Each badge is the TuneJury reward (the challenge MusicRankNet) this exact clip scores. It climbs from −1.10 to +0.84 across the pipeline. Same prompt and seed (42) at every stage, identical inference (s=5.0, cfg=4.0, 25 steps, prefix, 3×mdx_extra, −16.5 LUFS), only the checkpoint changes. These are single clips, not the 100-prompt averages above, and reward is a learned preference signal that need not match what every listener hears.
Listen
The three prompts where the TuneJury reward gains the most. Same prompt and same backbone — challenge baseline vs our submission (Sub. 1, seed 42).
Δ reward +3.10A relaxed hiphop track featuring a subtle cymbal shimmer, smooth beat, and soft rhythmic flow, perfect as a background atmosphere. Enhanced with a warm electric piano and a light upright bass, the arrangement stays minimal and laid-back.
Δ reward +2.60Atmospheric music featuring a soothing vibraphone, enhanced with gentle piano and ambient synth pads, evokes a calm, reflective mood with soft, resonant tones and a spacious, dreamy texture.
Δ reward +2.33A calm psychedelic track featuring a synthesizer, blending surreal textures and dreamy tones in a serene, meditative atmosphere.
Reward is the mean output of the TuneJury preference model (higher is better). Clips are 10 s, post-processed (3×Demucs mdx_extra, −16.5 LUFS).
Results
100 held-out Song Describer Dataset6 prompts, scored against SDD-706 with LAION-CLAP-Music5.
| System | FAD-CLAP ↓ | CLAP score ↑ | TuneJury reward ↑ |
|---|---|---|---|
| FluxAudio-S (baseline) | 0.5998 | 0.230 | −0.392 |
| Sub. 1 (seed 42) | 0.4238 | 0.285 | +0.533 |
| Sub. 2 (seed 55) | 0.4370 | 0.300 | +0.550 |
SDD-706 is the challenge's reference set: a 706-track instrumental MTG-Jamendo subset of the Song Describer Dataset. On the challenge's hidden Jamendo reference set, our submission (e02) scored FAD 0.498, CLAP 0.270, CCS 0.763.
The reward signal
TuneJury is an open, instance-level pairwise reward model for text-to-music, trained on human preference judgments from open datasets4. Here it drives both training-time conditioning and sample selection.