arxiv:2506.07520

LeVo: High-Quality Song Generation with Multi-Preference Alignment

Published on Jun 9, 2025

Upvote

Authors:

Shun Lei ,

Yaoxun Xu ,

Huaicheng Zhang ,

Wei Tan ,

Hangting Chen ,

Zhiyong Wu ,

Abstract

LeVo, a framework combining an LM and a music codec, improves lyrics-to-song generation by parallelly modeling mixed and dual-track tokens, using transformer decoders, and employing direct preference optimization to enhance musicality and instruction following.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in sound quality, musicality, instruction following, and vocal-instrument harmony. To address these challenges, we introduce LeVo, an LM-based framework consisting of LeLM and a music codec. LeLM is capable of parallelly modeling two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment to achieve vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment for high-quality song generation. It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types. To further enhance musicality and instruction following, we introduce a multi-preference alignment method based on Direct Preference Optimization (DPO). This method handles diverse human preferences through a semi-automatic data construction process and DPO post-training. Experimental results demonstrate that LeVo consistently outperforms existing methods on both objective and subjective metrics. Ablation studies further justify the effectiveness of our designs. Audio examples are available at https://levo-demo.github.io/.