Anime-XCodec2-44.1kHz-v2: A 44.1kHz Upsampling Variant of Anime-XCodec2 (v2)

TL;DR: Anime-XCodec2-44.1kHz-v2 is a fine-tuned variant of NandemoGHS/Anime-XCodec2. It incorporates upsampling layers and RMS loss (inspired by Inworld TTS-1) to produce 44.1kHz output, trained on ~22k hours of Japanese speech. This v2 updates upsampler parameters, loss configurations, and fixes a RoPE bug from the original XCodec2.

Only the decoder was updated; the encoder and codebook remain frozen, so speech tokens are identical to the original XCodec2. This makes the model a drop‑in decoder for downstream systems that already work with XCodec2 tokens (e.g., Llasa).

🔗 Quick Links

Demo (Gradio / Hugging Face Spaces): https://huggingface.co/spaces/OmniAICreator/Anime-XCodec2-44.1kHz-v2-Demo
This repository (v2 44.1kHz fine-tune): NandemoGHS/Anime-XCodec2-44.1kHz-v2
Baseline 16kHz model: NandemoGHS/Anime-XCodec2
Original XCodec2: HKUSTAudio/xcodec2
Reference Paper (Inworld TTS-1): https://arxiv.org/abs/2507.21138
Reference Implementation (Inworld TTS): https://github.com/inworld-ai/tts

1) Model Summary

What it is: A neural speech codec based on Anime-XCodec2 (which is based on XCodec2), fine-tuned to output 44.1kHz high-fidelity Japanese speech (anime/game-style). (Version 2)
Key Change: Integrates an UpSamplerBlock and utilizes RMS Loss (inspired by Inworld TTS-1) into the decoder architecture.
Training scope: Decoder-only fine-tuning on ~22,000 hours of Japanese data. Encoder and codebook are frozen.
Compatibility: Speech tokens are identical to HKUSTAudio/xcodec2 and NandemoGHS/Anime-XCodec2.
Input Sampling rate: 16 kHz (for encoding, same as XCodec2).
Output Sampling rate: 44.1 kHz (decoded audio).

2) Intended Use

Decode XCodec2 speech tokens (e.g., from Llasa or other AR generators) into high-fidelity 44.1kHz Japanese speech (anime/game-style).
Upgrade existing Anime-XCodec2 (16kHz) pipelines to 44.1kHz output.
Audio Super-Resolution: As the model accepts 16kHz input and outputs 44.1kHz reconstructed audio, it can also be used as a form of audio super-resolution. However, its performance for this specific purpose is untested/unevaluated.

3) How to Use (Important)

This model modifies the original XCodec2 architecture (upsampler blocks) and requires a custom library version that includes a fix for the RoPE bug (Issue #36).

You MUST use the provided custom xcodec2 library fork (v0.1.7 or later) for inference. The standard library or older custom libraries (like 0.1.6) will not work.

Installation:

# Install the custom xcodec2 library (v0.1.7)
pip install https://huggingface.co/NandemoGHS/Anime-XCodec2-44.1kHz-v2/resolve/main/xcodec2-0.1.7.tar.gz

Usage: Once the custom library is installed, you can load and use this model just as you would the original XCodec2 or Anime-XCodec2 models. The core inference logic remains the same.

For a complete, working code example, please refer to my Hugging Face Spaces Demo: https://huggingface.co/spaces/OmniAICreator/Anime-XCodec2-44.1kHz-v2-Demo

4) Limitations & Trade-offs

Language scope: Optimized for Japanese. Performance on other languages may degrade.
Content domain: Tuned toward anime/game-style voices.
Library Dependency: Requires the specific custom xcodec2 library (v0.1.7) linked above. It is not compatible with the original xcodec2 library or previous custom forks (e.g., v0.1.6).

5) Data (High-Level)

~22,000 hours of Japanese speech, with a focus on anime/game-style voices.
Data was prepared for 44.1kHz target output during training.

6) Training Procedure (High-Level)

Base Model: NandemoGHS/Anime-XCodec2 (16kHz)
Architecture Modification:
- Integrated the UpSamplerBlock from the Inworld TTS-1 implementation into the decoder.
Loss Function:
- Adopted RMS Loss (Root Mean Square loss) (from Inworld TTS-1), in addition to original losses.
Frozen: Encoder and Codebook (token compatibility preserved).
Updated (fine-tuned): generator.backbone, generator.head, generator.upsampler, fc_post_a

Key Updates in v2

Compared to the first version, this v2 model includes the following key updates to the training configuration:

RoPE Bug Fix: Corrected a RoPE (Rotary Position Embedding) bug present in the original XCodec2 implementation (See Issue #36).
Upsampler Parameters: The upsampler settings were changed to hop_length=98, upsample_factors=[3, 3], and kernel_sizes=[9, 9].
Perceptual Loss Model: The model used for calculating perceptual loss was switched from facebook/wav2vec2-large-xlsr-53 to imprt/kushinada-hubert-large.
Spectral Discriminator Tuning: The STFT (Short-Time Fourier Transform) settings for the spectral discriminator were adjusted to be more suitable for 44.1kHz high-sampling-rate audio.

7) License

CC-BY-NC 4.0 (inherited from XCodec2 and Anime-XCodec2).
See: https://creativecommons.org/licenses/by-nc/4.0/

8) Acknowledgements

HKUSTAudio/xcodec2 (Original model)
Inworld AI for their work on Inworld TTS-1 (Upsampler architecture and RMS Loss).
imprt for the kushinada-hubert-large model used in perceptual loss.
Thanks to contributors and the community around Japanese speech resources.