Anime-XCodec2-44.1kHz-v2: A 44.1kHz Upsampling Variant of Anime-XCodec2 (v2)

License: CC BY-NC 4.0

TL;DR: Anime-XCodec2-44.1kHz-v2 is a fine-tuned variant of NandemoGHS/Anime-XCodec2. It incorporates upsampling layers and RMS loss (inspired by Inworld TTS-1) to produce 44.1kHz output, trained on ~22k hours of Japanese speech. This v2 updates upsampler parameters, loss configurations, and fixes a RoPE bug from the original XCodec2.

Only the decoder was updated; the encoder and codebook remain frozen, so speech tokens are identical to the original XCodec2. This makes the model a drop‑in decoder for downstream systems that already work with XCodec2 tokens (e.g., Llasa).


πŸ”— Quick Links


1) Model Summary

  • What it is: A neural speech codec based on Anime-XCodec2 (which is based on XCodec2), fine-tuned to output 44.1kHz high-fidelity Japanese speech (anime/game-style). (Version 2)
  • Key Change: Integrates an UpSamplerBlock and utilizes RMS Loss (inspired by Inworld TTS-1) into the decoder architecture.
  • Training scope: Decoder-only fine-tuning on ~22,000 hours of Japanese data. Encoder and codebook are frozen.
  • Compatibility: Speech tokens are identical to HKUSTAudio/xcodec2 and NandemoGHS/Anime-XCodec2.
  • Input Sampling rate: 16 kHz (for encoding, same as XCodec2).
  • Output Sampling rate: 44.1 kHz (decoded audio).

2) Intended Use

  • Decode XCodec2 speech tokens (e.g., from Llasa or other AR generators) into high-fidelity 44.1kHz Japanese speech (anime/game-style).
  • Upgrade existing Anime-XCodec2 (16kHz) pipelines to 44.1kHz output.
  • Audio Super-Resolution: As the model accepts 16kHz input and outputs 44.1kHz reconstructed audio, it can also be used as a form of audio super-resolution. However, its performance for this specific purpose is untested/unevaluated.

3) How to Use (Important)

This model modifies the original XCodec2 architecture (upsampler blocks) and requires a custom library version that includes a fix for the RoPE bug (Issue #36).

You MUST use the provided custom xcodec2 library fork (v0.1.7 or later) for inference. The standard library or older custom libraries (like 0.1.6) will not work.

  • Installation:

    # Install the custom xcodec2 library (v0.1.7)
    pip install https://huggingface.co/NandemoGHS/Anime-XCodec2-44.1kHz-v2/resolve/main/xcodec2-0.1.7.tar.gz
    
  • Usage: Once the custom library is installed, you can load and use this model just as you would the original XCodec2 or Anime-XCodec2 models. The core inference logic remains the same.

For a complete, working code example, please refer to my Hugging Face Spaces Demo: https://huggingface.co/spaces/OmniAICreator/Anime-XCodec2-44.1kHz-v2-Demo


4) Limitations & Trade-offs

  • Language scope: Optimized for Japanese. Performance on other languages may degrade.
  • Content domain: Tuned toward anime/game-style voices.
  • Library Dependency: Requires the specific custom xcodec2 library (v0.1.7) linked above. It is not compatible with the original xcodec2 library or previous custom forks (e.g., v0.1.6).

5) Data (High-Level)

  • ~22,000 hours of Japanese speech, with a focus on anime/game-style voices.
  • Data was prepared for 44.1kHz target output during training.

6) Training Procedure (High-Level)

  • Base Model: NandemoGHS/Anime-XCodec2 (16kHz)
  • Architecture Modification:
  • Loss Function:
    • Adopted RMS Loss (Root Mean Square loss) (from Inworld TTS-1), in addition to original losses.
  • Frozen: Encoder and Codebook (token compatibility preserved).
  • Updated (fine-tuned): generator.backbone, generator.head, generator.upsampler, fc_post_a

Key Updates in v2

Compared to the first version, this v2 model includes the following key updates to the training configuration:

  1. RoPE Bug Fix: Corrected a RoPE (Rotary Position Embedding) bug present in the original XCodec2 implementation (See Issue #36).
  2. Upsampler Parameters: The upsampler settings were changed to hop_length=98, upsample_factors=[3, 3], and kernel_sizes=[9, 9].
  3. Perceptual Loss Model: The model used for calculating perceptual loss was switched from facebook/wav2vec2-large-xlsr-53 to imprt/kushinada-hubert-large.
  4. Spectral Discriminator Tuning: The STFT (Short-Time Fourier Transform) settings for the spectral discriminator were adjusted to be more suitable for 44.1kHz high-sampling-rate audio.

7) License


8) Acknowledgements

  • HKUSTAudio/xcodec2 (Original model)
  • Inworld AI for their work on Inworld TTS-1 (Upsampler architecture and RMS Loss).
  • imprt for the kushinada-hubert-large model used in perceptual loss.
  • Thanks to contributors and the community around Japanese speech resources.
Downloads last month
700
Safetensors
Model size
0.8B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for NandemoGHS/Anime-XCodec2-44.1kHz-v2

Finetuned
(2)
this model
Quantizations
1 model

Spaces using NandemoGHS/Anime-XCodec2-44.1kHz-v2 3

Paper for NandemoGHS/Anime-XCodec2-44.1kHz-v2