--- license: mit library_name: transformers pipeline_tag: feature-extraction tags: - audio - speech - sparse-autoencoder - sae - interpretability - mechanistic-interpretability - hubert base_model: facebook/hubert-large-ll60k --- # Audio-SAE — HuBERT-large BatchTop-K Sparse Autoencoders trained on every transformer layer of [`facebook/hubert-large-ll60k`](https://huggingface.co/facebook/hubert-large-ll60k), from the paper **[AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders](https://arxiv.org/abs/2602.05027)** (EACL 2026). Each SAE decomposes the residual stream at one encoder layer into a sparse, largely interpretable dictionary of features. - **Code:** https://github.com/audiosae/audiosae_demo - **Paper:** https://arxiv.org/abs/2602.05027 - **Collection:** https://huggingface.co/collections/Egorgij21/audio-sae ## Specs | Backbone | Activation dim | Dict size | Expansion | `k` | Layers | |----------|---------------:|----------:|----------:|----:|-------:| | HuBERT-large | 1024 | 8192 | 8× | 50 | 24 | One SAE per encoder layer (`layer_1` … `layer_24`). Layer indices are 1-based and correspond to the output of the n-th transformer block. ## Layout ``` layer_1/ ae.pt # BatchTopKSAE state_dict config.json # training config (activation_dim, dict_size, k, …) layer_2/ … layer_24/ ``` Each `ae.pt` contains `encoder.{weight,bias}`, `decoder.weight`, `b_dec`, `k`. ## Loading ```python import torch from huggingface_hub import hf_hub_download from audio_sae import BatchTopKSAE from audio_sae.models.hubert import MyHubert device = "cuda" if torch.cuda.is_available() else "cpu" layer = 12 # 1. HuBERT-large encoder, tapped after `layer` hubert = MyHubert("facebook/hubert-large-ll60k", sae_after_layer=layer).to(device).eval() # 2. Matching SAE ckpt = hf_hub_download( repo_id="Egorgij21/Audio-SAE-hubert-large", filename=f"layer_{layer}/ae.pt", ) sae = BatchTopKSAE.from_pretrained(ckpt, device=device) # 3. Run on audio import librosa wav, _ = librosa.load("example.wav", sr=16000, mono=True) wav = torch.from_numpy(wav).unsqueeze(0).to(device) with torch.no_grad(): acts = hubert(wav) # (1, T, 1024) features = sae.encode(acts, use_threshold=True) # (1, T, 8192), sparse ``` See the [GitHub repo](https://github.com/audiosae/audiosae_demo) for a full inference and interpretability walkthrough. ## Training - Architecture: BatchTop-K SAE, 8× expansion - Optimizer: Adam, lr 2e-4, 200 000 steps, decay from step 160 000 - Loss: L2 reconstruction with batch-wide top-k (`k=50`) - Data: ~2.8 k hours of mixed audio — speech (LibriSpeech, LibriHeavy, IEMOCAP, ESD, Expresso, CREMA-D, MELD), music (MTG-Jamendo) and environmental sounds (MUSAN, DEMAND, WHAM, FSD50K, VocalSound, Nonspeech7k, ESC-50, VGGSound) - Seed: 21 See the paper for full training details and evaluation metrics. ## Citation ```bibtex @inproceedings{aparin2026audiosae, title = {AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders}, author = {Aparin, Georgii and Sadekova, Tasnima and Rukhovich, Alexey and Yermekova, Assel and Kushnareva, Laida and Popov, Vadim and Kuznetsov, Kristian and Piontkovskaya, Irina}, booktitle = {Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)}, year = {2026}, address = {Rabat, Morocco}, } ``` ## License MIT