Egorgij21
/

Audio-SAE-HuBERT-large

+---
+license: mit
+library_name: transformers
+pipeline_tag: feature-extraction
+tags:
+- audio
+- speech
+- sparse-autoencoder
+- sae
+- interpretability
+- mechanistic-interpretability
+- hubert
+base_model: facebook/hubert-large-ll60k
+---
+# Audio-SAE — HuBERT-large
+BatchTop-K Sparse Autoencoders trained on every transformer layer of
+[`facebook/hubert-large-ll60k`](https://huggingface.co/facebook/hubert-large-ll60k),
+from the paper
+**[AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders](https://arxiv.org/abs/2602.05027)**
+(EACL 2026).
+Each SAE decomposes the residual stream at one encoder layer into a sparse, largely
+interpretable dictionary of features.
+- **Code:** https://github.com/audiosae/audiosae_demo
+- **Paper:** https://arxiv.org/abs/2602.05027
+- **Collection:** https://huggingface.co/collections/Egorgij21/audio-sae
+## Specs
+| Backbone | Activation dim | Dict size | Expansion | `k` | Layers |
+|----------|---------------:|----------:|----------:|----:|-------:|
+| HuBERT-large | 1024 | 8192 | 8× | 50 | 24 |
+One SAE per encoder layer (`layer_1` … `layer_24`). Layer indices are 1-based and
+correspond to the output of the n-th transformer block.
+## Layout
+```
+layer_1/
+  ae.pt          # BatchTopKSAE state_dict
+  config.json    # training config (activation_dim, dict_size, k, …)
+layer_2/
+…
+layer_24/
+```
+Each `ae.pt` contains `encoder.{weight,bias}`, `decoder.weight`, `b_dec`, `k`.
+## Loading
+```python
+import torch
+from huggingface_hub import hf_hub_download
+from audio_sae import BatchTopKSAE
+from audio_sae.models.hubert import MyHubert
+device = "cuda" if torch.cuda.is_available() else "cpu"
+layer = 12
+# 1. HuBERT-large encoder, tapped after `layer`
+hubert = MyHubert("facebook/hubert-large-ll60k", sae_after_layer=layer).to(device).eval()
+# 2. Matching SAE
+ckpt = hf_hub_download(
+    repo_id="Egorgij21/Audio-SAE-HuBERT-large",
+    filename=f"layer_{layer}/ae.pt",
+)
+sae = BatchTopKSAE.from_pretrained(ckpt, device=device)
+# 3. Run on audio
+import librosa
+wav, _ = librosa.load("example.wav", sr=16000, mono=True)
+wav = torch.from_numpy(wav).unsqueeze(0).to(device)
+with torch.no_grad():
+    acts = hubert(wav)                                # (1, T, 1024)
+    features = sae.encode(acts, use_threshold=True)   # (1, T, 8192), sparse
+```
+See the [GitHub repo](https://github.com/audiosae/audiosae_demo) for a full inference and
+interpretability walkthrough.
+## Training
+- Architecture: BatchTop-K SAE, 8× expansion
+- Optimizer: Adam, lr 2e-4, 200 000 steps, decay from step 160 000
+- Loss: L2 reconstruction with batch-wide top-k (`k=50`)
+- Data: ~2.8 k hours of mixed audio — speech (LibriSpeech, LibriHeavy, IEMOCAP, ESD,
+  Expresso, CREMA-D, MELD), music (MTG-Jamendo) and environmental sounds (MUSAN,
+  DEMAND, WHAM, FSD50K, VocalSound, Nonspeech7k, ESC-50, VGGSound)
+- Seed: 21
+See the paper for full training details and evaluation metrics.
+## Citation
+```bibtex
+@inproceedings{aparin2026audiosae,
+  title     = {AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders},
+  author    = {Aparin, Georgii and Sadekova, Tasnima and Rukhovich, Alexey and Yermekova, Assel and Kushnareva, Laida and Popov, Vadim and Kuznetsov, Kristian and Piontkovskaya, Irina},
+  booktitle = {Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)},
+  year      = {2026},
+  address   = {Rabat, Morocco},
+}
+```
+## License
+MIT