---
license: mit
library_name: transformers
pipeline_tag: feature-extraction
tags:
- audio
- speech
- sparse-autoencoder
- sae
- interpretability
- mechanistic-interpretability
- hubert
base_model: facebook/hubert-large-ll60k
---

# Audio-SAE — HuBERT-large

BatchTop-K Sparse Autoencoders trained on every transformer layer of
[`facebook/hubert-large-ll60k`](https://huggingface.co/facebook/hubert-large-ll60k),
from the paper
**[AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders](https://arxiv.org/abs/2602.05027)**
(EACL 2026).

Each SAE decomposes the residual stream at one encoder layer into a sparse, largely
interpretable dictionary of features.

- **Code:** https://github.com/audiosae/audiosae_demo
- **Paper:** https://arxiv.org/abs/2602.05027
- **Collection:** https://huggingface.co/collections/Egorgij21/audio-sae

## Specs

| Backbone | Activation dim | Dict size | Expansion | `k` | Layers |
|----------|---------------:|----------:|----------:|----:|-------:|
| HuBERT-large | 1024 | 8192 | 8× | 50 | 24 |

One SAE per encoder layer (`layer_1` … `layer_24`). Layer indices are 1-based and
correspond to the output of the n-th transformer block.

## Layout

```
layer_1/
  ae.pt          # BatchTopKSAE state_dict
  config.json    # training config (activation_dim, dict_size, k, …)
layer_2/
…
layer_24/
```

Each `ae.pt` contains `encoder.{weight,bias}`, `decoder.weight`, `b_dec`, `k`.

## Loading

```python
import torch
from huggingface_hub import hf_hub_download
from audio_sae import BatchTopKSAE
from audio_sae.models.hubert import MyHubert

device = "cuda" if torch.cuda.is_available() else "cpu"
layer = 12

# 1. HuBERT-large encoder, tapped after `layer`
hubert = MyHubert("facebook/hubert-large-ll60k", sae_after_layer=layer).to(device).eval()

# 2. Matching SAE
ckpt = hf_hub_download(
    repo_id="Egorgij21/Audio-SAE-hubert-large",
    filename=f"layer_{layer}/ae.pt",
)
sae = BatchTopKSAE.from_pretrained(ckpt, device=device)

# 3. Run on audio
import librosa
wav, _ = librosa.load("example.wav", sr=16000, mono=True)
wav = torch.from_numpy(wav).unsqueeze(0).to(device)

with torch.no_grad():
    acts = hubert(wav)                                # (1, T, 1024)
    features = sae.encode(acts, use_threshold=True)   # (1, T, 8192), sparse
```

See the [GitHub repo](https://github.com/audiosae/audiosae_demo) for a full inference and
interpretability walkthrough.

## Training

- Architecture: BatchTop-K SAE, 8× expansion
- Optimizer: Adam, lr 2e-4, 200 000 steps, decay from step 160 000
- Loss: L2 reconstruction with batch-wide top-k (`k=50`)
- Data: ~2.8 k hours of mixed audio — speech (LibriSpeech, LibriHeavy, IEMOCAP, ESD,
  Expresso, CREMA-D, MELD), music (MTG-Jamendo) and environmental sounds (MUSAN,
  DEMAND, WHAM, FSD50K, VocalSound, Nonspeech7k, ESC-50, VGGSound)
- Seed: 21

See the paper for full training details and evaluation metrics.

## Citation

```bibtex
@inproceedings{aparin2026audiosae,
  title     = {AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders},
  author    = {Aparin, Georgii and Sadekova, Tasnima and Rukhovich, Alexey and Yermekova, Assel and Kushnareva, Laida and Popov, Vadim and Kuznetsov, Kristian and Piontkovskaya, Irina},
  booktitle = {Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)},
  year      = {2026},
  address   = {Rabat, Morocco},
}
```

## License

MIT