Egorgij21 commited on
Commit
52cd891
·
verified ·
1 Parent(s): eb37d01

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +113 -3
README.md CHANGED
@@ -1,3 +1,113 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: transformers
4
+ pipeline_tag: feature-extraction
5
+ tags:
6
+ - audio
7
+ - speech
8
+ - sparse-autoencoder
9
+ - sae
10
+ - interpretability
11
+ - mechanistic-interpretability
12
+ - hubert
13
+ base_model: facebook/hubert-large-ll60k
14
+ ---
15
+
16
+ # Audio-SAE — HuBERT-large
17
+
18
+ BatchTop-K Sparse Autoencoders trained on every transformer layer of
19
+ [`facebook/hubert-large-ll60k`](https://huggingface.co/facebook/hubert-large-ll60k),
20
+ from the paper
21
+ **[AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders](https://arxiv.org/abs/2602.05027)**
22
+ (EACL 2026).
23
+
24
+ Each SAE decomposes the residual stream at one encoder layer into a sparse, largely
25
+ interpretable dictionary of features.
26
+
27
+ - **Code:** https://github.com/audiosae/audiosae_demo
28
+ - **Paper:** https://arxiv.org/abs/2602.05027
29
+ - **Collection:** https://huggingface.co/collections/Egorgij21/audio-sae
30
+
31
+ ## Specs
32
+
33
+ | Backbone | Activation dim | Dict size | Expansion | `k` | Layers |
34
+ |----------|---------------:|----------:|----------:|----:|-------:|
35
+ | HuBERT-large | 1024 | 8192 | 8× | 50 | 24 |
36
+
37
+ One SAE per encoder layer (`layer_1` … `layer_24`). Layer indices are 1-based and
38
+ correspond to the output of the n-th transformer block.
39
+
40
+ ## Layout
41
+
42
+ ```
43
+ layer_1/
44
+ ae.pt # BatchTopKSAE state_dict
45
+ config.json # training config (activation_dim, dict_size, k, …)
46
+ layer_2/
47
+
48
+ layer_24/
49
+ ```
50
+
51
+ Each `ae.pt` contains `encoder.{weight,bias}`, `decoder.weight`, `b_dec`, `k`.
52
+
53
+ ## Loading
54
+
55
+ ```python
56
+ import torch
57
+ from huggingface_hub import hf_hub_download
58
+ from audio_sae import BatchTopKSAE
59
+ from audio_sae.models.hubert import MyHubert
60
+
61
+ device = "cuda" if torch.cuda.is_available() else "cpu"
62
+ layer = 12
63
+
64
+ # 1. HuBERT-large encoder, tapped after `layer`
65
+ hubert = MyHubert("facebook/hubert-large-ll60k", sae_after_layer=layer).to(device).eval()
66
+
67
+ # 2. Matching SAE
68
+ ckpt = hf_hub_download(
69
+ repo_id="Egorgij21/Audio-SAE-HuBERT-large",
70
+ filename=f"layer_{layer}/ae.pt",
71
+ )
72
+ sae = BatchTopKSAE.from_pretrained(ckpt, device=device)
73
+
74
+ # 3. Run on audio
75
+ import librosa
76
+ wav, _ = librosa.load("example.wav", sr=16000, mono=True)
77
+ wav = torch.from_numpy(wav).unsqueeze(0).to(device)
78
+
79
+ with torch.no_grad():
80
+ acts = hubert(wav) # (1, T, 1024)
81
+ features = sae.encode(acts, use_threshold=True) # (1, T, 8192), sparse
82
+ ```
83
+
84
+ See the [GitHub repo](https://github.com/audiosae/audiosae_demo) for a full inference and
85
+ interpretability walkthrough.
86
+
87
+ ## Training
88
+
89
+ - Architecture: BatchTop-K SAE, 8× expansion
90
+ - Optimizer: Adam, lr 2e-4, 200 000 steps, decay from step 160 000
91
+ - Loss: L2 reconstruction with batch-wide top-k (`k=50`)
92
+ - Data: ~2.8 k hours of mixed audio — speech (LibriSpeech, LibriHeavy, IEMOCAP, ESD,
93
+ Expresso, CREMA-D, MELD), music (MTG-Jamendo) and environmental sounds (MUSAN,
94
+ DEMAND, WHAM, FSD50K, VocalSound, Nonspeech7k, ESC-50, VGGSound)
95
+ - Seed: 21
96
+
97
+ See the paper for full training details and evaluation metrics.
98
+
99
+ ## Citation
100
+
101
+ ```bibtex
102
+ @inproceedings{aparin2026audiosae,
103
+ title = {AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders},
104
+ author = {Aparin, Georgii and Sadekova, Tasnima and Rukhovich, Alexey and Yermekova, Assel and Kushnareva, Laida and Popov, Vadim and Kuznetsov, Kristian and Piontkovskaya, Irina},
105
+ booktitle = {Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)},
106
+ year = {2026},
107
+ address = {Rabat, Morocco},
108
+ }
109
+ ```
110
+
111
+ ## License
112
+
113
+ MIT