Audio Classification
Transformers
Safetensors
phonoq
image-feature-extraction
audio
speech
phonology
phonological-features
wav2vec2
custom_code
Instructions to use abnerh/phonoq-2.0-english with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use abnerh/phonoq-2.0-english with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("audio-classification", model="abnerh/phonoq-2.0-english", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("abnerh/phonoq-2.0-english", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
PhonoQ 2.0 English
Framewise phonological feature recognition for English speech.
This model returns phonological probabilities for manner, vowel height, vowel backness, place, and voicing, plus a hard conditional 22-feature representation per frame.
Usage
pip install torch transformers soundfile safetensors
import soundfile as sf
import torch
from transformers import AutoFeatureExtractor, AutoModel
model_id = "abnerh/phonoq-2.0-english"
audio, sr = sf.read("jacket_f.wav")
if audio.ndim > 1:
audio = audio.mean(axis=1)
processor = AutoFeatureExtractor.from_pretrained(model_id)
inputs = processor(audio, sampling_rate=sr, return_tensors="pt", padding=True)
model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
model.eval()
with torch.no_grad():
out = model(**inputs)
print(out.features.shape) # [1, T, 22]
print(out.manner_probabilities.shape) # [1, T, 9]
print(out.vowel_height.shape) # [1, T, 3]
print(out.vowel_backness.shape) # [1, T, 3]
print(out.place_probabilities.shape) # [1, T, 5]
print(out.voice_probabilities.shape) # [1, T, 2]
Outputs
features: hard conditional 22-dimensional features,[B, T, 22]manner_probabilities:[B, T, 9]vowel_height:[B, T, 3]vowel_backness:[B, T, 3]place_probabilities:[B, T, 5]voice_probabilities:[B, T, 2]attention_mask: valid encoder frames,[B, T]feature_names: names for the 22 feature dimensions
Feature order:
silence, stop, nasal, rhotic, fricative, affricate, approximant, lateral, vowel,
high, mid, low, front, central, back,
labial, alveolar, velar, palatal, postalveolar,
voiceless, voiced
Example: "jacket"
CLI pretty view for jacket_f.wav, omitting leading and trailing silence:
[ 1.01- 1.15] affricate 0.84 | voiced 0.96, postalveolar 0.96
[ 1.15- 1.29] vowel 0.92 | low 0.95, voiced 0.95, back 0.95
[ 1.29- 1.39] stop 0.91 | voiceless 0.96, velar 0.96
[ 1.39- 1.53] vowel 0.92 | high 0.97, voiced 0.96, front 0.96
[ 1.53- 1.67] stop 0.67 | alveolar 0.96, voiceless 0.87
Rough phonological pattern:
affricate + vowel + stop + vowel + stop
Viewing Probabilities
The following snippet prints only the non-silence region.
manner_labels = [
"silence", "stop", "nasal", "rhotic", "fricative", "affricate",
"approximant", "lateral", "vowel",
]
manner = out.manner_probabilities[0]
mask = out.attention_mask[0].bool()
manner = manner[mask]
best_manner = manner.argmax(dim=-1)
non_silence = (best_manner != 0).nonzero(as_tuple=True)[0]
if len(non_silence) == 0:
print("No non-silence frames found.")
else:
start = int(non_silence[0])
end = int(non_silence[-1]) + 1
print(f"Non-silence frame range: {start}-{end - 1}")
print()
for frame_idx in range(start, end):
probs = manner[frame_idx]
best = int(probs.argmax())
print(f"{frame_idx:03d} {manner_labels[best]:10s} {float(probs[best]):.3f}")
Example output for jacket_f.wav:
Non-silence frame range: 50-82
050 affricate 0.595
051 affricate 0.869
052 affricate 0.953
053 affricate 0.957
054 affricate 0.966
055 affricate 0.975
056 affricate 0.555
057 vowel 0.948
058 vowel 0.948
059 vowel 0.940
060 vowel 0.944
061 vowel 0.949
062 vowel 0.949
063 vowel 0.780
064 stop 0.933
065 stop 0.924
066 stop 0.935
067 stop 0.935
068 stop 0.839
069 vowel 0.954
070 vowel 0.953
071 vowel 0.949
072 vowel 0.939
073 vowel 0.914
074 vowel 0.841
075 vowel 0.897
076 stop 0.587
077 stop 0.793
078 stop 0.662
079 stop 0.657
080 stop 0.761
081 stop 0.673
082 stop 0.531
CLI
The repository includes best.ckpt for PhonoQ CLI compatibility:
phonoq predict jacket_f.wav \
--model abnerh/phonoq-2.0-english \
--outdir outputs \
--pretty
Note
This model uses custom Transformers code and must be loaded with
trust_remote_code=True.
- Downloads last month
- 59