PhonoQ 2.0 English

Framewise phonological feature recognition for English speech.

This model returns phonological probabilities for manner, vowel height, vowel backness, place, and voicing, plus a hard conditional 22-feature representation per frame.

Usage

pip install torch transformers soundfile safetensors

import soundfile as sf
import torch
from transformers import AutoFeatureExtractor, AutoModel

model_id = "abnerh/phonoq-2.0-english"
audio, sr = sf.read("jacket_f.wav")
if audio.ndim > 1:
    audio = audio.mean(axis=1)

processor = AutoFeatureExtractor.from_pretrained(model_id)
inputs = processor(audio, sampling_rate=sr, return_tensors="pt", padding=True)

model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
model.eval()

with torch.no_grad():
    out = model(**inputs)

print(out.features.shape)              # [1, T, 22]
print(out.manner_probabilities.shape)  # [1, T, 9]
print(out.vowel_height.shape)          # [1, T, 3]
print(out.vowel_backness.shape)        # [1, T, 3]
print(out.place_probabilities.shape)   # [1, T, 5]
print(out.voice_probabilities.shape)   # [1, T, 2]

Outputs

features: hard conditional 22-dimensional features, [B, T, 22]
manner_probabilities: [B, T, 9]
vowel_height: [B, T, 3]
vowel_backness: [B, T, 3]
place_probabilities: [B, T, 5]
voice_probabilities: [B, T, 2]
attention_mask: valid encoder frames, [B, T]
feature_names: names for the 22 feature dimensions

Feature order:

silence, stop, nasal, rhotic, fricative, affricate, approximant, lateral, vowel,
high, mid, low, front, central, back,
labial, alveolar, velar, palatal, postalveolar,
voiceless, voiced

Example: "jacket"

CLI pretty view for jacket_f.wav, omitting leading and trailing silence:

[  1.01-  1.15]  affricate 0.84 | voiced 0.96, postalveolar 0.96
[  1.15-  1.29]  vowel 0.92 | low 0.95, voiced 0.95, back 0.95
[  1.29-  1.39]  stop 0.91 | voiceless 0.96, velar 0.96
[  1.39-  1.53]  vowel 0.92 | high 0.97, voiced 0.96, front 0.96
[  1.53-  1.67]  stop 0.67 | alveolar 0.96, voiceless 0.87

Rough phonological pattern:

affricate + vowel + stop + vowel + stop

Viewing Probabilities

The following snippet prints only the non-silence region.

manner_labels = [
    "silence", "stop", "nasal", "rhotic", "fricative", "affricate",
    "approximant", "lateral", "vowel",
]

manner = out.manner_probabilities[0]
mask = out.attention_mask[0].bool()
manner = manner[mask]

best_manner = manner.argmax(dim=-1)
non_silence = (best_manner != 0).nonzero(as_tuple=True)[0]

if len(non_silence) == 0:
    print("No non-silence frames found.")
else:
    start = int(non_silence[0])
    end = int(non_silence[-1]) + 1

    print(f"Non-silence frame range: {start}-{end - 1}")
    print()

    for frame_idx in range(start, end):
        probs = manner[frame_idx]
        best = int(probs.argmax())
        print(f"{frame_idx:03d}  {manner_labels[best]:10s}  {float(probs[best]):.3f}")

Example output for jacket_f.wav:

Non-silence frame range: 50-82

050  affricate   0.595
051  affricate   0.869
052  affricate   0.953
053  affricate   0.957
054  affricate   0.966
055  affricate   0.975
056  affricate   0.555
057  vowel       0.948
058  vowel       0.948
059  vowel       0.940
060  vowel       0.944
061  vowel       0.949
062  vowel       0.949
063  vowel       0.780
064  stop        0.933
065  stop        0.924
066  stop        0.935
067  stop        0.935
068  stop        0.839
069  vowel       0.954
070  vowel       0.953
071  vowel       0.949
072  vowel       0.939
073  vowel       0.914
074  vowel       0.841
075  vowel       0.897
076  stop        0.587
077  stop        0.793
078  stop        0.662
079  stop        0.657
080  stop        0.761
081  stop        0.673
082  stop        0.531

CLI

The repository includes best.ckpt for PhonoQ CLI compatibility:

phonoq predict jacket_f.wav \
  --model abnerh/phonoq-2.0-english \
  --outdir outputs \
  --pretty

Note

This model uses custom Transformers code and must be loaded with trust_remote_code=True.

Downloads last month: 59

Safetensors

Model size

0.3B params

Tensor type

F32