Instructions to use kofdai/talkie-1930-13b-mlx-mixed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use kofdai/talkie-1930-13b-mlx-mixed with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir talkie-1930-13b-mlx-mixed kofdai/talkie-1930-13b-mlx-mixed
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Talkie 13B (1930s Edition) - MLX Selective Quantization
This is an optimized, selectively quantized version of the Talkie 13B Base model, purpose-built for the TalkiePress 1930s News Generator project.
Model Description
The original 13B model consumed approximately 52GB of RAM, making it difficult to run on standard Apple Silicon Macs without severe swapping and system freezes. Naive 8-bit quantization led to "model collapse," where the model lost its ability to generate coherent English.
To solve this, this repository contains a Surgically Quantized (Hybrid) version of the model:
- 8-bit Quantization (group_size=64): Applied to all deep intermediate Linear layers (Attention, MLP), which constitute 95% of the model's parameters and have high redundancy.
- FP16 (16-bit Float): Strictly preserved for the
embed(Input Token Embedding) andlm_head(Output Language Modeling Head) layers.
By protecting the "entry" and "exit" layers of the transformer, we successfully reduced the memory footprint from 52GB to 17GB while retaining 100% of the linguistic coherence and 1930s journalistic persona.
Usage in TalkiePress
This model is intended to be used directly with the Verantyx TalkiePress pipeline.
1-Click Deployment
To run the TalkiePress IDE and web interface directly using this optimized model:
curl -sL https://raw.githubusercontent.com/Ag3497120/TALKIEPRESS-1930/main/run_mlx_integrated.sh | bash
Manual MLX Loading
If you are writing custom Python code using MLX, you can load the model.safetensors file natively without any runtime conversion overhead:
import mlx.core as mx
import mlx.nn as nn
from talkie.model_mlx import TalkieModelMLX, GPTConfig
config = GPTConfig(vocab_size=32000) # Base vocab size
model = TalkieModelMLX(config)
# Apply the same hybrid quantization structure BEFORE loading weights
nn.quantize(
model,
class_predicate=lambda p, m: isinstance(m, nn.Linear) and "embed" not in p and "lm_head" not in p,
group_size=64,
bits=8
)
# Load the weights natively
model.load_weights("model.safetensors", strict=False)
Architecture Notes
- Base Model: Talkie 13B
- Hardware: Apple Silicon (M-Series) via MLX framework
- Memory Footprint: ~17GB (Fits comfortably on 32GB/64GB Macs)
- Downloads last month
- 25
8-bit