Talkie 13B (1930s Edition) - MLX Selective Quantization

This is an optimized, selectively quantized version of the Talkie 13B Base model, purpose-built for the TalkiePress 1930s News Generator project.

Model Description

The original 13B model consumed approximately 52GB of RAM, making it difficult to run on standard Apple Silicon Macs without severe swapping and system freezes. Naive 8-bit quantization led to "model collapse," where the model lost its ability to generate coherent English.

To solve this, this repository contains a Surgically Quantized (Hybrid) version of the model:

  • 8-bit Quantization (group_size=64): Applied to all deep intermediate Linear layers (Attention, MLP), which constitute 95% of the model's parameters and have high redundancy.
  • FP16 (16-bit Float): Strictly preserved for the embed (Input Token Embedding) and lm_head (Output Language Modeling Head) layers.

By protecting the "entry" and "exit" layers of the transformer, we successfully reduced the memory footprint from 52GB to 17GB while retaining 100% of the linguistic coherence and 1930s journalistic persona.

Usage in TalkiePress

This model is intended to be used directly with the Verantyx TalkiePress pipeline.

1-Click Deployment

To run the TalkiePress IDE and web interface directly using this optimized model:

curl -sL https://raw.githubusercontent.com/Ag3497120/TALKIEPRESS-1930/main/run_mlx_integrated.sh | bash

Manual MLX Loading

If you are writing custom Python code using MLX, you can load the model.safetensors file natively without any runtime conversion overhead:

import mlx.core as mx
import mlx.nn as nn
from talkie.model_mlx import TalkieModelMLX, GPTConfig

config = GPTConfig(vocab_size=32000) # Base vocab size
model = TalkieModelMLX(config)

# Apply the same hybrid quantization structure BEFORE loading weights
nn.quantize(
    model, 
    class_predicate=lambda p, m: isinstance(m, nn.Linear) and "embed" not in p and "lm_head" not in p, 
    group_size=64, 
    bits=8
)

# Load the weights natively
model.load_weights("model.safetensors", strict=False)

Architecture Notes

  • Base Model: Talkie 13B
  • Hardware: Apple Silicon (M-Series) via MLX framework
  • Memory Footprint: ~17GB (Fits comfortably on 32GB/64GB Macs)
Downloads last month
25
Safetensors
Model size
4B params
Tensor type
F16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support