gemma-4-26B-A4B-Heretic-Stable

gemma-4-26B-A4B-Heretic-Stable is an optimized release built on top of huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated. This version focuses on updated weight sharding, improved repository structure, and compatibility with the latest Transformers versions, while preserving the behavior and capabilities of the original model. The result is a powerful 26B parameter language model optimized for efficient deployment, inference stability, and modern tooling support.

This model is intended for research and learning purposes only. Any content generated by this model is used at the user’s own risk. The authors and hosting page disclaim any liability for outputs produced by this model. Users are responsible for ensuring safe, ethical, and lawful usage.

Key Highlights

Latest Transformers Compatibility Optimized for compatibility with recent Transformers releases for smoother loading and inference.
Re-sharded Model Weights Updated shard structure for improved download reliability, storage handling, and deployment efficiency.
Streamlined Inference Packaging Repository structure optimized for easier integration into modern inference pipelines.
26B Parameter Architecture Built on gemma-4-26B-A4B-it, providing strong reasoning and knowledge capacity.
Improved Deployment Stability Designed for consistent performance across different inference environments.
MoE Architecture Preserved Original Mixture-of-Experts structure remains unchanged, with no modifications to routing or expert layers.
High-Capability Deployment Suitable for advanced research workloads and high-performance inference setups.

Base Model Signatures:

This model has been re-sharded and optimized for the latest Transformers version from the base model: https://huggingface.co/huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated.

Quick Start with Transformers

pip install transformers==5.9.0
# or
pip install git+https://github.com/huggingface/transformers.git

from transformers import Gemma4ForConditionalGeneration, AutoProcessor
import torch

model = Gemma4ForConditionalGeneration.from_pretrained(
    "prithivMLmods/gemma-4-26B-A4B-Heretic-Stable",
    torch_dtype="auto",
    device_map="auto"
)

processor = AutoProcessor.from_pretrained(
    "prithivMLmods/gemma-4-26B-A4B-Heretic-Stable"
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Explain how transformer models work in simple terms."}
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = processor(
    text=[text],
    padding=True,
    return_tensors="pt"
).to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=256)

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

print(output_text)

Intended Use

Multimodal and Text Research Studying large-scale transformer behavior and inference characteristics.
Red-Teaming & Evaluation Testing robustness across diverse and challenging prompts.
High-Performance Local Deployment Running large-scale instruction models on optimized hardware setups.
Research Prototyping Experimentation with large Mixture-of-Experts architectures.

Limitations & Risks

Important Note: This model inherits the behavior and characteristics of its base model.

Output Variability Responses may vary depending on prompt structure and sampling settings.
Resource Requirements A 26B parameter model requires significant GPU memory or optimized inference strategies such as quantization or tensor parallelism.
Deployment Considerations Performance depends heavily on hardware configuration and runtime optimization.
General Model Limitations May still produce incorrect, incomplete, or inconsistent outputs depending on context.