Instructions to use ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16
- SGLang
How to use ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16 with Docker Model Runner:
docker model run hf.co/ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16
- Step-3.7-Flash-uncensored-abliterated-heretic-BF16
Step-3.7-Flash-uncensored-abliterated-heretic-BF16
NOTE: I have tested this and althgouh its capabilities are in tact, it seems ot still respond with refusals. Or at least this is what happens with the quantization oft, at IQ4_XS GGUF, at least.
This is a decensored BF16 full-weight version of stepfun-ai/Step-3.7-Flash, made using a Heretic-style gradient refusal-direction abliteration method inspired by Heretic and norm-preserving ablation work such as Magnitude/Norm-Preserving Biprojected Abliteration.
It was produced with a local gradient abliteration pass against the language model's refusal direction. The uploaded repository intentionally keeps the full HF/Transformers BF16 layout so it can be used later as a clean source for GGUF, AutoRound, AWQ, EXL3, NVFP4, GPTQ, FP8, or other quantization workflows.
Summary
| Item | Value |
|---|---|
| Base model | stepfun-ai/Step-3.7-Flash |
| Release type | Full BF16 safetensors |
| Model class | Step3p7ForConditionalGeneration |
| Text model class | Step3p5ForCausalLM |
| Text layers | 45 |
| Hidden size | 4096 |
| Attention heads | 64 |
| Head dim | 128 |
| Max positions | 262144 |
| Vocab size | 128896 |
| MoE layers | 3–44 |
| Experts | 288 |
| Top-k experts | 8 |
| MoE intermediate size | 1280 |
| Dense FFN intermediate size | 11264 |
| Patch target | model.layers.*.self_attn.o_proj.weight |
| Patched text layers | 0–44 |
| Abliteration strength | lambda = 0.1 |
| Stored tensor dtype | BF16 |
| Indexed parameter payload | 402,730,656,512 bytes |
What changed?
The modification targets self_attn.o_proj weights in all 45 text layers. A refusal-associated direction was extracted by gradient backpropagation through the BF16 model, then projected out of the attention output projection weights with a small norm-preserving update.
In plain terms, the goal was to reduce excessive refusals, moralizing, policy-style deflections, and over-filtered responses while keeping the model close to the original Step-3.7-Flash behavior.
No tokenizer vocabulary, embedding table, architecture, vision encoder, or MLP/expert tensor was intentionally changed by the abliteration pass.
Abliteration parameters
| Parameter | Value |
|---|---|
| Method | gradient-based orthogonal / norm-preserving abliteration |
| Direction source | refusal/harm-trigger gradient prompt |
| Target module | self_attn.o_proj |
| Target tensor glob | model.layers.*.self_attn.o_proj.weight |
| Modified layers | 0–44 |
| Lambda | 0.1 |
| Weight norm handling | per-row norm preservation after projection |
| Gradient tensor count | 45 |
| Per-layer gradient tensor shape | (1, 8, 4096) |
| Direction extraction score | -11.9375 |
| Refusal token ids used | [43, 371, 679, 1664, 9332, 34614, 100477] |
| Gradient norm range | 0.1069–31.875 |
| Mean gradient norm | 3.2397 |
Reproduction/support artifacts are included under heretic_artifacts/:
refusal_direction_gradients.pkl— saved gradient/refusal directions used for the BF16 patchapply_abliteration_inplace.py— patch application script used for shard-wise in-place BF16 modificationextract_gradients.py— gradient extraction scriptmemory_guard_v2.py/run_heavy.sh— memory safety helpers used during local processing
These are included so the method can be inspected or repeated if needed. They are not required for normal inference or quantization.
Recoverability / requantization checklist
This repository should contain what is needed to rebuild downstream formats:
Required for quantization
- ✅
config.json - ✅
model.safetensors.index.json - ✅ all indexed BF16 text shards:
model-00001.safetensors…model-00024.safetensors - ✅ indexed VIT shards:
model-vit-00001.safetensors,model-vit-00002.safetensors - ✅ tokenizer files:
tokenizer.json,tokenizer_config.json,special_tokens_map.json - ✅ chat template:
chat_template.jinja - ✅ custom code:
configuration_step3p7.py,modeling_step3p7.py,processing_step3.py,vision_encoder.py - ✅ method/reproduction artifacts in
heretic_artifacts/
Expected downstream uses
This BF16 repo can be used as source for:
- GGUF conversion / llama.cpp quantization
- AutoRound
- AWQ
- EXL3 / exllamav3-style workflows
- NVFP4 / FP4 experiments
- GPTQ / FP8 / other post-training quantization methods
- additional LoRA or delta extraction experiments
For most quantizers, use this repo exactly as the HF model path and enable remote code if needed:
MODEL=ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16
Example Transformers load
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
repo = "ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
repo,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Explain gradient abliteration in one paragraph."}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, temperature=0.7, top_p=0.95)
print(tok.decode(out[0], skip_special_tokens=False))
Step-3.7-Flash is very large. BF16 loading requires substantial memory. For local inference, a quantized GGUF/EXL/AWQ/etc. build is recommended.
GGUF conversion note
Use the StepFun/llama.cpp converter that supports Step-3.7. Example shape:
python convert_hf_to_gguf.py \
ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16 \
--outtype bf16 \
--outfile step37-heretic-bf16.gguf
llama-quantize step37-heretic-bf16.gguf step37-heretic.IQ4_XS.gguf IQ4_XS
If using multi-GPU llama.cpp inference in the original local environment, GGML_CUDA_NO_PEER_COPY=ON was required for coherent output.
Indexed shard inventory
The active model.safetensors.index.json references 26 safetensor files:
| File | Size |
|---|---|
model-00001.safetensors |
924,094,096 |
model-00002.safetensors |
9,808,156,008 |
model-00003.safetensors |
18,557,475,928 |
model-00004.safetensors |
18,624,846,944 |
model-00005.safetensors |
18,557,475,928 |
model-00006.safetensors |
18,624,846,976 |
model-00007.safetensors |
18,557,475,968 |
model-00008.safetensors |
18,624,846,976 |
model-00009.safetensors |
18,557,475,968 |
model-00010.safetensors |
18,624,846,976 |
model-00011.safetensors |
18,557,475,968 |
model-00012.safetensors |
18,624,846,976 |
model-00013.safetensors |
18,557,475,968 |
model-00014.safetensors |
18,624,846,976 |
model-00015.safetensors |
18,557,475,968 |
model-00016.safetensors |
18,624,846,976 |
model-00017.safetensors |
18,557,475,968 |
model-00018.safetensors |
18,624,846,976 |
model-00019.safetensors |
18,557,475,968 |
model-00020.safetensors |
18,624,846,976 |
model-00021.safetensors |
18,557,475,968 |
model-00022.safetensors |
18,624,846,976 |
model-00023.safetensors |
9,245,052,456 |
model-00024.safetensors |
6,968,188,464 |
model-vit-00001.safetensors |
1,613,990,904 |
model-vit-00002.safetensors |
2,348,122,376 |
model-00025.safetensors and model-00026.safetensors are not referenced by the active index used here and are not required by this uploaded model layout.
Performance / benchmark status
Formal KL/refusal/MMLU tables have not yet been run for this Step-3.7-Flash release. To avoid inventing numbers, the benchmark fields are listed as pending.
| Metric | This model | Original model (Step-3.7-Flash) |
|---|---|---|
| KL divergence | pending | 0 (by definition) |
| Refusals | pending | pending |
| MMLU | pending | pending |
Lower refusals indicate fewer content restrictions, rejections, objections, pushbacks, lecturing, censorship, softening, and deflections. Lower KL divergence indicates closer behavior to the original model baseline.
MMLU test results
MMLU has not yet been run for this release. Once measured, this section should include original-vs-heretic totals, accuracy, parse failures, and per-subject scores, following the same format used by comparable Heretic model cards.
Expected behavior
Compared with the base model, this version should generally exhibit:
- fewer refusals on benign requests that the base model over-filters
- less moralizing, policy language, and safety boilerplate
- more direct task completion
- similar architecture and tokenizer compatibility to the original
No formal refusal/KL/MMLU table is claimed yet for this release. Please run your own evaluations before deployment.
Limitations
- This is abliteration, not supervised fine-tuning or RLHF.
- It may reduce refusals but does not guarantee any specific behavior.
- It can affect calibration, safety behavior, and edge-case instruction following.
- Multimodal behavior has not been separately benchmarked after the text-path patch.
- Users should validate downstream quantizations independently.
Safety and responsibility
This model is provided for research and experimentation with refusal-reduction / alignment-ablation methods. You are responsible for complying with applicable laws, platform rules, and the base model's license/terms.
Related resources
Abliteration / refusal-direction removal references:
- Orthogonal Reflection Bounded Ablation
- Norm-Preserving Biprojected Abliteration
- Projected Abliteration
- Exploring SLERP Abliteration
- Abliteration: uncensor any LLM without retraining
- Heretic GitHub repository / method development
- Heretic PR #196
- Heretic PR #211
- Heretic PR #326
- Heretic PR #332
- Heretic issue #221
- Heretic issue #236
- Heretic issue #288
- Heretic issue #339
- UnstableLlama/heretic PR #35
Attribution
- Base model:
stepfun-ai/Step-3.7-Flash - Method inspiration: Heretic-style refusal direction ablation and norm-preserving projection methods
- Modified/uploaded by:
ibrahimkettaneh
- Downloads last month
- 277
docker model run hf.co/ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16