Instructions to use ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16

SGLang

How to use ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16 with Docker Model Runner:
```
docker model run hf.co/ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Step-3.7-Flash-uncensored-abliterated-heretic-BF16

NOTE: I have tested this and althgouh its capabilities are in tact, it seems ot still respond with refusals. Or at least this is what happens with the quantization oft, at IQ4_XS GGUF, at least.

This is a decensored BF16 full-weight version of stepfun-ai/Step-3.7-Flash, made using a Heretic-style gradient refusal-direction abliteration method inspired by Heretic and norm-preserving ablation work such as Magnitude/Norm-Preserving Biprojected Abliteration.

It was produced with a local gradient abliteration pass against the language model's refusal direction. The uploaded repository intentionally keeps the full HF/Transformers BF16 layout so it can be used later as a clean source for GGUF, AutoRound, AWQ, EXL3, NVFP4, GPTQ, FP8, or other quantization workflows.

Summary

Item	Value
Base model	`stepfun-ai/Step-3.7-Flash`
Release type	Full BF16 safetensors
Model class	`Step3p7ForConditionalGeneration`
Text model class	`Step3p5ForCausalLM`
Text layers	45
Hidden size	4096
Attention heads	64
Head dim	128
Max positions	262144
Vocab size	128896
MoE layers	3–44
Experts	288
Top-k experts	8
MoE intermediate size	1280
Dense FFN intermediate size	11264
Patch target	`model.layers.*.self_attn.o_proj.weight`
Patched text layers	0–44
Abliteration strength	`lambda = 0.1`
Stored tensor dtype	BF16
Indexed parameter payload	402,730,656,512 bytes

What changed?

The modification targets self_attn.o_proj weights in all 45 text layers. A refusal-associated direction was extracted by gradient backpropagation through the BF16 model, then projected out of the attention output projection weights with a small norm-preserving update.

In plain terms, the goal was to reduce excessive refusals, moralizing, policy-style deflections, and over-filtered responses while keeping the model close to the original Step-3.7-Flash behavior.

No tokenizer vocabulary, embedding table, architecture, vision encoder, or MLP/expert tensor was intentionally changed by the abliteration pass.

Abliteration parameters

Parameter	Value
Method	gradient-based orthogonal / norm-preserving abliteration
Direction source	refusal/harm-trigger gradient prompt
Target module	`self_attn.o_proj`
Target tensor glob	`model.layers.*.self_attn.o_proj.weight`
Modified layers	0–44
Lambda	`0.1`
Weight norm handling	per-row norm preservation after projection
Gradient tensor count	45
Per-layer gradient tensor shape	`(1, 8, 4096)`
Direction extraction score	`-11.9375`
Refusal token ids used	`[43, 371, 679, 1664, 9332, 34614, 100477]`
Gradient norm range	`0.1069`–`31.875`
Mean gradient norm	`3.2397`

Reproduction/support artifacts are included under heretic_artifacts/:

refusal_direction_gradients.pkl — saved gradient/refusal directions used for the BF16 patch
apply_abliteration_inplace.py — patch application script used for shard-wise in-place BF16 modification
extract_gradients.py — gradient extraction script
memory_guard_v2.py / run_heavy.sh — memory safety helpers used during local processing

These are included so the method can be inspected or repeated if needed. They are not required for normal inference or quantization.

Recoverability / requantization checklist

This repository should contain what is needed to rebuild downstream formats:

Required for quantization

✅ config.json
✅ model.safetensors.index.json
✅ all indexed BF16 text shards: model-00001.safetensors … model-00024.safetensors
✅ indexed VIT shards: model-vit-00001.safetensors, model-vit-00002.safetensors
✅ tokenizer files: tokenizer.json, tokenizer_config.json, special_tokens_map.json
✅ chat template: chat_template.jinja
✅ custom code: configuration_step3p7.py, modeling_step3p7.py, processing_step3.py, vision_encoder.py
✅ method/reproduction artifacts in heretic_artifacts/

Expected downstream uses

This BF16 repo can be used as source for:

GGUF conversion / llama.cpp quantization
AutoRound
AWQ
EXL3 / exllamav3-style workflows
NVFP4 / FP4 experiments
GPTQ / FP8 / other post-training quantization methods
additional LoRA or delta extraction experiments

For most quantizers, use this repo exactly as the HF model path and enable remote code if needed:

MODEL=ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16

Example Transformers load

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo = "ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16"

tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Explain gradient abliteration in one paragraph."}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, temperature=0.7, top_p=0.95)
print(tok.decode(out[0], skip_special_tokens=False))

Step-3.7-Flash is very large. BF16 loading requires substantial memory. For local inference, a quantized GGUF/EXL/AWQ/etc. build is recommended.

GGUF conversion note

Use the StepFun/llama.cpp converter that supports Step-3.7. Example shape:

python convert_hf_to_gguf.py \
  ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16 \
  --outtype bf16 \
  --outfile step37-heretic-bf16.gguf

llama-quantize step37-heretic-bf16.gguf step37-heretic.IQ4_XS.gguf IQ4_XS

If using multi-GPU llama.cpp inference in the original local environment, GGML_CUDA_NO_PEER_COPY=ON was required for coherent output.

Indexed shard inventory

The active model.safetensors.index.json references 26 safetensor files:

File	Size
`model-00001.safetensors`	924,094,096
`model-00002.safetensors`	9,808,156,008
`model-00003.safetensors`	18,557,475,928
`model-00004.safetensors`	18,624,846,944
`model-00005.safetensors`	18,557,475,928
`model-00006.safetensors`	18,624,846,976
`model-00007.safetensors`	18,557,475,968
`model-00008.safetensors`	18,624,846,976
`model-00009.safetensors`	18,557,475,968
`model-00010.safetensors`	18,624,846,976
`model-00011.safetensors`	18,557,475,968
`model-00012.safetensors`	18,624,846,976
`model-00013.safetensors`	18,557,475,968
`model-00014.safetensors`	18,624,846,976
`model-00015.safetensors`	18,557,475,968
`model-00016.safetensors`	18,624,846,976
`model-00017.safetensors`	18,557,475,968
`model-00018.safetensors`	18,624,846,976
`model-00019.safetensors`	18,557,475,968
`model-00020.safetensors`	18,624,846,976
`model-00021.safetensors`	18,557,475,968
`model-00022.safetensors`	18,624,846,976
`model-00023.safetensors`	9,245,052,456
`model-00024.safetensors`	6,968,188,464
`model-vit-00001.safetensors`	1,613,990,904
`model-vit-00002.safetensors`	2,348,122,376

model-00025.safetensors and model-00026.safetensors are not referenced by the active index used here and are not required by this uploaded model layout.

Performance / benchmark status

Formal KL/refusal/MMLU tables have not yet been run for this Step-3.7-Flash release. To avoid inventing numbers, the benchmark fields are listed as pending.

Metric	This model	Original model (Step-3.7-Flash)
KL divergence	pending	0 (by definition)
Refusals	pending	pending
MMLU	pending	pending

Lower refusals indicate fewer content restrictions, rejections, objections, pushbacks, lecturing, censorship, softening, and deflections. Lower KL divergence indicates closer behavior to the original model baseline.

MMLU test results

MMLU has not yet been run for this release. Once measured, this section should include original-vs-heretic totals, accuracy, parse failures, and per-subject scores, following the same format used by comparable Heretic model cards.

Expected behavior

Compared with the base model, this version should generally exhibit:

fewer refusals on benign requests that the base model over-filters
less moralizing, policy language, and safety boilerplate
more direct task completion
similar architecture and tokenizer compatibility to the original

No formal refusal/KL/MMLU table is claimed yet for this release. Please run your own evaluations before deployment.

Limitations

This is abliteration, not supervised fine-tuning or RLHF.
It may reduce refusals but does not guarantee any specific behavior.
It can affect calibration, safety behavior, and edge-case instruction following.
Multimodal behavior has not been separately benchmarked after the text-path patch.
Users should validate downstream quantizations independently.

Safety and responsibility

This model is provided for research and experimentation with refusal-reduction / alignment-ablation methods. You are responsible for complying with applicable laws, platform rules, and the base model's license/terms.

Related resources

Abliteration / refusal-direction removal references:

Attribution

Base model: stepfun-ai/Step-3.7-Flash
Method inspiration: Heretic-style refusal direction ablation and norm-preserving projection methods
Modified/uploaded by: ibrahimkettaneh

Downloads last month: 277

Safetensors

Model size

201B params

Tensor type

BF16

F32

Model tree for ibrahimkettaneh/Step-3.7-Flash-uncensored-abliterated-heretic-BF16

Base model

stepfun-ai/Step-3.7-Flash

Finetuned

(7)

this model

Quantizations

1 model