Instructions to use OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF",
	filename="RWKV-24B-A2B-wakaba-2601-GGUF-IQ3_XXS.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF:Q4_K_M

Use Docker

docker model run hf.co/OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF with Ollama:
```
ollama run hf.co/OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF:Q4_K_M
```

Unsloth Studio

How to use OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF to start chatting

How to use OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF with Docker Model Runner:
```
docker model run hf.co/OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF:Q4_K_M
```

Lemonade

How to use OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.RWKV-24B-A2B-wakaba-2601-GGUF-Q4_K_M

List all available models

lemonade list

OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF

※このモデルの実行には、カスタムllama.cppが必要です。使用方法については、下記をお読みください。

Model Overview

OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF は、
高価なGPUを必要とせず、より多くの人がローカル環境で実用的なLLMを扱えることを目的として設計された
RWKVハイブリッド + Aggressive Sparse MoE モデルです。

近年、AIインフラ投資の急激な拡大により、GPUおよびメモリ価格は著しく高騰しています。
2023年には 24GB VRAM クラスのGPUに約13万円でアクセス可能でしたが、
現在では同等の環境を用意することは容易ではありません。

本モデルは、こうした状況を背景に、

「限られたメモリ・計算資源でも、実用レベルのLLM体験を提供する」

ことを設計目標としています。

Motivation

小規模モデル（2B / 4B クラス）は動作自体は可能ですが、
推論品質・指示追従性・会話安定性の面では、依然として制約があります。

特に 32GB メモリの iGPU PC 環境では、

OS が約 16GB を占有
実質利用可能メモリは 16GB 前後
iGPU のため TFLOPS にも強い制約（平均的には2TFLOPS前後)

という現実的な制限があります。

本モデルは、こうした 厳しいローカル環境を前提条件として受け入れた上で、

24B パラメータ規模
Aggressive Sparse MoE（Active Experts = 6）
実効アクティブパラメータ ≒ 2B（A2B）

という設計により、
「小さく見えるが、実用的」なモデルを目指しています。

Architecture & Conversion Pipeline

本モデルは Qwen3-30B-A3B-Instruct-2507 をベースに、
以下の変換・最適化プロセスを経て構築されました。

Conversion Steps

RWKV hxa07D + NoPE Attention ハイブリッド化
- Attention 層を最小限に抑え、線形時間 RWKV を主軸とする構成へ変換
- アーキ解説はこちらです。(https://zenn.dev/openmose/articles/610b67f295c9ce)
Cerebras REAPによる MoE Pruning
- キャリブレーションデータによって、MoE Expert を 約25%削減し、冗長性を圧縮
Active Experts の削減（8 → 6）
- 推論時のアクティブパラメータをさらに低減
教師モデルとの KL-Divergence による性能復元
- 削減による性能劣化を最小限に抑制

Model Characteristics

非線形容量は意図的に削減されているため、
純粋な知識タスクや長文暗記型ベンチマークは得意ではありません
一方で、
- RAG（Retrieval-Augmented Generation）
- 外部知識ベースとの併用
- ツール・エージェント統合などと組み合わせることで、実運用に耐える構成を目指しています

Model Specifications

Item	Value
Total Parameters	24B
Active Parameters	~2B (A2B)
Total Layers	48
RWKV Layers	40
NoPE Attention Layers	8

セットアップ方法

レポジトリをクローンします。:

git clone https://github.com/OpenMOSE/llama.cpp
cd llama.cpp
git checkout hxa07d

プロジェクトをビルドします。(Linuxの例)

For CUDA (NVIDIA GPUs):

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

For ROCm (AMD GPUs):

GPUアーキテクチャに応じて、変更してください:

AMD Radeon RX 79xx series → gfx1100
AMD Instinct MI300 series → gfx942
AMD Instinct MI100 → gfx908

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release \
&& cmake --build build --config Release -- -j 16

Note: Replace gfx1100 with your GPU's architecture code

モデル動作コマンド

通常の推論:

./build/bin/llama-cli -m YOUR_MODEL_PATH --jinja -fa 1

KV量子化を有効にした推論:

./build/bin/llama-cli -m YOUR_MODEL_PATH --jinja -fa 1 -ctv q8_0 -ctk q8_0

VRAM削減モード１:

./build/bin/llama-cli -m YOUR_MODEL_PATH --jinja -fa 1 -ctv q8_0 -ctk q8_0 --override-tensor "time_mix_g1=CPU,time_mix_g2=CPU,time_mix_w1=CPU,time_mix_w2=CPU"

VRAM削減モード２:

./build/bin/llama-cli -m YOUR_MODE_PATH --jinja -fa 1 -c 4096 --n-cpu-moe 48

./build/bin/llama-server -m YOUR_MODE_PATH --jinja -fa 1 --port 4096 -np 1 -c 65536 --top-k 20 --top-p 0.3 --temp 0.6 --repeat-penalty 1.1 --n-cpu-moe 48

Important: To get better output quality, please test --top-k 20 --top-p 0.3 --temp 0.6 --repeat-penalty 1.1

Intended Use

ローカル LLM 実行（iGPU / 低〜中級 GPU 環境）
RAG + Chat / Agent ベースのアプリケーション
低メモリ環境での研究・検証用途
RWKV / Hybrid Architecture の実験的検証

Limitations

大規模知識暗記タスクには不向き
教師モデル（30Bクラス）と同等性能を保証するものではありません
依然として「育成途中」のモデルです

Acknowledgements

本モデルは、Recursal AI, Featherless AI による
計算資源および技術的支援によって実現しました。
ここに深く感謝の意を表します。

Datasets

kldivを行う上で、データセットとしては、DCLM-10Bから長文ソートで、10%、Instructデータとして、Qwen3-235B-Instructの合成データを作成し使用しました。データセットBiasによる弊害をさけるため、SFTは行っていません。

Closing Notes

モデル名の通り、本モデルはまだ 「若葉（wakaba）」 の段階です。
しかし、一つひとつ改善と検証を重ねながら、
「誰もが手の届く高性能AI」 を目指して育てていきます。

今後の成長を、ぜひ温かい目で見守っていただければ幸いです。

OpenMOSE — 2026

Downloads last month: 128

GGUF

Model size

24B params

Architecture

rwkv07d_moe

Hardware compatibility

3-bit

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF

Base model

OpenMOSE/RWKV-Qwen3-30B-A3B-Instruct-hxa07d-L8

Finetuned

OpenMOSE/RWKV-24B-A2B-wakaba-2601

Quantized

(1)

this model

Collection including OpenMOSE/RWKV-24B-A2B-wakaba-2601-GGUF

hxa07D RWKV-Transformer Hybrid series

Collection

New hxa07D family of hybrid models, combining improved RWKV recurrent architectures with Transformer-based attention. Designed for efficient long-cont • 6 items • Updated Jan 7