Text Generation
MLX
Safetensors
nemotron_h
nvidia
apple-silicon
metal
conversational
custom_code
4-bit precision
Instructions to use ljupco/Nemotron-3-Elastic-30B-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use ljupco/Nemotron-3-Elastic-30B-MLX with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("ljupco/Nemotron-3-Elastic-30B-MLX") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use ljupco/Nemotron-3-Elastic-30B-MLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "ljupco/Nemotron-3-Elastic-30B-MLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ljupco/Nemotron-3-Elastic-30B-MLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ljupco/Nemotron-3-Elastic-30B-MLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "ljupco/Nemotron-3-Elastic-30B-MLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ljupco/Nemotron-3-Elastic-30B-MLX
Run Hermes
hermes
- MLX LM
How to use ljupco/Nemotron-3-Elastic-30B-MLX with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "ljupco/Nemotron-3-Elastic-30B-MLX"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "ljupco/Nemotron-3-Elastic-30B-MLX" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ljupco/Nemotron-3-Elastic-30B-MLX", "messages": [ {"role": "user", "content": "Hello"} ] }'
How to use from
Hermes AgentConfigure Hermes
# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ljupco/Nemotron-3-Elastic-30B-MLXRun Hermes
hermesQuick Links
Nemotron 3 Elastic 30B - MLX Format (Apple Silicon)
NVIDIA Nemotron 3 Elastic 30B model converted to MLX format for efficient inference on Apple Silicon (Metal).
🚀 Quick Start
# Install dependencies
pip install mlx-lm
# Run chat (8-bit KV cache by default)
python chat_mlx.py --model . --max-kv-size 1048576
# For larger context with 4-bit KV cache (more memory savings)
python chat_mlx.py --model . --max-kv-size 1048576 --kv-bits 4
📊 Model Details
- Format: MLX NVFP4 (4.5 bits/weight)
- Size: ~16.5 GB
- Context: Up to 1M tokens (design limit, hardware-dependent)
- Platform: Apple Silicon (M1/M2/M3) with macOS
💾 Memory Requirements
| KV Cache | 1M Context | Typical Usage |
|---|---|---|
| 8-bit (default) | ~70-80 GB | Recommended for M2 Max 96GB |
| 4-bit | ~50-60 GB | For smaller memory configurations |
| 16-bit | ~90-100 GB | Maximum quality |
🎯 Features
- Hybrid Architecture: Mamba-2 + MoE (Mixture of Experts) + Attention layers
- Elastic Variants: Supports 12B/23B/30B configurations
- Long Context: Designed for up to 1M token context window
- Reasoning: Thinking traces enabled by default
📖 Usage
Basic Chat
python chat_mlx.py --model .
Large Text Input
For texts larger than ~4KB:
You> /paste
Now paste your text. After the text is pasted, to process the text, in empty line enter /endpaste
[paste your large text]
/endpaste
Assistant> [processes your text]
In-Chat Commands
/paste- Multi-line input mode for large text/quit- Exit chat/reset- Clear conversation history/thinking on|off- Toggle reasoning traces
🔗 Links
- Original Model: https://huggingface.co/nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-NVFP4
- Conversion Tools: https://github.com/ljupco/nemotron-3-elastic-mlx
- NVIDIA License: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
📝 License
NVIDIA Open Model License. See LICENSE.md for details.
🙏 Acknowledgments
- Model Developer: NVIDIA
- MLX Framework: Apple MLX team
- Conversion: Adapted from mlx-lm Nemotron H implementation
- Downloads last month
- -
Model size
32B params
Tensor type
U8
·
U32 ·
BF16 ·
Hardware compatibility
Log In to add your hardware
4-bit
Model tree for ljupco/Nemotron-3-Elastic-30B-MLX
Base model
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Start the MLX server
# Install MLX LM: uv tool install mlx-lm# Start a local OpenAI-compatible server: mlx_lm.server --model "ljupco/Nemotron-3-Elastic-30B-MLX"