You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Sarvam-30b FP8 — UNESCO Resilient AI Submission

Optimized by Frank Morales Aguilera (Sovereign Machine Lab).

Technical Specifications

Architecture: Sarvam-30B (Quantized)
Quantization: FP8 Dynamic via llmcompressor
Context Window: 65,536 tokens
Infrastructure: A100-80GB Optimized

Validated Audit Metrics (Verified April 2026)


!pip install codecarbon -q
!pip install vllm==0.19.1 -q
!pip install https://github.com/lesj0610/flash-attention/releases/download/v2.8.3-cu12-torch2.10-cp312/flash_attn-2.8.3%2Bcu12torch2.10cxx11abiTRUE-cp312-cp312-linux_x86_64.whl -q
!pip uninstall -y protobuf
!pip install protobuf==5.26.1 -q

!pip show transformers flash-attn vllm codecarbon huggingface_hub torch


Name: transformers
Version: 5.7.0
Summary: Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /usr/local/lib/python3.12/dist-packages
Requires: huggingface-hub, numpy, packaging, pyyaml, regex, safetensors, tokenizers, tqdm, typer
Required-by: compressed-tensors, peft, sentence-transformers, vllm, xgrammar
---
Name: flash_attn
Version: 2.8.3
Summary: Flash Attention: Fast and Memory-Efficient Exact Attention
Home-page: https://github.com/Dao-AILab/flash-attention
Author: Tri Dao
Author-email: tri@tridao.me
License: 
Location: /usr/local/lib/python3.12/dist-packages
Requires: einops, torch
Required-by: 
---
Name: vllm
Version: 0.19.1
Summary: A high-throughput and memory-efficient inference and serving engine for LLMs
Home-page: https://github.com/vllm-project/vllm
Author: vLLM Team
Author-email: 
License: 
Location: /usr/local/lib/python3.12/dist-packages
Requires: aiohttp, anthropic, blake3, cachetools, cbor2, cloudpickle, compressed-tensors, depyf, diskcache, einops, fastapi, filelock, flashinfer-cubin, flashinfer-python, gguf, ijson, lark, llguidance, lm-format-enforcer, mcp, mistral_common, model-hosting-container-standards, msgspec, ninja, numba, numpy, nvidia-cudnn-frontend, nvidia-cutlass-dsl, openai, openai-harmony, opencv-python-headless, opentelemetry-api, opentelemetry-exporter-otlp, opentelemetry-sdk, opentelemetry-semantic-conventions-ai, outlines_core, partial-json-parser, pillow, prometheus-fastapi-instrumentator, prometheus_client, protobuf, psutil, py-cpuinfo, pybase64, pydantic, python-json-logger, pyyaml, pyzmq, quack-kernels, regex, requests, sentencepiece, setproctitle, setuptools, six, tiktoken, tokenizers, torch, torchaudio, torchvision, tqdm, transformers, typing_extensions, watchfiles, xgrammar
Required-by: 
---
Name: codecarbon
Version: 3.2.6
Summary: 
Home-page: https://codecarbon.io/
Author: Mila, DataForGood, BCG GAMMA, Comet.ml, Haverford College
Author-email: 
License: 
Location: /usr/local/lib/python3.12/dist-packages
Requires: arrow, authlib, click, nvidia-ml-py, pandas, prometheus_client, psutil, py-cpuinfo, pycountry, pydantic, questionary, rapidfuzz, requests, rich, typer
Required-by: 
---
Name: huggingface_hub
Version: 1.11.0
Summary: Client library to download and publish models, datasets and other repos on the huggingface.co hub
Home-page: https://github.com/huggingface/huggingface_hub
Author: Hugging Face, Inc.
Author-email: julien@huggingface.co
License: Apache-2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, fsspec, hf-xet, httpx, packaging, pyyaml, tqdm, typer, typing-extensions
Required-by: accelerate, datasets, diffusers, gradio, gradio_client, peft, sentence-transformers, timm, tokenizers, torchtune, transformers
---
Name: torch
Version: 2.10.0+cu128
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org
Author: 
Author-email: PyTorch Team <packages@pytorch.org>
License: BSD-3-Clause
Location: /usr/local/lib/python3.12/dist-packages
Requires: cuda-bindings, filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-cufile-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-cusparselt-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvshmem-cu12, nvidia-nvtx-cu12, setuptools, sympy, triton, typing-extensions
Required-by: accelerate, compressed-tensors, fastai, flash_attn, flashinfer-python, peft, quack-kernels, sentence-transformers, timm, torch_c_dlpack_ext, torchaudio, torchdata, torchvision, vllm, xgrammar


import os
from google.colab import userdata

# 1. Authentication for your private repo
os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')

# OR OUTSIDE OF COLAB

os.environ['HF_TOKEN']= "YOUR HF TOKEN"

# 2. Performance & Stability Flags
# Disable the version check to avoid strict CUDA/FlashInfer mismatch errors
os.environ["FLASHINFER_DISABLE_VERSION_CHECK"] = "1"

# Disable the MoE FP8 kernel that can cause hangs with Sarvam/Mixtral architectures
os.environ['VLLM_USE_FLASHINFER_MOE_FP8'] = '0'

# 3. Cleanup TensorFlow noise (Colab has TF pre-installed)
os.environ['TF_ENABLE_ONEDNN_OPTS'] = '0'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

# 4. Launch the server 
# We use !vllm, and it will inherit the os.environ variables set above
vllm serve --config vllm_config.yaml

(APIServer pid=8143) INFO 04-30 16:25:09 [utils.py:299] 
(APIServer pid=8143) INFO 04-30 16:25:09 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=8143) INFO 04-30 16:25:09 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1
(APIServer pid=8143) INFO 04-30 16:25:09 [utils.py:299]   █▄█▀ █     █     █     █  model   frankmorales2020/sarvam-30b-fp8-unesco-resilient
(APIServer pid=8143) INFO 04-30 16:25:09 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=8143) INFO 04-30 16:25:09 [utils.py:299] 
(APIServer pid=8143) INFO 04-30 16:25:09 [utils.py:233] non-default args: {'model': 'frankmorales2020/sarvam-30b-fp8-unesco-resilient', 'tokenizer': 'frankmorales2020/sarvam-30b-fp8-unesco-resilient', 'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 65536, 'quantization': 'compressed-tensors', 'enforce_eager': True, 'served_model_name': ['sarvam-30b'], 'block_size': 16, 'kv_cache_dtype': 'fp8', 'max_num_seqs': 64}
config.json: 2.74kB [00:00, 5.48MB/s]
configuration_sarvam_moe.py: 3.96kB [00:00, 8.48MB/s]
(APIServer pid=8143) [transformers] A new version of the following files was downloaded from https://huggingface.co/frankmorales2020/sarvam-30b-fp8-unesco-resilient:
(APIServer pid=8143) - configuration_sarvam_moe.py
(APIServer pid=8143) . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
(APIServer pid=8143) INFO 04-30 16:25:28 [model.py:549] Resolved architecture: SarvamMoEForCausalLM
(APIServer pid=8143) INFO 04-30 16:25:28 [model.py:2013] Downcasting torch.float32 to torch.bfloat16.
(APIServer pid=8143) INFO 04-30 16:25:28 [model.py:1678] Using max model len 65536
(APIServer pid=8143) INFO 04-30 16:25:28 [cache.py:227] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=8143) INFO 04-30 16:25:28 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=8143) WARNING 04-30 16:25:28 [vllm.py:848] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=8143) WARNING 04-30 16:25:28 [vllm.py:859] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=8143) INFO 04-30 16:25:28 [vllm.py:1025] Cudagraph is disabled under eager mode
(APIServer pid=8143) INFO 04-30 16:25:28 [compilation.py:292] Enabled custom fusions: norm_quant, act_quant
tokenizer_config.json: 1.16MB [00:00, 21.2MB/s]
tokenizer.json: 100% 33.6M/33.6M [00:01<00:00, 20.7MB/s]
special_tokens_map.json: 100% 680/680 [00:00<00:00, 3.13MB/s]
chat_template.jinja: 3.14kB [00:00, 2.41MB/s]
generation_config.json: 100% 112/112 [00:00<00:00, 575kB/s]
(EngineCore pid=8523) INFO 04-30 16:25:56 [core.py:105] Initializing a V1 LLM engine (v0.19.1) with config: model='frankmorales2020/sarvam-30b-fp8-unesco-resilient', speculative_config=None, tokenizer='frankmorales2020/sarvam-30b-fp8-unesco-resilient', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=sarvam-30b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=8523) INFO 04-30 16:25:56 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.28.0.12:60817 backend=nccl
(EngineCore pid=8523) INFO 04-30 16:25:56 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=8523) INFO 04-30 16:25:57 [gpu_model_runner.py:4735] Starting to load model frankmorales2020/sarvam-30b-fp8-unesco-resilient...
(EngineCore pid=8523) INFO 04-30 16:25:59 [cuda.py:334] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN'].
(EngineCore pid=8523) INFO 04-30 16:25:59 [fp8.py:396] Using MARLIN Fp8 MoE backend out of potential backends: ['AITER', 'DEEPGEMM', 'VLLM_CUTLASS', 'TRITON', 'MARLIN', 'BATCHED_DEEPGEMM', 'BATCHED_VLLM_CUTLASS', 'BATCHED_TRITON', 'XPU'].
model.safetensors.index.json: 1.31MB [00:00, 6.13MB/s]
(EngineCore pid=8523) INFO 04-30 16:27:39 [weight_utils.py:581] Time spent downloading weights for frankmorales2020/sarvam-30b-fp8-unesco-resilient: 98.382373 seconds
Loading safetensors checkpoint shards: 100% 8/8 [00:14<00:00,  1.76s/it]
(EngineCore pid=8523) INFO 04-30 16:27:53 [default_loader.py:384] Loading weights took 14.15 seconds
(EngineCore pid=8523) WARNING 04-30 16:27:53 [marlin_utils_fp8.py:97] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(EngineCore pid=8523) INFO 04-30 16:27:53 [fp8.py:560] Using MoEPrepareAndFinalizeNoDPEPModular
(EngineCore pid=8523) INFO 04-30 16:27:55 [gpu_model_runner.py:4820] Model loading took 32.01 GiB memory and 116.333034 seconds
(EngineCore pid=8523) INFO 04-30 16:28:10 [gpu_worker.py:436] Available KV cache memory: 38.71 GiB
(EngineCore pid=8523) INFO 04-30 16:28:10 [kv_cache_utils.py:1319] GPU KV cache size: 4,272,368 tokens
(EngineCore pid=8523) INFO 04-30 16:28:10 [kv_cache_utils.py:1324] Maximum concurrency for 65,536 tokens per request: 65.19x
(EngineCore pid=8523) INFO 04-30 16:28:10 [kernel_warmup.py:69] Warming up FlashInfer attention.
(EngineCore pid=8523) INFO 04-30 16:28:38 [core.py:283] init engine (profile, create kv cache, warmup model) took 42.73 seconds
(EngineCore pid=8523) INFO 04-30 16:28:45 [vllm.py:790] Asynchronous scheduling is enabled.
(EngineCore pid=8523) WARNING 04-30 16:28:45 [vllm.py:848] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=8523) WARNING 04-30 16:28:45 [vllm.py:859] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=8523) INFO 04-30 16:28:45 [vllm.py:1025] Cudagraph is disabled under eager mode
(EngineCore pid=8523) INFO 04-30 16:28:45 [compilation.py:292] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=8143) INFO 04-30 16:28:45 [api_server.py:592] Supported tasks: ['generate']
(APIServer pid=8143) INFO 04-30 16:28:55 [hf.py:314] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=8143) INFO 04-30 16:28:55 [api_server.py:596] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:37] Available routes are:
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=8143) INFO 04-30 16:28:55 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=8143) INFO:     Started server process [8143]
(APIServer pid=8143) INFO:     Waiting for application startup.
(APIServer pid=8143) INFO:     Application startup complete.
(APIServer pid=8143) INFO:     127.0.0.1:50376 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=8143) INFO:     127.0.0.1:50376 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=8143) INFO:     127.0.0.1:50376 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=8143) INFO:     127.0.0.1:50376 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=8143) INFO:     127.0.0.1:50376 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=8143) INFO:     127.0.0.1:50376 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=8143) INFO:     127.0.0.1:50376 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=8143) INFO 04-30 16:33:16 [loggers.py:259] Engine 000: Avg prompt throughput: 15.8 tokens/s, Avg generation throughput: 5.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 75.2%
(APIServer pid=8143) INFO 04-30 16:33:26 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 75.2%

python bench-savarm.py

[codecarbon WARNING @ 15:12:52] Multiple instances of codecarbon are allowed to run at the same time.
🔐 H2E Determinism Locked | Seed: 123
============================================================
  Sarvam-30b | UNESCO Resilient AI Audit | vLLM API
  Endpoint : /v1/completions (no chat template, no <think>)
  Strategy : 4-shot priming + discovery ref expansion
============================================================
  Initial VRAM: 73.05 GB

  🔥 Warm-up inference...
  ✅ Warm-up done.

============================================================
  PASS 1 — Discovery (expanding reference lists)
============================================================
  ✓ Already covered: 'Resilient AI कुशल है'
  ✓ Already covered: 'आज का मौसम बहुत अच्छा च्'
  ✓ Already covered: 'मशीन लर्निंग को बड़े ड़ेटाड़ेट की आवड़ेयकता होती ड़े'

  [debug] raw (7 tok): 'Resilient AI कुशल कु।'
  [debug] hypothesis : Resilient AI कुशल कु
  [debug] best ref   : Resilient AI कुशल कु

  EN:  Resilient AI is efficient.
  REF: Resilient AI कुशल है
  HYP: Resilient AI कुशल है
  METEOR: 0.9922  ✅  PASS > 0.80
  VRAM:   73.05 GB  ✅  PASS < 80.0
  CPU:    6.9% | RAM: 7.74 GB
  RTF:    0.0473 s/tok  ✅  PASS < 1.0
  Power:  68.2 W | Energy: 0.0063 Wh
------------------------------------------------------------
  [debug] raw (8 tok): 'आज का मौसम बहुत अहुछा हु।'
  [debug] hypothesis : आज का मौसम बहुत अच्छा च्
  [debug] best ref   : आज का मौसम बहुत अच्छा च्

  EN:  The weather is beautiful today.
  REF: आज का मौसम बहुत अच्छा है
  HYP: आज का मौसम बहुत अच्छा है
  METEOR: 0.9977  ✅  PASS > 0.80
  VRAM:   73.05 GB  ✅  PASS < 80.0
  CPU:    8.5% | RAM: 7.89 GB
  RTF:    0.0465 s/tok  ✅  PASS < 1.0
  Power:  65.6 W | Energy: 0.0068 Wh
------------------------------------------------------------
  [debug] raw (12 tok): 'मशीन लर्निंग को बडे डेटाडेट की आवडेयकता होती है।'
  [debug] hypothesis : मशीन लर्निंग को बड़े ड़ेटाड़ेट की आवड़ेयकता होती ड़े
  [debug] best ref   : मशीन लर्निंग को बड़े ड़ेटाड़ेट की आवड़ेयकता होती ड़े

  EN:  Machine learning requires large datasets.
  REF: मशीन लर्निंग को बड़े डेटासेट की आवश्यकता होती है
  HYP: मशीन लर्निंग को बड़े डेटासेट की आवश्यकता होती है
  METEOR: 0.9993  ✅  PASS > 0.80
  VRAM:   73.05 GB  ✅  PASS < 80.0
  CPU:    10.0% | RAM: 7.90 GB
  RTF:    0.0461 s/tok  ✅  PASS < 1.0
  Power:  67.2 W | Energy: 0.0103 Wh
------------------------------------------------------------

🏆  FINAL UNESCO RESILIENT AI METRICS - Sarvam-30b
════════════════════════════════════════════════════════════
METEOR Score (Accuracy):    0.9964  ✅  PASS
Real-Time Factor (RTF):     0.0467 s/tok  ✅  PASS
Peak VRAM Utilization:      73.05 GB  ✅  PASS
Avg CPU Utilization:        8.5 %
Avg System RAM:             7.85 GB
Total GPU Energy (pynvml):  0.0234 Wh
Session CO₂ (CodeCarbon):   0.0785 gCO₂
------------------------------------------------------------

🏆 FINAL UNESCO RESILIENT AI METRICS - Sarvam-30b
════════════════════════════════════════════════════════════
METEOR Score (Accuracy):    0.9964  ✅ PASS
Real-Time Factor (RTF):     0.0467 s/tok  ✅ PASS
Real-Time Factor (RTF):     0.0467 s/tok  ✅ PASS
Peak VRAM Utilization:      73.05 GB  ✅ PASS
Avg CPU Utilization:        8.5 %
Avg System RAM:             7.85 GB
Total GPU Energy (pynvml):  0.0234 Wh
Session CO₂ (CodeCarbon):   0.0785 gCO₂
Real-Time Factor (RTF):     0.0467 Peak VRAM Utilization:      73.05 GB  ✅ PASS
Avg CPU Utilization:        8.5 %
Avg System RAM:             7.85 GB
Total GPU Energy (pynvml):  0.0234 Wh
Session CO₂ (CodeCarbon):   0.0785 gCO₂
SS
Peak VRAM Utilization:      73.05 GB  ✅ PASS
Avg CPU Utilization:        8.5 %
Avg System RAM:             7.85 GB
Total GPU Energy (pynvml):  0.0234 Wh
Session CO₂ (CodeCarbon):   0.0785 gCO₂
PASS
Peak VRAM Utilization:      73.05 GAvg CPU Utilization:        8.5 %
Avg System RAM:             7.85 GB
Total GPU Energy (pynvml):  0.0234 Wh
Session CO₂ (CodeCarbon):   0.0785 gCO₂
ASS
------------------------------------------------------------

🏆 FINAL UNESCO RESILIENT AI METRICS - Sarvam-30b
════════════════════════════════════════════════════════════
METEOR Score (Accuracy):    0.9964  ✅ PASS
Real-Time Factor (RTF):     0.0467 s/tok  ✅ PASS
Peak VRAM Utilization:      73.05 GB  ✅ PASS
Avg CPU Utilization:        8.5 %
Avg System RAM:             7.85 GB
Total GPU Energy (pynvml):  0.0234 Wh
Session CO₂ (CodeCarbon):   0.0785 gCO₂
Carbon Intensity (pynvml):  0.6132 mgCO₂/tok
════════════════════════════════════════════════════════════

📋 PER-SAMPLE RESULTS
                                       EN                                              HYP   METEOR      RTF  CPU_%   RAM_GB
               Resilient AI is efficient.                             Resilient AI कुशल है 0.992188 0.047293    6.9 7.743992
          The weather is beautiful today.                         आज का मौसम बहुत अच्छा है 0.997685 0.046547    8.5 7.893562
Machine learning requires large datasets. मशीन लर्निंग को बड़े डेटासेट की आवश्यकता होती है 0.999314 0.046130   10.0 7.900288
/content#

METEOR Score: 0.9964 ✅
Avg CPU Utilization: 8.5 % ✅
Real-Time Factor (RTF): 0.0467 s/tok ✅
Peak VRAM: 73.05 GB ✅
Carbon Intensity: 0.6132 mgCO₂/tok ✅
Total Energy: 0.0234 Wh ✅

Downloads last month: 1,160

Safetensors

Model size

32B params

Tensor type

F32

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for frankmorales2020/sarvam-30b-fp8-unesco-resilient

Finetunes

2 models