Instructions to use nvidia/CUDA-Autocomplete with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/CUDA-Autocomplete with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/CUDA-Autocomplete")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("nvidia/CUDA-Autocomplete")
model = AutoModelForMultimodalLM.from_pretrained("nvidia/CUDA-Autocomplete")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nvidia/CUDA-Autocomplete with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/CUDA-Autocomplete"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/CUDA-Autocomplete",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/CUDA-Autocomplete

SGLang

How to use nvidia/CUDA-Autocomplete with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/CUDA-Autocomplete" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/CUDA-Autocomplete",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/CUDA-Autocomplete" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/CUDA-Autocomplete",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nvidia/CUDA-Autocomplete with Docker Model Runner:
```
docker model run hf.co/nvidia/CUDA-Autocomplete
```

DavidBord commited on 8 days ago

Commit

1878aa0

verified ·

1 Parent(s): eaa345b

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +87 -76

README.md CHANGED Viewed

@@ -1,51 +1,64 @@
-# nvidia/CUDA-Autocomplete Overview
-## Description:
-NVIDIA CUDA Autocomplete is a fine-tuned version of Qwen/Qwen2.5-Coder-7B enhanced for CUDA code completion. The model takes as input two strings of code context: the prefix (code before the cursor) and the suffix (code after the cursor), and outputs a single line of code that logically continues the prefix. By analyzing the surrounding code structure, variable names, and CUDA-specific patterns, the model predicts the most likely next line of code, enabling intelligent autocomplete functionality for general programming and CUDA development in the Nsight Copilot extension for VSCode and Cursor.
-_This model is ready for commercial/non-commercial use._
-### License/Terms of Use:
-[NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/)
-### Deployment Geography:
 Global
-### Use Case:
-This model is intended to be used for code completion in the Nsight Copilot extension for VSCode / Cursor.
-### Release Date:
-Huggingface : 05/28/2026
-## Reference(s):
 [Qwen2.5-Coder paper](https://arxiv.org/abs/2409.12186)
-[Qwen2.5-Coder blog](https://qwenlm.github.io/blog/qwen2.5-coder-family/)
-[Qwen2.5-Coder GitHub repository](https://github.com/QwenLM/Qwen2.5-Coder)
-## Model Architecture:
-**Architecture Type:** Transformer
-**Network Architecture:** Qwen2ForCausalLM
-**This model was developed based on Qwen/Qwen2.5-Coder-7B.**
-**Number of model parameters:** 7B (7*10^9)
-## Computational Load (Internal Only: For NVIDIA Models Only)
-**Cumulative Compute:** 1.23 * 10^20 FLOPS
-**Estimated Energy and Emissions for Model Training:** 150.52 kWh
-## Input:
 **Input Type(s):** Code
 **Input Format(s):** String of code (meant for prefix code and suffix code)
 **Input Parameters:** One-Dimensional (1D)
 **Other Properties Related to Input:**
 - **Context Window:** The model processes sequential code text with prefix and suffix context
 - **Encoding:** UTF-8 text encoding
-- **Input Structure:** Fill-in-the-middle (FIM) format with prefix and suffix tokens
-## Output:
 **Output Type(s):** Code
 **Output Format:** String
 **Output Parameters:** One-Dimensional (1D)
@@ -53,60 +66,58 @@ Huggingface : 05/28/2026
 - **Output Length:** Single line of code completion
 - **Generation Method:** Autoregressive token-by-token generation
 - **Encoding:** UTF-8 text encoding
-- **Output Structure:** Sequential code text that continues from the input prefix
 Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
-## Software Integration:
-**Runtime Engine(s):** vLLM
-**Supported Hardware Microarchitecture Compatibility:**
-* H100
-* DGX Spark
-**[Supported] Operating System(s):** Linux
-The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
-## Model Version(s):
-v0.3
-## Training, Testing, and Evaluation Datasets:
-## Training Dataset:
-**Link:** Subset of
-1) https://huggingface.co/datasets/bigcode/the-stack-v2
-2) Synthetically generated CUDA data using OSS models like GPT-OSS 120B
-**Data Modality:** Text
-**Text Training Data Size:** ~700000 samples
-**Data Collection Method by dataset:** Hybrid: Automated, Synthetic
-**Labeling Method by dataset:** Not Applicable
-**Properties (Quantity, Dataset Descriptions, Sensor(s)):** ~700,000 samples. Text modality (source code). Content includes open-source CUDA and general programming code collected from permissive-licensed repositories, as well as machine-generated synthetic CUDA code produced by OSS models. Primarily English-language code with CUDA-specific constructs and APIs. No sensor data involved.
-### Testing Dataset:
-**Link:** NVIDIA Internal Data.
-(Internal Only: Not To Be Published)
-**Benchmark Score:** ROUGE-L score on cuda-samples dataset is 77.45 %.
-**Data Collection Method by dataset:** Automated
-**Labeling Method by dataset:** Not Applicable
-**Properties (Quantity, Dataset Descriptions, Sensor(s)):** 2,156 samples. Text modality (source code). Content consists of internal proprietary CUDA and HPC library code (e.g., cuDNN, cuda-hpc) parsed from internal GitLab repositories. Code is CUDA-specific with domain-specific APIs and patterns. No sensor data involved.
-### Evaluation Dataset:
-**Link:** Subset of https://huggingface.co/datasets/bigcode/the-stack-v2
-**Data Collection Method by dataset:** Automated
-**Labeling Method by dataset:** Not Applicable
-**Properties (Quantity, Dataset Descriptions, Sensor(s)):** ~33,000 samples. Each sample corresponds to a single source code file. Text modality (source code). Content includes open-source code collected from permissive-licensed repositories. CUDA and general programming code in English. No sensor data involved.
-## Inference:
 **Acceleration Engine:** vLLM
-**Test Hardware:**
-* H100
-* DGX Spark
-## Ethical Considerations:
-NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
-For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.
-Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).

+---
+license: other
+license_name: nvidia-open-model-license
+license_link: >-
+  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+  - code
+  - cuda
+  - fill-in-the-middle
+  - nvidia
+  - pytorch
+datasets:
+  - bigcode/the-stack-v2
+base_model: Qwen/Qwen2.5-Coder-7B
+---
+## Model Overview
+NVIDIA CUDA Autocomplete is a fine-tuned version of Qwen/Qwen2.5-Coder-7B enhanced for CUDA code completion. The model takes as input two strings of code context: the prefix (code before the cursor) and the suffix (code after the cursor), and outputs several lines of code that logically continues the prefix. By analyzing the surrounding code structure, variable names, and CUDA-specific patterns, the model predicts the most likely next line of code, enabling intelligent autocomplete functionality for general programming and CUDA development in the Nsight Copilot extension for VSCode and Cursor.
+_This model is ready for commercial/non-commercial use._
+### License/Terms of Use
+Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).
+Additional Information.For Qwen2.5-Coder-7B, [Apache License, Version 2.0](https://huggingface.co/Qwen/Qwen2.5-Coder-7B/blob/main/LICENSE).
+### Deployment Geography
 Global
+### Use Case
+This model is intended to be used for code completion in the Nsight Copilot extension for VSCode / Cursor.
+### Release Date
+Huggingface : 06/09/2026 via [https://huggingface.co/nvidia/CUDA-Autocomplete](https://huggingface.co/nvidia/CUDA-Autocomplete)
+## Reference(s)
 [Qwen2.5-Coder paper](https://arxiv.org/abs/2409.12186)
+[Qwen2.5-Coder blog](https://qwenlm.github.io/blog/qwen2.5-coder-family/)
+[Qwen2.5-Coder GitHub repository](https://github.com/QwenLM/Qwen2.5-Coder)
+## Model Architecture
+**Architecture Type:** Transformer
+**Network Architecture:** Qwen2ForCausalLM
+**This model was developed based on Qwen/Qwen2.5-Coder-7B.**
+**Number of model parameters:** 7B (7*10^9)
+## Input
 **Input Type(s):** Code
 **Input Format(s):** String of code (meant for prefix code and suffix code)
 **Input Parameters:** One-Dimensional (1D)
 **Other Properties Related to Input:**
 - **Context Window:** The model processes sequential code text with prefix and suffix context
 - **Encoding:** UTF-8 text encoding
+- **Input Structure:** Fill-in-the-middle (FIM) format with prefix and suffix tokens
+## Output
 **Output Type(s):** Code
 **Output Format:** String
 **Output Parameters:** One-Dimensional (1D)
 - **Output Length:** Single line of code completion
 - **Generation Method:** Autoregressive token-by-token generation
 - **Encoding:** UTF-8 text encoding
+- **Output Structure:** Sequential code text that continues from the input prefix
 Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
+## Software Integration
+**Runtime Engine(s):** vLLM
+**Supported Hardware Microarchitecture Compatibility:**
+* H100
+* DGX Spark
+**[Supported] Operating System(s):** Linux
+The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
+## Model Version(s)
+v0.3.0
+## Training, Testing, and Evaluation Datasets
+### Training Dataset
+* **Source:** Subset of [bigcode/the-stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) & synthetically generated CUDA data using OSS models like GPT-OSS 120B
+* **Data Modality:** Text
+* **Text Training Data Size:** ~700000 samples
+* **Data Collection Method by dataset:** Hybrid: Automated, Synthetic
+* **Labeling Method by dataset:** Not Applicable
+* **Properties (Quantity, Dataset Descriptions, Sensor(s)):** ~700,000 samples. Text modality (source code). Content includes open-source CUDA and general programming code collected from permissive-licensed repositories, as well as machine-generated synthetic CUDA code produced by OSS models. Primarily English-language code with CUDA-specific constructs and APIs. No sensor data involved.
+### Testing Dataset
+* **Source:** NVIDIA Internal Data
+* **Data Collection Method by dataset:** Automated
+* **Labeling Method by dataset:** Not Applicable
+* **Properties (Quantity, Dataset Descriptions, Sensor(s)):** 2,156 samples. Text modality (source code). Content consists of internal proprietary CUDA and HPC library code (e.g., cuDNN, cuda-hpc) parsed from internal GitLab repositories. Code is CUDA-specific with domain-specific APIs and patterns. No sensor data involved.
+### Evaluation Dataset
+* **Source:** Subset of [bigcode/the-stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2)
+* **Data Collection Method by dataset:** Automated
+* **Labeling Method by dataset:** Not Applicable
+* **Properties (Quantity, Dataset Descriptions, Sensor(s)):** ~33,000 samples. Each sample corresponds to a single source code file. Text modality (source code). Content includes open-source code collected from permissive-licensed repositories. CUDA and general programming code in English. No sensor data involved.
+## Inference
 **Acceleration Engine:** vLLM
+**Test Hardware:**
+* H100
+* DGX Spark
+## Ethical Considerations
+NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
+For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.
+Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).