Text Generation
Transformers
Safetensors
PyTorch
qwen2
code
cuda
fill-in-the-middle
nvidia
conversational
text-generation-inference
Instructions to use nvidia/CUDA-Autocomplete with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/CUDA-Autocomplete with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nvidia/CUDA-Autocomplete") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("nvidia/CUDA-Autocomplete") model = AutoModelForMultimodalLM.from_pretrained("nvidia/CUDA-Autocomplete") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use nvidia/CUDA-Autocomplete with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nvidia/CUDA-Autocomplete" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/CUDA-Autocomplete", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nvidia/CUDA-Autocomplete
- SGLang
How to use nvidia/CUDA-Autocomplete with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nvidia/CUDA-Autocomplete" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/CUDA-Autocomplete", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nvidia/CUDA-Autocomplete" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/CUDA-Autocomplete", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use nvidia/CUDA-Autocomplete with Docker Model Runner:
docker model run hf.co/nvidia/CUDA-Autocomplete
| 96ed1ea71e229005252045d11fe88c7c0c8abca160cbf24727e14cc056858e94 added_tokens.json | |
| 72e8302eb275e75b189b6a9ebfaa7aaba1b868c610b169b3b16c5cae2172148e chat_template.jinja | |
| 22a9823e5d36a21076314bf919d5d5435a8e6b45b36cf9dd8e5a23c5e78d4bf1 config.json | |
| 7380537b889068754f67c0fd23906b11c8f9736ed1c4a0a305992e49b1d22956 generation_config.json | |
| d001c4a60d63753a91a47bc85f6469b3faf91d3c7ebf8531b0a88f465fbc6306 merges.txt | |
| 5d5c0be20e4570a164d5f88f6bbeffca0bcf94a245ae1d2d1a8111e740d84d22 model-00001-of-00004.safetensors | |
| 2ae8c603d21803be61471d983f7888e1649205c16b2f890283c05edbd4cc5d71 model-00002-of-00004.safetensors | |
| f32c102974dfe0ba8f4175dae6536458f30849ea12d89f4c15d16beb934cc089 model-00003-of-00004.safetensors | |
| f2ab71f4b789df888e845019dd3b5cc0e1dadcab15db379c95c3f4be6c1783dd model-00004-of-00004.safetensors | |
| 8bb7e83d9e06c0060bccae5e5e57cdb10d8f933a9cf70a167d8942b49902742e model.safetensors.index.json | |
| 32b6f0d59a0ac7b647206ea5192d1fc9e793e8276673c1fac303bf5eb313d4df special_tokens_map.json | |
| 26e6526ccad066391561115f1cfec013f4cbbdacfd12ad17eadb52e29d94efa1 tokenizer_config.json | |
| 368590b202a77298b41ef72edfc9e0db08d6f642df30ebc81507534b6ffc81db tokenizer.json | |
| 8837532945f70ff1bef751f8011ba713b2d119f4f58ff3765a6fb24001d6c5fe vocab.json | |