=== [qwen3.5-2b-justgfos-nothink-1116] Vast.ai Instance Setup ===
Tue Mar 24 23:48:52 UTC 2026
Installing unsloth (preserving torch)...
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
unsloth 2026.3.11 requires diffusers, which is not installed.
unsloth 2026.3.11 requires nest-asyncio, which is not installed.
unsloth 2026.3.11 requires pydantic, which is not installed.
unsloth 2026.3.11 requires xformers>=0.0.27.post2; ("linux" in sys_platform or sys_platform == "win32") and (platform_machine == "AMD64" or platform_machine == "x86_64"), which is not installed.
unsloth-zoo 2026.3.5 requires msgspec, which is not installed.
unsloth-zoo 2026.3.5 requires torchao>=0.13.0, which is not installed.
unsloth 2026.3.11 requires datasets!=4.0.*,!=4.1.0,<4.4.0,>=3.4.1, but you have datasets 4.8.4 which is incompatible.
unsloth 2026.3.11 requires trl!=0.19.0,<=0.24.0,>=0.18.2, but you have trl 0.29.1 which is incompatible.
unsloth-zoo 2026.3.5 requires datasets!=4.0.*,!=4.1.0,<4.4.0,>=3.4.1, but you have datasets 4.8.4 which is incompatible.
unsloth-zoo 2026.3.5 requires trl!=0.19.0,<=0.24.0,>=0.18.2, but you have trl 0.29.1 which is incompatible.
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
Verifying install...
torch=2.6.0+cu124 cuda=True gpu=NVIDIA H100 80GB HBM3
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Your Flash Attention 2 installation seems to be broken. Using Xformers instead. No performance changes will be seen.
🦥 Unsloth Zoo will now patch everything to make training faster!
unsloth OK
=== Starting Training ===
=== qwen3.5-2b-justgfos-nothink-1116: Loading Unsloth ===
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Your Flash Attention 2 installation seems to be broken. Using Xformers instead. No performance changes will be seen.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2026.3.11: Fast Qwen3_5 patching. Transformers: 5.3.0.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 1. Max memory: 79.205 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 9.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
The fast path is not available because one of the required library is not installed. Falling back to torch implementation. To install follow https://github.com/fla-org/flash-linear-attention#installation and https://github.com/Dao-AILab/causal-conv1d
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
Loading weights:   0%|          | 0/617 [00:00<?, ?it/s]Loading weights:   0%|          | 1/617 [00:00<02:03,  5.00it/s]Loading weights:   5%|▌         | 33/617 [00:00<00:04, 135.74it/s]Loading weights:  10%|▉         | 61/617 [00:00<00:03, 185.07it/s]Loading weights:  14%|█▍        | 89/617 [00:00<00:02, 215.01it/s]Loading weights:  19%|█▉        | 116/617 [00:00<00:02, 232.48it/s]Loading weights:  23%|██▎       | 142/617 [00:00<00:02, 235.57it/s]Loading weights:  28%|██▊       | 172/617 [00:00<00:01, 252.22it/s]Loading weights:  34%|███▍      | 212/617 [00:00<00:01, 294.80it/s]Loading weights:  39%|███▉      | 243/617 [00:01<00:01, 298.04it/s]Loading weights:  45%|████▍     | 275/617 [00:01<00:01, 303.51it/s]Loading weights:  50%|████▉     | 306/617 [00:01<00:01, 292.54it/s]Loading weights:  59%|█████▊    | 361/617 [00:01<00:00, 365.73it/s]Loading weights:  72%|███████▏  | 447/617 [00:01<00:00, 509.53it/s]Loading weights:  86%|████████▌ | 532/617 [00:01<00:00, 607.68it/s]Loading weights:  99%|█████████▉| 611/617 [00:01<00:00, 657.75it/s]Loading weights: 100%|██████████| 617/617 [00:01<00:00, 373.81it/s]
  GPU: NVIDIA H100 80GB HBM3
  VRAM: 85.0 GB
  VRAM after load: 4.4 GB
Applying LoRA adapters...
Unsloth: Making `model.base_model.model.model.language_model` require gradients
trainable params: 44,034,048 || all params: 2,257,275,712 || trainable%: 1.9508
Loading dataset from HuggingFace Hub...
Generating train split: 0 examples [00:00, ? examples/s]Generating train split: 131 examples [00:00, 1528.86 examples/s]
Generating test split: 0 examples [00:00, ? examples/s]Generating test split: 23 examples [00:00, 1468.42 examples/s]
  131 train examples, 23 eval examples loaded
Formatting chat templates...
Map:   0%|          | 0/131 [00:00<?, ? examples/s]Map: 100%|██████████| 131/131 [00:00<00:00, 1548.59 examples/s]
Map:   0%|          | 0/23 [00:00<?, ? examples/s]Map: 100%|██████████| 23/23 [00:00<00:00, 1422.70 examples/s]
warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.
Unsloth: Tokenizing ["text"] (num_proc=1):   0%|          | 0/131 [00:00<?, ? examples/s]Unsloth: Tokenizing ["text"] (num_proc=1): 100%|██████████| 131/131 [00:09<00:00, 14.11 examples/s]Unsloth: Tokenizing ["text"] (num_proc=1): 100%|██████████| 131/131 [00:09<00:00, 13.76 examples/s]
Unsloth: Tokenizing ["text"] (num_proc=1):   0%|          | 0/23 [00:00<?, ? examples/s]Unsloth: Tokenizing ["text"] (num_proc=1): 100%|██████████| 23/23 [00:02<00:00,  7.97 examples/s]Unsloth: Tokenizing ["text"] (num_proc=1): 100%|██████████| 23/23 [00:03<00:00,  7.35 examples/s]
Map (num_proc=64):   0%|          | 0/131 [00:00<?, ? examples/s]Map (num_proc=64):   2%|▏         | 3/131 [00:03<02:18,  1.08s/ examples]Map (num_proc=64):  16%|█▌        | 21/131 [00:03<00:13,  8.44 examples/s]Map (num_proc=64):  31%|███▏      | 41/131 [00:03<00:04, 19.22 examples/s]Map (num_proc=64):  42%|████▏     | 55/131 [00:03<00:02, 28.19 examples/s]Map (num_proc=64):  59%|█████▉    | 77/131 [00:03<00:01, 46.32 examples/s]Map (num_proc=64):  83%|████████▎ | 109/131 [00:03<00:00, 78.19 examples/s]Map (num_proc=64): 100%|██████████| 131/131 [00:04<00:00, 30.19 examples/s]
Filter (num_proc=64):   0%|          | 0/131 [00:00<?, ? examples/s]Filter (num_proc=64):   2%|▏         | 3/131 [00:03<02:38,  1.24s/ examples]Filter (num_proc=64):  18%|█▊        | 23/131 [00:03<00:13,  8.15 examples/s]Filter (num_proc=64):  54%|█████▍    | 71/131 [00:03<00:01, 31.75 examples/s]Filter (num_proc=64):  77%|███████▋  | 101/131 [00:04<00:00, 49.45 examples/s]Filter (num_proc=64):  98%|█████████▊| 129/131 [00:04<00:00, 68.72 examples/s]Filter (num_proc=64): 100%|██████████| 131/131 [00:04<00:00, 28.37 examples/s]
num_proc must be <= 23. Reducing num_proc to 23 for dataset of size 23.
[datasets.arrow_dataset|WARNING]num_proc must be <= 23. Reducing num_proc to 23 for dataset of size 23.
Map (num_proc=23):   0%|          | 0/23 [00:00<?, ? examples/s]Map (num_proc=23):   4%|▍         | 1/23 [00:01<00:31,  1.45s/ examples]Map (num_proc=23):  43%|████▎     | 10/23 [00:01<00:01,  8.59 examples/s]Map (num_proc=23):  96%|█████████▌| 22/23 [00:01<00:00, 21.07 examples/s]Map (num_proc=23): 100%|██████████| 23/23 [00:02<00:00, 11.35 examples/s]
num_proc must be <= 23. Reducing num_proc to 23 for dataset of size 23.
[datasets.arrow_dataset|WARNING]num_proc must be <= 23. Reducing num_proc to 23 for dataset of size 23.
Filter (num_proc=23):   0%|          | 0/23 [00:00<?, ? examples/s]Filter (num_proc=23):   4%|▍         | 1/23 [00:01<00:29,  1.34s/ examples]Filter (num_proc=23):  91%|█████████▏| 21/23 [00:01<00:00, 19.94 examples/s]Filter (num_proc=23): 100%|██████████| 23/23 [00:01<00:00, 13.14 examples/s]
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 248046}.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 131 | Num Epochs = 4 | Total steps = 36
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 2 x 1) = 16
 "-____-"     Trainable parameters = 44,034,048 of 2,257,275,712 (1.95% trained)

=== Training qwen3.5-2b-justgfos-nothink-1116 ===
  0%|          | 0/36 [00:00<?, ?it/s]Unsloth: Will smartly offload gradients to save VRAM!
Traceback (most recent call last):
  File "/workspace/train_unsloth.py", line 124, in <module>
    stats = trainer.train()
            ^^^^^^^^^^^^^^^
  File "/root/unsloth_compiled_cache/UnslothSFTTrainer.py", line 68, in wrapper
    output = f(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1424, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 81, in _fast_inner_training_loop
  File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1734, in _run_epoch
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/unsloth_compiled_cache/UnslothSFTTrainer.py", line 1389, in training_step
    return super().training_step(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 68, in _unsloth_training_step
  File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2838, in backward
    loss.backward(**kwargs)
  File "/opt/conda/lib/python3.11/site-packages/torch/_tensor.py", line 626, in backward
    torch.autograd.backward(
  File "/opt/conda/lib/python3.11/site-packages/torch/autograd/__init__.py", line 347, in backward
    _engine_run_backward(
  File "/opt/conda/lib/python3.11/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/autograd/function.py", line 307, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/unsloth_zoo/gradient_checkpointing.py", line 612, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/opt/conda/lib/python3.11/site-packages/torch/autograd/__init__.py", line 347, in backward
    _engine_run_backward(
  File "/opt/conda/lib/python3.11/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/autograd/function.py", line 307, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1710, in backward
    return impl_fn()
           ^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1700, in impl_fn
    out = CompiledFunction._backward_impl(ctx, all_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2065, in _backward_impl
    out = call_func_at_runtime_with_args(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/output_code.py", line 466, in __call__
    return self.current_callable(inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/utils.py", line 2128, in run
    return model(new_inputs)
           ^^^^^^^^^^^^^^^^^
  File "/tmp/torchinductor_root/gg/cgg4ncnprssvz5mer65fcitwk5y2e6sa7scujqtkbr7ehlyjhhpe.py", line 807, in call
    buf27 = empty_strided_cuda((s0, s1, 6144), (6144*s1, 6144, 1), torch.bfloat16)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.67 GiB. GPU 0 has a total capacity of 79.20 GiB of which 1.24 GiB is free. Process 6584 has 5.01 GiB memory in use. Process 2899144 has 72.90 GiB memory in use. Of the allocated memory 71.93 GiB is allocated by PyTorch, and 179.50 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
=== SETUP/TRAINING FAILED (exit code 1) ===