/opt/conda/envs/qqt/lib/python3.11/site-packages/tensor_parallel/imports.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources /opt/conda/envs/qqt/lib/python3.11/site-packages/tensor_parallel/imports.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources /opt/conda/envs/qqt/lib/python3.11/site-packages/tensor_parallel/imports.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources /opt/conda/envs/qqt/lib/python3.11/site-packages/tensor_parallel/imports.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources /opt/conda/envs/qqt/lib/python3.11/site-packages/tensor_parallel/imports.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources /opt/conda/envs/qqt/lib/python3.11/site-packages/tensor_parallel/imports.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources /opt/conda/envs/qqt/lib/python3.11/site-packages/tensor_parallel/imports.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources /opt/conda/envs/qqt/lib/python3.11/site-packages/tensor_parallel/imports.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources 2026-02-16 17:54:28,960 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend 2026-02-16 17:54:29,068 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend INFO 02-16 17:54:29 [__init__.py:239] Automatically detected platform cuda. 2026-02-16 17:54:29,101 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend 2026-02-16 17:54:29,112 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend INFO 02-16 17:54:29 [__init__.py:239] Automatically detected platform cuda. 2026-02-16 17:54:29,195 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend INFO 02-16 17:54:29 [__init__.py:239] Automatically detected platform cuda. 2026-02-16 17:54:29,214 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend INFO 02-16 17:54:29 [__init__.py:239] Automatically detected platform cuda. INFO 02-16 17:54:29 [__init__.py:239] Automatically detected platform cuda. 2026-02-16 17:54:29,320 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend 2026-02-16 17:54:29,322 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend INFO 02-16 17:54:29 [__init__.py:239] Automatically detected platform cuda. INFO 02-16 17:54:29 [__init__.py:239] Automatically detected platform cuda. INFO 02-16 17:54:29 [__init__.py:239] Automatically detected platform cuda. Xattention Import Fail Xattention Import Fail Xattention Import Fail Xattention Import Fail Xattention Import Fail Xattention Import Fail Xattention Import Fail Xattention Import Fail 02/16/2026 17:54:32 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False 02/16/2026 17:54:32 - WARNING - __main__ - Process rank: 5, device: cuda:5, n_gpu: 1distributed training: True, 16-bits training: False 02/16/2026 17:54:32 - WARNING - __main__ - Process rank: 6, device: cuda:6, n_gpu: 1distributed training: True, 16-bits training: False [WARNING|logging.py:328] 2026-02-16 17:54:32,300 >> PawQwen3ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. [WARNING|logging.py:328] 2026-02-16 17:54:32,353 >> PawQwen3ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. [WARNING|logging.py:328] 2026-02-16 17:54:32,363 >> PawQwen3ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. 02/16/2026 17:54:32 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 02/16/2026 17:54:32 - INFO - __main__ - Training/evaluation parameters TrainingArguments( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, attention_type=None, auto_find_batch_size=False, average_tokens_across_devices=False, batch_eval_metrics=False, bf16=True, bf16_full_eval=False, context_window_if_toggled=2048, cuda_empty_cache=True, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=1, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=None, disable_linear_regularization_term=False, disable_tqdm=True, do_eval=False, do_predict=False, do_train=True, enable_ada_sparsity=True, enable_contrastive_loss=False, enable_lambda_task=True, enable_layerwise_sparsity=False, end_head_sparsity=0.35, erank_analysis_path=/, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_steps=None, eval_strategy=IntervalStrategy.NO, eval_use_gather_object=False, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, freeze_mask_parameters=False, freeze_non_mask_parameters=True, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=6, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=None, hub_strategy=HubStrategy.EVERY_SAVE, hub_token=, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, layerwise_sparsity_max_ratio=1.0, layerwise_sparsity_min_ratio=0.75, layerwise_sparsity_power=1.0, layerwise_sparsity_schedule=high-low-high, layerwise_sparsity_weight=1.0, learning_rate=1e-05, length_column_name=length, load_best_model_at_end=False, load_masks_from=None, load_masks_sparsity=None, local_rank=0, log_level=info, log_level_replica=warning, log_on_each_node=True, logging_dir=checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/runs/Feb16_17-54-32_pod-1436390550395908096, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=IntervalStrategy.STEPS, lr_scheduler_kwargs={}, lr_scheduler_type=SchedulerType.COSINE, mask_learning_rate=0.0005, max_grad_norm=5.0, max_steps=300, metric_for_best_model=None, min_lr_ratio=1e-07, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=3.0, optim=OptimizerNames.ADAMW_TORCH, optim_args=None, optim_target_modules=None, ordered=False, output_dir=checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=1, pooling_mode=ctx_q, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, reg_learning_rate=0.001, remove_unused_columns=False, report_to=['swanlab'], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, retrieval_mode=full, run_name=2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=100, save_strategy=SaveStrategy.STEPS, save_total_limit=3, seed=42, seq_parallel_size=2, sink_size=128, skip_memory_metrics=True, sparsity_warmup_ratio=0.0, start_head_sparsity=0.0, streaming_dataset=True, stripe_init_start_with_keep=False, stripe_init_width_1=None, stripe_init_width_2=None, tf32=None, toggle_type=xattn, topk_k=2048, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tp_size=0, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger_kernel=False, use_mps_device=False, use_softmax=True, use_task_emb_for_mask=False, warmup_ratio=0.2, warmup_steps=0, warmup_type=linear, weight_decay=0.1, ) 02/16/2026 17:54:32 - INFO - __main__ - Additional arguments ScriptArguments(model_name_or_path='/workspace/mnt/lcm_lab/hf_models/Qwen3-8B', config_overrides=None, config_overrides_json='', config_name=None, tokenizer_name='/workspace/mnt/lcm_lab/hf_models/Qwen3-8B', cache_dir=None, use_fast_tokenizer=False, model_revision='main', use_auth_token=False, use_thinking=False, should_log_loss=True, token_scaled_loss=False, tokenized_mds_train=['/workspace/mnt/lcm_lab/qqt/public_data/qwen_mix_sft_64K6'], tokenized_mds_validation=[], tokenized_mds_test=[]) 02/16/2026 17:54:32 - INFO - __main__ - Data arguments PackedDataArguments(single_seq=False, subsplit_length=None, per_device_max_tokens=65536, apply_instruct_masks=False, prepack=False, streaming=False, min_seq_len=1000, task_type='sft', use_packing=False, data_cache_dir='/workspace/mnt/lcm_lab/qqt/public_data/data_cache', preprocessing_num_workers=32, suffix='qwen3-4b_new_1200') [INFO|tokenization_utils_base.py:2058] 2026-02-16 17:54:32,504 >> loading file vocab.json [INFO|tokenization_utils_base.py:2058] 2026-02-16 17:54:32,505 >> loading file merges.txt [INFO|tokenization_utils_base.py:2058] 2026-02-16 17:54:32,505 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2058] 2026-02-16 17:54:32,505 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2058] 2026-02-16 17:54:32,505 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2058] 2026-02-16 17:54:32,505 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2058] 2026-02-16 17:54:32,505 >> loading file chat_template.jinja 02/16/2026 17:54:32 - WARNING - __main__ - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: False 02/16/2026 17:54:32 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False 02/16/2026 17:54:32 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False 02/16/2026 17:54:32 - WARNING - __main__ - Process rank: 7, device: cuda:7, n_gpu: 1distributed training: True, 16-bits training: False [INFO|tokenization_utils_base.py:2323] 2026-02-16 17:54:32,687 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|configuration_utils.py:691] 2026-02-16 17:54:32,687 >> loading configuration file /workspace/mnt/lcm_lab/hf_models/Qwen3-8B/config.json [INFO|configuration_utils.py:765] 2026-02-16 17:54:32,688 >> Model config PawQwen3Config { "architectures": [ "Qwen3ForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 151643, "disable_linear_regularization_term": false, "enable_ada_sparsity": true, "enable_lambda_task": true, "enable_layerwise_sparsity": false, "eos_token_id": 151645, "erank_analysis_path": "/", "head_dim": 128, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 12288, "layerwise_sparsity_max_ratio": 1.0, "layerwise_sparsity_min_ratio": 0.5, "layerwise_sparsity_power": 1.0, "layerwise_sparsity_schedule": "high-low-high", "layerwise_sparsity_weight": 1.0, "local_window_size": 2048, "max_position_embeddings": 262144, "max_window_layers": 36, "model_type": "qwen3", "num_attention_heads": 32, "num_hidden_layers": 36, "num_key_value_heads": 8, "pooling_mode": "ctx_q", "pooling_seq": true, "retrieval_mode": "full", "rms_norm_eps": 1e-06, "rope_scaling": { "factor": 8.0, "original_max_position_embeddings": 40960, "rope_type": "yarn", "type": "yarn" }, "rope_theta": 1000000, "sink_size": 128, "sliding_window": null, "suggested_sparsity": null, "tie_word_embeddings": false, "toggle_type": "xattn", "topk_k": 2048, "torch_dtype": "bfloat16", "transformers_version": "4.51.1", "triangle_n_last": 128, "use_cache": true, "use_sliding_window": false, "use_softmax": true, "use_task_emb_for_mask": false, "vocab_size": 151936 } [INFO|modeling_utils.py:1121] 2026-02-16 17:54:32,689 >> loading weights file /workspace/mnt/lcm_lab/hf_models/Qwen3-8B/model.safetensors.index.json [WARNING|logging.py:328] 2026-02-16 17:54:32,694 >> PawQwen3ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. [INFO|configuration_utils.py:1142] 2026-02-16 17:54:32,695 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645, "pad_token_id": 0 } [WARNING|logging.py:328] 2026-02-16 17:54:32,787 >> PawQwen3ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. [WARNING|logging.py:328] 2026-02-16 17:54:32,802 >> PawQwen3ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. [WARNING|logging.py:328] 2026-02-16 17:54:32,805 >> PawQwen3ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. [WARNING|logging.py:328] 2026-02-16 17:54:32,809 >> PawQwen3ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. Loading checkpoint shards: 0%| | 0/5 [00:00> Some weights of PawQwen3ForCausalLM were not initialized from the model checkpoint at /workspace/mnt/lcm_lab/hf_models/Qwen3-8B and are newly initialized: ['model.layers.0.self_attn.attn_mask_log_alphas', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.0.self_attn.mask_allocator.log_temp', 'model.layers.1.self_attn.attn_mask_log_alphas', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.1.self_attn.mask_allocator.log_temp', 'model.layers.10.self_attn.attn_mask_log_alphas', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.10.self_attn.mask_allocator.log_temp', 'model.layers.11.self_attn.attn_mask_log_alphas', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.11.self_attn.mask_allocator.log_temp', 'model.layers.12.self_attn.attn_mask_log_alphas', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.12.self_attn.mask_allocator.log_temp', 'model.layers.13.self_attn.attn_mask_log_alphas', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.13.self_attn.mask_allocator.log_temp', 'model.layers.14.self_attn.attn_mask_log_alphas', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.14.self_attn.mask_allocator.log_temp', 'model.layers.15.self_attn.attn_mask_log_alphas', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.15.self_attn.mask_allocator.log_temp', 'model.layers.16.self_attn.attn_mask_log_alphas', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.16.self_attn.mask_allocator.log_temp', 'model.layers.17.self_attn.attn_mask_log_alphas', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.17.self_attn.mask_allocator.log_temp', 'model.layers.18.self_attn.attn_mask_log_alphas', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.18.self_attn.mask_allocator.log_temp', 'model.layers.19.self_attn.attn_mask_log_alphas', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.19.self_attn.mask_allocator.log_temp', 'model.layers.2.self_attn.attn_mask_log_alphas', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.2.self_attn.mask_allocator.log_temp', 'model.layers.20.self_attn.attn_mask_log_alphas', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.20.self_attn.mask_allocator.log_temp', 'model.layers.21.self_attn.attn_mask_log_alphas', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.21.self_attn.mask_allocator.log_temp', 'model.layers.22.self_attn.attn_mask_log_alphas', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.22.self_attn.mask_allocator.log_temp', 'model.layers.23.self_attn.attn_mask_log_alphas', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.23.self_attn.mask_allocator.log_temp', 'model.layers.24.self_attn.attn_mask_log_alphas', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.24.self_attn.mask_allocator.log_temp', 'model.layers.25.self_attn.attn_mask_log_alphas', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.25.self_attn.mask_allocator.log_temp', 'model.layers.26.self_attn.attn_mask_log_alphas', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.26.self_attn.mask_allocator.log_temp', 'model.layers.27.self_attn.attn_mask_log_alphas', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.27.self_attn.mask_allocator.log_temp', 'model.layers.28.self_attn.attn_mask_log_alphas', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.28.self_attn.mask_allocator.log_temp', 'model.layers.29.self_attn.attn_mask_log_alphas', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.29.self_attn.mask_allocator.log_temp', 'model.layers.3.self_attn.attn_mask_log_alphas', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.3.self_attn.mask_allocator.log_temp', 'model.layers.30.self_attn.attn_mask_log_alphas', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.30.self_attn.mask_allocator.log_temp', 'model.layers.31.self_attn.attn_mask_log_alphas', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.31.self_attn.mask_allocator.log_temp', 'model.layers.32.self_attn.attn_mask_log_alphas', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.32.self_attn.mask_allocator.log_temp', 'model.layers.33.self_attn.attn_mask_log_alphas', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.33.self_attn.mask_allocator.log_temp', 'model.layers.34.self_attn.attn_mask_log_alphas', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.34.self_attn.mask_allocator.log_temp', 'model.layers.35.self_attn.attn_mask_log_alphas', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.35.self_attn.mask_allocator.log_temp', 'model.layers.4.self_attn.attn_mask_log_alphas', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.4.self_attn.mask_allocator.log_temp', 'model.layers.5.self_attn.attn_mask_log_alphas', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.5.self_attn.mask_allocator.log_temp', 'model.layers.6.self_attn.attn_mask_log_alphas', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.6.self_attn.mask_allocator.log_temp', 'model.layers.7.self_attn.attn_mask_log_alphas', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.7.self_attn.mask_allocator.log_temp', 'model.layers.8.self_attn.attn_mask_log_alphas', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.8.self_attn.mask_allocator.log_temp', 'model.layers.9.self_attn.attn_mask_log_alphas', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.9.self_attn.mask_allocator.log_temp', 'model.sparsity_lambda1_task', 'model.sparsity_lambda2_task', 'model.sparsity_lambda_1', 'model.sparsity_lambda_2'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [rank5]:[W216 17:54:36.287881127 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 5] using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.54it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.41it/s] [WARNING|modeling_utils.py:4932] 2026-02-16 17:54:36,490 >> Some weights of PawQwen3ForCausalLM were not initialized from the model checkpoint at /workspace/mnt/lcm_lab/hf_models/Qwen3-8B and are newly initialized: ['model.layers.0.self_attn.attn_mask_log_alphas', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.0.self_attn.mask_allocator.log_temp', 'model.layers.1.self_attn.attn_mask_log_alphas', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.1.self_attn.mask_allocator.log_temp', 'model.layers.10.self_attn.attn_mask_log_alphas', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.10.self_attn.mask_allocator.log_temp', 'model.layers.11.self_attn.attn_mask_log_alphas', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.11.self_attn.mask_allocator.log_temp', 'model.layers.12.self_attn.attn_mask_log_alphas', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.12.self_attn.mask_allocator.log_temp', 'model.layers.13.self_attn.attn_mask_log_alphas', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.13.self_attn.mask_allocator.log_temp', 'model.layers.14.self_attn.attn_mask_log_alphas', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.14.self_attn.mask_allocator.log_temp', 'model.layers.15.self_attn.attn_mask_log_alphas', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.15.self_attn.mask_allocator.log_temp', 'model.layers.16.self_attn.attn_mask_log_alphas', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.16.self_attn.mask_allocator.log_temp', 'model.layers.17.self_attn.attn_mask_log_alphas', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.17.self_attn.mask_allocator.log_temp', 'model.layers.18.self_attn.attn_mask_log_alphas', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.18.self_attn.mask_allocator.log_temp', 'model.layers.19.self_attn.attn_mask_log_alphas', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.19.self_attn.mask_allocator.log_temp', 'model.layers.2.self_attn.attn_mask_log_alphas', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.2.self_attn.mask_allocator.log_temp', 'model.layers.20.self_attn.attn_mask_log_alphas', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.20.self_attn.mask_allocator.log_temp', 'model.layers.21.self_attn.attn_mask_log_alphas', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.21.self_attn.mask_allocator.log_temp', 'model.layers.22.self_attn.attn_mask_log_alphas', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.22.self_attn.mask_allocator.log_temp', 'model.layers.23.self_attn.attn_mask_log_alphas', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.23.self_attn.mask_allocator.log_temp', 'model.layers.24.self_attn.attn_mask_log_alphas', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.24.self_attn.mask_allocator.log_temp', 'model.layers.25.self_attn.attn_mask_log_alphas', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.25.self_attn.mask_allocator.log_temp', 'model.layers.26.self_attn.attn_mask_log_alphas', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.26.self_attn.mask_allocator.log_temp', 'model.layers.27.self_attn.attn_mask_log_alphas', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.27.self_attn.mask_allocator.log_temp', 'model.layers.28.self_attn.attn_mask_log_alphas', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.28.self_attn.mask_allocator.log_temp', 'model.layers.29.self_attn.attn_mask_log_alphas', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.29.self_attn.mask_allocator.log_temp', 'model.layers.3.self_attn.attn_mask_log_alphas', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.3.self_attn.mask_allocator.log_temp', 'model.layers.30.self_attn.attn_mask_log_alphas', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.30.self_attn.mask_allocator.log_temp', 'model.layers.31.self_attn.attn_mask_log_alphas', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.31.self_attn.mask_allocator.log_temp', 'model.layers.32.self_attn.attn_mask_log_alphas', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.32.self_attn.mask_allocator.log_temp', 'model.layers.33.self_attn.attn_mask_log_alphas', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.33.self_attn.mask_allocator.log_temp', 'model.layers.34.self_attn.attn_mask_log_alphas', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.34.self_attn.mask_allocator.log_temp', 'model.layers.35.self_attn.attn_mask_log_alphas', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.35.self_attn.mask_allocator.log_temp', 'model.layers.4.self_attn.attn_mask_log_alphas', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.4.self_attn.mask_allocator.log_temp', 'model.layers.5.self_attn.attn_mask_log_alphas', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.5.self_attn.mask_allocator.log_temp', 'model.layers.6.self_attn.attn_mask_log_alphas', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.6.self_attn.mask_allocator.log_temp', 'model.layers.7.self_attn.attn_mask_log_alphas', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.7.self_attn.mask_allocator.log_temp', 'model.layers.8.self_attn.attn_mask_log_alphas', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.8.self_attn.mask_allocator.log_temp', 'model.layers.9.self_attn.attn_mask_log_alphas', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.9.self_attn.mask_allocator.log_temp', 'model.sparsity_lambda1_task', 'model.sparsity_lambda2_task', 'model.sparsity_lambda_1', 'model.sparsity_lambda_2'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [rank1]:[W216 17:54:36.366685180 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. Loading checkpoint shards: 60%|██████ | 3/5 [00:02<00:01, 1.05it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:03<00:00, 1.12it/s] Loading checkpoint shards: 60%|██████ | 3/5 [00:02<00:01, 1.02it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.52it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.31it/s] [WARNING|modeling_utils.py:4932] 2026-02-16 17:54:36,846 >> Some weights of PawQwen3ForCausalLM were not initialized from the model checkpoint at /workspace/mnt/lcm_lab/hf_models/Qwen3-8B and are newly initialized: ['model.layers.0.self_attn.attn_mask_log_alphas', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.0.self_attn.mask_allocator.log_temp', 'model.layers.1.self_attn.attn_mask_log_alphas', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.1.self_attn.mask_allocator.log_temp', 'model.layers.10.self_attn.attn_mask_log_alphas', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.10.self_attn.mask_allocator.log_temp', 'model.layers.11.self_attn.attn_mask_log_alphas', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.11.self_attn.mask_allocator.log_temp', 'model.layers.12.self_attn.attn_mask_log_alphas', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.12.self_attn.mask_allocator.log_temp', 'model.layers.13.self_attn.attn_mask_log_alphas', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.13.self_attn.mask_allocator.log_temp', 'model.layers.14.self_attn.attn_mask_log_alphas', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.14.self_attn.mask_allocator.log_temp', 'model.layers.15.self_attn.attn_mask_log_alphas', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.15.self_attn.mask_allocator.log_temp', 'model.layers.16.self_attn.attn_mask_log_alphas', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.16.self_attn.mask_allocator.log_temp', 'model.layers.17.self_attn.attn_mask_log_alphas', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.17.self_attn.mask_allocator.log_temp', 'model.layers.18.self_attn.attn_mask_log_alphas', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.18.self_attn.mask_allocator.log_temp', 'model.layers.19.self_attn.attn_mask_log_alphas', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.19.self_attn.mask_allocator.log_temp', 'model.layers.2.self_attn.attn_mask_log_alphas', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.2.self_attn.mask_allocator.log_temp', 'model.layers.20.self_attn.attn_mask_log_alphas', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.20.self_attn.mask_allocator.log_temp', 'model.layers.21.self_attn.attn_mask_log_alphas', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.21.self_attn.mask_allocator.log_temp', 'model.layers.22.self_attn.attn_mask_log_alphas', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.22.self_attn.mask_allocator.log_temp', 'model.layers.23.self_attn.attn_mask_log_alphas', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.23.self_attn.mask_allocator.log_temp', 'model.layers.24.self_attn.attn_mask_log_alphas', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.24.self_attn.mask_allocator.log_temp', 'model.layers.25.self_attn.attn_mask_log_alphas', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.25.self_attn.mask_allocator.log_temp', 'model.layers.26.self_attn.attn_mask_log_alphas', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.26.self_attn.mask_allocator.log_temp', 'model.layers.27.self_attn.attn_mask_log_alphas', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.27.self_attn.mask_allocator.log_temp', 'model.layers.28.self_attn.attn_mask_log_alphas', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.28.self_attn.mask_allocator.log_temp', 'model.layers.29.self_attn.attn_mask_log_alphas', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.29.self_attn.mask_allocator.log_temp', 'model.layers.3.self_attn.attn_mask_log_alphas', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.3.self_attn.mask_allocator.log_temp', 'model.layers.30.self_attn.attn_mask_log_alphas', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.30.self_attn.mask_allocator.log_temp', 'model.layers.31.self_attn.attn_mask_log_alphas', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.31.self_attn.mask_allocator.log_temp', 'model.layers.32.self_attn.attn_mask_log_alphas', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.32.self_attn.mask_allocator.log_temp', 'model.layers.33.self_attn.attn_mask_log_alphas', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.33.self_attn.mask_allocator.log_temp', 'model.layers.34.self_attn.attn_mask_log_alphas', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.34.self_attn.mask_allocator.log_temp', 'model.layers.35.self_attn.attn_mask_log_alphas', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.35.self_attn.mask_allocator.log_temp', 'model.layers.4.self_attn.attn_mask_log_alphas', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.4.self_attn.mask_allocator.log_temp', 'model.layers.5.self_attn.attn_mask_log_alphas', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.5.self_attn.mask_allocator.log_temp', 'model.layers.6.self_attn.attn_mask_log_alphas', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.6.self_attn.mask_allocator.log_temp', 'model.layers.7.self_attn.attn_mask_log_alphas', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.7.self_attn.mask_allocator.log_temp', 'model.layers.8.self_attn.attn_mask_log_alphas', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.8.self_attn.mask_allocator.log_temp', 'model.layers.9.self_attn.attn_mask_log_alphas', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.9.self_attn.mask_allocator.log_temp', 'model.sparsity_lambda1_task', 'model.sparsity_lambda2_task', 'model.sparsity_lambda_1', 'model.sparsity_lambda_2'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [rank6]:[W216 17:54:36.719455121 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 6] using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. Loading checkpoint shards: 80%|████████ | 4/5 [00:03<00:00, 1.22it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:03<00:00, 1.22it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:03<00:00, 1.26it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.66it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.38it/s] [INFO|modeling_utils.py:4930] 2026-02-16 17:54:37,081 >> All model checkpoint weights were used when initializing PawQwen3ForCausalLM. [WARNING|modeling_utils.py:4932] 2026-02-16 17:54:37,081 >> Some weights of PawQwen3ForCausalLM were not initialized from the model checkpoint at /workspace/mnt/lcm_lab/hf_models/Qwen3-8B and are newly initialized: ['model.layers.0.self_attn.attn_mask_log_alphas', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.0.self_attn.mask_allocator.log_temp', 'model.layers.1.self_attn.attn_mask_log_alphas', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.1.self_attn.mask_allocator.log_temp', 'model.layers.10.self_attn.attn_mask_log_alphas', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.10.self_attn.mask_allocator.log_temp', 'model.layers.11.self_attn.attn_mask_log_alphas', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.11.self_attn.mask_allocator.log_temp', 'model.layers.12.self_attn.attn_mask_log_alphas', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.12.self_attn.mask_allocator.log_temp', 'model.layers.13.self_attn.attn_mask_log_alphas', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.13.self_attn.mask_allocator.log_temp', 'model.layers.14.self_attn.attn_mask_log_alphas', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.14.self_attn.mask_allocator.log_temp', 'model.layers.15.self_attn.attn_mask_log_alphas', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.15.self_attn.mask_allocator.log_temp', 'model.layers.16.self_attn.attn_mask_log_alphas', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.16.self_attn.mask_allocator.log_temp', 'model.layers.17.self_attn.attn_mask_log_alphas', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.17.self_attn.mask_allocator.log_temp', 'model.layers.18.self_attn.attn_mask_log_alphas', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.18.self_attn.mask_allocator.log_temp', 'model.layers.19.self_attn.attn_mask_log_alphas', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.19.self_attn.mask_allocator.log_temp', 'model.layers.2.self_attn.attn_mask_log_alphas', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.2.self_attn.mask_allocator.log_temp', 'model.layers.20.self_attn.attn_mask_log_alphas', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.20.self_attn.mask_allocator.log_temp', 'model.layers.21.self_attn.attn_mask_log_alphas', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.21.self_attn.mask_allocator.log_temp', 'model.layers.22.self_attn.attn_mask_log_alphas', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.22.self_attn.mask_allocator.log_temp', 'model.layers.23.self_attn.attn_mask_log_alphas', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.23.self_attn.mask_allocator.log_temp', 'model.layers.24.self_attn.attn_mask_log_alphas', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.24.self_attn.mask_allocator.log_temp', 'model.layers.25.self_attn.attn_mask_log_alphas', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.25.self_attn.mask_allocator.log_temp', 'model.layers.26.self_attn.attn_mask_log_alphas', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.26.self_attn.mask_allocator.log_temp', 'model.layers.27.self_attn.attn_mask_log_alphas', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.27.self_attn.mask_allocator.log_temp', 'model.layers.28.self_attn.attn_mask_log_alphas', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.28.self_attn.mask_allocator.log_temp', 'model.layers.29.self_attn.attn_mask_log_alphas', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.29.self_attn.mask_allocator.log_temp', 'model.layers.3.self_attn.attn_mask_log_alphas', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.3.self_attn.mask_allocator.log_temp', 'model.layers.30.self_attn.attn_mask_log_alphas', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.30.self_attn.mask_allocator.log_temp', 'model.layers.31.self_attn.attn_mask_log_alphas', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.31.self_attn.mask_allocator.log_temp', 'model.layers.32.self_attn.attn_mask_log_alphas', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.32.self_attn.mask_allocator.log_temp', 'model.layers.33.self_attn.attn_mask_log_alphas', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.33.self_attn.mask_allocator.log_temp', 'model.layers.34.self_attn.attn_mask_log_alphas', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.34.self_attn.mask_allocator.log_temp', 'model.layers.35.self_attn.attn_mask_log_alphas', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.35.self_attn.mask_allocator.log_temp', 'model.layers.4.self_attn.attn_mask_log_alphas', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.4.self_attn.mask_allocator.log_temp', 'model.layers.5.self_attn.attn_mask_log_alphas', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.5.self_attn.mask_allocator.log_temp', 'model.layers.6.self_attn.attn_mask_log_alphas', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.6.self_attn.mask_allocator.log_temp', 'model.layers.7.self_attn.attn_mask_log_alphas', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.7.self_attn.mask_allocator.log_temp', 'model.layers.8.self_attn.attn_mask_log_alphas', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.8.self_attn.mask_allocator.log_temp', 'model.layers.9.self_attn.attn_mask_log_alphas', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.9.self_attn.mask_allocator.log_temp', 'model.sparsity_lambda1_task', 'model.sparsity_lambda2_task', 'model.sparsity_lambda_1', 'model.sparsity_lambda_2'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [INFO|configuration_utils.py:1095] 2026-02-16 17:54:37,086 >> loading configuration file /workspace/mnt/lcm_lab/hf_models/Qwen3-8B/generation_config.json [INFO|configuration_utils.py:1142] 2026-02-16 17:54:37,087 >> Generate config GenerationConfig { "bos_token_id": 151643, "do_sample": true, "enable_contrastive_loss": false, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "temperature": 0.6, "top_k": 20, "top_p": 0.95 } 02/16/2026 17:54:37 - INFO - __main__ - Model: PawQwen3ForCausalLM( (model): Qwen3Model( (embed_tokens): Embedding(151936, 4096, padding_idx=0) (layers): ModuleList( (0-35): 36 x Qwen3DecoderLayer( (self_attn): Qwen3Attention( (q_proj): Linear(in_features=4096, out_features=4096, bias=False) (k_proj): Linear(in_features=4096, out_features=1024, bias=False) (v_proj): Linear(in_features=4096, out_features=1024, bias=False) (o_proj): Linear(in_features=4096, out_features=4096, bias=False) (q_norm): Qwen3RMSNorm((128,), eps=1e-06) (k_norm): Qwen3RMSNorm((128,), eps=1e-06) (rotary_emb): Qwen3RotaryEmbedding() (distributed_attn_func): DistributedAttention() (mask_allocator): AttentionRouter( (cls_feat_extractor): Sequential( (0): Linear(in_features=128, out_features=1024, bias=True) (1): SiLU() (2): Linear(in_features=1024, out_features=256, bias=True) ) (cls_router_head_agnostic): Sequential( (0): Linear(in_features=256, out_features=512, bias=True) (1): SiLU() (2): Linear(in_features=512, out_features=128, bias=True) (3): SiLU() (4): Linear(in_features=128, out_features=2, bias=True) ) ) ) (mlp): Qwen3MLP( (gate_proj): Linear(in_features=4096, out_features=12288, bias=False) (up_proj): Linear(in_features=4096, out_features=12288, bias=False) (down_proj): Linear(in_features=12288, out_features=4096, bias=False) (act_fn): SiLU() ) (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06) (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06) ) ) (norm): Qwen3RMSNorm((4096,), eps=1e-06) (rotary_emb): Qwen3RotaryEmbedding() ) (lm_head): Linear(in_features=4096, out_features=151936, bias=False) ) [rank0]:[W216 17:54:37.953936769 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.65it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.37it/s] [WARNING|modeling_utils.py:4932] 2026-02-16 17:54:37,164 >> Some weights of PawQwen3ForCausalLM were not initialized from the model checkpoint at /workspace/mnt/lcm_lab/hf_models/Qwen3-8B and are newly initialized: ['model.layers.0.self_attn.attn_mask_log_alphas', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.0.self_attn.mask_allocator.log_temp', 'model.layers.1.self_attn.attn_mask_log_alphas', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.1.self_attn.mask_allocator.log_temp', 'model.layers.10.self_attn.attn_mask_log_alphas', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.10.self_attn.mask_allocator.log_temp', 'model.layers.11.self_attn.attn_mask_log_alphas', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.11.self_attn.mask_allocator.log_temp', 'model.layers.12.self_attn.attn_mask_log_alphas', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.12.self_attn.mask_allocator.log_temp', 'model.layers.13.self_attn.attn_mask_log_alphas', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.13.self_attn.mask_allocator.log_temp', 'model.layers.14.self_attn.attn_mask_log_alphas', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.14.self_attn.mask_allocator.log_temp', 'model.layers.15.self_attn.attn_mask_log_alphas', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.15.self_attn.mask_allocator.log_temp', 'model.layers.16.self_attn.attn_mask_log_alphas', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.16.self_attn.mask_allocator.log_temp', 'model.layers.17.self_attn.attn_mask_log_alphas', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.17.self_attn.mask_allocator.log_temp', 'model.layers.18.self_attn.attn_mask_log_alphas', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.18.self_attn.mask_allocator.log_temp', 'model.layers.19.self_attn.attn_mask_log_alphas', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.19.self_attn.mask_allocator.log_temp', 'model.layers.2.self_attn.attn_mask_log_alphas', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.2.self_attn.mask_allocator.log_temp', 'model.layers.20.self_attn.attn_mask_log_alphas', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.20.self_attn.mask_allocator.log_temp', 'model.layers.21.self_attn.attn_mask_log_alphas', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.21.self_attn.mask_allocator.log_temp', 'model.layers.22.self_attn.attn_mask_log_alphas', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.22.self_attn.mask_allocator.log_temp', 'model.layers.23.self_attn.attn_mask_log_alphas', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.23.self_attn.mask_allocator.log_temp', 'model.layers.24.self_attn.attn_mask_log_alphas', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.24.self_attn.mask_allocator.log_temp', 'model.layers.25.self_attn.attn_mask_log_alphas', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.25.self_attn.mask_allocator.log_temp', 'model.layers.26.self_attn.attn_mask_log_alphas', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.26.self_attn.mask_allocator.log_temp', 'model.layers.27.self_attn.attn_mask_log_alphas', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.27.self_attn.mask_allocator.log_temp', 'model.layers.28.self_attn.attn_mask_log_alphas', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.28.self_attn.mask_allocator.log_temp', 'model.layers.29.self_attn.attn_mask_log_alphas', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.29.self_attn.mask_allocator.log_temp', 'model.layers.3.self_attn.attn_mask_log_alphas', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.3.self_attn.mask_allocator.log_temp', 'model.layers.30.self_attn.attn_mask_log_alphas', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.30.self_attn.mask_allocator.log_temp', 'model.layers.31.self_attn.attn_mask_log_alphas', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.31.self_attn.mask_allocator.log_temp', 'model.layers.32.self_attn.attn_mask_log_alphas', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.32.self_attn.mask_allocator.log_temp', 'model.layers.33.self_attn.attn_mask_log_alphas', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.33.self_attn.mask_allocator.log_temp', 'model.layers.34.self_attn.attn_mask_log_alphas', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.34.self_attn.mask_allocator.log_temp', 'model.layers.35.self_attn.attn_mask_log_alphas', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.35.self_attn.mask_allocator.log_temp', 'model.layers.4.self_attn.attn_mask_log_alphas', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.4.self_attn.mask_allocator.log_temp', 'model.layers.5.self_attn.attn_mask_log_alphas', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.5.self_attn.mask_allocator.log_temp', 'model.layers.6.self_attn.attn_mask_log_alphas', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.6.self_attn.mask_allocator.log_temp', 'model.layers.7.self_attn.attn_mask_log_alphas', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.7.self_attn.mask_allocator.log_temp', 'model.layers.8.self_attn.attn_mask_log_alphas', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.8.self_attn.mask_allocator.log_temp', 'model.layers.9.self_attn.attn_mask_log_alphas', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.9.self_attn.mask_allocator.log_temp', 'model.sparsity_lambda1_task', 'model.sparsity_lambda2_task', 'model.sparsity_lambda_1', 'model.sparsity_lambda_2'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [rank4]:[W216 17:54:37.036166049 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 4] using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. Loading checkpoint shards: 80%|████████ | 4/5 [00:03<00:00, 1.21it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.71it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.40it/s] [WARNING|modeling_utils.py:4932] 2026-02-16 17:54:37,230 >> Some weights of PawQwen3ForCausalLM were not initialized from the model checkpoint at /workspace/mnt/lcm_lab/hf_models/Qwen3-8B and are newly initialized: ['model.layers.0.self_attn.attn_mask_log_alphas', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.0.self_attn.mask_allocator.log_temp', 'model.layers.1.self_attn.attn_mask_log_alphas', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.1.self_attn.mask_allocator.log_temp', 'model.layers.10.self_attn.attn_mask_log_alphas', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.10.self_attn.mask_allocator.log_temp', 'model.layers.11.self_attn.attn_mask_log_alphas', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.11.self_attn.mask_allocator.log_temp', 'model.layers.12.self_attn.attn_mask_log_alphas', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.12.self_attn.mask_allocator.log_temp', 'model.layers.13.self_attn.attn_mask_log_alphas', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.13.self_attn.mask_allocator.log_temp', 'model.layers.14.self_attn.attn_mask_log_alphas', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.14.self_attn.mask_allocator.log_temp', 'model.layers.15.self_attn.attn_mask_log_alphas', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.15.self_attn.mask_allocator.log_temp', 'model.layers.16.self_attn.attn_mask_log_alphas', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.16.self_attn.mask_allocator.log_temp', 'model.layers.17.self_attn.attn_mask_log_alphas', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.17.self_attn.mask_allocator.log_temp', 'model.layers.18.self_attn.attn_mask_log_alphas', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.18.self_attn.mask_allocator.log_temp', 'model.layers.19.self_attn.attn_mask_log_alphas', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.19.self_attn.mask_allocator.log_temp', 'model.layers.2.self_attn.attn_mask_log_alphas', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.2.self_attn.mask_allocator.log_temp', 'model.layers.20.self_attn.attn_mask_log_alphas', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.20.self_attn.mask_allocator.log_temp', 'model.layers.21.self_attn.attn_mask_log_alphas', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.21.self_attn.mask_allocator.log_temp', 'model.layers.22.self_attn.attn_mask_log_alphas', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.22.self_attn.mask_allocator.log_temp', 'model.layers.23.self_attn.attn_mask_log_alphas', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.23.self_attn.mask_allocator.log_temp', 'model.layers.24.self_attn.attn_mask_log_alphas', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.24.self_attn.mask_allocator.log_temp', 'model.layers.25.self_attn.attn_mask_log_alphas', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.25.self_attn.mask_allocator.log_temp', 'model.layers.26.self_attn.attn_mask_log_alphas', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.26.self_attn.mask_allocator.log_temp', 'model.layers.27.self_attn.attn_mask_log_alphas', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.27.self_attn.mask_allocator.log_temp', 'model.layers.28.self_attn.attn_mask_log_alphas', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.28.self_attn.mask_allocator.log_temp', 'model.layers.29.self_attn.attn_mask_log_alphas', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.29.self_attn.mask_allocator.log_temp', 'model.layers.3.self_attn.attn_mask_log_alphas', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.3.self_attn.mask_allocator.log_temp', 'model.layers.30.self_attn.attn_mask_log_alphas', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.30.self_attn.mask_allocator.log_temp', 'model.layers.31.self_attn.attn_mask_log_alphas', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.31.self_attn.mask_allocator.log_temp', 'model.layers.32.self_attn.attn_mask_log_alphas', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.32.self_attn.mask_allocator.log_temp', 'model.layers.33.self_attn.attn_mask_log_alphas', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.33.self_attn.mask_allocator.log_temp', 'model.layers.34.self_attn.attn_mask_log_alphas', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.34.self_attn.mask_allocator.log_temp', 'model.layers.35.self_attn.attn_mask_log_alphas', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.35.self_attn.mask_allocator.log_temp', 'model.layers.4.self_attn.attn_mask_log_alphas', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.4.self_attn.mask_allocator.log_temp', 'model.layers.5.self_attn.attn_mask_log_alphas', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.5.self_attn.mask_allocator.log_temp', 'model.layers.6.self_attn.attn_mask_log_alphas', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.6.self_attn.mask_allocator.log_temp', 'model.layers.7.self_attn.attn_mask_log_alphas', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.7.self_attn.mask_allocator.log_temp', 'model.layers.8.self_attn.attn_mask_log_alphas', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.8.self_attn.mask_allocator.log_temp', 'model.layers.9.self_attn.attn_mask_log_alphas', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.9.self_attn.mask_allocator.log_temp', 'model.sparsity_lambda1_task', 'model.sparsity_lambda2_task', 'model.sparsity_lambda_1', 'model.sparsity_lambda_2'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [rank7]:[W216 17:54:37.102946081 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 7] using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. Loading checkpoint shards: 80%|████████ | 4/5 [00:03<00:00, 1.18it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.65it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.36it/s] [WARNING|modeling_utils.py:4932] 2026-02-16 17:54:37,441 >> Some weights of PawQwen3ForCausalLM were not initialized from the model checkpoint at /workspace/mnt/lcm_lab/hf_models/Qwen3-8B and are newly initialized: ['model.layers.0.self_attn.attn_mask_log_alphas', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.0.self_attn.mask_allocator.log_temp', 'model.layers.1.self_attn.attn_mask_log_alphas', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.1.self_attn.mask_allocator.log_temp', 'model.layers.10.self_attn.attn_mask_log_alphas', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.10.self_attn.mask_allocator.log_temp', 'model.layers.11.self_attn.attn_mask_log_alphas', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.11.self_attn.mask_allocator.log_temp', 'model.layers.12.self_attn.attn_mask_log_alphas', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.12.self_attn.mask_allocator.log_temp', 'model.layers.13.self_attn.attn_mask_log_alphas', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.13.self_attn.mask_allocator.log_temp', 'model.layers.14.self_attn.attn_mask_log_alphas', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.14.self_attn.mask_allocator.log_temp', 'model.layers.15.self_attn.attn_mask_log_alphas', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.15.self_attn.mask_allocator.log_temp', 'model.layers.16.self_attn.attn_mask_log_alphas', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.16.self_attn.mask_allocator.log_temp', 'model.layers.17.self_attn.attn_mask_log_alphas', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.17.self_attn.mask_allocator.log_temp', 'model.layers.18.self_attn.attn_mask_log_alphas', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.18.self_attn.mask_allocator.log_temp', 'model.layers.19.self_attn.attn_mask_log_alphas', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.19.self_attn.mask_allocator.log_temp', 'model.layers.2.self_attn.attn_mask_log_alphas', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.2.self_attn.mask_allocator.log_temp', 'model.layers.20.self_attn.attn_mask_log_alphas', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.20.self_attn.mask_allocator.log_temp', 'model.layers.21.self_attn.attn_mask_log_alphas', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.21.self_attn.mask_allocator.log_temp', 'model.layers.22.self_attn.attn_mask_log_alphas', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.22.self_attn.mask_allocator.log_temp', 'model.layers.23.self_attn.attn_mask_log_alphas', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.23.self_attn.mask_allocator.log_temp', 'model.layers.24.self_attn.attn_mask_log_alphas', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.24.self_attn.mask_allocator.log_temp', 'model.layers.25.self_attn.attn_mask_log_alphas', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.25.self_attn.mask_allocator.log_temp', 'model.layers.26.self_attn.attn_mask_log_alphas', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.26.self_attn.mask_allocator.log_temp', 'model.layers.27.self_attn.attn_mask_log_alphas', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.27.self_attn.mask_allocator.log_temp', 'model.layers.28.self_attn.attn_mask_log_alphas', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.28.self_attn.mask_allocator.log_temp', 'model.layers.29.self_attn.attn_mask_log_alphas', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.29.self_attn.mask_allocator.log_temp', 'model.layers.3.self_attn.attn_mask_log_alphas', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.3.self_attn.mask_allocator.log_temp', 'model.layers.30.self_attn.attn_mask_log_alphas', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.30.self_attn.mask_allocator.log_temp', 'model.layers.31.self_attn.attn_mask_log_alphas', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.31.self_attn.mask_allocator.log_temp', 'model.layers.32.self_attn.attn_mask_log_alphas', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.32.self_attn.mask_allocator.log_temp', 'model.layers.33.self_attn.attn_mask_log_alphas', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.33.self_attn.mask_allocator.log_temp', 'model.layers.34.self_attn.attn_mask_log_alphas', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.34.self_attn.mask_allocator.log_temp', 'model.layers.35.self_attn.attn_mask_log_alphas', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.35.self_attn.mask_allocator.log_temp', 'model.layers.4.self_attn.attn_mask_log_alphas', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.4.self_attn.mask_allocator.log_temp', 'model.layers.5.self_attn.attn_mask_log_alphas', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.5.self_attn.mask_allocator.log_temp', 'model.layers.6.self_attn.attn_mask_log_alphas', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.6.self_attn.mask_allocator.log_temp', 'model.layers.7.self_attn.attn_mask_log_alphas', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.7.self_attn.mask_allocator.log_temp', 'model.layers.8.self_attn.attn_mask_log_alphas', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.8.self_attn.mask_allocator.log_temp', 'model.layers.9.self_attn.attn_mask_log_alphas', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.9.self_attn.mask_allocator.log_temp', 'model.sparsity_lambda1_task', 'model.sparsity_lambda2_task', 'model.sparsity_lambda_1', 'model.sparsity_lambda_2'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [rank2]:[W216 17:54:37.313219201 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.62it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.31it/s] [WARNING|modeling_utils.py:4932] 2026-02-16 17:54:37,603 >> Some weights of PawQwen3ForCausalLM were not initialized from the model checkpoint at /workspace/mnt/lcm_lab/hf_models/Qwen3-8B and are newly initialized: ['model.layers.0.self_attn.attn_mask_log_alphas', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.0.self_attn.mask_allocator.log_temp', 'model.layers.1.self_attn.attn_mask_log_alphas', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.1.self_attn.mask_allocator.log_temp', 'model.layers.10.self_attn.attn_mask_log_alphas', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.10.self_attn.mask_allocator.log_temp', 'model.layers.11.self_attn.attn_mask_log_alphas', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.11.self_attn.mask_allocator.log_temp', 'model.layers.12.self_attn.attn_mask_log_alphas', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.12.self_attn.mask_allocator.log_temp', 'model.layers.13.self_attn.attn_mask_log_alphas', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.13.self_attn.mask_allocator.log_temp', 'model.layers.14.self_attn.attn_mask_log_alphas', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.14.self_attn.mask_allocator.log_temp', 'model.layers.15.self_attn.attn_mask_log_alphas', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.15.self_attn.mask_allocator.log_temp', 'model.layers.16.self_attn.attn_mask_log_alphas', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.16.self_attn.mask_allocator.log_temp', 'model.layers.17.self_attn.attn_mask_log_alphas', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.17.self_attn.mask_allocator.log_temp', 'model.layers.18.self_attn.attn_mask_log_alphas', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.18.self_attn.mask_allocator.log_temp', 'model.layers.19.self_attn.attn_mask_log_alphas', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.19.self_attn.mask_allocator.log_temp', 'model.layers.2.self_attn.attn_mask_log_alphas', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.2.self_attn.mask_allocator.log_temp', 'model.layers.20.self_attn.attn_mask_log_alphas', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.20.self_attn.mask_allocator.log_temp', 'model.layers.21.self_attn.attn_mask_log_alphas', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.21.self_attn.mask_allocator.log_temp', 'model.layers.22.self_attn.attn_mask_log_alphas', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.22.self_attn.mask_allocator.log_temp', 'model.layers.23.self_attn.attn_mask_log_alphas', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.23.self_attn.mask_allocator.log_temp', 'model.layers.24.self_attn.attn_mask_log_alphas', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.24.self_attn.mask_allocator.log_temp', 'model.layers.25.self_attn.attn_mask_log_alphas', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.25.self_attn.mask_allocator.log_temp', 'model.layers.26.self_attn.attn_mask_log_alphas', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.26.self_attn.mask_allocator.log_temp', 'model.layers.27.self_attn.attn_mask_log_alphas', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.27.self_attn.mask_allocator.log_temp', 'model.layers.28.self_attn.attn_mask_log_alphas', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.28.self_attn.mask_allocator.log_temp', 'model.layers.29.self_attn.attn_mask_log_alphas', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.29.self_attn.mask_allocator.log_temp', 'model.layers.3.self_attn.attn_mask_log_alphas', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.3.self_attn.mask_allocator.log_temp', 'model.layers.30.self_attn.attn_mask_log_alphas', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.30.self_attn.mask_allocator.log_temp', 'model.layers.31.self_attn.attn_mask_log_alphas', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.31.self_attn.mask_allocator.log_temp', 'model.layers.32.self_attn.attn_mask_log_alphas', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.32.self_attn.mask_allocator.log_temp', 'model.layers.33.self_attn.attn_mask_log_alphas', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.33.self_attn.mask_allocator.log_temp', 'model.layers.34.self_attn.attn_mask_log_alphas', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.34.self_attn.mask_allocator.log_temp', 'model.layers.35.self_attn.attn_mask_log_alphas', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.35.self_attn.mask_allocator.log_temp', 'model.layers.4.self_attn.attn_mask_log_alphas', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.4.self_attn.mask_allocator.log_temp', 'model.layers.5.self_attn.attn_mask_log_alphas', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.5.self_attn.mask_allocator.log_temp', 'model.layers.6.self_attn.attn_mask_log_alphas', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.6.self_attn.mask_allocator.log_temp', 'model.layers.7.self_attn.attn_mask_log_alphas', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.7.self_attn.mask_allocator.log_temp', 'model.layers.8.self_attn.attn_mask_log_alphas', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.8.self_attn.mask_allocator.log_temp', 'model.layers.9.self_attn.attn_mask_log_alphas', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.9.self_attn.mask_allocator.log_temp', 'model.sparsity_lambda1_task', 'model.sparsity_lambda2_task', 'model.sparsity_lambda_1', 'model.sparsity_lambda_2'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [rank3]:[W216 17:54:37.475028267 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. ******** ['/workspace/mnt/lcm_lab/qqt/public_data/qwen_mix_sft_64K6/all.parquet'] ******* ******** ['/workspace/mnt/lcm_lab/qqt/public_data/qwen_mix_sft_64K6/all.parquet'] ******* ******** ['/workspace/mnt/lcm_lab/qqt/public_data/qwen_mix_sft_64K6/all.parquet'] ******* ******** ['/workspace/mnt/lcm_lab/qqt/public_data/qwen_mix_sft_64K6/all.parquet'] ******* ******** ['/workspace/mnt/lcm_lab/qqt/public_data/qwen_mix_sft_64K6/all.parquet'] *************** ['/workspace/mnt/lcm_lab/qqt/public_data/qwen_mix_sft_64K6/all.parquet'] ******* ******** ['/workspace/mnt/lcm_lab/qqt/public_data/qwen_mix_sft_64K6/all.parquet'] ******* ******** ['/workspace/mnt/lcm_lab/qqt/public_data/qwen_mix_sft_64K6/all.parquet'] ******* Using custom data configuration default-4687dca96c3d2fe4 02/16/2026 17:54:39 - INFO - datasets.builder - Using custom data configuration default-4687dca96c3d2fe4 Loading Dataset Infos from /opt/conda/envs/qqt/lib/python3.11/site-packages/datasets/packaged_modules/parquet 02/16/2026 17:54:39 - INFO - datasets.info - Loading Dataset Infos from /opt/conda/envs/qqt/lib/python3.11/site-packages/datasets/packaged_modules/parquet Extracting 'length' from metadata for sorting... Extracting 'length' from metadata for sorting... Extracting 'length' from metadata for sorting... Extracting 'length' from metadata for sorting... Extracting 'length' from metadata for sorting... Extracting 'length' from metadata for sorting... Overwrite dataset info from restored data version if exists. 02/16/2026 17:54:39 - INFO - datasets.builder - Overwrite dataset info from restored data version if exists. Loading Dataset info from /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6 02/16/2026 17:54:39 - INFO - datasets.info - Loading Dataset info from /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6 Found cached dataset parquet (/workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6) 02/16/2026 17:54:39 - INFO - datasets.builder - Found cached dataset parquet (/workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6) Loading Dataset info from /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6 02/16/2026 17:54:39 - INFO - datasets.info - Loading Dataset info from /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6 Extracting 'length' from metadata for sorting... Extracting 'length' from metadata for sorting... 📉 正在按 length 从小到大排序数据... 📉 正在按 length 从小到大排序数据... 📉 正在按 length 从小到大排序数据... *** 缓存文件路径:/workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet *** *** 缓存文件路径:/workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet *** 🚀 发现缓存文件: /workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet 🚀 发现缓存文件: /workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet *** 缓存文件路径:/workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet *** 🚀 发现缓存文件: /workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet Process #0 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00000_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #0 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00000_of_00032.arrow Process #1 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00001_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #1 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00001_of_00032.arrow Process #2 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00002_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #2 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00002_of_00032.arrow Process #3 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00003_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #3 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00003_of_00032.arrow Process #4 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00004_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #4 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00004_of_00032.arrow Process #5 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00005_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #5 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00005_of_00032.arrow Process #6 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00006_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #6 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00006_of_00032.arrow Process #7 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00007_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #7 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00007_of_00032.arrow Process #8 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00008_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #8 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00008_of_00032.arrow Process #9 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00009_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #9 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00009_of_00032.arrow Process #10 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00010_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #10 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00010_of_00032.arrow Process #11 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00011_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #11 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00011_of_00032.arrow Process #12 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00012_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #12 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00012_of_00032.arrow Process #13 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00013_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #13 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00013_of_00032.arrow Process #14 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00014_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #14 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00014_of_00032.arrow Process #15 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00015_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #15 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00015_of_00032.arrow Process #16 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00016_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #16 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00016_of_00032.arrow Process #17 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00017_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #17 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00017_of_00032.arrow Process #18 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00018_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #18 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00018_of_00032.arrow Process #19 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00019_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #19 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00019_of_00032.arrow Process #20 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00020_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #20 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00020_of_00032.arrow Process #21 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00021_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #21 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00021_of_00032.arrow Process #22 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00022_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #22 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00022_of_00032.arrow Process #23 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00023_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #23 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00023_of_00032.arrow Process #24 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00024_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #24 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00024_of_00032.arrow Process #25 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00025_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #25 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00025_of_00032.arrow Process #26 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00026_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #26 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00026_of_00032.arrow Process #27 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00027_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #27 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00027_of_00032.arrow Process #28 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00028_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #28 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00028_of_00032.arrow Process #29 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00029_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #29 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00029_of_00032.arrow Process #30 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00030_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #30 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00030_of_00032.arrow Process #31 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00031_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Process #31 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00031_of_00032.arrow 📉 正在按 length 从小到大排序数据... 📉 正在按 length 从小到大排序数据... 📉 正在按 length 从小到大排序数据... *** 缓存文件路径:/workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet *** 🚀 发现缓存文件: /workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet *** 缓存文件路径:/workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet *** 🚀 发现缓存文件: /workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet *** 缓存文件路径:/workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet *** 🚀 发现缓存文件: /workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet Loading cached processed dataset at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_*_of_00032.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Loading cached processed dataset at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_*_of_00032.arrow Concatenating 32 shards 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Concatenating 32 shards 📉 正在按 length 从小到大排序数据... Loading cached sorted indices for dataset at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-4808dacc7d21da3a.arrow 02/16/2026 17:54:39 - INFO - datasets.arrow_dataset - Loading cached sorted indices for dataset at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-4808dacc7d21da3a.arrow *** 缓存文件路径:/workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet *** 🚀 发现缓存文件: /workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet 📉 正在按 length 从小到大排序数据... *** 缓存文件路径:/workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet *** 🚀 发现缓存文件: /workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet ✅ 成功加载 Parquet 缓存! 包含 22790 条序列。 /workspace/mnt/lcm_lab/qqt/project/layer-ea/sparseattn/training/lh_trainer.py:427: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. super().__init__( [WARNING|lh_trainer.py:363] 2026-02-16 17:54:40,467 >> Handler registered ✅ 成功加载 Parquet 缓存! 包含 22790 条序列。 /workspace/mnt/lcm_lab/qqt/project/layer-ea/sparseattn/training/lh_trainer.py:427: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. super().__init__( ✅ 成功加载 Parquet 缓存! 包含 22790 条序列。 /workspace/mnt/lcm_lab/qqt/project/layer-ea/sparseattn/training/lh_trainer.py:427: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. super().__init__( [WARNING|lh_trainer.py:363] 2026-02-16 17:54:40,508 >> Handler registered ✅ 成功加载 Parquet 缓存! 包含 22790 条序列。 /workspace/mnt/lcm_lab/qqt/project/layer-ea/sparseattn/training/lh_trainer.py:427: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. super().__init__( [WARNING|lh_trainer.py:363] 2026-02-16 17:54:40,515 >> Handler registered [WARNING|lh_trainer.py:363] 2026-02-16 17:54:40,523 >> Handler registered ✅ 成功加载 Parquet 缓存! 包含 22790 条序列。 /workspace/mnt/lcm_lab/qqt/project/layer-ea/sparseattn/training/lh_trainer.py:427: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. super().__init__( [WARNING|lh_trainer.py:363] 2026-02-16 17:54:40,540 >> Handler registered ✅ 成功加载 Parquet 缓存! 包含 22790 条序列。 /workspace/mnt/lcm_lab/qqt/project/layer-ea/sparseattn/training/lh_trainer.py:427: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. super().__init__( Using custom data configuration default-625db63a2d7df432 02/16/2026 17:54:40 - INFO - datasets.builder - Using custom data configuration default-625db63a2d7df432 Loading Dataset Infos from /opt/conda/envs/qqt/lib/python3.11/site-packages/datasets/packaged_modules/parquet 02/16/2026 17:54:40 - INFO - datasets.info - Loading Dataset Infos from /opt/conda/envs/qqt/lib/python3.11/site-packages/datasets/packaged_modules/parquet Overwrite dataset info from restored data version if exists. 02/16/2026 17:54:40 - INFO - datasets.builder - Overwrite dataset info from restored data version if exists. Loading Dataset info from /root/.cache/huggingface/datasets/parquet/default-625db63a2d7df432/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6 02/16/2026 17:54:40 - INFO - datasets.info - Loading Dataset info from /root/.cache/huggingface/datasets/parquet/default-625db63a2d7df432/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6 Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/default-625db63a2d7df432/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6) 02/16/2026 17:54:40 - INFO - datasets.builder - Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/default-625db63a2d7df432/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6) Loading Dataset info from /root/.cache/huggingface/datasets/parquet/default-625db63a2d7df432/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6 02/16/2026 17:54:40 - INFO - datasets.info - Loading Dataset info from /root/.cache/huggingface/datasets/parquet/default-625db63a2d7df432/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6 [WARNING|lh_trainer.py:363] 2026-02-16 17:54:40,554 >> Handler registered ✅ 成功加载 Parquet 缓存! 包含 22790 条序列。 /workspace/mnt/lcm_lab/qqt/project/layer-ea/sparseattn/training/lh_trainer.py:427: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. super().__init__( 02/16/2026 17:54:40 - WARNING - accelerate.utils.other - Detected kernel version 4.19.90, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. [INFO|trainer.py:698] 2026-02-16 17:54:40,570 >> max_steps is given, it will override any value given in num_train_epochs [INFO|trainer.py:748] 2026-02-16 17:54:40,571 >> Using auto half precision backend [INFO|lh_trainer.py:486] 2026-02-16 17:54:40,571 >> Initializing sequence parallel groups with size 2 [WARNING|lh_trainer.py:363] 2026-02-16 17:54:40,572 >> Handler registered 02/16/2026 17:54:40 - INFO - __main__ - Successfully injected CustomDistributedStratifiedSampler into Trainer. ✅ 成功加载 Parquet 缓存! 包含 22790 条序列。 /workspace/mnt/lcm_lab/qqt/project/layer-ea/sparseattn/training/lh_trainer.py:427: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. super().__init__( [WARNING|lh_trainer.py:363] 2026-02-16 17:54:40,588 >> Handler registered [INFO|lh_trainer.py:931] 2026-02-16 17:54:57,994 >> Optimizing 400 parameters. [INFO|lh_trainer.py:932] 2026-02-16 17:54:57,995 >> Optimized parameters list: ['_fsdp_wrapped_module.model.sparsity_lambda_1', '_fsdp_wrapped_module.model.sparsity_lambda_2', '_fsdp_wrapped_module.model.sparsity_lambda1_task', '_fsdp_wrapped_module.model.sparsity_lambda2_task', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias'] [INFO|trainer.py:2414] 2026-02-16 17:54:58,008 >> ***** Running training ***** [INFO|trainer.py:2415] 2026-02-16 17:54:58,008 >> Num examples = 22,790 [INFO|trainer.py:2416] 2026-02-16 17:54:58,008 >> Num Epochs = 1 [INFO|trainer.py:2417] 2026-02-16 17:54:58,008 >> Instantaneous batch size per device = 1 [INFO|trainer.py:2420] 2026-02-16 17:54:58,008 >> Total train batch size (w. parallel, distributed & accumulation) = 48 [INFO|trainer.py:2421] 2026-02-16 17:54:58,008 >> Gradient Accumulation steps = 6 [INFO|trainer.py:2422] 2026-02-16 17:54:58,008 >> Total optimization steps = 300 [INFO|trainer.py:2423] 2026-02-16 17:54:58,010 >> Number of trainable parameters = 300 [INFO|integration_utils.py:2218] 2026-02-16 17:54:59,010 >> Automatic SwanLab logging enabled, to disable set os.environ["SWANLAB_MODE"] = "disabled" swanlab: swanlab version 0.7.8 is available! Upgrade: `pip install -U swanlab` swanlab: Tracking run with swanlab version 0.6.8 swanlab: Run data will be saved locally in /workspace/mnt/lcm_lab/qqt/project/layer-ea/sparseattn/checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/run-20 260216_175500-no4wlvguk9e1zebtqptin swanlab: 👋 Hi qqtang,welcome to swanlab! swanlab: Syncing run 2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B to the cloud swanlab: 🏠 View project at https://swanlab.cn/@qqtang/NIPS swanlab: 🚀 View run at https://swanlab.cn/@qqtang/NIPS/runs/no4wlvguk9e1zebtqptin [Step 0 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24165, 24168] → Tgt Spa: ['1.000', '1.000'] [Step 0 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [42872] → Tgt Spa: ['1.000'] [Step 0 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [24426, 24427] → Tgt Spa: ['0.350', '0.350'] [Step 0 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [25347, 25356] → Tgt Spa: ['1.000', '1.000'] [Step 0 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [25347, 25356] → Tgt Spa: ['1.000', '1.000'] [Step 0 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [42872] → Tgt Spa: ['1.000'] [Step 0 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24165, 24168] → Tgt Spa: ['1.000', '1.000'] [Step 0 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [24426, 24427] → Tgt Spa: ['0.350', '0.350'] /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/autograd/graph.py:823: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/autograd/graph.py:823: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/autograd/graph.py:823: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/autograd/graph.py:823: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/autograd/graph.py:823: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/autograd/graph.py:823: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/autograd/graph.py:823: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/autograd/graph.py:823: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [Step 0 / Rank 3] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [16134, 16134, 16134, 16135] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 0 / Rank 5] Tasks: ['Single QA'] | Lens: [45651] → Tgt Spa: ['0.350'] [Step 0 / Rank 2] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [16134, 16134, 16134, 16135] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 0 / Rank 6] Tasks: ['Code', 'Code', 'Single QA', 'Single QA'] | Lens: [15479, 15481, 15474, 15475] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350'] [Step 0 / Rank 4] Tasks: ['Single QA'] | Lens: [45651] → Tgt Spa: ['0.350'] [Step 0 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [39763] → Tgt Spa: ['1.000'] [Step 0 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [39763] → Tgt Spa: ['1.000'] [Step 0 / Rank 7] Tasks: ['Code', 'Code', 'Single QA', 'Single QA'] | Lens: [15479, 15481, 15474, 15475] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350'] [Step 0 / Rank 6] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23382, 23383] → Tgt Spa: ['1.000', '0.350'] [Step 0 / Rank 5] Tasks: ['Single QA'] | Lens: [43150] → Tgt Spa: ['0.350'] [Step 0 / Rank 7] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23382, 23383] → Tgt Spa: ['1.000', '0.350'] [Step 0 / Rank 1] Tasks: ['Single QA'] | Lens: [41556] → Tgt Spa: ['0.350'] [Step 0 / Rank 3] Tasks: ['Single QA'] | Lens: [51575] → Tgt Spa: ['0.350'] [Step 0 / Rank 2] Tasks: ['Single QA'] | Lens: [51575] → Tgt Spa: ['0.350'] [Step 0 / Rank 0] Tasks: ['Single QA'] | Lens: [41556] → Tgt Spa: ['0.350'] [Step 0 / Rank 4] Tasks: ['Single QA'] | Lens: [43150] → Tgt Spa: ['0.350'] [Step 0 / Rank 3] Tasks: ['Single QA'] | Lens: [34479] → Tgt Spa: ['0.350'] [Step 0 / Rank 7] Tasks: ['Single QA'] | Lens: [38927] → Tgt Spa: ['0.350'] [Step 0 / Rank 2] Tasks: ['Single QA'] | Lens: [34479] → Tgt Spa: ['0.350'] [Step 0 / Rank 0] Tasks: ['Single QA'] | Lens: [51032] → Tgt Spa: ['0.350'] [Step 0 / Rank 5] Tasks: ['Single QA'] | Lens: [56321] → Tgt Spa: ['0.350'] [Step 0 / Rank 6] Tasks: ['Single QA'] | Lens: [38927] → Tgt Spa: ['0.350'] [Step 0 / Rank 4] Tasks: ['Single QA'] | Lens: [56321] → Tgt Spa: ['0.350'] [Step 0 / Rank 1] Tasks: ['Single QA'] | Lens: [51032] → Tgt Spa: ['0.350'] [Step 0 / Rank 1] Tasks: ['Code', 'Summarization'] | Lens: [29815, 29828] → Tgt Spa: ['1.000', '1.000'] [Step 0 / Rank 3] Tasks: ['Single QA'] | Lens: [57566] → Tgt Spa: ['0.350'] [Step 0 / Rank 6] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [20687, 20699, 20691] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 0 / Rank 4] Tasks: ['Code'] | Lens: [34432] → Tgt Spa: ['1.000'] [Step 0 / Rank 7] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [20687, 20699, 20691] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 0 / Rank 0] Tasks: ['Code', 'Summarization'] | Lens: [29815, 29828] → Tgt Spa: ['1.000', '1.000'] [Step 0 / Rank 2] Tasks: ['Single QA'] | Lens: [57566] → Tgt Spa: ['0.350'] [Step 0 / Rank 5] Tasks: ['Code'] | Lens: [34432] → Tgt Spa: ['1.000'] [Step 0 / Rank 7] Tasks: ['Code'] | Lens: [55182] → Tgt Spa: ['1.000'] [Step 0 / Rank 2] Tasks: ['In-Context Learning', 'Code'] | Lens: [25956, 25964] → Tgt Spa: ['1.000', '1.000'] [Step 0 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [59253] → Tgt Spa: ['1.000'] [Step 0 / Rank 1] Tasks: ['Single QA'] | Lens: [35029] → Tgt Spa: ['0.350'] [Step 0 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [59253] → Tgt Spa: ['1.000'] [Step 0 / Rank 3] Tasks: ['In-Context Learning', 'Code'] | Lens: [25956, 25964] → Tgt Spa: ['1.000', '1.000'] [Step 0 / Rank 6] Tasks: ['Code'] | Lens: [55182] → Tgt Spa: ['1.000'] [Step 0 / Rank 0] Tasks: ['Single QA'] | Lens: [35029] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 17:57:11,072 >> @ 0 | Loss: 2.0407 | LM: 1.9309 | Reg: 0.1097 | Spa(Avg): 0.511 [INFO|lh_trainer.py:797] 2026-02-16 17:57:11,072 >> Statistic -> Code | Spa: 0.444 | Tgt: 1.000 | Z-Loss: 0.102 | [INFO|lh_trainer.py:797] 2026-02-16 17:57:11,073 >> Statistic -> In-Context | Spa: 0.469 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 17:57:11,073 >> Statistic -> MultiHop | Spa: 0.549 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 17:57:11,073 >> Statistic -> Single | Spa: 0.554 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 17:57:11,073 >> Statistic -> Summarization | Spa: 0.583 | Tgt: 1.000 | Z-Loss: 0.078 | [INFO|lh_trainer.py:810] 2026-02-16 17:57:11,077 >> [Micro-Log] {"loss": 2.040656689244012, "lm_loss": 1.930921162556236, "reg_loss": 0.10973554306353132, "model_sparsity(avg)": 0.5108024626970291, "Spa-In-Context Learning sparsity": 0.46875, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11539438087493181, "Spa-Single QA sparsity": 0.5537036975224813, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.11222218051552772, "Spa-Code sparsity": 0.4444444311989678, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10175766630305184, "Spa-Summarization sparsity": 0.5833333730697632, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.07792153209447861, "Spa-MultiHop QA sparsity": 0.5486111044883728, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.06407263409346342, "step": 0, "current_tau": 1.5, "lambda1 Single QA": 0.474609375, "lambda2 MultiHop QA": 0.23828125, "lambda3 Summarization": 0.0380859375, "lambda4 Code": 0.1337890625} /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:1214: UserWarning: Called FSDP.clip_grad_norm_() on rank 6 with no gradients -- returning the total norm in the default dtype torch.float32 warnings.warn( /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:1214: UserWarning: Called FSDP.clip_grad_norm_() on rank 7 with no gradients -- returning the total norm in the default dtype torch.float32 warnings.warn( /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:1214: UserWarning: Called FSDP.clip_grad_norm_() on rank 5 with no gradients -- returning the total norm in the default dtype torch.float32 warnings.warn( /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:1214: UserWarning: Called FSDP.clip_grad_norm_() on rank 3 with no gradients -- returning the total norm in the default dtype torch.float32 warnings.warn( /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:1214: UserWarning: Called FSDP.clip_grad_norm_() on rank 4 with no gradients -- returning the total norm in the default dtype torch.float32 warnings.warn( /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:1214: UserWarning: Called FSDP.clip_grad_norm_() on rank 2 with no gradients -- returning the total norm in the default dtype torch.float32 warnings.warn( [INFO|lh_trainer.py:331] 2026-02-16 17:57:33,920 >> {'loss': 12.2439, 'grad_norm': 1.3188265562057495, 'learning_rate': 0.0, 'epoch': 0.00105318588730911, 'num_input_tokens_seen': 2363056, 'completed': '0.33% (1 / 300)', 'remaining time': '12:39:38', 'throughput': '6499.56', 'gpu_mem_free': '14197MB', 'step': 1} [Step 1 / Rank 6] Tasks: ['Single QA'] | Lens: [45544] → Tgt Spa: ['0.350'] [Step 1 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [55295] → Tgt Spa: ['1.000'] [Step 1 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32126, 32126] → Tgt Spa: ['0.350', '0.350'] [Step 1 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [55295] → Tgt Spa: ['1.000'] [Step 1 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32126, 32126] → Tgt Spa: ['0.350', '0.350'] [Step 1 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [43923] → Tgt Spa: ['1.000'] [Step 1 / Rank 7] Tasks: ['Single QA'] | Lens: [45544] → Tgt Spa: ['0.350'] [Step 1 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [43923] → Tgt Spa: ['1.000'] [Step 1 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [9086, 9086, 9086, 9086, 9087, 9087, 9094] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 1 / Rank 5] Tasks: ['Single QA', 'Summarization'] | Lens: [32560, 32582] → Tgt Spa: ['0.350', '1.000'] [Step 1 / Rank 3] Tasks: ['Single QA'] | Lens: [36042] → Tgt Spa: ['0.350'] [Step 1 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [9086, 9086, 9086, 9086, 9087, 9087, 9094] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 1 / Rank 2] Tasks: ['Single QA'] | Lens: [36042] → Tgt Spa: ['0.350'] [Step 1 / Rank 4] Tasks: ['Single QA', 'Summarization'] | Lens: [32560, 32582] → Tgt Spa: ['0.350', '1.000'] [Step 1 / Rank 7] Tasks: ['Single QA'] | Lens: [49864] → Tgt Spa: ['0.350'] [Step 1 / Rank 6] Tasks: ['Single QA'] | Lens: [49864] → Tgt Spa: ['0.350'] [Step 1 / Rank 3] Tasks: ['Single QA'] | Lens: [62013] → Tgt Spa: ['0.350'] [Step 1 / Rank 7] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [20035, 20047, 20038] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 1 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [56435] → Tgt Spa: ['1.000'] [Step 1 / Rank 0] Tasks: ['Single QA'] | Lens: [39696] → Tgt Spa: ['0.350'] [Step 1 / Rank 2] Tasks: ['Single QA'] | Lens: [62013] → Tgt Spa: ['0.350'] [Step 1 / Rank 1] Tasks: ['Single QA'] | Lens: [39696] → Tgt Spa: ['0.350'] [Step 1 / Rank 6] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [20035, 20047, 20038] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 1 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [56435] → Tgt Spa: ['1.000'] [Step 1 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [64245] → Tgt Spa: ['1.000'] [Step 1 / Rank 6] Tasks: ['Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization'] | Lens: [1885, 1868, 1869, 1868, 1868, 1871, 1888, 1889, 1890, 1871, 1871, 1873, 1891, 1873, 1892, 1874, 1874, 1876, 1875, 1877, 1883, 1894, 1877, 1876, 1896, 1895, 1884, 1877, 1879, 1897, 1878, 1880, 1899, 1900] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000'] [Step 1 / Rank 7] Tasks: ['Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization'] | Lens: [1885, 1868, 1869, 1868, 1868, 1871, 1888, 1889, 1890, 1871, 1871, 1873, 1891, 1873, 1892, 1874, 1874, 1876, 1875, 1877, 1883, 1894, 1877, 1876, 1896, 1895, 1884, 1877, 1879, 1897, 1878, 1880, 1899, 1900] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000'] [Step 1 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [64245] → Tgt Spa: ['1.000'] [Step 1 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [27896, 27904] → Tgt Spa: ['1.000', '1.000'] [Step 1 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [27896, 27904] → Tgt Spa: ['1.000', '1.000'] [Step 1 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [42672] → Tgt Spa: ['1.000'] [Step 1 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [42672] → Tgt Spa: ['1.000'] [Step 1 / Rank 3] Tasks: ['Single QA'] | Lens: [48597] → Tgt Spa: ['0.350'] [Step 1 / Rank 7] Tasks: ['Code'] | Lens: [53240] → Tgt Spa: ['1.000'] [Step 1 / Rank 5] Tasks: ['Code'] | Lens: [61402] → Tgt Spa: ['1.000'] [Step 1 / Rank 2] Tasks: ['Single QA'] | Lens: [48597] → Tgt Spa: ['0.350'] [Step 1 / Rank 1] Tasks: ['Single QA'] | Lens: [58344] → Tgt Spa: ['0.350'] [Step 1 / Rank 0] Tasks: ['Single QA'] | Lens: [58344] → Tgt Spa: ['0.350'] [Step 1 / Rank 6] Tasks: ['Code'] | Lens: [53240] → Tgt Spa: ['1.000'] [Step 1 / Rank 4] Tasks: ['Code'] | Lens: [61402] → Tgt Spa: ['1.000'] [Step 1 / Rank 1] Tasks: ['Single QA'] | Lens: [39175] → Tgt Spa: ['0.350'] [Step 1 / Rank 6] Tasks: ['Single QA'] | Lens: [42585] → Tgt Spa: ['0.350'] [Step 1 / Rank 3] Tasks: ['Single QA'] | Lens: [65101] → Tgt Spa: ['0.350'] [Step 1 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [38877] → Tgt Spa: ['1.000'] [Step 1 / Rank 7] Tasks: ['Single QA'] | Lens: [42585] → Tgt Spa: ['0.350'] [Step 1 / Rank 2] Tasks: ['Single QA'] | Lens: [65101] → Tgt Spa: ['0.350'] [Step 1 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [38877] → Tgt Spa: ['1.000'] [Step 1 / Rank 0] Tasks: ['Single QA'] | Lens: [39175] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 18:00:11,090 >> @ 1 | Loss: 2.2751 | LM: 2.1887 | Reg: 0.0864 | Spa(Avg): 0.492 [INFO|lh_trainer.py:797] 2026-02-16 18:00:11,090 >> Statistic -> Code | Spa: 0.519 | Tgt: 1.000 | Z-Loss: 0.085 | [INFO|lh_trainer.py:797] 2026-02-16 18:00:11,090 >> Statistic -> In-Context | Spa: 0.488 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:00:11,090 >> Statistic -> MultiHop | Spa: 0.497 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:00:11,090 >> Statistic -> Single | Spa: 0.501 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:00:11,090 >> Statistic -> Summarization | Spa: 0.510 | Tgt: 1.000 | Z-Loss: 0.107 | [INFO|lh_trainer.py:810] 2026-02-16 18:00:11,092 >> [Micro-Log] {"loss": 2.275075492138664, "lm_loss": 2.1886735381558537, "reg_loss": 0.08640195491413276, "model_sparsity(avg)": 0.4915698903302352, "Spa-In-Context Learning sparsity": 0.48809523241860525, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11087532554353986, "Spa-Single QA sparsity": 0.5007309913635254, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.07692126027847591, "Spa-Code sparsity": 0.519097238779068, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.08456847909837961, "Spa-Summarization sparsity": 0.5099206353936877, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10697654421840395, "Spa-MultiHop QA sparsity": 0.49652778208255766, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.04697803878225386, "step": 1, "current_tau": 1.4999618530273438, "lambda1 Single QA": 0.474609375, "lambda2 MultiHop QA": 0.23828125, "lambda3 Summarization": 0.0380859375, "lambda4 Code": 0.1337890625} /opt/conda/envs/qqt/lib/python3.11/site-packages/tensor_parallel/imports.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources /opt/conda/envs/qqt/lib/python3.11/site-packages/tensor_parallel/imports.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources /opt/conda/envs/qqt/lib/python3.11/site-packages/tensor_parallel/imports.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources /opt/conda/envs/qqt/lib/python3.11/site-packages/tensor_parallel/imports.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources /opt/conda/envs/qqt/lib/python3.11/site-packages/tensor_parallel/imports.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources /opt/conda/envs/qqt/lib/python3.11/site-packages/tensor_parallel/imports.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources /opt/conda/envs/qqt/lib/python3.11/site-packages/tensor_parallel/imports.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources 2026-02-16 18:21:13,498 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend 2026-02-16 18:21:13,543 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend /opt/conda/envs/qqt/lib/python3.11/site-packages/tensor_parallel/imports.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources INFO 02-16 18:21:13 [__init__.py:239] Automatically detected platform cuda. 2026-02-16 18:21:13,638 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend INFO 02-16 18:21:13 [__init__.py:239] Automatically detected platform cuda. 2026-02-16 18:21:13,697 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend 2026-02-16 18:21:13,701 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend 2026-02-16 18:21:13,729 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend INFO 02-16 18:21:13 [__init__.py:239] Automatically detected platform cuda. 2026-02-16 18:21:13,750 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend INFO 02-16 18:21:13 [__init__.py:239] Automatically detected platform cuda. INFO 02-16 18:21:13 [__init__.py:239] Automatically detected platform cuda. INFO 02-16 18:21:13 [__init__.py:239] Automatically detected platform cuda. INFO 02-16 18:21:13 [__init__.py:239] Automatically detected platform cuda. 2026-02-16 18:21:13,984 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend INFO 02-16 18:21:14 [__init__.py:239] Automatically detected platform cuda. Xattention Import Fail Xattention Import Fail Xattention Import Fail 02/16/2026 18:21:16 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 02/16/2026 18:21:16 - INFO - __main__ - Training/evaluation parameters TrainingArguments( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, attention_type=None, auto_find_batch_size=False, average_tokens_across_devices=False, batch_eval_metrics=False, bf16=True, bf16_full_eval=False, context_window_if_toggled=2048, cuda_empty_cache=True, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=1, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=None, disable_linear_regularization_term=False, disable_tqdm=True, do_eval=False, do_predict=False, do_train=True, enable_ada_sparsity=True, enable_contrastive_loss=False, enable_lambda_task=True, enable_layerwise_sparsity=False, end_head_sparsity=0.35, erank_analysis_path=/, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_steps=None, eval_strategy=IntervalStrategy.NO, eval_use_gather_object=False, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, freeze_mask_parameters=False, freeze_non_mask_parameters=True, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=6, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=None, hub_strategy=HubStrategy.EVERY_SAVE, hub_token=, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, layerwise_sparsity_max_ratio=1.0, layerwise_sparsity_min_ratio=0.75, layerwise_sparsity_power=1.0, layerwise_sparsity_schedule=high-low-high, layerwise_sparsity_weight=1.0, learning_rate=1e-05, length_column_name=length, load_best_model_at_end=False, load_masks_from=None, load_masks_sparsity=None, local_rank=0, log_level=info, log_level_replica=warning, log_on_each_node=True, logging_dir=checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/runs/Feb16_18-21-16_pod-1436390550395908096, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=IntervalStrategy.STEPS, lr_scheduler_kwargs={}, lr_scheduler_type=SchedulerType.COSINE, mask_learning_rate=0.0005, max_grad_norm=5.0, max_steps=300, metric_for_best_model=None, min_lr_ratio=1e-07, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=3.0, optim=OptimizerNames.ADAMW_TORCH, optim_args=None, optim_target_modules=None, ordered=False, output_dir=checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=1, pooling_mode=ctx_q, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, reg_learning_rate=0.001, remove_unused_columns=False, report_to=['swanlab'], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, retrieval_mode=full, run_name=2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=100, save_strategy=SaveStrategy.STEPS, save_total_limit=3, seed=42, seq_parallel_size=2, sink_size=128, skip_memory_metrics=True, sparsity_warmup_ratio=0.0, start_head_sparsity=0.0, streaming_dataset=True, stripe_init_start_with_keep=False, stripe_init_width_1=None, stripe_init_width_2=None, tf32=None, toggle_type=xattn, topk_k=2048, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tp_size=0, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger_kernel=False, use_mps_device=False, use_softmax=True, use_task_emb_for_mask=False, warmup_ratio=0.2, warmup_steps=0, warmup_type=linear, weight_decay=0.1, ) 02/16/2026 18:21:16 - INFO - __main__ - Additional arguments ScriptArguments(model_name_or_path='/workspace/mnt/lcm_lab/hf_models/Qwen3-8B', config_overrides=None, config_overrides_json='', config_name=None, tokenizer_name='/workspace/mnt/lcm_lab/hf_models/Qwen3-8B', cache_dir=None, use_fast_tokenizer=False, model_revision='main', use_auth_token=False, use_thinking=False, should_log_loss=True, token_scaled_loss=False, tokenized_mds_train=['/workspace/mnt/lcm_lab/qqt/public_data/qwen_mix_sft_64K6'], tokenized_mds_validation=[], tokenized_mds_test=[]) 02/16/2026 18:21:16 - INFO - __main__ - Data arguments PackedDataArguments(single_seq=False, subsplit_length=None, per_device_max_tokens=65536, apply_instruct_masks=False, prepack=False, streaming=False, min_seq_len=1000, task_type='sft', use_packing=False, data_cache_dir='/workspace/mnt/lcm_lab/qqt/public_data/data_cache', preprocessing_num_workers=32, suffix='qwen3-4b_new_1200') [INFO|tokenization_utils_base.py:2058] 2026-02-16 18:21:16,307 >> loading file vocab.json [INFO|tokenization_utils_base.py:2058] 2026-02-16 18:21:16,307 >> loading file merges.txt [INFO|tokenization_utils_base.py:2058] 2026-02-16 18:21:16,307 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2058] 2026-02-16 18:21:16,307 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2058] 2026-02-16 18:21:16,307 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2058] 2026-02-16 18:21:16,307 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2058] 2026-02-16 18:21:16,307 >> loading file chat_template.jinja Xattention Import Fail Xattention Import Fail Xattention Import Fail [INFO|tokenization_utils_base.py:2323] 2026-02-16 18:21:16,491 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|configuration_utils.py:691] 2026-02-16 18:21:16,492 >> loading configuration file /workspace/mnt/lcm_lab/hf_models/Qwen3-8B/config.json [INFO|configuration_utils.py:765] 2026-02-16 18:21:16,493 >> Model config PawQwen3Config { "architectures": [ "Qwen3ForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 151643, "disable_linear_regularization_term": false, "enable_ada_sparsity": true, "enable_lambda_task": true, "enable_layerwise_sparsity": false, "eos_token_id": 151645, "erank_analysis_path": "/", "head_dim": 128, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 12288, "layerwise_sparsity_max_ratio": 1.0, "layerwise_sparsity_min_ratio": 0.5, "layerwise_sparsity_power": 1.0, "layerwise_sparsity_schedule": "high-low-high", "layerwise_sparsity_weight": 1.0, "local_window_size": 2048, "max_position_embeddings": 262144, "max_window_layers": 36, "model_type": "qwen3", "num_attention_heads": 32, "num_hidden_layers": 36, "num_key_value_heads": 8, "pooling_mode": "ctx_q", "pooling_seq": true, "retrieval_mode": "full", "rms_norm_eps": 1e-06, "rope_scaling": { "factor": 8.0, "original_max_position_embeddings": 40960, "rope_type": "yarn", "type": "yarn" }, "rope_theta": 1000000, "sink_size": 128, "sliding_window": null, "suggested_sparsity": null, "tie_word_embeddings": false, "toggle_type": "xattn", "topk_k": 2048, "torch_dtype": "bfloat16", "transformers_version": "4.51.1", "triangle_n_last": 128, "use_cache": true, "use_sliding_window": false, "use_softmax": true, "use_task_emb_for_mask": false, "vocab_size": 151936 } [INFO|modeling_utils.py:1121] 2026-02-16 18:21:16,495 >> loading weights file /workspace/mnt/lcm_lab/hf_models/Qwen3-8B/model.safetensors.index.json [WARNING|logging.py:328] 2026-02-16 18:21:16,499 >> PawQwen3ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. [INFO|configuration_utils.py:1142] 2026-02-16 18:21:16,500 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645, "pad_token_id": 0 } Xattention Import Fail Xattention Import Fail 02/16/2026 18:21:16 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False 02/16/2026 18:21:16 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False 02/16/2026 18:21:16 - WARNING - __main__ - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: False 02/16/2026 18:21:17 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False 02/16/2026 18:21:17 - WARNING - __main__ - Process rank: 5, device: cuda:5, n_gpu: 1distributed training: True, 16-bits training: False 02/16/2026 18:21:17 - WARNING - __main__ - Process rank: 7, device: cuda:7, n_gpu: 1distributed training: True, 16-bits training: False 02/16/2026 18:21:17 - WARNING - __main__ - Process rank: 6, device: cuda:6, n_gpu: 1distributed training: True, 16-bits training: False [WARNING|logging.py:328] 2026-02-16 18:21:17,058 >> PawQwen3ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. Loading checkpoint shards: 0%| | 0/5 [00:00> PawQwen3ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. [WARNING|logging.py:328] 2026-02-16 18:21:17,173 >> PawQwen3ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. [WARNING|logging.py:328] 2026-02-16 18:21:17,226 >> PawQwen3ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. [WARNING|logging.py:328] 2026-02-16 18:21:17,228 >> PawQwen3ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. [WARNING|logging.py:328] 2026-02-16 18:21:17,232 >> PawQwen3ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. [WARNING|logging.py:328] 2026-02-16 18:21:17,301 >> PawQwen3ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. Loading checkpoint shards: 20%|██ | 1/5 [00:00<00:01, 2.17it/s] Loading checkpoint shards: 0%| | 0/5 [00:00> All model checkpoint weights were used when initializing PawQwen3ForCausalLM. [WARNING|modeling_utils.py:4932] 2026-02-16 18:21:20,476 >> Some weights of PawQwen3ForCausalLM were not initialized from the model checkpoint at /workspace/mnt/lcm_lab/hf_models/Qwen3-8B and are newly initialized: ['model.layers.0.self_attn.attn_mask_log_alphas', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.0.self_attn.mask_allocator.log_temp', 'model.layers.1.self_attn.attn_mask_log_alphas', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.1.self_attn.mask_allocator.log_temp', 'model.layers.10.self_attn.attn_mask_log_alphas', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.10.self_attn.mask_allocator.log_temp', 'model.layers.11.self_attn.attn_mask_log_alphas', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.11.self_attn.mask_allocator.log_temp', 'model.layers.12.self_attn.attn_mask_log_alphas', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.12.self_attn.mask_allocator.log_temp', 'model.layers.13.self_attn.attn_mask_log_alphas', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.13.self_attn.mask_allocator.log_temp', 'model.layers.14.self_attn.attn_mask_log_alphas', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.14.self_attn.mask_allocator.log_temp', 'model.layers.15.self_attn.attn_mask_log_alphas', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.15.self_attn.mask_allocator.log_temp', 'model.layers.16.self_attn.attn_mask_log_alphas', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.16.self_attn.mask_allocator.log_temp', 'model.layers.17.self_attn.attn_mask_log_alphas', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.17.self_attn.mask_allocator.log_temp', 'model.layers.18.self_attn.attn_mask_log_alphas', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.18.self_attn.mask_allocator.log_temp', 'model.layers.19.self_attn.attn_mask_log_alphas', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.19.self_attn.mask_allocator.log_temp', 'model.layers.2.self_attn.attn_mask_log_alphas', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.2.self_attn.mask_allocator.log_temp', 'model.layers.20.self_attn.attn_mask_log_alphas', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.20.self_attn.mask_allocator.log_temp', 'model.layers.21.self_attn.attn_mask_log_alphas', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.21.self_attn.mask_allocator.log_temp', 'model.layers.22.self_attn.attn_mask_log_alphas', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.22.self_attn.mask_allocator.log_temp', 'model.layers.23.self_attn.attn_mask_log_alphas', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.23.self_attn.mask_allocator.log_temp', 'model.layers.24.self_attn.attn_mask_log_alphas', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.24.self_attn.mask_allocator.log_temp', 'model.layers.25.self_attn.attn_mask_log_alphas', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.25.self_attn.mask_allocator.log_temp', 'model.layers.26.self_attn.attn_mask_log_alphas', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.26.self_attn.mask_allocator.log_temp', 'model.layers.27.self_attn.attn_mask_log_alphas', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.27.self_attn.mask_allocator.log_temp', 'model.layers.28.self_attn.attn_mask_log_alphas', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.28.self_attn.mask_allocator.log_temp', 'model.layers.29.self_attn.attn_mask_log_alphas', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.29.self_attn.mask_allocator.log_temp', 'model.layers.3.self_attn.attn_mask_log_alphas', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.3.self_attn.mask_allocator.log_temp', 'model.layers.30.self_attn.attn_mask_log_alphas', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.30.self_attn.mask_allocator.log_temp', 'model.layers.31.self_attn.attn_mask_log_alphas', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.31.self_attn.mask_allocator.log_temp', 'model.layers.32.self_attn.attn_mask_log_alphas', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.32.self_attn.mask_allocator.log_temp', 'model.layers.33.self_attn.attn_mask_log_alphas', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.33.self_attn.mask_allocator.log_temp', 'model.layers.34.self_attn.attn_mask_log_alphas', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.34.self_attn.mask_allocator.log_temp', 'model.layers.35.self_attn.attn_mask_log_alphas', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.35.self_attn.mask_allocator.log_temp', 'model.layers.4.self_attn.attn_mask_log_alphas', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.4.self_attn.mask_allocator.log_temp', 'model.layers.5.self_attn.attn_mask_log_alphas', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.5.self_attn.mask_allocator.log_temp', 'model.layers.6.self_attn.attn_mask_log_alphas', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.6.self_attn.mask_allocator.log_temp', 'model.layers.7.self_attn.attn_mask_log_alphas', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.7.self_attn.mask_allocator.log_temp', 'model.layers.8.self_attn.attn_mask_log_alphas', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.8.self_attn.mask_allocator.log_temp', 'model.layers.9.self_attn.attn_mask_log_alphas', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.9.self_attn.mask_allocator.log_temp', 'model.sparsity_lambda1_task', 'model.sparsity_lambda2_task', 'model.sparsity_lambda_1', 'model.sparsity_lambda_2'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [INFO|configuration_utils.py:1095] 2026-02-16 18:21:20,482 >> loading configuration file /workspace/mnt/lcm_lab/hf_models/Qwen3-8B/generation_config.json [INFO|configuration_utils.py:1142] 2026-02-16 18:21:20,482 >> Generate config GenerationConfig { "bos_token_id": 151643, "do_sample": true, "enable_contrastive_loss": false, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "temperature": 0.6, "top_k": 20, "top_p": 0.95 } 02/16/2026 18:21:20 - INFO - __main__ - Model: PawQwen3ForCausalLM( (model): Qwen3Model( (embed_tokens): Embedding(151936, 4096, padding_idx=0) (layers): ModuleList( (0-35): 36 x Qwen3DecoderLayer( (self_attn): Qwen3Attention( (q_proj): Linear(in_features=4096, out_features=4096, bias=False) (k_proj): Linear(in_features=4096, out_features=1024, bias=False) (v_proj): Linear(in_features=4096, out_features=1024, bias=False) (o_proj): Linear(in_features=4096, out_features=4096, bias=False) (q_norm): Qwen3RMSNorm((128,), eps=1e-06) (k_norm): Qwen3RMSNorm((128,), eps=1e-06) (rotary_emb): Qwen3RotaryEmbedding() (distributed_attn_func): DistributedAttention() (mask_allocator): AttentionRouter( (cls_feat_extractor): Sequential( (0): Linear(in_features=128, out_features=1024, bias=True) (1): SiLU() (2): Linear(in_features=1024, out_features=256, bias=True) ) (cls_router_head_agnostic): Sequential( (0): Linear(in_features=256, out_features=512, bias=True) (1): SiLU() (2): Linear(in_features=512, out_features=128, bias=True) (3): SiLU() (4): Linear(in_features=128, out_features=2, bias=True) ) ) ) (mlp): Qwen3MLP( (gate_proj): Linear(in_features=4096, out_features=12288, bias=False) (up_proj): Linear(in_features=4096, out_features=12288, bias=False) (down_proj): Linear(in_features=12288, out_features=4096, bias=False) (act_fn): SiLU() ) (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06) (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06) ) ) (norm): Qwen3RMSNorm((4096,), eps=1e-06) (rotary_emb): Qwen3RotaryEmbedding() ) (lm_head): Linear(in_features=4096, out_features=151936, bias=False) ) [rank0]:[W216 18:21:20.349737953 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. Loading checkpoint shards: 60%|██████ | 3/5 [00:02<00:01, 1.10it/s] Loading checkpoint shards: 60%|██████ | 3/5 [00:02<00:01, 1.14it/s] Loading checkpoint shards: 60%|██████ | 3/5 [00:02<00:01, 1.08it/s] Loading checkpoint shards: 60%|██████ | 3/5 [00:02<00:01, 1.12it/s] Loading checkpoint shards: 60%|██████ | 3/5 [00:02<00:01, 1.11it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:03<00:00, 1.23it/s] Loading checkpoint shards: 60%|██████ | 3/5 [00:02<00:01, 1.08it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:03<00:00, 1.27it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:03<00:00, 1.23it/s] Loading checkpoint shards: 80%|████████ | 4/5 [00:03<00:00, 1.23it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.65it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.41it/s] [WARNING|modeling_utils.py:4932] 2026-02-16 18:21:21,257 >> Some weights of PawQwen3ForCausalLM were not initialized from the model checkpoint at /workspace/mnt/lcm_lab/hf_models/Qwen3-8B and are newly initialized: ['model.layers.0.self_attn.attn_mask_log_alphas', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.0.self_attn.mask_allocator.log_temp', 'model.layers.1.self_attn.attn_mask_log_alphas', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.1.self_attn.mask_allocator.log_temp', 'model.layers.10.self_attn.attn_mask_log_alphas', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.10.self_attn.mask_allocator.log_temp', 'model.layers.11.self_attn.attn_mask_log_alphas', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.11.self_attn.mask_allocator.log_temp', 'model.layers.12.self_attn.attn_mask_log_alphas', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.12.self_attn.mask_allocator.log_temp', 'model.layers.13.self_attn.attn_mask_log_alphas', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.13.self_attn.mask_allocator.log_temp', 'model.layers.14.self_attn.attn_mask_log_alphas', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.14.self_attn.mask_allocator.log_temp', 'model.layers.15.self_attn.attn_mask_log_alphas', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.15.self_attn.mask_allocator.log_temp', 'model.layers.16.self_attn.attn_mask_log_alphas', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.16.self_attn.mask_allocator.log_temp', 'model.layers.17.self_attn.attn_mask_log_alphas', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.17.self_attn.mask_allocator.log_temp', 'model.layers.18.self_attn.attn_mask_log_alphas', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.18.self_attn.mask_allocator.log_temp', 'model.layers.19.self_attn.attn_mask_log_alphas', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.19.self_attn.mask_allocator.log_temp', 'model.layers.2.self_attn.attn_mask_log_alphas', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.2.self_attn.mask_allocator.log_temp', 'model.layers.20.self_attn.attn_mask_log_alphas', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.20.self_attn.mask_allocator.log_temp', 'model.layers.21.self_attn.attn_mask_log_alphas', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.21.self_attn.mask_allocator.log_temp', 'model.layers.22.self_attn.attn_mask_log_alphas', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.22.self_attn.mask_allocator.log_temp', 'model.layers.23.self_attn.attn_mask_log_alphas', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.23.self_attn.mask_allocator.log_temp', 'model.layers.24.self_attn.attn_mask_log_alphas', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.24.self_attn.mask_allocator.log_temp', 'model.layers.25.self_attn.attn_mask_log_alphas', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.25.self_attn.mask_allocator.log_temp', 'model.layers.26.self_attn.attn_mask_log_alphas', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.26.self_attn.mask_allocator.log_temp', 'model.layers.27.self_attn.attn_mask_log_alphas', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.27.self_attn.mask_allocator.log_temp', 'model.layers.28.self_attn.attn_mask_log_alphas', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.28.self_attn.mask_allocator.log_temp', 'model.layers.29.self_attn.attn_mask_log_alphas', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.29.self_attn.mask_allocator.log_temp', 'model.layers.3.self_attn.attn_mask_log_alphas', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.3.self_attn.mask_allocator.log_temp', 'model.layers.30.self_attn.attn_mask_log_alphas', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.30.self_attn.mask_allocator.log_temp', 'model.layers.31.self_attn.attn_mask_log_alphas', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.31.self_attn.mask_allocator.log_temp', 'model.layers.32.self_attn.attn_mask_log_alphas', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.32.self_attn.mask_allocator.log_temp', 'model.layers.33.self_attn.attn_mask_log_alphas', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.33.self_attn.mask_allocator.log_temp', 'model.layers.34.self_attn.attn_mask_log_alphas', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.34.self_attn.mask_allocator.log_temp', 'model.layers.35.self_attn.attn_mask_log_alphas', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.35.self_attn.mask_allocator.log_temp', 'model.layers.4.self_attn.attn_mask_log_alphas', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.4.self_attn.mask_allocator.log_temp', 'model.layers.5.self_attn.attn_mask_log_alphas', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.5.self_attn.mask_allocator.log_temp', 'model.layers.6.self_attn.attn_mask_log_alphas', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.6.self_attn.mask_allocator.log_temp', 'model.layers.7.self_attn.attn_mask_log_alphas', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.7.self_attn.mask_allocator.log_temp', 'model.layers.8.self_attn.attn_mask_log_alphas', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.8.self_attn.mask_allocator.log_temp', 'model.layers.9.self_attn.attn_mask_log_alphas', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.9.self_attn.mask_allocator.log_temp', 'model.sparsity_lambda1_task', 'model.sparsity_lambda2_task', 'model.sparsity_lambda_1', 'model.sparsity_lambda_2'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [rank1]:[W216 18:21:21.129207342 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. Loading checkpoint shards: 80%|████████ | 4/5 [00:03<00:00, 1.24it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.70it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.45it/s] [WARNING|modeling_utils.py:4932] 2026-02-16 18:21:21,385 >> Some weights of PawQwen3ForCausalLM were not initialized from the model checkpoint at /workspace/mnt/lcm_lab/hf_models/Qwen3-8B and are newly initialized: ['model.layers.0.self_attn.attn_mask_log_alphas', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.0.self_attn.mask_allocator.log_temp', 'model.layers.1.self_attn.attn_mask_log_alphas', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.1.self_attn.mask_allocator.log_temp', 'model.layers.10.self_attn.attn_mask_log_alphas', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.10.self_attn.mask_allocator.log_temp', 'model.layers.11.self_attn.attn_mask_log_alphas', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.11.self_attn.mask_allocator.log_temp', 'model.layers.12.self_attn.attn_mask_log_alphas', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.12.self_attn.mask_allocator.log_temp', 'model.layers.13.self_attn.attn_mask_log_alphas', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.13.self_attn.mask_allocator.log_temp', 'model.layers.14.self_attn.attn_mask_log_alphas', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.14.self_attn.mask_allocator.log_temp', 'model.layers.15.self_attn.attn_mask_log_alphas', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.15.self_attn.mask_allocator.log_temp', 'model.layers.16.self_attn.attn_mask_log_alphas', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.16.self_attn.mask_allocator.log_temp', 'model.layers.17.self_attn.attn_mask_log_alphas', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.17.self_attn.mask_allocator.log_temp', 'model.layers.18.self_attn.attn_mask_log_alphas', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.18.self_attn.mask_allocator.log_temp', 'model.layers.19.self_attn.attn_mask_log_alphas', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.19.self_attn.mask_allocator.log_temp', 'model.layers.2.self_attn.attn_mask_log_alphas', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.2.self_attn.mask_allocator.log_temp', 'model.layers.20.self_attn.attn_mask_log_alphas', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.20.self_attn.mask_allocator.log_temp', 'model.layers.21.self_attn.attn_mask_log_alphas', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.21.self_attn.mask_allocator.log_temp', 'model.layers.22.self_attn.attn_mask_log_alphas', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.22.self_attn.mask_allocator.log_temp', 'model.layers.23.self_attn.attn_mask_log_alphas', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.23.self_attn.mask_allocator.log_temp', 'model.layers.24.self_attn.attn_mask_log_alphas', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.24.self_attn.mask_allocator.log_temp', 'model.layers.25.self_attn.attn_mask_log_alphas', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.25.self_attn.mask_allocator.log_temp', 'model.layers.26.self_attn.attn_mask_log_alphas', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.26.self_attn.mask_allocator.log_temp', 'model.layers.27.self_attn.attn_mask_log_alphas', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.27.self_attn.mask_allocator.log_temp', 'model.layers.28.self_attn.attn_mask_log_alphas', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.28.self_attn.mask_allocator.log_temp', 'model.layers.29.self_attn.attn_mask_log_alphas', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.29.self_attn.mask_allocator.log_temp', 'model.layers.3.self_attn.attn_mask_log_alphas', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.3.self_attn.mask_allocator.log_temp', 'model.layers.30.self_attn.attn_mask_log_alphas', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.30.self_attn.mask_allocator.log_temp', 'model.layers.31.self_attn.attn_mask_log_alphas', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.31.self_attn.mask_allocator.log_temp', 'model.layers.32.self_attn.attn_mask_log_alphas', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.32.self_attn.mask_allocator.log_temp', 'model.layers.33.self_attn.attn_mask_log_alphas', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.33.self_attn.mask_allocator.log_temp', 'model.layers.34.self_attn.attn_mask_log_alphas', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.34.self_attn.mask_allocator.log_temp', 'model.layers.35.self_attn.attn_mask_log_alphas', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.35.self_attn.mask_allocator.log_temp', 'model.layers.4.self_attn.attn_mask_log_alphas', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.4.self_attn.mask_allocator.log_temp', 'model.layers.5.self_attn.attn_mask_log_alphas', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.5.self_attn.mask_allocator.log_temp', 'model.layers.6.self_attn.attn_mask_log_alphas', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.6.self_attn.mask_allocator.log_temp', 'model.layers.7.self_attn.attn_mask_log_alphas', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.7.self_attn.mask_allocator.log_temp', 'model.layers.8.self_attn.attn_mask_log_alphas', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.8.self_attn.mask_allocator.log_temp', 'model.layers.9.self_attn.attn_mask_log_alphas', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.9.self_attn.mask_allocator.log_temp', 'model.sparsity_lambda1_task', 'model.sparsity_lambda2_task', 'model.sparsity_lambda_1', 'model.sparsity_lambda_2'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [rank7]:[W216 18:21:21.256628738 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 7] using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.65it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.39it/s] [WARNING|modeling_utils.py:4932] 2026-02-16 18:21:21,403 >> Some weights of PawQwen3ForCausalLM were not initialized from the model checkpoint at /workspace/mnt/lcm_lab/hf_models/Qwen3-8B and are newly initialized: ['model.layers.0.self_attn.attn_mask_log_alphas', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.0.self_attn.mask_allocator.log_temp', 'model.layers.1.self_attn.attn_mask_log_alphas', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.1.self_attn.mask_allocator.log_temp', 'model.layers.10.self_attn.attn_mask_log_alphas', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.10.self_attn.mask_allocator.log_temp', 'model.layers.11.self_attn.attn_mask_log_alphas', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.11.self_attn.mask_allocator.log_temp', 'model.layers.12.self_attn.attn_mask_log_alphas', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.12.self_attn.mask_allocator.log_temp', 'model.layers.13.self_attn.attn_mask_log_alphas', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.13.self_attn.mask_allocator.log_temp', 'model.layers.14.self_attn.attn_mask_log_alphas', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.14.self_attn.mask_allocator.log_temp', 'model.layers.15.self_attn.attn_mask_log_alphas', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.15.self_attn.mask_allocator.log_temp', 'model.layers.16.self_attn.attn_mask_log_alphas', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.16.self_attn.mask_allocator.log_temp', 'model.layers.17.self_attn.attn_mask_log_alphas', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.17.self_attn.mask_allocator.log_temp', 'model.layers.18.self_attn.attn_mask_log_alphas', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.18.self_attn.mask_allocator.log_temp', 'model.layers.19.self_attn.attn_mask_log_alphas', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.19.self_attn.mask_allocator.log_temp', 'model.layers.2.self_attn.attn_mask_log_alphas', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.2.self_attn.mask_allocator.log_temp', 'model.layers.20.self_attn.attn_mask_log_alphas', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.20.self_attn.mask_allocator.log_temp', 'model.layers.21.self_attn.attn_mask_log_alphas', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.21.self_attn.mask_allocator.log_temp', 'model.layers.22.self_attn.attn_mask_log_alphas', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.22.self_attn.mask_allocator.log_temp', 'model.layers.23.self_attn.attn_mask_log_alphas', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.23.self_attn.mask_allocator.log_temp', 'model.layers.24.self_attn.attn_mask_log_alphas', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.24.self_attn.mask_allocator.log_temp', 'model.layers.25.self_attn.attn_mask_log_alphas', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.25.self_attn.mask_allocator.log_temp', 'model.layers.26.self_attn.attn_mask_log_alphas', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.26.self_attn.mask_allocator.log_temp', 'model.layers.27.self_attn.attn_mask_log_alphas', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.27.self_attn.mask_allocator.log_temp', 'model.layers.28.self_attn.attn_mask_log_alphas', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.28.self_attn.mask_allocator.log_temp', 'model.layers.29.self_attn.attn_mask_log_alphas', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.29.self_attn.mask_allocator.log_temp', 'model.layers.3.self_attn.attn_mask_log_alphas', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.3.self_attn.mask_allocator.log_temp', 'model.layers.30.self_attn.attn_mask_log_alphas', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.30.self_attn.mask_allocator.log_temp', 'model.layers.31.self_attn.attn_mask_log_alphas', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.31.self_attn.mask_allocator.log_temp', 'model.layers.32.self_attn.attn_mask_log_alphas', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.32.self_attn.mask_allocator.log_temp', 'model.layers.33.self_attn.attn_mask_log_alphas', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.33.self_attn.mask_allocator.log_temp', 'model.layers.34.self_attn.attn_mask_log_alphas', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.34.self_attn.mask_allocator.log_temp', 'model.layers.35.self_attn.attn_mask_log_alphas', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.35.self_attn.mask_allocator.log_temp', 'model.layers.4.self_attn.attn_mask_log_alphas', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.4.self_attn.mask_allocator.log_temp', 'model.layers.5.self_attn.attn_mask_log_alphas', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.5.self_attn.mask_allocator.log_temp', 'model.layers.6.self_attn.attn_mask_log_alphas', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.6.self_attn.mask_allocator.log_temp', 'model.layers.7.self_attn.attn_mask_log_alphas', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.7.self_attn.mask_allocator.log_temp', 'model.layers.8.self_attn.attn_mask_log_alphas', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.8.self_attn.mask_allocator.log_temp', 'model.layers.9.self_attn.attn_mask_log_alphas', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.9.self_attn.mask_allocator.log_temp', 'model.sparsity_lambda1_task', 'model.sparsity_lambda2_task', 'model.sparsity_lambda_1', 'model.sparsity_lambda_2'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [rank4]:[W216 18:21:21.274789794 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 4] using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.64it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.37it/s] [WARNING|modeling_utils.py:4932] 2026-02-16 18:21:21,427 >> Some weights of PawQwen3ForCausalLM were not initialized from the model checkpoint at /workspace/mnt/lcm_lab/hf_models/Qwen3-8B and are newly initialized: ['model.layers.0.self_attn.attn_mask_log_alphas', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.0.self_attn.mask_allocator.log_temp', 'model.layers.1.self_attn.attn_mask_log_alphas', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.1.self_attn.mask_allocator.log_temp', 'model.layers.10.self_attn.attn_mask_log_alphas', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.10.self_attn.mask_allocator.log_temp', 'model.layers.11.self_attn.attn_mask_log_alphas', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.11.self_attn.mask_allocator.log_temp', 'model.layers.12.self_attn.attn_mask_log_alphas', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.12.self_attn.mask_allocator.log_temp', 'model.layers.13.self_attn.attn_mask_log_alphas', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.13.self_attn.mask_allocator.log_temp', 'model.layers.14.self_attn.attn_mask_log_alphas', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.14.self_attn.mask_allocator.log_temp', 'model.layers.15.self_attn.attn_mask_log_alphas', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.15.self_attn.mask_allocator.log_temp', 'model.layers.16.self_attn.attn_mask_log_alphas', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.16.self_attn.mask_allocator.log_temp', 'model.layers.17.self_attn.attn_mask_log_alphas', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.17.self_attn.mask_allocator.log_temp', 'model.layers.18.self_attn.attn_mask_log_alphas', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.18.self_attn.mask_allocator.log_temp', 'model.layers.19.self_attn.attn_mask_log_alphas', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.19.self_attn.mask_allocator.log_temp', 'model.layers.2.self_attn.attn_mask_log_alphas', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.2.self_attn.mask_allocator.log_temp', 'model.layers.20.self_attn.attn_mask_log_alphas', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.20.self_attn.mask_allocator.log_temp', 'model.layers.21.self_attn.attn_mask_log_alphas', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.21.self_attn.mask_allocator.log_temp', 'model.layers.22.self_attn.attn_mask_log_alphas', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.22.self_attn.mask_allocator.log_temp', 'model.layers.23.self_attn.attn_mask_log_alphas', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.23.self_attn.mask_allocator.log_temp', 'model.layers.24.self_attn.attn_mask_log_alphas', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.24.self_attn.mask_allocator.log_temp', 'model.layers.25.self_attn.attn_mask_log_alphas', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.25.self_attn.mask_allocator.log_temp', 'model.layers.26.self_attn.attn_mask_log_alphas', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.26.self_attn.mask_allocator.log_temp', 'model.layers.27.self_attn.attn_mask_log_alphas', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.27.self_attn.mask_allocator.log_temp', 'model.layers.28.self_attn.attn_mask_log_alphas', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.28.self_attn.mask_allocator.log_temp', 'model.layers.29.self_attn.attn_mask_log_alphas', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.29.self_attn.mask_allocator.log_temp', 'model.layers.3.self_attn.attn_mask_log_alphas', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.3.self_attn.mask_allocator.log_temp', 'model.layers.30.self_attn.attn_mask_log_alphas', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.30.self_attn.mask_allocator.log_temp', 'model.layers.31.self_attn.attn_mask_log_alphas', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.31.self_attn.mask_allocator.log_temp', 'model.layers.32.self_attn.attn_mask_log_alphas', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.32.self_attn.mask_allocator.log_temp', 'model.layers.33.self_attn.attn_mask_log_alphas', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.33.self_attn.mask_allocator.log_temp', 'model.layers.34.self_attn.attn_mask_log_alphas', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.34.self_attn.mask_allocator.log_temp', 'model.layers.35.self_attn.attn_mask_log_alphas', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.35.self_attn.mask_allocator.log_temp', 'model.layers.4.self_attn.attn_mask_log_alphas', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.4.self_attn.mask_allocator.log_temp', 'model.layers.5.self_attn.attn_mask_log_alphas', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.5.self_attn.mask_allocator.log_temp', 'model.layers.6.self_attn.attn_mask_log_alphas', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.6.self_attn.mask_allocator.log_temp', 'model.layers.7.self_attn.attn_mask_log_alphas', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.7.self_attn.mask_allocator.log_temp', 'model.layers.8.self_attn.attn_mask_log_alphas', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.8.self_attn.mask_allocator.log_temp', 'model.layers.9.self_attn.attn_mask_log_alphas', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.9.self_attn.mask_allocator.log_temp', 'model.sparsity_lambda1_task', 'model.sparsity_lambda2_task', 'model.sparsity_lambda_1', 'model.sparsity_lambda_2'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [rank3]:[W216 18:21:21.297862501 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. Loading checkpoint shards: 80%|████████ | 4/5 [00:03<00:00, 1.24it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.70it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.41it/s] [WARNING|modeling_utils.py:4932] 2026-02-16 18:21:21,492 >> Some weights of PawQwen3ForCausalLM were not initialized from the model checkpoint at /workspace/mnt/lcm_lab/hf_models/Qwen3-8B and are newly initialized: ['model.layers.0.self_attn.attn_mask_log_alphas', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.0.self_attn.mask_allocator.log_temp', 'model.layers.1.self_attn.attn_mask_log_alphas', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.1.self_attn.mask_allocator.log_temp', 'model.layers.10.self_attn.attn_mask_log_alphas', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.10.self_attn.mask_allocator.log_temp', 'model.layers.11.self_attn.attn_mask_log_alphas', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.11.self_attn.mask_allocator.log_temp', 'model.layers.12.self_attn.attn_mask_log_alphas', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.12.self_attn.mask_allocator.log_temp', 'model.layers.13.self_attn.attn_mask_log_alphas', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.13.self_attn.mask_allocator.log_temp', 'model.layers.14.self_attn.attn_mask_log_alphas', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.14.self_attn.mask_allocator.log_temp', 'model.layers.15.self_attn.attn_mask_log_alphas', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.15.self_attn.mask_allocator.log_temp', 'model.layers.16.self_attn.attn_mask_log_alphas', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.16.self_attn.mask_allocator.log_temp', 'model.layers.17.self_attn.attn_mask_log_alphas', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.17.self_attn.mask_allocator.log_temp', 'model.layers.18.self_attn.attn_mask_log_alphas', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.18.self_attn.mask_allocator.log_temp', 'model.layers.19.self_attn.attn_mask_log_alphas', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.19.self_attn.mask_allocator.log_temp', 'model.layers.2.self_attn.attn_mask_log_alphas', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.2.self_attn.mask_allocator.log_temp', 'model.layers.20.self_attn.attn_mask_log_alphas', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.20.self_attn.mask_allocator.log_temp', 'model.layers.21.self_attn.attn_mask_log_alphas', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.21.self_attn.mask_allocator.log_temp', 'model.layers.22.self_attn.attn_mask_log_alphas', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.22.self_attn.mask_allocator.log_temp', 'model.layers.23.self_attn.attn_mask_log_alphas', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.23.self_attn.mask_allocator.log_temp', 'model.layers.24.self_attn.attn_mask_log_alphas', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.24.self_attn.mask_allocator.log_temp', 'model.layers.25.self_attn.attn_mask_log_alphas', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.25.self_attn.mask_allocator.log_temp', 'model.layers.26.self_attn.attn_mask_log_alphas', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.26.self_attn.mask_allocator.log_temp', 'model.layers.27.self_attn.attn_mask_log_alphas', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.27.self_attn.mask_allocator.log_temp', 'model.layers.28.self_attn.attn_mask_log_alphas', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.28.self_attn.mask_allocator.log_temp', 'model.layers.29.self_attn.attn_mask_log_alphas', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.29.self_attn.mask_allocator.log_temp', 'model.layers.3.self_attn.attn_mask_log_alphas', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.3.self_attn.mask_allocator.log_temp', 'model.layers.30.self_attn.attn_mask_log_alphas', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.30.self_attn.mask_allocator.log_temp', 'model.layers.31.self_attn.attn_mask_log_alphas', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.31.self_attn.mask_allocator.log_temp', 'model.layers.32.self_attn.attn_mask_log_alphas', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.32.self_attn.mask_allocator.log_temp', 'model.layers.33.self_attn.attn_mask_log_alphas', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.33.self_attn.mask_allocator.log_temp', 'model.layers.34.self_attn.attn_mask_log_alphas', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.34.self_attn.mask_allocator.log_temp', 'model.layers.35.self_attn.attn_mask_log_alphas', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.35.self_attn.mask_allocator.log_temp', 'model.layers.4.self_attn.attn_mask_log_alphas', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.4.self_attn.mask_allocator.log_temp', 'model.layers.5.self_attn.attn_mask_log_alphas', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.5.self_attn.mask_allocator.log_temp', 'model.layers.6.self_attn.attn_mask_log_alphas', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.6.self_attn.mask_allocator.log_temp', 'model.layers.7.self_attn.attn_mask_log_alphas', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.7.self_attn.mask_allocator.log_temp', 'model.layers.8.self_attn.attn_mask_log_alphas', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.8.self_attn.mask_allocator.log_temp', 'model.layers.9.self_attn.attn_mask_log_alphas', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.9.self_attn.mask_allocator.log_temp', 'model.sparsity_lambda1_task', 'model.sparsity_lambda2_task', 'model.sparsity_lambda_1', 'model.sparsity_lambda_2'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [rank6]:[W216 18:21:21.363459493 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 6] using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.70it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.40it/s] [WARNING|modeling_utils.py:4932] 2026-02-16 18:21:21,652 >> Some weights of PawQwen3ForCausalLM were not initialized from the model checkpoint at /workspace/mnt/lcm_lab/hf_models/Qwen3-8B and are newly initialized: ['model.layers.0.self_attn.attn_mask_log_alphas', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.0.self_attn.mask_allocator.log_temp', 'model.layers.1.self_attn.attn_mask_log_alphas', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.1.self_attn.mask_allocator.log_temp', 'model.layers.10.self_attn.attn_mask_log_alphas', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.10.self_attn.mask_allocator.log_temp', 'model.layers.11.self_attn.attn_mask_log_alphas', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.11.self_attn.mask_allocator.log_temp', 'model.layers.12.self_attn.attn_mask_log_alphas', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.12.self_attn.mask_allocator.log_temp', 'model.layers.13.self_attn.attn_mask_log_alphas', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.13.self_attn.mask_allocator.log_temp', 'model.layers.14.self_attn.attn_mask_log_alphas', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.14.self_attn.mask_allocator.log_temp', 'model.layers.15.self_attn.attn_mask_log_alphas', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.15.self_attn.mask_allocator.log_temp', 'model.layers.16.self_attn.attn_mask_log_alphas', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.16.self_attn.mask_allocator.log_temp', 'model.layers.17.self_attn.attn_mask_log_alphas', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.17.self_attn.mask_allocator.log_temp', 'model.layers.18.self_attn.attn_mask_log_alphas', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.18.self_attn.mask_allocator.log_temp', 'model.layers.19.self_attn.attn_mask_log_alphas', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.19.self_attn.mask_allocator.log_temp', 'model.layers.2.self_attn.attn_mask_log_alphas', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.2.self_attn.mask_allocator.log_temp', 'model.layers.20.self_attn.attn_mask_log_alphas', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.20.self_attn.mask_allocator.log_temp', 'model.layers.21.self_attn.attn_mask_log_alphas', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.21.self_attn.mask_allocator.log_temp', 'model.layers.22.self_attn.attn_mask_log_alphas', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.22.self_attn.mask_allocator.log_temp', 'model.layers.23.self_attn.attn_mask_log_alphas', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.23.self_attn.mask_allocator.log_temp', 'model.layers.24.self_attn.attn_mask_log_alphas', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.24.self_attn.mask_allocator.log_temp', 'model.layers.25.self_attn.attn_mask_log_alphas', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.25.self_attn.mask_allocator.log_temp', 'model.layers.26.self_attn.attn_mask_log_alphas', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.26.self_attn.mask_allocator.log_temp', 'model.layers.27.self_attn.attn_mask_log_alphas', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.27.self_attn.mask_allocator.log_temp', 'model.layers.28.self_attn.attn_mask_log_alphas', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.28.self_attn.mask_allocator.log_temp', 'model.layers.29.self_attn.attn_mask_log_alphas', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.29.self_attn.mask_allocator.log_temp', 'model.layers.3.self_attn.attn_mask_log_alphas', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.3.self_attn.mask_allocator.log_temp', 'model.layers.30.self_attn.attn_mask_log_alphas', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.30.self_attn.mask_allocator.log_temp', 'model.layers.31.self_attn.attn_mask_log_alphas', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.31.self_attn.mask_allocator.log_temp', 'model.layers.32.self_attn.attn_mask_log_alphas', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.32.self_attn.mask_allocator.log_temp', 'model.layers.33.self_attn.attn_mask_log_alphas', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.33.self_attn.mask_allocator.log_temp', 'model.layers.34.self_attn.attn_mask_log_alphas', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.34.self_attn.mask_allocator.log_temp', 'model.layers.35.self_attn.attn_mask_log_alphas', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.35.self_attn.mask_allocator.log_temp', 'model.layers.4.self_attn.attn_mask_log_alphas', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.4.self_attn.mask_allocator.log_temp', 'model.layers.5.self_attn.attn_mask_log_alphas', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.5.self_attn.mask_allocator.log_temp', 'model.layers.6.self_attn.attn_mask_log_alphas', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.6.self_attn.mask_allocator.log_temp', 'model.layers.7.self_attn.attn_mask_log_alphas', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.7.self_attn.mask_allocator.log_temp', 'model.layers.8.self_attn.attn_mask_log_alphas', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.8.self_attn.mask_allocator.log_temp', 'model.layers.9.self_attn.attn_mask_log_alphas', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.9.self_attn.mask_allocator.log_temp', 'model.sparsity_lambda1_task', 'model.sparsity_lambda2_task', 'model.sparsity_lambda_1', 'model.sparsity_lambda_2'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [rank5]:[W216 18:21:21.523431560 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 5] using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. Loading checkpoint shards: 80%|████████ | 4/5 [00:03<00:00, 1.25it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.72it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:03<00:00, 1.39it/s] [WARNING|modeling_utils.py:4932] 2026-02-16 18:21:21,877 >> Some weights of PawQwen3ForCausalLM were not initialized from the model checkpoint at /workspace/mnt/lcm_lab/hf_models/Qwen3-8B and are newly initialized: ['model.layers.0.self_attn.attn_mask_log_alphas', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.0.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.0.self_attn.mask_allocator.log_temp', 'model.layers.1.self_attn.attn_mask_log_alphas', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.1.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.1.self_attn.mask_allocator.log_temp', 'model.layers.10.self_attn.attn_mask_log_alphas', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.10.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.10.self_attn.mask_allocator.log_temp', 'model.layers.11.self_attn.attn_mask_log_alphas', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.11.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.11.self_attn.mask_allocator.log_temp', 'model.layers.12.self_attn.attn_mask_log_alphas', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.12.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.12.self_attn.mask_allocator.log_temp', 'model.layers.13.self_attn.attn_mask_log_alphas', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.13.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.13.self_attn.mask_allocator.log_temp', 'model.layers.14.self_attn.attn_mask_log_alphas', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.14.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.14.self_attn.mask_allocator.log_temp', 'model.layers.15.self_attn.attn_mask_log_alphas', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.15.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.15.self_attn.mask_allocator.log_temp', 'model.layers.16.self_attn.attn_mask_log_alphas', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.16.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.16.self_attn.mask_allocator.log_temp', 'model.layers.17.self_attn.attn_mask_log_alphas', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.17.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.17.self_attn.mask_allocator.log_temp', 'model.layers.18.self_attn.attn_mask_log_alphas', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.18.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.18.self_attn.mask_allocator.log_temp', 'model.layers.19.self_attn.attn_mask_log_alphas', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.19.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.19.self_attn.mask_allocator.log_temp', 'model.layers.2.self_attn.attn_mask_log_alphas', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.2.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.2.self_attn.mask_allocator.log_temp', 'model.layers.20.self_attn.attn_mask_log_alphas', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.20.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.20.self_attn.mask_allocator.log_temp', 'model.layers.21.self_attn.attn_mask_log_alphas', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.21.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.21.self_attn.mask_allocator.log_temp', 'model.layers.22.self_attn.attn_mask_log_alphas', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.22.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.22.self_attn.mask_allocator.log_temp', 'model.layers.23.self_attn.attn_mask_log_alphas', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.23.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.23.self_attn.mask_allocator.log_temp', 'model.layers.24.self_attn.attn_mask_log_alphas', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.24.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.24.self_attn.mask_allocator.log_temp', 'model.layers.25.self_attn.attn_mask_log_alphas', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.25.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.25.self_attn.mask_allocator.log_temp', 'model.layers.26.self_attn.attn_mask_log_alphas', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.26.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.26.self_attn.mask_allocator.log_temp', 'model.layers.27.self_attn.attn_mask_log_alphas', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.27.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.27.self_attn.mask_allocator.log_temp', 'model.layers.28.self_attn.attn_mask_log_alphas', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.28.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.28.self_attn.mask_allocator.log_temp', 'model.layers.29.self_attn.attn_mask_log_alphas', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.29.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.29.self_attn.mask_allocator.log_temp', 'model.layers.3.self_attn.attn_mask_log_alphas', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.3.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.3.self_attn.mask_allocator.log_temp', 'model.layers.30.self_attn.attn_mask_log_alphas', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.30.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.30.self_attn.mask_allocator.log_temp', 'model.layers.31.self_attn.attn_mask_log_alphas', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.31.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.31.self_attn.mask_allocator.log_temp', 'model.layers.32.self_attn.attn_mask_log_alphas', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.32.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.32.self_attn.mask_allocator.log_temp', 'model.layers.33.self_attn.attn_mask_log_alphas', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.33.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.33.self_attn.mask_allocator.log_temp', 'model.layers.34.self_attn.attn_mask_log_alphas', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.34.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.34.self_attn.mask_allocator.log_temp', 'model.layers.35.self_attn.attn_mask_log_alphas', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.35.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.35.self_attn.mask_allocator.log_temp', 'model.layers.4.self_attn.attn_mask_log_alphas', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.4.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.4.self_attn.mask_allocator.log_temp', 'model.layers.5.self_attn.attn_mask_log_alphas', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.5.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.5.self_attn.mask_allocator.log_temp', 'model.layers.6.self_attn.attn_mask_log_alphas', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.6.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.6.self_attn.mask_allocator.log_temp', 'model.layers.7.self_attn.attn_mask_log_alphas', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.7.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.7.self_attn.mask_allocator.log_temp', 'model.layers.8.self_attn.attn_mask_log_alphas', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.8.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.8.self_attn.mask_allocator.log_temp', 'model.layers.9.self_attn.attn_mask_log_alphas', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_feat_extractor.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', 'model.layers.9.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', 'model.layers.9.self_attn.mask_allocator.log_temp', 'model.sparsity_lambda1_task', 'model.sparsity_lambda2_task', 'model.sparsity_lambda_1', 'model.sparsity_lambda_2'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [rank2]:[W216 18:21:21.749647734 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. ******** ['/workspace/mnt/lcm_lab/qqt/public_data/qwen_mix_sft_64K6/all.parquet'] ******* ******** ['/workspace/mnt/lcm_lab/qqt/public_data/qwen_mix_sft_64K6/all.parquet'] *************** ['/workspace/mnt/lcm_lab/qqt/public_data/qwen_mix_sft_64K6/all.parquet'] ******* ******** ['/workspace/mnt/lcm_lab/qqt/public_data/qwen_mix_sft_64K6/all.parquet'] ******* ******** ['/workspace/mnt/lcm_lab/qqt/public_data/qwen_mix_sft_64K6/all.parquet'] ******* ******** ['/workspace/mnt/lcm_lab/qqt/public_data/qwen_mix_sft_64K6/all.parquet'] ******* ******** ['/workspace/mnt/lcm_lab/qqt/public_data/qwen_mix_sft_64K6/all.parquet'] ******* ******** ['/workspace/mnt/lcm_lab/qqt/public_data/qwen_mix_sft_64K6/all.parquet'] ******* Using custom data configuration default-4687dca96c3d2fe4 02/16/2026 18:21:23 - INFO - datasets.builder - Using custom data configuration default-4687dca96c3d2fe4 Loading Dataset Infos from /opt/conda/envs/qqt/lib/python3.11/site-packages/datasets/packaged_modules/parquet 02/16/2026 18:21:23 - INFO - datasets.info - Loading Dataset Infos from /opt/conda/envs/qqt/lib/python3.11/site-packages/datasets/packaged_modules/parquet Extracting 'length' from metadata for sorting... Extracting 'length' from metadata for sorting... Extracting 'length' from metadata for sorting... Extracting 'length' from metadata for sorting... Extracting 'length' from metadata for sorting... Extracting 'length' from metadata for sorting... Extracting 'length' from metadata for sorting... Overwrite dataset info from restored data version if exists. 02/16/2026 18:21:23 - INFO - datasets.builder - Overwrite dataset info from restored data version if exists. Loading Dataset info from /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6 02/16/2026 18:21:23 - INFO - datasets.info - Loading Dataset info from /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6 📉 正在按 length 从小到大排序数据... 📉 正在按 length 从小到大排序数据... *** 缓存文件路径:/workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet *** *** 缓存文件路径:/workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet *** 🚀 发现缓存文件: /workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet 🚀 发现缓存文件: /workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet 📉 正在按 length 从小到大排序数据... Found cached dataset parquet (/workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6) 02/16/2026 18:21:23 - INFO - datasets.builder - Found cached dataset parquet (/workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6) Loading Dataset info from /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6 02/16/2026 18:21:23 - INFO - datasets.info - Loading Dataset info from /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6 📉 正在按 length 从小到大排序数据... *** 缓存文件路径:/workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet *** 🚀 发现缓存文件: /workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet *** 缓存文件路径:/workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet *** 🚀 发现缓存文件: /workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet Extracting 'length' from metadata for sorting... 📉 正在按 length 从小到大排序数据... *** 缓存文件路径:/workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet *** 🚀 发现缓存文件: /workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet 📉 正在按 length 从小到大排序数据... *** 缓存文件路径:/workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet *** 🚀 发现缓存文件: /workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet Process #0 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00000_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #0 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00000_of_00032.arrow Process #1 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00001_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #1 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00001_of_00032.arrow Process #2 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00002_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #2 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00002_of_00032.arrow Process #3 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00003_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #3 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00003_of_00032.arrow Process #4 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00004_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #4 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00004_of_00032.arrow Process #5 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00005_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #5 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00005_of_00032.arrow Process #6 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00006_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #6 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00006_of_00032.arrow Process #7 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00007_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #7 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00007_of_00032.arrow Process #8 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00008_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #8 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00008_of_00032.arrow Process #9 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00009_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #9 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00009_of_00032.arrow Process #10 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00010_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #10 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00010_of_00032.arrow Process #11 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00011_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #11 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00011_of_00032.arrow Process #12 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00012_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #12 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00012_of_00032.arrow Process #13 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00013_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #13 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00013_of_00032.arrow Process #14 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00014_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #14 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00014_of_00032.arrow Process #15 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00015_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #15 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00015_of_00032.arrow Process #16 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00016_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #16 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00016_of_00032.arrow Process #17 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00017_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #17 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00017_of_00032.arrow Process #18 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00018_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #18 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00018_of_00032.arrow Process #19 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00019_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #19 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00019_of_00032.arrow Process #20 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00020_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #20 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00020_of_00032.arrow Process #21 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00021_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #21 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00021_of_00032.arrow Process #22 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00022_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #22 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00022_of_00032.arrow Process #23 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00023_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #23 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00023_of_00032.arrow Process #24 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00024_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #24 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00024_of_00032.arrow Process #25 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00025_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #25 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00025_of_00032.arrow Process #26 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00026_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #26 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00026_of_00032.arrow Process #27 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00027_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #27 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00027_of_00032.arrow Process #28 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00028_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #28 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00028_of_00032.arrow Process #29 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00029_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #29 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00029_of_00032.arrow Process #30 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00030_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #30 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00030_of_00032.arrow Process #31 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00031_of_00032.arrow 02/16/2026 18:21:23 - INFO - datasets.arrow_dataset - Process #31 will write at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_00031_of_00032.arrow 📉 正在按 length 从小到大排序数据... *** 缓存文件路径:/workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet *** 🚀 发现缓存文件: /workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet Loading cached processed dataset at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_*_of_00032.arrow 02/16/2026 18:21:24 - INFO - datasets.arrow_dataset - Loading cached processed dataset at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-20fbaf3dcc5ce3e6_*_of_00032.arrow Concatenating 32 shards 02/16/2026 18:21:24 - INFO - datasets.arrow_dataset - Concatenating 32 shards 📉 正在按 length 从小到大排序数据... Loading cached sorted indices for dataset at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-4808dacc7d21da3a.arrow 02/16/2026 18:21:24 - INFO - datasets.arrow_dataset - Loading cached sorted indices for dataset at /workspace/mnt/lcm_lab/qqt/public_data/data_cache/raw/parquet/default-4687dca96c3d2fe4/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6/cache-4808dacc7d21da3a.arrow *** 缓存文件路径:/workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet *** 🚀 发现缓存文件: /workspace/mnt/lcm_lab/qqt/public_data/data_cache/qwen_mix_sft_64K6_qwen3-4b_new_1200_packed_maxseq65536.parquet ✅ 成功加载 Parquet 缓存! 包含 22790 条序列。 ✅ 成功加载 Parquet 缓存! 包含 22790 条序列。 /workspace/mnt/lcm_lab/qqt/project/layer-ea/sparseattn/training/lh_trainer.py:427: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. super().__init__( /workspace/mnt/lcm_lab/qqt/project/layer-ea/sparseattn/training/lh_trainer.py:427: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. super().__init__( ✅ 成功加载 Parquet 缓存! 包含 22790 条序列。 /workspace/mnt/lcm_lab/qqt/project/layer-ea/sparseattn/training/lh_trainer.py:427: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. super().__init__( [WARNING|lh_trainer.py:363] 2026-02-16 18:21:24,720 >> Handler registered [WARNING|lh_trainer.py:363] 2026-02-16 18:21:24,723 >> Handler registered [WARNING|lh_trainer.py:363] 2026-02-16 18:21:24,727 >> Handler registered ✅ 成功加载 Parquet 缓存! 包含 22790 条序列。 /workspace/mnt/lcm_lab/qqt/project/layer-ea/sparseattn/training/lh_trainer.py:427: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. super().__init__( ✅ 成功加载 Parquet 缓存! 包含 22790 条序列。 /workspace/mnt/lcm_lab/qqt/project/layer-ea/sparseattn/training/lh_trainer.py:427: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. super().__init__( [WARNING|lh_trainer.py:363] 2026-02-16 18:21:24,746 >> Handler registered [WARNING|lh_trainer.py:363] 2026-02-16 18:21:24,752 >> Handler registered ✅ 成功加载 Parquet 缓存! 包含 22790 条序列。 /workspace/mnt/lcm_lab/qqt/project/layer-ea/sparseattn/training/lh_trainer.py:427: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. super().__init__( [WARNING|lh_trainer.py:363] 2026-02-16 18:21:24,772 >> Handler registered ✅ 成功加载 Parquet 缓存! 包含 22790 条序列。 /workspace/mnt/lcm_lab/qqt/project/layer-ea/sparseattn/training/lh_trainer.py:427: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. super().__init__( Using custom data configuration default-625db63a2d7df432 02/16/2026 18:21:24 - INFO - datasets.builder - Using custom data configuration default-625db63a2d7df432 Loading Dataset Infos from /opt/conda/envs/qqt/lib/python3.11/site-packages/datasets/packaged_modules/parquet 02/16/2026 18:21:24 - INFO - datasets.info - Loading Dataset Infos from /opt/conda/envs/qqt/lib/python3.11/site-packages/datasets/packaged_modules/parquet Overwrite dataset info from restored data version if exists. 02/16/2026 18:21:24 - INFO - datasets.builder - Overwrite dataset info from restored data version if exists. Loading Dataset info from /root/.cache/huggingface/datasets/parquet/default-625db63a2d7df432/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6 02/16/2026 18:21:24 - INFO - datasets.info - Loading Dataset info from /root/.cache/huggingface/datasets/parquet/default-625db63a2d7df432/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6 Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/default-625db63a2d7df432/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6) 02/16/2026 18:21:24 - INFO - datasets.builder - Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/default-625db63a2d7df432/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6) Loading Dataset info from /root/.cache/huggingface/datasets/parquet/default-625db63a2d7df432/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6 02/16/2026 18:21:24 - INFO - datasets.info - Loading Dataset info from /root/.cache/huggingface/datasets/parquet/default-625db63a2d7df432/0.0.0/9d41700293b5cf3c3cee6167e8c49e37598331b6466506aecb40a8c11b6aa9f6 [WARNING|lh_trainer.py:363] 2026-02-16 18:21:24,817 >> Handler registered ✅ 成功加载 Parquet 缓存! 包含 22790 条序列。 /workspace/mnt/lcm_lab/qqt/project/layer-ea/sparseattn/training/lh_trainer.py:427: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. super().__init__( 02/16/2026 18:21:24 - WARNING - accelerate.utils.other - Detected kernel version 4.19.90, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. [INFO|trainer.py:698] 2026-02-16 18:21:24,831 >> max_steps is given, it will override any value given in num_train_epochs [INFO|trainer.py:748] 2026-02-16 18:21:24,831 >> Using auto half precision backend [INFO|lh_trainer.py:486] 2026-02-16 18:21:24,832 >> Initializing sequence parallel groups with size 2 [WARNING|lh_trainer.py:363] 2026-02-16 18:21:24,833 >> Handler registered 02/16/2026 18:21:24 - INFO - __main__ - Successfully injected CustomDistributedStratifiedSampler into Trainer. [INFO|lh_trainer.py:931] 2026-02-16 18:21:43,349 >> Optimizing 400 parameters. [INFO|lh_trainer.py:932] 2026-02-16 18:21:43,349 >> Optimized parameters list: ['_fsdp_wrapped_module.model.sparsity_lambda_1', '_fsdp_wrapped_module.model.sparsity_lambda_2', '_fsdp_wrapped_module.model.sparsity_lambda1_task', '_fsdp_wrapped_module.model.sparsity_lambda2_task', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.0._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.1._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.2._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.3._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.4._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.5._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.6._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.7._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.8._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.9._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.10._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.11._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.12._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.13._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.14._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.15._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.16._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.17._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.18._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.19._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.20._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.21._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.22._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.23._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.24._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.25._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.26._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.27._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.28._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.29._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.30._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.31._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.32._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.33._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.34._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.attn_mask_log_alphas', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.weight', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.0.bias', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.weight', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.mask_allocator.cls_feat_extractor.2.bias', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.weight', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.0.bias', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.weight', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.2.bias', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.weight', '_fsdp_wrapped_module.model.layers.35._fsdp_wrapped_module.self_attn.mask_allocator.cls_router_head_agnostic.4.bias'] [INFO|trainer.py:2414] 2026-02-16 18:21:43,361 >> ***** Running training ***** [INFO|trainer.py:2415] 2026-02-16 18:21:43,362 >> Num examples = 22,790 [INFO|trainer.py:2416] 2026-02-16 18:21:43,362 >> Num Epochs = 1 [INFO|trainer.py:2417] 2026-02-16 18:21:43,362 >> Instantaneous batch size per device = 1 [INFO|trainer.py:2420] 2026-02-16 18:21:43,362 >> Total train batch size (w. parallel, distributed & accumulation) = 48 [INFO|trainer.py:2421] 2026-02-16 18:21:43,362 >> Gradient Accumulation steps = 6 [INFO|trainer.py:2422] 2026-02-16 18:21:43,362 >> Total optimization steps = 300 [INFO|trainer.py:2423] 2026-02-16 18:21:43,363 >> Number of trainable parameters = 300 [INFO|integration_utils.py:2218] 2026-02-16 18:21:45,279 >> Automatic SwanLab logging enabled, to disable set os.environ["SWANLAB_MODE"] = "disabled" swanlab: swanlab version 0.7.8 is available! Upgrade: `pip install -U swanlab` swanlab: Tracking run with swanlab version 0.6.8 swanlab: Run data will be saved locally in /workspace/mnt/lcm_lab/qqt/project/layer-ea/sparseattn/checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/run-20 260216_182146-t1f5tfg4dj0dg1bwh58hb swanlab: 👋 Hi qqtang,welcome to swanlab! swanlab: Syncing run 2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B to the cloud swanlab: 🏠 View project at https://swanlab.cn/@qqtang/NIPS swanlab: 🚀 View run at https://swanlab.cn/@qqtang/NIPS/runs/t1f5tfg4dj0dg1bwh58hb [Step 0 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [25347, 25356] → Tgt Spa: ['1.000', '1.000'] [Step 0 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [24426, 24427] → Tgt Spa: ['0.350', '0.350'] [Step 0 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [42872] → Tgt Spa: ['1.000'] [Step 0 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24165, 24168] → Tgt Spa: ['1.000', '1.000'] [Step 0 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [42872] → Tgt Spa: ['1.000'] [Step 0 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [25347, 25356] → Tgt Spa: ['1.000', '1.000'] [Step 0 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24165, 24168] → Tgt Spa: ['1.000', '1.000'] [Step 0 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [24426, 24427] → Tgt Spa: ['0.350', '0.350'] /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/autograd/graph.py:823: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/autograd/graph.py:823: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/autograd/graph.py:823: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/autograd/graph.py:823: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/autograd/graph.py:823: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/autograd/graph.py:823: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/autograd/graph.py:823: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/autograd/graph.py:823: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [Step 0 / Rank 7] Tasks: ['Code', 'Code', 'Single QA', 'Single QA'] | Lens: [15479, 15481, 15474, 15475] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350'] [Step 0 / Rank 3] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [16134, 16134, 16134, 16135] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 0 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [39763] → Tgt Spa: ['1.000'] [Step 0 / Rank 2] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [16134, 16134, 16134, 16135] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 0 / Rank 4] Tasks: ['Single QA'] | Lens: [45651] → Tgt Spa: ['0.350'] [Step 0 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [39763] → Tgt Spa: ['1.000'] [Step 0 / Rank 6] Tasks: ['Code', 'Code', 'Single QA', 'Single QA'] | Lens: [15479, 15481, 15474, 15475] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350'] [Step 0 / Rank 5] Tasks: ['Single QA'] | Lens: [45651] → Tgt Spa: ['0.350'] [Step 0 / Rank 0] Tasks: ['Single QA'] | Lens: [41556] → Tgt Spa: ['0.350'] [Step 0 / Rank 3] Tasks: ['Single QA'] | Lens: [51575] → Tgt Spa: ['0.350'] [Step 0 / Rank 2] Tasks: ['Single QA'] | Lens: [51575] → Tgt Spa: ['0.350'] [Step 0 / Rank 4] Tasks: ['Single QA'] | Lens: [43150] → Tgt Spa: ['0.350'] [Step 0 / Rank 1] Tasks: ['Single QA'] | Lens: [41556] → Tgt Spa: ['0.350'] [Step 0 / Rank 7] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23382, 23383] → Tgt Spa: ['1.000', '0.350'] [Step 0 / Rank 5] Tasks: ['Single QA'] | Lens: [43150] → Tgt Spa: ['0.350'] [Step 0 / Rank 6] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23382, 23383] → Tgt Spa: ['1.000', '0.350'] [Step 0 / Rank 4] Tasks: ['Single QA'] | Lens: [56321] → Tgt Spa: ['0.350'] [Step 0 / Rank 5] Tasks: ['Single QA'] | Lens: [56321] → Tgt Spa: ['0.350'] [Step 0 / Rank 1] Tasks: ['Single QA'] | Lens: [51032] → Tgt Spa: ['0.350'] [Step 0 / Rank 7] Tasks: ['Single QA'] | Lens: [38927] → Tgt Spa: ['0.350'] [Step 0 / Rank 3] Tasks: ['Single QA'] | Lens: [34479] → Tgt Spa: ['0.350'] [Step 0 / Rank 6] Tasks: ['Single QA'] | Lens: [38927] → Tgt Spa: ['0.350'] [Step 0 / Rank 0] Tasks: ['Single QA'] | Lens: [51032] → Tgt Spa: ['0.350'] [Step 0 / Rank 2] Tasks: ['Single QA'] | Lens: [34479] → Tgt Spa: ['0.350'] [Step 0 / Rank 7] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [20687, 20699, 20691] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 0 / Rank 3] Tasks: ['Single QA'] | Lens: [57566] → Tgt Spa: ['0.350'] [Step 0 / Rank 0] Tasks: ['Code', 'Summarization'] | Lens: [29815, 29828] → Tgt Spa: ['1.000', '1.000'] [Step 0 / Rank 2] Tasks: ['Single QA'] | Lens: [57566] → Tgt Spa: ['0.350'] [Step 0 / Rank 6] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [20687, 20699, 20691] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 0 / Rank 5] Tasks: ['Code'] | Lens: [34432] → Tgt Spa: ['1.000'] [Step 0 / Rank 4] Tasks: ['Code'] | Lens: [34432] → Tgt Spa: ['1.000'] [Step 0 / Rank 1] Tasks: ['Code', 'Summarization'] | Lens: [29815, 29828] → Tgt Spa: ['1.000', '1.000'] [Step 0 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [59253] → Tgt Spa: ['1.000'] [Step 0 / Rank 0] Tasks: ['Single QA'] | Lens: [35029] → Tgt Spa: ['0.350'] [Step 0 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [59253] → Tgt Spa: ['1.000'] [Step 0 / Rank 3] Tasks: ['In-Context Learning', 'Code'] | Lens: [25956, 25964] → Tgt Spa: ['1.000', '1.000'] [Step 0 / Rank 7] Tasks: ['Code'] | Lens: [55182] → Tgt Spa: ['1.000'] [Step 0 / Rank 1] Tasks: ['Single QA'] | Lens: [35029] → Tgt Spa: ['0.350'] [Step 0 / Rank 2] Tasks: ['In-Context Learning', 'Code'] | Lens: [25956, 25964] → Tgt Spa: ['1.000', '1.000'] [Step 0 / Rank 6] Tasks: ['Code'] | Lens: [55182] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 18:23:56,464 >> @ 0 | Loss: 2.0407 | LM: 1.9309 | Reg: 0.1097 | Spa(Avg): 0.511 [INFO|lh_trainer.py:797] 2026-02-16 18:23:56,465 >> Statistic -> Code | Spa: 0.444 | Tgt: 1.000 | Z-Loss: 0.102 | [INFO|lh_trainer.py:797] 2026-02-16 18:23:56,465 >> Statistic -> In-Context | Spa: 0.469 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:23:56,465 >> Statistic -> MultiHop | Spa: 0.549 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:23:56,465 >> Statistic -> Single | Spa: 0.554 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:23:56,465 >> Statistic -> Summarization | Spa: 0.583 | Tgt: 1.000 | Z-Loss: 0.078 | [INFO|lh_trainer.py:810] 2026-02-16 18:23:56,469 >> [Micro-Log] {"loss": 2.040656689244012, "lm_loss": 1.930921162556236, "reg_loss": 0.10973554306353132, "model_sparsity(avg)": 0.5108024626970291, "Spa-In-Context Learning sparsity": 0.46875, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11539438087493181, "Spa-Single QA sparsity": 0.5537036975224813, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.11222218051552772, "Spa-Code sparsity": 0.4444444311989678, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10175766630305184, "Spa-Summarization sparsity": 0.5833333730697632, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.07792153209447861, "Spa-MultiHop QA sparsity": 0.5486111044883728, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.06407263409346342, "step": 0, "current_tau": 1.5, "lambda1 Single QA": 0.474609375, "lambda2 MultiHop QA": 0.23828125, "lambda3 Summarization": 0.0380859375, "lambda4 Code": 0.1337890625} /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:1214: UserWarning: Called FSDP.clip_grad_norm_() on rank 6 with no gradients -- returning the total norm in the default dtype torch.float32 warnings.warn( /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:1214: UserWarning: Called FSDP.clip_grad_norm_() on rank 5 with no gradients -- returning the total norm in the default dtype torch.float32 warnings.warn( /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:1214: UserWarning: Called FSDP.clip_grad_norm_() on rank 7 with no gradients -- returning the total norm in the default dtype torch.float32 warnings.warn( /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:1214: UserWarning: Called FSDP.clip_grad_norm_() on rank 4 with no gradients -- returning the total norm in the default dtype torch.float32 warnings.warn( /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:1214: UserWarning: Called FSDP.clip_grad_norm_() on rank 3 with no gradients -- returning the total norm in the default dtype torch.float32 warnings.warn( /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:1214: UserWarning: Called FSDP.clip_grad_norm_() on rank 2 with no gradients -- returning the total norm in the default dtype torch.float32 warnings.warn( [INFO|lh_trainer.py:331] 2026-02-16 18:24:19,322 >> {'loss': 12.2439, 'grad_norm': 1.3188265562057495, 'learning_rate': 0.0, 'epoch': 0.00105318588730911, 'num_input_tokens_seen': 2363056, 'completed': '0.33% (1 / 300)', 'remaining time': '12:36:48', 'throughput': '6523.91', 'gpu_mem_free': '13877MB', 'step': 1} [Step 1 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [43923] → Tgt Spa: ['1.000'] [Step 1 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [43923] → Tgt Spa: ['1.000'] [Step 1 / Rank 7] Tasks: ['Single QA'] | Lens: [45544] → Tgt Spa: ['0.350'] [Step 1 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32126, 32126] → Tgt Spa: ['0.350', '0.350'] [Step 1 / Rank 6] Tasks: ['Single QA'] | Lens: [45544] → Tgt Spa: ['0.350'] [Step 1 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [55295] → Tgt Spa: ['1.000'] [Step 1 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32126, 32126] → Tgt Spa: ['0.350', '0.350'] [Step 1 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [55295] → Tgt Spa: ['1.000'] [Step 1 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [9086, 9086, 9086, 9086, 9087, 9087, 9094] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 1 / Rank 7] Tasks: ['Single QA'] | Lens: [49864] → Tgt Spa: ['0.350'] [Step 1 / Rank 5] Tasks: ['Single QA', 'Summarization'] | Lens: [32560, 32582] → Tgt Spa: ['0.350', '1.000'] [Step 1 / Rank 4] Tasks: ['Single QA', 'Summarization'] | Lens: [32560, 32582] → Tgt Spa: ['0.350', '1.000'] [Step 1 / Rank 2] Tasks: ['Single QA'] | Lens: [36042] → Tgt Spa: ['0.350'] [Step 1 / Rank 3] Tasks: ['Single QA'] | Lens: [36042] → Tgt Spa: ['0.350'] [Step 1 / Rank 6] Tasks: ['Single QA'] | Lens: [49864] → Tgt Spa: ['0.350'] [Step 1 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [9086, 9086, 9086, 9086, 9087, 9087, 9094] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 1 / Rank 3] Tasks: ['Single QA'] | Lens: [62013] → Tgt Spa: ['0.350'] [Step 1 / Rank 7] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [20035, 20047, 20038] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 1 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [56435] → Tgt Spa: ['1.000'] [Step 1 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [56435] → Tgt Spa: ['1.000'] [Step 1 / Rank 0] Tasks: ['Single QA'] | Lens: [39696] → Tgt Spa: ['0.350'] [Step 1 / Rank 1] Tasks: ['Single QA'] | Lens: [39696] → Tgt Spa: ['0.350'] [Step 1 / Rank 6] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [20035, 20047, 20038] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 1 / Rank 2] Tasks: ['Single QA'] | Lens: [62013] → Tgt Spa: ['0.350'] [Step 1 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [42672] → Tgt Spa: ['1.000'] [Step 1 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [27896, 27904] → Tgt Spa: ['1.000', '1.000'] [Step 1 / Rank 7] Tasks: ['Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization'] | Lens: [1885, 1868, 1869, 1868, 1868, 1871, 1888, 1889, 1890, 1871, 1871, 1873, 1891, 1873, 1892, 1874, 1874, 1876, 1875, 1877, 1883, 1894, 1877, 1876, 1896, 1895, 1884, 1877, 1879, 1897, 1878, 1880, 1899, 1900] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000'] [Step 1 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [64245] → Tgt Spa: ['1.000'] [Step 1 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [27896, 27904] → Tgt Spa: ['1.000', '1.000'] [Step 1 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [42672] → Tgt Spa: ['1.000'] [Step 1 / Rank 6] Tasks: ['Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization'] | Lens: [1885, 1868, 1869, 1868, 1868, 1871, 1888, 1889, 1890, 1871, 1871, 1873, 1891, 1873, 1892, 1874, 1874, 1876, 1875, 1877, 1883, 1894, 1877, 1876, 1896, 1895, 1884, 1877, 1879, 1897, 1878, 1880, 1899, 1900] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000'] [Step 1 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [64245] → Tgt Spa: ['1.000'] [Step 1 / Rank 5] Tasks: ['Code'] | Lens: [61402] → Tgt Spa: ['1.000'] [Step 1 / Rank 3] Tasks: ['Single QA'] | Lens: [48597] → Tgt Spa: ['0.350'] [Step 1 / Rank 2] Tasks: ['Single QA'] | Lens: [48597] → Tgt Spa: ['0.350'] [Step 1 / Rank 6] Tasks: ['Code'] | Lens: [53240] → Tgt Spa: ['1.000'] [Step 1 / Rank 1] Tasks: ['Single QA'] | Lens: [58344] → Tgt Spa: ['0.350'] [Step 1 / Rank 4] Tasks: ['Code'] | Lens: [61402] → Tgt Spa: ['1.000'] [Step 1 / Rank 7] Tasks: ['Code'] | Lens: [53240] → Tgt Spa: ['1.000'] [Step 1 / Rank 0] Tasks: ['Single QA'] | Lens: [58344] → Tgt Spa: ['0.350'] [Step 1 / Rank 3] Tasks: ['Single QA'] | Lens: [65101] → Tgt Spa: ['0.350'] [Step 1 / Rank 7] Tasks: ['Single QA'] | Lens: [42585] → Tgt Spa: ['0.350'] [Step 1 / Rank 1] Tasks: ['Single QA'] | Lens: [39175] → Tgt Spa: ['0.350'] [Step 1 / Rank 2] Tasks: ['Single QA'] | Lens: [65101] → Tgt Spa: ['0.350'] [Step 1 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [38877] → Tgt Spa: ['1.000'] [Step 1 / Rank 0] Tasks: ['Single QA'] | Lens: [39175] → Tgt Spa: ['0.350'] [Step 1 / Rank 6] Tasks: ['Single QA'] | Lens: [42585] → Tgt Spa: ['0.350'] [Step 1 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [38877] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 18:26:56,655 >> @ 1 | Loss: 2.2751 | LM: 2.1887 | Reg: 0.0864 | Spa(Avg): 0.492 [INFO|lh_trainer.py:797] 2026-02-16 18:26:56,655 >> Statistic -> Code | Spa: 0.519 | Tgt: 1.000 | Z-Loss: 0.085 | [INFO|lh_trainer.py:797] 2026-02-16 18:26:56,655 >> Statistic -> In-Context | Spa: 0.488 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:26:56,655 >> Statistic -> MultiHop | Spa: 0.497 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:26:56,655 >> Statistic -> Single | Spa: 0.501 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:26:56,655 >> Statistic -> Summarization | Spa: 0.510 | Tgt: 1.000 | Z-Loss: 0.107 | [INFO|lh_trainer.py:810] 2026-02-16 18:26:56,657 >> [Micro-Log] {"loss": 2.275075492138664, "lm_loss": 2.1886735381558537, "reg_loss": 0.08640195491413276, "model_sparsity(avg)": 0.4915698903302352, "Spa-In-Context Learning sparsity": 0.48809523241860525, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11087532554353986, "Spa-Single QA sparsity": 0.5007309913635254, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.07692126027847591, "Spa-Code sparsity": 0.519097238779068, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.08456847909837961, "Spa-Summarization sparsity": 0.5099206353936877, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10697654421840395, "Spa-MultiHop QA sparsity": 0.49652778208255766, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.04697803878225386, "step": 1, "current_tau": 1.4999618530273438, "lambda1 Single QA": 0.474609375, "lambda2 MultiHop QA": 0.23828125, "lambda3 Summarization": 0.0380859375, "lambda4 Code": 0.1337890625} [INFO|lh_trainer.py:331] 2026-02-16 18:27:23,676 >> {'loss': 13.6505, 'grad_norm': 1.1604911088943481, 'learning_rate': 8.333333333333334e-06, 'epoch': 0.00210637177461822, 'num_input_tokens_seen': 4914924, 'completed': '0.67% (2 / 300)', 'remaining time': '13:54:56', 'throughput': '6921.10', 'gpu_mem_free': '12749MB', 'step': 2} [Step 2 / Rank 5] Tasks: ['Single QA'] | Lens: [42669] → Tgt Spa: ['0.350'] [Step 2 / Rank 7] Tasks: ['In-Context Learning', 'Summarization', 'Code'] | Lens: [19210, 19231, 19223] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 2 / Rank 0] Tasks: ['Single QA'] | Lens: [39862] → Tgt Spa: ['0.350'] [Step 2 / Rank 4] Tasks: ['Single QA'] | Lens: [42669] → Tgt Spa: ['0.350'] [Step 2 / Rank 6] Tasks: ['In-Context Learning', 'Summarization', 'Code'] | Lens: [19210, 19231, 19223] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 2 / Rank 3] Tasks: ['Single QA'] | Lens: [61832] → Tgt Spa: ['0.350'] [Step 2 / Rank 2] Tasks: ['Single QA'] | Lens: [61832] → Tgt Spa: ['0.350'] [Step 2 / Rank 1] Tasks: ['Single QA'] | Lens: [39862] → Tgt Spa: ['0.350'] [Step 2 / Rank 3] Tasks: ['Single QA'] | Lens: [54850] → Tgt Spa: ['0.350'] [Step 2 / Rank 7] Tasks: ['Single QA'] | Lens: [44053] → Tgt Spa: ['0.350'] [Step 2 / Rank 1] Tasks: ['Code', 'Code'] | Lens: [28822, 28823] → Tgt Spa: ['1.000', '1.000'] [Step 2 / Rank 6] Tasks: ['Single QA'] | Lens: [44053] → Tgt Spa: ['0.350'] [Step 2 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [48633] → Tgt Spa: ['1.000'] [Step 2 / Rank 0] Tasks: ['Code', 'Code'] | Lens: [28822, 28823] → Tgt Spa: ['1.000', '1.000'] [Step 2 / Rank 2] Tasks: ['Single QA'] | Lens: [54850] → Tgt Spa: ['0.350'] [Step 2 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [48633] → Tgt Spa: ['1.000'] [Step 2 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [60048] → Tgt Spa: ['1.000'] [Step 2 / Rank 0] Tasks: ['Single QA'] | Lens: [56342] → Tgt Spa: ['0.350'] [Step 2 / Rank 1] Tasks: ['Single QA'] | Lens: [56342] → Tgt Spa: ['0.350'] [Step 2 / Rank 7] Tasks: ['Single QA'] | Lens: [40515] → Tgt Spa: ['0.350'] [Step 2 / Rank 3] Tasks: ['Single QA'] | Lens: [46714] → Tgt Spa: ['0.350'] [Step 2 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [60048] → Tgt Spa: ['1.000'] [Step 2 / Rank 6] Tasks: ['Single QA'] | Lens: [40515] → Tgt Spa: ['0.350'] [Step 2 / Rank 2] Tasks: ['Single QA'] | Lens: [46714] → Tgt Spa: ['0.350'] [Step 2 / Rank 7] Tasks: ['Single QA'] | Lens: [34781] → Tgt Spa: ['0.350'] [Step 2 / Rank 3] Tasks: ['In-Context Learning', 'Code', 'In-Context Learning'] | Lens: [21305, 21312, 21307] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 2 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [53969] → Tgt Spa: ['1.000'] [Step 2 / Rank 2] Tasks: ['In-Context Learning', 'Code', 'In-Context Learning'] | Lens: [21305, 21312, 21307] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 2 / Rank 5] Tasks: ['Single QA'] | Lens: [42547] → Tgt Spa: ['0.350'] [Step 2 / Rank 6] Tasks: ['Single QA'] | Lens: [34781] → Tgt Spa: ['0.350'] [Step 2 / Rank 4] Tasks: ['Single QA'] | Lens: [42547] → Tgt Spa: ['0.350'] [Step 2 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [53969] → Tgt Spa: ['1.000'] [Step 2 / Rank 7] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17889, 17880, 17892] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 2 / Rank 3] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [26108, 26109] → Tgt Spa: ['1.000', '0.350'] [Step 2 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26217, 26219] → Tgt Spa: ['1.000', '1.000'] [Step 2 / Rank 4] Tasks: ['Single QA', 'Summarization'] | Lens: [25736, 25755] → Tgt Spa: ['0.350', '1.000'] [Step 2 / Rank 6] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17889, 17880, 17892] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 2 / Rank 5] Tasks: ['Single QA', 'Summarization'] | Lens: [25736, 25755] → Tgt Spa: ['0.350', '1.000'] [Step 2 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26217, 26219] → Tgt Spa: ['1.000', '1.000'] [Step 2 / Rank 2] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [26108, 26109] → Tgt Spa: ['1.000', '0.350'] [Step 2 / Rank 5] Tasks: ['Single QA'] | Lens: [41250] → Tgt Spa: ['0.350'] [Step 2 / Rank 1] Tasks: ['Single QA'] | Lens: [50256] → Tgt Spa: ['0.350'] [Step 2 / Rank 3] Tasks: ['Single QA', 'MultiHop QA'] | Lens: [31017, 31018] → Tgt Spa: ['0.350', '0.350'] [Step 2 / Rank 4] Tasks: ['Single QA'] | Lens: [41250] → Tgt Spa: ['0.350'] [Step 2 / Rank 2] Tasks: ['Single QA', 'MultiHop QA'] | Lens: [31017, 31018] → Tgt Spa: ['0.350', '0.350'] [Step 2 / Rank 7] Tasks: ['Single QA'] | Lens: [48477] → Tgt Spa: ['0.350'] [Step 2 / Rank 0] Tasks: ['Single QA'] | Lens: [50256] → Tgt Spa: ['0.350'] [Step 2 / Rank 6] Tasks: ['Single QA'] | Lens: [48477] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 18:29:42,922 >> @ 2 | Loss: 2.1245 | LM: 2.0220 | Reg: 0.1025 | Spa(Avg): 0.549 [INFO|lh_trainer.py:797] 2026-02-16 18:29:42,922 >> Statistic -> Code | Spa: 0.547 | Tgt: 1.000 | Z-Loss: 0.079 | [INFO|lh_trainer.py:797] 2026-02-16 18:29:42,922 >> Statistic -> In-Context | Spa: 0.531 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:29:42,922 >> Statistic -> MultiHop | Spa: 0.500 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:29:42,922 >> Statistic -> Single | Spa: 0.554 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:29:42,922 >> Statistic -> Summarization | Spa: 0.465 | Tgt: 1.000 | Z-Loss: 0.123 | [INFO|lh_trainer.py:810] 2026-02-16 18:29:42,924 >> [Micro-Log] {"loss": 2.124526788791021, "lm_loss": 2.022026394804319, "reg_loss": 0.10250038312127192, "model_sparsity(avg)": 0.5486111131807169, "Spa-Single QA sparsity": 0.553819440305233, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.10649622092023492, "Spa-Code sparsity": 0.5472222208976746, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.07894552797079087, "Spa-In-Context Learning sparsity": 0.5308641990025839, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10142000102334553, "Spa-MultiHop QA sparsity": 0.5, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.04352162778377533, "Spa-Summarization sparsity": 0.4652777761220932, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12288976833224297, "step": 2, "current_tau": 1.499847650527954, "lambda1 Single QA": 0.474609375, "lambda2 MultiHop QA": 0.23828125, "lambda3 Summarization": 0.0380859375, "lambda4 Code": 0.1337890625} [INFO|lh_trainer.py:331] 2026-02-16 18:30:00,833 >> {'loss': 12.7472, 'grad_norm': 1.0786586999893188, 'learning_rate': 1.6666666666666667e-05, 'epoch': 0.00315955766192733, 'num_input_tokens_seen': 7350666, 'completed': '1.00% (3 / 300)', 'remaining time': '13:34:04', 'throughput': '7749.40', 'gpu_mem_free': '11097MB', 'step': 3} [Step 3 / Rank 0] Tasks: ['Single QA'] | Lens: [47443] → Tgt Spa: ['0.350'] [Step 3 / Rank 6] Tasks: ['Single QA'] | Lens: [57752] → Tgt Spa: ['0.350'] [Step 3 / Rank 5] Tasks: ['Single QA'] | Lens: [50118] → Tgt Spa: ['0.350'] [Step 3 / Rank 1] Tasks: ['Single QA'] | Lens: [47443] → Tgt Spa: ['0.350'] [Step 3 / Rank 7] Tasks: ['Single QA'] | Lens: [57752] → Tgt Spa: ['0.350'] [Step 3 / Rank 4] Tasks: ['Single QA'] | Lens: [50118] → Tgt Spa: ['0.350'] [Step 3 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [40199] → Tgt Spa: ['1.000'] [Step 3 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [40199] → Tgt Spa: ['1.000'] [Step 3 / Rank 5] Tasks: ['Single QA'] | Lens: [60744] → Tgt Spa: ['0.350'] [Step 3 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [29428, 29434] → Tgt Spa: ['1.000', '1.000'] [Step 3 / Rank 0] Tasks: ['Single QA'] | Lens: [45119] → Tgt Spa: ['0.350'] [Step 3 / Rank 6] Tasks: ['MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [1332, 1352, 1352, 1334, 1353, 1353, 1337, 1336, 1335, 1355, 1354, 1337, 1336, 1337, 1355, 1355, 1336, 1337, 1338, 1337, 1356, 1356, 1338, 1338, 1340, 1338, 1338, 1339, 1341, 1339, 1339, 1342, 1341, 1340, 1340, 1360, 1342, 1341, 1360, 1361, 1342, 1342, 1341, 1342, 1344, 1343, 1343, 1361] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 3 / Rank 7] Tasks: ['MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [1332, 1352, 1352, 1334, 1353, 1353, 1337, 1336, 1335, 1355, 1354, 1337, 1336, 1337, 1355, 1355, 1336, 1337, 1338, 1337, 1356, 1356, 1338, 1338, 1340, 1338, 1338, 1339, 1341, 1339, 1339, 1342, 1341, 1340, 1340, 1360, 1342, 1341, 1360, 1361, 1342, 1342, 1341, 1342, 1344, 1343, 1343, 1361] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 3 / Rank 4] Tasks: ['Single QA'] | Lens: [60744] → Tgt Spa: ['0.350'] [Step 3 / Rank 1] Tasks: ['Single QA'] | Lens: [45119] → Tgt Spa: ['0.350'] [Step 3 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [29428, 29434] → Tgt Spa: ['1.000', '1.000'] [Step 3 / Rank 1] Tasks: ['Single QA'] | Lens: [52677] → Tgt Spa: ['0.350'] [Step 3 / Rank 3] Tasks: ['Code', 'Summarization'] | Lens: [26323, 26337] → Tgt Spa: ['1.000', '1.000'] [Step 3 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27445, 27448] → Tgt Spa: ['1.000', '1.000'] [Step 3 / Rank 7] Tasks: ['Single QA'] | Lens: [57575] → Tgt Spa: ['0.350'] [Step 3 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27445, 27448] → Tgt Spa: ['1.000', '1.000'] [Step 3 / Rank 2] Tasks: ['Code', 'Summarization'] | Lens: [26323, 26337] → Tgt Spa: ['1.000', '1.000'] [Step 3 / Rank 6] Tasks: ['Single QA'] | Lens: [57575] → Tgt Spa: ['0.350'] [Step 3 / Rank 0] Tasks: ['Single QA'] | Lens: [52677] → Tgt Spa: ['0.350'] [Step 3 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [23247, 23242] → Tgt Spa: ['1.000', '1.000'] [Step 3 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [23247, 23242] → Tgt Spa: ['1.000', '1.000'] [Step 3 / Rank 3] Tasks: ['Single QA'] | Lens: [62024] → Tgt Spa: ['0.350'] [Step 3 / Rank 2] Tasks: ['Single QA'] | Lens: [62024] → Tgt Spa: ['0.350'] [Step 3 / Rank 7] Tasks: ['Code'] | Lens: [49567] → Tgt Spa: ['1.000'] [Step 3 / Rank 6] Tasks: ['Code'] | Lens: [49567] → Tgt Spa: ['1.000'] [Step 3 / Rank 0] Tasks: ['Code', 'Code'] | Lens: [30809, 30811] → Tgt Spa: ['1.000', '1.000'] [Step 3 / Rank 1] Tasks: ['Code', 'Code'] | Lens: [30809, 30811] → Tgt Spa: ['1.000', '1.000'] [Step 3 / Rank 1] Tasks: ['Single QA'] | Lens: [60441] → Tgt Spa: ['0.350'] [Step 3 / Rank 3] Tasks: ['Single QA'] | Lens: [58397] → Tgt Spa: ['0.350'] [Step 3 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [45086] → Tgt Spa: ['1.000'] [Step 3 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [45086] → Tgt Spa: ['1.000'] [Step 3 / Rank 6] Tasks: ['Single QA'] | Lens: [38135] → Tgt Spa: ['0.350'] [Step 3 / Rank 2] Tasks: ['Single QA'] | Lens: [58397] → Tgt Spa: ['0.350'] [Step 3 / Rank 7] Tasks: ['Single QA'] | Lens: [38135] → Tgt Spa: ['0.350'] [Step 3 / Rank 0] Tasks: ['Single QA'] | Lens: [60441] → Tgt Spa: ['0.350'] [Step 3 / Rank 3] Tasks: ['Single QA'] | Lens: [55696] → Tgt Spa: ['0.350'] [Step 3 / Rank 5] Tasks: ['Single QA'] | Lens: [51699] → Tgt Spa: ['0.350'] [Step 3 / Rank 4] Tasks: ['Single QA'] | Lens: [51699] → Tgt Spa: ['0.350'] [Step 3 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43788] → Tgt Spa: ['1.000'] [Step 3 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43788] → Tgt Spa: ['1.000'] [Step 3 / Rank 2] Tasks: ['Single QA'] | Lens: [55696] → Tgt Spa: ['0.350'] [Step 3 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32527, 32528] → Tgt Spa: ['0.350', '0.350'] [Step 3 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32527, 32528] → Tgt Spa: ['0.350', '0.350'] [INFO|lh_trainer.py:781] 2026-02-16 18:32:40,715 >> @ 3 | Loss: 2.1654 | LM: 2.0908 | Reg: 0.0746 | Spa(Avg): 0.486 [INFO|lh_trainer.py:797] 2026-02-16 18:32:40,715 >> Statistic -> Code | Spa: 0.522 | Tgt: 1.000 | Z-Loss: 0.084 | [INFO|lh_trainer.py:797] 2026-02-16 18:32:40,715 >> Statistic -> In-Context | Spa: 0.528 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:32:40,715 >> Statistic -> MultiHop | Spa: 0.502 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:32:40,715 >> Statistic -> Single | Spa: 0.465 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:32:40,715 >> Statistic -> Summarization | Spa: 0.494 | Tgt: 1.000 | Z-Loss: 0.114 | [INFO|lh_trainer.py:810] 2026-02-16 18:32:40,717 >> [Micro-Log] {"loss": 2.1654449788232646, "lm_loss": 2.0908074167867503, "reg_loss": 0.07463756405437987, "model_sparsity(avg)": 0.48602670555313426, "Spa-Single QA sparsity": 0.46481480201085407, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05766560547053814, "Spa-Code sparsity": 0.5222222089767456, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.08430085927248002, "Spa-In-Context Learning sparsity": 0.5277777686715126, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10213251691311598, "Spa-Summarization sparsity": 0.4935185273488363, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11350376009941102, "Spa-MultiHop QA sparsity": 0.5016339929664836, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.04741028742864728, "step": 3, "current_tau": 1.499657392501831, "lambda1 Single QA": 0.4765625, "lambda2 MultiHop QA": 0.23828125, "lambda3 Summarization": 0.0380859375, "lambda4 Code": 0.1337890625} [INFO|lh_trainer.py:331] 2026-02-16 18:33:01,862 >> {'loss': 12.9927, 'grad_norm': 0.9630117416381836, 'learning_rate': 2.5e-05, 'epoch': 0.00421274354923644, 'num_input_tokens_seen': 9911760, 'completed': '1.33% (4 / 300)', 'remaining time': '13:51:46', 'throughput': '7073.71', 'gpu_mem_free': '6391MB', 'step': 4} [Step 4 / Rank 6] Tasks: ['In-Context Learning', 'Summarization', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Code'] | Lens: [6620, 6639, 6634, 6636, 6636, 6637, 6644, 6640, 6647] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000'] [Step 4 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25108, 25109] → Tgt Spa: ['1.000', '1.000'] [Step 4 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25108, 25109] → Tgt Spa: ['1.000', '1.000'] [Step 4 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23673, 23691] → Tgt Spa: ['1.000', '1.000'] [Step 4 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15879, 15879, 15879, 15879] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 4 / Rank 7] Tasks: ['In-Context Learning', 'Summarization', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Code'] | Lens: [6620, 6639, 6634, 6636, 6636, 6637, 6644, 6640, 6647] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000'] [Step 4 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15879, 15879, 15879, 15879] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 4 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23673, 23691] → Tgt Spa: ['1.000', '1.000'] [Step 4 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26397, 26399] → Tgt Spa: ['1.000', '1.000'] [Step 4 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [29217, 29217] → Tgt Spa: ['0.350', '0.350'] [Step 4 / Rank 7] Tasks: ['Code', 'Single QA', 'In-Context Learning'] | Lens: [20960, 20953, 20954] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 4 / Rank 6] Tasks: ['Code', 'Single QA', 'In-Context Learning'] | Lens: [20960, 20953, 20954] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 4 / Rank 1] Tasks: ['Code'] | Lens: [33781] → Tgt Spa: ['1.000'] [Step 4 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [29217, 29217] → Tgt Spa: ['0.350', '0.350'] [Step 4 / Rank 0] Tasks: ['Code'] | Lens: [33781] → Tgt Spa: ['1.000'] [Step 4 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26397, 26399] → Tgt Spa: ['1.000', '1.000'] [Step 4 / Rank 3] Tasks: ['Single QA'] | Lens: [49239] → Tgt Spa: ['0.350'] [Step 4 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [29868, 29868] → Tgt Spa: ['0.350', '0.350'] [Step 4 / Rank 7] Tasks: ['Single QA'] | Lens: [63851] → Tgt Spa: ['0.350'] [Step 4 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [29868, 29868] → Tgt Spa: ['0.350', '0.350'] [Step 4 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [53560] → Tgt Spa: ['1.000'] [Step 4 / Rank 6] Tasks: ['Single QA'] | Lens: [63851] → Tgt Spa: ['0.350'] [Step 4 / Rank 2] Tasks: ['Single QA'] | Lens: [49239] → Tgt Spa: ['0.350'] [Step 4 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [53560] → Tgt Spa: ['1.000'] [Step 4 / Rank 2] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23416, 23416] → Tgt Spa: ['0.350', '1.000'] [Step 4 / Rank 7] Tasks: ['Single QA'] | Lens: [57695] → Tgt Spa: ['0.350'] [Step 4 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32239, 32241] → Tgt Spa: ['0.350', '0.350'] [Step 4 / Rank 3] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23416, 23416] → Tgt Spa: ['0.350', '1.000'] [Step 4 / Rank 6] Tasks: ['Single QA'] | Lens: [57695] → Tgt Spa: ['0.350'] [Step 4 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32239, 32241] → Tgt Spa: ['0.350', '0.350'] [Step 4 / Rank 4] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17382, 17381, 17384] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 4 / Rank 5] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17382, 17381, 17384] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 4 / Rank 6] Tasks: ['Code', 'Single QA', 'Summarization'] | Lens: [18038, 18031, 18051] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 4 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [56230] → Tgt Spa: ['1.000'] [Step 4 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [56230] → Tgt Spa: ['1.000'] [Step 4 / Rank 7] Tasks: ['Code', 'Single QA', 'Summarization'] | Lens: [18038, 18031, 18051] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 4 / Rank 5] Tasks: ['Code'] | Lens: [59996] → Tgt Spa: ['1.000'] [Step 4 / Rank 4] Tasks: ['Code'] | Lens: [59996] → Tgt Spa: ['1.000'] [Step 4 / Rank 3] Tasks: ['Single QA'] | Lens: [41184] → Tgt Spa: ['0.350'] [Step 4 / Rank 2] Tasks: ['Single QA'] | Lens: [41184] → Tgt Spa: ['0.350'] [Step 4 / Rank 7] Tasks: ['Single QA'] | Lens: [45603] → Tgt Spa: ['0.350'] [Step 4 / Rank 3] Tasks: ['Code'] | Lens: [33946] → Tgt Spa: ['1.000'] [Step 4 / Rank 2] Tasks: ['Code'] | Lens: [33946] → Tgt Spa: ['1.000'] [Step 4 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [41344] → Tgt Spa: ['1.000'] [Step 4 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [41344] → Tgt Spa: ['1.000'] [Step 4 / Rank 1] Tasks: ['MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Code'] | Lens: [1710, 1729, 1729, 1710, 1730, 1730, 1712, 1712, 1731, 1714, 1713, 1733, 1732, 1733, 1733, 1714, 1733, 1715, 1715, 1716, 1716, 1716, 1736, 1717, 1718, 1717, 1736, 1737, 1721, 1737, 1719, 1720, 1722, 1721, 1720, 1739, 1740, 1728] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000'] [Step 4 / Rank 6] Tasks: ['Single QA'] | Lens: [45603] → Tgt Spa: ['0.350'] [Step 4 / Rank 0] Tasks: ['MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Code'] | Lens: [1710, 1729, 1729, 1710, 1730, 1730, 1712, 1712, 1731, 1714, 1713, 1733, 1732, 1733, 1733, 1714, 1733, 1715, 1715, 1716, 1716, 1716, 1736, 1717, 1718, 1717, 1736, 1737, 1721, 1737, 1719, 1720, 1722, 1721, 1720, 1739, 1740, 1728] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 18:35:19,422 >> @ 4 | Loss: 2.1784 | LM: 2.0865 | Reg: 0.0919 | Spa(Avg): 0.489 [INFO|lh_trainer.py:797] 2026-02-16 18:35:19,422 >> Statistic -> Code | Spa: 0.469 | Tgt: 1.000 | Z-Loss: 0.096 | [INFO|lh_trainer.py:797] 2026-02-16 18:35:19,422 >> Statistic -> In-Context | Spa: 0.497 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:35:19,422 >> Statistic -> MultiHop | Spa: 0.492 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:35:19,422 >> Statistic -> Single | Spa: 0.500 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:35:19,422 >> Statistic -> Summarization | Spa: 0.476 | Tgt: 1.000 | Z-Loss: 0.123 | [INFO|lh_trainer.py:810] 2026-02-16 18:35:19,424 >> [Micro-Log] {"loss": 2.1784369833767414, "lm_loss": 2.0865287395815053, "reg_loss": 0.09190824177737038, "model_sparsity(avg)": 0.48929566517472267, "Spa-Single QA sparsity": 0.49999999458139593, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.07668266631662846, "Spa-Code sparsity": 0.4691357943746779, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09603099235230023, "Spa-In-Context Learning sparsity": 0.4974747462706132, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10881095176393335, "Spa-MultiHop QA sparsity": 0.49206349111738656, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.04447990509548357, "Spa-Summarization sparsity": 0.47601010311733594, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12307465118779377, "step": 4, "current_tau": 1.4993910789489746, "lambda1 Single QA": 0.4765625, "lambda2 MultiHop QA": 0.23828125, "lambda3 Summarization": 0.038330078125, "lambda4 Code": 0.1337890625} [INFO|lh_trainer.py:331] 2026-02-16 18:35:34,845 >> {'loss': 13.0706, 'grad_norm': 1.2464433908462524, 'learning_rate': 3.3333333333333335e-05, 'epoch': 0.0052659294365455505, 'num_input_tokens_seen': 12460110, 'completed': '1.67% (5 / 300)', 'remaining time': '13:33:35', 'throughput': '8328.89', 'gpu_mem_free': '7465MB', 'step': 5} [Step 5 / Rank 6] Tasks: ['Single QA'] | Lens: [62466] → Tgt Spa: ['0.350'] [Step 5 / Rank 7] Tasks: ['Single QA'] | Lens: [62466] → Tgt Spa: ['0.350'] [Step 5 / Rank 3] Tasks: ['Single QA'] | Lens: [56069] → Tgt Spa: ['0.350'] [Step 5 / Rank 4] Tasks: ['Single QA', 'Summarization'] | Lens: [24273, 24291] → Tgt Spa: ['0.350', '1.000'] [Step 5 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27463, 27464] → Tgt Spa: ['1.000', '1.000'] [Step 5 / Rank 2] Tasks: ['Single QA'] | Lens: [56069] → Tgt Spa: ['0.350'] [Step 5 / Rank 5] Tasks: ['Single QA', 'Summarization'] | Lens: [24273, 24291] → Tgt Spa: ['0.350', '1.000'] [Step 5 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27463, 27464] → Tgt Spa: ['1.000', '1.000'] [Step 5 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [46080] → Tgt Spa: ['1.000'] [Step 5 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [26407, 26400] → Tgt Spa: ['1.000', '1.000'] [Step 5 / Rank 1] Tasks: ['MultiHop QA', 'Code', 'Code', 'Single QA'] | Lens: [14806, 14821, 14827, 14830] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350'] [Step 5 / Rank 6] Tasks: ['Single QA'] | Lens: [36936] → Tgt Spa: ['0.350'] [Step 5 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [46080] → Tgt Spa: ['1.000'] [Step 5 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [26407, 26400] → Tgt Spa: ['1.000', '1.000'] [Step 5 / Rank 7] Tasks: ['Single QA'] | Lens: [36936] → Tgt Spa: ['0.350'] [Step 5 / Rank 0] Tasks: ['MultiHop QA', 'Code', 'Code', 'Single QA'] | Lens: [14806, 14821, 14827, 14830] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350'] [Step 5 / Rank 7] Tasks: ['Single QA'] | Lens: [49593] → Tgt Spa: ['0.350'] [Step 5 / Rank 3] Tasks: ['Single QA'] | Lens: [46228] → Tgt Spa: ['0.350'] [Step 5 / Rank 6] Tasks: ['Single QA'] | Lens: [49593] → Tgt Spa: ['0.350'] [Step 5 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [43923] → Tgt Spa: ['1.000'] [Step 5 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [55891] → Tgt Spa: ['1.000'] [Step 5 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [43923] → Tgt Spa: ['1.000'] [Step 5 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [55891] → Tgt Spa: ['1.000'] [Step 5 / Rank 2] Tasks: ['Single QA'] | Lens: [46228] → Tgt Spa: ['0.350'] [Step 5 / Rank 1] Tasks: ['Single QA'] | Lens: [49457] → Tgt Spa: ['0.350'] [Step 5 / Rank 0] Tasks: ['Single QA'] | Lens: [49457] → Tgt Spa: ['0.350'] [Step 5 / Rank 6] Tasks: ['Single QA'] | Lens: [57289] → Tgt Spa: ['0.350'] [Step 5 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [45530] → Tgt Spa: ['1.000'] [Step 5 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15807, 15807, 15807, 15807] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 5 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15807, 15807, 15807, 15807] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 5 / Rank 7] Tasks: ['Single QA'] | Lens: [57289] → Tgt Spa: ['0.350'] [Step 5 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [45530] → Tgt Spa: ['1.000'] [Step 5 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23598, 23600] → Tgt Spa: ['1.000', '1.000'] [Step 5 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23598, 23600] → Tgt Spa: ['1.000', '1.000'] [Step 5 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [24186, 24195] → Tgt Spa: ['1.000', '1.000'] [Step 5 / Rank 3] Tasks: ['Single QA'] | Lens: [39799] → Tgt Spa: ['0.350'] [Step 5 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [26181, 26174] → Tgt Spa: ['1.000', '1.000'] [Step 5 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [26181, 26174] → Tgt Spa: ['1.000', '1.000'] [Step 5 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [24186, 24195] → Tgt Spa: ['1.000', '1.000'] [Step 5 / Rank 2] Tasks: ['Single QA'] | Lens: [39799] → Tgt Spa: ['0.350'] [Step 5 / Rank 4] Tasks: ['Single QA'] | Lens: [63035] → Tgt Spa: ['0.350'] [Step 5 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [58151] → Tgt Spa: ['1.000'] [Step 5 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25861, 25862] → Tgt Spa: ['1.000', '1.000'] [Step 5 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [58151] → Tgt Spa: ['1.000'] [Step 5 / Rank 5] Tasks: ['Single QA'] | Lens: [63035] → Tgt Spa: ['0.350'] [Step 5 / Rank 0] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17323, 17324, 17326] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 5 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25861, 25862] → Tgt Spa: ['1.000', '1.000'] [Step 5 / Rank 1] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17323, 17324, 17326] → Tgt Spa: ['1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 18:37:49,473 >> @ 5 | Loss: 2.2523 | LM: 2.1588 | Reg: 0.0935 | Spa(Avg): 0.504 [INFO|lh_trainer.py:797] 2026-02-16 18:37:49,474 >> Statistic -> Code | Spa: 0.506 | Tgt: 1.000 | Z-Loss: 0.088 | [INFO|lh_trainer.py:797] 2026-02-16 18:37:49,474 >> Statistic -> In-Context | Spa: 0.520 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:37:49,474 >> Statistic -> MultiHop | Spa: 0.417 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:37:49,474 >> Statistic -> Single | Spa: 0.515 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:37:49,474 >> Statistic -> Summarization | Spa: 0.493 | Tgt: 1.000 | Z-Loss: 0.113 | [INFO|lh_trainer.py:810] 2026-02-16 18:37:49,476 >> [Micro-Log] {"loss": 2.2522769700735807, "lm_loss": 2.158759331330657, "reg_loss": 0.09351766171554725, "model_sparsity(avg)": 0.5038580298423767, "Spa-In-Context Learning sparsity": 0.5198412793023246, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10390142777136394, "Spa-MultiHop QA sparsity": 0.4166666865348816, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.017506670206785202, "Spa-Code sparsity": 0.5055555701255798, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.08818826153874397, "Spa-Single QA sparsity": 0.514814821879069, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.08507677962382634, "Spa-Summarization sparsity": 0.4930555522441864, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11267183721065521, "step": 5, "current_tau": 1.4990487098693848, "lambda1 Single QA": 0.4765625, "lambda2 MultiHop QA": 0.23828125, "lambda3 Summarization": 0.038330078125, "lambda4 Code": 0.1337890625} [INFO|lh_trainer.py:331] 2026-02-16 18:38:15,303 >> {'loss': 13.5137, 'grad_norm': 1.4381663799285889, 'learning_rate': 4.1666666666666665e-05, 'epoch': 0.00631911532385466, 'num_input_tokens_seen': 14941884, 'completed': '2.00% (6 / 300)', 'remaining time': '13:26:44', 'throughput': '7733.40', 'gpu_mem_free': '9975MB', 'step': 6} [Step 6 / Rank 4] Tasks: ['Single QA'] | Lens: [62848] → Tgt Spa: ['0.350'] [Step 6 / Rank 5] Tasks: ['Single QA'] | Lens: [62848] → Tgt Spa: ['0.350'] [Step 6 / Rank 7] Tasks: ['Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [4944, 4937, 4937, 4937, 4938, 4946, 4940, 4940, 4947, 4941, 4942, 4943, 4943] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 6 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [63606] → Tgt Spa: ['1.000'] [Step 6 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [63606] → Tgt Spa: ['1.000'] [Step 6 / Rank 6] Tasks: ['Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [4944, 4937, 4937, 4937, 4938, 4946, 4940, 4940, 4947, 4941, 4942, 4943, 4943] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 6 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [60418] → Tgt Spa: ['1.000'] [Step 6 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [60418] → Tgt Spa: ['1.000'] [Step 6 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [31398, 31398] → Tgt Spa: ['1.000', '1.000'] [Step 6 / Rank 7] Tasks: ['Single QA'] | Lens: [64978] → Tgt Spa: ['0.350'] [Step 6 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [47357] → Tgt Spa: ['1.000'] [Step 6 / Rank 6] Tasks: ['Single QA'] | Lens: [64978] → Tgt Spa: ['0.350'] [Step 6 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [47357] → Tgt Spa: ['1.000'] [Step 6 / Rank 2] Tasks: ['Single QA'] | Lens: [38845] → Tgt Spa: ['0.350'] [Step 6 / Rank 3] Tasks: ['Single QA'] | Lens: [38845] → Tgt Spa: ['0.350'] [Step 6 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [31398, 31398] → Tgt Spa: ['1.000', '1.000'] [Step 6 / Rank 1] Tasks: ['Single QA'] | Lens: [62002] → Tgt Spa: ['0.350'] [Step 6 / Rank 3] Tasks: ['Single QA'] | Lens: [51026] → Tgt Spa: ['0.350'] [Step 6 / Rank 2] Tasks: ['Single QA'] | Lens: [51026] → Tgt Spa: ['0.350'] [Step 6 / Rank 5] Tasks: ['Single QA'] | Lens: [58426] → Tgt Spa: ['0.350'] [Step 6 / Rank 4] Tasks: ['Single QA'] | Lens: [58426] → Tgt Spa: ['0.350'] [Step 6 / Rank 7] Tasks: ['Code'] | Lens: [38804] → Tgt Spa: ['1.000'] [Step 6 / Rank 6] Tasks: ['Code'] | Lens: [38804] → Tgt Spa: ['1.000'] [Step 6 / Rank 0] Tasks: ['Single QA'] | Lens: [62002] → Tgt Spa: ['0.350'] [Step 6 / Rank 3] Tasks: ['Single QA'] | Lens: [50223] → Tgt Spa: ['0.350'] [Step 6 / Rank 6] Tasks: ['Single QA'] | Lens: [60737] → Tgt Spa: ['0.350'] [Step 6 / Rank 4] Tasks: ['Single QA'] | Lens: [45885] → Tgt Spa: ['0.350'] [Step 6 / Rank 5] Tasks: ['Single QA'] | Lens: [45885] → Tgt Spa: ['0.350'] [Step 6 / Rank 0] Tasks: ['MultiHop QA'] | Lens: [65334] → Tgt Spa: ['0.350'] [Step 6 / Rank 1] Tasks: ['MultiHop QA'] | Lens: [65334] → Tgt Spa: ['0.350'] [Step 6 / Rank 7] Tasks: ['Single QA'] | Lens: [60737] → Tgt Spa: ['0.350'] [Step 6 / Rank 2] Tasks: ['Single QA'] | Lens: [50223] → Tgt Spa: ['0.350'] [Step 6 / Rank 2] Tasks: ['Code'] | Lens: [32909] → Tgt Spa: ['1.000'] [Step 6 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [52592] → Tgt Spa: ['1.000'] [Step 6 / Rank 6] Tasks: ['Single QA'] | Lens: [45711] → Tgt Spa: ['0.350'] [Step 6 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning'] | Lens: [4706, 4706, 4707, 4706, 4707, 4706, 4713, 4709, 4708, 4709, 4710, 4711, 4710] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 6 / Rank 3] Tasks: ['Code'] | Lens: [32909] → Tgt Spa: ['1.000'] [Step 6 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning'] | Lens: [4706, 4706, 4707, 4706, 4707, 4706, 4713, 4709, 4708, 4709, 4710, 4711, 4710] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 6 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [52592] → Tgt Spa: ['1.000'] [Step 6 / Rank 7] Tasks: ['Single QA'] | Lens: [45711] → Tgt Spa: ['0.350'] [Step 6 / Rank 3] Tasks: ['Single QA'] | Lens: [47565] → Tgt Spa: ['0.350'] [Step 6 / Rank 7] Tasks: ['Single QA'] | Lens: [33526] → Tgt Spa: ['0.350'] [Step 6 / Rank 5] Tasks: ['Code', 'Code', 'Code', 'Summarization'] | Lens: [14737, 14740, 14743, 14755] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000'] [Step 6 / Rank 2] Tasks: ['Single QA'] | Lens: [47565] → Tgt Spa: ['0.350'] [Step 6 / Rank 4] Tasks: ['Code', 'Code', 'Code', 'Summarization'] | Lens: [14737, 14740, 14743, 14755] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000'] [Step 6 / Rank 1] Tasks: ['Single QA'] | Lens: [56509] → Tgt Spa: ['0.350'] [Step 6 / Rank 6] Tasks: ['Single QA'] | Lens: [33526] → Tgt Spa: ['0.350'] [Step 6 / Rank 0] Tasks: ['Single QA'] | Lens: [56509] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 18:41:02,547 >> @ 6 | Loss: 2.1617 | LM: 2.0908 | Reg: 0.0709 | Spa(Avg): 0.464 [INFO|lh_trainer.py:797] 2026-02-16 18:41:02,547 >> Statistic -> Code | Spa: 0.503 | Tgt: 1.000 | Z-Loss: 0.089 | [INFO|lh_trainer.py:797] 2026-02-16 18:41:02,547 >> Statistic -> In-Context | Spa: 0.511 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:41:02,547 >> Statistic -> MultiHop | Spa: 0.444 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:41:02,547 >> Statistic -> Single | Spa: 0.454 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:41:02,547 >> Statistic -> Summarization | Spa: 0.389 | Tgt: 1.000 | Z-Loss: 0.158 | [INFO|lh_trainer.py:810] 2026-02-16 18:41:02,550 >> [Micro-Log] {"loss": 2.161710256865869, "lm_loss": 2.0908465074220053, "reg_loss": 0.07086376391816884, "model_sparsity(avg)": 0.46449874962369603, "Spa-In-Context Learning sparsity": 0.5113636390729384, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10582987286827782, "Spa-Single QA sparsity": 0.4542483582216151, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05257645068580613, "Spa-MultiHop QA sparsity": 0.4444444179534912, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.025653861463069916, "Spa-Code sparsity": 0.5025252591479908, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.08876683122732422, "Spa-Summarization sparsity": 0.3888888955116272, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.15763497352600098, "step": 6, "current_tau": 1.4986305236816406, "lambda1 Single QA": 0.4765625, "lambda2 MultiHop QA": 0.23828125, "lambda3 Summarization": 0.038330078125, "lambda4 Code": 0.1337890625} [INFO|lh_trainer.py:331] 2026-02-16 18:41:24,022 >> {'loss': 12.9703, 'grad_norm': 0.9651690125465393, 'learning_rate': 5e-05, 'epoch': 0.00737230121116377, 'num_input_tokens_seen': 17514914, 'completed': '2.33% (7 / 300)', 'remaining time': '13:40:47', 'throughput': '6817.10', 'gpu_mem_free': '7445MB', 'step': 7} [Step 7 / Rank 2] Tasks: ['Single QA'] | Lens: [64532] → Tgt Spa: ['0.350'] [Step 7 / Rank 7] Tasks: ['Code'] | Lens: [62869] → Tgt Spa: ['1.000'] [Step 7 / Rank 4] Tasks: ['Single QA'] | Lens: [51193] → Tgt Spa: ['0.350'] [Step 7 / Rank 3] Tasks: ['Single QA'] | Lens: [64532] → Tgt Spa: ['0.350'] [Step 7 / Rank 5] Tasks: ['Single QA'] | Lens: [51193] → Tgt Spa: ['0.350'] [Step 7 / Rank 0] Tasks: ['Single QA'] | Lens: [64909] → Tgt Spa: ['0.350'] [Step 7 / Rank 6] Tasks: ['Code'] | Lens: [62869] → Tgt Spa: ['1.000'] [Step 7 / Rank 1] Tasks: ['Single QA'] | Lens: [64909] → Tgt Spa: ['0.350'] [Step 7 / Rank 4] Tasks: ['Single QA'] | Lens: [65076] → Tgt Spa: ['0.350'] [Step 7 / Rank 5] Tasks: ['Single QA'] | Lens: [65076] → Tgt Spa: ['0.350'] [Step 7 / Rank 2] Tasks: ['Code', 'Code', 'Code'] | Lens: [18953, 18954, 18954] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 7 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [29715, 29717] → Tgt Spa: ['0.350', '0.350'] [Step 7 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [29715, 29717] → Tgt Spa: ['0.350', '0.350'] [Step 7 / Rank 1] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [26886, 26868] → Tgt Spa: ['1.000', '1.000'] [Step 7 / Rank 3] Tasks: ['Code', 'Code', 'Code'] | Lens: [18953, 18954, 18954] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 7 / Rank 0] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [26886, 26868] → Tgt Spa: ['1.000', '1.000'] [Step 7 / Rank 5] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [18823, 18825, 18836] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 7 / Rank 6] Tasks: ['Single QA'] | Lens: [50184] → Tgt Spa: ['0.350'] [Step 7 / Rank 2] Tasks: ['Single QA'] | Lens: [46454] → Tgt Spa: ['0.350'] [Step 7 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [22258, 22267] → Tgt Spa: ['1.000', '1.000'] [Step 7 / Rank 4] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [18823, 18825, 18836] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 7 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [22258, 22267] → Tgt Spa: ['1.000', '1.000'] [Step 7 / Rank 3] Tasks: ['Single QA'] | Lens: [46454] → Tgt Spa: ['0.350'] [Step 7 / Rank 7] Tasks: ['Single QA'] | Lens: [50184] → Tgt Spa: ['0.350'] [Step 7 / Rank 3] Tasks: ['Single QA'] | Lens: [42967] → Tgt Spa: ['0.350'] [Step 7 / Rank 1] Tasks: ['Single QA'] | Lens: [47670] → Tgt Spa: ['0.350'] [Step 7 / Rank 0] Tasks: ['Single QA'] | Lens: [47670] → Tgt Spa: ['0.350'] [Step 7 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [21929, 21925] → Tgt Spa: ['1.000', '1.000'] [Step 7 / Rank 4] Tasks: ['Single QA'] | Lens: [49231] → Tgt Spa: ['0.350'] [Step 7 / Rank 5] Tasks: ['Single QA'] | Lens: [49231] → Tgt Spa: ['0.350'] [Step 7 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [21929, 21925] → Tgt Spa: ['1.000', '1.000'] [Step 7 / Rank 2] Tasks: ['Single QA'] | Lens: [42967] → Tgt Spa: ['0.350'] [Step 7 / Rank 1] Tasks: ['Code'] | Lens: [41583] → Tgt Spa: ['1.000'] [Step 7 / Rank 2] Tasks: ['Single QA'] | Lens: [53383] → Tgt Spa: ['0.350'] [Step 7 / Rank 7] Tasks: ['Single QA'] | Lens: [58936] → Tgt Spa: ['0.350'] [Step 7 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [24474, 24484] → Tgt Spa: ['1.000', '1.000'] [Step 7 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [24474, 24484] → Tgt Spa: ['1.000', '1.000'] [Step 7 / Rank 6] Tasks: ['Single QA'] | Lens: [58936] → Tgt Spa: ['0.350'] [Step 7 / Rank 0] Tasks: ['Code'] | Lens: [41583] → Tgt Spa: ['1.000'] [Step 7 / Rank 3] Tasks: ['Single QA'] | Lens: [53383] → Tgt Spa: ['0.350'] [Step 7 / Rank 7] Tasks: ['Single QA'] | Lens: [63826] → Tgt Spa: ['0.350'] [Step 7 / Rank 3] Tasks: ['Code', 'Single QA'] | Lens: [31422, 31427] → Tgt Spa: ['1.000', '0.350'] [Step 7 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32138, 32138] → Tgt Spa: ['0.350', '0.350'] [Step 7 / Rank 5] Tasks: ['Code'] | Lens: [59886] → Tgt Spa: ['1.000'] [Step 7 / Rank 4] Tasks: ['Code'] | Lens: [59886] → Tgt Spa: ['1.000'] [Step 7 / Rank 2] Tasks: ['Code', 'Single QA'] | Lens: [31422, 31427] → Tgt Spa: ['1.000', '0.350'] [Step 7 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32138, 32138] → Tgt Spa: ['0.350', '0.350'] [Step 7 / Rank 6] Tasks: ['Single QA'] | Lens: [63826] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 18:43:59,359 >> @ 7 | Loss: 1.8148 | LM: 1.7387 | Reg: 0.0761 | Spa(Avg): 0.485 [INFO|lh_trainer.py:797] 2026-02-16 18:43:59,359 >> Statistic -> Code | Spa: 0.481 | Tgt: 1.000 | Z-Loss: 0.093 | [INFO|lh_trainer.py:797] 2026-02-16 18:43:59,359 >> Statistic -> In-Context | Spa: 0.549 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:43:59,359 >> Statistic -> MultiHop | Spa: 0.444 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:43:59,359 >> Statistic -> Single | Spa: 0.482 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:43:59,359 >> Statistic -> Summarization | Spa: 0.569 | Tgt: 1.000 | Z-Loss: 0.087 | [INFO|lh_trainer.py:810] 2026-02-16 18:43:59,362 >> [Micro-Log] {"loss": 1.8147980240173638, "lm_loss": 1.73871249955846, "reg_loss": 0.07608552905730903, "model_sparsity(avg)": 0.4849537027378877, "Spa-Single QA sparsity": 0.4820261457387139, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06749761630507077, "Spa-Summarization sparsity": 0.5694444179534912, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.08662547171115875, "Spa-In-Context Learning sparsity": 0.5486111342906952, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.09755121171474457, "Spa-Code sparsity": 0.4814814825852712, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09320193839569886, "Spa-MultiHop QA sparsity": 0.4444444179534912, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.025653861463069916, "step": 7, "current_tau": 1.4981365203857422, "lambda1 Single QA": 0.4765625, "lambda2 MultiHop QA": 0.23828125, "lambda3 Summarization": 0.03857421875, "lambda4 Code": 0.1337890625} [INFO|lh_trainer.py:331] 2026-02-16 18:44:25,429 >> {'loss': 10.8888, 'grad_norm': 1.1015037298202515, 'learning_rate': 5.833333333333333e-05, 'epoch': 0.00842548709847288, 'num_input_tokens_seen': 20142298, 'completed': '2.67% (8 / 300)', 'remaining time': '13:46:06', 'throughput': '7241.70', 'gpu_mem_free': '6463MB', 'step': 8} [Step 8 / Rank 7] Tasks: ['Single QA'] | Lens: [41192] → Tgt Spa: ['0.350'] [Step 8 / Rank 5] Tasks: ['Single QA'] | Lens: [64910] → Tgt Spa: ['0.350'] [Step 8 / Rank 6] Tasks: ['Single QA'] | Lens: [41192] → Tgt Spa: ['0.350'] [Step 8 / Rank 2] Tasks: ['Single QA'] | Lens: [59679] → Tgt Spa: ['0.350'] [Step 8 / Rank 4] Tasks: ['Single QA'] | Lens: [64910] → Tgt Spa: ['0.350'] [Step 8 / Rank 0] Tasks: ['Code'] | Lens: [44092] → Tgt Spa: ['1.000'] [Step 8 / Rank 3] Tasks: ['Single QA'] | Lens: [59679] → Tgt Spa: ['0.350'] [Step 8 / Rank 1] Tasks: ['Code'] | Lens: [44092] → Tgt Spa: ['1.000'] [Step 8 / Rank 3] Tasks: ['Single QA'] | Lens: [38703] → Tgt Spa: ['0.350'] [Step 8 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [21895, 21888] → Tgt Spa: ['1.000', '1.000'] [Step 8 / Rank 5] Tasks: ['Single QA', 'Code'] | Lens: [28273, 28280] → Tgt Spa: ['0.350', '1.000'] [Step 8 / Rank 2] Tasks: ['Single QA'] | Lens: [38703] → Tgt Spa: ['0.350'] [Step 8 / Rank 6] Tasks: ['Code'] | Lens: [57780] → Tgt Spa: ['1.000'] [Step 8 / Rank 7] Tasks: ['Code'] | Lens: [57780] → Tgt Spa: ['1.000'] [Step 8 / Rank 4] Tasks: ['Single QA', 'Code'] | Lens: [28273, 28280] → Tgt Spa: ['0.350', '1.000'] [Step 8 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [21895, 21888] → Tgt Spa: ['1.000', '1.000'] [Step 8 / Rank 6] Tasks: ['Single QA'] | Lens: [32804] → Tgt Spa: ['0.350'] [Step 8 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [61027] → Tgt Spa: ['1.000'] [Step 8 / Rank 4] Tasks: ['Single QA'] | Lens: [64683] → Tgt Spa: ['0.350'] [Step 8 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [56992] → Tgt Spa: ['1.000'] [Step 8 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [56992] → Tgt Spa: ['1.000'] [Step 8 / Rank 7] Tasks: ['Single QA'] | Lens: [32804] → Tgt Spa: ['0.350'] [Step 8 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [61027] → Tgt Spa: ['1.000'] [Step 8 / Rank 5] Tasks: ['Single QA'] | Lens: [64683] → Tgt Spa: ['0.350'] [Step 8 / Rank 1] Tasks: ['Single QA'] | Lens: [51582] → Tgt Spa: ['0.350'] [Step 8 / Rank 5] Tasks: ['MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Code', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'Single QA', 'MultiHop QA', 'Summarization', 'Single QA', 'Single QA', 'Summarization', 'In-Context Learning', 'Summarization', 'Single QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Code', 'MultiHop QA'] | Lens: [2818, 2834, 2835, 2834, 2824, 2818, 2818, 2818, 2823, 2817, 2818, 2836, 2820, 2819, 2838, 2819, 2838, 2820, 2839, 2841, 2823, 2829, 2824] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350'] [Step 8 / Rank 6] Tasks: ['Single QA'] | Lens: [61381] → Tgt Spa: ['0.350'] [Step 8 / Rank 0] Tasks: ['Single QA'] | Lens: [51582] → Tgt Spa: ['0.350'] [Step 8 / Rank 4] Tasks: ['MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Code', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'Single QA', 'MultiHop QA', 'Summarization', 'Single QA', 'Single QA', 'Summarization', 'In-Context Learning', 'Summarization', 'Single QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Code', 'MultiHop QA'] | Lens: [2818, 2834, 2835, 2834, 2824, 2818, 2818, 2818, 2823, 2817, 2818, 2836, 2820, 2819, 2838, 2819, 2838, 2820, 2839, 2841, 2823, 2829, 2824] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350'] [Step 8 / Rank 7] Tasks: ['Single QA'] | Lens: [61381] → Tgt Spa: ['0.350'] [Step 8 / Rank 2] Tasks: ['Single QA'] | Lens: [58268] → Tgt Spa: ['0.350'] [Step 8 / Rank 3] Tasks: ['Single QA'] | Lens: [58268] → Tgt Spa: ['0.350'] [Step 8 / Rank 4] Tasks: ['Code'] | Lens: [53238] → Tgt Spa: ['1.000'] [Step 8 / Rank 1] Tasks: ['Single QA'] | Lens: [46641] → Tgt Spa: ['0.350'] [Step 8 / Rank 6] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [21873, 21874] → Tgt Spa: ['1.000', '0.350'] [Step 8 / Rank 5] Tasks: ['Code'] | Lens: [53238] → Tgt Spa: ['1.000'] [Step 8 / Rank 7] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [21873, 21874] → Tgt Spa: ['1.000', '0.350'] [Step 8 / Rank 0] Tasks: ['Single QA'] | Lens: [46641] → Tgt Spa: ['0.350'] [Step 8 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [23836, 23828] → Tgt Spa: ['1.000', '1.000'] [Step 8 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [23836, 23828] → Tgt Spa: ['1.000', '1.000'] [Step 8 / Rank 1] Tasks: ['Single QA'] | Lens: [60718] → Tgt Spa: ['0.350'] [Step 8 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27231, 27233] → Tgt Spa: ['1.000', '1.000'] [Step 8 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27231, 27233] → Tgt Spa: ['1.000', '1.000'] [Step 8 / Rank 0] Tasks: ['Single QA'] | Lens: [60718] → Tgt Spa: ['0.350'] [Step 8 / Rank 4] Tasks: ['Code'] | Lens: [33792] → Tgt Spa: ['1.000'] [Step 8 / Rank 2] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [19404, 19404, 19419] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 8 / Rank 3] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [19404, 19404, 19419] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 8 / Rank 5] Tasks: ['Code'] | Lens: [33792] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 18:47:09,125 >> @ 8 | Loss: 2.0246 | LM: 1.9299 | Reg: 0.0947 | Spa(Avg): 0.528 [INFO|lh_trainer.py:797] 2026-02-16 18:47:09,126 >> Statistic -> Code | Spa: 0.512 | Tgt: 1.000 | Z-Loss: 0.086 | [INFO|lh_trainer.py:797] 2026-02-16 18:47:09,126 >> Statistic -> In-Context | Spa: 0.524 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:47:09,126 >> Statistic -> MultiHop | Spa: 0.551 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:47:09,126 >> Statistic -> Single | Spa: 0.520 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:47:09,126 >> Statistic -> Summarization | Spa: 0.503 | Tgt: 1.000 | Z-Loss: 0.109 | [INFO|lh_trainer.py:810] 2026-02-16 18:47:09,128 >> [Micro-Log] {"loss": 2.024604399998983, "lm_loss": 1.9298622477799654, "reg_loss": 0.0947421588934958, "model_sparsity(avg)": 0.5281887551148733, "Spa-Code sparsity": 0.5115740845600764, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.08630892137686412, "Spa-In-Context Learning sparsity": 0.5243055671453476, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10291337873786688, "Spa-Single QA sparsity": 0.5200617412726084, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.08773546034677161, "Spa-Summarization sparsity": 0.5030864212248061, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10867288791471058, "Spa-MultiHop QA sparsity": 0.5509259402751923, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.06265377749999364, "step": 8, "current_tau": 1.4975669384002686, "lambda1 Single QA": 0.4765625, "lambda2 MultiHop QA": 0.23828125, "lambda3 Summarization": 0.03857421875, "lambda4 Code": 0.1337890625} [INFO|lh_trainer.py:331] 2026-02-16 18:47:33,221 >> {'loss': 12.1476, 'grad_norm': 1.195784091949463, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.009478672985781991, 'num_input_tokens_seen': 22656144, 'completed': '3.00% (9 / 300)', 'remaining time': '13:52:59', 'throughput': '6693.14', 'gpu_mem_free': '5849MB', 'step': 9} [Step 9 / Rank 6] Tasks: ['Single QA'] | Lens: [59885] → Tgt Spa: ['0.350'] [Step 9 / Rank 3] Tasks: ['Single QA'] | Lens: [35653] → Tgt Spa: ['0.350'] [Step 9 / Rank 2] Tasks: ['Single QA'] | Lens: [35653] → Tgt Spa: ['0.350'] [Step 9 / Rank 0] Tasks: ['Single QA'] | Lens: [56498] → Tgt Spa: ['0.350'] [Step 9 / Rank 5] Tasks: ['Single QA'] | Lens: [39582] → Tgt Spa: ['0.350'] [Step 9 / Rank 4] Tasks: ['Single QA'] | Lens: [39582] → Tgt Spa: ['0.350'] [Step 9 / Rank 7] Tasks: ['Single QA'] | Lens: [59885] → Tgt Spa: ['0.350'] [Step 9 / Rank 1] Tasks: ['Single QA'] | Lens: [56498] → Tgt Spa: ['0.350'] [Step 9 / Rank 1] Tasks: ['Code'] | Lens: [42636] → Tgt Spa: ['1.000'] [Step 9 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [43555] → Tgt Spa: ['1.000'] [Step 9 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28949, 28953] → Tgt Spa: ['1.000', '1.000'] [Step 9 / Rank 0] Tasks: ['Code'] | Lens: [42636] → Tgt Spa: ['1.000'] [Step 9 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [43555] → Tgt Spa: ['1.000'] [Step 9 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27569, 27568] → Tgt Spa: ['1.000', '1.000'] [Step 9 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27569, 27568] → Tgt Spa: ['1.000', '1.000'] [Step 9 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28949, 28953] → Tgt Spa: ['1.000', '1.000'] [Step 9 / Rank 1] Tasks: ['Single QA'] | Lens: [57591] → Tgt Spa: ['0.350'] [Step 9 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [54741] → Tgt Spa: ['1.000'] [Step 9 / Rank 0] Tasks: ['Single QA'] | Lens: [57591] → Tgt Spa: ['0.350'] [Step 9 / Rank 3] Tasks: ['In-Context Learning', 'Code'] | Lens: [25924, 25935] → Tgt Spa: ['1.000', '1.000'] [Step 9 / Rank 5] Tasks: ['Single QA'] | Lens: [65070] → Tgt Spa: ['0.350'] [Step 9 / Rank 2] Tasks: ['In-Context Learning', 'Code'] | Lens: [25924, 25935] → Tgt Spa: ['1.000', '1.000'] [Step 9 / Rank 4] Tasks: ['Single QA'] | Lens: [65070] → Tgt Spa: ['0.350'] [Step 9 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [54741] → Tgt Spa: ['1.000'] [Step 9 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32005, 32006] → Tgt Spa: ['0.350', '0.350'] [Step 9 / Rank 3] Tasks: ['Single QA'] | Lens: [49675] → Tgt Spa: ['0.350'] [Step 9 / Rank 5] Tasks: ['Single QA'] | Lens: [63925] → Tgt Spa: ['0.350'] [Step 9 / Rank 2] Tasks: ['Single QA'] | Lens: [49675] → Tgt Spa: ['0.350'] [Step 9 / Rank 1] Tasks: ['Single QA'] | Lens: [58315] → Tgt Spa: ['0.350'] [Step 9 / Rank 4] Tasks: ['Single QA'] | Lens: [63925] → Tgt Spa: ['0.350'] [Step 9 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32005, 32006] → Tgt Spa: ['0.350', '0.350'] [Step 9 / Rank 0] Tasks: ['Single QA'] | Lens: [58315] → Tgt Spa: ['0.350'] [Step 9 / Rank 5] Tasks: ['Single QA'] | Lens: [64198] → Tgt Spa: ['0.350'] [Step 9 / Rank 2] Tasks: ['Single QA'] | Lens: [53904] → Tgt Spa: ['0.350'] [Step 9 / Rank 1] Tasks: ['Code'] | Lens: [64184] → Tgt Spa: ['1.000'] [Step 9 / Rank 0] Tasks: ['Code'] | Lens: [64184] → Tgt Spa: ['1.000'] [Step 9 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [40069] → Tgt Spa: ['1.000'] [Step 9 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [40069] → Tgt Spa: ['1.000'] [Step 9 / Rank 3] Tasks: ['Single QA'] | Lens: [53904] → Tgt Spa: ['0.350'] [Step 9 / Rank 4] Tasks: ['Single QA'] | Lens: [64198] → Tgt Spa: ['0.350'] [Step 9 / Rank 1] Tasks: ['Single QA'] | Lens: [42616] → Tgt Spa: ['0.350'] [Step 9 / Rank 3] Tasks: ['Code'] | Lens: [36481] → Tgt Spa: ['1.000'] [Step 9 / Rank 5] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [29755, 29758] → Tgt Spa: ['1.000', '0.350'] [Step 9 / Rank 6] Tasks: ['Single QA'] | Lens: [42629] → Tgt Spa: ['0.350'] [Step 9 / Rank 7] Tasks: ['Single QA'] | Lens: [42629] → Tgt Spa: ['0.350'] [Step 9 / Rank 4] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [29755, 29758] → Tgt Spa: ['1.000', '0.350'] [Step 9 / Rank 0] Tasks: ['Single QA'] | Lens: [42616] → Tgt Spa: ['0.350'] [Step 9 / Rank 2] Tasks: ['Code'] | Lens: [36481] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 18:50:13,111 >> @ 9 | Loss: 2.2481 | LM: 2.1582 | Reg: 0.0898 | Spa(Avg): 0.485 [INFO|lh_trainer.py:797] 2026-02-16 18:50:13,112 >> Statistic -> Code | Spa: 0.500 | Tgt: 1.000 | Z-Loss: 0.090 | [INFO|lh_trainer.py:797] 2026-02-16 18:50:13,112 >> Statistic -> In-Context | Spa: 0.444 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:50:13,112 >> Statistic -> MultiHop | Spa: 0.551 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:50:13,112 >> Statistic -> Single | Spa: 0.503 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:50:13,112 >> Statistic -> Summarization | Spa: 0.503 | Tgt: 1.000 | Z-Loss: 0.109 | [INFO|lh_trainer.py:810] 2026-02-16 18:50:13,114 >> [Micro-Log] {"loss": 2.248065236955881, "lm_loss": 2.1582390585293374, "reg_loss": 0.08982618843826155, "model_sparsity(avg)": 0.48495370397965115, "Spa-Single QA sparsity": 0.5034722313284874, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.07922061870340258, "Spa-Code sparsity": 0.4999999850988388, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09021787531673908, "Spa-In-Context Learning sparsity": 0.4444444378217061, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.12084748513168758, "Spa-Summarization sparsity": 0.5030864212248061, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10867288791471058, "Spa-MultiHop QA sparsity": 0.5509259402751923, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.06265377749999364, "step": 9, "current_tau": 1.4969220161437988, "lambda1 Single QA": 0.4765625, "lambda2 MultiHop QA": 0.23828125, "lambda3 Summarization": 0.03857421875, "lambda4 Code": 0.134765625} [INFO|lh_trainer.py:331] 2026-02-16 18:50:28,900 >> {'loss': 13.4884, 'grad_norm': 1.1983633041381836, 'learning_rate': 7.5e-05, 'epoch': 0.010531858873091101, 'num_input_tokens_seen': 25175402, 'completed': '3.33% (10 / 300)', 'remaining time': '13:52:01', 'throughput': '7170.03', 'gpu_mem_free': '12827MB', 'step': 10} [Step 10 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [22110, 22102] → Tgt Spa: ['1.000', '1.000'] [Step 10 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [22110, 22102] → Tgt Spa: ['1.000', '1.000'] [Step 10 / Rank 5] Tasks: ['Single QA'] | Lens: [62448] → Tgt Spa: ['0.350'] [Step 10 / Rank 2] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [18310, 18302, 18311] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 10 / Rank 0] Tasks: ['In-Context Learning', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Single QA', 'Summarization', 'Single QA', 'MultiHop QA', 'Code', 'MultiHop QA'] | Lens: [3055, 3057, 3075, 3075, 3058, 3057, 3058, 3058, 3077, 3058, 3060, 3060, 3059, 3061, 3079, 3062, 3080, 3064, 3063, 3069, 3065] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350'] [Step 10 / Rank 3] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [18310, 18302, 18311] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 10 / Rank 4] Tasks: ['Single QA'] | Lens: [62448] → Tgt Spa: ['0.350'] [Step 10 / Rank 1] Tasks: ['In-Context Learning', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Single QA', 'Summarization', 'Single QA', 'MultiHop QA', 'Code', 'MultiHop QA'] | Lens: [3055, 3057, 3075, 3075, 3058, 3057, 3058, 3058, 3077, 3058, 3060, 3060, 3059, 3061, 3079, 3062, 3080, 3064, 3063, 3069, 3065] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350'] [Step 10 / Rank 5] Tasks: ['Single QA'] | Lens: [56672] → Tgt Spa: ['0.350'] [Step 10 / Rank 3] Tasks: ['Single QA'] | Lens: [40262] → Tgt Spa: ['0.350'] [Step 10 / Rank 6] Tasks: ['Single QA'] | Lens: [57274] → Tgt Spa: ['0.350'] [Step 10 / Rank 0] Tasks: ['Code', 'Code'] | Lens: [22681, 22680] → Tgt Spa: ['1.000', '1.000'] [Step 10 / Rank 1] Tasks: ['Code', 'Code'] | Lens: [22681, 22680] → Tgt Spa: ['1.000', '1.000'] [Step 10 / Rank 4] Tasks: ['Single QA'] | Lens: [56672] → Tgt Spa: ['0.350'] [Step 10 / Rank 7] Tasks: ['Single QA'] | Lens: [57274] → Tgt Spa: ['0.350'] [Step 10 / Rank 2] Tasks: ['Single QA'] | Lens: [40262] → Tgt Spa: ['0.350'] [Step 10 / Rank 3] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [23058, 23040] → Tgt Spa: ['1.000', '1.000'] [Step 10 / Rank 0] Tasks: ['Code'] | Lens: [37239] → Tgt Spa: ['1.000'] [Step 10 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24990, 24992] → Tgt Spa: ['1.000', '1.000'] [Step 10 / Rank 6] Tasks: ['Single QA'] | Lens: [53585] → Tgt Spa: ['0.350'] [Step 10 / Rank 2] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [23058, 23040] → Tgt Spa: ['1.000', '1.000'] [Step 10 / Rank 7] Tasks: ['Single QA'] | Lens: [53585] → Tgt Spa: ['0.350'] [Step 10 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24990, 24992] → Tgt Spa: ['1.000', '1.000'] [Step 10 / Rank 1] Tasks: ['Code'] | Lens: [37239] → Tgt Spa: ['1.000'] [Step 10 / Rank 1] Tasks: ['Single QA'] | Lens: [64678] → Tgt Spa: ['0.350'] [Step 10 / Rank 3] Tasks: ['Single QA'] | Lens: [64903] → Tgt Spa: ['0.350'] [Step 10 / Rank 4] Tasks: ['Code'] | Lens: [54942] → Tgt Spa: ['1.000'] [Step 10 / Rank 2] Tasks: ['Single QA'] | Lens: [64903] → Tgt Spa: ['0.350'] [Step 10 / Rank 0] Tasks: ['Single QA'] | Lens: [64678] → Tgt Spa: ['0.350'] [Step 10 / Rank 6] Tasks: ['Single QA'] | Lens: [57714] → Tgt Spa: ['0.350'] [Step 10 / Rank 5] Tasks: ['Code'] | Lens: [54942] → Tgt Spa: ['1.000'] [Step 10 / Rank 7] Tasks: ['Single QA'] | Lens: [57714] → Tgt Spa: ['0.350'] [Step 10 / Rank 6] Tasks: ['Single QA'] | Lens: [33096] → Tgt Spa: ['0.350'] [Step 10 / Rank 2] Tasks: ['Code'] | Lens: [33960] → Tgt Spa: ['1.000'] [Step 10 / Rank 3] Tasks: ['Code'] | Lens: [33960] → Tgt Spa: ['1.000'] [Step 10 / Rank 5] Tasks: ['Single QA', 'Code'] | Lens: [30644, 30651] → Tgt Spa: ['0.350', '1.000'] [Step 10 / Rank 4] Tasks: ['Single QA', 'Code'] | Lens: [30644, 30651] → Tgt Spa: ['0.350', '1.000'] [Step 10 / Rank 1] Tasks: ['Single QA'] | Lens: [49398] → Tgt Spa: ['0.350'] [Step 10 / Rank 7] Tasks: ['Single QA'] | Lens: [33096] → Tgt Spa: ['0.350'] [Step 10 / Rank 0] Tasks: ['Single QA'] | Lens: [49398] → Tgt Spa: ['0.350'] [Step 10 / Rank 7] Tasks: ['Single QA'] | Lens: [38368] → Tgt Spa: ['0.350'] [Step 10 / Rank 4] Tasks: ['Code'] | Lens: [37359] → Tgt Spa: ['1.000'] [Step 10 / Rank 0] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [18799, 18800, 18790] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 10 / Rank 6] Tasks: ['Single QA'] | Lens: [38368] → Tgt Spa: ['0.350'] [Step 10 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24720, 24722] → Tgt Spa: ['1.000', '1.000'] [Step 10 / Rank 1] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [18799, 18800, 18790] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 10 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24720, 24722] → Tgt Spa: ['1.000', '1.000'] [Step 10 / Rank 5] Tasks: ['Code'] | Lens: [37359] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 18:52:59,803 >> @ 10 | Loss: 1.9374 | LM: 1.8500 | Reg: 0.0874 | Spa(Avg): 0.500 [INFO|lh_trainer.py:797] 2026-02-16 18:52:59,803 >> Statistic -> Code | Spa: 0.490 | Tgt: 1.000 | Z-Loss: 0.092 | [INFO|lh_trainer.py:797] 2026-02-16 18:52:59,803 >> Statistic -> In-Context | Spa: 0.506 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:52:59,803 >> Statistic -> MultiHop | Spa: 0.573 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:52:59,803 >> Statistic -> Single | Spa: 0.503 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:52:59,803 >> Statistic -> Summarization | Spa: 0.497 | Tgt: 1.000 | Z-Loss: 0.113 | [INFO|lh_trainer.py:810] 2026-02-16 18:52:59,805 >> [Micro-Log] {"loss": 1.9373840065672994, "lm_loss": 1.8499832608892273, "reg_loss": 0.08740074670640752, "model_sparsity(avg)": 0.5000551206370195, "Spa-In-Context Learning sparsity": 0.5061728490723504, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1071618033779992, "Spa-MultiHop QA sparsity": 0.5729166865348816, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.0710703362710774, "Spa-Summarization sparsity": 0.49722222089767454, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11260511949658394, "Spa-Single QA sparsity": 0.5034722238779068, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.08091095383861102, "Spa-Code sparsity": 0.4898989959196611, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09248298948461359, "step": 10, "current_tau": 1.496201992034912, "lambda1 Single QA": 0.4765625, "lambda2 MultiHop QA": 0.23828125, "lambda3 Summarization": 0.038818359375, "lambda4 Code": 0.134765625} [INFO|lh_trainer.py:331] 2026-02-16 18:53:12,397 >> {'loss': 11.6243, 'grad_norm': 1.2181470394134521, 'learning_rate': 8.333333333333333e-05, 'epoch': 0.01158504476040021, 'num_input_tokens_seen': 27603302, 'completed': '3.67% (11 / 300)', 'remaining time': '13:45:22', 'throughput': '7424.91', 'gpu_mem_free': '9779MB', 'step': 11} [Step 11 / Rank 3] Tasks: ['Single QA'] | Lens: [41929] → Tgt Spa: ['0.350'] [Step 11 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [37972] → Tgt Spa: ['1.000'] [Step 11 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [24891, 24883] → Tgt Spa: ['1.000', '1.000'] [Step 11 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [37972] → Tgt Spa: ['1.000'] [Step 11 / Rank 0] Tasks: ['Single QA'] | Lens: [43292] → Tgt Spa: ['0.350'] [Step 11 / Rank 1] Tasks: ['Single QA'] | Lens: [43292] → Tgt Spa: ['0.350'] [Step 11 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [24891, 24883] → Tgt Spa: ['1.000', '1.000'] [Step 11 / Rank 2] Tasks: ['Single QA'] | Lens: [41929] → Tgt Spa: ['0.350'] [Step 11 / Rank 2] Tasks: ['Summarization'] | Lens: [33473] → Tgt Spa: ['1.000'] [Step 11 / Rank 1] Tasks: ['Single QA', 'Code'] | Lens: [32139, 32146] → Tgt Spa: ['0.350', '1.000'] [Step 11 / Rank 0] Tasks: ['Single QA', 'Code'] | Lens: [32139, 32146] → Tgt Spa: ['0.350', '1.000'] [Step 11 / Rank 4] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [17578, 17590, 17593] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 11 / Rank 3] Tasks: ['Summarization'] | Lens: [33473] → Tgt Spa: ['1.000'] [Step 11 / Rank 6] Tasks: ['Single QA'] | Lens: [51276] → Tgt Spa: ['0.350'] [Step 11 / Rank 7] Tasks: ['Single QA'] | Lens: [51276] → Tgt Spa: ['0.350'] [Step 11 / Rank 5] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [17578, 17590, 17593] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 11 / Rank 3] Tasks: ['Single QA', 'Code', 'Single QA', 'Code', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'Single QA'] | Lens: [6935, 6942, 6938, 6947, 6952, 6953, 6947, 6949, 6949] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 11 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [51787] → Tgt Spa: ['1.000'] [Step 11 / Rank 1] Tasks: ['Code'] | Lens: [63793] → Tgt Spa: ['1.000'] [Step 11 / Rank 2] Tasks: ['Single QA', 'Code', 'Single QA', 'Code', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'Single QA'] | Lens: [6935, 6942, 6938, 6947, 6952, 6953, 6947, 6949, 6949] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 11 / Rank 6] Tasks: ['Single QA'] | Lens: [37379] → Tgt Spa: ['0.350'] [Step 11 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [51787] → Tgt Spa: ['1.000'] [Step 11 / Rank 0] Tasks: ['Code'] | Lens: [63793] → Tgt Spa: ['1.000'] [Step 11 / Rank 7] Tasks: ['Single QA'] | Lens: [37379] → Tgt Spa: ['0.350'] [Step 11 / Rank 5] Tasks: ['Single QA'] | Lens: [51313] → Tgt Spa: ['0.350'] [Step 11 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [53563] → Tgt Spa: ['1.000'] [Step 11 / Rank 3] Tasks: ['Single QA'] | Lens: [46955] → Tgt Spa: ['0.350'] [Step 11 / Rank 7] Tasks: ['Code', 'Code', 'Code'] | Lens: [16838, 16841, 16842] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 11 / Rank 4] Tasks: ['Single QA'] | Lens: [51313] → Tgt Spa: ['0.350'] [Step 11 / Rank 2] Tasks: ['Single QA'] | Lens: [46955] → Tgt Spa: ['0.350'] [Step 11 / Rank 6] Tasks: ['Code', 'Code', 'Code'] | Lens: [16838, 16841, 16842] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 11 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [53563] → Tgt Spa: ['1.000'] [Step 11 / Rank 3] Tasks: ['Single QA'] | Lens: [33697] → Tgt Spa: ['0.350'] [Step 11 / Rank 6] Tasks: ['Single QA'] | Lens: [44047] → Tgt Spa: ['0.350'] [Step 11 / Rank 0] Tasks: ['Single QA'] | Lens: [65025] → Tgt Spa: ['0.350'] [Step 11 / Rank 1] Tasks: ['Single QA'] | Lens: [65025] → Tgt Spa: ['0.350'] [Step 11 / Rank 7] Tasks: ['Single QA'] | Lens: [44047] → Tgt Spa: ['0.350'] [Step 11 / Rank 4] Tasks: ['Code', 'Single QA', 'Code', 'Code'] | Lens: [15150, 15146, 15157, 15167] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000'] [Step 11 / Rank 2] Tasks: ['Single QA'] | Lens: [33697] → Tgt Spa: ['0.350'] [Step 11 / Rank 5] Tasks: ['Code', 'Single QA', 'Code', 'Code'] | Lens: [15150, 15146, 15157, 15167] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000'] [Step 11 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [60038] → Tgt Spa: ['1.000'] [Step 11 / Rank 1] Tasks: ['Single QA'] | Lens: [54842] → Tgt Spa: ['0.350'] [Step 11 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32685, 32686] → Tgt Spa: ['0.350', '0.350'] [Step 11 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32685, 32686] → Tgt Spa: ['0.350', '0.350'] [Step 11 / Rank 6] Tasks: ['Single QA'] | Lens: [53843] → Tgt Spa: ['0.350'] [Step 11 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [60038] → Tgt Spa: ['1.000'] [Step 11 / Rank 7] Tasks: ['Single QA'] | Lens: [53843] → Tgt Spa: ['0.350'] [Step 11 / Rank 0] Tasks: ['Single QA'] | Lens: [54842] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 18:55:39,003 >> @ 11 | Loss: 2.0236 | LM: 1.9314 | Reg: 0.0922 | Spa(Avg): 0.494 [INFO|lh_trainer.py:797] 2026-02-16 18:55:39,003 >> Statistic -> Code | Spa: 0.476 | Tgt: 1.000 | Z-Loss: 0.095 | [INFO|lh_trainer.py:797] 2026-02-16 18:55:39,003 >> Statistic -> In-Context | Spa: 0.472 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:55:39,003 >> Statistic -> MultiHop | Spa: 0.573 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:55:39,003 >> Statistic -> Single | Spa: 0.506 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:55:39,003 >> Statistic -> Summarization | Spa: 0.472 | Tgt: 1.000 | Z-Loss: 0.121 | [INFO|lh_trainer.py:810] 2026-02-16 18:55:39,005 >> [Micro-Log] {"loss": 2.0235960943003497, "lm_loss": 1.9313891132672627, "reg_loss": 0.09220698662102222, "model_sparsity(avg)": 0.493666410446167, "Spa-Single QA sparsity": 0.5058479591419822, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.08028452000335644, "Spa-Code sparsity": 0.4761904776096344, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.0951204379754407, "Spa-In-Context Learning sparsity": 0.4722222189108531, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11487142990032832, "Spa-Summarization sparsity": 0.4722222089767456, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1205911636352539, "Spa-MultiHop QA sparsity": 0.5729166865348816, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.0710703362710774, "step": 11, "current_tau": 1.4954068660736084, "lambda1 Single QA": 0.4765625, "lambda2 MultiHop QA": 0.23828125, "lambda3 Summarization": 0.038818359375, "lambda4 Code": 0.134765625} [INFO|lh_trainer.py:331] 2026-02-16 18:56:02,418 >> {'loss': 12.1416, 'grad_norm': 1.1232670545578003, 'learning_rate': 9.166666666666667e-05, 'epoch': 0.01263823064770932, 'num_input_tokens_seen': 30063438, 'completed': '4.00% (12 / 300)', 'remaining time': '13:41:59', 'throughput': '7234.82', 'gpu_mem_free': '8615MB', 'step': 12} [Step 12 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17947, 17947, 17940] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 12 / Rank 3] Tasks: ['Single QA'] | Lens: [40872] → Tgt Spa: ['0.350'] [Step 12 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17947, 17947, 17940] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 12 / Rank 1] Tasks: ['Single QA'] | Lens: [40018] → Tgt Spa: ['0.350'] [Step 12 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [28915, 28918] → Tgt Spa: ['1.000', '1.000'] [Step 12 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [28915, 28918] → Tgt Spa: ['1.000', '1.000'] [Step 12 / Rank 0] Tasks: ['Single QA'] | Lens: [40018] → Tgt Spa: ['0.350'] [Step 12 / Rank 2] Tasks: ['Single QA'] | Lens: [40872] → Tgt Spa: ['0.350'] [Step 12 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [43340] → Tgt Spa: ['1.000'] [Step 12 / Rank 4] Tasks: ['Single QA'] | Lens: [54072] → Tgt Spa: ['0.350'] [Step 12 / Rank 0] Tasks: ['Summarization'] | Lens: [56211] → Tgt Spa: ['1.000'] [Step 12 / Rank 1] Tasks: ['Summarization'] | Lens: [56211] → Tgt Spa: ['1.000'] [Step 12 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [27999, 27997] → Tgt Spa: ['1.000', '1.000'] [Step 12 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [43340] → Tgt Spa: ['1.000'] [Step 12 / Rank 5] Tasks: ['Single QA'] | Lens: [54072] → Tgt Spa: ['0.350'] [Step 12 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [27999, 27997] → Tgt Spa: ['1.000', '1.000'] [Step 12 / Rank 2] Tasks: ['Single QA'] | Lens: [45295] → Tgt Spa: ['0.350'] [Step 12 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [58514] → Tgt Spa: ['1.000'] [Step 12 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25854, 25854] → Tgt Spa: ['1.000', '1.000'] [Step 12 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [51833] → Tgt Spa: ['1.000'] [Step 12 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [51833] → Tgt Spa: ['1.000'] [Step 12 / Rank 3] Tasks: ['Single QA'] | Lens: [45295] → Tgt Spa: ['0.350'] [Step 12 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25854, 25854] → Tgt Spa: ['1.000', '1.000'] [Step 12 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [58514] → Tgt Spa: ['1.000'] [Step 12 / Rank 5] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [20500, 20490, 20492] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 12 / Rank 6] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [27686, 27705] → Tgt Spa: ['1.000', '1.000'] [Step 12 / Rank 3] Tasks: ['Single QA'] | Lens: [45059] → Tgt Spa: ['0.350'] [Step 12 / Rank 1] Tasks: ['Summarization'] | Lens: [39974] → Tgt Spa: ['1.000'] [Step 12 / Rank 0] Tasks: ['Summarization'] | Lens: [39974] → Tgt Spa: ['1.000'] [Step 12 / Rank 4] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [20500, 20490, 20492] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 12 / Rank 2] Tasks: ['Single QA'] | Lens: [45059] → Tgt Spa: ['0.350'] [Step 12 / Rank 7] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [27686, 27705] → Tgt Spa: ['1.000', '1.000'] [Step 12 / Rank 6] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [3750, 3742, 3743, 3746, 3744, 3745, 3764, 3753, 3746, 3746, 3747, 3747, 3748, 3748, 3749, 3748, 3748] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000'] [Step 12 / Rank 5] Tasks: ['Single QA'] | Lens: [36269] → Tgt Spa: ['0.350'] [Step 12 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [25159, 25152] → Tgt Spa: ['1.000', '1.000'] [Step 12 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [42011] → Tgt Spa: ['1.000'] [Step 12 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [25159, 25152] → Tgt Spa: ['1.000', '1.000'] [Step 12 / Rank 4] Tasks: ['Single QA'] | Lens: [36269] → Tgt Spa: ['0.350'] [Step 12 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [42011] → Tgt Spa: ['1.000'] [Step 12 / Rank 7] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [3750, 3742, 3743, 3746, 3744, 3745, 3764, 3753, 3746, 3746, 3747, 3747, 3748, 3748, 3749, 3748, 3748] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000'] [Step 12 / Rank 3] Tasks: ['Code'] | Lens: [57392] → Tgt Spa: ['1.000'] [Step 12 / Rank 1] Tasks: ['Single QA'] | Lens: [44944] → Tgt Spa: ['0.350'] [Step 12 / Rank 5] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [18853, 18854, 18865] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 12 / Rank 2] Tasks: ['Code'] | Lens: [57392] → Tgt Spa: ['1.000'] [Step 12 / Rank 0] Tasks: ['Single QA'] | Lens: [44944] → Tgt Spa: ['0.350'] [Step 12 / Rank 4] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [18853, 18854, 18865] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 12 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [27049, 27050] → Tgt Spa: ['0.350', '0.350'] [Step 12 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [27049, 27050] → Tgt Spa: ['0.350', '0.350'] [INFO|lh_trainer.py:781] 2026-02-16 18:58:04,850 >> @ 12 | Loss: 2.0459 | LM: 1.9518 | Reg: 0.0940 | Spa(Avg): 0.477 [INFO|lh_trainer.py:797] 2026-02-16 18:58:04,850 >> Statistic -> Code | Spa: 0.509 | Tgt: 1.000 | Z-Loss: 0.088 | [INFO|lh_trainer.py:797] 2026-02-16 18:58:04,850 >> Statistic -> In-Context | Spa: 0.512 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:58:04,850 >> Statistic -> MultiHop | Spa: 0.573 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:58:04,851 >> Statistic -> Single | Spa: 0.488 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 18:58:04,851 >> Statistic -> Summarization | Spa: 0.476 | Tgt: 1.000 | Z-Loss: 0.120 | [INFO|lh_trainer.py:810] 2026-02-16 18:58:04,853 >> [Micro-Log] {"loss": 2.0458778416117034, "lm_loss": 1.9518327514330547, "reg_loss": 0.0940450844160902, "model_sparsity(avg)": 0.4774986306826274, "Spa-Single QA sparsity": 0.48842592040697735, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.0707228077808395, "Spa-Summarization sparsity": 0.475694440305233, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12048038840293884, "Spa-In-Context Learning sparsity": 0.5118055492639542, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10646266862750053, "Spa-Code sparsity": 0.5092592537403107, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.08750224920610587, "Spa-MultiHop QA sparsity": 0.5729166865348816, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.0710703362710774, "step": 12, "current_tau": 1.4945368766784668, "lambda1 Single QA": 0.4765625, "lambda2 MultiHop QA": 0.2392578125, "lambda3 Summarization": 0.0390625, "lambda4 Code": 0.134765625} [INFO|lh_trainer.py:331] 2026-02-16 18:58:26,409 >> {'loss': 12.2753, 'grad_norm': 1.4114834070205688, 'learning_rate': 0.0001, 'epoch': 0.01369141653501843, 'num_input_tokens_seen': 32496926, 'completed': '4.33% (13 / 300)', 'remaining time': '13:29:06', 'throughput': '8450.16', 'gpu_mem_free': '11223MB', 'step': 13} [Step 13 / Rank 5] Tasks: ['Single QA'] | Lens: [35646] → Tgt Spa: ['0.350'] [Step 13 / Rank 2] Tasks: ['Single QA', 'Code'] | Lens: [32535, 32542] → Tgt Spa: ['0.350', '1.000'] [Step 13 / Rank 3] Tasks: ['Single QA', 'Code'] | Lens: [32535, 32542] → Tgt Spa: ['0.350', '1.000'] [Step 13 / Rank 0] Tasks: ['Single QA', 'In-Context Learning', 'In-Context Learning', 'Code', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'Code', 'Summarization', 'In-Context Learning'] | Lens: [5110, 5110, 5111, 5119, 5131, 5113, 5114, 5116, 5127, 5127, 5138, 5120] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000'] [Step 13 / Rank 6] Tasks: ['Code'] | Lens: [36885] → Tgt Spa: ['1.000'] [Step 13 / Rank 7] Tasks: ['Code'] | Lens: [36885] → Tgt Spa: ['1.000'] [Step 13 / Rank 4] Tasks: ['Single QA'] | Lens: [35646] → Tgt Spa: ['0.350'] [Step 13 / Rank 1] Tasks: ['Single QA', 'In-Context Learning', 'In-Context Learning', 'Code', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'Code', 'Summarization', 'In-Context Learning'] | Lens: [5110, 5110, 5111, 5119, 5131, 5113, 5114, 5116, 5127, 5127, 5138, 5120] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000'] [Step 13 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24961, 24961] → Tgt Spa: ['1.000', '1.000'] [Step 13 / Rank 5] Tasks: ['Single QA'] | Lens: [57834] → Tgt Spa: ['0.350'] [Step 13 / Rank 4] Tasks: ['Single QA'] | Lens: [57834] → Tgt Spa: ['0.350'] [Step 13 / Rank 7] Tasks: ['Single QA'] | Lens: [52680] → Tgt Spa: ['0.350'] [Step 13 / Rank 0] Tasks: ['Single QA'] | Lens: [55589] → Tgt Spa: ['0.350'] [Step 13 / Rank 6] Tasks: ['Single QA'] | Lens: [52680] → Tgt Spa: ['0.350'] [Step 13 / Rank 1] Tasks: ['Single QA'] | Lens: [55589] → Tgt Spa: ['0.350'] [Step 13 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24961, 24961] → Tgt Spa: ['1.000', '1.000'] [Step 13 / Rank 6] Tasks: ['Summarization'] | Lens: [35433] → Tgt Spa: ['1.000'] [Step 13 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [38195] → Tgt Spa: ['1.000'] [Step 13 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [38195] → Tgt Spa: ['1.000'] [Step 13 / Rank 5] Tasks: ['Single QA'] | Lens: [52796] → Tgt Spa: ['0.350'] [Step 13 / Rank 3] Tasks: ['Code'] | Lens: [41554] → Tgt Spa: ['1.000'] [Step 13 / Rank 2] Tasks: ['Code'] | Lens: [41554] → Tgt Spa: ['1.000'] [Step 13 / Rank 4] Tasks: ['Single QA'] | Lens: [52796] → Tgt Spa: ['0.350'] [Step 13 / Rank 7] Tasks: ['Summarization'] | Lens: [35433] → Tgt Spa: ['1.000'] [Step 13 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [57799] → Tgt Spa: ['1.000'] [Step 13 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [36846] → Tgt Spa: ['1.000'] [Step 13 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [36846] → Tgt Spa: ['1.000'] [Step 13 / Rank 6] Tasks: ['Single QA'] | Lens: [59585] → Tgt Spa: ['0.350'] [Step 13 / Rank 3] Tasks: ['Single QA'] | Lens: [54846] → Tgt Spa: ['0.350'] [Step 13 / Rank 7] Tasks: ['Single QA'] | Lens: [59585] → Tgt Spa: ['0.350'] [Step 13 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [57799] → Tgt Spa: ['1.000'] [Step 13 / Rank 2] Tasks: ['Single QA'] | Lens: [54846] → Tgt Spa: ['0.350'] [Step 13 / Rank 3] Tasks: ['Code'] | Lens: [41027] → Tgt Spa: ['1.000'] [Step 13 / Rank 5] Tasks: ['Single QA'] | Lens: [34810] → Tgt Spa: ['0.350'] [Step 13 / Rank 2] Tasks: ['Code'] | Lens: [41027] → Tgt Spa: ['1.000'] [Step 13 / Rank 1] Tasks: ['Single QA'] | Lens: [41491] → Tgt Spa: ['0.350'] [Step 13 / Rank 0] Tasks: ['Single QA'] | Lens: [41491] → Tgt Spa: ['0.350'] [Step 13 / Rank 4] Tasks: ['Single QA'] | Lens: [34810] → Tgt Spa: ['0.350'] [Step 13 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [38697] → Tgt Spa: ['1.000'] [Step 13 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [38697] → Tgt Spa: ['1.000'] [Step 13 / Rank 1] Tasks: ['Code'] | Lens: [44133] → Tgt Spa: ['1.000'] [Step 13 / Rank 5] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [32258, 32262] → Tgt Spa: ['1.000', '0.350'] [Step 13 / Rank 4] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [32258, 32262] → Tgt Spa: ['1.000', '0.350'] [Step 13 / Rank 6] Tasks: ['Single QA'] | Lens: [51022] → Tgt Spa: ['0.350'] [Step 13 / Rank 2] Tasks: ['Single QA'] | Lens: [55959] → Tgt Spa: ['0.350'] [Step 13 / Rank 3] Tasks: ['Single QA'] | Lens: [55959] → Tgt Spa: ['0.350'] [Step 13 / Rank 7] Tasks: ['Single QA'] | Lens: [51022] → Tgt Spa: ['0.350'] [Step 13 / Rank 0] Tasks: ['Code'] | Lens: [44133] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 19:00:40,046 >> @ 13 | Loss: 2.0710 | LM: 1.9899 | Reg: 0.0811 | Spa(Avg): 0.487 [INFO|lh_trainer.py:797] 2026-02-16 19:00:40,046 >> Statistic -> Code | Spa: 0.451 | Tgt: 1.000 | Z-Loss: 0.101 | [INFO|lh_trainer.py:797] 2026-02-16 19:00:40,046 >> Statistic -> In-Context | Spa: 0.479 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:00:40,046 >> Statistic -> MultiHop | Spa: 0.573 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:00:40,046 >> Statistic -> Single | Spa: 0.487 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:00:40,046 >> Statistic -> Summarization | Spa: 0.509 | Tgt: 1.000 | Z-Loss: 0.108 | [INFO|lh_trainer.py:810] 2026-02-16 19:00:40,048 >> [Micro-Log] {"loss": 2.071042850613594, "lm_loss": 1.989928004021446, "reg_loss": 0.08111481686743598, "model_sparsity(avg)": 0.487075620641311, "Spa-Single QA sparsity": 0.4870370427767436, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.07037992638846238, "Spa-In-Context Learning sparsity": 0.4791666716337204, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11378117774923642, "Spa-Code sparsity": 0.4513888955116272, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10129801090806723, "Spa-Summarization sparsity": 0.5092592438062032, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10774664580821991, "Spa-MultiHop QA sparsity": 0.5729166865348816, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.0710703362710774, "step": 13, "current_tau": 1.4935925006866455, "lambda1 Single QA": 0.4765625, "lambda2 MultiHop QA": 0.2392578125, "lambda3 Summarization": 0.039306640625, "lambda4 Code": 0.134765625} [INFO|lh_trainer.py:331] 2026-02-16 19:01:01,328 >> {'loss': 12.4263, 'grad_norm': 1.1449567079544067, 'learning_rate': 0.00010833333333333334, 'epoch': 0.01474460242232754, 'num_input_tokens_seen': 34824490, 'completed': '4.67% (14 / 300)', 'remaining time': '13:21:26', 'throughput': '7512.18', 'gpu_mem_free': '11613MB', 'step': 14} [Step 14 / Rank 3] Tasks: ['Code'] | Lens: [37221] → Tgt Spa: ['1.000'] [Step 14 / Rank 4] Tasks: ['Summarization'] | Lens: [54640] → Tgt Spa: ['1.000'] [Step 14 / Rank 1] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18092, 18083, 18084] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 14 / Rank 0] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18092, 18083, 18084] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 14 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [29388, 29400] → Tgt Spa: ['1.000', '1.000'] [Step 14 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [29388, 29400] → Tgt Spa: ['1.000', '1.000'] [Step 14 / Rank 5] Tasks: ['Summarization'] | Lens: [54640] → Tgt Spa: ['1.000'] [Step 14 / Rank 2] Tasks: ['Code'] | Lens: [37221] → Tgt Spa: ['1.000'] [Step 14 / Rank 5] Tasks: ['In-Context Learning', 'Summarization', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Single QA'] | Lens: [4432, 4451, 4433, 4434, 4436, 4435, 4434, 4437, 4444, 4438, 4437, 4437, 4438, 4439] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350'] [Step 14 / Rank 6] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [26104, 26105] → Tgt Spa: ['0.350', '1.000'] [Step 14 / Rank 4] Tasks: ['In-Context Learning', 'Summarization', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Single QA'] | Lens: [4432, 4451, 4433, 4434, 4436, 4435, 4434, 4437, 4444, 4438, 4437, 4437, 4438, 4439] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350'] [Step 14 / Rank 0] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17340, 17341, 17335] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 14 / Rank 1] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17340, 17341, 17335] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 14 / Rank 2] Tasks: ['Code', 'Single QA', 'Single QA'] | Lens: [20718, 20711, 20711] → Tgt Spa: ['1.000', '0.350', '0.350'] [Step 14 / Rank 7] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [26104, 26105] → Tgt Spa: ['0.350', '1.000'] [Step 14 / Rank 3] Tasks: ['Code', 'Single QA', 'Single QA'] | Lens: [20718, 20711, 20711] → Tgt Spa: ['1.000', '0.350', '0.350'] [Step 14 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [60175] → Tgt Spa: ['1.000'] [Step 14 / Rank 6] Tasks: ['Single QA', 'Code', 'Single QA', 'Code'] | Lens: [14519, 14531, 14525, 14536] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000'] [Step 14 / Rank 4] Tasks: ['Single QA'] | Lens: [49588] → Tgt Spa: ['0.350'] [Step 14 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [24353, 24346] → Tgt Spa: ['1.000', '1.000'] [Step 14 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [60175] → Tgt Spa: ['1.000'] [Step 14 / Rank 5] Tasks: ['Single QA'] | Lens: [49588] → Tgt Spa: ['0.350'] [Step 14 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [24353, 24346] → Tgt Spa: ['1.000', '1.000'] [Step 14 / Rank 7] Tasks: ['Single QA', 'Code', 'Single QA', 'Code'] | Lens: [14519, 14531, 14525, 14536] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000'] [Step 14 / Rank 1] Tasks: ['Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization'] | Lens: [2412, 2415, 2415, 2413, 2415, 2419, 2416, 2414, 2414, 2416, 2434, 2418, 2418, 2437, 2436, 2420, 2420, 2437, 2437, 2422, 2420, 2422, 2421, 2438, 2421, 2439, 2437] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000'] [Step 14 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [42255] → Tgt Spa: ['1.000'] [Step 14 / Rank 3] Tasks: ['Single QA'] | Lens: [53824] → Tgt Spa: ['0.350'] [Step 14 / Rank 0] Tasks: ['Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization'] | Lens: [2412, 2415, 2415, 2413, 2415, 2419, 2416, 2414, 2414, 2416, 2434, 2418, 2418, 2437, 2436, 2420, 2420, 2437, 2437, 2422, 2420, 2422, 2421, 2438, 2421, 2439, 2437] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000'] [Step 14 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [42255] → Tgt Spa: ['1.000'] [Step 14 / Rank 6] Tasks: ['Single QA'] | Lens: [52370] → Tgt Spa: ['0.350'] [Step 14 / Rank 2] Tasks: ['Single QA'] | Lens: [53824] → Tgt Spa: ['0.350'] [Step 14 / Rank 7] Tasks: ['Single QA'] | Lens: [52370] → Tgt Spa: ['0.350'] [Step 14 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Code', 'Code', 'Single QA', 'Code'] | Lens: [10461, 10464, 10472, 10473, 10466, 10476] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '0.350', '1.000'] [Step 14 / Rank 4] Tasks: ['Single QA'] | Lens: [55704] → Tgt Spa: ['0.350'] [Step 14 / Rank 5] Tasks: ['Single QA'] | Lens: [55704] → Tgt Spa: ['0.350'] [Step 14 / Rank 0] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [29932, 29933] → Tgt Spa: ['1.000', '0.350'] [Step 14 / Rank 1] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [29932, 29933] → Tgt Spa: ['1.000', '0.350'] [Step 14 / Rank 6] Tasks: ['Single QA', 'Code', 'Code', 'MultiHop QA'] | Lens: [16344, 16369, 16373, 16368] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350'] [Step 14 / Rank 7] Tasks: ['Single QA', 'Code', 'Code', 'MultiHop QA'] | Lens: [16344, 16369, 16373, 16368] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350'] [Step 14 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Code', 'Code', 'Single QA', 'Code'] | Lens: [10461, 10464, 10472, 10473, 10466, 10476] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '0.350', '1.000'] [Step 14 / Rank 3] Tasks: ['Code'] | Lens: [44993] → Tgt Spa: ['1.000'] [Step 14 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26416, 26417] → Tgt Spa: ['1.000', '1.000'] [Step 14 / Rank 7] Tasks: ['Single QA'] | Lens: [55969] → Tgt Spa: ['0.350'] [Step 14 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26416, 26417] → Tgt Spa: ['1.000', '1.000'] [Step 14 / Rank 5] Tasks: ['Code', 'Single QA', 'Summarization', 'Code', 'Single QA', 'Single QA'] | Lens: [10570, 10565, 10594, 10584, 10580, 10582] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '0.350', '0.350'] [Step 14 / Rank 4] Tasks: ['Code', 'Single QA', 'Summarization', 'Code', 'Single QA', 'Single QA'] | Lens: [10570, 10565, 10594, 10584, 10580, 10582] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '0.350', '0.350'] [Step 14 / Rank 6] Tasks: ['Single QA'] | Lens: [55969] → Tgt Spa: ['0.350'] [Step 14 / Rank 2] Tasks: ['Code'] | Lens: [44993] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 19:03:17,508 >> @ 14 | Loss: 1.9451 | LM: 1.8551 | Reg: 0.0900 | Spa(Avg): 0.490 [INFO|lh_trainer.py:797] 2026-02-16 19:03:17,508 >> Statistic -> Code | Spa: 0.509 | Tgt: 1.000 | Z-Loss: 0.088 | [INFO|lh_trainer.py:797] 2026-02-16 19:03:17,508 >> Statistic -> In-Context | Spa: 0.490 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:03:17,508 >> Statistic -> MultiHop | Spa: 0.515 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:03:17,509 >> Statistic -> Single | Spa: 0.462 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:03:17,509 >> Statistic -> Summarization | Spa: 0.447 | Tgt: 1.000 | Z-Loss: 0.133 | [INFO|lh_trainer.py:810] 2026-02-16 19:03:17,511 >> [Micro-Log] {"loss": 1.9451384594043095, "lm_loss": 1.855130897834897, "reg_loss": 0.09000754961743951, "model_sparsity(avg)": 0.48999057958523434, "Spa-Summarization sparsity": 0.44742063539368765, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13343173318675586, "Spa-Code sparsity": 0.50877193714443, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.08825183659791946, "Spa-In-Context Learning sparsity": 0.4904513917863369, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11126326769590378, "Spa-Single QA sparsity": 0.46195652173913043, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06308513952662116, "Spa-MultiHop QA sparsity": 0.5154320961899228, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.0516434946977016, "step": 14, "current_tau": 1.4925739765167236, "lambda1 Single QA": 0.4765625, "lambda2 MultiHop QA": 0.2392578125, "lambda3 Summarization": 0.03955078125, "lambda4 Code": 0.134765625} [INFO|lh_trainer.py:331] 2026-02-16 19:03:38,750 >> {'loss': 11.6708, 'grad_norm': 1.307713270187378, 'learning_rate': 0.00011666666666666667, 'epoch': 0.01579778830963665, 'num_input_tokens_seen': 37474392, 'completed': '5.00% (15 / 300)', 'remaining time': '13:15:14', 'throughput': '8416.58', 'gpu_mem_free': '9593MB', 'step': 15} [Step 15 / Rank 4] Tasks: ['Summarization', 'Summarization'] | Lens: [23597, 23598] → Tgt Spa: ['1.000', '1.000'] [Step 15 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [31107, 31109] → Tgt Spa: ['0.350', '0.350'] [Step 15 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [31107, 31109] → Tgt Spa: ['0.350', '0.350'] [Step 15 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27023, 27024] → Tgt Spa: ['1.000', '1.000'] [Step 15 / Rank 1] Tasks: ['Single QA'] | Lens: [64596] → Tgt Spa: ['0.350'] [Step 15 / Rank 5] Tasks: ['Summarization', 'Summarization'] | Lens: [23597, 23598] → Tgt Spa: ['1.000', '1.000'] [Step 15 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27023, 27024] → Tgt Spa: ['1.000', '1.000'] [Step 15 / Rank 0] Tasks: ['Single QA'] | Lens: [64596] → Tgt Spa: ['0.350'] [Step 15 / Rank 7] Tasks: ['Code', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [8323, 8314, 8321, 8315, 8315, 8323, 8317] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350'] [Step 15 / Rank 4] Tasks: ['Single QA'] | Lens: [41149] → Tgt Spa: ['0.350'] [Step 15 / Rank 1] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25459, 25460] → Tgt Spa: ['1.000', '0.350'] [Step 15 / Rank 0] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25459, 25460] → Tgt Spa: ['1.000', '0.350'] [Step 15 / Rank 3] Tasks: ['Single QA'] | Lens: [59398] → Tgt Spa: ['0.350'] [Step 15 / Rank 5] Tasks: ['Single QA'] | Lens: [41149] → Tgt Spa: ['0.350'] [Step 15 / Rank 6] Tasks: ['Code', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [8323, 8314, 8321, 8315, 8315, 8323, 8317] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350'] [Step 15 / Rank 2] Tasks: ['Single QA'] | Lens: [59398] → Tgt Spa: ['0.350'] [Step 15 / Rank 1] Tasks: ['Single QA'] | Lens: [51165] → Tgt Spa: ['0.350'] [Step 15 / Rank 7] Tasks: ['Single QA'] | Lens: [42419] → Tgt Spa: ['0.350'] [Step 15 / Rank 0] Tasks: ['Single QA'] | Lens: [51165] → Tgt Spa: ['0.350'] [Step 15 / Rank 4] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23316, 23335] → Tgt Spa: ['1.000', '1.000'] [Step 15 / Rank 3] Tasks: ['Code', 'Summarization'] | Lens: [26930, 26943] → Tgt Spa: ['1.000', '1.000'] [Step 15 / Rank 5] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23316, 23335] → Tgt Spa: ['1.000', '1.000'] [Step 15 / Rank 6] Tasks: ['Single QA'] | Lens: [42419] → Tgt Spa: ['0.350'] [Step 15 / Rank 2] Tasks: ['Code', 'Summarization'] | Lens: [26930, 26943] → Tgt Spa: ['1.000', '1.000'] [Step 15 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27172, 27174] → Tgt Spa: ['1.000', '1.000'] [Step 15 / Rank 4] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22303, 22322] → Tgt Spa: ['1.000', '1.000'] [Step 15 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [21038, 21039, 21039] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 15 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [21038, 21039, 21039] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 15 / Rank 6] Tasks: ['Single QA'] | Lens: [58833] → Tgt Spa: ['0.350'] [Step 15 / Rank 7] Tasks: ['Single QA'] | Lens: [58833] → Tgt Spa: ['0.350'] [Step 15 / Rank 5] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22303, 22322] → Tgt Spa: ['1.000', '1.000'] [Step 15 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27172, 27174] → Tgt Spa: ['1.000', '1.000'] [Step 15 / Rank 3] Tasks: ['Single QA'] | Lens: [58972] → Tgt Spa: ['0.350'] [Step 15 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [25289, 25285] → Tgt Spa: ['1.000', '1.000'] [Step 15 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [25289, 25285] → Tgt Spa: ['1.000', '1.000'] [Step 15 / Rank 0] Tasks: ['Single QA'] | Lens: [58393] → Tgt Spa: ['0.350'] [Step 15 / Rank 6] Tasks: ['Summarization', 'Summarization', 'In-Context Learning', 'Summarization', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA'] | Lens: [6913, 6915, 6896, 6915, 6897, 6899, 6901, 6902, 6904] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350'] [Step 15 / Rank 1] Tasks: ['Single QA'] | Lens: [58393] → Tgt Spa: ['0.350'] [Step 15 / Rank 7] Tasks: ['Summarization', 'Summarization', 'In-Context Learning', 'Summarization', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA'] | Lens: [6913, 6915, 6896, 6915, 6897, 6899, 6901, 6902, 6904] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350'] [Step 15 / Rank 2] Tasks: ['Single QA'] | Lens: [58972] → Tgt Spa: ['0.350'] [Step 15 / Rank 7] Tasks: ['Single QA'] | Lens: [56727] → Tgt Spa: ['0.350'] [Step 15 / Rank 4] Tasks: ['Single QA'] | Lens: [43460] → Tgt Spa: ['0.350'] [Step 15 / Rank 5] Tasks: ['Single QA'] | Lens: [43460] → Tgt Spa: ['0.350'] [Step 15 / Rank 3] Tasks: ['Single QA', 'Code'] | Lens: [31558, 31568] → Tgt Spa: ['0.350', '1.000'] [Step 15 / Rank 0] Tasks: ['Single QA'] | Lens: [56579] → Tgt Spa: ['0.350'] [Step 15 / Rank 1] Tasks: ['Single QA'] | Lens: [56579] → Tgt Spa: ['0.350'] [Step 15 / Rank 2] Tasks: ['Single QA', 'Code'] | Lens: [31558, 31568] → Tgt Spa: ['0.350', '1.000'] [Step 15 / Rank 6] Tasks: ['Single QA'] | Lens: [56727] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 19:06:15,183 >> @ 15 | Loss: 2.1487 | LM: 2.0791 | Reg: 0.0696 | Spa(Avg): 0.463 [INFO|lh_trainer.py:797] 2026-02-16 19:06:15,183 >> Statistic -> Code | Spa: 0.426 | Tgt: 1.000 | Z-Loss: 0.107 | [INFO|lh_trainer.py:797] 2026-02-16 19:06:15,184 >> Statistic -> In-Context | Spa: 0.486 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:06:15,184 >> Statistic -> MultiHop | Spa: 0.515 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:06:15,184 >> Statistic -> Single | Spa: 0.459 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:06:15,184 >> Statistic -> Summarization | Spa: 0.519 | Tgt: 1.000 | Z-Loss: 0.104 | [INFO|lh_trainer.py:810] 2026-02-16 19:06:15,186 >> [Micro-Log] {"loss": 2.148746132850647, "lm_loss": 2.0791177166004977, "reg_loss": 0.06962839902068178, "model_sparsity(avg)": 0.46285732463002205, "Spa-Single QA sparsity": 0.45940170838282657, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05833220831118524, "Spa-In-Context Learning sparsity": 0.48611111044883726, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11231655031442642, "Spa-Code sparsity": 0.42592592040697735, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10720385859409969, "Spa-Summarization sparsity": 0.5190972313284874, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10429711453616619, "Spa-MultiHop QA sparsity": 0.5154320961899228, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.0516434946977016, "step": 15, "current_tau": 1.4914814233779907, "lambda1 Single QA": 0.4765625, "lambda2 MultiHop QA": 0.2392578125, "lambda3 Summarization": 0.03955078125, "lambda4 Code": 0.1357421875} [INFO|lh_trainer.py:331] 2026-02-16 19:06:36,852 >> {'loss': 12.8925, 'grad_norm': 0.8677361011505127, 'learning_rate': 0.000125, 'epoch': 0.01685097419694576, 'num_input_tokens_seen': 40079890, 'completed': '5.33% (16 / 300)', 'remaining time': '13:15:36', 'throughput': '7314.59', 'gpu_mem_free': '7825MB', 'step': 16} [Step 16 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [31093, 31093] → Tgt Spa: ['0.350', '0.350'] [Step 16 / Rank 1] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [16838, 16840, 16841] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 16 / Rank 2] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [21998, 21999] → Tgt Spa: ['1.000', '0.350'] [Step 16 / Rank 3] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [21998, 21999] → Tgt Spa: ['1.000', '0.350'] [Step 16 / Rank 4] Tasks: ['Single QA'] | Lens: [46638] → Tgt Spa: ['0.350'] [Step 16 / Rank 5] Tasks: ['Single QA'] | Lens: [46638] → Tgt Spa: ['0.350'] [Step 16 / Rank 0] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [16838, 16840, 16841] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 16 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [31093, 31093] → Tgt Spa: ['0.350', '0.350'] [Step 16 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [60178] → Tgt Spa: ['1.000'] [Step 16 / Rank 5] Tasks: ['Code'] | Lens: [60706] → Tgt Spa: ['1.000'] [Step 16 / Rank 4] Tasks: ['Code'] | Lens: [60706] → Tgt Spa: ['1.000'] [Step 16 / Rank 0] Tasks: ['Single QA'] | Lens: [40029] → Tgt Spa: ['0.350'] [Step 16 / Rank 6] Tasks: ['Code'] | Lens: [41330] → Tgt Spa: ['1.000'] [Step 16 / Rank 7] Tasks: ['Code'] | Lens: [41330] → Tgt Spa: ['1.000'] [Step 16 / Rank 1] Tasks: ['Single QA'] | Lens: [40029] → Tgt Spa: ['0.350'] [Step 16 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [60178] → Tgt Spa: ['1.000'] [Step 16 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24140, 24141] → Tgt Spa: ['1.000', '1.000'] [Step 16 / Rank 6] Tasks: ['Single QA'] | Lens: [45896] → Tgt Spa: ['0.350'] [Step 16 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24140, 24141] → Tgt Spa: ['1.000', '1.000'] [Step 16 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [59943] → Tgt Spa: ['1.000'] [Step 16 / Rank 1] Tasks: ['Single QA'] | Lens: [37121] → Tgt Spa: ['0.350'] [Step 16 / Rank 7] Tasks: ['Single QA'] | Lens: [45896] → Tgt Spa: ['0.350'] [Step 16 / Rank 0] Tasks: ['Single QA'] | Lens: [37121] → Tgt Spa: ['0.350'] [Step 16 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [59943] → Tgt Spa: ['1.000'] [Step 16 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [64234] → Tgt Spa: ['1.000'] [Step 16 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [54644] → Tgt Spa: ['1.000'] [Step 16 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [54644] → Tgt Spa: ['1.000'] [Step 16 / Rank 3] Tasks: ['Single QA'] | Lens: [65021] → Tgt Spa: ['0.350'] [Step 16 / Rank 6] Tasks: ['Single QA'] | Lens: [51552] → Tgt Spa: ['0.350'] [Step 16 / Rank 7] Tasks: ['Single QA'] | Lens: [51552] → Tgt Spa: ['0.350'] [Step 16 / Rank 2] Tasks: ['Single QA'] | Lens: [65021] → Tgt Spa: ['0.350'] [Step 16 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [64234] → Tgt Spa: ['1.000'] [Step 16 / Rank 5] Tasks: ['Code'] | Lens: [38959] → Tgt Spa: ['1.000'] [Step 16 / Rank 6] Tasks: ['Code'] | Lens: [64736] → Tgt Spa: ['1.000'] [Step 16 / Rank 0] Tasks: ['Single QA'] | Lens: [36772] → Tgt Spa: ['0.350'] [Step 16 / Rank 4] Tasks: ['Code'] | Lens: [38959] → Tgt Spa: ['1.000'] [Step 16 / Rank 3] Tasks: ['Code'] | Lens: [35432] → Tgt Spa: ['1.000'] [Step 16 / Rank 2] Tasks: ['Code'] | Lens: [35432] → Tgt Spa: ['1.000'] [Step 16 / Rank 7] Tasks: ['Code'] | Lens: [64736] → Tgt Spa: ['1.000'] [Step 16 / Rank 1] Tasks: ['Single QA'] | Lens: [36772] → Tgt Spa: ['0.350'] [Step 16 / Rank 5] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17280, 17291, 17281] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 16 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [47587] → Tgt Spa: ['1.000'] [Step 16 / Rank 0] Tasks: ['Summarization', 'Single QA'] | Lens: [22593, 22574] → Tgt Spa: ['1.000', '0.350'] [Step 16 / Rank 1] Tasks: ['Summarization', 'Single QA'] | Lens: [22593, 22574] → Tgt Spa: ['1.000', '0.350'] [Step 16 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24438, 24438] → Tgt Spa: ['1.000', '1.000'] [Step 16 / Rank 4] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17280, 17291, 17281] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 16 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24438, 24438] → Tgt Spa: ['1.000', '1.000'] [Step 16 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [47587] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 19:09:15,492 >> @ 16 | Loss: 2.1270 | LM: 2.0411 | Reg: 0.0860 | Spa(Avg): 0.475 [INFO|lh_trainer.py:797] 2026-02-16 19:09:15,492 >> Statistic -> Code | Spa: 0.544 | Tgt: 1.000 | Z-Loss: 0.081 | [INFO|lh_trainer.py:797] 2026-02-16 19:09:15,492 >> Statistic -> In-Context | Spa: 0.450 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:09:15,492 >> Statistic -> MultiHop | Spa: 0.515 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:09:15,493 >> Statistic -> Single | Spa: 0.442 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:09:15,493 >> Statistic -> Summarization | Spa: 0.442 | Tgt: 1.000 | Z-Loss: 0.139 | [INFO|lh_trainer.py:810] 2026-02-16 19:09:15,495 >> [Micro-Log] {"loss": 2.1270460449159145, "lm_loss": 2.0410929654414454, "reg_loss": 0.08595310409630959, "model_sparsity(avg)": 0.4751157611608505, "Spa-Summarization sparsity": 0.44166667461395265, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13853043913841248, "Spa-Single QA sparsity": 0.441919207572937, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.053843932480297306, "Spa-In-Context Learning sparsity": 0.45000001788139343, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.12062912434339523, "Spa-Code sparsity": 0.5436508229800633, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.08060165175369807, "Spa-MultiHop QA sparsity": 0.5154320961899228, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.0516434946977016, "step": 16, "current_tau": 1.4903154373168945, "lambda1 Single QA": 0.4765625, "lambda2 MultiHop QA": 0.2392578125, "lambda3 Summarization": 0.039794921875, "lambda4 Code": 0.1357421875} [INFO|lh_trainer.py:331] 2026-02-16 19:09:31,726 >> {'loss': 12.7623, 'grad_norm': 1.3970224857330322, 'learning_rate': 0.00013333333333333334, 'epoch': 0.01790416008425487, 'num_input_tokens_seen': 42483202, 'completed': '5.67% (17 / 300)', 'remaining time': '13:14:41', 'throughput': '6871.56', 'gpu_mem_free': '12699MB', 'step': 17} [Step 17 / Rank 6] Tasks: ['Single QA'] | Lens: [41474] → Tgt Spa: ['0.350'] [Step 17 / Rank 3] Tasks: ['Single QA'] | Lens: [53241] → Tgt Spa: ['0.350'] [Step 17 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [22928, 22938] → Tgt Spa: ['1.000', '1.000'] [Step 17 / Rank 0] Tasks: ['Code'] | Lens: [48863] → Tgt Spa: ['1.000'] [Step 17 / Rank 2] Tasks: ['Single QA'] | Lens: [53241] → Tgt Spa: ['0.350'] [Step 17 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [22928, 22938] → Tgt Spa: ['1.000', '1.000'] [Step 17 / Rank 7] Tasks: ['Single QA'] | Lens: [41474] → Tgt Spa: ['0.350'] [Step 17 / Rank 1] Tasks: ['Code'] | Lens: [48863] → Tgt Spa: ['1.000'] [Step 17 / Rank 3] Tasks: ['Single QA'] | Lens: [51210] → Tgt Spa: ['0.350'] [Step 17 / Rank 5] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25870, 25872] → Tgt Spa: ['1.000', '0.350'] [Step 17 / Rank 2] Tasks: ['Single QA'] | Lens: [51210] → Tgt Spa: ['0.350'] [Step 17 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [28249, 28250] → Tgt Spa: ['0.350', '0.350'] [Step 17 / Rank 7] Tasks: ['Single QA'] | Lens: [56495] → Tgt Spa: ['0.350'] [Step 17 / Rank 4] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25870, 25872] → Tgt Spa: ['1.000', '0.350'] [Step 17 / Rank 6] Tasks: ['Single QA'] | Lens: [56495] → Tgt Spa: ['0.350'] [Step 17 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [28249, 28250] → Tgt Spa: ['0.350', '0.350'] [Step 17 / Rank 4] Tasks: ['Single QA'] | Lens: [36067] → Tgt Spa: ['0.350'] [Step 17 / Rank 5] Tasks: ['Single QA'] | Lens: [36067] → Tgt Spa: ['0.350'] [Step 17 / Rank 0] Tasks: ['Single QA'] | Lens: [39996] → Tgt Spa: ['0.350'] [Step 17 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [28604, 28604] → Tgt Spa: ['0.350', '0.350'] [Step 17 / Rank 3] Tasks: ['Code', 'Single QA', 'Code', 'Code', 'Single QA', 'Single QA'] | Lens: [10633, 10626, 10635, 10637, 10631, 10636] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '0.350', '0.350'] [Step 17 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [28604, 28604] → Tgt Spa: ['0.350', '0.350'] [Step 17 / Rank 2] Tasks: ['Code', 'Single QA', 'Code', 'Code', 'Single QA', 'Single QA'] | Lens: [10633, 10626, 10635, 10637, 10631, 10636] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '0.350', '0.350'] [Step 17 / Rank 1] Tasks: ['Single QA'] | Lens: [39996] → Tgt Spa: ['0.350'] [Step 17 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [55867] → Tgt Spa: ['1.000'] [Step 17 / Rank 7] Tasks: ['Single QA'] | Lens: [56681] → Tgt Spa: ['0.350'] [Step 17 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [55867] → Tgt Spa: ['1.000'] [Step 17 / Rank 3] Tasks: ['Single QA'] | Lens: [62394] → Tgt Spa: ['0.350'] [Step 17 / Rank 6] Tasks: ['Single QA'] | Lens: [56681] → Tgt Spa: ['0.350'] [Step 17 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [51361] → Tgt Spa: ['1.000'] [Step 17 / Rank 2] Tasks: ['Single QA'] | Lens: [62394] → Tgt Spa: ['0.350'] [Step 17 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [51361] → Tgt Spa: ['1.000'] [Step 17 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [59541] → Tgt Spa: ['1.000'] [Step 17 / Rank 6] Tasks: ['Code'] | Lens: [40567] → Tgt Spa: ['1.000'] [Step 17 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [59541] → Tgt Spa: ['1.000'] [Step 17 / Rank 7] Tasks: ['Code'] | Lens: [40567] → Tgt Spa: ['1.000'] [Step 17 / Rank 5] Tasks: ['Single QA'] | Lens: [57057] → Tgt Spa: ['0.350'] [Step 17 / Rank 0] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25117, 25117] → Tgt Spa: ['0.350', '1.000'] [Step 17 / Rank 4] Tasks: ['Single QA'] | Lens: [57057] → Tgt Spa: ['0.350'] [Step 17 / Rank 1] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25117, 25117] → Tgt Spa: ['0.350', '1.000'] [Step 17 / Rank 4] Tasks: ['In-Context Learning', 'Single QA', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization'] | Lens: [3095, 3096, 3096, 3099, 3098, 3097, 3098, 3104, 3098, 3099, 3099, 3100, 3117, 3106, 3100, 3100, 3100, 3101, 3118, 3102, 3120] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000'] [Step 17 / Rank 6] Tasks: ['Single QA'] | Lens: [40243] → Tgt Spa: ['0.350'] [Step 17 / Rank 0] Tasks: ['Single QA', 'Summarization'] | Lens: [22899, 22919] → Tgt Spa: ['0.350', '1.000'] [Step 17 / Rank 5] Tasks: ['In-Context Learning', 'Single QA', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization'] | Lens: [3095, 3096, 3096, 3099, 3098, 3097, 3098, 3104, 3098, 3099, 3099, 3100, 3117, 3106, 3100, 3100, 3100, 3101, 3118, 3102, 3120] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000'] [Step 17 / Rank 2] Tasks: ['Summarization', 'Summarization', 'Code', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'In-Context Learning', 'Single QA'] | Lens: [3502, 3502, 3490, 3483, 3486, 3485, 3491, 3487, 3486, 3486, 3487, 3487, 3487, 3489, 3489, 3488, 3488, 3489] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350'] [Step 17 / Rank 1] Tasks: ['Single QA', 'Summarization'] | Lens: [22899, 22919] → Tgt Spa: ['0.350', '1.000'] [Step 17 / Rank 7] Tasks: ['Single QA'] | Lens: [40243] → Tgt Spa: ['0.350'] [Step 17 / Rank 3] Tasks: ['Summarization', 'Summarization', 'Code', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'In-Context Learning', 'Single QA'] | Lens: [3502, 3502, 3490, 3483, 3486, 3485, 3491, 3487, 3486, 3486, 3487, 3487, 3487, 3489, 3489, 3488, 3488, 3489] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350'] [INFO|lh_trainer.py:781] 2026-02-16 19:11:53,350 >> @ 17 | Loss: 2.2525 | LM: 2.1688 | Reg: 0.0838 | Spa(Avg): 0.475 [INFO|lh_trainer.py:797] 2026-02-16 19:11:53,350 >> Statistic -> Code | Spa: 0.475 | Tgt: 1.000 | Z-Loss: 0.096 | [INFO|lh_trainer.py:797] 2026-02-16 19:11:53,350 >> Statistic -> In-Context | Spa: 0.481 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:11:53,350 >> Statistic -> MultiHop | Spa: 0.510 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:11:53,350 >> Statistic -> Single | Spa: 0.484 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:11:53,350 >> Statistic -> Summarization | Spa: 0.458 | Tgt: 1.000 | Z-Loss: 0.128 | [INFO|lh_trainer.py:810] 2026-02-16 19:11:53,352 >> [Micro-Log] {"loss": 2.252540085464716, "lm_loss": 2.1687545807411275, "reg_loss": 0.0837855141920348, "model_sparsity(avg)": 0.47524894028902054, "Spa-Code sparsity": 0.475, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09638197273015976, "Spa-Single QA sparsity": 0.48353910225409047, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06856243101948942, "Spa-In-Context Learning sparsity": 0.48076923993917614, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11416003795770499, "Spa-Summarization sparsity": 0.4583333233992259, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12756850322087607, "Spa-MultiHop QA sparsity": 0.5095486119389534, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.04931540583493188, "step": 17, "current_tau": 1.4890761375427246, "lambda1 Single QA": 0.478515625, "lambda2 MultiHop QA": 0.2392578125, "lambda3 Summarization": 0.0400390625, "lambda4 Code": 0.1357421875} [INFO|lh_trainer.py:331] 2026-02-16 19:12:06,136 >> {'loss': 13.5152, 'grad_norm': 1.0275459289550781, 'learning_rate': 0.00014166666666666668, 'epoch': 0.018957345971563982, 'num_input_tokens_seen': 44983536, 'completed': '6.00% (18 / 300)', 'remaining time': '13:08:12', 'throughput': '8096.42', 'gpu_mem_free': '13197MB', 'step': 18} [Step 18 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [41729] → Tgt Spa: ['1.000'] [Step 18 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24010, 24030] → Tgt Spa: ['1.000', '1.000'] [Step 18 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32632, 32633] → Tgt Spa: ['0.350', '0.350'] [Step 18 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [26108, 26107] → Tgt Spa: ['1.000', '1.000'] [Step 18 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32632, 32633] → Tgt Spa: ['0.350', '0.350'] [Step 18 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [26108, 26107] → Tgt Spa: ['1.000', '1.000'] [Step 18 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [41729] → Tgt Spa: ['1.000'] [Step 18 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24010, 24030] → Tgt Spa: ['1.000', '1.000'] [Step 18 / Rank 5] Tasks: ['Single QA'] | Lens: [62830] → Tgt Spa: ['0.350'] [Step 18 / Rank 3] Tasks: ['Summarization'] | Lens: [48837] → Tgt Spa: ['1.000'] [Step 18 / Rank 4] Tasks: ['Single QA'] | Lens: [62830] → Tgt Spa: ['0.350'] [Step 18 / Rank 2] Tasks: ['Summarization'] | Lens: [48837] → Tgt Spa: ['1.000'] [Step 18 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32700, 32700] → Tgt Spa: ['0.350', '0.350'] [Step 18 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32700, 32700] → Tgt Spa: ['0.350', '0.350'] [Step 18 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [31615, 31616] → Tgt Spa: ['0.350', '0.350'] [Step 18 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [31615, 31616] → Tgt Spa: ['0.350', '0.350'] [Step 18 / Rank 3] Tasks: ['Summarization', 'Code'] | Lens: [28677, 28669] → Tgt Spa: ['1.000', '1.000'] [Step 18 / Rank 0] Tasks: ['Single QA'] | Lens: [56618] → Tgt Spa: ['0.350'] [Step 18 / Rank 4] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [24715, 24702] → Tgt Spa: ['1.000', '1.000'] [Step 18 / Rank 5] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [24715, 24702] → Tgt Spa: ['1.000', '1.000'] [Step 18 / Rank 2] Tasks: ['Summarization', 'Code'] | Lens: [28677, 28669] → Tgt Spa: ['1.000', '1.000'] [Step 18 / Rank 6] Tasks: ['Code'] | Lens: [54356] → Tgt Spa: ['1.000'] [Step 18 / Rank 7] Tasks: ['Code'] | Lens: [54356] → Tgt Spa: ['1.000'] [Step 18 / Rank 1] Tasks: ['Single QA'] | Lens: [56618] → Tgt Spa: ['0.350'] [Step 18 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [43969] → Tgt Spa: ['1.000'] [Step 18 / Rank 2] Tasks: ['Single QA'] | Lens: [46105] → Tgt Spa: ['0.350'] [Step 18 / Rank 6] Tasks: ['Single QA'] | Lens: [62710] → Tgt Spa: ['0.350'] [Step 18 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [22039, 22048] → Tgt Spa: ['1.000', '1.000'] [Step 18 / Rank 7] Tasks: ['Single QA'] | Lens: [62710] → Tgt Spa: ['0.350'] [Step 18 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [43969] → Tgt Spa: ['1.000'] [Step 18 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [22039, 22048] → Tgt Spa: ['1.000', '1.000'] [Step 18 / Rank 3] Tasks: ['Single QA'] | Lens: [46105] → Tgt Spa: ['0.350'] [Step 18 / Rank 6] Tasks: ['Single QA'] | Lens: [63885] → Tgt Spa: ['0.350'] [Step 18 / Rank 3] Tasks: ['Single QA'] | Lens: [59031] → Tgt Spa: ['0.350'] [Step 18 / Rank 5] Tasks: ['Single QA'] | Lens: [52200] → Tgt Spa: ['0.350'] [Step 18 / Rank 1] Tasks: ['Single QA', 'Code', 'In-Context Learning', 'Single QA', 'Summarization', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning'] | Lens: [5487, 5494, 5487, 5490, 5507, 5489, 5498, 5490, 5492, 5491, 5492] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 18 / Rank 4] Tasks: ['Single QA'] | Lens: [52200] → Tgt Spa: ['0.350'] [Step 18 / Rank 2] Tasks: ['Single QA'] | Lens: [59031] → Tgt Spa: ['0.350'] [Step 18 / Rank 7] Tasks: ['Single QA'] | Lens: [63885] → Tgt Spa: ['0.350'] [Step 18 / Rank 0] Tasks: ['Single QA', 'Code', 'In-Context Learning', 'Single QA', 'Summarization', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning'] | Lens: [5487, 5494, 5487, 5490, 5507, 5489, 5498, 5490, 5492, 5491, 5492] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 18 / Rank 6] Tasks: ['Single QA'] | Lens: [42391] → Tgt Spa: ['0.350'] [Step 18 / Rank 5] Tasks: ['Single QA'] | Lens: [43070] → Tgt Spa: ['0.350'] [Step 18 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [38176] → Tgt Spa: ['1.000'] [Step 18 / Rank 4] Tasks: ['Single QA'] | Lens: [43070] → Tgt Spa: ['0.350'] [Step 18 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [38176] → Tgt Spa: ['1.000'] [Step 18 / Rank 1] Tasks: ['Single QA'] | Lens: [65070] → Tgt Spa: ['0.350'] [Step 18 / Rank 7] Tasks: ['Single QA'] | Lens: [42391] → Tgt Spa: ['0.350'] [Step 18 / Rank 0] Tasks: ['Single QA'] | Lens: [65070] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 19:14:46,310 >> @ 18 | Loss: 2.2479 | LM: 2.1474 | Reg: 0.1004 | Spa(Avg): 0.487 [INFO|lh_trainer.py:797] 2026-02-16 19:14:46,310 >> Statistic -> Code | Spa: 0.448 | Tgt: 1.000 | Z-Loss: 0.102 | [INFO|lh_trainer.py:797] 2026-02-16 19:14:46,310 >> Statistic -> In-Context | Spa: 0.470 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:14:46,310 >> Statistic -> MultiHop | Spa: 0.510 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:14:46,310 >> Statistic -> Single | Spa: 0.510 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:14:46,310 >> Statistic -> Summarization | Spa: 0.456 | Tgt: 1.000 | Z-Loss: 0.133 | [INFO|lh_trainer.py:810] 2026-02-16 19:14:46,312 >> [Micro-Log] {"loss": 2.247852044800917, "lm_loss": 2.147425356631478, "reg_loss": 0.1004266949215283, "model_sparsity(avg)": 0.48747895533839863, "Spa-In-Context Learning sparsity": 0.46969696608456696, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11672014607624574, "Spa-Summarization sparsity": 0.4555555582046509, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13311843127012252, "Spa-Single QA sparsity": 0.510233926145654, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.08786622993648052, "Spa-Code sparsity": 0.44841269084385466, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10225283567394529, "Spa-MultiHop QA sparsity": 0.5095486119389534, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.04931540583493188, "step": 18, "current_tau": 1.4877641201019287, "lambda1 Single QA": 0.478515625, "lambda2 MultiHop QA": 0.2392578125, "lambda3 Summarization": 0.040283203125, "lambda4 Code": 0.1357421875} [INFO|lh_trainer.py:331] 2026-02-16 19:15:13,092 >> {'loss': 13.4871, 'grad_norm': 1.1338393688201904, 'learning_rate': 0.00015, 'epoch': 0.020010531858873092, 'num_input_tokens_seen': 47556326, 'completed': '6.33% (19 / 300)', 'remaining time': '13:10:09', 'throughput': '6880.72', 'gpu_mem_free': '4647MB', 'step': 19} [Step 19 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [51514] → Tgt Spa: ['1.000'] [Step 19 / Rank 7] Tasks: ['Single QA'] | Lens: [47178] → Tgt Spa: ['0.350'] [Step 19 / Rank 4] Tasks: ['Code', 'Code', 'In-Context Learning'] | Lens: [20792, 20793, 20790] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 19 / Rank 2] Tasks: ['Summarization', 'Summarization'] | Lens: [28321, 28321] → Tgt Spa: ['1.000', '1.000'] [Step 19 / Rank 3] Tasks: ['Summarization', 'Summarization'] | Lens: [28321, 28321] → Tgt Spa: ['1.000', '1.000'] [Step 19 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [51514] → Tgt Spa: ['1.000'] [Step 19 / Rank 6] Tasks: ['Single QA'] | Lens: [47178] → Tgt Spa: ['0.350'] [Step 19 / Rank 5] Tasks: ['Code', 'Code', 'In-Context Learning'] | Lens: [20792, 20793, 20790] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 19 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24096, 24097] → Tgt Spa: ['1.000', '1.000'] [Step 19 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [59711] → Tgt Spa: ['1.000'] [Step 19 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24096, 24097] → Tgt Spa: ['1.000', '1.000'] [Step 19 / Rank 0] Tasks: ['Single QA'] | Lens: [57535] → Tgt Spa: ['0.350'] [Step 19 / Rank 1] Tasks: ['Single QA'] | Lens: [57535] → Tgt Spa: ['0.350'] [Step 19 / Rank 2] Tasks: ['Code'] | Lens: [41425] → Tgt Spa: ['1.000'] [Step 19 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [59711] → Tgt Spa: ['1.000'] [Step 19 / Rank 3] Tasks: ['Code'] | Lens: [41425] → Tgt Spa: ['1.000'] [Step 19 / Rank 7] Tasks: ['Single QA'] | Lens: [33760] → Tgt Spa: ['0.350'] [Step 19 / Rank 1] Tasks: ['Single QA'] | Lens: [47095] → Tgt Spa: ['0.350'] [Step 19 / Rank 2] Tasks: ['Single QA'] | Lens: [52013] → Tgt Spa: ['0.350'] [Step 19 / Rank 5] Tasks: ['Single QA'] | Lens: [40544] → Tgt Spa: ['0.350'] [Step 19 / Rank 3] Tasks: ['Single QA'] | Lens: [52013] → Tgt Spa: ['0.350'] [Step 19 / Rank 6] Tasks: ['Single QA'] | Lens: [33760] → Tgt Spa: ['0.350'] [Step 19 / Rank 4] Tasks: ['Single QA'] | Lens: [40544] → Tgt Spa: ['0.350'] [Step 19 / Rank 0] Tasks: ['Single QA'] | Lens: [47095] → Tgt Spa: ['0.350'] [Step 19 / Rank 5] Tasks: ['Single QA'] | Lens: [55094] → Tgt Spa: ['0.350'] [Step 19 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [39749] → Tgt Spa: ['1.000'] [Step 19 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [39749] → Tgt Spa: ['1.000'] [Step 19 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [25602, 25622] → Tgt Spa: ['1.000', '1.000'] [Step 19 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [25602, 25622] → Tgt Spa: ['1.000', '1.000'] [Step 19 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [56311] → Tgt Spa: ['1.000'] [Step 19 / Rank 4] Tasks: ['Single QA'] | Lens: [55094] → Tgt Spa: ['0.350'] [Step 19 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [56311] → Tgt Spa: ['1.000'] [Step 19 / Rank 4] Tasks: ['Single QA'] | Lens: [36916] → Tgt Spa: ['0.350'] [Step 19 / Rank 3] Tasks: ['Single QA', 'Summarization', 'Single QA'] | Lens: [21059, 21078, 21061] → Tgt Spa: ['0.350', '1.000', '0.350'] [Step 19 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Single QA', 'MultiHop QA'] | Lens: [16083, 16093, 16097, 16098] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 19 / Rank 1] Tasks: ['Single QA'] | Lens: [48785] → Tgt Spa: ['0.350'] [Step 19 / Rank 5] Tasks: ['Single QA'] | Lens: [36916] → Tgt Spa: ['0.350'] [Step 19 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Single QA', 'MultiHop QA'] | Lens: [16083, 16093, 16097, 16098] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 19 / Rank 2] Tasks: ['Single QA', 'Summarization', 'Single QA'] | Lens: [21059, 21078, 21061] → Tgt Spa: ['0.350', '1.000', '0.350'] [Step 19 / Rank 0] Tasks: ['Single QA'] | Lens: [48785] → Tgt Spa: ['0.350'] [Step 19 / Rank 3] Tasks: ['Single QA'] | Lens: [54344] → Tgt Spa: ['0.350'] [Step 19 / Rank 4] Tasks: ['Single QA'] | Lens: [49596] → Tgt Spa: ['0.350'] [Step 19 / Rank 6] Tasks: ['Code', 'Code', 'Code'] | Lens: [19427, 19428, 19427] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 19 / Rank 5] Tasks: ['Single QA'] | Lens: [49596] → Tgt Spa: ['0.350'] [Step 19 / Rank 0] Tasks: ['Single QA'] | Lens: [38447] → Tgt Spa: ['0.350'] [Step 19 / Rank 7] Tasks: ['Code', 'Code', 'Code'] | Lens: [19427, 19428, 19427] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 19 / Rank 1] Tasks: ['Single QA'] | Lens: [38447] → Tgt Spa: ['0.350'] [Step 19 / Rank 2] Tasks: ['Single QA'] | Lens: [54344] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 19:17:28,782 >> @ 19 | Loss: 2.1423 | LM: 2.0415 | Reg: 0.1009 | Spa(Avg): 0.513 [INFO|lh_trainer.py:797] 2026-02-16 19:17:28,782 >> Statistic -> Code | Spa: 0.516 | Tgt: 1.000 | Z-Loss: 0.087 | [INFO|lh_trainer.py:797] 2026-02-16 19:17:28,782 >> Statistic -> In-Context | Spa: 0.497 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:17:28,782 >> Statistic -> MultiHop | Spa: 0.444 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:17:28,783 >> Statistic -> Single | Spa: 0.542 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:17:28,783 >> Statistic -> Summarization | Spa: 0.448 | Tgt: 1.000 | Z-Loss: 0.134 | [INFO|lh_trainer.py:810] 2026-02-16 19:17:28,784 >> [Micro-Log] {"loss": 2.142331298440695, "lm_loss": 2.041477439304193, "reg_loss": 0.10085384113093217, "model_sparsity(avg)": 0.5131172810991605, "Spa-In-Context Learning sparsity": 0.4965277761220932, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11068800929933786, "Spa-Single QA sparsity": 0.5424836523392621, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.1013057464185883, "Spa-Summarization sparsity": 0.4479166567325592, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13433113507926464, "Spa-Code sparsity": 0.5162037114302317, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.08730217069387436, "Spa-MultiHop QA sparsity": 0.4444444179534912, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.02585665136575699, "step": 19, "current_tau": 1.486379623413086, "lambda1 Single QA": 0.478515625, "lambda2 MultiHop QA": 0.240234375, "lambda3 Summarization": 0.04052734375, "lambda4 Code": 0.13671875} [INFO|lh_trainer.py:331] 2026-02-16 19:17:48,957 >> {'loss': 12.854, 'grad_norm': 1.1492358446121216, 'learning_rate': 0.00015833333333333332, 'epoch': 0.021063717746182202, 'num_input_tokens_seen': 49984930, 'completed': '6.67% (20 / 300)', 'remaining time': '13:04:21', 'throughput': '7790.77', 'gpu_mem_free': '13449MB', 'step': 20} [Step 20 / Rank 2] Tasks: ['Single QA'] | Lens: [57131] → Tgt Spa: ['0.350'] [Step 20 / Rank 3] Tasks: ['Single QA'] | Lens: [57131] → Tgt Spa: ['0.350'] [Step 20 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23966, 23966] → Tgt Spa: ['1.000', '1.000'] [Step 20 / Rank 6] Tasks: ['Single QA'] | Lens: [58642] → Tgt Spa: ['0.350'] [Step 20 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23966, 23966] → Tgt Spa: ['1.000', '1.000'] [Step 20 / Rank 0] Tasks: ['Code'] | Lens: [38476] → Tgt Spa: ['1.000'] [Step 20 / Rank 1] Tasks: ['Code'] | Lens: [38476] → Tgt Spa: ['1.000'] [Step 20 / Rank 7] Tasks: ['Single QA'] | Lens: [58642] → Tgt Spa: ['0.350'] [Step 20 / Rank 4] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [16491, 16494, 16507] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 20 / Rank 7] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25531, 25532] → Tgt Spa: ['1.000', '0.350'] [Step 20 / Rank 5] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [16491, 16494, 16507] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 20 / Rank 6] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25531, 25532] → Tgt Spa: ['1.000', '0.350'] [Step 20 / Rank 0] Tasks: ['Single QA'] | Lens: [47133] → Tgt Spa: ['0.350'] [Step 20 / Rank 1] Tasks: ['Single QA'] | Lens: [47133] → Tgt Spa: ['0.350'] [Step 20 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [60549] → Tgt Spa: ['1.000'] [Step 20 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [60549] → Tgt Spa: ['1.000'] [Step 20 / Rank 5] Tasks: ['Code'] | Lens: [39779] → Tgt Spa: ['1.000'] [Step 20 / Rank 4] Tasks: ['Code'] | Lens: [39779] → Tgt Spa: ['1.000'] [Step 20 / Rank 3] Tasks: ['Single QA'] | Lens: [36025] → Tgt Spa: ['0.350'] [Step 20 / Rank 6] Tasks: ['Single QA'] | Lens: [54809] → Tgt Spa: ['0.350'] [Step 20 / Rank 0] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'Code', 'Single QA', 'In-Context Learning'] | Lens: [2710, 2710, 2710, 2716, 2711, 2711, 2729, 2713, 2730, 2711, 2713, 2714, 2730, 2714, 2731, 2731, 2715, 2732, 2721, 2717, 2716, 2721, 2715, 2715] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000'] [Step 20 / Rank 1] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'Code', 'Single QA', 'In-Context Learning'] | Lens: [2710, 2710, 2710, 2716, 2711, 2711, 2729, 2713, 2730, 2711, 2713, 2714, 2730, 2714, 2731, 2731, 2715, 2732, 2721, 2717, 2716, 2721, 2715, 2715] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000'] [Step 20 / Rank 2] Tasks: ['Single QA'] | Lens: [36025] → Tgt Spa: ['0.350'] [Step 20 / Rank 7] Tasks: ['Single QA'] | Lens: [54809] → Tgt Spa: ['0.350'] [Step 20 / Rank 4] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [16530, 16530, 16519] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 20 / Rank 3] Tasks: ['Single QA'] | Lens: [57507] → Tgt Spa: ['0.350'] [Step 20 / Rank 2] Tasks: ['Single QA'] | Lens: [57507] → Tgt Spa: ['0.350'] [Step 20 / Rank 6] Tasks: ['Single QA'] | Lens: [44764] → Tgt Spa: ['0.350'] [Step 20 / Rank 1] Tasks: ['Code', 'Code', 'Code'] | Lens: [19358, 19360, 19363] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 20 / Rank 0] Tasks: ['Code', 'Code', 'Code'] | Lens: [19358, 19360, 19363] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 20 / Rank 5] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [16530, 16530, 16519] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 20 / Rank 7] Tasks: ['Single QA'] | Lens: [44764] → Tgt Spa: ['0.350'] [Step 20 / Rank 5] Tasks: ['Single QA'] | Lens: [51075] → Tgt Spa: ['0.350'] [Step 20 / Rank 7] Tasks: ['Single QA'] | Lens: [55863] → Tgt Spa: ['0.350'] [Step 20 / Rank 6] Tasks: ['Single QA'] | Lens: [55863] → Tgt Spa: ['0.350'] [Step 20 / Rank 3] Tasks: ['Code', 'Single QA', 'Single QA', 'Code', 'Code', 'Code'] | Lens: [9817, 9813, 9815, 9827, 9845, 9852] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '1.000', '1.000'] [Step 20 / Rank 1] Tasks: ['Single QA'] | Lens: [38829] → Tgt Spa: ['0.350'] [Step 20 / Rank 0] Tasks: ['Single QA'] | Lens: [38829] → Tgt Spa: ['0.350'] [Step 20 / Rank 2] Tasks: ['Code', 'Single QA', 'Single QA', 'Code', 'Code', 'Code'] | Lens: [9817, 9813, 9815, 9827, 9845, 9852] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '1.000', '1.000'] [Step 20 / Rank 4] Tasks: ['Single QA'] | Lens: [51075] → Tgt Spa: ['0.350'] [Step 20 / Rank 5] Tasks: ['Single QA'] | Lens: [57370] → Tgt Spa: ['0.350'] [Step 20 / Rank 0] Tasks: ['Single QA'] | Lens: [52570] → Tgt Spa: ['0.350'] [Step 20 / Rank 4] Tasks: ['Single QA'] | Lens: [57370] → Tgt Spa: ['0.350'] [Step 20 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [52418] → Tgt Spa: ['1.000'] [Step 20 / Rank 1] Tasks: ['Single QA'] | Lens: [52570] → Tgt Spa: ['0.350'] [Step 20 / Rank 6] Tasks: ['Single QA', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Single QA', 'Single QA', 'Summarization', 'Single QA', 'Single QA', 'Summarization', 'Single QA', 'Single QA'] | Lens: [5168, 5168, 5170, 5172, 5171, 5171, 5189, 5172, 5173, 5192, 5174, 5176] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350'] [Step 20 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [52418] → Tgt Spa: ['1.000'] [Step 20 / Rank 7] Tasks: ['Single QA', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Single QA', 'Single QA', 'Summarization', 'Single QA', 'Single QA', 'Summarization', 'Single QA', 'Single QA'] | Lens: [5168, 5168, 5170, 5172, 5171, 5171, 5189, 5172, 5173, 5192, 5174, 5176] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350'] [INFO|lh_trainer.py:781] 2026-02-16 19:20:20,340 >> @ 20 | Loss: 1.9761 | LM: 1.8942 | Reg: 0.0819 | Spa(Avg): 0.466 [INFO|lh_trainer.py:797] 2026-02-16 19:20:20,340 >> Statistic -> Code | Spa: 0.490 | Tgt: 1.000 | Z-Loss: 0.093 | [INFO|lh_trainer.py:797] 2026-02-16 19:20:20,340 >> Statistic -> In-Context | Spa: 0.477 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:20:20,340 >> Statistic -> MultiHop | Spa: 0.497 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:20:20,340 >> Statistic -> Single | Spa: 0.480 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:20:20,340 >> Statistic -> Summarization | Spa: 0.487 | Tgt: 1.000 | Z-Loss: 0.119 | [INFO|lh_trainer.py:810] 2026-02-16 19:20:20,342 >> [Micro-Log] {"loss": 1.9760875118275483, "lm_loss": 1.894169093420108, "reg_loss": 0.08191841437170903, "model_sparsity(avg)": 0.4657600298523903, "Spa-Code sparsity": 0.489814817905426, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09332145353158315, "Spa-Single QA sparsity": 0.48032407462596893, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.07110162467385332, "Spa-MultiHop QA sparsity": 0.4965277761220932, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.044014554005116224, "Spa-Summarization sparsity": 0.48737374219027435, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11858123676343398, "Spa-In-Context Learning sparsity": 0.4768518606821696, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11593619071775013, "step": 20, "current_tau": 1.4849231243133545, "lambda1 Single QA": 0.478515625, "lambda2 MultiHop QA": 0.240234375, "lambda3 Summarization": 0.041015625, "lambda4 Code": 0.13671875} [INFO|lh_trainer.py:331] 2026-02-16 19:20:42,553 >> {'loss': 11.8565, 'grad_norm': 1.051015019416809, 'learning_rate': 0.00016666666666666666, 'epoch': 0.022116903633491312, 'num_input_tokens_seen': 52475706, 'completed': '7.00% (21 / 300)', 'remaining time': '13:02:46', 'throughput': '7174.06', 'gpu_mem_free': '8985MB', 'step': 21} [Step 21 / Rank 4] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24952, 24971] → Tgt Spa: ['1.000', '1.000'] [Step 21 / Rank 2] Tasks: ['Single QA'] | Lens: [38132] → Tgt Spa: ['0.350'] [Step 21 / Rank 0] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [19226, 19215, 19227] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 21 / Rank 1] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [19226, 19215, 19227] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 21 / Rank 5] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24952, 24971] → Tgt Spa: ['1.000', '1.000'] [Step 21 / Rank 3] Tasks: ['Single QA'] | Lens: [38132] → Tgt Spa: ['0.350'] [Step 21 / Rank 7] Tasks: ['Single QA'] | Lens: [45269] → Tgt Spa: ['0.350'] [Step 21 / Rank 6] Tasks: ['Single QA'] | Lens: [45269] → Tgt Spa: ['0.350'] [Step 21 / Rank 4] Tasks: ['Code'] | Lens: [36443] → Tgt Spa: ['1.000'] [Step 21 / Rank 5] Tasks: ['Code'] | Lens: [36443] → Tgt Spa: ['1.000'] [Step 21 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Summarization'] | Lens: [17186, 17187, 17207] → Tgt Spa: ['0.350', '0.350', '1.000'] [Step 21 / Rank 7] Tasks: ['Code', 'Summarization'] | Lens: [22649, 22662] → Tgt Spa: ['1.000', '1.000'] [Step 21 / Rank 2] Tasks: ['Single QA'] | Lens: [65260] → Tgt Spa: ['0.350'] [Step 21 / Rank 3] Tasks: ['Single QA'] | Lens: [65260] → Tgt Spa: ['0.350'] [Step 21 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Summarization'] | Lens: [17186, 17187, 17207] → Tgt Spa: ['0.350', '0.350', '1.000'] [Step 21 / Rank 6] Tasks: ['Code', 'Summarization'] | Lens: [22649, 22662] → Tgt Spa: ['1.000', '1.000'] [Step 21 / Rank 0] Tasks: ['Single QA'] | Lens: [45082] → Tgt Spa: ['0.350'] [Step 21 / Rank 5] Tasks: ['Single QA'] | Lens: [43133] → Tgt Spa: ['0.350'] [Step 21 / Rank 4] Tasks: ['Single QA'] | Lens: [43133] → Tgt Spa: ['0.350'] [Step 21 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [19742, 19743, 19743] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 21 / Rank 1] Tasks: ['Single QA'] | Lens: [45082] → Tgt Spa: ['0.350'] [Step 21 / Rank 3] Tasks: ['In-Context Learning', 'Code'] | Lens: [25617, 25626] → Tgt Spa: ['1.000', '1.000'] [Step 21 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [19742, 19743, 19743] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 21 / Rank 2] Tasks: ['In-Context Learning', 'Code'] | Lens: [25617, 25626] → Tgt Spa: ['1.000', '1.000'] [Step 21 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26001, 26001] → Tgt Spa: ['1.000', '1.000'] [Step 21 / Rank 1] Tasks: ['Single QA'] | Lens: [39375] → Tgt Spa: ['0.350'] [Step 21 / Rank 3] Tasks: ['Single QA'] | Lens: [57714] → Tgt Spa: ['0.350'] [Step 21 / Rank 2] Tasks: ['Single QA'] | Lens: [57714] → Tgt Spa: ['0.350'] [Step 21 / Rank 0] Tasks: ['Single QA'] | Lens: [39375] → Tgt Spa: ['0.350'] [Step 21 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning', 'Code'] | Lens: [21565, 21565, 21574] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 21 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning', 'Code'] | Lens: [21565, 21565, 21574] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 21 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26001, 26001] → Tgt Spa: ['1.000', '1.000'] [Step 21 / Rank 1] Tasks: ['Single QA'] | Lens: [38851] → Tgt Spa: ['0.350'] [Step 21 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [32254, 32255] → Tgt Spa: ['0.350', '0.350'] [Step 21 / Rank 0] Tasks: ['Single QA'] | Lens: [38851] → Tgt Spa: ['0.350'] [Step 21 / Rank 7] Tasks: ['Single QA'] | Lens: [57373] → Tgt Spa: ['0.350'] [Step 21 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [25126, 25134] → Tgt Spa: ['1.000', '1.000'] [Step 21 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [32254, 32255] → Tgt Spa: ['0.350', '0.350'] [Step 21 / Rank 6] Tasks: ['Single QA'] | Lens: [57373] → Tgt Spa: ['0.350'] [Step 21 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [25126, 25134] → Tgt Spa: ['1.000', '1.000'] [Step 21 / Rank 4] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [19231, 19231, 19242] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 21 / Rank 5] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [19231, 19231, 19242] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 21 / Rank 2] Tasks: ['Single QA'] | Lens: [35120] → Tgt Spa: ['0.350'] [Step 21 / Rank 7] Tasks: ['Single QA'] | Lens: [61563] → Tgt Spa: ['0.350'] [Step 21 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [60538] → Tgt Spa: ['1.000'] [Step 21 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [60538] → Tgt Spa: ['1.000'] [Step 21 / Rank 6] Tasks: ['Single QA'] | Lens: [61563] → Tgt Spa: ['0.350'] [Step 21 / Rank 3] Tasks: ['Single QA'] | Lens: [35120] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 19:23:05,317 >> @ 21 | Loss: 2.0590 | LM: 1.9641 | Reg: 0.0949 | Spa(Avg): 0.513 [INFO|lh_trainer.py:797] 2026-02-16 19:23:05,318 >> Statistic -> Code | Spa: 0.495 | Tgt: 1.000 | Z-Loss: 0.092 | [INFO|lh_trainer.py:797] 2026-02-16 19:23:05,318 >> Statistic -> In-Context | Spa: 0.517 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:23:05,318 >> Statistic -> MultiHop | Spa: 0.497 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:23:05,318 >> Statistic -> Single | Spa: 0.517 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:23:05,318 >> Statistic -> Summarization | Spa: 0.519 | Tgt: 1.000 | Z-Loss: 0.105 | [INFO|lh_trainer.py:810] 2026-02-16 19:23:05,320 >> [Micro-Log] {"loss": 2.058984519292911, "lm_loss": 1.964091682806611, "reg_loss": 0.09489284393688042, "model_sparsity(avg)": 0.5132137375573317, "Spa-Summarization sparsity": 0.5185185207260979, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10473021699322595, "Spa-Code sparsity": 0.4947916567325592, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09217580314725637, "Spa-Single QA sparsity": 0.5166666626930236, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.08622753197948138, "Spa-In-Context Learning sparsity": 0.5173611044883728, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10660299938172102, "Spa-MultiHop QA sparsity": 0.4965277761220932, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.044014554005116224, "step": 21, "current_tau": 1.4833950996398926, "lambda1 Single QA": 0.478515625, "lambda2 MultiHop QA": 0.240234375, "lambda3 Summarization": 0.041259765625, "lambda4 Code": 0.13671875} [INFO|lh_trainer.py:331] 2026-02-16 19:23:30,023 >> {'loss': 12.3539, 'grad_norm': 1.0687230825424194, 'learning_rate': 0.000175, 'epoch': 0.02317008952080042, 'num_input_tokens_seen': 54931676, 'completed': '7.33% (22 / 300)', 'remaining time': '12:59:46', 'throughput': '7332.57', 'gpu_mem_free': '6631MB', 'step': 22} [Step 22 / Rank 6] Tasks: ['In-Context Learning', 'Single QA', 'Single QA', 'Summarization', 'Single QA', 'In-Context Learning', 'Code', 'Summarization', 'Code', 'Code'] | Lens: [6158, 6160, 6160, 6179, 6162, 6164, 6174, 6186, 6175, 6175] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 22 / Rank 7] Tasks: ['In-Context Learning', 'Single QA', 'Single QA', 'Summarization', 'Single QA', 'In-Context Learning', 'Code', 'Summarization', 'Code', 'Code'] | Lens: [6158, 6160, 6160, 6179, 6162, 6164, 6174, 6186, 6175, 6175] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 22 / Rank 2] Tasks: ['Code'] | Lens: [45100] → Tgt Spa: ['1.000'] [Step 22 / Rank 1] Tasks: ['Code', 'Single QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [15564, 15559, 15565, 15572] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350'] [Step 22 / Rank 3] Tasks: ['Code'] | Lens: [45100] → Tgt Spa: ['1.000'] [Step 22 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [11989, 11992, 12000, 11994, 11994] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350'] [Step 22 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [11989, 11992, 12000, 11994, 11994] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350'] [Step 22 / Rank 0] Tasks: ['Code', 'Single QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [15564, 15559, 15565, 15572] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350'] [Step 22 / Rank 1] Tasks: ['Single QA'] | Lens: [53530] → Tgt Spa: ['0.350'] [Step 22 / Rank 3] Tasks: ['Code', 'Single QA'] | Lens: [26268, 26261] → Tgt Spa: ['1.000', '0.350'] [Step 22 / Rank 4] Tasks: ['Single QA'] | Lens: [56512] → Tgt Spa: ['0.350'] [Step 22 / Rank 7] Tasks: ['Single QA'] | Lens: [45268] → Tgt Spa: ['0.350'] [Step 22 / Rank 0] Tasks: ['Single QA'] | Lens: [53530] → Tgt Spa: ['0.350'] [Step 22 / Rank 5] Tasks: ['Single QA'] | Lens: [56512] → Tgt Spa: ['0.350'] [Step 22 / Rank 6] Tasks: ['Single QA'] | Lens: [45268] → Tgt Spa: ['0.350'] [Step 22 / Rank 2] Tasks: ['Code', 'Single QA'] | Lens: [26268, 26261] → Tgt Spa: ['1.000', '0.350'] [Step 22 / Rank 6] Tasks: ['Single QA'] | Lens: [63736] → Tgt Spa: ['0.350'] [Step 22 / Rank 0] Tasks: ['Code'] | Lens: [38123] → Tgt Spa: ['1.000'] [Step 22 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [38424] → Tgt Spa: ['1.000'] [Step 22 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [38424] → Tgt Spa: ['1.000'] [Step 22 / Rank 2] Tasks: ['Single QA'] | Lens: [33634] → Tgt Spa: ['0.350'] [Step 22 / Rank 7] Tasks: ['Single QA'] | Lens: [63736] → Tgt Spa: ['0.350'] [Step 22 / Rank 1] Tasks: ['Code'] | Lens: [38123] → Tgt Spa: ['1.000'] [Step 22 / Rank 3] Tasks: ['Single QA'] | Lens: [33634] → Tgt Spa: ['0.350'] [Step 22 / Rank 4] Tasks: ['Single QA'] | Lens: [49268] → Tgt Spa: ['0.350'] [Step 22 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [36317] → Tgt Spa: ['1.000'] [Step 22 / Rank 3] Tasks: ['Single QA'] | Lens: [36871] → Tgt Spa: ['0.350'] [Step 22 / Rank 5] Tasks: ['Single QA'] | Lens: [49268] → Tgt Spa: ['0.350'] [Step 22 / Rank 2] Tasks: ['Single QA'] | Lens: [36871] → Tgt Spa: ['0.350'] [Step 22 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [36317] → Tgt Spa: ['1.000'] [Step 22 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32158, 32158] → Tgt Spa: ['0.350', '0.350'] [Step 22 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32158, 32158] → Tgt Spa: ['0.350', '0.350'] [Step 22 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [22391, 22385] → Tgt Spa: ['1.000', '1.000'] [Step 22 / Rank 6] Tasks: ['Single QA'] | Lens: [58663] → Tgt Spa: ['0.350'] [Step 22 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [22391, 22385] → Tgt Spa: ['1.000', '1.000'] [Step 22 / Rank 3] Tasks: ['Single QA'] | Lens: [44072] → Tgt Spa: ['0.350'] [Step 22 / Rank 7] Tasks: ['Single QA'] | Lens: [58663] → Tgt Spa: ['0.350'] [Step 22 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22775, 22774] → Tgt Spa: ['1.000', '1.000'] [Step 22 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22775, 22774] → Tgt Spa: ['1.000', '1.000'] [Step 22 / Rank 2] Tasks: ['Single QA'] | Lens: [44072] → Tgt Spa: ['0.350'] [Step 22 / Rank 1] Tasks: ['Single QA'] | Lens: [49789] → Tgt Spa: ['0.350'] [Step 22 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [29216, 29217] → Tgt Spa: ['0.350', '0.350'] [Step 22 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [41893] → Tgt Spa: ['1.000'] [Step 22 / Rank 6] Tasks: ['Single QA', 'Code'] | Lens: [24529, 24541] → Tgt Spa: ['0.350', '1.000'] [Step 22 / Rank 7] Tasks: ['Single QA', 'Code'] | Lens: [24529, 24541] → Tgt Spa: ['0.350', '1.000'] [Step 22 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [29216, 29217] → Tgt Spa: ['0.350', '0.350'] [Step 22 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [41893] → Tgt Spa: ['1.000'] [Step 22 / Rank 0] Tasks: ['Single QA'] | Lens: [49789] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 19:25:49,957 >> @ 22 | Loss: 1.9516 | LM: 1.8628 | Reg: 0.0888 | Spa(Avg): 0.488 [INFO|lh_trainer.py:797] 2026-02-16 19:25:49,957 >> Statistic -> Code | Spa: 0.461 | Tgt: 1.000 | Z-Loss: 0.101 | [INFO|lh_trainer.py:797] 2026-02-16 19:25:49,958 >> Statistic -> In-Context | Spa: 0.457 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:25:49,958 >> Statistic -> MultiHop | Spa: 0.458 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:25:49,958 >> Statistic -> Single | Spa: 0.492 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:25:49,958 >> Statistic -> Summarization | Spa: 0.417 | Tgt: 1.000 | Z-Loss: 0.148 | [INFO|lh_trainer.py:810] 2026-02-16 19:25:49,960 >> [Micro-Log] {"loss": 1.9516284105678399, "lm_loss": 1.8628378383194406, "reg_loss": 0.08879057488714655, "model_sparsity(avg)": 0.48761573309699696, "Spa-Code sparsity": 0.4611111044883728, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10075842142105103, "Spa-Single QA sparsity": 0.49247684329748154, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.07373385298221062, "Spa-MultiHop QA sparsity": 0.4583333134651184, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.030159572139382362, "Spa-In-Context Learning sparsity": 0.4565972164273262, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1206335611641407, "Spa-Summarization sparsity": 0.4166666567325592, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1477898508310318, "step": 22, "current_tau": 1.4817960262298584, "lambda1 Single QA": 0.478515625, "lambda2 MultiHop QA": 0.240234375, "lambda3 Summarization": 0.04150390625, "lambda4 Code": 0.1376953125} [INFO|lh_trainer.py:331] 2026-02-16 19:26:07,616 >> {'loss': 11.7098, 'grad_norm': 1.1236428022384644, 'learning_rate': 0.00018333333333333334, 'epoch': 0.02422327540810953, 'num_input_tokens_seen': 57311266, 'completed': '7.67% (23 / 300)', 'remaining time': '12:54:49', 'throughput': '7549.76', 'gpu_mem_free': '11229MB', 'step': 23} [Step 23 / Rank 6] Tasks: ['Single QA'] | Lens: [52167] → Tgt Spa: ['0.350'] [Step 23 / Rank 7] Tasks: ['Single QA'] | Lens: [52167] → Tgt Spa: ['0.350'] [Step 23 / Rank 3] Tasks: ['Single QA'] | Lens: [49394] → Tgt Spa: ['0.350'] [Step 23 / Rank 0] Tasks: ['Single QA'] | Lens: [35805] → Tgt Spa: ['0.350'] [Step 23 / Rank 4] Tasks: ['Single QA', 'Code', 'Code', 'Single QA', 'Single QA'] | Lens: [11813, 11837, 11837, 11832, 11833] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '0.350'] [Step 23 / Rank 5] Tasks: ['Single QA', 'Code', 'Code', 'Single QA', 'Single QA'] | Lens: [11813, 11837, 11837, 11832, 11833] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '0.350'] [Step 23 / Rank 2] Tasks: ['Single QA'] | Lens: [49394] → Tgt Spa: ['0.350'] [Step 23 / Rank 1] Tasks: ['Single QA'] | Lens: [35805] → Tgt Spa: ['0.350'] [Step 23 / Rank 6] Tasks: ['Single QA'] | Lens: [41714] → Tgt Spa: ['0.350'] [Step 23 / Rank 1] Tasks: ['Code', 'In-Context Learning', 'Summarization'] | Lens: [20057, 20053, 20072] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 23 / Rank 3] Tasks: ['Single QA'] | Lens: [40030] → Tgt Spa: ['0.350'] [Step 23 / Rank 0] Tasks: ['Code', 'In-Context Learning', 'Summarization'] | Lens: [20057, 20053, 20072] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 23 / Rank 4] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23102, 23103] → Tgt Spa: ['1.000', '0.350'] [Step 23 / Rank 2] Tasks: ['Single QA'] | Lens: [40030] → Tgt Spa: ['0.350'] [Step 23 / Rank 7] Tasks: ['Single QA'] | Lens: [41714] → Tgt Spa: ['0.350'] [Step 23 / Rank 5] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23102, 23103] → Tgt Spa: ['1.000', '0.350'] [Step 23 / Rank 5] Tasks: ['Code'] | Lens: [57425] → Tgt Spa: ['1.000'] [Step 23 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [43367] → Tgt Spa: ['1.000'] [Step 23 / Rank 6] Tasks: ['Single QA'] | Lens: [63798] → Tgt Spa: ['0.350'] [Step 23 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [43367] → Tgt Spa: ['1.000'] [Step 23 / Rank 7] Tasks: ['Single QA'] | Lens: [63798] → Tgt Spa: ['0.350'] [Step 23 / Rank 0] Tasks: ['Single QA'] | Lens: [59933] → Tgt Spa: ['0.350'] [Step 23 / Rank 1] Tasks: ['Single QA'] | Lens: [59933] → Tgt Spa: ['0.350'] [Step 23 / Rank 4] Tasks: ['Code'] | Lens: [57425] → Tgt Spa: ['1.000'] [Step 23 / Rank 7] Tasks: ['Single QA'] | Lens: [65038] → Tgt Spa: ['0.350'] [Step 23 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [29884, 29893] → Tgt Spa: ['1.000', '1.000'] [Step 23 / Rank 2] Tasks: ['Single QA'] | Lens: [59227] → Tgt Spa: ['0.350'] [Step 23 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [37920] → Tgt Spa: ['1.000'] [Step 23 / Rank 3] Tasks: ['Single QA'] | Lens: [59227] → Tgt Spa: ['0.350'] [Step 23 / Rank 6] Tasks: ['Single QA'] | Lens: [65038] → Tgt Spa: ['0.350'] [Step 23 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [37920] → Tgt Spa: ['1.000'] [Step 23 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [29884, 29893] → Tgt Spa: ['1.000', '1.000'] [Step 23 / Rank 4] Tasks: ['Single QA'] | Lens: [43835] → Tgt Spa: ['0.350'] [Step 23 / Rank 6] Tasks: ['MultiHop QA', 'MultiHop QA', 'Single QA', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Code', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [3204, 3200, 3203, 3203, 3204, 3204, 3206, 3222, 3211, 3207, 3206, 3224, 3224, 3208, 3208, 3208, 3214, 3210, 3210, 3227] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000'] [Step 23 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [22944, 22937] → Tgt Spa: ['1.000', '1.000'] [Step 23 / Rank 7] Tasks: ['MultiHop QA', 'MultiHop QA', 'Single QA', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Code', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [3204, 3200, 3203, 3203, 3204, 3204, 3206, 3222, 3211, 3207, 3206, 3224, 3224, 3208, 3208, 3208, 3214, 3210, 3210, 3227] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000'] [Step 23 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [22944, 22937] → Tgt Spa: ['1.000', '1.000'] [Step 23 / Rank 1] Tasks: ['Code'] | Lens: [42650] → Tgt Spa: ['1.000'] [Step 23 / Rank 0] Tasks: ['Code'] | Lens: [42650] → Tgt Spa: ['1.000'] [Step 23 / Rank 5] Tasks: ['Single QA'] | Lens: [43835] → Tgt Spa: ['0.350'] [Step 23 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24192, 24193] → Tgt Spa: ['1.000', '1.000'] [Step 23 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [57745] → Tgt Spa: ['1.000'] [Step 23 / Rank 2] Tasks: ['Single QA'] | Lens: [65140] → Tgt Spa: ['0.350'] [Step 23 / Rank 1] Tasks: ['Code'] | Lens: [36571] → Tgt Spa: ['1.000'] [Step 23 / Rank 3] Tasks: ['Single QA'] | Lens: [65140] → Tgt Spa: ['0.350'] [Step 23 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24192, 24193] → Tgt Spa: ['1.000', '1.000'] [Step 23 / Rank 0] Tasks: ['Code'] | Lens: [36571] → Tgt Spa: ['1.000'] [Step 23 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [57745] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 19:28:27,362 >> @ 23 | Loss: 2.0770 | LM: 1.9941 | Reg: 0.0829 | Spa(Avg): 0.454 [INFO|lh_trainer.py:797] 2026-02-16 19:28:27,362 >> Statistic -> Code | Spa: 0.447 | Tgt: 1.000 | Z-Loss: 0.105 | [INFO|lh_trainer.py:797] 2026-02-16 19:28:27,362 >> Statistic -> In-Context | Spa: 0.454 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:28:27,362 >> Statistic -> MultiHop | Spa: 0.478 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:28:27,362 >> Statistic -> Single | Spa: 0.473 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:28:27,362 >> Statistic -> Summarization | Spa: 0.483 | Tgt: 1.000 | Z-Loss: 0.120 | [INFO|lh_trainer.py:810] 2026-02-16 19:28:27,364 >> [Micro-Log] {"loss": 2.0769637674093246, "lm_loss": 1.9940845879415672, "reg_loss": 0.08287918645267685, "model_sparsity(avg)": 0.4538097990055879, "Spa-Single QA sparsity": 0.4729938275284237, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06281845664812459, "Spa-Code sparsity": 0.44722222089767455, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10457019582390785, "Spa-In-Context Learning sparsity": 0.4541666626930237, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.12185530439019203, "Spa-Summarization sparsity": 0.48333332538604734, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11991810947656631, "Spa-MultiHop QA sparsity": 0.47777777910232544, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.03792430013418198, "step": 23, "current_tau": 1.480126142501831, "lambda1 Single QA": 0.478515625, "lambda2 MultiHop QA": 0.240234375, "lambda3 Summarization": 0.0419921875, "lambda4 Code": 0.1376953125} [INFO|lh_trainer.py:331] 2026-02-16 19:28:55,368 >> {'loss': 12.4618, 'grad_norm': 1.2922334671020508, 'learning_rate': 0.00019166666666666667, 'epoch': 0.02527646129541864, 'num_input_tokens_seen': 59782354, 'completed': '8.00% (24 / 300)', 'remaining time': '12:52:01', 'throughput': '7365.32', 'gpu_mem_free': '12919MB', 'step': 24} [Step 24 / Rank 6] Tasks: ['Single QA'] | Lens: [64599] → Tgt Spa: ['0.350'] [Step 24 / Rank 5] Tasks: ['Single QA'] | Lens: [48681] → Tgt Spa: ['0.350'] [Step 24 / Rank 1] Tasks: ['Single QA'] | Lens: [41712] → Tgt Spa: ['0.350'] [Step 24 / Rank 7] Tasks: ['Single QA'] | Lens: [64599] → Tgt Spa: ['0.350'] [Step 24 / Rank 3] Tasks: ['Single QA'] | Lens: [46676] → Tgt Spa: ['0.350'] [Step 24 / Rank 2] Tasks: ['Single QA'] | Lens: [46676] → Tgt Spa: ['0.350'] [Step 24 / Rank 0] Tasks: ['Single QA'] | Lens: [41712] → Tgt Spa: ['0.350'] [Step 24 / Rank 4] Tasks: ['Single QA'] | Lens: [48681] → Tgt Spa: ['0.350'] [Step 24 / Rank 6] Tasks: ['Summarization'] | Lens: [36423] → Tgt Spa: ['1.000'] [Step 24 / Rank 2] Tasks: ['Single QA', 'In-Context Learning', 'Code', 'Single QA', 'Single QA', 'Code', 'Code', 'Single QA', 'Single QA'] | Lens: [6585, 6586, 6594, 6591, 6591, 6600, 6600, 6598, 6598] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350'] [Step 24 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27013, 27014] → Tgt Spa: ['1.000', '1.000'] [Step 24 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27013, 27014] → Tgt Spa: ['1.000', '1.000'] [Step 24 / Rank 3] Tasks: ['Single QA', 'In-Context Learning', 'Code', 'Single QA', 'Single QA', 'Code', 'Code', 'Single QA', 'Single QA'] | Lens: [6585, 6586, 6594, 6591, 6591, 6600, 6600, 6598, 6598] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350'] [Step 24 / Rank 1] Tasks: ['Code'] | Lens: [52891] → Tgt Spa: ['1.000'] [Step 24 / Rank 0] Tasks: ['Code'] | Lens: [52891] → Tgt Spa: ['1.000'] [Step 24 / Rank 7] Tasks: ['Summarization'] | Lens: [36423] → Tgt Spa: ['1.000'] [Step 24 / Rank 3] Tasks: ['Single QA'] | Lens: [41004] → Tgt Spa: ['0.350'] [Step 24 / Rank 5] Tasks: ['Single QA'] | Lens: [64042] → Tgt Spa: ['0.350'] [Step 24 / Rank 2] Tasks: ['Single QA'] | Lens: [41004] → Tgt Spa: ['0.350'] [Step 24 / Rank 1] Tasks: ['Single QA'] | Lens: [49357] → Tgt Spa: ['0.350'] [Step 24 / Rank 7] Tasks: ['Single QA'] | Lens: [33173] → Tgt Spa: ['0.350'] [Step 24 / Rank 0] Tasks: ['Single QA'] | Lens: [49357] → Tgt Spa: ['0.350'] [Step 24 / Rank 4] Tasks: ['Single QA'] | Lens: [64042] → Tgt Spa: ['0.350'] [Step 24 / Rank 6] Tasks: ['Single QA'] | Lens: [33173] → Tgt Spa: ['0.350'] [Step 24 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [25332, 25333] → Tgt Spa: ['0.350', '0.350'] [Step 24 / Rank 6] Tasks: ['Single QA'] | Lens: [57586] → Tgt Spa: ['0.350'] [Step 24 / Rank 7] Tasks: ['Single QA'] | Lens: [57586] → Tgt Spa: ['0.350'] [Step 24 / Rank 2] Tasks: ['Code', 'Summarization'] | Lens: [22273, 22284] → Tgt Spa: ['1.000', '1.000'] [Step 24 / Rank 3] Tasks: ['Code', 'Summarization'] | Lens: [22273, 22284] → Tgt Spa: ['1.000', '1.000'] [Step 24 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [25332, 25333] → Tgt Spa: ['0.350', '0.350'] [Step 24 / Rank 0] Tasks: ['In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA'] | Lens: [5048, 5067, 5050, 5050, 5069, 5052, 5052, 5060, 5053, 5053, 5055, 5055] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 24 / Rank 1] Tasks: ['In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA'] | Lens: [5048, 5067, 5050, 5050, 5069, 5052, 5052, 5060, 5053, 5053, 5055, 5055] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 24 / Rank 7] Tasks: ['Single QA'] | Lens: [63517] → Tgt Spa: ['0.350'] [Step 24 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [29961, 29961] → Tgt Spa: ['0.350', '0.350'] [Step 24 / Rank 6] Tasks: ['Single QA'] | Lens: [63517] → Tgt Spa: ['0.350'] [Step 24 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [29961, 29961] → Tgt Spa: ['0.350', '0.350'] [Step 24 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [21061, 21061, 21061] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 24 / Rank 3] Tasks: ['Code'] | Lens: [33781] → Tgt Spa: ['1.000'] [Step 24 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [21061, 21061, 21061] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 24 / Rank 2] Tasks: ['Code'] | Lens: [33781] → Tgt Spa: ['1.000'] [Step 24 / Rank 6] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [19641, 19643, 19656] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 24 / Rank 0] Tasks: ['Single QA'] | Lens: [41323] → Tgt Spa: ['0.350'] [Step 24 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [39623] → Tgt Spa: ['1.000'] [Step 24 / Rank 1] Tasks: ['Single QA'] | Lens: [41323] → Tgt Spa: ['0.350'] [Step 24 / Rank 5] Tasks: ['Single QA'] | Lens: [36357] → Tgt Spa: ['0.350'] [Step 24 / Rank 4] Tasks: ['Single QA'] | Lens: [36357] → Tgt Spa: ['0.350'] [Step 24 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [39623] → Tgt Spa: ['1.000'] [Step 24 / Rank 7] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [19641, 19643, 19656] → Tgt Spa: ['1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 19:31:37,215 >> @ 24 | Loss: 2.2270 | LM: 2.1337 | Reg: 0.0933 | Spa(Avg): 0.498 [INFO|lh_trainer.py:797] 2026-02-16 19:31:37,215 >> Statistic -> Code | Spa: 0.451 | Tgt: 1.000 | Z-Loss: 0.103 | [INFO|lh_trainer.py:797] 2026-02-16 19:31:37,215 >> Statistic -> In-Context | Spa: 0.472 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:31:37,215 >> Statistic -> MultiHop | Spa: 0.478 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:31:37,215 >> Statistic -> Single | Spa: 0.503 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:31:37,215 >> Statistic -> Summarization | Spa: 0.511 | Tgt: 1.000 | Z-Loss: 0.108 | [INFO|lh_trainer.py:810] 2026-02-16 19:31:37,217 >> [Micro-Log] {"loss": 2.2270209851364293, "lm_loss": 2.1336838391919932, "reg_loss": 0.0933371198674043, "model_sparsity(avg)": 0.4980066853264968, "Spa-Single QA sparsity": 0.5025720132721795, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.0796130227131976, "Spa-Code sparsity": 0.4506172736485799, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10342532893021901, "Spa-In-Context Learning sparsity": 0.47222222089767457, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11779500618577003, "Spa-Summarization sparsity": 0.5111110925674438, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10783043801784516, "Spa-MultiHop QA sparsity": 0.47777777910232544, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.03792430013418198, "step": 24, "current_tau": 1.478386402130127, "lambda1 Single QA": 0.48046875, "lambda2 MultiHop QA": 0.240234375, "lambda3 Summarization": 0.042236328125, "lambda4 Code": 0.1376953125} [INFO|lh_trainer.py:331] 2026-02-16 19:31:50,649 >> {'loss': 13.3621, 'grad_norm': 0.9868028163909912, 'learning_rate': 0.0002, 'epoch': 0.02632964718272775, 'num_input_tokens_seen': 62186446, 'completed': '8.33% (25 / 300)', 'remaining time': '12:50:35', 'throughput': '6857.81', 'gpu_mem_free': '13703MB', 'step': 25} [Step 25 / Rank 5] Tasks: ['Single QA'] | Lens: [55143] → Tgt Spa: ['0.350'] [Step 25 / Rank 4] Tasks: ['Single QA'] | Lens: [55143] → Tgt Spa: ['0.350'] [Step 25 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26834, 26834] → Tgt Spa: ['1.000', '1.000'] [Step 25 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24094, 24094] → Tgt Spa: ['1.000', '1.000'] [Step 25 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26834, 26834] → Tgt Spa: ['1.000', '1.000'] [Step 25 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24094, 24094] → Tgt Spa: ['1.000', '1.000'] [Step 25 / Rank 2] Tasks: ['Code', 'In-Context Learning', 'Summarization'] | Lens: [21366, 21357, 21378] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 25 / Rank 3] Tasks: ['Code', 'In-Context Learning', 'Summarization'] | Lens: [21366, 21357, 21378] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 25 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [11001, 11001, 11002, 11002, 11002] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 25 / Rank 6] Tasks: ['Single QA'] | Lens: [58941] → Tgt Spa: ['0.350'] [Step 25 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26481, 26484] → Tgt Spa: ['1.000', '1.000'] [Step 25 / Rank 0] Tasks: ['Single QA'] | Lens: [49232] → Tgt Spa: ['0.350'] [Step 25 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [11001, 11001, 11002, 11002, 11002] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 25 / Rank 1] Tasks: ['Single QA'] | Lens: [49232] → Tgt Spa: ['0.350'] [Step 25 / Rank 7] Tasks: ['Single QA'] | Lens: [58941] → Tgt Spa: ['0.350'] [Step 25 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26481, 26484] → Tgt Spa: ['1.000', '1.000'] [Step 25 / Rank 0] Tasks: ['Single QA'] | Lens: [34001] → Tgt Spa: ['0.350'] [Step 25 / Rank 5] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [18660, 18661, 18674] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 25 / Rank 4] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [18660, 18661, 18674] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 25 / Rank 3] Tasks: ['Code'] | Lens: [35707] → Tgt Spa: ['1.000'] [Step 25 / Rank 2] Tasks: ['Code'] | Lens: [35707] → Tgt Spa: ['1.000'] [Step 25 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [12041, 12041, 12042, 12044, 12045] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 25 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [12041, 12041, 12042, 12044, 12045] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 25 / Rank 1] Tasks: ['Single QA'] | Lens: [34001] → Tgt Spa: ['0.350'] [Step 25 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [46169] → Tgt Spa: ['1.000'] [Step 25 / Rank 7] Tasks: ['Single QA'] | Lens: [54401] → Tgt Spa: ['0.350'] [Step 25 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [27077, 27077] → Tgt Spa: ['0.350', '0.350'] [Step 25 / Rank 6] Tasks: ['Single QA'] | Lens: [54401] → Tgt Spa: ['0.350'] [Step 25 / Rank 1] Tasks: ['Single QA'] | Lens: [54232] → Tgt Spa: ['0.350'] [Step 25 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [27077, 27077] → Tgt Spa: ['0.350', '0.350'] [Step 25 / Rank 0] Tasks: ['Single QA'] | Lens: [54232] → Tgt Spa: ['0.350'] [Step 25 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [46169] → Tgt Spa: ['1.000'] [Step 25 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [41199] → Tgt Spa: ['1.000'] [Step 25 / Rank 5] Tasks: ['Single QA'] | Lens: [59064] → Tgt Spa: ['0.350'] [Step 25 / Rank 4] Tasks: ['Single QA'] | Lens: [59064] → Tgt Spa: ['0.350'] [Step 25 / Rank 2] Tasks: ['Code'] | Lens: [40279] → Tgt Spa: ['1.000'] [Step 25 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [41199] → Tgt Spa: ['1.000'] [Step 25 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32503, 32503] → Tgt Spa: ['0.350', '0.350'] [Step 25 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32503, 32503] → Tgt Spa: ['0.350', '0.350'] [Step 25 / Rank 3] Tasks: ['Code'] | Lens: [40279] → Tgt Spa: ['1.000'] [Step 25 / Rank 6] Tasks: ['Single QA'] | Lens: [65257] → Tgt Spa: ['0.350'] [Step 25 / Rank 1] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [18258, 18273, 18262] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 25 / Rank 3] Tasks: ['Code', 'Code', 'Single QA', 'Code', 'Code'] | Lens: [11689, 11696, 11699, 11710, 11714] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000'] [Step 25 / Rank 0] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [18258, 18273, 18262] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 25 / Rank 2] Tasks: ['Code', 'Code', 'Single QA', 'Code', 'Code'] | Lens: [11689, 11696, 11699, 11710, 11714] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000'] [Step 25 / Rank 4] Tasks: ['Single QA'] | Lens: [40005] → Tgt Spa: ['0.350'] [Step 25 / Rank 5] Tasks: ['Single QA'] | Lens: [40005] → Tgt Spa: ['0.350'] [Step 25 / Rank 7] Tasks: ['Single QA'] | Lens: [65257] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 19:34:10,032 >> @ 25 | Loss: 1.9925 | LM: 1.9094 | Reg: 0.0831 | Spa(Avg): 0.500 [INFO|lh_trainer.py:797] 2026-02-16 19:34:10,033 >> Statistic -> Code | Spa: 0.534 | Tgt: 1.000 | Z-Loss: 0.085 | [INFO|lh_trainer.py:797] 2026-02-16 19:34:10,033 >> Statistic -> In-Context | Spa: 0.526 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:34:10,033 >> Statistic -> MultiHop | Spa: 0.478 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:34:10,033 >> Statistic -> Single | Spa: 0.483 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:34:10,033 >> Statistic -> Summarization | Spa: 0.574 | Tgt: 1.000 | Z-Loss: 0.084 | [INFO|lh_trainer.py:810] 2026-02-16 19:34:10,035 >> [Micro-Log] {"loss": 1.9925233178461592, "lm_loss": 1.9093909402533125, "reg_loss": 0.08313237161686023, "model_sparsity(avg)": 0.5000000149011612, "Spa-In-Context Learning sparsity": 0.5262345737881131, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10535299529631932, "Spa-Single QA sparsity": 0.4832175945242246, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06926715816371143, "Spa-Code sparsity": 0.5340909199281172, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.0850161723792553, "Spa-Summarization sparsity": 0.5740740895271301, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.08424659321705501, "Spa-MultiHop QA sparsity": 0.47777777910232544, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.03792430013418198, "step": 25, "current_tau": 1.4765769243240356, "lambda1 Single QA": 0.48046875, "lambda2 MultiHop QA": 0.2412109375, "lambda3 Summarization": 0.04248046875, "lambda4 Code": 0.138671875} [INFO|lh_trainer.py:331] 2026-02-16 19:34:37,131 >> {'loss': 11.9551, 'grad_norm': 1.0244539976119995, 'learning_rate': 0.00020833333333333335, 'epoch': 0.02738283307003686, 'num_input_tokens_seen': 64698904, 'completed': '8.67% (26 / 300)', 'remaining time': '12:47:29', 'throughput': '7545.77', 'gpu_mem_free': '9603MB', 'step': 26} [Step 26 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43202] → Tgt Spa: ['1.000'] [Step 26 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23670, 23671] → Tgt Spa: ['1.000', '1.000'] [Step 26 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23670, 23671] → Tgt Spa: ['1.000', '1.000'] [Step 26 / Rank 1] Tasks: ['Single QA'] | Lens: [49251] → Tgt Spa: ['0.350'] [Step 26 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43202] → Tgt Spa: ['1.000'] [Step 26 / Rank 0] Tasks: ['Single QA'] | Lens: [49251] → Tgt Spa: ['0.350'] [Step 26 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [32164, 32164] → Tgt Spa: ['0.350', '0.350'] [Step 26 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [32164, 32164] → Tgt Spa: ['0.350', '0.350'] [Step 26 / Rank 4] Tasks: ['Single QA'] | Lens: [50941] → Tgt Spa: ['0.350'] [Step 26 / Rank 6] Tasks: ['Single QA'] | Lens: [64909] → Tgt Spa: ['0.350'] [Step 26 / Rank 5] Tasks: ['Single QA'] | Lens: [50941] → Tgt Spa: ['0.350'] [Step 26 / Rank 3] Tasks: ['Single QA'] | Lens: [57701] → Tgt Spa: ['0.350'] [Step 26 / Rank 0] Tasks: ['Single QA'] | Lens: [36881] → Tgt Spa: ['0.350'] [Step 26 / Rank 1] Tasks: ['Single QA'] | Lens: [36881] → Tgt Spa: ['0.350'] [Step 26 / Rank 7] Tasks: ['Single QA'] | Lens: [64909] → Tgt Spa: ['0.350'] [Step 26 / Rank 2] Tasks: ['Single QA'] | Lens: [57701] → Tgt Spa: ['0.350'] [Step 26 / Rank 5] Tasks: ['Code'] | Lens: [61427] → Tgt Spa: ['1.000'] [Step 26 / Rank 1] Tasks: ['Code'] | Lens: [36256] → Tgt Spa: ['1.000'] [Step 26 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [60420] → Tgt Spa: ['1.000'] [Step 26 / Rank 4] Tasks: ['Code'] | Lens: [61427] → Tgt Spa: ['1.000'] [Step 26 / Rank 0] Tasks: ['Code'] | Lens: [36256] → Tgt Spa: ['1.000'] [Step 26 / Rank 3] Tasks: ['Single QA'] | Lens: [44066] → Tgt Spa: ['0.350'] [Step 26 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [60420] → Tgt Spa: ['1.000'] [Step 26 / Rank 2] Tasks: ['Single QA'] | Lens: [44066] → Tgt Spa: ['0.350'] [Step 26 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [40603] → Tgt Spa: ['1.000'] [Step 26 / Rank 3] Tasks: ['Single QA'] | Lens: [52537] → Tgt Spa: ['0.350'] [Step 26 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [29191, 29191] → Tgt Spa: ['0.350', '0.350'] [Step 26 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [40603] → Tgt Spa: ['1.000'] [Step 26 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15984, 15985, 15985, 15985] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 26 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15984, 15985, 15985, 15985] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 26 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [29191, 29191] → Tgt Spa: ['0.350', '0.350'] [Step 26 / Rank 2] Tasks: ['Single QA'] | Lens: [52537] → Tgt Spa: ['0.350'] [Step 26 / Rank 5] Tasks: ['Single QA'] | Lens: [61767] → Tgt Spa: ['0.350'] [Step 26 / Rank 6] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23687, 23687] → Tgt Spa: ['0.350', '1.000'] [Step 26 / Rank 4] Tasks: ['Single QA'] | Lens: [61767] → Tgt Spa: ['0.350'] [Step 26 / Rank 1] Tasks: ['Single QA'] | Lens: [51223] → Tgt Spa: ['0.350'] [Step 26 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23751, 23751] → Tgt Spa: ['1.000', '1.000'] [Step 26 / Rank 7] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23687, 23687] → Tgt Spa: ['0.350', '1.000'] [Step 26 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23751, 23751] → Tgt Spa: ['1.000', '1.000'] [Step 26 / Rank 0] Tasks: ['Single QA'] | Lens: [51223] → Tgt Spa: ['0.350'] [Step 26 / Rank 6] Tasks: ['Code'] | Lens: [54343] → Tgt Spa: ['1.000'] [Step 26 / Rank 7] Tasks: ['Code'] | Lens: [54343] → Tgt Spa: ['1.000'] [Step 26 / Rank 0] Tasks: ['Single QA'] | Lens: [35239] → Tgt Spa: ['0.350'] [Step 26 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27338, 27338] → Tgt Spa: ['1.000', '1.000'] [Step 26 / Rank 5] Tasks: ['Code'] | Lens: [45129] → Tgt Spa: ['1.000'] [Step 26 / Rank 4] Tasks: ['Code'] | Lens: [45129] → Tgt Spa: ['1.000'] [Step 26 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27338, 27338] → Tgt Spa: ['1.000', '1.000'] [Step 26 / Rank 1] Tasks: ['Single QA'] | Lens: [35239] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 19:37:11,483 >> @ 26 | Loss: 2.0648 | LM: 1.9827 | Reg: 0.0821 | Spa(Avg): 0.452 [INFO|lh_trainer.py:797] 2026-02-16 19:37:11,483 >> Statistic -> Code | Spa: 0.514 | Tgt: 1.000 | Z-Loss: 0.089 | [INFO|lh_trainer.py:797] 2026-02-16 19:37:11,484 >> Statistic -> In-Context | Spa: 0.438 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:37:11,484 >> Statistic -> MultiHop | Spa: 0.478 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:37:11,484 >> Statistic -> Single | Spa: 0.430 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:37:11,484 >> Statistic -> Summarization | Spa: 0.574 | Tgt: 1.000 | Z-Loss: 0.084 | [INFO|lh_trainer.py:810] 2026-02-16 19:37:11,486 >> [Micro-Log] {"loss": 2.064823070851465, "lm_loss": 1.9827375497067503, "reg_loss": 0.08208552173649271, "model_sparsity(avg)": 0.4516782450179259, "Spa-Single QA sparsity": 0.4298245624492043, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05451926473209536, "Spa-Code sparsity": 0.5138888955116272, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.08926127105951309, "Spa-In-Context Learning sparsity": 0.43750000596046446, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.12666508182883263, "Spa-Summarization sparsity": 0.5740740895271301, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.08424659321705501, "Spa-MultiHop QA sparsity": 0.47777777910232544, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.03792430013418198, "step": 26, "current_tau": 1.474698543548584, "lambda1 Single QA": 0.48046875, "lambda2 MultiHop QA": 0.2412109375, "lambda3 Summarization": 0.04296875, "lambda4 Code": 0.138671875} [INFO|lh_trainer.py:331] 2026-02-16 19:37:31,269 >> {'loss': 12.3889, 'grad_norm': 1.2274116277694702, 'learning_rate': 0.00021666666666666668, 'epoch': 0.02843601895734597, 'num_input_tokens_seen': 67157778, 'completed': '9.00% (27 / 300)', 'remaining time': '12:45:43', 'throughput': '7060.11', 'gpu_mem_free': '13937MB', 'step': 27} [Step 27 / Rank 3] Tasks: ['Single QA'] | Lens: [49206] → Tgt Spa: ['0.350'] [Step 27 / Rank 5] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17437, 17437, 17426] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 27 / Rank 2] Tasks: ['Single QA'] | Lens: [49206] → Tgt Spa: ['0.350'] [Step 27 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [24884, 24885] → Tgt Spa: ['0.350', '0.350'] [Step 27 / Rank 4] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17437, 17437, 17426] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 27 / Rank 7] Tasks: ['Single QA'] | Lens: [40258] → Tgt Spa: ['0.350'] [Step 27 / Rank 6] Tasks: ['Single QA'] | Lens: [40258] → Tgt Spa: ['0.350'] [Step 27 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [24884, 24885] → Tgt Spa: ['0.350', '0.350'] [Step 27 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [58764] → Tgt Spa: ['1.000'] [Step 27 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23809, 23811] → Tgt Spa: ['1.000', '1.000'] [Step 27 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [58764] → Tgt Spa: ['1.000'] [Step 27 / Rank 0] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17319, 17332, 17322] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 27 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23809, 23811] → Tgt Spa: ['1.000', '1.000'] [Step 27 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [45892] → Tgt Spa: ['1.000'] [Step 27 / Rank 1] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17319, 17332, 17322] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 27 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [45892] → Tgt Spa: ['1.000'] [Step 27 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [26553, 26553] → Tgt Spa: ['0.350', '0.350'] [Step 27 / Rank 1] Tasks: ['Single QA', 'Code'] | Lens: [25083, 25091] → Tgt Spa: ['0.350', '1.000'] [Step 27 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [26553, 26553] → Tgt Spa: ['0.350', '0.350'] [Step 27 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [63745] → Tgt Spa: ['1.000'] [Step 27 / Rank 0] Tasks: ['Single QA', 'Code'] | Lens: [25083, 25091] → Tgt Spa: ['0.350', '1.000'] [Step 27 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [46119] → Tgt Spa: ['1.000'] [Step 27 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [63745] → Tgt Spa: ['1.000'] [Step 27 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [46119] → Tgt Spa: ['1.000'] [Step 27 / Rank 0] Tasks: ['In-Context Learning', 'Summarization', 'Code', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'Code', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Summarization', 'Single QA', 'Code'] | Lens: [3572, 3591, 3580, 3573, 3573, 3574, 3580, 3593, 3576, 3575, 3579, 3578, 3577, 3577, 3578, 3598, 3579, 3587] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000'] [Step 27 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [59328] → Tgt Spa: ['1.000'] [Step 27 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [59328] → Tgt Spa: ['1.000'] [Step 27 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [11007, 11008, 11009, 11010, 11010] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 27 / Rank 3] Tasks: ['Code'] | Lens: [41438] → Tgt Spa: ['1.000'] [Step 27 / Rank 1] Tasks: ['In-Context Learning', 'Summarization', 'Code', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'Code', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Summarization', 'Single QA', 'Code'] | Lens: [3572, 3591, 3580, 3573, 3573, 3574, 3580, 3593, 3576, 3575, 3579, 3578, 3577, 3577, 3578, 3598, 3579, 3587] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000'] [Step 27 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [11007, 11008, 11009, 11010, 11010] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 27 / Rank 2] Tasks: ['Code'] | Lens: [41438] → Tgt Spa: ['1.000'] [Step 27 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [37961] → Tgt Spa: ['1.000'] [Step 27 / Rank 0] Tasks: ['Single QA'] | Lens: [48513] → Tgt Spa: ['0.350'] [Step 27 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [37961] → Tgt Spa: ['1.000'] [Step 27 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [40555] → Tgt Spa: ['1.000'] [Step 27 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [40555] → Tgt Spa: ['1.000'] [Step 27 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [43103] → Tgt Spa: ['1.000'] [Step 27 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [43103] → Tgt Spa: ['1.000'] [Step 27 / Rank 1] Tasks: ['Single QA'] | Lens: [48513] → Tgt Spa: ['0.350'] [Step 27 / Rank 3] Tasks: ['Single QA'] | Lens: [41335] → Tgt Spa: ['0.350'] [Step 27 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [25128, 25119] → Tgt Spa: ['1.000', '1.000'] [Step 27 / Rank 4] Tasks: ['Single QA'] | Lens: [55075] → Tgt Spa: ['0.350'] [Step 27 / Rank 2] Tasks: ['Single QA'] | Lens: [41335] → Tgt Spa: ['0.350'] [Step 27 / Rank 1] Tasks: ['Code'] | Lens: [43412] → Tgt Spa: ['1.000'] [Step 27 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [25128, 25119] → Tgt Spa: ['1.000', '1.000'] [Step 27 / Rank 5] Tasks: ['Single QA'] | Lens: [55075] → Tgt Spa: ['0.350'] [Step 27 / Rank 0] Tasks: ['Code'] | Lens: [43412] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 19:39:56,009 >> @ 27 | Loss: 2.1915 | LM: 2.0795 | Reg: 0.1120 | Spa(Avg): 0.475 [INFO|lh_trainer.py:797] 2026-02-16 19:39:56,009 >> Statistic -> Code | Spa: 0.475 | Tgt: 1.000 | Z-Loss: 0.100 | [INFO|lh_trainer.py:797] 2026-02-16 19:39:56,009 >> Statistic -> In-Context | Spa: 0.420 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:39:56,009 >> Statistic -> MultiHop | Spa: 0.514 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:39:56,009 >> Statistic -> Single | Spa: 0.520 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:39:56,009 >> Statistic -> Summarization | Spa: 0.572 | Tgt: 1.000 | Z-Loss: 0.089 | [INFO|lh_trainer.py:810] 2026-02-16 19:39:56,011 >> [Micro-Log] {"loss": 2.1915148018548885, "lm_loss": 2.0795360328629613, "reg_loss": 0.11197877069935203, "model_sparsity(avg)": 0.47489068408807117, "Spa-Single QA sparsity": 0.5204248288098503, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.08965370190494201, "Spa-Code sparsity": 0.4749999940395355, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1000005103647709, "Spa-Summarization sparsity": 0.5717592338720957, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.08866468630731106, "Spa-In-Context Learning sparsity": 0.42032164335250854, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1308691846696954, "Spa-MultiHop QA sparsity": 0.5138888657093048, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.04911407083272934, "step": 27, "current_tau": 1.4727516174316406, "lambda1 Single QA": 0.48046875, "lambda2 MultiHop QA": 0.2412109375, "lambda3 Summarization": 0.043212890625, "lambda4 Code": 0.1396484375} [INFO|lh_trainer.py:331] 2026-02-16 19:40:16,723 >> {'loss': 13.1491, 'grad_norm': 1.8787362575531006, 'learning_rate': 0.00022500000000000002, 'epoch': 0.02948920484465508, 'num_input_tokens_seen': 69536532, 'completed': '9.33% (28 / 300)', 'remaining time': '12:42:27', 'throughput': '7188.58', 'gpu_mem_free': '11341MB', 'step': 28} [Step 28 / Rank 7] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [15507, 15526, 15522, 15523] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350'] [Step 28 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [40050] → Tgt Spa: ['1.000'] [Step 28 / Rank 1] Tasks: ['Summarization'] | Lens: [47904] → Tgt Spa: ['1.000'] [Step 28 / Rank 2] Tasks: ['Code'] | Lens: [52323] → Tgt Spa: ['1.000'] [Step 28 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [40050] → Tgt Spa: ['1.000'] [Step 28 / Rank 6] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [15507, 15526, 15522, 15523] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350'] [Step 28 / Rank 0] Tasks: ['Summarization'] | Lens: [47904] → Tgt Spa: ['1.000'] [Step 28 / Rank 3] Tasks: ['Code'] | Lens: [52323] → Tgt Spa: ['1.000'] [Step 28 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [23142, 23143] → Tgt Spa: ['1.000', '1.000'] [Step 28 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [59499] → Tgt Spa: ['1.000'] [Step 28 / Rank 0] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [20006, 19996, 19995] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 28 / Rank 1] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [20006, 19996, 19995] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 28 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [23142, 23143] → Tgt Spa: ['1.000', '1.000'] [Step 28 / Rank 2] Tasks: ['Single QA'] | Lens: [54769] → Tgt Spa: ['0.350'] [Step 28 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [59499] → Tgt Spa: ['1.000'] [Step 28 / Rank 3] Tasks: ['Single QA'] | Lens: [54769] → Tgt Spa: ['0.350'] [Step 28 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [61100] → Tgt Spa: ['1.000'] [Step 28 / Rank 3] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [17412, 17422, 17423] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 28 / Rank 2] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [17412, 17422, 17423] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 28 / Rank 1] Tasks: ['Single QA'] | Lens: [38802] → Tgt Spa: ['0.350'] [Step 28 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [61100] → Tgt Spa: ['1.000'] [Step 28 / Rank 5] Tasks: ['Single QA'] | Lens: [42365] → Tgt Spa: ['0.350'] [Step 28 / Rank 4] Tasks: ['Single QA'] | Lens: [42365] → Tgt Spa: ['0.350'] [Step 28 / Rank 0] Tasks: ['Single QA'] | Lens: [38802] → Tgt Spa: ['0.350'] [Step 28 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [56661] → Tgt Spa: ['1.000'] [Step 28 / Rank 3] Tasks: ['Summarization', 'Single QA', 'Code'] | Lens: [18490, 18471, 18480] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 28 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [56661] → Tgt Spa: ['1.000'] [Step 28 / Rank 6] Tasks: ['Single QA'] | Lens: [64036] → Tgt Spa: ['0.350'] [Step 28 / Rank 0] Tasks: ['Code'] | Lens: [41632] → Tgt Spa: ['1.000'] [Step 28 / Rank 1] Tasks: ['Code'] | Lens: [41632] → Tgt Spa: ['1.000'] [Step 28 / Rank 2] Tasks: ['Summarization', 'Single QA', 'Code'] | Lens: [18490, 18471, 18480] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 28 / Rank 7] Tasks: ['Single QA'] | Lens: [64036] → Tgt Spa: ['0.350'] [Step 28 / Rank 0] Tasks: ['Code'] | Lens: [43027] → Tgt Spa: ['1.000'] [Step 28 / Rank 4] Tasks: ['In-Context Learning', 'Single QA', 'Single QA', 'Code', 'Code', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning'] | Lens: [5916, 5918, 5919, 5926, 5927, 5930, 5929, 5921, 5923, 5925, 5927] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000'] [Step 28 / Rank 5] Tasks: ['In-Context Learning', 'Single QA', 'Single QA', 'Code', 'Code', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning'] | Lens: [5916, 5918, 5919, 5926, 5927, 5930, 5929, 5921, 5923, 5925, 5927] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000'] [Step 28 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [39552] → Tgt Spa: ['1.000'] [Step 28 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [39552] → Tgt Spa: ['1.000'] [Step 28 / Rank 1] Tasks: ['Code'] | Lens: [43027] → Tgt Spa: ['1.000'] [Step 28 / Rank 2] Tasks: ['Code'] | Lens: [64938] → Tgt Spa: ['1.000'] [Step 28 / Rank 3] Tasks: ['Code'] | Lens: [64938] → Tgt Spa: ['1.000'] [Step 28 / Rank 5] Tasks: ['Single QA'] | Lens: [51020] → Tgt Spa: ['0.350'] [Step 28 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15930, 15930, 15930, 15930] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 28 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [22391, 22392] → Tgt Spa: ['1.000', '1.000'] [Step 28 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [22391, 22392] → Tgt Spa: ['1.000', '1.000'] [Step 28 / Rank 4] Tasks: ['Single QA'] | Lens: [51020] → Tgt Spa: ['0.350'] [Step 28 / Rank 0] Tasks: ['Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [5100, 5102, 5102, 5103, 5102, 5103, 5103, 5104, 5105, 5108, 5108, 5109] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 28 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15930, 15930, 15930, 15930] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 28 / Rank 1] Tasks: ['Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [5100, 5102, 5102, 5103, 5102, 5103, 5103, 5104, 5105, 5108, 5108, 5109] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 19:42:55,406 >> @ 28 | Loss: 1.9156 | LM: 1.8205 | Reg: 0.0952 | Spa(Avg): 0.450 [INFO|lh_trainer.py:797] 2026-02-16 19:42:55,407 >> Statistic -> Code | Spa: 0.447 | Tgt: 1.000 | Z-Loss: 0.106 | [INFO|lh_trainer.py:797] 2026-02-16 19:42:55,407 >> Statistic -> In-Context | Spa: 0.433 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:42:55,407 >> Statistic -> MultiHop | Spa: 0.514 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:42:55,407 >> Statistic -> Single | Spa: 0.446 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:42:55,407 >> Statistic -> Summarization | Spa: 0.506 | Tgt: 1.000 | Z-Loss: 0.116 | [INFO|lh_trainer.py:810] 2026-02-16 19:42:55,409 >> [Micro-Log] {"loss": 1.9156209180752437, "lm_loss": 1.8204569400598605, "reg_loss": 0.09516397479455918, "model_sparsity(avg)": 0.4504813812673092, "Spa-Summarization sparsity": 0.5055555582046509, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11581126749515533, "Spa-Code sparsity": 0.44689542405745564, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10619344299330431, "Spa-Single QA sparsity": 0.44583332240581514, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05282386538456194, "Spa-In-Context Learning sparsity": 0.43300653555813956, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1286039632909438, "Spa-MultiHop QA sparsity": 0.5138888657093048, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.04911407083272934, "step": 28, "current_tau": 1.4707368612289429, "lambda1 Single QA": 0.48046875, "lambda2 MultiHop QA": 0.2412109375, "lambda3 Summarization": 0.04345703125, "lambda4 Code": 0.1396484375} [INFO|lh_trainer.py:331] 2026-02-16 19:43:13,600 >> {'loss': 11.4937, 'grad_norm': 1.6268733739852905, 'learning_rate': 0.00023333333333333333, 'epoch': 0.030542390731964193, 'num_input_tokens_seen': 72073830, 'completed': '9.67% (29 / 300)', 'remaining time': '12:41:00', 'throughput': '7172.50', 'gpu_mem_free': '7863MB', 'step': 29} [Step 29 / Rank 7] Tasks: ['Single QA'] | Lens: [54774] → Tgt Spa: ['0.350'] [Step 29 / Rank 4] Tasks: ['Single QA'] | Lens: [54837] → Tgt Spa: ['0.350'] [Step 29 / Rank 2] Tasks: ['Single QA'] | Lens: [58797] → Tgt Spa: ['0.350'] [Step 29 / Rank 5] Tasks: ['Single QA'] | Lens: [54837] → Tgt Spa: ['0.350'] [Step 29 / Rank 3] Tasks: ['Single QA'] | Lens: [58797] → Tgt Spa: ['0.350'] [Step 29 / Rank 6] Tasks: ['Single QA'] | Lens: [54774] → Tgt Spa: ['0.350'] [Step 29 / Rank 1] Tasks: ['Single QA'] | Lens: [37333] → Tgt Spa: ['0.350'] [Step 29 / Rank 0] Tasks: ['Single QA'] | Lens: [37333] → Tgt Spa: ['0.350'] [Step 29 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [39298] → Tgt Spa: ['1.000'] [Step 29 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [39298] → Tgt Spa: ['1.000'] [Step 29 / Rank 1] Tasks: ['Single QA'] | Lens: [48516] → Tgt Spa: ['0.350'] [Step 29 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [32070, 32073] → Tgt Spa: ['0.350', '0.350'] [Step 29 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [32070, 32073] → Tgt Spa: ['0.350', '0.350'] [Step 29 / Rank 5] Tasks: ['Single QA', 'Summarization', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7742, 7760, 7742, 7743, 7743, 7744, 7744, 7746] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 29 / Rank 0] Tasks: ['Single QA'] | Lens: [48516] → Tgt Spa: ['0.350'] [Step 29 / Rank 4] Tasks: ['Single QA', 'Summarization', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7742, 7760, 7742, 7743, 7743, 7744, 7744, 7746] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 29 / Rank 2] Tasks: ['Single QA'] | Lens: [64190] → Tgt Spa: ['0.350'] [Step 29 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [29758, 29751] → Tgt Spa: ['1.000', '1.000'] [Step 29 / Rank 4] Tasks: ['Code', 'Single QA'] | Lens: [25882, 25874] → Tgt Spa: ['1.000', '0.350'] [Step 29 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [29758, 29751] → Tgt Spa: ['1.000', '1.000'] [Step 29 / Rank 3] Tasks: ['Single QA'] | Lens: [64190] → Tgt Spa: ['0.350'] [Step 29 / Rank 5] Tasks: ['Code', 'Single QA'] | Lens: [25882, 25874] → Tgt Spa: ['1.000', '0.350'] [Step 29 / Rank 6] Tasks: ['Single QA'] | Lens: [53246] → Tgt Spa: ['0.350'] [Step 29 / Rank 7] Tasks: ['Single QA'] | Lens: [53246] → Tgt Spa: ['0.350'] [Step 29 / Rank 4] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [18203, 18204, 18194] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 29 / Rank 0] Tasks: ['Single QA'] | Lens: [32971] → Tgt Spa: ['0.350'] [Step 29 / Rank 5] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [18203, 18204, 18194] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 29 / Rank 1] Tasks: ['Single QA'] | Lens: [32971] → Tgt Spa: ['0.350'] [Step 29 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [26032, 26027] → Tgt Spa: ['1.000', '1.000'] [Step 29 / Rank 7] Tasks: ['Code', 'Summarization', 'In-Context Learning', 'Code', 'Code', 'Code', 'Code', 'Code', 'Code', 'Code'] | Lens: [5961, 5978, 5960, 5968, 5971, 5974, 5973, 5975, 5975, 5977] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 29 / Rank 6] Tasks: ['Code', 'Summarization', 'In-Context Learning', 'Code', 'Code', 'Code', 'Code', 'Code', 'Code', 'Code'] | Lens: [5961, 5978, 5960, 5968, 5971, 5974, 5973, 5975, 5975, 5977] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 29 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [26032, 26027] → Tgt Spa: ['1.000', '1.000'] [Step 29 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Single QA', 'In-Context Learning', 'Single QA', 'Code', 'Code', 'Code', 'In-Context Learning'] | Lens: [5253, 5254, 5254, 5254, 5262, 5256, 5256, 5257, 5265, 5268, 5266, 5259] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000'] [Step 29 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [59566] → Tgt Spa: ['1.000'] [Step 29 / Rank 3] Tasks: ['Single QA'] | Lens: [52019] → Tgt Spa: ['0.350'] [Step 29 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Single QA', 'In-Context Learning', 'Single QA', 'Code', 'Code', 'Code', 'In-Context Learning'] | Lens: [5253, 5254, 5254, 5254, 5262, 5256, 5256, 5257, 5265, 5268, 5266, 5259] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000'] [Step 29 / Rank 0] Tasks: ['Summarization', 'In-Context Learning', 'Code'] | Lens: [18703, 18686, 18694] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 29 / Rank 2] Tasks: ['Single QA'] | Lens: [52019] → Tgt Spa: ['0.350'] [Step 29 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [59566] → Tgt Spa: ['1.000'] [Step 29 / Rank 1] Tasks: ['Summarization', 'In-Context Learning', 'Code'] | Lens: [18703, 18686, 18694] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 29 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [44397] → Tgt Spa: ['1.000'] [Step 29 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [24883, 24883] → Tgt Spa: ['0.350', '0.350'] [Step 29 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [25529, 25529] → Tgt Spa: ['0.350', '0.350'] [Step 29 / Rank 7] Tasks: ['Single QA'] | Lens: [64095] → Tgt Spa: ['0.350'] [Step 29 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [25529, 25529] → Tgt Spa: ['0.350', '0.350'] [Step 29 / Rank 6] Tasks: ['Single QA'] | Lens: [64095] → Tgt Spa: ['0.350'] [Step 29 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [24883, 24883] → Tgt Spa: ['0.350', '0.350'] [Step 29 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [44397] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 19:45:37,125 >> @ 29 | Loss: 2.2316 | LM: 2.1444 | Reg: 0.0872 | Spa(Avg): 0.467 [INFO|lh_trainer.py:797] 2026-02-16 19:45:37,125 >> Statistic -> Code | Spa: 0.498 | Tgt: 1.000 | Z-Loss: 0.094 | [INFO|lh_trainer.py:797] 2026-02-16 19:45:37,125 >> Statistic -> In-Context | Spa: 0.418 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:45:37,125 >> Statistic -> MultiHop | Spa: 0.514 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:45:37,125 >> Statistic -> Single | Spa: 0.464 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:45:37,125 >> Statistic -> Summarization | Spa: 0.483 | Tgt: 1.000 | Z-Loss: 0.121 | [INFO|lh_trainer.py:810] 2026-02-16 19:45:37,127 >> [Micro-Log] {"loss": 2.231629348049561, "lm_loss": 2.144417801871896, "reg_loss": 0.08721154695861817, "model_sparsity(avg)": 0.4666956042249997, "Spa-Single QA sparsity": 0.46420939610554623, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.059543711469114684, "Spa-Code sparsity": 0.4983660052804386, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.0939658192150733, "Spa-In-Context Learning sparsity": 0.4177350401878357, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.13227079350214738, "Spa-Summarization sparsity": 0.48333332538604734, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12058919817209243, "Spa-MultiHop QA sparsity": 0.5138888657093048, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.04911407083272934, "step": 29, "current_tau": 1.4686548709869385, "lambda1 Single QA": 0.482421875, "lambda2 MultiHop QA": 0.2412109375, "lambda3 Summarization": 0.0439453125, "lambda4 Code": 0.1396484375} [INFO|lh_trainer.py:331] 2026-02-16 19:46:03,431 >> {'loss': 13.3898, 'grad_norm': 1.0921963453292847, 'learning_rate': 0.00024166666666666667, 'epoch': 0.0315955766192733, 'num_input_tokens_seen': 74649418, 'completed': '10.00% (30 / 300)', 'remaining time': '12:38:23', 'throughput': '7582.77', 'gpu_mem_free': '9749MB', 'step': 30} [Step 30 / Rank 0] Tasks: ['Single QA'] | Lens: [62442] → Tgt Spa: ['0.350'] [Step 30 / Rank 3] Tasks: ['Single QA'] | Lens: [49550] → Tgt Spa: ['0.350'] [Step 30 / Rank 2] Tasks: ['Single QA'] | Lens: [49550] → Tgt Spa: ['0.350'] [Step 30 / Rank 6] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [19347, 19361, 19353] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 30 / Rank 4] Tasks: ['Single QA'] | Lens: [65024] → Tgt Spa: ['0.350'] [Step 30 / Rank 7] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [19347, 19361, 19353] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 30 / Rank 5] Tasks: ['Single QA'] | Lens: [65024] → Tgt Spa: ['0.350'] [Step 30 / Rank 1] Tasks: ['Single QA'] | Lens: [62442] → Tgt Spa: ['0.350'] [Step 30 / Rank 6] Tasks: ['Single QA'] | Lens: [56505] → Tgt Spa: ['0.350'] [Step 30 / Rank 5] Tasks: ['Single QA'] | Lens: [55635] → Tgt Spa: ['0.350'] [Step 30 / Rank 2] Tasks: ['Single QA'] | Lens: [64097] → Tgt Spa: ['0.350'] [Step 30 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [23595, 23589] → Tgt Spa: ['1.000', '1.000'] [Step 30 / Rank 3] Tasks: ['Single QA'] | Lens: [64097] → Tgt Spa: ['0.350'] [Step 30 / Rank 7] Tasks: ['Single QA'] | Lens: [56505] → Tgt Spa: ['0.350'] [Step 30 / Rank 4] Tasks: ['Single QA'] | Lens: [55635] → Tgt Spa: ['0.350'] [Step 30 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [23595, 23589] → Tgt Spa: ['1.000', '1.000'] [Step 30 / Rank 4] Tasks: ['Single QA'] | Lens: [55073] → Tgt Spa: ['0.350'] [Step 30 / Rank 1] Tasks: ['Single QA'] | Lens: [33500] → Tgt Spa: ['0.350'] [Step 30 / Rank 7] Tasks: ['Single QA'] | Lens: [50935] → Tgt Spa: ['0.350'] [Step 30 / Rank 5] Tasks: ['Single QA'] | Lens: [55073] → Tgt Spa: ['0.350'] [Step 30 / Rank 3] Tasks: ['MultiHop QA', 'MultiHop QA'] | Lens: [32694, 32694] → Tgt Spa: ['0.350', '0.350'] [Step 30 / Rank 0] Tasks: ['Single QA'] | Lens: [33500] → Tgt Spa: ['0.350'] [Step 30 / Rank 2] Tasks: ['MultiHop QA', 'MultiHop QA'] | Lens: [32694, 32694] → Tgt Spa: ['0.350', '0.350'] [Step 30 / Rank 6] Tasks: ['Single QA'] | Lens: [50935] → Tgt Spa: ['0.350'] [Step 30 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [30367, 30371] → Tgt Spa: ['1.000', '1.000'] [Step 30 / Rank 3] Tasks: ['Single QA'] | Lens: [46165] → Tgt Spa: ['0.350'] [Step 30 / Rank 5] Tasks: ['Single QA', 'Summarization'] | Lens: [26851, 26869] → Tgt Spa: ['0.350', '1.000'] [Step 30 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [30367, 30371] → Tgt Spa: ['1.000', '1.000'] [Step 30 / Rank 6] Tasks: ['Single QA'] | Lens: [55637] → Tgt Spa: ['0.350'] [Step 30 / Rank 4] Tasks: ['Single QA', 'Summarization'] | Lens: [26851, 26869] → Tgt Spa: ['0.350', '1.000'] [Step 30 / Rank 2] Tasks: ['Single QA'] | Lens: [46165] → Tgt Spa: ['0.350'] [Step 30 / Rank 7] Tasks: ['Single QA'] | Lens: [55637] → Tgt Spa: ['0.350'] [Step 30 / Rank 5] Tasks: ['Summarization', 'Code'] | Lens: [23239, 23229] → Tgt Spa: ['1.000', '1.000'] [Step 30 / Rank 6] Tasks: ['Single QA'] | Lens: [63716] → Tgt Spa: ['0.350'] [Step 30 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [4519, 4519, 4521, 4520, 4520, 4521, 4521, 4529, 4523, 4523, 4522, 4522, 4522, 4524] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 30 / Rank 4] Tasks: ['Summarization', 'Code'] | Lens: [23239, 23229] → Tgt Spa: ['1.000', '1.000'] [Step 30 / Rank 1] Tasks: ['Single QA'] | Lens: [37121] → Tgt Spa: ['0.350'] [Step 30 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [4519, 4519, 4521, 4520, 4520, 4521, 4521, 4529, 4523, 4523, 4522, 4522, 4522, 4524] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 30 / Rank 7] Tasks: ['Single QA'] | Lens: [63716] → Tgt Spa: ['0.350'] [Step 30 / Rank 0] Tasks: ['Single QA'] | Lens: [37121] → Tgt Spa: ['0.350'] [Step 30 / Rank 5] Tasks: ['Single QA'] | Lens: [39855] → Tgt Spa: ['0.350'] [Step 30 / Rank 6] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17822, 17813, 17824] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 30 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26685, 26687] → Tgt Spa: ['1.000', '1.000'] [Step 30 / Rank 7] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17822, 17813, 17824] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 30 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26685, 26687] → Tgt Spa: ['1.000', '1.000'] [Step 30 / Rank 4] Tasks: ['Single QA'] | Lens: [39855] → Tgt Spa: ['0.350'] [Step 30 / Rank 0] Tasks: ['Single QA'] | Lens: [35574] → Tgt Spa: ['0.350'] [Step 30 / Rank 1] Tasks: ['Single QA'] | Lens: [35574] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 19:48:45,223 >> @ 30 | Loss: 2.2078 | LM: 2.1189 | Reg: 0.0889 | Spa(Avg): 0.481 [INFO|lh_trainer.py:797] 2026-02-16 19:48:45,223 >> Statistic -> Code | Spa: 0.481 | Tgt: 1.000 | Z-Loss: 0.099 | [INFO|lh_trainer.py:797] 2026-02-16 19:48:45,223 >> Statistic -> In-Context | Spa: 0.443 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:48:45,223 >> Statistic -> MultiHop | Spa: 0.444 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:48:45,223 >> Statistic -> Single | Spa: 0.495 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:48:45,223 >> Statistic -> Summarization | Spa: 0.450 | Tgt: 1.000 | Z-Loss: 0.137 | [INFO|lh_trainer.py:810] 2026-02-16 19:48:45,225 >> [Micro-Log] {"loss": 2.2078331212202706, "lm_loss": 2.118892470219483, "reg_loss": 0.0889406512142159, "model_sparsity(avg)": 0.4810405609508355, "Spa-Single QA sparsity": 0.4947916604578495, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.07648627166054212, "Spa-Code sparsity": 0.4814814825852712, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.0988406886657079, "Spa-In-Context Learning sparsity": 0.44290123383204144, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.12654273750053513, "Spa-MultiHop QA sparsity": 0.4444444477558136, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.0284466037992388, "Spa-Summarization sparsity": 0.45, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13696786463260652, "step": 30, "current_tau": 1.4665063619613647, "lambda1 Single QA": 0.482421875, "lambda2 MultiHop QA": 0.2421875, "lambda3 Summarization": 0.04443359375, "lambda4 Code": 0.140625} [INFO|lh_trainer.py:331] 2026-02-16 19:48:58,597 >> {'loss': 13.247, 'grad_norm': 0.9047552347183228, 'learning_rate': 0.00025, 'epoch': 0.03264876250658241, 'num_input_tokens_seen': 77194468, 'completed': '10.33% (31 / 300)', 'remaining time': '12:36:32', 'throughput': '7264.68', 'gpu_mem_free': '14003MB', 'step': 31} [Step 31 / Rank 7] Tasks: ['Single QA'] | Lens: [37335] → Tgt Spa: ['0.350'] [Step 31 / Rank 5] Tasks: ['Single QA'] | Lens: [44047] → Tgt Spa: ['0.350'] [Step 31 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [28246, 28246] → Tgt Spa: ['0.350', '0.350'] [Step 31 / Rank 6] Tasks: ['Single QA'] | Lens: [37335] → Tgt Spa: ['0.350'] [Step 31 / Rank 3] Tasks: ['Single QA'] | Lens: [35011] → Tgt Spa: ['0.350'] [Step 31 / Rank 2] Tasks: ['Single QA'] | Lens: [35011] → Tgt Spa: ['0.350'] [Step 31 / Rank 4] Tasks: ['Single QA'] | Lens: [44047] → Tgt Spa: ['0.350'] [Step 31 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [28246, 28246] → Tgt Spa: ['0.350', '0.350'] [Step 31 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [29740, 29740] → Tgt Spa: ['0.350', '0.350'] [Step 31 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24358, 24359] → Tgt Spa: ['1.000', '1.000'] [Step 31 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [43725] → Tgt Spa: ['1.000'] [Step 31 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [43725] → Tgt Spa: ['1.000'] [Step 31 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24358, 24359] → Tgt Spa: ['1.000', '1.000'] [Step 31 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [25607, 25628] → Tgt Spa: ['1.000', '1.000'] [Step 31 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [25607, 25628] → Tgt Spa: ['1.000', '1.000'] [Step 31 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [29740, 29740] → Tgt Spa: ['0.350', '0.350'] [Step 31 / Rank 4] Tasks: ['Code'] | Lens: [35989] → Tgt Spa: ['1.000'] [Step 31 / Rank 3] Tasks: ['Single QA'] | Lens: [53610] → Tgt Spa: ['0.350'] [Step 31 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [27306, 27314] → Tgt Spa: ['1.000', '1.000'] [Step 31 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [27306, 27314] → Tgt Spa: ['1.000', '1.000'] [Step 31 / Rank 5] Tasks: ['Code'] | Lens: [35989] → Tgt Spa: ['1.000'] [Step 31 / Rank 7] Tasks: ['Single QA'] | Lens: [50547] → Tgt Spa: ['0.350'] [Step 31 / Rank 2] Tasks: ['Single QA'] | Lens: [53610] → Tgt Spa: ['0.350'] [Step 31 / Rank 6] Tasks: ['Single QA'] | Lens: [50547] → Tgt Spa: ['0.350'] [Step 31 / Rank 5] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18274, 18264, 18264] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 31 / Rank 4] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18274, 18264, 18264] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 31 / Rank 1] Tasks: ['Single QA'] | Lens: [37433] → Tgt Spa: ['0.350'] [Step 31 / Rank 3] Tasks: ['Single QA'] | Lens: [50112] → Tgt Spa: ['0.350'] [Step 31 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27284, 27286] → Tgt Spa: ['1.000', '1.000'] [Step 31 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27284, 27286] → Tgt Spa: ['1.000', '1.000'] [Step 31 / Rank 2] Tasks: ['Single QA'] | Lens: [50112] → Tgt Spa: ['0.350'] [Step 31 / Rank 0] Tasks: ['Single QA'] | Lens: [37433] → Tgt Spa: ['0.350'] [Step 31 / Rank 1] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18803, 18792, 18792] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 31 / Rank 5] Tasks: ['Single QA'] | Lens: [54559] → Tgt Spa: ['0.350'] [Step 31 / Rank 6] Tasks: ['Code'] | Lens: [45148] → Tgt Spa: ['1.000'] [Step 31 / Rank 4] Tasks: ['Single QA'] | Lens: [54559] → Tgt Spa: ['0.350'] [Step 31 / Rank 0] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18803, 18792, 18792] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 31 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [13356, 13356, 13364, 13357] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350'] [Step 31 / Rank 7] Tasks: ['Code'] | Lens: [45148] → Tgt Spa: ['1.000'] [Step 31 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [13356, 13356, 13364, 13357] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350'] [Step 31 / Rank 0] Tasks: ['Single QA'] | Lens: [44756] → Tgt Spa: ['0.350'] [Step 31 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [61769] → Tgt Spa: ['1.000'] [Step 31 / Rank 5] Tasks: ['Single QA'] | Lens: [55534] → Tgt Spa: ['0.350'] [Step 31 / Rank 6] Tasks: ['Summarization', 'Single QA'] | Lens: [23576, 23559] → Tgt Spa: ['1.000', '0.350'] [Step 31 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [61769] → Tgt Spa: ['1.000'] [Step 31 / Rank 7] Tasks: ['Summarization', 'Single QA'] | Lens: [23576, 23559] → Tgt Spa: ['1.000', '0.350'] [Step 31 / Rank 4] Tasks: ['Single QA'] | Lens: [55534] → Tgt Spa: ['0.350'] [Step 31 / Rank 1] Tasks: ['Single QA'] | Lens: [44756] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 19:51:03,301 >> @ 31 | Loss: 2.1440 | LM: 2.0576 | Reg: 0.0864 | Spa(Avg): 0.459 [INFO|lh_trainer.py:797] 2026-02-16 19:51:03,301 >> Statistic -> Code | Spa: 0.462 | Tgt: 1.000 | Z-Loss: 0.103 | [INFO|lh_trainer.py:797] 2026-02-16 19:51:03,301 >> Statistic -> In-Context | Spa: 0.476 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:51:03,301 >> Statistic -> MultiHop | Spa: 0.444 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:51:03,301 >> Statistic -> Single | Spa: 0.458 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:51:03,301 >> Statistic -> Summarization | Spa: 0.420 | Tgt: 1.000 | Z-Loss: 0.149 | [INFO|lh_trainer.py:810] 2026-02-16 19:51:03,303 >> [Micro-Log] {"loss": 2.1439636821548143, "lm_loss": 2.0575888380408287, "reg_loss": 0.08637485917036732, "model_sparsity(avg)": 0.4594907263914744, "Spa-Single QA sparsity": 0.4575617081589169, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05537604263776706, "Spa-In-Context Learning sparsity": 0.4756944477558136, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11937380395829678, "Spa-Code sparsity": 0.4618055373430252, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10327600967139006, "Spa-Summarization sparsity": 0.4201388657093048, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.14918405562639236, "Spa-MultiHop QA sparsity": 0.4444444477558136, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.0284466037992388, "step": 31, "current_tau": 1.4642918109893799, "lambda1 Single QA": 0.482421875, "lambda2 MultiHop QA": 0.2421875, "lambda3 Summarization": 0.044677734375, "lambda4 Code": 0.140625} [INFO|lh_trainer.py:331] 2026-02-16 19:51:28,020 >> {'loss': 12.8638, 'grad_norm': 1.0892783403396606, 'learning_rate': 0.00025833333333333334, 'epoch': 0.03370194839389152, 'num_input_tokens_seen': 79567360, 'completed': '10.67% (32 / 300)', 'remaining time': '12:31:02', 'throughput': '7940.21', 'gpu_mem_free': '10255MB', 'step': 32} [Step 32 / Rank 5] Tasks: ['Summarization', 'MultiHop QA', 'Code', 'MultiHop QA', 'Single QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA'] | Lens: [1966, 1950, 1957, 1948, 1950, 1951, 1952, 1950, 1950, 1969, 1951, 1950, 1971, 1952, 1954, 1972, 1971, 1953, 1954, 1973, 1954, 1974, 1974, 1956, 1957, 1958, 1975, 1977, 1977, 1977, 1961, 1979, 1961] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350'] [Step 32 / Rank 6] Tasks: ['Code'] | Lens: [32882] → Tgt Spa: ['1.000'] [Step 32 / Rank 7] Tasks: ['Code'] | Lens: [32882] → Tgt Spa: ['1.000'] [Step 32 / Rank 1] Tasks: ['Single QA'] | Lens: [64902] → Tgt Spa: ['0.350'] [Step 32 / Rank 4] Tasks: ['Summarization', 'MultiHop QA', 'Code', 'MultiHop QA', 'Single QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA'] | Lens: [1966, 1950, 1957, 1948, 1950, 1951, 1952, 1950, 1950, 1969, 1951, 1950, 1971, 1952, 1954, 1972, 1971, 1953, 1954, 1973, 1954, 1974, 1974, 1956, 1957, 1958, 1975, 1977, 1977, 1977, 1961, 1979, 1961] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350'] [Step 32 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27019, 27021] → Tgt Spa: ['1.000', '1.000'] [Step 32 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27019, 27021] → Tgt Spa: ['1.000', '1.000'] [Step 32 / Rank 0] Tasks: ['Single QA'] | Lens: [64902] → Tgt Spa: ['0.350'] [Step 32 / Rank 5] Tasks: ['Single QA'] | Lens: [38724] → Tgt Spa: ['0.350'] [Step 32 / Rank 7] Tasks: ['Single QA'] | Lens: [52403] → Tgt Spa: ['0.350'] [Step 32 / Rank 4] Tasks: ['Single QA'] | Lens: [38724] → Tgt Spa: ['0.350'] [Step 32 / Rank 1] Tasks: ['Code', 'Code', 'Code'] | Lens: [19070, 19071, 19071] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 32 / Rank 0] Tasks: ['Code', 'Code', 'Code'] | Lens: [19070, 19071, 19071] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 32 / Rank 3] Tasks: ['Single QA'] | Lens: [40466] → Tgt Spa: ['0.350'] [Step 32 / Rank 2] Tasks: ['Single QA'] | Lens: [40466] → Tgt Spa: ['0.350'] [Step 32 / Rank 6] Tasks: ['Single QA'] | Lens: [52403] → Tgt Spa: ['0.350'] [Step 32 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [59400] → Tgt Spa: ['1.000'] [Step 32 / Rank 0] Tasks: ['Single QA'] | Lens: [54454] → Tgt Spa: ['0.350'] [Step 32 / Rank 7] Tasks: ['Single QA'] | Lens: [53364] → Tgt Spa: ['0.350'] [Step 32 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [50792] → Tgt Spa: ['1.000'] [Step 32 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [59400] → Tgt Spa: ['1.000'] [Step 32 / Rank 1] Tasks: ['Single QA'] | Lens: [54454] → Tgt Spa: ['0.350'] [Step 32 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [50792] → Tgt Spa: ['1.000'] [Step 32 / Rank 6] Tasks: ['Single QA'] | Lens: [53364] → Tgt Spa: ['0.350'] [Step 32 / Rank 5] Tasks: ['Single QA'] | Lens: [41623] → Tgt Spa: ['0.350'] [Step 32 / Rank 1] Tasks: ['Single QA'] | Lens: [60436] → Tgt Spa: ['0.350'] [Step 32 / Rank 3] Tasks: ['Single QA'] | Lens: [51574] → Tgt Spa: ['0.350'] [Step 32 / Rank 2] Tasks: ['Single QA'] | Lens: [51574] → Tgt Spa: ['0.350'] [Step 32 / Rank 4] Tasks: ['Single QA'] | Lens: [41623] → Tgt Spa: ['0.350'] [Step 32 / Rank 6] Tasks: ['Code', 'Code', 'Code'] | Lens: [18247, 18250, 18250] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 32 / Rank 0] Tasks: ['Single QA'] | Lens: [60436] → Tgt Spa: ['0.350'] [Step 32 / Rank 7] Tasks: ['Code', 'Code', 'Code'] | Lens: [18247, 18250, 18250] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 32 / Rank 5] Tasks: ['Single QA'] | Lens: [59572] → Tgt Spa: ['0.350'] [Step 32 / Rank 4] Tasks: ['Single QA'] | Lens: [59572] → Tgt Spa: ['0.350'] [Step 32 / Rank 1] Tasks: ['Code'] | Lens: [35110] → Tgt Spa: ['1.000'] [Step 32 / Rank 2] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [16795, 16798, 16788] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 32 / Rank 3] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [16795, 16798, 16788] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 32 / Rank 7] Tasks: ['Code'] | Lens: [34641] → Tgt Spa: ['1.000'] [Step 32 / Rank 0] Tasks: ['Code'] | Lens: [35110] → Tgt Spa: ['1.000'] [Step 32 / Rank 6] Tasks: ['Code'] | Lens: [34641] → Tgt Spa: ['1.000'] [Step 32 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25312, 25313] → Tgt Spa: ['1.000', '1.000'] [Step 32 / Rank 0] Tasks: ['Single QA', 'Code', 'Code', 'Single QA', 'Code'] | Lens: [11315, 11323, 11323, 11319, 11326] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '1.000'] [Step 32 / Rank 1] Tasks: ['Single QA', 'Code', 'Code', 'Single QA', 'Code'] | Lens: [11315, 11323, 11323, 11319, 11326] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '1.000'] [Step 32 / Rank 6] Tasks: ['Single QA'] | Lens: [35628] → Tgt Spa: ['0.350'] [Step 32 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25312, 25313] → Tgt Spa: ['1.000', '1.000'] [Step 32 / Rank 7] Tasks: ['Single QA'] | Lens: [35628] → Tgt Spa: ['0.350'] [Step 32 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [25152, 25173] → Tgt Spa: ['1.000', '1.000'] [Step 32 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [25152, 25173] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 19:54:05,538 >> @ 32 | Loss: 1.9804 | LM: 1.9026 | Reg: 0.0777 | Spa(Avg): 0.427 [INFO|lh_trainer.py:797] 2026-02-16 19:54:05,538 >> Statistic -> Code | Spa: 0.441 | Tgt: 1.000 | Z-Loss: 0.109 | [INFO|lh_trainer.py:797] 2026-02-16 19:54:05,538 >> Statistic -> In-Context | Spa: 0.413 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:54:05,538 >> Statistic -> MultiHop | Spa: 0.393 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:54:05,539 >> Statistic -> Single | Spa: 0.419 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:54:05,539 >> Statistic -> Summarization | Spa: 0.437 | Tgt: 1.000 | Z-Loss: 0.145 | [INFO|lh_trainer.py:810] 2026-02-16 19:54:05,541 >> [Micro-Log] {"loss": 1.9803785420954227, "lm_loss": 1.9026369812587898, "reg_loss": 0.07774157928846155, "model_sparsity(avg)": 0.4268378255267938, "Spa-Single QA sparsity": 0.41851851940155027, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.035076149149487416, "Spa-Code sparsity": 0.44146824734551565, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10903575005275863, "Spa-In-Context Learning sparsity": 0.41269841364451815, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.13463362412793295, "Spa-Summarization sparsity": 0.436631940305233, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1453800923191011, "Spa-MultiHop QA sparsity": 0.392973847248975, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.02282761221559828, "step": 32, "current_tau": 1.4620120525360107, "lambda1 Single QA": 0.482421875, "lambda2 MultiHop QA": 0.2421875, "lambda3 Summarization": 0.045166015625, "lambda4 Code": 0.1416015625} [INFO|lh_trainer.py:331] 2026-02-16 19:54:18,161 >> {'loss': 11.8823, 'grad_norm': 1.2737656831741333, 'learning_rate': 0.0002666666666666667, 'epoch': 0.03475513428120063, 'num_input_tokens_seen': 81976622, 'completed': '11.00% (33 / 300)', 'remaining time': '12:28:30', 'throughput': '7080.19', 'gpu_mem_free': '9745MB', 'step': 33} [Step 33 / Rank 2] Tasks: ['Code'] | Lens: [39392] → Tgt Spa: ['1.000'] [Step 33 / Rank 6] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [18645, 18646, 18658] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 33 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [57687] → Tgt Spa: ['1.000'] [Step 33 / Rank 7] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [18645, 18646, 18658] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 33 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [57687] → Tgt Spa: ['1.000'] [Step 33 / Rank 5] Tasks: ['Code'] | Lens: [45001] → Tgt Spa: ['1.000'] [Step 33 / Rank 3] Tasks: ['Code'] | Lens: [39392] → Tgt Spa: ['1.000'] [Step 33 / Rank 4] Tasks: ['Code'] | Lens: [45001] → Tgt Spa: ['1.000'] [Step 33 / Rank 4] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24544, 24546] → Tgt Spa: ['1.000', '0.350'] [Step 33 / Rank 7] Tasks: ['Single QA'] | Lens: [36682] → Tgt Spa: ['0.350'] [Step 33 / Rank 6] Tasks: ['Single QA'] | Lens: [36682] → Tgt Spa: ['0.350'] [Step 33 / Rank 5] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24544, 24546] → Tgt Spa: ['1.000', '0.350'] [Step 33 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [29457, 29476] → Tgt Spa: ['1.000', '1.000'] [Step 33 / Rank 0] Tasks: ['Single QA'] | Lens: [44049] → Tgt Spa: ['0.350'] [Step 33 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [29457, 29476] → Tgt Spa: ['1.000', '1.000'] [Step 33 / Rank 1] Tasks: ['Single QA'] | Lens: [44049] → Tgt Spa: ['0.350'] [Step 33 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [61502] → Tgt Spa: ['1.000'] [Step 33 / Rank 5] Tasks: ['Summarization', 'Summarization', 'In-Context Learning'] | Lens: [21430, 21431, 21413] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 33 / Rank 4] Tasks: ['Summarization', 'Summarization', 'In-Context Learning'] | Lens: [21430, 21431, 21413] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 33 / Rank 0] Tasks: ['Single QA'] | Lens: [38450] → Tgt Spa: ['0.350'] [Step 33 / Rank 7] Tasks: ['Single QA'] | Lens: [55332] → Tgt Spa: ['0.350'] [Step 33 / Rank 6] Tasks: ['Single QA'] | Lens: [55332] → Tgt Spa: ['0.350'] [Step 33 / Rank 1] Tasks: ['Single QA'] | Lens: [38450] → Tgt Spa: ['0.350'] [Step 33 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [61502] → Tgt Spa: ['1.000'] [Step 33 / Rank 3] Tasks: ['Single QA'] | Lens: [33771] → Tgt Spa: ['0.350'] [Step 33 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28778, 28780] → Tgt Spa: ['1.000', '1.000'] [Step 33 / Rank 0] Tasks: ['Code'] | Lens: [51766] → Tgt Spa: ['1.000'] [Step 33 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28778, 28780] → Tgt Spa: ['1.000', '1.000'] [Step 33 / Rank 2] Tasks: ['Single QA'] | Lens: [33771] → Tgt Spa: ['0.350'] [Step 33 / Rank 1] Tasks: ['Code'] | Lens: [51766] → Tgt Spa: ['1.000'] [Step 33 / Rank 6] Tasks: ['Single QA'] | Lens: [62113] → Tgt Spa: ['0.350'] [Step 33 / Rank 7] Tasks: ['Single QA'] | Lens: [62113] → Tgt Spa: ['0.350'] [Step 33 / Rank 1] Tasks: ['Single QA'] | Lens: [51693] → Tgt Spa: ['0.350'] [Step 33 / Rank 2] Tasks: ['Single QA'] | Lens: [60325] → Tgt Spa: ['0.350'] [Step 33 / Rank 3] Tasks: ['Single QA'] | Lens: [60325] → Tgt Spa: ['0.350'] [Step 33 / Rank 5] Tasks: ['Code'] | Lens: [61475] → Tgt Spa: ['1.000'] [Step 33 / Rank 6] Tasks: ['Code'] | Lens: [59894] → Tgt Spa: ['1.000'] [Step 33 / Rank 7] Tasks: ['Code'] | Lens: [59894] → Tgt Spa: ['1.000'] [Step 33 / Rank 4] Tasks: ['Code'] | Lens: [61475] → Tgt Spa: ['1.000'] [Step 33 / Rank 0] Tasks: ['Single QA'] | Lens: [51693] → Tgt Spa: ['0.350'] [Step 33 / Rank 2] Tasks: ['Single QA'] | Lens: [40355] → Tgt Spa: ['0.350'] [Step 33 / Rank 4] Tasks: ['Single QA'] | Lens: [51695] → Tgt Spa: ['0.350'] [Step 33 / Rank 0] Tasks: ['Single QA'] | Lens: [58953] → Tgt Spa: ['0.350'] [Step 33 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [21938, 21939] → Tgt Spa: ['1.000', '1.000'] [Step 33 / Rank 3] Tasks: ['Single QA'] | Lens: [40355] → Tgt Spa: ['0.350'] [Step 33 / Rank 1] Tasks: ['Single QA'] | Lens: [58953] → Tgt Spa: ['0.350'] [Step 33 / Rank 5] Tasks: ['Single QA'] | Lens: [51695] → Tgt Spa: ['0.350'] [Step 33 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [21938, 21939] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 19:56:52,380 >> @ 33 | Loss: 1.8825 | LM: 1.7954 | Reg: 0.0872 | Spa(Avg): 0.354 [INFO|lh_trainer.py:797] 2026-02-16 19:56:52,381 >> Statistic -> Code | Spa: 0.310 | Tgt: 1.000 | Z-Loss: 0.144 | [INFO|lh_trainer.py:797] 2026-02-16 19:56:52,381 >> Statistic -> In-Context | Spa: 0.361 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:56:52,381 >> Statistic -> MultiHop | Spa: 0.393 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:56:52,381 >> Statistic -> Single | Spa: 0.369 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:56:52,381 >> Statistic -> Summarization | Spa: 0.368 | Tgt: 1.000 | Z-Loss: 0.180 | [INFO|lh_trainer.py:810] 2026-02-16 19:56:52,383 >> [Micro-Log] {"loss": 1.8825271713236968, "lm_loss": 1.7953596860170364, "reg_loss": 0.0871674843559352, "model_sparsity(avg)": 0.35435956964890164, "Spa-In-Context Learning sparsity": 0.3611111111111111, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.14788297646575504, "Spa-Single QA sparsity": 0.3692129651705424, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.020943190863666434, "Spa-Code sparsity": 0.3095238038471767, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.14374705297606333, "Spa-Summarization sparsity": 0.3680555522441864, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.17972857132554054, "Spa-MultiHop QA sparsity": 0.392973847248975, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.02282761221559828, "step": 33, "current_tau": 1.459667682647705, "lambda1 Single QA": 0.484375, "lambda2 MultiHop QA": 0.2421875, "lambda3 Summarization": 0.045654296875, "lambda4 Code": 0.1416015625} [INFO|lh_trainer.py:331] 2026-02-16 19:57:15,505 >> {'loss': 11.2952, 'grad_norm': 1.589759349822998, 'learning_rate': 0.000275, 'epoch': 0.03580832016850974, 'num_input_tokens_seen': 84456254, 'completed': '11.33% (34 / 300)', 'remaining time': '12:26:53', 'throughput': '6991.04', 'gpu_mem_free': '6903MB', 'step': 34} [Step 34 / Rank 3] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [19104, 19094, 19108] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 34 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27973, 27975] → Tgt Spa: ['1.000', '1.000'] [Step 34 / Rank 1] Tasks: ['Single QA'] | Lens: [49232] → Tgt Spa: ['0.350'] [Step 34 / Rank 6] Tasks: ['Single QA'] | Lens: [51629] → Tgt Spa: ['0.350'] [Step 34 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27973, 27975] → Tgt Spa: ['1.000', '1.000'] [Step 34 / Rank 7] Tasks: ['Single QA'] | Lens: [51629] → Tgt Spa: ['0.350'] [Step 34 / Rank 2] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [19104, 19094, 19108] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 34 / Rank 0] Tasks: ['Single QA'] | Lens: [49232] → Tgt Spa: ['0.350'] [Step 34 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [30393, 30395] → Tgt Spa: ['1.000', '1.000'] [Step 34 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [30393, 30395] → Tgt Spa: ['1.000', '1.000'] [Step 34 / Rank 0] Tasks: ['Single QA'] | Lens: [43817] → Tgt Spa: ['0.350'] [Step 34 / Rank 6] Tasks: ['Single QA'] | Lens: [62389] → Tgt Spa: ['0.350'] [Step 34 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [31616, 31617] → Tgt Spa: ['0.350', '0.350'] [Step 34 / Rank 1] Tasks: ['Single QA'] | Lens: [43817] → Tgt Spa: ['0.350'] [Step 34 / Rank 7] Tasks: ['Single QA'] | Lens: [62389] → Tgt Spa: ['0.350'] [Step 34 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [31616, 31617] → Tgt Spa: ['0.350', '0.350'] [Step 34 / Rank 1] Tasks: ['Single QA'] | Lens: [60132] → Tgt Spa: ['0.350'] [Step 34 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [40291] → Tgt Spa: ['1.000'] [Step 34 / Rank 0] Tasks: ['Single QA'] | Lens: [60132] → Tgt Spa: ['0.350'] [Step 34 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [42649] → Tgt Spa: ['1.000'] [Step 34 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [42649] → Tgt Spa: ['1.000'] [Step 34 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [40291] → Tgt Spa: ['1.000'] [Step 34 / Rank 6] Tasks: ['Single QA'] | Lens: [65063] → Tgt Spa: ['0.350'] [Step 34 / Rank 7] Tasks: ['Single QA'] | Lens: [65063] → Tgt Spa: ['0.350'] [Step 34 / Rank 7] Tasks: ['Single QA'] | Lens: [64037] → Tgt Spa: ['0.350'] [Step 34 / Rank 3] Tasks: ['Single QA'] | Lens: [53580] → Tgt Spa: ['0.350'] [Step 34 / Rank 0] Tasks: ['Single QA'] | Lens: [42424] → Tgt Spa: ['0.350'] [Step 34 / Rank 4] Tasks: ['Summarization'] | Lens: [38766] → Tgt Spa: ['1.000'] [Step 34 / Rank 2] Tasks: ['Single QA'] | Lens: [53580] → Tgt Spa: ['0.350'] [Step 34 / Rank 1] Tasks: ['Single QA'] | Lens: [42424] → Tgt Spa: ['0.350'] [Step 34 / Rank 6] Tasks: ['Single QA'] | Lens: [64037] → Tgt Spa: ['0.350'] [Step 34 / Rank 5] Tasks: ['Summarization'] | Lens: [38766] → Tgt Spa: ['1.000'] [Step 34 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [43927] → Tgt Spa: ['1.000'] [Step 34 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32714, 32714] → Tgt Spa: ['0.350', '0.350'] [Step 34 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32714, 32714] → Tgt Spa: ['0.350', '0.350'] [Step 34 / Rank 1] Tasks: ['Summarization', 'Single QA'] | Lens: [21861, 21843] → Tgt Spa: ['1.000', '0.350'] [Step 34 / Rank 2] Tasks: ['Code'] | Lens: [41791] → Tgt Spa: ['1.000'] [Step 34 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [43927] → Tgt Spa: ['1.000'] [Step 34 / Rank 3] Tasks: ['Code'] | Lens: [41791] → Tgt Spa: ['1.000'] [Step 34 / Rank 0] Tasks: ['Summarization', 'Single QA'] | Lens: [21861, 21843] → Tgt Spa: ['1.000', '0.350'] [Step 34 / Rank 1] Tasks: ['Single QA'] | Lens: [49210] → Tgt Spa: ['0.350'] [Step 34 / Rank 5] Tasks: ['Code'] | Lens: [51098] → Tgt Spa: ['1.000'] [Step 34 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [37001] → Tgt Spa: ['1.000'] [Step 34 / Rank 0] Tasks: ['Single QA'] | Lens: [49210] → Tgt Spa: ['0.350'] [Step 34 / Rank 4] Tasks: ['Code'] | Lens: [51098] → Tgt Spa: ['1.000'] [Step 34 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [37001] → Tgt Spa: ['1.000'] [Step 34 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [44471] → Tgt Spa: ['1.000'] [Step 34 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [44471] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 19:59:51,337 >> @ 34 | Loss: 2.2197 | LM: 2.1389 | Reg: 0.0807 | Spa(Avg): 0.389 [INFO|lh_trainer.py:797] 2026-02-16 19:59:51,337 >> Statistic -> Code | Spa: 0.397 | Tgt: 1.000 | Z-Loss: 0.121 | [INFO|lh_trainer.py:797] 2026-02-16 19:59:51,337 >> Statistic -> In-Context | Spa: 0.395 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:59:51,337 >> Statistic -> MultiHop | Spa: 0.393 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:59:51,337 >> Statistic -> Single | Spa: 0.399 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 19:59:51,337 >> Statistic -> Summarization | Spa: 0.382 | Tgt: 1.000 | Z-Loss: 0.169 | [INFO|lh_trainer.py:810] 2026-02-16 19:59:51,339 >> [Micro-Log] {"loss": 2.2196759258707366, "lm_loss": 2.1389434039592743, "reg_loss": 0.08073251358776663, "model_sparsity(avg)": 0.389178233842055, "Spa-Single QA sparsity": 0.3990740696589152, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.027523816144093872, "Spa-Summarization sparsity": 0.3819444477558136, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.16851109638810158, "Spa-Code sparsity": 0.3972222089767456, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.12104972898960113, "Spa-In-Context Learning sparsity": 0.3948412537574768, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.13986043738467352, "Spa-MultiHop QA sparsity": 0.392973847248975, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.02282761221559828, "step": 34, "current_tau": 1.4572594165802002, "lambda1 Single QA": 0.484375, "lambda2 MultiHop QA": 0.2431640625, "lambda3 Summarization": 0.046142578125, "lambda4 Code": 0.142578125} [INFO|lh_trainer.py:331] 2026-02-16 20:00:09,532 >> {'loss': 13.3181, 'grad_norm': 1.317963719367981, 'learning_rate': 0.00028333333333333335, 'epoch': 0.03686150605581885, 'num_input_tokens_seen': 86912082, 'completed': '11.67% (35 / 300)', 'remaining time': '12:24:47', 'throughput': '7055.89', 'gpu_mem_free': '11321MB', 'step': 35} [Step 35 / Rank 7] Tasks: ['Single QA'] | Lens: [34703] → Tgt Spa: ['0.350'] [Step 35 / Rank 5] Tasks: ['Single QA'] | Lens: [49539] → Tgt Spa: ['0.350'] [Step 35 / Rank 0] Tasks: ['Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [2400, 2400, 2385, 2384, 2401, 2384, 2402, 2384, 2403, 2385, 2386, 2386, 2404, 2404, 2395, 2388, 2389, 2389, 2391, 2387, 2391, 2390, 2392, 2390, 2395, 2393, 2409] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 35 / Rank 6] Tasks: ['Single QA'] | Lens: [34703] → Tgt Spa: ['0.350'] [Step 35 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [58106] → Tgt Spa: ['1.000'] [Step 35 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [58106] → Tgt Spa: ['1.000'] [Step 35 / Rank 1] Tasks: ['Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [2400, 2400, 2385, 2384, 2401, 2384, 2402, 2384, 2403, 2385, 2386, 2386, 2404, 2404, 2395, 2388, 2389, 2389, 2391, 2387, 2391, 2390, 2392, 2390, 2395, 2393, 2409] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 35 / Rank 4] Tasks: ['Single QA'] | Lens: [49539] → Tgt Spa: ['0.350'] [Step 35 / Rank 0] Tasks: ['Single QA'] | Lens: [38622] → Tgt Spa: ['0.350'] [Step 35 / Rank 2] Tasks: ['Summarization'] | Lens: [35950] → Tgt Spa: ['1.000'] [Step 35 / Rank 5] Tasks: ['Code'] | Lens: [44391] → Tgt Spa: ['1.000'] [Step 35 / Rank 1] Tasks: ['Single QA'] | Lens: [38622] → Tgt Spa: ['0.350'] [Step 35 / Rank 4] Tasks: ['Code'] | Lens: [44391] → Tgt Spa: ['1.000'] [Step 35 / Rank 6] Tasks: ['Single QA'] | Lens: [64034] → Tgt Spa: ['0.350'] [Step 35 / Rank 7] Tasks: ['Single QA'] | Lens: [64034] → Tgt Spa: ['0.350'] [Step 35 / Rank 3] Tasks: ['Summarization'] | Lens: [35950] → Tgt Spa: ['1.000'] [Step 35 / Rank 1] Tasks: ['Single QA'] | Lens: [41147] → Tgt Spa: ['0.350'] [Step 35 / Rank 5] Tasks: ['Single QA'] | Lens: [38708] → Tgt Spa: ['0.350'] [Step 35 / Rank 3] Tasks: ['Code'] | Lens: [60625] → Tgt Spa: ['1.000'] [Step 35 / Rank 0] Tasks: ['Single QA'] | Lens: [41147] → Tgt Spa: ['0.350'] [Step 35 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [24810, 24818] → Tgt Spa: ['1.000', '1.000'] [Step 35 / Rank 4] Tasks: ['Single QA'] | Lens: [38708] → Tgt Spa: ['0.350'] [Step 35 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [24810, 24818] → Tgt Spa: ['1.000', '1.000'] [Step 35 / Rank 2] Tasks: ['Code'] | Lens: [60625] → Tgt Spa: ['1.000'] [Step 35 / Rank 6] Tasks: ['Single QA'] | Lens: [41129] → Tgt Spa: ['0.350'] [Step 35 / Rank 3] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17231, 17231, 17231] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 35 / Rank 4] Tasks: ['Single QA'] | Lens: [43585] → Tgt Spa: ['0.350'] [Step 35 / Rank 1] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [20256, 20246, 20258] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 35 / Rank 5] Tasks: ['Single QA'] | Lens: [43585] → Tgt Spa: ['0.350'] [Step 35 / Rank 2] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17231, 17231, 17231] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 35 / Rank 7] Tasks: ['Single QA'] | Lens: [41129] → Tgt Spa: ['0.350'] [Step 35 / Rank 0] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [20256, 20246, 20258] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 35 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [43067] → Tgt Spa: ['1.000'] [Step 35 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [43067] → Tgt Spa: ['1.000'] [Step 35 / Rank 1] Tasks: ['Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Summarization'] | Lens: [5133, 5133, 5134, 5135, 5142, 5136, 5136, 5136, 5137, 5138, 5155, 5156] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 35 / Rank 3] Tasks: ['Single QA'] | Lens: [64052] → Tgt Spa: ['0.350'] [Step 35 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [24416, 24417] → Tgt Spa: ['0.350', '0.350'] [Step 35 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [24416, 24417] → Tgt Spa: ['0.350', '0.350'] [Step 35 / Rank 0] Tasks: ['Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Summarization'] | Lens: [5133, 5133, 5134, 5135, 5142, 5136, 5136, 5136, 5137, 5138, 5155, 5156] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 35 / Rank 2] Tasks: ['Single QA'] | Lens: [64052] → Tgt Spa: ['0.350'] [Step 35 / Rank 5] Tasks: ['Code'] | Lens: [41043] → Tgt Spa: ['1.000'] [Step 35 / Rank 0] Tasks: ['Single QA', 'Code'] | Lens: [30203, 30211] → Tgt Spa: ['0.350', '1.000'] [Step 35 / Rank 7] Tasks: ['Single QA'] | Lens: [44452] → Tgt Spa: ['0.350'] [Step 35 / Rank 6] Tasks: ['Single QA'] | Lens: [44452] → Tgt Spa: ['0.350'] [Step 35 / Rank 1] Tasks: ['Single QA', 'Code'] | Lens: [30203, 30211] → Tgt Spa: ['0.350', '1.000'] [Step 35 / Rank 4] Tasks: ['Code'] | Lens: [41043] → Tgt Spa: ['1.000'] [Step 35 / Rank 2] Tasks: ['Single QA'] | Lens: [51546] → Tgt Spa: ['0.350'] [Step 35 / Rank 3] Tasks: ['Single QA'] | Lens: [51546] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 20:02:41,841 >> @ 35 | Loss: 2.0809 | LM: 1.9919 | Reg: 0.0891 | Spa(Avg): 0.368 [INFO|lh_trainer.py:797] 2026-02-16 20:02:41,841 >> Statistic -> Code | Spa: 0.340 | Tgt: 1.000 | Z-Loss: 0.137 | [INFO|lh_trainer.py:797] 2026-02-16 20:02:41,841 >> Statistic -> In-Context | Spa: 0.358 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:02:41,841 >> Statistic -> MultiHop | Spa: 0.363 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:02:41,841 >> Statistic -> Single | Spa: 0.373 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:02:41,841 >> Statistic -> Summarization | Spa: 0.403 | Tgt: 1.000 | Z-Loss: 0.161 | [INFO|lh_trainer.py:810] 2026-02-16 20:02:41,843 >> [Micro-Log] {"loss": 2.0809302696337304, "lm_loss": 1.9918606827656429, "reg_loss": 0.08906958142567116, "model_sparsity(avg)": 0.3678090659280618, "Spa-Summarization sparsity": 0.4027777761220932, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.16147152101621032, "Spa-MultiHop QA sparsity": 0.3627450886894675, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.011528559186605407, "Spa-In-Context Learning sparsity": 0.3583333373069763, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.14964658692479132, "Spa-Code sparsity": 0.3402777761220932, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.13677940517663956, "Spa-Single QA sparsity": 0.37336600878659415, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.032113630064379645, "step": 35, "current_tau": 1.4547879695892334, "lambda1 Single QA": 0.484375, "lambda2 MultiHop QA": 0.2431640625, "lambda3 Summarization": 0.046630859375, "lambda4 Code": 0.1435546875} [INFO|lh_trainer.py:331] 2026-02-16 20:03:00,378 >> {'loss': 12.4856, 'grad_norm': 1.2985906600952148, 'learning_rate': 0.0002916666666666667, 'epoch': 0.037914691943127965, 'num_input_tokens_seen': 89296692, 'completed': '12.00% (36 / 300)', 'remaining time': '12:22:14', 'throughput': '6978.80', 'gpu_mem_free': '8025MB', 'step': 36} [Step 36 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [23616, 23608] → Tgt Spa: ['1.000', '1.000'] [Step 36 / Rank 2] Tasks: ['MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [1587, 1606, 1606, 1588, 1590, 1589, 1593, 1609, 1609, 1592, 1591, 1591, 1610, 1610, 1592, 1591, 1611, 1593, 1592, 1613, 1595, 1594, 1594, 1613, 1597, 1595, 1597, 1596, 1615, 1596, 1596, 1616, 1616, 1598, 1598, 1598, 1599, 1600, 1599, 1618] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 36 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25948, 25950] → Tgt Spa: ['1.000', '1.000'] [Step 36 / Rank 3] Tasks: ['MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [1587, 1606, 1606, 1588, 1590, 1589, 1593, 1609, 1609, 1592, 1591, 1591, 1610, 1610, 1592, 1591, 1611, 1593, 1592, 1613, 1595, 1594, 1594, 1613, 1597, 1595, 1597, 1596, 1615, 1596, 1596, 1616, 1616, 1598, 1598, 1598, 1599, 1600, 1599, 1618] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 36 / Rank 1] Tasks: ['Single QA'] | Lens: [64047] → Tgt Spa: ['0.350'] [Step 36 / Rank 0] Tasks: ['Single QA'] | Lens: [64047] → Tgt Spa: ['0.350'] [Step 36 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25948, 25950] → Tgt Spa: ['1.000', '1.000'] [Step 36 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [23616, 23608] → Tgt Spa: ['1.000', '1.000'] [Step 36 / Rank 3] Tasks: ['Summarization', 'Code'] | Lens: [25152, 25141] → Tgt Spa: ['1.000', '1.000'] [Step 36 / Rank 1] Tasks: ['Single QA', 'Code'] | Lens: [29223, 29231] → Tgt Spa: ['0.350', '1.000'] [Step 36 / Rank 0] Tasks: ['Single QA', 'Code'] | Lens: [29223, 29231] → Tgt Spa: ['0.350', '1.000'] [Step 36 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [28605, 28605] → Tgt Spa: ['0.350', '0.350'] [Step 36 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [28605, 28605] → Tgt Spa: ['0.350', '0.350'] [Step 36 / Rank 5] Tasks: ['Single QA'] | Lens: [33910] → Tgt Spa: ['0.350'] [Step 36 / Rank 2] Tasks: ['Summarization', 'Code'] | Lens: [25152, 25141] → Tgt Spa: ['1.000', '1.000'] [Step 36 / Rank 4] Tasks: ['Single QA'] | Lens: [33910] → Tgt Spa: ['0.350'] [Step 36 / Rank 2] Tasks: ['Single QA'] | Lens: [36658] → Tgt Spa: ['0.350'] [Step 36 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [29673, 29674] → Tgt Spa: ['0.350', '0.350'] [Step 36 / Rank 3] Tasks: ['Single QA'] | Lens: [36658] → Tgt Spa: ['0.350'] [Step 36 / Rank 6] Tasks: ['Single QA'] | Lens: [39966] → Tgt Spa: ['0.350'] [Step 36 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [29673, 29674] → Tgt Spa: ['0.350', '0.350'] [Step 36 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [43080] → Tgt Spa: ['1.000'] [Step 36 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [43080] → Tgt Spa: ['1.000'] [Step 36 / Rank 7] Tasks: ['Single QA'] | Lens: [39966] → Tgt Spa: ['0.350'] [Step 36 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [54583] → Tgt Spa: ['1.000'] [Step 36 / Rank 6] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [16851, 16864, 16855] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 36 / Rank 1] Tasks: ['Code', 'Code'] | Lens: [31606, 31606] → Tgt Spa: ['1.000', '1.000'] [Step 36 / Rank 3] Tasks: ['Single QA', 'Code'] | Lens: [32718, 32728] → Tgt Spa: ['0.350', '1.000'] [Step 36 / Rank 7] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [16851, 16864, 16855] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 36 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [54583] → Tgt Spa: ['1.000'] [Step 36 / Rank 2] Tasks: ['Single QA', 'Code'] | Lens: [32718, 32728] → Tgt Spa: ['0.350', '1.000'] [Step 36 / Rank 0] Tasks: ['Code', 'Code'] | Lens: [31606, 31606] → Tgt Spa: ['1.000', '1.000'] [Step 36 / Rank 5] Tasks: ['Single QA'] | Lens: [43129] → Tgt Spa: ['0.350'] [Step 36 / Rank 4] Tasks: ['Single QA'] | Lens: [43129] → Tgt Spa: ['0.350'] [Step 36 / Rank 0] Tasks: ['Code'] | Lens: [57911] → Tgt Spa: ['1.000'] [Step 36 / Rank 7] Tasks: ['Code', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Summarization'] | Lens: [8426, 8424, 8421, 8421, 8421, 8432, 8442] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 36 / Rank 3] Tasks: ['Single QA'] | Lens: [58640] → Tgt Spa: ['0.350'] [Step 36 / Rank 1] Tasks: ['Code'] | Lens: [57911] → Tgt Spa: ['1.000'] [Step 36 / Rank 6] Tasks: ['Code', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Summarization'] | Lens: [8426, 8424, 8421, 8421, 8421, 8432, 8442] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 36 / Rank 2] Tasks: ['Single QA'] | Lens: [58640] → Tgt Spa: ['0.350'] [Step 36 / Rank 4] Tasks: ['MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA'] | Lens: [1459, 1459, 1478, 1460, 1461, 1461, 1461, 1480, 1462, 1481, 1464, 1463, 1462, 1463, 1463, 1463, 1462, 1482, 1481, 1464, 1464, 1464, 1464, 1464, 1483, 1466, 1466, 1466, 1465, 1465, 1465, 1466, 1466, 1468, 1466, 1486, 1485, 1467, 1467, 1468, 1468, 1486, 1469, 1468] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350'] [Step 36 / Rank 3] Tasks: ['Single QA'] | Lens: [33982] → Tgt Spa: ['0.350'] [Step 36 / Rank 7] Tasks: ['Single QA'] | Lens: [52209] → Tgt Spa: ['0.350'] [Step 36 / Rank 5] Tasks: ['MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA'] | Lens: [1459, 1459, 1478, 1460, 1461, 1461, 1461, 1480, 1462, 1481, 1464, 1463, 1462, 1463, 1463, 1463, 1462, 1482, 1481, 1464, 1464, 1464, 1464, 1464, 1483, 1466, 1466, 1466, 1465, 1465, 1465, 1466, 1466, 1468, 1466, 1486, 1485, 1467, 1467, 1468, 1468, 1486, 1469, 1468] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350'] [Step 36 / Rank 2] Tasks: ['Single QA'] | Lens: [33982] → Tgt Spa: ['0.350'] [Step 36 / Rank 1] Tasks: ['Single QA'] | Lens: [35325] → Tgt Spa: ['0.350'] [Step 36 / Rank 6] Tasks: ['Single QA'] | Lens: [52209] → Tgt Spa: ['0.350'] [Step 36 / Rank 0] Tasks: ['Single QA'] | Lens: [35325] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 20:05:18,110 >> @ 36 | Loss: 2.1508 | LM: 2.0660 | Reg: 0.0848 | Spa(Avg): 0.353 [INFO|lh_trainer.py:797] 2026-02-16 20:05:18,110 >> Statistic -> Code | Spa: 0.366 | Tgt: 1.000 | Z-Loss: 0.131 | [INFO|lh_trainer.py:797] 2026-02-16 20:05:18,110 >> Statistic -> In-Context | Spa: 0.358 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:05:18,110 >> Statistic -> MultiHop | Spa: 0.354 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:05:18,110 >> Statistic -> Single | Spa: 0.332 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:05:18,110 >> Statistic -> Summarization | Spa: 0.352 | Tgt: 1.000 | Z-Loss: 0.186 | [INFO|lh_trainer.py:810] 2026-02-16 20:05:18,112 >> [Micro-Log] {"loss": 2.1508078233649335, "lm_loss": 2.066023842742046, "reg_loss": 0.08478396770078689, "model_sparsity(avg)": 0.35267824803789455, "Spa-Single QA sparsity": 0.3317901094754537, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03116441941043983, "Spa-Code sparsity": 0.3657407263914744, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.13076217472553253, "Spa-In-Context Learning sparsity": 0.35833331346511843, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1505153149366379, "Spa-MultiHop QA sparsity": 0.3537186297678178, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.014958214461653224, "Spa-Summarization sparsity": 0.35166666269302366, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.18606812059879302, "step": 36, "current_tau": 1.452254295349121, "lambda1 Single QA": 0.484375, "lambda2 MultiHop QA": 0.2431640625, "lambda3 Summarization": 0.047119140625, "lambda4 Code": 0.1435546875} [INFO|lh_trainer.py:331] 2026-02-16 20:05:37,003 >> {'loss': 12.9048, 'grad_norm': 1.2449119091033936, 'learning_rate': 0.0003, 'epoch': 0.03896787783043707, 'num_input_tokens_seen': 91786022, 'completed': '12.33% (37 / 300)', 'remaining time': '12:18:00', 'throughput': '7946.77', 'gpu_mem_free': '13957MB', 'step': 37} [Step 37 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [29945, 29945] → Tgt Spa: ['0.350', '0.350'] [Step 37 / Rank 2] Tasks: ['Summarization'] | Lens: [44915] → Tgt Spa: ['1.000'] [Step 37 / Rank 3] Tasks: ['Summarization'] | Lens: [44915] → Tgt Spa: ['1.000'] [Step 37 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [29945, 29945] → Tgt Spa: ['0.350', '0.350'] [Step 37 / Rank 1] Tasks: ['Code'] | Lens: [64104] → Tgt Spa: ['1.000'] [Step 37 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [56110] → Tgt Spa: ['1.000'] [Step 37 / Rank 0] Tasks: ['Code'] | Lens: [64104] → Tgt Spa: ['1.000'] [Step 37 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [56110] → Tgt Spa: ['1.000'] [Step 37 / Rank 3] Tasks: ['MultiHop QA'] | Lens: [65340] → Tgt Spa: ['0.350'] [Step 37 / Rank 2] Tasks: ['MultiHop QA'] | Lens: [65340] → Tgt Spa: ['0.350'] [Step 37 / Rank 1] Tasks: ['Single QA'] | Lens: [64860] → Tgt Spa: ['0.350'] [Step 37 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [56729] → Tgt Spa: ['1.000'] [Step 37 / Rank 5] Tasks: ['Single QA'] | Lens: [63518] → Tgt Spa: ['0.350'] [Step 37 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [56729] → Tgt Spa: ['1.000'] [Step 37 / Rank 0] Tasks: ['Single QA'] | Lens: [64860] → Tgt Spa: ['0.350'] [Step 37 / Rank 4] Tasks: ['Single QA'] | Lens: [63518] → Tgt Spa: ['0.350'] [Step 37 / Rank 1] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24274, 24274] → Tgt Spa: ['0.350', '1.000'] [Step 37 / Rank 4] Tasks: ['Single QA'] | Lens: [44323] → Tgt Spa: ['0.350'] [Step 37 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [27187, 27198] → Tgt Spa: ['1.000', '1.000'] [Step 37 / Rank 5] Tasks: ['Single QA'] | Lens: [44323] → Tgt Spa: ['0.350'] [Step 37 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [27187, 27198] → Tgt Spa: ['1.000', '1.000'] [Step 37 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [45779] → Tgt Spa: ['1.000'] [Step 37 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [45779] → Tgt Spa: ['1.000'] [Step 37 / Rank 0] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24274, 24274] → Tgt Spa: ['0.350', '1.000'] [Step 37 / Rank 6] Tasks: ['Code'] | Lens: [41540] → Tgt Spa: ['1.000'] [Step 37 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Summarization', 'Single QA', 'Code', 'Code'] | Lens: [8224, 8226, 8225, 8243, 8225, 8234, 8233] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000'] [Step 37 / Rank 0] Tasks: ['Single QA'] | Lens: [57262] → Tgt Spa: ['0.350'] [Step 37 / Rank 7] Tasks: ['Code'] | Lens: [41540] → Tgt Spa: ['1.000'] [Step 37 / Rank 1] Tasks: ['Single QA'] | Lens: [57262] → Tgt Spa: ['0.350'] [Step 37 / Rank 3] Tasks: ['Single QA'] | Lens: [53267] → Tgt Spa: ['0.350'] [Step 37 / Rank 2] Tasks: ['Single QA'] | Lens: [53267] → Tgt Spa: ['0.350'] [Step 37 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Summarization', 'Single QA', 'Code', 'Code'] | Lens: [8224, 8226, 8225, 8243, 8225, 8234, 8233] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000'] [Step 37 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32557, 32558] → Tgt Spa: ['0.350', '0.350'] [Step 37 / Rank 7] Tasks: ['Single QA'] | Lens: [57290] → Tgt Spa: ['0.350'] [Step 37 / Rank 3] Tasks: ['Single QA'] | Lens: [57239] → Tgt Spa: ['0.350'] [Step 37 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32557, 32558] → Tgt Spa: ['0.350', '0.350'] [Step 37 / Rank 6] Tasks: ['Single QA'] | Lens: [57290] → Tgt Spa: ['0.350'] [Step 37 / Rank 2] Tasks: ['Single QA'] | Lens: [57239] → Tgt Spa: ['0.350'] [Step 37 / Rank 1] Tasks: ['Single QA'] | Lens: [65025] → Tgt Spa: ['0.350'] [Step 37 / Rank 0] Tasks: ['Single QA'] | Lens: [65025] → Tgt Spa: ['0.350'] [Step 37 / Rank 3] Tasks: ['Single QA'] | Lens: [65168] → Tgt Spa: ['0.350'] [Step 37 / Rank 2] Tasks: ['Single QA'] | Lens: [65168] → Tgt Spa: ['0.350'] [Step 37 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23929, 23929] → Tgt Spa: ['1.000', '1.000'] [Step 37 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23929, 23929] → Tgt Spa: ['1.000', '1.000'] [Step 37 / Rank 5] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Code'] | Lens: [9047, 9040, 9040, 9041, 9050, 9047, 9056] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000'] [Step 37 / Rank 0] Tasks: ['Single QA'] | Lens: [33308] → Tgt Spa: ['0.350'] [Step 37 / Rank 1] Tasks: ['Single QA'] | Lens: [33308] → Tgt Spa: ['0.350'] [Step 37 / Rank 4] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Code'] | Lens: [9047, 9040, 9040, 9041, 9050, 9047, 9056] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 20:08:18,470 >> @ 37 | Loss: 2.0732 | LM: 1.9757 | Reg: 0.0975 | Spa(Avg): 0.288 [INFO|lh_trainer.py:797] 2026-02-16 20:08:18,470 >> Statistic -> Code | Spa: 0.299 | Tgt: 1.000 | Z-Loss: 0.150 | [INFO|lh_trainer.py:797] 2026-02-16 20:08:18,470 >> Statistic -> In-Context | Spa: 0.266 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:08:18,470 >> Statistic -> MultiHop | Spa: 0.292 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:08:18,470 >> Statistic -> Single | Spa: 0.320 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:08:18,470 >> Statistic -> Summarization | Spa: 0.215 | Tgt: 1.000 | Z-Loss: 0.269 | [INFO|lh_trainer.py:810] 2026-02-16 20:08:18,472 >> [Micro-Log] {"loss": 2.0732172335653254, "lm_loss": 1.9756928524002433, "reg_loss": 0.09752437631444384, "model_sparsity(avg)": 0.2878224216401577, "Spa-Code sparsity": 0.2986111044883728, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1502658724784851, "Spa-Single QA sparsity": 0.32004830629929254, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03621530850701358, "Spa-In-Context Learning sparsity": 0.2658730149269104, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.17410239364419663, "Spa-Summarization sparsity": 0.215277761220932, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.2687777429819107, "Spa-MultiHop QA sparsity": 0.2916666865348816, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.015243670903146267, "step": 37, "current_tau": 1.4496588706970215, "lambda1 Single QA": 0.484375, "lambda2 MultiHop QA": 0.2431640625, "lambda3 Summarization": 0.0478515625, "lambda4 Code": 0.14453125} [INFO|lh_trainer.py:331] 2026-02-16 20:08:46,367 >> {'loss': 12.4393, 'grad_norm': 1.4544183015823364, 'learning_rate': 0.00030833333333333337, 'epoch': 0.040021063717746184, 'num_input_tokens_seen': 94451030, 'completed': '12.67% (38 / 300)', 'remaining time': '12:17:36', 'throughput': '7036.76', 'gpu_mem_free': '14617MB', 'step': 38} [Step 38 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [41236] → Tgt Spa: ['1.000'] [Step 38 / Rank 0] Tasks: ['Single QA'] | Lens: [37060] → Tgt Spa: ['0.350'] [Step 38 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32184, 32184] → Tgt Spa: ['0.350', '0.350'] [Step 38 / Rank 4] Tasks: ['Single QA'] | Lens: [59071] → Tgt Spa: ['0.350'] [Step 38 / Rank 5] Tasks: ['Single QA'] | Lens: [59071] → Tgt Spa: ['0.350'] [Step 38 / Rank 1] Tasks: ['Single QA'] | Lens: [37060] → Tgt Spa: ['0.350'] [Step 38 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [41236] → Tgt Spa: ['1.000'] [Step 38 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32184, 32184] → Tgt Spa: ['0.350', '0.350'] [Step 38 / Rank 2] Tasks: ['Single QA'] | Lens: [52804] → Tgt Spa: ['0.350'] [Step 38 / Rank 3] Tasks: ['Single QA'] | Lens: [52804] → Tgt Spa: ['0.350'] [Step 38 / Rank 0] Tasks: ['Single QA'] | Lens: [55061] → Tgt Spa: ['0.350'] [Step 38 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [40242] → Tgt Spa: ['1.000'] [Step 38 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [40242] → Tgt Spa: ['1.000'] [Step 38 / Rank 5] Tasks: ['Single QA'] | Lens: [50879] → Tgt Spa: ['0.350'] [Step 38 / Rank 1] Tasks: ['Single QA'] | Lens: [55061] → Tgt Spa: ['0.350'] [Step 38 / Rank 4] Tasks: ['Single QA'] | Lens: [50879] → Tgt Spa: ['0.350'] [Step 38 / Rank 6] Tasks: ['Code', 'Code', 'Summarization', 'Code', 'Single QA'] | Lens: [12314, 12318, 12331, 12335, 12329] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350'] [Step 38 / Rank 3] Tasks: ['Single QA'] | Lens: [51865] → Tgt Spa: ['0.350'] [Step 38 / Rank 7] Tasks: ['Code', 'Code', 'Summarization', 'Code', 'Single QA'] | Lens: [12314, 12318, 12331, 12335, 12329] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350'] [Step 38 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [40431] → Tgt Spa: ['1.000'] [Step 38 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [40431] → Tgt Spa: ['1.000'] [Step 38 / Rank 0] Tasks: ['Single QA'] | Lens: [57713] → Tgt Spa: ['0.350'] [Step 38 / Rank 2] Tasks: ['Single QA'] | Lens: [51865] → Tgt Spa: ['0.350'] [Step 38 / Rank 1] Tasks: ['Single QA'] | Lens: [57713] → Tgt Spa: ['0.350'] [Step 38 / Rank 6] Tasks: ['Single QA'] | Lens: [45683] → Tgt Spa: ['0.350'] [Step 38 / Rank 7] Tasks: ['Single QA'] | Lens: [45683] → Tgt Spa: ['0.350'] [Step 38 / Rank 5] Tasks: ['Single QA'] | Lens: [49428] → Tgt Spa: ['0.350'] [Step 38 / Rank 2] Tasks: ['Single QA'] | Lens: [43243] → Tgt Spa: ['0.350'] [Step 38 / Rank 0] Tasks: ['In-Context Learning', 'Single QA', 'Single QA'] | Lens: [21053, 21053, 21053] → Tgt Spa: ['1.000', '0.350', '0.350'] [Step 38 / Rank 4] Tasks: ['Single QA'] | Lens: [49428] → Tgt Spa: ['0.350'] [Step 38 / Rank 3] Tasks: ['Single QA'] | Lens: [43243] → Tgt Spa: ['0.350'] [Step 38 / Rank 1] Tasks: ['In-Context Learning', 'Single QA', 'Single QA'] | Lens: [21053, 21053, 21053] → Tgt Spa: ['1.000', '0.350', '0.350'] [Step 38 / Rank 3] Tasks: ['Single QA'] | Lens: [40255] → Tgt Spa: ['0.350'] [Step 38 / Rank 0] Tasks: ['Single QA'] | Lens: [39185] → Tgt Spa: ['0.350'] [Step 38 / Rank 4] Tasks: ['Single QA'] | Lens: [52160] → Tgt Spa: ['0.350'] [Step 38 / Rank 1] Tasks: ['Single QA'] | Lens: [39185] → Tgt Spa: ['0.350'] [Step 38 / Rank 5] Tasks: ['Single QA'] | Lens: [52160] → Tgt Spa: ['0.350'] [Step 38 / Rank 7] Tasks: ['Code'] | Lens: [35420] → Tgt Spa: ['1.000'] [Step 38 / Rank 6] Tasks: ['Code'] | Lens: [35420] → Tgt Spa: ['1.000'] [Step 38 / Rank 2] Tasks: ['Single QA'] | Lens: [40255] → Tgt Spa: ['0.350'] [Step 38 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'MultiHop QA', 'Single QA', 'Single QA'] | Lens: [7971, 7971, 7972, 7973, 7973, 7971, 7973, 7974] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 38 / Rank 2] Tasks: ['Code', 'Code', 'Code'] | Lens: [18556, 18559, 18557] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 38 / Rank 6] Tasks: ['Code'] | Lens: [35282] → Tgt Spa: ['1.000'] [Step 38 / Rank 4] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [23820, 23802] → Tgt Spa: ['1.000', '1.000'] [Step 38 / Rank 3] Tasks: ['Code', 'Code', 'Code'] | Lens: [18556, 18559, 18557] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 38 / Rank 5] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [23820, 23802] → Tgt Spa: ['1.000', '1.000'] [Step 38 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'MultiHop QA', 'Single QA', 'Single QA'] | Lens: [7971, 7971, 7972, 7973, 7973, 7971, 7973, 7974] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 38 / Rank 7] Tasks: ['Code'] | Lens: [35282] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 20:11:05,062 >> @ 38 | Loss: 2.2373 | LM: 2.1700 | Reg: 0.0673 | Spa(Avg): 0.385 [INFO|lh_trainer.py:797] 2026-02-16 20:11:05,063 >> Statistic -> Code | Spa: 0.372 | Tgt: 1.000 | Z-Loss: 0.130 | [INFO|lh_trainer.py:797] 2026-02-16 20:11:05,063 >> Statistic -> In-Context | Spa: 0.414 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:11:05,063 >> Statistic -> MultiHop | Spa: 0.333 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:11:05,063 >> Statistic -> Single | Spa: 0.368 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:11:05,063 >> Statistic -> Summarization | Spa: 0.250 | Tgt: 1.000 | Z-Loss: 0.245 | [INFO|lh_trainer.py:810] 2026-02-16 20:11:05,065 >> [Micro-Log] {"loss": 2.2373295860985913, "lm_loss": 2.170039122303327, "reg_loss": 0.06729046472658713, "model_sparsity(avg)": 0.3851658912996451, "Spa-Single QA sparsity": 0.36777777433395387, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.030211208686232567, "Spa-In-Context Learning sparsity": 0.4138888716697693, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.13769947588443757, "Spa-MultiHop QA sparsity": 0.3333333134651184, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.004064710810780525, "Spa-Code sparsity": 0.3715277835726738, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.12967785075306892, "Spa-Summarization sparsity": 0.25, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.244537353515625, "step": 38, "current_tau": 1.447002649307251, "lambda1 Single QA": 0.486328125, "lambda2 MultiHop QA": 0.244140625, "lambda3 Summarization": 0.04833984375, "lambda4 Code": 0.14453125} [INFO|lh_trainer.py:331] 2026-02-16 20:11:17,339 >> {'loss': 13.424, 'grad_norm': 0.9716716408729553, 'learning_rate': 0.00031666666666666665, 'epoch': 0.04107424960505529, 'num_input_tokens_seen': 96817518, 'completed': '13.00% (39 / 300)', 'remaining time': '12:12:47', 'throughput': '7837.49', 'gpu_mem_free': '8135MB', 'step': 39} [Step 39 / Rank 4] Tasks: ['MultiHop QA'] | Lens: [65339] → Tgt Spa: ['0.350'] [Step 39 / Rank 0] Tasks: ['Single QA'] | Lens: [60990] → Tgt Spa: ['0.350'] [Step 39 / Rank 2] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [26734, 26717] → Tgt Spa: ['1.000', '1.000'] [Step 39 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [24030, 24030] → Tgt Spa: ['0.350', '0.350'] [Step 39 / Rank 5] Tasks: ['MultiHop QA'] | Lens: [65339] → Tgt Spa: ['0.350'] [Step 39 / Rank 1] Tasks: ['Single QA'] | Lens: [60990] → Tgt Spa: ['0.350'] [Step 39 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [24030, 24030] → Tgt Spa: ['0.350', '0.350'] [Step 39 / Rank 3] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [26734, 26717] → Tgt Spa: ['1.000', '1.000'] [Step 39 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [25258, 25259] → Tgt Spa: ['0.350', '0.350'] [Step 39 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32357, 32359] → Tgt Spa: ['0.350', '0.350'] [Step 39 / Rank 0] Tasks: ['Single QA'] | Lens: [54496] → Tgt Spa: ['0.350'] [Step 39 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [25258, 25259] → Tgt Spa: ['0.350', '0.350'] [Step 39 / Rank 6] Tasks: ['Single QA'] | Lens: [34241] → Tgt Spa: ['0.350'] [Step 39 / Rank 7] Tasks: ['Single QA'] | Lens: [34241] → Tgt Spa: ['0.350'] [Step 39 / Rank 1] Tasks: ['Single QA'] | Lens: [54496] → Tgt Spa: ['0.350'] [Step 39 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32357, 32359] → Tgt Spa: ['0.350', '0.350'] [Step 39 / Rank 6] Tasks: ['Code'] | Lens: [35666] → Tgt Spa: ['1.000'] [Step 39 / Rank 3] Tasks: ['Single QA'] | Lens: [33954] → Tgt Spa: ['0.350'] [Step 39 / Rank 2] Tasks: ['Single QA'] | Lens: [33954] → Tgt Spa: ['0.350'] [Step 39 / Rank 7] Tasks: ['Code'] | Lens: [35666] → Tgt Spa: ['1.000'] [Step 39 / Rank 0] Tasks: ['Single QA'] | Lens: [41694] → Tgt Spa: ['0.350'] [Step 39 / Rank 5] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [16595, 16587, 16587] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 39 / Rank 1] Tasks: ['Single QA'] | Lens: [41694] → Tgt Spa: ['0.350'] [Step 39 / Rank 4] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [16595, 16587, 16587] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 39 / Rank 3] Tasks: ['Single QA'] | Lens: [42437] → Tgt Spa: ['0.350'] [Step 39 / Rank 0] Tasks: ['Single QA'] | Lens: [51647] → Tgt Spa: ['0.350'] [Step 39 / Rank 4] Tasks: ['Single QA'] | Lens: [55870] → Tgt Spa: ['0.350'] [Step 39 / Rank 2] Tasks: ['Single QA'] | Lens: [42437] → Tgt Spa: ['0.350'] [Step 39 / Rank 7] Tasks: ['Code'] | Lens: [38200] → Tgt Spa: ['1.000'] [Step 39 / Rank 5] Tasks: ['Single QA'] | Lens: [55870] → Tgt Spa: ['0.350'] [Step 39 / Rank 6] Tasks: ['Code'] | Lens: [38200] → Tgt Spa: ['1.000'] [Step 39 / Rank 1] Tasks: ['Single QA'] | Lens: [51647] → Tgt Spa: ['0.350'] [Step 39 / Rank 7] Tasks: ['Single QA'] | Lens: [51696] → Tgt Spa: ['0.350'] [Step 39 / Rank 6] Tasks: ['Single QA'] | Lens: [51696] → Tgt Spa: ['0.350'] [Step 39 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [36619] → Tgt Spa: ['1.000'] [Step 39 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [36619] → Tgt Spa: ['1.000'] [Step 39 / Rank 1] Tasks: ['Single QA'] | Lens: [36375] → Tgt Spa: ['0.350'] [Step 39 / Rank 4] Tasks: ['In-Context Learning', 'Single QA', 'Code', 'MultiHop QA', 'Code', 'In-Context Learning', 'Code', 'Single QA'] | Lens: [7350, 7351, 7357, 7352, 7361, 7353, 7361, 7356] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350'] [Step 39 / Rank 5] Tasks: ['In-Context Learning', 'Single QA', 'Code', 'MultiHop QA', 'Code', 'In-Context Learning', 'Code', 'Single QA'] | Lens: [7350, 7351, 7357, 7352, 7361, 7353, 7361, 7356] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350'] [Step 39 / Rank 0] Tasks: ['Single QA'] | Lens: [36375] → Tgt Spa: ['0.350'] [Step 39 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [19674, 19678, 19667] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 39 / Rank 3] Tasks: ['Code'] | Lens: [62248] → Tgt Spa: ['1.000'] [Step 39 / Rank 5] Tasks: ['MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Single QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1388, 1407, 1407, 1389, 1389, 1389, 1388, 1391, 1389, 1389, 1408, 1408, 1390, 1390, 1391, 1390, 1392, 1391, 1391, 1392, 1392, 1391, 1392, 1394, 1392, 1393, 1413, 1412, 1394, 1393, 1394, 1413, 1395, 1395, 1394, 1395, 1396, 1396, 1415, 1396, 1396, 1395, 1397, 1396, 1397, 1397] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 39 / Rank 4] Tasks: ['MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Single QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1388, 1407, 1407, 1389, 1389, 1389, 1388, 1391, 1389, 1389, 1408, 1408, 1390, 1390, 1391, 1390, 1392, 1391, 1391, 1392, 1392, 1391, 1392, 1394, 1392, 1393, 1413, 1412, 1394, 1393, 1394, 1413, 1395, 1395, 1394, 1395, 1396, 1396, 1415, 1396, 1396, 1395, 1397, 1396, 1397, 1397] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 39 / Rank 2] Tasks: ['Code'] | Lens: [62248] → Tgt Spa: ['1.000'] [Step 39 / Rank 0] Tasks: ['Single QA'] | Lens: [58402] → Tgt Spa: ['0.350'] [Step 39 / Rank 1] Tasks: ['Single QA'] | Lens: [58402] → Tgt Spa: ['0.350'] [Step 39 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [19674, 19678, 19667] → Tgt Spa: ['1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 20:13:36,516 >> @ 39 | Loss: 1.9551 | LM: 1.8891 | Reg: 0.0661 | Spa(Avg): 0.349 [INFO|lh_trainer.py:797] 2026-02-16 20:13:36,517 >> Statistic -> Code | Spa: 0.364 | Tgt: 1.000 | Z-Loss: 0.133 | [INFO|lh_trainer.py:797] 2026-02-16 20:13:36,517 >> Statistic -> In-Context | Spa: 0.288 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:13:36,517 >> Statistic -> MultiHop | Spa: 0.360 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:13:36,517 >> Statistic -> Single | Spa: 0.358 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:13:36,517 >> Statistic -> Summarization | Spa: 0.336 | Tgt: 1.000 | Z-Loss: 0.198 | [INFO|lh_trainer.py:810] 2026-02-16 20:13:36,519 >> [Micro-Log] {"loss": 1.9551198200400297, "lm_loss": 1.8890653929750745, "reg_loss": 0.06605442091434573, "model_sparsity(avg)": 0.3488461524248123, "Spa-Single QA sparsity": 0.3584656034197126, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.02257781307257357, "Spa-Summarization sparsity": 0.33564814428488415, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.198325265198946, "Spa-In-Context Learning sparsity": 0.2881944328546524, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.169668298214674, "Spa-Code sparsity": 0.36419752571317887, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.13269256469276217, "Spa-MultiHop QA sparsity": 0.3600146127374549, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.015220357550884057, "step": 39, "current_tau": 1.4442864656448364, "lambda1 Single QA": 0.486328125, "lambda2 MultiHop QA": 0.244140625, "lambda3 Summarization": 0.049072265625, "lambda4 Code": 0.1455078125} [INFO|lh_trainer.py:331] 2026-02-16 20:14:00,676 >> {'loss': 11.7307, 'grad_norm': 0.980323076248169, 'learning_rate': 0.00032500000000000004, 'epoch': 0.042127435492364404, 'num_input_tokens_seen': 99234416, 'completed': '13.33% (40 / 300)', 'remaining time': '12:09:25', 'throughput': '7398.49', 'gpu_mem_free': '7483MB', 'step': 40} [Step 40 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [61194] → Tgt Spa: ['1.000'] [Step 40 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [61194] → Tgt Spa: ['1.000'] [Step 40 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [56031] → Tgt Spa: ['1.000'] [Step 40 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [56031] → Tgt Spa: ['1.000'] [Step 40 / Rank 0] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning'] | Lens: [7333, 7334, 7336, 7328, 7328, 7328, 7329, 7329] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 40 / Rank 4] Tasks: ['Code'] | Lens: [44095] → Tgt Spa: ['1.000'] [Step 40 / Rank 5] Tasks: ['Code'] | Lens: [44095] → Tgt Spa: ['1.000'] [Step 40 / Rank 1] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning'] | Lens: [7333, 7334, 7336, 7328, 7328, 7328, 7329, 7329] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 40 / Rank 4] Tasks: ['Single QA'] | Lens: [58264] → Tgt Spa: ['0.350'] [Step 40 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24955, 24958] → Tgt Spa: ['1.000', '1.000'] [Step 40 / Rank 1] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17224, 17214, 17226] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 40 / Rank 6] Tasks: ['Single QA'] | Lens: [59400] → Tgt Spa: ['0.350'] [Step 40 / Rank 0] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17224, 17214, 17226] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 40 / Rank 5] Tasks: ['Single QA'] | Lens: [58264] → Tgt Spa: ['0.350'] [Step 40 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24955, 24958] → Tgt Spa: ['1.000', '1.000'] [Step 40 / Rank 7] Tasks: ['Single QA'] | Lens: [59400] → Tgt Spa: ['0.350'] [Step 40 / Rank 2] Tasks: ['Code', 'Code', 'Code'] | Lens: [17755, 17758, 17757] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 40 / Rank 3] Tasks: ['Code', 'Code', 'Code'] | Lens: [17755, 17758, 17757] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 40 / Rank 7] Tasks: ['Code', 'Summarization', 'Code', 'In-Context Learning', 'Single QA', 'Code', 'Single QA', 'In-Context Learning', 'Code', 'Single QA', 'Single QA', 'In-Context Learning'] | Lens: [5300, 5312, 5302, 5294, 5295, 5303, 5296, 5296, 5304, 5298, 5297, 5298] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000'] [Step 40 / Rank 4] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17701, 17690, 17703] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 40 / Rank 6] Tasks: ['Code', 'Summarization', 'Code', 'In-Context Learning', 'Single QA', 'Code', 'Single QA', 'In-Context Learning', 'Code', 'Single QA', 'Single QA', 'In-Context Learning'] | Lens: [5300, 5312, 5302, 5294, 5295, 5303, 5296, 5296, 5304, 5298, 5297, 5298] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000'] [Step 40 / Rank 1] Tasks: ['Single QA'] | Lens: [34602] → Tgt Spa: ['0.350'] [Step 40 / Rank 0] Tasks: ['Single QA'] | Lens: [34602] → Tgt Spa: ['0.350'] [Step 40 / Rank 5] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17701, 17690, 17703] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 40 / Rank 1] Tasks: ['Single QA'] | Lens: [41325] → Tgt Spa: ['0.350'] [Step 40 / Rank 5] Tasks: ['Single QA'] | Lens: [37053] → Tgt Spa: ['0.350'] [Step 40 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [41034] → Tgt Spa: ['1.000'] [Step 40 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [41034] → Tgt Spa: ['1.000'] [Step 40 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26843, 26843] → Tgt Spa: ['1.000', '1.000'] [Step 40 / Rank 0] Tasks: ['Single QA'] | Lens: [41325] → Tgt Spa: ['0.350'] [Step 40 / Rank 4] Tasks: ['Single QA'] | Lens: [37053] → Tgt Spa: ['0.350'] [Step 40 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26843, 26843] → Tgt Spa: ['1.000', '1.000'] [Step 40 / Rank 5] Tasks: ['Single QA'] | Lens: [65064] → Tgt Spa: ['0.350'] [Step 40 / Rank 0] Tasks: ['Single QA'] | Lens: [37938] → Tgt Spa: ['0.350'] [Step 40 / Rank 3] Tasks: ['Single QA'] | Lens: [43583] → Tgt Spa: ['0.350'] [Step 40 / Rank 2] Tasks: ['Single QA'] | Lens: [43583] → Tgt Spa: ['0.350'] [Step 40 / Rank 1] Tasks: ['Single QA'] | Lens: [37938] → Tgt Spa: ['0.350'] [Step 40 / Rank 4] Tasks: ['Single QA'] | Lens: [65064] → Tgt Spa: ['0.350'] [Step 40 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [29638, 29639] → Tgt Spa: ['1.000', '1.000'] [Step 40 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [29638, 29639] → Tgt Spa: ['1.000', '1.000'] [Step 40 / Rank 6] Tasks: ['Single QA'] | Lens: [64981] → Tgt Spa: ['0.350'] [Step 40 / Rank 5] Tasks: ['Code', 'Summarization', 'Single QA'] | Lens: [20720, 20731, 20714] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 40 / Rank 0] Tasks: ['Single QA'] | Lens: [37672] → Tgt Spa: ['0.350'] [Step 40 / Rank 4] Tasks: ['Code', 'Summarization', 'Single QA'] | Lens: [20720, 20731, 20714] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 40 / Rank 7] Tasks: ['Single QA'] | Lens: [64981] → Tgt Spa: ['0.350'] [Step 40 / Rank 1] Tasks: ['Single QA'] | Lens: [37672] → Tgt Spa: ['0.350'] [Step 40 / Rank 3] Tasks: ['Code'] | Lens: [50516] → Tgt Spa: ['1.000'] [Step 40 / Rank 2] Tasks: ['Code'] | Lens: [50516] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 20:16:19,271 >> @ 40 | Loss: 2.1368 | LM: 2.0325 | Reg: 0.1043 | Spa(Avg): 0.338 [INFO|lh_trainer.py:797] 2026-02-16 20:16:19,271 >> Statistic -> Code | Spa: 0.331 | Tgt: 1.000 | Z-Loss: 0.143 | [INFO|lh_trainer.py:797] 2026-02-16 20:16:19,271 >> Statistic -> In-Context | Spa: 0.341 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:16:19,271 >> Statistic -> MultiHop | Spa: 0.360 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:16:19,271 >> Statistic -> Single | Spa: 0.352 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:16:19,271 >> Statistic -> Summarization | Spa: 0.366 | Tgt: 1.000 | Z-Loss: 0.181 | [INFO|lh_trainer.py:810] 2026-02-16 20:16:19,273 >> [Micro-Log] {"loss": 2.1368057876825333, "lm_loss": 2.0325250532478094, "reg_loss": 0.10428074176888913, "model_sparsity(avg)": 0.3379388426740964, "Spa-Code sparsity": 0.3314814766248067, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.14264928450187048, "Spa-Single QA sparsity": 0.3516081791175039, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.0342103767355806, "Spa-In-Context Learning sparsity": 0.3408119586797861, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1572961497765321, "Spa-Summarization sparsity": 0.36574073632558185, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.18097037076950073, "Spa-MultiHop QA sparsity": 0.3600146127374549, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.015220357550884057, "step": 40, "current_tau": 1.4415111541748047, "lambda1 Single QA": 0.486328125, "lambda2 MultiHop QA": 0.2451171875, "lambda3 Summarization": 0.049560546875, "lambda4 Code": 0.146484375} [INFO|lh_trainer.py:331] 2026-02-16 20:16:46,122 >> {'loss': 12.8208, 'grad_norm': 1.7067333459854126, 'learning_rate': 0.0003333333333333333, 'epoch': 0.04318062137967351, 'num_input_tokens_seen': 101710538, 'completed': '13.67% (41 / 300)', 'remaining time': '12:06:19', 'throughput': '7483.22', 'gpu_mem_free': '13627MB', 'step': 41} [Step 41 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [29508, 29503] → Tgt Spa: ['1.000', '1.000'] [Step 41 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [55730] → Tgt Spa: ['1.000'] [Step 41 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [53625] → Tgt Spa: ['1.000'] [Step 41 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [53625] → Tgt Spa: ['1.000'] [Step 41 / Rank 1] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Single QA', 'Single QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [7976, 7976, 7976, 7979, 7978, 7979, 7977, 7977] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 41 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [55730] → Tgt Spa: ['1.000'] [Step 41 / Rank 0] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Single QA', 'Single QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [7976, 7976, 7976, 7979, 7978, 7979, 7977, 7977] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 41 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [29508, 29503] → Tgt Spa: ['1.000', '1.000'] [Step 41 / Rank 3] Tasks: ['Single QA'] | Lens: [37980] → Tgt Spa: ['0.350'] [Step 41 / Rank 6] Tasks: ['Code'] | Lens: [34522] → Tgt Spa: ['1.000'] [Step 41 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [42473] → Tgt Spa: ['1.000'] [Step 41 / Rank 7] Tasks: ['Code'] | Lens: [34522] → Tgt Spa: ['1.000'] [Step 41 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [42473] → Tgt Spa: ['1.000'] [Step 41 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [26850, 26850] → Tgt Spa: ['0.350', '0.350'] [Step 41 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [26850, 26850] → Tgt Spa: ['0.350', '0.350'] [Step 41 / Rank 2] Tasks: ['Single QA'] | Lens: [37980] → Tgt Spa: ['0.350'] [Step 41 / Rank 7] Tasks: ['Single QA'] | Lens: [42411] → Tgt Spa: ['0.350'] [Step 41 / Rank 6] Tasks: ['Single QA'] | Lens: [42411] → Tgt Spa: ['0.350'] [Step 41 / Rank 5] Tasks: ['Single QA'] | Lens: [35289] → Tgt Spa: ['0.350'] [Step 41 / Rank 4] Tasks: ['Single QA'] | Lens: [35289] → Tgt Spa: ['0.350'] [Step 41 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [22313, 22322] → Tgt Spa: ['1.000', '1.000'] [Step 41 / Rank 2] Tasks: ['Single QA'] | Lens: [65033] → Tgt Spa: ['0.350'] [Step 41 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [22313, 22322] → Tgt Spa: ['1.000', '1.000'] [Step 41 / Rank 3] Tasks: ['Single QA'] | Lens: [65033] → Tgt Spa: ['0.350'] [Step 41 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [21864, 21866] → Tgt Spa: ['0.350', '0.350'] [Step 41 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [47199] → Tgt Spa: ['1.000'] [Step 41 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [21864, 21866] → Tgt Spa: ['0.350', '0.350'] [Step 41 / Rank 1] Tasks: ['Single QA'] | Lens: [49558] → Tgt Spa: ['0.350'] [Step 41 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [47199] → Tgt Spa: ['1.000'] [Step 41 / Rank 0] Tasks: ['Single QA'] | Lens: [49558] → Tgt Spa: ['0.350'] [Step 41 / Rank 3] Tasks: ['Single QA'] | Lens: [46429] → Tgt Spa: ['0.350'] [Step 41 / Rank 2] Tasks: ['Single QA'] | Lens: [46429] → Tgt Spa: ['0.350'] [Step 41 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [23754, 23754] → Tgt Spa: ['0.350', '0.350'] [Step 41 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [23754, 23754] → Tgt Spa: ['0.350', '0.350'] [Step 41 / Rank 1] Tasks: ['Single QA', 'Summarization', 'Single QA'] | Lens: [20374, 20392, 20375] → Tgt Spa: ['0.350', '1.000', '0.350'] [Step 41 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [26591, 26598] → Tgt Spa: ['1.000', '1.000'] [Step 41 / Rank 3] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [26094, 26095] → Tgt Spa: ['1.000', '0.350'] [Step 41 / Rank 0] Tasks: ['Single QA', 'Summarization', 'Single QA'] | Lens: [20374, 20392, 20375] → Tgt Spa: ['0.350', '1.000', '0.350'] [Step 41 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [26591, 26598] → Tgt Spa: ['1.000', '1.000'] [Step 41 / Rank 2] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [26094, 26095] → Tgt Spa: ['1.000', '0.350'] [Step 41 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [42477] → Tgt Spa: ['1.000'] [Step 41 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23071, 23072] → Tgt Spa: ['1.000', '1.000'] [Step 41 / Rank 6] Tasks: ['Single QA'] | Lens: [36417] → Tgt Spa: ['0.350'] [Step 41 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [42477] → Tgt Spa: ['1.000'] [Step 41 / Rank 7] Tasks: ['Single QA'] | Lens: [36417] → Tgt Spa: ['0.350'] [Step 41 / Rank 2] Tasks: ['Single QA'] | Lens: [63910] → Tgt Spa: ['0.350'] [Step 41 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23071, 23072] → Tgt Spa: ['1.000', '1.000'] [Step 41 / Rank 3] Tasks: ['Single QA'] | Lens: [63910] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 20:18:56,166 >> @ 41 | Loss: 2.2656 | LM: 2.1816 | Reg: 0.0840 | Spa(Avg): 0.341 [INFO|lh_trainer.py:797] 2026-02-16 20:18:56,166 >> Statistic -> Code | Spa: 0.351 | Tgt: 1.000 | Z-Loss: 0.137 | [INFO|lh_trainer.py:797] 2026-02-16 20:18:56,166 >> Statistic -> In-Context | Spa: 0.337 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:18:56,166 >> Statistic -> MultiHop | Spa: 0.408 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:18:56,167 >> Statistic -> Single | Spa: 0.349 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:18:56,167 >> Statistic -> Summarization | Spa: 0.389 | Tgt: 1.000 | Z-Loss: 0.169 | [INFO|lh_trainer.py:810] 2026-02-16 20:18:56,169 >> [Micro-Log] {"loss": 2.2656021962563195, "lm_loss": 2.1816269047558308, "reg_loss": 0.08397529280046001, "model_sparsity(avg)": 0.3412663886944453, "Spa-MultiHop QA sparsity": 0.4083333253860474, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.016690054337959736, "Spa-Single QA sparsity": 0.3493055492639542, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.01984019731171429, "Spa-In-Context Learning sparsity": 0.3371211994778026, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.15915146741000089, "Spa-Code sparsity": 0.3506944328546524, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.13717905804514885, "Spa-Summarization sparsity": 0.3888888955116272, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1685926616191864, "step": 41, "current_tau": 1.438677430152893, "lambda1 Single QA": 0.486328125, "lambda2 MultiHop QA": 0.2451171875, "lambda3 Summarization": 0.05029296875, "lambda4 Code": 0.146484375} [INFO|lh_trainer.py:331] 2026-02-16 20:19:22,303 >> {'loss': 13.5936, 'grad_norm': 1.6978358030319214, 'learning_rate': 0.00034166666666666666, 'epoch': 0.044233807266982623, 'num_input_tokens_seen': 104066772, 'completed': '14.00% (42 / 300)', 'remaining time': '12:02:16', 'throughput': '7543.23', 'gpu_mem_free': '12543MB', 'step': 42} [Step 42 / Rank 4] Tasks: ['Single QA'] | Lens: [34214] → Tgt Spa: ['0.350'] [Step 42 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26690, 26690] → Tgt Spa: ['1.000', '1.000'] [Step 42 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [32119, 32120] → Tgt Spa: ['0.350', '0.350'] [Step 42 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [50216] → Tgt Spa: ['1.000'] [Step 42 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [50216] → Tgt Spa: ['1.000'] [Step 42 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26690, 26690] → Tgt Spa: ['1.000', '1.000'] [Step 42 / Rank 5] Tasks: ['Single QA'] | Lens: [34214] → Tgt Spa: ['0.350'] [Step 42 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [32119, 32120] → Tgt Spa: ['0.350', '0.350'] [Step 42 / Rank 1] Tasks: ['Single QA'] | Lens: [61438] → Tgt Spa: ['0.350'] [Step 42 / Rank 3] Tasks: ['Summarization'] | Lens: [39657] → Tgt Spa: ['1.000'] [Step 42 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43216] → Tgt Spa: ['1.000'] [Step 42 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43216] → Tgt Spa: ['1.000'] [Step 42 / Rank 0] Tasks: ['Single QA'] | Lens: [61438] → Tgt Spa: ['0.350'] [Step 42 / Rank 4] Tasks: ['Single QA'] | Lens: [61390] → Tgt Spa: ['0.350'] [Step 42 / Rank 5] Tasks: ['Single QA'] | Lens: [61390] → Tgt Spa: ['0.350'] [Step 42 / Rank 2] Tasks: ['Summarization'] | Lens: [39657] → Tgt Spa: ['1.000'] [Step 42 / Rank 7] Tasks: ['Single QA'] | Lens: [42553] → Tgt Spa: ['0.350'] [Step 42 / Rank 5] Tasks: ['Single QA'] | Lens: [35565] → Tgt Spa: ['0.350'] [Step 42 / Rank 6] Tasks: ['Single QA'] | Lens: [42553] → Tgt Spa: ['0.350'] [Step 42 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [49009] → Tgt Spa: ['1.000'] [Step 42 / Rank 4] Tasks: ['Single QA'] | Lens: [35565] → Tgt Spa: ['0.350'] [Step 42 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [49009] → Tgt Spa: ['1.000'] [Step 42 / Rank 2] Tasks: ['Code'] | Lens: [34583] → Tgt Spa: ['1.000'] [Step 42 / Rank 3] Tasks: ['Code'] | Lens: [34583] → Tgt Spa: ['1.000'] [Step 42 / Rank 5] Tasks: ['Single QA'] | Lens: [50216] → Tgt Spa: ['0.350'] [Step 42 / Rank 4] Tasks: ['Single QA'] | Lens: [50216] → Tgt Spa: ['0.350'] [Step 42 / Rank 3] Tasks: ['Single QA'] | Lens: [54499] → Tgt Spa: ['0.350'] [Step 42 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32628, 32628] → Tgt Spa: ['0.350', '0.350'] [Step 42 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [58304] → Tgt Spa: ['1.000'] [Step 42 / Rank 2] Tasks: ['Single QA'] | Lens: [54499] → Tgt Spa: ['0.350'] [Step 42 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32628, 32628] → Tgt Spa: ['0.350', '0.350'] [Step 42 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [58304] → Tgt Spa: ['1.000'] [Step 42 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [62536] → Tgt Spa: ['1.000'] [Step 42 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [60154] → Tgt Spa: ['1.000'] [Step 42 / Rank 6] Tasks: ['Single QA'] | Lens: [36169] → Tgt Spa: ['0.350'] [Step 42 / Rank 7] Tasks: ['Single QA'] | Lens: [36169] → Tgt Spa: ['0.350'] [Step 42 / Rank 2] Tasks: ['Summarization', 'Single QA'] | Lens: [25267, 25249] → Tgt Spa: ['1.000', '0.350'] [Step 42 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [60154] → Tgt Spa: ['1.000'] [Step 42 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [62536] → Tgt Spa: ['1.000'] [Step 42 / Rank 3] Tasks: ['Summarization', 'Single QA'] | Lens: [25267, 25249] → Tgt Spa: ['1.000', '0.350'] [Step 42 / Rank 7] Tasks: ['Code'] | Lens: [36043] → Tgt Spa: ['1.000'] [Step 42 / Rank 4] Tasks: ['In-Context Learning', 'Code', 'In-Context Learning'] | Lens: [21594, 21612, 21610] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 42 / Rank 0] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [18651, 18662, 18662] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 42 / Rank 2] Tasks: ['Single QA'] | Lens: [41932] → Tgt Spa: ['0.350'] [Step 42 / Rank 1] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [18651, 18662, 18662] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 42 / Rank 6] Tasks: ['Code'] | Lens: [36043] → Tgt Spa: ['1.000'] [Step 42 / Rank 3] Tasks: ['Single QA'] | Lens: [41932] → Tgt Spa: ['0.350'] [Step 42 / Rank 5] Tasks: ['In-Context Learning', 'Code', 'In-Context Learning'] | Lens: [21594, 21612, 21610] → Tgt Spa: ['1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 20:21:47,930 >> @ 42 | Loss: 2.1600 | LM: 2.0677 | Reg: 0.0923 | Spa(Avg): 0.374 [INFO|lh_trainer.py:797] 2026-02-16 20:21:47,931 >> Statistic -> Code | Spa: 0.389 | Tgt: 1.000 | Z-Loss: 0.129 | [INFO|lh_trainer.py:797] 2026-02-16 20:21:47,931 >> Statistic -> In-Context | Spa: 0.387 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:21:47,931 >> Statistic -> MultiHop | Spa: 0.408 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:21:47,931 >> Statistic -> Single | Spa: 0.364 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:21:47,931 >> Statistic -> Summarization | Spa: 0.392 | Tgt: 1.000 | Z-Loss: 0.170 | [INFO|lh_trainer.py:810] 2026-02-16 20:21:47,933 >> [Micro-Log] {"loss": 2.159974131733179, "lm_loss": 2.06768033032616, "reg_loss": 0.09229380222192655, "model_sparsity(avg)": 0.37393903732299805, "Spa-In-Context Learning sparsity": 0.3874999940395355, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.14653864204883577, "Spa-Single QA sparsity": 0.36408729212624685, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.02944787992497108, "Spa-Code sparsity": 0.388888880610466, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.12866907939314842, "Spa-Summarization sparsity": 0.3923611044883728, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.16953363083302975, "Spa-MultiHop QA sparsity": 0.4083333253860474, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.016690054337959736, "step": 42, "current_tau": 1.435786247253418, "lambda1 Single QA": 0.48828125, "lambda2 MultiHop QA": 0.2451171875, "lambda3 Summarization": 0.05078125, "lambda4 Code": 0.1474609375} [INFO|lh_trainer.py:331] 2026-02-16 20:22:02,724 >> {'loss': 12.9598, 'grad_norm': 1.54063880443573, 'learning_rate': 0.00035, 'epoch': 0.04528699315429173, 'num_input_tokens_seen': 106478524, 'completed': '14.33% (43 / 300)', 'remaining time': '11:58:43', 'throughput': '7516.96', 'gpu_mem_free': '9941MB', 'step': 43} [Step 43 / Rank 5] Tasks: ['Single QA'] | Lens: [40215] → Tgt Spa: ['0.350'] [Step 43 / Rank 2] Tasks: ['Single QA'] | Lens: [57583] → Tgt Spa: ['0.350'] [Step 43 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [29917, 29919] → Tgt Spa: ['1.000', '1.000'] [Step 43 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [29917, 29919] → Tgt Spa: ['1.000', '1.000'] [Step 43 / Rank 3] Tasks: ['Single QA'] | Lens: [57583] → Tgt Spa: ['0.350'] [Step 43 / Rank 4] Tasks: ['Single QA'] | Lens: [40215] → Tgt Spa: ['0.350'] [Step 43 / Rank 0] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Single QA', 'In-Context Learning', 'Single QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Code', 'In-Context Learning'] | Lens: [4819, 4812, 4812, 4832, 4813, 4814, 4814, 4833, 4815, 4815, 4834, 4821, 4816] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 43 / Rank 1] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Single QA', 'In-Context Learning', 'Single QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Code', 'In-Context Learning'] | Lens: [4819, 4812, 4812, 4832, 4813, 4814, 4814, 4833, 4815, 4815, 4834, 4821, 4816] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 43 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [53670] → Tgt Spa: ['1.000'] [Step 43 / Rank 0] Tasks: ['Code'] | Lens: [42142] → Tgt Spa: ['1.000'] [Step 43 / Rank 4] Tasks: ['Summarization'] | Lens: [50171] → Tgt Spa: ['1.000'] [Step 43 / Rank 3] Tasks: ['Single QA'] | Lens: [43635] → Tgt Spa: ['0.350'] [Step 43 / Rank 5] Tasks: ['Summarization'] | Lens: [50171] → Tgt Spa: ['1.000'] [Step 43 / Rank 1] Tasks: ['Code'] | Lens: [42142] → Tgt Spa: ['1.000'] [Step 43 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [53670] → Tgt Spa: ['1.000'] [Step 43 / Rank 2] Tasks: ['Single QA'] | Lens: [43635] → Tgt Spa: ['0.350'] [Step 43 / Rank 5] Tasks: ['Single QA'] | Lens: [42539] → Tgt Spa: ['0.350'] [Step 43 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22938, 22958] → Tgt Spa: ['1.000', '1.000'] [Step 43 / Rank 2] Tasks: ['Single QA'] | Lens: [35716] → Tgt Spa: ['0.350'] [Step 43 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22938, 22958] → Tgt Spa: ['1.000', '1.000'] [Step 43 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [29551, 29552] → Tgt Spa: ['1.000', '1.000'] [Step 43 / Rank 4] Tasks: ['Single QA'] | Lens: [42539] → Tgt Spa: ['0.350'] [Step 43 / Rank 3] Tasks: ['Single QA'] | Lens: [35716] → Tgt Spa: ['0.350'] [Step 43 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [29551, 29552] → Tgt Spa: ['1.000', '1.000'] [Step 43 / Rank 2] Tasks: ['Single QA'] | Lens: [56620] → Tgt Spa: ['0.350'] [Step 43 / Rank 1] Tasks: ['Code', 'Code', 'MultiHop QA', 'Code'] | Lens: [15054, 15058, 15051, 15059] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000'] [Step 43 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [38222] → Tgt Spa: ['1.000'] [Step 43 / Rank 7] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16679, 16669, 16682] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 43 / Rank 0] Tasks: ['Code', 'Code', 'MultiHop QA', 'Code'] | Lens: [15054, 15058, 15051, 15059] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000'] [Step 43 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [38222] → Tgt Spa: ['1.000'] [Step 43 / Rank 3] Tasks: ['Single QA'] | Lens: [56620] → Tgt Spa: ['0.350'] [Step 43 / Rank 6] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16679, 16669, 16682] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 43 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [6070, 6070, 6070, 6071, 6071, 6072, 6072, 6072, 6072, 6073] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 43 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [6070, 6070, 6070, 6071, 6071, 6072, 6072, 6072, 6072, 6073] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 43 / Rank 7] Tasks: ['Code'] | Lens: [37118] → Tgt Spa: ['1.000'] [Step 43 / Rank 6] Tasks: ['Code'] | Lens: [37118] → Tgt Spa: ['1.000'] [Step 43 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [35892] → Tgt Spa: ['1.000'] [Step 43 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [55444] → Tgt Spa: ['1.000'] [Step 43 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [55444] → Tgt Spa: ['1.000'] [Step 43 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [35892] → Tgt Spa: ['1.000'] [Step 43 / Rank 5] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [19224, 19237, 19226] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 43 / Rank 3] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [16137, 16137, 16137, 16137] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 43 / Rank 2] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [16137, 16137, 16137, 16137] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 43 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32486, 32485] → Tgt Spa: ['0.350', '0.350'] [Step 43 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32486, 32485] → Tgt Spa: ['0.350', '0.350'] [Step 43 / Rank 4] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [19224, 19237, 19226] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 43 / Rank 0] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [22562, 22564] → Tgt Spa: ['1.000', '0.350'] [Step 43 / Rank 1] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [22562, 22564] → Tgt Spa: ['1.000', '0.350'] [INFO|lh_trainer.py:781] 2026-02-16 20:24:19,077 >> @ 43 | Loss: 1.9636 | LM: 1.8685 | Reg: 0.0951 | Spa(Avg): 0.393 [INFO|lh_trainer.py:797] 2026-02-16 20:24:19,077 >> Statistic -> Code | Spa: 0.403 | Tgt: 1.000 | Z-Loss: 0.125 | [INFO|lh_trainer.py:797] 2026-02-16 20:24:19,077 >> Statistic -> In-Context | Spa: 0.414 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:24:19,077 >> Statistic -> MultiHop | Spa: 0.392 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:24:19,077 >> Statistic -> Single | Spa: 0.384 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:24:19,077 >> Statistic -> Summarization | Spa: 0.384 | Tgt: 1.000 | Z-Loss: 0.173 | [INFO|lh_trainer.py:810] 2026-02-16 20:24:19,079 >> [Micro-Log] {"loss": 1.9636199626450737, "lm_loss": 1.868548642611131, "reg_loss": 0.09507132379803807, "model_sparsity(avg)": 0.3932291517655055, "Spa-Code sparsity": 0.402777761220932, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1251603439450264, "Spa-In-Context Learning sparsity": 0.41435183584690094, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.14044340824087462, "Spa-Summarization sparsity": 0.3836805447936058, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.17331606522202492, "Spa-Single QA sparsity": 0.38359786782945904, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.030871784236902993, "Spa-MultiHop QA sparsity": 0.3916666507720947, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.01659884213004261, "step": 43, "current_tau": 1.4328384399414062, "lambda1 Single QA": 0.48828125, "lambda2 MultiHop QA": 0.24609375, "lambda3 Summarization": 0.051513671875, "lambda4 Code": 0.1484375} [INFO|lh_trainer.py:331] 2026-02-16 20:24:37,768 >> {'loss': 11.7817, 'grad_norm': 1.4603615999221802, 'learning_rate': 0.00035833333333333333, 'epoch': 0.04634017904160084, 'num_input_tokens_seen': 108918022, 'completed': '14.67% (44 / 300)', 'remaining time': '11:54:41', 'throughput': '7867.13', 'gpu_mem_free': '11979MB', 'step': 44} [Step 44 / Rank 7] Tasks: ['Single QA'] | Lens: [52262] → Tgt Spa: ['0.350'] [Step 44 / Rank 5] Tasks: ['Single QA'] | Lens: [46472] → Tgt Spa: ['0.350'] [Step 44 / Rank 6] Tasks: ['Single QA'] | Lens: [52262] → Tgt Spa: ['0.350'] [Step 44 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [31073, 31073] → Tgt Spa: ['0.350', '0.350'] [Step 44 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [41110] → Tgt Spa: ['1.000'] [Step 44 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [41110] → Tgt Spa: ['1.000'] [Step 44 / Rank 4] Tasks: ['Single QA'] | Lens: [46472] → Tgt Spa: ['0.350'] [Step 44 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [31073, 31073] → Tgt Spa: ['0.350', '0.350'] [Step 44 / Rank 1] Tasks: ['Single QA'] | Lens: [52658] → Tgt Spa: ['0.350'] [Step 44 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [37636] → Tgt Spa: ['1.000'] [Step 44 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27434, 27433] → Tgt Spa: ['1.000', '1.000'] [Step 44 / Rank 6] Tasks: ['Single QA'] | Lens: [64950] → Tgt Spa: ['0.350'] [Step 44 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [37636] → Tgt Spa: ['1.000'] [Step 44 / Rank 7] Tasks: ['Single QA'] | Lens: [64950] → Tgt Spa: ['0.350'] [Step 44 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27434, 27433] → Tgt Spa: ['1.000', '1.000'] [Step 44 / Rank 0] Tasks: ['Single QA'] | Lens: [52658] → Tgt Spa: ['0.350'] [Step 44 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [53488] → Tgt Spa: ['1.000'] [Step 44 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [29462, 29472] → Tgt Spa: ['1.000', '1.000'] [Step 44 / Rank 7] Tasks: ['Single QA'] | Lens: [46752] → Tgt Spa: ['0.350'] [Step 44 / Rank 0] Tasks: ['Single QA'] | Lens: [49551] → Tgt Spa: ['0.350'] [Step 44 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [29462, 29472] → Tgt Spa: ['1.000', '1.000'] [Step 44 / Rank 1] Tasks: ['Single QA'] | Lens: [49551] → Tgt Spa: ['0.350'] [Step 44 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [53488] → Tgt Spa: ['1.000'] [Step 44 / Rank 6] Tasks: ['Single QA'] | Lens: [46752] → Tgt Spa: ['0.350'] [Step 44 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [42614] → Tgt Spa: ['1.000'] [Step 44 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [42614] → Tgt Spa: ['1.000'] [Step 44 / Rank 2] Tasks: ['Single QA'] | Lens: [42669] → Tgt Spa: ['0.350'] [Step 44 / Rank 0] Tasks: ['Single QA'] | Lens: [56248] → Tgt Spa: ['0.350'] [Step 44 / Rank 3] Tasks: ['Single QA'] | Lens: [42669] → Tgt Spa: ['0.350'] [Step 44 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [49034] → Tgt Spa: ['1.000'] [Step 44 / Rank 1] Tasks: ['Single QA'] | Lens: [56248] → Tgt Spa: ['0.350'] [Step 44 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [49034] → Tgt Spa: ['1.000'] [Step 44 / Rank 5] Tasks: ['Single QA'] | Lens: [58609] → Tgt Spa: ['0.350'] [Step 44 / Rank 4] Tasks: ['Single QA'] | Lens: [58609] → Tgt Spa: ['0.350'] [Step 44 / Rank 6] Tasks: ['Single QA'] | Lens: [34212] → Tgt Spa: ['0.350'] [Step 44 / Rank 3] Tasks: ['Single QA'] | Lens: [64095] → Tgt Spa: ['0.350'] [Step 44 / Rank 7] Tasks: ['Single QA'] | Lens: [34212] → Tgt Spa: ['0.350'] [Step 44 / Rank 0] Tasks: ['Single QA'] | Lens: [36902] → Tgt Spa: ['0.350'] [Step 44 / Rank 2] Tasks: ['Single QA'] | Lens: [64095] → Tgt Spa: ['0.350'] [Step 44 / Rank 1] Tasks: ['Single QA'] | Lens: [36902] → Tgt Spa: ['0.350'] [Step 44 / Rank 1] Tasks: ['Single QA'] | Lens: [42638] → Tgt Spa: ['0.350'] [Step 44 / Rank 5] Tasks: ['Single QA'] | Lens: [35093] → Tgt Spa: ['0.350'] [Step 44 / Rank 2] Tasks: ['Single QA'] | Lens: [58753] → Tgt Spa: ['0.350'] [Step 44 / Rank 6] Tasks: ['Single QA'] | Lens: [65088] → Tgt Spa: ['0.350'] [Step 44 / Rank 7] Tasks: ['Single QA'] | Lens: [65088] → Tgt Spa: ['0.350'] [Step 44 / Rank 0] Tasks: ['Single QA'] | Lens: [42638] → Tgt Spa: ['0.350'] [Step 44 / Rank 3] Tasks: ['Single QA'] | Lens: [58753] → Tgt Spa: ['0.350'] [Step 44 / Rank 4] Tasks: ['Single QA'] | Lens: [35093] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 20:27:15,335 >> @ 44 | Loss: 2.3294 | LM: 2.2684 | Reg: 0.0610 | Spa(Avg): 0.352 [INFO|lh_trainer.py:797] 2026-02-16 20:27:15,335 >> Statistic -> Code | Spa: 0.444 | Tgt: 1.000 | Z-Loss: 0.114 | [INFO|lh_trainer.py:797] 2026-02-16 20:27:15,335 >> Statistic -> In-Context | Spa: 0.358 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:27:15,335 >> Statistic -> MultiHop | Spa: 0.392 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:27:15,335 >> Statistic -> Single | Spa: 0.356 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:27:15,335 >> Statistic -> Summarization | Spa: 0.384 | Tgt: 1.000 | Z-Loss: 0.173 | [INFO|lh_trainer.py:810] 2026-02-16 20:27:15,337 >> [Micro-Log] {"loss": 2.3293969854712486, "lm_loss": 2.2683612816035748, "reg_loss": 0.061035712133161724, "model_sparsity(avg)": 0.3524305435518424, "Spa-In-Context Learning sparsity": 0.357638880610466, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.15608975756913424, "Spa-Single QA sparsity": 0.3564814693397946, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.023041530667493742, "Spa-Code sparsity": 0.4444444179534912, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.11381173133850098, "Spa-Summarization sparsity": 0.3836805447936058, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.17331606522202492, "Spa-MultiHop QA sparsity": 0.3916666507720947, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.01659884213004261, "step": 44, "current_tau": 1.4298349618911743, "lambda1 Single QA": 0.48828125, "lambda2 MultiHop QA": 0.24609375, "lambda3 Summarization": 0.05224609375, "lambda4 Code": 0.1484375} [INFO|lh_trainer.py:331] 2026-02-16 20:27:41,854 >> {'loss': 13.9764, 'grad_norm': 1.2922568321228027, 'learning_rate': 0.00036666666666666667, 'epoch': 0.04739336492890995, 'num_input_tokens_seen': 111331584, 'completed': '15.00% (45 / 300)', 'remaining time': '11:53:28', 'throughput': '6555.51', 'gpu_mem_free': '12153MB', 'step': 45} [Step 45 / Rank 3] Tasks: ['Single QA'] | Lens: [33559] → Tgt Spa: ['0.350'] [Step 45 / Rank 6] Tasks: ['Single QA'] | Lens: [59035] → Tgt Spa: ['0.350'] [Step 45 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Code'] | Lens: [20992, 20992, 21002] → Tgt Spa: ['0.350', '0.350', '1.000'] [Step 45 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Code'] | Lens: [20992, 20992, 21002] → Tgt Spa: ['0.350', '0.350', '1.000'] [Step 45 / Rank 0] Tasks: ['Single QA'] | Lens: [42532] → Tgt Spa: ['0.350'] [Step 45 / Rank 2] Tasks: ['Single QA'] | Lens: [33559] → Tgt Spa: ['0.350'] [Step 45 / Rank 7] Tasks: ['Single QA'] | Lens: [59035] → Tgt Spa: ['0.350'] [Step 45 / Rank 1] Tasks: ['Single QA'] | Lens: [42532] → Tgt Spa: ['0.350'] [Step 45 / Rank 6] Tasks: ['Single QA'] | Lens: [55057] → Tgt Spa: ['0.350'] [Step 45 / Rank 3] Tasks: ['Single QA', 'Code', 'Summarization'] | Lens: [17559, 17570, 17581] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 45 / Rank 5] Tasks: ['Single QA'] | Lens: [34101] → Tgt Spa: ['0.350'] [Step 45 / Rank 4] Tasks: ['Single QA'] | Lens: [34101] → Tgt Spa: ['0.350'] [Step 45 / Rank 2] Tasks: ['Single QA', 'Code', 'Summarization'] | Lens: [17559, 17570, 17581] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 45 / Rank 1] Tasks: ['Code', 'Code', 'Code'] | Lens: [16959, 16959, 16960] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 45 / Rank 0] Tasks: ['Code', 'Code', 'Code'] | Lens: [16959, 16959, 16960] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 45 / Rank 7] Tasks: ['Single QA'] | Lens: [55057] → Tgt Spa: ['0.350'] [Step 45 / Rank 5] Tasks: ['Single QA'] | Lens: [59952] → Tgt Spa: ['0.350'] [Step 45 / Rank 0] Tasks: ['Single QA'] | Lens: [39545] → Tgt Spa: ['0.350'] [Step 45 / Rank 2] Tasks: ['Single QA'] | Lens: [58167] → Tgt Spa: ['0.350'] [Step 45 / Rank 4] Tasks: ['Single QA'] | Lens: [59952] → Tgt Spa: ['0.350'] [Step 45 / Rank 6] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [18157, 18169, 18158] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 45 / Rank 3] Tasks: ['Single QA'] | Lens: [58167] → Tgt Spa: ['0.350'] [Step 45 / Rank 1] Tasks: ['Single QA'] | Lens: [39545] → Tgt Spa: ['0.350'] [Step 45 / Rank 7] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [18157, 18169, 18158] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 45 / Rank 5] Tasks: ['Single QA'] | Lens: [65101] → Tgt Spa: ['0.350'] [Step 45 / Rank 4] Tasks: ['Single QA'] | Lens: [65101] → Tgt Spa: ['0.350'] [Step 45 / Rank 7] Tasks: ['Single QA'] | Lens: [46292] → Tgt Spa: ['0.350'] [Step 45 / Rank 3] Tasks: ['Single QA'] | Lens: [57046] → Tgt Spa: ['0.350'] [Step 45 / Rank 6] Tasks: ['Single QA'] | Lens: [46292] → Tgt Spa: ['0.350'] [Step 45 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [21877, 21878] → Tgt Spa: ['0.350', '0.350'] [Step 45 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [21877, 21878] → Tgt Spa: ['0.350', '0.350'] [Step 45 / Rank 2] Tasks: ['Single QA'] | Lens: [57046] → Tgt Spa: ['0.350'] [Step 45 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [59146] → Tgt Spa: ['1.000'] [Step 45 / Rank 3] Tasks: ['Single QA'] | Lens: [61708] → Tgt Spa: ['0.350'] [Step 45 / Rank 6] Tasks: ['Single QA'] | Lens: [60326] → Tgt Spa: ['0.350'] [Step 45 / Rank 5] Tasks: ['Single QA'] | Lens: [61977] → Tgt Spa: ['0.350'] [Step 45 / Rank 7] Tasks: ['Single QA'] | Lens: [60326] → Tgt Spa: ['0.350'] [Step 45 / Rank 2] Tasks: ['Single QA'] | Lens: [61708] → Tgt Spa: ['0.350'] [Step 45 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [59146] → Tgt Spa: ['1.000'] [Step 45 / Rank 4] Tasks: ['Single QA'] | Lens: [61977] → Tgt Spa: ['0.350'] [Step 45 / Rank 5] Tasks: ['Single QA'] | Lens: [54188] → Tgt Spa: ['0.350'] [Step 45 / Rank 0] Tasks: ['Single QA'] | Lens: [34554] → Tgt Spa: ['0.350'] [Step 45 / Rank 6] Tasks: ['Code', 'Summarization'] | Lens: [24531, 24543] → Tgt Spa: ['1.000', '1.000'] [Step 45 / Rank 4] Tasks: ['Single QA'] | Lens: [54188] → Tgt Spa: ['0.350'] [Step 45 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [40310] → Tgt Spa: ['1.000'] [Step 45 / Rank 1] Tasks: ['Single QA'] | Lens: [34554] → Tgt Spa: ['0.350'] [Step 45 / Rank 7] Tasks: ['Code', 'Summarization'] | Lens: [24531, 24543] → Tgt Spa: ['1.000', '1.000'] [Step 45 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [40310] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 20:30:23,798 >> @ 45 | Loss: 2.1317 | LM: 2.0677 | Reg: 0.0640 | Spa(Avg): 0.367 [INFO|lh_trainer.py:797] 2026-02-16 20:30:23,799 >> Statistic -> Code | Spa: 0.347 | Tgt: 1.000 | Z-Loss: 0.141 | [INFO|lh_trainer.py:797] 2026-02-16 20:30:23,799 >> Statistic -> In-Context | Spa: 0.396 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:30:23,799 >> Statistic -> MultiHop | Spa: 0.392 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:30:23,799 >> Statistic -> Single | Spa: 0.368 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:30:23,799 >> Statistic -> Summarization | Spa: 0.384 | Tgt: 1.000 | Z-Loss: 0.175 | [INFO|lh_trainer.py:810] 2026-02-16 20:30:23,801 >> [Micro-Log] {"loss": 2.1317088343203068, "lm_loss": 2.0676666342963776, "reg_loss": 0.06404219117636482, "model_sparsity(avg)": 0.36747683957219124, "Spa-Single QA sparsity": 0.3683862288792928, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03332643872237809, "Spa-Code sparsity": 0.3472222164273262, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.14147525373846292, "Spa-In-Context Learning sparsity": 0.3958333134651184, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.14632967859506607, "Spa-Summarization sparsity": 0.3842592438062032, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.17507382730642954, "Spa-MultiHop QA sparsity": 0.3916666507720947, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.01659884213004261, "step": 45, "current_tau": 1.426776647567749, "lambda1 Single QA": 0.48828125, "lambda2 MultiHop QA": 0.2470703125, "lambda3 Summarization": 0.052734375, "lambda4 Code": 0.1494140625} [INFO|lh_trainer.py:331] 2026-02-16 20:30:44,053 >> {'loss': 12.7903, 'grad_norm': 0.7605544924736023, 'learning_rate': 0.000375, 'epoch': 0.04844655081621906, 'num_input_tokens_seen': 113804550, 'completed': '15.33% (46 / 300)', 'remaining time': '11:51:59', 'throughput': '6786.44', 'gpu_mem_free': '14517MB', 'step': 46} [Step 46 / Rank 7] Tasks: ['Single QA'] | Lens: [37660] → Tgt Spa: ['0.350'] [Step 46 / Rank 0] Tasks: ['Single QA'] | Lens: [58356] → Tgt Spa: ['0.350'] [Step 46 / Rank 4] Tasks: ['Single QA'] | Lens: [64043] → Tgt Spa: ['0.350'] [Step 46 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [56267] → Tgt Spa: ['1.000'] [Step 46 / Rank 6] Tasks: ['Single QA'] | Lens: [37660] → Tgt Spa: ['0.350'] [Step 46 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [56267] → Tgt Spa: ['1.000'] [Step 46 / Rank 5] Tasks: ['Single QA'] | Lens: [64043] → Tgt Spa: ['0.350'] [Step 46 / Rank 1] Tasks: ['Single QA'] | Lens: [58356] → Tgt Spa: ['0.350'] [Step 46 / Rank 1] Tasks: ['Single QA'] | Lens: [56321] → Tgt Spa: ['0.350'] [Step 46 / Rank 5] Tasks: ['Code'] | Lens: [44772] → Tgt Spa: ['1.000'] [Step 46 / Rank 7] Tasks: ['Code', 'Single QA', 'Code', 'Code'] | Lens: [13504, 13512, 13533, 13537] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000'] [Step 46 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [34243] → Tgt Spa: ['1.000'] [Step 46 / Rank 4] Tasks: ['Code'] | Lens: [44772] → Tgt Spa: ['1.000'] [Step 46 / Rank 0] Tasks: ['Single QA'] | Lens: [56321] → Tgt Spa: ['0.350'] [Step 46 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [34243] → Tgt Spa: ['1.000'] [Step 46 / Rank 6] Tasks: ['Code', 'Single QA', 'Code', 'Code'] | Lens: [13504, 13512, 13533, 13537] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000'] [Step 46 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [52784] → Tgt Spa: ['1.000'] [Step 46 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [52784] → Tgt Spa: ['1.000'] [Step 46 / Rank 6] Tasks: ['Single QA'] | Lens: [64661] → Tgt Spa: ['0.350'] [Step 46 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23032, 23034] → Tgt Spa: ['1.000', '1.000'] [Step 46 / Rank 3] Tasks: ['Summarization', 'MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning'] | Lens: [4900, 4883, 4883, 4883, 4892, 4885, 4885, 4885, 4893, 4894, 4886, 4887, 4887] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000'] [Step 46 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23032, 23034] → Tgt Spa: ['1.000', '1.000'] [Step 46 / Rank 2] Tasks: ['Summarization', 'MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning'] | Lens: [4900, 4883, 4883, 4883, 4892, 4885, 4885, 4885, 4893, 4894, 4886, 4887, 4887] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000'] [Step 46 / Rank 7] Tasks: ['Single QA'] | Lens: [64661] → Tgt Spa: ['0.350'] [Step 46 / Rank 5] Tasks: ['Single QA'] | Lens: [58620] → Tgt Spa: ['0.350'] [Step 46 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [41134] → Tgt Spa: ['1.000'] [Step 46 / Rank 4] Tasks: ['Single QA'] | Lens: [58620] → Tgt Spa: ['0.350'] [Step 46 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [41134] → Tgt Spa: ['1.000'] [Step 46 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32064, 32064] → Tgt Spa: ['0.350', '0.350'] [Step 46 / Rank 2] Tasks: ['Single QA', 'In-Context Learning', 'Single QA'] | Lens: [21044, 21044, 21045] → Tgt Spa: ['0.350', '1.000', '0.350'] [Step 46 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32064, 32064] → Tgt Spa: ['0.350', '0.350'] [Step 46 / Rank 3] Tasks: ['Single QA', 'In-Context Learning', 'Single QA'] | Lens: [21044, 21044, 21045] → Tgt Spa: ['0.350', '1.000', '0.350'] [Step 46 / Rank 1] Tasks: ['Code', 'In-Context Learning', 'Summarization'] | Lens: [20033, 20026, 20045] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 46 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27872, 27873] → Tgt Spa: ['1.000', '1.000'] [Step 46 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [28695, 28717] → Tgt Spa: ['1.000', '1.000'] [Step 46 / Rank 0] Tasks: ['Code', 'In-Context Learning', 'Summarization'] | Lens: [20033, 20026, 20045] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 46 / Rank 5] Tasks: ['Single QA'] | Lens: [53633] → Tgt Spa: ['0.350'] [Step 46 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27872, 27873] → Tgt Spa: ['1.000', '1.000'] [Step 46 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [28695, 28717] → Tgt Spa: ['1.000', '1.000'] [Step 46 / Rank 4] Tasks: ['Single QA'] | Lens: [53633] → Tgt Spa: ['0.350'] [Step 46 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [61402] → Tgt Spa: ['1.000'] [Step 46 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [54239] → Tgt Spa: ['1.000'] [Step 46 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [54239] → Tgt Spa: ['1.000'] [Step 46 / Rank 1] Tasks: ['Single QA'] | Lens: [40990] → Tgt Spa: ['0.350'] [Step 46 / Rank 0] Tasks: ['Single QA'] | Lens: [40990] → Tgt Spa: ['0.350'] [Step 46 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [61402] → Tgt Spa: ['1.000'] [Step 46 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [57682] → Tgt Spa: ['1.000'] [Step 46 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [57682] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 20:33:25,626 >> @ 46 | Loss: 2.3398 | LM: 2.2492 | Reg: 0.0906 | Spa(Avg): 0.431 [INFO|lh_trainer.py:797] 2026-02-16 20:33:25,627 >> Statistic -> Code | Spa: 0.405 | Tgt: 1.000 | Z-Loss: 0.126 | [INFO|lh_trainer.py:797] 2026-02-16 20:33:25,627 >> Statistic -> In-Context | Spa: 0.451 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:33:25,627 >> Statistic -> MultiHop | Spa: 0.403 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:33:25,627 >> Statistic -> Single | Spa: 0.402 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:33:25,627 >> Statistic -> Summarization | Spa: 0.468 | Tgt: 1.000 | Z-Loss: 0.137 | [INFO|lh_trainer.py:810] 2026-02-16 20:33:25,631 >> [Micro-Log] {"loss": 2.339815972993771, "lm_loss": 2.2491791608432927, "reg_loss": 0.09063683302762608, "model_sparsity(avg)": 0.4312789315978686, "Spa-Single QA sparsity": 0.4018518408139547, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.0349062483990565, "Spa-In-Context Learning sparsity": 0.451388880610466, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1327779669314623, "Spa-Code sparsity": 0.404513880610466, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.12635491508990526, "Spa-Summarization sparsity": 0.46759257713953656, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13695963472127914, "Spa-MultiHop QA sparsity": 0.4027777910232544, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.014113598503172398, "step": 46, "current_tau": 1.4236645698547363, "lambda1 Single QA": 0.490234375, "lambda2 MultiHop QA": 0.2470703125, "lambda3 Summarization": 0.053466796875, "lambda4 Code": 0.150390625} [INFO|lh_trainer.py:331] 2026-02-16 20:33:49,630 >> {'loss': 14.0389, 'grad_norm': 1.6951416730880737, 'learning_rate': 0.00038333333333333334, 'epoch': 0.049499736703528176, 'num_input_tokens_seen': 116406598, 'completed': '15.67% (47 / 300)', 'remaining time': '11:50:44', 'throughput': '7010.69', 'gpu_mem_free': '12941MB', 'step': 47} [Step 47 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Single QA', 'Summarization'] | Lens: [6395, 6394, 6404, 6403, 6396, 6396, 6397, 6407, 6401, 6419] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000'] [Step 47 / Rank 2] Tasks: ['Code'] | Lens: [42455] → Tgt Spa: ['1.000'] [Step 47 / Rank 5] Tasks: ['Single QA'] | Lens: [45651] → Tgt Spa: ['0.350'] [Step 47 / Rank 1] Tasks: ['Code'] | Lens: [33684] → Tgt Spa: ['1.000'] [Step 47 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Single QA', 'Summarization'] | Lens: [6395, 6394, 6404, 6403, 6396, 6396, 6397, 6407, 6401, 6419] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000'] [Step 47 / Rank 3] Tasks: ['Code'] | Lens: [42455] → Tgt Spa: ['1.000'] [Step 47 / Rank 0] Tasks: ['Code'] | Lens: [33684] → Tgt Spa: ['1.000'] [Step 47 / Rank 4] Tasks: ['Single QA'] | Lens: [45651] → Tgt Spa: ['0.350'] [Step 47 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [43472] → Tgt Spa: ['1.000'] [Step 47 / Rank 2] Tasks: ['Code', 'Code'] | Lens: [31280, 31285] → Tgt Spa: ['1.000', '1.000'] [Step 47 / Rank 5] Tasks: ['Single QA'] | Lens: [49740] → Tgt Spa: ['0.350'] [Step 47 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22803, 22803] → Tgt Spa: ['1.000', '1.000'] [Step 47 / Rank 3] Tasks: ['Code', 'Code'] | Lens: [31280, 31285] → Tgt Spa: ['1.000', '1.000'] [Step 47 / Rank 4] Tasks: ['Single QA'] | Lens: [49740] → Tgt Spa: ['0.350'] [Step 47 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [43472] → Tgt Spa: ['1.000'] [Step 47 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22803, 22803] → Tgt Spa: ['1.000', '1.000'] [Step 47 / Rank 6] Tasks: ['Summarization', 'In-Context Learning', 'Code'] | Lens: [19271, 19256, 19266] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 47 / Rank 3] Tasks: ['Single QA'] | Lens: [49216] → Tgt Spa: ['0.350'] [Step 47 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23835, 23835] → Tgt Spa: ['1.000', '1.000'] [Step 47 / Rank 2] Tasks: ['Single QA'] | Lens: [49216] → Tgt Spa: ['0.350'] [Step 47 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23835, 23835] → Tgt Spa: ['1.000', '1.000'] [Step 47 / Rank 7] Tasks: ['Summarization', 'In-Context Learning', 'Code'] | Lens: [19271, 19256, 19266] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 47 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15986, 15988, 15988, 15989] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 47 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15986, 15988, 15988, 15989] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 47 / Rank 1] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [24010, 23991] → Tgt Spa: ['1.000', '1.000'] [Step 47 / Rank 5] Tasks: ['Code'] | Lens: [38120] → Tgt Spa: ['1.000'] [Step 47 / Rank 6] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [27153, 27154] → Tgt Spa: ['0.350', '1.000'] [Step 47 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [11356, 11356, 11356, 11356, 11358] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 47 / Rank 7] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [27153, 27154] → Tgt Spa: ['0.350', '1.000'] [Step 47 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [11356, 11356, 11356, 11356, 11358] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 47 / Rank 4] Tasks: ['Code'] | Lens: [38120] → Tgt Spa: ['1.000'] [Step 47 / Rank 0] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [24010, 23991] → Tgt Spa: ['1.000', '1.000'] [Step 47 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32136, 32136] → Tgt Spa: ['0.350', '0.350'] [Step 47 / Rank 3] Tasks: ['Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Single QA', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'In-Context Learning'] | Lens: [3060, 3044, 3044, 3061, 3045, 3045, 3062, 3062, 3044, 3045, 3046, 3046, 3063, 3045, 3047, 3047, 3048, 3047, 3047, 3048, 3047] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 47 / Rank 4] Tasks: ['Code'] | Lens: [45157] → Tgt Spa: ['1.000'] [Step 47 / Rank 0] Tasks: ['Single QA'] | Lens: [64038] → Tgt Spa: ['0.350'] [Step 47 / Rank 2] Tasks: ['Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Single QA', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'In-Context Learning'] | Lens: [3060, 3044, 3044, 3061, 3045, 3045, 3062, 3062, 3044, 3045, 3046, 3046, 3063, 3045, 3047, 3047, 3048, 3047, 3047, 3048, 3047] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 47 / Rank 5] Tasks: ['Code'] | Lens: [45157] → Tgt Spa: ['1.000'] [Step 47 / Rank 1] Tasks: ['Single QA'] | Lens: [64038] → Tgt Spa: ['0.350'] [Step 47 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32136, 32136] → Tgt Spa: ['0.350', '0.350'] [Step 47 / Rank 4] Tasks: ['Code'] | Lens: [34965] → Tgt Spa: ['1.000'] [Step 47 / Rank 3] Tasks: ['Single QA'] | Lens: [49987] → Tgt Spa: ['0.350'] [Step 47 / Rank 6] Tasks: ['Single QA'] | Lens: [36773] → Tgt Spa: ['0.350'] [Step 47 / Rank 7] Tasks: ['Single QA'] | Lens: [36773] → Tgt Spa: ['0.350'] [Step 47 / Rank 5] Tasks: ['Code'] | Lens: [34965] → Tgt Spa: ['1.000'] [Step 47 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [25902, 25920] → Tgt Spa: ['1.000', '1.000'] [Step 47 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [25902, 25920] → Tgt Spa: ['1.000', '1.000'] [Step 47 / Rank 2] Tasks: ['Single QA'] | Lens: [49987] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 20:35:54,038 >> @ 47 | Loss: 1.9304 | LM: 1.8370 | Reg: 0.0935 | Spa(Avg): 0.414 [INFO|lh_trainer.py:797] 2026-02-16 20:35:54,039 >> Statistic -> Code | Spa: 0.404 | Tgt: 1.000 | Z-Loss: 0.127 | [INFO|lh_trainer.py:797] 2026-02-16 20:35:54,039 >> Statistic -> In-Context | Spa: 0.435 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:35:54,039 >> Statistic -> MultiHop | Spa: 0.447 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:35:54,039 >> Statistic -> Single | Spa: 0.414 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:35:54,039 >> Statistic -> Summarization | Spa: 0.424 | Tgt: 1.000 | Z-Loss: 0.155 | [INFO|lh_trainer.py:810] 2026-02-16 20:35:54,041 >> [Micro-Log] {"loss": 1.9304488723476727, "lm_loss": 1.8369596783692639, "reg_loss": 0.09348919903762483, "model_sparsity(avg)": 0.41388337314128876, "Spa-Code sparsity": 0.4040403962135315, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.12692471525885843, "Spa-In-Context Learning sparsity": 0.435185178120931, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.13776387572288512, "Spa-Single QA sparsity": 0.4135100949894298, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.034568420955276284, "Spa-Summarization sparsity": 0.42438270648320514, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1554590149058236, "Spa-MultiHop QA sparsity": 0.4467592587073644, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.030833854213900242, "step": 47, "current_tau": 1.420499563217163, "lambda1 Single QA": 0.490234375, "lambda2 MultiHop QA": 0.2470703125, "lambda3 Summarization": 0.05419921875, "lambda4 Code": 0.150390625} [INFO|lh_trainer.py:331] 2026-02-16 20:36:11,794 >> {'loss': 11.5827, 'grad_norm': 1.4952235221862793, 'learning_rate': 0.0003916666666666667, 'epoch': 0.05055292259083728, 'num_input_tokens_seen': 118834762, 'completed': '16.00% (48 / 300)', 'remaining time': '11:45:37', 'throughput': '8540.05', 'gpu_mem_free': '10447MB', 'step': 48} [Step 48 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [55716] → Tgt Spa: ['1.000'] [Step 48 / Rank 5] Tasks: ['Single QA'] | Lens: [53257] → Tgt Spa: ['0.350'] [Step 48 / Rank 3] Tasks: ['In-Context Learning', 'Code'] | Lens: [25181, 25190] → Tgt Spa: ['1.000', '1.000'] [Step 48 / Rank 1] Tasks: ['Single QA'] | Lens: [37428] → Tgt Spa: ['0.350'] [Step 48 / Rank 0] Tasks: ['Single QA'] | Lens: [37428] → Tgt Spa: ['0.350'] [Step 48 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [55716] → Tgt Spa: ['1.000'] [Step 48 / Rank 4] Tasks: ['Single QA'] | Lens: [53257] → Tgt Spa: ['0.350'] [Step 48 / Rank 2] Tasks: ['In-Context Learning', 'Code'] | Lens: [25181, 25190] → Tgt Spa: ['1.000', '1.000'] [Step 48 / Rank 2] Tasks: ['Code', 'Code'] | Lens: [27473, 27472] → Tgt Spa: ['1.000', '1.000'] [Step 48 / Rank 3] Tasks: ['Code', 'Code'] | Lens: [27473, 27472] → Tgt Spa: ['1.000', '1.000'] [Step 48 / Rank 6] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [16735, 16748, 16737] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 48 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [54529] → Tgt Spa: ['1.000'] [Step 48 / Rank 5] Tasks: ['Single QA'] | Lens: [40434] → Tgt Spa: ['0.350'] [Step 48 / Rank 4] Tasks: ['Single QA'] | Lens: [40434] → Tgt Spa: ['0.350'] [Step 48 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [54529] → Tgt Spa: ['1.000'] [Step 48 / Rank 7] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [16735, 16748, 16737] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 48 / Rank 4] Tasks: ['Single QA'] | Lens: [64867] → Tgt Spa: ['0.350'] [Step 48 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [16977, 16979, 16980] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 48 / Rank 5] Tasks: ['Single QA'] | Lens: [64867] → Tgt Spa: ['0.350'] [Step 48 / Rank 2] Tasks: ['Single QA'] | Lens: [36795] → Tgt Spa: ['0.350'] [Step 48 / Rank 1] Tasks: ['Single QA'] | Lens: [63702] → Tgt Spa: ['0.350'] [Step 48 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [16977, 16979, 16980] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 48 / Rank 0] Tasks: ['Single QA'] | Lens: [63702] → Tgt Spa: ['0.350'] [Step 48 / Rank 3] Tasks: ['Single QA'] | Lens: [36795] → Tgt Spa: ['0.350'] [Step 48 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [30995, 31004] → Tgt Spa: ['0.350', '0.350'] [Step 48 / Rank 3] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24146, 24147] → Tgt Spa: ['1.000', '0.350'] [Step 48 / Rank 1] Tasks: ['Single QA'] | Lens: [46290] → Tgt Spa: ['0.350'] [Step 48 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [30995, 31004] → Tgt Spa: ['0.350', '0.350'] [Step 48 / Rank 2] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24146, 24147] → Tgt Spa: ['1.000', '0.350'] [Step 48 / Rank 6] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16888, 16877, 16887] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 48 / Rank 7] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16888, 16877, 16887] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 48 / Rank 0] Tasks: ['Single QA'] | Lens: [46290] → Tgt Spa: ['0.350'] [Step 48 / Rank 7] Tasks: ['Single QA'] | Lens: [52699] → Tgt Spa: ['0.350'] [Step 48 / Rank 4] Tasks: ['Single QA'] | Lens: [40730] → Tgt Spa: ['0.350'] [Step 48 / Rank 2] Tasks: ['Code'] | Lens: [60110] → Tgt Spa: ['1.000'] [Step 48 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [46989] → Tgt Spa: ['1.000'] [Step 48 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [46989] → Tgt Spa: ['1.000'] [Step 48 / Rank 3] Tasks: ['Code'] | Lens: [60110] → Tgt Spa: ['1.000'] [Step 48 / Rank 6] Tasks: ['Single QA'] | Lens: [52699] → Tgt Spa: ['0.350'] [Step 48 / Rank 5] Tasks: ['Single QA'] | Lens: [40730] → Tgt Spa: ['0.350'] [Step 48 / Rank 4] Tasks: ['Single QA'] | Lens: [39752] → Tgt Spa: ['0.350'] [Step 48 / Rank 3] Tasks: ['Single QA'] | Lens: [49073] → Tgt Spa: ['0.350'] [Step 48 / Rank 5] Tasks: ['Single QA'] | Lens: [39752] → Tgt Spa: ['0.350'] [Step 48 / Rank 1] Tasks: ['Summarization', 'Code'] | Lens: [23885, 23875] → Tgt Spa: ['1.000', '1.000'] [Step 48 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [22803, 22796] → Tgt Spa: ['1.000', '1.000'] [Step 48 / Rank 0] Tasks: ['Summarization', 'Code'] | Lens: [23885, 23875] → Tgt Spa: ['1.000', '1.000'] [Step 48 / Rank 2] Tasks: ['Single QA'] | Lens: [49073] → Tgt Spa: ['0.350'] [Step 48 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [22803, 22796] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 20:38:39,073 >> @ 48 | Loss: 2.0954 | LM: 2.0098 | Reg: 0.0856 | Spa(Avg): 0.439 [INFO|lh_trainer.py:797] 2026-02-16 20:38:39,074 >> Statistic -> Code | Spa: 0.446 | Tgt: 1.000 | Z-Loss: 0.116 | [INFO|lh_trainer.py:797] 2026-02-16 20:38:39,074 >> Statistic -> In-Context | Spa: 0.437 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:38:39,074 >> Statistic -> MultiHop | Spa: 0.447 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:38:39,074 >> Statistic -> Single | Spa: 0.435 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:38:39,074 >> Statistic -> Summarization | Spa: 0.462 | Tgt: 1.000 | Z-Loss: 0.140 | [INFO|lh_trainer.py:810] 2026-02-16 20:38:39,076 >> [Micro-Log] {"loss": 2.095376634349426, "lm_loss": 2.00981611572206, "reg_loss": 0.08556050283368677, "model_sparsity(avg)": 0.4392360982795556, "Spa-Single QA sparsity": 0.43452379533222746, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.043514695683760304, "Spa-In-Context Learning sparsity": 0.4374999801317851, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.13789334148168564, "Spa-Summarization sparsity": 0.4623015948704311, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1398520448378154, "Spa-Code sparsity": 0.44598764181137085, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.11620627178086175, "Spa-MultiHop QA sparsity": 0.4467592587073644, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.030833854213900242, "step": 48, "current_tau": 1.4172825813293457, "lambda1 Single QA": 0.490234375, "lambda2 MultiHop QA": 0.248046875, "lambda3 Summarization": 0.054931640625, "lambda4 Code": 0.1513671875} [INFO|lh_trainer.py:331] 2026-02-16 20:38:56,291 >> {'loss': 12.5723, 'grad_norm': 1.1292210817337036, 'learning_rate': 0.0004, 'epoch': 0.051606108478146395, 'num_input_tokens_seen': 121241054, 'completed': '16.33% (49 / 300)', 'remaining time': '11:42:31', 'throughput': '7314.08', 'gpu_mem_free': '12369MB', 'step': 49} [Step 49 / Rank 4] Tasks: ['Code', 'Code', 'Single QA', 'Summarization', 'Code', 'Code', 'Single QA', 'MultiHop QA'] | Lens: [7778, 7782, 7775, 7794, 7783, 7782, 7775, 7776] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 49 / Rank 5] Tasks: ['Code', 'Code', 'Single QA', 'Summarization', 'Code', 'Code', 'Single QA', 'MultiHop QA'] | Lens: [7778, 7782, 7775, 7794, 7783, 7782, 7775, 7776] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 49 / Rank 2] Tasks: ['Single QA'] | Lens: [38385] → Tgt Spa: ['0.350'] [Step 49 / Rank 3] Tasks: ['Single QA'] | Lens: [38385] → Tgt Spa: ['0.350'] [Step 49 / Rank 6] Tasks: ['Single QA', 'Code'] | Lens: [32635, 32643] → Tgt Spa: ['0.350', '1.000'] [Step 49 / Rank 0] Tasks: ['Code'] | Lens: [60647] → Tgt Spa: ['1.000'] [Step 49 / Rank 7] Tasks: ['Single QA', 'Code'] | Lens: [32635, 32643] → Tgt Spa: ['0.350', '1.000'] [Step 49 / Rank 1] Tasks: ['Code'] | Lens: [60647] → Tgt Spa: ['1.000'] [Step 49 / Rank 2] Tasks: ['Single QA', 'Code'] | Lens: [24151, 24161] → Tgt Spa: ['0.350', '1.000'] [Step 49 / Rank 7] Tasks: ['Summarization', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning'] | Lens: [5815, 5798, 5797, 5798, 5806, 5806, 5799, 5800, 5809, 5809, 5803] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 49 / Rank 6] Tasks: ['Summarization', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning'] | Lens: [5815, 5798, 5797, 5798, 5806, 5806, 5799, 5800, 5809, 5809, 5803] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 49 / Rank 1] Tasks: ['Summarization', 'Summarization'] | Lens: [30792, 30797] → Tgt Spa: ['1.000', '1.000'] [Step 49 / Rank 5] Tasks: ['Summarization'] | Lens: [49915] → Tgt Spa: ['1.000'] [Step 49 / Rank 3] Tasks: ['Single QA', 'Code'] | Lens: [24151, 24161] → Tgt Spa: ['0.350', '1.000'] [Step 49 / Rank 4] Tasks: ['Summarization'] | Lens: [49915] → Tgt Spa: ['1.000'] [Step 49 / Rank 0] Tasks: ['Summarization', 'Summarization'] | Lens: [30792, 30797] → Tgt Spa: ['1.000', '1.000'] [Step 49 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32217, 32217] → Tgt Spa: ['0.350', '0.350'] [Step 49 / Rank 3] Tasks: ['Single QA'] | Lens: [50219] → Tgt Spa: ['0.350'] [Step 49 / Rank 2] Tasks: ['Single QA'] | Lens: [50219] → Tgt Spa: ['0.350'] [Step 49 / Rank 5] Tasks: ['Single QA'] | Lens: [53807] → Tgt Spa: ['0.350'] [Step 49 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [23947, 23955] → Tgt Spa: ['1.000', '1.000'] [Step 49 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32217, 32217] → Tgt Spa: ['0.350', '0.350'] [Step 49 / Rank 4] Tasks: ['Single QA'] | Lens: [53807] → Tgt Spa: ['0.350'] [Step 49 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [23947, 23955] → Tgt Spa: ['1.000', '1.000'] [Step 49 / Rank 5] Tasks: ['Single QA'] | Lens: [49755] → Tgt Spa: ['0.350'] [Step 49 / Rank 4] Tasks: ['Single QA'] | Lens: [49755] → Tgt Spa: ['0.350'] [Step 49 / Rank 0] Tasks: ['Code'] | Lens: [36304] → Tgt Spa: ['1.000'] [Step 49 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25589, 25590] → Tgt Spa: ['1.000', '1.000'] [Step 49 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25589, 25590] → Tgt Spa: ['1.000', '1.000'] [Step 49 / Rank 1] Tasks: ['Code'] | Lens: [36304] → Tgt Spa: ['1.000'] [Step 49 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27759, 27760] → Tgt Spa: ['1.000', '1.000'] [Step 49 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27759, 27760] → Tgt Spa: ['1.000', '1.000'] [Step 49 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [24262, 24264] → Tgt Spa: ['0.350', '0.350'] [Step 49 / Rank 3] Tasks: ['In-Context Learning', 'Code'] | Lens: [24039, 24048] → Tgt Spa: ['1.000', '1.000'] [Step 49 / Rank 5] Tasks: ['Single QA'] | Lens: [39788] → Tgt Spa: ['0.350'] [Step 49 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [24108, 24109] → Tgt Spa: ['0.350', '0.350'] [Step 49 / Rank 4] Tasks: ['Single QA'] | Lens: [39788] → Tgt Spa: ['0.350'] [Step 49 / Rank 2] Tasks: ['In-Context Learning', 'Code'] | Lens: [24039, 24048] → Tgt Spa: ['1.000', '1.000'] [Step 49 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [24108, 24109] → Tgt Spa: ['0.350', '0.350'] [Step 49 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [24262, 24264] → Tgt Spa: ['0.350', '0.350'] [Step 49 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [43989] → Tgt Spa: ['1.000'] [Step 49 / Rank 1] Tasks: ['Single QA', 'Summarization', 'Code', 'Code', 'Single QA', 'Single QA'] | Lens: [10494, 10517, 10508, 10516, 10513, 10523] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 49 / Rank 7] Tasks: ['Code'] | Lens: [35674] → Tgt Spa: ['1.000'] [Step 49 / Rank 6] Tasks: ['Code'] | Lens: [35674] → Tgt Spa: ['1.000'] [Step 49 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [43989] → Tgt Spa: ['1.000'] [Step 49 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24980, 24981] → Tgt Spa: ['1.000', '1.000'] [Step 49 / Rank 0] Tasks: ['Single QA', 'Summarization', 'Code', 'Code', 'Single QA', 'Single QA'] | Lens: [10494, 10517, 10508, 10516, 10513, 10523] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 49 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24980, 24981] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 20:41:00,462 >> @ 49 | Loss: 1.9246 | LM: 1.8296 | Reg: 0.0950 | Spa(Avg): 0.434 [INFO|lh_trainer.py:797] 2026-02-16 20:41:00,462 >> Statistic -> Code | Spa: 0.406 | Tgt: 1.000 | Z-Loss: 0.128 | [INFO|lh_trainer.py:797] 2026-02-16 20:41:00,462 >> Statistic -> In-Context | Spa: 0.450 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:41:00,462 >> Statistic -> MultiHop | Spa: 0.403 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:41:00,462 >> Statistic -> Single | Spa: 0.436 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:41:00,462 >> Statistic -> Summarization | Spa: 0.426 | Tgt: 1.000 | Z-Loss: 0.156 | [INFO|lh_trainer.py:810] 2026-02-16 20:41:00,464 >> [Micro-Log] {"loss": 1.9246102410058181, "lm_loss": 1.829563045874238, "reg_loss": 0.09504719202717145, "model_sparsity(avg)": 0.43421628947059315, "Spa-Code sparsity": 0.40604574540082145, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.12789258974439957, "Spa-Summarization sparsity": 0.42592592040697735, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1559425617257754, "Spa-In-Context Learning sparsity": 0.45039681451661245, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.13474751902478083, "Spa-Single QA sparsity": 0.4364035066805388, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.044875022397670696, "Spa-MultiHop QA sparsity": 0.4027777910232544, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.01417104247957468, "step": 49, "current_tau": 1.4140148162841797, "lambda1 Single QA": 0.4921875, "lambda2 MultiHop QA": 0.248046875, "lambda3 Summarization": 0.0556640625, "lambda4 Code": 0.15234375} [INFO|lh_trainer.py:331] 2026-02-16 20:41:14,843 >> {'loss': 11.5477, 'grad_norm': 1.3782143592834473, 'learning_rate': 0.00040833333333333336, 'epoch': 0.0526592943654555, 'num_input_tokens_seen': 123714340, 'completed': '16.67% (50 / 300)', 'remaining time': '11:37:16', 'throughput': '8925.45', 'gpu_mem_free': '8179MB', 'step': 50} [Step 50 / Rank 6] Tasks: ['Code'] | Lens: [39397] → Tgt Spa: ['1.000'] [Step 50 / Rank 3] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [17150, 17140, 17141] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 50 / Rank 2] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [17150, 17140, 17141] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 50 / Rank 4] Tasks: ['Code', 'Code', 'Single QA', 'Single QA'] | Lens: [15809, 15810, 15804, 15804] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350'] [Step 50 / Rank 5] Tasks: ['Code', 'Code', 'Single QA', 'Single QA'] | Lens: [15809, 15810, 15804, 15804] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350'] [Step 50 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32137, 32137] → Tgt Spa: ['0.350', '0.350'] [Step 50 / Rank 7] Tasks: ['Code'] | Lens: [39397] → Tgt Spa: ['1.000'] [Step 50 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32137, 32137] → Tgt Spa: ['0.350', '0.350'] [Step 50 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32067, 32067] → Tgt Spa: ['0.350', '0.350'] [Step 50 / Rank 3] Tasks: ['MultiHop QA'] | Lens: [65334] → Tgt Spa: ['0.350'] [Step 50 / Rank 7] Tasks: ['Single QA'] | Lens: [51698] → Tgt Spa: ['0.350'] [Step 50 / Rank 4] Tasks: ['Single QA', 'Code'] | Lens: [31846, 31857] → Tgt Spa: ['0.350', '1.000'] [Step 50 / Rank 5] Tasks: ['Single QA', 'Code'] | Lens: [31846, 31857] → Tgt Spa: ['0.350', '1.000'] [Step 50 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32067, 32067] → Tgt Spa: ['0.350', '0.350'] [Step 50 / Rank 2] Tasks: ['MultiHop QA'] | Lens: [65334] → Tgt Spa: ['0.350'] [Step 50 / Rank 6] Tasks: ['Single QA'] | Lens: [51698] → Tgt Spa: ['0.350'] [Step 50 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [32448, 32449] → Tgt Spa: ['0.350', '1.000'] [Step 50 / Rank 0] Tasks: ['Code', 'Summarization', 'In-Context Learning'] | Lens: [21800, 21814, 21796] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 50 / Rank 2] Tasks: ['Code'] | Lens: [37932] → Tgt Spa: ['1.000'] [Step 50 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [32448, 32449] → Tgt Spa: ['0.350', '1.000'] [Step 50 / Rank 6] Tasks: ['Single QA'] | Lens: [39851] → Tgt Spa: ['0.350'] [Step 50 / Rank 3] Tasks: ['Code'] | Lens: [37932] → Tgt Spa: ['1.000'] [Step 50 / Rank 1] Tasks: ['Code', 'Summarization', 'In-Context Learning'] | Lens: [21800, 21814, 21796] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 50 / Rank 7] Tasks: ['Single QA'] | Lens: [39851] → Tgt Spa: ['0.350'] [Step 50 / Rank 5] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [17878, 17888, 17889] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 50 / Rank 3] Tasks: ['Code', 'Single QA'] | Lens: [32367, 32360] → Tgt Spa: ['1.000', '0.350'] [Step 50 / Rank 6] Tasks: ['Single QA'] | Lens: [54059] → Tgt Spa: ['0.350'] [Step 50 / Rank 2] Tasks: ['Code', 'Single QA'] | Lens: [32367, 32360] → Tgt Spa: ['1.000', '0.350'] [Step 50 / Rank 7] Tasks: ['Single QA'] | Lens: [54059] → Tgt Spa: ['0.350'] [Step 50 / Rank 4] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [17878, 17888, 17889] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 50 / Rank 1] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [26359, 26342] → Tgt Spa: ['1.000', '1.000'] [Step 50 / Rank 0] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [26359, 26342] → Tgt Spa: ['1.000', '1.000'] [Step 50 / Rank 5] Tasks: ['Single QA'] | Lens: [49604] → Tgt Spa: ['0.350'] [Step 50 / Rank 4] Tasks: ['Single QA'] | Lens: [49604] → Tgt Spa: ['0.350'] [Step 50 / Rank 3] Tasks: ['Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'Code', 'Single QA', 'Code', 'Single QA', 'Code'] | Lens: [6773, 6776, 6777, 6777, 6784, 6778, 6787, 6782, 6790] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000'] [Step 50 / Rank 7] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25441, 25441] → Tgt Spa: ['0.350', '1.000'] [Step 50 / Rank 2] Tasks: ['Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'Code', 'Single QA', 'Code', 'Single QA', 'Code'] | Lens: [6773, 6776, 6777, 6777, 6784, 6778, 6787, 6782, 6790] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000'] [Step 50 / Rank 6] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25441, 25441] → Tgt Spa: ['0.350', '1.000'] [Step 50 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32738, 32741] → Tgt Spa: ['0.350', '0.350'] [Step 50 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32738, 32741] → Tgt Spa: ['0.350', '0.350'] [Step 50 / Rank 4] Tasks: ['Single QA'] | Lens: [65093] → Tgt Spa: ['0.350'] [Step 50 / Rank 7] Tasks: ['Code'] | Lens: [35987] → Tgt Spa: ['1.000'] [Step 50 / Rank 0] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [20073, 20073, 20062] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 50 / Rank 5] Tasks: ['Single QA'] | Lens: [65093] → Tgt Spa: ['0.350'] [Step 50 / Rank 6] Tasks: ['Code'] | Lens: [35987] → Tgt Spa: ['1.000'] [Step 50 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [57820] → Tgt Spa: ['1.000'] [Step 50 / Rank 1] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [20073, 20073, 20062] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 50 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [57820] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 20:43:36,046 >> @ 50 | Loss: 1.9165 | LM: 1.8256 | Reg: 0.0910 | Spa(Avg): 0.407 [INFO|lh_trainer.py:797] 2026-02-16 20:43:36,046 >> Statistic -> Code | Spa: 0.411 | Tgt: 1.000 | Z-Loss: 0.128 | [INFO|lh_trainer.py:797] 2026-02-16 20:43:36,046 >> Statistic -> In-Context | Spa: 0.394 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:43:36,046 >> Statistic -> MultiHop | Spa: 0.472 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:43:36,046 >> Statistic -> Single | Spa: 0.410 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:43:36,046 >> Statistic -> Summarization | Spa: 0.435 | Tgt: 1.000 | Z-Loss: 0.157 | [INFO|lh_trainer.py:810] 2026-02-16 20:43:36,048 >> [Micro-Log] {"loss": 1.9165329795020323, "lm_loss": 1.825573921164808, "reg_loss": 0.09095906775231317, "model_sparsity(avg)": 0.40718235696355504, "Spa-Single QA sparsity": 0.4103535305369984, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03873814031248912, "Spa-Code sparsity": 0.41111110846201576, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1279567912220955, "Spa-Summarization sparsity": 0.4345238038471767, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.15653429393257415, "Spa-In-Context Learning sparsity": 0.39351850748062134, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.15075128028790155, "Spa-MultiHop QA sparsity": 0.4722222089767456, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.035760100930929184, "step": 50, "current_tau": 1.4106969833374023, "lambda1 Single QA": 0.4921875, "lambda2 MultiHop QA": 0.2490234375, "lambda3 Summarization": 0.056396484375, "lambda4 Code": 0.1533203125} [INFO|lh_trainer.py:331] 2026-02-16 20:44:02,947 >> {'loss': 11.4992, 'grad_norm': 1.3731465339660645, 'learning_rate': 0.0004166666666666667, 'epoch': 0.053712480252764615, 'num_input_tokens_seen': 126399394, 'completed': '17.00% (51 / 300)', 'remaining time': '11:34:33', 'throughput': '7986.31', 'gpu_mem_free': '7377MB', 'step': 51} [Step 51 / Rank 7] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [26750, 26770] → Tgt Spa: ['1.000', '1.000'] [Step 51 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24212, 24214] → Tgt Spa: ['1.000', '1.000'] [Step 51 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [29225, 29226] → Tgt Spa: ['0.350', '0.350'] [Step 51 / Rank 6] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [26750, 26770] → Tgt Spa: ['1.000', '1.000'] [Step 51 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [24887, 24887] → Tgt Spa: ['0.350', '0.350'] [Step 51 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [29225, 29226] → Tgt Spa: ['0.350', '0.350'] [Step 51 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24212, 24214] → Tgt Spa: ['1.000', '1.000'] [Step 51 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [24887, 24887] → Tgt Spa: ['0.350', '0.350'] [Step 51 / Rank 5] Tasks: ['Code'] | Lens: [48539] → Tgt Spa: ['1.000'] [Step 51 / Rank 1] Tasks: ['Single QA'] | Lens: [64694] → Tgt Spa: ['0.350'] [Step 51 / Rank 3] Tasks: ['Single QA'] | Lens: [51363] → Tgt Spa: ['0.350'] [Step 51 / Rank 4] Tasks: ['Code'] | Lens: [48539] → Tgt Spa: ['1.000'] [Step 51 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [39920] → Tgt Spa: ['1.000'] [Step 51 / Rank 2] Tasks: ['Single QA'] | Lens: [51363] → Tgt Spa: ['0.350'] [Step 51 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [39920] → Tgt Spa: ['1.000'] [Step 51 / Rank 0] Tasks: ['Single QA'] | Lens: [64694] → Tgt Spa: ['0.350'] [Step 51 / Rank 5] Tasks: ['Single QA'] | Lens: [49234] → Tgt Spa: ['0.350'] [Step 51 / Rank 7] Tasks: ['Code'] | Lens: [44120] → Tgt Spa: ['1.000'] [Step 51 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [65195] → Tgt Spa: ['1.000'] [Step 51 / Rank 4] Tasks: ['Single QA'] | Lens: [49234] → Tgt Spa: ['0.350'] [Step 51 / Rank 6] Tasks: ['Code'] | Lens: [44120] → Tgt Spa: ['1.000'] [Step 51 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22651, 22671] → Tgt Spa: ['1.000', '1.000'] [Step 51 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [65195] → Tgt Spa: ['1.000'] [Step 51 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22651, 22671] → Tgt Spa: ['1.000', '1.000'] [Step 51 / Rank 2] Tasks: ['Single QA'] | Lens: [52410] → Tgt Spa: ['0.350'] [Step 51 / Rank 4] Tasks: ['Single QA'] | Lens: [54046] → Tgt Spa: ['0.350'] [Step 51 / Rank 5] Tasks: ['Single QA'] | Lens: [54046] → Tgt Spa: ['0.350'] [Step 51 / Rank 0] Tasks: ['Single QA'] | Lens: [50476] → Tgt Spa: ['0.350'] [Step 51 / Rank 6] Tasks: ['Single QA'] | Lens: [64823] → Tgt Spa: ['0.350'] [Step 51 / Rank 7] Tasks: ['Single QA'] | Lens: [64823] → Tgt Spa: ['0.350'] [Step 51 / Rank 1] Tasks: ['Single QA'] | Lens: [50476] → Tgt Spa: ['0.350'] [Step 51 / Rank 3] Tasks: ['Single QA'] | Lens: [52410] → Tgt Spa: ['0.350'] [Step 51 / Rank 6] Tasks: ['Code'] | Lens: [34812] → Tgt Spa: ['1.000'] [Step 51 / Rank 4] Tasks: ['Single QA'] | Lens: [34550] → Tgt Spa: ['0.350'] [Step 51 / Rank 7] Tasks: ['Code'] | Lens: [34812] → Tgt Spa: ['1.000'] [Step 51 / Rank 0] Tasks: ['Single QA'] | Lens: [55824] → Tgt Spa: ['0.350'] [Step 51 / Rank 5] Tasks: ['Single QA'] | Lens: [34550] → Tgt Spa: ['0.350'] [Step 51 / Rank 1] Tasks: ['Single QA'] | Lens: [55824] → Tgt Spa: ['0.350'] [Step 51 / Rank 3] Tasks: ['Code'] | Lens: [57855] → Tgt Spa: ['1.000'] [Step 51 / Rank 2] Tasks: ['Code'] | Lens: [57855] → Tgt Spa: ['1.000'] [Step 51 / Rank 5] Tasks: ['Single QA'] | Lens: [56507] → Tgt Spa: ['0.350'] [Step 51 / Rank 7] Tasks: ['Single QA'] | Lens: [37337] → Tgt Spa: ['0.350'] [Step 51 / Rank 1] Tasks: ['Single QA'] | Lens: [34339] → Tgt Spa: ['0.350'] [Step 51 / Rank 6] Tasks: ['Single QA'] | Lens: [37337] → Tgt Spa: ['0.350'] [Step 51 / Rank 4] Tasks: ['Single QA'] | Lens: [56507] → Tgt Spa: ['0.350'] [Step 51 / Rank 3] Tasks: ['Single QA'] | Lens: [56622] → Tgt Spa: ['0.350'] [Step 51 / Rank 0] Tasks: ['Single QA'] | Lens: [34339] → Tgt Spa: ['0.350'] [Step 51 / Rank 2] Tasks: ['Single QA'] | Lens: [56622] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 20:46:43,419 >> @ 51 | Loss: 2.1544 | LM: 2.0779 | Reg: 0.0765 | Spa(Avg): 0.410 [INFO|lh_trainer.py:797] 2026-02-16 20:46:43,420 >> Statistic -> Code | Spa: 0.389 | Tgt: 1.000 | Z-Loss: 0.134 | [INFO|lh_trainer.py:797] 2026-02-16 20:46:43,420 >> Statistic -> In-Context | Spa: 0.412 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:46:43,420 >> Statistic -> MultiHop | Spa: 0.472 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:46:43,420 >> Statistic -> Single | Spa: 0.423 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:46:43,420 >> Statistic -> Summarization | Spa: 0.451 | Tgt: 1.000 | Z-Loss: 0.145 | [INFO|lh_trainer.py:810] 2026-02-16 20:46:43,422 >> [Micro-Log] {"loss": 2.1543650714059672, "lm_loss": 2.0778846529622874, "reg_loss": 0.07648040630253188, "model_sparsity(avg)": 0.4103009117146333, "Spa-Single QA sparsity": 0.42320260230232687, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03964666622992167, "Spa-In-Context Learning sparsity": 0.41203702489535016, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.14669708659251532, "Spa-Summarization sparsity": 0.4513888657093048, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.14481288194656372, "Spa-Code sparsity": 0.3888888657093048, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1339711882174015, "Spa-MultiHop QA sparsity": 0.4722222089767456, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.035760100930929184, "step": 51, "current_tau": 1.40733003616333, "lambda1 Single QA": 0.4921875, "lambda2 MultiHop QA": 0.2490234375, "lambda3 Summarization": 0.05712890625, "lambda4 Code": 0.1533203125} [INFO|lh_trainer.py:331] 2026-02-16 20:47:04,852 >> {'loss': 12.9262, 'grad_norm': 1.0577400922775269, 'learning_rate': 0.000425, 'epoch': 0.05476566614007372, 'num_input_tokens_seen': 128815712, 'completed': '17.33% (52 / 300)', 'remaining time': '11:32:55', 'throughput': '6641.68', 'gpu_mem_free': '14767MB', 'step': 52} [Step 52 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [57260] → Tgt Spa: ['1.000'] [Step 52 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [57260] → Tgt Spa: ['1.000'] [Step 52 / Rank 1] Tasks: ['Single QA'] | Lens: [64225] → Tgt Spa: ['0.350'] [Step 52 / Rank 6] Tasks: ['Single QA'] | Lens: [40304] → Tgt Spa: ['0.350'] [Step 52 / Rank 0] Tasks: ['Single QA'] | Lens: [64225] → Tgt Spa: ['0.350'] [Step 52 / Rank 3] Tasks: ['Single QA'] | Lens: [58640] → Tgt Spa: ['0.350'] [Step 52 / Rank 2] Tasks: ['Single QA'] | Lens: [58640] → Tgt Spa: ['0.350'] [Step 52 / Rank 7] Tasks: ['Single QA'] | Lens: [40304] → Tgt Spa: ['0.350'] [Step 52 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [39831] → Tgt Spa: ['1.000'] [Step 52 / Rank 7] Tasks: ['Code'] | Lens: [64600] → Tgt Spa: ['1.000'] [Step 52 / Rank 1] Tasks: ['Single QA'] | Lens: [41260] → Tgt Spa: ['0.350'] [Step 52 / Rank 6] Tasks: ['Code'] | Lens: [64600] → Tgt Spa: ['1.000'] [Step 52 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [39831] → Tgt Spa: ['1.000'] [Step 52 / Rank 3] Tasks: ['Single QA'] | Lens: [60603] → Tgt Spa: ['0.350'] [Step 52 / Rank 0] Tasks: ['Single QA'] | Lens: [41260] → Tgt Spa: ['0.350'] [Step 52 / Rank 2] Tasks: ['Single QA'] | Lens: [60603] → Tgt Spa: ['0.350'] [Step 52 / Rank 7] Tasks: ['Single QA'] | Lens: [45691] → Tgt Spa: ['0.350'] [Step 52 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [26858, 26861] → Tgt Spa: ['0.350', '0.350'] [Step 52 / Rank 6] Tasks: ['Single QA'] | Lens: [45691] → Tgt Spa: ['0.350'] [Step 52 / Rank 4] Tasks: ['Single QA'] | Lens: [43292] → Tgt Spa: ['0.350'] [Step 52 / Rank 5] Tasks: ['Single QA'] | Lens: [43292] → Tgt Spa: ['0.350'] [Step 52 / Rank 0] Tasks: ['Single QA'] | Lens: [55747] → Tgt Spa: ['0.350'] [Step 52 / Rank 1] Tasks: ['Single QA'] | Lens: [55747] → Tgt Spa: ['0.350'] [Step 52 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [26858, 26861] → Tgt Spa: ['0.350', '0.350'] [Step 52 / Rank 2] Tasks: ['Summarization'] | Lens: [42656] → Tgt Spa: ['1.000'] [Step 52 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [29098, 29105] → Tgt Spa: ['1.000', '1.000'] [Step 52 / Rank 1] Tasks: ['Code'] | Lens: [44078] → Tgt Spa: ['1.000'] [Step 52 / Rank 6] Tasks: ['Summarization', 'Summarization'] | Lens: [26338, 26341] → Tgt Spa: ['1.000', '1.000'] [Step 52 / Rank 7] Tasks: ['Summarization', 'Summarization'] | Lens: [26338, 26341] → Tgt Spa: ['1.000', '1.000'] [Step 52 / Rank 0] Tasks: ['Code'] | Lens: [44078] → Tgt Spa: ['1.000'] [Step 52 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [29098, 29105] → Tgt Spa: ['1.000', '1.000'] [Step 52 / Rank 3] Tasks: ['Summarization'] | Lens: [42656] → Tgt Spa: ['1.000'] [Step 52 / Rank 5] Tasks: ['Single QA'] | Lens: [39807] → Tgt Spa: ['0.350'] [Step 52 / Rank 2] Tasks: ['Single QA'] | Lens: [35788] → Tgt Spa: ['0.350'] [Step 52 / Rank 0] Tasks: ['Code'] | Lens: [59254] → Tgt Spa: ['1.000'] [Step 52 / Rank 4] Tasks: ['Single QA'] | Lens: [39807] → Tgt Spa: ['0.350'] [Step 52 / Rank 3] Tasks: ['Single QA'] | Lens: [35788] → Tgt Spa: ['0.350'] [Step 52 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32313, 32313] → Tgt Spa: ['0.350', '0.350'] [Step 52 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32313, 32313] → Tgt Spa: ['0.350', '0.350'] [Step 52 / Rank 1] Tasks: ['Code'] | Lens: [59254] → Tgt Spa: ['1.000'] [Step 52 / Rank 5] Tasks: ['Code'] | Lens: [33584] → Tgt Spa: ['1.000'] [Step 52 / Rank 6] Tasks: ['Single QA'] | Lens: [47660] → Tgt Spa: ['0.350'] [Step 52 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [46588] → Tgt Spa: ['1.000'] [Step 52 / Rank 7] Tasks: ['Single QA'] | Lens: [47660] → Tgt Spa: ['0.350'] [Step 52 / Rank 0] Tasks: ['Single QA'] | Lens: [64040] → Tgt Spa: ['0.350'] [Step 52 / Rank 4] Tasks: ['Code'] | Lens: [33584] → Tgt Spa: ['1.000'] [Step 52 / Rank 1] Tasks: ['Single QA'] | Lens: [64040] → Tgt Spa: ['0.350'] [Step 52 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [46588] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 20:49:37,343 >> @ 52 | Loss: 1.8695 | LM: 1.7968 | Reg: 0.0727 | Spa(Avg): 0.450 [INFO|lh_trainer.py:797] 2026-02-16 20:49:37,343 >> Statistic -> Code | Spa: 0.491 | Tgt: 1.000 | Z-Loss: 0.107 | [INFO|lh_trainer.py:797] 2026-02-16 20:49:37,343 >> Statistic -> In-Context | Spa: 0.458 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:49:37,343 >> Statistic -> MultiHop | Spa: 0.472 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:49:37,343 >> Statistic -> Single | Spa: 0.423 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:49:37,343 >> Statistic -> Summarization | Spa: 0.514 | Tgt: 1.000 | Z-Loss: 0.117 | [INFO|lh_trainer.py:810] 2026-02-16 20:49:37,345 >> [Micro-Log] {"loss": 1.869533968468507, "lm_loss": 1.7968221722791593, "reg_loss": 0.07271179449162446, "model_sparsity(avg)": 0.449942114452521, "Spa-Single QA sparsity": 0.4227430410683155, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03919750534987543, "Spa-Code sparsity": 0.49074073632558185, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10678973297278087, "Spa-Summarization sparsity": 0.5138888955116272, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1172510435183843, "Spa-In-Context Learning sparsity": 0.4583333134651184, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.13509422789017358, "Spa-MultiHop QA sparsity": 0.4722222089767456, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.035760100930929184, "step": 52, "current_tau": 1.4039154052734375, "lambda1 Single QA": 0.4921875, "lambda2 MultiHop QA": 0.25, "lambda3 Summarization": 0.057861328125, "lambda4 Code": 0.154296875} [INFO|lh_trainer.py:331] 2026-02-16 20:50:03,526 >> {'loss': 11.2172, 'grad_norm': 0.9413532018661499, 'learning_rate': 0.00043333333333333337, 'epoch': 0.055818852027382834, 'num_input_tokens_seen': 131243982, 'completed': '17.67% (53 / 300)', 'remaining time': '11:30:59', 'throughput': '6795.28', 'gpu_mem_free': '4757MB', 'step': 53} [Step 53 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [42068] → Tgt Spa: ['1.000'] [Step 53 / Rank 3] Tasks: ['Single QA'] | Lens: [53235] → Tgt Spa: ['0.350'] [Step 53 / Rank 2] Tasks: ['Single QA'] | Lens: [53235] → Tgt Spa: ['0.350'] [Step 53 / Rank 7] Tasks: ['Single QA'] | Lens: [33668] → Tgt Spa: ['0.350'] [Step 53 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [31103, 31103] → Tgt Spa: ['0.350', '0.350'] [Step 53 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [42068] → Tgt Spa: ['1.000'] [Step 53 / Rank 6] Tasks: ['Single QA'] | Lens: [33668] → Tgt Spa: ['0.350'] [Step 53 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [31103, 31103] → Tgt Spa: ['0.350', '0.350'] [Step 53 / Rank 7] Tasks: ['Single QA'] | Lens: [55860] → Tgt Spa: ['0.350'] [Step 53 / Rank 1] Tasks: ['Single QA'] | Lens: [40023] → Tgt Spa: ['0.350'] [Step 53 / Rank 5] Tasks: ['Single QA'] | Lens: [65458] → Tgt Spa: ['0.350'] [Step 53 / Rank 0] Tasks: ['Single QA'] | Lens: [40023] → Tgt Spa: ['0.350'] [Step 53 / Rank 6] Tasks: ['Single QA'] | Lens: [55860] → Tgt Spa: ['0.350'] [Step 53 / Rank 4] Tasks: ['Single QA'] | Lens: [65458] → Tgt Spa: ['0.350'] [Step 53 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [62810] → Tgt Spa: ['1.000'] [Step 53 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [62810] → Tgt Spa: ['1.000'] [Step 53 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [56802] → Tgt Spa: ['1.000'] [Step 53 / Rank 0] Tasks: ['Code'] | Lens: [33485] → Tgt Spa: ['1.000'] [Step 53 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [56802] → Tgt Spa: ['1.000'] [Step 53 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25917, 25919] → Tgt Spa: ['1.000', '1.000'] [Step 53 / Rank 1] Tasks: ['Code'] | Lens: [33485] → Tgt Spa: ['1.000'] [Step 53 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [58320] → Tgt Spa: ['1.000'] [Step 53 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [58320] → Tgt Spa: ['1.000'] [Step 53 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25917, 25919] → Tgt Spa: ['1.000', '1.000'] [Step 53 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [40795] → Tgt Spa: ['1.000'] [Step 53 / Rank 2] Tasks: ['Summarization'] | Lens: [35296] → Tgt Spa: ['1.000'] [Step 53 / Rank 0] Tasks: ['Code'] | Lens: [41981] → Tgt Spa: ['1.000'] [Step 53 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [40795] → Tgt Spa: ['1.000'] [Step 53 / Rank 1] Tasks: ['Code'] | Lens: [41981] → Tgt Spa: ['1.000'] [Step 53 / Rank 7] Tasks: ['Single QA'] | Lens: [37107] → Tgt Spa: ['0.350'] [Step 53 / Rank 3] Tasks: ['Summarization'] | Lens: [35296] → Tgt Spa: ['1.000'] [Step 53 / Rank 6] Tasks: ['Single QA'] | Lens: [37107] → Tgt Spa: ['0.350'] [Step 53 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26748, 26750] → Tgt Spa: ['1.000', '1.000'] [Step 53 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23391, 23392] → Tgt Spa: ['1.000', '1.000'] [Step 53 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26748, 26750] → Tgt Spa: ['1.000', '1.000'] [Step 53 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23391, 23392] → Tgt Spa: ['1.000', '1.000'] [Step 53 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [61111] → Tgt Spa: ['1.000'] [Step 53 / Rank 3] Tasks: ['Single QA'] | Lens: [57507] → Tgt Spa: ['0.350'] [Step 53 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [61111] → Tgt Spa: ['1.000'] [Step 53 / Rank 2] Tasks: ['Single QA'] | Lens: [57507] → Tgt Spa: ['0.350'] [Step 53 / Rank 5] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [20410, 20399, 20400] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 53 / Rank 4] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [20410, 20399, 20400] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 53 / Rank 6] Tasks: ['Single QA'] | Lens: [58648] → Tgt Spa: ['0.350'] [Step 53 / Rank 7] Tasks: ['Single QA'] | Lens: [58648] → Tgt Spa: ['0.350'] [Step 53 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Code', 'Single QA'] | Lens: [9459, 9459, 9467, 9463, 9471, 9463] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '1.000', '0.350'] [Step 53 / Rank 2] Tasks: ['Single QA'] | Lens: [59397] → Tgt Spa: ['0.350'] [Step 53 / Rank 3] Tasks: ['Single QA'] | Lens: [59397] → Tgt Spa: ['0.350'] [Step 53 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Code', 'Single QA'] | Lens: [9459, 9459, 9467, 9463, 9471, 9463] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '1.000', '0.350'] [INFO|lh_trainer.py:781] 2026-02-16 20:52:29,304 >> @ 53 | Loss: 2.2507 | LM: 2.1597 | Reg: 0.0910 | Spa(Avg): 0.407 [INFO|lh_trainer.py:797] 2026-02-16 20:52:29,304 >> Statistic -> Code | Spa: 0.405 | Tgt: 1.000 | Z-Loss: 0.131 | [INFO|lh_trainer.py:797] 2026-02-16 20:52:29,304 >> Statistic -> In-Context | Spa: 0.411 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:52:29,304 >> Statistic -> MultiHop | Spa: 0.472 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:52:29,304 >> Statistic -> Single | Spa: 0.398 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:52:29,305 >> Statistic -> Summarization | Spa: 0.465 | Tgt: 1.000 | Z-Loss: 0.140 | [INFO|lh_trainer.py:810] 2026-02-16 20:52:29,306 >> [Micro-Log] {"loss": 2.2507146994272866, "lm_loss": 2.1597467698156834, "reg_loss": 0.09096794336801395, "model_sparsity(avg)": 0.4066357935468356, "Spa-Single QA sparsity": 0.3981481353441874, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03022181720783313, "Spa-Code sparsity": 0.40509257713953656, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.13123491778969765, "Spa-In-Context Learning sparsity": 0.41087962687015533, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1486155167222023, "Spa-Summarization sparsity": 0.465277761220932, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13968951255083084, "Spa-MultiHop QA sparsity": 0.4722222089767456, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.035760100930929184, "step": 53, "current_tau": 1.400453805923462, "lambda1 Single QA": 0.494140625, "lambda2 MultiHop QA": 0.25, "lambda3 Summarization": 0.05859375, "lambda4 Code": 0.1552734375} [INFO|lh_trainer.py:331] 2026-02-16 20:52:52,746 >> {'loss': 13.5043, 'grad_norm': 1.6401008367538452, 'learning_rate': 0.00044166666666666665, 'epoch': 0.05687203791469194, 'num_input_tokens_seen': 133695752, 'completed': '18.00% (54 / 300)', 'remaining time': '11:28:17', 'throughput': '7244.32', 'gpu_mem_free': '8887MB', 'step': 54} [Step 54 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [25362, 25370] → Tgt Spa: ['1.000', '1.000'] [Step 54 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [60863] → Tgt Spa: ['1.000'] [Step 54 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [38109] → Tgt Spa: ['1.000'] [Step 54 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [25362, 25370] → Tgt Spa: ['1.000', '1.000'] [Step 54 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [38109] → Tgt Spa: ['1.000'] [Step 54 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [60863] → Tgt Spa: ['1.000'] [Step 54 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [57561] → Tgt Spa: ['1.000'] [Step 54 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [57561] → Tgt Spa: ['1.000'] [Step 54 / Rank 4] Tasks: ['Single QA'] | Lens: [56585] → Tgt Spa: ['0.350'] [Step 54 / Rank 2] Tasks: ['Single QA'] | Lens: [37245] → Tgt Spa: ['0.350'] [Step 54 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32202, 32202] → Tgt Spa: ['0.350', '0.350'] [Step 54 / Rank 7] Tasks: ['Single QA'] | Lens: [51534] → Tgt Spa: ['0.350'] [Step 54 / Rank 5] Tasks: ['Single QA'] | Lens: [56585] → Tgt Spa: ['0.350'] [Step 54 / Rank 6] Tasks: ['Single QA'] | Lens: [51534] → Tgt Spa: ['0.350'] [Step 54 / Rank 3] Tasks: ['Single QA'] | Lens: [37245] → Tgt Spa: ['0.350'] [Step 54 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32202, 32202] → Tgt Spa: ['0.350', '0.350'] [Step 54 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [62165] → Tgt Spa: ['1.000'] [Step 54 / Rank 6] Tasks: ['Single QA'] | Lens: [56975] → Tgt Spa: ['0.350'] [Step 54 / Rank 0] Tasks: ['Single QA'] | Lens: [58753] → Tgt Spa: ['0.350'] [Step 54 / Rank 1] Tasks: ['Single QA'] | Lens: [58753] → Tgt Spa: ['0.350'] [Step 54 / Rank 3] Tasks: ['Single QA'] | Lens: [33842] → Tgt Spa: ['0.350'] [Step 54 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [62165] → Tgt Spa: ['1.000'] [Step 54 / Rank 2] Tasks: ['Single QA'] | Lens: [33842] → Tgt Spa: ['0.350'] [Step 54 / Rank 7] Tasks: ['Single QA'] | Lens: [56975] → Tgt Spa: ['0.350'] [Step 54 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [45094] → Tgt Spa: ['1.000'] [Step 54 / Rank 1] Tasks: ['Single QA'] | Lens: [53594] → Tgt Spa: ['0.350'] [Step 54 / Rank 2] Tasks: ['Code'] | Lens: [41795] → Tgt Spa: ['1.000'] [Step 54 / Rank 7] Tasks: ['Single QA'] | Lens: [35248] → Tgt Spa: ['0.350'] [Step 54 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [45094] → Tgt Spa: ['1.000'] [Step 54 / Rank 3] Tasks: ['Code'] | Lens: [41795] → Tgt Spa: ['1.000'] [Step 54 / Rank 6] Tasks: ['Single QA'] | Lens: [35248] → Tgt Spa: ['0.350'] [Step 54 / Rank 0] Tasks: ['Single QA'] | Lens: [53594] → Tgt Spa: ['0.350'] [Step 54 / Rank 2] Tasks: ['Single QA', 'In-Context Learning', 'Single QA'] | Lens: [21844, 21844, 21845] → Tgt Spa: ['0.350', '1.000', '0.350'] [Step 54 / Rank 5] Tasks: ['Single QA'] | Lens: [57511] → Tgt Spa: ['0.350'] [Step 54 / Rank 3] Tasks: ['Single QA', 'In-Context Learning', 'Single QA'] | Lens: [21844, 21844, 21845] → Tgt Spa: ['0.350', '1.000', '0.350'] [Step 54 / Rank 6] Tasks: ['Single QA'] | Lens: [52983] → Tgt Spa: ['0.350'] [Step 54 / Rank 0] Tasks: ['Code', 'Code'] | Lens: [25119, 25119] → Tgt Spa: ['1.000', '1.000'] [Step 54 / Rank 1] Tasks: ['Code', 'Code'] | Lens: [25119, 25119] → Tgt Spa: ['1.000', '1.000'] [Step 54 / Rank 4] Tasks: ['Single QA'] | Lens: [57511] → Tgt Spa: ['0.350'] [Step 54 / Rank 7] Tasks: ['Single QA'] | Lens: [52983] → Tgt Spa: ['0.350'] [Step 54 / Rank 0] Tasks: ['Single QA'] | Lens: [55695] → Tgt Spa: ['0.350'] [Step 54 / Rank 6] Tasks: ['Single QA'] | Lens: [43846] → Tgt Spa: ['0.350'] [Step 54 / Rank 4] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [20184, 20195, 20197] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 54 / Rank 5] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [20184, 20195, 20197] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 54 / Rank 1] Tasks: ['Single QA'] | Lens: [55695] → Tgt Spa: ['0.350'] [Step 54 / Rank 2] Tasks: ['Single QA'] | Lens: [64536] → Tgt Spa: ['0.350'] [Step 54 / Rank 3] Tasks: ['Single QA'] | Lens: [64536] → Tgt Spa: ['0.350'] [Step 54 / Rank 7] Tasks: ['Single QA'] | Lens: [43846] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 20:55:27,275 >> @ 54 | Loss: 2.3078 | LM: 2.2296 | Reg: 0.0782 | Spa(Avg): 0.410 [INFO|lh_trainer.py:797] 2026-02-16 20:55:27,276 >> Statistic -> Code | Spa: 0.408 | Tgt: 1.000 | Z-Loss: 0.131 | [INFO|lh_trainer.py:797] 2026-02-16 20:55:27,276 >> Statistic -> In-Context | Spa: 0.435 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:55:27,276 >> Statistic -> MultiHop | Spa: 0.472 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:55:27,276 >> Statistic -> Single | Spa: 0.397 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:55:27,276 >> Statistic -> Summarization | Spa: 0.389 | Tgt: 1.000 | Z-Loss: 0.178 | [INFO|lh_trainer.py:810] 2026-02-16 20:55:27,278 >> [Micro-Log] {"loss": 2.307817184676727, "lm_loss": 2.2296035761634507, "reg_loss": 0.0782136024548284, "model_sparsity(avg)": 0.40991512313485146, "Spa-In-Context Learning sparsity": 0.43452381236212595, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.14240948855876923, "Spa-Single QA sparsity": 0.39705882352941174, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03617905419292476, "Spa-Code sparsity": 0.4083333134651184, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.13113004416227342, "Spa-Summarization sparsity": 0.3888888955116272, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.17798228561878204, "Spa-MultiHop QA sparsity": 0.4722222089767456, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.035760100930929184, "step": 54, "current_tau": 1.3969463109970093, "lambda1 Single QA": 0.494140625, "lambda2 MultiHop QA": 0.25, "lambda3 Summarization": 0.0595703125, "lambda4 Code": 0.15625} [INFO|lh_trainer.py:331] 2026-02-16 20:55:53,782 >> {'loss': 13.8469, 'grad_norm': 1.1288220882415771, 'learning_rate': 0.00045000000000000004, 'epoch': 0.057925223802001054, 'num_input_tokens_seen': 136206586, 'completed': '18.33% (55 / 300)', 'remaining time': '11:26:28', 'throughput': '6934.62', 'gpu_mem_free': '7565MB', 'step': 55} [Step 55 / Rank 6] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23111, 23133] → Tgt Spa: ['1.000', '1.000'] [Step 55 / Rank 2] Tasks: ['Single QA'] | Lens: [40425] → Tgt Spa: ['0.350'] [Step 55 / Rank 7] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23111, 23133] → Tgt Spa: ['1.000', '1.000'] [Step 55 / Rank 0] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [18722, 18723, 18715] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 55 / Rank 3] Tasks: ['Single QA'] | Lens: [40425] → Tgt Spa: ['0.350'] [Step 55 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [37009] → Tgt Spa: ['1.000'] [Step 55 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [37009] → Tgt Spa: ['1.000'] [Step 55 / Rank 1] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [18722, 18723, 18715] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 55 / Rank 2] Tasks: ['Single QA'] | Lens: [45733] → Tgt Spa: ['0.350'] [Step 55 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [41611] → Tgt Spa: ['1.000'] [Step 55 / Rank 0] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25625, 25625] → Tgt Spa: ['0.350', '1.000'] [Step 55 / Rank 7] Tasks: ['Single QA'] | Lens: [34746] → Tgt Spa: ['0.350'] [Step 55 / Rank 3] Tasks: ['Single QA'] | Lens: [45733] → Tgt Spa: ['0.350'] [Step 55 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [41611] → Tgt Spa: ['1.000'] [Step 55 / Rank 6] Tasks: ['Single QA'] | Lens: [34746] → Tgt Spa: ['0.350'] [Step 55 / Rank 1] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25625, 25625] → Tgt Spa: ['0.350', '1.000'] [Step 55 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [39887] → Tgt Spa: ['1.000'] [Step 55 / Rank 6] Tasks: ['Single QA'] | Lens: [61397] → Tgt Spa: ['0.350'] [Step 55 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [39887] → Tgt Spa: ['1.000'] [Step 55 / Rank 5] Tasks: ['Single QA'] | Lens: [47248] → Tgt Spa: ['0.350'] [Step 55 / Rank 1] Tasks: ['Single QA'] | Lens: [32973] → Tgt Spa: ['0.350'] [Step 55 / Rank 7] Tasks: ['Single QA'] | Lens: [61397] → Tgt Spa: ['0.350'] [Step 55 / Rank 0] Tasks: ['Single QA'] | Lens: [32973] → Tgt Spa: ['0.350'] [Step 55 / Rank 4] Tasks: ['Single QA'] | Lens: [47248] → Tgt Spa: ['0.350'] [Step 55 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [55865] → Tgt Spa: ['1.000'] [Step 55 / Rank 1] Tasks: ['Code'] | Lens: [54656] → Tgt Spa: ['1.000'] [Step 55 / Rank 3] Tasks: ['Single QA'] | Lens: [51199] → Tgt Spa: ['0.350'] [Step 55 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25458, 25458] → Tgt Spa: ['1.000', '1.000'] [Step 55 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25458, 25458] → Tgt Spa: ['1.000', '1.000'] [Step 55 / Rank 0] Tasks: ['Code'] | Lens: [54656] → Tgt Spa: ['1.000'] [Step 55 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [55865] → Tgt Spa: ['1.000'] [Step 55 / Rank 2] Tasks: ['Single QA'] | Lens: [51199] → Tgt Spa: ['0.350'] [Step 55 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [63234] → Tgt Spa: ['1.000'] [Step 55 / Rank 7] Tasks: ['Single QA'] | Lens: [51571] → Tgt Spa: ['0.350'] [Step 55 / Rank 0] Tasks: ['Single QA'] | Lens: [41003] → Tgt Spa: ['0.350'] [Step 55 / Rank 1] Tasks: ['Single QA'] | Lens: [41003] → Tgt Spa: ['0.350'] [Step 55 / Rank 4] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'Code', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'Code'] | Lens: [4725, 4719, 4718, 4719, 4719, 4721, 4727, 4721, 4730, 4730, 4724, 4725, 4733] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 55 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [63234] → Tgt Spa: ['1.000'] [Step 55 / Rank 5] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'Code', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'Code'] | Lens: [4725, 4719, 4718, 4719, 4719, 4721, 4727, 4721, 4730, 4730, 4724, 4725, 4733] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 55 / Rank 6] Tasks: ['Single QA'] | Lens: [51571] → Tgt Spa: ['0.350'] [Step 55 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [52900] → Tgt Spa: ['1.000'] [Step 55 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [25919, 25927] → Tgt Spa: ['1.000', '1.000'] [Step 55 / Rank 5] Tasks: ['Code'] | Lens: [56435] → Tgt Spa: ['1.000'] [Step 55 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [25919, 25927] → Tgt Spa: ['1.000', '1.000'] [Step 55 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [52900] → Tgt Spa: ['1.000'] [Step 55 / Rank 6] Tasks: ['Single QA'] | Lens: [65055] → Tgt Spa: ['0.350'] [Step 55 / Rank 7] Tasks: ['Single QA'] | Lens: [65055] → Tgt Spa: ['0.350'] [Step 55 / Rank 4] Tasks: ['Code'] | Lens: [56435] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 20:58:12,455 >> @ 55 | Loss: 2.3756 | LM: 2.2829 | Reg: 0.0927 | Spa(Avg): 0.393 [INFO|lh_trainer.py:797] 2026-02-16 20:58:12,456 >> Statistic -> Code | Spa: 0.415 | Tgt: 1.000 | Z-Loss: 0.130 | [INFO|lh_trainer.py:797] 2026-02-16 20:58:12,456 >> Statistic -> In-Context | Spa: 0.409 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:58:12,456 >> Statistic -> MultiHop | Spa: 0.472 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:58:12,456 >> Statistic -> Single | Spa: 0.394 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 20:58:12,456 >> Statistic -> Summarization | Spa: 0.407 | Tgt: 1.000 | Z-Loss: 0.169 | [INFO|lh_trainer.py:810] 2026-02-16 20:58:12,458 >> [Micro-Log] {"loss": 2.3755896414319673, "lm_loss": 2.2828607807556787, "reg_loss": 0.09272885953153794, "model_sparsity(avg)": 0.3930288391808669, "Spa-Summarization sparsity": 0.40740740299224854, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1690538227558136, "Spa-Code sparsity": 0.41512345605426365, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.12992222938272688, "Spa-Single QA sparsity": 0.39384919830730986, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.023426548544583575, "Spa-In-Context Learning sparsity": 0.4088541567325592, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1502826353535056, "Spa-MultiHop QA sparsity": 0.4722222089767456, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.035760100930929184, "step": 55, "current_tau": 1.393394112586975, "lambda1 Single QA": 0.494140625, "lambda2 MultiHop QA": 0.25, "lambda3 Summarization": 0.060302734375, "lambda4 Code": 0.1572265625} [INFO|lh_trainer.py:331] 2026-02-16 20:58:39,304 >> {'loss': 14.2535, 'grad_norm': 1.6782962083816528, 'learning_rate': 0.0004583333333333333, 'epoch': 0.05897840968931016, 'num_input_tokens_seen': 138588134, 'completed': '18.67% (56 / 300)', 'remaining time': '11:23:28', 'throughput': '7194.05', 'gpu_mem_free': '8939MB', 'step': 56} [Step 56 / Rank 6] Tasks: ['Single QA'] | Lens: [44432] → Tgt Spa: ['0.350'] [Step 56 / Rank 3] Tasks: ['Single QA'] | Lens: [65095] → Tgt Spa: ['0.350'] [Step 56 / Rank 1] Tasks: ['Single QA'] | Lens: [42295] → Tgt Spa: ['0.350'] [Step 56 / Rank 4] Tasks: ['Single QA'] | Lens: [54043] → Tgt Spa: ['0.350'] [Step 56 / Rank 7] Tasks: ['Single QA'] | Lens: [44432] → Tgt Spa: ['0.350'] [Step 56 / Rank 2] Tasks: ['Single QA'] | Lens: [65095] → Tgt Spa: ['0.350'] [Step 56 / Rank 5] Tasks: ['Single QA'] | Lens: [54043] → Tgt Spa: ['0.350'] [Step 56 / Rank 0] Tasks: ['Single QA'] | Lens: [42295] → Tgt Spa: ['0.350'] [Step 56 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Code', 'Single QA', 'In-Context Learning'] | Lens: [6991, 6993, 6993, 6994, 7001, 6995, 7002, 6996, 6998] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000'] [Step 56 / Rank 1] Tasks: ['Single QA'] | Lens: [51219] → Tgt Spa: ['0.350'] [Step 56 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Code', 'Single QA', 'In-Context Learning'] | Lens: [6991, 6993, 6993, 6994, 7001, 6995, 7002, 6996, 6998] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000'] [Step 56 / Rank 7] Tasks: ['Single QA'] | Lens: [56613] → Tgt Spa: ['0.350'] [Step 56 / Rank 6] Tasks: ['Single QA'] | Lens: [56613] → Tgt Spa: ['0.350'] [Step 56 / Rank 4] Tasks: ['Code'] | Lens: [47183] → Tgt Spa: ['1.000'] [Step 56 / Rank 0] Tasks: ['Single QA'] | Lens: [51219] → Tgt Spa: ['0.350'] [Step 56 / Rank 5] Tasks: ['Code'] | Lens: [47183] → Tgt Spa: ['1.000'] [Step 56 / Rank 1] Tasks: ['Single QA'] | Lens: [60996] → Tgt Spa: ['0.350'] [Step 56 / Rank 5] Tasks: ['Code'] | Lens: [60939] → Tgt Spa: ['1.000'] [Step 56 / Rank 0] Tasks: ['Single QA'] | Lens: [60996] → Tgt Spa: ['0.350'] [Step 56 / Rank 4] Tasks: ['Code'] | Lens: [60939] → Tgt Spa: ['1.000'] [Step 56 / Rank 3] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [16907, 16907, 16918] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 56 / Rank 7] Tasks: ['Single QA', 'Code'] | Lens: [31686, 31696] → Tgt Spa: ['0.350', '1.000'] [Step 56 / Rank 2] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [16907, 16907, 16918] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 56 / Rank 6] Tasks: ['Single QA', 'Code'] | Lens: [31686, 31696] → Tgt Spa: ['0.350', '1.000'] [Step 56 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [60237] → Tgt Spa: ['1.000'] [Step 56 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [60237] → Tgt Spa: ['1.000'] [Step 56 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22897, 22898] → Tgt Spa: ['1.000', '1.000'] [Step 56 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [23557, 23550] → Tgt Spa: ['1.000', '1.000'] [Step 56 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22897, 22898] → Tgt Spa: ['1.000', '1.000'] [Step 56 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [23557, 23550] → Tgt Spa: ['1.000', '1.000'] [Step 56 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [56447] → Tgt Spa: ['1.000'] [Step 56 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [56447] → Tgt Spa: ['1.000'] [Step 56 / Rank 6] Tasks: ['Single QA'] | Lens: [36960] → Tgt Spa: ['0.350'] [Step 56 / Rank 3] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17501, 17490, 17502] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 56 / Rank 1] Tasks: ['Code'] | Lens: [41723] → Tgt Spa: ['1.000'] [Step 56 / Rank 7] Tasks: ['Single QA'] | Lens: [36960] → Tgt Spa: ['0.350'] [Step 56 / Rank 0] Tasks: ['Code'] | Lens: [41723] → Tgt Spa: ['1.000'] [Step 56 / Rank 2] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17501, 17490, 17502] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 56 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [41516] → Tgt Spa: ['1.000'] [Step 56 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [41516] → Tgt Spa: ['1.000'] [Step 56 / Rank 4] Tasks: ['Single QA'] | Lens: [47409] → Tgt Spa: ['0.350'] [Step 56 / Rank 6] Tasks: ['Code'] | Lens: [58141] → Tgt Spa: ['1.000'] [Step 56 / Rank 1] Tasks: ['Single QA'] | Lens: [63424] → Tgt Spa: ['0.350'] [Step 56 / Rank 3] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [26700, 26684] → Tgt Spa: ['1.000', '1.000'] [Step 56 / Rank 0] Tasks: ['Single QA'] | Lens: [63424] → Tgt Spa: ['0.350'] [Step 56 / Rank 7] Tasks: ['Code'] | Lens: [58141] → Tgt Spa: ['1.000'] [Step 56 / Rank 2] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [26700, 26684] → Tgt Spa: ['1.000', '1.000'] [Step 56 / Rank 5] Tasks: ['Single QA'] | Lens: [47409] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 21:01:09,687 >> @ 56 | Loss: 1.9414 | LM: 1.8422 | Reg: 0.0991 | Spa(Avg): 0.445 [INFO|lh_trainer.py:797] 2026-02-16 21:01:09,687 >> Statistic -> Code | Spa: 0.447 | Tgt: 1.000 | Z-Loss: 0.122 | [INFO|lh_trainer.py:797] 2026-02-16 21:01:09,687 >> Statistic -> In-Context | Spa: 0.441 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:01:09,687 >> Statistic -> MultiHop | Spa: 0.472 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:01:09,687 >> Statistic -> Single | Spa: 0.440 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:01:09,687 >> Statistic -> Summarization | Spa: 0.389 | Tgt: 1.000 | Z-Loss: 0.181 | [INFO|lh_trainer.py:810] 2026-02-16 21:01:09,689 >> [Micro-Log] {"loss": 1.9413635386154056, "lm_loss": 1.8422178132459521, "reg_loss": 0.09914573895124097, "model_sparsity(avg)": 0.4454089378317197, "Spa-Single QA sparsity": 0.4395424688563627, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.0506603600566878, "Spa-In-Context Learning sparsity": 0.4409722089767456, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.14233064651489258, "Spa-Code sparsity": 0.4469696825200861, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.12153768133033406, "Spa-Summarization sparsity": 0.388888880610466, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1805623434484005, "Spa-MultiHop QA sparsity": 0.4722222089767456, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.035760100930929184, "step": 56, "current_tau": 1.3897981643676758, "lambda1 Single QA": 0.49609375, "lambda2 MultiHop QA": 0.251953125, "lambda3 Summarization": 0.06103515625, "lambda4 Code": 0.158203125} [INFO|lh_trainer.py:331] 2026-02-16 21:01:35,354 >> {'loss': 11.6482, 'grad_norm': 1.3746291399002075, 'learning_rate': 0.00046666666666666666, 'epoch': 0.06003159557661927, 'num_input_tokens_seen': 141117190, 'completed': '19.00% (57 / 300)', 'remaining time': '11:21:14', 'throughput': '7182.80', 'gpu_mem_free': '5485MB', 'step': 57} [Step 57 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [62012] → Tgt Spa: ['1.000'] [Step 57 / Rank 2] Tasks: ['Single QA'] | Lens: [38676] → Tgt Spa: ['0.350'] [Step 57 / Rank 1] Tasks: ['Single QA'] | Lens: [41297] → Tgt Spa: ['0.350'] [Step 57 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [41150] → Tgt Spa: ['1.000'] [Step 57 / Rank 0] Tasks: ['Single QA'] | Lens: [41297] → Tgt Spa: ['0.350'] [Step 57 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [41150] → Tgt Spa: ['1.000'] [Step 57 / Rank 3] Tasks: ['Single QA'] | Lens: [38676] → Tgt Spa: ['0.350'] [Step 57 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [62012] → Tgt Spa: ['1.000'] [Step 57 / Rank 1] Tasks: ['Single QA'] | Lens: [44352] → Tgt Spa: ['0.350'] [Step 57 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32366, 32365] → Tgt Spa: ['0.350', '0.350'] [Step 57 / Rank 2] Tasks: ['Single QA'] | Lens: [41988] → Tgt Spa: ['0.350'] [Step 57 / Rank 6] Tasks: ['Single QA'] | Lens: [54313] → Tgt Spa: ['0.350'] [Step 57 / Rank 3] Tasks: ['Single QA'] | Lens: [41988] → Tgt Spa: ['0.350'] [Step 57 / Rank 0] Tasks: ['Single QA'] | Lens: [44352] → Tgt Spa: ['0.350'] [Step 57 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32366, 32365] → Tgt Spa: ['0.350', '0.350'] [Step 57 / Rank 7] Tasks: ['Single QA'] | Lens: [54313] → Tgt Spa: ['0.350'] [Step 57 / Rank 2] Tasks: ['Single QA'] | Lens: [46762] → Tgt Spa: ['0.350'] [Step 57 / Rank 1] Tasks: ['Summarization', 'Code'] | Lens: [24613, 24602] → Tgt Spa: ['1.000', '1.000'] [Step 57 / Rank 7] Tasks: ['Single QA'] | Lens: [49457] → Tgt Spa: ['0.350'] [Step 57 / Rank 3] Tasks: ['Single QA'] | Lens: [46762] → Tgt Spa: ['0.350'] [Step 57 / Rank 6] Tasks: ['Single QA'] | Lens: [49457] → Tgt Spa: ['0.350'] [Step 57 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [57723] → Tgt Spa: ['1.000'] [Step 57 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [57723] → Tgt Spa: ['1.000'] [Step 57 / Rank 0] Tasks: ['Summarization', 'Code'] | Lens: [24613, 24602] → Tgt Spa: ['1.000', '1.000'] [Step 57 / Rank 3] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [21394, 21385, 21400] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 57 / Rank 6] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [25432, 25417] → Tgt Spa: ['1.000', '1.000'] [Step 57 / Rank 5] Tasks: ['Single QA'] | Lens: [45436] → Tgt Spa: ['0.350'] [Step 57 / Rank 4] Tasks: ['Single QA'] | Lens: [45436] → Tgt Spa: ['0.350'] [Step 57 / Rank 0] Tasks: ['Code'] | Lens: [37384] → Tgt Spa: ['1.000'] [Step 57 / Rank 7] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [25432, 25417] → Tgt Spa: ['1.000', '1.000'] [Step 57 / Rank 2] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [21394, 21385, 21400] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 57 / Rank 1] Tasks: ['Code'] | Lens: [37384] → Tgt Spa: ['1.000'] [Step 57 / Rank 6] Tasks: ['Single QA'] | Lens: [33294] → Tgt Spa: ['0.350'] [Step 57 / Rank 5] Tasks: ['Code', 'Code', 'Code'] | Lens: [19504, 19510, 19513] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 57 / Rank 1] Tasks: ['Single QA'] | Lens: [58402] → Tgt Spa: ['0.350'] [Step 57 / Rank 4] Tasks: ['Code', 'Code', 'Code'] | Lens: [19504, 19510, 19513] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 57 / Rank 0] Tasks: ['Single QA'] | Lens: [58402] → Tgt Spa: ['0.350'] [Step 57 / Rank 2] Tasks: ['Single QA'] | Lens: [36027] → Tgt Spa: ['0.350'] [Step 57 / Rank 7] Tasks: ['Single QA'] | Lens: [33294] → Tgt Spa: ['0.350'] [Step 57 / Rank 3] Tasks: ['Single QA'] | Lens: [36027] → Tgt Spa: ['0.350'] [Step 57 / Rank 5] Tasks: ['Single QA'] | Lens: [51846] → Tgt Spa: ['0.350'] [Step 57 / Rank 3] Tasks: ['Single QA'] | Lens: [49554] → Tgt Spa: ['0.350'] [Step 57 / Rank 4] Tasks: ['Single QA'] | Lens: [51846] → Tgt Spa: ['0.350'] [Step 57 / Rank 2] Tasks: ['Single QA'] | Lens: [49554] → Tgt Spa: ['0.350'] [Step 57 / Rank 6] Tasks: ['Summarization'] | Lens: [52395] → Tgt Spa: ['1.000'] [Step 57 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [57705] → Tgt Spa: ['1.000'] [Step 57 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [57705] → Tgt Spa: ['1.000'] [Step 57 / Rank 7] Tasks: ['Summarization'] | Lens: [52395] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 21:03:58,610 >> @ 57 | Loss: 2.2943 | LM: 2.2043 | Reg: 0.0900 | Spa(Avg): 0.450 [INFO|lh_trainer.py:797] 2026-02-16 21:03:58,610 >> Statistic -> Code | Spa: 0.426 | Tgt: 1.000 | Z-Loss: 0.129 | [INFO|lh_trainer.py:797] 2026-02-16 21:03:58,610 >> Statistic -> In-Context | Spa: 0.444 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:03:58,610 >> Statistic -> MultiHop | Spa: 0.472 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:03:58,611 >> Statistic -> Single | Spa: 0.455 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:03:58,611 >> Statistic -> Summarization | Spa: 0.431 | Tgt: 1.000 | Z-Loss: 0.159 | [INFO|lh_trainer.py:810] 2026-02-16 21:03:58,612 >> [Micro-Log] {"loss": 2.2942972853779793, "lm_loss": 2.2042579228679338, "reg_loss": 0.09003936561445396, "model_sparsity(avg)": 0.4497492127120495, "Spa-Single QA sparsity": 0.45462961196899415, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.054671541601419446, "Spa-Summarization sparsity": 0.430555534362793, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1593492329120636, "Spa-Code sparsity": 0.42592592040697735, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.12853803237279257, "Spa-In-Context Learning sparsity": 0.4444444298744202, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.14216456115245818, "Spa-MultiHop QA sparsity": 0.4722222089767456, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.035760100930929184, "step": 57, "current_tau": 1.3861597776412964, "lambda1 Single QA": 0.49609375, "lambda2 MultiHop QA": 0.251953125, "lambda3 Summarization": 0.061767578125, "lambda4 Code": 0.1591796875} [INFO|lh_trainer.py:331] 2026-02-16 21:04:20,534 >> {'loss': 13.7658, 'grad_norm': 1.0187997817993164, 'learning_rate': 0.000475, 'epoch': 0.061084781463928386, 'num_input_tokens_seen': 143491738, 'completed': '19.33% (58 / 300)', 'remaining time': '11:18:13', 'throughput': '7187.73', 'gpu_mem_free': '7303MB', 'step': 58} [Step 58 / Rank 3] Tasks: ['Single QA'] | Lens: [58368] → Tgt Spa: ['0.350'] [Step 58 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [21804, 21805, 21795] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 58 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [21804, 21805, 21795] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 58 / Rank 0] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22377, 22360] → Tgt Spa: ['1.000', '1.000'] [Step 58 / Rank 1] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22377, 22360] → Tgt Spa: ['1.000', '1.000'] [Step 58 / Rank 4] Tasks: ['Single QA'] | Lens: [48143] → Tgt Spa: ['0.350'] [Step 58 / Rank 5] Tasks: ['Single QA'] | Lens: [48143] → Tgt Spa: ['0.350'] [Step 58 / Rank 2] Tasks: ['Single QA'] | Lens: [58368] → Tgt Spa: ['0.350'] [Step 58 / Rank 3] Tasks: ['Single QA'] | Lens: [45373] → Tgt Spa: ['0.350'] [Step 58 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [31099, 31099] → Tgt Spa: ['0.350', '0.350'] [Step 58 / Rank 2] Tasks: ['Single QA'] | Lens: [45373] → Tgt Spa: ['0.350'] [Step 58 / Rank 5] Tasks: ['Code'] | Lens: [41051] → Tgt Spa: ['1.000'] [Step 58 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [31099, 31099] → Tgt Spa: ['0.350', '0.350'] [Step 58 / Rank 0] Tasks: ['Summarization', 'Summarization'] | Lens: [24415, 24415] → Tgt Spa: ['1.000', '1.000'] [Step 58 / Rank 4] Tasks: ['Code'] | Lens: [41051] → Tgt Spa: ['1.000'] [Step 58 / Rank 1] Tasks: ['Summarization', 'Summarization'] | Lens: [24415, 24415] → Tgt Spa: ['1.000', '1.000'] [Step 58 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32170, 32170] → Tgt Spa: ['0.350', '0.350'] [Step 58 / Rank 7] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [18385, 18374, 18386] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 58 / Rank 2] Tasks: ['Code'] | Lens: [61129] → Tgt Spa: ['1.000'] [Step 58 / Rank 6] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [18385, 18374, 18386] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 58 / Rank 0] Tasks: ['Code'] | Lens: [43548] → Tgt Spa: ['1.000'] [Step 58 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32170, 32170] → Tgt Spa: ['0.350', '0.350'] [Step 58 / Rank 3] Tasks: ['Code'] | Lens: [61129] → Tgt Spa: ['1.000'] [Step 58 / Rank 1] Tasks: ['Code'] | Lens: [43548] → Tgt Spa: ['1.000'] [Step 58 / Rank 4] Tasks: ['Single QA'] | Lens: [54187] → Tgt Spa: ['0.350'] [Step 58 / Rank 2] Tasks: ['Single QA'] | Lens: [60135] → Tgt Spa: ['0.350'] [Step 58 / Rank 5] Tasks: ['Single QA'] | Lens: [54187] → Tgt Spa: ['0.350'] [Step 58 / Rank 7] Tasks: ['Single QA'] | Lens: [35969] → Tgt Spa: ['0.350'] [Step 58 / Rank 3] Tasks: ['Single QA'] | Lens: [60135] → Tgt Spa: ['0.350'] [Step 58 / Rank 6] Tasks: ['Single QA'] | Lens: [35969] → Tgt Spa: ['0.350'] [Step 58 / Rank 0] Tasks: ['Single QA'] | Lens: [56711] → Tgt Spa: ['0.350'] [Step 58 / Rank 1] Tasks: ['Single QA'] | Lens: [56711] → Tgt Spa: ['0.350'] [Step 58 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [61660] → Tgt Spa: ['1.000'] [Step 58 / Rank 7] Tasks: ['Single QA'] | Lens: [46636] → Tgt Spa: ['0.350'] [Step 58 / Rank 6] Tasks: ['Single QA'] | Lens: [46636] → Tgt Spa: ['0.350'] [Step 58 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28714, 28717] → Tgt Spa: ['1.000', '1.000'] [Step 58 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [26019, 26038] → Tgt Spa: ['1.000', '1.000'] [Step 58 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [61660] → Tgt Spa: ['1.000'] [Step 58 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28714, 28717] → Tgt Spa: ['1.000', '1.000'] [Step 58 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [26019, 26038] → Tgt Spa: ['1.000', '1.000'] [Step 58 / Rank 6] Tasks: ['Single QA'] | Lens: [34768] → Tgt Spa: ['0.350'] [Step 58 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24857, 24859] → Tgt Spa: ['0.350', '1.000'] [Step 58 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [24416, 24416] → Tgt Spa: ['0.350', '0.350'] [Step 58 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [24416, 24416] → Tgt Spa: ['0.350', '0.350'] [Step 58 / Rank 7] Tasks: ['Single QA'] | Lens: [34768] → Tgt Spa: ['0.350'] [Step 58 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24857, 24859] → Tgt Spa: ['0.350', '1.000'] [Step 58 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26667, 26668] → Tgt Spa: ['1.000', '1.000'] [Step 58 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26667, 26668] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 21:06:50,084 >> @ 58 | Loss: 2.1994 | LM: 2.1060 | Reg: 0.0935 | Spa(Avg): 0.440 [INFO|lh_trainer.py:797] 2026-02-16 21:06:50,085 >> Statistic -> Code | Spa: 0.428 | Tgt: 1.000 | Z-Loss: 0.129 | [INFO|lh_trainer.py:797] 2026-02-16 21:06:50,085 >> Statistic -> In-Context | Spa: 0.425 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:06:50,085 >> Statistic -> MultiHop | Spa: 0.472 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:06:50,085 >> Statistic -> Single | Spa: 0.444 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:06:50,085 >> Statistic -> Summarization | Spa: 0.450 | Tgt: 1.000 | Z-Loss: 0.151 | [INFO|lh_trainer.py:810] 2026-02-16 21:06:50,087 >> [Micro-Log] {"loss": 2.1994495590527854, "lm_loss": 2.10597136678795, "reg_loss": 0.093478180312862, "model_sparsity(avg)": 0.44029706592361134, "Spa-Summarization sparsity": 0.4496527761220932, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.15145026426762342, "Spa-In-Context Learning sparsity": 0.4253472238779068, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.14823485724627972, "Spa-Code sparsity": 0.42777777910232545, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.12865080833435058, "Spa-Single QA sparsity": 0.444444440305233, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04954281181562692, "Spa-MultiHop QA sparsity": 0.4722222089767456, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.035760100930929184, "step": 58, "current_tau": 1.3824797868728638, "lambda1 Single QA": 0.498046875, "lambda2 MultiHop QA": 0.251953125, "lambda3 Summarization": 0.0625, "lambda4 Code": 0.16015625} [INFO|lh_trainer.py:331] 2026-02-16 21:07:03,530 >> {'loss': 13.1967, 'grad_norm': 1.1370724439620972, 'learning_rate': 0.00048333333333333334, 'epoch': 0.06213796735123749, 'num_input_tokens_seen': 145991144, 'completed': '19.67% (59 / 300)', 'remaining time': '11:15:04', 'throughput': '7667.08', 'gpu_mem_free': '10749MB', 'step': 59} [Step 59 / Rank 3] Tasks: ['Code'] | Lens: [59165] → Tgt Spa: ['1.000'] [Step 59 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [24819, 24820] → Tgt Spa: ['1.000', '1.000'] [Step 59 / Rank 0] Tasks: ['Single QA', 'Summarization', 'Code'] | Lens: [19393, 19412, 19400] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 59 / Rank 2] Tasks: ['Code'] | Lens: [59165] → Tgt Spa: ['1.000'] [Step 59 / Rank 1] Tasks: ['Single QA', 'Summarization', 'Code'] | Lens: [19393, 19412, 19400] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 59 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [18765, 18765, 18766] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 59 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [18765, 18765, 18766] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 59 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [24819, 24820] → Tgt Spa: ['1.000', '1.000'] [Step 59 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24570, 24571] → Tgt Spa: ['1.000', '1.000'] [Step 59 / Rank 0] Tasks: ['Single QA'] | Lens: [55533] → Tgt Spa: ['0.350'] [Step 59 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24570, 24571] → Tgt Spa: ['1.000', '1.000'] [Step 59 / Rank 4] Tasks: ['Single QA'] | Lens: [44070] → Tgt Spa: ['0.350'] [Step 59 / Rank 7] Tasks: ['Code'] | Lens: [35310] → Tgt Spa: ['1.000'] [Step 59 / Rank 1] Tasks: ['Single QA'] | Lens: [55533] → Tgt Spa: ['0.350'] [Step 59 / Rank 6] Tasks: ['Code'] | Lens: [35310] → Tgt Spa: ['1.000'] [Step 59 / Rank 5] Tasks: ['Single QA'] | Lens: [44070] → Tgt Spa: ['0.350'] [Step 59 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [57058] → Tgt Spa: ['1.000'] [Step 59 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [57448] → Tgt Spa: ['1.000'] [Step 59 / Rank 3] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [17486, 17498, 17501] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 59 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [57058] → Tgt Spa: ['1.000'] [Step 59 / Rank 0] Tasks: ['Single QA'] | Lens: [33139] → Tgt Spa: ['0.350'] [Step 59 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [57448] → Tgt Spa: ['1.000'] [Step 59 / Rank 2] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [17486, 17498, 17501] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 59 / Rank 1] Tasks: ['Single QA'] | Lens: [33139] → Tgt Spa: ['0.350'] [Step 59 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [9917, 9918, 9919, 9919, 9919, 9924] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 59 / Rank 6] Tasks: ['Single QA'] | Lens: [59023] → Tgt Spa: ['0.350'] [Step 59 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [28276, 28276] → Tgt Spa: ['0.350', '0.350'] [Step 59 / Rank 7] Tasks: ['Single QA'] | Lens: [59023] → Tgt Spa: ['0.350'] [Step 59 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [9917, 9918, 9919, 9919, 9919, 9924] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 59 / Rank 5] Tasks: ['Single QA'] | Lens: [41396] → Tgt Spa: ['0.350'] [Step 59 / Rank 4] Tasks: ['Single QA'] | Lens: [41396] → Tgt Spa: ['0.350'] [Step 59 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [28276, 28276] → Tgt Spa: ['0.350', '0.350'] [Step 59 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [44175] → Tgt Spa: ['1.000'] [Step 59 / Rank 1] Tasks: ['Single QA'] | Lens: [41010] → Tgt Spa: ['0.350'] [Step 59 / Rank 7] Tasks: ['Single QA'] | Lens: [55155] → Tgt Spa: ['0.350'] [Step 59 / Rank 2] Tasks: ['Single QA'] | Lens: [52027] → Tgt Spa: ['0.350'] [Step 59 / Rank 3] Tasks: ['Single QA'] | Lens: [52027] → Tgt Spa: ['0.350'] [Step 59 / Rank 6] Tasks: ['Single QA'] | Lens: [55155] → Tgt Spa: ['0.350'] [Step 59 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [44175] → Tgt Spa: ['1.000'] [Step 59 / Rank 0] Tasks: ['Single QA'] | Lens: [41010] → Tgt Spa: ['0.350'] [Step 59 / Rank 3] Tasks: ['Code'] | Lens: [59353] → Tgt Spa: ['1.000'] [Step 59 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [18188, 18189, 18180] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 59 / Rank 1] Tasks: ['Code'] | Lens: [53629] → Tgt Spa: ['1.000'] [Step 59 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [18188, 18189, 18180] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 59 / Rank 5] Tasks: ['MultiHop QA'] | Lens: [63724] → Tgt Spa: ['0.350'] [Step 59 / Rank 4] Tasks: ['MultiHop QA'] | Lens: [63724] → Tgt Spa: ['0.350'] [Step 59 / Rank 0] Tasks: ['Code'] | Lens: [53629] → Tgt Spa: ['1.000'] [Step 59 / Rank 2] Tasks: ['Code'] | Lens: [59353] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 21:09:33,742 >> @ 59 | Loss: 1.9724 | LM: 1.8895 | Reg: 0.0829 | Spa(Avg): 0.386 [INFO|lh_trainer.py:797] 2026-02-16 21:09:33,742 >> Statistic -> Code | Spa: 0.393 | Tgt: 1.000 | Z-Loss: 0.139 | [INFO|lh_trainer.py:797] 2026-02-16 21:09:33,742 >> Statistic -> In-Context | Spa: 0.397 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:09:33,742 >> Statistic -> MultiHop | Spa: 0.431 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:09:33,742 >> Statistic -> Single | Spa: 0.366 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:09:33,742 >> Statistic -> Summarization | Spa: 0.389 | Tgt: 1.000 | Z-Loss: 0.183 | [INFO|lh_trainer.py:810] 2026-02-16 21:09:33,744 >> [Micro-Log] {"loss": 1.9724186280121405, "lm_loss": 1.8895322022338708, "reg_loss": 0.08288642022913943, "model_sparsity(avg)": 0.3858024689058463, "Spa-Single QA sparsity": 0.3662280628555699, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.01893623385235275, "Spa-Summarization sparsity": 0.38888888359069823, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.18349790573120117, "Spa-Code sparsity": 0.3930555522441864, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1394503466784954, "Spa-In-Context Learning sparsity": 0.3972222328186035, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.15711971521377563, "Spa-MultiHop QA sparsity": 0.430555522441864, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.022685373201966286, "step": 59, "current_tau": 1.378759503364563, "lambda1 Single QA": 0.498046875, "lambda2 MultiHop QA": 0.251953125, "lambda3 Summarization": 0.0634765625, "lambda4 Code": 0.16015625} [INFO|lh_trainer.py:331] 2026-02-16 21:09:59,572 >> {'loss': 11.8345, 'grad_norm': 1.401911973953247, 'learning_rate': 0.0004916666666666666, 'epoch': 0.0631911532385466, 'num_input_tokens_seen': 148486356, 'completed': '20.00% (60 / 300)', 'remaining time': '11:12:48', 'throughput': '7086.98', 'gpu_mem_free': '8765MB', 'step': 60} [Step 60 / Rank 3] Tasks: ['Single QA'] | Lens: [49671] → Tgt Spa: ['0.350'] [Step 60 / Rank 7] Tasks: ['Code'] | Lens: [44757] → Tgt Spa: ['1.000'] [Step 60 / Rank 1] Tasks: ['Single QA'] | Lens: [63036] → Tgt Spa: ['0.350'] [Step 60 / Rank 2] Tasks: ['Single QA'] | Lens: [49671] → Tgt Spa: ['0.350'] [Step 60 / Rank 0] Tasks: ['Single QA'] | Lens: [63036] → Tgt Spa: ['0.350'] [Step 60 / Rank 6] Tasks: ['Code'] | Lens: [44757] → Tgt Spa: ['1.000'] [Step 60 / Rank 5] Tasks: ['Code'] | Lens: [47541] → Tgt Spa: ['1.000'] [Step 60 / Rank 4] Tasks: ['Code'] | Lens: [47541] → Tgt Spa: ['1.000'] [Step 60 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'Summarization', 'Single QA', 'In-Context Learning', 'Single QA', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Summarization'] | Lens: [4164, 4164, 4165, 4165, 4185, 4166, 4185, 4167, 4168, 4169, 4175, 4169, 4169, 4170, 4189] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000'] [Step 60 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'Summarization', 'Single QA', 'In-Context Learning', 'Single QA', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Summarization'] | Lens: [4164, 4164, 4165, 4165, 4185, 4166, 4185, 4167, 4168, 4169, 4175, 4169, 4169, 4170, 4189] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000'] [Step 60 / Rank 3] Tasks: ['Summarization'] | Lens: [46819] → Tgt Spa: ['1.000'] [Step 60 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [22108, 22101] → Tgt Spa: ['1.000', '1.000'] [Step 60 / Rank 1] Tasks: ['Single QA'] | Lens: [57528] → Tgt Spa: ['0.350'] [Step 60 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [22108, 22101] → Tgt Spa: ['1.000', '1.000'] [Step 60 / Rank 2] Tasks: ['Summarization'] | Lens: [46819] → Tgt Spa: ['1.000'] [Step 60 / Rank 0] Tasks: ['Single QA'] | Lens: [57528] → Tgt Spa: ['0.350'] [Step 60 / Rank 1] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [20880, 20880, 20883] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 60 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Code'] | Lens: [9559, 9559, 9561, 9574, 9568, 9577] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '0.350', '1.000'] [Step 60 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Code'] | Lens: [9559, 9559, 9561, 9574, 9568, 9577] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '0.350', '1.000'] [Step 60 / Rank 0] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [20880, 20880, 20883] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 60 / Rank 6] Tasks: ['Single QA'] | Lens: [36431] → Tgt Spa: ['0.350'] [Step 60 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [59468] → Tgt Spa: ['1.000'] [Step 60 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [59468] → Tgt Spa: ['1.000'] [Step 60 / Rank 7] Tasks: ['Single QA'] | Lens: [36431] → Tgt Spa: ['0.350'] [Step 60 / Rank 3] Tasks: ['Code'] | Lens: [36729] → Tgt Spa: ['1.000'] [Step 60 / Rank 4] Tasks: ['Single QA'] | Lens: [64939] → Tgt Spa: ['0.350'] [Step 60 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27386, 27392] → Tgt Spa: ['1.000', '1.000'] [Step 60 / Rank 5] Tasks: ['Single QA'] | Lens: [64939] → Tgt Spa: ['0.350'] [Step 60 / Rank 2] Tasks: ['Code'] | Lens: [36729] → Tgt Spa: ['1.000'] [Step 60 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27386, 27392] → Tgt Spa: ['1.000', '1.000'] [Step 60 / Rank 0] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [20386, 20375, 20377] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 60 / Rank 1] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [20386, 20375, 20377] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 60 / Rank 3] Tasks: ['Single QA'] | Lens: [39487] → Tgt Spa: ['0.350'] [Step 60 / Rank 6] Tasks: ['Single QA'] | Lens: [42191] → Tgt Spa: ['0.350'] [Step 60 / Rank 4] Tasks: ['Single QA'] | Lens: [43534] → Tgt Spa: ['0.350'] [Step 60 / Rank 0] Tasks: ['Single QA'] | Lens: [33293] → Tgt Spa: ['0.350'] [Step 60 / Rank 1] Tasks: ['Single QA'] | Lens: [33293] → Tgt Spa: ['0.350'] [Step 60 / Rank 2] Tasks: ['Single QA'] | Lens: [39487] → Tgt Spa: ['0.350'] [Step 60 / Rank 5] Tasks: ['Single QA'] | Lens: [43534] → Tgt Spa: ['0.350'] [Step 60 / Rank 7] Tasks: ['Single QA'] | Lens: [42191] → Tgt Spa: ['0.350'] [Step 60 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [22108, 22116] → Tgt Spa: ['1.000', '1.000'] [Step 60 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [38104] → Tgt Spa: ['1.000'] [Step 60 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [26576, 26578] → Tgt Spa: ['0.350', '0.350'] [Step 60 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [22108, 22116] → Tgt Spa: ['1.000', '1.000'] [Step 60 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [26576, 26578] → Tgt Spa: ['0.350', '0.350'] [Step 60 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [38104] → Tgt Spa: ['1.000'] [Step 60 / Rank 1] Tasks: ['Single QA'] | Lens: [59028] → Tgt Spa: ['0.350'] [Step 60 / Rank 0] Tasks: ['Single QA'] | Lens: [59028] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 21:12:34,051 >> @ 60 | Loss: 2.0963 | LM: 2.0035 | Reg: 0.0928 | Spa(Avg): 0.410 [INFO|lh_trainer.py:797] 2026-02-16 21:12:34,051 >> Statistic -> Code | Spa: 0.382 | Tgt: 1.000 | Z-Loss: 0.144 | [INFO|lh_trainer.py:797] 2026-02-16 21:12:34,051 >> Statistic -> In-Context | Spa: 0.411 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:12:34,051 >> Statistic -> MultiHop | Spa: 0.431 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:12:34,051 >> Statistic -> Single | Spa: 0.417 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:12:34,051 >> Statistic -> Summarization | Spa: 0.420 | Tgt: 1.000 | Z-Loss: 0.168 | [INFO|lh_trainer.py:810] 2026-02-16 21:12:34,053 >> [Micro-Log] {"loss": 2.09630074352026, "lm_loss": 2.0034818624456725, "reg_loss": 0.09281887675751932, "model_sparsity(avg)": 0.4103009228905042, "Spa-Single QA sparsity": 0.417397655938801, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03741435242179585, "Spa-Summarization sparsity": 0.420138880610466, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.16819028742611408, "Spa-Code sparsity": 0.38194443583488463, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.14380302280187607, "Spa-In-Context Learning sparsity": 0.41071427719933645, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1543075596647603, "Spa-MultiHop QA sparsity": 0.430555522441864, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.022685373201966286, "step": 60, "current_tau": 1.375, "lambda1 Single QA": 0.498046875, "lambda2 MultiHop QA": 0.251953125, "lambda3 Summarization": 0.064453125, "lambda4 Code": 0.1611328125} [INFO|lh_trainer.py:331] 2026-02-16 21:12:57,236 >> {'loss': 12.5778, 'grad_norm': 1.3002967834472656, 'learning_rate': 0.0005, 'epoch': 0.06424433912585571, 'num_input_tokens_seen': 150891696, 'completed': '20.33% (61 / 300)', 'remaining time': '11:10:37', 'throughput': '6769.36', 'gpu_mem_free': '7139MB', 'step': 61} [Step 61 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23065, 23066] → Tgt Spa: ['1.000', '1.000'] [Step 61 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [41787] → Tgt Spa: ['1.000'] [Step 61 / Rank 6] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [26027, 26048] → Tgt Spa: ['1.000', '1.000'] [Step 61 / Rank 1] Tasks: ['Single QA'] | Lens: [45879] → Tgt Spa: ['0.350'] [Step 61 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [41787] → Tgt Spa: ['1.000'] [Step 61 / Rank 7] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [26027, 26048] → Tgt Spa: ['1.000', '1.000'] [Step 61 / Rank 0] Tasks: ['Single QA'] | Lens: [45879] → Tgt Spa: ['0.350'] [Step 61 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23065, 23066] → Tgt Spa: ['1.000', '1.000'] [Step 61 / Rank 3] Tasks: ['Code'] | Lens: [39595] → Tgt Spa: ['1.000'] [Step 61 / Rank 2] Tasks: ['Code'] | Lens: [39595] → Tgt Spa: ['1.000'] [Step 61 / Rank 4] Tasks: ['Single QA'] | Lens: [59022] → Tgt Spa: ['0.350'] [Step 61 / Rank 1] Tasks: ['Single QA'] | Lens: [57388] → Tgt Spa: ['0.350'] [Step 61 / Rank 0] Tasks: ['Single QA'] | Lens: [57388] → Tgt Spa: ['0.350'] [Step 61 / Rank 6] Tasks: ['Single QA'] | Lens: [62857] → Tgt Spa: ['0.350'] [Step 61 / Rank 5] Tasks: ['Single QA'] | Lens: [59022] → Tgt Spa: ['0.350'] [Step 61 / Rank 7] Tasks: ['Single QA'] | Lens: [62857] → Tgt Spa: ['0.350'] [Step 61 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [36552] → Tgt Spa: ['1.000'] [Step 61 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [58055] → Tgt Spa: ['1.000'] [Step 61 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [36552] → Tgt Spa: ['1.000'] [Step 61 / Rank 2] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24594, 24595] → Tgt Spa: ['0.350', '1.000'] [Step 61 / Rank 3] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24594, 24595] → Tgt Spa: ['0.350', '1.000'] [Step 61 / Rank 1] Tasks: ['Single QA'] | Lens: [65314] → Tgt Spa: ['0.350'] [Step 61 / Rank 0] Tasks: ['Single QA'] | Lens: [65314] → Tgt Spa: ['0.350'] [Step 61 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [58055] → Tgt Spa: ['1.000'] [Step 61 / Rank 5] Tasks: ['Code'] | Lens: [37962] → Tgt Spa: ['1.000'] [Step 61 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [56076] → Tgt Spa: ['1.000'] [Step 61 / Rank 7] Tasks: ['Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'Summarization', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA'] | Lens: [5620, 5621, 5621, 5622, 5643, 5624, 5625, 5627, 5629, 5632, 5634] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350'] [Step 61 / Rank 0] Tasks: ['Single QA', 'Single QA', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA', 'Code', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'In-Context Learning'] | Lens: [4234, 4235, 4235, 4236, 4236, 4243, 4236, 4238, 4246, 4238, 4237, 4239, 4247, 4240, 4241] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 61 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [56076] → Tgt Spa: ['1.000'] [Step 61 / Rank 4] Tasks: ['Code'] | Lens: [37962] → Tgt Spa: ['1.000'] [Step 61 / Rank 6] Tasks: ['Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'Summarization', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA'] | Lens: [5620, 5621, 5621, 5622, 5643, 5624, 5625, 5627, 5629, 5632, 5634] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350'] [Step 61 / Rank 1] Tasks: ['Single QA', 'Single QA', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA', 'Code', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'In-Context Learning'] | Lens: [4234, 4235, 4235, 4236, 4236, 4243, 4236, 4238, 4246, 4238, 4237, 4239, 4247, 4240, 4241] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 61 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [57422] → Tgt Spa: ['1.000'] [Step 61 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [57422] → Tgt Spa: ['1.000'] [Step 61 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7340, 7340, 7348, 7341, 7341, 7341, 7341, 7342] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 61 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22449, 22450] → Tgt Spa: ['1.000', '1.000'] [Step 61 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22449, 22450] → Tgt Spa: ['1.000', '1.000'] [Step 61 / Rank 0] Tasks: ['Code', 'Code', 'Code'] | Lens: [20158, 20157, 20162] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 61 / Rank 1] Tasks: ['Code', 'Code', 'Code'] | Lens: [20158, 20157, 20162] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 61 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7340, 7340, 7348, 7341, 7341, 7341, 7341, 7342] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 61 / Rank 6] Tasks: ['Single QA'] | Lens: [37442] → Tgt Spa: ['0.350'] [Step 61 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [64842] → Tgt Spa: ['1.000'] [Step 61 / Rank 4] Tasks: ['Single QA'] | Lens: [42985] → Tgt Spa: ['0.350'] [Step 61 / Rank 5] Tasks: ['Single QA'] | Lens: [42985] → Tgt Spa: ['0.350'] [Step 61 / Rank 0] Tasks: ['Summarization', 'Code', 'Single QA'] | Lens: [20449, 20438, 20432] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 61 / Rank 1] Tasks: ['Summarization', 'Code', 'Single QA'] | Lens: [20449, 20438, 20432] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 61 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [64842] → Tgt Spa: ['1.000'] [Step 61 / Rank 7] Tasks: ['Single QA'] | Lens: [37442] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 21:15:31,209 >> @ 61 | Loss: 2.2199 | LM: 2.1169 | Reg: 0.1030 | Spa(Avg): 0.375 [INFO|lh_trainer.py:797] 2026-02-16 21:15:31,209 >> Statistic -> Code | Spa: 0.369 | Tgt: 1.000 | Z-Loss: 0.148 | [INFO|lh_trainer.py:797] 2026-02-16 21:15:31,210 >> Statistic -> In-Context | Spa: 0.389 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:15:31,210 >> Statistic -> MultiHop | Spa: 0.361 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:15:31,210 >> Statistic -> Single | Spa: 0.380 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:15:31,210 >> Statistic -> Summarization | Spa: 0.384 | Tgt: 1.000 | Z-Loss: 0.187 | [INFO|lh_trainer.py:810] 2026-02-16 21:15:31,212 >> [Micro-Log] {"loss": 2.2198981239149966, "lm_loss": 2.116917805125316, "reg_loss": 0.10298032800589378, "model_sparsity(avg)": 0.37534941112001735, "Spa-Single QA sparsity": 0.37962962687015533, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.021625864998592686, "Spa-MultiHop QA sparsity": 0.3611111044883728, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.015237488318234682, "Spa-In-Context Learning sparsity": 0.3888888855775197, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.16135326214134693, "Spa-Code sparsity": 0.36944444179534913, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.14833837896585464, "Spa-Summarization sparsity": 0.38425926367441815, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1869746744632721, "step": 61, "current_tau": 1.3712024688720703, "lambda1 Single QA": 0.5, "lambda2 MultiHop QA": 0.25390625, "lambda3 Summarization": 0.0654296875, "lambda4 Code": 0.162109375} [INFO|lh_trainer.py:331] 2026-02-16 21:15:57,469 >> {'loss': 13.3194, 'grad_norm': 1.8764727115631104, 'learning_rate': 0.0004999785818956435, 'epoch': 0.06529752501316483, 'num_input_tokens_seen': 153414658, 'completed': '20.67% (62 / 300)', 'remaining time': '11:08:34', 'throughput': '6999.19', 'gpu_mem_free': '6823MB', 'step': 62} [Step 62 / Rank 4] Tasks: ['Single QA'] | Lens: [34875] → Tgt Spa: ['0.350'] [Step 62 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [24649, 24649] → Tgt Spa: ['1.000', '1.000'] [Step 62 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [24649, 24649] → Tgt Spa: ['1.000', '1.000'] [Step 62 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [43764] → Tgt Spa: ['1.000'] [Step 62 / Rank 2] Tasks: ['Summarization'] | Lens: [57155] → Tgt Spa: ['1.000'] [Step 62 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [43764] → Tgt Spa: ['1.000'] [Step 62 / Rank 3] Tasks: ['Summarization'] | Lens: [57155] → Tgt Spa: ['1.000'] [Step 62 / Rank 5] Tasks: ['Single QA'] | Lens: [34875] → Tgt Spa: ['0.350'] [Step 62 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [56099] → Tgt Spa: ['1.000'] [Step 62 / Rank 0] Tasks: ['MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'Single QA'] | Lens: [4772, 4774, 4774, 4775, 4775, 4776, 4783, 4776, 4777, 4777, 4777, 4778, 4778] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350'] [Step 62 / Rank 1] Tasks: ['MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'Single QA'] | Lens: [4772, 4774, 4774, 4775, 4775, 4776, 4783, 4776, 4777, 4777, 4777, 4778, 4778] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350'] [Step 62 / Rank 7] Tasks: ['Single QA'] | Lens: [43102] → Tgt Spa: ['0.350'] [Step 62 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Summarization', 'Code', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [4558, 4557, 4576, 4565, 4559, 4559, 4567, 4560, 4560, 4563, 4563, 4564, 4563, 4565] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000'] [Step 62 / Rank 6] Tasks: ['Single QA'] | Lens: [43102] → Tgt Spa: ['0.350'] [Step 62 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [56099] → Tgt Spa: ['1.000'] [Step 62 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Summarization', 'Code', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [4558, 4557, 4576, 4565, 4559, 4559, 4567, 4560, 4560, 4563, 4563, 4564, 4563, 4565] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000'] [Step 62 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [26583, 26593] → Tgt Spa: ['1.000', '1.000'] [Step 62 / Rank 6] Tasks: ['Code'] | Lens: [41582] → Tgt Spa: ['1.000'] [Step 62 / Rank 3] Tasks: ['Single QA'] | Lens: [59024] → Tgt Spa: ['0.350'] [Step 62 / Rank 7] Tasks: ['Code'] | Lens: [41582] → Tgt Spa: ['1.000'] [Step 62 / Rank 0] Tasks: ['Single QA'] | Lens: [50475] → Tgt Spa: ['0.350'] [Step 62 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [26583, 26593] → Tgt Spa: ['1.000', '1.000'] [Step 62 / Rank 1] Tasks: ['Single QA'] | Lens: [50475] → Tgt Spa: ['0.350'] [Step 62 / Rank 2] Tasks: ['Single QA'] | Lens: [59024] → Tgt Spa: ['0.350'] [Step 62 / Rank 5] Tasks: ['Single QA'] | Lens: [54438] → Tgt Spa: ['0.350'] [Step 62 / Rank 2] Tasks: ['Code'] | Lens: [49028] → Tgt Spa: ['1.000'] [Step 62 / Rank 6] Tasks: ['Code'] | Lens: [57208] → Tgt Spa: ['1.000'] [Step 62 / Rank 3] Tasks: ['Code'] | Lens: [49028] → Tgt Spa: ['1.000'] [Step 62 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [38759] → Tgt Spa: ['1.000'] [Step 62 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [38759] → Tgt Spa: ['1.000'] [Step 62 / Rank 4] Tasks: ['Single QA'] | Lens: [54438] → Tgt Spa: ['0.350'] [Step 62 / Rank 7] Tasks: ['Code'] | Lens: [57208] → Tgt Spa: ['1.000'] [Step 62 / Rank 4] Tasks: ['Single QA'] | Lens: [34044] → Tgt Spa: ['0.350'] [Step 62 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [27055, 27056] → Tgt Spa: ['0.350', '0.350'] [Step 62 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [27055, 27056] → Tgt Spa: ['0.350', '0.350'] [Step 62 / Rank 3] Tasks: ['Single QA'] | Lens: [46441] → Tgt Spa: ['0.350'] [Step 62 / Rank 5] Tasks: ['Single QA'] | Lens: [34044] → Tgt Spa: ['0.350'] [Step 62 / Rank 2] Tasks: ['Single QA'] | Lens: [46441] → Tgt Spa: ['0.350'] [Step 62 / Rank 6] Tasks: ['Single QA'] | Lens: [52967] → Tgt Spa: ['0.350'] [Step 62 / Rank 7] Tasks: ['Single QA'] | Lens: [52967] → Tgt Spa: ['0.350'] [Step 62 / Rank 6] Tasks: ['Single QA'] | Lens: [51854] → Tgt Spa: ['0.350'] [Step 62 / Rank 5] Tasks: ['Single QA'] | Lens: [49974] → Tgt Spa: ['0.350'] [Step 62 / Rank 4] Tasks: ['Single QA'] | Lens: [49974] → Tgt Spa: ['0.350'] [Step 62 / Rank 0] Tasks: ['Summarization', 'Code'] | Lens: [24525, 24520] → Tgt Spa: ['1.000', '1.000'] [Step 62 / Rank 3] Tasks: ['Code'] | Lens: [37131] → Tgt Spa: ['1.000'] [Step 62 / Rank 1] Tasks: ['Summarization', 'Code'] | Lens: [24525, 24520] → Tgt Spa: ['1.000', '1.000'] [Step 62 / Rank 7] Tasks: ['Single QA'] | Lens: [51854] → Tgt Spa: ['0.350'] [Step 62 / Rank 2] Tasks: ['Code'] | Lens: [37131] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 21:18:22,688 >> @ 62 | Loss: 1.8912 | LM: 1.8066 | Reg: 0.0846 | Spa(Avg): 0.373 [INFO|lh_trainer.py:797] 2026-02-16 21:18:22,688 >> Statistic -> Code | Spa: 0.383 | Tgt: 1.000 | Z-Loss: 0.145 | [INFO|lh_trainer.py:797] 2026-02-16 21:18:22,688 >> Statistic -> In-Context | Spa: 0.402 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:18:22,688 >> Statistic -> MultiHop | Spa: 0.361 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:18:22,688 >> Statistic -> Single | Spa: 0.395 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:18:22,688 >> Statistic -> Summarization | Spa: 0.394 | Tgt: 1.000 | Z-Loss: 0.182 | [INFO|lh_trainer.py:810] 2026-02-16 21:18:22,690 >> [Micro-Log] {"loss": 1.8911721222102642, "lm_loss": 1.8065559420113761, "reg_loss": 0.08461617841385305, "model_sparsity(avg)": 0.37339743226766586, "Spa-In-Context Learning sparsity": 0.4017857142857143, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1585699309195791, "Spa-MultiHop QA sparsity": 0.3611111044883728, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.002966869156807661, "Spa-Single QA sparsity": 0.39467592040697735, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.02426779221665735, "Spa-Code sparsity": 0.38257575035095215, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1453129012476314, "Spa-Summarization sparsity": 0.39351850748062134, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.18206247687339783, "step": 62, "current_tau": 1.3673678636550903, "lambda1 Single QA": 0.5, "lambda2 MultiHop QA": 0.25390625, "lambda3 Summarization": 0.06640625, "lambda4 Code": 0.1630859375} [INFO|lh_trainer.py:331] 2026-02-16 21:18:41,545 >> {'loss': 11.347, 'grad_norm': 1.4269951581954956, 'learning_rate': 0.0004999143312524562, 'epoch': 0.06635071090047394, 'num_input_tokens_seen': 155793700, 'completed': '21.00% (63 / 300)', 'remaining time': '11:05:29', 'throughput': '7249.81', 'gpu_mem_free': '11747MB', 'step': 63} [Step 63 / Rank 4] Tasks: ['Single QA', 'Code'] | Lens: [26094, 26101] → Tgt Spa: ['0.350', '1.000'] [Step 63 / Rank 7] Tasks: ['Single QA'] | Lens: [52681] → Tgt Spa: ['0.350'] [Step 63 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [26278, 26287] → Tgt Spa: ['1.000', '1.000'] [Step 63 / Rank 5] Tasks: ['Single QA', 'Code'] | Lens: [26094, 26101] → Tgt Spa: ['0.350', '1.000'] [Step 63 / Rank 6] Tasks: ['Single QA'] | Lens: [52681] → Tgt Spa: ['0.350'] [Step 63 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [49736] → Tgt Spa: ['1.000'] [Step 63 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [26278, 26287] → Tgt Spa: ['1.000', '1.000'] [Step 63 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [49736] → Tgt Spa: ['1.000'] [Step 63 / Rank 5] Tasks: ['Code'] | Lens: [43695] → Tgt Spa: ['1.000'] [Step 63 / Rank 6] Tasks: ['Code'] | Lens: [33085] → Tgt Spa: ['1.000'] [Step 63 / Rank 2] Tasks: ['Single QA', 'Summarization'] | Lens: [31097, 31115] → Tgt Spa: ['0.350', '1.000'] [Step 63 / Rank 3] Tasks: ['Single QA', 'Summarization'] | Lens: [31097, 31115] → Tgt Spa: ['0.350', '1.000'] [Step 63 / Rank 1] Tasks: ['Single QA'] | Lens: [60603] → Tgt Spa: ['0.350'] [Step 63 / Rank 0] Tasks: ['Single QA'] | Lens: [60603] → Tgt Spa: ['0.350'] [Step 63 / Rank 7] Tasks: ['Code'] | Lens: [33085] → Tgt Spa: ['1.000'] [Step 63 / Rank 4] Tasks: ['Code'] | Lens: [43695] → Tgt Spa: ['1.000'] [Step 63 / Rank 6] Tasks: ['Single QA'] | Lens: [44679] → Tgt Spa: ['0.350'] [Step 63 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'Summarization', 'Code', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'Single QA'] | Lens: [4250, 4251, 4251, 4250, 4252, 4254, 4273, 4257, 4276, 4265, 4257, 4257, 4268, 4267, 4260] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 63 / Rank 2] Tasks: ['Single QA'] | Lens: [45886] → Tgt Spa: ['0.350'] [Step 63 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [53532] → Tgt Spa: ['1.000'] [Step 63 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [53532] → Tgt Spa: ['1.000'] [Step 63 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'Summarization', 'Code', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'Single QA'] | Lens: [4250, 4251, 4251, 4250, 4252, 4254, 4273, 4257, 4276, 4265, 4257, 4257, 4268, 4267, 4260] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 63 / Rank 7] Tasks: ['Single QA'] | Lens: [44679] → Tgt Spa: ['0.350'] [Step 63 / Rank 3] Tasks: ['Single QA'] | Lens: [45886] → Tgt Spa: ['0.350'] [Step 63 / Rank 3] Tasks: ['Single QA'] | Lens: [45277] → Tgt Spa: ['0.350'] [Step 63 / Rank 0] Tasks: ['Single QA'] | Lens: [36662] → Tgt Spa: ['0.350'] [Step 63 / Rank 2] Tasks: ['Single QA'] | Lens: [45277] → Tgt Spa: ['0.350'] [Step 63 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [58099] → Tgt Spa: ['1.000'] [Step 63 / Rank 1] Tasks: ['Single QA'] | Lens: [36662] → Tgt Spa: ['0.350'] [Step 63 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32133, 32133] → Tgt Spa: ['0.350', '0.350'] [Step 63 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32133, 32133] → Tgt Spa: ['0.350', '0.350'] [Step 63 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [58099] → Tgt Spa: ['1.000'] [Step 63 / Rank 1] Tasks: ['Single QA'] | Lens: [40236] → Tgt Spa: ['0.350'] [Step 63 / Rank 2] Tasks: ['Single QA'] | Lens: [44179] → Tgt Spa: ['0.350'] [Step 63 / Rank 0] Tasks: ['Single QA'] | Lens: [40236] → Tgt Spa: ['0.350'] [Step 63 / Rank 7] Tasks: ['In-Context Learning', 'Single QA', 'Summarization'] | Lens: [21211, 21212, 21231] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 63 / Rank 4] Tasks: ['Single QA'] | Lens: [34873] → Tgt Spa: ['0.350'] [Step 63 / Rank 6] Tasks: ['In-Context Learning', 'Single QA', 'Summarization'] | Lens: [21211, 21212, 21231] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 63 / Rank 3] Tasks: ['Single QA'] | Lens: [44179] → Tgt Spa: ['0.350'] [Step 63 / Rank 5] Tasks: ['Single QA'] | Lens: [34873] → Tgt Spa: ['0.350'] [Step 63 / Rank 2] Tasks: ['Code', 'Code', 'In-Context Learning'] | Lens: [20037, 20043, 20036] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 63 / Rank 4] Tasks: ['Single QA'] | Lens: [64715] → Tgt Spa: ['0.350'] [Step 63 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [24390, 24399] → Tgt Spa: ['1.000', '1.000'] [Step 63 / Rank 6] Tasks: ['Single QA'] | Lens: [51696] → Tgt Spa: ['0.350'] [Step 63 / Rank 5] Tasks: ['Single QA'] | Lens: [64715] → Tgt Spa: ['0.350'] [Step 63 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [24390, 24399] → Tgt Spa: ['1.000', '1.000'] [Step 63 / Rank 7] Tasks: ['Single QA'] | Lens: [51696] → Tgt Spa: ['0.350'] [Step 63 / Rank 3] Tasks: ['Code', 'Code', 'In-Context Learning'] | Lens: [20037, 20043, 20036] → Tgt Spa: ['1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 21:21:00,416 >> @ 63 | Loss: 2.0606 | LM: 1.9781 | Reg: 0.0825 | Spa(Avg): 0.390 [INFO|lh_trainer.py:797] 2026-02-16 21:21:00,416 >> Statistic -> Code | Spa: 0.394 | Tgt: 1.000 | Z-Loss: 0.143 | [INFO|lh_trainer.py:797] 2026-02-16 21:21:00,416 >> Statistic -> In-Context | Spa: 0.401 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:21:00,416 >> Statistic -> MultiHop | Spa: 0.361 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:21:00,416 >> Statistic -> Single | Spa: 0.391 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:21:00,416 >> Statistic -> Summarization | Spa: 0.427 | Tgt: 1.000 | Z-Loss: 0.166 | [INFO|lh_trainer.py:810] 2026-02-16 21:21:00,418 >> [Micro-Log] {"loss": 2.060609226425489, "lm_loss": 1.9781133122742176, "reg_loss": 0.08249592876139407, "model_sparsity(avg)": 0.3901427388191223, "Spa-In-Context Learning sparsity": 0.4010416604578495, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1599057512357831, "Spa-Code sparsity": 0.3944444417953491, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.142750982940197, "Spa-Single QA sparsity": 0.3905228656880996, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.026572493781500003, "Spa-Summarization sparsity": 0.4270833283662796, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1661495678126812, "Spa-MultiHop QA sparsity": 0.3611111044883728, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.002966869156807661, "step": 63, "current_tau": 1.3634976148605347, "lambda1 Single QA": 0.5, "lambda2 MultiHop QA": 0.25390625, "lambda3 Summarization": 0.0673828125, "lambda4 Code": 0.1640625} [INFO|lh_trainer.py:331] 2026-02-16 21:21:27,123 >> {'loss': 12.3637, 'grad_norm': 1.2144259214401245, 'learning_rate': 0.0004998072590794548, 'epoch': 0.06740389678778304, 'num_input_tokens_seen': 158248338, 'completed': '21.33% (64 / 300)', 'remaining time': '11:02:30', 'throughput': '7412.32', 'gpu_mem_free': '11659MB', 'step': 64} [Step 64 / Rank 5] Tasks: ['Single QA'] | Lens: [34958] → Tgt Spa: ['0.350'] [Step 64 / Rank 4] Tasks: ['Single QA'] | Lens: [34958] → Tgt Spa: ['0.350'] [Step 64 / Rank 6] Tasks: ['Single QA'] | Lens: [63927] → Tgt Spa: ['0.350'] [Step 64 / Rank 7] Tasks: ['Single QA'] | Lens: [63927] → Tgt Spa: ['0.350'] [Step 64 / Rank 2] Tasks: ['Code'] | Lens: [51191] → Tgt Spa: ['1.000'] [Step 64 / Rank 3] Tasks: ['Code'] | Lens: [51191] → Tgt Spa: ['1.000'] [Step 64 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [29503, 29510] → Tgt Spa: ['1.000', '1.000'] [Step 64 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [29503, 29510] → Tgt Spa: ['1.000', '1.000'] [Step 64 / Rank 1] Tasks: ['Single QA'] | Lens: [65088] → Tgt Spa: ['0.350'] [Step 64 / Rank 6] Tasks: ['In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Summarization', 'In-Context Learning'] | Lens: [5900, 5919, 5902, 5901, 5903, 5902, 5911, 5907, 5908, 5925, 5908] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000'] [Step 64 / Rank 4] Tasks: ['Code', 'Single QA', 'Code'] | Lens: [20730, 20724, 20731] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 64 / Rank 2] Tasks: ['Summarization', 'Single QA', 'Code', 'Single QA'] | Lens: [16039, 16021, 16029, 16029] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350'] [Step 64 / Rank 5] Tasks: ['Code', 'Single QA', 'Code'] | Lens: [20730, 20724, 20731] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 64 / Rank 3] Tasks: ['Summarization', 'Single QA', 'Code', 'Single QA'] | Lens: [16039, 16021, 16029, 16029] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350'] [Step 64 / Rank 7] Tasks: ['In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Summarization', 'In-Context Learning'] | Lens: [5900, 5919, 5902, 5901, 5903, 5902, 5911, 5907, 5908, 5925, 5908] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000'] [Step 64 / Rank 0] Tasks: ['Single QA'] | Lens: [65088] → Tgt Spa: ['0.350'] [Step 64 / Rank 4] Tasks: ['Code'] | Lens: [35281] → Tgt Spa: ['1.000'] [Step 64 / Rank 3] Tasks: ['Single QA'] | Lens: [43215] → Tgt Spa: ['0.350'] [Step 64 / Rank 5] Tasks: ['Code'] | Lens: [35281] → Tgt Spa: ['1.000'] [Step 64 / Rank 7] Tasks: ['Single QA'] | Lens: [49522] → Tgt Spa: ['0.350'] [Step 64 / Rank 6] Tasks: ['Single QA'] | Lens: [49522] → Tgt Spa: ['0.350'] [Step 64 / Rank 1] Tasks: ['Single QA'] | Lens: [44074] → Tgt Spa: ['0.350'] [Step 64 / Rank 2] Tasks: ['Single QA'] | Lens: [43215] → Tgt Spa: ['0.350'] [Step 64 / Rank 0] Tasks: ['Single QA'] | Lens: [44074] → Tgt Spa: ['0.350'] [Step 64 / Rank 5] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22036, 22019] → Tgt Spa: ['1.000', '1.000'] [Step 64 / Rank 7] Tasks: ['Single QA'] | Lens: [39188] → Tgt Spa: ['0.350'] [Step 64 / Rank 4] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22036, 22019] → Tgt Spa: ['1.000', '1.000'] [Step 64 / Rank 2] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [30434, 30437] → Tgt Spa: ['0.350', '1.000'] [Step 64 / Rank 1] Tasks: ['Single QA'] | Lens: [45100] → Tgt Spa: ['0.350'] [Step 64 / Rank 6] Tasks: ['Single QA'] | Lens: [39188] → Tgt Spa: ['0.350'] [Step 64 / Rank 0] Tasks: ['Single QA'] | Lens: [45100] → Tgt Spa: ['0.350'] [Step 64 / Rank 3] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [30434, 30437] → Tgt Spa: ['0.350', '1.000'] [Step 64 / Rank 1] Tasks: ['Summarization'] | Lens: [38551] → Tgt Spa: ['1.000'] [Step 64 / Rank 4] Tasks: ['Single QA'] | Lens: [53819] → Tgt Spa: ['0.350'] [Step 64 / Rank 6] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Single QA'] | Lens: [11509, 11513, 11513, 11510, 11512] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350'] [Step 64 / Rank 2] Tasks: ['Single QA'] | Lens: [49700] → Tgt Spa: ['0.350'] [Step 64 / Rank 5] Tasks: ['Single QA'] | Lens: [53819] → Tgt Spa: ['0.350'] [Step 64 / Rank 0] Tasks: ['Summarization'] | Lens: [38551] → Tgt Spa: ['1.000'] [Step 64 / Rank 7] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Single QA'] | Lens: [11509, 11513, 11513, 11510, 11512] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350'] [Step 64 / Rank 3] Tasks: ['Single QA'] | Lens: [49700] → Tgt Spa: ['0.350'] [Step 64 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26955, 26959] → Tgt Spa: ['1.000', '1.000'] [Step 64 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26955, 26959] → Tgt Spa: ['1.000', '1.000'] [Step 64 / Rank 4] Tasks: ['Single QA'] | Lens: [54205] → Tgt Spa: ['0.350'] [Step 64 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [23288, 23298] → Tgt Spa: ['1.000', '1.000'] [Step 64 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [23288, 23298] → Tgt Spa: ['1.000', '1.000'] [Step 64 / Rank 3] Tasks: ['Code'] | Lens: [63339] → Tgt Spa: ['1.000'] [Step 64 / Rank 2] Tasks: ['Code'] | Lens: [63339] → Tgt Spa: ['1.000'] [Step 64 / Rank 5] Tasks: ['Single QA'] | Lens: [54205] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 21:23:56,653 >> @ 64 | Loss: 2.0526 | LM: 1.9625 | Reg: 0.0901 | Spa(Avg): 0.414 [INFO|lh_trainer.py:797] 2026-02-16 21:23:56,653 >> Statistic -> Code | Spa: 0.398 | Tgt: 1.000 | Z-Loss: 0.143 | [INFO|lh_trainer.py:797] 2026-02-16 21:23:56,653 >> Statistic -> In-Context | Spa: 0.400 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:23:56,653 >> Statistic -> MultiHop | Spa: 0.361 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:23:56,654 >> Statistic -> Single | Spa: 0.426 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:23:56,654 >> Statistic -> Summarization | Spa: 0.381 | Tgt: 1.000 | Z-Loss: 0.193 | [INFO|lh_trainer.py:810] 2026-02-16 21:23:56,655 >> [Micro-Log] {"loss": 2.0525583749016127, "lm_loss": 1.9624750390648842, "reg_loss": 0.0900833261354516, "model_sparsity(avg)": 0.4142466336488724, "Spa-In-Context Learning sparsity": 0.4002525210380554, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1609102799133821, "Spa-Single QA sparsity": 0.42592592182613553, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04377285315699521, "Spa-Summarization sparsity": 0.3805555462837219, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.19289958477020264, "Spa-Code sparsity": 0.3977272781458768, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.14260933345014398, "Spa-MultiHop QA sparsity": 0.3611111044883728, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.002966869156807661, "step": 64, "current_tau": 1.3595927953720093, "lambda1 Single QA": 0.5, "lambda2 MultiHop QA": 0.25390625, "lambda3 Summarization": 0.068359375, "lambda4 Code": 0.1650390625} [INFO|lh_trainer.py:331] 2026-02-16 21:24:21,410 >> {'loss': 12.3153, 'grad_norm': 1.1703037023544312, 'learning_rate': 0.000499657383722905, 'epoch': 0.06845708267509215, 'num_input_tokens_seen': 160737224, 'completed': '21.67% (65 / 300)', 'remaining time': '11:00:02', 'throughput': '7140.22', 'gpu_mem_free': '9383MB', 'step': 65} [Step 65 / Rank 3] Tasks: ['MultiHop QA'] | Lens: [63155] → Tgt Spa: ['0.350'] [Step 65 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [23834, 23834] → Tgt Spa: ['0.350', '0.350'] [Step 65 / Rank 0] Tasks: ['In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'Code', 'Single QA', 'Single QA', 'In-Context Learning', 'Summarization', 'In-Context Learning'] | Lens: [6112, 6120, 6113, 6114, 6122, 6115, 6115, 6116, 6135, 6117] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000'] [Step 65 / Rank 5] Tasks: ['Single QA'] | Lens: [50205] → Tgt Spa: ['0.350'] [Step 65 / Rank 2] Tasks: ['MultiHop QA'] | Lens: [63155] → Tgt Spa: ['0.350'] [Step 65 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [23834, 23834] → Tgt Spa: ['0.350', '0.350'] [Step 65 / Rank 1] Tasks: ['In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'Code', 'Single QA', 'Single QA', 'In-Context Learning', 'Summarization', 'In-Context Learning'] | Lens: [6112, 6120, 6113, 6114, 6122, 6115, 6115, 6116, 6135, 6117] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000'] [Step 65 / Rank 4] Tasks: ['Single QA'] | Lens: [50205] → Tgt Spa: ['0.350'] [Step 65 / Rank 2] Tasks: ['Single QA'] | Lens: [52689] → Tgt Spa: ['0.350'] [Step 65 / Rank 6] Tasks: ['Single QA'] | Lens: [46948] → Tgt Spa: ['0.350'] [Step 65 / Rank 7] Tasks: ['Single QA'] | Lens: [46948] → Tgt Spa: ['0.350'] [Step 65 / Rank 4] Tasks: ['Single QA'] | Lens: [42356] → Tgt Spa: ['0.350'] [Step 65 / Rank 5] Tasks: ['Single QA'] | Lens: [42356] → Tgt Spa: ['0.350'] [Step 65 / Rank 3] Tasks: ['Single QA'] | Lens: [52689] → Tgt Spa: ['0.350'] [Step 65 / Rank 1] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [27501, 27483] → Tgt Spa: ['1.000', '1.000'] [Step 65 / Rank 0] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [27501, 27483] → Tgt Spa: ['1.000', '1.000'] [Step 65 / Rank 5] Tasks: ['Single QA'] | Lens: [52015] → Tgt Spa: ['0.350'] [Step 65 / Rank 0] Tasks: ['Single QA'] | Lens: [59581] → Tgt Spa: ['0.350'] [Step 65 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [44234] → Tgt Spa: ['1.000'] [Step 65 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [24778, 24790] → Tgt Spa: ['1.000', '1.000'] [Step 65 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [44234] → Tgt Spa: ['1.000'] [Step 65 / Rank 4] Tasks: ['Single QA'] | Lens: [52015] → Tgt Spa: ['0.350'] [Step 65 / Rank 1] Tasks: ['Single QA'] | Lens: [59581] → Tgt Spa: ['0.350'] [Step 65 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [24778, 24790] → Tgt Spa: ['1.000', '1.000'] [Step 65 / Rank 5] Tasks: ['Single QA'] | Lens: [44173] → Tgt Spa: ['0.350'] [Step 65 / Rank 2] Tasks: ['Summarization', 'MultiHop QA'] | Lens: [30754, 30741] → Tgt Spa: ['1.000', '0.350'] [Step 65 / Rank 4] Tasks: ['Single QA'] | Lens: [44173] → Tgt Spa: ['0.350'] [Step 65 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [5884, 5885, 5885, 5891, 5892, 5885, 5886, 5886, 5888, 5887, 5887] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 65 / Rank 3] Tasks: ['Summarization', 'MultiHop QA'] | Lens: [30754, 30741] → Tgt Spa: ['1.000', '0.350'] [Step 65 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [26452, 26452] → Tgt Spa: ['1.000', '1.000'] [Step 65 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [5884, 5885, 5885, 5891, 5892, 5885, 5886, 5886, 5888, 5887, 5887] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 65 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [26452, 26452] → Tgt Spa: ['1.000', '1.000'] [Step 65 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA'] | Lens: [7416, 7422, 7423, 7426, 7427, 7429, 7430, 7432] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350'] [Step 65 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [29185, 29186] → Tgt Spa: ['0.350', '0.350'] [Step 65 / Rank 1] Tasks: ['Single QA'] | Lens: [61159] → Tgt Spa: ['0.350'] [Step 65 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [29185, 29186] → Tgt Spa: ['0.350', '0.350'] [Step 65 / Rank 6] Tasks: ['Summarization', 'Code', 'In-Context Learning'] | Lens: [21808, 21797, 21792] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 65 / Rank 0] Tasks: ['Single QA'] | Lens: [61159] → Tgt Spa: ['0.350'] [Step 65 / Rank 7] Tasks: ['Summarization', 'Code', 'In-Context Learning'] | Lens: [21808, 21797, 21792] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 65 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA'] | Lens: [7416, 7422, 7423, 7426, 7427, 7429, 7430, 7432] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350'] [Step 65 / Rank 3] Tasks: ['Single QA'] | Lens: [43081] → Tgt Spa: ['0.350'] [Step 65 / Rank 7] Tasks: ['Single QA'] | Lens: [40780] → Tgt Spa: ['0.350'] [Step 65 / Rank 4] Tasks: ['Single QA'] | Lens: [55063] → Tgt Spa: ['0.350'] [Step 65 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [59108] → Tgt Spa: ['1.000'] [Step 65 / Rank 2] Tasks: ['Single QA'] | Lens: [43081] → Tgt Spa: ['0.350'] [Step 65 / Rank 5] Tasks: ['Single QA'] | Lens: [55063] → Tgt Spa: ['0.350'] [Step 65 / Rank 6] Tasks: ['Single QA'] | Lens: [40780] → Tgt Spa: ['0.350'] [Step 65 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [59108] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 21:26:51,671 >> @ 65 | Loss: 2.0412 | LM: 1.9632 | Reg: 0.0780 | Spa(Avg): 0.401 [INFO|lh_trainer.py:797] 2026-02-16 21:26:51,671 >> Statistic -> Code | Spa: 0.429 | Tgt: 1.000 | Z-Loss: 0.134 | [INFO|lh_trainer.py:797] 2026-02-16 21:26:51,671 >> Statistic -> In-Context | Spa: 0.434 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:26:51,671 >> Statistic -> MultiHop | Spa: 0.521 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:26:51,672 >> Statistic -> Single | Spa: 0.386 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:26:51,672 >> Statistic -> Summarization | Spa: 0.385 | Tgt: 1.000 | Z-Loss: 0.190 | [INFO|lh_trainer.py:810] 2026-02-16 21:26:51,673 >> [Micro-Log] {"loss": 2.0411726317058005, "lm_loss": 1.963175463800629, "reg_loss": 0.07799717415279399, "model_sparsity(avg)": 0.4010206138094266, "Spa-In-Context Learning sparsity": 0.4343434138731523, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.15179218351840973, "Spa-Code sparsity": 0.4288194254040718, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.13423102255910635, "Spa-Single QA sparsity": 0.385521877895702, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.031669583058718476, "Spa-Summarization sparsity": 0.3854166716337204, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.19029508903622627, "Spa-MultiHop QA sparsity": 0.5208333134651184, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.05393439903855324, "step": 65, "current_tau": 1.3556545972824097, "lambda1 Single QA": 0.50390625, "lambda2 MultiHop QA": 0.25390625, "lambda3 Summarization": 0.06884765625, "lambda4 Code": 0.166015625} [INFO|lh_trainer.py:331] 2026-02-16 21:27:14,322 >> {'loss': 12.247, 'grad_norm': 0.9227648973464966, 'learning_rate': 0.0004994647308631777, 'epoch': 0.06951026856240126, 'num_input_tokens_seen': 163317772, 'completed': '22.00% (66 / 300)', 'remaining time': '10:57:29', 'throughput': '7462.00', 'gpu_mem_free': '6875MB', 'step': 66} [Step 66 / Rank 5] Tasks: ['Single QA'] | Lens: [36041] → Tgt Spa: ['0.350'] [Step 66 / Rank 7] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17693, 17706, 17698] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 66 / Rank 0] Tasks: ['Code'] | Lens: [62796] → Tgt Spa: ['1.000'] [Step 66 / Rank 2] Tasks: ['Single QA'] | Lens: [64981] → Tgt Spa: ['0.350'] [Step 66 / Rank 1] Tasks: ['Code'] | Lens: [62796] → Tgt Spa: ['1.000'] [Step 66 / Rank 4] Tasks: ['Single QA'] | Lens: [36041] → Tgt Spa: ['0.350'] [Step 66 / Rank 6] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17693, 17706, 17698] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 66 / Rank 3] Tasks: ['Single QA'] | Lens: [64981] → Tgt Spa: ['0.350'] [Step 66 / Rank 6] Tasks: ['Single QA'] | Lens: [33802] → Tgt Spa: ['0.350'] [Step 66 / Rank 5] Tasks: ['Single QA'] | Lens: [49687] → Tgt Spa: ['0.350'] [Step 66 / Rank 0] Tasks: ['Single QA'] | Lens: [41251] → Tgt Spa: ['0.350'] [Step 66 / Rank 4] Tasks: ['Single QA'] | Lens: [49687] → Tgt Spa: ['0.350'] [Step 66 / Rank 7] Tasks: ['Single QA'] | Lens: [33802] → Tgt Spa: ['0.350'] [Step 66 / Rank 2] Tasks: ['Code', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'Single QA', 'Single QA', 'In-Context Learning', 'Code'] | Lens: [5597, 5589, 5609, 5592, 5592, 5599, 5601, 5597, 5597, 5597, 5605] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000'] [Step 66 / Rank 1] Tasks: ['Single QA'] | Lens: [41251] → Tgt Spa: ['0.350'] [Step 66 / Rank 3] Tasks: ['Code', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'Single QA', 'Single QA', 'In-Context Learning', 'Code'] | Lens: [5597, 5589, 5609, 5592, 5592, 5599, 5601, 5597, 5597, 5597, 5605] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000'] [Step 66 / Rank 4] Tasks: ['Single QA'] | Lens: [57508] → Tgt Spa: ['0.350'] [Step 66 / Rank 7] Tasks: ['Single QA'] | Lens: [55330] → Tgt Spa: ['0.350'] [Step 66 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [60647] → Tgt Spa: ['1.000'] [Step 66 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [60647] → Tgt Spa: ['1.000'] [Step 66 / Rank 6] Tasks: ['Single QA'] | Lens: [55330] → Tgt Spa: ['0.350'] [Step 66 / Rank 3] Tasks: ['Single QA'] | Lens: [55864] → Tgt Spa: ['0.350'] [Step 66 / Rank 2] Tasks: ['Single QA'] | Lens: [55864] → Tgt Spa: ['0.350'] [Step 66 / Rank 5] Tasks: ['Single QA'] | Lens: [57508] → Tgt Spa: ['0.350'] [Step 66 / Rank 3] Tasks: ['Single QA'] | Lens: [42480] → Tgt Spa: ['0.350'] [Step 66 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [26262, 26263] → Tgt Spa: ['0.350', '0.350'] [Step 66 / Rank 1] Tasks: ['In-Context Learning', 'Single QA', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Code', 'Single QA', 'Single QA', 'Code', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Code', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'MultiHop QA'] | Lens: [3158, 3159, 3159, 3177, 3158, 3158, 3160, 3178, 3166, 3159, 3160, 3166, 3163, 3162, 3162, 3170, 3164, 3181, 3163, 3164] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350'] [Step 66 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [61537] → Tgt Spa: ['1.000'] [Step 66 / Rank 0] Tasks: ['In-Context Learning', 'Single QA', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Code', 'Single QA', 'Single QA', 'Code', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Code', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'MultiHop QA'] | Lens: [3158, 3159, 3159, 3177, 3158, 3158, 3160, 3178, 3166, 3159, 3160, 3166, 3163, 3162, 3162, 3170, 3164, 3181, 3163, 3164] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350'] [Step 66 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [61537] → Tgt Spa: ['1.000'] [Step 66 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [26262, 26263] → Tgt Spa: ['0.350', '0.350'] [Step 66 / Rank 2] Tasks: ['Single QA'] | Lens: [42480] → Tgt Spa: ['0.350'] [Step 66 / Rank 1] Tasks: ['Code'] | Lens: [51826] → Tgt Spa: ['1.000'] [Step 66 / Rank 2] Tasks: ['In-Context Learning', 'Code'] | Lens: [24862, 24870] → Tgt Spa: ['1.000', '1.000'] [Step 66 / Rank 6] Tasks: ['Single QA'] | Lens: [37520] → Tgt Spa: ['0.350'] [Step 66 / Rank 7] Tasks: ['Single QA'] | Lens: [37520] → Tgt Spa: ['0.350'] [Step 66 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [24874, 24867] → Tgt Spa: ['1.000', '1.000'] [Step 66 / Rank 3] Tasks: ['In-Context Learning', 'Code'] | Lens: [24862, 24870] → Tgt Spa: ['1.000', '1.000'] [Step 66 / Rank 0] Tasks: ['Code'] | Lens: [51826] → Tgt Spa: ['1.000'] [Step 66 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [24874, 24867] → Tgt Spa: ['1.000', '1.000'] [Step 66 / Rank 4] Tasks: ['Single QA'] | Lens: [45640] → Tgt Spa: ['0.350'] [Step 66 / Rank 5] Tasks: ['Single QA'] | Lens: [45640] → Tgt Spa: ['0.350'] [Step 66 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [29217, 29218] → Tgt Spa: ['0.350', '0.350'] [Step 66 / Rank 0] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [27158, 27159] → Tgt Spa: ['1.000', '0.350'] [Step 66 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [29217, 29218] → Tgt Spa: ['0.350', '0.350'] [Step 66 / Rank 1] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [27158, 27159] → Tgt Spa: ['1.000', '0.350'] [Step 66 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [40257] → Tgt Spa: ['1.000'] [Step 66 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [40257] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 21:29:44,093 >> @ 66 | Loss: 2.1481 | LM: 2.0623 | Reg: 0.0858 | Spa(Avg): 0.417 [INFO|lh_trainer.py:797] 2026-02-16 21:29:44,093 >> Statistic -> Code | Spa: 0.419 | Tgt: 1.000 | Z-Loss: 0.138 | [INFO|lh_trainer.py:797] 2026-02-16 21:29:44,093 >> Statistic -> In-Context | Spa: 0.398 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:29:44,094 >> Statistic -> MultiHop | Spa: 0.419 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:29:44,094 >> Statistic -> Single | Spa: 0.422 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:29:44,094 >> Statistic -> Summarization | Spa: 0.433 | Tgt: 1.000 | Z-Loss: 0.166 | [INFO|lh_trainer.py:810] 2026-02-16 21:29:44,096 >> [Micro-Log] {"loss": 2.1481429878622293, "lm_loss": 2.0623222328722477, "reg_loss": 0.0858207659330219, "model_sparsity(avg)": 0.4174286189178626, "Spa-Code sparsity": 0.4188034167656532, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.13817697304945725, "Spa-Single QA sparsity": 0.42234848033298145, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04319693689996546, "Spa-In-Context Learning sparsity": 0.39781745416777475, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.16301941233021872, "Spa-MultiHop QA sparsity": 0.41898147265116376, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.02745333795125286, "Spa-Summarization sparsity": 0.43333332538604735, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.16567549109458923, "step": 66, "current_tau": 1.3516840934753418, "lambda1 Single QA": 0.50390625, "lambda2 MultiHop QA": 0.255859375, "lambda3 Summarization": 0.06982421875, "lambda4 Code": 0.1669921875} [INFO|lh_trainer.py:331] 2026-02-16 21:29:59,691 >> {'loss': 12.8889, 'grad_norm': 1.1137187480926514, 'learning_rate': 0.0004992293335103487, 'epoch': 0.07056345444971038, 'num_input_tokens_seen': 165797524, 'completed': '22.33% (67 / 300)', 'remaining time': '10:54:30', 'throughput': '7497.63', 'gpu_mem_free': '10071MB', 'step': 67} [Step 67 / Rank 4] Tasks: ['Code', 'Summarization'] | Lens: [24401, 24404] → Tgt Spa: ['1.000', '1.000'] [Step 67 / Rank 0] Tasks: ['Code'] | Lens: [35239] → Tgt Spa: ['1.000'] [Step 67 / Rank 7] Tasks: ['Single QA'] | Lens: [40429] → Tgt Spa: ['0.350'] [Step 67 / Rank 5] Tasks: ['Code', 'Summarization'] | Lens: [24401, 24404] → Tgt Spa: ['1.000', '1.000'] [Step 67 / Rank 6] Tasks: ['Single QA'] | Lens: [40429] → Tgt Spa: ['0.350'] [Step 67 / Rank 2] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [20542, 20554, 20555] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 67 / Rank 1] Tasks: ['Code'] | Lens: [35239] → Tgt Spa: ['1.000'] [Step 67 / Rank 3] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [20542, 20554, 20555] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 67 / Rank 1] Tasks: ['Single QA'] | Lens: [33209] → Tgt Spa: ['0.350'] [Step 67 / Rank 5] Tasks: ['Single QA'] | Lens: [50364] → Tgt Spa: ['0.350'] [Step 67 / Rank 4] Tasks: ['Single QA'] | Lens: [50364] → Tgt Spa: ['0.350'] [Step 67 / Rank 0] Tasks: ['Single QA'] | Lens: [33209] → Tgt Spa: ['0.350'] [Step 67 / Rank 3] Tasks: ['Single QA'] | Lens: [45753] → Tgt Spa: ['0.350'] [Step 67 / Rank 2] Tasks: ['Single QA'] | Lens: [45753] → Tgt Spa: ['0.350'] [Step 67 / Rank 7] Tasks: ['Single QA'] | Lens: [61904] → Tgt Spa: ['0.350'] [Step 67 / Rank 6] Tasks: ['Single QA'] | Lens: [61904] → Tgt Spa: ['0.350'] [Step 67 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26344, 26345] → Tgt Spa: ['1.000', '1.000'] [Step 67 / Rank 1] Tasks: ['Single QA'] | Lens: [47405] → Tgt Spa: ['0.350'] [Step 67 / Rank 6] Tasks: ['Single QA'] | Lens: [42626] → Tgt Spa: ['0.350'] [Step 67 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26344, 26345] → Tgt Spa: ['1.000', '1.000'] [Step 67 / Rank 2] Tasks: ['Single QA'] | Lens: [41819] → Tgt Spa: ['0.350'] [Step 67 / Rank 3] Tasks: ['Single QA'] | Lens: [41819] → Tgt Spa: ['0.350'] [Step 67 / Rank 7] Tasks: ['Single QA'] | Lens: [42626] → Tgt Spa: ['0.350'] [Step 67 / Rank 0] Tasks: ['Single QA'] | Lens: [47405] → Tgt Spa: ['0.350'] [Step 67 / Rank 5] Tasks: ['Single QA'] | Lens: [40725] → Tgt Spa: ['0.350'] [Step 67 / Rank 4] Tasks: ['Single QA'] | Lens: [40725] → Tgt Spa: ['0.350'] [Step 67 / Rank 2] Tasks: ['Single QA', 'Code', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [6281, 6289, 6281, 6290, 6285, 6289, 6290, 6296, 6291, 6292] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350'] [Step 67 / Rank 3] Tasks: ['Single QA', 'Code', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [6281, 6289, 6281, 6290, 6285, 6289, 6290, 6296, 6291, 6292] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350'] [Step 67 / Rank 7] Tasks: ['Single QA'] | Lens: [54047] → Tgt Spa: ['0.350'] [Step 67 / Rank 0] Tasks: ['Code'] | Lens: [55766] → Tgt Spa: ['1.000'] [Step 67 / Rank 6] Tasks: ['Single QA'] | Lens: [54047] → Tgt Spa: ['0.350'] [Step 67 / Rank 1] Tasks: ['Code'] | Lens: [55766] → Tgt Spa: ['1.000'] [Step 67 / Rank 7] Tasks: ['Single QA'] | Lens: [54034] → Tgt Spa: ['0.350'] [Step 67 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24214, 24215] → Tgt Spa: ['1.000', '1.000'] [Step 67 / Rank 5] Tasks: ['Single QA'] | Lens: [60608] → Tgt Spa: ['0.350'] [Step 67 / Rank 4] Tasks: ['Single QA'] | Lens: [60608] → Tgt Spa: ['0.350'] [Step 67 / Rank 6] Tasks: ['Single QA'] | Lens: [54034] → Tgt Spa: ['0.350'] [Step 67 / Rank 1] Tasks: ['Code'] | Lens: [42720] → Tgt Spa: ['1.000'] [Step 67 / Rank 0] Tasks: ['Code'] | Lens: [42720] → Tgt Spa: ['1.000'] [Step 67 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24214, 24215] → Tgt Spa: ['1.000', '1.000'] [Step 67 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [32364, 32365] → Tgt Spa: ['0.350', '0.350'] [Step 67 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [32364, 32365] → Tgt Spa: ['0.350', '0.350'] [Step 67 / Rank 4] Tasks: ['Code', 'Code', 'Code'] | Lens: [17142, 17142, 17147] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 67 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [28240, 28241] → Tgt Spa: ['0.350', '0.350'] [Step 67 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [28240, 28241] → Tgt Spa: ['0.350', '0.350'] [Step 67 / Rank 6] Tasks: ['Single QA'] | Lens: [48828] → Tgt Spa: ['0.350'] [Step 67 / Rank 5] Tasks: ['Code', 'Code', 'Code'] | Lens: [17142, 17142, 17147] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 67 / Rank 7] Tasks: ['Single QA'] | Lens: [48828] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 21:32:18,021 >> @ 67 | Loss: 1.9920 | LM: 1.9107 | Reg: 0.0813 | Spa(Avg): 0.408 [INFO|lh_trainer.py:797] 2026-02-16 21:32:18,022 >> Statistic -> Code | Spa: 0.407 | Tgt: 1.000 | Z-Loss: 0.143 | [INFO|lh_trainer.py:797] 2026-02-16 21:32:18,022 >> Statistic -> In-Context | Spa: 0.412 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:32:18,022 >> Statistic -> MultiHop | Spa: 0.419 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:32:18,022 >> Statistic -> Single | Spa: 0.407 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:32:18,022 >> Statistic -> Summarization | Spa: 0.435 | Tgt: 1.000 | Z-Loss: 0.165 | [INFO|lh_trainer.py:810] 2026-02-16 21:32:18,024 >> [Micro-Log] {"loss": 1.9919711922605832, "lm_loss": 1.9106627466777961, "reg_loss": 0.08130845787915557, "model_sparsity(avg)": 0.40814042588075, "Spa-Code sparsity": 0.4065656499429183, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1429263549772176, "Spa-Single QA sparsity": 0.40719696337526495, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.038000124480194325, "Spa-Summarization sparsity": 0.43518515427907306, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.164621631304423, "Spa-In-Context Learning sparsity": 0.4120370348294576, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.15957222878932953, "Spa-MultiHop QA sparsity": 0.41898147265116376, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.02745333795125286, "step": 67, "current_tau": 1.3476827144622803, "lambda1 Single QA": 0.50390625, "lambda2 MultiHop QA": 0.255859375, "lambda3 Summarization": 0.07080078125, "lambda4 Code": 0.16796875} [INFO|lh_trainer.py:331] 2026-02-16 21:32:35,796 >> {'loss': 11.9518, 'grad_norm': 1.0727897882461548, 'learning_rate': 0.0004989512319985422, 'epoch': 0.07161664033701948, 'num_input_tokens_seen': 168202674, 'completed': '22.67% (68 / 300)', 'remaining time': '10:50:59', 'throughput': '7703.66', 'gpu_mem_free': '9097MB', 'step': 68} [Step 68 / Rank 6] Tasks: ['Code'] | Lens: [51678] → Tgt Spa: ['1.000'] [Step 68 / Rank 2] Tasks: ['Single QA'] | Lens: [42647] → Tgt Spa: ['0.350'] [Step 68 / Rank 1] Tasks: ['Single QA'] | Lens: [45368] → Tgt Spa: ['0.350'] [Step 68 / Rank 0] Tasks: ['Single QA'] | Lens: [45368] → Tgt Spa: ['0.350'] [Step 68 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27963, 27964] → Tgt Spa: ['1.000', '1.000'] [Step 68 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27963, 27964] → Tgt Spa: ['1.000', '1.000'] [Step 68 / Rank 3] Tasks: ['Single QA'] | Lens: [42647] → Tgt Spa: ['0.350'] [Step 68 / Rank 7] Tasks: ['Code'] | Lens: [51678] → Tgt Spa: ['1.000'] [Step 68 / Rank 7] Tasks: ['Single QA'] | Lens: [55753] → Tgt Spa: ['0.350'] [Step 68 / Rank 1] Tasks: ['MultiHop QA'] | Lens: [64851] → Tgt Spa: ['0.350'] [Step 68 / Rank 4] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [1229, 1229, 1228, 1248, 1230, 1230, 1230, 1230, 1229, 1232, 1250, 1232, 1232, 1253, 1234, 1233, 1254, 1253, 1236, 1236, 1254, 1235, 1236, 1235, 1237, 1237, 1237, 1237, 1237, 1257, 1239, 1237, 1238, 1238, 1239, 1238, 1257, 1239, 1239, 1258, 1241, 1260, 1241, 1241, 1241, 1260, 1241, 1241, 1242, 1242, 1243, 1261] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 68 / Rank 6] Tasks: ['Single QA'] | Lens: [55753] → Tgt Spa: ['0.350'] [Step 68 / Rank 3] Tasks: ['Single QA'] | Lens: [40873] → Tgt Spa: ['0.350'] [Step 68 / Rank 5] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [1229, 1229, 1228, 1248, 1230, 1230, 1230, 1230, 1229, 1232, 1250, 1232, 1232, 1253, 1234, 1233, 1254, 1253, 1236, 1236, 1254, 1235, 1236, 1235, 1237, 1237, 1237, 1237, 1237, 1257, 1239, 1237, 1238, 1238, 1239, 1238, 1257, 1239, 1239, 1258, 1241, 1260, 1241, 1241, 1241, 1260, 1241, 1241, 1242, 1242, 1243, 1261] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 68 / Rank 2] Tasks: ['Single QA'] | Lens: [40873] → Tgt Spa: ['0.350'] [Step 68 / Rank 0] Tasks: ['MultiHop QA'] | Lens: [64851] → Tgt Spa: ['0.350'] [Step 68 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [8000, 8000, 8007, 8001, 8001, 8001, 8001, 8009] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 68 / Rank 2] Tasks: ['Single QA'] | Lens: [64538] → Tgt Spa: ['0.350'] [Step 68 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [8000, 8000, 8007, 8001, 8001, 8001, 8001, 8009] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 68 / Rank 3] Tasks: ['Single QA'] | Lens: [64538] → Tgt Spa: ['0.350'] [Step 68 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [24106, 24101] → Tgt Spa: ['1.000', '1.000'] [Step 68 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [24106, 24101] → Tgt Spa: ['1.000', '1.000'] [Step 68 / Rank 1] Tasks: ['Code'] | Lens: [35600] → Tgt Spa: ['1.000'] [Step 68 / Rank 0] Tasks: ['Code'] | Lens: [35600] → Tgt Spa: ['1.000'] [Step 68 / Rank 6] Tasks: ['Single QA'] | Lens: [50568] → Tgt Spa: ['0.350'] [Step 68 / Rank 1] Tasks: ['Code'] | Lens: [51827] → Tgt Spa: ['1.000'] [Step 68 / Rank 3] Tasks: ['Single QA'] | Lens: [46150] → Tgt Spa: ['0.350'] [Step 68 / Rank 4] Tasks: ['Code', 'Summarization'] | Lens: [29866, 29876] → Tgt Spa: ['1.000', '1.000'] [Step 68 / Rank 5] Tasks: ['Code', 'Summarization'] | Lens: [29866, 29876] → Tgt Spa: ['1.000', '1.000'] [Step 68 / Rank 0] Tasks: ['Code'] | Lens: [51827] → Tgt Spa: ['1.000'] [Step 68 / Rank 7] Tasks: ['Single QA'] | Lens: [50568] → Tgt Spa: ['0.350'] [Step 68 / Rank 2] Tasks: ['Single QA'] | Lens: [46150] → Tgt Spa: ['0.350'] [Step 68 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [23411, 23405] → Tgt Spa: ['1.000', '1.000'] [Step 68 / Rank 5] Tasks: ['Single QA'] | Lens: [50952] → Tgt Spa: ['0.350'] [Step 68 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [32121, 32121] → Tgt Spa: ['0.350', '0.350'] [Step 68 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [23411, 23405] → Tgt Spa: ['1.000', '1.000'] [Step 68 / Rank 4] Tasks: ['Single QA'] | Lens: [50952] → Tgt Spa: ['0.350'] [Step 68 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [32121, 32121] → Tgt Spa: ['0.350', '0.350'] [Step 68 / Rank 6] Tasks: ['Summarization'] | Lens: [33382] → Tgt Spa: ['1.000'] [Step 68 / Rank 7] Tasks: ['Summarization'] | Lens: [33382] → Tgt Spa: ['1.000'] [Step 68 / Rank 6] Tasks: ['Single QA'] | Lens: [34807] → Tgt Spa: ['0.350'] [Step 68 / Rank 4] Tasks: ['Single QA'] | Lens: [51538] → Tgt Spa: ['0.350'] [Step 68 / Rank 7] Tasks: ['Single QA'] | Lens: [34807] → Tgt Spa: ['0.350'] [Step 68 / Rank 0] Tasks: ['Code'] | Lens: [49411] → Tgt Spa: ['1.000'] [Step 68 / Rank 5] Tasks: ['Single QA'] | Lens: [51538] → Tgt Spa: ['0.350'] [Step 68 / Rank 2] Tasks: ['Code'] | Lens: [34699] → Tgt Spa: ['1.000'] [Step 68 / Rank 3] Tasks: ['Code'] | Lens: [34699] → Tgt Spa: ['1.000'] [Step 68 / Rank 1] Tasks: ['Code'] | Lens: [49411] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 21:35:01,898 >> @ 68 | Loss: 1.9479 | LM: 1.8550 | Reg: 0.0929 | Spa(Avg): 0.485 [INFO|lh_trainer.py:797] 2026-02-16 21:35:01,899 >> Statistic -> Code | Spa: 0.487 | Tgt: 1.000 | Z-Loss: 0.119 | [INFO|lh_trainer.py:797] 2026-02-16 21:35:01,899 >> Statistic -> In-Context | Spa: 0.437 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:35:01,899 >> Statistic -> MultiHop | Spa: 0.434 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:35:01,899 >> Statistic -> Single | Spa: 0.470 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:35:01,899 >> Statistic -> Summarization | Spa: 0.452 | Tgt: 1.000 | Z-Loss: 0.158 | [INFO|lh_trainer.py:810] 2026-02-16 21:35:01,901 >> [Micro-Log] {"loss": 1.94791133950154, "lm_loss": 1.8550124783068895, "reg_loss": 0.09289886010810733, "model_sparsity(avg)": 0.48487022891640663, "Spa-Single QA sparsity": 0.46990740299224854, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06479745712648663, "Spa-MultiHop QA sparsity": 0.4336043276437899, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.025823980872235374, "Spa-Code sparsity": 0.4875, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.119444078207016, "Spa-In-Context Learning sparsity": 0.4374999850988388, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.15289584919810295, "Spa-Summarization sparsity": 0.4523809381893703, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.15787847020796367, "step": 68, "current_tau": 1.3436516523361206, "lambda1 Single QA": 0.50390625, "lambda2 MultiHop QA": 0.255859375, "lambda3 Summarization": 0.07177734375, "lambda4 Code": 0.1689453125} [INFO|lh_trainer.py:331] 2026-02-16 21:35:20,701 >> {'loss': 11.6875, 'grad_norm': 1.1321420669555664, 'learning_rate': 0.000498630473979021, 'epoch': 0.07266982622432859, 'num_input_tokens_seen': 170618878, 'completed': '23.00% (69 / 300)', 'remaining time': '10:47:59', 'throughput': '7326.00', 'gpu_mem_free': '11263MB', 'step': 69} [Step 69 / Rank 5] Tasks: ['Single QA'] | Lens: [38201] → Tgt Spa: ['0.350'] [Step 69 / Rank 6] Tasks: ['Single QA', 'Code'] | Lens: [22730, 22739] → Tgt Spa: ['0.350', '1.000'] [Step 69 / Rank 1] Tasks: ['Single QA'] | Lens: [51066] → Tgt Spa: ['0.350'] [Step 69 / Rank 2] Tasks: ['Single QA'] | Lens: [59028] → Tgt Spa: ['0.350'] [Step 69 / Rank 0] Tasks: ['Single QA'] | Lens: [51066] → Tgt Spa: ['0.350'] [Step 69 / Rank 7] Tasks: ['Single QA', 'Code'] | Lens: [22730, 22739] → Tgt Spa: ['0.350', '1.000'] [Step 69 / Rank 4] Tasks: ['Single QA'] | Lens: [38201] → Tgt Spa: ['0.350'] [Step 69 / Rank 3] Tasks: ['Single QA'] | Lens: [59028] → Tgt Spa: ['0.350'] [Step 69 / Rank 5] Tasks: ['Single QA'] | Lens: [43604] → Tgt Spa: ['0.350'] [Step 69 / Rank 6] Tasks: ['Single QA'] | Lens: [63206] → Tgt Spa: ['0.350'] [Step 69 / Rank 0] Tasks: ['Single QA'] | Lens: [51707] → Tgt Spa: ['0.350'] [Step 69 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [36283] → Tgt Spa: ['1.000'] [Step 69 / Rank 7] Tasks: ['Single QA'] | Lens: [63206] → Tgt Spa: ['0.350'] [Step 69 / Rank 4] Tasks: ['Single QA'] | Lens: [43604] → Tgt Spa: ['0.350'] [Step 69 / Rank 1] Tasks: ['Single QA'] | Lens: [51707] → Tgt Spa: ['0.350'] [Step 69 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [36283] → Tgt Spa: ['1.000'] [Step 69 / Rank 6] Tasks: ['Single QA'] | Lens: [32847] → Tgt Spa: ['0.350'] [Step 69 / Rank 0] Tasks: ['Code'] | Lens: [38412] → Tgt Spa: ['1.000'] [Step 69 / Rank 4] Tasks: ['Code'] | Lens: [41576] → Tgt Spa: ['1.000'] [Step 69 / Rank 3] Tasks: ['Summarization'] | Lens: [64624] → Tgt Spa: ['1.000'] [Step 69 / Rank 5] Tasks: ['Code'] | Lens: [41576] → Tgt Spa: ['1.000'] [Step 69 / Rank 7] Tasks: ['Single QA'] | Lens: [32847] → Tgt Spa: ['0.350'] [Step 69 / Rank 2] Tasks: ['Summarization'] | Lens: [64624] → Tgt Spa: ['1.000'] [Step 69 / Rank 1] Tasks: ['Code'] | Lens: [38412] → Tgt Spa: ['1.000'] [Step 69 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [30243, 30243] → Tgt Spa: ['1.000', '1.000'] [Step 69 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15877, 15877, 15877, 15877] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 69 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Code'] | Lens: [10293, 10293, 10294, 10297, 10304, 10316] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 69 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [30243, 30243] → Tgt Spa: ['1.000', '1.000'] [Step 69 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Code'] | Lens: [10293, 10293, 10294, 10297, 10304, 10316] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 69 / Rank 3] Tasks: ['Single QA'] | Lens: [58716] → Tgt Spa: ['0.350'] [Step 69 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15877, 15877, 15877, 15877] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 69 / Rank 2] Tasks: ['Single QA'] | Lens: [58716] → Tgt Spa: ['0.350'] [Step 69 / Rank 1] Tasks: ['Code'] | Lens: [36261] → Tgt Spa: ['1.000'] [Step 69 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27140, 27140] → Tgt Spa: ['1.000', '1.000'] [Step 69 / Rank 7] Tasks: ['Summarization'] | Lens: [43443] → Tgt Spa: ['1.000'] [Step 69 / Rank 0] Tasks: ['Code'] | Lens: [36261] → Tgt Spa: ['1.000'] [Step 69 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27140, 27140] → Tgt Spa: ['1.000', '1.000'] [Step 69 / Rank 2] Tasks: ['Code'] | Lens: [59412] → Tgt Spa: ['1.000'] [Step 69 / Rank 6] Tasks: ['Summarization'] | Lens: [43443] → Tgt Spa: ['1.000'] [Step 69 / Rank 3] Tasks: ['Code'] | Lens: [59412] → Tgt Spa: ['1.000'] [Step 69 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [53868] → Tgt Spa: ['1.000'] [Step 69 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [53868] → Tgt Spa: ['1.000'] [Step 69 / Rank 7] Tasks: ['Code'] | Lens: [53025] → Tgt Spa: ['1.000'] [Step 69 / Rank 0] Tasks: ['Single QA'] | Lens: [53943] → Tgt Spa: ['0.350'] [Step 69 / Rank 1] Tasks: ['Single QA'] | Lens: [53943] → Tgt Spa: ['0.350'] [Step 69 / Rank 6] Tasks: ['Code'] | Lens: [53025] → Tgt Spa: ['1.000'] [Step 69 / Rank 2] Tasks: ['Single QA'] | Lens: [37787] → Tgt Spa: ['0.350'] [Step 69 / Rank 3] Tasks: ['Single QA'] | Lens: [37787] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 21:38:05,396 >> @ 69 | Loss: 1.9029 | LM: 1.8112 | Reg: 0.0916 | Spa(Avg): 0.392 [INFO|lh_trainer.py:797] 2026-02-16 21:38:05,397 >> Statistic -> Code | Spa: 0.392 | Tgt: 1.000 | Z-Loss: 0.149 | [INFO|lh_trainer.py:797] 2026-02-16 21:38:05,397 >> Statistic -> In-Context | Spa: 0.420 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:38:05,397 >> Statistic -> MultiHop | Spa: 0.434 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:38:05,397 >> Statistic -> Single | Spa: 0.389 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:38:05,397 >> Statistic -> Summarization | Spa: 0.396 | Tgt: 1.000 | Z-Loss: 0.187 | [INFO|lh_trainer.py:810] 2026-02-16 21:38:05,399 >> [Micro-Log] {"loss": 1.9028576171646516, "lm_loss": 1.811219491995871, "reg_loss": 0.09163811018515844, "model_sparsity(avg)": 0.3922164204219977, "Spa-Single QA sparsity": 0.38888887982619436, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.028159756685215, "Spa-Code sparsity": 0.3916666567325592, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.14938027113676072, "Spa-In-Context Learning sparsity": 0.420138880610466, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.15898893028497696, "Spa-Summarization sparsity": 0.3958333134651184, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.18701135367155075, "Spa-MultiHop QA sparsity": 0.4336043276437899, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.025823980872235374, "step": 69, "current_tau": 1.3395919799804688, "lambda1 Single QA": 0.50390625, "lambda2 MultiHop QA": 0.255859375, "lambda3 Summarization": 0.07275390625, "lambda4 Code": 0.169921875} [INFO|lh_trainer.py:331] 2026-02-16 21:38:25,428 >> {'loss': 11.4171, 'grad_norm': 1.3855392932891846, 'learning_rate': 0.0004982671144120202, 'epoch': 0.0737230121116377, 'num_input_tokens_seen': 173023976, 'completed': '23.33% (70 / 300)', 'remaining time': '10:46:04', 'throughput': '6509.91', 'gpu_mem_free': '8537MB', 'step': 70} [Step 70 / Rank 5] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24420, 24421] → Tgt Spa: ['1.000', '0.350'] [Step 70 / Rank 0] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [23465, 23448] → Tgt Spa: ['1.000', '1.000'] [Step 70 / Rank 6] Tasks: ['Single QA'] | Lens: [43579] → Tgt Spa: ['0.350'] [Step 70 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [43409] → Tgt Spa: ['1.000'] [Step 70 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [43409] → Tgt Spa: ['1.000'] [Step 70 / Rank 1] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [23465, 23448] → Tgt Spa: ['1.000', '1.000'] [Step 70 / Rank 4] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24420, 24421] → Tgt Spa: ['1.000', '0.350'] [Step 70 / Rank 7] Tasks: ['Single QA'] | Lens: [43579] → Tgt Spa: ['0.350'] [Step 70 / Rank 1] Tasks: ['Single QA'] | Lens: [64999] → Tgt Spa: ['0.350'] [Step 70 / Rank 4] Tasks: ['Single QA'] | Lens: [40346] → Tgt Spa: ['0.350'] [Step 70 / Rank 0] Tasks: ['Single QA'] | Lens: [64999] → Tgt Spa: ['0.350'] [Step 70 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [61664] → Tgt Spa: ['1.000'] [Step 70 / Rank 5] Tasks: ['Single QA'] | Lens: [40346] → Tgt Spa: ['0.350'] [Step 70 / Rank 3] Tasks: ['Single QA'] | Lens: [60924] → Tgt Spa: ['0.350'] [Step 70 / Rank 2] Tasks: ['Single QA'] | Lens: [60924] → Tgt Spa: ['0.350'] [Step 70 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [61664] → Tgt Spa: ['1.000'] [Step 70 / Rank 4] Tasks: ['Single QA'] | Lens: [47372] → Tgt Spa: ['0.350'] [Step 70 / Rank 5] Tasks: ['Single QA'] | Lens: [47372] → Tgt Spa: ['0.350'] [Step 70 / Rank 0] Tasks: ['Single QA'] | Lens: [36671] → Tgt Spa: ['0.350'] [Step 70 / Rank 1] Tasks: ['Single QA'] | Lens: [36671] → Tgt Spa: ['0.350'] [Step 70 / Rank 2] Tasks: ['Code', 'Code'] | Lens: [31876, 31875] → Tgt Spa: ['1.000', '1.000'] [Step 70 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [56557] → Tgt Spa: ['1.000'] [Step 70 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [56557] → Tgt Spa: ['1.000'] [Step 70 / Rank 3] Tasks: ['Code', 'Code'] | Lens: [31876, 31875] → Tgt Spa: ['1.000', '1.000'] [Step 70 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [25331, 25331] → Tgt Spa: ['0.350', '0.350'] [Step 70 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24111, 24112] → Tgt Spa: ['0.350', '1.000'] [Step 70 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [25331, 25331] → Tgt Spa: ['0.350', '0.350'] [Step 70 / Rank 2] Tasks: ['Single QA'] | Lens: [57272] → Tgt Spa: ['0.350'] [Step 70 / Rank 7] Tasks: ['Single QA'] | Lens: [36322] → Tgt Spa: ['0.350'] [Step 70 / Rank 6] Tasks: ['Single QA'] | Lens: [36322] → Tgt Spa: ['0.350'] [Step 70 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24111, 24112] → Tgt Spa: ['0.350', '1.000'] [Step 70 / Rank 3] Tasks: ['Single QA'] | Lens: [57272] → Tgt Spa: ['0.350'] [Step 70 / Rank 1] Tasks: ['Single QA'] | Lens: [43924] → Tgt Spa: ['0.350'] [Step 70 / Rank 4] Tasks: ['Single QA'] | Lens: [43502] → Tgt Spa: ['0.350'] [Step 70 / Rank 3] Tasks: ['Single QA'] | Lens: [36705] → Tgt Spa: ['0.350'] [Step 70 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [56767] → Tgt Spa: ['1.000'] [Step 70 / Rank 2] Tasks: ['Single QA'] | Lens: [36705] → Tgt Spa: ['0.350'] [Step 70 / Rank 0] Tasks: ['Single QA'] | Lens: [43924] → Tgt Spa: ['0.350'] [Step 70 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [56767] → Tgt Spa: ['1.000'] [Step 70 / Rank 5] Tasks: ['Single QA'] | Lens: [43502] → Tgt Spa: ['0.350'] [Step 70 / Rank 4] Tasks: ['Single QA'] | Lens: [58389] → Tgt Spa: ['0.350'] [Step 70 / Rank 5] Tasks: ['Single QA'] | Lens: [58389] → Tgt Spa: ['0.350'] [Step 70 / Rank 1] Tasks: ['Single QA'] | Lens: [52876] → Tgt Spa: ['0.350'] [Step 70 / Rank 3] Tasks: ['MultiHop QA'] | Lens: [65334] → Tgt Spa: ['0.350'] [Step 70 / Rank 7] Tasks: ['Single QA'] | Lens: [34760] → Tgt Spa: ['0.350'] [Step 70 / Rank 2] Tasks: ['MultiHop QA'] | Lens: [65334] → Tgt Spa: ['0.350'] [Step 70 / Rank 0] Tasks: ['Single QA'] | Lens: [52876] → Tgt Spa: ['0.350'] [Step 70 / Rank 6] Tasks: ['Single QA'] | Lens: [34760] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 21:40:53,905 >> @ 70 | Loss: 2.1855 | LM: 2.1145 | Reg: 0.0710 | Spa(Avg): 0.460 [INFO|lh_trainer.py:797] 2026-02-16 21:40:53,905 >> Statistic -> Code | Spa: 0.424 | Tgt: 1.000 | Z-Loss: 0.140 | [INFO|lh_trainer.py:797] 2026-02-16 21:40:53,905 >> Statistic -> In-Context | Spa: 0.518 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:40:53,905 >> Statistic -> MultiHop | Spa: 0.417 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:40:53,905 >> Statistic -> Single | Spa: 0.432 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:40:53,905 >> Statistic -> Summarization | Spa: 0.597 | Tgt: 1.000 | Z-Loss: 0.093 | [INFO|lh_trainer.py:810] 2026-02-16 21:40:53,908 >> [Micro-Log] {"loss": 2.185523903463036, "lm_loss": 2.1144912403736575, "reg_loss": 0.07103267808755238, "model_sparsity(avg)": 0.4600694440305233, "Spa-Summarization sparsity": 0.5972222089767456, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09338492155075073, "Spa-In-Context Learning sparsity": 0.5178571258272443, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.13087494777781622, "Spa-Single QA sparsity": 0.43209876616795856, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04536716469253103, "Spa-Code sparsity": 0.4236111044883728, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.13987576961517334, "Spa-MultiHop QA sparsity": 0.4166666865348816, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.01887786202132702, "step": 70, "current_tau": 1.3355050086975098, "lambda1 Single QA": 0.5078125, "lambda2 MultiHop QA": 0.2578125, "lambda3 Summarization": 0.07373046875, "lambda4 Code": 0.1708984375} [INFO|lh_trainer.py:331] 2026-02-16 21:41:21,253 >> {'loss': 13.1131, 'grad_norm': 0.8411357998847961, 'learning_rate': 0.0004978612155573311, 'epoch': 0.07477619799894682, 'num_input_tokens_seen': 175423500, 'completed': '23.67% (71 / 300)', 'remaining time': '10:43:39', 'throughput': '6823.60', 'gpu_mem_free': '8481MB', 'step': 71} [Step 71 / Rank 5] Tasks: ['Single QA'] | Lens: [42609] → Tgt Spa: ['0.350'] [Step 71 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [22476, 22484] → Tgt Spa: ['1.000', '1.000'] [Step 71 / Rank 4] Tasks: ['Single QA'] | Lens: [42609] → Tgt Spa: ['0.350'] [Step 71 / Rank 2] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24685, 24686] → Tgt Spa: ['0.350', '1.000'] [Step 71 / Rank 0] Tasks: ['Single QA'] | Lens: [61004] → Tgt Spa: ['0.350'] [Step 71 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [22476, 22484] → Tgt Spa: ['1.000', '1.000'] [Step 71 / Rank 1] Tasks: ['Single QA'] | Lens: [61004] → Tgt Spa: ['0.350'] [Step 71 / Rank 3] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24685, 24686] → Tgt Spa: ['0.350', '1.000'] [Step 71 / Rank 4] Tasks: ['Code'] | Lens: [36919] → Tgt Spa: ['1.000'] [Step 71 / Rank 1] Tasks: ['Single QA'] | Lens: [64717] → Tgt Spa: ['0.350'] [Step 71 / Rank 2] Tasks: ['Code'] | Lens: [36882] → Tgt Spa: ['1.000'] [Step 71 / Rank 5] Tasks: ['Code'] | Lens: [36919] → Tgt Spa: ['1.000'] [Step 71 / Rank 3] Tasks: ['Code'] | Lens: [36882] → Tgt Spa: ['1.000'] [Step 71 / Rank 0] Tasks: ['Single QA'] | Lens: [64717] → Tgt Spa: ['0.350'] [Step 71 / Rank 6] Tasks: ['Single QA'] | Lens: [65038] → Tgt Spa: ['0.350'] [Step 71 / Rank 7] Tasks: ['Single QA'] | Lens: [65038] → Tgt Spa: ['0.350'] [Step 71 / Rank 4] Tasks: ['Single QA'] | Lens: [51138] → Tgt Spa: ['0.350'] [Step 71 / Rank 5] Tasks: ['Single QA'] | Lens: [51138] → Tgt Spa: ['0.350'] [Step 71 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [55005] → Tgt Spa: ['1.000'] [Step 71 / Rank 1] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17477, 17467, 17479] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 71 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [26225, 26233] → Tgt Spa: ['1.000', '1.000'] [Step 71 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [55005] → Tgt Spa: ['1.000'] [Step 71 / Rank 0] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17477, 17467, 17479] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 71 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [26225, 26233] → Tgt Spa: ['1.000', '1.000'] [Step 71 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [45352] → Tgt Spa: ['1.000'] [Step 71 / Rank 5] Tasks: ['Single QA'] | Lens: [47070] → Tgt Spa: ['0.350'] [Step 71 / Rank 6] Tasks: ['Single QA'] | Lens: [50951] → Tgt Spa: ['0.350'] [Step 71 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [45352] → Tgt Spa: ['1.000'] [Step 71 / Rank 3] Tasks: ['Single QA'] | Lens: [54613] → Tgt Spa: ['0.350'] [Step 71 / Rank 7] Tasks: ['Single QA'] | Lens: [50951] → Tgt Spa: ['0.350'] [Step 71 / Rank 4] Tasks: ['Single QA'] | Lens: [47070] → Tgt Spa: ['0.350'] [Step 71 / Rank 2] Tasks: ['Single QA'] | Lens: [54613] → Tgt Spa: ['0.350'] [Step 71 / Rank 5] Tasks: ['Single QA'] | Lens: [64525] → Tgt Spa: ['0.350'] [Step 71 / Rank 6] Tasks: ['Single QA'] | Lens: [56669] → Tgt Spa: ['0.350'] [Step 71 / Rank 4] Tasks: ['Single QA'] | Lens: [64525] → Tgt Spa: ['0.350'] [Step 71 / Rank 7] Tasks: ['Single QA'] | Lens: [56669] → Tgt Spa: ['0.350'] [Step 71 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [7808, 7809, 7809, 7809, 7809, 7809, 7820, 7814] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350'] [Step 71 / Rank 1] Tasks: ['Code'] | Lens: [34820] → Tgt Spa: ['1.000'] [Step 71 / Rank 0] Tasks: ['Code'] | Lens: [34820] → Tgt Spa: ['1.000'] [Step 71 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [7808, 7809, 7809, 7809, 7809, 7809, 7820, 7814] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350'] [Step 71 / Rank 0] Tasks: ['Single QA'] | Lens: [65459] → Tgt Spa: ['0.350'] [Step 71 / Rank 6] Tasks: ['Code'] | Lens: [37697] → Tgt Spa: ['1.000'] [Step 71 / Rank 1] Tasks: ['Single QA'] | Lens: [65459] → Tgt Spa: ['0.350'] [Step 71 / Rank 5] Tasks: ['Single QA'] | Lens: [37732] → Tgt Spa: ['0.350'] [Step 71 / Rank 4] Tasks: ['Single QA'] | Lens: [37732] → Tgt Spa: ['0.350'] [Step 71 / Rank 7] Tasks: ['Code'] | Lens: [37697] → Tgt Spa: ['1.000'] [Step 71 / Rank 3] Tasks: ['Code', 'Code'] | Lens: [25600, 25599] → Tgt Spa: ['1.000', '1.000'] [Step 71 / Rank 2] Tasks: ['Code', 'Code'] | Lens: [25600, 25599] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 21:44:05,170 >> @ 71 | Loss: 1.8693 | LM: 1.7811 | Reg: 0.0882 | Spa(Avg): 0.417 [INFO|lh_trainer.py:797] 2026-02-16 21:44:05,170 >> Statistic -> Code | Spa: 0.401 | Tgt: 1.000 | Z-Loss: 0.148 | [INFO|lh_trainer.py:797] 2026-02-16 21:44:05,171 >> Statistic -> In-Context | Spa: 0.419 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:44:05,171 >> Statistic -> MultiHop | Spa: 0.417 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:44:05,171 >> Statistic -> Single | Spa: 0.425 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:44:05,171 >> Statistic -> Summarization | Spa: 0.410 | Tgt: 1.000 | Z-Loss: 0.181 | [INFO|lh_trainer.py:810] 2026-02-16 21:44:05,173 >> [Micro-Log] {"loss": 1.8693164599438508, "lm_loss": 1.7810673800607522, "reg_loss": 0.08824908776053537, "model_sparsity(avg)": 0.41729358459512395, "Spa-Single QA sparsity": 0.4249999910593033, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04011790307704359, "Spa-Summarization sparsity": 0.4097222089767456, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.18106814473867416, "Spa-Code sparsity": 0.4013888716697693, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.14809992909431458, "Spa-In-Context Learning sparsity": 0.4194444417953491, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.16069650948047637, "Spa-MultiHop QA sparsity": 0.4166666865348816, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.01887786202132702, "step": 71, "current_tau": 1.3313920497894287, "lambda1 Single QA": 0.5078125, "lambda2 MultiHop QA": 0.2578125, "lambda3 Summarization": 0.07421875, "lambda4 Code": 0.171875} [INFO|lh_trainer.py:331] 2026-02-16 21:44:32,302 >> {'loss': 11.2159, 'grad_norm': 1.355546236038208, 'learning_rate': 0.0004974128469636329, 'epoch': 0.07582938388625593, 'num_input_tokens_seen': 177865696, 'completed': '24.00% (72 / 300)', 'remaining time': '10:42:02', 'throughput': '6391.54', 'gpu_mem_free': '3671MB', 'step': 72} [Step 72 / Rank 0] Tasks: ['Code'] | Lens: [34022] → Tgt Spa: ['1.000'] [Step 72 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24663, 24663] → Tgt Spa: ['0.350', '1.000'] [Step 72 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [39162] → Tgt Spa: ['1.000'] [Step 72 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24663, 24663] → Tgt Spa: ['0.350', '1.000'] [Step 72 / Rank 6] Tasks: ['Single QA'] | Lens: [35396] → Tgt Spa: ['0.350'] [Step 72 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [39162] → Tgt Spa: ['1.000'] [Step 72 / Rank 1] Tasks: ['Code'] | Lens: [34022] → Tgt Spa: ['1.000'] [Step 72 / Rank 7] Tasks: ['Single QA'] | Lens: [35396] → Tgt Spa: ['0.350'] [Step 72 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15918, 15919, 15921, 15921] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 72 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [40948] → Tgt Spa: ['1.000'] [Step 72 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [25926, 25935] → Tgt Spa: ['1.000', '1.000'] [Step 72 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [25926, 25935] → Tgt Spa: ['1.000', '1.000'] [Step 72 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15918, 15919, 15921, 15921] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 72 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [40948] → Tgt Spa: ['1.000'] [Step 72 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [13967, 13967, 13968, 13968] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 72 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [13967, 13967, 13968, 13968] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 72 / Rank 5] Tasks: ['Single QA'] | Lens: [34800] → Tgt Spa: ['0.350'] [Step 72 / Rank 4] Tasks: ['Single QA'] | Lens: [34800] → Tgt Spa: ['0.350'] [Step 72 / Rank 0] Tasks: ['Code'] | Lens: [61264] → Tgt Spa: ['1.000'] [Step 72 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [59527] → Tgt Spa: ['1.000'] [Step 72 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [19264, 19264, 19255] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 72 / Rank 1] Tasks: ['Code'] | Lens: [61264] → Tgt Spa: ['1.000'] [Step 72 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [59527] → Tgt Spa: ['1.000'] [Step 72 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [19264, 19264, 19255] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 72 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [40160] → Tgt Spa: ['1.000'] [Step 72 / Rank 1] Tasks: ['Single QA'] | Lens: [35105] → Tgt Spa: ['0.350'] [Step 72 / Rank 5] Tasks: ['Single QA'] | Lens: [50537] → Tgt Spa: ['0.350'] [Step 72 / Rank 4] Tasks: ['Single QA'] | Lens: [50537] → Tgt Spa: ['0.350'] [Step 72 / Rank 0] Tasks: ['Single QA'] | Lens: [35105] → Tgt Spa: ['0.350'] [Step 72 / Rank 2] Tasks: ['Code', 'Code'] | Lens: [23174, 23174] → Tgt Spa: ['1.000', '1.000'] [Step 72 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [40160] → Tgt Spa: ['1.000'] [Step 72 / Rank 3] Tasks: ['Code', 'Code'] | Lens: [23174, 23174] → Tgt Spa: ['1.000', '1.000'] [Step 72 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [56580] → Tgt Spa: ['1.000'] [Step 72 / Rank 3] Tasks: ['In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Single QA'] | Lens: [4699, 4700, 4707, 4700, 4702, 4701, 4703, 4702, 4703, 4723, 4705, 4705, 4707] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350'] [Step 72 / Rank 5] Tasks: ['Single QA'] | Lens: [65044] → Tgt Spa: ['0.350'] [Step 72 / Rank 0] Tasks: ['Single QA'] | Lens: [44036] → Tgt Spa: ['0.350'] [Step 72 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [56580] → Tgt Spa: ['1.000'] [Step 72 / Rank 4] Tasks: ['Single QA'] | Lens: [65044] → Tgt Spa: ['0.350'] [Step 72 / Rank 2] Tasks: ['In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Single QA'] | Lens: [4699, 4700, 4707, 4700, 4702, 4701, 4703, 4702, 4703, 4723, 4705, 4705, 4707] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350'] [Step 72 / Rank 1] Tasks: ['Single QA'] | Lens: [44036] → Tgt Spa: ['0.350'] [Step 72 / Rank 4] Tasks: ['Single QA'] | Lens: [47417] → Tgt Spa: ['0.350'] [Step 72 / Rank 1] Tasks: ['Single QA'] | Lens: [35187] → Tgt Spa: ['0.350'] [Step 72 / Rank 0] Tasks: ['Single QA'] | Lens: [35187] → Tgt Spa: ['0.350'] [Step 72 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [54675] → Tgt Spa: ['1.000'] [Step 72 / Rank 2] Tasks: ['Single QA'] | Lens: [54954] → Tgt Spa: ['0.350'] [Step 72 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [54675] → Tgt Spa: ['1.000'] [Step 72 / Rank 5] Tasks: ['Single QA'] | Lens: [47417] → Tgt Spa: ['0.350'] [Step 72 / Rank 3] Tasks: ['Single QA'] | Lens: [54954] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 21:46:45,334 >> @ 72 | Loss: 2.0682 | LM: 1.9684 | Reg: 0.0998 | Spa(Avg): 0.391 [INFO|lh_trainer.py:797] 2026-02-16 21:46:45,334 >> Statistic -> Code | Spa: 0.383 | Tgt: 1.000 | Z-Loss: 0.155 | [INFO|lh_trainer.py:797] 2026-02-16 21:46:45,334 >> Statistic -> In-Context | Spa: 0.409 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:46:45,335 >> Statistic -> MultiHop | Spa: 0.417 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:46:45,335 >> Statistic -> Single | Spa: 0.391 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:46:45,335 >> Statistic -> Summarization | Spa: 0.421 | Tgt: 1.000 | Z-Loss: 0.176 | [INFO|lh_trainer.py:810] 2026-02-16 21:46:45,336 >> [Micro-Log] {"loss": 2.068204348285993, "lm_loss": 1.9683899447942774, "reg_loss": 0.09981441649142653, "model_sparsity(avg)": 0.39140773316224414, "Spa-Code sparsity": 0.3829364946910313, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.15510624008519308, "Spa-In-Context Learning sparsity": 0.4091880275652959, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.16487614581218132, "Spa-Single QA sparsity": 0.39120370149612427, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.032104560329268374, "Spa-Summarization sparsity": 0.42129629850387573, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.17576191325982413, "Spa-MultiHop QA sparsity": 0.4166666865348816, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.01887786202132702, "step": 72, "current_tau": 1.327254295349121, "lambda1 Single QA": 0.5078125, "lambda2 MultiHop QA": 0.2578125, "lambda3 Summarization": 0.0751953125, "lambda4 Code": 0.1728515625} [INFO|lh_trainer.py:331] 2026-02-16 21:47:05,861 >> {'loss': 12.4092, 'grad_norm': 1.486307144165039, 'learning_rate': 0.000496922085456576, 'epoch': 0.07688256977356503, 'num_input_tokens_seen': 180215372, 'completed': '24.33% (73 / 300)', 'remaining time': '10:38:25', 'throughput': '7650.73', 'gpu_mem_free': '13929MB', 'step': 73} [Step 73 / Rank 1] Tasks: ['Summarization', 'MultiHop QA'] | Lens: [30093, 30076] → Tgt Spa: ['1.000', '0.350'] [Step 73 / Rank 5] Tasks: ['Single QA'] | Lens: [48622] → Tgt Spa: ['0.350'] [Step 73 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [52127] → Tgt Spa: ['1.000'] [Step 73 / Rank 2] Tasks: ['Single QA'] | Lens: [41298] → Tgt Spa: ['0.350'] [Step 73 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [52127] → Tgt Spa: ['1.000'] [Step 73 / Rank 3] Tasks: ['Single QA'] | Lens: [41298] → Tgt Spa: ['0.350'] [Step 73 / Rank 4] Tasks: ['Single QA'] | Lens: [48622] → Tgt Spa: ['0.350'] [Step 73 / Rank 0] Tasks: ['Summarization', 'MultiHop QA'] | Lens: [30093, 30076] → Tgt Spa: ['1.000', '0.350'] [Step 73 / Rank 5] Tasks: ['Summarization', 'Summarization'] | Lens: [24239, 24240] → Tgt Spa: ['1.000', '1.000'] [Step 73 / Rank 4] Tasks: ['Summarization', 'Summarization'] | Lens: [24239, 24240] → Tgt Spa: ['1.000', '1.000'] [Step 73 / Rank 3] Tasks: ['Code'] | Lens: [57283] → Tgt Spa: ['1.000'] [Step 73 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [31665, 31669] → Tgt Spa: ['0.350', '0.350'] [Step 73 / Rank 6] Tasks: ['Single QA'] | Lens: [42521] → Tgt Spa: ['0.350'] [Step 73 / Rank 2] Tasks: ['Code'] | Lens: [57283] → Tgt Spa: ['1.000'] [Step 73 / Rank 7] Tasks: ['Single QA'] | Lens: [42521] → Tgt Spa: ['0.350'] [Step 73 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [31665, 31669] → Tgt Spa: ['0.350', '0.350'] [Step 73 / Rank 4] Tasks: ['Single QA'] | Lens: [51063] → Tgt Spa: ['0.350'] [Step 73 / Rank 5] Tasks: ['Single QA'] | Lens: [51063] → Tgt Spa: ['0.350'] [Step 73 / Rank 7] Tasks: ['Single QA'] | Lens: [36055] → Tgt Spa: ['0.350'] [Step 73 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [27409, 27409] → Tgt Spa: ['0.350', '0.350'] [Step 73 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [57715] → Tgt Spa: ['1.000'] [Step 73 / Rank 6] Tasks: ['Single QA'] | Lens: [36055] → Tgt Spa: ['0.350'] [Step 73 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [27409, 27409] → Tgt Spa: ['0.350', '0.350'] [Step 73 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [57715] → Tgt Spa: ['1.000'] [Step 73 / Rank 3] Tasks: ['Single QA'] | Lens: [51381] → Tgt Spa: ['0.350'] [Step 73 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [60580] → Tgt Spa: ['1.000'] [Step 73 / Rank 6] Tasks: ['Code'] | Lens: [39806] → Tgt Spa: ['1.000'] [Step 73 / Rank 7] Tasks: ['Code'] | Lens: [39806] → Tgt Spa: ['1.000'] [Step 73 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [29760, 29761] → Tgt Spa: ['0.350', '0.350'] [Step 73 / Rank 2] Tasks: ['Single QA'] | Lens: [51381] → Tgt Spa: ['0.350'] [Step 73 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [60580] → Tgt Spa: ['1.000'] [Step 73 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [29760, 29761] → Tgt Spa: ['0.350', '0.350'] [Step 73 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [57840] → Tgt Spa: ['1.000'] [Step 73 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [57840] → Tgt Spa: ['1.000'] [Step 73 / Rank 6] Tasks: ['Single QA'] | Lens: [43617] → Tgt Spa: ['0.350'] [Step 73 / Rank 7] Tasks: ['Single QA'] | Lens: [43617] → Tgt Spa: ['0.350'] [Step 73 / Rank 1] Tasks: ['Single QA'] | Lens: [64598] → Tgt Spa: ['0.350'] [Step 73 / Rank 4] Tasks: ['Single QA'] | Lens: [39990] → Tgt Spa: ['0.350'] [Step 73 / Rank 5] Tasks: ['Single QA'] | Lens: [39990] → Tgt Spa: ['0.350'] [Step 73 / Rank 0] Tasks: ['Single QA'] | Lens: [64598] → Tgt Spa: ['0.350'] [Step 73 / Rank 7] Tasks: ['Code'] | Lens: [53107] → Tgt Spa: ['1.000'] [Step 73 / Rank 5] Tasks: ['Single QA'] | Lens: [39277] → Tgt Spa: ['0.350'] [Step 73 / Rank 2] Tasks: ['Single QA'] | Lens: [36014] → Tgt Spa: ['0.350'] [Step 73 / Rank 6] Tasks: ['Code'] | Lens: [53107] → Tgt Spa: ['1.000'] [Step 73 / Rank 1] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7030, 7026, 7026, 7026, 7026, 7027, 7027, 7028, 7028] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 73 / Rank 4] Tasks: ['Single QA'] | Lens: [39277] → Tgt Spa: ['0.350'] [Step 73 / Rank 0] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7030, 7026, 7026, 7026, 7026, 7027, 7027, 7028, 7028] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 73 / Rank 3] Tasks: ['Single QA'] | Lens: [36014] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 21:49:37,719 >> @ 73 | Loss: 2.1042 | LM: 2.0229 | Reg: 0.0812 | Spa(Avg): 0.389 [INFO|lh_trainer.py:797] 2026-02-16 21:49:37,719 >> Statistic -> Code | Spa: 0.448 | Tgt: 1.000 | Z-Loss: 0.135 | [INFO|lh_trainer.py:797] 2026-02-16 21:49:37,720 >> Statistic -> In-Context | Spa: 0.396 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:49:37,720 >> Statistic -> MultiHop | Spa: 0.361 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:49:37,720 >> Statistic -> Single | Spa: 0.383 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:49:37,720 >> Statistic -> Summarization | Spa: 0.389 | Tgt: 1.000 | Z-Loss: 0.194 | [INFO|lh_trainer.py:810] 2026-02-16 21:49:37,722 >> [Micro-Log] {"loss": 2.1041585579514503, "lm_loss": 2.0229465824862323, "reg_loss": 0.08121198823209852, "model_sparsity(avg)": 0.38949974005421, "Spa-Summarization sparsity": 0.3888888955116272, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.19449869791666666, "Spa-MultiHop QA sparsity": 0.3611111044883728, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.003034778870642185, "Spa-Single QA sparsity": 0.38333332777023316, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03355189974419773, "Spa-In-Context Learning sparsity": 0.3958333134651184, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.169835414737463, "Spa-Code sparsity": 0.4479166716337204, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.13486292213201523, "step": 73, "current_tau": 1.3230929374694824, "lambda1 Single QA": 0.5078125, "lambda2 MultiHop QA": 0.259765625, "lambda3 Summarization": 0.076171875, "lambda4 Code": 0.173828125} [INFO|lh_trainer.py:331] 2026-02-16 21:49:57,009 >> {'loss': 12.625, 'grad_norm': 0.969342827796936, 'learning_rate': 0.0004963890151256181, 'epoch': 0.07793575566087414, 'num_input_tokens_seen': 182660290, 'completed': '24.67% (74 / 300)', 'remaining time': '10:35:43', 'throughput': '7142.70', 'gpu_mem_free': '7401MB', 'step': 74} [Step 74 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [45511] → Tgt Spa: ['1.000'] [Step 74 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA'] | Lens: [8049, 8048, 8049, 8048, 8056, 8051, 8052, 8052] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350'] [Step 74 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24102, 24103] → Tgt Spa: ['1.000', '1.000'] [Step 74 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA'] | Lens: [8049, 8048, 8049, 8048, 8056, 8051, 8052, 8052] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350'] [Step 74 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [45511] → Tgt Spa: ['1.000'] [Step 74 / Rank 0] Tasks: ['Single QA'] | Lens: [64595] → Tgt Spa: ['0.350'] [Step 74 / Rank 1] Tasks: ['Single QA'] | Lens: [64595] → Tgt Spa: ['0.350'] [Step 74 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24102, 24103] → Tgt Spa: ['1.000', '1.000'] [Step 74 / Rank 4] Tasks: ['Single QA'] | Lens: [42642] → Tgt Spa: ['0.350'] [Step 74 / Rank 3] Tasks: ['Single QA'] | Lens: [65024] → Tgt Spa: ['0.350'] [Step 74 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [25251, 25251] → Tgt Spa: ['0.350', '0.350'] [Step 74 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [25251, 25251] → Tgt Spa: ['0.350', '0.350'] [Step 74 / Rank 0] Tasks: ['Code'] | Lens: [43852] → Tgt Spa: ['1.000'] [Step 74 / Rank 5] Tasks: ['Single QA'] | Lens: [42642] → Tgt Spa: ['0.350'] [Step 74 / Rank 1] Tasks: ['Code'] | Lens: [43852] → Tgt Spa: ['1.000'] [Step 74 / Rank 2] Tasks: ['Single QA'] | Lens: [65024] → Tgt Spa: ['0.350'] [Step 74 / Rank 4] Tasks: ['Single QA'] | Lens: [62727] → Tgt Spa: ['0.350'] [Step 74 / Rank 2] Tasks: ['Code'] | Lens: [43454] → Tgt Spa: ['1.000'] [Step 74 / Rank 1] Tasks: ['Single QA'] | Lens: [45638] → Tgt Spa: ['0.350'] [Step 74 / Rank 0] Tasks: ['Single QA'] | Lens: [45638] → Tgt Spa: ['0.350'] [Step 74 / Rank 5] Tasks: ['Single QA'] | Lens: [62727] → Tgt Spa: ['0.350'] [Step 74 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [29754, 29754] → Tgt Spa: ['0.350', '0.350'] [Step 74 / Rank 3] Tasks: ['Code'] | Lens: [43454] → Tgt Spa: ['1.000'] [Step 74 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [29754, 29754] → Tgt Spa: ['0.350', '0.350'] [Step 74 / Rank 7] Tasks: ['Single QA'] | Lens: [58960] → Tgt Spa: ['0.350'] [Step 74 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22543, 22543] → Tgt Spa: ['1.000', '1.000'] [Step 74 / Rank 0] Tasks: ['Single QA'] | Lens: [60855] → Tgt Spa: ['0.350'] [Step 74 / Rank 5] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22543, 22561] → Tgt Spa: ['1.000', '1.000'] [Step 74 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22543, 22543] → Tgt Spa: ['1.000', '1.000'] [Step 74 / Rank 6] Tasks: ['Single QA'] | Lens: [58960] → Tgt Spa: ['0.350'] [Step 74 / Rank 1] Tasks: ['Single QA'] | Lens: [60855] → Tgt Spa: ['0.350'] [Step 74 / Rank 4] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22543, 22561] → Tgt Spa: ['1.000', '1.000'] [Step 74 / Rank 3] Tasks: ['Single QA'] | Lens: [41289] → Tgt Spa: ['0.350'] [Step 74 / Rank 2] Tasks: ['Single QA'] | Lens: [41289] → Tgt Spa: ['0.350'] [Step 74 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43164] → Tgt Spa: ['1.000'] [Step 74 / Rank 0] Tasks: ['Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1386, 1366, 1386, 1387, 1386, 1387, 1368, 1369, 1368, 1369, 1369, 1369, 1369, 1369, 1371, 1370, 1371, 1370, 1389, 1371, 1371, 1390, 1373, 1371, 1372, 1374, 1373, 1372, 1372, 1373, 1392, 1392, 1375, 1375, 1375, 1375, 1375, 1375, 1394, 1394, 1395, 1394, 1376, 1376, 1376, 1375, 1376] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 74 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43164] → Tgt Spa: ['1.000'] [Step 74 / Rank 5] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [16769, 16782, 16771] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 74 / Rank 4] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [16769, 16782, 16771] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 74 / Rank 1] Tasks: ['Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1386, 1366, 1386, 1387, 1386, 1387, 1368, 1369, 1368, 1369, 1369, 1369, 1369, 1369, 1371, 1370, 1371, 1370, 1389, 1371, 1371, 1390, 1373, 1371, 1372, 1374, 1373, 1372, 1372, 1373, 1392, 1392, 1375, 1375, 1375, 1375, 1375, 1375, 1394, 1394, 1395, 1394, 1376, 1376, 1376, 1375, 1376] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 74 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [51820] → Tgt Spa: ['1.000'] [Step 74 / Rank 6] Tasks: ['Single QA'] | Lens: [62853] → Tgt Spa: ['0.350'] [Step 74 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [51820] → Tgt Spa: ['1.000'] [Step 74 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [27084, 27085] → Tgt Spa: ['0.350', '0.350'] [Step 74 / Rank 0] Tasks: ['Single QA'] | Lens: [45947] → Tgt Spa: ['0.350'] [Step 74 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [27084, 27085] → Tgt Spa: ['0.350', '0.350'] [Step 74 / Rank 7] Tasks: ['Single QA'] | Lens: [62853] → Tgt Spa: ['0.350'] [Step 74 / Rank 1] Tasks: ['Single QA'] | Lens: [45947] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 21:52:40,539 >> @ 74 | Loss: 2.1826 | LM: 2.0992 | Reg: 0.0834 | Spa(Avg): 0.410 [INFO|lh_trainer.py:797] 2026-02-16 21:52:40,539 >> Statistic -> Code | Spa: 0.406 | Tgt: 1.000 | Z-Loss: 0.149 | [INFO|lh_trainer.py:797] 2026-02-16 21:52:40,539 >> Statistic -> In-Context | Spa: 0.448 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:52:40,539 >> Statistic -> MultiHop | Spa: 0.412 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:52:40,539 >> Statistic -> Single | Spa: 0.400 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:52:40,539 >> Statistic -> Summarization | Spa: 0.409 | Tgt: 1.000 | Z-Loss: 0.185 | [INFO|lh_trainer.py:810] 2026-02-16 21:52:40,541 >> [Micro-Log] {"loss": 2.182629125813643, "lm_loss": 2.0991871965428195, "reg_loss": 0.08344192595298712, "model_sparsity(avg)": 0.4100813443462054, "Spa-Single QA sparsity": 0.39975845295449963, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.02756360533606747, "Spa-Code sparsity": 0.4055555462837219, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1494372755289078, "Spa-Summarization sparsity": 0.4092592477798462, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.18484103282292683, "Spa-MultiHop QA sparsity": 0.41217319229069876, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.020215402064529958, "Spa-In-Context Learning sparsity": 0.4479166641831398, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.15403064526617527, "step": 74, "current_tau": 1.3189092874526978, "lambda1 Single QA": 0.5078125, "lambda2 MultiHop QA": 0.259765625, "lambda3 Summarization": 0.07666015625, "lambda4 Code": 0.1748046875} [INFO|lh_trainer.py:331] 2026-02-16 21:53:06,118 >> {'loss': 13.0958, 'grad_norm': 1.078611135482788, 'learning_rate': 0.000495813727309616, 'epoch': 0.07898894154818326, 'num_input_tokens_seen': 185180996, 'completed': '25.00% (75 / 300)', 'remaining time': '10:33:55', 'throughput': '6664.66', 'gpu_mem_free': '11633MB', 'step': 75} [Step 75 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [23854, 23855] → Tgt Spa: ['1.000', '1.000'] [Step 75 / Rank 0] Tasks: ['Single QA'] | Lens: [40416] → Tgt Spa: ['0.350'] [Step 75 / Rank 4] Tasks: ['Single QA'] | Lens: [58253] → Tgt Spa: ['0.350'] [Step 75 / Rank 5] Tasks: ['Single QA'] | Lens: [58253] → Tgt Spa: ['0.350'] [Step 75 / Rank 1] Tasks: ['Single QA'] | Lens: [40416] → Tgt Spa: ['0.350'] [Step 75 / Rank 3] Tasks: ['Single QA'] | Lens: [48505] → Tgt Spa: ['0.350'] [Step 75 / Rank 2] Tasks: ['Single QA'] | Lens: [48505] → Tgt Spa: ['0.350'] [Step 75 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [23854, 23855] → Tgt Spa: ['1.000', '1.000'] [Step 75 / Rank 6] Tasks: ['Single QA'] | Lens: [54353] → Tgt Spa: ['0.350'] [Step 75 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [43315] → Tgt Spa: ['1.000'] [Step 75 / Rank 7] Tasks: ['Single QA'] | Lens: [54353] → Tgt Spa: ['0.350'] [Step 75 / Rank 3] Tasks: ['Single QA'] | Lens: [65040] → Tgt Spa: ['0.350'] [Step 75 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [43315] → Tgt Spa: ['1.000'] [Step 75 / Rank 4] Tasks: ['Code'] | Lens: [37757] → Tgt Spa: ['1.000'] [Step 75 / Rank 5] Tasks: ['Code'] | Lens: [37757] → Tgt Spa: ['1.000'] [Step 75 / Rank 2] Tasks: ['Single QA'] | Lens: [65040] → Tgt Spa: ['0.350'] [Step 75 / Rank 6] Tasks: ['Summarization'] | Lens: [41785] → Tgt Spa: ['1.000'] [Step 75 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [61171] → Tgt Spa: ['1.000'] [Step 75 / Rank 0] Tasks: ['Code'] | Lens: [41129] → Tgt Spa: ['1.000'] [Step 75 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [61171] → Tgt Spa: ['1.000'] [Step 75 / Rank 1] Tasks: ['Code'] | Lens: [41129] → Tgt Spa: ['1.000'] [Step 75 / Rank 3] Tasks: ['Single QA'] | Lens: [49175] → Tgt Spa: ['0.350'] [Step 75 / Rank 2] Tasks: ['Single QA'] | Lens: [49175] → Tgt Spa: ['0.350'] [Step 75 / Rank 7] Tasks: ['Summarization'] | Lens: [41785] → Tgt Spa: ['1.000'] [Step 75 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [44610] → Tgt Spa: ['1.000'] [Step 75 / Rank 1] Tasks: ['Code', 'Code', 'Code'] | Lens: [19624, 19627, 19628] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 75 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [44610] → Tgt Spa: ['1.000'] [Step 75 / Rank 5] Tasks: ['Single QA'] | Lens: [59764] → Tgt Spa: ['0.350'] [Step 75 / Rank 0] Tasks: ['Code', 'Code', 'Code'] | Lens: [19624, 19627, 19628] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 75 / Rank 2] Tasks: ['Code', 'Code'] | Lens: [25798, 25799] → Tgt Spa: ['1.000', '1.000'] [Step 75 / Rank 3] Tasks: ['Code', 'Code'] | Lens: [25798, 25799] → Tgt Spa: ['1.000', '1.000'] [Step 75 / Rank 4] Tasks: ['Single QA'] | Lens: [59764] → Tgt Spa: ['0.350'] [Step 75 / Rank 3] Tasks: ['Single QA'] | Lens: [51731] → Tgt Spa: ['0.350'] [Step 75 / Rank 0] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [26264, 26264] → Tgt Spa: ['0.350', '1.000'] [Step 75 / Rank 5] Tasks: ['Code'] | Lens: [58838] → Tgt Spa: ['1.000'] [Step 75 / Rank 1] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [26264, 26264] → Tgt Spa: ['0.350', '1.000'] [Step 75 / Rank 4] Tasks: ['Code'] | Lens: [58838] → Tgt Spa: ['1.000'] [Step 75 / Rank 7] Tasks: ['Code', 'Code', 'Code'] | Lens: [18341, 18342, 18343] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 75 / Rank 2] Tasks: ['Single QA'] | Lens: [51731] → Tgt Spa: ['0.350'] [Step 75 / Rank 6] Tasks: ['Code', 'Code', 'Code'] | Lens: [18341, 18342, 18343] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 75 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32366, 32366] → Tgt Spa: ['0.350', '0.350'] [Step 75 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [26420, 26439] → Tgt Spa: ['1.000', '1.000'] [Step 75 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7734, 7735, 7736, 7736, 7737, 7737, 7737, 7737] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 75 / Rank 2] Tasks: ['Single QA', 'Code'] | Lens: [30377, 30386] → Tgt Spa: ['0.350', '1.000'] [Step 75 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7734, 7735, 7736, 7736, 7737, 7737, 7737, 7737] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 75 / Rank 3] Tasks: ['Single QA', 'Code'] | Lens: [30377, 30386] → Tgt Spa: ['0.350', '1.000'] [Step 75 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [26420, 26439] → Tgt Spa: ['1.000', '1.000'] [Step 75 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32366, 32366] → Tgt Spa: ['0.350', '0.350'] [INFO|lh_trainer.py:781] 2026-02-16 21:55:47,359 >> @ 75 | Loss: 1.8863 | LM: 1.7882 | Reg: 0.0982 | Spa(Avg): 0.373 [INFO|lh_trainer.py:797] 2026-02-16 21:55:47,359 >> Statistic -> Code | Spa: 0.398 | Tgt: 1.000 | Z-Loss: 0.153 | [INFO|lh_trainer.py:797] 2026-02-16 21:55:47,359 >> Statistic -> In-Context | Spa: 0.439 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:55:47,359 >> Statistic -> MultiHop | Spa: 0.412 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:55:47,359 >> Statistic -> Single | Spa: 0.358 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:55:47,359 >> Statistic -> Summarization | Spa: 0.375 | Tgt: 1.000 | Z-Loss: 0.205 | [INFO|lh_trainer.py:810] 2026-02-16 21:55:47,362 >> [Micro-Log] {"loss": 1.8863242417573929, "lm_loss": 1.7881554619719584, "reg_loss": 0.09816879587015137, "model_sparsity(avg)": 0.37292630101243657, "Spa-Single QA sparsity": 0.357638880610466, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.025263780599925668, "Spa-In-Context Learning sparsity": 0.4388888835906982, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.15761059522628784, "Spa-Code sparsity": 0.39781745416777475, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.15306438505649567, "Spa-Summarization sparsity": 0.3749999701976776, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.20462339371442795, "Spa-MultiHop QA sparsity": 0.41217319229069876, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.020215402064529958, "step": 75, "current_tau": 1.3147047758102417, "lambda1 Single QA": 0.51171875, "lambda2 MultiHop QA": 0.259765625, "lambda3 Summarization": 0.07763671875, "lambda4 Code": 0.17578125} [INFO|lh_trainer.py:331] 2026-02-16 21:56:05,006 >> {'loss': 11.3179, 'grad_norm': 1.5402764081954956, 'learning_rate': 0.0004951963205811756, 'epoch': 0.08004212743549237, 'num_input_tokens_seen': 187704644, 'completed': '25.33% (76 / 300)', 'remaining time': '10:31:35', 'throughput': '7053.72', 'gpu_mem_free': '9831MB', 'step': 76} [Step 76 / Rank 0] Tasks: ['Single QA'] | Lens: [43936] → Tgt Spa: ['0.350'] [Step 76 / Rank 2] Tasks: ['Single QA'] | Lens: [43262] → Tgt Spa: ['0.350'] [Step 76 / Rank 7] Tasks: ['Code', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Summarization', 'Code', 'In-Context Learning'] | Lens: [4186, 4185, 4178, 4179, 4179, 4181, 4181, 4181, 4182, 4182, 4183, 4202, 4203, 4192, 4186] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 76 / Rank 1] Tasks: ['Single QA'] | Lens: [43936] → Tgt Spa: ['0.350'] [Step 76 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [38865] → Tgt Spa: ['1.000'] [Step 76 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [38865] → Tgt Spa: ['1.000'] [Step 76 / Rank 3] Tasks: ['Single QA'] | Lens: [43262] → Tgt Spa: ['0.350'] [Step 76 / Rank 6] Tasks: ['Code', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Summarization', 'Code', 'In-Context Learning'] | Lens: [4186, 4185, 4178, 4179, 4179, 4181, 4181, 4181, 4182, 4182, 4183, 4202, 4203, 4192, 4186] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 76 / Rank 5] Tasks: ['Summarization', 'Code'] | Lens: [26936, 26926] → Tgt Spa: ['1.000', '1.000'] [Step 76 / Rank 6] Tasks: ['Single QA'] | Lens: [52402] → Tgt Spa: ['0.350'] [Step 76 / Rank 4] Tasks: ['Summarization', 'Code'] | Lens: [26936, 26926] → Tgt Spa: ['1.000', '1.000'] [Step 76 / Rank 2] Tasks: ['Code'] | Lens: [53781] → Tgt Spa: ['1.000'] [Step 76 / Rank 3] Tasks: ['Code'] | Lens: [53781] → Tgt Spa: ['1.000'] [Step 76 / Rank 7] Tasks: ['Single QA'] | Lens: [52402] → Tgt Spa: ['0.350'] [Step 76 / Rank 0] Tasks: ['Single QA'] | Lens: [39993] → Tgt Spa: ['0.350'] [Step 76 / Rank 1] Tasks: ['Single QA'] | Lens: [39993] → Tgt Spa: ['0.350'] [Step 76 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [53357] → Tgt Spa: ['1.000'] [Step 76 / Rank 6] Tasks: ['Single QA'] | Lens: [40583] → Tgt Spa: ['0.350'] [Step 76 / Rank 2] Tasks: ['Code'] | Lens: [38038] → Tgt Spa: ['1.000'] [Step 76 / Rank 7] Tasks: ['Single QA'] | Lens: [40583] → Tgt Spa: ['0.350'] [Step 76 / Rank 1] Tasks: ['Code'] | Lens: [56658] → Tgt Spa: ['1.000'] [Step 76 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [53357] → Tgt Spa: ['1.000'] [Step 76 / Rank 3] Tasks: ['Code'] | Lens: [38038] → Tgt Spa: ['1.000'] [Step 76 / Rank 0] Tasks: ['Code'] | Lens: [56658] → Tgt Spa: ['1.000'] [Step 76 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [58819] → Tgt Spa: ['1.000'] [Step 76 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [44484] → Tgt Spa: ['1.000'] [Step 76 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [58819] → Tgt Spa: ['1.000'] [Step 76 / Rank 3] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [21192, 21194, 21207] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 76 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [44484] → Tgt Spa: ['1.000'] [Step 76 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [25873, 25880] → Tgt Spa: ['1.000', '1.000'] [Step 76 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [25873, 25880] → Tgt Spa: ['1.000', '1.000'] [Step 76 / Rank 2] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [21192, 21194, 21207] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 76 / Rank 4] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA'] | Lens: [14329, 14324, 14323, 14330] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350'] [Step 76 / Rank 7] Tasks: ['Single QA'] | Lens: [65098] → Tgt Spa: ['0.350'] [Step 76 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [40105] → Tgt Spa: ['1.000'] [Step 76 / Rank 6] Tasks: ['Single QA'] | Lens: [65098] → Tgt Spa: ['0.350'] [Step 76 / Rank 5] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA'] | Lens: [14329, 14324, 14323, 14330] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350'] [Step 76 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [45992] → Tgt Spa: ['1.000'] [Step 76 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [40105] → Tgt Spa: ['1.000'] [Step 76 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [45992] → Tgt Spa: ['1.000'] [Step 76 / Rank 2] Tasks: ['Single QA'] | Lens: [44182] → Tgt Spa: ['0.350'] [Step 76 / Rank 1] Tasks: ['Code'] | Lens: [57360] → Tgt Spa: ['1.000'] [Step 76 / Rank 4] Tasks: ['Code'] | Lens: [39730] → Tgt Spa: ['1.000'] [Step 76 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43121] → Tgt Spa: ['1.000'] [Step 76 / Rank 3] Tasks: ['Single QA'] | Lens: [44182] → Tgt Spa: ['0.350'] [Step 76 / Rank 0] Tasks: ['Code'] | Lens: [57360] → Tgt Spa: ['1.000'] [Step 76 / Rank 5] Tasks: ['Code'] | Lens: [39730] → Tgt Spa: ['1.000'] [Step 76 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43121] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 21:58:28,617 >> @ 76 | Loss: 2.0627 | LM: 1.9489 | Reg: 0.1138 | Spa(Avg): 0.415 [INFO|lh_trainer.py:797] 2026-02-16 21:58:28,617 >> Statistic -> Code | Spa: 0.389 | Tgt: 1.000 | Z-Loss: 0.157 | [INFO|lh_trainer.py:797] 2026-02-16 21:58:28,617 >> Statistic -> In-Context | Spa: 0.471 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:58:28,617 >> Statistic -> MultiHop | Spa: 0.412 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:58:28,617 >> Statistic -> Single | Spa: 0.413 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 21:58:28,617 >> Statistic -> Summarization | Spa: 0.427 | Tgt: 1.000 | Z-Loss: 0.176 | [INFO|lh_trainer.py:810] 2026-02-16 21:58:28,619 >> [Micro-Log] {"loss": 2.0626835574706397, "lm_loss": 1.9488847283646464, "reg_loss": 0.1137988116630974, "model_sparsity(avg)": 0.41532600050171214, "Spa-Single QA sparsity": 0.4128787842663852, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03751192804933949, "Spa-Code sparsity": 0.38888888634168184, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.15679795352312234, "Spa-In-Context Learning sparsity": 0.4714052221354316, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.14882644178236232, "Spa-Summarization sparsity": 0.4270833134651184, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.17557507753372192, "Spa-MultiHop QA sparsity": 0.41217319229069876, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.020215402064529958, "step": 76, "current_tau": 1.3104804754257202, "lambda1 Single QA": 0.51171875, "lambda2 MultiHop QA": 0.259765625, "lambda3 Summarization": 0.07861328125, "lambda4 Code": 0.1767578125} [INFO|lh_trainer.py:331] 2026-02-16 21:58:49,898 >> {'loss': 12.3761, 'grad_norm': 1.763607382774353, 'learning_rate': 0.0004945369007297615, 'epoch': 0.08109531332280147, 'num_input_tokens_seen': 190082764, 'completed': '25.67% (77 / 300)', 'remaining time': '10:28:34', 'throughput': '7211.13', 'gpu_mem_free': '7813MB', 'step': 77} [Step 77 / Rank 3] Tasks: ['Code'] | Lens: [42145] → Tgt Spa: ['1.000'] [Step 77 / Rank 4] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25853, 25855] → Tgt Spa: ['1.000', '0.350'] [Step 77 / Rank 0] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [21892, 21873] → Tgt Spa: ['1.000', '1.000'] [Step 77 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [42906] → Tgt Spa: ['1.000'] [Step 77 / Rank 1] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [21892, 21873] → Tgt Spa: ['1.000', '1.000'] [Step 77 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [42906] → Tgt Spa: ['1.000'] [Step 77 / Rank 5] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25853, 25855] → Tgt Spa: ['1.000', '0.350'] [Step 77 / Rank 2] Tasks: ['Code'] | Lens: [42145] → Tgt Spa: ['1.000'] [Step 77 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [58418] → Tgt Spa: ['1.000'] [Step 77 / Rank 3] Tasks: ['Code'] | Lens: [45253] → Tgt Spa: ['1.000'] [Step 77 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [41919] → Tgt Spa: ['1.000'] [Step 77 / Rank 4] Tasks: ['Code'] | Lens: [37316] → Tgt Spa: ['1.000'] [Step 77 / Rank 5] Tasks: ['Code'] | Lens: [37316] → Tgt Spa: ['1.000'] [Step 77 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [41919] → Tgt Spa: ['1.000'] [Step 77 / Rank 2] Tasks: ['Code'] | Lens: [45253] → Tgt Spa: ['1.000'] [Step 77 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [58418] → Tgt Spa: ['1.000'] [Step 77 / Rank 6] Tasks: ['Single QA'] | Lens: [38399] → Tgt Spa: ['0.350'] [Step 77 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26835, 26834] → Tgt Spa: ['1.000', '1.000'] [Step 77 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22590, 22591] → Tgt Spa: ['1.000', '1.000'] [Step 77 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22590, 22591] → Tgt Spa: ['1.000', '1.000'] [Step 77 / Rank 3] Tasks: ['Summarization'] | Lens: [46817] → Tgt Spa: ['1.000'] [Step 77 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26835, 26834] → Tgt Spa: ['1.000', '1.000'] [Step 77 / Rank 7] Tasks: ['Single QA'] | Lens: [38399] → Tgt Spa: ['0.350'] [Step 77 / Rank 2] Tasks: ['Summarization'] | Lens: [46817] → Tgt Spa: ['1.000'] [Step 77 / Rank 1] Tasks: ['In-Context Learning', 'Summarization', 'Single QA'] | Lens: [20653, 20673, 20657] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 77 / Rank 3] Tasks: ['Single QA'] | Lens: [56499] → Tgt Spa: ['0.350'] [Step 77 / Rank 7] Tasks: ['Single QA'] | Lens: [49069] → Tgt Spa: ['0.350'] [Step 77 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [41664] → Tgt Spa: ['1.000'] [Step 77 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [41664] → Tgt Spa: ['1.000'] [Step 77 / Rank 0] Tasks: ['In-Context Learning', 'Summarization', 'Single QA'] | Lens: [20653, 20673, 20657] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 77 / Rank 6] Tasks: ['Single QA'] | Lens: [49069] → Tgt Spa: ['0.350'] [Step 77 / Rank 2] Tasks: ['Single QA'] | Lens: [56499] → Tgt Spa: ['0.350'] [Step 77 / Rank 2] Tasks: ['Code'] | Lens: [62674] → Tgt Spa: ['1.000'] [Step 77 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [36143] → Tgt Spa: ['1.000'] [Step 77 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [44548] → Tgt Spa: ['1.000'] [Step 77 / Rank 3] Tasks: ['Code'] | Lens: [62674] → Tgt Spa: ['1.000'] [Step 77 / Rank 1] Tasks: ['Code'] | Lens: [54390] → Tgt Spa: ['1.000'] [Step 77 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [36143] → Tgt Spa: ['1.000'] [Step 77 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [44548] → Tgt Spa: ['1.000'] [Step 77 / Rank 0] Tasks: ['Code'] | Lens: [54390] → Tgt Spa: ['1.000'] [Step 77 / Rank 4] Tasks: ['Code'] | Lens: [37197] → Tgt Spa: ['1.000'] [Step 77 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43169] → Tgt Spa: ['1.000'] [Step 77 / Rank 3] Tasks: ['Single QA'] | Lens: [52804] → Tgt Spa: ['0.350'] [Step 77 / Rank 2] Tasks: ['Single QA'] | Lens: [52804] → Tgt Spa: ['0.350'] [Step 77 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [55384] → Tgt Spa: ['1.000'] [Step 77 / Rank 5] Tasks: ['Code'] | Lens: [37197] → Tgt Spa: ['1.000'] [Step 77 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43169] → Tgt Spa: ['1.000'] [Step 77 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [55384] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 22:01:05,835 >> @ 77 | Loss: 2.1344 | LM: 1.9994 | Reg: 0.1350 | Spa(Avg): 0.433 [INFO|lh_trainer.py:797] 2026-02-16 22:01:05,836 >> Statistic -> Code | Spa: 0.396 | Tgt: 1.000 | Z-Loss: 0.155 | [INFO|lh_trainer.py:797] 2026-02-16 22:01:05,836 >> Statistic -> In-Context | Spa: 0.456 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:01:05,836 >> Statistic -> MultiHop | Spa: 0.412 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:01:05,836 >> Statistic -> Single | Spa: 0.428 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:01:05,836 >> Statistic -> Summarization | Spa: 0.468 | Tgt: 1.000 | Z-Loss: 0.156 | [INFO|lh_trainer.py:810] 2026-02-16 22:01:05,838 >> [Micro-Log] {"loss": 2.13443873077631, "lm_loss": 1.9994173847759764, "reg_loss": 0.13502133645427725, "model_sparsity(avg)": 0.43344906469186145, "Spa-Summarization sparsity": 0.46759257713953656, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.15553641815980276, "Spa-In-Context Learning sparsity": 0.45648147265116373, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1537255158027013, "Spa-Single QA sparsity": 0.42824073632558185, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04292049255066862, "Spa-Code sparsity": 0.3958333233992259, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.15535538643598557, "Spa-MultiHop QA sparsity": 0.41217319229069876, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.020215402064529958, "step": 77, "current_tau": 1.3062376976013184, "lambda1 Single QA": 0.51171875, "lambda2 MultiHop QA": 0.259765625, "lambda3 Summarization": 0.07958984375, "lambda4 Code": 0.177734375} [INFO|lh_trainer.py:331] 2026-02-16 22:01:26,225 >> {'loss': 12.8066, 'grad_norm': 2.118359327316284, 'learning_rate': 0.0004938355807435702, 'epoch': 0.08214849921011058, 'num_input_tokens_seen': 192368804, 'completed': '26.00% (78 / 300)', 'remaining time': '10:25:08', 'throughput': '7311.75', 'gpu_mem_free': '8467MB', 'step': 78} [Step 78 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [61819] → Tgt Spa: ['1.000'] [Step 78 / Rank 0] Tasks: ['Single QA'] | Lens: [51491] → Tgt Spa: ['0.350'] [Step 78 / Rank 4] Tasks: ['Single QA'] | Lens: [64906] → Tgt Spa: ['0.350'] [Step 78 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [61819] → Tgt Spa: ['1.000'] [Step 78 / Rank 6] Tasks: ['Summarization'] | Lens: [35926] → Tgt Spa: ['1.000'] [Step 78 / Rank 5] Tasks: ['Single QA'] | Lens: [64906] → Tgt Spa: ['0.350'] [Step 78 / Rank 1] Tasks: ['Single QA'] | Lens: [51491] → Tgt Spa: ['0.350'] [Step 78 / Rank 7] Tasks: ['Summarization'] | Lens: [35926] → Tgt Spa: ['1.000'] [Step 78 / Rank 3] Tasks: ['Code'] | Lens: [59798] → Tgt Spa: ['1.000'] [Step 78 / Rank 4] Tasks: ['Single QA'] | Lens: [40707] → Tgt Spa: ['0.350'] [Step 78 / Rank 6] Tasks: ['Single QA'] | Lens: [58666] → Tgt Spa: ['0.350'] [Step 78 / Rank 7] Tasks: ['Single QA'] | Lens: [58666] → Tgt Spa: ['0.350'] [Step 78 / Rank 5] Tasks: ['Single QA'] | Lens: [40707] → Tgt Spa: ['0.350'] [Step 78 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [25864, 25864] → Tgt Spa: ['0.350', '0.350'] [Step 78 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [25864, 25864] → Tgt Spa: ['0.350', '0.350'] [Step 78 / Rank 2] Tasks: ['Code'] | Lens: [59798] → Tgt Spa: ['1.000'] [Step 78 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22991, 22992] → Tgt Spa: ['1.000', '1.000'] [Step 78 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [45554] → Tgt Spa: ['1.000'] [Step 78 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [45554] → Tgt Spa: ['1.000'] [Step 78 / Rank 0] Tasks: ['Single QA'] | Lens: [41245] → Tgt Spa: ['0.350'] [Step 78 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22991, 22992] → Tgt Spa: ['1.000', '1.000'] [Step 78 / Rank 1] Tasks: ['Single QA'] | Lens: [41245] → Tgt Spa: ['0.350'] [Step 78 / Rank 3] Tasks: ['MultiHop QA', 'MultiHop QA'] | Lens: [32697, 32697] → Tgt Spa: ['0.350', '0.350'] [Step 78 / Rank 2] Tasks: ['MultiHop QA', 'MultiHop QA'] | Lens: [32697, 32697] → Tgt Spa: ['0.350', '0.350'] [Step 78 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15927, 15927, 15927, 15927] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 78 / Rank 0] Tasks: ['Single QA'] | Lens: [49700] → Tgt Spa: ['0.350'] [Step 78 / Rank 2] Tasks: ['Single QA', 'Code', 'Code', 'Single QA'] | Lens: [15618, 15626, 15627, 15623] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350'] [Step 78 / Rank 3] Tasks: ['Single QA', 'Code', 'Code', 'Single QA'] | Lens: [15618, 15626, 15627, 15623] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350'] [Step 78 / Rank 7] Tasks: ['Summarization', 'Single QA'] | Lens: [22489, 22472] → Tgt Spa: ['1.000', '0.350'] [Step 78 / Rank 1] Tasks: ['Single QA'] | Lens: [49700] → Tgt Spa: ['0.350'] [Step 78 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15927, 15927, 15927, 15927] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 78 / Rank 6] Tasks: ['Summarization', 'Single QA'] | Lens: [22489, 22472] → Tgt Spa: ['1.000', '0.350'] [Step 78 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [34151] → Tgt Spa: ['1.000'] [Step 78 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [42022] → Tgt Spa: ['1.000'] [Step 78 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32523, 32522] → Tgt Spa: ['0.350', '0.350'] [Step 78 / Rank 3] Tasks: ['Single QA'] | Lens: [47754] → Tgt Spa: ['0.350'] [Step 78 / Rank 2] Tasks: ['Single QA'] | Lens: [47754] → Tgt Spa: ['0.350'] [Step 78 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32523, 32522] → Tgt Spa: ['0.350', '0.350'] [Step 78 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [42022] → Tgt Spa: ['1.000'] [Step 78 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [34151] → Tgt Spa: ['1.000'] [Step 78 / Rank 7] Tasks: ['Code', 'Code', 'Code'] | Lens: [21034, 21034, 21034] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 78 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [50211] → Tgt Spa: ['1.000'] [Step 78 / Rank 6] Tasks: ['Code', 'Code', 'Code'] | Lens: [21034, 21034, 21034] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 78 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [48763] → Tgt Spa: ['1.000'] [Step 78 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24594, 24595] → Tgt Spa: ['1.000', '1.000'] [Step 78 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [48763] → Tgt Spa: ['1.000'] [Step 78 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [50211] → Tgt Spa: ['1.000'] [Step 78 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24594, 24595] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 22:03:49,445 >> @ 78 | Loss: 1.9868 | LM: 1.8909 | Reg: 0.0960 | Spa(Avg): 0.435 [INFO|lh_trainer.py:797] 2026-02-16 22:03:49,446 >> Statistic -> Code | Spa: 0.440 | Tgt: 1.000 | Z-Loss: 0.142 | [INFO|lh_trainer.py:797] 2026-02-16 22:03:49,446 >> Statistic -> In-Context | Spa: 0.451 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:03:49,446 >> Statistic -> MultiHop | Spa: 0.493 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:03:49,446 >> Statistic -> Single | Spa: 0.427 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:03:49,446 >> Statistic -> Summarization | Spa: 0.458 | Tgt: 1.000 | Z-Loss: 0.160 | [INFO|lh_trainer.py:810] 2026-02-16 22:03:49,448 >> [Micro-Log] {"loss": 1.9868117456013958, "lm_loss": 1.8908538867544848, "reg_loss": 0.0959578564700981, "model_sparsity(avg)": 0.4351851853231589, "Spa-Single QA sparsity": 0.4266975323359172, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04113518591556284, "Spa-In-Context Learning sparsity": 0.45138888955116274, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.15556903332471847, "Spa-Code sparsity": 0.43981480598449707, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1417338028550148, "Spa-MultiHop QA sparsity": 0.4930555522441864, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.045052288100123405, "Spa-Summarization sparsity": 0.4583333134651184, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1602783203125, "step": 78, "current_tau": 1.3019778728485107, "lambda1 Single QA": 0.51171875, "lambda2 MultiHop QA": 0.26171875, "lambda3 Summarization": 0.080078125, "lambda4 Code": 0.1787109375} [INFO|lh_trainer.py:331] 2026-02-16 22:04:07,222 >> {'loss': 11.9209, 'grad_norm': 1.365455150604248, 'learning_rate': 0.0004930924807901711, 'epoch': 0.0832016850974197, 'num_input_tokens_seen': 194857438, 'completed': '26.33% (79 / 300)', 'remaining time': '10:21:57', 'throughput': '7728.83', 'gpu_mem_free': '11481MB', 'step': 79} [Step 79 / Rank 5] Tasks: ['Single QA'] | Lens: [37879] → Tgt Spa: ['0.350'] [Step 79 / Rank 4] Tasks: ['Single QA'] | Lens: [37879] → Tgt Spa: ['0.350'] [Step 79 / Rank 6] Tasks: ['Code', 'Code', 'Code'] | Lens: [17521, 17522, 17524] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 79 / Rank 7] Tasks: ['Code', 'Code', 'Code'] | Lens: [17521, 17522, 17524] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 79 / Rank 3] Tasks: ['Single QA'] | Lens: [46382] → Tgt Spa: ['0.350'] [Step 79 / Rank 2] Tasks: ['Single QA'] | Lens: [46382] → Tgt Spa: ['0.350'] [Step 79 / Rank 1] Tasks: ['Single QA'] | Lens: [34964] → Tgt Spa: ['0.350'] [Step 79 / Rank 0] Tasks: ['Single QA'] | Lens: [34964] → Tgt Spa: ['0.350'] [Step 79 / Rank 3] Tasks: ['Code'] | Lens: [34273] → Tgt Spa: ['1.000'] [Step 79 / Rank 6] Tasks: ['Code'] | Lens: [38214] → Tgt Spa: ['1.000'] [Step 79 / Rank 7] Tasks: ['Code'] | Lens: [38214] → Tgt Spa: ['1.000'] [Step 79 / Rank 5] Tasks: ['Single QA'] | Lens: [53827] → Tgt Spa: ['0.350'] [Step 79 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32231, 32233] → Tgt Spa: ['0.350', '0.350'] [Step 79 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32231, 32233] → Tgt Spa: ['0.350', '0.350'] [Step 79 / Rank 2] Tasks: ['Code'] | Lens: [34273] → Tgt Spa: ['1.000'] [Step 79 / Rank 4] Tasks: ['Single QA'] | Lens: [53827] → Tgt Spa: ['0.350'] [Step 79 / Rank 6] Tasks: ['Code'] | Lens: [37149] → Tgt Spa: ['1.000'] [Step 79 / Rank 4] Tasks: ['In-Context Learning', 'Summarization', 'Code'] | Lens: [19124, 19144, 19133] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 79 / Rank 0] Tasks: ['Single QA'] | Lens: [53740] → Tgt Spa: ['0.350'] [Step 79 / Rank 7] Tasks: ['Code'] | Lens: [37149] → Tgt Spa: ['1.000'] [Step 79 / Rank 1] Tasks: ['Single QA'] | Lens: [53740] → Tgt Spa: ['0.350'] [Step 79 / Rank 5] Tasks: ['In-Context Learning', 'Summarization', 'Code'] | Lens: [19124, 19144, 19133] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 79 / Rank 3] Tasks: ['Single QA'] | Lens: [52964] → Tgt Spa: ['0.350'] [Step 79 / Rank 2] Tasks: ['Single QA'] | Lens: [52964] → Tgt Spa: ['0.350'] [Step 79 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [61200] → Tgt Spa: ['1.000'] [Step 79 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [43969] → Tgt Spa: ['1.000'] [Step 79 / Rank 5] Tasks: ['Single QA'] | Lens: [34891] → Tgt Spa: ['0.350'] [Step 79 / Rank 1] Tasks: ['Single QA'] | Lens: [36770] → Tgt Spa: ['0.350'] [Step 79 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [61200] → Tgt Spa: ['1.000'] [Step 79 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [43969] → Tgt Spa: ['1.000'] [Step 79 / Rank 4] Tasks: ['Single QA'] | Lens: [34891] → Tgt Spa: ['0.350'] [Step 79 / Rank 0] Tasks: ['Single QA'] | Lens: [36770] → Tgt Spa: ['0.350'] [Step 79 / Rank 5] Tasks: ['Single QA'] | Lens: [41283] → Tgt Spa: ['0.350'] [Step 79 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26751, 26759] → Tgt Spa: ['1.000', '1.000'] [Step 79 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [62265] → Tgt Spa: ['1.000'] [Step 79 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [62265] → Tgt Spa: ['1.000'] [Step 79 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26751, 26759] → Tgt Spa: ['1.000', '1.000'] [Step 79 / Rank 6] Tasks: ['Code'] | Lens: [46548] → Tgt Spa: ['1.000'] [Step 79 / Rank 4] Tasks: ['Single QA'] | Lens: [41283] → Tgt Spa: ['0.350'] [Step 79 / Rank 7] Tasks: ['Code'] | Lens: [46548] → Tgt Spa: ['1.000'] [Step 79 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [48767] → Tgt Spa: ['1.000'] [Step 79 / Rank 7] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [26009, 25994] → Tgt Spa: ['1.000', '1.000'] [Step 79 / Rank 0] Tasks: ['Single QA'] | Lens: [45457] → Tgt Spa: ['0.350'] [Step 79 / Rank 1] Tasks: ['Single QA'] | Lens: [45457] → Tgt Spa: ['0.350'] [Step 79 / Rank 3] Tasks: ['Single QA'] | Lens: [57268] → Tgt Spa: ['0.350'] [Step 79 / Rank 6] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [26009, 25994] → Tgt Spa: ['1.000', '1.000'] [Step 79 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [48767] → Tgt Spa: ['1.000'] [Step 79 / Rank 2] Tasks: ['Single QA'] | Lens: [57268] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 22:06:30,272 >> @ 79 | Loss: 2.0554 | LM: 1.9598 | Reg: 0.0956 | Spa(Avg): 0.433 [INFO|lh_trainer.py:797] 2026-02-16 22:06:30,272 >> Statistic -> Code | Spa: 0.431 | Tgt: 1.000 | Z-Loss: 0.146 | [INFO|lh_trainer.py:797] 2026-02-16 22:06:30,272 >> Statistic -> In-Context | Spa: 0.470 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:06:30,272 >> Statistic -> MultiHop | Spa: 0.493 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:06:30,272 >> Statistic -> Single | Spa: 0.419 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:06:30,273 >> Statistic -> Summarization | Spa: 0.382 | Tgt: 1.000 | Z-Loss: 0.203 | [INFO|lh_trainer.py:810] 2026-02-16 22:06:30,275 >> [Micro-Log] {"loss": 2.055407129228115, "lm_loss": 1.9597695972770452, "reg_loss": 0.095637536442761, "model_sparsity(avg)": 0.43335262065132457, "Spa-Single QA sparsity": 0.4188034121806805, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03959938337524923, "Spa-In-Context Learning sparsity": 0.4704861044883728, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.15083528496325016, "Spa-Code sparsity": 0.4305555522441864, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.14567057602107525, "Spa-Summarization sparsity": 0.3819444477558136, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.20321635901927948, "Spa-MultiHop QA sparsity": 0.4930555522441864, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.045052288100123405, "step": 79, "current_tau": 1.2977021932601929, "lambda1 Single QA": 0.51171875, "lambda2 MultiHop QA": 0.26171875, "lambda3 Summarization": 0.0810546875, "lambda4 Code": 0.1796875} [INFO|lh_trainer.py:331] 2026-02-16 22:06:52,137 >> {'loss': 12.3324, 'grad_norm': 1.2654019594192505, 'learning_rate': 0.0004923077281959159, 'epoch': 0.08425487098472881, 'num_input_tokens_seen': 197152948, 'completed': '26.67% (80 / 300)', 'remaining time': '10:18:57', 'throughput': '6959.67', 'gpu_mem_free': '11957MB', 'step': 80} [Step 80 / Rank 0] Tasks: ['Summarization', 'Single QA'] | Lens: [32628, 32613] → Tgt Spa: ['1.000', '0.350'] [Step 80 / Rank 4] Tasks: ['Single QA'] | Lens: [42964] → Tgt Spa: ['0.350'] [Step 80 / Rank 3] Tasks: ['Code'] | Lens: [37838] → Tgt Spa: ['1.000'] [Step 80 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24780, 24781] → Tgt Spa: ['1.000', '1.000'] [Step 80 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24780, 24781] → Tgt Spa: ['1.000', '1.000'] [Step 80 / Rank 5] Tasks: ['Single QA'] | Lens: [42964] → Tgt Spa: ['0.350'] [Step 80 / Rank 2] Tasks: ['Code'] | Lens: [37838] → Tgt Spa: ['1.000'] [Step 80 / Rank 1] Tasks: ['Summarization', 'Single QA'] | Lens: [32628, 32613] → Tgt Spa: ['1.000', '0.350'] [Step 80 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [22192, 22185] → Tgt Spa: ['1.000', '1.000'] [Step 80 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26426, 26429] → Tgt Spa: ['1.000', '1.000'] [Step 80 / Rank 3] Tasks: ['Code'] | Lens: [37934] → Tgt Spa: ['1.000'] [Step 80 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26426, 26429] → Tgt Spa: ['1.000', '1.000'] [Step 80 / Rank 6] Tasks: ['Single QA'] | Lens: [57174] → Tgt Spa: ['0.350'] [Step 80 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [22192, 22185] → Tgt Spa: ['1.000', '1.000'] [Step 80 / Rank 2] Tasks: ['Code'] | Lens: [37934] → Tgt Spa: ['1.000'] [Step 80 / Rank 7] Tasks: ['Single QA'] | Lens: [57174] → Tgt Spa: ['0.350'] [Step 80 / Rank 4] Tasks: ['Summarization', 'Summarization'] | Lens: [28613, 28613] → Tgt Spa: ['1.000', '1.000'] [Step 80 / Rank 5] Tasks: ['Summarization', 'Summarization'] | Lens: [28613, 28613] → Tgt Spa: ['1.000', '1.000'] [Step 80 / Rank 0] Tasks: ['Code'] | Lens: [36532] → Tgt Spa: ['1.000'] [Step 80 / Rank 1] Tasks: ['Code'] | Lens: [36532] → Tgt Spa: ['1.000'] [Step 80 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [11355, 11355, 11355, 11355, 11356] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 80 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [40429] → Tgt Spa: ['1.000'] [Step 80 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [40429] → Tgt Spa: ['1.000'] [Step 80 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [11355, 11355, 11355, 11355, 11356] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 80 / Rank 0] Tasks: ['Single QA'] | Lens: [52409] → Tgt Spa: ['0.350'] [Step 80 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [38620] → Tgt Spa: ['1.000'] [Step 80 / Rank 3] Tasks: ['Single QA'] | Lens: [41714] → Tgt Spa: ['0.350'] [Step 80 / Rank 7] Tasks: ['Single QA'] | Lens: [43157] → Tgt Spa: ['0.350'] [Step 80 / Rank 6] Tasks: ['Single QA'] | Lens: [43157] → Tgt Spa: ['0.350'] [Step 80 / Rank 2] Tasks: ['Single QA'] | Lens: [41714] → Tgt Spa: ['0.350'] [Step 80 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [38620] → Tgt Spa: ['1.000'] [Step 80 / Rank 1] Tasks: ['Single QA'] | Lens: [52409] → Tgt Spa: ['0.350'] [Step 80 / Rank 2] Tasks: ['Single QA'] | Lens: [50855] → Tgt Spa: ['0.350'] [Step 80 / Rank 5] Tasks: ['Single QA'] | Lens: [46272] → Tgt Spa: ['0.350'] [Step 80 / Rank 6] Tasks: ['Single QA'] | Lens: [43100] → Tgt Spa: ['0.350'] [Step 80 / Rank 4] Tasks: ['Single QA'] | Lens: [46272] → Tgt Spa: ['0.350'] [Step 80 / Rank 3] Tasks: ['Single QA'] | Lens: [50855] → Tgt Spa: ['0.350'] [Step 80 / Rank 7] Tasks: ['Single QA'] | Lens: [43100] → Tgt Spa: ['0.350'] [Step 80 / Rank 1] Tasks: ['Summarization', 'Single QA', 'Summarization'] | Lens: [18649, 18631, 18652] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 80 / Rank 0] Tasks: ['Summarization', 'Single QA', 'Summarization'] | Lens: [18649, 18631, 18652] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 80 / Rank 1] Tasks: ['Single QA'] | Lens: [51218] → Tgt Spa: ['0.350'] [Step 80 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [56219] → Tgt Spa: ['1.000'] [Step 80 / Rank 6] Tasks: ['MultiHop QA'] | Lens: [63715] → Tgt Spa: ['0.350'] [Step 80 / Rank 0] Tasks: ['Single QA'] | Lens: [51218] → Tgt Spa: ['0.350'] [Step 80 / Rank 7] Tasks: ['MultiHop QA'] | Lens: [63715] → Tgt Spa: ['0.350'] [Step 80 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [56219] → Tgt Spa: ['1.000'] [Step 80 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [25885, 25877] → Tgt Spa: ['1.000', '1.000'] [Step 80 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [25885, 25877] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 22:09:01,656 >> @ 80 | Loss: 2.0761 | LM: 1.9806 | Reg: 0.0955 | Spa(Avg): 0.432 [INFO|lh_trainer.py:797] 2026-02-16 22:09:01,656 >> Statistic -> Code | Spa: 0.406 | Tgt: 1.000 | Z-Loss: 0.155 | [INFO|lh_trainer.py:797] 2026-02-16 22:09:01,656 >> Statistic -> In-Context | Spa: 0.488 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:09:01,656 >> Statistic -> MultiHop | Spa: 0.375 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:09:01,656 >> Statistic -> Single | Spa: 0.413 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:09:01,657 >> Statistic -> Summarization | Spa: 0.461 | Tgt: 1.000 | Z-Loss: 0.161 | [INFO|lh_trainer.py:810] 2026-02-16 22:09:01,659 >> [Micro-Log] {"loss": 2.0760788060724735, "lm_loss": 1.980615192403396, "reg_loss": 0.09546363393504483, "model_sparsity(avg)": 0.4322723684211572, "Spa-Summarization sparsity": 0.4611111044883728, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.16107901334762573, "Spa-Single QA sparsity": 0.413194440305233, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.0379915818630252, "Spa-In-Context Learning sparsity": 0.48765430847803753, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.14576750910944408, "Spa-Code sparsity": 0.4055555462837219, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.15495387017726897, "Spa-MultiHop QA sparsity": 0.375, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.006875626742839813, "step": 80, "current_tau": 1.2934119701385498, "lambda1 Single QA": 0.515625, "lambda2 MultiHop QA": 0.26171875, "lambda3 Summarization": 0.08203125, "lambda4 Code": 0.1806640625} [INFO|lh_trainer.py:331] 2026-02-16 22:09:27,640 >> {'loss': 12.4565, 'grad_norm': 1.1840094327926636, 'learning_rate': 0.0004914814574241215, 'epoch': 0.08530805687203792, 'num_input_tokens_seen': 199500708, 'completed': '27.00% (81 / 300)', 'remaining time': '10:15:33', 'throughput': '7548.93', 'gpu_mem_free': '8969MB', 'step': 81} [Step 81 / Rank 0] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23542, 23543] → Tgt Spa: ['1.000', '0.350'] [Step 81 / Rank 6] Tasks: ['Single QA'] | Lens: [33531] → Tgt Spa: ['0.350'] [Step 81 / Rank 4] Tasks: ['Single QA'] | Lens: [53612] → Tgt Spa: ['0.350'] [Step 81 / Rank 5] Tasks: ['Single QA'] | Lens: [53612] → Tgt Spa: ['0.350'] [Step 81 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [43745] → Tgt Spa: ['1.000'] [Step 81 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [43745] → Tgt Spa: ['1.000'] [Step 81 / Rank 1] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23542, 23543] → Tgt Spa: ['1.000', '0.350'] [Step 81 / Rank 7] Tasks: ['Single QA'] | Lens: [33531] → Tgt Spa: ['0.350'] [Step 81 / Rank 6] Tasks: ['Single QA'] | Lens: [65064] → Tgt Spa: ['0.350'] [Step 81 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [29208, 29209] → Tgt Spa: ['0.350', '0.350'] [Step 81 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [14569, 14570, 14570, 14575] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 81 / Rank 5] Tasks: ['Code', 'Summarization'] | Lens: [32403, 32421] → Tgt Spa: ['1.000', '1.000'] [Step 81 / Rank 7] Tasks: ['Single QA'] | Lens: [65064] → Tgt Spa: ['0.350'] [Step 81 / Rank 4] Tasks: ['Code', 'Summarization'] | Lens: [32403, 32421] → Tgt Spa: ['1.000', '1.000'] [Step 81 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [29208, 29209] → Tgt Spa: ['0.350', '0.350'] [Step 81 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [14569, 14570, 14570, 14575] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 81 / Rank 5] Tasks: ['Single QA'] | Lens: [57451] → Tgt Spa: ['0.350'] [Step 81 / Rank 4] Tasks: ['Single QA'] | Lens: [57451] → Tgt Spa: ['0.350'] [Step 81 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [23863, 23871] → Tgt Spa: ['1.000', '1.000'] [Step 81 / Rank 3] Tasks: ['Code'] | Lens: [38284] → Tgt Spa: ['1.000'] [Step 81 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [23863, 23871] → Tgt Spa: ['1.000', '1.000'] [Step 81 / Rank 0] Tasks: ['Summarization'] | Lens: [36214] → Tgt Spa: ['1.000'] [Step 81 / Rank 2] Tasks: ['Code'] | Lens: [38284] → Tgt Spa: ['1.000'] [Step 81 / Rank 1] Tasks: ['Summarization'] | Lens: [36214] → Tgt Spa: ['1.000'] [Step 81 / Rank 0] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24923, 24923] → Tgt Spa: ['0.350', '1.000'] [Step 81 / Rank 1] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24923, 24923] → Tgt Spa: ['0.350', '1.000'] [Step 81 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24351, 24351] → Tgt Spa: ['1.000', '1.000'] [Step 81 / Rank 7] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16491, 16483, 16493] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 81 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24351, 24351] → Tgt Spa: ['1.000', '1.000'] [Step 81 / Rank 6] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16491, 16483, 16493] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 81 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [52088] → Tgt Spa: ['1.000'] [Step 81 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [52088] → Tgt Spa: ['1.000'] [Step 81 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [26853, 26852] → Tgt Spa: ['1.000', '1.000'] [Step 81 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning', 'Code'] | Lens: [21144, 21144, 21153] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 81 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [26853, 26852] → Tgt Spa: ['1.000', '1.000'] [Step 81 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning', 'Code'] | Lens: [21144, 21144, 21153] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 81 / Rank 0] Tasks: ['Single QA'] | Lens: [58949] → Tgt Spa: ['0.350'] [Step 81 / Rank 2] Tasks: ['Code', 'Code', 'Code', 'Code'] | Lens: [15210, 15210, 15214, 15225] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000'] [Step 81 / Rank 1] Tasks: ['Single QA'] | Lens: [58949] → Tgt Spa: ['0.350'] [Step 81 / Rank 3] Tasks: ['Code', 'Code', 'Code', 'Code'] | Lens: [15210, 15210, 15214, 15225] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000'] [Step 81 / Rank 1] Tasks: ['Single QA'] | Lens: [49032] → Tgt Spa: ['0.350'] [Step 81 / Rank 5] Tasks: ['Single QA'] | Lens: [59279] → Tgt Spa: ['0.350'] [Step 81 / Rank 4] Tasks: ['Single QA'] | Lens: [59279] → Tgt Spa: ['0.350'] [Step 81 / Rank 3] Tasks: ['Single QA'] | Lens: [50543] → Tgt Spa: ['0.350'] [Step 81 / Rank 2] Tasks: ['Single QA'] | Lens: [50543] → Tgt Spa: ['0.350'] [Step 81 / Rank 7] Tasks: ['Single QA'] | Lens: [45427] → Tgt Spa: ['0.350'] [Step 81 / Rank 0] Tasks: ['Single QA'] | Lens: [49032] → Tgt Spa: ['0.350'] [Step 81 / Rank 6] Tasks: ['Single QA'] | Lens: [45427] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 22:11:59,017 >> @ 81 | Loss: 1.9293 | LM: 1.8314 | Reg: 0.0979 | Spa(Avg): 0.442 [INFO|lh_trainer.py:797] 2026-02-16 22:11:59,017 >> Statistic -> Code | Spa: 0.451 | Tgt: 1.000 | Z-Loss: 0.141 | [INFO|lh_trainer.py:797] 2026-02-16 22:11:59,017 >> Statistic -> In-Context | Spa: 0.458 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:11:59,017 >> Statistic -> MultiHop | Spa: 0.375 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:11:59,018 >> Statistic -> Single | Spa: 0.434 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:11:59,018 >> Statistic -> Summarization | Spa: 0.455 | Tgt: 1.000 | Z-Loss: 0.164 | [INFO|lh_trainer.py:810] 2026-02-16 22:11:59,019 >> [Micro-Log] {"loss": 1.9293388773997624, "lm_loss": 1.8313961289823055, "reg_loss": 0.09794275951571763, "model_sparsity(avg)": 0.44217784826954204, "Spa-In-Context Learning sparsity": 0.4583333200878567, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.156283814046118, "Spa-Single QA sparsity": 0.43382352239945354, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04533969074049417, "Spa-Summarization sparsity": 0.4548611044883728, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.16437191143631935, "Spa-Code sparsity": 0.4507575685327703, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.14074224098162216, "Spa-MultiHop QA sparsity": 0.375, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.006875626742839813, "step": 81, "current_tau": 1.2891086339950562, "lambda1 Single QA": 0.515625, "lambda2 MultiHop QA": 0.26171875, "lambda3 Summarization": 0.08251953125, "lambda4 Code": 0.181640625} [INFO|lh_trainer.py:331] 2026-02-16 22:12:22,206 >> {'loss': 11.576, 'grad_norm': 1.177915096282959, 'learning_rate': 0.0004906138100520309, 'epoch': 0.08636124275934702, 'num_input_tokens_seen': 201991874, 'completed': '27.33% (82 / 300)', 'remaining time': '10:13:00', 'throughput': '7135.31', 'gpu_mem_free': '9577MB', 'step': 82} [Step 82 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [55201] → Tgt Spa: ['1.000'] [Step 82 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [55201] → Tgt Spa: ['1.000'] [Step 82 / Rank 7] Tasks: ['Single QA', 'Code', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [6971, 6978, 6972, 6981, 6979, 6980, 6981, 6983, 6985] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 82 / Rank 2] Tasks: ['Single QA'] | Lens: [62001] → Tgt Spa: ['0.350'] [Step 82 / Rank 3] Tasks: ['Single QA'] | Lens: [62001] → Tgt Spa: ['0.350'] [Step 82 / Rank 0] Tasks: ['Single QA'] | Lens: [47182] → Tgt Spa: ['0.350'] [Step 82 / Rank 6] Tasks: ['Single QA', 'Code', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [6971, 6978, 6972, 6981, 6979, 6980, 6981, 6983, 6985] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 82 / Rank 1] Tasks: ['Single QA'] | Lens: [47182] → Tgt Spa: ['0.350'] [Step 82 / Rank 2] Tasks: ['Single QA'] | Lens: [56711] → Tgt Spa: ['0.350'] [Step 82 / Rank 7] Tasks: ['Single QA'] | Lens: [34443] → Tgt Spa: ['0.350'] [Step 82 / Rank 6] Tasks: ['Single QA'] | Lens: [34443] → Tgt Spa: ['0.350'] [Step 82 / Rank 5] Tasks: ['Single QA'] | Lens: [47508] → Tgt Spa: ['0.350'] [Step 82 / Rank 4] Tasks: ['Single QA'] | Lens: [47508] → Tgt Spa: ['0.350'] [Step 82 / Rank 1] Tasks: ['Single QA'] | Lens: [56617] → Tgt Spa: ['0.350'] [Step 82 / Rank 3] Tasks: ['Single QA'] | Lens: [56711] → Tgt Spa: ['0.350'] [Step 82 / Rank 0] Tasks: ['Single QA'] | Lens: [56617] → Tgt Spa: ['0.350'] [Step 82 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [39923] → Tgt Spa: ['1.000'] [Step 82 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [28272, 28272] → Tgt Spa: ['0.350', '0.350'] [Step 82 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [46023] → Tgt Spa: ['1.000'] [Step 82 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [28272, 28272] → Tgt Spa: ['0.350', '0.350'] [Step 82 / Rank 1] Tasks: ['Single QA'] | Lens: [51472] → Tgt Spa: ['0.350'] [Step 82 / Rank 0] Tasks: ['Single QA'] | Lens: [51472] → Tgt Spa: ['0.350'] [Step 82 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [46023] → Tgt Spa: ['1.000'] [Step 82 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [39923] → Tgt Spa: ['1.000'] [Step 82 / Rank 4] Tasks: ['Code'] | Lens: [37900] → Tgt Spa: ['1.000'] [Step 82 / Rank 7] Tasks: ['Single QA'] | Lens: [36910] → Tgt Spa: ['0.350'] [Step 82 / Rank 5] Tasks: ['Code'] | Lens: [37900] → Tgt Spa: ['1.000'] [Step 82 / Rank 6] Tasks: ['Single QA'] | Lens: [36910] → Tgt Spa: ['0.350'] [Step 82 / Rank 3] Tasks: ['Single QA'] | Lens: [35797] → Tgt Spa: ['0.350'] [Step 82 / Rank 0] Tasks: ['Single QA'] | Lens: [54449] → Tgt Spa: ['0.350'] [Step 82 / Rank 1] Tasks: ['Single QA'] | Lens: [54449] → Tgt Spa: ['0.350'] [Step 82 / Rank 2] Tasks: ['Single QA'] | Lens: [35797] → Tgt Spa: ['0.350'] [Step 82 / Rank 6] Tasks: ['Single QA'] | Lens: [41074] → Tgt Spa: ['0.350'] [Step 82 / Rank 7] Tasks: ['Single QA'] | Lens: [41074] → Tgt Spa: ['0.350'] [Step 82 / Rank 3] Tasks: ['Summarization'] | Lens: [39558] → Tgt Spa: ['1.000'] [Step 82 / Rank 2] Tasks: ['Summarization'] | Lens: [39558] → Tgt Spa: ['1.000'] [Step 82 / Rank 4] Tasks: ['Single QA'] | Lens: [39650] → Tgt Spa: ['0.350'] [Step 82 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Single QA', 'Code', 'Code', 'In-Context Learning'] | Lens: [4393, 4393, 4395, 4394, 4395, 4395, 4395, 4395, 4395, 4414, 4397, 4404, 4404, 4398] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000'] [Step 82 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Single QA', 'Code', 'Code', 'In-Context Learning'] | Lens: [4393, 4393, 4395, 4394, 4395, 4395, 4395, 4395, 4395, 4414, 4397, 4404, 4404, 4398] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000'] [Step 82 / Rank 5] Tasks: ['Single QA'] | Lens: [39650] → Tgt Spa: ['0.350'] [Step 82 / Rank 5] Tasks: ['Single QA'] | Lens: [33791] → Tgt Spa: ['0.350'] [Step 82 / Rank 2] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [17096, 17110, 17110] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 82 / Rank 7] Tasks: ['Single QA'] | Lens: [43211] → Tgt Spa: ['0.350'] [Step 82 / Rank 1] Tasks: ['Code'] | Lens: [38242] → Tgt Spa: ['1.000'] [Step 82 / Rank 6] Tasks: ['Single QA'] | Lens: [43211] → Tgt Spa: ['0.350'] [Step 82 / Rank 4] Tasks: ['Single QA'] | Lens: [33791] → Tgt Spa: ['0.350'] [Step 82 / Rank 3] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [17096, 17110, 17110] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 82 / Rank 0] Tasks: ['Code'] | Lens: [38242] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 22:14:36,115 >> @ 82 | Loss: 2.1659 | LM: 2.0841 | Reg: 0.0818 | Spa(Avg): 0.461 [INFO|lh_trainer.py:797] 2026-02-16 22:14:36,115 >> Statistic -> Code | Spa: 0.478 | Tgt: 1.000 | Z-Loss: 0.133 | [INFO|lh_trainer.py:797] 2026-02-16 22:14:36,115 >> Statistic -> In-Context | Spa: 0.482 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:14:36,115 >> Statistic -> MultiHop | Spa: 0.375 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:14:36,116 >> Statistic -> Single | Spa: 0.455 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:14:36,116 >> Statistic -> Summarization | Spa: 0.465 | Tgt: 1.000 | Z-Loss: 0.160 | [INFO|lh_trainer.py:810] 2026-02-16 22:14:36,118 >> [Micro-Log] {"loss": 2.1658891389767327, "lm_loss": 2.0840984992682934, "reg_loss": 0.08179064220166765, "model_sparsity(avg)": 0.4611992860833804, "Spa-Single QA sparsity": 0.4545940069051889, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.057535505504347384, "Spa-In-Context Learning sparsity": 0.48232322389429266, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.14886978539553555, "Spa-Summarization sparsity": 0.4652777761220932, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1600959710776806, "Spa-Code sparsity": 0.47817459276744295, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.13281715980597905, "Spa-MultiHop QA sparsity": 0.375, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.006875626742839813, "step": 82, "current_tau": 1.2847932577133179, "lambda1 Single QA": 0.515625, "lambda2 MultiHop QA": 0.263671875, "lambda3 Summarization": 0.08349609375, "lambda4 Code": 0.1826171875} [INFO|lh_trainer.py:331] 2026-02-16 22:14:50,461 >> {'loss': 12.9953, 'grad_norm': 0.7854015827178955, 'learning_rate': 0.0004897049347465549, 'epoch': 0.08741442864665613, 'num_input_tokens_seen': 204251674, 'completed': '27.67% (83 / 300)', 'remaining time': '10:09:17', 'throughput': '7621.31', 'gpu_mem_free': '13793MB', 'step': 83} [Step 83 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [46501] → Tgt Spa: ['1.000'] [Step 83 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [46501] → Tgt Spa: ['1.000'] [Step 83 / Rank 6] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [22177, 22177] → Tgt Spa: ['0.350', '1.000'] [Step 83 / Rank 3] Tasks: ['Code'] | Lens: [41752] → Tgt Spa: ['1.000'] [Step 83 / Rank 2] Tasks: ['Code'] | Lens: [41752] → Tgt Spa: ['1.000'] [Step 83 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [29706, 29707] → Tgt Spa: ['0.350', '0.350'] [Step 83 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [29706, 29707] → Tgt Spa: ['0.350', '0.350'] [Step 83 / Rank 7] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [22177, 22177] → Tgt Spa: ['0.350', '1.000'] [Step 83 / Rank 7] Tasks: ['Single QA'] | Lens: [44175] → Tgt Spa: ['0.350'] [Step 83 / Rank 2] Tasks: ['Single QA'] | Lens: [47516] → Tgt Spa: ['0.350'] [Step 83 / Rank 5] Tasks: ['Single QA'] | Lens: [41180] → Tgt Spa: ['0.350'] [Step 83 / Rank 4] Tasks: ['Single QA'] | Lens: [41180] → Tgt Spa: ['0.350'] [Step 83 / Rank 6] Tasks: ['Single QA'] | Lens: [44175] → Tgt Spa: ['0.350'] [Step 83 / Rank 3] Tasks: ['Single QA'] | Lens: [47516] → Tgt Spa: ['0.350'] [Step 83 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15875, 15876, 15876, 15876] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 83 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15875, 15876, 15876, 15876] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 83 / Rank 1] Tasks: ['Code'] | Lens: [61406] → Tgt Spa: ['1.000'] [Step 83 / Rank 7] Tasks: ['Single QA'] | Lens: [41383] → Tgt Spa: ['0.350'] [Step 83 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [27150, 27150] → Tgt Spa: ['0.350', '0.350'] [Step 83 / Rank 5] Tasks: ['Single QA'] | Lens: [52879] → Tgt Spa: ['0.350'] [Step 83 / Rank 6] Tasks: ['Single QA'] | Lens: [41383] → Tgt Spa: ['0.350'] [Step 83 / Rank 0] Tasks: ['Code'] | Lens: [61406] → Tgt Spa: ['1.000'] [Step 83 / Rank 4] Tasks: ['Single QA'] | Lens: [52879] → Tgt Spa: ['0.350'] [Step 83 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [27150, 27150] → Tgt Spa: ['0.350', '0.350'] [Step 83 / Rank 6] Tasks: ['Single QA', 'Code'] | Lens: [25271, 25279] → Tgt Spa: ['0.350', '1.000'] [Step 83 / Rank 5] Tasks: ['Summarization'] | Lens: [41318] → Tgt Spa: ['1.000'] [Step 83 / Rank 2] Tasks: ['Single QA', 'Summarization', 'Summarization', 'Single QA', 'Single QA', 'Code'] | Lens: [10004, 10026, 10027, 10011, 10011, 10020] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '0.350', '1.000'] [Step 83 / Rank 4] Tasks: ['Summarization'] | Lens: [41318] → Tgt Spa: ['1.000'] [Step 83 / Rank 3] Tasks: ['Single QA', 'Summarization', 'Summarization', 'Single QA', 'Single QA', 'Code'] | Lens: [10004, 10026, 10027, 10011, 10011, 10020] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '0.350', '1.000'] [Step 83 / Rank 1] Tasks: ['Single QA'] | Lens: [46753] → Tgt Spa: ['0.350'] [Step 83 / Rank 7] Tasks: ['Single QA', 'Code'] | Lens: [25271, 25279] → Tgt Spa: ['0.350', '1.000'] [Step 83 / Rank 0] Tasks: ['Single QA'] | Lens: [46753] → Tgt Spa: ['0.350'] [Step 83 / Rank 4] Tasks: ['Code'] | Lens: [42833] → Tgt Spa: ['1.000'] [Step 83 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43962] → Tgt Spa: ['1.000'] [Step 83 / Rank 5] Tasks: ['Code'] | Lens: [42833] → Tgt Spa: ['1.000'] [Step 83 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43962] → Tgt Spa: ['1.000'] [Step 83 / Rank 1] Tasks: ['Code'] | Lens: [54183] → Tgt Spa: ['1.000'] [Step 83 / Rank 0] Tasks: ['Code'] | Lens: [54183] → Tgt Spa: ['1.000'] [Step 83 / Rank 3] Tasks: ['Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'Code', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Code', 'Summarization', 'Summarization', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Single QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA'] | Lens: [2287, 2269, 2288, 2288, 2278, 2273, 2289, 2272, 2273, 2292, 2282, 2295, 2294, 2294, 2277, 2278, 2278, 2296, 2295, 2278, 2279, 2297, 2296, 2279, 2280, 2282, 2298, 2280] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350'] [Step 83 / Rank 2] Tasks: ['Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'Code', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Code', 'Summarization', 'Summarization', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Single QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA'] | Lens: [2287, 2269, 2288, 2288, 2278, 2273, 2289, 2272, 2273, 2292, 2282, 2295, 2294, 2294, 2277, 2278, 2278, 2296, 2295, 2278, 2279, 2297, 2296, 2279, 2280, 2282, 2298, 2280] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350'] [Step 83 / Rank 3] Tasks: ['Single QA'] | Lens: [35934] → Tgt Spa: ['0.350'] [Step 83 / Rank 5] Tasks: ['Single QA'] | Lens: [47074] → Tgt Spa: ['0.350'] [Step 83 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [26985, 26985] → Tgt Spa: ['0.350', '0.350'] [Step 83 / Rank 4] Tasks: ['Single QA'] | Lens: [47074] → Tgt Spa: ['0.350'] [Step 83 / Rank 6] Tasks: ['Code', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code'] | Lens: [3742, 3735, 3736, 3743, 3737, 3737, 3756, 3737, 3739, 3739, 3739, 3739, 3740, 3740, 3741, 3747, 3751] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000'] [Step 83 / Rank 2] Tasks: ['Single QA'] | Lens: [35934] → Tgt Spa: ['0.350'] [Step 83 / Rank 7] Tasks: ['Code', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code'] | Lens: [3742, 3735, 3736, 3743, 3737, 3737, 3756, 3737, 3739, 3739, 3739, 3739, 3740, 3740, 3741, 3747, 3751] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000'] [Step 83 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [26985, 26985] → Tgt Spa: ['0.350', '0.350'] [INFO|lh_trainer.py:781] 2026-02-16 22:16:56,199 >> @ 83 | Loss: 2.0714 | LM: 1.9824 | Reg: 0.0890 | Spa(Avg): 0.484 [INFO|lh_trainer.py:797] 2026-02-16 22:16:56,199 >> Statistic -> Code | Spa: 0.487 | Tgt: 1.000 | Z-Loss: 0.130 | [INFO|lh_trainer.py:797] 2026-02-16 22:16:56,199 >> Statistic -> In-Context | Spa: 0.510 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:16:56,199 >> Statistic -> MultiHop | Spa: 0.486 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:16:56,199 >> Statistic -> Single | Spa: 0.470 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:16:56,200 >> Statistic -> Summarization | Spa: 0.487 | Tgt: 1.000 | Z-Loss: 0.150 | [INFO|lh_trainer.py:810] 2026-02-16 22:16:56,201 >> [Micro-Log] {"loss": 2.0713806860148907, "lm_loss": 1.982370310773452, "reg_loss": 0.08901035832241178, "model_sparsity(avg)": 0.48427772770325345, "Spa-Single QA sparsity": 0.46965019349698667, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06589928786787722, "Spa-Code sparsity": 0.48726850748062134, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1302414300541083, "Spa-Summarization sparsity": 0.4869280948358424, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1500350552446702, "Spa-MultiHop QA sparsity": 0.48611109906976874, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.043015512214465576, "Spa-In-Context Learning sparsity": 0.509615380030412, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.14136599749326706, "step": 83, "current_tau": 1.2804672718048096, "lambda1 Single QA": 0.515625, "lambda2 MultiHop QA": 0.263671875, "lambda3 Summarization": 0.08447265625, "lambda4 Code": 0.18359375} [INFO|lh_trainer.py:331] 2026-02-16 22:17:12,361 >> {'loss': 12.4283, 'grad_norm': 0.937538206577301, 'learning_rate': 0.0004887549872387981, 'epoch': 0.08846761453396525, 'num_input_tokens_seen': 206656880, 'completed': '28.00% (84 / 300)', 'remaining time': '10:05:21', 'throughput': '8475.00', 'gpu_mem_free': '9253MB', 'step': 84} [Step 84 / Rank 3] Tasks: ['Single QA'] | Lens: [47950] → Tgt Spa: ['0.350'] [Step 84 / Rank 2] Tasks: ['Single QA'] | Lens: [47950] → Tgt Spa: ['0.350'] [Step 84 / Rank 4] Tasks: ['Single QA'] | Lens: [49859] → Tgt Spa: ['0.350'] [Step 84 / Rank 5] Tasks: ['Single QA'] | Lens: [49859] → Tgt Spa: ['0.350'] [Step 84 / Rank 7] Tasks: ['Single QA'] | Lens: [57288] → Tgt Spa: ['0.350'] [Step 84 / Rank 1] Tasks: ['Summarization', 'Single QA'] | Lens: [27567, 27550] → Tgt Spa: ['1.000', '0.350'] [Step 84 / Rank 6] Tasks: ['Single QA'] | Lens: [57288] → Tgt Spa: ['0.350'] [Step 84 / Rank 0] Tasks: ['Summarization', 'Single QA'] | Lens: [27567, 27550] → Tgt Spa: ['1.000', '0.350'] [Step 84 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23118, 23119] → Tgt Spa: ['1.000', '1.000'] [Step 84 / Rank 0] Tasks: ['Single QA'] | Lens: [60595] → Tgt Spa: ['0.350'] [Step 84 / Rank 3] Tasks: ['Single QA'] | Lens: [38376] → Tgt Spa: ['0.350'] [Step 84 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23118, 23119] → Tgt Spa: ['1.000', '1.000'] [Step 84 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [37389] → Tgt Spa: ['1.000'] [Step 84 / Rank 2] Tasks: ['Single QA'] | Lens: [38376] → Tgt Spa: ['0.350'] [Step 84 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [37389] → Tgt Spa: ['1.000'] [Step 84 / Rank 1] Tasks: ['Single QA'] | Lens: [60595] → Tgt Spa: ['0.350'] [Step 84 / Rank 6] Tasks: ['Single QA'] | Lens: [44926] → Tgt Spa: ['0.350'] [Step 84 / Rank 4] Tasks: ['Single QA'] | Lens: [42487] → Tgt Spa: ['0.350'] [Step 84 / Rank 5] Tasks: ['Single QA'] | Lens: [42487] → Tgt Spa: ['0.350'] [Step 84 / Rank 2] Tasks: ['Single QA'] | Lens: [40598] → Tgt Spa: ['0.350'] [Step 84 / Rank 3] Tasks: ['Single QA'] | Lens: [40598] → Tgt Spa: ['0.350'] [Step 84 / Rank 1] Tasks: ['Single QA'] | Lens: [36400] → Tgt Spa: ['0.350'] [Step 84 / Rank 0] Tasks: ['Single QA'] | Lens: [36400] → Tgt Spa: ['0.350'] [Step 84 / Rank 7] Tasks: ['Single QA'] | Lens: [44926] → Tgt Spa: ['0.350'] [Step 84 / Rank 6] Tasks: ['Single QA'] | Lens: [37111] → Tgt Spa: ['0.350'] [Step 84 / Rank 4] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [16964, 16955, 16955] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 84 / Rank 1] Tasks: ['Single QA'] | Lens: [51389] → Tgt Spa: ['0.350'] [Step 84 / Rank 5] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [16964, 16955, 16955] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 84 / Rank 2] Tasks: ['Single QA'] | Lens: [52545] → Tgt Spa: ['0.350'] [Step 84 / Rank 3] Tasks: ['Single QA'] | Lens: [52545] → Tgt Spa: ['0.350'] [Step 84 / Rank 0] Tasks: ['Single QA'] | Lens: [51389] → Tgt Spa: ['0.350'] [Step 84 / Rank 7] Tasks: ['Single QA'] | Lens: [37111] → Tgt Spa: ['0.350'] [Step 84 / Rank 5] Tasks: ['Code'] | Lens: [46025] → Tgt Spa: ['1.000'] [Step 84 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27025, 27026] → Tgt Spa: ['1.000', '1.000'] [Step 84 / Rank 0] Tasks: ['Single QA'] | Lens: [59507] → Tgt Spa: ['0.350'] [Step 84 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27025, 27026] → Tgt Spa: ['1.000', '1.000'] [Step 84 / Rank 1] Tasks: ['Single QA'] | Lens: [59507] → Tgt Spa: ['0.350'] [Step 84 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [39584] → Tgt Spa: ['1.000'] [Step 84 / Rank 4] Tasks: ['Code'] | Lens: [46025] → Tgt Spa: ['1.000'] [Step 84 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [39584] → Tgt Spa: ['1.000'] [Step 84 / Rank 5] Tasks: ['Single QA'] | Lens: [33977] → Tgt Spa: ['0.350'] [Step 84 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [43389] → Tgt Spa: ['1.000'] [Step 84 / Rank 4] Tasks: ['Single QA'] | Lens: [33977] → Tgt Spa: ['0.350'] [Step 84 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [43389] → Tgt Spa: ['1.000'] [Step 84 / Rank 0] Tasks: ['Code'] | Lens: [53284] → Tgt Spa: ['1.000'] [Step 84 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [25864, 25865] → Tgt Spa: ['0.350', '0.350'] [Step 84 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [25864, 25865] → Tgt Spa: ['0.350', '0.350'] [Step 84 / Rank 1] Tasks: ['Code'] | Lens: [53284] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 22:19:33,483 >> @ 84 | Loss: 2.1950 | LM: 2.1197 | Reg: 0.0753 | Spa(Avg): 0.436 [INFO|lh_trainer.py:797] 2026-02-16 22:19:33,483 >> Statistic -> Code | Spa: 0.462 | Tgt: 1.000 | Z-Loss: 0.139 | [INFO|lh_trainer.py:797] 2026-02-16 22:19:33,483 >> Statistic -> In-Context | Spa: 0.492 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:19:33,483 >> Statistic -> MultiHop | Spa: 0.486 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:19:33,483 >> Statistic -> Single | Spa: 0.413 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:19:33,483 >> Statistic -> Summarization | Spa: 0.479 | Tgt: 1.000 | Z-Loss: 0.154 | [INFO|lh_trainer.py:810] 2026-02-16 22:19:33,485 >> [Micro-Log] {"loss": 2.1950076085825763, "lm_loss": 2.119707121203343, "reg_loss": 0.07530048536136746, "model_sparsity(avg)": 0.4362461318572362, "Spa-Summarization sparsity": 0.4791666567325592, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1541968435049057, "Spa-Single QA sparsity": 0.41258168921751137, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03564197353689986, "Spa-Code sparsity": 0.4618055373430252, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1393032819032669, "Spa-In-Context Learning sparsity": 0.49206347124917166, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.14715953171253204, "Spa-MultiHop QA sparsity": 0.48611109906976874, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.043015512214465576, "step": 84, "current_tau": 1.2761321067810059, "lambda1 Single QA": 0.515625, "lambda2 MultiHop QA": 0.263671875, "lambda3 Summarization": 0.08544921875, "lambda4 Code": 0.1845703125} [INFO|lh_trainer.py:331] 2026-02-16 22:19:52,826 >> {'loss': 13.17, 'grad_norm': 0.8664125800132751, 'learning_rate': 0.0004877641302973755, 'epoch': 0.08952080042127436, 'num_input_tokens_seen': 208918254, 'completed': '28.33% (85 / 300)', 'remaining time': '10:02:13', 'throughput': '7046.33', 'gpu_mem_free': '9095MB', 'step': 85} [Step 85 / Rank 4] Tasks: ['Single QA'] | Lens: [55168] → Tgt Spa: ['0.350'] [Step 85 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [59603] → Tgt Spa: ['1.000'] [Step 85 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [23448, 23440] → Tgt Spa: ['1.000', '1.000'] [Step 85 / Rank 5] Tasks: ['Single QA'] | Lens: [55168] → Tgt Spa: ['0.350'] [Step 85 / Rank 1] Tasks: ['Single QA', 'Summarization', 'Summarization'] | Lens: [21541, 21561, 21561] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 85 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [23448, 23440] → Tgt Spa: ['1.000', '1.000'] [Step 85 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [59603] → Tgt Spa: ['1.000'] [Step 85 / Rank 0] Tasks: ['Single QA', 'Summarization', 'Summarization'] | Lens: [21541, 21561, 21561] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 85 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [39213] → Tgt Spa: ['1.000'] [Step 85 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Code', 'Code'] | Lens: [9051, 9052, 9057, 9058, 9069, 9070, 9073] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000'] [Step 85 / Rank 5] Tasks: ['Code'] | Lens: [47330] → Tgt Spa: ['1.000'] [Step 85 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [39213] → Tgt Spa: ['1.000'] [Step 85 / Rank 1] Tasks: ['Single QA'] | Lens: [42482] → Tgt Spa: ['0.350'] [Step 85 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Code', 'Code'] | Lens: [9051, 9052, 9057, 9058, 9069, 9070, 9073] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000'] [Step 85 / Rank 4] Tasks: ['Code'] | Lens: [47330] → Tgt Spa: ['1.000'] [Step 85 / Rank 0] Tasks: ['Single QA'] | Lens: [42482] → Tgt Spa: ['0.350'] [Step 85 / Rank 5] Tasks: ['Single QA'] | Lens: [58616] → Tgt Spa: ['0.350'] [Step 85 / Rank 1] Tasks: ['Single QA'] | Lens: [34139] → Tgt Spa: ['0.350'] [Step 85 / Rank 3] Tasks: ['Single QA', 'Summarization', 'Code'] | Lens: [18116, 18135, 18124] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 85 / Rank 4] Tasks: ['Single QA'] | Lens: [58616] → Tgt Spa: ['0.350'] [Step 85 / Rank 7] Tasks: ['Single QA'] | Lens: [49988] → Tgt Spa: ['0.350'] [Step 85 / Rank 0] Tasks: ['Single QA'] | Lens: [34139] → Tgt Spa: ['0.350'] [Step 85 / Rank 2] Tasks: ['Single QA', 'Summarization', 'Code'] | Lens: [18116, 18135, 18124] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 85 / Rank 6] Tasks: ['Single QA'] | Lens: [49988] → Tgt Spa: ['0.350'] [Step 85 / Rank 3] Tasks: ['Single QA'] | Lens: [52920] → Tgt Spa: ['0.350'] [Step 85 / Rank 2] Tasks: ['Single QA'] | Lens: [52920] → Tgt Spa: ['0.350'] [Step 85 / Rank 4] Tasks: ['Code'] | Lens: [60277] → Tgt Spa: ['1.000'] [Step 85 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [43523] → Tgt Spa: ['1.000'] [Step 85 / Rank 6] Tasks: ['Single QA'] | Lens: [34725] → Tgt Spa: ['0.350'] [Step 85 / Rank 7] Tasks: ['Single QA'] | Lens: [34725] → Tgt Spa: ['0.350'] [Step 85 / Rank 5] Tasks: ['Code'] | Lens: [60277] → Tgt Spa: ['1.000'] [Step 85 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [43523] → Tgt Spa: ['1.000'] [Step 85 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22604, 22604] → Tgt Spa: ['1.000', '1.000'] [Step 85 / Rank 6] Tasks: ['Summarization', 'Single QA'] | Lens: [32524, 32506] → Tgt Spa: ['1.000', '0.350'] [Step 85 / Rank 2] Tasks: ['Single QA'] | Lens: [57504] → Tgt Spa: ['0.350'] [Step 85 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [27739, 27739] → Tgt Spa: ['0.350', '0.350'] [Step 85 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22604, 22604] → Tgt Spa: ['1.000', '1.000'] [Step 85 / Rank 3] Tasks: ['Single QA'] | Lens: [57504] → Tgt Spa: ['0.350'] [Step 85 / Rank 7] Tasks: ['Summarization', 'Single QA'] | Lens: [32524, 32506] → Tgt Spa: ['1.000', '0.350'] [Step 85 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [27739, 27739] → Tgt Spa: ['0.350', '0.350'] [Step 85 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [55806] → Tgt Spa: ['1.000'] [Step 85 / Rank 6] Tasks: ['Single QA'] | Lens: [55348] → Tgt Spa: ['0.350'] [Step 85 / Rank 7] Tasks: ['Single QA'] | Lens: [55348] → Tgt Spa: ['0.350'] [Step 85 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [54886] → Tgt Spa: ['1.000'] [Step 85 / Rank 5] Tasks: ['Code', 'Single QA', 'Single QA'] | Lens: [20452, 20446, 20446] → Tgt Spa: ['1.000', '0.350', '0.350'] [Step 85 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [55806] → Tgt Spa: ['1.000'] [Step 85 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [54886] → Tgt Spa: ['1.000'] [Step 85 / Rank 4] Tasks: ['Code', 'Single QA', 'Single QA'] | Lens: [20452, 20446, 20446] → Tgt Spa: ['1.000', '0.350', '0.350'] [INFO|lh_trainer.py:781] 2026-02-16 22:22:19,366 >> @ 85 | Loss: 2.0469 | LM: 1.9556 | Reg: 0.0913 | Spa(Avg): 0.415 [INFO|lh_trainer.py:797] 2026-02-16 22:22:19,366 >> Statistic -> Code | Spa: 0.429 | Tgt: 1.000 | Z-Loss: 0.151 | [INFO|lh_trainer.py:797] 2026-02-16 22:22:19,366 >> Statistic -> In-Context | Spa: 0.476 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:22:19,366 >> Statistic -> MultiHop | Spa: 0.486 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:22:19,366 >> Statistic -> Single | Spa: 0.402 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:22:19,366 >> Statistic -> Summarization | Spa: 0.441 | Tgt: 1.000 | Z-Loss: 0.175 | [INFO|lh_trainer.py:810] 2026-02-16 22:22:19,368 >> [Micro-Log] {"loss": 2.0469123888760805, "lm_loss": 1.9555986480942618, "reg_loss": 0.09131375358750422, "model_sparsity(avg)": 0.41497189179062843, "Spa-Single QA sparsity": 0.40208333134651186, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03340054096188396, "Spa-Summarization sparsity": 0.4409722238779068, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.17510562762618065, "Spa-In-Context Learning sparsity": 0.4756944328546524, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.15378608740866184, "Spa-Code sparsity": 0.428819440305233, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.15142562985420227, "Spa-MultiHop QA sparsity": 0.48611109906976874, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.043015512214465576, "step": 85, "current_tau": 1.2717889547348022, "lambda1 Single QA": 0.51953125, "lambda2 MultiHop QA": 0.263671875, "lambda3 Summarization": 0.08642578125, "lambda4 Code": 0.185546875} [INFO|lh_trainer.py:331] 2026-02-16 22:22:40,191 >> {'loss': 12.2815, 'grad_norm': 1.2139054536819458, 'learning_rate': 0.0004867325337005232, 'epoch': 0.09057398630858346, 'num_input_tokens_seen': 211434142, 'completed': '28.67% (86 / 300)', 'remaining time': '9:59:23', 'throughput': '7516.18', 'gpu_mem_free': '7417MB', 'step': 86} [Step 86 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [13969, 13969, 13969, 13969] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 86 / Rank 0] Tasks: ['Summarization', 'Single QA'] | Lens: [25461, 25443] → Tgt Spa: ['1.000', '0.350'] [Step 86 / Rank 2] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [16031, 16038, 16033, 16033] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350'] [Step 86 / Rank 5] Tasks: ['Code'] | Lens: [50356] → Tgt Spa: ['1.000'] [Step 86 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [13969, 13969, 13969, 13969] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 86 / Rank 4] Tasks: ['Code'] | Lens: [50356] → Tgt Spa: ['1.000'] [Step 86 / Rank 1] Tasks: ['Summarization', 'Single QA'] | Lens: [25461, 25443] → Tgt Spa: ['1.000', '0.350'] [Step 86 / Rank 3] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [16031, 16038, 16033, 16033] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350'] [Step 86 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32309, 32309] → Tgt Spa: ['0.350', '0.350'] [Step 86 / Rank 6] Tasks: ['Summarization', 'Summarization'] | Lens: [26625, 26627] → Tgt Spa: ['1.000', '1.000'] [Step 86 / Rank 4] Tasks: ['Single QA'] | Lens: [47089] → Tgt Spa: ['0.350'] [Step 86 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [48183] → Tgt Spa: ['1.000'] [Step 86 / Rank 5] Tasks: ['Single QA'] | Lens: [47089] → Tgt Spa: ['0.350'] [Step 86 / Rank 7] Tasks: ['Summarization', 'Summarization'] | Lens: [26625, 26627] → Tgt Spa: ['1.000', '1.000'] [Step 86 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32309, 32309] → Tgt Spa: ['0.350', '0.350'] [Step 86 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [48183] → Tgt Spa: ['1.000'] [Step 86 / Rank 3] Tasks: ['Single QA'] | Lens: [52301] → Tgt Spa: ['0.350'] [Step 86 / Rank 2] Tasks: ['Single QA'] | Lens: [52301] → Tgt Spa: ['0.350'] [Step 86 / Rank 7] Tasks: ['Single QA'] | Lens: [65035] → Tgt Spa: ['0.350'] [Step 86 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23708, 23709] → Tgt Spa: ['1.000', '1.000'] [Step 86 / Rank 4] Tasks: ['Single QA'] | Lens: [52400] → Tgt Spa: ['0.350'] [Step 86 / Rank 5] Tasks: ['Single QA'] | Lens: [52400] → Tgt Spa: ['0.350'] [Step 86 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23708, 23709] → Tgt Spa: ['1.000', '1.000'] [Step 86 / Rank 6] Tasks: ['Single QA'] | Lens: [65035] → Tgt Spa: ['0.350'] [Step 86 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [63403] → Tgt Spa: ['1.000'] [Step 86 / Rank 0] Tasks: ['Single QA'] | Lens: [62270] → Tgt Spa: ['0.350'] [Step 86 / Rank 5] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [16595, 16596, 16606] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 86 / Rank 4] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [16595, 16596, 16606] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 86 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [63403] → Tgt Spa: ['1.000'] [Step 86 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32370, 32370] → Tgt Spa: ['0.350', '0.350'] [Step 86 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32370, 32370] → Tgt Spa: ['0.350', '0.350'] [Step 86 / Rank 1] Tasks: ['Single QA'] | Lens: [62270] → Tgt Spa: ['0.350'] [Step 86 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [36982] → Tgt Spa: ['1.000'] [Step 86 / Rank 3] Tasks: ['Code'] | Lens: [63983] → Tgt Spa: ['1.000'] [Step 86 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [36982] → Tgt Spa: ['1.000'] [Step 86 / Rank 0] Tasks: ['Single QA'] | Lens: [51131] → Tgt Spa: ['0.350'] [Step 86 / Rank 1] Tasks: ['Single QA'] | Lens: [51131] → Tgt Spa: ['0.350'] [Step 86 / Rank 6] Tasks: ['Single QA'] | Lens: [49877] → Tgt Spa: ['0.350'] [Step 86 / Rank 7] Tasks: ['Single QA'] | Lens: [49877] → Tgt Spa: ['0.350'] [Step 86 / Rank 2] Tasks: ['Code'] | Lens: [63983] → Tgt Spa: ['1.000'] [Step 86 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25417, 25418] → Tgt Spa: ['1.000', '1.000'] [Step 86 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [40775] → Tgt Spa: ['1.000'] [Step 86 / Rank 2] Tasks: ['Single QA'] | Lens: [51040] → Tgt Spa: ['0.350'] [Step 86 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23489, 23490] → Tgt Spa: ['1.000', '1.000'] [Step 86 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25417, 25418] → Tgt Spa: ['1.000', '1.000'] [Step 86 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23489, 23490] → Tgt Spa: ['1.000', '1.000'] [Step 86 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [40775] → Tgt Spa: ['1.000'] [Step 86 / Rank 3] Tasks: ['Single QA'] | Lens: [51040] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 22:25:17,320 >> @ 86 | Loss: 2.1007 | LM: 2.0091 | Reg: 0.0916 | Spa(Avg): 0.415 [INFO|lh_trainer.py:797] 2026-02-16 22:25:17,320 >> Statistic -> Code | Spa: 0.406 | Tgt: 1.000 | Z-Loss: 0.160 | [INFO|lh_trainer.py:797] 2026-02-16 22:25:17,320 >> Statistic -> In-Context | Spa: 0.486 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:25:17,320 >> Statistic -> MultiHop | Spa: 0.486 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:25:17,320 >> Statistic -> Single | Spa: 0.378 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:25:17,320 >> Statistic -> Summarization | Spa: 0.413 | Tgt: 1.000 | Z-Loss: 0.191 | [INFO|lh_trainer.py:810] 2026-02-16 22:25:17,322 >> [Micro-Log] {"loss": 2.1007488599667945, "lm_loss": 2.0091286171227694, "reg_loss": 0.09162026199434574, "model_sparsity(avg)": 0.4146411990125974, "Spa-Summarization sparsity": 0.4131944477558136, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.19145333766937256, "Spa-Single QA sparsity": 0.37847221493721006, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.022358963801525533, "Spa-In-Context Learning sparsity": 0.4861111044883728, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1505264848470688, "Spa-Code sparsity": 0.4055555582046509, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.15950656235218047, "Spa-MultiHop QA sparsity": 0.48611109906976874, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.043015512214465576, "step": 86, "current_tau": 1.2674391269683838, "lambda1 Single QA": 0.51953125, "lambda2 MultiHop QA": 0.263671875, "lambda3 Summarization": 0.0869140625, "lambda4 Code": 0.185546875} [INFO|lh_trainer.py:331] 2026-02-16 22:25:35,553 >> {'loss': 12.6045, 'grad_norm': 1.1984803676605225, 'learning_rate': 0.00048566037420700735, 'epoch': 0.09162717219589257, 'num_input_tokens_seen': 214000898, 'completed': '29.00% (87 / 300)', 'remaining time': '9:56:53', 'throughput': '7318.43', 'gpu_mem_free': '12105MB', 'step': 87} [Step 87 / Rank 2] Tasks: ['Single QA'] | Lens: [64978] → Tgt Spa: ['0.350'] [Step 87 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [58264] → Tgt Spa: ['1.000'] [Step 87 / Rank 6] Tasks: ['Code'] | Lens: [37171] → Tgt Spa: ['1.000'] [Step 87 / Rank 3] Tasks: ['Single QA'] | Lens: [64978] → Tgt Spa: ['0.350'] [Step 87 / Rank 5] Tasks: ['Single QA', 'Code'] | Lens: [29631, 29641] → Tgt Spa: ['0.350', '1.000'] [Step 87 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [58264] → Tgt Spa: ['1.000'] [Step 87 / Rank 7] Tasks: ['Code'] | Lens: [37171] → Tgt Spa: ['1.000'] [Step 87 / Rank 4] Tasks: ['Single QA', 'Code'] | Lens: [29631, 29641] → Tgt Spa: ['0.350', '1.000'] [Step 87 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [39040] → Tgt Spa: ['1.000'] [Step 87 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [18757, 18758, 18758] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 87 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24826, 24825] → Tgt Spa: ['1.000', '1.000'] [Step 87 / Rank 6] Tasks: ['Code', 'Code', 'Code', 'Code'] | Lens: [13289, 13304, 13310, 13326] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000'] [Step 87 / Rank 7] Tasks: ['Code', 'Code', 'Code', 'Code'] | Lens: [13289, 13304, 13310, 13326] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000'] [Step 87 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [18757, 18758, 18758] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 87 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [39040] → Tgt Spa: ['1.000'] [Step 87 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24826, 24825] → Tgt Spa: ['1.000', '1.000'] [Step 87 / Rank 4] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22114, 22096] → Tgt Spa: ['1.000', '1.000'] [Step 87 / Rank 7] Tasks: ['Single QA'] | Lens: [42366] → Tgt Spa: ['0.350'] [Step 87 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [48466] → Tgt Spa: ['1.000'] [Step 87 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24819, 24818] → Tgt Spa: ['1.000', '1.000'] [Step 87 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [48466] → Tgt Spa: ['1.000'] [Step 87 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24819, 24818] → Tgt Spa: ['1.000', '1.000'] [Step 87 / Rank 5] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22114, 22096] → Tgt Spa: ['1.000', '1.000'] [Step 87 / Rank 6] Tasks: ['Single QA'] | Lens: [42366] → Tgt Spa: ['0.350'] [Step 87 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43232] → Tgt Spa: ['1.000'] [Step 87 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43232] → Tgt Spa: ['1.000'] [Step 87 / Rank 2] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [9003, 8998, 8999, 8999, 8999, 9001, 9001] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 87 / Rank 5] Tasks: ['Code'] | Lens: [37636] → Tgt Spa: ['1.000'] [Step 87 / Rank 0] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [19000, 18990, 18991] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 87 / Rank 3] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [9003, 8998, 8999, 8999, 8999, 9001, 9001] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 87 / Rank 1] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [19000, 18990, 18991] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 87 / Rank 4] Tasks: ['Code'] | Lens: [37636] → Tgt Spa: ['1.000'] [Step 87 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [25213, 25212] → Tgt Spa: ['1.000', '1.000'] [Step 87 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [23892, 23884] → Tgt Spa: ['1.000', '1.000'] [Step 87 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [25213, 25212] → Tgt Spa: ['1.000', '1.000'] [Step 87 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [13358, 13358, 13358, 13358] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 87 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [39499] → Tgt Spa: ['1.000'] [Step 87 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [13358, 13358, 13358, 13358] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 87 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [23892, 23884] → Tgt Spa: ['1.000', '1.000'] [Step 87 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [39499] → Tgt Spa: ['1.000'] [Step 87 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [41702] → Tgt Spa: ['1.000'] [Step 87 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [41321] → Tgt Spa: ['1.000'] [Step 87 / Rank 2] Tasks: ['Summarization'] | Lens: [38319] → Tgt Spa: ['1.000'] [Step 87 / Rank 3] Tasks: ['Summarization'] | Lens: [38319] → Tgt Spa: ['1.000'] [Step 87 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [41702] → Tgt Spa: ['1.000'] [Step 87 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [41321] → Tgt Spa: ['1.000'] [Step 87 / Rank 0] Tasks: ['Summarization'] | Lens: [38317] → Tgt Spa: ['1.000'] [Step 87 / Rank 1] Tasks: ['Summarization'] | Lens: [38317] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 22:27:28,933 >> @ 87 | Loss: 2.0733 | LM: 1.9456 | Reg: 0.1277 | Spa(Avg): 0.450 [INFO|lh_trainer.py:797] 2026-02-16 22:27:28,933 >> Statistic -> Code | Spa: 0.439 | Tgt: 1.000 | Z-Loss: 0.149 | [INFO|lh_trainer.py:797] 2026-02-16 22:27:28,933 >> Statistic -> In-Context | Spa: 0.477 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:27:28,933 >> Statistic -> MultiHop | Spa: 0.486 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:27:28,934 >> Statistic -> Single | Spa: 0.379 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:27:28,934 >> Statistic -> Summarization | Spa: 0.441 | Tgt: 1.000 | Z-Loss: 0.177 | [INFO|lh_trainer.py:810] 2026-02-16 22:27:28,935 >> [Micro-Log] {"loss": 2.0732738903413215, "lm_loss": 1.9455702646325033, "reg_loss": 0.12770364169652262, "model_sparsity(avg)": 0.4503623731434345, "Spa-In-Context Learning sparsity": 0.477182537317276, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.15465655284268515, "Spa-Summarization sparsity": 0.4409722238779068, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.17669376730918884, "Spa-Code sparsity": 0.43865738809108734, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.14879939953486124, "Spa-Single QA sparsity": 0.3793402686715126, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.021165930469578598, "Spa-MultiHop QA sparsity": 0.48611109906976874, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.043015512214465576, "step": 87, "current_tau": 1.2630839347839355, "lambda1 Single QA": 0.51953125, "lambda2 MultiHop QA": 0.263671875, "lambda3 Summarization": 0.087890625, "lambda4 Code": 0.1865234375} [INFO|lh_trainer.py:331] 2026-02-16 22:27:42,144 >> {'loss': 12.4396, 'grad_norm': 1.8322398662567139, 'learning_rate': 0.0004845478355258377, 'epoch': 0.09268035808320169, 'num_input_tokens_seen': 216309292, 'completed': '29.33% (88 / 300)', 'remaining time': '9:52:25', 'throughput': '9117.50', 'gpu_mem_free': '14233MB', 'step': 88} [Step 88 / Rank 5] Tasks: ['Code'] | Lens: [48606] → Tgt Spa: ['1.000'] [Step 88 / Rank 7] Tasks: ['Single QA'] | Lens: [60718] → Tgt Spa: ['0.350'] [Step 88 / Rank 4] Tasks: ['Code'] | Lens: [48606] → Tgt Spa: ['1.000'] [Step 88 / Rank 0] Tasks: ['Single QA'] | Lens: [65353] → Tgt Spa: ['0.350'] [Step 88 / Rank 1] Tasks: ['Single QA'] | Lens: [65353] → Tgt Spa: ['0.350'] [Step 88 / Rank 6] Tasks: ['Single QA'] | Lens: [60718] → Tgt Spa: ['0.350'] [Step 88 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [11741, 11741, 11741, 11744, 11744] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 88 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [11741, 11741, 11741, 11744, 11744] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 88 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [27457, 27465] → Tgt Spa: ['1.000', '1.000'] [Step 88 / Rank 7] Tasks: ['Code'] | Lens: [60317] → Tgt Spa: ['1.000'] [Step 88 / Rank 6] Tasks: ['Code'] | Lens: [60317] → Tgt Spa: ['1.000'] [Step 88 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [53613] → Tgt Spa: ['1.000'] [Step 88 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [53613] → Tgt Spa: ['1.000'] [Step 88 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [27457, 27465] → Tgt Spa: ['1.000', '1.000'] [Step 88 / Rank 1] Tasks: ['Code'] | Lens: [36215] → Tgt Spa: ['1.000'] [Step 88 / Rank 0] Tasks: ['Code'] | Lens: [36215] → Tgt Spa: ['1.000'] [Step 88 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [37287] → Tgt Spa: ['1.000'] [Step 88 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27639, 27640] → Tgt Spa: ['1.000', '1.000'] [Step 88 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [23810, 23820] → Tgt Spa: ['1.000', '1.000'] [Step 88 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [37287] → Tgt Spa: ['1.000'] [Step 88 / Rank 5] Tasks: ['Single QA'] | Lens: [50722] → Tgt Spa: ['0.350'] [Step 88 / Rank 4] Tasks: ['Single QA'] | Lens: [50722] → Tgt Spa: ['0.350'] [Step 88 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [23810, 23820] → Tgt Spa: ['1.000', '1.000'] [Step 88 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27639, 27640] → Tgt Spa: ['1.000', '1.000'] [Step 88 / Rank 3] Tasks: ['Single QA'] | Lens: [39317] → Tgt Spa: ['0.350'] [Step 88 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [57275] → Tgt Spa: ['1.000'] [Step 88 / Rank 5] Tasks: ['Single QA'] | Lens: [44637] → Tgt Spa: ['0.350'] [Step 88 / Rank 6] Tasks: ['Single QA'] | Lens: [60741] → Tgt Spa: ['0.350'] [Step 88 / Rank 2] Tasks: ['Single QA'] | Lens: [39317] → Tgt Spa: ['0.350'] [Step 88 / Rank 7] Tasks: ['Single QA'] | Lens: [60741] → Tgt Spa: ['0.350'] [Step 88 / Rank 4] Tasks: ['Single QA'] | Lens: [44637] → Tgt Spa: ['0.350'] [Step 88 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [57275] → Tgt Spa: ['1.000'] [Step 88 / Rank 4] Tasks: ['Summarization', 'Single QA', 'Code'] | Lens: [17674, 17658, 17666] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 88 / Rank 6] Tasks: ['Code'] | Lens: [45593] → Tgt Spa: ['1.000'] [Step 88 / Rank 5] Tasks: ['Summarization', 'Single QA', 'Code'] | Lens: [17674, 17658, 17666] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 88 / Rank 7] Tasks: ['Code'] | Lens: [45593] → Tgt Spa: ['1.000'] [Step 88 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [60494] → Tgt Spa: ['1.000'] [Step 88 / Rank 0] Tasks: ['Single QA'] | Lens: [40221] → Tgt Spa: ['0.350'] [Step 88 / Rank 1] Tasks: ['Single QA'] | Lens: [40221] → Tgt Spa: ['0.350'] [Step 88 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [60494] → Tgt Spa: ['1.000'] [Step 88 / Rank 5] Tasks: ['Single QA'] | Lens: [34808] → Tgt Spa: ['0.350'] [Step 88 / Rank 1] Tasks: ['Code', 'Code', 'Code'] | Lens: [19671, 19673, 19672] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 88 / Rank 2] Tasks: ['Single QA', 'Summarization'] | Lens: [31635, 31653] → Tgt Spa: ['0.350', '1.000'] [Step 88 / Rank 6] Tasks: ['Single QA'] | Lens: [35678] → Tgt Spa: ['0.350'] [Step 88 / Rank 0] Tasks: ['Code', 'Code', 'Code'] | Lens: [19671, 19673, 19672] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 88 / Rank 7] Tasks: ['Single QA'] | Lens: [35678] → Tgt Spa: ['0.350'] [Step 88 / Rank 4] Tasks: ['Single QA'] | Lens: [34808] → Tgt Spa: ['0.350'] [Step 88 / Rank 3] Tasks: ['Single QA', 'Summarization'] | Lens: [31635, 31653] → Tgt Spa: ['0.350', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 22:30:19,420 >> @ 88 | Loss: 2.0743 | LM: 1.9752 | Reg: 0.0991 | Spa(Avg): 0.436 [INFO|lh_trainer.py:797] 2026-02-16 22:30:19,420 >> Statistic -> Code | Spa: 0.425 | Tgt: 1.000 | Z-Loss: 0.154 | [INFO|lh_trainer.py:797] 2026-02-16 22:30:19,420 >> Statistic -> In-Context | Spa: 0.483 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:30:19,420 >> Statistic -> MultiHop | Spa: 0.486 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:30:19,420 >> Statistic -> Single | Spa: 0.405 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:30:19,420 >> Statistic -> Summarization | Spa: 0.458 | Tgt: 1.000 | Z-Loss: 0.167 | [INFO|lh_trainer.py:810] 2026-02-16 22:30:19,422 >> [Micro-Log] {"loss": 2.0742595431705317, "lm_loss": 1.975203501060605, "reg_loss": 0.0990560242983823, "model_sparsity(avg)": 0.43572529902060825, "Spa-Single QA sparsity": 0.4045138843357563, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.032585455977823585, "Spa-Code sparsity": 0.424999988079071, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.15441080778837205, "Spa-In-Context Learning sparsity": 0.482638880610466, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.15316687896847725, "Spa-Summarization sparsity": 0.4583333134651184, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1673312783241272, "Spa-MultiHop QA sparsity": 0.48611109906976874, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.043015512214465576, "step": 88, "current_tau": 1.2587248086929321, "lambda1 Single QA": 0.51953125, "lambda2 MultiHop QA": 0.265625, "lambda3 Summarization": 0.0888671875, "lambda4 Code": 0.1875} [INFO|lh_trainer.py:331] 2026-02-16 22:30:36,717 >> {'loss': 12.4456, 'grad_norm': 1.363206386566162, 'learning_rate': 0.0004833951082847898, 'epoch': 0.0937335439705108, 'num_input_tokens_seen': 218756170, 'completed': '29.67% (89 / 300)', 'remaining time': '9:49:54', 'throughput': '7008.18', 'gpu_mem_free': '8771MB', 'step': 89} [Step 89 / Rank 5] Tasks: ['Single QA'] | Lens: [34554] → Tgt Spa: ['0.350'] [Step 89 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [41402] → Tgt Spa: ['1.000'] [Step 89 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [20844, 20845, 20844] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 89 / Rank 0] Tasks: ['Summarization', 'Single QA'] | Lens: [23222, 23205] → Tgt Spa: ['1.000', '0.350'] [Step 89 / Rank 4] Tasks: ['Single QA'] | Lens: [34554] → Tgt Spa: ['0.350'] [Step 89 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [20844, 20845, 20844] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 89 / Rank 1] Tasks: ['Summarization', 'Single QA'] | Lens: [23222, 23205] → Tgt Spa: ['1.000', '0.350'] [Step 89 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [41402] → Tgt Spa: ['1.000'] [Step 89 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [59960] → Tgt Spa: ['1.000'] [Step 89 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [59960] → Tgt Spa: ['1.000'] [Step 89 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [29863, 29864] → Tgt Spa: ['0.350', '0.350'] [Step 89 / Rank 4] Tasks: ['Single QA'] | Lens: [56724] → Tgt Spa: ['0.350'] [Step 89 / Rank 5] Tasks: ['Single QA'] | Lens: [56724] → Tgt Spa: ['0.350'] [Step 89 / Rank 6] Tasks: ['Summarization', 'In-Context Learning', 'In-Context Learning'] | Lens: [20869, 20852, 20853] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 89 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [29863, 29864] → Tgt Spa: ['0.350', '0.350'] [Step 89 / Rank 7] Tasks: ['Summarization', 'In-Context Learning', 'In-Context Learning'] | Lens: [20869, 20852, 20853] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 89 / Rank 1] Tasks: ['Single QA', 'Summarization'] | Lens: [22553, 22573] → Tgt Spa: ['0.350', '1.000'] [Step 89 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [32879] → Tgt Spa: ['1.000'] [Step 89 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [32879] → Tgt Spa: ['1.000'] [Step 89 / Rank 3] Tasks: ['Code'] | Lens: [54140] → Tgt Spa: ['1.000'] [Step 89 / Rank 0] Tasks: ['Single QA', 'Summarization'] | Lens: [22553, 22573] → Tgt Spa: ['0.350', '1.000'] [Step 89 / Rank 7] Tasks: ['Code'] | Lens: [49297] → Tgt Spa: ['1.000'] [Step 89 / Rank 6] Tasks: ['Code'] | Lens: [49297] → Tgt Spa: ['1.000'] [Step 89 / Rank 2] Tasks: ['Code'] | Lens: [54140] → Tgt Spa: ['1.000'] [Step 89 / Rank 4] Tasks: ['Single QA'] | Lens: [54172] → Tgt Spa: ['0.350'] [Step 89 / Rank 3] Tasks: ['Single QA'] | Lens: [58276] → Tgt Spa: ['0.350'] [Step 89 / Rank 2] Tasks: ['Single QA'] | Lens: [58276] → Tgt Spa: ['0.350'] [Step 89 / Rank 1] Tasks: ['In-Context Learning', 'Code', 'Code'] | Lens: [21215, 21224, 21224] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 89 / Rank 5] Tasks: ['Single QA'] | Lens: [54172] → Tgt Spa: ['0.350'] [Step 89 / Rank 7] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24327, 24330] → Tgt Spa: ['1.000', '0.350'] [Step 89 / Rank 0] Tasks: ['In-Context Learning', 'Code', 'Code'] | Lens: [21215, 21224, 21224] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 89 / Rank 6] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24327, 24330] → Tgt Spa: ['1.000', '0.350'] [Step 89 / Rank 5] Tasks: ['Single QA'] | Lens: [49647] → Tgt Spa: ['0.350'] [Step 89 / Rank 7] Tasks: ['Single QA'] | Lens: [64839] → Tgt Spa: ['0.350'] [Step 89 / Rank 3] Tasks: ['Single QA'] | Lens: [47564] → Tgt Spa: ['0.350'] [Step 89 / Rank 0] Tasks: ['Code'] | Lens: [53888] → Tgt Spa: ['1.000'] [Step 89 / Rank 2] Tasks: ['Single QA'] | Lens: [47564] → Tgt Spa: ['0.350'] [Step 89 / Rank 1] Tasks: ['Code'] | Lens: [53888] → Tgt Spa: ['1.000'] [Step 89 / Rank 4] Tasks: ['Single QA'] | Lens: [49647] → Tgt Spa: ['0.350'] [Step 89 / Rank 6] Tasks: ['Single QA'] | Lens: [64839] → Tgt Spa: ['0.350'] [Step 89 / Rank 5] Tasks: ['Code'] | Lens: [39793] → Tgt Spa: ['1.000'] [Step 89 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [39329] → Tgt Spa: ['1.000'] [Step 89 / Rank 1] Tasks: ['Code', 'Single QA', 'Summarization'] | Lens: [17431, 17423, 17442] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 89 / Rank 0] Tasks: ['Code', 'Single QA', 'Summarization'] | Lens: [17431, 17423, 17442] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 89 / Rank 4] Tasks: ['Code'] | Lens: [39793] → Tgt Spa: ['1.000'] [Step 89 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [39329] → Tgt Spa: ['1.000'] [Step 89 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [41266] → Tgt Spa: ['1.000'] [Step 89 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [41266] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 22:33:02,123 >> @ 89 | Loss: 2.0499 | LM: 1.9448 | Reg: 0.1050 | Spa(Avg): 0.435 [INFO|lh_trainer.py:797] 2026-02-16 22:33:02,123 >> Statistic -> Code | Spa: 0.411 | Tgt: 1.000 | Z-Loss: 0.160 | [INFO|lh_trainer.py:797] 2026-02-16 22:33:02,123 >> Statistic -> In-Context | Spa: 0.506 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:33:02,124 >> Statistic -> MultiHop | Spa: 0.486 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:33:02,124 >> Statistic -> Single | Spa: 0.407 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:33:02,124 >> Statistic -> Summarization | Spa: 0.440 | Tgt: 1.000 | Z-Loss: 0.180 | [INFO|lh_trainer.py:810] 2026-02-16 22:33:02,126 >> [Micro-Log] {"loss": 2.049856604387363, "lm_loss": 1.944812527236839, "reg_loss": 0.10504409847878075, "model_sparsity(avg)": 0.43508872389793396, "Spa-Summarization sparsity": 0.44047618763787405, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1800154937165124, "Spa-Single QA sparsity": 0.40705126982468826, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.031548502192331046, "Spa-In-Context Learning sparsity": 0.5061728358268738, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.14668535441160202, "Spa-Code sparsity": 0.41071427719933645, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.16045618057250977, "Spa-MultiHop QA sparsity": 0.48611109906976874, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.043015512214465576, "step": 89, "current_tau": 1.2543630599975586, "lambda1 Single QA": 0.51953125, "lambda2 MultiHop QA": 0.265625, "lambda3 Summarization": 0.08984375, "lambda4 Code": 0.1884765625} [INFO|lh_trainer.py:331] 2026-02-16 22:33:15,139 >> {'loss': 12.2991, 'grad_norm': 1.314345121383667, 'learning_rate': 0.00048220238999774226, 'epoch': 0.0947867298578199, 'num_input_tokens_seen': 221193636, 'completed': '30.00% (90 / 300)', 'remaining time': '9:46:44', 'throughput': '7692.96', 'gpu_mem_free': '10395MB', 'step': 90} [Step 90 / Rank 3] Tasks: ['Single QA'] | Lens: [42504] → Tgt Spa: ['0.350'] [Step 90 / Rank 4] Tasks: ['Single QA'] | Lens: [42751] → Tgt Spa: ['0.350'] [Step 90 / Rank 6] Tasks: ['Summarization', 'Code'] | Lens: [23783, 23772] → Tgt Spa: ['1.000', '1.000'] [Step 90 / Rank 0] Tasks: ['Single QA'] | Lens: [55691] → Tgt Spa: ['0.350'] [Step 90 / Rank 2] Tasks: ['Single QA'] | Lens: [42504] → Tgt Spa: ['0.350'] [Step 90 / Rank 1] Tasks: ['Single QA'] | Lens: [55691] → Tgt Spa: ['0.350'] [Step 90 / Rank 7] Tasks: ['Summarization', 'Code'] | Lens: [23783, 23772] → Tgt Spa: ['1.000', '1.000'] [Step 90 / Rank 5] Tasks: ['Single QA'] | Lens: [42751] → Tgt Spa: ['0.350'] [Step 90 / Rank 4] Tasks: ['Single QA'] | Lens: [58752] → Tgt Spa: ['0.350'] [Step 90 / Rank 7] Tasks: ['Code', 'Single QA'] | Lens: [29321, 29316] → Tgt Spa: ['1.000', '0.350'] [Step 90 / Rank 6] Tasks: ['Code', 'Single QA'] | Lens: [29321, 29316] → Tgt Spa: ['1.000', '0.350'] [Step 90 / Rank 3] Tasks: ['Single QA'] | Lens: [43463] → Tgt Spa: ['0.350'] [Step 90 / Rank 5] Tasks: ['Single QA'] | Lens: [58752] → Tgt Spa: ['0.350'] [Step 90 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [24096, 24089] → Tgt Spa: ['1.000', '1.000'] [Step 90 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [24096, 24089] → Tgt Spa: ['1.000', '1.000'] [Step 90 / Rank 2] Tasks: ['Single QA'] | Lens: [43463] → Tgt Spa: ['0.350'] [Step 90 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25796, 25796] → Tgt Spa: ['1.000', '1.000'] [Step 90 / Rank 5] Tasks: ['Code', 'Summarization'] | Lens: [26553, 26566] → Tgt Spa: ['1.000', '1.000'] [Step 90 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [51713] → Tgt Spa: ['1.000'] [Step 90 / Rank 4] Tasks: ['Code', 'Summarization'] | Lens: [26553, 26566] → Tgt Spa: ['1.000', '1.000'] [Step 90 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [51713] → Tgt Spa: ['1.000'] [Step 90 / Rank 7] Tasks: ['Single QA'] | Lens: [49883] → Tgt Spa: ['0.350'] [Step 90 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25796, 25796] → Tgt Spa: ['1.000', '1.000'] [Step 90 / Rank 6] Tasks: ['Single QA'] | Lens: [49883] → Tgt Spa: ['0.350'] [Step 90 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [56316] → Tgt Spa: ['1.000'] [Step 90 / Rank 6] Tasks: ['Single QA'] | Lens: [57502] → Tgt Spa: ['0.350'] [Step 90 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [56316] → Tgt Spa: ['1.000'] [Step 90 / Rank 7] Tasks: ['Single QA'] | Lens: [57502] → Tgt Spa: ['0.350'] [Step 90 / Rank 1] Tasks: ['Single QA'] | Lens: [33747] → Tgt Spa: ['0.350'] [Step 90 / Rank 0] Tasks: ['Single QA'] | Lens: [33747] → Tgt Spa: ['0.350'] [Step 90 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [40757] → Tgt Spa: ['1.000'] [Step 90 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [40757] → Tgt Spa: ['1.000'] [Step 90 / Rank 3] Tasks: ['Single QA'] | Lens: [35091] → Tgt Spa: ['0.350'] [Step 90 / Rank 4] Tasks: ['Single QA'] | Lens: [45739] → Tgt Spa: ['0.350'] [Step 90 / Rank 7] Tasks: ['Code'] | Lens: [58686] → Tgt Spa: ['1.000'] [Step 90 / Rank 5] Tasks: ['Single QA'] | Lens: [45739] → Tgt Spa: ['0.350'] [Step 90 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [45241] → Tgt Spa: ['1.000'] [Step 90 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [45241] → Tgt Spa: ['1.000'] [Step 90 / Rank 2] Tasks: ['Single QA'] | Lens: [35091] → Tgt Spa: ['0.350'] [Step 90 / Rank 6] Tasks: ['Code'] | Lens: [58686] → Tgt Spa: ['1.000'] [Step 90 / Rank 3] Tasks: ['Code', 'Code', 'Code', 'Code', 'Code'] | Lens: [11982, 11989, 11991, 11994, 11992] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000'] [Step 90 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [23463, 23471] → Tgt Spa: ['1.000', '1.000'][Step 90 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [23463, 23471] → Tgt Spa: ['1.000', '1.000'] [Step 90 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [22657, 22656] → Tgt Spa: ['0.350', '0.350'] [Step 90 / Rank 6] Tasks: ['Single QA'] | Lens: [36498] → Tgt Spa: ['0.350'] [Step 90 / Rank 2] Tasks: ['Code', 'Code', 'Code', 'Code', 'Code'] | Lens: [11982, 11989, 11991, 11994, 11992] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000'] [Step 90 / Rank 7] Tasks: ['Single QA'] | Lens: [36498] → Tgt Spa: ['0.350'] [Step 90 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [22657, 22656] → Tgt Spa: ['0.350', '0.350'] [INFO|lh_trainer.py:781] 2026-02-16 22:35:39,210 >> @ 90 | Loss: 2.1789 | LM: 2.0847 | Reg: 0.0943 | Spa(Avg): 0.429 [INFO|lh_trainer.py:797] 2026-02-16 22:35:39,210 >> Statistic -> Code | Spa: 0.437 | Tgt: 1.000 | Z-Loss: 0.152 | [INFO|lh_trainer.py:797] 2026-02-16 22:35:39,210 >> Statistic -> In-Context | Spa: 0.509 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:35:39,210 >> Statistic -> MultiHop | Spa: 0.486 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:35:39,210 >> Statistic -> Single | Spa: 0.399 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:35:39,210 >> Statistic -> Summarization | Spa: 0.493 | Tgt: 1.000 | Z-Loss: 0.151 | [INFO|lh_trainer.py:810] 2026-02-16 22:35:39,212 >> [Micro-Log] {"loss": 2.1789198853075504, "lm_loss": 2.0846651618679366, "reg_loss": 0.09425471218613286, "model_sparsity(avg)": 0.42934027190009755, "Spa-Single QA sparsity": 0.39880951387541635, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.041870620766920705, "Spa-Code sparsity": 0.43686868385835126, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.15223777768286792, "Spa-In-Context Learning sparsity": 0.5086805522441864, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.14615976437926292, "Spa-Summarization sparsity": 0.4930555522441864, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.15143820643424988, "Spa-MultiHop QA sparsity": 0.48611109906976874, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.043015512214465576, "step": 90, "current_tau": 1.25, "lambda1 Single QA": 0.5234375, "lambda2 MultiHop QA": 0.265625, "lambda3 Summarization": 0.0908203125, "lambda4 Code": 0.189453125} [INFO|lh_trainer.py:331] 2026-02-16 22:35:50,820 >> {'loss': 13.0735, 'grad_norm': 1.163728952407837, 'learning_rate': 0.0004809698850308334, 'epoch': 0.09583991574512901, 'num_input_tokens_seen': 223524870, 'completed': '30.33% (91 / 300)', 'remaining time': '9:43:29', 'throughput': '7487.23', 'gpu_mem_free': '12323MB', 'step': 91} [Step 91 / Rank 7] Tasks: ['Single QA'] | Lens: [58630] → Tgt Spa: ['0.350'] [Step 91 / Rank 4] Tasks: ['Single QA'] | Lens: [57393] → Tgt Spa: ['0.350'] [Step 91 / Rank 0] Tasks: ['In-Context Learning', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning'] | Lens: [6026, 6031, 6025, 6025, 6028, 6028, 6032, 6032, 6034, 6033] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000'] [Step 91 / Rank 3] Tasks: ['Single QA'] | Lens: [49560] → Tgt Spa: ['0.350'] [Step 91 / Rank 1] Tasks: ['In-Context Learning', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning'] | Lens: [6026, 6031, 6025, 6025, 6028, 6028, 6032, 6032, 6034, 6033] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000'] [Step 91 / Rank 2] Tasks: ['Single QA'] | Lens: [49560] → Tgt Spa: ['0.350'] [Step 91 / Rank 5] Tasks: ['Single QA'] | Lens: [57393] → Tgt Spa: ['0.350'] [Step 91 / Rank 6] Tasks: ['Single QA'] | Lens: [58630] → Tgt Spa: ['0.350'] [Step 91 / Rank 1] Tasks: ['Single QA'] | Lens: [57447] → Tgt Spa: ['0.350'] [Step 91 / Rank 0] Tasks: ['Single QA'] | Lens: [57447] → Tgt Spa: ['0.350'] [Step 91 / Rank 4] Tasks: ['Single QA'] | Lens: [46915] → Tgt Spa: ['0.350'] [Step 91 / Rank 2] Tasks: ['Code'] | Lens: [43529] → Tgt Spa: ['1.000'] [Step 91 / Rank 7] Tasks: ['Single QA'] | Lens: [42513] → Tgt Spa: ['0.350'] [Step 91 / Rank 3] Tasks: ['Code'] | Lens: [43529] → Tgt Spa: ['1.000'] [Step 91 / Rank 6] Tasks: ['Single QA'] | Lens: [42513] → Tgt Spa: ['0.350'] [Step 91 / Rank 5] Tasks: ['Single QA'] | Lens: [46915] → Tgt Spa: ['0.350'] [Step 91 / Rank 4] Tasks: ['Code', 'Code', 'Code'] | Lens: [19809, 19810, 19811] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 91 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [56865] → Tgt Spa: ['1.000'] [Step 91 / Rank 0] Tasks: ['Single QA'] | Lens: [64713] → Tgt Spa: ['0.350'] [Step 91 / Rank 5] Tasks: ['Code', 'Code', 'Code'] | Lens: [19809, 19810, 19811] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 91 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [23645, 23647] → Tgt Spa: ['1.000', '1.000'] [Step 91 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [23645, 23647] → Tgt Spa: ['1.000', '1.000'] [Step 91 / Rank 1] Tasks: ['Single QA'] | Lens: [64713] → Tgt Spa: ['0.350'] [Step 91 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [56865] → Tgt Spa: ['1.000'] [Step 91 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [59224] → Tgt Spa: ['1.000'] [Step 91 / Rank 3] Tasks: ['Code'] | Lens: [55225] → Tgt Spa: ['1.000'] [Step 91 / Rank 6] Tasks: ['Single QA'] | Lens: [42099] → Tgt Spa: ['0.350'] [Step 91 / Rank 7] Tasks: ['Single QA'] | Lens: [42099] → Tgt Spa: ['0.350'] [Step 91 / Rank 0] Tasks: ['Single QA'] | Lens: [34807] → Tgt Spa: ['0.350'] [Step 91 / Rank 1] Tasks: ['Single QA'] | Lens: [34807] → Tgt Spa: ['0.350'] [Step 91 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [59224] → Tgt Spa: ['1.000'] [Step 91 / Rank 2] Tasks: ['Code'] | Lens: [55225] → Tgt Spa: ['1.000'] [Step 91 / Rank 4] Tasks: ['Single QA'] | Lens: [64906] → Tgt Spa: ['0.350'] [Step 91 / Rank 3] Tasks: ['Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Code', 'In-Context Learning', 'MultiHop QA', 'Code', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'Code', 'Code', 'Single QA'] | Lens: [3484, 3485, 3479, 3478, 3479, 3498, 3485, 3479, 3481, 3487, 3480, 3483, 3482, 3482, 3483, 3490, 3490, 3484] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350'] [Step 91 / Rank 5] Tasks: ['Single QA'] | Lens: [64906] → Tgt Spa: ['0.350'] [Step 91 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32503, 32503] → Tgt Spa: ['0.350', '0.350'] [Step 91 / Rank 0] Tasks: ['Code'] | Lens: [34273] → Tgt Spa: ['1.000'] [Step 91 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32503, 32503] → Tgt Spa: ['0.350', '0.350'] [Step 91 / Rank 1] Tasks: ['Code'] | Lens: [34273] → Tgt Spa: ['1.000'] [Step 91 / Rank 2] Tasks: ['Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Code', 'In-Context Learning', 'MultiHop QA', 'Code', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'Code', 'Code', 'Single QA'] | Lens: [3484, 3485, 3479, 3478, 3479, 3498, 3485, 3479, 3481, 3487, 3480, 3483, 3482, 3482, 3483, 3490, 3490, 3484] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350'] [Step 91 / Rank 3] Tasks: ['Single QA'] | Lens: [34412] → Tgt Spa: ['0.350'] [Step 91 / Rank 5] Tasks: ['Single QA', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Single QA'] | Lens: [6008, 6008, 6027, 6009, 6011, 6011, 6015, 6016, 6024, 6018] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 91 / Rank 0] Tasks: ['Single QA'] | Lens: [48485] → Tgt Spa: ['0.350'] [Step 91 / Rank 2] Tasks: ['Single QA'] | Lens: [34412] → Tgt Spa: ['0.350'] [Step 91 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32512, 32512] → Tgt Spa: ['0.350', '0.350'] [Step 91 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32512, 32512] → Tgt Spa: ['0.350', '0.350'] [Step 91 / Rank 1] Tasks: ['Single QA'] | Lens: [48485] → Tgt Spa: ['0.350'] [Step 91 / Rank 4] Tasks: ['Single QA', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Single QA'] | Lens: [6008, 6008, 6027, 6009, 6011, 6011, 6015, 6016, 6024, 6018] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [INFO|lh_trainer.py:781] 2026-02-16 22:38:35,573 >> @ 91 | Loss: 1.7452 | LM: 1.6684 | Reg: 0.0769 | Spa(Avg): 0.440 [INFO|lh_trainer.py:797] 2026-02-16 22:38:35,573 >> Statistic -> Code | Spa: 0.454 | Tgt: 1.000 | Z-Loss: 0.147 | [INFO|lh_trainer.py:797] 2026-02-16 22:38:35,573 >> Statistic -> In-Context | Spa: 0.538 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:38:35,573 >> Statistic -> MultiHop | Spa: 0.514 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:38:35,574 >> Statistic -> Single | Spa: 0.425 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:38:35,574 >> Statistic -> Summarization | Spa: 0.431 | Tgt: 1.000 | Z-Loss: 0.186 | [INFO|lh_trainer.py:810] 2026-02-16 22:38:35,575 >> [Micro-Log] {"loss": 1.7452326770871878, "lm_loss": 1.6683751925593242, "reg_loss": 0.07685746789987509, "model_sparsity(avg)": 0.4396476236482461, "Spa-In-Context Learning sparsity": 0.5381944388151169, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1366031091660261, "Spa-Code sparsity": 0.4539930485188961, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.14706276450306177, "Spa-Single QA sparsity": 0.42487372864376416, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.042226676511662925, "Spa-Summarization sparsity": 0.4305555522441864, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.18598542362451553, "Spa-MultiHop QA sparsity": 0.5138888657093048, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.053980628959834576, "step": 91, "current_tau": 1.2456369400024414, "lambda1 Single QA": 0.5234375, "lambda2 MultiHop QA": 0.265625, "lambda3 Summarization": 0.091796875, "lambda4 Code": 0.1904296875} [INFO|lh_trainer.py:331] 2026-02-16 22:38:53,816 >> {'loss': 10.4714, 'grad_norm': 0.991970956325531, 'learning_rate': 0.00047969780456744436, 'epoch': 0.09689310163243813, 'num_input_tokens_seen': 226066666, 'completed': '30.67% (92 / 300)', 'remaining time': '9:41:16', 'throughput': '6944.94', 'gpu_mem_free': '10497MB', 'step': 92} [Step 92 / Rank 5] Tasks: ['Single QA'] | Lens: [41328] → Tgt Spa: ['0.350'] [Step 92 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [44274] → Tgt Spa: ['1.000'] [Step 92 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [44274] → Tgt Spa: ['1.000'] [Step 92 / Rank 4] Tasks: ['Single QA'] | Lens: [41328] → Tgt Spa: ['0.350'] [Step 92 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [41582] → Tgt Spa: ['1.000'] [Step 92 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Summarization'] | Lens: [21067, 21068, 21084] → Tgt Spa: ['0.350', '0.350', '1.000'] [Step 92 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Summarization'] | Lens: [21067, 21068, 21084] → Tgt Spa: ['0.350', '0.350', '1.000'] [Step 92 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [41582] → Tgt Spa: ['1.000'] [Step 92 / Rank 1] Tasks: ['Single QA'] | Lens: [59936] → Tgt Spa: ['0.350'] [Step 92 / Rank 0] Tasks: ['Single QA'] | Lens: [59936] → Tgt Spa: ['0.350'] [Step 92 / Rank 6] Tasks: ['Code'] | Lens: [40717] → Tgt Spa: ['1.000'] [Step 92 / Rank 7] Tasks: ['Code'] | Lens: [40717] → Tgt Spa: ['1.000'] [Step 92 / Rank 4] Tasks: ['Single QA'] | Lens: [60035] → Tgt Spa: ['0.350'] [Step 92 / Rank 3] Tasks: ['Single QA'] | Lens: [46350] → Tgt Spa: ['0.350'] [Step 92 / Rank 2] Tasks: ['Single QA'] | Lens: [46350] → Tgt Spa: ['0.350'] [Step 92 / Rank 5] Tasks: ['Single QA'] | Lens: [60035] → Tgt Spa: ['0.350'] [Step 92 / Rank 5] Tasks: ['Single QA'] | Lens: [48485] → Tgt Spa: ['0.350'] [Step 92 / Rank 1] Tasks: ['Code'] | Lens: [58628] → Tgt Spa: ['1.000'] [Step 92 / Rank 0] Tasks: ['Code'] | Lens: [58628] → Tgt Spa: ['1.000'] [Step 92 / Rank 6] Tasks: ['Single QA'] | Lens: [65036] → Tgt Spa: ['0.350'] [Step 92 / Rank 4] Tasks: ['Single QA'] | Lens: [48485] → Tgt Spa: ['0.350'] [Step 92 / Rank 3] Tasks: ['Single QA'] | Lens: [59055] → Tgt Spa: ['0.350'] [Step 92 / Rank 2] Tasks: ['Single QA'] | Lens: [59055] → Tgt Spa: ['0.350'] [Step 92 / Rank 7] Tasks: ['Single QA'] | Lens: [65036] → Tgt Spa: ['0.350'] [Step 92 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22630, 22629] → Tgt Spa: ['1.000', '1.000'] [Step 92 / Rank 0] Tasks: ['Single QA'] | Lens: [51354] → Tgt Spa: ['0.350'] [Step 92 / Rank 5] Tasks: ['Single QA'] | Lens: [59073] → Tgt Spa: ['0.350'] [Step 92 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [25448, 25448] → Tgt Spa: ['0.350', '0.350'] [Step 92 / Rank 4] Tasks: ['Single QA'] | Lens: [59073] → Tgt Spa: ['0.350'] [Step 92 / Rank 1] Tasks: ['Single QA'] | Lens: [51354] → Tgt Spa: ['0.350'] [Step 92 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22630, 22629] → Tgt Spa: ['1.000', '1.000'] [Step 92 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [25448, 25448] → Tgt Spa: ['0.350', '0.350'] [Step 92 / Rank 4] Tasks: ['Code'] | Lens: [35102] → Tgt Spa: ['1.000'] [Step 92 / Rank 5] Tasks: ['Code'] | Lens: [35102] → Tgt Spa: ['1.000'] [Step 92 / Rank 3] Tasks: ['In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [5121, 5130, 5130, 5122, 5124, 5131, 5131, 5124, 5125, 5126, 5126, 5126] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000'] [Step 92 / Rank 0] Tasks: ['Code'] | Lens: [34054] → Tgt Spa: ['1.000'] [Step 92 / Rank 1] Tasks: ['Code'] | Lens: [34054] → Tgt Spa: ['1.000'] [Step 92 / Rank 6] Tasks: ['Single QA'] | Lens: [60531] → Tgt Spa: ['0.350'] [Step 92 / Rank 2] Tasks: ['In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [5121, 5130, 5130, 5122, 5124, 5131, 5131, 5124, 5125, 5126, 5126, 5126] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000'] [Step 92 / Rank 7] Tasks: ['Single QA'] | Lens: [60531] → Tgt Spa: ['0.350'] [Step 92 / Rank 5] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [16821, 16822, 16813] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 92 / Rank 6] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [16571, 16583, 16586] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 92 / Rank 0] Tasks: ['Single QA'] | Lens: [40711] → Tgt Spa: ['0.350'] [Step 92 / Rank 2] Tasks: ['Code'] | Lens: [61960] → Tgt Spa: ['1.000'] [Step 92 / Rank 3] Tasks: ['Code'] | Lens: [61960] → Tgt Spa: ['1.000'] [Step 92 / Rank 7] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [16571, 16583, 16586] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 92 / Rank 4] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [16821, 16822, 16813] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 92 / Rank 1] Tasks: ['Single QA'] | Lens: [40711] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 22:41:27,793 >> @ 92 | Loss: 2.0034 | LM: 1.9194 | Reg: 0.0839 | Spa(Avg): 0.453 [INFO|lh_trainer.py:797] 2026-02-16 22:41:27,793 >> Statistic -> Code | Spa: 0.470 | Tgt: 1.000 | Z-Loss: 0.142 | [INFO|lh_trainer.py:797] 2026-02-16 22:41:27,793 >> Statistic -> In-Context | Spa: 0.521 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:41:27,793 >> Statistic -> MultiHop | Spa: 0.514 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:41:27,793 >> Statistic -> Single | Spa: 0.415 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:41:27,793 >> Statistic -> Summarization | Spa: 0.503 | Tgt: 1.000 | Z-Loss: 0.149 | [INFO|lh_trainer.py:810] 2026-02-16 22:41:27,795 >> [Micro-Log] {"loss": 2.003378424793482, "lm_loss": 1.9194492027163506, "reg_loss": 0.08392922892623271, "model_sparsity(avg)": 0.4531732102235158, "Spa-In-Context Learning sparsity": 0.5208333253860473, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.14330857396125793, "Spa-Single QA sparsity": 0.41503266376607556, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.038836226415108234, "Spa-Code sparsity": 0.4696969552473588, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.14245180988853628, "Spa-Summarization sparsity": 0.5027777671813964, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.14859936237335206, "Spa-MultiHop QA sparsity": 0.5138888657093048, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.053980628959834576, "step": 92, "current_tau": 1.2412750720977783, "lambda1 Single QA": 0.5234375, "lambda2 MultiHop QA": 0.265625, "lambda3 Summarization": 0.09228515625, "lambda4 Code": 0.19140625} [INFO|lh_trainer.py:331] 2026-02-16 22:41:51,808 >> {'loss': 12.0203, 'grad_norm': 1.0605396032333374, 'learning_rate': 0.0004783863665720137, 'epoch': 0.09794628751974724, 'num_input_tokens_seen': 228525260, 'completed': '31.00% (93 / 300)', 'remaining time': '9:38:52', 'throughput': '6906.46', 'gpu_mem_free': '11839MB', 'step': 93} [Step 93 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [23706, 23699] → Tgt Spa: ['1.000', '1.000'] [Step 93 / Rank 7] Tasks: ['Code'] | Lens: [45121] → Tgt Spa: ['1.000'] [Step 93 / Rank 4] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16421, 16410, 16422] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 93 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [23706, 23699] → Tgt Spa: ['1.000', '1.000'] [Step 93 / Rank 5] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16421, 16410, 16422] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 93 / Rank 0] Tasks: ['Single QA'] | Lens: [62709] → Tgt Spa: ['0.350'] [Step 93 / Rank 6] Tasks: ['Code'] | Lens: [45121] → Tgt Spa: ['1.000'] [Step 93 / Rank 1] Tasks: ['Single QA'] | Lens: [62709] → Tgt Spa: ['0.350'] [Step 93 / Rank 3] Tasks: ['Single QA'] | Lens: [64588] → Tgt Spa: ['0.350'] [Step 93 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [25629, 25630] → Tgt Spa: ['0.350', '0.350'] [Step 93 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [25629, 25630] → Tgt Spa: ['0.350', '0.350'] [Step 93 / Rank 5] Tasks: ['Single QA'] | Lens: [54160] → Tgt Spa: ['0.350'] [Step 93 / Rank 4] Tasks: ['Single QA'] | Lens: [54160] → Tgt Spa: ['0.350'] [Step 93 / Rank 2] Tasks: ['Single QA'] | Lens: [64588] → Tgt Spa: ['0.350'] [Step 93 / Rank 6] Tasks: ['Single QA'] | Lens: [50763] → Tgt Spa: ['0.350'] [Step 93 / Rank 7] Tasks: ['Single QA'] | Lens: [50763] → Tgt Spa: ['0.350'] [Step 93 / Rank 3] Tasks: ['Single QA'] | Lens: [58864] → Tgt Spa: ['0.350'] [Step 93 / Rank 2] Tasks: ['Single QA'] | Lens: [58864] → Tgt Spa: ['0.350'] [Step 93 / Rank 4] Tasks: ['Single QA'] | Lens: [64185] → Tgt Spa: ['0.350'] [Step 93 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [27933, 27942] → Tgt Spa: ['1.000', '1.000'] [Step 93 / Rank 5] Tasks: ['Single QA'] | Lens: [64185] → Tgt Spa: ['0.350'] [Step 93 / Rank 0] Tasks: ['Single QA'] | Lens: [54246] → Tgt Spa: ['0.350'] [Step 93 / Rank 1] Tasks: ['Single QA'] | Lens: [54246] → Tgt Spa: ['0.350'] [Step 93 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [27933, 27942] → Tgt Spa: ['1.000', '1.000'] [Step 93 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [54715] → Tgt Spa: ['1.000'] [Step 93 / Rank 5] Tasks: ['Single QA'] | Lens: [46720] → Tgt Spa: ['0.350'] [Step 93 / Rank 7] Tasks: ['Code', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Code', 'Single QA', 'Code', 'In-Context Learning'] | Lens: [7064, 7067, 7060, 7063, 7063, 7070, 7064, 7071, 7065] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 93 / Rank 4] Tasks: ['Single QA'] | Lens: [46720] → Tgt Spa: ['0.350'] [Step 93 / Rank 6] Tasks: ['Code', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Code', 'Single QA', 'Code', 'In-Context Learning'] | Lens: [7064, 7067, 7060, 7063, 7063, 7070, 7064, 7071, 7065] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 93 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [56523] → Tgt Spa: ['1.000'] [Step 93 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [54715] → Tgt Spa: ['1.000'] [Step 93 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [56523] → Tgt Spa: ['1.000'] [Step 93 / Rank 5] Tasks: ['MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [1739, 1740, 1757, 1757, 1757, 1739, 1740, 1758, 1741, 1740, 1740, 1739, 1740, 1742, 1740, 1741, 1742, 1761, 1744, 1745, 1744, 1743, 1763, 1745, 1745, 1745, 1765, 1746, 1746, 1747, 1746, 1746, 1767, 1748, 1749, 1747, 1767] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000'] [Step 93 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [42262] → Tgt Spa: ['1.000'] [Step 93 / Rank 6] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [18543, 18556, 18558] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 93 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22035, 22037] → Tgt Spa: ['1.000', '1.000'] [Step 93 / Rank 7] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [18543, 18556, 18558] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 93 / Rank 4] Tasks: ['MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [1739, 1740, 1757, 1757, 1757, 1739, 1740, 1758, 1741, 1740, 1740, 1739, 1740, 1742, 1740, 1741, 1742, 1761, 1744, 1745, 1744, 1743, 1763, 1745, 1745, 1745, 1765, 1746, 1746, 1747, 1746, 1746, 1767, 1748, 1749, 1747, 1767] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000'] [Step 93 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22035, 22037] → Tgt Spa: ['1.000', '1.000'] [Step 93 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [42262] → Tgt Spa: ['1.000'] [Step 93 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [61752] → Tgt Spa: ['1.000'] [Step 93 / Rank 1] Tasks: ['Code'] | Lens: [45382] → Tgt Spa: ['1.000'] [Step 93 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [28269, 28269] → Tgt Spa: ['0.350', '0.350'] [Step 93 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [28269, 28269] → Tgt Spa: ['0.350', '0.350'] [Step 93 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [61752] → Tgt Spa: ['1.000'] [Step 93 / Rank 0] Tasks: ['Code'] | Lens: [45382] → Tgt Spa: ['1.000'] [Step 93 / Rank 6] Tasks: ['MultiHop QA'] | Lens: [65331] → Tgt Spa: ['0.350'] [Step 93 / Rank 7] Tasks: ['MultiHop QA'] | Lens: [65331] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 22:44:29,620 >> @ 93 | Loss: 2.0210 | LM: 1.9412 | Reg: 0.0798 | Spa(Avg): 0.464 [INFO|lh_trainer.py:797] 2026-02-16 22:44:29,620 >> Statistic -> Code | Spa: 0.476 | Tgt: 1.000 | Z-Loss: 0.141 | [INFO|lh_trainer.py:797] 2026-02-16 22:44:29,620 >> Statistic -> In-Context | Spa: 0.542 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:44:29,620 >> Statistic -> MultiHop | Spa: 0.499 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:44:29,621 >> Statistic -> Single | Spa: 0.407 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:44:29,621 >> Statistic -> Summarization | Spa: 0.494 | Tgt: 1.000 | Z-Loss: 0.154 | [INFO|lh_trainer.py:810] 2026-02-16 22:44:29,623 >> [Micro-Log] {"loss": 2.02097064598153, "lm_loss": 1.9411724956395726, "reg_loss": 0.07979815955817078, "model_sparsity(avg)": 0.46380233640472096, "Spa-Single QA sparsity": 0.4067460298538208, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03204058692790568, "Spa-In-Context Learning sparsity": 0.5416666594418612, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.13673478364944458, "Spa-Code sparsity": 0.47638886570930483, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.140992633998394, "Spa-Summarization sparsity": 0.4935897451180678, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.15386395099071357, "Spa-MultiHop QA sparsity": 0.4985632135950286, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.0480122153872046, "step": 93, "current_tau": 1.2369160652160645, "lambda1 Single QA": 0.5234375, "lambda2 MultiHop QA": 0.265625, "lambda3 Summarization": 0.09326171875, "lambda4 Code": 0.1923828125} [INFO|lh_trainer.py:331] 2026-02-16 22:44:57,069 >> {'loss': 12.1258, 'grad_norm': 1.0144538879394531, 'learning_rate': 0.000477035795752691, 'epoch': 0.09899947340705635, 'num_input_tokens_seen': 231156516, 'completed': '31.33% (94 / 300)', 'remaining time': '9:36:42', 'throughput': '7101.51', 'gpu_mem_free': '11455MB', 'step': 94} [Step 94 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [44303] → Tgt Spa: ['1.000'] [Step 94 / Rank 7] Tasks: ['Single QA'] | Lens: [65061] → Tgt Spa: ['0.350'] [Step 94 / Rank 4] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [18109, 18120, 18121] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 94 / Rank 3] Tasks: ['Single QA'] | Lens: [52256] → Tgt Spa: ['0.350'] [Step 94 / Rank 5] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [18109, 18120, 18121] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 94 / Rank 2] Tasks: ['Single QA'] | Lens: [52256] → Tgt Spa: ['0.350'] [Step 94 / Rank 6] Tasks: ['Single QA'] | Lens: [65061] → Tgt Spa: ['0.350'] [Step 94 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [44303] → Tgt Spa: ['1.000'] [Step 94 / Rank 4] Tasks: ['Single QA'] | Lens: [58133] → Tgt Spa: ['0.350'] [Step 94 / Rank 2] Tasks: ['Single QA'] | Lens: [60522] → Tgt Spa: ['0.350'] [Step 94 / Rank 7] Tasks: ['Single QA'] | Lens: [54441] → Tgt Spa: ['0.350'] [Step 94 / Rank 6] Tasks: ['Single QA'] | Lens: [54441] → Tgt Spa: ['0.350'] [Step 94 / Rank 3] Tasks: ['Single QA'] | Lens: [60522] → Tgt Spa: ['0.350'] [Step 94 / Rank 5] Tasks: ['Single QA'] | Lens: [58133] → Tgt Spa: ['0.350'] [Step 94 / Rank 1] Tasks: ['Single QA', 'In-Context Learning', 'Code', 'Single QA', 'In-Context Learning', 'Code', 'Single QA', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [6044, 6046, 6053, 6049, 6050, 6057, 6052, 6060, 6054, 6057] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350'] [Step 94 / Rank 0] Tasks: ['Single QA', 'In-Context Learning', 'Code', 'Single QA', 'In-Context Learning', 'Code', 'Single QA', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [6044, 6046, 6053, 6049, 6050, 6057, 6052, 6060, 6054, 6057] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350'] [Step 94 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32003, 32004] → Tgt Spa: ['0.350', '0.350'] [Step 94 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32003, 32004] → Tgt Spa: ['0.350', '0.350'] [Step 94 / Rank 6] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [28891, 28911] → Tgt Spa: ['1.000', '1.000'] [Step 94 / Rank 2] Tasks: ['Code'] | Lens: [59914] → Tgt Spa: ['1.000'] [Step 94 / Rank 0] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [16544, 16555, 16558] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 94 / Rank 3] Tasks: ['Code'] | Lens: [59914] → Tgt Spa: ['1.000'] [Step 94 / Rank 1] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [16544, 16555, 16558] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 94 / Rank 7] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [28891, 28911] → Tgt Spa: ['1.000', '1.000'] [Step 94 / Rank 0] Tasks: ['Single QA'] | Lens: [57250] → Tgt Spa: ['0.350'] [Step 94 / Rank 1] Tasks: ['Single QA'] | Lens: [57250] → Tgt Spa: ['0.350'] [Step 94 / Rank 5] Tasks: ['Single QA'] | Lens: [33977] → Tgt Spa: ['0.350'] [Step 94 / Rank 4] Tasks: ['Single QA'] | Lens: [33977] → Tgt Spa: ['0.350'] [Step 94 / Rank 7] Tasks: ['Single QA'] | Lens: [60717] → Tgt Spa: ['0.350'] [Step 94 / Rank 3] Tasks: ['Single QA'] | Lens: [60977] → Tgt Spa: ['0.350'] [Step 94 / Rank 6] Tasks: ['Single QA'] | Lens: [60717] → Tgt Spa: ['0.350'] [Step 94 / Rank 2] Tasks: ['Single QA'] | Lens: [60977] → Tgt Spa: ['0.350'] [Step 94 / Rank 4] Tasks: ['Summarization', 'Single QA', 'Code'] | Lens: [17191, 17174, 17181] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 94 / Rank 0] Tasks: ['Single QA'] | Lens: [64895] → Tgt Spa: ['0.350'] [Step 94 / Rank 5] Tasks: ['Summarization', 'Single QA', 'Code'] | Lens: [17191, 17174, 17181] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 94 / Rank 2] Tasks: ['Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Code'] | Lens: [5680, 5681, 5682, 5682, 5690, 5683, 5686, 5689, 5689, 5691, 5697] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000'] [Step 94 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [47840] → Tgt Spa: ['1.000'] [Step 94 / Rank 3] Tasks: ['Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Code'] | Lens: [5680, 5681, 5682, 5682, 5690, 5683, 5686, 5689, 5689, 5691, 5697] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000'] [Step 94 / Rank 1] Tasks: ['Single QA'] | Lens: [64895] → Tgt Spa: ['0.350'] [Step 94 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [47840] → Tgt Spa: ['1.000'] [Step 94 / Rank 4] Tasks: ['Single QA'] | Lens: [54436] → Tgt Spa: ['0.350'] [Step 94 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [57542] → Tgt Spa: ['1.000'] [Step 94 / Rank 0] Tasks: ['Single QA'] | Lens: [49993] → Tgt Spa: ['0.350'] [Step 94 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [62331] → Tgt Spa: ['1.000'] [Step 94 / Rank 1] Tasks: ['Single QA'] | Lens: [49993] → Tgt Spa: ['0.350'] [Step 94 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [57542] → Tgt Spa: ['1.000'] [Step 94 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [62331] → Tgt Spa: ['1.000'] [Step 94 / Rank 5] Tasks: ['Single QA'] | Lens: [54436] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 22:47:47,786 >> @ 94 | Loss: 2.2380 | LM: 2.1545 | Reg: 0.0835 | Spa(Avg): 0.460 [INFO|lh_trainer.py:797] 2026-02-16 22:47:47,786 >> Statistic -> Code | Spa: 0.468 | Tgt: 1.000 | Z-Loss: 0.145 | [INFO|lh_trainer.py:797] 2026-02-16 22:47:47,786 >> Statistic -> In-Context | Spa: 0.545 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:47:47,786 >> Statistic -> MultiHop | Spa: 0.499 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:47:47,786 >> Statistic -> Single | Spa: 0.438 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:47:47,786 >> Statistic -> Summarization | Spa: 0.463 | Tgt: 1.000 | Z-Loss: 0.170 | [INFO|lh_trainer.py:810] 2026-02-16 22:47:47,788 >> [Micro-Log] {"loss": 2.237989446769158, "lm_loss": 2.1545068584382534, "reg_loss": 0.08348259304572518, "model_sparsity(avg)": 0.45975904787580174, "Spa-In-Context Learning sparsity": 0.5446428571428571, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1366512972329344, "Spa-Single QA sparsity": 0.4381313107230447, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05068336668508974, "Spa-Code sparsity": 0.46759257713953656, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.14467530366447237, "Spa-Summarization sparsity": 0.46296295523643494, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1701455463965734, "Spa-MultiHop QA sparsity": 0.4985632135950286, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.0480122153872046, "step": 94, "current_tau": 1.2325608730316162, "lambda1 Single QA": 0.5234375, "lambda2 MultiHop QA": 0.267578125, "lambda3 Summarization": 0.09423828125, "lambda4 Code": 0.193359375} [INFO|lh_trainer.py:331] 2026-02-16 22:48:12,636 >> {'loss': 13.4279, 'grad_norm': 0.8651906251907349, 'learning_rate': 0.0004756463235228331, 'epoch': 0.10005265929436545, 'num_input_tokens_seen': 233846560, 'completed': '31.67% (95 / 300)', 'remaining time': '9:34:54', 'throughput': '6877.54', 'gpu_mem_free': '10189MB', 'step': 95} [Step 95 / Rank 3] Tasks: ['Single QA'] | Lens: [35809] → Tgt Spa: ['0.350'] [Step 95 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27559, 27560] → Tgt Spa: ['1.000', '1.000'] [Step 95 / Rank 4] Tasks: ['Single QA'] | Lens: [42035] → Tgt Spa: ['0.350'] [Step 95 / Rank 5] Tasks: ['Single QA'] | Lens: [42035] → Tgt Spa: ['0.350'] [Step 95 / Rank 2] Tasks: ['Single QA'] | Lens: [35809] → Tgt Spa: ['0.350'] [Step 95 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27559, 27560] → Tgt Spa: ['1.000', '1.000'] [Step 95 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [23561, 23570] → Tgt Spa: ['1.000', '1.000'] [Step 95 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [23561, 23570] → Tgt Spa: ['1.000', '1.000'] [Step 95 / Rank 5] Tasks: ['Single QA'] | Lens: [49178] → Tgt Spa: ['0.350'] [Step 95 / Rank 6] Tasks: ['Code'] | Lens: [44975] → Tgt Spa: ['1.000'] [Step 95 / Rank 1] Tasks: ['Single QA'] | Lens: [50596] → Tgt Spa: ['0.350'] [Step 95 / Rank 0] Tasks: ['Single QA'] | Lens: [50596] → Tgt Spa: ['0.350'] [Step 95 / Rank 4] Tasks: ['Single QA'] | Lens: [49178] → Tgt Spa: ['0.350'] [Step 95 / Rank 7] Tasks: ['Code'] | Lens: [44975] → Tgt Spa: ['1.000'] [Step 95 / Rank 3] Tasks: ['Code'] | Lens: [33624] → Tgt Spa: ['1.000'] [Step 95 / Rank 2] Tasks: ['Code'] | Lens: [33624] → Tgt Spa: ['1.000'] [Step 95 / Rank 6] Tasks: ['Code'] | Lens: [37557] → Tgt Spa: ['1.000'] [Step 95 / Rank 3] Tasks: ['Code'] | Lens: [64180] → Tgt Spa: ['1.000'] [Step 95 / Rank 1] Tasks: ['Single QA', 'Code'] | Lens: [24414, 24421] → Tgt Spa: ['0.350', '1.000'] [Step 95 / Rank 2] Tasks: ['Code'] | Lens: [64180] → Tgt Spa: ['1.000'] [Step 95 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [42268] → Tgt Spa: ['1.000'] [Step 95 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [42268] → Tgt Spa: ['1.000'] [Step 95 / Rank 7] Tasks: ['Code'] | Lens: [37557] → Tgt Spa: ['1.000'] [Step 95 / Rank 0] Tasks: ['Single QA', 'Code'] | Lens: [24414, 24421] → Tgt Spa: ['0.350', '1.000'] [Step 95 / Rank 7] Tasks: ['Single QA'] | Lens: [55704] → Tgt Spa: ['0.350'] [Step 95 / Rank 4] Tasks: ['Single QA'] | Lens: [49595] → Tgt Spa: ['0.350'] [Step 95 / Rank 5] Tasks: ['Single QA'] | Lens: [49595] → Tgt Spa: ['0.350'] [Step 95 / Rank 0] Tasks: ['Single QA'] | Lens: [57583] → Tgt Spa: ['0.350'] [Step 95 / Rank 6] Tasks: ['Single QA'] | Lens: [55704] → Tgt Spa: ['0.350'] [Step 95 / Rank 2] Tasks: ['Single QA'] | Lens: [57935] → Tgt Spa: ['0.350'] [Step 95 / Rank 3] Tasks: ['Single QA'] | Lens: [57935] → Tgt Spa: ['0.350'] [Step 95 / Rank 1] Tasks: ['Single QA'] | Lens: [57583] → Tgt Spa: ['0.350'] [Step 95 / Rank 5] Tasks: ['Code'] | Lens: [61995] → Tgt Spa: ['1.000'] [Step 95 / Rank 1] Tasks: ['Single QA'] | Lens: [52802] → Tgt Spa: ['0.350'] [Step 95 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24439, 24458] → Tgt Spa: ['1.000', '1.000'] [Step 95 / Rank 6] Tasks: ['Code'] | Lens: [41722] → Tgt Spa: ['1.000'] [Step 95 / Rank 7] Tasks: ['Code'] | Lens: [41722] → Tgt Spa: ['1.000'] [Step 95 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24439, 24458] → Tgt Spa: ['1.000', '1.000'] [Step 95 / Rank 4] Tasks: ['Code'] | Lens: [61995] → Tgt Spa: ['1.000'] [Step 95 / Rank 0] Tasks: ['Single QA'] | Lens: [52802] → Tgt Spa: ['0.350'] [Step 95 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25461, 25462] → Tgt Spa: ['1.000', '1.000'] [Step 95 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [47712] → Tgt Spa: ['1.000'] [Step 95 / Rank 4] Tasks: ['Single QA'] | Lens: [37052] → Tgt Spa: ['0.350'] [Step 95 / Rank 5] Tasks: ['Single QA'] | Lens: [37052] → Tgt Spa: ['0.350'] [Step 95 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25461, 25462] → Tgt Spa: ['1.000', '1.000'] [Step 95 / Rank 0] Tasks: ['Code'] | Lens: [48841] → Tgt Spa: ['1.000'] [Step 95 / Rank 1] Tasks: ['Code'] | Lens: [48841] → Tgt Spa: ['1.000'] [Step 95 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [47712] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 22:50:34,487 >> @ 95 | Loss: 1.9287 | LM: 1.8355 | Reg: 0.0932 | Spa(Avg): 0.456 [INFO|lh_trainer.py:797] 2026-02-16 22:50:34,487 >> Statistic -> Code | Spa: 0.461 | Tgt: 1.000 | Z-Loss: 0.147 | [INFO|lh_trainer.py:797] 2026-02-16 22:50:34,488 >> Statistic -> In-Context | Spa: 0.562 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:50:34,488 >> Statistic -> MultiHop | Spa: 0.499 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:50:34,488 >> Statistic -> Single | Spa: 0.400 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:50:34,488 >> Statistic -> Summarization | Spa: 0.528 | Tgt: 1.000 | Z-Loss: 0.137 | [INFO|lh_trainer.py:810] 2026-02-16 22:50:34,489 >> [Micro-Log] {"loss": 1.928677945708235, "lm_loss": 1.8354601704825957, "reg_loss": 0.09321778638210769, "model_sparsity(avg)": 0.4560185099641482, "Spa-In-Context Learning sparsity": 0.5625, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.13090444635599852, "Spa-Single QA sparsity": 0.40025251561945135, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03096823220733892, "Spa-Code sparsity": 0.4614197346899245, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.14739324814743465, "Spa-Summarization sparsity": 0.5277777910232544, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13686040043830872, "Spa-MultiHop QA sparsity": 0.4985632135950286, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.0480122153872046, "step": 95, "current_tau": 1.2282110452651978, "lambda1 Single QA": 0.52734375, "lambda2 MultiHop QA": 0.267578125, "lambda3 Summarization": 0.09521484375, "lambda4 Code": 0.1943359375} [INFO|lh_trainer.py:331] 2026-02-16 22:50:51,236 >> {'loss': 11.5721, 'grad_norm': 1.388136625289917, 'learning_rate': 0.0004742181879613535, 'epoch': 0.10110584518167456, 'num_input_tokens_seen': 236170696, 'completed': '32.00% (96 / 300)', 'remaining time': '9:31:45', 'throughput': '7327.02', 'gpu_mem_free': '10363MB', 'step': 96} [Step 96 / Rank 6] Tasks: ['Summarization', 'Code'] | Lens: [31519, 31514] → Tgt Spa: ['1.000', '1.000'] [Step 96 / Rank 4] Tasks: ['Single QA'] | Lens: [50597] → Tgt Spa: ['0.350'] [Step 96 / Rank 7] Tasks: ['Summarization', 'Code'] | Lens: [31519, 31514] → Tgt Spa: ['1.000', '1.000'] [Step 96 / Rank 3] Tasks: ['Single QA'] | Lens: [56498] → Tgt Spa: ['0.350'] [Step 96 / Rank 5] Tasks: ['Single QA'] | Lens: [50597] → Tgt Spa: ['0.350'] [Step 96 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [16055, 16056, 16056, 16058] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 96 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [16055, 16056, 16056, 16058] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 96 / Rank 2] Tasks: ['Single QA'] | Lens: [56498] → Tgt Spa: ['0.350'] [Step 96 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [23199, 23200] → Tgt Spa: ['1.000', '1.000'] [Step 96 / Rank 1] Tasks: ['Single QA', 'Code', 'Code'] | Lens: [17194, 17202, 17203] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 96 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [23199, 23200] → Tgt Spa: ['1.000', '1.000'] [Step 96 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25520, 25521] → Tgt Spa: ['1.000', '1.000'] [Step 96 / Rank 6] Tasks: ['Single QA'] | Lens: [38849] → Tgt Spa: ['0.350'] [Step 96 / Rank 0] Tasks: ['Single QA', 'Code', 'Code'] | Lens: [17194, 17202, 17203] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 96 / Rank 7] Tasks: ['Single QA'] | Lens: [38849] → Tgt Spa: ['0.350'] [Step 96 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25520, 25521] → Tgt Spa: ['1.000', '1.000'] [Step 96 / Rank 5] Tasks: ['Code'] | Lens: [53425] → Tgt Spa: ['1.000'] [Step 96 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [24132, 24133] → Tgt Spa: ['0.350', '0.350'] [Step 96 / Rank 2] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24408, 24408] → Tgt Spa: ['0.350', '1.000'] [Step 96 / Rank 4] Tasks: ['Code'] | Lens: [53425] → Tgt Spa: ['1.000'] [Step 96 / Rank 6] Tasks: ['Single QA'] | Lens: [33298] → Tgt Spa: ['0.350'] [Step 96 / Rank 3] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24408, 24408] → Tgt Spa: ['0.350', '1.000'] [Step 96 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [24132, 24133] → Tgt Spa: ['0.350', '0.350'] [Step 96 / Rank 7] Tasks: ['Single QA'] | Lens: [33298] → Tgt Spa: ['0.350'] [Step 96 / Rank 4] Tasks: ['Summarization', 'Code', 'Single QA', 'Code'] | Lens: [15784, 15786, 15783, 15794] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000'] [Step 96 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [58059] → Tgt Spa: ['1.000'] [Step 96 / Rank 6] Tasks: ['Code'] | Lens: [35186] → Tgt Spa: ['1.000'] [Step 96 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [64911] → Tgt Spa: ['1.000'] [Step 96 / Rank 5] Tasks: ['Summarization', 'Code', 'Single QA', 'Code'] | Lens: [15784, 15786, 15783, 15794] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000'] [Step 96 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [58059] → Tgt Spa: ['1.000'] [Step 96 / Rank 7] Tasks: ['Code'] | Lens: [35186] → Tgt Spa: ['1.000'] [Step 96 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [64911] → Tgt Spa: ['1.000'] [Step 96 / Rank 6] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [9451, 9460, 9453, 9455, 9464, 9459] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '1.000', '0.350'] [Step 96 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [39993] → Tgt Spa: ['1.000'] [Step 96 / Rank 7] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [9451, 9460, 9453, 9455, 9464, 9459] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '1.000', '0.350'] [Step 96 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25164, 25166] → Tgt Spa: ['1.000', '1.000'] [Step 96 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [45447] → Tgt Spa: ['1.000'] [Step 96 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [39993] → Tgt Spa: ['1.000'] [Step 96 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25164, 25166] → Tgt Spa: ['1.000', '1.000'] [Step 96 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [45447] → Tgt Spa: ['1.000'] [Step 96 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [39760] → Tgt Spa: ['1.000'] [Step 96 / Rank 5] Tasks: ['Code'] | Lens: [37825] → Tgt Spa: ['1.000'] [Step 96 / Rank 4] Tasks: ['Code'] | Lens: [37825] → Tgt Spa: ['1.000'] [Step 96 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [39760] → Tgt Spa: ['1.000'] [Step 96 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24421, 24422] → Tgt Spa: ['1.000', '1.000'] [Step 96 / Rank 2] Tasks: ['Single QA'] | Lens: [43059] → Tgt Spa: ['0.350'] [Step 96 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24421, 24422] → Tgt Spa: ['1.000', '1.000'] [Step 96 / Rank 3] Tasks: ['Single QA'] | Lens: [43059] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 22:53:01,381 >> @ 96 | Loss: 2.0062 | LM: 1.9054 | Reg: 0.1008 | Spa(Avg): 0.475 [INFO|lh_trainer.py:797] 2026-02-16 22:53:01,381 >> Statistic -> Code | Spa: 0.456 | Tgt: 1.000 | Z-Loss: 0.150 | [INFO|lh_trainer.py:797] 2026-02-16 22:53:01,381 >> Statistic -> In-Context | Spa: 0.560 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:53:01,381 >> Statistic -> MultiHop | Spa: 0.499 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:53:01,381 >> Statistic -> Single | Spa: 0.424 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:53:01,381 >> Statistic -> Summarization | Spa: 0.514 | Tgt: 1.000 | Z-Loss: 0.145 | [INFO|lh_trainer.py:810] 2026-02-16 22:53:01,383 >> [Micro-Log] {"loss": 2.006172655771176, "lm_loss": 1.9054118543863297, "reg_loss": 0.10076079779537395, "model_sparsity(avg)": 0.4754533109565576, "Spa-Single QA sparsity": 0.423611107799742, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04283154150471091, "Spa-Code sparsity": 0.45601850748062134, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1500380796690782, "Spa-In-Context Learning sparsity": 0.5601851791143417, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1326600213845571, "Spa-Summarization sparsity": 0.5138888955116272, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.14508584141731262, "Spa-MultiHop QA sparsity": 0.4985632135950286, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.0480122153872046, "step": 96, "current_tau": 1.2238678932189941, "lambda1 Single QA": 0.52734375, "lambda2 MultiHop QA": 0.267578125, "lambda3 Summarization": 0.095703125, "lambda4 Code": 0.1953125} [INFO|lh_trainer.py:331] 2026-02-16 22:53:15,796 >> {'loss': 12.037, 'grad_norm': 1.3920540809631348, 'learning_rate': 0.00047275163377192886, 'epoch': 0.10215903106898368, 'num_input_tokens_seen': 238549390, 'completed': '32.33% (97 / 300)', 'remaining time': '9:28:07', 'throughput': '8227.36', 'gpu_mem_free': '12801MB', 'step': 97} [Step 97 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [57033] → Tgt Spa: ['1.000'] [Step 97 / Rank 1] Tasks: ['Single QA'] | Lens: [55621] → Tgt Spa: ['0.350'] [Step 97 / Rank 4] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24664, 24684] → Tgt Spa: ['1.000', '1.000'] [Step 97 / Rank 5] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24664, 24684] → Tgt Spa: ['1.000', '1.000'] [Step 97 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [57033] → Tgt Spa: ['1.000'] [Step 97 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23282, 23282] → Tgt Spa: ['1.000', '1.000'] [Step 97 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23282, 23282] → Tgt Spa: ['1.000', '1.000'] [Step 97 / Rank 0] Tasks: ['Single QA'] | Lens: [55621] → Tgt Spa: ['0.350'] [Step 97 / Rank 4] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [11291, 11287, 11302, 11302, 11303] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '0.350'] [Step 97 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [36002] → Tgt Spa: ['1.000'] [Step 97 / Rank 2] Tasks: ['Single QA'] | Lens: [65045] → Tgt Spa: ['0.350'] [Step 97 / Rank 7] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [28913, 28914] → Tgt Spa: ['0.350', '1.000'] [Step 97 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [36002] → Tgt Spa: ['1.000'] [Step 97 / Rank 5] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [11291, 11287, 11302, 11302, 11303] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '0.350'] [Step 97 / Rank 3] Tasks: ['Single QA'] | Lens: [65045] → Tgt Spa: ['0.350'] [Step 97 / Rank 6] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [28913, 28914] → Tgt Spa: ['0.350', '1.000'] [Step 97 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [44798] → Tgt Spa: ['1.000'] [Step 97 / Rank 7] Tasks: ['Single QA'] | Lens: [34556] → Tgt Spa: ['0.350'] [Step 97 / Rank 5] Tasks: ['Single QA'] | Lens: [33971] → Tgt Spa: ['0.350'] [Step 97 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [44798] → Tgt Spa: ['1.000'] [Step 97 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [19160, 19160, 19160] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 97 / Rank 4] Tasks: ['Single QA'] | Lens: [33971] → Tgt Spa: ['0.350'] [Step 97 / Rank 6] Tasks: ['Single QA'] | Lens: [34556] → Tgt Spa: ['0.350'] [Step 97 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [19160, 19160, 19160] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 97 / Rank 4] Tasks: ['Single QA'] | Lens: [33979] → Tgt Spa: ['0.350'] [Step 97 / Rank 0] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [29976, 29976] → Tgt Spa: ['0.350', '1.000'] [Step 97 / Rank 1] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [29976, 29976] → Tgt Spa: ['0.350', '1.000'] [Step 97 / Rank 2] Tasks: ['Code'] | Lens: [36591] → Tgt Spa: ['1.000'] [Step 97 / Rank 7] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [18059, 18072, 18072] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 97 / Rank 3] Tasks: ['Code'] | Lens: [36591] → Tgt Spa: ['1.000'] [Step 97 / Rank 5] Tasks: ['Single QA'] | Lens: [33979] → Tgt Spa: ['0.350'] [Step 97 / Rank 6] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [18059, 18072, 18072] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 97 / Rank 4] Tasks: ['Single QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1814, 1832, 1833, 1833, 1834, 1834, 1818, 1816, 1815, 1817, 1835, 1835, 1817, 1817, 1818, 1821, 1818, 1820, 1820, 1820, 1827, 1840, 1840, 1822, 1820, 1821, 1822, 1823, 1839, 1840, 1840, 1824, 1824, 1826, 1825] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350'] [Step 97 / Rank 5] Tasks: ['Single QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1814, 1832, 1833, 1833, 1834, 1834, 1818, 1816, 1815, 1817, 1835, 1835, 1817, 1817, 1818, 1821, 1818, 1820, 1820, 1820, 1827, 1840, 1840, 1822, 1820, 1821, 1822, 1823, 1839, 1840, 1840, 1824, 1824, 1826, 1825] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350'] [Step 97 / Rank 1] Tasks: ['Single QA'] | Lens: [60600] → Tgt Spa: ['0.350'] [Step 97 / Rank 6] Tasks: ['In-Context Learning', 'Summarization', 'Summarization'] | Lens: [19773, 19791, 19797] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 97 / Rank 0] Tasks: ['Single QA'] | Lens: [60600] → Tgt Spa: ['0.350'] [Step 97 / Rank 7] Tasks: ['In-Context Learning', 'Summarization', 'Summarization'] | Lens: [19773, 19791, 19797] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 97 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26509, 26510] → Tgt Spa: ['1.000', '1.000'] [Step 97 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26509, 26510] → Tgt Spa: ['1.000', '1.000'] [Step 97 / Rank 1] Tasks: ['Summarization', 'Summarization'] | Lens: [24533, 24536] → Tgt Spa: ['1.000', '1.000'] [Step 97 / Rank 4] Tasks: ['Code', 'Single QA'] | Lens: [30217, 30212] → Tgt Spa: ['1.000', '0.350'] [Step 97 / Rank 0] Tasks: ['Summarization', 'Summarization'] | Lens: [24533, 24536] → Tgt Spa: ['1.000', '1.000'] [Step 97 / Rank 7] Tasks: ['Single QA'] | Lens: [45052] → Tgt Spa: ['0.350'] [Step 97 / Rank 2] Tasks: ['Single QA'] | Lens: [49454] → Tgt Spa: ['0.350'] [Step 97 / Rank 5] Tasks: ['Code', 'Single QA'] | Lens: [30217, 30212] → Tgt Spa: ['1.000', '0.350'] [Step 97 / Rank 6] Tasks: ['Single QA'] | Lens: [45052] → Tgt Spa: ['0.350'] [Step 97 / Rank 3] Tasks: ['Single QA'] | Lens: [49454] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 22:55:36,134 >> @ 97 | Loss: 2.2101 | LM: 2.1219 | Reg: 0.0882 | Spa(Avg): 0.492 [INFO|lh_trainer.py:797] 2026-02-16 22:55:36,135 >> Statistic -> Code | Spa: 0.519 | Tgt: 1.000 | Z-Loss: 0.128 | [INFO|lh_trainer.py:797] 2026-02-16 22:55:36,135 >> Statistic -> In-Context | Spa: 0.601 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:55:36,135 >> Statistic -> MultiHop | Spa: 0.519 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:55:36,135 >> Statistic -> Single | Spa: 0.442 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:55:36,135 >> Statistic -> Summarization | Spa: 0.507 | Tgt: 1.000 | Z-Loss: 0.149 | [INFO|lh_trainer.py:810] 2026-02-16 22:55:36,137 >> [Micro-Log] {"loss": 2.2100831580658755, "lm_loss": 2.121917632718881, "reg_loss": 0.08816551424873371, "model_sparsity(avg)": 0.49194224427143735, "Spa-Single QA sparsity": 0.44179894242967876, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05198409931645507, "Spa-In-Context Learning sparsity": 0.6010101112452421, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11929259449243546, "Spa-Summarization sparsity": 0.5065789348200748, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.14895354838747726, "Spa-Code sparsity": 0.5194444417953491, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.12822476923465728, "Spa-MultiHop QA sparsity": 0.5190058476046512, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.05602569191863662, "step": 97, "current_tau": 1.2195326089859009, "lambda1 Single QA": 0.52734375, "lambda2 MultiHop QA": 0.267578125, "lambda3 Summarization": 0.0966796875, "lambda4 Code": 0.1953125} [INFO|lh_trainer.py:331] 2026-02-16 22:55:53,547 >> {'loss': 13.2605, 'grad_norm': 0.9544283747673035, 'learning_rate': 0.0004712469122410695, 'epoch': 0.10321221695629279, 'num_input_tokens_seen': 240990068, 'completed': '32.67% (98 / 300)', 'remaining time': '9:24:59', 'throughput': '7735.88', 'gpu_mem_free': '11389MB', 'step': 98} [Step 98 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [28480, 28481] → Tgt Spa: ['1.000', '1.000'] [Step 98 / Rank 3] Tasks: ['Single QA'] | Lens: [43609] → Tgt Spa: ['0.350'] [Step 98 / Rank 4] Tasks: ['Code', 'Summarization'] | Lens: [31960, 31973] → Tgt Spa: ['1.000', '1.000'] [Step 98 / Rank 2] Tasks: ['Single QA'] | Lens: [43609] → Tgt Spa: ['0.350'] [Step 98 / Rank 0] Tasks: ['Single QA'] | Lens: [58277] → Tgt Spa: ['0.350'] [Step 98 / Rank 5] Tasks: ['Code', 'Summarization'] | Lens: [31960, 31973] → Tgt Spa: ['1.000', '1.000'] [Step 98 / Rank 1] Tasks: ['Single QA'] | Lens: [58277] → Tgt Spa: ['0.350'] [Step 98 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [28480, 28481] → Tgt Spa: ['1.000', '1.000'] [Step 98 / Rank 7] Tasks: ['Code'] | Lens: [36598] → Tgt Spa: ['1.000'] [Step 98 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [57440] → Tgt Spa: ['1.000'] [Step 98 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [55874] → Tgt Spa: ['1.000'] [Step 98 / Rank 6] Tasks: ['Code'] | Lens: [36598] → Tgt Spa: ['1.000'] [Step 98 / Rank 1] Tasks: ['Single QA'] | Lens: [64087] → Tgt Spa: ['0.350'] [Step 98 / Rank 0] Tasks: ['Single QA'] | Lens: [64087] → Tgt Spa: ['0.350'] [Step 98 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [57440] → Tgt Spa: ['1.000'] [Step 98 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [55874] → Tgt Spa: ['1.000'] [Step 98 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [42338] → Tgt Spa: ['1.000'] [Step 98 / Rank 6] Tasks: ['Single QA'] | Lens: [55556] → Tgt Spa: ['0.350'] [Step 98 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [43263] → Tgt Spa: ['1.000'] [Step 98 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [43263] → Tgt Spa: ['1.000'] [Step 98 / Rank 7] Tasks: ['Single QA'] | Lens: [55556] → Tgt Spa: ['0.350'] [Step 98 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [42338] → Tgt Spa: ['1.000'] [Step 98 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22016, 22036] → Tgt Spa: ['1.000', '1.000'] [Step 98 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22016, 22036] → Tgt Spa: ['1.000', '1.000'] [Step 98 / Rank 6] Tasks: ['Summarization'] | Lens: [61070] → Tgt Spa: ['1.000'] [Step 98 / Rank 5] Tasks: ['Single QA'] | Lens: [63246] → Tgt Spa: ['0.350'] [Step 98 / Rank 2] Tasks: ['Single QA'] | Lens: [34408] → Tgt Spa: ['0.350'] [Step 98 / Rank 3] Tasks: ['Single QA'] | Lens: [34408] → Tgt Spa: ['0.350'] [Step 98 / Rank 4] Tasks: ['Single QA'] | Lens: [63246] → Tgt Spa: ['0.350'] [Step 98 / Rank 7] Tasks: ['Summarization'] | Lens: [61070] → Tgt Spa: ['1.000'] [Step 98 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28118, 28120] → Tgt Spa: ['1.000', '1.000'] [Step 98 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28118, 28120] → Tgt Spa: ['1.000', '1.000'] [Step 98 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [38258] → Tgt Spa: ['1.000'] [Step 98 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [30625, 30626] → Tgt Spa: ['0.350', '0.350'] [Step 98 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [40068] → Tgt Spa: ['1.000'] [Step 98 / Rank 4] Tasks: ['Single QA'] | Lens: [49733] → Tgt Spa: ['0.350'] [Step 98 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [30625, 30626] → Tgt Spa: ['0.350', '0.350'] [Step 98 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [40068] → Tgt Spa: ['1.000'] [Step 98 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [38258] → Tgt Spa: ['1.000'] [Step 98 / Rank 5] Tasks: ['Single QA'] | Lens: [49733] → Tgt Spa: ['0.350'] [Step 98 / Rank 5] Tasks: ['Single QA'] | Lens: [52875] → Tgt Spa: ['0.350'] [Step 98 / Rank 6] Tasks: ['Code'] | Lens: [42373] → Tgt Spa: ['1.000'] [Step 98 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [24030, 24030] → Tgt Spa: ['0.350', '0.350'] [Step 98 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [24030, 24030] → Tgt Spa: ['0.350', '0.350'] [Step 98 / Rank 3] Tasks: ['Summarization'] | Lens: [32844] → Tgt Spa: ['1.000'] [Step 98 / Rank 7] Tasks: ['Code'] | Lens: [42373] → Tgt Spa: ['1.000'] [Step 98 / Rank 2] Tasks: ['Summarization'] | Lens: [32844] → Tgt Spa: ['1.000'] [Step 98 / Rank 4] Tasks: ['Single QA'] | Lens: [52875] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 22:58:27,017 >> @ 98 | Loss: 2.0827 | LM: 2.0008 | Reg: 0.0819 | Spa(Avg): 0.488 [INFO|lh_trainer.py:797] 2026-02-16 22:58:27,017 >> Statistic -> Code | Spa: 0.550 | Tgt: 1.000 | Z-Loss: 0.119 | [INFO|lh_trainer.py:797] 2026-02-16 22:58:27,017 >> Statistic -> In-Context | Spa: 0.583 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:58:27,017 >> Statistic -> MultiHop | Spa: 0.519 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:58:27,017 >> Statistic -> Single | Spa: 0.378 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 22:58:27,017 >> Statistic -> Summarization | Spa: 0.535 | Tgt: 1.000 | Z-Loss: 0.135 | [INFO|lh_trainer.py:810] 2026-02-16 22:58:27,019 >> [Micro-Log] {"loss": 2.08268436913689, "lm_loss": 2.000784710670511, "reg_loss": 0.08189965083632463, "model_sparsity(avg)": 0.4884259266157945, "Spa-Single QA sparsity": 0.3784722139437993, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.018118825857527554, "Spa-In-Context Learning sparsity": 0.5833333399560716, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.12532350752088758, "Spa-Summarization sparsity": 0.534722238779068, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13518720865249634, "Spa-Code sparsity": 0.55, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.11866251528263091, "Spa-MultiHop QA sparsity": 0.5190058476046512, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.05602569191863662, "step": 98, "current_tau": 1.2152067422866821, "lambda1 Single QA": 0.52734375, "lambda2 MultiHop QA": 0.267578125, "lambda3 Summarization": 0.09716796875, "lambda4 Code": 0.1962890625} [INFO|lh_trainer.py:331] 2026-02-16 22:58:46,437 >> {'loss': 12.4961, 'grad_norm': 1.0330355167388916, 'learning_rate': 0.00046970428119506353, 'epoch': 0.10426540284360189, 'num_input_tokens_seen': 243394892, 'completed': '33.00% (99 / 300)', 'remaining time': '9:22:21', 'throughput': '6954.78', 'gpu_mem_free': '10605MB', 'step': 99} [Step 99 / Rank 4] Tasks: ['Code'] | Lens: [59034] → Tgt Spa: ['1.000'] [Step 99 / Rank 3] Tasks: ['Single QA'] | Lens: [65063] → Tgt Spa: ['0.350'] [Step 99 / Rank 5] Tasks: ['Code'] | Lens: [59034] → Tgt Spa: ['1.000'] [Step 99 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [26316, 26336] → Tgt Spa: ['1.000', '1.000'] [Step 99 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [8204, 8205, 8205, 8205, 8205, 8213, 8207] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350'] [Step 99 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [8204, 8205, 8205, 8205, 8205, 8213, 8207] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350'] [Step 99 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [26316, 26336] → Tgt Spa: ['1.000', '1.000'] [Step 99 / Rank 2] Tasks: ['Single QA'] | Lens: [65063] → Tgt Spa: ['0.350'] [Step 99 / Rank 6] Tasks: ['Single QA'] | Lens: [44757] → Tgt Spa: ['0.350'] [Step 99 / Rank 7] Tasks: ['Single QA'] | Lens: [44757] → Tgt Spa: ['0.350'] [Step 99 / Rank 3] Tasks: ['Summarization'] | Lens: [41479] → Tgt Spa: ['1.000'] [Step 99 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [20439, 20440, 20440] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 99 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [20439, 20440, 20440] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 99 / Rank 0] Tasks: ['Single QA'] | Lens: [37519] → Tgt Spa: ['0.350'] [Step 99 / Rank 1] Tasks: ['Single QA'] | Lens: [37519] → Tgt Spa: ['0.350'] [Step 99 / Rank 2] Tasks: ['Summarization'] | Lens: [41479] → Tgt Spa: ['1.000'] [Step 99 / Rank 6] Tasks: ['Single QA'] | Lens: [33980] → Tgt Spa: ['0.350'] [Step 99 / Rank 4] Tasks: ['Code'] | Lens: [65363] → Tgt Spa: ['1.000'] [Step 99 / Rank 0] Tasks: ['Single QA'] | Lens: [48779] → Tgt Spa: ['0.350'] [Step 99 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [65493] → Tgt Spa: ['1.000'] [Step 99 / Rank 5] Tasks: ['Code'] | Lens: [65363] → Tgt Spa: ['1.000'] [Step 99 / Rank 1] Tasks: ['Single QA'] | Lens: [48779] → Tgt Spa: ['0.350'] [Step 99 / Rank 7] Tasks: ['Single QA'] | Lens: [33980] → Tgt Spa: ['0.350'] [Step 99 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [65493] → Tgt Spa: ['1.000'] [Step 99 / Rank 0] Tasks: ['Code'] | Lens: [56551] → Tgt Spa: ['1.000'] [Step 99 / Rank 6] Tasks: ['Code'] | Lens: [34841] → Tgt Spa: ['1.000'] [Step 99 / Rank 3] Tasks: ['Code', 'Single QA', 'Code', 'Single QA', 'Code', 'Code', 'Code'] | Lens: [8963, 8959, 8971, 8964, 8975, 8975, 8976] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000'] [Step 99 / Rank 2] Tasks: ['Code', 'Single QA', 'Code', 'Single QA', 'Code', 'Code', 'Code'] | Lens: [8963, 8959, 8971, 8964, 8975, 8975, 8976] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000'] [Step 99 / Rank 1] Tasks: ['Code'] | Lens: [56551] → Tgt Spa: ['1.000'] [Step 99 / Rank 7] Tasks: ['Code'] | Lens: [34841] → Tgt Spa: ['1.000'] [Step 99 / Rank 5] Tasks: ['Single QA'] | Lens: [41687] → Tgt Spa: ['0.350'] [Step 99 / Rank 4] Tasks: ['Single QA'] | Lens: [41687] → Tgt Spa: ['0.350'] [Step 99 / Rank 4] Tasks: ['Single QA'] | Lens: [60570] → Tgt Spa: ['0.350'] [Step 99 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [51275] → Tgt Spa: ['1.000'] [Step 99 / Rank 5] Tasks: ['Single QA'] | Lens: [60570] → Tgt Spa: ['0.350'] [Step 99 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [51275] → Tgt Spa: ['1.000'] [Step 99 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23922, 23923] → Tgt Spa: ['1.000', '1.000'] [Step 99 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23922, 23923] → Tgt Spa: ['1.000', '1.000'] [Step 99 / Rank 3] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [19093, 19081, 19083] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 99 / Rank 2] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [19093, 19081, 19083] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 99 / Rank 5] Tasks: ['Summarization', 'MultiHop QA', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA', 'In-Context Learning', 'Summarization', 'Summarization', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'Summarization', 'Summarization', 'Code'] | Lens: [3660, 3643, 3643, 3644, 3644, 3645, 3646, 3646, 3665, 3666, 3654, 3647, 3648, 3648, 3666, 3666, 3655] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000'] [Step 99 / Rank 6] Tasks: ['Single QA'] | Lens: [44927] → Tgt Spa: ['0.350'] [Step 99 / Rank 3] Tasks: ['Single QA'] | Lens: [64591] → Tgt Spa: ['0.350'] [Step 99 / Rank 0] Tasks: ['Single QA'] | Lens: [49254] → Tgt Spa: ['0.350'] [Step 99 / Rank 1] Tasks: ['Single QA'] | Lens: [49254] → Tgt Spa: ['0.350'] [Step 99 / Rank 7] Tasks: ['Single QA'] | Lens: [44927] → Tgt Spa: ['0.350'] [Step 99 / Rank 2] Tasks: ['Single QA'] | Lens: [64591] → Tgt Spa: ['0.350'] [Step 99 / Rank 4] Tasks: ['Summarization', 'MultiHop QA', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA', 'In-Context Learning', 'Summarization', 'Summarization', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'Summarization', 'Summarization', 'Code'] | Lens: [3660, 3643, 3643, 3644, 3644, 3645, 3646, 3646, 3665, 3666, 3654, 3647, 3648, 3648, 3666, 3666, 3655] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 23:01:22,988 >> @ 99 | Loss: 1.8412 | LM: 1.7578 | Reg: 0.0833 | Spa(Avg): 0.469 [INFO|lh_trainer.py:797] 2026-02-16 23:01:22,988 >> Statistic -> Code | Spa: 0.506 | Tgt: 1.000 | Z-Loss: 0.134 | [INFO|lh_trainer.py:797] 2026-02-16 23:01:22,988 >> Statistic -> In-Context | Spa: 0.573 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:01:22,989 >> Statistic -> MultiHop | Spa: 0.560 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:01:22,989 >> Statistic -> Single | Spa: 0.420 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:01:22,989 >> Statistic -> Summarization | Spa: 0.524 | Tgt: 1.000 | Z-Loss: 0.142 | [INFO|lh_trainer.py:810] 2026-02-16 23:01:22,991 >> [Micro-Log] {"loss": 1.841177936643362, "lm_loss": 1.757830588767926, "reg_loss": 0.08334735690732487, "model_sparsity(avg)": 0.46922658135493595, "Spa-In-Context Learning sparsity": 0.5725308656692505, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.12981024550067055, "Spa-Summarization sparsity": 0.5243055522441864, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.14170272089540958, "Spa-Single QA sparsity": 0.4201388855775197, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04066879423044156, "Spa-Code sparsity": 0.5059523752757481, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.13440146297216415, "Spa-MultiHop QA sparsity": 0.5601851940155029, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.07306788861751556, "step": 99, "current_tau": 1.2108913660049438, "lambda1 Single QA": 0.52734375, "lambda2 MultiHop QA": 0.26953125, "lambda3 Summarization": 0.09814453125, "lambda4 Code": 0.197265625} [INFO|lh_trainer.py:331] 2026-02-16 23:01:49,673 >> {'loss': 11.0471, 'grad_norm': 0.9636231064796448, 'learning_rate': 0.0004681240049557991, 'epoch': 0.105318588730911, 'num_input_tokens_seen': 245927990, 'completed': '33.33% (100 / 300)', 'remaining time': '9:20:04', 'throughput': '6912.13', 'gpu_mem_free': '10631MB', 'step': 100} /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( [INFO|trainer.py:3984] 2026-02-16 23:02:02,760 >> Saving model checkpoint to checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/checkpoint-100 [INFO|configuration_utils.py:419] 2026-02-16 23:02:02,932 >> Configuration saved in checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/checkpoint-100/config.json [INFO|configuration_utils.py:911] 2026-02-16 23:02:02,939 >> Configuration saved in checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/checkpoint-100/generation_config.json [INFO|modeling_utils.py:3580] 2026-02-16 23:02:44,474 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/checkpoint-100/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-02-16 23:02:44,480 >> tokenizer config file saved in checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/checkpoint-100/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-02-16 23:02:44,485 >> Special tokens file saved in checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/checkpoint-100/special_tokens_map.json [INFO|tokenization_utils_base.py:2572] 2026-02-16 23:02:44,488 >> added tokens file saved in checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/checkpoint-100/added_tokens.json /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( [Step 100 / Rank 3] Tasks: ['Code'] | Lens: [62910] → Tgt Spa: ['1.000'] [Step 100 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [39488] → Tgt Spa: ['1.000'] [Step 100 / Rank 4] Tasks: ['In-Context Learning', 'Code', 'Summarization'] | Lens: [21701, 21710, 21723] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 100 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22888, 22907] → Tgt Spa: ['1.000', '1.000'] [Step 100 / Rank 5] Tasks: ['In-Context Learning', 'Code', 'Summarization'] | Lens: [21701, 21710, 21723] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 100 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [39488] → Tgt Spa: ['1.000'] [Step 100 / Rank 2] Tasks: ['Code'] | Lens: [62910] → Tgt Spa: ['1.000'] [Step 100 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22888, 22907] → Tgt Spa: ['1.000', '1.000'] /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/autograd/graph.py:823: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [Step 100 / Rank 1] Tasks: ['Single QA'] | Lens: [52012] → Tgt Spa: ['0.350'] [Step 100 / Rank 6] Tasks: ['Single QA'] | Lens: [60562] → Tgt Spa: ['0.350'] [Step 100 / Rank 7] Tasks: ['Single QA'] | Lens: [60562] → Tgt Spa: ['0.350'] [Step 100 / Rank 4] Tasks: ['Code'] | Lens: [40070] → Tgt Spa: ['1.000'] [Step 100 / Rank 0] Tasks: ['Single QA'] | Lens: [52012] → Tgt Spa: ['0.350'] [Step 100 / Rank 5] Tasks: ['Code'] | Lens: [40070] → Tgt Spa: ['1.000'] [Step 100 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [32530, 32531] → Tgt Spa: ['0.350', '0.350'] [Step 100 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [32530, 32531] → Tgt Spa: ['0.350', '0.350'] [Step 100 / Rank 1] Tasks: ['Single QA'] | Lens: [49168] → Tgt Spa: ['0.350'] [Step 100 / Rank 4] Tasks: ['Code', 'Summarization'] | Lens: [27103, 27119] → Tgt Spa: ['1.000', '1.000'] [Step 100 / Rank 5] Tasks: ['Code', 'Summarization'] | Lens: [27103, 27119] → Tgt Spa: ['1.000', '1.000'] [Step 100 / Rank 7] Tasks: ['Single QA'] | Lens: [62196] → Tgt Spa: ['0.350'] [Step 100 / Rank 0] Tasks: ['Single QA'] | Lens: [49168] → Tgt Spa: ['0.350'] [Step 100 / Rank 6] Tasks: ['Single QA'] | Lens: [62196] → Tgt Spa: ['0.350'] [Step 100 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [37188] → Tgt Spa: ['1.000'] [Step 100 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [37188] → Tgt Spa: ['1.000'] [Step 100 / Rank 4] Tasks: ['Code'] | Lens: [36837] → Tgt Spa: ['1.000'] [Step 100 / Rank 0] Tasks: ['Single QA'] | Lens: [34738] → Tgt Spa: ['0.350'] [Step 100 / Rank 1] Tasks: ['Single QA'] | Lens: [34738] → Tgt Spa: ['0.350'] [Step 100 / Rank 5] Tasks: ['Code'] | Lens: [36837] → Tgt Spa: ['1.000'] [Step 100 / Rank 2] Tasks: ['Code', 'Single QA', 'Code', 'Single QA'] | Lens: [13627, 13621, 13636, 13629] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350'] [Step 100 / Rank 6] Tasks: ['Single QA', 'Summarization', 'Summarization'] | Lens: [17584, 17603, 17603] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 100 / Rank 3] Tasks: ['Code', 'Single QA', 'Code', 'Single QA'] | Lens: [13627, 13621, 13636, 13629] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350'] [Step 100 / Rank 7] Tasks: ['Single QA', 'Summarization', 'Summarization'] | Lens: [17584, 17603, 17603] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 100 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [30041, 30035] → Tgt Spa: ['1.000', '1.000'] [Step 100 / Rank 7] Tasks: ['Single QA'] | Lens: [55862] → Tgt Spa: ['0.350'] [Step 100 / Rank 0] Tasks: ['Code'] | Lens: [59870] → Tgt Spa: ['1.000'] [Step 100 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [30041, 30035] → Tgt Spa: ['1.000', '1.000'] [Step 100 / Rank 6] Tasks: ['Single QA'] | Lens: [55862] → Tgt Spa: ['0.350'] [Step 100 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [38182] → Tgt Spa: ['1.000'] [Step 100 / Rank 1] Tasks: ['Code'] | Lens: [59870] → Tgt Spa: ['1.000'] [Step 100 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [38182] → Tgt Spa: ['1.000'] [Step 100 / Rank 5] Tasks: ['Code'] | Lens: [61468] → Tgt Spa: ['1.000'] [Step 100 / Rank 4] Tasks: ['Code'] | Lens: [61468] → Tgt Spa: ['1.000'] [Step 100 / Rank 7] Tasks: ['Code'] | Lens: [34421] → Tgt Spa: ['1.000'] [Step 100 / Rank 2] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17926, 17917, 17928] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 100 / Rank 1] Tasks: ['Code'] | Lens: [44243] → Tgt Spa: ['1.000'] [Step 100 / Rank 6] Tasks: ['Code'] | Lens: [34421] → Tgt Spa: ['1.000'] [Step 100 / Rank 3] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17926, 17917, 17928] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 100 / Rank 0] Tasks: ['Code'] | Lens: [44243] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 23:06:08,519 >> @ 100 | Loss: 1.6549 | LM: 1.5503 | Reg: 0.1046 | Spa(Avg): 0.504 [INFO|lh_trainer.py:797] 2026-02-16 23:06:08,520 >> Statistic -> Code | Spa: 0.509 | Tgt: 1.000 | Z-Loss: 0.134 | [INFO|lh_trainer.py:797] 2026-02-16 23:06:08,520 >> Statistic -> In-Context | Spa: 0.583 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:06:08,520 >> Statistic -> MultiHop | Spa: 0.560 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:06:08,520 >> Statistic -> Single | Spa: 0.458 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:06:08,520 >> Statistic -> Summarization | Spa: 0.556 | Tgt: 1.000 | Z-Loss: 0.127 | [INFO|lh_trainer.py:810] 2026-02-16 23:06:08,522 >> [Micro-Log] {"loss": 1.654852328511576, "lm_loss": 1.5503015232582886, "reg_loss": 0.10455080699951698, "model_sparsity(avg)": 0.5038580298423767, "Spa-In-Context Learning sparsity": 0.5833333631356558, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.12627745792269707, "Spa-Summarization sparsity": 0.555555556501661, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12721566962344305, "Spa-Single QA sparsity": 0.45833332430232654, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06126804251901128, "Spa-Code sparsity": 0.5085470080375671, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1340960625272531, "Spa-MultiHop QA sparsity": 0.5601851940155029, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.07306788861751556, "step": 100, "current_tau": 1.2065879106521606, "lambda1 Single QA": 0.52734375, "lambda2 MultiHop QA": 0.26953125, "lambda3 Summarization": 0.09912109375, "lambda4 Code": 0.1982421875} [INFO|lh_trainer.py:331] 2026-02-16 23:06:32,828 >> {'loss': 9.9291, 'grad_norm': 1.4202877283096313, 'learning_rate': 0.0004665063542954746, 'epoch': 0.10637177461822012, 'num_input_tokens_seen': 248369144, 'completed': '33.67% (101 / 300)', 'remaining time': '9:21:03', 'throughput': '4310.62', 'gpu_mem_free': '11169MB', 'step': 101} [Step 101 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23630, 23633] → Tgt Spa: ['1.000', '1.000'] [Step 101 / Rank 2] Tasks: ['Single QA'] | Lens: [54375] → Tgt Spa: ['0.350'] [Step 101 / Rank 1] Tasks: ['Single QA'] | Lens: [56713] → Tgt Spa: ['0.350'] [Step 101 / Rank 3] Tasks: ['Single QA'] | Lens: [54375] → Tgt Spa: ['0.350'] [Step 101 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23630, 23633] → Tgt Spa: ['1.000', '1.000'] [Step 101 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17833, 17835, 17825] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 101 / Rank 0] Tasks: ['Single QA'] | Lens: [56713] → Tgt Spa: ['0.350'] [Step 101 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17833, 17835, 17825] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 101 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [40898] → Tgt Spa: ['1.000'] [Step 101 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [40147] → Tgt Spa: ['1.000'] [Step 101 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [40147] → Tgt Spa: ['1.000'] [Step 101 / Rank 2] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [19447, 19440, 19439] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 101 / Rank 3] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [19447, 19440, 19439] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 101 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [40898] → Tgt Spa: ['1.000'] [Step 101 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [30620, 30621] → Tgt Spa: ['0.350', '0.350'] [Step 101 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [30620, 30621] → Tgt Spa: ['0.350', '0.350'] [Step 101 / Rank 1] Tasks: ['Single QA'] | Lens: [54343] → Tgt Spa: ['0.350'] [Step 101 / Rank 7] Tasks: ['Code'] | Lens: [51784] → Tgt Spa: ['1.000'] [Step 101 / Rank 2] Tasks: ['Single QA', 'Summarization'] | Lens: [26567, 26586] → Tgt Spa: ['0.350', '1.000'] [Step 101 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [62874] → Tgt Spa: ['1.000'] [Step 101 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [62874] → Tgt Spa: ['1.000'] [Step 101 / Rank 3] Tasks: ['Single QA', 'Summarization'] | Lens: [26567, 26586] → Tgt Spa: ['0.350', '1.000'] [Step 101 / Rank 6] Tasks: ['Code'] | Lens: [51784] → Tgt Spa: ['1.000'] [Step 101 / Rank 0] Tasks: ['Single QA'] | Lens: [54343] → Tgt Spa: ['0.350'] [Step 101 / Rank 1] Tasks: ['Single QA'] | Lens: [33172] → Tgt Spa: ['0.350'] [Step 101 / Rank 5] Tasks: ['Single QA'] | Lens: [54597] → Tgt Spa: ['0.350'] [Step 101 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [23316, 23309] → Tgt Spa: ['1.000', '1.000'] [Step 101 / Rank 0] Tasks: ['Single QA'] | Lens: [33172] → Tgt Spa: ['0.350'] [Step 101 / Rank 2] Tasks: ['Single QA'] | Lens: [35418] → Tgt Spa: ['0.350'] [Step 101 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [23316, 23309] → Tgt Spa: ['1.000', '1.000'] [Step 101 / Rank 3] Tasks: ['Single QA'] | Lens: [35418] → Tgt Spa: ['0.350'] [Step 101 / Rank 4] Tasks: ['Single QA'] | Lens: [54597] → Tgt Spa: ['0.350'] [Step 101 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [56839] → Tgt Spa: ['1.000'] [Step 101 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32529, 32529] → Tgt Spa: ['0.350', '0.350'] [Step 101 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [56839] → Tgt Spa: ['1.000'] [Step 101 / Rank 0] Tasks: ['Single QA'] | Lens: [64906] → Tgt Spa: ['0.350'] [Step 101 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32529, 32529] → Tgt Spa: ['0.350', '0.350'] [Step 101 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25088, 25088] → Tgt Spa: ['1.000', '1.000'] [Step 101 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25088, 25088] → Tgt Spa: ['1.000', '1.000'] [Step 101 / Rank 1] Tasks: ['Single QA'] | Lens: [64906] → Tgt Spa: ['0.350'] [Step 101 / Rank 7] Tasks: ['Single QA'] | Lens: [33612] → Tgt Spa: ['0.350'] [Step 101 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [43413] → Tgt Spa: ['1.000'] [Step 101 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [6074, 6074, 6074, 6076, 6075, 6075, 6075, 6075, 6076, 6076] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 101 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [23034, 23043] → Tgt Spa: ['1.000', '1.000'] [Step 101 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [6074, 6074, 6074, 6076, 6075, 6075, 6075, 6075, 6076, 6076] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 101 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [43413] → Tgt Spa: ['1.000'] [Step 101 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [23034, 23043] → Tgt Spa: ['1.000', '1.000'] [Step 101 / Rank 6] Tasks: ['Single QA'] | Lens: [33612] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 23:09:02,257 >> @ 101 | Loss: 2.1009 | LM: 2.0121 | Reg: 0.0888 | Spa(Avg): 0.491 [INFO|lh_trainer.py:797] 2026-02-16 23:09:02,257 >> Statistic -> Code | Spa: 0.495 | Tgt: 1.000 | Z-Loss: 0.140 | [INFO|lh_trainer.py:797] 2026-02-16 23:09:02,257 >> Statistic -> In-Context | Spa: 0.594 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:09:02,257 >> Statistic -> MultiHop | Spa: 0.560 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:09:02,257 >> Statistic -> Single | Spa: 0.415 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:09:02,257 >> Statistic -> Summarization | Spa: 0.535 | Tgt: 1.000 | Z-Loss: 0.138 | [INFO|lh_trainer.py:810] 2026-02-16 23:09:02,259 >> [Micro-Log] {"loss": 2.100896217782671, "lm_loss": 2.0121098650076115, "reg_loss": 0.088786332285963, "model_sparsity(avg)": 0.4911651263634364, "Spa-Single QA sparsity": 0.4154040352864699, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03980958031024784, "Spa-In-Context Learning sparsity": 0.5937500149011612, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.12375838682055473, "Spa-Summarization sparsity": 0.5347222089767456, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13760074973106384, "Spa-Code sparsity": 0.49537035822868347, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.13973434269428253, "Spa-MultiHop QA sparsity": 0.5601851940155029, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.07306788861751556, "step": 101, "current_tau": 1.2022976875305176, "lambda1 Single QA": 0.53125, "lambda2 MultiHop QA": 0.26953125, "lambda3 Summarization": 0.10009765625, "lambda4 Code": 0.19921875} [INFO|lh_trainer.py:331] 2026-02-16 23:09:16,376 >> {'loss': 12.6054, 'grad_norm': 1.065412998199463, 'learning_rate': 0.00046485160639020293, 'epoch': 0.10742496050552923, 'num_input_tokens_seen': 250819650, 'completed': '34.00% (102 / 300)', 'remaining time': '9:18:03', 'throughput': '7491.72', 'gpu_mem_free': '12197MB', 'step': 102} [Step 102 / Rank 5] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [26999, 27001] → Tgt Spa: ['1.000', '0.350'] [Step 102 / Rank 2] Tasks: ['Single QA'] | Lens: [55422] → Tgt Spa: ['0.350'] [Step 102 / Rank 6] Tasks: ['Single QA'] | Lens: [36923] → Tgt Spa: ['0.350'] [Step 102 / Rank 3] Tasks: ['Single QA'] | Lens: [55422] → Tgt Spa: ['0.350'] [Step 102 / Rank 1] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [18113, 18103, 18115] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 102 / Rank 7] Tasks: ['Single QA'] | Lens: [36923] → Tgt Spa: ['0.350'] [Step 102 / Rank 4] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [26999, 27001] → Tgt Spa: ['1.000', '0.350'] [Step 102 / Rank 0] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [18113, 18103, 18115] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 102 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23722, 23725] → Tgt Spa: ['1.000', '1.000'] [Step 102 / Rank 7] Tasks: ['Single QA', 'Summarization'] | Lens: [27269, 27288] → Tgt Spa: ['0.350', '1.000'] [Step 102 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [42908] → Tgt Spa: ['1.000'] [Step 102 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [42908] → Tgt Spa: ['1.000'] [Step 102 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23722, 23725] → Tgt Spa: ['1.000', '1.000'] [Step 102 / Rank 6] Tasks: ['Single QA', 'Summarization'] | Lens: [27269, 27288] → Tgt Spa: ['0.350', '1.000'] [Step 102 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [30959, 30959] → Tgt Spa: ['0.350', '0.350'] [Step 102 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [30959, 30959] → Tgt Spa: ['0.350', '0.350'] [Step 102 / Rank 5] Tasks: ['Single QA'] | Lens: [65082] → Tgt Spa: ['0.350'] [Step 102 / Rank 3] Tasks: ['Single QA'] | Lens: [52404] → Tgt Spa: ['0.350'] [Step 102 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22382, 22384] → Tgt Spa: ['1.000', '1.000'] [Step 102 / Rank 4] Tasks: ['Single QA'] | Lens: [65082] → Tgt Spa: ['0.350'] [Step 102 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Summarization', 'Single QA', 'In-Context Learning', 'Single QA'] | Lens: [4611, 4611, 4611, 4622, 4614, 4633, 4617, 4616, 4618, 4637, 4637, 4619, 4619, 4621] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350'] [Step 102 / Rank 2] Tasks: ['Single QA'] | Lens: [52404] → Tgt Spa: ['0.350'] [Step 102 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Summarization', 'Single QA', 'In-Context Learning', 'Single QA'] | Lens: [4611, 4611, 4611, 4622, 4614, 4633, 4617, 4616, 4618, 4637, 4637, 4619, 4619, 4621] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350'] [Step 102 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22382, 22384] → Tgt Spa: ['1.000', '1.000'] [Step 102 / Rank 5] Tasks: ['Single QA'] | Lens: [38977] → Tgt Spa: ['0.350'] [Step 102 / Rank 4] Tasks: ['Single QA'] | Lens: [38977] → Tgt Spa: ['0.350'] [Step 102 / Rank 2] Tasks: ['Single QA'] | Lens: [37966] → Tgt Spa: ['0.350'] [Step 102 / Rank 0] Tasks: ['Single QA'] | Lens: [33677] → Tgt Spa: ['0.350'] [Step 102 / Rank 3] Tasks: ['Single QA'] | Lens: [37966] → Tgt Spa: ['0.350'] [Step 102 / Rank 7] Tasks: ['Single QA'] | Lens: [33969] → Tgt Spa: ['0.350'] [Step 102 / Rank 1] Tasks: ['Single QA'] | Lens: [33677] → Tgt Spa: ['0.350'] [Step 102 / Rank 6] Tasks: ['Single QA'] | Lens: [33969] → Tgt Spa: ['0.350'] [Step 102 / Rank 5] Tasks: ['Single QA'] | Lens: [64593] → Tgt Spa: ['0.350'] [Step 102 / Rank 2] Tasks: ['Single QA', 'Code'] | Lens: [24413, 24421] → Tgt Spa: ['0.350', '1.000'] [Step 102 / Rank 1] Tasks: ['Single QA'] | Lens: [38144] → Tgt Spa: ['0.350'] [Step 102 / Rank 6] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [15443, 15451, 15449, 15459] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350'] [Step 102 / Rank 4] Tasks: ['Single QA'] | Lens: [64593] → Tgt Spa: ['0.350'] [Step 102 / Rank 7] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [15443, 15451, 15449, 15459] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350'] [Step 102 / Rank 3] Tasks: ['Single QA', 'Code'] | Lens: [24413, 24421] → Tgt Spa: ['0.350', '1.000'] [Step 102 / Rank 0] Tasks: ['Single QA'] | Lens: [38144] → Tgt Spa: ['0.350'] [Step 102 / Rank 1] Tasks: ['Single QA'] | Lens: [49458] → Tgt Spa: ['0.350'] [Step 102 / Rank 6] Tasks: ['Single QA'] | Lens: [64734] → Tgt Spa: ['0.350'] [Step 102 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [23548, 23549] → Tgt Spa: ['0.350', '0.350'] [Step 102 / Rank 7] Tasks: ['Single QA'] | Lens: [64734] → Tgt Spa: ['0.350'] [Step 102 / Rank 0] Tasks: ['Single QA'] | Lens: [49458] → Tgt Spa: ['0.350'] [Step 102 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [23548, 23549] → Tgt Spa: ['0.350', '0.350'] [Step 102 / Rank 2] Tasks: ['Code', 'Code'] | Lens: [23037, 23036] → Tgt Spa: ['1.000', '1.000'] [Step 102 / Rank 3] Tasks: ['Code', 'Code'] | Lens: [23037, 23036] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 23:11:39,725 >> @ 102 | Loss: 2.3636 | LM: 2.2822 | Reg: 0.0814 | Spa(Avg): 0.492 [INFO|lh_trainer.py:797] 2026-02-16 23:11:39,726 >> Statistic -> Code | Spa: 0.495 | Tgt: 1.000 | Z-Loss: 0.140 | [INFO|lh_trainer.py:797] 2026-02-16 23:11:39,726 >> Statistic -> In-Context | Spa: 0.602 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:11:39,726 >> Statistic -> MultiHop | Spa: 0.560 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:11:39,726 >> Statistic -> Single | Spa: 0.462 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:11:39,726 >> Statistic -> Summarization | Spa: 0.560 | Tgt: 1.000 | Z-Loss: 0.126 | [INFO|lh_trainer.py:810] 2026-02-16 23:11:39,728 >> [Micro-Log] {"loss": 2.3636054607729116, "lm_loss": 2.2821641427775226, "reg_loss": 0.08144132820113252, "model_sparsity(avg)": 0.4920841579635938, "Spa-Summarization sparsity": 0.5601851840813955, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12591402108470598, "Spa-Code sparsity": 0.495370348294576, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1403838793436686, "Spa-Single QA sparsity": 0.4618055522441864, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06413602372534417, "Spa-In-Context Learning sparsity": 0.6021825458322253, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.12103993871382304, "Spa-MultiHop QA sparsity": 0.5601851940155029, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.07306788861751556, "step": 102, "current_tau": 1.1980220079421997, "lambda1 Single QA": 0.53125, "lambda2 MultiHop QA": 0.26953125, "lambda3 Summarization": 0.1005859375, "lambda4 Code": 0.2001953125} [INFO|lh_trainer.py:331] 2026-02-16 23:12:07,368 >> {'loss': 14.1816, 'grad_norm': 0.7211883068084717, 'learning_rate': 0.0004631600447725189, 'epoch': 0.10847814639283834, 'num_input_tokens_seen': 253219186, 'completed': '34.33% (103 / 300)', 'remaining time': '9:15:17', 'throughput': '7016.50', 'gpu_mem_free': '10527MB', 'step': 103} [Step 103 / Rank 3] Tasks: ['Code'] | Lens: [37384] → Tgt Spa: ['1.000'] [Step 103 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [64896] → Tgt Spa: ['1.000'] [Step 103 / Rank 1] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [19268, 19268, 19280] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 103 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [56698] → Tgt Spa: ['1.000'] [Step 103 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [64896] → Tgt Spa: ['1.000'] [Step 103 / Rank 2] Tasks: ['Code'] | Lens: [37384] → Tgt Spa: ['1.000'] [Step 103 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [56698] → Tgt Spa: ['1.000'] [Step 103 / Rank 0] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [19268, 19268, 19280] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 103 / Rank 3] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25247, 25248] → Tgt Spa: ['0.350', '1.000'] [Step 103 / Rank 1] Tasks: ['Single QA'] | Lens: [54963] → Tgt Spa: ['0.350'] [Step 103 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [43671] → Tgt Spa: ['1.000'] [Step 103 / Rank 2] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25247, 25248] → Tgt Spa: ['0.350', '1.000'] [Step 103 / Rank 6] Tasks: ['Single QA'] | Lens: [35102] → Tgt Spa: ['0.350'] [Step 103 / Rank 0] Tasks: ['Single QA'] | Lens: [54963] → Tgt Spa: ['0.350'] [Step 103 / Rank 7] Tasks: ['Single QA'] | Lens: [35102] → Tgt Spa: ['0.350'] [Step 103 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [43671] → Tgt Spa: ['1.000'] [Step 103 / Rank 5] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22204, 22187] → Tgt Spa: ['1.000', '1.000'] [Step 103 / Rank 3] Tasks: ['Code', 'Code', 'Single QA', 'Code', 'Code'] | Lens: [11159, 11161, 11158, 11169, 11169] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000'] [Step 103 / Rank 6] Tasks: ['Single QA'] | Lens: [46707] → Tgt Spa: ['0.350'] [Step 103 / Rank 4] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22204, 22187] → Tgt Spa: ['1.000', '1.000'] [Step 103 / Rank 1] Tasks: ['Single QA'] | Lens: [60739] → Tgt Spa: ['0.350'] [Step 103 / Rank 7] Tasks: ['Single QA'] | Lens: [46707] → Tgt Spa: ['0.350'] [Step 103 / Rank 2] Tasks: ['Code', 'Code', 'Single QA', 'Code', 'Code'] | Lens: [11159, 11161, 11158, 11169, 11169] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000'] [Step 103 / Rank 0] Tasks: ['Single QA'] | Lens: [60739] → Tgt Spa: ['0.350'] [Step 103 / Rank 5] Tasks: ['Single QA'] | Lens: [39522] → Tgt Spa: ['0.350'] [Step 103 / Rank 7] Tasks: ['Single QA'] | Lens: [51382] → Tgt Spa: ['0.350'] [Step 103 / Rank 3] Tasks: ['Single QA'] | Lens: [56713] → Tgt Spa: ['0.350'] [Step 103 / Rank 2] Tasks: ['Single QA'] | Lens: [56713] → Tgt Spa: ['0.350'] [Step 103 / Rank 0] Tasks: ['Single QA'] | Lens: [64049] → Tgt Spa: ['0.350'] [Step 103 / Rank 1] Tasks: ['Single QA'] | Lens: [64049] → Tgt Spa: ['0.350'] [Step 103 / Rank 6] Tasks: ['Single QA'] | Lens: [51382] → Tgt Spa: ['0.350'] [Step 103 / Rank 4] Tasks: ['Single QA'] | Lens: [39522] → Tgt Spa: ['0.350'] [Step 103 / Rank 1] Tasks: ['Single QA'] | Lens: [39918] → Tgt Spa: ['0.350'] [Step 103 / Rank 2] Tasks: ['Single QA'] | Lens: [45884] → Tgt Spa: ['0.350'] [Step 103 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [64375] → Tgt Spa: ['1.000'] [Step 103 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [40189] → Tgt Spa: ['1.000'] [Step 103 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [64375] → Tgt Spa: ['1.000'] [Step 103 / Rank 0] Tasks: ['Single QA'] | Lens: [39918] → Tgt Spa: ['0.350'] [Step 103 / Rank 3] Tasks: ['Single QA'] | Lens: [45884] → Tgt Spa: ['0.350'] [Step 103 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [40189] → Tgt Spa: ['1.000'] [Step 103 / Rank 5] Tasks: ['Single QA'] | Lens: [53809] → Tgt Spa: ['0.350'] [Step 103 / Rank 6] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1480, 1483, 1480, 1480, 1479, 1500, 1500, 1499, 1499, 1498, 1499, 1482, 1481, 1481, 1500, 1481, 1482, 1501, 1483, 1484, 1485, 1484, 1485, 1486, 1483, 1503, 1486, 1485, 1484, 1485, 1505, 1504, 1487, 1486, 1486, 1486, 1487, 1487, 1508, 1490, 1490, 1488, 1488] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350'] [Step 103 / Rank 1] Tasks: ['Single QA'] | Lens: [35542] → Tgt Spa: ['0.350'] [Step 103 / Rank 4] Tasks: ['Single QA'] | Lens: [53809] → Tgt Spa: ['0.350'] [Step 103 / Rank 2] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Code', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'Code'] | Lens: [4146, 4140, 4141, 4143, 4150, 4142, 4142, 4143, 4161, 4151, 4145, 4152, 4145, 4145, 4153] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000'] [Step 103 / Rank 3] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Code', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'Code'] | Lens: [4146, 4140, 4141, 4143, 4150, 4142, 4142, 4143, 4161, 4151, 4145, 4152, 4145, 4145, 4153] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000'] [Step 103 / Rank 7] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1480, 1483, 1480, 1480, 1479, 1500, 1500, 1499, 1499, 1498, 1499, 1482, 1481, 1481, 1500, 1481, 1482, 1501, 1483, 1484, 1485, 1484, 1485, 1486, 1483, 1503, 1486, 1485, 1484, 1485, 1505, 1504, 1487, 1486, 1486, 1486, 1487, 1487, 1508, 1490, 1490, 1488, 1488] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350'] [Step 103 / Rank 0] Tasks: ['Single QA'] | Lens: [35542] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 23:14:54,464 >> @ 103 | Loss: 2.2679 | LM: 2.1959 | Reg: 0.0720 | Spa(Avg): 0.457 [INFO|lh_trainer.py:797] 2026-02-16 23:14:54,465 >> Statistic -> Code | Spa: 0.493 | Tgt: 1.000 | Z-Loss: 0.142 | [INFO|lh_trainer.py:797] 2026-02-16 23:14:54,465 >> Statistic -> In-Context | Spa: 0.585 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:14:54,465 >> Statistic -> MultiHop | Spa: 0.531 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:14:54,465 >> Statistic -> Single | Spa: 0.392 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:14:54,465 >> Statistic -> Summarization | Spa: 0.548 | Tgt: 1.000 | Z-Loss: 0.132 | [INFO|lh_trainer.py:810] 2026-02-16 23:14:54,467 >> [Micro-Log] {"loss": 2.267850309610367, "lm_loss": 2.195850122720003, "reg_loss": 0.0720001990654661, "model_sparsity(avg)": 0.4572961429754893, "Spa-Code sparsity": 0.4930555423100789, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.14187070168554783, "Spa-Summarization sparsity": 0.5481481432914734, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13183925151824952, "Spa-Single QA sparsity": 0.3923611007630825, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03918412011989858, "Spa-In-Context Learning sparsity": 0.5853174669401986, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.12690360631261552, "Spa-MultiHop QA sparsity": 0.5313620067411854, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.061493933080665523, "step": 103, "current_tau": 1.193762183189392, "lambda1 Single QA": 0.53125, "lambda2 MultiHop QA": 0.26953125, "lambda3 Summarization": 0.1015625, "lambda4 Code": 0.201171875} [INFO|lh_trainer.py:331] 2026-02-16 23:15:14,302 >> {'loss': 13.6071, 'grad_norm': 0.8114559650421143, 'learning_rate': 0.0004614319592827978, 'epoch': 0.10953133228014744, 'num_input_tokens_seen': 255671766, 'completed': '34.67% (104 / 300)', 'remaining time': '9:13:02', 'throughput': '6560.04', 'gpu_mem_free': '13371MB', 'step': 104} [Step 104 / Rank 5] Tasks: ['Single QA'] | Lens: [53577] → Tgt Spa: ['0.350'] [Step 104 / Rank 0] Tasks: ['Summarization', 'Code'] | Lens: [30050, 30039] → Tgt Spa: ['1.000', '1.000'] [Step 104 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [25335, 25336] → Tgt Spa: ['0.350', '0.350'] [Step 104 / Rank 1] Tasks: ['Summarization', 'Code'] | Lens: [30050, 30039] → Tgt Spa: ['1.000', '1.000'] [Step 104 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [25335, 25336] → Tgt Spa: ['0.350', '0.350'] [Step 104 / Rank 4] Tasks: ['Single QA'] | Lens: [53577] → Tgt Spa: ['0.350'] [Step 104 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [38412] → Tgt Spa: ['1.000'] [Step 104 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [38412] → Tgt Spa: ['1.000'] [Step 104 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [60580] → Tgt Spa: ['1.000'] [Step 104 / Rank 3] Tasks: ['Summarization', 'Summarization'] | Lens: [23570, 23569] → Tgt Spa: ['1.000', '1.000'] [Step 104 / Rank 2] Tasks: ['Summarization', 'Summarization'] | Lens: [23570, 23569] → Tgt Spa: ['1.000', '1.000'] [Step 104 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [60580] → Tgt Spa: ['1.000'] [Step 104 / Rank 4] Tasks: ['Single QA'] | Lens: [60448] → Tgt Spa: ['0.350'] [Step 104 / Rank 5] Tasks: ['Single QA'] | Lens: [60448] → Tgt Spa: ['0.350'] [Step 104 / Rank 6] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24427, 24446] → Tgt Spa: ['1.000', '1.000'] [Step 104 / Rank 7] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24427, 24446] → Tgt Spa: ['1.000', '1.000'] [Step 104 / Rank 3] Tasks: ['Single QA', 'Summarization', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [8913, 8932, 8916, 8919, 8927, 8922, 8929] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350'] [Step 104 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [39351] → Tgt Spa: ['1.000'] [Step 104 / Rank 7] Tasks: ['Single QA'] | Lens: [64180] → Tgt Spa: ['0.350'] [Step 104 / Rank 2] Tasks: ['Single QA', 'Summarization', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [8913, 8932, 8916, 8919, 8927, 8922, 8929] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350'] [Step 104 / Rank 5] Tasks: ['Single QA'] | Lens: [52509] → Tgt Spa: ['0.350'] [Step 104 / Rank 6] Tasks: ['Single QA'] | Lens: [64180] → Tgt Spa: ['0.350'] [Step 104 / Rank 4] Tasks: ['Single QA'] | Lens: [52509] → Tgt Spa: ['0.350'] [Step 104 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [39351] → Tgt Spa: ['1.000'] [Step 104 / Rank 5] Tasks: ['Single QA', 'Summarization'] | Lens: [32559, 32577] → Tgt Spa: ['0.350', '1.000'] [Step 104 / Rank 1] Tasks: ['Single QA'] | Lens: [51828] → Tgt Spa: ['0.350'] [Step 104 / Rank 7] Tasks: ['MultiHop QA', 'Summarization', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Code', 'Code', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [2964, 2981, 2965, 2963, 2965, 2965, 2982, 2964, 2966, 2967, 2967, 2966, 2984, 2972, 2973, 2970, 2969, 2986, 2986, 2969, 2970, 2971] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350'] [Step 104 / Rank 4] Tasks: ['Single QA', 'Summarization'] | Lens: [32559, 32577] → Tgt Spa: ['0.350', '1.000'] [Step 104 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [41482] → Tgt Spa: ['1.000'] [Step 104 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [41482] → Tgt Spa: ['1.000'] [Step 104 / Rank 0] Tasks: ['Single QA'] | Lens: [51828] → Tgt Spa: ['0.350'] [Step 104 / Rank 6] Tasks: ['MultiHop QA', 'Summarization', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Code', 'Code', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [2964, 2981, 2965, 2963, 2965, 2965, 2982, 2964, 2966, 2967, 2967, 2966, 2984, 2972, 2973, 2970, 2969, 2986, 2986, 2969, 2970, 2971] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350'] [Step 104 / Rank 4] Tasks: ['Single QA'] | Lens: [55087] → Tgt Spa: ['0.350'] [Step 104 / Rank 3] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [20127, 20119, 20119] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 104 / Rank 5] Tasks: ['Single QA'] | Lens: [55087] → Tgt Spa: ['0.350'] [Step 104 / Rank 7] Tasks: ['Single QA'] | Lens: [38731] → Tgt Spa: ['0.350'] [Step 104 / Rank 6] Tasks: ['Single QA'] | Lens: [38731] → Tgt Spa: ['0.350'] [Step 104 / Rank 1] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15811, 15805, 15806, 15806] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350'] [Step 104 / Rank 2] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [20127, 20119, 20119] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 104 / Rank 0] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15811, 15805, 15806, 15806] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350'] [Step 104 / Rank 0] Tasks: ['Single QA'] | Lens: [64781] → Tgt Spa: ['0.350'] [Step 104 / Rank 1] Tasks: ['Single QA'] | Lens: [64781] → Tgt Spa: ['0.350'] [Step 104 / Rank 2] Tasks: ['Code'] | Lens: [32860] → Tgt Spa: ['1.000'] [Step 104 / Rank 3] Tasks: ['Code'] | Lens: [32860] → Tgt Spa: ['1.000'] [Step 104 / Rank 6] Tasks: ['Single QA'] | Lens: [61424] → Tgt Spa: ['0.350'] [Step 104 / Rank 7] Tasks: ['Single QA'] | Lens: [61424] → Tgt Spa: ['0.350'] [Step 104 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15951, 15950, 15951, 15955] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 104 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15951, 15950, 15951, 15955] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [INFO|lh_trainer.py:781] 2026-02-16 23:17:45,541 >> @ 104 | Loss: 2.1353 | LM: 2.0533 | Reg: 0.0820 | Spa(Avg): 0.461 [INFO|lh_trainer.py:797] 2026-02-16 23:17:45,541 >> Statistic -> Code | Spa: 0.509 | Tgt: 1.000 | Z-Loss: 0.137 | [INFO|lh_trainer.py:797] 2026-02-16 23:17:45,542 >> Statistic -> In-Context | Spa: 0.565 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:17:45,542 >> Statistic -> MultiHop | Spa: 0.532 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:17:45,542 >> Statistic -> Single | Spa: 0.429 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:17:45,542 >> Statistic -> Summarization | Spa: 0.531 | Tgt: 1.000 | Z-Loss: 0.141 | [INFO|lh_trainer.py:810] 2026-02-16 23:17:45,544 >> [Micro-Log] {"loss": 2.1352777096132436, "lm_loss": 2.0532869237164655, "reg_loss": 0.0819907898743016, "model_sparsity(avg)": 0.4605172487596671, "Spa-Summarization sparsity": 0.5312499950329462, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.14126640558242798, "Spa-Code sparsity": 0.508680559694767, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1370416795834899, "Spa-In-Context Learning sparsity": 0.5648148357868195, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.13478304197390875, "Spa-Single QA sparsity": 0.4289529827924875, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.047255140384480074, "Spa-MultiHop QA sparsity": 0.5324074029922485, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.06218258710578084, "step": 104, "current_tau": 1.1895195245742798, "lambda1 Single QA": 0.53125, "lambda2 MultiHop QA": 0.26953125, "lambda3 Summarization": 0.1025390625, "lambda4 Code": 0.2021484375} [INFO|lh_trainer.py:331] 2026-02-16 23:18:13,744 >> {'loss': 12.8117, 'grad_norm': 0.8580276966094971, 'learning_rate': 0.0004596676460195918, 'epoch': 0.11058451816745656, 'num_input_tokens_seen': 258276528, 'completed': '35.00% (105 / 300)', 'remaining time': '9:10:31', 'throughput': '7257.92', 'gpu_mem_free': '4053MB', 'step': 105} [Step 105 / Rank 0] Tasks: ['Single QA'] | Lens: [65378] → Tgt Spa: ['0.350'] [Step 105 / Rank 4] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Single QA', 'Code', 'Single QA', 'In-Context Learning', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [4916, 4908, 4908, 4908, 4927, 4911, 4917, 4911, 4911, 4919, 4913, 4914, 4916] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 105 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [46134] → Tgt Spa: ['1.000'] [Step 105 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [46134] → Tgt Spa: ['1.000'] [Step 105 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43699] → Tgt Spa: ['1.000'] [Step 105 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43699] → Tgt Spa: ['1.000'] [Step 105 / Rank 5] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Single QA', 'Code', 'Single QA', 'In-Context Learning', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [4916, 4908, 4908, 4908, 4927, 4911, 4917, 4911, 4911, 4919, 4913, 4914, 4916] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 105 / Rank 1] Tasks: ['Single QA'] | Lens: [65378] → Tgt Spa: ['0.350'] [Step 105 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [22364, 22365] → Tgt Spa: ['0.350', '1.000'] [Step 105 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [44781] → Tgt Spa: ['1.000'] [Step 105 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [39896] → Tgt Spa: ['1.000'] [Step 105 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [39896] → Tgt Spa: ['1.000'] [Step 105 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [41736] → Tgt Spa: ['1.000'] [Step 105 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [41736] → Tgt Spa: ['1.000'] [Step 105 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [22364, 22365] → Tgt Spa: ['0.350', '1.000'] [Step 105 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [44781] → Tgt Spa: ['1.000'] [Step 105 / Rank 5] Tasks: ['Single QA'] | Lens: [58713] → Tgt Spa: ['0.350'] [Step 105 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [24609, 24618] → Tgt Spa: ['1.000', '1.000'] [Step 105 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [26938, 26958] → Tgt Spa: ['1.000', '1.000'] [Step 105 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [24609, 24618] → Tgt Spa: ['1.000', '1.000'] [Step 105 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [60157] → Tgt Spa: ['1.000'] [Step 105 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [26938, 26958] → Tgt Spa: ['1.000', '1.000'] [Step 105 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [60157] → Tgt Spa: ['1.000'] [Step 105 / Rank 4] Tasks: ['Single QA'] | Lens: [58713] → Tgt Spa: ['0.350'] [Step 105 / Rank 1] Tasks: ['Single QA'] | Lens: [49228] → Tgt Spa: ['0.350'] [Step 105 / Rank 7] Tasks: ['Code'] | Lens: [51860] → Tgt Spa: ['1.000'] [Step 105 / Rank 2] Tasks: ['Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [5767, 5768, 5768, 5771, 5770, 5772, 5773, 5773, 5775, 5774, 5781] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000'] [Step 105 / Rank 6] Tasks: ['Code'] | Lens: [51860] → Tgt Spa: ['1.000'] [Step 105 / Rank 4] Tasks: ['Single QA'] | Lens: [57915] → Tgt Spa: ['0.350'] [Step 105 / Rank 0] Tasks: ['Single QA'] | Lens: [49228] → Tgt Spa: ['0.350'] [Step 105 / Rank 5] Tasks: ['Single QA'] | Lens: [57915] → Tgt Spa: ['0.350'] [Step 105 / Rank 3] Tasks: ['Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [5767, 5768, 5768, 5771, 5770, 5772, 5773, 5773, 5775, 5774, 5781] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000'] [Step 105 / Rank 5] Tasks: ['Code'] | Lens: [34941] → Tgt Spa: ['1.000'] [Step 105 / Rank 1] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'Single QA', 'Code', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [6064, 6057, 6058, 6066, 6066, 6059, 6066, 6070, 6063, 6065] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350'] [Step 105 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [41236] → Tgt Spa: ['1.000'] [Step 105 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [41236] → Tgt Spa: ['1.000'] [Step 105 / Rank 6] Tasks: ['Single QA'] | Lens: [50000] → Tgt Spa: ['0.350'] [Step 105 / Rank 0] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'Single QA', 'Code', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [6064, 6057, 6058, 6066, 6066, 6059, 6066, 6070, 6063, 6065] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350'] [Step 105 / Rank 7] Tasks: ['Single QA'] | Lens: [50000] → Tgt Spa: ['0.350'] [Step 105 / Rank 4] Tasks: ['Code'] | Lens: [34941] → Tgt Spa: ['1.000'] [Step 105 / Rank 6] Tasks: ['Single QA'] | Lens: [15933] → Tgt Spa: ['0.350'] [Step 105 / Rank 3] Tasks: ['Single QA'] | Lens: [52532] → Tgt Spa: ['0.350'] [Step 105 / Rank 2] Tasks: ['Single QA'] | Lens: [52532] → Tgt Spa: ['0.350'] [Step 105 / Rank 1] Tasks: ['Code'] | Lens: [34336] → Tgt Spa: ['1.000'] [Step 105 / Rank 7] Tasks: ['Single QA'] | Lens: [15933] → Tgt Spa: ['0.350'] [Step 105 / Rank 4] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [16688, 16689, 16680] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 105 / Rank 0] Tasks: ['Code'] | Lens: [34336] → Tgt Spa: ['1.000'] [Step 105 / Rank 5] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [16688, 16689, 16680] → Tgt Spa: ['1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 23:20:38,563 >> @ 105 | Loss: 2.1007 | LM: 1.9942 | Reg: 0.1065 | Spa(Avg): 0.505 [INFO|lh_trainer.py:797] 2026-02-16 23:20:38,563 >> Statistic -> Code | Spa: 0.463 | Tgt: 1.000 | Z-Loss: 0.154 | [INFO|lh_trainer.py:797] 2026-02-16 23:20:38,563 >> Statistic -> In-Context | Spa: 0.592 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:20:38,563 >> Statistic -> MultiHop | Spa: 0.532 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:20:38,563 >> Statistic -> Single | Spa: 0.467 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:20:38,563 >> Statistic -> Summarization | Spa: 0.514 | Tgt: 1.000 | Z-Loss: 0.150 | [INFO|lh_trainer.py:810] 2026-02-16 23:20:38,565 >> [Micro-Log] {"loss": 2.1007155179977417, "lm_loss": 1.9942133392517765, "reg_loss": 0.10650219829403795, "model_sparsity(avg)": 0.5047669460376104, "Spa-Single QA sparsity": 0.46732025637346153, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06813190678846748, "Spa-In-Context Learning sparsity": 0.5922222161293029, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1255766648054123, "Spa-Code sparsity": 0.4632936418056488, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1536604900445257, "Spa-Summarization sparsity": 0.513888880610466, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.14991949126124382, "Spa-MultiHop QA sparsity": 0.5324074029922485, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.06218258710578084, "step": 105, "current_tau": 1.1852952241897583, "lambda1 Single QA": 0.53125, "lambda2 MultiHop QA": 0.271484375, "lambda3 Summarization": 0.10302734375, "lambda4 Code": 0.2021484375} [INFO|lh_trainer.py:331] 2026-02-16 23:20:57,532 >> {'loss': 12.6043, 'grad_norm': 1.31855309009552, 'learning_rate': 0.000457867407288896, 'epoch': 0.11163770405476567, 'num_input_tokens_seen': 260625306, 'completed': '35.33% (106 / 300)', 'remaining time': '9:07:32', 'throughput': '7170.19', 'gpu_mem_free': '14743MB', 'step': 106} [Step 106 / Rank 3] Tasks: ['Code'] | Lens: [42934] → Tgt Spa: ['1.000'] [Step 106 / Rank 2] Tasks: ['Code'] | Lens: [42934] → Tgt Spa: ['1.000'] [Step 106 / Rank 0] Tasks: ['Code'] | Lens: [36086] → Tgt Spa: ['1.000'] [Step 106 / Rank 7] Tasks: ['Single QA'] | Lens: [47947] → Tgt Spa: ['0.350'] [Step 106 / Rank 4] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [16827, 16827, 16828] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 106 / Rank 1] Tasks: ['Code'] | Lens: [36086] → Tgt Spa: ['1.000'] [Step 106 / Rank 5] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [16827, 16827, 16828] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 106 / Rank 6] Tasks: ['Single QA'] | Lens: [47947] → Tgt Spa: ['0.350'] [Step 106 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [63749] → Tgt Spa: ['1.000'] [Step 106 / Rank 3] Tasks: ['Code', 'Summarization'] | Lens: [23006, 23019] → Tgt Spa: ['1.000', '1.000'] [Step 106 / Rank 6] Tasks: ['Single QA'] | Lens: [56750] → Tgt Spa: ['0.350'] [Step 106 / Rank 1] Tasks: ['Single QA'] | Lens: [65103] → Tgt Spa: ['0.350'] [Step 106 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [63749] → Tgt Spa: ['1.000'] [Step 106 / Rank 0] Tasks: ['Single QA'] | Lens: [65103] → Tgt Spa: ['0.350'] [Step 106 / Rank 7] Tasks: ['Single QA'] | Lens: [56750] → Tgt Spa: ['0.350'] [Step 106 / Rank 2] Tasks: ['Code', 'Summarization'] | Lens: [23006, 23019] → Tgt Spa: ['1.000', '1.000'] [Step 106 / Rank 3] Tasks: ['Single QA'] | Lens: [48731] → Tgt Spa: ['0.350'] [Step 106 / Rank 1] Tasks: ['Single QA'] | Lens: [57587] → Tgt Spa: ['0.350'] [Step 106 / Rank 0] Tasks: ['Single QA'] | Lens: [57587] → Tgt Spa: ['0.350'] [Step 106 / Rank 2] Tasks: ['Single QA'] | Lens: [48731] → Tgt Spa: ['0.350'] [Step 106 / Rank 6] Tasks: ['Single QA'] | Lens: [48671] → Tgt Spa: ['0.350'] [Step 106 / Rank 5] Tasks: ['Single QA'] | Lens: [49668] → Tgt Spa: ['0.350'] [Step 106 / Rank 4] Tasks: ['Single QA'] | Lens: [49668] → Tgt Spa: ['0.350'] [Step 106 / Rank 7] Tasks: ['Single QA'] | Lens: [48671] → Tgt Spa: ['0.350'] [Step 106 / Rank 6] Tasks: ['Single QA'] | Lens: [57534] → Tgt Spa: ['0.350'] [Step 106 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [22568, 22568] → Tgt Spa: ['0.350', '1.000'] [Step 106 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15808, 15808, 15809, 15809] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 106 / Rank 1] Tasks: ['Summarization', 'Code'] | Lens: [24695, 24686] → Tgt Spa: ['1.000', '1.000'] [Step 106 / Rank 0] Tasks: ['Summarization', 'Code'] | Lens: [24695, 24686] → Tgt Spa: ['1.000', '1.000'] [Step 106 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15808, 15808, 15809, 15809] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 106 / Rank 7] Tasks: ['Single QA'] | Lens: [57534] → Tgt Spa: ['0.350'] [Step 106 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [22568, 22568] → Tgt Spa: ['0.350', '1.000'] [Step 106 / Rank 3] Tasks: ['Code'] | Lens: [36628] → Tgt Spa: ['1.000'] [Step 106 / Rank 1] Tasks: ['Single QA'] | Lens: [34808] → Tgt Spa: ['0.350'] [Step 106 / Rank 0] Tasks: ['Single QA'] | Lens: [34808] → Tgt Spa: ['0.350'] [Step 106 / Rank 4] Tasks: ['Single QA'] | Lens: [36617] → Tgt Spa: ['0.350'] [Step 106 / Rank 5] Tasks: ['Single QA'] | Lens: [36617] → Tgt Spa: ['0.350'] [Step 106 / Rank 6] Tasks: ['Single QA'] | Lens: [39368] → Tgt Spa: ['0.350'] [Step 106 / Rank 7] Tasks: ['Single QA'] | Lens: [39368] → Tgt Spa: ['0.350'] [Step 106 / Rank 2] Tasks: ['Code'] | Lens: [36628] → Tgt Spa: ['1.000'] [Step 106 / Rank 5] Tasks: ['Single QA'] | Lens: [52155] → Tgt Spa: ['0.350'] [Step 106 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [62375] → Tgt Spa: ['1.000'] [Step 106 / Rank 1] Tasks: ['Code', 'Code', 'Code'] | Lens: [21084, 21088, 21093] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 106 / Rank 6] Tasks: ['Single QA'] | Lens: [43937] → Tgt Spa: ['0.350'] [Step 106 / Rank 0] Tasks: ['Code', 'Code', 'Code'] | Lens: [21084, 21088, 21093] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 106 / Rank 7] Tasks: ['Single QA'] | Lens: [43937] → Tgt Spa: ['0.350'] [Step 106 / Rank 4] Tasks: ['Single QA'] | Lens: [52155] → Tgt Spa: ['0.350'] [Step 106 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [62375] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 23:23:17,388 >> @ 106 | Loss: 1.9808 | LM: 1.9150 | Reg: 0.0658 | Spa(Avg): 0.442 [INFO|lh_trainer.py:797] 2026-02-16 23:23:17,388 >> Statistic -> Code | Spa: 0.516 | Tgt: 1.000 | Z-Loss: 0.135 | [INFO|lh_trainer.py:797] 2026-02-16 23:23:17,388 >> Statistic -> In-Context | Spa: 0.588 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:23:17,388 >> Statistic -> MultiHop | Spa: 0.532 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:23:17,388 >> Statistic -> Single | Spa: 0.375 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:23:17,388 >> Statistic -> Summarization | Spa: 0.517 | Tgt: 1.000 | Z-Loss: 0.150 | [INFO|lh_trainer.py:810] 2026-02-16 23:23:17,390 >> [Micro-Log] {"loss": 1.9808012420932453, "lm_loss": 1.9150469501813252, "reg_loss": 0.06575428716799554, "model_sparsity(avg)": 0.442033172895511, "Spa-Code sparsity": 0.5156249925494194, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1354581117630005, "Spa-Single QA sparsity": 0.3749999900658925, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.02140642676709427, "Spa-Summarization sparsity": 0.5166666507720947, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.14968262910842894, "Spa-In-Context Learning sparsity": 0.5879629453023275, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.12788485984007517, "Spa-MultiHop QA sparsity": 0.5324074029922485, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.06218258710578084, "step": 106, "current_tau": 1.1810905933380127, "lambda1 Single QA": 0.53515625, "lambda2 MultiHop QA": 0.271484375, "lambda3 Summarization": 0.10400390625, "lambda4 Code": 0.203125} [INFO|lh_trainer.py:331] 2026-02-16 23:23:42,331 >> {'loss': 11.8848, 'grad_norm': 0.7763041257858276, 'learning_rate': 0.0004560315515523492, 'epoch': 0.11269088994207478, 'num_input_tokens_seen': 263021648, 'completed': '35.67% (107 / 300)', 'remaining time': '9:04:34', 'throughput': '7270.53', 'gpu_mem_free': '6563MB', 'step': 107} [Step 107 / Rank 4] Tasks: ['Single QA'] | Lens: [54163] → Tgt Spa: ['0.350'] [Step 107 / Rank 5] Tasks: ['Single QA'] | Lens: [54163] → Tgt Spa: ['0.350'] [Step 107 / Rank 1] Tasks: ['Code', 'Single QA'] | Lens: [21857, 21852] → Tgt Spa: ['1.000', '0.350'] [Step 107 / Rank 6] Tasks: ['Single QA'] | Lens: [48677] → Tgt Spa: ['0.350'] [Step 107 / Rank 3] Tasks: ['Single QA'] | Lens: [43247] → Tgt Spa: ['0.350'] [Step 107 / Rank 7] Tasks: ['Single QA'] | Lens: [48677] → Tgt Spa: ['0.350'] [Step 107 / Rank 2] Tasks: ['Single QA'] | Lens: [43247] → Tgt Spa: ['0.350'] [Step 107 / Rank 0] Tasks: ['Code', 'Single QA'] | Lens: [21857, 21852] → Tgt Spa: ['1.000', '0.350'] [Step 107 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26362, 26364] → Tgt Spa: ['1.000', '1.000'] [Step 107 / Rank 2] Tasks: ['Single QA'] | Lens: [62704] → Tgt Spa: ['0.350'] [Step 107 / Rank 0] Tasks: ['Code', 'Code', 'Code'] | Lens: [20466, 20465, 20470] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 107 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26362, 26364] → Tgt Spa: ['1.000', '1.000'] [Step 107 / Rank 6] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23759, 23758] → Tgt Spa: ['1.000', '0.350'] [Step 107 / Rank 1] Tasks: ['Code', 'Code', 'Code'] | Lens: [20466, 20465, 20470] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 107 / Rank 3] Tasks: ['Single QA'] | Lens: [62704] → Tgt Spa: ['0.350'] [Step 107 / Rank 7] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23759, 23758] → Tgt Spa: ['1.000', '0.350'] [Step 107 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [54418] → Tgt Spa: ['1.000'] [Step 107 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [40517] → Tgt Spa: ['1.000'] [Step 107 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [54418] → Tgt Spa: ['1.000'] [Step 107 / Rank 2] Tasks: ['Single QA'] | Lens: [58752] → Tgt Spa: ['0.350'] [Step 107 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [25601, 25608] → Tgt Spa: ['1.000', '1.000'] [Step 107 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [40517] → Tgt Spa: ['1.000'] [Step 107 / Rank 3] Tasks: ['Single QA'] | Lens: [58752] → Tgt Spa: ['0.350'] [Step 107 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [25601, 25608] → Tgt Spa: ['1.000', '1.000'] [Step 107 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Summarization'] | Lens: [21453, 21453, 21471] → Tgt Spa: ['0.350', '0.350', '1.000'] [Step 107 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24056, 24056] → Tgt Spa: ['0.350', '1.000'] [Step 107 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [44978] → Tgt Spa: ['1.000'] [Step 107 / Rank 7] Tasks: ['Single QA'] | Lens: [51164] → Tgt Spa: ['0.350'] [Step 107 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [44978] → Tgt Spa: ['1.000'] [Step 107 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24056, 24056] → Tgt Spa: ['0.350', '1.000'] [Step 107 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Summarization'] | Lens: [21453, 21453, 21471] → Tgt Spa: ['0.350', '0.350', '1.000'] [Step 107 / Rank 6] Tasks: ['Single QA'] | Lens: [51164] → Tgt Spa: ['0.350'] [Step 107 / Rank 1] Tasks: ['Code'] | Lens: [44964] → Tgt Spa: ['1.000'] [Step 107 / Rank 4] Tasks: ['Single QA'] | Lens: [57570] → Tgt Spa: ['0.350'] [Step 107 / Rank 5] Tasks: ['Single QA'] | Lens: [57570] → Tgt Spa: ['0.350'] [Step 107 / Rank 2] Tasks: ['Code', 'Single QA', 'Single QA'] | Lens: [16430, 16424, 16424] → Tgt Spa: ['1.000', '0.350', '0.350'] [Step 107 / Rank 3] Tasks: ['Code', 'Single QA', 'Single QA'] | Lens: [16430, 16424, 16424] → Tgt Spa: ['1.000', '0.350', '0.350'] [Step 107 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [57554] → Tgt Spa: ['1.000'] [Step 107 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [57554] → Tgt Spa: ['1.000'] [Step 107 / Rank 0] Tasks: ['Code'] | Lens: [44964] → Tgt Spa: ['1.000'] [Step 107 / Rank 6] Tasks: ['Single QA'] | Lens: [46933] → Tgt Spa: ['0.350'] [Step 107 / Rank 3] Tasks: ['Single QA'] | Lens: [35714] → Tgt Spa: ['0.350'] [Step 107 / Rank 5] Tasks: ['Single QA'] | Lens: [62711] → Tgt Spa: ['0.350'] [Step 107 / Rank 1] Tasks: ['Single QA'] | Lens: [40450] → Tgt Spa: ['0.350'] [Step 107 / Rank 4] Tasks: ['Single QA'] | Lens: [62711] → Tgt Spa: ['0.350'] [Step 107 / Rank 7] Tasks: ['Single QA'] | Lens: [46933] → Tgt Spa: ['0.350'] [Step 107 / Rank 0] Tasks: ['Single QA'] | Lens: [40450] → Tgt Spa: ['0.350'] [Step 107 / Rank 2] Tasks: ['Single QA'] | Lens: [35714] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 23:26:12,252 >> @ 107 | Loss: 2.1085 | LM: 2.0410 | Reg: 0.0675 | Spa(Avg): 0.447 [INFO|lh_trainer.py:797] 2026-02-16 23:26:12,252 >> Statistic -> Code | Spa: 0.510 | Tgt: 1.000 | Z-Loss: 0.138 | [INFO|lh_trainer.py:797] 2026-02-16 23:26:12,252 >> Statistic -> In-Context | Spa: 0.568 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:26:12,252 >> Statistic -> MultiHop | Spa: 0.532 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:26:12,252 >> Statistic -> Single | Spa: 0.373 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:26:12,253 >> Statistic -> Summarization | Spa: 0.625 | Tgt: 1.000 | Z-Loss: 0.099 | [INFO|lh_trainer.py:810] 2026-02-16 23:26:12,254 >> [Micro-Log] {"loss": 2.1085145038863025, "lm_loss": 2.0410226099193096, "reg_loss": 0.06749187493308757, "model_sparsity(avg)": 0.447145060946544, "Spa-Code sparsity": 0.5099206396511623, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1383091083594731, "Spa-Single QA sparsity": 0.37345678276485866, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.016633720641645294, "Spa-In-Context Learning sparsity": 0.5679012404547797, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1349702643023597, "Spa-Summarization sparsity": 0.625, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09869384765625, "Spa-MultiHop QA sparsity": 0.5324074029922485, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.06218258710578084, "step": 107, "current_tau": 1.1769070625305176, "lambda1 Single QA": 0.53515625, "lambda2 MultiHop QA": 0.271484375, "lambda3 Summarization": 0.10498046875, "lambda4 Code": 0.2041015625} [INFO|lh_trainer.py:331] 2026-02-16 23:26:37,632 >> {'loss': 12.6511, 'grad_norm': 0.9121378064155579, 'learning_rate': 0.00045416039337438087, 'epoch': 0.11374407582938388, 'num_input_tokens_seen': 265467338, 'completed': '36.00% (108 / 300)', 'remaining time': '9:01:55', 'throughput': '6975.68', 'gpu_mem_free': '11721MB', 'step': 108} [Step 108 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [35670] → Tgt Spa: ['1.000'] [Step 108 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [35670] → Tgt Spa: ['1.000'] [Step 108 / Rank 2] Tasks: ['Single QA'] | Lens: [47508] → Tgt Spa: ['0.350'] [Step 108 / Rank 1] Tasks: ['Single QA'] | Lens: [52154] → Tgt Spa: ['0.350'] [Step 108 / Rank 6] Tasks: ['Code'] | Lens: [34763] → Tgt Spa: ['1.000'] [Step 108 / Rank 0] Tasks: ['Single QA'] | Lens: [52154] → Tgt Spa: ['0.350'] [Step 108 / Rank 3] Tasks: ['Single QA'] | Lens: [47508] → Tgt Spa: ['0.350'] [Step 108 / Rank 7] Tasks: ['Code'] | Lens: [34763] → Tgt Spa: ['1.000'] [Step 108 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [37936] → Tgt Spa: ['1.000'] [Step 108 / Rank 6] Tasks: ['Single QA'] | Lens: [65265] → Tgt Spa: ['0.350'] [Step 108 / Rank 3] Tasks: ['Single QA'] | Lens: [53940] → Tgt Spa: ['0.350'] [Step 108 / Rank 2] Tasks: ['Single QA'] | Lens: [53940] → Tgt Spa: ['0.350'] [Step 108 / Rank 1] Tasks: ['Single QA'] | Lens: [62449] → Tgt Spa: ['0.350'] [Step 108 / Rank 7] Tasks: ['Single QA'] | Lens: [65265] → Tgt Spa: ['0.350'] [Step 108 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [37936] → Tgt Spa: ['1.000'] [Step 108 / Rank 0] Tasks: ['Single QA'] | Lens: [62449] → Tgt Spa: ['0.350'] [Step 108 / Rank 3] Tasks: ['Single QA'] | Lens: [39702] → Tgt Spa: ['0.350'] [Step 108 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32196, 32196] → Tgt Spa: ['0.350', '0.350'] [Step 108 / Rank 6] Tasks: ['Single QA'] | Lens: [59950] → Tgt Spa: ['0.350'] [Step 108 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [61014] → Tgt Spa: ['1.000'] [Step 108 / Rank 2] Tasks: ['Single QA'] | Lens: [39702] → Tgt Spa: ['0.350'] [Step 108 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [61014] → Tgt Spa: ['1.000'] [Step 108 / Rank 7] Tasks: ['Single QA'] | Lens: [59950] → Tgt Spa: ['0.350'] [Step 108 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32196, 32196] → Tgt Spa: ['0.350', '0.350'] [Step 108 / Rank 2] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18918, 18908, 18909] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 108 / Rank 0] Tasks: ['Single QA'] | Lens: [44446] → Tgt Spa: ['0.350'] [Step 108 / Rank 1] Tasks: ['Single QA'] | Lens: [44446] → Tgt Spa: ['0.350'] [Step 108 / Rank 6] Tasks: ['MultiHop QA', 'MultiHop QA'] | Lens: [32692, 32692] → Tgt Spa: ['0.350', '0.350'] [Step 108 / Rank 3] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18918, 18908, 18909] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 108 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [25488, 25497] → Tgt Spa: ['1.000', '1.000'] [Step 108 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [25488, 25497] → Tgt Spa: ['1.000', '1.000'] [Step 108 / Rank 7] Tasks: ['MultiHop QA', 'MultiHop QA'] | Lens: [32692, 32692] → Tgt Spa: ['0.350', '0.350'] [Step 108 / Rank 7] Tasks: ['Single QA'] | Lens: [43176] → Tgt Spa: ['0.350'] [Step 108 / Rank 6] Tasks: ['Single QA'] | Lens: [43176] → Tgt Spa: ['0.350'] [Step 108 / Rank 0] Tasks: ['Single QA'] | Lens: [65093] → Tgt Spa: ['0.350'] [Step 108 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25483, 25484] → Tgt Spa: ['1.000', '1.000'] [Step 108 / Rank 4] Tasks: ['Single QA'] | Lens: [36380] → Tgt Spa: ['0.350'] [Step 108 / Rank 5] Tasks: ['Single QA'] | Lens: [36380] → Tgt Spa: ['0.350'] [Step 108 / Rank 1] Tasks: ['Single QA'] | Lens: [65093] → Tgt Spa: ['0.350'] [Step 108 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25483, 25484] → Tgt Spa: ['1.000', '1.000'] [Step 108 / Rank 5] Tasks: ['Single QA'] | Lens: [44891] → Tgt Spa: ['0.350'] [Step 108 / Rank 6] Tasks: ['Single QA'] | Lens: [65098] → Tgt Spa: ['0.350'] [Step 108 / Rank 7] Tasks: ['Single QA'] | Lens: [65098] → Tgt Spa: ['0.350'] [Step 108 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [29195, 29197] → Tgt Spa: ['0.350', '0.350'] [Step 108 / Rank 4] Tasks: ['Single QA'] | Lens: [44891] → Tgt Spa: ['0.350'] [Step 108 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [60836] → Tgt Spa: ['1.000'] [Step 108 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [60836] → Tgt Spa: ['1.000'] [Step 108 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [29195, 29197] → Tgt Spa: ['0.350', '0.350'] [INFO|lh_trainer.py:781] 2026-02-16 23:29:15,472 >> @ 108 | Loss: 2.1137 | LM: 2.0447 | Reg: 0.0690 | Spa(Avg): 0.462 [INFO|lh_trainer.py:797] 2026-02-16 23:29:15,472 >> Statistic -> Code | Spa: 0.465 | Tgt: 1.000 | Z-Loss: 0.155 | [INFO|lh_trainer.py:797] 2026-02-16 23:29:15,472 >> Statistic -> In-Context | Spa: 0.595 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:29:15,473 >> Statistic -> MultiHop | Spa: 0.437 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:29:15,473 >> Statistic -> Single | Spa: 0.412 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:29:15,473 >> Statistic -> Summarization | Spa: 0.500 | Tgt: 1.000 | Z-Loss: 0.158 | [INFO|lh_trainer.py:810] 2026-02-16 23:29:15,475 >> [Micro-Log] {"loss": 2.1136824762603887, "lm_loss": 2.0447215110762045, "reg_loss": 0.06896095113673557, "model_sparsity(avg)": 0.4622877997656663, "Spa-Single QA sparsity": 0.4117646988700418, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03815060017137405, "Spa-In-Context Learning sparsity": 0.5952380895614624, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.12568688499076025, "Spa-Summarization sparsity": 0.5, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.158447265625, "Spa-Code sparsity": 0.465277761220932, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.15483517572283745, "Spa-MultiHop QA sparsity": 0.4374999701976776, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.026715065352618694, "step": 108, "current_tau": 1.172745704650879, "lambda1 Single QA": 0.53515625, "lambda2 MultiHop QA": 0.271484375, "lambda3 Summarization": 0.10595703125, "lambda4 Code": 0.205078125} [INFO|lh_trainer.py:331] 2026-02-16 23:29:42,275 >> {'loss': 12.6821, 'grad_norm': 0.7179497480392456, 'learning_rate': 0.000452254253368312, 'epoch': 0.114797261716693, 'num_input_tokens_seen': 267981590, 'completed': '36.33% (109 / 300)', 'remaining time': '8:59:33', 'throughput': '6808.41', 'gpu_mem_free': '5491MB', 'step': 109} [Step 109 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [44650] → Tgt Spa: ['1.000'] [Step 109 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [44650] → Tgt Spa: ['1.000'] [Step 109 / Rank 5] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18326, 18318, 18318] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 109 / Rank 3] Tasks: ['Single QA'] | Lens: [35643] → Tgt Spa: ['0.350'] [Step 109 / Rank 2] Tasks: ['Single QA'] | Lens: [35643] → Tgt Spa: ['0.350'] [Step 109 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32713, 32713] → Tgt Spa: ['0.350', '0.350'] [Step 109 / Rank 4] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18326, 18318, 18318] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 109 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32713, 32713] → Tgt Spa: ['0.350', '0.350'] [Step 109 / Rank 3] Tasks: ['Single QA'] | Lens: [46370] → Tgt Spa: ['0.350'] [Step 109 / Rank 5] Tasks: ['In-Context Learning', 'Single QA', 'MultiHop QA', 'Single QA', 'In-Context Learning', 'Summarization', 'Code', 'Code', 'Code', 'Single QA', 'In-Context Learning', 'Single QA', 'Summarization', 'Code', 'In-Context Learning'] | Lens: [4242, 4244, 4245, 4245, 4247, 4264, 4254, 4253, 4254, 4247, 4247, 4248, 4267, 4257, 4250] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000'] [Step 109 / Rank 7] Tasks: ['MultiHop QA', 'Code'] | Lens: [30870, 30881] → Tgt Spa: ['0.350', '1.000'] [Step 109 / Rank 0] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [16670, 16671, 16683] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 109 / Rank 2] Tasks: ['Single QA'] | Lens: [46370] → Tgt Spa: ['0.350'] [Step 109 / Rank 1] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [16670, 16671, 16683] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 109 / Rank 4] Tasks: ['In-Context Learning', 'Single QA', 'MultiHop QA', 'Single QA', 'In-Context Learning', 'Summarization', 'Code', 'Code', 'Code', 'Single QA', 'In-Context Learning', 'Single QA', 'Summarization', 'Code', 'In-Context Learning'] | Lens: [4242, 4244, 4245, 4245, 4247, 4264, 4254, 4253, 4254, 4247, 4247, 4248, 4267, 4257, 4250] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000'] [Step 109 / Rank 6] Tasks: ['MultiHop QA', 'Code'] | Lens: [30870, 30881] → Tgt Spa: ['0.350', '1.000'] [Step 109 / Rank 6] Tasks: ['Single QA', 'Code', 'Code', 'Code', 'Single QA'] | Lens: [12688, 12708, 12720, 12728, 12723] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '0.350'] [Step 109 / Rank 2] Tasks: ['Single QA'] | Lens: [54850] → Tgt Spa: ['0.350'] [Step 109 / Rank 4] Tasks: ['Single QA'] | Lens: [36356] → Tgt Spa: ['0.350'] [Step 109 / Rank 3] Tasks: ['Single QA'] | Lens: [54850] → Tgt Spa: ['0.350'] [Step 109 / Rank 5] Tasks: ['Single QA'] | Lens: [36356] → Tgt Spa: ['0.350'] [Step 109 / Rank 0] Tasks: ['Code'] | Lens: [38893] → Tgt Spa: ['1.000'] [Step 109 / Rank 7] Tasks: ['Single QA', 'Code', 'Code', 'Code', 'Single QA'] | Lens: [12688, 12708, 12720, 12728, 12723] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '0.350'] [Step 109 / Rank 1] Tasks: ['Code'] | Lens: [38893] → Tgt Spa: ['1.000'] [Step 109 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [26589, 26596] → Tgt Spa: ['1.000', '1.000'] [Step 109 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [26589, 26596] → Tgt Spa: ['1.000', '1.000'] [Step 109 / Rank 7] Tasks: ['Single QA'] | Lens: [36082] → Tgt Spa: ['0.350'] [Step 109 / Rank 2] Tasks: ['Single QA'] | Lens: [40243] → Tgt Spa: ['0.350'] [Step 109 / Rank 0] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24403, 24404] → Tgt Spa: ['1.000', '0.350'] [Step 109 / Rank 1] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24403, 24404] → Tgt Spa: ['1.000', '0.350'] [Step 109 / Rank 6] Tasks: ['Single QA'] | Lens: [36082] → Tgt Spa: ['0.350'] [Step 109 / Rank 3] Tasks: ['Single QA'] | Lens: [40243] → Tgt Spa: ['0.350'] [Step 109 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25529, 25529] → Tgt Spa: ['1.000', '1.000'] [Step 109 / Rank 5] Tasks: ['Code'] | Lens: [35193] → Tgt Spa: ['1.000'] [Step 109 / Rank 3] Tasks: ['Single QA'] | Lens: [43191] → Tgt Spa: ['0.350'] [Step 109 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25529, 25529] → Tgt Spa: ['1.000', '1.000'] [Step 109 / Rank 2] Tasks: ['Single QA'] | Lens: [43191] → Tgt Spa: ['0.350'] [Step 109 / Rank 4] Tasks: ['Code'] | Lens: [35193] → Tgt Spa: ['1.000'] [Step 109 / Rank 1] Tasks: ['Single QA', 'Summarization'] | Lens: [23088, 23107] → Tgt Spa: ['0.350', '1.000'] [Step 109 / Rank 0] Tasks: ['Single QA', 'Summarization'] | Lens: [23088, 23107] → Tgt Spa: ['0.350', '1.000'] [Step 109 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [40029] → Tgt Spa: ['1.000'] [Step 109 / Rank 4] Tasks: ['Summarization', 'MultiHop QA'] | Lens: [30822, 30805] → Tgt Spa: ['1.000', '0.350'] [Step 109 / Rank 0] Tasks: ['Single QA'] | Lens: [46865] → Tgt Spa: ['0.350'] [Step 109 / Rank 7] Tasks: ['Code'] | Lens: [44803] → Tgt Spa: ['1.000'] [Step 109 / Rank 5] Tasks: ['Summarization', 'MultiHop QA'] | Lens: [30822, 30805] → Tgt Spa: ['1.000', '0.350'] [Step 109 / Rank 6] Tasks: ['Code'] | Lens: [44803] → Tgt Spa: ['1.000'] [Step 109 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [40029] → Tgt Spa: ['1.000'] [Step 109 / Rank 1] Tasks: ['Single QA'] | Lens: [46865] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 23:31:37,757 >> @ 109 | Loss: 1.9036 | LM: 1.8185 | Reg: 0.0851 | Spa(Avg): 0.485 [INFO|lh_trainer.py:797] 2026-02-16 23:31:37,757 >> Statistic -> Code | Spa: 0.501 | Tgt: 1.000 | Z-Loss: 0.142 | [INFO|lh_trainer.py:797] 2026-02-16 23:31:37,757 >> Statistic -> In-Context | Spa: 0.607 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:31:37,757 >> Statistic -> MultiHop | Spa: 0.495 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:31:37,757 >> Statistic -> Single | Spa: 0.436 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:31:37,757 >> Statistic -> Summarization | Spa: 0.500 | Tgt: 1.000 | Z-Loss: 0.160 | [INFO|lh_trainer.py:810] 2026-02-16 23:31:37,759 >> [Micro-Log] {"loss": 1.9035654372225206, "lm_loss": 1.8184540749837954, "reg_loss": 0.08511135797016323, "model_sparsity(avg)": 0.48462577039996785, "Spa-Single QA sparsity": 0.43595678276485866, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05083139944407675, "Spa-Code sparsity": 0.5008680522441864, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1424249019473791, "Spa-Summarization sparsity": 0.4999999801317851, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1596157302459081, "Spa-In-Context Learning sparsity": 0.606944453716278, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.12248105257749557, "Spa-MultiHop QA sparsity": 0.495370348294576, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.04776669293642044, "step": 109, "current_tau": 1.1686079502105713, "lambda1 Single QA": 0.53515625, "lambda2 MultiHop QA": 0.271484375, "lambda3 Summarization": 0.1064453125, "lambda4 Code": 0.205078125} [INFO|lh_trainer.py:331] 2026-02-16 23:31:54,377 >> {'loss': 11.4214, 'grad_norm': 0.9322866797447205, 'learning_rate': 0.0004503134581414198, 'epoch': 0.11585044760400211, 'num_input_tokens_seen': 270308658, 'completed': '36.67% (110 / 300)', 'remaining time': '8:55:39', 'throughput': '8807.86', 'gpu_mem_free': '11403MB', 'step': 110} [Step 110 / Rank 6] Tasks: ['Single QA'] | Lens: [64044] → Tgt Spa: ['0.350'] [Step 110 / Rank 4] Tasks: ['Code'] | Lens: [42300] → Tgt Spa: ['1.000'] [Step 110 / Rank 7] Tasks: ['Single QA'] | Lens: [64044] → Tgt Spa: ['0.350'] [Step 110 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [14104, 14105, 14105, 14106] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 110 / Rank 2] Tasks: ['Single QA'] | Lens: [44893] → Tgt Spa: ['0.350'] [Step 110 / Rank 3] Tasks: ['Single QA'] | Lens: [44893] → Tgt Spa: ['0.350'] [Step 110 / Rank 5] Tasks: ['Code'] | Lens: [42300] → Tgt Spa: ['1.000'] [Step 110 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [14104, 14105, 14105, 14106] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 110 / Rank 5] Tasks: ['Summarization'] | Lens: [43925] → Tgt Spa: ['1.000'] [Step 110 / Rank 0] Tasks: ['Single QA'] | Lens: [50662] → Tgt Spa: ['0.350'] [Step 110 / Rank 4] Tasks: ['Summarization'] | Lens: [43925] → Tgt Spa: ['1.000'] [Step 110 / Rank 2] Tasks: ['In-Context Learning', 'Code'] | Lens: [25745, 25754] → Tgt Spa: ['1.000', '1.000'] [Step 110 / Rank 3] Tasks: ['In-Context Learning', 'Code'] | Lens: [25745, 25754] → Tgt Spa: ['1.000', '1.000'] [Step 110 / Rank 7] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [18962, 18970, 18961] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 110 / Rank 1] Tasks: ['Single QA'] | Lens: [50662] → Tgt Spa: ['0.350'] [Step 110 / Rank 6] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [18962, 18970, 18961] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 110 / Rank 7] Tasks: ['Summarization', 'Code', 'Code', 'Code', 'Single QA', 'Code', 'In-Context Learning'] | Lens: [8541, 8531, 8532, 8538, 8535, 8545, 8540] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 110 / Rank 6] Tasks: ['Summarization', 'Code', 'Code', 'Code', 'Single QA', 'Code', 'In-Context Learning'] | Lens: [8541, 8531, 8532, 8538, 8535, 8545, 8540] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 110 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [52709] → Tgt Spa: ['1.000'] [Step 110 / Rank 5] Tasks: ['Single QA'] | Lens: [38354] → Tgt Spa: ['0.350'] [Step 110 / Rank 2] Tasks: ['Single QA'] | Lens: [50661] → Tgt Spa: ['0.350'] [Step 110 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [52709] → Tgt Spa: ['1.000'] [Step 110 / Rank 3] Tasks: ['Single QA'] | Lens: [50661] → Tgt Spa: ['0.350'] [Step 110 / Rank 4] Tasks: ['Single QA'] | Lens: [38354] → Tgt Spa: ['0.350'] [Step 110 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [53677] → Tgt Spa: ['1.000'] [Step 110 / Rank 3] Tasks: ['Single QA'] | Lens: [38173] → Tgt Spa: ['0.350'] [Step 110 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25809, 25810] → Tgt Spa: ['1.000', '1.000'] [Step 110 / Rank 2] Tasks: ['Single QA'] | Lens: [38173] → Tgt Spa: ['0.350'] [Step 110 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15947, 15947, 15948, 15948] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 110 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15947, 15947, 15948, 15948] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 110 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25809, 25810] → Tgt Spa: ['1.000', '1.000'] [Step 110 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [53677] → Tgt Spa: ['1.000'] [Step 110 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [12006, 12007, 12015, 12011, 12011] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350'] [Step 110 / Rank 0] Tasks: ['Single QA'] | Lens: [47662] → Tgt Spa: ['0.350'] [Step 110 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [44116] → Tgt Spa: ['1.000'] [Step 110 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [44116] → Tgt Spa: ['1.000'] [Step 110 / Rank 4] Tasks: ['Single QA'] | Lens: [60896] → Tgt Spa: ['0.350'] [Step 110 / Rank 1] Tasks: ['Single QA'] | Lens: [47662] → Tgt Spa: ['0.350'] [Step 110 / Rank 5] Tasks: ['Single QA'] | Lens: [60896] → Tgt Spa: ['0.350'] [Step 110 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [12006, 12007, 12015, 12011, 12011] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350'] [Step 110 / Rank 7] Tasks: ['Single QA'] | Lens: [47419] → Tgt Spa: ['0.350'] [Step 110 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [59661] → Tgt Spa: ['1.000'] [Step 110 / Rank 5] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [27249, 27249] → Tgt Spa: ['1.000', '0.350'] [Step 110 / Rank 6] Tasks: ['Single QA'] | Lens: [47419] → Tgt Spa: ['0.350'] [Step 110 / Rank 3] Tasks: ['Single QA'] | Lens: [49672] → Tgt Spa: ['0.350'] [Step 110 / Rank 4] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [27249, 27249] → Tgt Spa: ['1.000', '0.350'] [Step 110 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [59661] → Tgt Spa: ['1.000'] [Step 110 / Rank 2] Tasks: ['Single QA'] | Lens: [49672] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 23:34:20,914 >> @ 110 | Loss: 2.1002 | LM: 2.0250 | Reg: 0.0753 | Spa(Avg): 0.447 [INFO|lh_trainer.py:797] 2026-02-16 23:34:20,914 >> Statistic -> Code | Spa: 0.455 | Tgt: 1.000 | Z-Loss: 0.160 | [INFO|lh_trainer.py:797] 2026-02-16 23:34:20,914 >> Statistic -> In-Context | Spa: 0.602 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:34:20,914 >> Statistic -> MultiHop | Spa: 0.495 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:34:20,915 >> Statistic -> Single | Spa: 0.395 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:34:20,915 >> Statistic -> Summarization | Spa: 0.500 | Tgt: 1.000 | Z-Loss: 0.161 | [INFO|lh_trainer.py:810] 2026-02-16 23:34:20,917 >> [Micro-Log] {"loss": 2.1002167239785194, "lm_loss": 2.0249582702914872, "reg_loss": 0.07525846093388584, "model_sparsity(avg)": 0.447488147765398, "Spa-Single QA sparsity": 0.3946759104728699, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.031897939731910206, "Spa-In-Context Learning sparsity": 0.601851847436693, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.12447751892937554, "Spa-Code sparsity": 0.4552469121085273, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.15972410639127096, "Spa-Summarization sparsity": 0.5, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.16135760893424353, "Spa-MultiHop QA sparsity": 0.495370348294576, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.04776669293642044, "step": 110, "current_tau": 1.1644949913024902, "lambda1 Single QA": 0.53515625, "lambda2 MultiHop QA": 0.2734375, "lambda3 Summarization": 0.107421875, "lambda4 Code": 0.2060546875} [INFO|lh_trainer.py:331] 2026-02-16 23:34:44,010 >> {'loss': 12.6013, 'grad_norm': 0.8432686924934387, 'learning_rate': 0.0004483383402389753, 'epoch': 0.11690363349131122, 'num_input_tokens_seen': 272795368, 'completed': '37.00% (111 / 300)', 'remaining time': '8:52:50', 'throughput': '7329.68', 'gpu_mem_free': '6655MB', 'step': 111} [Step 111 / Rank 5] Tasks: ['Code'] | Lens: [35099] → Tgt Spa: ['1.000'] [Step 111 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [44044] → Tgt Spa: ['1.000'] [Step 111 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [31836, 31837] → Tgt Spa: ['0.350', '0.350'] [Step 111 / Rank 4] Tasks: ['Code'] | Lens: [35099] → Tgt Spa: ['1.000'] [Step 111 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [31836, 31837] → Tgt Spa: ['0.350', '0.350'] [Step 111 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [44044] → Tgt Spa: ['1.000'] [Step 111 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [24262, 24262] → Tgt Spa: ['0.350', '0.350'] [Step 111 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [24262, 24262] → Tgt Spa: ['0.350', '0.350'] [Step 111 / Rank 3] Tasks: ['Single QA'] | Lens: [62713] → Tgt Spa: ['0.350'] [Step 111 / Rank 4] Tasks: ['Single QA'] | Lens: [65359] → Tgt Spa: ['0.350'] [Step 111 / Rank 6] Tasks: ['Code', 'Summarization'] | Lens: [23901, 23913] → Tgt Spa: ['1.000', '1.000'] [Step 111 / Rank 7] Tasks: ['Code', 'Summarization'] | Lens: [23901, 23913] → Tgt Spa: ['1.000', '1.000'] [Step 111 / Rank 1] Tasks: ['Single QA'] | Lens: [52402] → Tgt Spa: ['0.350'] [Step 111 / Rank 2] Tasks: ['Single QA'] | Lens: [62713] → Tgt Spa: ['0.350'] [Step 111 / Rank 5] Tasks: ['Single QA'] | Lens: [65359] → Tgt Spa: ['0.350'] [Step 111 / Rank 0] Tasks: ['Single QA'] | Lens: [52402] → Tgt Spa: ['0.350'] [Step 111 / Rank 7] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [18821, 18822, 18835] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 111 / Rank 5] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17448, 17450, 17439] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 111 / Rank 0] Tasks: ['Code'] | Lens: [50037] → Tgt Spa: ['1.000'] [Step 111 / Rank 1] Tasks: ['Code'] | Lens: [50037] → Tgt Spa: ['1.000'] [Step 111 / Rank 2] Tasks: ['Single QA'] | Lens: [53363] → Tgt Spa: ['0.350'] [Step 111 / Rank 3] Tasks: ['Single QA'] | Lens: [53363] → Tgt Spa: ['0.350'] [Step 111 / Rank 6] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [18821, 18822, 18835] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 111 / Rank 4] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17448, 17450, 17439] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 111 / Rank 4] Tasks: ['Single QA'] | Lens: [51067] → Tgt Spa: ['0.350'] [Step 111 / Rank 6] Tasks: ['Summarization', 'Code'] | Lens: [25014, 25004] → Tgt Spa: ['1.000', '1.000'] [Step 111 / Rank 7] Tasks: ['Summarization', 'Code'] | Lens: [25014, 25004] → Tgt Spa: ['1.000', '1.000'] [Step 111 / Rank 3] Tasks: ['Summarization'] | Lens: [36883] → Tgt Spa: ['1.000'] [Step 111 / Rank 2] Tasks: ['Summarization'] | Lens: [36883] → Tgt Spa: ['1.000'] [Step 111 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [40796] → Tgt Spa: ['1.000'] [Step 111 / Rank 5] Tasks: ['Single QA'] | Lens: [51067] → Tgt Spa: ['0.350'] [Step 111 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [40796] → Tgt Spa: ['1.000'] [Step 111 / Rank 4] Tasks: ['Single QA'] | Lens: [45418] → Tgt Spa: ['0.350'] [Step 111 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [56394] → Tgt Spa: ['1.000'] [Step 111 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [56394] → Tgt Spa: ['1.000'] [Step 111 / Rank 1] Tasks: ['Single QA'] | Lens: [42495] → Tgt Spa: ['0.350'] [Step 111 / Rank 0] Tasks: ['Single QA'] | Lens: [42495] → Tgt Spa: ['0.350'] [Step 111 / Rank 5] Tasks: ['Single QA'] | Lens: [45418] → Tgt Spa: ['0.350'] [Step 111 / Rank 6] Tasks: ['Single QA'] | Lens: [33966] → Tgt Spa: ['0.350'] [Step 111 / Rank 7] Tasks: ['Single QA'] | Lens: [33966] → Tgt Spa: ['0.350'] [Step 111 / Rank 5] Tasks: ['Code', 'Code', 'Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [9781, 9782, 9776, 9777, 9790, 9792] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '1.000', '0.350'] [Step 111 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [21874, 21874] → Tgt Spa: ['0.350', '0.350'] [Step 111 / Rank 4] Tasks: ['Code', 'Code', 'Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [9781, 9782, 9776, 9777, 9790, 9792] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '1.000', '0.350'] [Step 111 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25103, 25104] → Tgt Spa: ['1.000', '1.000'] [Step 111 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25103, 25104] → Tgt Spa: ['1.000', '1.000'] [Step 111 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [21874, 21874] → Tgt Spa: ['0.350', '0.350'] [Step 111 / Rank 3] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22722, 22705] → Tgt Spa: ['1.000', '1.000'] [Step 111 / Rank 2] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22722, 22705] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 23:37:04,574 >> @ 111 | Loss: 2.1572 | LM: 2.0722 | Reg: 0.0851 | Spa(Avg): 0.479 [INFO|lh_trainer.py:797] 2026-02-16 23:37:04,574 >> Statistic -> Code | Spa: 0.510 | Tgt: 1.000 | Z-Loss: 0.140 | [INFO|lh_trainer.py:797] 2026-02-16 23:37:04,574 >> Statistic -> In-Context | Spa: 0.606 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:37:04,574 >> Statistic -> MultiHop | Spa: 0.495 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:37:04,575 >> Statistic -> Single | Spa: 0.409 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:37:04,575 >> Statistic -> Summarization | Spa: 0.546 | Tgt: 1.000 | Z-Loss: 0.137 | [INFO|lh_trainer.py:810] 2026-02-16 23:37:04,577 >> [Micro-Log] {"loss": 2.1572140902280807, "lm_loss": 2.0721507047613463, "reg_loss": 0.08506337367968324, "model_sparsity(avg)": 0.47887730970978737, "Spa-Single QA sparsity": 0.40931372081532197, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.0347125560784822, "Spa-Code sparsity": 0.5097222268581391, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1401953987777233, "Spa-In-Context Learning sparsity": 0.6064814825852712, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.12365355342626572, "Spa-Summarization sparsity": 0.545634925365448, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13703613728284836, "Spa-MultiHop QA sparsity": 0.495370348294576, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.04776669293642044, "step": 111, "current_tau": 1.1604080200195312, "lambda1 Single QA": 0.5390625, "lambda2 MultiHop QA": 0.2734375, "lambda3 Summarization": 0.10791015625, "lambda4 Code": 0.20703125} [INFO|lh_trainer.py:331] 2026-02-16 23:37:16,998 >> {'loss': 12.9433, 'grad_norm': 0.9262714982032776, 'learning_rate': 0.00044632923808726293, 'epoch': 0.11795681937862032, 'num_input_tokens_seen': 275169288, 'completed': '37.33% (112 / 300)', 'remaining time': '8:49:34', 'throughput': '7758.49', 'gpu_mem_free': '12557MB', 'step': 112} [Step 112 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23990, 23990] → Tgt Spa: ['1.000', '1.000'] [Step 112 / Rank 4] Tasks: ['Single QA'] | Lens: [49217] → Tgt Spa: ['0.350'] [Step 112 / Rank 5] Tasks: ['Single QA'] | Lens: [49217] → Tgt Spa: ['0.350'] [Step 112 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23990, 23990] → Tgt Spa: ['1.000', '1.000'] [Step 112 / Rank 1] Tasks: ['In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Summarization'] | Lens: [4581, 4581, 4582, 4583, 4584, 4594, 4592, 4586, 4585, 4585, 4586, 4586, 4587, 4605] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000'] [Step 112 / Rank 3] Tasks: ['Single QA'] | Lens: [33996] → Tgt Spa: ['0.350'] [Step 112 / Rank 2] Tasks: ['Single QA'] | Lens: [33996] → Tgt Spa: ['0.350'] [Step 112 / Rank 0] Tasks: ['In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Summarization'] | Lens: [4581, 4581, 4582, 4583, 4584, 4594, 4592, 4586, 4585, 4585, 4586, 4586, 4587, 4605] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000'] [Step 112 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26963, 26968] → Tgt Spa: ['1.000', '1.000'] [Step 112 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [54506] → Tgt Spa: ['1.000'] [Step 112 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26963, 26968] → Tgt Spa: ['1.000', '1.000'] [Step 112 / Rank 6] Tasks: ['Single QA'] | Lens: [52300] → Tgt Spa: ['0.350'] [Step 112 / Rank 0] Tasks: ['In-Context Learning', 'Summarization', 'Code'] | Lens: [20780, 20800, 20790] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 112 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [54506] → Tgt Spa: ['1.000'] [Step 112 / Rank 1] Tasks: ['In-Context Learning', 'Summarization', 'Code'] | Lens: [20780, 20800, 20790] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 112 / Rank 7] Tasks: ['Single QA'] | Lens: [52300] → Tgt Spa: ['0.350'] [Step 112 / Rank 6] Tasks: ['Single QA'] | Lens: [55536] → Tgt Spa: ['0.350'] [Step 112 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [25455, 25455] → Tgt Spa: ['0.350', '0.350'] [Step 112 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [25455, 25455] → Tgt Spa: ['0.350', '0.350'] [Step 112 / Rank 7] Tasks: ['Single QA'] | Lens: [55536] → Tgt Spa: ['0.350'] [Step 112 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [22673, 22673] → Tgt Spa: ['0.350', '1.000'] [Step 112 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [22673, 22673] → Tgt Spa: ['0.350', '1.000'] [Step 112 / Rank 1] Tasks: ['Single QA'] | Lens: [46003] → Tgt Spa: ['0.350'] [Step 112 / Rank 0] Tasks: ['Single QA'] | Lens: [46003] → Tgt Spa: ['0.350'] [Step 112 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [31099, 31099] → Tgt Spa: ['0.350', '0.350'] [Step 112 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [31099, 31099] → Tgt Spa: ['0.350', '0.350'] [Step 112 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [40810] → Tgt Spa: ['1.000'] [Step 112 / Rank 1] Tasks: ['Single QA'] | Lens: [57076] → Tgt Spa: ['0.350'] [Step 112 / Rank 0] Tasks: ['Single QA'] | Lens: [57076] → Tgt Spa: ['0.350'] [Step 112 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [59615] → Tgt Spa: ['1.000'] [Step 112 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [59615] → Tgt Spa: ['1.000'] [Step 112 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [40810] → Tgt Spa: ['1.000'] [Step 112 / Rank 3] Tasks: ['Single QA'] | Lens: [50872] → Tgt Spa: ['0.350'] [Step 112 / Rank 6] Tasks: ['Single QA'] | Lens: [59573] → Tgt Spa: ['0.350'] [Step 112 / Rank 5] Tasks: ['Single QA'] | Lens: [51022] → Tgt Spa: ['0.350'] [Step 112 / Rank 1] Tasks: ['Single QA'] | Lens: [33360] → Tgt Spa: ['0.350'] [Step 112 / Rank 2] Tasks: ['Single QA'] | Lens: [50872] → Tgt Spa: ['0.350'] [Step 112 / Rank 4] Tasks: ['Single QA'] | Lens: [51022] → Tgt Spa: ['0.350'] [Step 112 / Rank 7] Tasks: ['Single QA'] | Lens: [59573] → Tgt Spa: ['0.350'] [Step 112 / Rank 0] Tasks: ['Single QA'] | Lens: [33360] → Tgt Spa: ['0.350'] [Step 112 / Rank 3] Tasks: ['Single QA'] | Lens: [49226] → Tgt Spa: ['0.350'] [Step 112 / Rank 4] Tasks: ['Single QA'] | Lens: [45738] → Tgt Spa: ['0.350'] [Step 112 / Rank 5] Tasks: ['Single QA'] | Lens: [45738] → Tgt Spa: ['0.350'] [Step 112 / Rank 1] Tasks: ['Single QA'] | Lens: [40718] → Tgt Spa: ['0.350'] [Step 112 / Rank 0] Tasks: ['Single QA'] | Lens: [40718] → Tgt Spa: ['0.350'] [Step 112 / Rank 2] Tasks: ['Single QA'] | Lens: [49226] → Tgt Spa: ['0.350'] [Step 112 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [24059, 24052] → Tgt Spa: ['1.000', '1.000'] [Step 112 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [24059, 24052] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 23:39:39,488 >> @ 112 | Loss: 2.3137 | LM: 2.2471 | Reg: 0.0666 | Spa(Avg): 0.477 [INFO|lh_trainer.py:797] 2026-02-16 23:39:39,488 >> Statistic -> Code | Spa: 0.524 | Tgt: 1.000 | Z-Loss: 0.135 | [INFO|lh_trainer.py:797] 2026-02-16 23:39:39,488 >> Statistic -> In-Context | Spa: 0.618 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:39:39,489 >> Statistic -> MultiHop | Spa: 0.495 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:39:39,489 >> Statistic -> Single | Spa: 0.422 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:39:39,489 >> Statistic -> Summarization | Spa: 0.500 | Tgt: 1.000 | Z-Loss: 0.163 | [INFO|lh_trainer.py:810] 2026-02-16 23:39:39,490 >> [Micro-Log] {"loss": 2.3137388601899147, "lm_loss": 2.2471280296643577, "reg_loss": 0.06661084366108601, "model_sparsity(avg)": 0.47696207587917644, "Spa-In-Context Learning sparsity": 0.6176900549938804, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11981588013862308, "Spa-Single QA sparsity": 0.4222222179174423, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04178126829210669, "Spa-Code sparsity": 0.5243055522441864, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.13538030721247196, "Spa-Summarization sparsity": 0.4999999701976776, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.16294202953577042, "Spa-MultiHop QA sparsity": 0.495370348294576, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.04776669293642044, "step": 112, "current_tau": 1.1563483476638794, "lambda1 Single QA": 0.5390625, "lambda2 MultiHop QA": 0.2734375, "lambda3 Summarization": 0.10888671875, "lambda4 Code": 0.2080078125} [INFO|lh_trainer.py:331] 2026-02-16 23:39:56,908 >> {'loss': 13.8824, 'grad_norm': 0.7486256957054138, 'learning_rate': 0.00044428649593559365, 'epoch': 0.11901000526592943, 'num_input_tokens_seen': 277598550, 'completed': '37.67% (113 / 300)', 'remaining time': '8:46:30', 'throughput': '7595.71', 'gpu_mem_free': '11747MB', 'step': 113} [Step 113 / Rank 5] Tasks: ['Single QA'] | Lens: [52675] → Tgt Spa: ['0.350'] [Step 113 / Rank 3] Tasks: ['Single QA'] | Lens: [45877] → Tgt Spa: ['0.350'] [Step 113 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [25717, 25736] → Tgt Spa: ['1.000', '1.000'] [Step 113 / Rank 7] Tasks: ['Single QA'] | Lens: [34206] → Tgt Spa: ['0.350'] [Step 113 / Rank 4] Tasks: ['Single QA'] | Lens: [52675] → Tgt Spa: ['0.350'] [Step 113 / Rank 2] Tasks: ['Single QA'] | Lens: [45877] → Tgt Spa: ['0.350'] [Step 113 / Rank 6] Tasks: ['Single QA'] | Lens: [34206] → Tgt Spa: ['0.350'] [Step 113 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [25717, 25736] → Tgt Spa: ['1.000', '1.000'] [Step 113 / Rank 6] Tasks: ['Code'] | Lens: [33868] → Tgt Spa: ['1.000'] [Step 113 / Rank 5] Tasks: ['Single QA'] | Lens: [47485] → Tgt Spa: ['0.350'] [Step 113 / Rank 4] Tasks: ['Single QA'] | Lens: [47485] → Tgt Spa: ['0.350'] [Step 113 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [29226, 29226] → Tgt Spa: ['0.350', '0.350'] [Step 113 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [29180, 29181] → Tgt Spa: ['0.350', '0.350'] [Step 113 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [29226, 29226] → Tgt Spa: ['0.350', '0.350'] [Step 113 / Rank 7] Tasks: ['Code'] | Lens: [33868] → Tgt Spa: ['1.000'] [Step 113 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [29180, 29181] → Tgt Spa: ['0.350', '0.350'] [Step 113 / Rank 6] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [21508, 21520, 21509] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 113 / Rank 3] Tasks: ['Code'] | Lens: [45186] → Tgt Spa: ['1.000'] [Step 113 / Rank 5] Tasks: ['Single QA'] | Lens: [54040] → Tgt Spa: ['0.350'] [Step 113 / Rank 2] Tasks: ['Code'] | Lens: [45186] → Tgt Spa: ['1.000'] [Step 113 / Rank 7] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [21508, 21520, 21509] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 113 / Rank 0] Tasks: ['Single QA'] | Lens: [58409] → Tgt Spa: ['0.350'] [Step 113 / Rank 4] Tasks: ['Single QA'] | Lens: [54040] → Tgt Spa: ['0.350'] [Step 113 / Rank 1] Tasks: ['Single QA'] | Lens: [58409] → Tgt Spa: ['0.350'] [Step 113 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22402, 22402] → Tgt Spa: ['1.000', '1.000'] [Step 113 / Rank 5] Tasks: ['Single QA'] | Lens: [38295] → Tgt Spa: ['0.350'] [Step 113 / Rank 0] Tasks: ['Single QA'] | Lens: [52288] → Tgt Spa: ['0.350'] [Step 113 / Rank 1] Tasks: ['Single QA'] | Lens: [52288] → Tgt Spa: ['0.350'] [Step 113 / Rank 4] Tasks: ['Single QA'] | Lens: [38295] → Tgt Spa: ['0.350'] [Step 113 / Rank 7] Tasks: ['Single QA'] | Lens: [56989] → Tgt Spa: ['0.350'] [Step 113 / Rank 6] Tasks: ['Single QA'] | Lens: [56989] → Tgt Spa: ['0.350'] [Step 113 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22402, 22402] → Tgt Spa: ['1.000', '1.000'] [Step 113 / Rank 5] Tasks: ['Single QA'] | Lens: [65362] → Tgt Spa: ['0.350'] [Step 113 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15807, 15807, 15807, 15807] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 113 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15807, 15807, 15807, 15807] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 113 / Rank 0] Tasks: ['Single QA'] | Lens: [42405] → Tgt Spa: ['0.350'] [Step 113 / Rank 1] Tasks: ['Single QA'] | Lens: [42405] → Tgt Spa: ['0.350'] [Step 113 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [40984] → Tgt Spa: ['1.000'] [Step 113 / Rank 4] Tasks: ['Single QA'] | Lens: [65362] → Tgt Spa: ['0.350'] [Step 113 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [40984] → Tgt Spa: ['1.000'] [Step 113 / Rank 6] Tasks: ['Single QA'] | Lens: [51484] → Tgt Spa: ['0.350'] [Step 113 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Code'] | Lens: [21035, 21035, 21043] → Tgt Spa: ['0.350', '0.350', '1.000'] [Step 113 / Rank 7] Tasks: ['Single QA'] | Lens: [51484] → Tgt Spa: ['0.350'] [Step 113 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Code'] | Lens: [21035, 21035, 21043] → Tgt Spa: ['0.350', '0.350', '1.000'] [Step 113 / Rank 3] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [18248, 18237, 18253] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 113 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [29712, 29712] → Tgt Spa: ['0.350', '1.000'] [Step 113 / Rank 2] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [18248, 18237, 18253] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 113 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [29712, 29712] → Tgt Spa: ['0.350', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 23:42:23,211 >> @ 113 | Loss: 2.1734 | LM: 2.1148 | Reg: 0.0586 | Spa(Avg): 0.452 [INFO|lh_trainer.py:797] 2026-02-16 23:42:23,211 >> Statistic -> Code | Spa: 0.535 | Tgt: 1.000 | Z-Loss: 0.132 | [INFO|lh_trainer.py:797] 2026-02-16 23:42:23,211 >> Statistic -> In-Context | Spa: 0.617 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:42:23,211 >> Statistic -> MultiHop | Spa: 0.495 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:42:23,211 >> Statistic -> Single | Spa: 0.405 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:42:23,211 >> Statistic -> Summarization | Spa: 0.559 | Tgt: 1.000 | Z-Loss: 0.132 | [INFO|lh_trainer.py:810] 2026-02-16 23:42:23,213 >> [Micro-Log] {"loss": 2.173411493500074, "lm_loss": 2.114828416456779, "reg_loss": 0.058583077926111095, "model_sparsity(avg)": 0.4522087213893731, "Spa-In-Context Learning sparsity": 0.6166666746139526, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.12033374607563019, "Spa-Summarization sparsity": 0.5590277910232544, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1319461315870285, "Spa-Single QA sparsity": 0.40519323556319525, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.033172055977680116, "Spa-Code sparsity": 0.5347222288449606, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1322105290989081, "Spa-MultiHop QA sparsity": 0.495370348294576, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.04776669293642044, "step": 113, "current_tau": 1.1523171663284302, "lambda1 Single QA": 0.5390625, "lambda2 MultiHop QA": 0.2734375, "lambda3 Summarization": 0.109375, "lambda4 Code": 0.2080078125} [INFO|lh_trainer.py:331] 2026-02-16 23:42:41,731 >> {'loss': 13.0405, 'grad_norm': 0.5905117392539978, 'learning_rate': 0.0004422104637973191, 'epoch': 0.12006319115323855, 'num_input_tokens_seen': 280073876, 'completed': '38.00% (114 / 300)', 'remaining time': '8:43:34', 'throughput': '7509.05', 'gpu_mem_free': '7041MB', 'step': 114} [Step 114 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [30810, 30810] → Tgt Spa: ['1.000', '1.000'] [Step 114 / Rank 6] Tasks: ['Single QA'] | Lens: [48615] → Tgt Spa: ['0.350'] [Step 114 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [52272] → Tgt Spa: ['1.000'] [Step 114 / Rank 1] Tasks: ['Code', 'MultiHop QA'] | Lens: [30531, 30530] → Tgt Spa: ['1.000', '0.350'] [Step 114 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [30810, 30810] → Tgt Spa: ['1.000', '1.000'] [Step 114 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [52272] → Tgt Spa: ['1.000'] [Step 114 / Rank 7] Tasks: ['Single QA'] | Lens: [48615] → Tgt Spa: ['0.350'] [Step 114 / Rank 0] Tasks: ['Code', 'MultiHop QA'] | Lens: [30531, 30530] → Tgt Spa: ['1.000', '0.350'] [Step 114 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [56028] → Tgt Spa: ['1.000'] [Step 114 / Rank 2] Tasks: ['Single QA'] | Lens: [60959] → Tgt Spa: ['0.350'] [Step 114 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [25419, 25421] → Tgt Spa: ['1.000', '1.000'] [Step 114 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [42777] → Tgt Spa: ['1.000'] [Step 114 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [42777] → Tgt Spa: ['1.000'] [Step 114 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [56028] → Tgt Spa: ['1.000'] [Step 114 / Rank 3] Tasks: ['Single QA'] | Lens: [60959] → Tgt Spa: ['0.350'] [Step 114 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [25419, 25421] → Tgt Spa: ['1.000', '1.000'] [Step 114 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24792, 24792] → Tgt Spa: ['0.350', '1.000'] [Step 114 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [25589, 25582] → Tgt Spa: ['1.000', '1.000'] [Step 114 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [27531, 27531] → Tgt Spa: ['0.350', '0.350'] [Step 114 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32129, 32129] → Tgt Spa: ['0.350', '0.350'] [Step 114 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [27531, 27531] → Tgt Spa: ['0.350', '0.350'] [Step 114 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24792, 24792] → Tgt Spa: ['0.350', '1.000'] [Step 114 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [25589, 25582] → Tgt Spa: ['1.000', '1.000'] [Step 114 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32129, 32129] → Tgt Spa: ['0.350', '0.350'] [Step 114 / Rank 3] Tasks: ['Single QA'] | Lens: [42418] → Tgt Spa: ['0.350'] [Step 114 / Rank 1] Tasks: ['Code'] | Lens: [59933] → Tgt Spa: ['1.000'] [Step 114 / Rank 0] Tasks: ['Code'] | Lens: [59933] → Tgt Spa: ['1.000'] [Step 114 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [42283] → Tgt Spa: ['1.000'] [Step 114 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [37351] → Tgt Spa: ['1.000'] [Step 114 / Rank 2] Tasks: ['Single QA'] | Lens: [42418] → Tgt Spa: ['0.350'] [Step 114 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [42283] → Tgt Spa: ['1.000'] [Step 114 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [37351] → Tgt Spa: ['1.000'] [Step 114 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [24267, 24267] → Tgt Spa: ['0.350', '0.350'] [Step 114 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Summarization', 'Code'] | Lens: [8943, 8943, 8944, 8953, 8945, 8965, 8962] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000'] [Step 114 / Rank 6] Tasks: ['Single QA'] | Lens: [52528] → Tgt Spa: ['0.350'] [Step 114 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [60688] → Tgt Spa: ['1.000'] [Step 114 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [60688] → Tgt Spa: ['1.000'] [Step 114 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [24267, 24267] → Tgt Spa: ['0.350', '0.350'] [Step 114 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Summarization', 'Code'] | Lens: [8943, 8943, 8944, 8953, 8945, 8965, 8962] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000'] [Step 114 / Rank 7] Tasks: ['Single QA'] | Lens: [52528] → Tgt Spa: ['0.350'] [Step 114 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [21871, 21871] → Tgt Spa: ['0.350', '0.350'] [Step 114 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [36380] → Tgt Spa: ['1.000'] [Step 114 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [36380] → Tgt Spa: ['1.000'] [Step 114 / Rank 3] Tasks: ['MultiHop QA'] | Lens: [65339] → Tgt Spa: ['0.350'] [Step 114 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25324, 25324] → Tgt Spa: ['0.350', '1.000'] [Step 114 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25324, 25324] → Tgt Spa: ['0.350', '1.000'] [Step 114 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [21871, 21871] → Tgt Spa: ['0.350', '0.350'] [Step 114 / Rank 2] Tasks: ['MultiHop QA'] | Lens: [65339] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 23:45:10,154 >> @ 114 | Loss: 2.0676 | LM: 1.9892 | Reg: 0.0784 | Spa(Avg): 0.503 [INFO|lh_trainer.py:797] 2026-02-16 23:45:10,154 >> Statistic -> Code | Spa: 0.514 | Tgt: 1.000 | Z-Loss: 0.140 | [INFO|lh_trainer.py:797] 2026-02-16 23:45:10,154 >> Statistic -> In-Context | Spa: 0.629 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:45:10,154 >> Statistic -> MultiHop | Spa: 0.417 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:45:10,154 >> Statistic -> Single | Spa: 0.420 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:45:10,154 >> Statistic -> Summarization | Spa: 0.625 | Tgt: 1.000 | Z-Loss: 0.102 | [INFO|lh_trainer.py:810] 2026-02-16 23:45:10,156 >> [Micro-Log] {"loss": 2.0676065903777876, "lm_loss": 1.9891820799869795, "reg_loss": 0.07842451809847262, "model_sparsity(avg)": 0.5034308843314648, "Spa-Code sparsity": 0.513888869020674, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1400942156712214, "Spa-MultiHop QA sparsity": 0.4166666567325592, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.02006708923727274, "Spa-In-Context Learning sparsity": 0.6291666686534881, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11679185330867767, "Spa-Single QA sparsity": 0.4197530812687344, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.041692374804471105, "Spa-Summarization sparsity": 0.625, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.101531982421875, "step": 114, "current_tau": 1.1483157873153687, "lambda1 Single QA": 0.5390625, "lambda2 MultiHop QA": 0.2734375, "lambda3 Summarization": 0.1103515625, "lambda4 Code": 0.208984375} [INFO|lh_trainer.py:331] 2026-02-16 23:45:37,646 >> {'loss': 12.4056, 'grad_norm': 1.0221725702285767, 'learning_rate': 0.0004401014973898586, 'epoch': 0.12111637704054766, 'num_input_tokens_seen': 282587368, 'completed': '38.33% (115 / 300)', 'remaining time': '8:40:57', 'throughput': '7144.05', 'gpu_mem_free': '12457MB', 'step': 115} [Step 115 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [32371, 32372] → Tgt Spa: ['0.350', '0.350'] [Step 115 / Rank 4] Tasks: ['Single QA', 'Code', 'In-Context Learning', 'Code', 'In-Context Learning', 'MultiHop QA', 'Single QA', 'Single QA'] | Lens: [8122, 8131, 8124, 8134, 8127, 8127, 8129, 8130] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350'] [Step 115 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25991, 25990] → Tgt Spa: ['1.000', '1.000'] [Step 115 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [44625] → Tgt Spa: ['1.000'] [Step 115 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [44625] → Tgt Spa: ['1.000'] [Step 115 / Rank 5] Tasks: ['Single QA', 'Code', 'In-Context Learning', 'Code', 'In-Context Learning', 'MultiHop QA', 'Single QA', 'Single QA'] | Lens: [8122, 8131, 8124, 8134, 8127, 8127, 8129, 8130] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350'] [Step 115 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [32371, 32372] → Tgt Spa: ['0.350', '0.350'] [Step 115 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25991, 25990] → Tgt Spa: ['1.000', '1.000'] [Step 115 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [21967, 21966] → Tgt Spa: ['1.000', '1.000'] [Step 115 / Rank 4] Tasks: ['Single QA'] | Lens: [42634] → Tgt Spa: ['0.350'] [Step 115 / Rank 2] Tasks: ['Single QA'] | Lens: [57256] → Tgt Spa: ['0.350'] [Step 115 / Rank 1] Tasks: ['Single QA'] | Lens: [65224] → Tgt Spa: ['0.350'] [Step 115 / Rank 5] Tasks: ['Single QA'] | Lens: [42634] → Tgt Spa: ['0.350'] [Step 115 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [21967, 21966] → Tgt Spa: ['1.000', '1.000'] [Step 115 / Rank 3] Tasks: ['Single QA'] | Lens: [57256] → Tgt Spa: ['0.350'] [Step 115 / Rank 0] Tasks: ['Single QA'] | Lens: [65224] → Tgt Spa: ['0.350'] [Step 115 / Rank 5] Tasks: ['Code'] | Lens: [41975] → Tgt Spa: ['1.000'] [Step 115 / Rank 4] Tasks: ['Code'] | Lens: [41975] → Tgt Spa: ['1.000'] [Step 115 / Rank 7] Tasks: ['Code', 'Code', 'In-Context Learning'] | Lens: [20108, 20112, 20107] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 115 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [28580, 28574] → Tgt Spa: ['1.000', '1.000'] [Step 115 / Rank 2] Tasks: ['Code'] | Lens: [60924] → Tgt Spa: ['1.000'] [Step 115 / Rank 3] Tasks: ['Code'] | Lens: [60924] → Tgt Spa: ['1.000'] [Step 115 / Rank 6] Tasks: ['Code', 'Code', 'In-Context Learning'] | Lens: [20108, 20112, 20107] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 115 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [28580, 28574] → Tgt Spa: ['1.000', '1.000'] [Step 115 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [55542] → Tgt Spa: ['1.000'] [Step 115 / Rank 0] Tasks: ['Single QA'] | Lens: [51270] → Tgt Spa: ['0.350'] [Step 115 / Rank 6] Tasks: ['Code'] | Lens: [59262] → Tgt Spa: ['1.000'] [Step 115 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [55542] → Tgt Spa: ['1.000'] [Step 115 / Rank 7] Tasks: ['Code'] | Lens: [59262] → Tgt Spa: ['1.000'] [Step 115 / Rank 3] Tasks: ['Single QA'] | Lens: [33536] → Tgt Spa: ['0.350'] [Step 115 / Rank 1] Tasks: ['Single QA'] | Lens: [51270] → Tgt Spa: ['0.350'] [Step 115 / Rank 2] Tasks: ['Single QA'] | Lens: [33536] → Tgt Spa: ['0.350'] [Step 115 / Rank 2] Tasks: ['Single QA'] | Lens: [42366] → Tgt Spa: ['0.350'] [Step 115 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24057, 24062] → Tgt Spa: ['0.350', '1.000'] [Step 115 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [29133, 29125] → Tgt Spa: ['1.000', '1.000'] [Step 115 / Rank 1] Tasks: ['Summarization', 'Single QA'] | Lens: [24768, 24749] → Tgt Spa: ['1.000', '0.350'] [Step 115 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24057, 24062] → Tgt Spa: ['0.350', '1.000'] [Step 115 / Rank 0] Tasks: ['Summarization', 'Single QA'] | Lens: [24768, 24749] → Tgt Spa: ['1.000', '0.350'] [Step 115 / Rank 3] Tasks: ['Single QA'] | Lens: [42366] → Tgt Spa: ['0.350'] [Step 115 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [29133, 29125] → Tgt Spa: ['1.000', '1.000'] [Step 115 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15499, 15500, 15502, 15506] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 115 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15499, 15500, 15502, 15506] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 115 / Rank 2] Tasks: ['Single QA'] | Lens: [50376] → Tgt Spa: ['0.350'] [Step 115 / Rank 3] Tasks: ['Single QA'] | Lens: [50376] → Tgt Spa: ['0.350'] [Step 115 / Rank 7] Tasks: ['Single QA', 'Summarization'] | Lens: [24805, 24826] → Tgt Spa: ['0.350', '1.000'] [Step 115 / Rank 6] Tasks: ['Single QA', 'Summarization'] | Lens: [24805, 24826] → Tgt Spa: ['0.350', '1.000'] [Step 115 / Rank 0] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23766, 23767] → Tgt Spa: ['1.000', '0.350'] [Step 115 / Rank 1] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23766, 23767] → Tgt Spa: ['1.000', '0.350'] [INFO|lh_trainer.py:781] 2026-02-16 23:48:03,766 >> @ 115 | Loss: 2.0796 | LM: 1.9960 | Reg: 0.0836 | Spa(Avg): 0.472 [INFO|lh_trainer.py:797] 2026-02-16 23:48:03,766 >> Statistic -> Code | Spa: 0.505 | Tgt: 1.000 | Z-Loss: 0.144 | [INFO|lh_trainer.py:797] 2026-02-16 23:48:03,766 >> Statistic -> In-Context | Spa: 0.615 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:48:03,766 >> Statistic -> MultiHop | Spa: 0.458 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:48:03,766 >> Statistic -> Single | Spa: 0.421 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:48:03,767 >> Statistic -> Summarization | Spa: 0.472 | Tgt: 1.000 | Z-Loss: 0.180 | [INFO|lh_trainer.py:810] 2026-02-16 23:48:03,769 >> [Micro-Log] {"loss": 2.0795854553580284, "lm_loss": 1.9959755837917328, "reg_loss": 0.0836098824123231, "model_sparsity(avg)": 0.4724392307301362, "Spa-In-Context Learning sparsity": 0.614898990501057, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.12184603444554588, "Spa-Single QA sparsity": 0.4208333224058151, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.044525799102848394, "Spa-Code sparsity": 0.5050505020401694, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1440494114702398, "Spa-Summarization sparsity": 0.4722222089767456, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.17970671504735947, "Spa-MultiHop QA sparsity": 0.4583333134651184, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.03430512547492981, "step": 115, "current_tau": 1.1443454027175903, "lambda1 Single QA": 0.5390625, "lambda2 MultiHop QA": 0.275390625, "lambda3 Summarization": 0.11083984375, "lambda4 Code": 0.2099609375} [INFO|lh_trainer.py:331] 2026-02-16 23:48:21,962 >> {'loss': 12.4775, 'grad_norm': 1.0212949514389038, 'learning_rate': 0.00043795995807374916, 'epoch': 0.12216956292785677, 'num_input_tokens_seen': 285113802, 'completed': '38.67% (116 / 300)', 'remaining time': '8:38:00', 'throughput': '7687.71', 'gpu_mem_free': '12055MB', 'step': 116} [Step 116 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [43101] → Tgt Spa: ['1.000'] [Step 116 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [21461, 21461, 21461] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 116 / Rank 5] Tasks: ['Single QA'] | Lens: [42502] → Tgt Spa: ['0.350'] [Step 116 / Rank 4] Tasks: ['Single QA'] | Lens: [42502] → Tgt Spa: ['0.350'] [Step 116 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [21461, 21461, 21461] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 116 / Rank 3] Tasks: ['Single QA'] | Lens: [58939] → Tgt Spa: ['0.350'] [Step 116 / Rank 2] Tasks: ['Single QA'] | Lens: [58939] → Tgt Spa: ['0.350'] [Step 116 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [43101] → Tgt Spa: ['1.000'] [Step 116 / Rank 5] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [19031, 19043, 19032] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 116 / Rank 4] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [19031, 19043, 19032] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 116 / Rank 0] Tasks: ['Single QA'] | Lens: [32814] → Tgt Spa: ['0.350'] [Step 116 / Rank 7] Tasks: ['Code', 'Single QA', 'Single QA', 'Code', 'Code'] | Lens: [11210, 11205, 11211, 11216, 11216] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '1.000'] [Step 116 / Rank 6] Tasks: ['Code', 'Single QA', 'Single QA', 'Code', 'Code'] | Lens: [11210, 11205, 11211, 11216, 11216] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '1.000'] [Step 116 / Rank 2] Tasks: ['Code'] | Lens: [43326] → Tgt Spa: ['1.000'] [Step 116 / Rank 1] Tasks: ['Single QA'] | Lens: [32814] → Tgt Spa: ['0.350'] [Step 116 / Rank 3] Tasks: ['Code'] | Lens: [43326] → Tgt Spa: ['1.000'] [Step 116 / Rank 4] Tasks: ['Code'] | Lens: [34025] → Tgt Spa: ['1.000'] [Step 116 / Rank 3] Tasks: ['Code'] | Lens: [33823] → Tgt Spa: ['1.000'] [Step 116 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32235, 32236] → Tgt Spa: ['0.350', '0.350'] [Step 116 / Rank 0] Tasks: ['Code'] | Lens: [32862] → Tgt Spa: ['1.000'] [Step 116 / Rank 5] Tasks: ['Code'] | Lens: [34025] → Tgt Spa: ['1.000'] [Step 116 / Rank 2] Tasks: ['Code'] | Lens: [33823] → Tgt Spa: ['1.000'] [Step 116 / Rank 1] Tasks: ['Code'] | Lens: [32862] → Tgt Spa: ['1.000'] [Step 116 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32235, 32236] → Tgt Spa: ['0.350', '0.350'] [Step 116 / Rank 4] Tasks: ['Code', 'Code', 'Code', 'Code', 'Code'] | Lens: [12855, 12872, 12890, 12894, 12909] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000'] [Step 116 / Rank 0] Tasks: ['Single QA'] | Lens: [33021] → Tgt Spa: ['0.350'] [Step 116 / Rank 5] Tasks: ['Code', 'Code', 'Code', 'Code', 'Code'] | Lens: [12855, 12872, 12890, 12894, 12909] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000'] [Step 116 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [26694, 26703] → Tgt Spa: ['1.000', '1.000'] [Step 116 / Rank 2] Tasks: ['Code'] | Lens: [53172] → Tgt Spa: ['1.000'] [Step 116 / Rank 3] Tasks: ['Code'] | Lens: [53172] → Tgt Spa: ['1.000'] [Step 116 / Rank 1] Tasks: ['Single QA'] | Lens: [33021] → Tgt Spa: ['0.350'] [Step 116 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [26694, 26703] → Tgt Spa: ['1.000', '1.000'] [Step 116 / Rank 3] Tasks: ['Single QA'] | Lens: [47568] → Tgt Spa: ['0.350'] [Step 116 / Rank 2] Tasks: ['Single QA'] | Lens: [47568] → Tgt Spa: ['0.350'] [Step 116 / Rank 1] Tasks: ['Summarization'] | Lens: [45140] → Tgt Spa: ['1.000'] [Step 116 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'In-Context Learning'] | Lens: [6881, 6881, 6882, 6881, 6882, 6882, 6883, 6883, 6883] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000'] [Step 116 / Rank 0] Tasks: ['Summarization'] | Lens: [45140] → Tgt Spa: ['1.000'] [Step 116 / Rank 4] Tasks: ['Code'] | Lens: [38937] → Tgt Spa: ['1.000'] [Step 116 / Rank 5] Tasks: ['Code'] | Lens: [38937] → Tgt Spa: ['1.000'] [Step 116 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'In-Context Learning'] | Lens: [6881, 6881, 6882, 6881, 6882, 6882, 6883, 6883, 6883] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000'] [Step 116 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [29938, 29940] → Tgt Spa: ['0.350', '0.350'] [Step 116 / Rank 3] Tasks: ['Single QA'] | Lens: [45554] → Tgt Spa: ['0.350'] [Step 116 / Rank 1] Tasks: ['Code'] | Lens: [32960] → Tgt Spa: ['1.000'] [Step 116 / Rank 6] Tasks: ['Single QA', 'Code'] | Lens: [24687, 24696] → Tgt Spa: ['0.350', '1.000'] [Step 116 / Rank 2] Tasks: ['Single QA'] | Lens: [45554] → Tgt Spa: ['0.350'] [Step 116 / Rank 7] Tasks: ['Single QA', 'Code'] | Lens: [24687, 24696] → Tgt Spa: ['0.350', '1.000'] [Step 116 / Rank 0] Tasks: ['Code'] | Lens: [32960] → Tgt Spa: ['1.000'] [Step 116 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [29938, 29940] → Tgt Spa: ['0.350', '0.350'] [INFO|lh_trainer.py:781] 2026-02-16 23:50:27,434 >> @ 116 | Loss: 1.6727 | LM: 1.5884 | Reg: 0.0843 | Spa(Avg): 0.486 [INFO|lh_trainer.py:797] 2026-02-16 23:50:27,434 >> Statistic -> Code | Spa: 0.560 | Tgt: 1.000 | Z-Loss: 0.125 | [INFO|lh_trainer.py:797] 2026-02-16 23:50:27,434 >> Statistic -> In-Context | Spa: 0.625 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:50:27,434 >> Statistic -> MultiHop | Spa: 0.458 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:50:27,434 >> Statistic -> Single | Spa: 0.420 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:50:27,435 >> Statistic -> Summarization | Spa: 0.528 | Tgt: 1.000 | Z-Loss: 0.148 | [INFO|lh_trainer.py:810] 2026-02-16 23:50:27,436 >> [Micro-Log] {"loss": 1.672727254529794, "lm_loss": 1.5884467133631308, "reg_loss": 0.0842805251207513, "model_sparsity(avg)": 0.4858088940382004, "Spa-In-Context Learning sparsity": 0.625, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11916426755487919, "Spa-Single QA sparsity": 0.41968598054802936, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04857198935528488, "Spa-Code sparsity": 0.5599415208164015, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.12479097552989658, "Spa-Summarization sparsity": 0.527777761220932, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.14826682955026627, "Spa-MultiHop QA sparsity": 0.4583333134651184, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.03430512547492981, "step": 116, "current_tau": 1.1404072046279907, "lambda1 Single QA": 0.54296875, "lambda2 MultiHop QA": 0.275390625, "lambda3 Summarization": 0.11181640625, "lambda4 Code": 0.2109375} [INFO|lh_trainer.py:331] 2026-02-16 23:50:43,064 >> {'loss': 10.0364, 'grad_norm': 1.2418874502182007, 'learning_rate': 0.00043578621279072793, 'epoch': 0.12322274881516587, 'num_input_tokens_seen': 287411358, 'completed': '39.00% (117 / 300)', 'remaining time': '8:34:28', 'throughput': '8141.53', 'gpu_mem_free': '14953MB', 'step': 117} [Step 117 / Rank 7] Tasks: ['Single QA'] | Lens: [33900] → Tgt Spa: ['0.350'] [Step 117 / Rank 5] Tasks: ['Summarization'] | Lens: [61054] → Tgt Spa: ['1.000'] [Step 117 / Rank 4] Tasks: ['Summarization'] | Lens: [61054] → Tgt Spa: ['1.000'] [Step 117 / Rank 0] Tasks: ['Single QA'] | Lens: [49727] → Tgt Spa: ['0.350'] [Step 117 / Rank 2] Tasks: ['Code', 'Summarization'] | Lens: [31318, 31331] → Tgt Spa: ['1.000', '1.000'] [Step 117 / Rank 6] Tasks: ['Single QA'] | Lens: [33900] → Tgt Spa: ['0.350'] [Step 117 / Rank 1] Tasks: ['Single QA'] | Lens: [49727] → Tgt Spa: ['0.350'] [Step 117 / Rank 3] Tasks: ['Code', 'Summarization'] | Lens: [31318, 31331] → Tgt Spa: ['1.000', '1.000'] [Step 117 / Rank 4] Tasks: ['Code'] | Lens: [60511] → Tgt Spa: ['1.000'] [Step 117 / Rank 5] Tasks: ['Code'] | Lens: [60511] → Tgt Spa: ['1.000'] [Step 117 / Rank 7] Tasks: ['Single QA'] | Lens: [44457] → Tgt Spa: ['0.350'] [Step 117 / Rank 6] Tasks: ['Single QA'] | Lens: [44457] → Tgt Spa: ['0.350'] [Step 117 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25005, 25005] → Tgt Spa: ['1.000', '1.000'] [Step 117 / Rank 1] Tasks: ['Single QA'] | Lens: [36393] → Tgt Spa: ['0.350'] [Step 117 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25005, 25005] → Tgt Spa: ['1.000', '1.000'] [Step 117 / Rank 0] Tasks: ['Single QA'] | Lens: [36393] → Tgt Spa: ['0.350'] [Step 117 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [61944] → Tgt Spa: ['1.000'] [Step 117 / Rank 2] Tasks: ['Single QA'] | Lens: [37932] → Tgt Spa: ['0.350'] [Step 117 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [31019, 31020] → Tgt Spa: ['0.350', '0.350'] [Step 117 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [61944] → Tgt Spa: ['1.000'] [Step 117 / Rank 1] Tasks: ['Code', 'Code', 'Code'] | Lens: [16635, 16636, 16636] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 117 / Rank 0] Tasks: ['Code', 'Code', 'Code'] | Lens: [16635, 16636, 16636] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 117 / Rank 3] Tasks: ['Single QA'] | Lens: [37932] → Tgt Spa: ['0.350'] [Step 117 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [31019, 31020] → Tgt Spa: ['0.350', '0.350'] [Step 117 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [37739] → Tgt Spa: ['1.000'] [Step 117 / Rank 5] Tasks: ['Code'] | Lens: [54386] → Tgt Spa: ['1.000'] [Step 117 / Rank 1] Tasks: ['Single QA'] | Lens: [56714] → Tgt Spa: ['0.350'] [Step 117 / Rank 4] Tasks: ['Code'] | Lens: [54386] → Tgt Spa: ['1.000'] [Step 117 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [37739] → Tgt Spa: ['1.000'] [Step 117 / Rank 6] Tasks: ['Code'] | Lens: [38240] → Tgt Spa: ['1.000'] [Step 117 / Rank 7] Tasks: ['Code'] | Lens: [38240] → Tgt Spa: ['1.000'] [Step 117 / Rank 0] Tasks: ['Single QA'] | Lens: [56714] → Tgt Spa: ['0.350'] [Step 117 / Rank 1] Tasks: ['Single QA'] | Lens: [45672] → Tgt Spa: ['0.350'] [Step 117 / Rank 4] Tasks: ['Single QA'] | Lens: [65036] → Tgt Spa: ['0.350'] [Step 117 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [29186, 29186] → Tgt Spa: ['0.350', '0.350'] [Step 117 / Rank 6] Tasks: ['Code', 'Code', 'Code'] | Lens: [20432, 20433, 20433] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 117 / Rank 5] Tasks: ['Single QA'] | Lens: [65036] → Tgt Spa: ['0.350'] [Step 117 / Rank 0] Tasks: ['Single QA'] | Lens: [45672] → Tgt Spa: ['0.350'] [Step 117 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [29186, 29186] → Tgt Spa: ['0.350', '0.350'] [Step 117 / Rank 7] Tasks: ['Code', 'Code', 'Code'] | Lens: [20432, 20433, 20433] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 117 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [40953] → Tgt Spa: ['1.000'] [Step 117 / Rank 6] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [26753, 26736] → Tgt Spa: ['1.000', '1.000'] [Step 117 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [45517] → Tgt Spa: ['1.000'] [Step 117 / Rank 2] Tasks: ['Code'] | Lens: [47328] → Tgt Spa: ['1.000'] [Step 117 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [40953] → Tgt Spa: ['1.000'] [Step 117 / Rank 7] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [26753, 26736] → Tgt Spa: ['1.000', '1.000'] [Step 117 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [45517] → Tgt Spa: ['1.000'] [Step 117 / Rank 3] Tasks: ['Code'] | Lens: [47328] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-16 23:53:24,716 >> @ 117 | Loss: 1.9474 | LM: 1.8667 | Reg: 0.0807 | Spa(Avg): 0.498 [INFO|lh_trainer.py:797] 2026-02-16 23:53:24,717 >> Statistic -> Code | Spa: 0.561 | Tgt: 1.000 | Z-Loss: 0.124 | [INFO|lh_trainer.py:797] 2026-02-16 23:53:24,717 >> Statistic -> In-Context | Spa: 0.631 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:53:24,717 >> Statistic -> MultiHop | Spa: 0.458 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:53:24,717 >> Statistic -> Single | Spa: 0.380 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:53:24,717 >> Statistic -> Summarization | Spa: 0.519 | Tgt: 1.000 | Z-Loss: 0.154 | [INFO|lh_trainer.py:810] 2026-02-16 23:53:24,719 >> [Micro-Log] {"loss": 1.9473785621424515, "lm_loss": 1.866712186485529, "reg_loss": 0.08066638132731896, "model_sparsity(avg)": 0.4977816417813301, "Spa-Single QA sparsity": 0.37962962687015533, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.019844742423932377, "Spa-Code sparsity": 0.5606060569936578, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1243328946557912, "Spa-In-Context Learning sparsity": 0.6309524008205959, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11723574783120837, "Spa-Summarization sparsity": 0.5185185074806213, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.15413035452365875, "Spa-MultiHop QA sparsity": 0.4583333134651184, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.03430512547492981, "step": 117, "current_tau": 1.1365023851394653, "lambda1 Single QA": 0.54296875, "lambda2 MultiHop QA": 0.275390625, "lambda3 Summarization": 0.1123046875, "lambda4 Code": 0.2109375} [INFO|lh_trainer.py:331] 2026-02-16 23:53:40,534 >> {'loss': 11.6843, 'grad_norm': 0.9875852465629578, 'learning_rate': 0.0004335806340008587, 'epoch': 0.12427593470247499, 'num_input_tokens_seen': 289841892, 'completed': '39.33% (118 / 300)', 'remaining time': '8:31:53', 'throughput': '6847.70', 'gpu_mem_free': '12337MB', 'step': 118} [Step 118 / Rank 4] Tasks: ['Code', 'In-Context Learning', 'Code'] | Lens: [20336, 20331, 20338] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 118 / Rank 2] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [20279, 20294, 20285] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 118 / Rank 3] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [20279, 20294, 20285] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 118 / Rank 6] Tasks: ['Single QA'] | Lens: [52403] → Tgt Spa: ['0.350'] [Step 118 / Rank 7] Tasks: ['Single QA'] | Lens: [52403] → Tgt Spa: ['0.350'] [Step 118 / Rank 0] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [26093, 26093] → Tgt Spa: ['0.350', '1.000'] [Step 118 / Rank 5] Tasks: ['Code', 'In-Context Learning', 'Code'] | Lens: [20336, 20331, 20338] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 118 / Rank 1] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [26093, 26093] → Tgt Spa: ['0.350', '1.000'] [Step 118 / Rank 3] Tasks: ['Single QA'] | Lens: [51543] → Tgt Spa: ['0.350'] [Step 118 / Rank 5] Tasks: ['Single QA'] | Lens: [40597] → Tgt Spa: ['0.350'] [Step 118 / Rank 0] Tasks: ['Code'] | Lens: [36899] → Tgt Spa: ['1.000'] [Step 118 / Rank 6] Tasks: ['Single QA'] | Lens: [37111] → Tgt Spa: ['0.350'] [Step 118 / Rank 7] Tasks: ['Single QA'] | Lens: [37111] → Tgt Spa: ['0.350'] [Step 118 / Rank 1] Tasks: ['Code'] | Lens: [36899] → Tgt Spa: ['1.000'] [Step 118 / Rank 2] Tasks: ['Single QA'] | Lens: [51543] → Tgt Spa: ['0.350'] [Step 118 / Rank 4] Tasks: ['Single QA'] | Lens: [40597] → Tgt Spa: ['0.350'] [Step 118 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24690, 24691] → Tgt Spa: ['1.000', '1.000'] [Step 118 / Rank 6] Tasks: ['Single QA'] | Lens: [34903] → Tgt Spa: ['0.350'] [Step 118 / Rank 7] Tasks: ['Single QA'] | Lens: [34903] → Tgt Spa: ['0.350'] [Step 118 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [28045, 28069] → Tgt Spa: ['1.000', '1.000'] [Step 118 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [29371, 29381] → Tgt Spa: ['1.000', '1.000'] [Step 118 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [28045, 28069] → Tgt Spa: ['1.000', '1.000'] [Step 118 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24690, 24691] → Tgt Spa: ['1.000', '1.000'] [Step 118 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [29371, 29381] → Tgt Spa: ['1.000', '1.000'] [Step 118 / Rank 5] Tasks: ['Code'] | Lens: [44464] → Tgt Spa: ['1.000'] [Step 118 / Rank 0] Tasks: ['Code'] | Lens: [37065] → Tgt Spa: ['1.000'] [Step 118 / Rank 3] Tasks: ['Code'] | Lens: [54040] → Tgt Spa: ['1.000'] [Step 118 / Rank 7] Tasks: ['Single QA'] | Lens: [64290] → Tgt Spa: ['0.350'] [Step 118 / Rank 1] Tasks: ['Code'] | Lens: [37065] → Tgt Spa: ['1.000'] [Step 118 / Rank 2] Tasks: ['Code'] | Lens: [54040] → Tgt Spa: ['1.000'] [Step 118 / Rank 4] Tasks: ['Code'] | Lens: [44464] → Tgt Spa: ['1.000'] [Step 118 / Rank 6] Tasks: ['Single QA'] | Lens: [64290] → Tgt Spa: ['0.350'] [Step 118 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [26915, 26924] → Tgt Spa: ['1.000', '1.000'] [Step 118 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [26915, 26924] → Tgt Spa: ['1.000', '1.000'][Step 118 / Rank 4] Tasks: ['Code'] | Lens: [43464] → Tgt Spa: ['1.000'] [Step 118 / Rank 6] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24022, 24022] → Tgt Spa: ['0.350', '1.000'] [Step 118 / Rank 2] Tasks: ['Single QA'] | Lens: [52619] → Tgt Spa: ['0.350'] [Step 118 / Rank 3] Tasks: ['Single QA'] | Lens: [52619] → Tgt Spa: ['0.350'] [Step 118 / Rank 5] Tasks: ['Code'] | Lens: [43464] → Tgt Spa: ['1.000'] [Step 118 / Rank 7] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24022, 24022] → Tgt Spa: ['0.350', '1.000'] [Step 118 / Rank 7] Tasks: ['Single QA'] | Lens: [52959] → Tgt Spa: ['0.350'] [Step 118 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [28368, 28378] → Tgt Spa: ['1.000', '1.000'] [Step 118 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [59840] → Tgt Spa: ['1.000'] [Step 118 / Rank 6] Tasks: ['Single QA'] | Lens: [52959] → Tgt Spa: ['0.350'] [Step 118 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [59840] → Tgt Spa: ['1.000'] [Step 118 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [28368, 28378] → Tgt Spa: ['1.000', '1.000'] [Step 118 / Rank 3] Tasks: ['Single QA'] | Lens: [54514] → Tgt Spa: ['0.350'] [Step 118 / Rank 2] Tasks: ['Single QA'] | Lens: [54514] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-16 23:55:56,734 >> @ 118 | Loss: 1.9159 | LM: 1.8306 | Reg: 0.0853 | Spa(Avg): 0.521 [INFO|lh_trainer.py:797] 2026-02-16 23:55:56,735 >> Statistic -> Code | Spa: 0.589 | Tgt: 1.000 | Z-Loss: 0.115 | [INFO|lh_trainer.py:797] 2026-02-16 23:55:56,735 >> Statistic -> In-Context | Spa: 0.639 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:55:56,735 >> Statistic -> MultiHop | Spa: 0.458 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:55:56,735 >> Statistic -> Single | Spa: 0.404 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:55:56,735 >> Statistic -> Summarization | Spa: 0.444 | Tgt: 1.000 | Z-Loss: 0.196 | [INFO|lh_trainer.py:810] 2026-02-16 23:55:56,737 >> [Micro-Log] {"loss": 1.9158958941698074, "lm_loss": 1.830623601252834, "reg_loss": 0.08527226835819117, "model_sparsity(avg)": 0.520640429109335, "Spa-Single QA sparsity": 0.4040403962135315, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.033328604556366125, "Spa-In-Context Learning sparsity": 0.6388888835906983, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11447143480181694, "Spa-Code sparsity": 0.5891203681627909, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.11504524325331052, "Spa-Summarization sparsity": 0.4444444477558136, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.19588518142700195, "Spa-MultiHop QA sparsity": 0.4583333134651184, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.03430512547492981, "step": 118, "current_tau": 1.1326321363449097, "lambda1 Single QA": 0.54296875, "lambda2 MultiHop QA": 0.275390625, "lambda3 Summarization": 0.11328125, "lambda4 Code": 0.2119140625} [INFO|lh_trainer.py:331] 2026-02-16 23:56:19,909 >> {'loss': 11.4954, 'grad_norm': 1.0161254405975342, 'learning_rate': 0.0004313435996187126, 'epoch': 0.1253291205897841, 'num_input_tokens_seen': 292269164, 'completed': '39.67% (119 / 300)', 'remaining time': '8:28:50', 'throughput': '7615.00', 'gpu_mem_free': '7643MB', 'step': 119} [Step 119 / Rank 5] Tasks: ['Single QA'] | Lens: [38295] → Tgt Spa: ['0.350'] [Step 119 / Rank 6] Tasks: ['Single QA'] | Lens: [55458] → Tgt Spa: ['0.350'] [Step 119 / Rank 0] Tasks: ['Code', 'Summarization'] | Lens: [23145, 23156] → Tgt Spa: ['1.000', '1.000'] [Step 119 / Rank 7] Tasks: ['Single QA'] | Lens: [55458] → Tgt Spa: ['0.350'] [Step 119 / Rank 1] Tasks: ['Code', 'Summarization'] | Lens: [23145, 23156] → Tgt Spa: ['1.000', '1.000'] [Step 119 / Rank 2] Tasks: ['Single QA'] | Lens: [38696] → Tgt Spa: ['0.350'] [Step 119 / Rank 4] Tasks: ['Single QA'] | Lens: [38295] → Tgt Spa: ['0.350'] [Step 119 / Rank 3] Tasks: ['Single QA'] | Lens: [38696] → Tgt Spa: ['0.350'] [Step 119 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [23974, 23985] → Tgt Spa: ['1.000', '1.000'] [Step 119 / Rank 4] Tasks: ['Single QA'] | Lens: [55881] → Tgt Spa: ['0.350'] [Step 119 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [26122, 26115] → Tgt Spa: ['1.000', '1.000'] [Step 119 / Rank 2] Tasks: ['In-Context Learning', 'Code'] | Lens: [25302, 25311] → Tgt Spa: ['1.000', '1.000'] [Step 119 / Rank 5] Tasks: ['Single QA'] | Lens: [55881] → Tgt Spa: ['0.350'] [Step 119 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [23974, 23985] → Tgt Spa: ['1.000', '1.000'] [Step 119 / Rank 3] Tasks: ['In-Context Learning', 'Code'] | Lens: [25302, 25311] → Tgt Spa: ['1.000', '1.000'] [Step 119 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [26122, 26115] → Tgt Spa: ['1.000', '1.000'] [Step 119 / Rank 7] Tasks: ['Code'] | Lens: [57667] → Tgt Spa: ['1.000'] [Step 119 / Rank 5] Tasks: ['Single QA'] | Lens: [42284] → Tgt Spa: ['0.350'] [Step 119 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26801, 26803] → Tgt Spa: ['1.000', '1.000'] [Step 119 / Rank 4] Tasks: ['Single QA'] | Lens: [42284] → Tgt Spa: ['0.350'] [Step 119 / Rank 0] Tasks: ['Single QA'] | Lens: [36922] → Tgt Spa: ['0.350'] [Step 119 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26801, 26803] → Tgt Spa: ['1.000', '1.000'] [Step 119 / Rank 6] Tasks: ['Code'] | Lens: [57667] → Tgt Spa: ['1.000'] [Step 119 / Rank 1] Tasks: ['Single QA'] | Lens: [36922] → Tgt Spa: ['0.350'] [Step 119 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32371, 32371] → Tgt Spa: ['0.350', '0.350'] [Step 119 / Rank 3] Tasks: ['Single QA'] | Lens: [36927] → Tgt Spa: ['0.350'] [Step 119 / Rank 2] Tasks: ['Single QA'] | Lens: [36927] → Tgt Spa: ['0.350'] [Step 119 / Rank 0] Tasks: ['Single QA'] | Lens: [43133] → Tgt Spa: ['0.350'] [Step 119 / Rank 1] Tasks: ['Single QA'] | Lens: [43133] → Tgt Spa: ['0.350'] [Step 119 / Rank 7] Tasks: ['Summarization', 'Summarization'] | Lens: [28380, 28381] → Tgt Spa: ['1.000', '1.000'] [Step 119 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32371, 32371] → Tgt Spa: ['0.350', '0.350'][Step 119 / Rank 6] Tasks: ['Summarization', 'Summarization'] | Lens: [28380, 28381] → Tgt Spa: ['1.000', '1.000'] [Step 119 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [43330] → Tgt Spa: ['1.000'] [Step 119 / Rank 1] Tasks: ['Single QA'] | Lens: [35025] → Tgt Spa: ['0.350'] [Step 119 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [51164] → Tgt Spa: ['1.000'] [Step 119 / Rank 6] Tasks: ['Single QA'] | Lens: [58718] → Tgt Spa: ['0.350'] [Step 119 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [51164] → Tgt Spa: ['1.000'] [Step 119 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [43330] → Tgt Spa: ['1.000'] [Step 119 / Rank 7] Tasks: ['Single QA'] | Lens: [58718] → Tgt Spa: ['0.350'] [Step 119 / Rank 0] Tasks: ['Single QA'] | Lens: [35025] → Tgt Spa: ['0.350'] [Step 119 / Rank 5] Tasks: ['Code'] | Lens: [33217] → Tgt Spa: ['1.000'] [Step 119 / Rank 4] Tasks: ['Code'] | Lens: [33217] → Tgt Spa: ['1.000'] [Step 119 / Rank 6] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23860, 23860] → Tgt Spa: ['0.350', '1.000'] [Step 119 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [38135] → Tgt Spa: ['1.000'] [Step 119 / Rank 2] Tasks: ['Single QA'] | Lens: [57246] → Tgt Spa: ['0.350'] [Step 119 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [38135] → Tgt Spa: ['1.000'] [Step 119 / Rank 3] Tasks: ['Single QA'] | Lens: [57246] → Tgt Spa: ['0.350'] [Step 119 / Rank 7] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23860, 23860] → Tgt Spa: ['0.350', '1.000'] [INFO|lh_trainer.py:781] 2026-02-16 23:58:43,278 >> @ 119 | Loss: 2.2557 | LM: 2.1844 | Reg: 0.0713 | Spa(Avg): 0.493 [INFO|lh_trainer.py:797] 2026-02-16 23:58:43,278 >> Statistic -> Code | Spa: 0.595 | Tgt: 1.000 | Z-Loss: 0.113 | [INFO|lh_trainer.py:797] 2026-02-16 23:58:43,279 >> Statistic -> In-Context | Spa: 0.622 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:58:43,279 >> Statistic -> MultiHop | Spa: 0.458 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:58:43,279 >> Statistic -> Single | Spa: 0.393 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-16 23:58:43,279 >> Statistic -> Summarization | Spa: 0.528 | Tgt: 1.000 | Z-Loss: 0.153 | [INFO|lh_trainer.py:810] 2026-02-16 23:58:43,280 >> [Micro-Log] {"loss": 2.255723132441441, "lm_loss": 2.184380249120295, "reg_loss": 0.0713428524856378, "model_sparsity(avg)": 0.49276619777083397, "Spa-Code sparsity": 0.594907412926356, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1134020301202933, "Spa-Summarization sparsity": 0.5277777711550394, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1529129147529602, "Spa-In-Context Learning sparsity": 0.6219135853979323, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.12121532526281145, "Spa-Single QA sparsity": 0.39285713008471895, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.025033347865766182, "Spa-MultiHop QA sparsity": 0.4583333134651184, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.03430512547492981, "step": 119, "current_tau": 1.1287975311279297, "lambda1 Single QA": 0.54296875, "lambda2 MultiHop QA": 0.275390625, "lambda3 Summarization": 0.11376953125, "lambda4 Code": 0.212890625} [INFO|lh_trainer.py:331] 2026-02-16 23:59:05,232 >> {'loss': 13.5343, 'grad_norm': 0.7848489284515381, 'learning_rate': 0.00042907549294861504, 'epoch': 0.1263823064770932, 'num_input_tokens_seen': 294553234, 'completed': '40.00% (120 / 300)', 'remaining time': '8:25:56', 'throughput': '6907.90', 'gpu_mem_free': '13617MB', 'step': 120} [Step 120 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [25790, 25784] → Tgt Spa: ['1.000', '1.000'] [Step 120 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32540, 32543] → Tgt Spa: ['0.350', '0.350'] [Step 120 / Rank 1] Tasks: ['Single QA'] | Lens: [38651] → Tgt Spa: ['0.350'] [Step 120 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [25790, 25784] → Tgt Spa: ['1.000', '1.000'] [Step 120 / Rank 0] Tasks: ['Single QA'] | Lens: [38651] → Tgt Spa: ['0.350'] [Step 120 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [26243, 26237] → Tgt Spa: ['1.000', '1.000'] [Step 120 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32540, 32543] → Tgt Spa: ['0.350', '0.350'] [Step 120 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [26243, 26237] → Tgt Spa: ['1.000', '1.000'] [Step 120 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [42440] → Tgt Spa: ['1.000'] [Step 120 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [12753, 12753, 12759, 12760, 12761] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 120 / Rank 6] Tasks: ['Summarization', 'Code'] | Lens: [28207, 28196] → Tgt Spa: ['1.000', '1.000'] [Step 120 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [42440] → Tgt Spa: ['1.000'] [Step 120 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [12753, 12753, 12759, 12760, 12761] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 120 / Rank 4] Tasks: ['Single QA'] | Lens: [41532] → Tgt Spa: ['0.350'] [Step 120 / Rank 7] Tasks: ['Summarization', 'Code'] | Lens: [28207, 28196] → Tgt Spa: ['1.000', '1.000'] [Step 120 / Rank 5] Tasks: ['Single QA'] | Lens: [41532] → Tgt Spa: ['0.350'] [Step 120 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [28010, 28030] → Tgt Spa: ['1.000', '1.000'] [Step 120 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [28010, 28030] → Tgt Spa: ['1.000', '1.000'] [Step 120 / Rank 5] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24746, 24764] → Tgt Spa: ['1.000', '1.000'] [Step 120 / Rank 7] Tasks: ['Single QA'] | Lens: [45919] → Tgt Spa: ['0.350'] [Step 120 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [43019] → Tgt Spa: ['1.000'] [Step 120 / Rank 6] Tasks: ['Single QA'] | Lens: [45919] → Tgt Spa: ['0.350'] [Step 120 / Rank 4] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24746, 24764] → Tgt Spa: ['1.000', '1.000'] [Step 120 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [43019] → Tgt Spa: ['1.000'] [Step 120 / Rank 5] Tasks: ['MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Code', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [1975, 1992, 1993, 1977, 1981, 1981, 1999, 1982, 2000, 1981, 2002, 2003, 2001, 2001, 1984, 2002, 1991, 2003, 2003, 1985, 1985, 1986, 1989, 2006, 2009, 2006, 2008, 2009, 1991, 1993, 1992, 2012] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000'] [Step 120 / Rank 6] Tasks: ['Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'Code', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [2907, 2907, 2910, 2910, 2907, 2910, 2909, 2909, 2908, 2915, 2914, 2910, 2909, 2910, 2927, 2909, 2910, 2910, 2929, 2912, 2913, 2913] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350'] [Step 120 / Rank 3] Tasks: ['Single QA'] | Lens: [51532] → Tgt Spa: ['0.350'] [Step 120 / Rank 2] Tasks: ['Single QA'] | Lens: [51532] → Tgt Spa: ['0.350'] [Step 120 / Rank 4] Tasks: ['MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Code', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [1975, 1992, 1993, 1977, 1981, 1981, 1999, 1982, 2000, 1981, 2002, 2003, 2001, 2001, 1984, 2002, 1991, 2003, 2003, 1985, 1985, 1986, 1989, 2006, 2009, 2006, 2008, 2009, 1991, 1993, 1992, 2012] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000'] [Step 120 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [31011, 31012] → Tgt Spa: ['0.350', '0.350'] [Step 120 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [31011, 31012] → Tgt Spa: ['0.350', '0.350'] [Step 120 / Rank 7] Tasks: ['Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'Code', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [2907, 2907, 2910, 2910, 2907, 2910, 2909, 2909, 2908, 2915, 2914, 2910, 2909, 2910, 2927, 2909, 2910, 2910, 2929, 2912, 2913, 2913] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350'] [Step 120 / Rank 4] Tasks: ['Single QA'] | Lens: [52305] → Tgt Spa: ['0.350'] [Step 120 / Rank 1] Tasks: ['Single QA'] | Lens: [60546] → Tgt Spa: ['0.350'] [Step 120 / Rank 5] Tasks: ['Single QA'] | Lens: [52305] → Tgt Spa: ['0.350'] [Step 120 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28044, 28043] → Tgt Spa: ['1.000', '1.000'] [Step 120 / Rank 0] Tasks: ['Single QA'] | Lens: [60546] → Tgt Spa: ['0.350'] [Step 120 / Rank 2] Tasks: ['Code'] | Lens: [47405] → Tgt Spa: ['1.000'] [Step 120 / Rank 3] Tasks: ['Code'] | Lens: [47405] → Tgt Spa: ['1.000'] [Step 120 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28044, 28043] → Tgt Spa: ['1.000', '1.000'] [Step 120 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [35827] → Tgt Spa: ['1.000'] [Step 120 / Rank 6] Tasks: ['Code'] | Lens: [63379] → Tgt Spa: ['1.000'] [Step 120 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [37647] → Tgt Spa: ['1.000'] [Step 120 / Rank 7] Tasks: ['Code'] | Lens: [63379] → Tgt Spa: ['1.000'] [Step 120 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [35827] → Tgt Spa: ['1.000'] [Step 120 / Rank 1] Tasks: ['Single QA'] | Lens: [45875] → Tgt Spa: ['0.350'] [Step 120 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [37647] → Tgt Spa: ['1.000'][Step 120 / Rank 0] Tasks: ['Single QA'] | Lens: [45875] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 00:01:12,479 >> @ 120 | Loss: 2.0921 | LM: 2.0131 | Reg: 0.0790 | Spa(Avg): 0.528 [INFO|lh_trainer.py:797] 2026-02-17 00:01:12,479 >> Statistic -> Code | Spa: 0.590 | Tgt: 1.000 | Z-Loss: 0.116 | [INFO|lh_trainer.py:797] 2026-02-17 00:01:12,479 >> Statistic -> In-Context | Spa: 0.640 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:01:12,479 >> Statistic -> MultiHop | Spa: 0.606 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:01:12,479 >> Statistic -> Single | Spa: 0.452 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:01:12,479 >> Statistic -> Summarization | Spa: 0.610 | Tgt: 1.000 | Z-Loss: 0.111 | [INFO|lh_trainer.py:810] 2026-02-17 00:01:12,482 >> [Micro-Log] {"loss": 2.0920683400084576, "lm_loss": 2.0130952366938195, "reg_loss": 0.07897312201869984, "model_sparsity(avg)": 0.5278382822871208, "Spa-Single QA sparsity": 0.4523809523809524, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.061078017621877645, "Spa-In-Context Learning sparsity": 0.6401515061205084, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11496324024417183, "Spa-Summarization sparsity": 0.6098484857515856, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11091991954229095, "Spa-Code sparsity": 0.5902777835726738, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.11577577702701092, "Spa-MultiHop QA sparsity": 0.6063034144731668, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.09628847499306385, "step": 120, "current_tau": 1.125, "lambda1 Single QA": 0.54296875, "lambda2 MultiHop QA": 0.27734375, "lambda3 Summarization": 0.11474609375, "lambda4 Code": 0.2138671875} [INFO|lh_trainer.py:331] 2026-02-17 00:01:37,617 >> {'loss': 12.5524, 'grad_norm': 0.8620059490203857, 'learning_rate': 0.0004267767026189673, 'epoch': 0.12743549236440233, 'num_input_tokens_seen': 297047120, 'completed': '40.33% (121 / 300)', 'remaining time': '8:22:43', 'throughput': '8182.80', 'gpu_mem_free': '11691MB', 'step': 121} [Step 121 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [65479] → Tgt Spa: ['1.000'] [Step 121 / Rank 1] Tasks: ['Single QA'] | Lens: [41023] → Tgt Spa: ['0.350'] [Step 121 / Rank 3] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [24334, 24315] → Tgt Spa: ['1.000', '1.000'] [Step 121 / Rank 5] Tasks: ['Code'] | Lens: [63368] → Tgt Spa: ['1.000'] [Step 121 / Rank 0] Tasks: ['Single QA'] | Lens: [41023] → Tgt Spa: ['0.350'] [Step 121 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [65479] → Tgt Spa: ['1.000'] [Step 121 / Rank 4] Tasks: ['Code'] | Lens: [63368] → Tgt Spa: ['1.000'] [Step 121 / Rank 2] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [24334, 24315] → Tgt Spa: ['1.000', '1.000'] [Step 121 / Rank 1] Tasks: ['Code'] | Lens: [43289] → Tgt Spa: ['1.000'] [Step 121 / Rank 7] Tasks: ['Single QA'] | Lens: [64526] → Tgt Spa: ['0.350'] [Step 121 / Rank 2] Tasks: ['Code', 'Code', 'Single QA', 'Code', 'Code', 'Code', 'Single QA'] | Lens: [9344, 9345, 9342, 9350, 9352, 9354, 9347] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350'] [Step 121 / Rank 0] Tasks: ['Code'] | Lens: [43289] → Tgt Spa: ['1.000'] [Step 121 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [22842, 22852] → Tgt Spa: ['1.000', '1.000'] [Step 121 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [22842, 22852] → Tgt Spa: ['1.000', '1.000'] [Step 121 / Rank 6] Tasks: ['Single QA'] | Lens: [64526] → Tgt Spa: ['0.350'] [Step 121 / Rank 3] Tasks: ['Code', 'Code', 'Single QA', 'Code', 'Code', 'Code', 'Single QA'] | Lens: [9344, 9345, 9342, 9350, 9352, 9354, 9347] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350'] [Step 121 / Rank 3] Tasks: ['In-Context Learning', 'Code'] | Lens: [26078, 26085] → Tgt Spa: ['1.000', '1.000'] [Step 121 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Code', 'Code', 'Single QA', 'Code'] | Lens: [9641, 9644, 9655, 9656, 9652, 9664] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '0.350', '1.000'] [Step 121 / Rank 5] Tasks: ['Code'] | Lens: [58066] → Tgt Spa: ['1.000'] [Step 121 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Code', 'Code', 'Single QA', 'Code'] | Lens: [9641, 9644, 9655, 9656, 9652, 9664] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '0.350', '1.000'] [Step 121 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [23769, 23769] → Tgt Spa: ['0.350', '0.350'] [Step 121 / Rank 4] Tasks: ['Code'] | Lens: [58066] → Tgt Spa: ['1.000'] [Step 121 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [23769, 23769] → Tgt Spa: ['0.350', '0.350'] [Step 121 / Rank 2] Tasks: ['In-Context Learning', 'Code'] | Lens: [26078, 26085] → Tgt Spa: ['1.000', '1.000'] [Step 121 / Rank 7] Tasks: ['Single QA'] | Lens: [51853] → Tgt Spa: ['0.350'] [Step 121 / Rank 3] Tasks: ['Code'] | Lens: [34301] → Tgt Spa: ['1.000'] [Step 121 / Rank 4] Tasks: ['Single QA'] | Lens: [48469] → Tgt Spa: ['0.350'] [Step 121 / Rank 1] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17117, 17130, 17121] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 121 / Rank 5] Tasks: ['Single QA'] | Lens: [48469] → Tgt Spa: ['0.350'] [Step 121 / Rank 2] Tasks: ['Code'] | Lens: [34301] → Tgt Spa: ['1.000'] [Step 121 / Rank 0] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17117, 17130, 17121] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 121 / Rank 6] Tasks: ['Single QA'] | Lens: [51853] → Tgt Spa: ['0.350'] [Step 121 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [64509] → Tgt Spa: ['1.000'] [Step 121 / Rank 5] Tasks: ['Single QA'] | Lens: [62865] → Tgt Spa: ['0.350'] [Step 121 / Rank 7] Tasks: ['Single QA'] | Lens: [63020] → Tgt Spa: ['0.350'] [Step 121 / Rank 1] Tasks: ['Summarization', 'Code'] | Lens: [23902, 23891] → Tgt Spa: ['1.000', '1.000'] [Step 121 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [64509] → Tgt Spa: ['1.000'] [Step 121 / Rank 6] Tasks: ['Single QA'] | Lens: [63020] → Tgt Spa: ['0.350'] [Step 121 / Rank 0] Tasks: ['Summarization', 'Code'] | Lens: [23902, 23891] → Tgt Spa: ['1.000', '1.000'] [Step 121 / Rank 4] Tasks: ['Single QA'] | Lens: [62865] → Tgt Spa: ['0.350'] [Step 121 / Rank 5] Tasks: ['Single QA'] | Lens: [65304] → Tgt Spa: ['0.350'] [Step 121 / Rank 7] Tasks: ['Single QA'] | Lens: [33357] → Tgt Spa: ['0.350'] [Step 121 / Rank 6] Tasks: ['Single QA'] | Lens: [33357] → Tgt Spa: ['0.350'] [Step 121 / Rank 3] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [18359, 18372, 18361] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 121 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [54999] → Tgt Spa: ['1.000'] [Step 121 / Rank 4] Tasks: ['Single QA'] | Lens: [65304] → Tgt Spa: ['0.350'] [Step 121 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [54999] → Tgt Spa: ['1.000'] [Step 121 / Rank 2] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [18359, 18372, 18361] → Tgt Spa: ['1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 00:04:22,088 >> @ 121 | Loss: 1.8221 | LM: 1.7413 | Reg: 0.0808 | Spa(Avg): 0.529 [INFO|lh_trainer.py:797] 2026-02-17 00:04:22,088 >> Statistic -> Code | Spa: 0.604 | Tgt: 1.000 | Z-Loss: 0.111 | [INFO|lh_trainer.py:797] 2026-02-17 00:04:22,088 >> Statistic -> In-Context | Spa: 0.648 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:04:22,088 >> Statistic -> MultiHop | Spa: 0.606 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:04:22,088 >> Statistic -> Single | Spa: 0.434 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:04:22,089 >> Statistic -> Summarization | Spa: 0.594 | Tgt: 1.000 | Z-Loss: 0.119 | [INFO|lh_trainer.py:810] 2026-02-17 00:04:22,090 >> [Micro-Log] {"loss": 1.8220760896801949, "lm_loss": 1.7412586448093255, "reg_loss": 0.08081745317516227, "model_sparsity(avg)": 0.5288387437661489, "Spa-Single QA sparsity": 0.43425925572713214, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05244057650367419, "Spa-Code sparsity": 0.603801175167686, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.11106490735944949, "Spa-Summarization sparsity": 0.5937500149011612, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11919119581580162, "Spa-In-Context Learning sparsity": 0.6481481591860453, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11214644213517506, "Spa-MultiHop QA sparsity": 0.6063034144731668, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.09628847499306385, "step": 121, "current_tau": 1.121240496635437, "lambda1 Single QA": 0.54296875, "lambda2 MultiHop QA": 0.27734375, "lambda3 Summarization": 0.115234375, "lambda4 Code": 0.2138671875} [INFO|lh_trainer.py:331] 2026-02-17 00:04:50,097 >> {'loss': 10.9325, 'grad_norm': 0.9458215832710266, 'learning_rate': 0.00042444762251565854, 'epoch': 0.12848867825171142, 'num_input_tokens_seen': 299619262, 'completed': '40.67% (122 / 300)', 'remaining time': '8:20:30', 'throughput': '6681.59', 'gpu_mem_free': '7835MB', 'step': 122} [Step 122 / Rank 7] Tasks: ['Single QA'] | Lens: [47753] → Tgt Spa: ['0.350'] [Step 122 / Rank 6] Tasks: ['Single QA'] | Lens: [47753] → Tgt Spa: ['0.350'] [Step 122 / Rank 3] Tasks: ['Single QA'] | Lens: [54962] → Tgt Spa: ['0.350'] [Step 122 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22977, 22997] → Tgt Spa: ['1.000', '1.000'] [Step 122 / Rank 5] Tasks: ['Single QA'] | Lens: [55759] → Tgt Spa: ['0.350'] [Step 122 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22977, 22997] → Tgt Spa: ['1.000', '1.000'] [Step 122 / Rank 4] Tasks: ['Single QA'] | Lens: [55759] → Tgt Spa: ['0.350'] [Step 122 / Rank 2] Tasks: ['Single QA'] | Lens: [54962] → Tgt Spa: ['0.350'] [Step 122 / Rank 1] Tasks: ['Code', 'Code'] | Lens: [24123, 24125] → Tgt Spa: ['1.000', '1.000'] [Step 122 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [61802] → Tgt Spa: ['1.000'] [Step 122 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [18001, 18002, 17992] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 122 / Rank 0] Tasks: ['Code', 'Code'] | Lens: [24123, 24125] → Tgt Spa: ['1.000', '1.000'] [Step 122 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Code', 'Summarization', 'Single QA', 'Single QA'] | Lens: [8829, 8834, 8835, 8845, 8860, 8847, 8851] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350'] [Step 122 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [61802] → Tgt Spa: ['1.000'] [Step 122 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Code', 'Summarization', 'Single QA', 'Single QA'] | Lens: [8829, 8834, 8835, 8845, 8860, 8847, 8851] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350'] [Step 122 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [18001, 18002, 17992] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 122 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [40087] → Tgt Spa: ['1.000'] [Step 122 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [40087] → Tgt Spa: ['1.000'] [Step 122 / Rank 1] Tasks: ['Single QA'] | Lens: [49573] → Tgt Spa: ['0.350'] [Step 122 / Rank 3] Tasks: ['Single QA'] | Lens: [54022] → Tgt Spa: ['0.350'] [Step 122 / Rank 2] Tasks: ['Single QA'] | Lens: [54022] → Tgt Spa: ['0.350'] [Step 122 / Rank 6] Tasks: ['Single QA'] | Lens: [32870] → Tgt Spa: ['0.350'] [Step 122 / Rank 7] Tasks: ['Single QA'] | Lens: [32870] → Tgt Spa: ['0.350'] [Step 122 / Rank 0] Tasks: ['Single QA'] | Lens: [49573] → Tgt Spa: ['0.350'] [Step 122 / Rank 4] Tasks: ['Code'] | Lens: [60026] → Tgt Spa: ['1.000'] [Step 122 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [30109, 30123] → Tgt Spa: ['1.000', '1.000'] [Step 122 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [47137] → Tgt Spa: ['1.000'] [Step 122 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [47137] → Tgt Spa: ['1.000'] [Step 122 / Rank 3] Tasks: ['Single QA'] | Lens: [64032] → Tgt Spa: ['0.350'] [Step 122 / Rank 5] Tasks: ['Code'] | Lens: [60026] → Tgt Spa: ['1.000'] [Step 122 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [30109, 30123] → Tgt Spa: ['1.000', '1.000'] [Step 122 / Rank 2] Tasks: ['Single QA'] | Lens: [64032] → Tgt Spa: ['0.350'] [Step 122 / Rank 6] Tasks: ['Single QA'] | Lens: [49978] → Tgt Spa: ['0.350'] [Step 122 / Rank 3] Tasks: ['Single QA'] | Lens: [65046] → Tgt Spa: ['0.350'] [Step 122 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [30212, 30213] → Tgt Spa: ['0.350', '0.350'] [Step 122 / Rank 0] Tasks: ['Code'] | Lens: [51338] → Tgt Spa: ['1.000'] [Step 122 / Rank 2] Tasks: ['Single QA'] | Lens: [65046] → Tgt Spa: ['0.350'] [Step 122 / Rank 1] Tasks: ['Code'] | Lens: [51338] → Tgt Spa: ['1.000'] [Step 122 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [30212, 30213] → Tgt Spa: ['0.350', '0.350'] [Step 122 / Rank 7] Tasks: ['Single QA'] | Lens: [49978] → Tgt Spa: ['0.350'] [Step 122 / Rank 5] Tasks: ['MultiHop QA', 'Single QA'] | Lens: [31032, 31033] → Tgt Spa: ['0.350', '0.350'] [Step 122 / Rank 7] Tasks: ['Single QA'] | Lens: [51017] → Tgt Spa: ['0.350'] [Step 122 / Rank 4] Tasks: ['MultiHop QA', 'Single QA'] | Lens: [31032, 31033] → Tgt Spa: ['0.350', '0.350'] [Step 122 / Rank 6] Tasks: ['Single QA'] | Lens: [51017] → Tgt Spa: ['0.350'] [Step 122 / Rank 0] Tasks: ['Single QA'] | Lens: [50126] → Tgt Spa: ['0.350'] [Step 122 / Rank 2] Tasks: ['Single QA'] | Lens: [58743] → Tgt Spa: ['0.350'] [Step 122 / Rank 1] Tasks: ['Single QA'] | Lens: [50126] → Tgt Spa: ['0.350'] [Step 122 / Rank 3] Tasks: ['Single QA'] | Lens: [58743] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 00:07:33,743 >> @ 122 | Loss: 2.0201 | LM: 1.9487 | Reg: 0.0714 | Spa(Avg): 0.498 [INFO|lh_trainer.py:797] 2026-02-17 00:07:33,743 >> Statistic -> Code | Spa: 0.615 | Tgt: 1.000 | Z-Loss: 0.108 | [INFO|lh_trainer.py:797] 2026-02-17 00:07:33,743 >> Statistic -> In-Context | Spa: 0.625 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:07:33,744 >> Statistic -> MultiHop | Spa: 0.514 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:07:33,744 >> Statistic -> Single | Spa: 0.443 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:07:33,744 >> Statistic -> Summarization | Spa: 0.580 | Tgt: 1.000 | Z-Loss: 0.125 | [INFO|lh_trainer.py:810] 2026-02-17 00:07:33,746 >> [Micro-Log] {"loss": 2.020054101323088, "lm_loss": 1.9486780762672424, "reg_loss": 0.07137601917687182, "model_sparsity(avg)": 0.4976438470184803, "Spa-In-Context Learning sparsity": 0.625, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1211100909858942, "Spa-Summarization sparsity": 0.5798610895872116, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1253235526382923, "Spa-Code sparsity": 0.614583320915699, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10780970565974712, "Spa-Single QA sparsity": 0.44305555522441864, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05632978286594152, "Spa-MultiHop QA sparsity": 0.5138888955116272, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.05557688698172569, "step": 122, "current_tau": 1.1175202131271362, "lambda1 Single QA": 0.546875, "lambda2 MultiHop QA": 0.27734375, "lambda3 Summarization": 0.1162109375, "lambda4 Code": 0.21484375} [INFO|lh_trainer.py:331] 2026-02-17 00:07:56,652 >> {'loss': 12.1203, 'grad_norm': 0.6834012866020203, 'learning_rate': 0.0004220886517145741, 'epoch': 0.12954186413902052, 'num_input_tokens_seen': 302193484, 'completed': '41.00% (123 / 300)', 'remaining time': '8:18:07', 'throughput': '6899.37', 'gpu_mem_free': '10513MB', 'step': 123} [Step 123 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [57900] → Tgt Spa: ['1.000'] [Step 123 / Rank 1] Tasks: ['Single QA'] | Lens: [42731] → Tgt Spa: ['0.350'] [Step 123 / Rank 7] Tasks: ['Single QA'] | Lens: [47527] → Tgt Spa: ['0.350'] [Step 123 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [57900] → Tgt Spa: ['1.000'] [Step 123 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [45055] → Tgt Spa: ['1.000'] [Step 123 / Rank 6] Tasks: ['Single QA'] | Lens: [47527] → Tgt Spa: ['0.350'] [Step 123 / Rank 0] Tasks: ['Single QA'] | Lens: [42731] → Tgt Spa: ['0.350'] [Step 123 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [45055] → Tgt Spa: ['1.000'] [Step 123 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [28644, 28658] → Tgt Spa: ['1.000', '1.000'] [Step 123 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [28644, 28658] → Tgt Spa: ['1.000', '1.000'] [Step 123 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23784, 23785] → Tgt Spa: ['1.000', '1.000'] [Step 123 / Rank 5] Tasks: ['Code', 'Single QA', 'Single QA', 'In-Context Learning', 'Code', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [6705, 6700, 6702, 6701, 6710, 6703, 6711, 6708, 6711] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 123 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23784, 23785] → Tgt Spa: ['1.000', '1.000'] [Step 123 / Rank 4] Tasks: ['Code', 'Single QA', 'Single QA', 'In-Context Learning', 'Code', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [6705, 6700, 6702, 6701, 6710, 6703, 6711, 6708, 6711] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 123 / Rank 3] Tasks: ['Code'] | Lens: [35343] → Tgt Spa: ['1.000'] [Step 123 / Rank 2] Tasks: ['Code'] | Lens: [35343] → Tgt Spa: ['1.000'] [Step 123 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26649, 26651] → Tgt Spa: ['1.000', '1.000'] [Step 123 / Rank 0] Tasks: ['Single QA'] | Lens: [59925] → Tgt Spa: ['0.350'] [Step 123 / Rank 1] Tasks: ['Single QA'] | Lens: [59925] → Tgt Spa: ['0.350'] [Step 123 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26649, 26651] → Tgt Spa: ['1.000', '1.000'] [Step 123 / Rank 3] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [20366, 20365, 20375] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 123 / Rank 2] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [20366, 20365, 20375] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 123 / Rank 4] Tasks: ['Single QA', 'Code'] | Lens: [32461, 32475] → Tgt Spa: ['0.350', '1.000'] [Step 123 / Rank 5] Tasks: ['Single QA', 'Code'] | Lens: [32461, 32475] → Tgt Spa: ['0.350', '1.000'] [Step 123 / Rank 5] Tasks: ['Single QA'] | Lens: [32964] → Tgt Spa: ['0.350'] [Step 123 / Rank 2] Tasks: ['Single QA', 'Code'] | Lens: [27424, 27432] → Tgt Spa: ['0.350', '1.000'] [Step 123 / Rank 7] Tasks: ['Single QA'] | Lens: [51692] → Tgt Spa: ['0.350'] [Step 123 / Rank 6] Tasks: ['Single QA'] | Lens: [51692] → Tgt Spa: ['0.350'] [Step 123 / Rank 4] Tasks: ['Single QA'] | Lens: [32964] → Tgt Spa: ['0.350'] [Step 123 / Rank 3] Tasks: ['Single QA', 'Code'] | Lens: [27424, 27432] → Tgt Spa: ['0.350', '1.000'] [Step 123 / Rank 0] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [19250, 19260, 19249] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 123 / Rank 1] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [19250, 19260, 19249] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 123 / Rank 7] Tasks: ['Single QA'] | Lens: [64819] → Tgt Spa: ['0.350'] [Step 123 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [57981] → Tgt Spa: ['1.000'] [Step 123 / Rank 5] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [20302, 20306, 20307] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 123 / Rank 6] Tasks: ['Single QA'] | Lens: [64819] → Tgt Spa: ['0.350'] [Step 123 / Rank 4] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [20302, 20306, 20307] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 123 / Rank 1] Tasks: ['Single QA'] | Lens: [50388] → Tgt Spa: ['0.350'] [Step 123 / Rank 0] Tasks: ['Single QA'] | Lens: [50388] → Tgt Spa: ['0.350'] [Step 123 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [57981] → Tgt Spa: ['1.000'] [Step 123 / Rank 6] Tasks: ['Single QA'] | Lens: [61798] → Tgt Spa: ['0.350'] [Step 123 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [30954, 30954] → Tgt Spa: ['0.350', '0.350'] [Step 123 / Rank 7] Tasks: ['Single QA'] | Lens: [61798] → Tgt Spa: ['0.350'] [Step 123 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [30954, 30954] → Tgt Spa: ['0.350', '0.350'] [Step 123 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [50559] → Tgt Spa: ['1.000'] [Step 123 / Rank 4] Tasks: ['Single QA'] | Lens: [50396] → Tgt Spa: ['0.350'] [Step 123 / Rank 5] Tasks: ['Single QA'] | Lens: [50396] → Tgt Spa: ['0.350'] [Step 123 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [50559] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 00:10:23,852 >> @ 123 | Loss: 2.1979 | LM: 2.1175 | Reg: 0.0804 | Spa(Avg): 0.521 [INFO|lh_trainer.py:797] 2026-02-17 00:10:23,853 >> Statistic -> Code | Spa: 0.596 | Tgt: 1.000 | Z-Loss: 0.115 | [INFO|lh_trainer.py:797] 2026-02-17 00:10:23,853 >> Statistic -> In-Context | Spa: 0.622 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:10:23,853 >> Statistic -> MultiHop | Spa: 0.514 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:10:23,853 >> Statistic -> Single | Spa: 0.435 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:10:23,853 >> Statistic -> Summarization | Spa: 0.608 | Tgt: 1.000 | Z-Loss: 0.112 | [INFO|lh_trainer.py:810] 2026-02-17 00:10:23,855 >> [Micro-Log] {"loss": 2.1978749011953673, "lm_loss": 2.1174954616775117, "reg_loss": 0.0803794411670727, "model_sparsity(avg)": 0.5209619315961996, "Spa-Single QA sparsity": 0.4348958320915699, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04992726395721547, "Spa-In-Context Learning sparsity": 0.621527781089147, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.12246718071401119, "Spa-Code sparsity": 0.5959595875306563, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.11505353315310045, "Spa-Summarization sparsity": 0.6083333253860473, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1124349743127823, "Spa-MultiHop QA sparsity": 0.5138888955116272, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.05557688698172569, "step": 123, "current_tau": 1.1138402223587036, "lambda1 Single QA": 0.546875, "lambda2 MultiHop QA": 0.27734375, "lambda3 Summarization": 0.11669921875, "lambda4 Code": 0.2158203125} [INFO|lh_trainer.py:331] 2026-02-17 00:10:48,609 >> {'loss': 13.1872, 'grad_norm': 0.8515812158584595, 'learning_rate': 0.0004197001944132168, 'epoch': 0.13059505002632965, 'num_input_tokens_seen': 304771644, 'completed': '41.33% (124 / 300)', 'remaining time': '8:15:22', 'throughput': '7496.53', 'gpu_mem_free': '6527MB', 'step': 124} [Step 124 / Rank 0] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [16537, 16526, 16527] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 124 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [39244] → Tgt Spa: ['1.000'] [Step 124 / Rank 1] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [16537, 16526, 16527] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 124 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [44119] → Tgt Spa: ['1.000'] [Step 124 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [39244] → Tgt Spa: ['1.000'] [Step 124 / Rank 3] Tasks: ['Single QA', 'Summarization'] | Lens: [28275, 28293] → Tgt Spa: ['0.350', '1.000'] [Step 124 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [44119] → Tgt Spa: ['1.000'] [Step 124 / Rank 2] Tasks: ['Single QA', 'Summarization'] | Lens: [28275, 28293] → Tgt Spa: ['0.350', '1.000'] [Step 124 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [29347, 29351] → Tgt Spa: ['1.000', '1.000'] [Step 124 / Rank 2] Tasks: ['Single QA'] | Lens: [42509] → Tgt Spa: ['0.350'] [Step 124 / Rank 6] Tasks: ['Single QA'] | Lens: [43593] → Tgt Spa: ['0.350'] [Step 124 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [57311] → Tgt Spa: ['1.000'] [Step 124 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [29347, 29351] → Tgt Spa: ['1.000', '1.000'] [Step 124 / Rank 3] Tasks: ['Single QA'] | Lens: [42509] → Tgt Spa: ['0.350'] [Step 124 / Rank 7] Tasks: ['Single QA'] | Lens: [43593] → Tgt Spa: ['0.350'] [Step 124 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [57311] → Tgt Spa: ['1.000'] [Step 124 / Rank 4] Tasks: ['Single QA'] | Lens: [50883] → Tgt Spa: ['0.350'] [Step 124 / Rank 5] Tasks: ['Single QA'] | Lens: [50883] → Tgt Spa: ['0.350'] [Step 124 / Rank 1] Tasks: ['Single QA'] | Lens: [35549] → Tgt Spa: ['0.350'] [Step 124 / Rank 3] Tasks: ['Single QA', 'Summarization'] | Lens: [29877, 29895] → Tgt Spa: ['0.350', '1.000'] [Step 124 / Rank 0] Tasks: ['Single QA'] | Lens: [35549] → Tgt Spa: ['0.350'] [Step 124 / Rank 2] Tasks: ['Single QA', 'Summarization'] | Lens: [29877, 29895] → Tgt Spa: ['0.350', '1.000'] [Step 124 / Rank 7] Tasks: ['Single QA'] | Lens: [35047] → Tgt Spa: ['0.350'] [Step 124 / Rank 6] Tasks: ['Single QA'] | Lens: [35047] → Tgt Spa: ['0.350'] [Step 124 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [22257, 22250] → Tgt Spa: ['1.000', '1.000'] [Step 124 / Rank 6] Tasks: ['Single QA'] | Lens: [53240] → Tgt Spa: ['0.350'] [Step 124 / Rank 5] Tasks: ['Single QA'] | Lens: [54855] → Tgt Spa: ['0.350'] [Step 124 / Rank 3] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [2291, 2291, 2292, 2293, 2293, 2292, 2294, 2311, 2293, 2295, 2295, 2312, 2294, 2296, 2313, 2296, 2296, 2296, 2314, 2314, 2299, 2315, 2316, 2298, 2299, 2300, 2300, 2302] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 124 / Rank 2] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [2291, 2291, 2292, 2293, 2293, 2292, 2294, 2311, 2293, 2295, 2295, 2312, 2294, 2296, 2313, 2296, 2296, 2296, 2314, 2314, 2299, 2315, 2316, 2298, 2299, 2300, 2300, 2302] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 124 / Rank 4] Tasks: ['Single QA'] | Lens: [54855] → Tgt Spa: ['0.350'] [Step 124 / Rank 7] Tasks: ['Single QA'] | Lens: [53240] → Tgt Spa: ['0.350'] [Step 124 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [22257, 22250] → Tgt Spa: ['1.000', '1.000'] [Step 124 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [54328] → Tgt Spa: ['1.000'] [Step 124 / Rank 5] Tasks: ['Single QA'] | Lens: [43550] → Tgt Spa: ['0.350'] [Step 124 / Rank 0] Tasks: ['Summarization'] | Lens: [38077] → Tgt Spa: ['1.000'] [Step 124 / Rank 7] Tasks: ['MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Summarization', 'Code', 'Single QA', 'Summarization', 'MultiHop QA', 'In-Context Learning', 'Single QA', 'Single QA', 'Code', 'MultiHop QA'] | Lens: [3804, 3803, 3804, 3803, 3803, 3805, 3822, 3823, 3811, 3805, 3824, 3808, 3807, 3808, 3808, 3815, 3813] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350'] [Step 124 / Rank 4] Tasks: ['Single QA'] | Lens: [43550] → Tgt Spa: ['0.350'] [Step 124 / Rank 1] Tasks: ['Summarization'] | Lens: [38077] → Tgt Spa: ['1.000'] [Step 124 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [54328] → Tgt Spa: ['1.000'] [Step 124 / Rank 6] Tasks: ['MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Summarization', 'Code', 'Single QA', 'Summarization', 'MultiHop QA', 'In-Context Learning', 'Single QA', 'Single QA', 'Code', 'MultiHop QA'] | Lens: [3804, 3803, 3804, 3803, 3803, 3805, 3822, 3823, 3811, 3805, 3824, 3808, 3807, 3808, 3808, 3815, 3813] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350'] [Step 124 / Rank 2] Tasks: ['In-Context Learning', 'Code'] | Lens: [24839, 24846] → Tgt Spa: ['1.000', '1.000'] [Step 124 / Rank 4] Tasks: ['Single QA'] | Lens: [51495] → Tgt Spa: ['0.350'] [Step 124 / Rank 3] Tasks: ['In-Context Learning', 'Code'] | Lens: [24839, 24846] → Tgt Spa: ['1.000', '1.000'] [Step 124 / Rank 6] Tasks: ['Single QA'] | Lens: [41330] → Tgt Spa: ['0.350'] [Step 124 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [58447] → Tgt Spa: ['1.000'] [Step 124 / Rank 5] Tasks: ['Single QA'] | Lens: [51495] → Tgt Spa: ['0.350'] [Step 124 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [58447] → Tgt Spa: ['1.000'] [Step 124 / Rank 7] Tasks: ['Single QA'] | Lens: [41330] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 00:13:00,513 >> @ 124 | Loss: 2.1986 | LM: 2.1278 | Reg: 0.0708 | Spa(Avg): 0.496 [INFO|lh_trainer.py:797] 2026-02-17 00:13:00,513 >> Statistic -> Code | Spa: 0.569 | Tgt: 1.000 | Z-Loss: 0.125 | [INFO|lh_trainer.py:797] 2026-02-17 00:13:00,513 >> Statistic -> In-Context | Spa: 0.638 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:13:00,513 >> Statistic -> MultiHop | Spa: 0.584 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:13:00,513 >> Statistic -> Single | Spa: 0.397 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:13:00,514 >> Statistic -> Summarization | Spa: 0.585 | Tgt: 1.000 | Z-Loss: 0.124 | [INFO|lh_trainer.py:810] 2026-02-17 00:13:00,515 >> [Micro-Log] {"loss": 2.1986173689365387, "lm_loss": 2.127813055490454, "reg_loss": 0.07080432369063298, "model_sparsity(avg)": 0.4959644762178262, "Spa-Summarization sparsity": 0.585317462682724, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12424271021570478, "Spa-Code sparsity": 0.569444440305233, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1250559501349926, "Spa-In-Context Learning sparsity": 0.637820514348837, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1167569189117505, "Spa-Single QA sparsity": 0.3972222169240316, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.030530068588753543, "Spa-MultiHop QA sparsity": 0.5844907412926356, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.08670956703523795, "step": 124, "current_tau": 1.1102017164230347, "lambda1 Single QA": 0.546875, "lambda2 MultiHop QA": 0.279296875, "lambda3 Summarization": 0.1171875, "lambda4 Code": 0.216796875} [INFO|lh_trainer.py:331] 2026-02-17 00:13:22,839 >> {'loss': 13.1917, 'grad_norm': 0.7977530360221863, 'learning_rate': 0.00041728265986144944, 'epoch': 0.13164823591363875, 'num_input_tokens_seen': 307154770, 'completed': '41.67% (125 / 300)', 'remaining time': '8:12:13', 'throughput': '7725.88', 'gpu_mem_free': '7039MB', 'step': 125} [Step 125 / Rank 4] Tasks: ['Code', 'Single QA', 'Code', 'Summarization', 'Single QA'] | Lens: [12772, 12769, 12777, 12793, 12776] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '0.350'] [Step 125 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26055, 26056] → Tgt Spa: ['1.000', '1.000'] [Step 125 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26055, 26056] → Tgt Spa: ['1.000', '1.000'] [Step 125 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [40730] → Tgt Spa: ['1.000'] [Step 125 / Rank 5] Tasks: ['Code', 'Single QA', 'Code', 'Summarization', 'Single QA'] | Lens: [12772, 12769, 12777, 12793, 12776] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '0.350'] [Step 125 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [35734] → Tgt Spa: ['1.000'] [Step 125 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [35734] → Tgt Spa: ['1.000'] [Step 125 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [40730] → Tgt Spa: ['1.000'] [Step 125 / Rank 4] Tasks: ['Single QA'] | Lens: [58354] → Tgt Spa: ['0.350'] [Step 125 / Rank 5] Tasks: ['Single QA'] | Lens: [58354] → Tgt Spa: ['0.350'] [Step 125 / Rank 0] Tasks: ['Summarization', 'Code'] | Lens: [28601, 28596] → Tgt Spa: ['1.000', '1.000'] [Step 125 / Rank 6] Tasks: ['Single QA'] | Lens: [40549] → Tgt Spa: ['0.350'] [Step 125 / Rank 2] Tasks: ['Single QA'] | Lens: [52150] → Tgt Spa: ['0.350'] [Step 125 / Rank 7] Tasks: ['Single QA'] | Lens: [40549] → Tgt Spa: ['0.350'] [Step 125 / Rank 1] Tasks: ['Summarization', 'Code'] | Lens: [28601, 28596] → Tgt Spa: ['1.000', '1.000'] [Step 125 / Rank 3] Tasks: ['Single QA'] | Lens: [52150] → Tgt Spa: ['0.350'] [Step 125 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27475, 27475] → Tgt Spa: ['1.000', '1.000'] [Step 125 / Rank 6] Tasks: ['Single QA'] | Lens: [61083] → Tgt Spa: ['0.350'] [Step 125 / Rank 7] Tasks: ['Single QA'] | Lens: [61083] → Tgt Spa: ['0.350'] [Step 125 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28170, 28173] → Tgt Spa: ['1.000', '1.000'] [Step 125 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27475, 27475] → Tgt Spa: ['1.000', '1.000'] [Step 125 / Rank 2] Tasks: ['Code', 'Code', 'Single QA', 'Code'] | Lens: [13905, 13917, 13911, 13921] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000'] [Step 125 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28170, 28173] → Tgt Spa: ['1.000', '1.000'] [Step 125 / Rank 3] Tasks: ['Code', 'Code', 'Single QA', 'Code'] | Lens: [13905, 13917, 13911, 13921] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000'] [Step 125 / Rank 3] Tasks: ['Code'] | Lens: [38244] → Tgt Spa: ['1.000'] [Step 125 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [60974] → Tgt Spa: ['1.000'] [Step 125 / Rank 2] Tasks: ['Code'] | Lens: [38244] → Tgt Spa: ['1.000'] [Step 125 / Rank 7] Tasks: ['In-Context Learning', 'Summarization', 'Code'] | Lens: [18529, 18548, 18538] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 125 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [60974] → Tgt Spa: ['1.000'] [Step 125 / Rank 0] Tasks: ['Code'] | Lens: [38652] → Tgt Spa: ['1.000'] [Step 125 / Rank 1] Tasks: ['Code'] | Lens: [38652] → Tgt Spa: ['1.000'] [Step 125 / Rank 6] Tasks: ['In-Context Learning', 'Summarization', 'Code'] | Lens: [18529, 18548, 18538] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 125 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [39320] → Tgt Spa: ['1.000'] [Step 125 / Rank 5] Tasks: ['Single QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'In-Context Learning'] | Lens: [6087, 6105, 6088, 6087, 6088, 6089, 6089, 6089, 6097, 6091] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 125 / Rank 4] Tasks: ['Single QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'In-Context Learning'] | Lens: [6087, 6105, 6088, 6087, 6088, 6089, 6089, 6089, 6097, 6091] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 125 / Rank 7] Tasks: ['Single QA'] | Lens: [64686] → Tgt Spa: ['0.350'] [Step 125 / Rank 6] Tasks: ['Single QA'] | Lens: [64686] → Tgt Spa: ['0.350'] [Step 125 / Rank 0] Tasks: ['Single QA'] | Lens: [33966] → Tgt Spa: ['0.350'] [Step 125 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [39320] → Tgt Spa: ['1.000'] [Step 125 / Rank 1] Tasks: ['Single QA'] | Lens: [33966] → Tgt Spa: ['0.350'] [Step 125 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [57861] → Tgt Spa: ['1.000'] [Step 125 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [57861] → Tgt Spa: ['1.000'] [Step 125 / Rank 0] Tasks: ['Single QA'] | Lens: [57064] → Tgt Spa: ['0.350'] [Step 125 / Rank 4] Tasks: ['Code'] | Lens: [33703] → Tgt Spa: ['1.000'] [Step 125 / Rank 5] Tasks: ['Code'] | Lens: [33703] → Tgt Spa: ['1.000'] [Step 125 / Rank 1] Tasks: ['Single QA'] | Lens: [57064] → Tgt Spa: ['0.350'] [Step 125 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [32435, 32438] → Tgt Spa: ['1.000', '1.000'] [Step 125 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [32435, 32438] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 00:15:53,193 >> @ 125 | Loss: 2.0319 | LM: 1.9477 | Reg: 0.0842 | Spa(Avg): 0.532 [INFO|lh_trainer.py:797] 2026-02-17 00:15:53,193 >> Statistic -> Code | Spa: 0.583 | Tgt: 1.000 | Z-Loss: 0.121 | [INFO|lh_trainer.py:797] 2026-02-17 00:15:53,193 >> Statistic -> In-Context | Spa: 0.645 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:15:53,193 >> Statistic -> MultiHop | Spa: 0.584 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:15:53,193 >> Statistic -> Single | Spa: 0.403 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:15:53,194 >> Statistic -> Summarization | Spa: 0.622 | Tgt: 1.000 | Z-Loss: 0.107 | [INFO|lh_trainer.py:810] 2026-02-17 00:15:53,196 >> [Micro-Log] {"loss": 2.0319007504731417, "lm_loss": 1.947714449216922, "reg_loss": 0.08418629283551127, "model_sparsity(avg)": 0.5315200599531332, "Spa-In-Context Learning sparsity": 0.6454248358221615, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11481051848215215, "Spa-Summarization sparsity": 0.6215277761220932, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10722238197922707, "Spa-Code sparsity": 0.5833333405581388, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.12053384631872177, "Spa-Single QA sparsity": 0.4027777632077535, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.041539247271915276, "Spa-MultiHop QA sparsity": 0.5844907412926356, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.08670956703523795, "step": 125, "current_tau": 1.106605887413025, "lambda1 Single QA": 0.546875, "lambda2 MultiHop QA": 0.279296875, "lambda3 Summarization": 0.1181640625, "lambda4 Code": 0.216796875} [INFO|lh_trainer.py:331] 2026-02-17 00:16:15,432 >> {'loss': 12.1914, 'grad_norm': 1.0813367366790771, 'learning_rate': 0.0004148364622913718, 'epoch': 0.13270142180094788, 'num_input_tokens_seen': 309623990, 'completed': '42.00% (126 / 300)', 'remaining time': '8:09:30', 'throughput': '7153.29', 'gpu_mem_free': '6639MB', 'step': 126} [Step 126 / Rank 2] Tasks: ['Single QA'] | Lens: [53535] → Tgt Spa: ['0.350'] [Step 126 / Rank 3] Tasks: ['Single QA'] | Lens: [53535] → Tgt Spa: ['0.350'] [Step 126 / Rank 6] Tasks: ['Code', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [10204, 10197, 10202, 10204, 10207, 10207] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '0.350', '0.350'] [Step 126 / Rank 7] Tasks: ['Code', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [10204, 10197, 10202, 10204, 10207, 10207] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '0.350', '0.350'] [Step 126 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [23126, 23134] → Tgt Spa: ['1.000', '1.000'] [Step 126 / Rank 5] Tasks: ['Single QA'] | Lens: [58633] → Tgt Spa: ['0.350'] [Step 126 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [23126, 23134] → Tgt Spa: ['1.000', '1.000'] [Step 126 / Rank 4] Tasks: ['Single QA'] | Lens: [58633] → Tgt Spa: ['0.350'] [Step 126 / Rank 5] Tasks: ['Single QA'] | Lens: [35807] → Tgt Spa: ['0.350'] [Step 126 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [22848, 22841] → Tgt Spa: ['1.000', '1.000'] [Step 126 / Rank 4] Tasks: ['Single QA'] | Lens: [35807] → Tgt Spa: ['0.350'] [Step 126 / Rank 6] Tasks: ['Code'] | Lens: [34541] → Tgt Spa: ['1.000'] [Step 126 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [22848, 22841] → Tgt Spa: ['1.000', '1.000'] [Step 126 / Rank 1] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [18963, 18953, 18964] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 126 / Rank 0] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [18963, 18953, 18964] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 126 / Rank 7] Tasks: ['Code'] | Lens: [34541] → Tgt Spa: ['1.000'] [Step 126 / Rank 3] Tasks: ['Code'] | Lens: [40250] → Tgt Spa: ['1.000'] [Step 126 / Rank 5] Tasks: ['Single QA'] | Lens: [65380] → Tgt Spa: ['0.350'] [Step 126 / Rank 1] Tasks: ['Single QA'] | Lens: [58833] → Tgt Spa: ['0.350'] [Step 126 / Rank 7] Tasks: ['Single QA'] | Lens: [60920] → Tgt Spa: ['0.350'] [Step 126 / Rank 4] Tasks: ['Single QA'] | Lens: [65380] → Tgt Spa: ['0.350'] [Step 126 / Rank 6] Tasks: ['Single QA'] | Lens: [60920] → Tgt Spa: ['0.350'] [Step 126 / Rank 0] Tasks: ['Single QA'] | Lens: [58833] → Tgt Spa: ['0.350'] [Step 126 / Rank 2] Tasks: ['Code'] | Lens: [40250] → Tgt Spa: ['1.000'] [Step 126 / Rank 1] Tasks: ['Single QA'] | Lens: [39472] → Tgt Spa: ['0.350'] [Step 126 / Rank 0] Tasks: ['Single QA'] | Lens: [39472] → Tgt Spa: ['0.350'] [Step 126 / Rank 6] Tasks: ['Single QA'] | Lens: [63015] → Tgt Spa: ['0.350'] [Step 126 / Rank 7] Tasks: ['Single QA'] | Lens: [63015] → Tgt Spa: ['0.350'] [Step 126 / Rank 2] Tasks: ['Single QA'] | Lens: [49579] → Tgt Spa: ['0.350'] [Step 126 / Rank 5] Tasks: ['Code', 'In-Context Learning', 'Summarization'] | Lens: [20902, 20895, 20917] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 126 / Rank 4] Tasks: ['Code', 'In-Context Learning', 'Summarization'] | Lens: [20902, 20895, 20917] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 126 / Rank 3] Tasks: ['Single QA'] | Lens: [49579] → Tgt Spa: ['0.350'] [Step 126 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'Code', 'Single QA', 'Code', 'Single QA', 'In-Context Learning', 'Code', 'Single QA'] | Lens: [6372, 6371, 6372, 6381, 6375, 6383, 6376, 6375, 6386, 6379] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350'] [Step 126 / Rank 0] Tasks: ['Code', 'MultiHop QA'] | Lens: [28848, 28847] → Tgt Spa: ['1.000', '0.350'] [Step 126 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [22792, 22793] → Tgt Spa: ['1.000', '1.000'] [Step 126 / Rank 1] Tasks: ['Code', 'MultiHop QA'] | Lens: [28848, 28847] → Tgt Spa: ['1.000', '0.350'] [Step 126 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [22792, 22793] → Tgt Spa: ['1.000', '1.000'] [Step 126 / Rank 7] Tasks: ['Code'] | Lens: [61241] → Tgt Spa: ['1.000'] [Step 126 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'Code', 'Single QA', 'Code', 'Single QA', 'In-Context Learning', 'Code', 'Single QA'] | Lens: [6372, 6371, 6372, 6381, 6375, 6383, 6376, 6375, 6386, 6379] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350'] [Step 126 / Rank 6] Tasks: ['Code'] | Lens: [61241] → Tgt Spa: ['1.000'] [Step 126 / Rank 5] Tasks: ['Single QA'] | Lens: [62439] → Tgt Spa: ['0.350'] [Step 126 / Rank 2] Tasks: ['Single QA'] | Lens: [34415] → Tgt Spa: ['0.350'] [Step 126 / Rank 4] Tasks: ['Single QA'] | Lens: [62439] → Tgt Spa: ['0.350'] [Step 126 / Rank 0] Tasks: ['Code', 'In-Context Learning', 'Code'] | Lens: [20002, 19995, 20004] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 126 / Rank 1] Tasks: ['Code', 'In-Context Learning', 'Code'] | Lens: [20002, 19995, 20004] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 126 / Rank 6] Tasks: ['Single QA'] | Lens: [65041] → Tgt Spa: ['0.350'] [Step 126 / Rank 3] Tasks: ['Single QA'] | Lens: [34415] → Tgt Spa: ['0.350'] [Step 126 / Rank 7] Tasks: ['Single QA'] | Lens: [65041] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 00:18:50,963 >> @ 126 | Loss: 1.8511 | LM: 1.7842 | Reg: 0.0669 | Spa(Avg): 0.499 [INFO|lh_trainer.py:797] 2026-02-17 00:18:50,963 >> Statistic -> Code | Spa: 0.616 | Tgt: 1.000 | Z-Loss: 0.109 | [INFO|lh_trainer.py:797] 2026-02-17 00:18:50,964 >> Statistic -> In-Context | Spa: 0.641 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:18:50,964 >> Statistic -> MultiHop | Spa: 0.569 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:18:50,964 >> Statistic -> Single | Spa: 0.450 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:18:50,964 >> Statistic -> Summarization | Spa: 0.593 | Tgt: 1.000 | Z-Loss: 0.122 | [INFO|lh_trainer.py:810] 2026-02-17 00:18:50,966 >> [Micro-Log] {"loss": 1.851065631955862, "lm_loss": 1.7842082343995571, "reg_loss": 0.06685739167733118, "model_sparsity(avg)": 0.4994020064671834, "Spa-In-Context Learning sparsity": 0.640625, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11658531986176968, "Spa-Code sparsity": 0.6163194440305233, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10899126715958118, "Spa-Summarization sparsity": 0.5925925771395365, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12199948479731877, "Spa-Single QA sparsity": 0.44956139828029434, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06043370756761808, "Spa-MultiHop QA sparsity": 0.5694444477558136, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.08007530868053436, "step": 126, "current_tau": 1.1030536890029907, "lambda1 Single QA": 0.546875, "lambda2 MultiHop QA": 0.279296875, "lambda3 Summarization": 0.11865234375, "lambda4 Code": 0.2177734375} [INFO|lh_trainer.py:331] 2026-02-17 00:19:17,955 >> {'loss': 11.1064, 'grad_norm': 0.823809027671814, 'learning_rate': 0.00041236202084634466, 'epoch': 0.13375460768825698, 'num_input_tokens_seen': 312189822, 'completed': '42.33% (127 / 300)', 'remaining time': '8:06:59', 'throughput': '7028.80', 'gpu_mem_free': '6991MB', 'step': 127} [Step 127 / Rank 4] Tasks: ['Code'] | Lens: [32885] → Tgt Spa: ['1.000'] [Step 127 / Rank 0] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [30275, 30277] → Tgt Spa: ['0.350', '1.000'] [Step 127 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [33774] → Tgt Spa: ['1.000'] [Step 127 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [38246] → Tgt Spa: ['1.000'] [Step 127 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [38246] → Tgt Spa: ['1.000'] [Step 127 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [33774] → Tgt Spa: ['1.000'] [Step 127 / Rank 5] Tasks: ['Code'] | Lens: [32885] → Tgt Spa: ['1.000'] [Step 127 / Rank 1] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [30275, 30277] → Tgt Spa: ['0.350', '1.000'] [Step 127 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [29864, 29865] → Tgt Spa: ['0.350', '0.350'] [Step 127 / Rank 7] Tasks: ['Single QA'] | Lens: [49226] → Tgt Spa: ['0.350'] [Step 127 / Rank 2] Tasks: ['Single QA'] | Lens: [38833] → Tgt Spa: ['0.350'] [Step 127 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [29864, 29865] → Tgt Spa: ['0.350', '0.350'] [Step 127 / Rank 6] Tasks: ['Single QA'] | Lens: [49226] → Tgt Spa: ['0.350'] [Step 127 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [53120] → Tgt Spa: ['1.000'] [Step 127 / Rank 3] Tasks: ['Single QA'] | Lens: [38833] → Tgt Spa: ['0.350'] [Step 127 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [53120] → Tgt Spa: ['1.000'] [Step 127 / Rank 5] Tasks: ['Summarization', 'Single QA'] | Lens: [26583, 26566] → Tgt Spa: ['1.000', '0.350'] [Step 127 / Rank 1] Tasks: ['Single QA'] | Lens: [48677] → Tgt Spa: ['0.350'] [Step 127 / Rank 0] Tasks: ['Single QA'] | Lens: [48677] → Tgt Spa: ['0.350'] [Step 127 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [40129] → Tgt Spa: ['1.000'] [Step 127 / Rank 6] Tasks: ['Single QA'] | Lens: [52955] → Tgt Spa: ['0.350'] [Step 127 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [40129] → Tgt Spa: ['1.000'] [Step 127 / Rank 4] Tasks: ['Summarization', 'Single QA'] | Lens: [26583, 26566] → Tgt Spa: ['1.000', '0.350'] [Step 127 / Rank 7] Tasks: ['Single QA'] | Lens: [52955] → Tgt Spa: ['0.350'] [Step 127 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [42882] → Tgt Spa: ['1.000'] [Step 127 / Rank 3] Tasks: ['Single QA'] | Lens: [57750] → Tgt Spa: ['0.350'] [Step 127 / Rank 1] Tasks: ['Single QA'] | Lens: [34874] → Tgt Spa: ['0.350'] [Step 127 / Rank 2] Tasks: ['Single QA'] | Lens: [57750] → Tgt Spa: ['0.350'] [Step 127 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [42882] → Tgt Spa: ['1.000'] [Step 127 / Rank 7] Tasks: ['Code'] | Lens: [35805] → Tgt Spa: ['1.000'] [Step 127 / Rank 6] Tasks: ['Code'] | Lens: [35805] → Tgt Spa: ['1.000'] [Step 127 / Rank 0] Tasks: ['Single QA'] | Lens: [34874] → Tgt Spa: ['0.350'] [Step 127 / Rank 4] Tasks: ['Single QA'] | Lens: [50967] → Tgt Spa: ['0.350'] [Step 127 / Rank 3] Tasks: ['Single QA'] | Lens: [43048] → Tgt Spa: ['0.350'] [Step 127 / Rank 5] Tasks: ['Single QA'] | Lens: [50967] → Tgt Spa: ['0.350'] [Step 127 / Rank 6] Tasks: ['Code', 'Single QA', 'Code', 'Code', 'Code'] | Lens: [13055, 13054, 13074, 13082, 13090] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000'] [Step 127 / Rank 2] Tasks: ['Single QA'] | Lens: [43048] → Tgt Spa: ['0.350'] [Step 127 / Rank 1] Tasks: ['Single QA'] | Lens: [65023] → Tgt Spa: ['0.350'] [Step 127 / Rank 0] Tasks: ['Single QA'] | Lens: [65023] → Tgt Spa: ['0.350'] [Step 127 / Rank 7] Tasks: ['Code', 'Single QA', 'Code', 'Code', 'Code'] | Lens: [13055, 13054, 13074, 13082, 13090] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000'] [Step 127 / Rank 2] Tasks: ['Single QA'] | Lens: [58933] → Tgt Spa: ['0.350'] [Step 127 / Rank 3] Tasks: ['Single QA'] | Lens: [58933] → Tgt Spa: ['0.350'] [Step 127 / Rank 7] Tasks: ['Code'] | Lens: [36358] → Tgt Spa: ['1.000'] [Step 127 / Rank 5] Tasks: ['Code'] | Lens: [57776] → Tgt Spa: ['1.000'] [Step 127 / Rank 6] Tasks: ['Code'] | Lens: [36358] → Tgt Spa: ['1.000'] [Step 127 / Rank 4] Tasks: ['Code'] | Lens: [57776] → Tgt Spa: ['1.000'] [Step 127 / Rank 0] Tasks: ['Single QA'] | Lens: [35680] → Tgt Spa: ['0.350'] [Step 127 / Rank 1] Tasks: ['Single QA'] | Lens: [35680] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 00:21:40,719 >> @ 127 | Loss: 2.0110 | LM: 1.9546 | Reg: 0.0564 | Spa(Avg): 0.506 [INFO|lh_trainer.py:797] 2026-02-17 00:21:40,719 >> Statistic -> Code | Spa: 0.641 | Tgt: 1.000 | Z-Loss: 0.101 | [INFO|lh_trainer.py:797] 2026-02-17 00:21:40,719 >> Statistic -> In-Context | Spa: 0.662 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:21:40,719 >> Statistic -> MultiHop | Spa: 0.569 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:21:40,719 >> Statistic -> Single | Spa: 0.377 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:21:40,719 >> Statistic -> Summarization | Spa: 0.611 | Tgt: 1.000 | Z-Loss: 0.112 | [INFO|lh_trainer.py:810] 2026-02-17 00:21:40,722 >> [Micro-Log] {"loss": 2.0110064558684826, "lm_loss": 1.9546331812938054, "reg_loss": 0.056373266115163766, "model_sparsity(avg)": 0.5055555490156015, "Spa-Single QA sparsity": 0.3767361007630825, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.02136245963629335, "Spa-In-Context Learning sparsity": 0.6620370348294576, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10905073707302411, "Spa-Code sparsity": 0.640625, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10091470740735531, "Spa-Summarization sparsity": 0.6111111044883728, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11239209771156311, "Spa-MultiHop QA sparsity": 0.5694444477558136, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.08007530868053436, "step": 127, "current_tau": 1.099546194076538, "lambda1 Single QA": 0.546875, "lambda2 MultiHop QA": 0.279296875, "lambda3 Summarization": 0.11962890625, "lambda4 Code": 0.21875} [INFO|lh_trainer.py:331] 2026-02-17 00:22:03,795 >> {'loss': 12.066, 'grad_norm': 0.6433836817741394, 'learning_rate': 0.00040985975950917115, 'epoch': 0.13480779357556608, 'num_input_tokens_seen': 314481274, 'completed': '42.67% (128 / 300)', 'remaining time': '8:04:06', 'throughput': '6908.62', 'gpu_mem_free': '13281MB', 'step': 128} [Step 128 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [51733] → Tgt Spa: ['1.000'] [Step 128 / Rank 5] Tasks: ['Single QA'] | Lens: [51782] → Tgt Spa: ['0.350'] [Step 128 / Rank 0] Tasks: ['Single QA'] | Lens: [37513] → Tgt Spa: ['0.350'] [Step 128 / Rank 4] Tasks: ['Single QA'] | Lens: [51782] → Tgt Spa: ['0.350'] [Step 128 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [51733] → Tgt Spa: ['1.000'] [Step 128 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23887, 23889] → Tgt Spa: ['1.000', '1.000'] [Step 128 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23887, 23889] → Tgt Spa: ['1.000', '1.000'] [Step 128 / Rank 1] Tasks: ['Single QA'] | Lens: [37513] → Tgt Spa: ['0.350'] [Step 128 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [21898, 21906] → Tgt Spa: ['1.000', '1.000'] [Step 128 / Rank 2] Tasks: ['Single QA'] | Lens: [60140] → Tgt Spa: ['0.350'] [Step 128 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [21898, 21906] → Tgt Spa: ['1.000', '1.000'] [Step 128 / Rank 5] Tasks: ['Code'] | Lens: [50061] → Tgt Spa: ['1.000'] [Step 128 / Rank 6] Tasks: ['Code', 'Code', 'Single QA', 'Code', 'Single QA'] | Lens: [10972, 10976, 10970, 10979, 10979] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '0.350'] [Step 128 / Rank 3] Tasks: ['Single QA'] | Lens: [60140] → Tgt Spa: ['0.350'] [Step 128 / Rank 7] Tasks: ['Code', 'Code', 'Single QA', 'Code', 'Single QA'] | Lens: [10972, 10976, 10970, 10979, 10979] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '0.350'] [Step 128 / Rank 4] Tasks: ['Code'] | Lens: [50061] → Tgt Spa: ['1.000'] [Step 128 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Single QA', 'In-Context Learning', 'Single QA', 'Single QA', 'Code', 'Single QA', 'In-Context Learning', 'Summarization'] | Lens: [5027, 5026, 5026, 5027, 5035, 5028, 5029, 5029, 5030, 5037, 5029, 5029, 5049] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000'] [Step 128 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [27812, 27820] → Tgt Spa: ['1.000', '1.000'] [Step 128 / Rank 5] Tasks: ['Code', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Code', 'Single QA'] | Lens: [8138, 8131, 8134, 8134, 8134, 8141, 8145, 8138] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350'] [Step 128 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Single QA', 'In-Context Learning', 'Single QA', 'Single QA', 'Code', 'Single QA', 'In-Context Learning', 'Summarization'] | Lens: [5027, 5026, 5026, 5027, 5035, 5028, 5029, 5029, 5030, 5037, 5029, 5029, 5049] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000'] [Step 128 / Rank 0] Tasks: ['Code'] | Lens: [37165] → Tgt Spa: ['1.000'] [Step 128 / Rank 4] Tasks: ['Code', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Code', 'Single QA'] | Lens: [8138, 8131, 8134, 8134, 8134, 8141, 8145, 8138] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350'] [Step 128 / Rank 1] Tasks: ['Code'] | Lens: [37165] → Tgt Spa: ['1.000'] [Step 128 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [27812, 27820] → Tgt Spa: ['1.000', '1.000'] [Step 128 / Rank 4] Tasks: ['Single QA'] | Lens: [49872] → Tgt Spa: ['0.350'] [Step 128 / Rank 2] Tasks: ['Single QA'] | Lens: [52198] → Tgt Spa: ['0.350'] [Step 128 / Rank 1] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Code'] | Lens: [11484, 11487, 11487, 11479, 11506] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000'] [Step 128 / Rank 3] Tasks: ['Single QA'] | Lens: [52198] → Tgt Spa: ['0.350'] [Step 128 / Rank 0] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Code'] | Lens: [11484, 11487, 11487, 11479, 11506] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000'] [Step 128 / Rank 7] Tasks: ['Single QA'] | Lens: [35042] → Tgt Spa: ['0.350'] [Step 128 / Rank 6] Tasks: ['Single QA'] | Lens: [35042] → Tgt Spa: ['0.350'] [Step 128 / Rank 5] Tasks: ['Single QA'] | Lens: [49872] → Tgt Spa: ['0.350'] [Step 128 / Rank 4] Tasks: ['Single QA'] | Lens: [47598] → Tgt Spa: ['0.350'] [Step 128 / Rank 2] Tasks: ['Single QA'] | Lens: [51598] → Tgt Spa: ['0.350'] [Step 128 / Rank 3] Tasks: ['Single QA'] | Lens: [51598] → Tgt Spa: ['0.350'] [Step 128 / Rank 5] Tasks: ['Single QA'] | Lens: [47598] → Tgt Spa: ['0.350'] [Step 128 / Rank 7] Tasks: ['Single QA'] | Lens: [55167] → Tgt Spa: ['0.350'] [Step 128 / Rank 1] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'In-Context Learning'] | Lens: [7942, 7951, 7944, 7945, 7946, 7946, 7953, 7947] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 128 / Rank 6] Tasks: ['Single QA'] | Lens: [55167] → Tgt Spa: ['0.350'] [Step 128 / Rank 0] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'In-Context Learning'] | Lens: [7942, 7951, 7944, 7945, 7946, 7946, 7953, 7947] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 128 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [52137] → Tgt Spa: ['1.000'] [Step 128 / Rank 0] Tasks: ['Single QA'] | Lens: [41392] → Tgt Spa: ['0.350'] [Step 128 / Rank 6] Tasks: ['Single QA', 'Code'] | Lens: [30783, 30792] → Tgt Spa: ['0.350', '1.000'] [Step 128 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [52137] → Tgt Spa: ['1.000'] [Step 128 / Rank 2] Tasks: ['Single QA'] | Lens: [58339] → Tgt Spa: ['0.350'] [Step 128 / Rank 3] Tasks: ['Single QA'] | Lens: [58339] → Tgt Spa: ['0.350'] [Step 128 / Rank 7] Tasks: ['Single QA', 'Code'] | Lens: [30783, 30792] → Tgt Spa: ['0.350', '1.000'] [Step 128 / Rank 1] Tasks: ['Single QA'] | Lens: [41392] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 00:24:17,326 >> @ 128 | Loss: 2.0146 | LM: 1.9380 | Reg: 0.0765 | Spa(Avg): 0.505 [INFO|lh_trainer.py:797] 2026-02-17 00:24:17,326 >> Statistic -> Code | Spa: 0.600 | Tgt: 1.000 | Z-Loss: 0.116 | [INFO|lh_trainer.py:797] 2026-02-17 00:24:17,326 >> Statistic -> In-Context | Spa: 0.664 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:24:17,326 >> Statistic -> MultiHop | Spa: 0.569 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:24:17,326 >> Statistic -> Single | Spa: 0.500 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:24:17,327 >> Statistic -> Summarization | Spa: 0.569 | Tgt: 1.000 | Z-Loss: 0.132 | [INFO|lh_trainer.py:810] 2026-02-17 00:24:17,329 >> [Micro-Log] {"loss": 2.0145558901131153, "lm_loss": 1.9380172826349735, "reg_loss": 0.07653861209594955, "model_sparsity(avg)": 0.5054008588194847, "Spa-Single QA sparsity": 0.4999999978712627, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.09635486317399357, "Spa-In-Context Learning sparsity": 0.6636904648372105, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1092107572725841, "Spa-Code sparsity": 0.6001461932533666, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.11577073445445613, "Spa-Summarization sparsity": 0.5694444179534912, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13245797157287598, "Spa-MultiHop QA sparsity": 0.5694444477558136, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.08007530868053436, "step": 128, "current_tau": 1.0960845947265625, "lambda1 Single QA": 0.55078125, "lambda2 MultiHop QA": 0.28125, "lambda3 Summarization": 0.1201171875, "lambda4 Code": 0.2197265625} [INFO|lh_trainer.py:331] 2026-02-17 00:24:39,920 >> {'loss': 12.0873, 'grad_norm': 0.7915118932723999, 'learning_rate': 0.0004073301070294496, 'epoch': 0.1358609794628752, 'num_input_tokens_seen': 316975100, 'completed': '43.00% (129 / 300)', 'remaining time': '8:01:01', 'throughput': '7986.66', 'gpu_mem_free': '12543MB', 'step': 129} [Step 129 / Rank 0] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1304, 1304, 1304, 1303, 1305, 1304, 1323, 1323, 1304, 1324, 1307, 1307, 1306, 1325, 1307, 1306, 1309, 1306, 1306, 1306, 1326, 1309, 1308, 1308, 1309, 1308, 1309, 1309, 1309, 1311, 1329, 1329, 1330, 1312, 1311, 1311, 1330, 1330, 1330, 1312, 1313, 1312, 1313, 1332, 1313, 1313, 1312, 1313, 1313] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 129 / Rank 4] Tasks: ['Code'] | Lens: [37602] → Tgt Spa: ['1.000'] [Step 129 / Rank 2] Tasks: ['Single QA'] | Lens: [55883] → Tgt Spa: ['0.350'] [Step 129 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32528, 32528] → Tgt Spa: ['0.350', '0.350'] [Step 129 / Rank 3] Tasks: ['Single QA'] | Lens: [55883] → Tgt Spa: ['0.350'] [Step 129 / Rank 1] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1304, 1304, 1304, 1303, 1305, 1304, 1323, 1323, 1304, 1324, 1307, 1307, 1306, 1325, 1307, 1306, 1309, 1306, 1306, 1306, 1326, 1309, 1308, 1308, 1309, 1308, 1309, 1309, 1309, 1311, 1329, 1329, 1330, 1312, 1311, 1311, 1330, 1330, 1330, 1312, 1313, 1312, 1313, 1332, 1313, 1313, 1312, 1313, 1313] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 129 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32528, 32528] → Tgt Spa: ['0.350', '0.350'] [Step 129 / Rank 5] Tasks: ['Code'] | Lens: [37602] → Tgt Spa: ['1.000'] [Step 129 / Rank 1] Tasks: ['Single QA'] | Lens: [49695] → Tgt Spa: ['0.350'] [Step 129 / Rank 2] Tasks: ['Summarization', 'Summarization'] | Lens: [26722, 26723] → Tgt Spa: ['1.000', '1.000'] [Step 129 / Rank 4] Tasks: ['Single QA', 'Code', 'Code', 'Code', 'Code', 'Single QA'] | Lens: [10110, 10118, 10120, 10120, 10124, 10119] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 129 / Rank 3] Tasks: ['Summarization', 'Summarization'] | Lens: [26722, 26723] → Tgt Spa: ['1.000', '1.000'] [Step 129 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [16874, 16874, 16865] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 129 / Rank 0] Tasks: ['Single QA'] | Lens: [49695] → Tgt Spa: ['0.350'] [Step 129 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [16874, 16874, 16865] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 129 / Rank 5] Tasks: ['Single QA', 'Code', 'Code', 'Code', 'Code', 'Single QA'] | Lens: [10110, 10118, 10120, 10120, 10124, 10119] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 129 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [59461] → Tgt Spa: ['1.000'] [Step 129 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [59461] → Tgt Spa: ['1.000'] [Step 129 / Rank 7] Tasks: ['Code'] | Lens: [62038] → Tgt Spa: ['1.000'] [Step 129 / Rank 2] Tasks: ['Single QA'] | Lens: [64056] → Tgt Spa: ['0.350'] [Step 129 / Rank 6] Tasks: ['Code'] | Lens: [62038] → Tgt Spa: ['1.000'] [Step 129 / Rank 1] Tasks: ['Single QA'] | Lens: [56623] → Tgt Spa: ['0.350'] [Step 129 / Rank 3] Tasks: ['Single QA'] | Lens: [64056] → Tgt Spa: ['0.350'] [Step 129 / Rank 0] Tasks: ['Single QA'] | Lens: [56623] → Tgt Spa: ['0.350'] [Step 129 / Rank 3] Tasks: ['Single QA'] | Lens: [50234] → Tgt Spa: ['0.350'] [Step 129 / Rank 2] Tasks: ['Single QA'] | Lens: [50234] → Tgt Spa: ['0.350'] [Step 129 / Rank 1] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [24256, 24240] → Tgt Spa: ['1.000', '1.000'] [Step 129 / Rank 7] Tasks: ['Code'] | Lens: [39840] → Tgt Spa: ['1.000'] [Step 129 / Rank 5] Tasks: ['In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'Summarization', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning'] | Lens: [4786, 4788, 4796, 4790, 4808, 4790, 4791, 4791, 4791, 4791, 4791, 4811, 4792] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 129 / Rank 4] Tasks: ['In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'Summarization', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning'] | Lens: [4786, 4788, 4796, 4790, 4808, 4790, 4791, 4791, 4791, 4791, 4791, 4811, 4792] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 129 / Rank 6] Tasks: ['Code'] | Lens: [39840] → Tgt Spa: ['1.000'] [Step 129 / Rank 0] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [24256, 24240] → Tgt Spa: ['1.000', '1.000'] [Step 129 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [58920] → Tgt Spa: ['1.000'] [Step 129 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [55162] → Tgt Spa: ['1.000'] [Step 129 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [58920] → Tgt Spa: ['1.000'] [Step 129 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23053, 23073] → Tgt Spa: ['1.000', '1.000'] [Step 129 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [55162] → Tgt Spa: ['1.000'] [Step 129 / Rank 1] Tasks: ['Single QA'] | Lens: [58724] → Tgt Spa: ['0.350'] [Step 129 / Rank 0] Tasks: ['Single QA'] | Lens: [58724] → Tgt Spa: ['0.350'] [Step 129 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23053, 23073] → Tgt Spa: ['1.000', '1.000'] [Step 129 / Rank 5] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24042, 24043] → Tgt Spa: ['1.000', '0.350'] [Step 129 / Rank 2] Tasks: ['Single QA'] | Lens: [50555] → Tgt Spa: ['0.350'] [Step 129 / Rank 4] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24042, 24043] → Tgt Spa: ['1.000', '0.350'] [Step 129 / Rank 3] Tasks: ['Single QA'] | Lens: [50555] → Tgt Spa: ['0.350'] [Step 129 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32199, 32199] → Tgt Spa: ['0.350', '0.350'] [Step 129 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32199, 32199] → Tgt Spa: ['0.350', '0.350'] [Step 129 / Rank 0] Tasks: ['Single QA'] | Lens: [65023] → Tgt Spa: ['0.350'] [Step 129 / Rank 1] Tasks: ['Single QA'] | Lens: [65023] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 00:27:05,744 >> @ 129 | Loss: 2.1512 | LM: 2.0691 | Reg: 0.0821 | Spa(Avg): 0.539 [INFO|lh_trainer.py:797] 2026-02-17 00:27:05,744 >> Statistic -> Code | Spa: 0.630 | Tgt: 1.000 | Z-Loss: 0.105 | [INFO|lh_trainer.py:797] 2026-02-17 00:27:05,744 >> Statistic -> In-Context | Spa: 0.663 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:27:05,744 >> Statistic -> MultiHop | Spa: 0.596 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:27:05,744 >> Statistic -> Single | Spa: 0.475 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:27:05,744 >> Statistic -> Summarization | Spa: 0.600 | Tgt: 1.000 | Z-Loss: 0.119 | [INFO|lh_trainer.py:810] 2026-02-17 00:27:05,747 >> [Micro-Log] {"loss": 2.1512076730529466, "lm_loss": 2.0691406931728125, "reg_loss": 0.0820669534150511, "model_sparsity(avg)": 0.5392622215052446, "Spa-MultiHop QA sparsity": 0.5960961000339405, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.09286001163559991, "Spa-Summarization sparsity": 0.5999999880790711, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11936302967369557, "Spa-Single QA sparsity": 0.4745370282067193, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.07527788883696, "Spa-In-Context Learning sparsity": 0.6634615384615384, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10940857518177766, "Spa-Code sparsity": 0.6296296252144707, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10531067765421337, "step": 129, "current_tau": 1.0926698446273804, "lambda1 Single QA": 0.55078125, "lambda2 MultiHop QA": 0.28125, "lambda3 Summarization": 0.12060546875, "lambda4 Code": 0.2197265625} [INFO|lh_trainer.py:331] 2026-02-17 00:27:32,601 >> {'loss': 12.9072, 'grad_norm': 0.7420702576637268, 'learning_rate': 0.0004047734968501098, 'epoch': 0.1369141653501843, 'num_input_tokens_seen': 319629918, 'completed': '43.33% (130 / 300)', 'remaining time': '7:58:17', 'throughput': '7687.04', 'gpu_mem_free': '3955MB', 'step': 130} [Step 130 / Rank 2] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17485, 17485, 17487] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 130 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [38129] → Tgt Spa: ['1.000'] [Step 130 / Rank 1] Tasks: ['Single QA'] | Lens: [47661] → Tgt Spa: ['0.350'] [Step 130 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [61137] → Tgt Spa: ['1.000'] [Step 130 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [38129] → Tgt Spa: ['1.000'] [Step 130 / Rank 3] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17485, 17485, 17487] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 130 / Rank 0] Tasks: ['Single QA'] | Lens: [47661] → Tgt Spa: ['0.350'] [Step 130 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [61137] → Tgt Spa: ['1.000'] [Step 130 / Rank 4] Tasks: ['Code', 'Code', 'Code'] | Lens: [17377, 17380, 17380] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 130 / Rank 2] Tasks: ['Single QA'] | Lens: [53160] → Tgt Spa: ['0.350'] [Step 130 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [26250, 26250] → Tgt Spa: ['1.000', '1.000'] [Step 130 / Rank 5] Tasks: ['Code', 'Code', 'Code'] | Lens: [17377, 17380, 17380] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 130 / Rank 1] Tasks: ['Single QA'] | Lens: [44166] → Tgt Spa: ['0.350'] [Step 130 / Rank 3] Tasks: ['Single QA'] | Lens: [53160] → Tgt Spa: ['0.350'] [Step 130 / Rank 0] Tasks: ['Single QA'] | Lens: [44166] → Tgt Spa: ['0.350'] [Step 130 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [26250, 26250] → Tgt Spa: ['1.000', '1.000'] [Step 130 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [24946, 24955] → Tgt Spa: ['1.000', '1.000'] [Step 130 / Rank 1] Tasks: ['Code'] | Lens: [37111] → Tgt Spa: ['1.000'] [Step 130 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [64571] → Tgt Spa: ['1.000'] [Step 130 / Rank 2] Tasks: ['Single QA'] | Lens: [47421] → Tgt Spa: ['0.350'] [Step 130 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [64571] → Tgt Spa: ['1.000'] [Step 130 / Rank 0] Tasks: ['Code'] | Lens: [37111] → Tgt Spa: ['1.000'] [Step 130 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [24946, 24955] → Tgt Spa: ['1.000', '1.000'] [Step 130 / Rank 3] Tasks: ['Single QA'] | Lens: [47421] → Tgt Spa: ['0.350'] [Step 130 / Rank 3] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [19593, 19597, 19587] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 130 / Rank 6] Tasks: ['Single QA'] | Lens: [52292] → Tgt Spa: ['0.350'] [Step 130 / Rank 0] Tasks: ['Code'] | Lens: [42451] → Tgt Spa: ['1.000'] [Step 130 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [23854, 23846] → Tgt Spa: ['1.000', '1.000'] [Step 130 / Rank 7] Tasks: ['Single QA'] | Lens: [52292] → Tgt Spa: ['0.350'] [Step 130 / Rank 2] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [19593, 19597, 19587] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 130 / Rank 1] Tasks: ['Code'] | Lens: [42451] → Tgt Spa: ['1.000'] [Step 130 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [23854, 23846] → Tgt Spa: ['1.000', '1.000'] [Step 130 / Rank 4] Tasks: ['Summarization', 'MultiHop QA', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Code', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'Summarization', 'Code', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'MultiHop QA'] | Lens: [3377, 3361, 3362, 3360, 3361, 3379, 3369, 3366, 3365, 3365, 3382, 3371, 3365, 3365, 3365, 3366, 3383, 3365, 3367] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350'] [Step 130 / Rank 2] Tasks: ['Single QA'] | Lens: [44045] → Tgt Spa: ['0.350'] [Step 130 / Rank 3] Tasks: ['Single QA'] | Lens: [44045] → Tgt Spa: ['0.350'] [Step 130 / Rank 0] Tasks: ['Code'] | Lens: [43670] → Tgt Spa: ['1.000'] [Step 130 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [19524, 19524, 19515] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 130 / Rank 5] Tasks: ['Summarization', 'MultiHop QA', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Code', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'Summarization', 'Code', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'MultiHop QA'] | Lens: [3377, 3361, 3362, 3360, 3361, 3379, 3369, 3366, 3365, 3365, 3382, 3371, 3365, 3365, 3365, 3366, 3383, 3365, 3367] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350'] [Step 130 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [19524, 19524, 19515] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 130 / Rank 1] Tasks: ['Code'] | Lens: [43670] → Tgt Spa: ['1.000'] [Step 130 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [58948] → Tgt Spa: ['1.000'] [Step 130 / Rank 1] Tasks: ['Single QA'] | Lens: [52214] → Tgt Spa: ['0.350'] [Step 130 / Rank 3] Tasks: ['Code'] | Lens: [37881] → Tgt Spa: ['1.000'] [Step 130 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [58948] → Tgt Spa: ['1.000'] [Step 130 / Rank 7] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23509, 23509] → Tgt Spa: ['0.350', '1.000'] [Step 130 / Rank 6] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23509, 23509] → Tgt Spa: ['0.350', '1.000'] [Step 130 / Rank 0] Tasks: ['Single QA'] | Lens: [52214] → Tgt Spa: ['0.350'] [Step 130 / Rank 2] Tasks: ['Code'] | Lens: [37881] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 00:29:54,579 >> @ 130 | Loss: 1.7810 | LM: 1.6946 | Reg: 0.0864 | Spa(Avg): 0.542 [INFO|lh_trainer.py:797] 2026-02-17 00:29:54,580 >> Statistic -> Code | Spa: 0.617 | Tgt: 1.000 | Z-Loss: 0.110 | [INFO|lh_trainer.py:797] 2026-02-17 00:29:54,580 >> Statistic -> In-Context | Spa: 0.658 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:29:54,580 >> Statistic -> MultiHop | Spa: 0.617 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:29:54,580 >> Statistic -> Single | Spa: 0.403 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:29:54,580 >> Statistic -> Summarization | Spa: 0.569 | Tgt: 1.000 | Z-Loss: 0.134 | [INFO|lh_trainer.py:810] 2026-02-17 00:29:54,583 >> [Micro-Log] {"loss": 1.780960473542412, "lm_loss": 1.6946066667636235, "reg_loss": 0.08635380292253103, "model_sparsity(avg)": 0.5416819006204605, "Spa-Single QA sparsity": 0.4027777777777778, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.042520122874217726, "Spa-Code sparsity": 0.6166666626930237, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.11029127985239029, "Spa-Summarization sparsity": 0.5694444450465116, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13398778709498319, "Spa-In-Context Learning sparsity": 0.6583333373069763, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11131625548005104, "Spa-MultiHop QA sparsity": 0.6172839535607232, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.10228249265087976, "step": 130, "current_tau": 1.0893031358718872, "lambda1 Single QA": 0.55078125, "lambda2 MultiHop QA": 0.28125, "lambda3 Summarization": 0.12158203125, "lambda4 Code": 0.220703125} [INFO|lh_trainer.py:331] 2026-02-17 00:30:17,260 >> {'loss': 10.6858, 'grad_norm': 1.0302780866622925, 'learning_rate': 0.0004021903670331444, 'epoch': 0.13796735123749343, 'num_input_tokens_seen': 322045726, 'completed': '43.67% (131 / 300)', 'remaining time': '7:55:23', 'throughput': '7335.81', 'gpu_mem_free': '8471MB', 'step': 131} [Step 131 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23433, 23433] → Tgt Spa: ['1.000', '1.000'] [Step 131 / Rank 2] Tasks: ['Single QA'] | Lens: [64609] → Tgt Spa: ['0.350'] [Step 131 / Rank 3] Tasks: ['Single QA'] | Lens: [64609] → Tgt Spa: ['0.350'] [Step 131 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [24604, 24613] → Tgt Spa: ['1.000', '1.000'] [Step 131 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [55854] → Tgt Spa: ['1.000'] [Step 131 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23433, 23433] → Tgt Spa: ['1.000', '1.000'] [Step 131 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [24604, 24613] → Tgt Spa: ['1.000', '1.000'] [Step 131 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [55854] → Tgt Spa: ['1.000'] [Step 131 / Rank 5] Tasks: ['Code'] | Lens: [44025] → Tgt Spa: ['1.000'] [Step 131 / Rank 4] Tasks: ['Code'] | Lens: [44025] → Tgt Spa: ['1.000'] [Step 131 / Rank 3] Tasks: ['Single QA'] | Lens: [33978] → Tgt Spa: ['0.350'] [Step 131 / Rank 7] Tasks: ['Single QA'] | Lens: [64670] → Tgt Spa: ['0.350'] [Step 131 / Rank 6] Tasks: ['Single QA'] | Lens: [64670] → Tgt Spa: ['0.350'] [Step 131 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [36280] → Tgt Spa: ['1.000'] [Step 131 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [36280] → Tgt Spa: ['1.000'] [Step 131 / Rank 2] Tasks: ['Single QA'] | Lens: [33978] → Tgt Spa: ['0.350'] [Step 131 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [45893] → Tgt Spa: ['1.000'] [Step 131 / Rank 7] Tasks: ['Summarization', 'Code'] | Lens: [28070, 28063] → Tgt Spa: ['1.000', '1.000'] [Step 131 / Rank 2] Tasks: ['Code'] | Lens: [57900] → Tgt Spa: ['1.000'] [Step 131 / Rank 1] Tasks: ['Single QA'] | Lens: [45270] → Tgt Spa: ['0.350'] [Step 131 / Rank 6] Tasks: ['Summarization', 'Code'] | Lens: [28070, 28063] → Tgt Spa: ['1.000', '1.000'] [Step 131 / Rank 3] Tasks: ['Code'] | Lens: [57900] → Tgt Spa: ['1.000'] [Step 131 / Rank 0] Tasks: ['Single QA'] | Lens: [45270] → Tgt Spa: ['0.350'] [Step 131 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [45893] → Tgt Spa: ['1.000'] [Step 131 / Rank 5] Tasks: ['Single QA'] | Lens: [55067] → Tgt Spa: ['0.350'] [Step 131 / Rank 2] Tasks: ['Single QA'] | Lens: [35550] → Tgt Spa: ['0.350'] [Step 131 / Rank 1] Tasks: ['MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'MultiHop QA', 'Single QA', 'Single QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1619, 1620, 1639, 1621, 1621, 1621, 1641, 1623, 1623, 1623, 1644, 1625, 1627, 1644, 1644, 1645, 1644, 1627, 1627, 1627, 1628, 1633, 1627, 1628, 1627, 1630, 1647, 1646, 1628, 1629, 1628, 1647, 1629, 1628, 1649, 1648, 1649, 1632, 1631, 1630] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350'] [Step 131 / Rank 6] Tasks: ['Single QA'] | Lens: [51703] → Tgt Spa: ['0.350'] [Step 131 / Rank 3] Tasks: ['Single QA'] | Lens: [35550] → Tgt Spa: ['0.350'] [Step 131 / Rank 0] Tasks: ['MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'MultiHop QA', 'Single QA', 'Single QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1619, 1620, 1639, 1621, 1621, 1621, 1641, 1623, 1623, 1623, 1644, 1625, 1627, 1644, 1644, 1645, 1644, 1627, 1627, 1627, 1628, 1633, 1627, 1628, 1627, 1630, 1647, 1646, 1628, 1629, 1628, 1647, 1629, 1628, 1649, 1648, 1649, 1632, 1631, 1630] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350'] [Step 131 / Rank 7] Tasks: ['Single QA'] | Lens: [51703] → Tgt Spa: ['0.350'] [Step 131 / Rank 4] Tasks: ['Single QA'] | Lens: [55067] → Tgt Spa: ['0.350'] [Step 131 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [65311] → Tgt Spa: ['1.000'] [Step 131 / Rank 2] Tasks: ['Code'] | Lens: [53953] → Tgt Spa: ['1.000'] [Step 131 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [24327, 24320] → Tgt Spa: ['1.000', '1.000'] [Step 131 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [24327, 24320] → Tgt Spa: ['1.000', '1.000'] [Step 131 / Rank 6] Tasks: ['Single QA'] | Lens: [33854] → Tgt Spa: ['0.350'] [Step 131 / Rank 3] Tasks: ['Code'] | Lens: [53953] → Tgt Spa: ['1.000'] [Step 131 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [65311] → Tgt Spa: ['1.000'] [Step 131 / Rank 7] Tasks: ['Single QA'] | Lens: [33854] → Tgt Spa: ['0.350'] [Step 131 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [30905, 30920] → Tgt Spa: ['1.000', '1.000'] [Step 131 / Rank 5] Tasks: ['Code'] | Lens: [37457] → Tgt Spa: ['1.000'] [Step 131 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [30905, 30920] → Tgt Spa: ['1.000', '1.000'] [Step 131 / Rank 3] Tasks: ['Single QA'] | Lens: [38269] → Tgt Spa: ['0.350'] [Step 131 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [40936] → Tgt Spa: ['1.000'] [Step 131 / Rank 4] Tasks: ['Code'] | Lens: [37457] → Tgt Spa: ['1.000'] [Step 131 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [40936] → Tgt Spa: ['1.000'] [Step 131 / Rank 2] Tasks: ['Single QA'] | Lens: [38269] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 00:33:03,440 >> @ 131 | Loss: 2.0134 | LM: 1.9277 | Reg: 0.0857 | Spa(Avg): 0.564 [INFO|lh_trainer.py:797] 2026-02-17 00:33:03,440 >> Statistic -> Code | Spa: 0.619 | Tgt: 1.000 | Z-Loss: 0.110 | [INFO|lh_trainer.py:797] 2026-02-17 00:33:03,440 >> Statistic -> In-Context | Spa: 0.673 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:33:03,440 >> Statistic -> MultiHop | Spa: 0.595 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:33:03,440 >> Statistic -> Single | Spa: 0.463 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:33:03,440 >> Statistic -> Summarization | Spa: 0.631 | Tgt: 1.000 | Z-Loss: 0.105 | [INFO|lh_trainer.py:810] 2026-02-17 00:33:03,442 >> [Micro-Log] {"loss": 2.01342049613595, "lm_loss": 1.9277241161713998, "reg_loss": 0.08569636886628966, "model_sparsity(avg)": 0.5644675915439924, "Spa-In-Context Learning sparsity": 0.6728394958708022, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1068086822827657, "Spa-Code sparsity": 0.6194444477558136, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10999575182795525, "Spa-Single QA sparsity": 0.4633838425983082, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06807463429868221, "Spa-MultiHop QA sparsity": 0.5954861119389534, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.09290097653865814, "Spa-Summarization sparsity": 0.6309523752757481, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10541578967656408, "step": 131, "current_tau": 1.0859853029251099, "lambda1 Single QA": 0.55078125, "lambda2 MultiHop QA": 0.28125, "lambda3 Summarization": 0.1220703125, "lambda4 Code": 0.2216796875} [INFO|lh_trainer.py:331] 2026-02-17 00:33:19,825 >> {'loss': 12.0805, 'grad_norm': 0.9391334652900696, 'learning_rate': 0.00039958116018454974, 'epoch': 0.13902053712480253, 'num_input_tokens_seen': 324422858, 'completed': '44.00% (132 / 300)', 'remaining time': '7:52:52', 'throughput': '6510.38', 'gpu_mem_free': '11867MB', 'step': 132} [Step 132 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [26987, 26987] → Tgt Spa: ['0.350', '0.350'] [Step 132 / Rank 4] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [7948, 7956, 7949, 7949, 7950, 7951, 7951, 7960] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 132 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [26987, 26987] → Tgt Spa: ['0.350', '0.350'] [Step 132 / Rank 7] Tasks: ['Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA'] | Lens: [4834, 4816, 4817, 4817, 4817, 4818, 4826, 4825, 4818, 4818, 4818, 4819, 4820] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 132 / Rank 2] Tasks: ['Code'] | Lens: [37809] → Tgt Spa: ['1.000'] [Step 132 / Rank 3] Tasks: ['Code'] | Lens: [37809] → Tgt Spa: ['1.000'] [Step 132 / Rank 5] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [7948, 7956, 7949, 7949, 7950, 7951, 7951, 7960] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 132 / Rank 6] Tasks: ['Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA'] | Lens: [4834, 4816, 4817, 4817, 4817, 4818, 4826, 4825, 4818, 4818, 4818, 4819, 4820] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 132 / Rank 3] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25851, 25853] → Tgt Spa: ['1.000', '0.350'] [Step 132 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [58395] → Tgt Spa: ['1.000'] [Step 132 / Rank 2] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25851, 25853] → Tgt Spa: ['1.000', '0.350'] [Step 132 / Rank 1] Tasks: ['Single QA'] | Lens: [46408] → Tgt Spa: ['0.350'] [Step 132 / Rank 0] Tasks: ['Single QA'] | Lens: [46408] → Tgt Spa: ['0.350'] [Step 132 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [25493, 25485] → Tgt Spa: ['1.000', '1.000'] [Step 132 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [58395] → Tgt Spa: ['1.000'] [Step 132 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [25493, 25485] → Tgt Spa: ['1.000', '1.000'] [Step 132 / Rank 3] Tasks: ['Single QA'] | Lens: [49599] → Tgt Spa: ['0.350'] [Step 132 / Rank 2] Tasks: ['Single QA'] | Lens: [49599] → Tgt Spa: ['0.350'] [Step 132 / Rank 0] Tasks: ['Single QA'] | Lens: [35353] → Tgt Spa: ['0.350'] [Step 132 / Rank 5] Tasks: ['Code'] | Lens: [35142] → Tgt Spa: ['1.000'] [Step 132 / Rank 4] Tasks: ['Code'] | Lens: [35142] → Tgt Spa: ['1.000'] [Step 132 / Rank 1] Tasks: ['Single QA'] | Lens: [35353] → Tgt Spa: ['0.350'] [Step 132 / Rank 6] Tasks: ['Single QA'] | Lens: [49749] → Tgt Spa: ['0.350'] [Step 132 / Rank 7] Tasks: ['Single QA'] | Lens: [49749] → Tgt Spa: ['0.350'] [Step 132 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [43627] → Tgt Spa: ['1.000'] [Step 132 / Rank 1] Tasks: ['Code'] | Lens: [35829] → Tgt Spa: ['1.000'] [Step 132 / Rank 4] Tasks: ['Single QA'] | Lens: [42767] → Tgt Spa: ['0.350'] [Step 132 / Rank 6] Tasks: ['Single QA'] | Lens: [49231] → Tgt Spa: ['0.350'] [Step 132 / Rank 0] Tasks: ['Code'] | Lens: [35829] → Tgt Spa: ['1.000'] [Step 132 / Rank 5] Tasks: ['Single QA'] | Lens: [42767] → Tgt Spa: ['0.350'] [Step 132 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [43627] → Tgt Spa: ['1.000'] [Step 132 / Rank 7] Tasks: ['Single QA'] | Lens: [49231] → Tgt Spa: ['0.350'] [Step 132 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [39008] → Tgt Spa: ['1.000'] [Step 132 / Rank 5] Tasks: ['Code', 'Single QA', 'Code', 'MultiHop QA'] | Lens: [15585, 15579, 15600, 15602] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350'] [Step 132 / Rank 3] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [18607, 18607, 18597] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 132 / Rank 4] Tasks: ['Code', 'Single QA', 'Code', 'MultiHop QA'] | Lens: [15585, 15579, 15600, 15602] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350'] [Step 132 / Rank 7] Tasks: ['Single QA'] | Lens: [35560] → Tgt Spa: ['0.350'] [Step 132 / Rank 2] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [18607, 18607, 18597] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 132 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [39008] → Tgt Spa: ['1.000'] [Step 132 / Rank 6] Tasks: ['Single QA'] | Lens: [35560] → Tgt Spa: ['0.350'] [Step 132 / Rank 1] Tasks: ['Code'] | Lens: [37647] → Tgt Spa: ['1.000'] [Step 132 / Rank 4] Tasks: ['Single QA'] | Lens: [50124] → Tgt Spa: ['0.350'] [Step 132 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22483, 22484] → Tgt Spa: ['1.000', '1.000'] [Step 132 / Rank 6] Tasks: ['Code'] | Lens: [34241] → Tgt Spa: ['1.000'] [Step 132 / Rank 7] Tasks: ['Code'] | Lens: [34241] → Tgt Spa: ['1.000'] [Step 132 / Rank 5] Tasks: ['Single QA'] | Lens: [50124] → Tgt Spa: ['0.350'] [Step 132 / Rank 0] Tasks: ['Code'] | Lens: [37647] → Tgt Spa: ['1.000'] [Step 132 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22483, 22484] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 00:35:17,109 >> @ 132 | Loss: 1.9676 | LM: 1.8890 | Reg: 0.0786 | Spa(Avg): 0.557 [INFO|lh_trainer.py:797] 2026-02-17 00:35:17,109 >> Statistic -> Code | Spa: 0.636 | Tgt: 1.000 | Z-Loss: 0.105 | [INFO|lh_trainer.py:797] 2026-02-17 00:35:17,109 >> Statistic -> In-Context | Spa: 0.684 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:35:17,109 >> Statistic -> MultiHop | Spa: 0.542 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:35:17,109 >> Statistic -> Single | Spa: 0.481 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:35:17,109 >> Statistic -> Summarization | Spa: 0.620 | Tgt: 1.000 | Z-Loss: 0.110 | [INFO|lh_trainer.py:810] 2026-02-17 00:35:17,111 >> [Micro-Log] {"loss": 1.9676354142526786, "lm_loss": 1.8889876833806436, "reg_loss": 0.07864771798388877, "model_sparsity(avg)": 0.5572842520972093, "Spa-Single QA sparsity": 0.48125, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.08019294049008749, "Spa-Code sparsity": 0.6356837566082294, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10459202871872829, "Spa-In-Context Learning sparsity": 0.6842592636744181, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10273427963256836, "Spa-Summarization sparsity": 0.6203703681627909, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10971006502707799, "Spa-MultiHop QA sparsity": 0.5416666269302368, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.06829550117254257, "step": 132, "current_tau": 1.0827172994613647, "lambda1 Single QA": 0.55078125, "lambda2 MultiHop QA": 0.283203125, "lambda3 Summarization": 0.12255859375, "lambda4 Code": 0.22265625} [INFO|lh_trainer.py:331] 2026-02-17 00:35:35,108 >> {'loss': 11.8058, 'grad_norm': 0.834057629108429, 'learning_rate': 0.000396946323378487, 'epoch': 0.14007372301211163, 'num_input_tokens_seen': 326675990, 'completed': '44.33% (133 / 300)', 'remaining time': '7:49:21', 'throughput': '8327.44', 'gpu_mem_free': '14063MB', 'step': 133} [Step 133 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [61645] → Tgt Spa: ['1.000'] [Step 133 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [24833, 24841] → Tgt Spa: ['1.000', '1.000'] [Step 133 / Rank 7] Tasks: ['Single QA'] | Lens: [41018] → Tgt Spa: ['0.350'] [Step 133 / Rank 6] Tasks: ['Single QA'] | Lens: [41018] → Tgt Spa: ['0.350'] [Step 133 / Rank 3] Tasks: ['Code'] | Lens: [58290] → Tgt Spa: ['1.000'] [Step 133 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [24833, 24841] → Tgt Spa: ['1.000', '1.000'] [Step 133 / Rank 2] Tasks: ['Code'] | Lens: [58290] → Tgt Spa: ['1.000'] [Step 133 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [61645] → Tgt Spa: ['1.000'] [Step 133 / Rank 5] Tasks: ['Code'] | Lens: [45885] → Tgt Spa: ['1.000'] [Step 133 / Rank 6] Tasks: ['Code'] | Lens: [65186] → Tgt Spa: ['1.000'] [Step 133 / Rank 7] Tasks: ['Code'] | Lens: [65186] → Tgt Spa: ['1.000'] [Step 133 / Rank 1] Tasks: ['In-Context Learning', 'Summarization', 'Code'] | Lens: [21645, 21664, 21658] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 133 / Rank 4] Tasks: ['Code'] | Lens: [45885] → Tgt Spa: ['1.000'] [Step 133 / Rank 0] Tasks: ['In-Context Learning', 'Summarization', 'Code'] | Lens: [21645, 21664, 21658] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 133 / Rank 3] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16900, 16890, 16904] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 133 / Rank 2] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16900, 16890, 16904] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 133 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [28494, 28500] → Tgt Spa: ['1.000', '1.000'] [Step 133 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [10444, 10446, 10446, 10449, 10449, 10450] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 133 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [10444, 10446, 10446, 10449, 10449, 10450] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 133 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [55006] → Tgt Spa: ['1.000'] [Step 133 / Rank 7] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25471, 25474] → Tgt Spa: ['0.350', '1.000'] [Step 133 / Rank 6] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25471, 25474] → Tgt Spa: ['0.350', '1.000'] [Step 133 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [28494, 28500] → Tgt Spa: ['1.000', '1.000'] [Step 133 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [55006] → Tgt Spa: ['1.000'] [Step 133 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22151, 22152] → Tgt Spa: ['1.000', '1.000'] [Step 133 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [62721] → Tgt Spa: ['1.000'] [Step 133 / Rank 4] Tasks: ['Single QA'] | Lens: [41541] → Tgt Spa: ['0.350'] [Step 133 / Rank 0] Tasks: ['Single QA'] | Lens: [46274] → Tgt Spa: ['0.350'] [Step 133 / Rank 1] Tasks: ['Single QA'] | Lens: [46274] → Tgt Spa: ['0.350'] [Step 133 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22151, 22152] → Tgt Spa: ['1.000', '1.000'] [Step 133 / Rank 5] Tasks: ['Single QA'] | Lens: [41541] → Tgt Spa: ['0.350'] [Step 133 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [62721] → Tgt Spa: ['1.000'] [Step 133 / Rank 1] Tasks: ['Single QA'] | Lens: [38199] → Tgt Spa: ['0.350'] [Step 133 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [27181, 27173] → Tgt Spa: ['1.000', '1.000'] [Step 133 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [27181, 27173] → Tgt Spa: ['1.000', '1.000'] [Step 133 / Rank 0] Tasks: ['Single QA'] | Lens: [38199] → Tgt Spa: ['0.350'] [Step 133 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24744, 24763] → Tgt Spa: ['1.000', '1.000'] [Step 133 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24744, 24763] → Tgt Spa: ['1.000', '1.000'] [Step 133 / Rank 7] Tasks: ['Single QA'] | Lens: [58390] → Tgt Spa: ['0.350'] [Step 133 / Rank 6] Tasks: ['Single QA'] | Lens: [58390] → Tgt Spa: ['0.350'] [Step 133 / Rank 5] Tasks: ['Single QA'] | Lens: [45438] → Tgt Spa: ['0.350'] [Step 133 / Rank 0] Tasks: ['Single QA'] | Lens: [33901] → Tgt Spa: ['0.350'] [Step 133 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [54942] → Tgt Spa: ['1.000'] [Step 133 / Rank 1] Tasks: ['Single QA'] | Lens: [33901] → Tgt Spa: ['0.350'] [Step 133 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [31151, 31144] → Tgt Spa: ['1.000', '1.000'] [Step 133 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [31151, 31144] → Tgt Spa: ['1.000', '1.000'] [Step 133 / Rank 4] Tasks: ['Single QA'] | Lens: [45438] → Tgt Spa: ['0.350'] [Step 133 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [54942] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 00:38:17,184 >> @ 133 | Loss: 1.9821 | LM: 1.9067 | Reg: 0.0754 | Spa(Avg): 0.580 [INFO|lh_trainer.py:797] 2026-02-17 00:38:17,184 >> Statistic -> Code | Spa: 0.664 | Tgt: 1.000 | Z-Loss: 0.095 | [INFO|lh_trainer.py:797] 2026-02-17 00:38:17,184 >> Statistic -> In-Context | Spa: 0.692 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:38:17,184 >> Statistic -> MultiHop | Spa: 0.542 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:38:17,184 >> Statistic -> Single | Spa: 0.401 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:38:17,184 >> Statistic -> Summarization | Spa: 0.625 | Tgt: 1.000 | Z-Loss: 0.108 | [INFO|lh_trainer.py:810] 2026-02-17 00:38:17,186 >> [Micro-Log] {"loss": 1.9821204151958227, "lm_loss": 1.9067052028452356, "reg_loss": 0.075415197138985, "model_sparsity(avg)": 0.5802469154198965, "Spa-In-Context Learning sparsity": 0.6921296417713165, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10003215819597244, "Spa-Summarization sparsity": 0.6249999850988388, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1082288958132267, "Spa-Code sparsity": 0.6638888716697693, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09462092220783233, "Spa-Single QA sparsity": 0.40079365032059805, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.02953284711111337, "Spa-MultiHop QA sparsity": 0.5416666269302368, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.06829550117254257, "step": 133, "current_tau": 1.079500436782837, "lambda1 Single QA": 0.55078125, "lambda2 MultiHop QA": 0.283203125, "lambda3 Summarization": 0.12353515625, "lambda4 Code": 0.22265625} [INFO|lh_trainer.py:331] 2026-02-17 00:38:37,533 >> {'loss': 11.8927, 'grad_norm': 0.8652845025062561, 'learning_rate': 0.0003942863080806787, 'epoch': 0.14112690889942076, 'num_input_tokens_seen': 329185696, 'completed': '44.67% (134 / 300)', 'remaining time': '7:46:49', 'throughput': '6878.75', 'gpu_mem_free': '14955MB', 'step': 134} [Step 134 / Rank 5] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [32654, 32654] → Tgt Spa: ['1.000', '0.350'] [Step 134 / Rank 6] Tasks: ['Single QA'] | Lens: [36152] → Tgt Spa: ['0.350'] [Step 134 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [60585] → Tgt Spa: ['1.000'] [Step 134 / Rank 7] Tasks: ['Single QA'] | Lens: [36152] → Tgt Spa: ['0.350'] [Step 134 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [40094] → Tgt Spa: ['1.000'] [Step 134 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [60585] → Tgt Spa: ['1.000'] [Step 134 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [40094] → Tgt Spa: ['1.000'] [Step 134 / Rank 4] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [32654, 32654] → Tgt Spa: ['1.000', '0.350'] [Step 134 / Rank 1] Tasks: ['Single QA'] | Lens: [61870] → Tgt Spa: ['0.350'] [Step 134 / Rank 4] Tasks: ['In-Context Learning', 'Single QA', 'Single QA'] | Lens: [21831, 21831, 21831] → Tgt Spa: ['1.000', '0.350', '0.350'] [Step 134 / Rank 7] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [18956, 18957, 18970] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 134 / Rank 3] Tasks: ['Single QA'] | Lens: [42185] → Tgt Spa: ['0.350'] [Step 134 / Rank 6] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [18956, 18957, 18970] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 134 / Rank 5] Tasks: ['In-Context Learning', 'Single QA', 'Single QA'] | Lens: [21831, 21831, 21831] → Tgt Spa: ['1.000', '0.350', '0.350'] [Step 134 / Rank 2] Tasks: ['Single QA'] | Lens: [42185] → Tgt Spa: ['0.350'] [Step 134 / Rank 0] Tasks: ['Single QA'] | Lens: [61870] → Tgt Spa: ['0.350'] [Step 134 / Rank 2] Tasks: ['Code'] | Lens: [44253] → Tgt Spa: ['1.000'] [Step 134 / Rank 4] Tasks: ['Single QA'] | Lens: [62754] → Tgt Spa: ['0.350'] [Step 134 / Rank 0] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [16507, 16507, 16496] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 134 / Rank 7] Tasks: ['Single QA'] | Lens: [61800] → Tgt Spa: ['0.350'] [Step 134 / Rank 6] Tasks: ['Single QA'] | Lens: [61800] → Tgt Spa: ['0.350'] [Step 134 / Rank 5] Tasks: ['Single QA'] | Lens: [62754] → Tgt Spa: ['0.350'] [Step 134 / Rank 3] Tasks: ['Code'] | Lens: [44253] → Tgt Spa: ['1.000'] [Step 134 / Rank 1] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [16507, 16507, 16496] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 134 / Rank 5] Tasks: ['Single QA'] | Lens: [62338] → Tgt Spa: ['0.350'] [Step 134 / Rank 7] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24137, 24139] → Tgt Spa: ['0.350', '1.000'] [Step 134 / Rank 2] Tasks: ['Single QA'] | Lens: [50120] → Tgt Spa: ['0.350'] [Step 134 / Rank 1] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24437, 24438] → Tgt Spa: ['1.000', '0.350'] [Step 134 / Rank 6] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24137, 24139] → Tgt Spa: ['0.350', '1.000'] [Step 134 / Rank 4] Tasks: ['Single QA'] | Lens: [62338] → Tgt Spa: ['0.350'] [Step 134 / Rank 3] Tasks: ['Single QA'] | Lens: [50120] → Tgt Spa: ['0.350'] [Step 134 / Rank 0] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24437, 24438] → Tgt Spa: ['1.000', '0.350'] [Step 134 / Rank 5] Tasks: ['In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'Code', 'Single QA', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'Code'] | Lens: [3683, 3684, 3690, 3684, 3684, 3685, 3692, 3685, 3686, 3704, 3687, 3687, 3688, 3688, 3690, 3689, 3696] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000'] [Step 134 / Rank 0] Tasks: ['Code'] | Lens: [38961] → Tgt Spa: ['1.000'] [Step 134 / Rank 2] Tasks: ['Single QA'] | Lens: [42308] → Tgt Spa: ['0.350'] [Step 134 / Rank 6] Tasks: ['Single QA'] | Lens: [62754] → Tgt Spa: ['0.350'] [Step 134 / Rank 3] Tasks: ['Single QA'] | Lens: [42308] → Tgt Spa: ['0.350'] [Step 134 / Rank 4] Tasks: ['In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'Code', 'Single QA', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'Code'] | Lens: [3683, 3684, 3690, 3684, 3684, 3685, 3692, 3685, 3686, 3704, 3687, 3687, 3688, 3688, 3690, 3689, 3696] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000'] [Step 134 / Rank 7] Tasks: ['Single QA'] | Lens: [62754] → Tgt Spa: ['0.350'] [Step 134 / Rank 1] Tasks: ['Code'] | Lens: [38961] → Tgt Spa: ['1.000'] [Step 134 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [51561] → Tgt Spa: ['1.000'] [Step 134 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [51561] → Tgt Spa: ['1.000'] [Step 134 / Rank 1] Tasks: ['Single QA'] | Lens: [58707] → Tgt Spa: ['0.350'] [Step 134 / Rank 0] Tasks: ['Single QA'] | Lens: [58707] → Tgt Spa: ['0.350'] [Step 134 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32158, 32158] → Tgt Spa: ['0.350', '0.350'] [Step 134 / Rank 2] Tasks: ['Single QA'] | Lens: [47774] → Tgt Spa: ['0.350'] [Step 134 / Rank 3] Tasks: ['Single QA'] | Lens: [47774] → Tgt Spa: ['0.350'] [Step 134 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32158, 32158] → Tgt Spa: ['0.350', '0.350'] [INFO|lh_trainer.py:781] 2026-02-17 00:41:28,763 >> @ 134 | Loss: 2.0942 | LM: 2.0176 | Reg: 0.0766 | Spa(Avg): 0.524 [INFO|lh_trainer.py:797] 2026-02-17 00:41:28,763 >> Statistic -> Code | Spa: 0.611 | Tgt: 1.000 | Z-Loss: 0.114 | [INFO|lh_trainer.py:797] 2026-02-17 00:41:28,763 >> Statistic -> In-Context | Spa: 0.682 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:41:28,763 >> Statistic -> MultiHop | Spa: 0.646 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:41:28,763 >> Statistic -> Single | Spa: 0.465 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:41:28,763 >> Statistic -> Summarization | Spa: 0.660 | Tgt: 1.000 | Z-Loss: 0.094 | [INFO|lh_trainer.py:810] 2026-02-17 00:41:28,765 >> [Micro-Log] {"loss": 2.094245683401823, "lm_loss": 2.017637688666582, "reg_loss": 0.07660800389324625, "model_sparsity(avg)": 0.5235169207056364, "Spa-In-Context Learning sparsity": 0.6824073870976766, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.104165218770504, "Spa-Single QA sparsity": 0.4649470817475092, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.07332372603317101, "Spa-Summarization sparsity": 0.6597222089767456, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.0937008410692215, "Spa-Code sparsity": 0.6111111119389534, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1139696380123496, "Spa-MultiHop QA sparsity": 0.6458333134651184, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11730450391769409, "step": 134, "current_tau": 1.0763354301452637, "lambda1 Single QA": 0.5546875, "lambda2 MultiHop QA": 0.283203125, "lambda3 Summarization": 0.1240234375, "lambda4 Code": 0.2236328125} [INFO|lh_trainer.py:331] 2026-02-17 00:41:51,662 >> {'loss': 12.5655, 'grad_norm': 0.6714434027671814, 'learning_rate': 0.0003916015700710523, 'epoch': 0.14218009478672985, 'num_input_tokens_seen': 331756854, 'completed': '45.00% (135 / 300)', 'remaining time': '7:44:31', 'throughput': '6622.30', 'gpu_mem_free': '6971MB', 'step': 135} [Step 135 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [27417, 27419] → Tgt Spa: ['0.350', '0.350'] [Step 135 / Rank 7] Tasks: ['Single QA'] | Lens: [46749] → Tgt Spa: ['0.350'] [Step 135 / Rank 1] Tasks: ['Single QA'] | Lens: [34970] → Tgt Spa: ['0.350'] [Step 135 / Rank 0] Tasks: ['Single QA'] | Lens: [34970] → Tgt Spa: ['0.350'] [Step 135 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [27417, 27419] → Tgt Spa: ['0.350', '0.350'] [Step 135 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [24121, 24130] → Tgt Spa: ['1.000', '1.000'] [Step 135 / Rank 6] Tasks: ['Single QA'] | Lens: [46749] → Tgt Spa: ['0.350'] [Step 135 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [24121, 24130] → Tgt Spa: ['1.000', '1.000'] [Step 135 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [55450] → Tgt Spa: ['1.000'] [Step 135 / Rank 0] Tasks: ['Single QA'] | Lens: [55055] → Tgt Spa: ['0.350'] [Step 135 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [55450] → Tgt Spa: ['1.000'] [Step 135 / Rank 5] Tasks: ['Single QA'] | Lens: [51381] → Tgt Spa: ['0.350'] [Step 135 / Rank 1] Tasks: ['Single QA'] | Lens: [55055] → Tgt Spa: ['0.350'] [Step 135 / Rank 7] Tasks: ['Single QA'] | Lens: [64678] → Tgt Spa: ['0.350'] [Step 135 / Rank 6] Tasks: ['Single QA'] | Lens: [64678] → Tgt Spa: ['0.350'] [Step 135 / Rank 4] Tasks: ['Single QA'] | Lens: [51381] → Tgt Spa: ['0.350'] [Step 135 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [31268, 31268] → Tgt Spa: ['1.000', '1.000'] [Step 135 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [29817, 29825] → Tgt Spa: ['1.000', '1.000'] [Step 135 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [29817, 29825] → Tgt Spa: ['1.000', '1.000'] [Step 135 / Rank 1] Tasks: ['Code', 'Code', 'Code'] | Lens: [17997, 17999, 18003] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 135 / Rank 0] Tasks: ['Code', 'Code', 'Code'] | Lens: [17997, 17999, 18003] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 135 / Rank 2] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17454, 17454, 17456] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 135 / Rank 3] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17454, 17454, 17456] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 135 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [31268, 31268] → Tgt Spa: ['1.000', '1.000'] [Step 135 / Rank 0] Tasks: ['Code', 'Summarization'] | Lens: [22761, 22773] → Tgt Spa: ['1.000', '1.000'] [Step 135 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [24033, 24034] → Tgt Spa: ['0.350', '0.350'] [Step 135 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [57011] → Tgt Spa: ['1.000'] [Step 135 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [57011] → Tgt Spa: ['1.000'] [Step 135 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [24033, 24034] → Tgt Spa: ['0.350', '0.350'] [Step 135 / Rank 1] Tasks: ['Code', 'Summarization'] | Lens: [22761, 22773] → Tgt Spa: ['1.000', '1.000'] [Step 135 / Rank 7] Tasks: ['Code'] | Lens: [36203] → Tgt Spa: ['1.000'] [Step 135 / Rank 6] Tasks: ['Code'] | Lens: [36203] → Tgt Spa: ['1.000'] [Step 135 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26497, 26498] → Tgt Spa: ['1.000', '1.000'] [Step 135 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26497, 26498] → Tgt Spa: ['1.000', '1.000'] [Step 135 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [64604] → Tgt Spa: ['1.000'] [Step 135 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [64604] → Tgt Spa: ['1.000'] [Step 135 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [14600, 14601, 14602, 14603] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 135 / Rank 6] Tasks: ['Single QA'] | Lens: [56984] → Tgt Spa: ['0.350'] [Step 135 / Rank 7] Tasks: ['Single QA'] | Lens: [56984] → Tgt Spa: ['0.350'] [Step 135 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [14600, 14601, 14602, 14603] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 135 / Rank 3] Tasks: ['Single QA'] | Lens: [48767] → Tgt Spa: ['0.350'] [Step 135 / Rank 6] Tasks: ['Single QA', 'Code', 'In-Context Learning', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7903, 7912, 7904, 7912, 7906, 7906, 7906, 7906] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350'] [Step 135 / Rank 0] Tasks: ['Single QA', 'Summarization', 'Code'] | Lens: [20395, 20413, 20403] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 135 / Rank 2] Tasks: ['Single QA'] | Lens: [48767] → Tgt Spa: ['0.350'] [Step 135 / Rank 7] Tasks: ['Single QA', 'Code', 'In-Context Learning', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7903, 7912, 7904, 7912, 7906, 7906, 7906, 7906] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350'] [Step 135 / Rank 5] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16519, 16512, 16521] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 135 / Rank 4] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16519, 16512, 16521] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 135 / Rank 1] Tasks: ['Single QA', 'Summarization', 'Code'] | Lens: [20395, 20413, 20403] → Tgt Spa: ['0.350', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 00:44:16,270 >> @ 135 | Loss: 1.8936 | LM: 1.8207 | Reg: 0.0729 | Spa(Avg): 0.528 [INFO|lh_trainer.py:797] 2026-02-17 00:44:16,271 >> Statistic -> Code | Spa: 0.637 | Tgt: 1.000 | Z-Loss: 0.106 | [INFO|lh_trainer.py:797] 2026-02-17 00:44:16,271 >> Statistic -> In-Context | Spa: 0.679 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:44:16,271 >> Statistic -> MultiHop | Spa: 0.646 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:44:16,271 >> Statistic -> Single | Spa: 0.400 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:44:16,271 >> Statistic -> Summarization | Spa: 0.583 | Tgt: 1.000 | Z-Loss: 0.130 | [INFO|lh_trainer.py:810] 2026-02-17 00:44:16,273 >> [Micro-Log] {"loss": 1.8935620840638876, "lm_loss": 1.8207027005652587, "reg_loss": 0.07285939503344707, "model_sparsity(avg)": 0.5282600286106268, "Spa-Single QA sparsity": 0.40013227576301214, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.029928020722720595, "Spa-Code sparsity": 0.6365740746259689, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10558517090976238, "Spa-Summarization sparsity": 0.5833333219800677, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13025677523442677, "Spa-In-Context Learning sparsity": 0.6790123316976759, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1054787395728959, "Spa-MultiHop QA sparsity": 0.6458333134651184, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11730450391769409, "step": 135, "current_tau": 1.073223352432251, "lambda1 Single QA": 0.5546875, "lambda2 MultiHop QA": 0.283203125, "lambda3 Summarization": 0.12451171875, "lambda4 Code": 0.224609375} [INFO|lh_trainer.py:331] 2026-02-17 00:44:33,337 >> {'loss': 11.3614, 'grad_norm': 0.7775278091430664, 'learning_rate': 0.0003888925693656447, 'epoch': 0.14323328067403895, 'num_input_tokens_seen': 334321854, 'completed': '45.33% (136 / 300)', 'remaining time': '7:41:34', 'throughput': '7932.56', 'gpu_mem_free': '6763MB', 'step': 136} [Step 136 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [61970] → Tgt Spa: ['1.000'] [Step 136 / Rank 0] Tasks: ['Single QA'] | Lens: [46458] → Tgt Spa: ['0.350'] [Step 136 / Rank 4] Tasks: ['Single QA'] | Lens: [41929] → Tgt Spa: ['0.350'] [Step 136 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning', 'Summarization', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA'] | Lens: [3944, 3945, 3964, 3964, 3946, 3948, 3947, 3948, 3948, 3949, 3948, 3951, 3970, 3953, 3954, 3956] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 136 / Rank 1] Tasks: ['Single QA'] | Lens: [46458] → Tgt Spa: ['0.350'] [Step 136 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning', 'Summarization', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA'] | Lens: [3944, 3945, 3964, 3964, 3946, 3948, 3947, 3948, 3948, 3949, 3948, 3951, 3970, 3953, 3954, 3956] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 136 / Rank 5] Tasks: ['Single QA'] | Lens: [41929] → Tgt Spa: ['0.350'] [Step 136 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [61970] → Tgt Spa: ['1.000'] [Step 136 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [23210, 23212] → Tgt Spa: ['1.000', '1.000'] [Step 136 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [23210, 23212] → Tgt Spa: ['1.000', '1.000'] [Step 136 / Rank 2] Tasks: ['Single QA'] | Lens: [40263] → Tgt Spa: ['0.350'] [Step 136 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [61097] → Tgt Spa: ['1.000'] [Step 136 / Rank 0] Tasks: ['Single QA'] | Lens: [58798] → Tgt Spa: ['0.350'] [Step 136 / Rank 1] Tasks: ['Single QA'] | Lens: [58798] → Tgt Spa: ['0.350'] [Step 136 / Rank 3] Tasks: ['Single QA'] | Lens: [40263] → Tgt Spa: ['0.350'] [Step 136 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [61097] → Tgt Spa: ['1.000'] [Step 136 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26016, 26017] → Tgt Spa: ['1.000', '1.000'] [Step 136 / Rank 2] Tasks: ['Single QA'] | Lens: [57564] → Tgt Spa: ['0.350'] [Step 136 / Rank 3] Tasks: ['Single QA'] | Lens: [57564] → Tgt Spa: ['0.350'] [Step 136 / Rank 1] Tasks: ['Summarization', 'Summarization'] | Lens: [24213, 24213] → Tgt Spa: ['1.000', '1.000'] [Step 136 / Rank 6] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [16613, 16625, 16627] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 136 / Rank 0] Tasks: ['Summarization', 'Summarization'] | Lens: [24213, 24213] → Tgt Spa: ['1.000', '1.000'] [Step 136 / Rank 7] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [16613, 16625, 16627] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 136 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26016, 26017] → Tgt Spa: ['1.000', '1.000'] [Step 136 / Rank 7] Tasks: ['Single QA'] | Lens: [44003] → Tgt Spa: ['0.350'] [Step 136 / Rank 0] Tasks: ['Single QA'] | Lens: [63022] → Tgt Spa: ['0.350'] [Step 136 / Rank 6] Tasks: ['Single QA'] | Lens: [44003] → Tgt Spa: ['0.350'] [Step 136 / Rank 1] Tasks: ['Single QA'] | Lens: [63022] → Tgt Spa: ['0.350'] [Step 136 / Rank 3] Tasks: ['Single QA'] | Lens: [42304] → Tgt Spa: ['0.350'] [Step 136 / Rank 4] Tasks: ['In-Context Learning', 'Summarization', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Single QA', 'Single QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'Summarization'] | Lens: [3559, 3578, 3578, 3561, 3564, 3561, 3561, 3561, 3563, 3564, 3564, 3564, 3563, 3564, 3564, 3565, 3566, 3584] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 136 / Rank 5] Tasks: ['In-Context Learning', 'Summarization', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Single QA', 'Single QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'Summarization'] | Lens: [3559, 3578, 3578, 3561, 3564, 3561, 3561, 3561, 3563, 3564, 3564, 3564, 3563, 3564, 3564, 3565, 3566, 3584] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 136 / Rank 2] Tasks: ['Single QA'] | Lens: [42304] → Tgt Spa: ['0.350'] [Step 136 / Rank 2] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [24462, 24445] → Tgt Spa: ['1.000', '1.000'] [Step 136 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [22707, 22714] → Tgt Spa: ['1.000', '1.000'] [Step 136 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [22707, 22714] → Tgt Spa: ['1.000', '1.000'] [Step 136 / Rank 0] Tasks: ['Code'] | Lens: [58456] → Tgt Spa: ['1.000'] [Step 136 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32112, 32113] → Tgt Spa: ['0.350', '0.350'] [Step 136 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32112, 32113] → Tgt Spa: ['0.350', '0.350'] [Step 136 / Rank 3] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [24462, 24445] → Tgt Spa: ['1.000', '1.000'] [Step 136 / Rank 1] Tasks: ['Code'] | Lens: [58456] → Tgt Spa: ['1.000'] [Step 136 / Rank 6] Tasks: ['Single QA'] | Lens: [58957] → Tgt Spa: ['0.350'] [Step 136 / Rank 1] Tasks: ['Single QA'] | Lens: [58822] → Tgt Spa: ['0.350'] [Step 136 / Rank 2] Tasks: ['Single QA'] | Lens: [35713] → Tgt Spa: ['0.350'] [Step 136 / Rank 3] Tasks: ['Single QA'] | Lens: [35713] → Tgt Spa: ['0.350'] [Step 136 / Rank 5] Tasks: ['Single QA', 'Code', 'Code', 'Code', 'Single QA', 'Code', 'Code'] | Lens: [9317, 9324, 9331, 9335, 9328, 9339, 9338] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 136 / Rank 7] Tasks: ['Single QA'] | Lens: [58957] → Tgt Spa: ['0.350'] [Step 136 / Rank 4] Tasks: ['Single QA', 'Code', 'Code', 'Code', 'Single QA', 'Code', 'Code'] | Lens: [9317, 9324, 9331, 9335, 9328, 9339, 9338] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 136 / Rank 0] Tasks: ['Single QA'] | Lens: [58822] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 00:47:15,281 >> @ 136 | Loss: 2.1799 | LM: 2.1055 | Reg: 0.0744 | Spa(Avg): 0.511 [INFO|lh_trainer.py:797] 2026-02-17 00:47:15,281 >> Statistic -> Code | Spa: 0.592 | Tgt: 1.000 | Z-Loss: 0.122 | [INFO|lh_trainer.py:797] 2026-02-17 00:47:15,281 >> Statistic -> In-Context | Spa: 0.673 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:47:15,281 >> Statistic -> MultiHop | Spa: 0.628 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:47:15,281 >> Statistic -> Single | Spa: 0.459 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:47:15,281 >> Statistic -> Summarization | Spa: 0.597 | Tgt: 1.000 | Z-Loss: 0.124 | [INFO|lh_trainer.py:810] 2026-02-17 00:47:15,283 >> [Micro-Log] {"loss": 2.179932475090027, "lm_loss": 2.105507170781493, "reg_loss": 0.07442531751197141, "model_sparsity(avg)": 0.5108346194028854, "Spa-Single QA sparsity": 0.4590277820825577, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06746769380988553, "Spa-Summarization sparsity": 0.5972222252325579, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12392078543251211, "Spa-Code sparsity": 0.5916666686534882, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.12184651717543601, "Spa-In-Context Learning sparsity": 0.6729166746139527, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10770531706511974, "Spa-MultiHop QA sparsity": 0.6280864079793295, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1100892358356052, "step": 136, "current_tau": 1.0701650381088257, "lambda1 Single QA": 0.5546875, "lambda2 MultiHop QA": 0.28515625, "lambda3 Summarization": 0.125, "lambda4 Code": 0.224609375} [INFO|lh_trainer.py:331] 2026-02-17 00:47:38,307 >> {'loss': 13.0796, 'grad_norm': 0.6510213613510132, 'learning_rate': 0.00038615977013778093, 'epoch': 0.14428646656134808, 'num_input_tokens_seen': 336876626, 'completed': '45.67% (137 / 300)', 'remaining time': '7:39:04', 'throughput': '6905.91', 'gpu_mem_free': '7181MB', 'step': 137} [Step 137 / Rank 5] Tasks: ['Single QA'] | Lens: [37661] → Tgt Spa: ['0.350'] [Step 137 / Rank 0] Tasks: ['Single QA'] | Lens: [54996] → Tgt Spa: ['0.350'] [Step 137 / Rank 2] Tasks: ['Single QA'] | Lens: [41582] → Tgt Spa: ['0.350'] [Step 137 / Rank 3] Tasks: ['Single QA'] | Lens: [41582] → Tgt Spa: ['0.350'] [Step 137 / Rank 7] Tasks: ['Single QA'] | Lens: [49409] → Tgt Spa: ['0.350'] [Step 137 / Rank 4] Tasks: ['Single QA'] | Lens: [37661] → Tgt Spa: ['0.350'] [Step 137 / Rank 6] Tasks: ['Single QA'] | Lens: [49409] → Tgt Spa: ['0.350'] [Step 137 / Rank 1] Tasks: ['Single QA'] | Lens: [54996] → Tgt Spa: ['0.350'] [Step 137 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [64643] → Tgt Spa: ['1.000'] [Step 137 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [64643] → Tgt Spa: ['1.000'] [Step 137 / Rank 3] Tasks: ['Summarization', 'Summarization'] | Lens: [32765, 32770] → Tgt Spa: ['1.000', '1.000'] [Step 137 / Rank 4] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23144, 23165] → Tgt Spa: ['1.000', '1.000'] [Step 137 / Rank 5] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23144, 23165] → Tgt Spa: ['1.000', '1.000'] [Step 137 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [60255] → Tgt Spa: ['1.000'] [Step 137 / Rank 2] Tasks: ['Summarization', 'Summarization'] | Lens: [32765, 32770] → Tgt Spa: ['1.000', '1.000'] [Step 137 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [60255] → Tgt Spa: ['1.000'] [Step 137 / Rank 4] Tasks: ['Code'] | Lens: [56314] → Tgt Spa: ['1.000'] [Step 137 / Rank 1] Tasks: ['Single QA'] | Lens: [65040] → Tgt Spa: ['0.350'] [Step 137 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning', 'Summarization'] | Lens: [21821, 21821, 21842] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 137 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [55015] → Tgt Spa: ['1.000'] [Step 137 / Rank 5] Tasks: ['Code'] | Lens: [56314] → Tgt Spa: ['1.000'] [Step 137 / Rank 0] Tasks: ['Single QA'] | Lens: [65040] → Tgt Spa: ['0.350'] [Step 137 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [55015] → Tgt Spa: ['1.000'] [Step 137 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning', 'Summarization'] | Lens: [21821, 21821, 21842] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 137 / Rank 1] Tasks: ['Single QA'] | Lens: [40013] → Tgt Spa: ['0.350'] [Step 137 / Rank 5] Tasks: ['Single QA'] | Lens: [39582] → Tgt Spa: ['0.350'] [Step 137 / Rank 7] Tasks: ['Code'] | Lens: [45576] → Tgt Spa: ['1.000'] [Step 137 / Rank 0] Tasks: ['Single QA'] | Lens: [40013] → Tgt Spa: ['0.350'] [Step 137 / Rank 4] Tasks: ['Single QA'] | Lens: [39582] → Tgt Spa: ['0.350'] [Step 137 / Rank 2] Tasks: ['Code'] | Lens: [51866] → Tgt Spa: ['1.000'] [Step 137 / Rank 3] Tasks: ['Code'] | Lens: [51866] → Tgt Spa: ['1.000'] [Step 137 / Rank 6] Tasks: ['Code'] | Lens: [45576] → Tgt Spa: ['1.000'] [Step 137 / Rank 1] Tasks: ['Single QA'] | Lens: [55627] → Tgt Spa: ['0.350'] [Step 137 / Rank 6] Tasks: ['Code', 'Code', 'Code'] | Lens: [19001, 19004, 19006] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 137 / Rank 7] Tasks: ['Code', 'Code', 'Code'] | Lens: [19001, 19004, 19006] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 137 / Rank 3] Tasks: ['MultiHop QA'] | Lens: [63306] → Tgt Spa: ['0.350'] [Step 137 / Rank 4] Tasks: ['Code'] | Lens: [37706] → Tgt Spa: ['1.000'] [Step 137 / Rank 0] Tasks: ['Single QA'] | Lens: [55627] → Tgt Spa: ['0.350'] [Step 137 / Rank 5] Tasks: ['Code'] | Lens: [37706] → Tgt Spa: ['1.000'] [Step 137 / Rank 2] Tasks: ['MultiHop QA'] | Lens: [63306] → Tgt Spa: ['0.350'] [Step 137 / Rank 5] Tasks: ['Code'] | Lens: [42817] → Tgt Spa: ['1.000'] [Step 137 / Rank 3] Tasks: ['Single QA'] | Lens: [65042] → Tgt Spa: ['0.350'] [Step 137 / Rank 2] Tasks: ['Single QA'] | Lens: [65042] → Tgt Spa: ['0.350'] [Step 137 / Rank 7] Tasks: ['Code'] | Lens: [35604] → Tgt Spa: ['1.000'] [Step 137 / Rank 1] Tasks: ['Single QA'] | Lens: [39526] → Tgt Spa: ['0.350'] [Step 137 / Rank 6] Tasks: ['Code'] | Lens: [35604] → Tgt Spa: ['1.000'] [Step 137 / Rank 4] Tasks: ['Code'] | Lens: [42817] → Tgt Spa: ['1.000'] [Step 137 / Rank 0] Tasks: ['Single QA'] | Lens: [39526] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 00:50:19,108 >> @ 137 | Loss: 1.8092 | LM: 1.7302 | Reg: 0.0790 | Spa(Avg): 0.542 [INFO|lh_trainer.py:797] 2026-02-17 00:50:19,108 >> Statistic -> Code | Spa: 0.637 | Tgt: 1.000 | Z-Loss: 0.106 | [INFO|lh_trainer.py:797] 2026-02-17 00:50:19,108 >> Statistic -> In-Context | Spa: 0.657 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:50:19,108 >> Statistic -> MultiHop | Spa: 0.569 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:50:19,108 >> Statistic -> Single | Spa: 0.414 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:50:19,109 >> Statistic -> Summarization | Spa: 0.649 | Tgt: 1.000 | Z-Loss: 0.099 | [INFO|lh_trainer.py:810] 2026-02-17 00:50:19,111 >> [Micro-Log] {"loss": 1.809200791021188, "lm_loss": 1.7302324107537668, "reg_loss": 0.07896839286938, "model_sparsity(avg)": 0.5419560198982557, "Spa-Single QA sparsity": 0.4138888955116272, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.038437687302939595, "Spa-In-Context Learning sparsity": 0.6574074029922485, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11416733140746753, "Spa-Summarization sparsity": 0.6493055522441864, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09852487035095692, "Spa-Code sparsity": 0.6373456716537476, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10570399794313642, "Spa-MultiHop QA sparsity": 0.5694444179534912, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.08099891245365143, "step": 137, "current_tau": 1.0671615600585938, "lambda1 Single QA": 0.5546875, "lambda2 MultiHop QA": 0.28515625, "lambda3 Summarization": 0.1259765625, "lambda4 Code": 0.2255859375} [INFO|lh_trainer.py:331] 2026-02-17 00:50:45,996 >> {'loss': 10.8552, 'grad_norm': 0.8339363932609558, 'learning_rate': 0.00038340364063854, 'epoch': 0.14533965244865718, 'num_input_tokens_seen': 339348464, 'completed': '46.00% (138 / 300)', 'remaining time': '7:36:37', 'throughput': '6584.94', 'gpu_mem_free': '11851MB', 'step': 138} [Step 138 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23932, 23932] → Tgt Spa: ['1.000', '1.000'] [Step 138 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [31421, 31413] → Tgt Spa: ['1.000', '1.000'] [Step 138 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23932, 23932] → Tgt Spa: ['1.000', '1.000'] [Step 138 / Rank 6] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [16918, 16930, 16921] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 138 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [31421, 31413] → Tgt Spa: ['1.000', '1.000'] [Step 138 / Rank 0] Tasks: ['Code', 'Code', 'Code'] | Lens: [19313, 19314, 19318] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 138 / Rank 1] Tasks: ['Code', 'Code', 'Code'] | Lens: [19313, 19314, 19318] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 138 / Rank 7] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [16918, 16930, 16921] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 138 / Rank 7] Tasks: ['Code'] | Lens: [62534] → Tgt Spa: ['1.000'] [Step 138 / Rank 6] Tasks: ['Code'] | Lens: [62534] → Tgt Spa: ['1.000'] [Step 138 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [56504] → Tgt Spa: ['1.000'] [Step 138 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24387, 24387] → Tgt Spa: ['1.000', '1.000'] [Step 138 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [56504] → Tgt Spa: ['1.000'] [Step 138 / Rank 0] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24881, 24882] → Tgt Spa: ['1.000', '0.350'] [Step 138 / Rank 1] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24881, 24882] → Tgt Spa: ['1.000', '0.350'] [Step 138 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24387, 24387] → Tgt Spa: ['1.000', '1.000'] [Step 138 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Code', 'Code', 'Single QA', 'Single QA'] | Lens: [9539, 9548, 9558, 9558, 9552, 9557] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '0.350', '0.350'] [Step 138 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25568, 25568] → Tgt Spa: ['1.000', '1.000'] [Step 138 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Code', 'Code', 'Single QA', 'Single QA'] | Lens: [9539, 9548, 9558, 9558, 9552, 9557] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '0.350', '0.350'] [Step 138 / Rank 6] Tasks: ['Single QA'] | Lens: [49857] → Tgt Spa: ['0.350'] [Step 138 / Rank 1] Tasks: ['Single QA'] | Lens: [39702] → Tgt Spa: ['0.350'] [Step 138 / Rank 7] Tasks: ['Single QA'] | Lens: [49857] → Tgt Spa: ['0.350'] [Step 138 / Rank 0] Tasks: ['Single QA'] | Lens: [39702] → Tgt Spa: ['0.350'] [Step 138 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25568, 25568] → Tgt Spa: ['1.000', '1.000'] [Step 138 / Rank 3] Tasks: ['Single QA'] | Lens: [65023] → Tgt Spa: ['0.350'] [Step 138 / Rank 0] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [28196, 28177] → Tgt Spa: ['1.000', '1.000'] [Step 138 / Rank 2] Tasks: ['Single QA'] | Lens: [65023] → Tgt Spa: ['0.350'] [Step 138 / Rank 7] Tasks: ['Single QA'] | Lens: [59396] → Tgt Spa: ['0.350'] [Step 138 / Rank 5] Tasks: ['Single QA'] | Lens: [52712] → Tgt Spa: ['0.350'] [Step 138 / Rank 1] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [28196, 28177] → Tgt Spa: ['1.000', '1.000'] [Step 138 / Rank 4] Tasks: ['Single QA'] | Lens: [52712] → Tgt Spa: ['0.350'] [Step 138 / Rank 6] Tasks: ['Single QA'] | Lens: [59396] → Tgt Spa: ['0.350'] [Step 138 / Rank 5] Tasks: ['Single QA'] | Lens: [36113] → Tgt Spa: ['0.350'] [Step 138 / Rank 4] Tasks: ['Single QA'] | Lens: [36113] → Tgt Spa: ['0.350'] [Step 138 / Rank 6] Tasks: ['Summarization', 'In-Context Learning', 'Summarization'] | Lens: [21582, 21563, 21582] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 138 / Rank 7] Tasks: ['Summarization', 'In-Context Learning', 'Summarization'] | Lens: [21582, 21563, 21582] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 138 / Rank 1] Tasks: ['Single QA'] | Lens: [59432] → Tgt Spa: ['0.350'] [Step 138 / Rank 0] Tasks: ['Single QA'] | Lens: [59432] → Tgt Spa: ['0.350'] [Step 138 / Rank 3] Tasks: ['Single QA', 'Code', 'Code', 'Single QA'] | Lens: [13104, 13120, 13121, 13113] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350'] [Step 138 / Rank 2] Tasks: ['Single QA', 'Code', 'Code', 'Single QA'] | Lens: [13104, 13120, 13121, 13113] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350'] [Step 138 / Rank 5] Tasks: ['Single QA'] | Lens: [38379] → Tgt Spa: ['0.350'] [Step 138 / Rank 2] Tasks: ['Single QA'] | Lens: [42422] → Tgt Spa: ['0.350'] [Step 138 / Rank 1] Tasks: ['Single QA'] | Lens: [39219] → Tgt Spa: ['0.350'] [Step 138 / Rank 4] Tasks: ['Single QA'] | Lens: [38379] → Tgt Spa: ['0.350'] [Step 138 / Rank 3] Tasks: ['Single QA'] | Lens: [42422] → Tgt Spa: ['0.350'] [Step 138 / Rank 0] Tasks: ['Single QA'] | Lens: [39219] → Tgt Spa: ['0.350'] [Step 138 / Rank 6] Tasks: ['Single QA'] | Lens: [36388] → Tgt Spa: ['0.350'] [Step 138 / Rank 7] Tasks: ['Single QA'] | Lens: [36388] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 00:53:14,232 >> @ 138 | Loss: 2.2114 | LM: 2.1389 | Reg: 0.0725 | Spa(Avg): 0.517 [INFO|lh_trainer.py:797] 2026-02-17 00:53:14,232 >> Statistic -> Code | Spa: 0.601 | Tgt: 1.000 | Z-Loss: 0.119 | [INFO|lh_trainer.py:797] 2026-02-17 00:53:14,233 >> Statistic -> In-Context | Spa: 0.682 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:53:14,233 >> Statistic -> MultiHop | Spa: 0.569 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:53:14,233 >> Statistic -> Single | Spa: 0.448 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:53:14,233 >> Statistic -> Summarization | Spa: 0.646 | Tgt: 1.000 | Z-Loss: 0.100 | [INFO|lh_trainer.py:810] 2026-02-17 00:53:14,235 >> [Micro-Log] {"loss": 2.211352750658989, "lm_loss": 2.1388564966619015, "reg_loss": 0.07249627852191527, "model_sparsity(avg)": 0.516589509944121, "Spa-Code sparsity": 0.6010100895708258, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.11914905499328267, "Spa-In-Context Learning sparsity": 0.68181819265539, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10522008280862462, "Spa-Single QA sparsity": 0.4475308623578813, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06355456803511414, "Spa-Summarization sparsity": 0.6458333432674408, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10038851015269756, "Spa-MultiHop QA sparsity": 0.5694444179534912, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.08099891245365143, "step": 138, "current_tau": 1.064213752746582, "lambda1 Single QA": 0.5546875, "lambda2 MultiHop QA": 0.28515625, "lambda3 Summarization": 0.1259765625, "lambda4 Code": 0.2265625} [INFO|lh_trainer.py:331] 2026-02-17 00:53:27,988 >> {'loss': 13.2681, 'grad_norm': 0.7218100428581238, 'learning_rate': 0.0003806246531165231, 'epoch': 0.1463928383359663, 'num_input_tokens_seen': 341823736, 'completed': '46.33% (139 / 300)', 'remaining time': '7:33:40', 'throughput': '7640.07', 'gpu_mem_free': '13295MB', 'step': 139} [Step 139 / Rank 4] Tasks: ['Single QA'] | Lens: [56683] → Tgt Spa: ['0.350'] [Step 139 / Rank 2] Tasks: ['Single QA'] | Lens: [52020] → Tgt Spa: ['0.350'] [Step 139 / Rank 1] Tasks: ['Code'] | Lens: [36468] → Tgt Spa: ['1.000'] [Step 139 / Rank 3] Tasks: ['Single QA'] | Lens: [52020] → Tgt Spa: ['0.350'] [Step 139 / Rank 0] Tasks: ['Code'] | Lens: [36468] → Tgt Spa: ['1.000'] [Step 139 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26129, 26130] → Tgt Spa: ['1.000', '1.000'] [Step 139 / Rank 5] Tasks: ['Single QA'] | Lens: [56683] → Tgt Spa: ['0.350'] [Step 139 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26129, 26130] → Tgt Spa: ['1.000', '1.000'] [Step 139 / Rank 0] Tasks: ['Single QA'] | Lens: [40458] → Tgt Spa: ['0.350'] [Step 139 / Rank 1] Tasks: ['Single QA'] | Lens: [40458] → Tgt Spa: ['0.350'] [Step 139 / Rank 5] Tasks: ['Single QA'] | Lens: [54592] → Tgt Spa: ['0.350'] [Step 139 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [54789] → Tgt Spa: ['1.000'] [Step 139 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [28678, 28675] → Tgt Spa: ['1.000', '1.000'] [Step 139 / Rank 4] Tasks: ['Single QA'] | Lens: [54592] → Tgt Spa: ['0.350'] [Step 139 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [28678, 28675] → Tgt Spa: ['1.000', '1.000'] [Step 139 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [54789] → Tgt Spa: ['1.000'] [Step 139 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [27847, 27848] → Tgt Spa: ['1.000', '1.000'] [Step 139 / Rank 3] Tasks: ['Single QA'] | Lens: [55002] → Tgt Spa: ['0.350'] [Step 139 / Rank 7] Tasks: ['Single QA'] | Lens: [51924] → Tgt Spa: ['0.350'] [Step 139 / Rank 0] Tasks: ['Single QA', 'Summarization', 'Code'] | Lens: [17671, 17689, 17680] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 139 / Rank 6] Tasks: ['Single QA'] | Lens: [51924] → Tgt Spa: ['0.350'] [Step 139 / Rank 2] Tasks: ['Single QA'] | Lens: [55002] → Tgt Spa: ['0.350'] [Step 139 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [27847, 27848] → Tgt Spa: ['1.000', '1.000'] [Step 139 / Rank 1] Tasks: ['Single QA', 'Summarization', 'Code'] | Lens: [17671, 17689, 17680] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 139 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [29220, 29220] → Tgt Spa: ['0.350', '0.350'] [Step 139 / Rank 1] Tasks: ['Single QA'] | Lens: [54422] → Tgt Spa: ['0.350'] [Step 139 / Rank 2] Tasks: ['Code', 'In-Context Learning', 'Code', 'In-Context Learning', 'Code', 'Single QA', 'Single QA', 'In-Context Learning', 'Code', 'Single QA', 'Code'] | Lens: [5954, 5947, 5954, 5951, 5958, 5952, 5953, 5953, 5960, 5955, 5960] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000'] [Step 139 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [29182, 29183] → Tgt Spa: ['0.350', '0.350'] [Step 139 / Rank 3] Tasks: ['Code', 'In-Context Learning', 'Code', 'In-Context Learning', 'Code', 'Single QA', 'Single QA', 'In-Context Learning', 'Code', 'Single QA', 'Code'] | Lens: [5954, 5947, 5954, 5951, 5958, 5952, 5953, 5953, 5960, 5955, 5960] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000'] [Step 139 / Rank 0] Tasks: ['Single QA'] | Lens: [54422] → Tgt Spa: ['0.350'] [Step 139 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [29220, 29220] → Tgt Spa: ['0.350', '0.350'] [Step 139 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [29182, 29183] → Tgt Spa: ['0.350', '0.350'] [Step 139 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Summarization', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [4470, 4473, 4478, 4478, 4472, 4473, 4473, 4493, 4483, 4474, 4474, 4475, 4475, 4475] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 139 / Rank 5] Tasks: ['Single QA'] | Lens: [39386] → Tgt Spa: ['0.350'] [Step 139 / Rank 6] Tasks: ['Code', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Code', 'Single QA', 'MultiHop QA', 'Summarization', 'Code', 'Code', 'MultiHop QA', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'MultiHop QA'] | Lens: [3032, 3027, 3027, 3045, 3028, 3028, 3034, 3028, 3028, 3046, 3035, 3034, 3030, 3037, 3031, 3030, 3030, 3030, 3030, 3031, 3031] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 139 / Rank 4] Tasks: ['Single QA'] | Lens: [39386] → Tgt Spa: ['0.350'] [Step 139 / Rank 7] Tasks: ['Code', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Code', 'Single QA', 'MultiHop QA', 'Summarization', 'Code', 'Code', 'MultiHop QA', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'MultiHop QA'] | Lens: [3032, 3027, 3027, 3045, 3028, 3028, 3034, 3028, 3028, 3046, 3035, 3034, 3030, 3037, 3031, 3030, 3030, 3030, 3030, 3031, 3031] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 139 / Rank 2] Tasks: ['Single QA'] | Lens: [39592] → Tgt Spa: ['0.350'] [Step 139 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Summarization', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [4470, 4473, 4478, 4478, 4472, 4473, 4473, 4493, 4483, 4474, 4474, 4475, 4475, 4475] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 139 / Rank 3] Tasks: ['Single QA'] | Lens: [39592] → Tgt Spa: ['0.350'] [Step 139 / Rank 4] Tasks: ['Code'] | Lens: [44359] → Tgt Spa: ['1.000'] [Step 139 / Rank 5] Tasks: ['Code'] | Lens: [44359] → Tgt Spa: ['1.000'] [Step 139 / Rank 1] Tasks: ['Single QA'] | Lens: [64975] → Tgt Spa: ['0.350'] [Step 139 / Rank 0] Tasks: ['Single QA'] | Lens: [64975] → Tgt Spa: ['0.350'] [Step 139 / Rank 3] Tasks: ['Code'] | Lens: [52977] → Tgt Spa: ['1.000'] [Step 139 / Rank 2] Tasks: ['Code'] | Lens: [52977] → Tgt Spa: ['1.000'] [Step 139 / Rank 6] Tasks: ['Single QA'] | Lens: [53811] → Tgt Spa: ['0.350'] [Step 139 / Rank 7] Tasks: ['Single QA'] | Lens: [53811] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 00:55:40,794 >> @ 139 | Loss: 1.8032 | LM: 1.7439 | Reg: 0.0592 | Spa(Avg): 0.488 [INFO|lh_trainer.py:797] 2026-02-17 00:55:40,795 >> Statistic -> Code | Spa: 0.632 | Tgt: 1.000 | Z-Loss: 0.108 | [INFO|lh_trainer.py:797] 2026-02-17 00:55:40,795 >> Statistic -> In-Context | Spa: 0.667 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:55:40,795 >> Statistic -> MultiHop | Spa: 0.627 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:55:40,795 >> Statistic -> Single | Spa: 0.421 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:55:40,795 >> Statistic -> Summarization | Spa: 0.594 | Tgt: 1.000 | Z-Loss: 0.126 | [INFO|lh_trainer.py:810] 2026-02-17 00:55:40,797 >> [Micro-Log] {"loss": 1.8031614708403747, "lm_loss": 1.7439478716502588, "reg_loss": 0.059213606883228444, "model_sparsity(avg)": 0.48830943057934445, "Spa-Code sparsity": 0.6319444477558136, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10804548524320126, "Spa-Single QA sparsity": 0.4214975833892822, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04902776825727652, "Spa-Summarization sparsity": 0.5937499850988388, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12550429813563824, "Spa-In-Context Learning sparsity": 0.6666666656732559, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11081822775304317, "Spa-MultiHop QA sparsity": 0.6269841364451817, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1099135588322367, "step": 139, "current_tau": 1.061322569847107, "lambda1 Single QA": 0.5546875, "lambda2 MultiHop QA": 0.287109375, "lambda3 Summarization": 0.126953125, "lambda4 Code": 0.2265625} [INFO|lh_trainer.py:331] 2026-02-17 00:56:07,573 >> {'loss': 10.819, 'grad_norm': 0.6435384154319763, 'learning_rate': 0.0003778232837369358, 'epoch': 0.1474460242232754, 'num_input_tokens_seen': 344380626, 'completed': '46.67% (140 / 300)', 'remaining time': '7:30:40', 'throughput': '8011.10', 'gpu_mem_free': '3971MB', 'step': 140} [Step 140 / Rank 7] Tasks: ['Code'] | Lens: [63479] → Tgt Spa: ['1.000'] [Step 140 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [29912, 29913] → Tgt Spa: ['1.000', '1.000'] [Step 140 / Rank 5] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [27719, 27738] → Tgt Spa: ['1.000', '1.000'] [Step 140 / Rank 4] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [27719, 27738] → Tgt Spa: ['1.000', '1.000'] [Step 140 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [55772] → Tgt Spa: ['1.000'] [Step 140 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [55772] → Tgt Spa: ['1.000'] [Step 140 / Rank 6] Tasks: ['Code'] | Lens: [63479] → Tgt Spa: ['1.000'] [Step 140 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [29912, 29913] → Tgt Spa: ['1.000', '1.000'] [Step 140 / Rank 1] Tasks: ['Single QA'] | Lens: [45348] → Tgt Spa: ['0.350'] [Step 140 / Rank 3] Tasks: ['Single QA'] | Lens: [39270] → Tgt Spa: ['0.350'] [Step 140 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [54147] → Tgt Spa: ['1.000'] [Step 140 / Rank 2] Tasks: ['Single QA'] | Lens: [39270] → Tgt Spa: ['0.350'] [Step 140 / Rank 7] Tasks: ['Single QA'] | Lens: [34465] → Tgt Spa: ['0.350'] [Step 140 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [54147] → Tgt Spa: ['1.000'] [Step 140 / Rank 6] Tasks: ['Single QA'] | Lens: [34465] → Tgt Spa: ['0.350'] [Step 140 / Rank 0] Tasks: ['Single QA'] | Lens: [45348] → Tgt Spa: ['0.350'] [Step 140 / Rank 6] Tasks: ['Single QA'] | Lens: [41479] → Tgt Spa: ['0.350'] [Step 140 / Rank 1] Tasks: ['Single QA'] | Lens: [65272] → Tgt Spa: ['0.350'] [Step 140 / Rank 3] Tasks: ['Single QA'] | Lens: [49864] → Tgt Spa: ['0.350'] [Step 140 / Rank 2] Tasks: ['Single QA'] | Lens: [49864] → Tgt Spa: ['0.350'] [Step 140 / Rank 7] Tasks: ['Single QA'] | Lens: [41479] → Tgt Spa: ['0.350'] [Step 140 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [44328] → Tgt Spa: ['1.000'] [Step 140 / Rank 0] Tasks: ['Single QA'] | Lens: [65272] → Tgt Spa: ['0.350'] [Step 140 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [44328] → Tgt Spa: ['1.000'] [Step 140 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25436, 25438] → Tgt Spa: ['1.000', '1.000'] [Step 140 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25436, 25438] → Tgt Spa: ['1.000', '1.000'] [Step 140 / Rank 7] Tasks: ['Code'] | Lens: [42020] → Tgt Spa: ['1.000'] [Step 140 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [44387] → Tgt Spa: ['1.000'] [Step 140 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [44387] → Tgt Spa: ['1.000'] [Step 140 / Rank 3] Tasks: ['Single QA'] | Lens: [48104] → Tgt Spa: ['0.350'] [Step 140 / Rank 2] Tasks: ['Single QA'] | Lens: [48104] → Tgt Spa: ['0.350'] [Step 140 / Rank 6] Tasks: ['Code'] | Lens: [42020] → Tgt Spa: ['1.000'] [Step 140 / Rank 5] Tasks: ['Single QA'] | Lens: [51715] → Tgt Spa: ['0.350'] [Step 140 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [40345] → Tgt Spa: ['1.000'] [Step 140 / Rank 6] Tasks: ['Code'] | Lens: [44677] → Tgt Spa: ['1.000'] [Step 140 / Rank 3] Tasks: ['Single QA'] | Lens: [53709] → Tgt Spa: ['0.350'] [Step 140 / Rank 4] Tasks: ['Single QA'] | Lens: [51715] → Tgt Spa: ['0.350'] [Step 140 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [40345] → Tgt Spa: ['1.000'] [Step 140 / Rank 2] Tasks: ['Single QA'] | Lens: [53709] → Tgt Spa: ['0.350'] [Step 140 / Rank 7] Tasks: ['Code'] | Lens: [44677] → Tgt Spa: ['1.000'] [Step 140 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [26259, 26260] → Tgt Spa: ['0.350', '0.350'] [Step 140 / Rank 6] Tasks: ['Single QA'] | Lens: [35231] → Tgt Spa: ['0.350'] [Step 140 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [42880] → Tgt Spa: ['1.000'] [Step 140 / Rank 1] Tasks: ['Single QA'] | Lens: [40546] → Tgt Spa: ['0.350'] [Step 140 / Rank 0] Tasks: ['Single QA'] | Lens: [40546] → Tgt Spa: ['0.350'] [Step 140 / Rank 7] Tasks: ['Single QA'] | Lens: [35231] → Tgt Spa: ['0.350'] [Step 140 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [42880] → Tgt Spa: ['1.000'] [Step 140 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [26259, 26260] → Tgt Spa: ['0.350', '0.350'] [INFO|lh_trainer.py:781] 2026-02-17 00:58:33,925 >> @ 140 | Loss: 2.2713 | LM: 2.2060 | Reg: 0.0653 | Spa(Avg): 0.513 [INFO|lh_trainer.py:797] 2026-02-17 00:58:33,925 >> Statistic -> Code | Spa: 0.657 | Tgt: 1.000 | Z-Loss: 0.099 | [INFO|lh_trainer.py:797] 2026-02-17 00:58:33,925 >> Statistic -> In-Context | Spa: 0.669 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:58:33,925 >> Statistic -> MultiHop | Spa: 0.627 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:58:33,925 >> Statistic -> Single | Spa: 0.358 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 00:58:33,925 >> Statistic -> Summarization | Spa: 0.611 | Tgt: 1.000 | Z-Loss: 0.117 | [INFO|lh_trainer.py:810] 2026-02-17 00:58:33,928 >> [Micro-Log] {"loss": 2.271317737797896, "lm_loss": 2.2059746806820235, "reg_loss": 0.06534304080802637, "model_sparsity(avg)": 0.5127314664423466, "Spa-In-Context Learning sparsity": 0.6691918969154358, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11060727049003947, "Spa-Single QA sparsity": 0.3579059701699477, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.022371810413180634, "Spa-Summarization sparsity": 0.6111111044883728, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11650632321834564, "Spa-Code sparsity": 0.6574074029922485, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09902803599834442, "Spa-MultiHop QA sparsity": 0.6269841364451817, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1099135588322367, "step": 140, "current_tau": 1.0584888458251953, "lambda1 Single QA": 0.55859375, "lambda2 MultiHop QA": 0.287109375, "lambda3 Summarization": 0.1279296875, "lambda4 Code": 0.2275390625} [INFO|lh_trainer.py:331] 2026-02-17 00:58:47,765 >> {'loss': 13.6279, 'grad_norm': 0.8053638339042664, 'learning_rate': 0.0003750000125, 'epoch': 0.1484992101105845, 'num_input_tokens_seen': 346692052, 'completed': '47.00% (141 / 300)', 'remaining time': '7:27:41', 'throughput': '7214.53', 'gpu_mem_free': '12725MB', 'step': 141} [Step 141 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [55308] → Tgt Spa: ['1.000'] [Step 141 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [59661] → Tgt Spa: ['1.000'] [Step 141 / Rank 6] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [21531, 21523, 21539] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 141 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [30797, 30807] → Tgt Spa: ['1.000', '1.000'] [Step 141 / Rank 7] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [21531, 21523, 21539] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 141 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [55308] → Tgt Spa: ['1.000'] [Step 141 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [30797, 30807] → Tgt Spa: ['1.000', '1.000'] [Step 141 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [59661] → Tgt Spa: ['1.000'] [Step 141 / Rank 3] Tasks: ['Single QA'] | Lens: [58131] → Tgt Spa: ['0.350'] [Step 141 / Rank 1] Tasks: ['Code', 'Code', 'Single QA', 'Summarization', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning'] | Lens: [6931, 6932, 6930, 6949, 6936, 6930, 6931, 6931, 6933] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 141 / Rank 7] Tasks: ['Single QA', 'Code', 'Single QA'] | Lens: [20716, 20723, 20717] → Tgt Spa: ['0.350', '1.000', '0.350'] [Step 141 / Rank 0] Tasks: ['Code', 'Code', 'Single QA', 'Summarization', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning'] | Lens: [6931, 6932, 6930, 6949, 6936, 6930, 6931, 6931, 6933] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 141 / Rank 6] Tasks: ['Single QA', 'Code', 'Single QA'] | Lens: [20716, 20723, 20717] → Tgt Spa: ['0.350', '1.000', '0.350'] [Step 141 / Rank 4] Tasks: ['Single QA'] | Lens: [34710] → Tgt Spa: ['0.350'] [Step 141 / Rank 5] Tasks: ['Single QA'] | Lens: [34710] → Tgt Spa: ['0.350'] [Step 141 / Rank 2] Tasks: ['Single QA'] | Lens: [58131] → Tgt Spa: ['0.350'] [Step 141 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22725, 22725] → Tgt Spa: ['1.000', '1.000'] [Step 141 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22725, 22725] → Tgt Spa: ['1.000', '1.000'] [Step 141 / Rank 7] Tasks: ['Single QA'] | Lens: [35601] → Tgt Spa: ['0.350'][Step 141 / Rank 4] Tasks: ['Summarization'] | Lens: [33676] → Tgt Spa: ['1.000'] [Step 141 / Rank 6] Tasks: ['Single QA'] | Lens: [35601] → Tgt Spa: ['0.350'] [Step 141 / Rank 2] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [17179, 17190, 17191] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 141 / Rank 3] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [17179, 17190, 17191] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 141 / Rank 5] Tasks: ['Summarization'] | Lens: [33676] → Tgt Spa: ['1.000'] [Step 141 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [39762] → Tgt Spa: ['1.000'] [Step 141 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [45789] → Tgt Spa: ['1.000'] [Step 141 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [39762] → Tgt Spa: ['1.000'] [Step 141 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [45789] → Tgt Spa: ['1.000'] [Step 141 / Rank 6] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [19721, 19733, 19736] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 141 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23657, 23658] → Tgt Spa: ['1.000', '1.000'] [Step 141 / Rank 7] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [19721, 19733, 19736] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 141 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23657, 23658] → Tgt Spa: ['1.000', '1.000'] [Step 141 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15898, 15898, 15898, 15898] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 141 / Rank 7] Tasks: ['Single QA'] | Lens: [43809] → Tgt Spa: ['0.350'] [Step 141 / Rank 3] Tasks: ['Code'] | Lens: [34847] → Tgt Spa: ['1.000'] [Step 141 / Rank 1] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [19896, 19908, 19912] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 141 / Rank 6] Tasks: ['Single QA'] | Lens: [43809] → Tgt Spa: ['0.350'] [Step 141 / Rank 2] Tasks: ['Code'] | Lens: [34847] → Tgt Spa: ['1.000'] [Step 141 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15898, 15898, 15898, 15898] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 141 / Rank 0] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [19896, 19908, 19912] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 141 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [42926] → Tgt Spa: ['1.000'] [Step 141 / Rank 1] Tasks: ['Single QA'] | Lens: [52248] → Tgt Spa: ['0.350'] [Step 141 / Rank 3] Tasks: ['Code'] | Lens: [47859] → Tgt Spa: ['1.000'] [Step 141 / Rank 2] Tasks: ['Code'] | Lens: [47859] → Tgt Spa: ['1.000'] [Step 141 / Rank 6] Tasks: ['Code', 'Single QA'] | Lens: [24171, 24163] → Tgt Spa: ['1.000', '0.350'] [Step 141 / Rank 0] Tasks: ['Single QA'] | Lens: [52248] → Tgt Spa: ['0.350'] [Step 141 / Rank 7] Tasks: ['Code', 'Single QA'] | Lens: [24171, 24163] → Tgt Spa: ['1.000', '0.350'] [Step 141 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [42926] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 01:00:48,599 >> @ 141 | Loss: 2.2168 | LM: 2.1267 | Reg: 0.0901 | Spa(Avg): 0.576 [INFO|lh_trainer.py:797] 2026-02-17 01:00:48,599 >> Statistic -> Code | Spa: 0.637 | Tgt: 1.000 | Z-Loss: 0.108 | [INFO|lh_trainer.py:797] 2026-02-17 01:00:48,599 >> Statistic -> In-Context | Spa: 0.685 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:00:48,599 >> Statistic -> MultiHop | Spa: 0.627 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:00:48,599 >> Statistic -> Single | Spa: 0.461 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:00:48,599 >> Statistic -> Summarization | Spa: 0.594 | Tgt: 1.000 | Z-Loss: 0.128 | [INFO|lh_trainer.py:810] 2026-02-17 01:00:48,601 >> [Micro-Log] {"loss": 2.2168007058401904, "lm_loss": 2.1266948046783605, "reg_loss": 0.09010589387617074, "model_sparsity(avg)": 0.5758744863172373, "Spa-In-Context Learning sparsity": 0.6848290608479426, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10491728954590283, "Spa-Code sparsity": 0.6365740646918615, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10753047838807106, "Spa-Single QA sparsity": 0.4613095223903656, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.07332971512473055, "Spa-Summarization sparsity": 0.5944444358348846, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12779420614242554, "Spa-MultiHop QA sparsity": 0.6269841364451817, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1099135588322367, "step": 141, "current_tau": 1.0557135343551636, "lambda1 Single QA": 0.55859375, "lambda2 MultiHop QA": 0.287109375, "lambda3 Summarization": 0.1279296875, "lambda4 Code": 0.228515625} [INFO|lh_trainer.py:331] 2026-02-17 01:01:07,540 >> {'loss': 13.3008, 'grad_norm': 0.9220738410949707, 'learning_rate': 0.00037215532315870774, 'epoch': 0.14955239599789363, 'num_input_tokens_seen': 349112532, 'completed': '47.33% (142 / 300)', 'remaining time': '7:24:19', 'throughput': '8658.49', 'gpu_mem_free': '8747MB', 'step': 142} [Step 142 / Rank 2] Tasks: ['Single QA'] | Lens: [46984] → Tgt Spa: ['0.350'] [Step 142 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [42203] → Tgt Spa: ['1.000'] [Step 142 / Rank 0] Tasks: ['Single QA'] | Lens: [37812] → Tgt Spa: ['0.350'] [Step 142 / Rank 6] Tasks: ['Code'] | Lens: [38628] → Tgt Spa: ['1.000'] [Step 142 / Rank 3] Tasks: ['Single QA'] | Lens: [46984] → Tgt Spa: ['0.350'] [Step 142 / Rank 7] Tasks: ['Code'] | Lens: [38628] → Tgt Spa: ['1.000'] [Step 142 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [42203] → Tgt Spa: ['1.000'] [Step 142 / Rank 1] Tasks: ['Single QA'] | Lens: [37812] → Tgt Spa: ['0.350'] [Step 142 / Rank 4] Tasks: ['Single QA'] | Lens: [46758] → Tgt Spa: ['0.350'] [Step 142 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning'] | Lens: [4449, 4448, 4448, 4457, 4449, 4450, 4451, 4452, 4471, 4453, 4453, 4453, 4461, 4454] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 142 / Rank 5] Tasks: ['Single QA'] | Lens: [46758] → Tgt Spa: ['0.350'] [Step 142 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning'] | Lens: [4449, 4448, 4448, 4457, 4449, 4450, 4451, 4452, 4471, 4453, 4453, 4453, 4461, 4454] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 142 / Rank 0] Tasks: ['Single QA'] | Lens: [33837] → Tgt Spa: ['0.350'] [Step 142 / Rank 7] Tasks: ['Single QA'] | Lens: [47077] → Tgt Spa: ['0.350'] [Step 142 / Rank 1] Tasks: ['Single QA'] | Lens: [33837] → Tgt Spa: ['0.350'] [Step 142 / Rank 6] Tasks: ['Single QA'] | Lens: [47077] → Tgt Spa: ['0.350'] [Step 142 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [4543, 4545, 4545, 4545, 4553, 4552, 4546, 4546, 4546, 4553, 4547, 4547, 4547, 4549] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 142 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [4543, 4545, 4545, 4545, 4553, 4552, 4546, 4546, 4546, 4553, 4547, 4547, 4547, 4549] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 142 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [60754] → Tgt Spa: ['1.000'] [Step 142 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [60754] → Tgt Spa: ['1.000'] [Step 142 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [41206] → Tgt Spa: ['1.000'] [Step 142 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [28263, 28264] → Tgt Spa: ['0.350', '0.350'] [Step 142 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [41206] → Tgt Spa: ['1.000'] [Step 142 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [28263, 28264] → Tgt Spa: ['0.350', '0.350'] [Step 142 / Rank 5] Tasks: ['Summarization'] | Lens: [33298] → Tgt Spa: ['1.000'] [Step 142 / Rank 4] Tasks: ['Summarization'] | Lens: [33298] → Tgt Spa: ['1.000'] [Step 142 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [39921] → Tgt Spa: ['1.000'] [Step 142 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [39921] → Tgt Spa: ['1.000'] [Step 142 / Rank 0] Tasks: ['Single QA'] | Lens: [45118] → Tgt Spa: ['0.350'] [Step 142 / Rank 2] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Code', 'Single QA'] | Lens: [9664, 9669, 9671, 9666, 9675, 9669] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '0.350'] [Step 142 / Rank 1] Tasks: ['Single QA'] | Lens: [45118] → Tgt Spa: ['0.350'] [Step 142 / Rank 3] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Code', 'Single QA'] | Lens: [9664, 9669, 9671, 9666, 9675, 9669] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '0.350'] [Step 142 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [58396] → Tgt Spa: ['1.000'] [Step 142 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [58396] → Tgt Spa: ['1.000'] [Step 142 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27667, 27669] → Tgt Spa: ['1.000', '1.000'] [Step 142 / Rank 0] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [19209, 19223, 19212] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 142 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27667, 27669] → Tgt Spa: ['1.000', '1.000'] [Step 142 / Rank 1] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [19209, 19223, 19212] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 142 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [50400] → Tgt Spa: ['1.000'] [Step 142 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [50400] → Tgt Spa: ['1.000'] [Step 142 / Rank 5] Tasks: ['Single QA'] | Lens: [49226] → Tgt Spa: ['0.350'] [Step 142 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [38151] → Tgt Spa: ['1.000'] [Step 142 / Rank 1] Tasks: ['Single QA'] | Lens: [58750] → Tgt Spa: ['0.350'] [Step 142 / Rank 4] Tasks: ['Single QA'] | Lens: [49226] → Tgt Spa: ['0.350'] [Step 142 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [23850, 23842] → Tgt Spa: ['1.000', '1.000'] [Step 142 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [23850, 23842] → Tgt Spa: ['1.000', '1.000'] [Step 142 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [38151] → Tgt Spa: ['1.000'] [Step 142 / Rank 0] Tasks: ['Single QA'] | Lens: [58750] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 01:03:16,931 >> @ 142 | Loss: 2.1530 | LM: 2.0756 | Reg: 0.0774 | Spa(Avg): 0.551 [INFO|lh_trainer.py:797] 2026-02-17 01:03:16,931 >> Statistic -> Code | Spa: 0.613 | Tgt: 1.000 | Z-Loss: 0.116 | [INFO|lh_trainer.py:797] 2026-02-17 01:03:16,931 >> Statistic -> In-Context | Spa: 0.669 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:03:16,931 >> Statistic -> MultiHop | Spa: 0.627 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:03:16,931 >> Statistic -> Single | Spa: 0.442 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:03:16,931 >> Statistic -> Summarization | Spa: 0.634 | Tgt: 1.000 | Z-Loss: 0.106 | [INFO|lh_trainer.py:810] 2026-02-17 01:03:16,933 >> [Micro-Log] {"loss": 2.1529505935808024, "lm_loss": 2.0755563036849103, "reg_loss": 0.07739428983768448, "model_sparsity(avg)": 0.5511739415427049, "Spa-Single QA sparsity": 0.4424603113106319, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.0602968428616545, "Spa-In-Context Learning sparsity": 0.6685185154279073, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11110144878427188, "Spa-Code sparsity": 0.6132478622289804, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.11591648482359372, "Spa-Summarization sparsity": 0.6342592636744181, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10647180676460266, "Spa-MultiHop QA sparsity": 0.6269841364451817, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1099135588322367, "step": 142, "current_tau": 1.052997350692749, "lambda1 Single QA": 0.55859375, "lambda2 MultiHop QA": 0.287109375, "lambda3 Summarization": 0.12890625, "lambda4 Code": 0.228515625} [INFO|lh_trainer.py:331] 2026-02-17 01:03:39,543 >> {'loss': 12.9177, 'grad_norm': 0.8803617358207703, 'learning_rate': 0.00036928970313593307, 'epoch': 0.15060558188520273, 'num_input_tokens_seen': 351452022, 'completed': '47.67% (143 / 300)', 'remaining time': '7:21:12', 'throughput': '7695.53', 'gpu_mem_free': '7185MB', 'step': 143} [Step 143 / Rank 4] Tasks: ['Single QA'] | Lens: [35037] → Tgt Spa: ['0.350'] [Step 143 / Rank 5] Tasks: ['Single QA'] | Lens: [35037] → Tgt Spa: ['0.350'] [Step 143 / Rank 2] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18424, 18413, 18413] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 143 / Rank 0] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24902, 24902] → Tgt Spa: ['1.000', '0.350'] [Step 143 / Rank 3] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18424, 18413, 18413] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 143 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [16953, 16954, 16944] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 143 / Rank 1] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24902, 24902] → Tgt Spa: ['1.000', '0.350'] [Step 143 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [16953, 16954, 16944] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 143 / Rank 3] Tasks: ['Single QA'] | Lens: [52277] → Tgt Spa: ['0.350'] [Step 143 / Rank 1] Tasks: ['Single QA'] | Lens: [62747] → Tgt Spa: ['0.350'] [Step 143 / Rank 0] Tasks: ['Single QA'] | Lens: [62747] → Tgt Spa: ['0.350'] [Step 143 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [58724] → Tgt Spa: ['1.000'] [Step 143 / Rank 5] Tasks: ['Single QA'] | Lens: [56324] → Tgt Spa: ['0.350'] [Step 143 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [58724] → Tgt Spa: ['1.000'] [Step 143 / Rank 4] Tasks: ['Single QA'] | Lens: [56324] → Tgt Spa: ['0.350'] [Step 143 / Rank 2] Tasks: ['Single QA'] | Lens: [52277] → Tgt Spa: ['0.350'] [Step 143 / Rank 4] Tasks: ['Code'] | Lens: [33634] → Tgt Spa: ['1.000'] [Step 143 / Rank 5] Tasks: ['Code'] | Lens: [33634] → Tgt Spa: ['1.000'] [Step 143 / Rank 6] Tasks: ['Single QA'] | Lens: [35250] → Tgt Spa: ['0.350'] [Step 143 / Rank 7] Tasks: ['Single QA'] | Lens: [35250] → Tgt Spa: ['0.350'] [Step 143 / Rank 0] Tasks: ['Single QA'] | Lens: [48373] → Tgt Spa: ['0.350'] [Step 143 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [36360] → Tgt Spa: ['1.000'] [Step 143 / Rank 1] Tasks: ['Single QA'] | Lens: [48373] → Tgt Spa: ['0.350'] [Step 143 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [36360] → Tgt Spa: ['1.000'] [Step 143 / Rank 7] Tasks: ['Code'] | Lens: [35191] → Tgt Spa: ['1.000'] [Step 143 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [44074] → Tgt Spa: ['1.000'] [Step 143 / Rank 1] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17472, 17460, 17474] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 143 / Rank 6] Tasks: ['Code'] | Lens: [35191] → Tgt Spa: ['1.000'] [Step 143 / Rank 0] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17472, 17460, 17474] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 143 / Rank 3] Tasks: ['Single QA'] | Lens: [59928] → Tgt Spa: ['0.350'] [Step 143 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [44074] → Tgt Spa: ['1.000'] [Step 143 / Rank 2] Tasks: ['Single QA'] | Lens: [59928] → Tgt Spa: ['0.350'] [Step 143 / Rank 3] Tasks: ['Code'] | Lens: [56894] → Tgt Spa: ['1.000'] [Step 143 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [39779] → Tgt Spa: ['1.000'] [Step 143 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [39779] → Tgt Spa: ['1.000'] [Step 143 / Rank 1] Tasks: ['Single QA'] | Lens: [58634] → Tgt Spa: ['0.350'] [Step 143 / Rank 7] Tasks: ['Summarization', 'Code'] | Lens: [24129, 24119] → Tgt Spa: ['1.000', '1.000'] [Step 143 / Rank 6] Tasks: ['Summarization', 'Code'] | Lens: [24129, 24119] → Tgt Spa: ['1.000', '1.000'] [Step 143 / Rank 0] Tasks: ['Single QA'] | Lens: [58634] → Tgt Spa: ['0.350'] [Step 143 / Rank 2] Tasks: ['Code'] | Lens: [56894] → Tgt Spa: ['1.000'] [Step 143 / Rank 7] Tasks: ['Code'] | Lens: [62869] → Tgt Spa: ['1.000'] [Step 143 / Rank 0] Tasks: ['Single QA'] | Lens: [40378] → Tgt Spa: ['0.350'] [Step 143 / Rank 3] Tasks: ['Single QA'] | Lens: [40978] → Tgt Spa: ['0.350'] [Step 143 / Rank 1] Tasks: ['Single QA'] | Lens: [40378] → Tgt Spa: ['0.350'] [Step 143 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [29255, 29267] → Tgt Spa: ['1.000', '1.000'] [Step 143 / Rank 2] Tasks: ['Single QA'] | Lens: [40978] → Tgt Spa: ['0.350'] [Step 143 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [29255, 29267] → Tgt Spa: ['1.000', '1.000'] [Step 143 / Rank 6] Tasks: ['Code'] | Lens: [62869] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 01:05:59,773 >> @ 143 | Loss: 1.8451 | LM: 1.7791 | Reg: 0.0660 | Spa(Avg): 0.547 [INFO|lh_trainer.py:797] 2026-02-17 01:05:59,773 >> Statistic -> Code | Spa: 0.644 | Tgt: 1.000 | Z-Loss: 0.105 | [INFO|lh_trainer.py:797] 2026-02-17 01:05:59,773 >> Statistic -> In-Context | Spa: 0.694 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:05:59,773 >> Statistic -> MultiHop | Spa: 0.627 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:05:59,774 >> Statistic -> Single | Spa: 0.386 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:05:59,774 >> Statistic -> Summarization | Spa: 0.662 | Tgt: 1.000 | Z-Loss: 0.095 | [INFO|lh_trainer.py:810] 2026-02-17 01:05:59,776 >> [Micro-Log] {"loss": 1.8451354553302128, "lm_loss": 1.7791446751604478, "reg_loss": 0.06599079121466882, "model_sparsity(avg)": 0.5474537039796511, "Spa-In-Context Learning sparsity": 0.6944444378217062, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10208224505186081, "Spa-Single QA sparsity": 0.38636363636363635, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.021637692137367347, "Spa-Summarization sparsity": 0.6620370248953501, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09483972067634265, "Spa-Code sparsity": 0.6444444417953491, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10484152063727378, "Spa-MultiHop QA sparsity": 0.6269841364451817, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1099135588322367, "step": 143, "current_tau": 1.0503411293029785, "lambda1 Single QA": 0.55859375, "lambda2 MultiHop QA": 0.2890625, "lambda3 Summarization": 0.1298828125, "lambda4 Code": 0.2294921875} [INFO|lh_trainer.py:331] 2026-02-17 01:06:24,478 >> {'loss': 11.0708, 'grad_norm': 0.7481730580329895, 'learning_rate': 0.00036640364344091487, 'epoch': 0.15165876777251186, 'num_input_tokens_seen': 353797086, 'completed': '48.00% (144 / 300)', 'remaining time': '7:18:20', 'throughput': '7109.07', 'gpu_mem_free': '11689MB', 'step': 144} [Step 144 / Rank 7] Tasks: ['Single QA'] | Lens: [50821] → Tgt Spa: ['0.350'] [Step 144 / Rank 1] Tasks: ['Code', 'Code'] | Lens: [23992, 23991] → Tgt Spa: ['1.000', '1.000'] [Step 144 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [39662] → Tgt Spa: ['1.000'] [Step 144 / Rank 2] Tasks: ['Code'] | Lens: [44518] → Tgt Spa: ['1.000'] [Step 144 / Rank 0] Tasks: ['Code', 'Code'] | Lens: [23992, 23991] → Tgt Spa: ['1.000', '1.000'] [Step 144 / Rank 3] Tasks: ['Code'] | Lens: [44518] → Tgt Spa: ['1.000'] [Step 144 / Rank 6] Tasks: ['Single QA'] | Lens: [50821] → Tgt Spa: ['0.350'] [Step 144 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [39662] → Tgt Spa: ['1.000'] [Step 144 / Rank 4] Tasks: ['Single QA'] | Lens: [37385] → Tgt Spa: ['0.350'] [Step 144 / Rank 3] Tasks: ['Single QA'] | Lens: [65041] → Tgt Spa: ['0.350'] [Step 144 / Rank 5] Tasks: ['Single QA'] | Lens: [37385] → Tgt Spa: ['0.350'] [Step 144 / Rank 2] Tasks: ['Single QA'] | Lens: [65041] → Tgt Spa: ['0.350'] [Step 144 / Rank 6] Tasks: ['Single QA'] | Lens: [54006] → Tgt Spa: ['0.350'] [Step 144 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [54308] → Tgt Spa: ['1.000'] [Step 144 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [54308] → Tgt Spa: ['1.000'] [Step 144 / Rank 7] Tasks: ['Single QA'] | Lens: [54006] → Tgt Spa: ['0.350'] [Step 144 / Rank 1] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Code', 'Single QA', 'Code'] | Lens: [8724, 8726, 8731, 8730, 8741, 8736, 8743] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000'] [Step 144 / Rank 5] Tasks: ['Single QA'] | Lens: [51845] → Tgt Spa: ['0.350'] [Step 144 / Rank 2] Tasks: ['Single QA'] | Lens: [45059] → Tgt Spa: ['0.350'] [Step 144 / Rank 6] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [8115, 8122, 8118, 8119, 8120, 8120, 8121, 8129] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 144 / Rank 4] Tasks: ['Single QA'] | Lens: [51845] → Tgt Spa: ['0.350'] [Step 144 / Rank 0] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Code', 'Single QA', 'Code'] | Lens: [8724, 8726, 8731, 8730, 8741, 8736, 8743] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000'] [Step 144 / Rank 7] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [8115, 8122, 8118, 8119, 8120, 8120, 8121, 8129] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 144 / Rank 3] Tasks: ['Single QA'] | Lens: [45059] → Tgt Spa: ['0.350'] [Step 144 / Rank 7] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24211, 24212] → Tgt Spa: ['1.000', '0.350'] [Step 144 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24947, 24967] → Tgt Spa: ['1.000', '1.000'] [Step 144 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24947, 24967] → Tgt Spa: ['1.000', '1.000'] [Step 144 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [26188, 26180] → Tgt Spa: ['1.000', '1.000'] [Step 144 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [29641, 29641] → Tgt Spa: ['0.350', '1.000'] [Step 144 / Rank 6] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24211, 24212] → Tgt Spa: ['1.000', '0.350'] [Step 144 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [29641, 29641] → Tgt Spa: ['0.350', '1.000'] [Step 144 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [26188, 26180] → Tgt Spa: ['1.000', '1.000'] [Step 144 / Rank 6] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [16914, 16916, 16927] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 144 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [23545, 23545] → Tgt Spa: ['0.350', '0.350'] [Step 144 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [36999] → Tgt Spa: ['1.000'] [Step 144 / Rank 1] Tasks: ['Single QA'] | Lens: [38418] → Tgt Spa: ['0.350'] [Step 144 / Rank 0] Tasks: ['Single QA'] | Lens: [38418] → Tgt Spa: ['0.350'] [Step 144 / Rank 7] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [16914, 16916, 16927] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 144 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [23545, 23545] → Tgt Spa: ['0.350', '0.350'] [Step 144 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [36999] → Tgt Spa: ['1.000'] [Step 144 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [40788] → Tgt Spa: ['1.000'] [Step 144 / Rank 1] Tasks: ['Code', 'Summarization'] | Lens: [25061, 25072] → Tgt Spa: ['1.000', '1.000'] [Step 144 / Rank 0] Tasks: ['Code', 'Summarization'] | Lens: [25061, 25072] → Tgt Spa: ['1.000', '1.000'] [Step 144 / Rank 5] Tasks: ['Single QA'] | Lens: [51022] → Tgt Spa: ['0.350'] [Step 144 / Rank 4] Tasks: ['Single QA'] | Lens: [51022] → Tgt Spa: ['0.350'] [Step 144 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [40788] → Tgt Spa: ['1.000'] [Step 144 / Rank 2] Tasks: ['Single QA'] | Lens: [40541] → Tgt Spa: ['0.350'] [Step 144 / Rank 3] Tasks: ['Single QA'] | Lens: [40541] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 01:08:30,779 >> @ 144 | Loss: 2.0365 | LM: 1.9649 | Reg: 0.0716 | Spa(Avg): 0.531 [INFO|lh_trainer.py:797] 2026-02-17 01:08:30,779 >> Statistic -> Code | Spa: 0.655 | Tgt: 1.000 | Z-Loss: 0.101 | [INFO|lh_trainer.py:797] 2026-02-17 01:08:30,779 >> Statistic -> In-Context | Spa: 0.684 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:08:30,779 >> Statistic -> MultiHop | Spa: 0.627 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:08:30,780 >> Statistic -> Single | Spa: 0.460 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:08:30,780 >> Statistic -> Summarization | Spa: 0.620 | Tgt: 1.000 | Z-Loss: 0.115 | [INFO|lh_trainer.py:810] 2026-02-17 01:08:30,782 >> [Micro-Log] {"loss": 2.03646524126331, "lm_loss": 1.9648726706703503, "reg_loss": 0.07159256693072773, "model_sparsity(avg)": 0.5308745317161083, "Spa-Code sparsity": 0.6547619019235883, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10144761790122304, "Spa-In-Context Learning sparsity": 0.6840277761220932, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10609452333301306, "Spa-Single QA sparsity": 0.4603174499103001, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06949423362744883, "Spa-Summarization sparsity": 0.6203703681627909, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11533233523368835, "Spa-MultiHop QA sparsity": 0.6269841364451817, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1099135588322367, "step": 144, "current_tau": 1.047745704650879, "lambda1 Single QA": 0.55859375, "lambda2 MultiHop QA": 0.2890625, "lambda3 Summarization": 0.1298828125, "lambda4 Code": 0.23046875} [INFO|lh_trainer.py:331] 2026-02-17 01:08:48,914 >> {'loss': 12.2188, 'grad_norm': 0.7051401734352112, 'learning_rate': 0.0003634976385851242, 'epoch': 0.15271195365982096, 'num_input_tokens_seen': 356162002, 'completed': '48.33% (145 / 300)', 'remaining time': '7:15:05', 'throughput': '8186.73', 'gpu_mem_free': '11671MB', 'step': 145} [Step 145 / Rank 7] Tasks: ['Single QA'] | Lens: [59922] → Tgt Spa: ['0.350'] [Step 145 / Rank 6] Tasks: ['Single QA'] | Lens: [59922] → Tgt Spa: ['0.350'] [Step 145 / Rank 2] Tasks: ['Summarization', 'Summarization', 'Single QA', 'Summarization', 'Code', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA'] | Lens: [2594, 2595, 2577, 2597, 2584, 2597, 2586, 2582, 2584, 2600, 2585, 2601, 2602, 2585, 2584, 2587, 2585, 2586, 2586, 2585, 2586, 2588, 2587, 2586, 2587] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350'] [Step 145 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25750, 25752] → Tgt Spa: ['0.350', '1.000'] [Step 145 / Rank 3] Tasks: ['Summarization', 'Summarization', 'Single QA', 'Summarization', 'Code', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA'] | Lens: [2594, 2595, 2577, 2597, 2584, 2597, 2586, 2582, 2584, 2600, 2585, 2601, 2602, 2585, 2584, 2587, 2585, 2586, 2586, 2585, 2586, 2588, 2587, 2586, 2587] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350'] [Step 145 / Rank 0] Tasks: ['Single QA'] | Lens: [50937] → Tgt Spa: ['0.350'] [Step 145 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25750, 25752] → Tgt Spa: ['0.350', '1.000'] [Step 145 / Rank 1] Tasks: ['Single QA'] | Lens: [50937] → Tgt Spa: ['0.350'] [Step 145 / Rank 3] Tasks: ['Code', 'In-Context Learning', 'Single QA'] | Lens: [21064, 21056, 21058] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 145 / Rank 5] Tasks: ['Code'] | Lens: [63748] → Tgt Spa: ['1.000'] [Step 145 / Rank 7] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22072, 22054] → Tgt Spa: ['1.000', '1.000'] [Step 145 / Rank 0] Tasks: ['Single QA', 'Summarization', 'In-Context Learning'] | Lens: [20995, 21014, 20997] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 145 / Rank 4] Tasks: ['Code'] | Lens: [63748] → Tgt Spa: ['1.000'] [Step 145 / Rank 2] Tasks: ['Code', 'In-Context Learning', 'Single QA'] | Lens: [21064, 21056, 21058] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 145 / Rank 6] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22072, 22054] → Tgt Spa: ['1.000', '1.000'] [Step 145 / Rank 1] Tasks: ['Single QA', 'Summarization', 'In-Context Learning'] | Lens: [20995, 21014, 20997] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 145 / Rank 7] Tasks: ['Summarization'] | Lens: [39899] → Tgt Spa: ['1.000'] [Step 145 / Rank 5] Tasks: ['Single QA'] | Lens: [47492] → Tgt Spa: ['0.350'] [Step 145 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [32126, 32127] → Tgt Spa: ['0.350', '0.350'] [Step 145 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [32126, 32127] → Tgt Spa: ['0.350', '0.350'] [Step 145 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [29198, 29199] → Tgt Spa: ['1.000', '1.000'] [Step 145 / Rank 6] Tasks: ['Summarization'] | Lens: [39899] → Tgt Spa: ['1.000'] [Step 145 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [29198, 29199] → Tgt Spa: ['1.000', '1.000'] [Step 145 / Rank 4] Tasks: ['Single QA'] | Lens: [47492] → Tgt Spa: ['0.350'] [Step 145 / Rank 7] Tasks: ['Code'] | Lens: [60388] → Tgt Spa: ['1.000'] [Step 145 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [23659, 23659] → Tgt Spa: ['0.350', '0.350'] [Step 145 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [23659, 23659] → Tgt Spa: ['0.350', '0.350'] [Step 145 / Rank 2] Tasks: ['Single QA'] | Lens: [36883] → Tgt Spa: ['0.350'] [Step 145 / Rank 6] Tasks: ['Code'] | Lens: [60388] → Tgt Spa: ['1.000'] [Step 145 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [56263] → Tgt Spa: ['1.000'] [Step 145 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [56263] → Tgt Spa: ['1.000'] [Step 145 / Rank 3] Tasks: ['Single QA'] | Lens: [36883] → Tgt Spa: ['0.350'] [Step 145 / Rank 5] Tasks: ['Summarization'] | Lens: [34233] → Tgt Spa: ['1.000'] [Step 145 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [60565] → Tgt Spa: ['1.000'] [Step 145 / Rank 6] Tasks: ['Single QA'] | Lens: [36330] → Tgt Spa: ['0.350'] [Step 145 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [63608] → Tgt Spa: ['1.000'] [Step 145 / Rank 7] Tasks: ['Single QA'] | Lens: [36330] → Tgt Spa: ['0.350'] [Step 145 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [60565] → Tgt Spa: ['1.000'] [Step 145 / Rank 4] Tasks: ['Summarization'] | Lens: [34233] → Tgt Spa: ['1.000'] [Step 145 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [63608] → Tgt Spa: ['1.000'] [Step 145 / Rank 7] Tasks: ['Single QA'] | Lens: [64906] → Tgt Spa: ['0.350'] [Step 145 / Rank 6] Tasks: ['Single QA'] | Lens: [64906] → Tgt Spa: ['0.350'] [Step 145 / Rank 4] Tasks: ['Single QA'] | Lens: [61379] → Tgt Spa: ['0.350'] [Step 145 / Rank 5] Tasks: ['Single QA'] | Lens: [61379] → Tgt Spa: ['0.350'] [Step 145 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [58736] → Tgt Spa: ['1.000'] [Step 145 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [58736] → Tgt Spa: ['1.000'] [Step 145 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Code'] | Lens: [9202, 9202, 9203, 9205, 9206, 9214, 9217] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 145 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Code'] | Lens: [9202, 9202, 9203, 9205, 9206, 9214, 9217] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 01:11:28,290 >> @ 145 | Loss: 2.1290 | LM: 2.0495 | Reg: 0.0794 | Spa(Avg): 0.561 [INFO|lh_trainer.py:797] 2026-02-17 01:11:28,291 >> Statistic -> Code | Spa: 0.661 | Tgt: 1.000 | Z-Loss: 0.099 | [INFO|lh_trainer.py:797] 2026-02-17 01:11:28,291 >> Statistic -> In-Context | Spa: 0.683 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:11:28,291 >> Statistic -> MultiHop | Spa: 0.653 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:11:28,291 >> Statistic -> Single | Spa: 0.478 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:11:28,291 >> Statistic -> Summarization | Spa: 0.631 | Tgt: 1.000 | Z-Loss: 0.110 | [INFO|lh_trainer.py:810] 2026-02-17 01:11:28,293 >> [Micro-Log] {"loss": 2.1289566011012844, "lm_loss": 2.0495399940603725, "reg_loss": 0.0794166284079741, "model_sparsity(avg)": 0.5608206577599049, "Spa-Single QA sparsity": 0.47777777910232544, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.07880246944841929, "Spa-Summarization sparsity": 0.6313131180676547, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11004823852669109, "Spa-In-Context Learning sparsity": 0.6830808249386874, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10652007568966258, "Spa-Code sparsity": 0.6607142771993365, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09935885987111501, "Spa-MultiHop QA sparsity": 0.6527777697358813, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.12372656006898199, "step": 145, "current_tau": 1.0452120304107666, "lambda1 Single QA": 0.55859375, "lambda2 MultiHop QA": 0.2890625, "lambda3 Summarization": 0.130859375, "lambda4 Code": 0.23046875} [INFO|lh_trainer.py:331] 2026-02-17 01:11:55,569 >> {'loss': 12.7737, 'grad_norm': 0.7327920794487, 'learning_rate': 0.0003605721864975331, 'epoch': 0.15376513954713006, 'num_input_tokens_seen': 358794470, 'completed': '48.67% (146 / 300)', 'remaining time': '7:12:36', 'throughput': '7051.69', 'gpu_mem_free': '7041MB', 'step': 146} [Step 146 / Rank 3] Tasks: ['Single QA'] | Lens: [45058] → Tgt Spa: ['0.350'] [Step 146 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [26602, 26611] → Tgt Spa: ['1.000', '1.000'] [Step 146 / Rank 6] Tasks: ['Single QA'] | Lens: [35674] → Tgt Spa: ['0.350'] [Step 146 / Rank 7] Tasks: ['Single QA'] | Lens: [35674] → Tgt Spa: ['0.350'] [Step 146 / Rank 1] Tasks: ['Summarization'] | Lens: [43337] → Tgt Spa: ['1.000'] [Step 146 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [26602, 26611] → Tgt Spa: ['1.000', '1.000'] [Step 146 / Rank 0] Tasks: ['Summarization'] | Lens: [43337] → Tgt Spa: ['1.000'] [Step 146 / Rank 2] Tasks: ['Single QA'] | Lens: [45058] → Tgt Spa: ['0.350'] [Step 146 / Rank 4] Tasks: ['Single QA'] | Lens: [35574] → Tgt Spa: ['0.350'] [Step 146 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [17663, 17663, 17663] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 146 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [17663, 17663, 17663] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 146 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43145] → Tgt Spa: ['1.000'] [Step 146 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43145] → Tgt Spa: ['1.000'] [Step 146 / Rank 5] Tasks: ['Single QA'] | Lens: [35574] → Tgt Spa: ['0.350'] [Step 146 / Rank 1] Tasks: ['Single QA'] | Lens: [50596] → Tgt Spa: ['0.350'] [Step 146 / Rank 0] Tasks: ['Single QA'] | Lens: [50596] → Tgt Spa: ['0.350'] [Step 146 / Rank 7] Tasks: ['Single QA'] | Lens: [65354] → Tgt Spa: ['0.350'] [Step 146 / Rank 4] Tasks: ['Single QA'] | Lens: [50718] → Tgt Spa: ['0.350'] [Step 146 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [58993] → Tgt Spa: ['1.000'] [Step 146 / Rank 5] Tasks: ['Single QA'] | Lens: [50718] → Tgt Spa: ['0.350'] [Step 146 / Rank 6] Tasks: ['Single QA'] | Lens: [65354] → Tgt Spa: ['0.350'] [Step 146 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24856, 24876] → Tgt Spa: ['1.000', '1.000'] [Step 146 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24856, 24876] → Tgt Spa: ['1.000', '1.000'] [Step 146 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [58993] → Tgt Spa: ['1.000'] [Step 146 / Rank 7] Tasks: ['Summarization'] | Lens: [55972] → Tgt Spa: ['1.000'] [Step 146 / Rank 0] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [19064, 19067, 19078] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 146 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [33478] → Tgt Spa: ['1.000'] [Step 146 / Rank 5] Tasks: ['Code', 'Code', 'Code'] | Lens: [17601, 17602, 17605] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 146 / Rank 1] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [19064, 19067, 19078] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 146 / Rank 6] Tasks: ['Summarization'] | Lens: [55972] → Tgt Spa: ['1.000'] [Step 146 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [33478] → Tgt Spa: ['1.000'] [Step 146 / Rank 4] Tasks: ['Code', 'Code', 'Code'] | Lens: [17601, 17602, 17605] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 146 / Rank 5] Tasks: ['Single QA'] | Lens: [38159] → Tgt Spa: ['0.350'] [Step 146 / Rank 7] Tasks: ['Single QA'] | Lens: [39897] → Tgt Spa: ['0.350'] [Step 146 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [42702] → Tgt Spa: ['1.000'] [Step 146 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [42702] → Tgt Spa: ['1.000'] [Step 146 / Rank 6] Tasks: ['Single QA'] | Lens: [39897] → Tgt Spa: ['0.350'] [Step 146 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27473, 27472] → Tgt Spa: ['1.000', '1.000'] [Step 146 / Rank 4] Tasks: ['Single QA'] | Lens: [38159] → Tgt Spa: ['0.350'] [Step 146 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27473, 27472] → Tgt Spa: ['1.000', '1.000'] [Step 146 / Rank 5] Tasks: ['Single QA'] | Lens: [36206] → Tgt Spa: ['0.350'] [Step 146 / Rank 1] Tasks: ['Single QA'] | Lens: [38651] → Tgt Spa: ['0.350'] [Step 146 / Rank 0] Tasks: ['Single QA'] | Lens: [38651] → Tgt Spa: ['0.350'] [Step 146 / Rank 7] Tasks: ['Single QA'] | Lens: [54766] → Tgt Spa: ['0.350'] [Step 146 / Rank 4] Tasks: ['Single QA'] | Lens: [36206] → Tgt Spa: ['0.350'] [Step 146 / Rank 3] Tasks: ['Single QA'] | Lens: [43754] → Tgt Spa: ['0.350'] [Step 146 / Rank 6] Tasks: ['Single QA'] | Lens: [54766] → Tgt Spa: ['0.350'] [Step 146 / Rank 2] Tasks: ['Single QA'] | Lens: [43754] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 01:14:07,626 >> @ 146 | Loss: 2.1265 | LM: 2.0685 | Reg: 0.0580 | Spa(Avg): 0.520 [INFO|lh_trainer.py:797] 2026-02-17 01:14:07,626 >> Statistic -> Code | Spa: 0.660 | Tgt: 1.000 | Z-Loss: 0.100 | [INFO|lh_trainer.py:797] 2026-02-17 01:14:07,626 >> Statistic -> In-Context | Spa: 0.689 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:14:07,626 >> Statistic -> MultiHop | Spa: 0.653 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:14:07,626 >> Statistic -> Single | Spa: 0.383 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:14:07,627 >> Statistic -> Summarization | Spa: 0.663 | Tgt: 1.000 | Z-Loss: 0.096 | [INFO|lh_trainer.py:810] 2026-02-17 01:14:07,628 >> [Micro-Log] {"loss": 2.12648373345534, "lm_loss": 2.0684503304461637, "reg_loss": 0.058033393579535186, "model_sparsity(avg)": 0.5195794726411501, "Spa-Summarization sparsity": 0.6631944477558136, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.0961947962641716, "Spa-Single QA sparsity": 0.38333332935969033, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.021715487477680047, "Spa-In-Context Learning sparsity": 0.6892361119389534, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10479485522955656, "Spa-Code sparsity": 0.6597221891085306, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10032202924291293, "Spa-MultiHop QA sparsity": 0.6527777697358813, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.12372656006898199, "step": 146, "current_tau": 1.0427405834197998, "lambda1 Single QA": 0.5625, "lambda2 MultiHop QA": 0.2890625, "lambda3 Summarization": 0.130859375, "lambda4 Code": 0.2314453125} [INFO|lh_trainer.py:331] 2026-02-17 01:14:27,904 >> {'loss': 12.7589, 'grad_norm': 0.6063867211341858, 'learning_rate': 0.0003576277884392964, 'epoch': 0.15481832543443919, 'num_input_tokens_seen': 361060330, 'completed': '49.00% (147 / 300)', 'remaining time': '7:09:31', 'throughput': '7437.11', 'gpu_mem_free': '13627MB', 'step': 147} [Step 147 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26578, 26581] → Tgt Spa: ['1.000', '1.000'] [Step 147 / Rank 4] Tasks: ['Single QA', 'Code', 'Code'] | Lens: [17578, 17586, 17586] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 147 / Rank 2] Tasks: ['Single QA'] | Lens: [51018] → Tgt Spa: ['0.350'] [Step 147 / Rank 1] Tasks: ['Single QA'] | Lens: [43253] → Tgt Spa: ['0.350'] [Step 147 / Rank 3] Tasks: ['Single QA'] | Lens: [51018] → Tgt Spa: ['0.350'] [Step 147 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26578, 26581] → Tgt Spa: ['1.000', '1.000'] [Step 147 / Rank 5] Tasks: ['Single QA', 'Code', 'Code'] | Lens: [17578, 17586, 17586] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 147 / Rank 0] Tasks: ['Single QA'] | Lens: [43253] → Tgt Spa: ['0.350'] [Step 147 / Rank 4] Tasks: ['In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Code', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'In-Context Learning', 'Single QA', 'MultiHop QA'] | Lens: [3723, 3731, 3724, 3724, 3726, 3726, 3727, 3728, 3727, 3747, 3736, 3730, 3730, 3731, 3730, 3731, 3731] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350'] [Step 147 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [52121] → Tgt Spa: ['1.000'] [Step 147 / Rank 7] Tasks: ['Single QA'] | Lens: [64436] → Tgt Spa: ['0.350'] [Step 147 / Rank 6] Tasks: ['Single QA'] | Lens: [64436] → Tgt Spa: ['0.350'] [Step 147 / Rank 2] Tasks: ['MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Code', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Summarization', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1882, 1881, 1882, 1883, 1900, 1883, 1884, 1884, 1904, 1887, 1893, 1904, 1904, 1886, 1887, 1905, 1887, 1887, 1889, 1906, 1888, 1889, 1891, 1890, 1891, 1891, 1893, 1893, 1892, 1911, 1912, 1894, 1894, 1895] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350'] [Step 147 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [52121] → Tgt Spa: ['1.000'] [Step 147 / Rank 3] Tasks: ['MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Code', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Summarization', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1882, 1881, 1882, 1883, 1900, 1883, 1884, 1884, 1904, 1887, 1893, 1904, 1904, 1886, 1887, 1905, 1887, 1887, 1889, 1906, 1888, 1889, 1891, 1890, 1891, 1891, 1893, 1893, 1892, 1911, 1912, 1894, 1894, 1895] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350'] [Step 147 / Rank 5] Tasks: ['In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Code', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'In-Context Learning', 'Single QA', 'MultiHop QA'] | Lens: [3723, 3731, 3724, 3724, 3726, 3726, 3727, 3728, 3727, 3747, 3736, 3730, 3730, 3731, 3730, 3731, 3731] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350'] [Step 147 / Rank 4] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [16722, 16723, 16735] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 147 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [47336] → Tgt Spa: ['1.000'] [Step 147 / Rank 7] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22107, 22091] → Tgt Spa: ['1.000', '1.000'] [Step 147 / Rank 6] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22107, 22091] → Tgt Spa: ['1.000', '1.000'] [Step 147 / Rank 5] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [16722, 16723, 16735] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 147 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [47336] → Tgt Spa: ['1.000'] [Step 147 / Rank 2] Tasks: ['Code', 'Code', 'Code'] | Lens: [17501, 17504, 17503] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 147 / Rank 3] Tasks: ['Code', 'Code', 'Code'] | Lens: [17501, 17504, 17503] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 147 / Rank 2] Tasks: ['Single QA'] | Lens: [38857] → Tgt Spa: ['0.350'] [Step 147 / Rank 3] Tasks: ['Single QA'] | Lens: [38857] → Tgt Spa: ['0.350'] [Step 147 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [39102] → Tgt Spa: ['1.000'] [Step 147 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [39102] → Tgt Spa: ['1.000'] [Step 147 / Rank 7] Tasks: ['Code'] | Lens: [46323] → Tgt Spa: ['1.000'] [Step 147 / Rank 5] Tasks: ['Code', 'Single QA', 'Single QA'] | Lens: [18110, 18103, 18103] → Tgt Spa: ['1.000', '0.350', '0.350'] [Step 147 / Rank 4] Tasks: ['Code', 'Single QA', 'Single QA'] | Lens: [18110, 18103, 18103] → Tgt Spa: ['1.000', '0.350', '0.350'] [Step 147 / Rank 6] Tasks: ['Code'] | Lens: [46323] → Tgt Spa: ['1.000'] [Step 147 / Rank 1] Tasks: ['MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1721, 1740, 1741, 1723, 1723, 1723, 1723, 1724, 1725, 1742, 1725, 1725, 1743, 1725, 1726, 1727] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350'] [Step 147 / Rank 0] Tasks: ['MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1721, 1740, 1741, 1723, 1723, 1723, 1723, 1724, 1725, 1742, 1725, 1725, 1743, 1725, 1726, 1727] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350'] [Step 147 / Rank 2] Tasks: ['Single QA'] | Lens: [64998] → Tgt Spa: ['0.350'] [Step 147 / Rank 4] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [29178, 29180] → Tgt Spa: ['1.000', '0.350'] [Step 147 / Rank 3] Tasks: ['Single QA'] | Lens: [64998] → Tgt Spa: ['0.350'] [Step 147 / Rank 5] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [29178, 29180] → Tgt Spa: ['1.000', '0.350'] [Step 147 / Rank 7] Tasks: ['Code'] | Lens: [51161] → Tgt Spa: ['1.000'] [Step 147 / Rank 6] Tasks: ['Code'] | Lens: [51161] → Tgt Spa: ['1.000'] [Step 147 / Rank 0] Tasks: ['Code', 'Single QA'] | Lens: [23942, 23937] → Tgt Spa: ['1.000', '0.350'] [Step 147 / Rank 7] Tasks: ['Single QA'] | Lens: [61832] → Tgt Spa: ['0.350'] [Step 147 / Rank 5] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22413, 22395] → Tgt Spa: ['1.000', '1.000'] [Step 147 / Rank 1] Tasks: ['Code', 'Single QA'] | Lens: [23942, 23937] → Tgt Spa: ['1.000', '0.350'] [Step 147 / Rank 4] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22413, 22395] → Tgt Spa: ['1.000', '1.000'] [Step 147 / Rank 6] Tasks: ['Single QA'] | Lens: [61832] → Tgt Spa: ['0.350'] [Step 147 / Rank 3] Tasks: ['Single QA'] | Lens: [47447] → Tgt Spa: ['0.350'] [Step 147 / Rank 2] Tasks: ['Single QA'] | Lens: [47447] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 01:16:52,796 >> @ 147 | Loss: 1.9400 | LM: 1.8491 | Reg: 0.0909 | Spa(Avg): 0.609 [INFO|lh_trainer.py:797] 2026-02-17 01:16:52,796 >> Statistic -> Code | Spa: 0.688 | Tgt: 1.000 | Z-Loss: 0.090 | [INFO|lh_trainer.py:797] 2026-02-17 01:16:52,796 >> Statistic -> In-Context | Spa: 0.705 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:16:52,796 >> Statistic -> MultiHop | Spa: 0.648 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:16:52,797 >> Statistic -> Single | Spa: 0.544 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:16:52,797 >> Statistic -> Summarization | Spa: 0.668 | Tgt: 1.000 | Z-Loss: 0.094 | [INFO|lh_trainer.py:810] 2026-02-17 01:16:52,799 >> [Micro-Log] {"loss": 1.9399713439246018, "lm_loss": 1.8490626855442922, "reg_loss": 0.09090862438703577, "model_sparsity(avg)": 0.6091041018565496, "Spa-Single QA sparsity": 0.5443121734119597, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.12379978247363829, "Spa-In-Context Learning sparsity": 0.704629651705424, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.09916388740142186, "Spa-MultiHop QA sparsity": 0.6484126959528241, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.12135276453835624, "Spa-Summarization sparsity": 0.6675347350537777, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09358384367078543, "Spa-Code sparsity": 0.6884920724800655, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09013083896466664, "step": 147, "current_tau": 1.040332317352295, "lambda1 Single QA": 0.5625, "lambda2 MultiHop QA": 0.2890625, "lambda3 Summarization": 0.1318359375, "lambda4 Code": 0.232421875} [INFO|lh_trainer.py:331] 2026-02-17 01:17:18,799 >> {'loss': 11.6398, 'grad_norm': 0.7477701902389526, 'learning_rate': 0.0003546649489178636, 'epoch': 0.15587151132174829, 'num_input_tokens_seen': 363503210, 'completed': '49.33% (148 / 300)', 'remaining time': '7:06:45', 'throughput': '7147.31', 'gpu_mem_free': '11695MB', 'step': 148} [Step 148 / Rank 7] Tasks: ['Code'] | Lens: [55196] → Tgt Spa: ['1.000'] [Step 148 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [29769, 29770] → Tgt Spa: ['0.350', '0.350'] [Step 148 / Rank 1] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23388, 23388] → Tgt Spa: ['0.350', '1.000'] [Step 148 / Rank 5] Tasks: ['Code'] | Lens: [47439] → Tgt Spa: ['1.000'] [Step 148 / Rank 0] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23388, 23388] → Tgt Spa: ['0.350', '1.000'] [Step 148 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [29769, 29770] → Tgt Spa: ['0.350', '0.350'] [Step 148 / Rank 6] Tasks: ['Code'] | Lens: [55196] → Tgt Spa: ['1.000'] [Step 148 / Rank 4] Tasks: ['Code'] | Lens: [47439] → Tgt Spa: ['1.000'] [Step 148 / Rank 0] Tasks: ['Single QA'] | Lens: [58711] → Tgt Spa: ['0.350'] [Step 148 / Rank 5] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [25421, 25403] → Tgt Spa: ['1.000', '1.000'] [Step 148 / Rank 4] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [25421, 25403] → Tgt Spa: ['1.000', '1.000'] [Step 148 / Rank 6] Tasks: ['Single QA'] | Lens: [62854] → Tgt Spa: ['0.350'] [Step 148 / Rank 7] Tasks: ['Single QA'] | Lens: [62854] → Tgt Spa: ['0.350'] [Step 148 / Rank 1] Tasks: ['Single QA'] | Lens: [58711] → Tgt Spa: ['0.350'] [Step 148 / Rank 3] Tasks: ['Code'] | Lens: [43123] → Tgt Spa: ['1.000'] [Step 148 / Rank 2] Tasks: ['Code'] | Lens: [43123] → Tgt Spa: ['1.000'] [Step 148 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15703, 15703, 15703, 15703] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 148 / Rank 7] Tasks: ['Single QA'] | Lens: [57711] → Tgt Spa: ['0.350'] [Step 148 / Rank 3] Tasks: ['Code'] | Lens: [36013] → Tgt Spa: ['1.000'] [Step 148 / Rank 2] Tasks: ['Code'] | Lens: [36013] → Tgt Spa: ['1.000'] [Step 148 / Rank 0] Tasks: ['Single QA'] | Lens: [55691] → Tgt Spa: ['0.350'] [Step 148 / Rank 6] Tasks: ['Single QA'] | Lens: [57711] → Tgt Spa: ['0.350'][Step 148 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15703, 15703, 15703, 15703] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 148 / Rank 1] Tasks: ['Single QA'] | Lens: [55691] → Tgt Spa: ['0.350'] [Step 148 / Rank 5] Tasks: ['Single QA'] | Lens: [41067] → Tgt Spa: ['0.350'] [Step 148 / Rank 1] Tasks: ['Single QA'] | Lens: [59019] → Tgt Spa: ['0.350'] [Step 148 / Rank 4] Tasks: ['Single QA'] | Lens: [41067] → Tgt Spa: ['0.350'] [Step 148 / Rank 7] Tasks: ['Summarization', 'Code'] | Lens: [26671, 26659] → Tgt Spa: ['1.000', '1.000'] [Step 148 / Rank 2] Tasks: ['Single QA'] | Lens: [45358] → Tgt Spa: ['0.350'] [Step 148 / Rank 6] Tasks: ['Summarization', 'Code'] | Lens: [26671, 26659] → Tgt Spa: ['1.000', '1.000'] [Step 148 / Rank 3] Tasks: ['Single QA'] | Lens: [45358] → Tgt Spa: ['0.350'] [Step 148 / Rank 0] Tasks: ['Single QA'] | Lens: [59019] → Tgt Spa: ['0.350'] [Step 148 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24006, 24006] → Tgt Spa: ['1.000', '1.000'] [Step 148 / Rank 3] Tasks: ['Single QA'] | Lens: [51848] → Tgt Spa: ['0.350'] [Step 148 / Rank 7] Tasks: ['Single QA'] | Lens: [51070] → Tgt Spa: ['0.350'] [Step 148 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [59173] → Tgt Spa: ['1.000'] [Step 148 / Rank 6] Tasks: ['Single QA'] | Lens: [51070] → Tgt Spa: ['0.350'] [Step 148 / Rank 2] Tasks: ['Single QA'] | Lens: [51848] → Tgt Spa: ['0.350'] [Step 148 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24006, 24006] → Tgt Spa: ['1.000', '1.000'] [Step 148 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [59173] → Tgt Spa: ['1.000'] [Step 148 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [23437, 23446] → Tgt Spa: ['1.000', '1.000'] [Step 148 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [41519] → Tgt Spa: ['1.000'] [Step 148 / Rank 1] Tasks: ['Single QA'] | Lens: [61123] → Tgt Spa: ['0.350'] [Step 148 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [23437, 23446] → Tgt Spa: ['1.000', '1.000'] [Step 148 / Rank 0] Tasks: ['Single QA'] | Lens: [61123] → Tgt Spa: ['0.350'] [Step 148 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [42972] → Tgt Spa: ['1.000'] [Step 148 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [42972] → Tgt Spa: ['1.000'] [Step 148 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [41519] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 01:19:54,366 >> @ 148 | Loss: 2.0150 | LM: 1.9513 | Reg: 0.0638 | Spa(Avg): 0.544 [INFO|lh_trainer.py:797] 2026-02-17 01:19:54,366 >> Statistic -> Code | Spa: 0.708 | Tgt: 1.000 | Z-Loss: 0.083 | [INFO|lh_trainer.py:797] 2026-02-17 01:19:54,366 >> Statistic -> In-Context | Spa: 0.701 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:19:54,366 >> Statistic -> MultiHop | Spa: 0.648 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:19:54,366 >> Statistic -> Single | Spa: 0.400 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:19:54,366 >> Statistic -> Summarization | Spa: 0.618 | Tgt: 1.000 | Z-Loss: 0.117 | [INFO|lh_trainer.py:810] 2026-02-17 01:19:54,368 >> [Micro-Log] {"loss": 2.0150384296818324, "lm_loss": 1.951278400568602, "reg_loss": 0.06376001010357868, "model_sparsity(avg)": 0.5439814875523249, "Spa-Single QA sparsity": 0.40032679543775673, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03309759470935473, "Spa-In-Context Learning sparsity": 0.7013889104127884, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10041808243840933, "Spa-Code sparsity": 0.7083333532015482, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.08333659172058105, "Spa-Summarization sparsity": 0.6180555820465088, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11651402339339256, "Spa-MultiHop QA sparsity": 0.6484126959528241, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.12135276453835624, "step": 148, "current_tau": 1.0379879474639893, "lambda1 Single QA": 0.5625, "lambda2 MultiHop QA": 0.2890625, "lambda3 Summarization": 0.1328125, "lambda4 Code": 0.232421875} [INFO|lh_trainer.py:331] 2026-02-17 01:20:18,673 >> {'loss': 12.0902, 'grad_norm': 0.6183655858039856, 'learning_rate': 0.000351684175600534, 'epoch': 0.1569246972090574, 'num_input_tokens_seen': 365979336, 'completed': '49.67% (149 / 300)', 'remaining time': '7:04:08', 'throughput': '6882.94', 'gpu_mem_free': '5123MB', 'step': 149} [Step 149 / Rank 0] Tasks: ['Single QA'] | Lens: [52200] → Tgt Spa: ['0.350'] [Step 149 / Rank 2] Tasks: ['Code', 'Code', 'Code'] | Lens: [19858, 19858, 19863] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 149 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26501, 26503] → Tgt Spa: ['1.000', '1.000'] [Step 149 / Rank 7] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25259, 25259] → Tgt Spa: ['0.350', '1.000'] [Step 149 / Rank 6] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25259, 25259] → Tgt Spa: ['0.350', '1.000'] [Step 149 / Rank 1] Tasks: ['Single QA'] | Lens: [52200] → Tgt Spa: ['0.350'] [Step 149 / Rank 3] Tasks: ['Code', 'Code', 'Code'] | Lens: [19858, 19858, 19863] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 149 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26501, 26503] → Tgt Spa: ['1.000', '1.000'] [Step 149 / Rank 2] Tasks: ['Code'] | Lens: [52227] → Tgt Spa: ['1.000'] [Step 149 / Rank 1] Tasks: ['Code', 'Single QA'] | Lens: [31464, 31461] → Tgt Spa: ['1.000', '0.350'] [Step 149 / Rank 5] Tasks: ['Single QA'] | Lens: [56323] → Tgt Spa: ['0.350'] [Step 149 / Rank 4] Tasks: ['Single QA'] | Lens: [56323] → Tgt Spa: ['0.350'] [Step 149 / Rank 6] Tasks: ['Single QA'] | Lens: [36944] → Tgt Spa: ['0.350'] [Step 149 / Rank 3] Tasks: ['Code'] | Lens: [52227] → Tgt Spa: ['1.000'] [Step 149 / Rank 7] Tasks: ['Single QA'] | Lens: [36944] → Tgt Spa: ['0.350'] [Step 149 / Rank 0] Tasks: ['Code', 'Single QA'] | Lens: [31464, 31461] → Tgt Spa: ['1.000', '0.350'] [Step 149 / Rank 5] Tasks: ['MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA'] | Lens: [1789, 1789, 1808, 1808, 1790, 1790, 1810, 1809, 1810, 1791, 1791, 1810, 1794, 1792, 1811, 1795, 1796, 1813, 1795, 1795, 1814, 1815, 1796, 1815, 1798, 1797, 1816, 1817, 1799, 1818, 1801, 1803, 1819, 1820, 1800, 1801] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350'] [Step 149 / Rank 0] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [18093, 18104, 18105] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 149 / Rank 1] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [18093, 18104, 18105] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 149 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [52406] → Tgt Spa: ['1.000'] [Step 149 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [52406] → Tgt Spa: ['1.000'] [Step 149 / Rank 2] Tasks: ['Single QA'] | Lens: [52630] → Tgt Spa: ['0.350'] [Step 149 / Rank 3] Tasks: ['Single QA'] | Lens: [52630] → Tgt Spa: ['0.350'] [Step 149 / Rank 4] Tasks: ['MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA'] | Lens: [1789, 1789, 1808, 1808, 1790, 1790, 1810, 1809, 1810, 1791, 1791, 1810, 1794, 1792, 1811, 1795, 1796, 1813, 1795, 1795, 1814, 1815, 1796, 1815, 1798, 1797, 1816, 1817, 1799, 1818, 1801, 1803, 1819, 1820, 1800, 1801] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350'] [Step 149 / Rank 6] Tasks: ['Code'] | Lens: [56197] → Tgt Spa: ['1.000'] [Step 149 / Rank 5] Tasks: ['Single QA'] | Lens: [33983] → Tgt Spa: ['0.350'] [Step 149 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24361, 24362] → Tgt Spa: ['1.000', '1.000'] [Step 149 / Rank 4] Tasks: ['Single QA'] | Lens: [33983] → Tgt Spa: ['0.350'] [Step 149 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [38363] → Tgt Spa: ['1.000'] [Step 149 / Rank 7] Tasks: ['Code'] | Lens: [56197] → Tgt Spa: ['1.000'] [Step 149 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24361, 24362] → Tgt Spa: ['1.000', '1.000'] [Step 149 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [38363] → Tgt Spa: ['1.000'] [Step 149 / Rank 4] Tasks: ['Single QA'] | Lens: [63495] → Tgt Spa: ['0.350'] [Step 149 / Rank 7] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18180, 18169, 18171] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 149 / Rank 1] Tasks: ['Single QA', 'Code'] | Lens: [25853, 25860] → Tgt Spa: ['0.350', '1.000'] [Step 149 / Rank 0] Tasks: ['Single QA', 'Code'] | Lens: [25853, 25860] → Tgt Spa: ['0.350', '1.000'] [Step 149 / Rank 3] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24799, 24799] → Tgt Spa: ['0.350', '1.000'] [Step 149 / Rank 5] Tasks: ['Single QA'] | Lens: [63495] → Tgt Spa: ['0.350'] [Step 149 / Rank 6] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18180, 18169, 18171] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 149 / Rank 2] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24799, 24799] → Tgt Spa: ['0.350', '1.000'] [Step 149 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [36164] → Tgt Spa: ['1.000'] [Step 149 / Rank 4] Tasks: ['In-Context Learning', 'Code', 'Single QA'] | Lens: [21491, 21501, 21497] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 149 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [36164] → Tgt Spa: ['1.000'] [Step 149 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [44959] → Tgt Spa: ['1.000'] [Step 149 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [44959] → Tgt Spa: ['1.000'] [Step 149 / Rank 5] Tasks: ['In-Context Learning', 'Code', 'Single QA'] | Lens: [21491, 21501, 21497] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 149 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25061, 25062] → Tgt Spa: ['1.000', '1.000'] [Step 149 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25061, 25062] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 01:22:43,251 >> @ 149 | Loss: 2.0388 | LM: 1.9541 | Reg: 0.0847 | Spa(Avg): 0.591 [INFO|lh_trainer.py:797] 2026-02-17 01:22:43,251 >> Statistic -> Code | Spa: 0.676 | Tgt: 1.000 | Z-Loss: 0.095 | [INFO|lh_trainer.py:797] 2026-02-17 01:22:43,252 >> Statistic -> In-Context | Spa: 0.698 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:22:43,252 >> Statistic -> MultiHop | Spa: 0.613 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:22:43,252 >> Statistic -> Single | Spa: 0.436 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:22:43,252 >> Statistic -> Summarization | Spa: 0.673 | Tgt: 1.000 | Z-Loss: 0.092 | [INFO|lh_trainer.py:810] 2026-02-17 01:22:43,254 >> [Micro-Log] {"loss": 2.038846661647161, "lm_loss": 1.9541478902101517, "reg_loss": 0.084698774269782, "model_sparsity(avg)": 0.5911619067192078, "Spa-Single QA sparsity": 0.43560605157505383, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06360948263582858, "Spa-Code sparsity": 0.6755050637505271, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09528763863173398, "Spa-Summarization sparsity": 0.6732456119436967, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09178999576129411, "Spa-In-Context Learning sparsity": 0.697649570611807, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10242471041587684, "Spa-MultiHop QA sparsity": 0.6131944537162781, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.10434731990098953, "step": 149, "current_tau": 1.0357081890106201, "lambda1 Single QA": 0.5625, "lambda2 MultiHop QA": 0.291015625, "lambda3 Summarization": 0.1328125, "lambda4 Code": 0.2333984375} [INFO|lh_trainer.py:331] 2026-02-17 01:22:58,230 >> {'loss': 12.2331, 'grad_norm': 0.8521149754524231, 'learning_rate': 0.0003486859792274704, 'epoch': 0.1579778830963665, 'num_input_tokens_seen': 368459936, 'completed': '50.00% (150 / 300)', 'remaining time': '7:01:10', 'throughput': '7773.41', 'gpu_mem_free': '13685MB', 'step': 150} [Step 150 / Rank 5] Tasks: ['Code'] | Lens: [53000] → Tgt Spa: ['1.000'] [Step 150 / Rank 1] Tasks: ['Single QA'] | Lens: [41934] → Tgt Spa: ['0.350'] [Step 150 / Rank 4] Tasks: ['Code'] | Lens: [53000] → Tgt Spa: ['1.000'] [Step 150 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [21938, 21932] → Tgt Spa: ['1.000', '1.000'] [Step 150 / Rank 7] Tasks: ['Summarization', 'Code', 'Code', 'Summarization', 'Code', 'Code', 'In-Context Learning', 'MultiHop QA', 'Single QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [3274, 3264, 3265, 3276, 3264, 3264, 3258, 3261, 3260, 3260, 3261, 3263, 3280, 3262, 3262, 3269, 3263, 3266, 3265, 3266] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350'] [Step 150 / Rank 6] Tasks: ['Summarization', 'Code', 'Code', 'Summarization', 'Code', 'Code', 'In-Context Learning', 'MultiHop QA', 'Single QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [3274, 3264, 3265, 3276, 3264, 3264, 3258, 3261, 3260, 3260, 3261, 3263, 3280, 3262, 3262, 3269, 3263, 3266, 3265, 3266] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350'] [Step 150 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [21938, 21932] → Tgt Spa: ['1.000', '1.000'] [Step 150 / Rank 0] Tasks: ['Single QA'] | Lens: [41934] → Tgt Spa: ['0.350'] [Step 150 / Rank 5] Tasks: ['Code'] | Lens: [61646] → Tgt Spa: ['1.000'] [Step 150 / Rank 2] Tasks: ['Single QA'] | Lens: [40028] → Tgt Spa: ['0.350'] [Step 150 / Rank 3] Tasks: ['Single QA'] | Lens: [40028] → Tgt Spa: ['0.350'] [Step 150 / Rank 7] Tasks: ['Single QA'] | Lens: [58615] → Tgt Spa: ['0.350'] [Step 150 / Rank 4] Tasks: ['Code'] | Lens: [61646] → Tgt Spa: ['1.000'] [Step 150 / Rank 6] Tasks: ['Single QA'] | Lens: [58615] → Tgt Spa: ['0.350'] [Step 150 / Rank 0] Tasks: ['Single QA'] | Lens: [34562] → Tgt Spa: ['0.350'] [Step 150 / Rank 1] Tasks: ['Single QA'] | Lens: [34562] → Tgt Spa: ['0.350'] [Step 150 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32002, 32003] → Tgt Spa: ['0.350', '0.350'] [Step 150 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [22121, 22131] → Tgt Spa: ['1.000', '1.000'] [Step 150 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32002, 32003] → Tgt Spa: ['0.350', '0.350'] [Step 150 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [8033, 8035, 8035, 8035, 8035, 8035, 8035, 8035] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 150 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [22121, 22131] → Tgt Spa: ['1.000', '1.000'] [Step 150 / Rank 3] Tasks: ['Code', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'Summarization', 'Code', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [6370, 6363, 6365, 6366, 6366, 6385, 6375, 6375, 6368, 6371] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 150 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [8033, 8035, 8035, 8035, 8035, 8035, 8035, 8035] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 150 / Rank 2] Tasks: ['Code', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'Summarization', 'Code', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [6370, 6363, 6365, 6366, 6366, 6385, 6375, 6375, 6368, 6371] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 150 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [21878, 21881] → Tgt Spa: ['0.350', '0.350'] [Step 150 / Rank 2] Tasks: ['Single QA'] | Lens: [45987] → Tgt Spa: ['0.350'] [Step 150 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [21878, 21881] → Tgt Spa: ['0.350', '0.350'] [Step 150 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22431, 22453] → Tgt Spa: ['1.000', '1.000'] [Step 150 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22431, 22453] → Tgt Spa: ['1.000', '1.000'] [Step 150 / Rank 5] Tasks: ['Single QA'] | Lens: [65075] → Tgt Spa: ['0.350'] [Step 150 / Rank 3] Tasks: ['Single QA'] | Lens: [45987] → Tgt Spa: ['0.350'] [Step 150 / Rank 4] Tasks: ['Single QA'] | Lens: [65075] → Tgt Spa: ['0.350'] [Step 150 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [47082] → Tgt Spa: ['1.000'] [Step 150 / Rank 0] Tasks: ['Single QA'] | Lens: [65483] → Tgt Spa: ['0.350'] [Step 150 / Rank 7] Tasks: ['Code'] | Lens: [53370] → Tgt Spa: ['1.000'] [Step 150 / Rank 5] Tasks: ['Single QA'] | Lens: [37510] → Tgt Spa: ['0.350'] [Step 150 / Rank 1] Tasks: ['Single QA'] | Lens: [65483] → Tgt Spa: ['0.350'] [Step 150 / Rank 6] Tasks: ['Code'] | Lens: [53370] → Tgt Spa: ['1.000'] [Step 150 / Rank 4] Tasks: ['Single QA'] | Lens: [37510] → Tgt Spa: ['0.350'] [Step 150 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [47082] → Tgt Spa: ['1.000'] [Step 150 / Rank 7] Tasks: ['Single QA'] | Lens: [52519] → Tgt Spa: ['0.350'] [Step 150 / Rank 5] Tasks: ['In-Context Learning', 'Summarization', 'Code'] | Lens: [21253, 21272, 21262] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 150 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [37039] → Tgt Spa: ['1.000'] [Step 150 / Rank 4] Tasks: ['In-Context Learning', 'Summarization', 'Code'] | Lens: [21253, 21272, 21262] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 150 / Rank 1] Tasks: ['Single QA'] | Lens: [64597] → Tgt Spa: ['0.350'] [Step 150 / Rank 0] Tasks: ['Single QA'] | Lens: [64597] → Tgt Spa: ['0.350'] [Step 150 / Rank 6] Tasks: ['Single QA'] | Lens: [52519] → Tgt Spa: ['0.350'] [Step 150 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [37039] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 01:25:37,183 >> @ 150 | Loss: 1.8823 | LM: 1.8150 | Reg: 0.0673 | Spa(Avg): 0.522 [INFO|lh_trainer.py:797] 2026-02-17 01:25:37,183 >> Statistic -> Code | Spa: 0.652 | Tgt: 1.000 | Z-Loss: 0.104 | [INFO|lh_trainer.py:797] 2026-02-17 01:25:37,183 >> Statistic -> In-Context | Spa: 0.696 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:25:37,183 >> Statistic -> MultiHop | Spa: 0.664 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:25:37,183 >> Statistic -> Single | Spa: 0.479 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:25:37,183 >> Statistic -> Summarization | Spa: 0.655 | Tgt: 1.000 | Z-Loss: 0.100 | [INFO|lh_trainer.py:810] 2026-02-17 01:25:37,185 >> [Micro-Log] {"loss": 1.8823017043371995, "lm_loss": 1.814980637282133, "reg_loss": 0.06732105397774528, "model_sparsity(avg)": 0.5221113078296185, "Spa-Single QA sparsity": 0.4794238673316108, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.08261197853695464, "Spa-In-Context Learning sparsity": 0.6955128266261175, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10328578318540867, "Spa-Summarization sparsity": 0.6550926069418589, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10014053682486217, "Spa-Code sparsity": 0.6517857185431889, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10417382312672478, "Spa-MultiHop QA sparsity": 0.6643518606821696, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.13038926074902216, "step": 150, "current_tau": 1.0334936380386353, "lambda1 Single QA": 0.5625, "lambda2 MultiHop QA": 0.291015625, "lambda3 Summarization": 0.1337890625, "lambda4 Code": 0.2333984375} [INFO|lh_trainer.py:331] 2026-02-17 01:26:03,780 >> {'loss': 11.2938, 'grad_norm': 0.6342733502388, 'learning_rate': 0.00034567087352418665, 'epoch': 0.1590310689836756, 'num_input_tokens_seen': 370972514, 'completed': '50.33% (151 / 300)', 'remaining time': '6:58:39', 'throughput': '6770.60', 'gpu_mem_free': '4171MB', 'step': 151} [Step 151 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26347, 26349] → Tgt Spa: ['1.000', '1.000'] [Step 151 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26224, 26224] → Tgt Spa: ['1.000', '1.000'] [Step 151 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26347, 26349] → Tgt Spa: ['1.000', '1.000'] [Step 151 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23254, 23256] → Tgt Spa: ['1.000', '1.000'] [Step 151 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27949, 27949] → Tgt Spa: ['1.000', '1.000'] [Step 151 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23254, 23256] → Tgt Spa: ['1.000', '1.000'] [Step 151 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27949, 27949] → Tgt Spa: ['1.000', '1.000'] [Step 151 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26224, 26224] → Tgt Spa: ['1.000', '1.000'] [Step 151 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32548, 32548] → Tgt Spa: ['0.350', '0.350'] [Step 151 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32548, 32548] → Tgt Spa: ['0.350', '0.350'] [Step 151 / Rank 3] Tasks: ['Code', 'Code', 'Code'] | Lens: [21659, 21659, 21662] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 151 / Rank 6] Tasks: ['Single QA'] | Lens: [53907] → Tgt Spa: ['0.350'] [Step 151 / Rank 1] Tasks: ['Single QA'] | Lens: [46359] → Tgt Spa: ['0.350'] [Step 151 / Rank 0] Tasks: ['Single QA'] | Lens: [46359] → Tgt Spa: ['0.350'] [Step 151 / Rank 2] Tasks: ['Code', 'Code', 'Code'] | Lens: [21659, 21659, 21662] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 151 / Rank 7] Tasks: ['Single QA'] | Lens: [53907] → Tgt Spa: ['0.350'] [Step 151 / Rank 4] Tasks: ['Code'] | Lens: [38947] → Tgt Spa: ['1.000'] [Step 151 / Rank 5] Tasks: ['Code'] | Lens: [38947] → Tgt Spa: ['1.000'] [Step 151 / Rank 6] Tasks: ['Code', 'Code', 'In-Context Learning', 'Single QA', 'Code', 'Code', 'Code', 'Code', 'Code'] | Lens: [6756, 6756, 6749, 6750, 6759, 6759, 6759, 6759, 6760] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 151 / Rank 3] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [25392, 25378] → Tgt Spa: ['1.000', '1.000'] [Step 151 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32237, 32238] → Tgt Spa: ['0.350', '0.350'] [Step 151 / Rank 2] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [25392, 25378] → Tgt Spa: ['1.000', '1.000'] [Step 151 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32237, 32238] → Tgt Spa: ['0.350', '0.350'] [Step 151 / Rank 7] Tasks: ['Code', 'Code', 'In-Context Learning', 'Single QA', 'Code', 'Code', 'Code', 'Code', 'Code'] | Lens: [6756, 6756, 6749, 6750, 6759, 6759, 6759, 6759, 6760] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 151 / Rank 2] Tasks: ['Single QA'] | Lens: [61771] → Tgt Spa: ['0.350'] [Step 151 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [57323] → Tgt Spa: ['1.000'] [Step 151 / Rank 5] Tasks: ['Single QA'] | Lens: [40295] → Tgt Spa: ['0.350'] [Step 151 / Rank 7] Tasks: ['Single QA', 'Summarization', 'In-Context Learning'] | Lens: [20668, 20690, 20672] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 151 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [57323] → Tgt Spa: ['1.000'] [Step 151 / Rank 3] Tasks: ['Single QA'] | Lens: [61771] → Tgt Spa: ['0.350'] [Step 151 / Rank 6] Tasks: ['Single QA', 'Summarization', 'In-Context Learning'] | Lens: [20668, 20690, 20672] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 151 / Rank 4] Tasks: ['Single QA'] | Lens: [40295] → Tgt Spa: ['0.350'] [Step 151 / Rank 4] Tasks: ['Single QA'] | Lens: [55847] → Tgt Spa: ['0.350'] [Step 151 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26604, 26606] → Tgt Spa: ['1.000', '1.000'] [Step 151 / Rank 5] Tasks: ['Single QA'] | Lens: [55847] → Tgt Spa: ['0.350'] [Step 151 / Rank 3] Tasks: ['Single QA'] | Lens: [36397] → Tgt Spa: ['0.350'] [Step 151 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22709, 22710] → Tgt Spa: ['1.000', '1.000'] [Step 151 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22709, 22710] → Tgt Spa: ['1.000', '1.000'] [Step 151 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26604, 26606] → Tgt Spa: ['1.000', '1.000'] [Step 151 / Rank 2] Tasks: ['Single QA'] | Lens: [36397] → Tgt Spa: ['0.350'] [Step 151 / Rank 4] Tasks: ['Single QA', 'Code', 'Code', 'Single QA', 'Single QA'] | Lens: [11353, 11359, 11359, 11354, 11354] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '0.350'] [Step 151 / Rank 1] Tasks: ['Code', 'Summarization'] | Lens: [24373, 24384] → Tgt Spa: ['1.000', '1.000'] [Step 151 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [28600, 28601] → Tgt Spa: ['0.350', '0.350'] [Step 151 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [28600, 28601] → Tgt Spa: ['0.350', '0.350'] [Step 151 / Rank 6] Tasks: ['Single QA'] | Lens: [55264] → Tgt Spa: ['0.350'] [Step 151 / Rank 0] Tasks: ['Code', 'Summarization'] | Lens: [24373, 24384] → Tgt Spa: ['1.000', '1.000'] [Step 151 / Rank 5] Tasks: ['Single QA', 'Code', 'Code', 'Single QA', 'Single QA'] | Lens: [11353, 11359, 11359, 11354, 11354] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '0.350'] [Step 151 / Rank 7] Tasks: ['Single QA'] | Lens: [55264] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 01:28:20,896 >> @ 151 | Loss: 2.0955 | LM: 2.0275 | Reg: 0.0680 | Spa(Avg): 0.556 [INFO|lh_trainer.py:797] 2026-02-17 01:28:20,896 >> Statistic -> Code | Spa: 0.682 | Tgt: 1.000 | Z-Loss: 0.093 | [INFO|lh_trainer.py:797] 2026-02-17 01:28:20,896 >> Statistic -> In-Context | Spa: 0.707 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:28:20,896 >> Statistic -> MultiHop | Spa: 0.664 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:28:20,896 >> Statistic -> Single | Spa: 0.404 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:28:20,896 >> Statistic -> Summarization | Spa: 0.644 | Tgt: 1.000 | Z-Loss: 0.105 | [INFO|lh_trainer.py:810] 2026-02-17 01:28:20,898 >> [Micro-Log] {"loss": 2.095477888981501, "lm_loss": 2.027457818388939, "reg_loss": 0.06802006023159872, "model_sparsity(avg)": 0.555632721632719, "Spa-In-Context Learning sparsity": 0.7065972313284874, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.09918975457549095, "Spa-Single QA sparsity": 0.40432097845607334, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.037952654684583344, "Spa-Code sparsity": 0.6815476247242519, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09334928755249296, "Spa-Summarization sparsity": 0.6435185273488363, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10541637986898422, "Spa-MultiHop QA sparsity": 0.6643518606821696, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.13038926074902216, "step": 151, "current_tau": 1.0313451290130615, "lambda1 Single QA": 0.5625, "lambda2 MultiHop QA": 0.291015625, "lambda3 Summarization": 0.134765625, "lambda4 Code": 0.234375} [INFO|lh_trainer.py:331] 2026-02-17 01:28:41,426 >> {'loss': 12.5729, 'grad_norm': 0.7194718718528748, 'learning_rate': 0.00034263937511352314, 'epoch': 0.16008425487098474, 'num_input_tokens_seen': 373538886, 'completed': '50.67% (152 / 300)', 'remaining time': '6:55:39', 'throughput': '8139.67', 'gpu_mem_free': '11115MB', 'step': 152} [Step 152 / Rank 3] Tasks: ['Single QA'] | Lens: [64738] → Tgt Spa: ['0.350'] [Step 152 / Rank 2] Tasks: ['Single QA'] | Lens: [64738] → Tgt Spa: ['0.350'] [Step 152 / Rank 1] Tasks: ['MultiHop QA'] | Lens: [64829] → Tgt Spa: ['0.350'] [Step 152 / Rank 5] Tasks: ['Code'] | Lens: [56965] → Tgt Spa: ['1.000'] [Step 152 / Rank 0] Tasks: ['MultiHop QA'] | Lens: [64829] → Tgt Spa: ['0.350'] [Step 152 / Rank 7] Tasks: ['Single QA'] | Lens: [37514] → Tgt Spa: ['0.350'] [Step 152 / Rank 6] Tasks: ['Single QA'] | Lens: [37514] → Tgt Spa: ['0.350'] [Step 152 / Rank 4] Tasks: ['Code'] | Lens: [56965] → Tgt Spa: ['1.000'] [Step 152 / Rank 6] Tasks: ['Code'] | Lens: [41730] → Tgt Spa: ['1.000'] [Step 152 / Rank 4] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Single QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization'] | Lens: [1643, 1643, 1642, 1645, 1642, 1645, 1662, 1644, 1643, 1644, 1663, 1644, 1663, 1644, 1665, 1665, 1648, 1647, 1649, 1669, 1650, 1652, 1651, 1651, 1669, 1651, 1651, 1671, 1671, 1672, 1671, 1653, 1673, 1654, 1653, 1653, 1654, 1673, 1673] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 152 / Rank 5] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Single QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization'] | Lens: [1643, 1643, 1642, 1645, 1642, 1645, 1662, 1644, 1643, 1644, 1663, 1644, 1663, 1644, 1665, 1665, 1648, 1647, 1649, 1669, 1650, 1652, 1651, 1651, 1669, 1651, 1651, 1671, 1671, 1672, 1671, 1653, 1673, 1654, 1653, 1653, 1654, 1673, 1673] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 152 / Rank 0] Tasks: ['Code', 'Single QA'] | Lens: [29953, 29946] → Tgt Spa: ['1.000', '0.350'] [Step 152 / Rank 3] Tasks: ['Single QA'] | Lens: [54193] → Tgt Spa: ['0.350'] [Step 152 / Rank 7] Tasks: ['Code'] | Lens: [41730] → Tgt Spa: ['1.000'] [Step 152 / Rank 2] Tasks: ['Single QA'] | Lens: [54193] → Tgt Spa: ['0.350'] [Step 152 / Rank 1] Tasks: ['Code', 'Single QA'] | Lens: [29953, 29946] → Tgt Spa: ['1.000', '0.350'] [Step 152 / Rank 6] Tasks: ['Code'] | Lens: [37936] → Tgt Spa: ['1.000'] [Step 152 / Rank 7] Tasks: ['Code'] | Lens: [37936] → Tgt Spa: ['1.000'] [Step 152 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15706, 15706, 15706, 15706] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 152 / Rank 2] Tasks: ['Single QA'] | Lens: [64730] → Tgt Spa: ['0.350'] [Step 152 / Rank 3] Tasks: ['Single QA'] | Lens: [64730] → Tgt Spa: ['0.350'] [Step 152 / Rank 0] Tasks: ['Single QA'] | Lens: [33350] → Tgt Spa: ['0.350'] [Step 152 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15706, 15706, 15706, 15706] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 152 / Rank 1] Tasks: ['Single QA'] | Lens: [33350] → Tgt Spa: ['0.350'] [Step 152 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [56074] → Tgt Spa: ['1.000'] [Step 152 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [25136, 25143] → Tgt Spa: ['1.000', '1.000'] [Step 152 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [43698] → Tgt Spa: ['1.000'] [Step 152 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [25136, 25143] → Tgt Spa: ['1.000', '1.000'] [Step 152 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [56074] → Tgt Spa: ['1.000'] [Step 152 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [43698] → Tgt Spa: ['1.000'] [Step 152 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32252, 32254] → Tgt Spa: ['0.350', '0.350'] [Step 152 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32252, 32254] → Tgt Spa: ['0.350', '0.350'] [Step 152 / Rank 4] Tasks: ['Single QA'] | Lens: [59396] → Tgt Spa: ['0.350'] [Step 152 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [45692] → Tgt Spa: ['1.000'] [Step 152 / Rank 1] Tasks: ['Single QA'] | Lens: [35540] → Tgt Spa: ['0.350'] [Step 152 / Rank 0] Tasks: ['Single QA'] | Lens: [35540] → Tgt Spa: ['0.350'] [Step 152 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [45692] → Tgt Spa: ['1.000'] [Step 152 / Rank 3] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [18854, 18867, 18868] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 152 / Rank 5] Tasks: ['Single QA'] | Lens: [59396] → Tgt Spa: ['0.350'] [Step 152 / Rank 2] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [18854, 18867, 18868] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 152 / Rank 3] Tasks: ['Summarization'] | Lens: [56395] → Tgt Spa: ['1.000'] [Step 152 / Rank 4] Tasks: ['Single QA'] | Lens: [41435] → Tgt Spa: ['0.350'] [Step 152 / Rank 5] Tasks: ['Single QA'] | Lens: [41435] → Tgt Spa: ['0.350'] [Step 152 / Rank 7] Tasks: ['Code'] | Lens: [37331] → Tgt Spa: ['1.000'] [Step 152 / Rank 6] Tasks: ['Code'] | Lens: [37331] → Tgt Spa: ['1.000'] [Step 152 / Rank 2] Tasks: ['Summarization'] | Lens: [56395] → Tgt Spa: ['1.000'] [Step 152 / Rank 1] Tasks: ['Single QA'] | Lens: [39032] → Tgt Spa: ['0.350'] [Step 152 / Rank 0] Tasks: ['Single QA'] | Lens: [39032] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 01:31:22,860 >> @ 152 | Loss: 1.9134 | LM: 1.8468 | Reg: 0.0666 | Spa(Avg): 0.506 [INFO|lh_trainer.py:797] 2026-02-17 01:31:22,861 >> Statistic -> Code | Spa: 0.665 | Tgt: 1.000 | Z-Loss: 0.100 | [INFO|lh_trainer.py:797] 2026-02-17 01:31:22,861 >> Statistic -> In-Context | Spa: 0.663 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:31:22,861 >> Statistic -> MultiHop | Spa: 0.568 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:31:22,861 >> Statistic -> Single | Spa: 0.397 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:31:22,861 >> Statistic -> Summarization | Spa: 0.615 | Tgt: 1.000 | Z-Loss: 0.120 | [INFO|lh_trainer.py:810] 2026-02-17 01:31:22,863 >> [Micro-Log] {"loss": 1.9134014709076534, "lm_loss": 1.8467976273192714, "reg_loss": 0.06660383687509845, "model_sparsity(avg)": 0.5062025114893913, "Spa-MultiHop QA sparsity": 0.5677083283662796, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.08292407045761745, "Spa-Code sparsity": 0.6646825415747506, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09958103724888392, "Spa-Single QA sparsity": 0.3966049353281657, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03501917804694838, "Spa-In-Context Learning sparsity": 0.6631944328546524, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11625003069639206, "Spa-Summarization sparsity": 0.6151960807688096, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12038193686920054, "step": 152, "current_tau": 1.0292631387710571, "lambda1 Single QA": 0.5625, "lambda2 MultiHop QA": 0.291015625, "lambda3 Summarization": 0.134765625, "lambda4 Code": 0.234375} [INFO|lh_trainer.py:331] 2026-02-17 01:31:44,297 >> {'loss': 11.4804, 'grad_norm': 0.6463214755058289, 'learning_rate': 0.00033959200342712626, 'epoch': 0.16113744075829384, 'num_input_tokens_seen': 375997348, 'completed': '51.00% (153 / 300)', 'remaining time': '6:53:05', 'throughput': '6721.86', 'gpu_mem_free': '12677MB', 'step': 153} [Step 153 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [48114] → Tgt Spa: ['1.000'] [Step 153 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [38971] → Tgt Spa: ['1.000'] [Step 153 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [38971] → Tgt Spa: ['1.000'] [Step 153 / Rank 0] Tasks: ['Single QA'] | Lens: [56500] → Tgt Spa: ['0.350'] [Step 153 / Rank 5] Tasks: ['Single QA'] | Lens: [34745] → Tgt Spa: ['0.350'] [Step 153 / Rank 4] Tasks: ['Single QA'] | Lens: [34745] → Tgt Spa: ['0.350'] [Step 153 / Rank 1] Tasks: ['Single QA'] | Lens: [56500] → Tgt Spa: ['0.350'] [Step 153 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [48114] → Tgt Spa: ['1.000'] [Step 153 / Rank 4] Tasks: ['Single QA'] | Lens: [49450] → Tgt Spa: ['0.350'] [Step 153 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [8228, 8228, 8230, 8230, 8231, 8231, 8232] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 153 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [38133] → Tgt Spa: ['1.000'] [Step 153 / Rank 5] Tasks: ['Single QA'] | Lens: [49450] → Tgt Spa: ['0.350'] [Step 153 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [30263, 30255] → Tgt Spa: ['1.000', '1.000'] [Step 153 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [8228, 8228, 8230, 8230, 8231, 8231, 8232] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 153 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [30263, 30255] → Tgt Spa: ['1.000', '1.000'] [Step 153 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [38133] → Tgt Spa: ['1.000'] [Step 153 / Rank 4] Tasks: ['MultiHop QA', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Summarization', 'Code', 'Single QA', 'In-Context Learning', 'MultiHop QA', 'Single QA', 'Summarization', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Code'] | Lens: [3048, 3049, 3048, 3050, 3067, 3068, 3056, 3049, 3049, 3051, 3053, 3070, 3070, 3053, 3055, 3055, 3071, 3053, 3054, 3056, 3061] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000'] [Step 153 / Rank 6] Tasks: ['Code'] | Lens: [55944] → Tgt Spa: ['1.000'] [Step 153 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22487, 22506] → Tgt Spa: ['1.000', '1.000'] [Step 153 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22487, 22506] → Tgt Spa: ['1.000', '1.000'] [Step 153 / Rank 5] Tasks: ['MultiHop QA', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Summarization', 'Code', 'Single QA', 'In-Context Learning', 'MultiHop QA', 'Single QA', 'Summarization', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Code'] | Lens: [3048, 3049, 3048, 3050, 3067, 3068, 3056, 3049, 3049, 3051, 3053, 3070, 3070, 3053, 3055, 3055, 3071, 3053, 3054, 3056, 3061] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000'] [Step 153 / Rank 2] Tasks: ['Code', 'Code', 'Single QA'] | Lens: [21020, 21023, 21014] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 153 / Rank 7] Tasks: ['Code'] | Lens: [55944] → Tgt Spa: ['1.000'] [Step 153 / Rank 3] Tasks: ['Code', 'Code', 'Single QA'] | Lens: [21020, 21023, 21014] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 153 / Rank 0] Tasks: ['Code'] | Lens: [35531] → Tgt Spa: ['1.000'] [Step 153 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [57703] → Tgt Spa: ['1.000'] [Step 153 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [57703] → Tgt Spa: ['1.000'] [Step 153 / Rank 1] Tasks: ['Code'] | Lens: [35531] → Tgt Spa: ['1.000'] [Step 153 / Rank 4] Tasks: ['Single QA'] | Lens: [43235] → Tgt Spa: ['0.350'] [Step 153 / Rank 5] Tasks: ['Single QA'] | Lens: [43235] → Tgt Spa: ['0.350'] [Step 153 / Rank 6] Tasks: ['Single QA'] | Lens: [51266] → Tgt Spa: ['0.350'] [Step 153 / Rank 7] Tasks: ['Single QA'] | Lens: [51266] → Tgt Spa: ['0.350'] [Step 153 / Rank 5] Tasks: ['Single QA'] | Lens: [46361] → Tgt Spa: ['0.350'] [Step 153 / Rank 2] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [17705, 17706, 17719] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 153 / Rank 4] Tasks: ['Single QA'] | Lens: [46361] → Tgt Spa: ['0.350'] [Step 153 / Rank 0] Tasks: ['Single QA', 'Summarization'] | Lens: [25341, 25358] → Tgt Spa: ['0.350', '1.000'] [Step 153 / Rank 7] Tasks: ['Single QA'] | Lens: [43590] → Tgt Spa: ['0.350'] [Step 153 / Rank 3] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [17705, 17706, 17719] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 153 / Rank 1] Tasks: ['Single QA', 'Summarization'] | Lens: [25341, 25358] → Tgt Spa: ['0.350', '1.000'] [Step 153 / Rank 6] Tasks: ['Single QA'] | Lens: [43590] → Tgt Spa: ['0.350'] [Step 153 / Rank 6] Tasks: ['Single QA'] | Lens: [60965] → Tgt Spa: ['0.350'] [Step 153 / Rank 3] Tasks: ['Single QA'] | Lens: [42607] → Tgt Spa: ['0.350'] [Step 153 / Rank 0] Tasks: ['Single QA'] | Lens: [39642] → Tgt Spa: ['0.350'] [Step 153 / Rank 5] Tasks: ['Single QA'] | Lens: [48602] → Tgt Spa: ['0.350'] [Step 153 / Rank 4] Tasks: ['Single QA'] | Lens: [48602] → Tgt Spa: ['0.350'] [Step 153 / Rank 2] Tasks: ['Single QA'] | Lens: [42607] → Tgt Spa: ['0.350'] [Step 153 / Rank 1] Tasks: ['Single QA'] | Lens: [39642] → Tgt Spa: ['0.350'] [Step 153 / Rank 7] Tasks: ['Single QA'] | Lens: [60965] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 01:33:58,549 >> @ 153 | Loss: 2.2258 | LM: 2.1629 | Reg: 0.0629 | Spa(Avg): 0.501 [INFO|lh_trainer.py:797] 2026-02-17 01:33:58,550 >> Statistic -> Code | Spa: 0.642 | Tgt: 1.000 | Z-Loss: 0.108 | [INFO|lh_trainer.py:797] 2026-02-17 01:33:58,550 >> Statistic -> In-Context | Spa: 0.684 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:33:58,550 >> Statistic -> MultiHop | Spa: 0.578 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:33:58,550 >> Statistic -> Single | Spa: 0.416 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:33:58,550 >> Statistic -> Summarization | Spa: 0.608 | Tgt: 1.000 | Z-Loss: 0.127 | [INFO|lh_trainer.py:810] 2026-02-17 01:33:58,552 >> [Micro-Log] {"loss": 2.225800007581711, "lm_loss": 2.162937986354033, "reg_loss": 0.0628620083637846, "model_sparsity(avg)": 0.5006200398008028, "Spa-Single QA sparsity": 0.41550924628973007, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.049769278072441615, "Spa-In-Context Learning sparsity": 0.6840277761220932, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1084042340517044, "Spa-Summarization sparsity": 0.6076388880610466, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1266999403014779, "Spa-Code sparsity": 0.6419753101136949, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10837733745574951, "Spa-MultiHop QA sparsity": 0.5781250149011612, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.0885024182498455, "step": 153, "current_tau": 1.0272483825683594, "lambda1 Single QA": 0.56640625, "lambda2 MultiHop QA": 0.29296875, "lambda3 Summarization": 0.1357421875, "lambda4 Code": 0.2353515625} [INFO|lh_trainer.py:331] 2026-02-17 01:34:22,712 >> {'loss': 13.3548, 'grad_norm': 0.614433765411377, 'learning_rate': 0.0003365292806164468, 'epoch': 0.16219062664560294, 'num_input_tokens_seen': 378368452, 'completed': '51.33% (154 / 300)', 'remaining time': '6:50:06', 'throughput': '7483.89', 'gpu_mem_free': '12043MB', 'step': 154} [Step 154 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27295, 27295] → Tgt Spa: ['1.000', '1.000'] [Step 154 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27295, 27295] → Tgt Spa: ['1.000', '1.000'] [Step 154 / Rank 5] Tasks: ['Single QA'] | Lens: [33930] → Tgt Spa: ['0.350'] [Step 154 / Rank 7] Tasks: ['Single QA', 'Summarization'] | Lens: [23751, 23770] → Tgt Spa: ['0.350', '1.000'] [Step 154 / Rank 6] Tasks: ['Single QA', 'Summarization'] | Lens: [23751, 23770] → Tgt Spa: ['0.350', '1.000'] [Step 154 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [59603] → Tgt Spa: ['1.000'] [Step 154 / Rank 4] Tasks: ['Single QA'] | Lens: [33930] → Tgt Spa: ['0.350'] [Step 154 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [59603] → Tgt Spa: ['1.000'] [Step 154 / Rank 4] Tasks: ['Single QA'] | Lens: [62000] → Tgt Spa: ['0.350'] [Step 154 / Rank 3] Tasks: ['Single QA'] | Lens: [52152] → Tgt Spa: ['0.350'] [Step 154 / Rank 2] Tasks: ['Single QA'] | Lens: [52152] → Tgt Spa: ['0.350'] [Step 154 / Rank 6] Tasks: ['Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [1526, 1509, 1508, 1508, 1508, 1509, 1507, 1527, 1509, 1509, 1509, 1528, 1528, 1510, 1511, 1530, 1530, 1512, 1512, 1512, 1512, 1512, 1531, 1513, 1532, 1515, 1514, 1514, 1515, 1516, 1515, 1516, 1515, 1517, 1516, 1517, 1517, 1517, 1517, 1517, 1517, 1517, 1536] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 154 / Rank 7] Tasks: ['Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [1526, 1509, 1508, 1508, 1508, 1509, 1507, 1527, 1509, 1509, 1509, 1528, 1528, 1510, 1511, 1530, 1530, 1512, 1512, 1512, 1512, 1512, 1531, 1513, 1532, 1515, 1514, 1514, 1515, 1516, 1515, 1516, 1515, 1517, 1516, 1517, 1517, 1517, 1517, 1517, 1517, 1517, 1536] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 154 / Rank 0] Tasks: ['Code', 'Single QA'] | Lens: [24883, 24875] → Tgt Spa: ['1.000', '0.350'] [Step 154 / Rank 5] Tasks: ['Single QA'] | Lens: [62000] → Tgt Spa: ['0.350'] [Step 154 / Rank 1] Tasks: ['Code', 'Single QA'] | Lens: [24883, 24875] → Tgt Spa: ['1.000', '0.350'] [Step 154 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32366, 32366] → Tgt Spa: ['0.350', '0.350'] [Step 154 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [42551] → Tgt Spa: ['1.000'] [Step 154 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [42551] → Tgt Spa: ['1.000'] [Step 154 / Rank 1] Tasks: ['Code', 'Code'] | Lens: [28076, 28078] → Tgt Spa: ['1.000', '1.000'] [Step 154 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23620, 23623] → Tgt Spa: ['1.000', '1.000'] [Step 154 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32366, 32366] → Tgt Spa: ['0.350', '0.350'] [Step 154 / Rank 0] Tasks: ['Code', 'Code'] | Lens: [28076, 28078] → Tgt Spa: ['1.000', '1.000'] [Step 154 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23620, 23623] → Tgt Spa: ['1.000', '1.000'] [Step 154 / Rank 5] Tasks: ['Single QA'] | Lens: [42049] → Tgt Spa: ['0.350'] [Step 154 / Rank 2] Tasks: ['MultiHop QA'] | Lens: [65339] → Tgt Spa: ['0.350'] [Step 154 / Rank 0] Tasks: ['Single QA'] | Lens: [38723] → Tgt Spa: ['0.350'] [Step 154 / Rank 3] Tasks: ['MultiHop QA'] | Lens: [65339] → Tgt Spa: ['0.350'] [Step 154 / Rank 4] Tasks: ['Single QA'] | Lens: [42049] → Tgt Spa: ['0.350'] [Step 154 / Rank 1] Tasks: ['Single QA'] | Lens: [38723] → Tgt Spa: ['0.350'] [Step 154 / Rank 6] Tasks: ['Single QA'] | Lens: [53602] → Tgt Spa: ['0.350'] [Step 154 / Rank 7] Tasks: ['Single QA'] | Lens: [53602] → Tgt Spa: ['0.350'] [Step 154 / Rank 4] Tasks: ['Code', 'Summarization'] | Lens: [23105, 23117] → Tgt Spa: ['1.000', '1.000'] [Step 154 / Rank 5] Tasks: ['Code', 'Summarization'] | Lens: [23105, 23117] → Tgt Spa: ['1.000', '1.000'] [Step 154 / Rank 0] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [27761, 27744] → Tgt Spa: ['1.000', '1.000'] [Step 154 / Rank 1] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [27761, 27744] → Tgt Spa: ['1.000', '1.000'] [Step 154 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [37501] → Tgt Spa: ['1.000'] [Step 154 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15877, 15877, 15878, 15878] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 154 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15877, 15877, 15878, 15878] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 154 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [37501] → Tgt Spa: ['1.000'] [Step 154 / Rank 4] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [17934, 17924, 17924] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 154 / Rank 1] Tasks: ['Single QA', 'Summarization'] | Lens: [28274, 28292] → Tgt Spa: ['0.350', '1.000'] [Step 154 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [29742, 29742] → Tgt Spa: ['0.350', '0.350'] [Step 154 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [29742, 29742] → Tgt Spa: ['0.350', '0.350'] [Step 154 / Rank 6] Tasks: ['Code', 'Summarization'] | Lens: [31216, 31230] → Tgt Spa: ['1.000', '1.000'] [Step 154 / Rank 5] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [17934, 17924, 17924] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 154 / Rank 0] Tasks: ['Single QA', 'Summarization'] | Lens: [28274, 28292] → Tgt Spa: ['0.350', '1.000'] [Step 154 / Rank 7] Tasks: ['Code', 'Summarization'] | Lens: [31216, 31230] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 01:36:50,170 >> @ 154 | Loss: 2.1602 | LM: 2.0853 | Reg: 0.0748 | Spa(Avg): 0.520 [INFO|lh_trainer.py:797] 2026-02-17 01:36:50,170 >> Statistic -> Code | Spa: 0.633 | Tgt: 1.000 | Z-Loss: 0.113 | [INFO|lh_trainer.py:797] 2026-02-17 01:36:50,170 >> Statistic -> In-Context | Spa: 0.670 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:36:50,170 >> Statistic -> MultiHop | Spa: 0.562 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:36:50,170 >> Statistic -> Single | Spa: 0.421 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:36:50,170 >> Statistic -> Summarization | Spa: 0.595 | Tgt: 1.000 | Z-Loss: 0.130 | [INFO|lh_trainer.py:810] 2026-02-17 01:36:50,172 >> [Micro-Log] {"loss": 2.160151361565416, "lm_loss": 2.0853397835744545, "reg_loss": 0.07481158119238292, "model_sparsity(avg)": 0.5204508937895298, "Spa-In-Context Learning sparsity": 0.670138880610466, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11387394275516272, "Spa-Code sparsity": 0.6329365117209298, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.11258429608174733, "Spa-Single QA sparsity": 0.42129628856976825, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.046771203736878104, "Spa-Summarization sparsity": 0.5953703641891479, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1300666222969691, "Spa-MultiHop QA sparsity": 0.5616830061463749, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.08100335425971185, "step": 154, "current_tau": 1.025301456451416, "lambda1 Single QA": 0.56640625, "lambda2 MultiHop QA": 0.29296875, "lambda3 Summarization": 0.1357421875, "lambda4 Code": 0.236328125} [INFO|lh_trainer.py:331] 2026-02-17 01:37:06,934 >> {'loss': 12.9609, 'grad_norm': 0.7338926196098328, 'learning_rate': 0.0003334517314632712, 'epoch': 0.16324381253291206, 'num_input_tokens_seen': 380908798, 'completed': '51.67% (155 / 300)', 'remaining time': '6:47:14', 'throughput': '7734.42', 'gpu_mem_free': '8445MB', 'step': 155} [Step 155 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [36691] → Tgt Spa: ['1.000'] [Step 155 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [51217] → Tgt Spa: ['1.000'] [Step 155 / Rank 1] Tasks: ['Single QA'] | Lens: [49696] → Tgt Spa: ['0.350'] [Step 155 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [51217] → Tgt Spa: ['1.000'] [Step 155 / Rank 6] Tasks: ['Code'] | Lens: [45079] → Tgt Spa: ['1.000'] [Step 155 / Rank 7] Tasks: ['Code'] | Lens: [45079] → Tgt Spa: ['1.000'] [Step 155 / Rank 0] Tasks: ['Single QA'] | Lens: [49696] → Tgt Spa: ['0.350'] [Step 155 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [36691] → Tgt Spa: ['1.000'] [Step 155 / Rank 6] Tasks: ['Code'] | Lens: [53510] → Tgt Spa: ['1.000'] [Step 155 / Rank 7] Tasks: ['Code'] | Lens: [53510] → Tgt Spa: ['1.000'] [Step 155 / Rank 1] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24407, 24407] → Tgt Spa: ['0.350', '1.000'] [Step 155 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [41490] → Tgt Spa: ['1.000'] [Step 155 / Rank 0] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24407, 24407] → Tgt Spa: ['0.350', '1.000'] [Step 155 / Rank 3] Tasks: ['Single QA'] | Lens: [58864] → Tgt Spa: ['0.350'] [Step 155 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [41490] → Tgt Spa: ['1.000'] [Step 155 / Rank 2] Tasks: ['Single QA'] | Lens: [58864] → Tgt Spa: ['0.350'] [Step 155 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26558, 26558] → Tgt Spa: ['1.000', '1.000'] [Step 155 / Rank 5] Tasks: ['Single QA'] | Lens: [64978] → Tgt Spa: ['0.350'] [Step 155 / Rank 4] Tasks: ['Single QA'] | Lens: [64978] → Tgt Spa: ['0.350'] [Step 155 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [41301] → Tgt Spa: ['1.000'] [Step 155 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26558, 26558] → Tgt Spa: ['1.000', '1.000'] [Step 155 / Rank 0] Tasks: ['Single QA'] | Lens: [51241] → Tgt Spa: ['0.350'] [Step 155 / Rank 1] Tasks: ['Single QA'] | Lens: [51241] → Tgt Spa: ['0.350'] [Step 155 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [41301] → Tgt Spa: ['1.000'] [Step 155 / Rank 4] Tasks: ['Single QA'] | Lens: [50745] → Tgt Spa: ['0.350'] [Step 155 / Rank 7] Tasks: ['Single QA'] | Lens: [63259] → Tgt Spa: ['0.350'] [Step 155 / Rank 5] Tasks: ['Single QA'] | Lens: [50745] → Tgt Spa: ['0.350'] [Step 155 / Rank 6] Tasks: ['Single QA'] | Lens: [63259] → Tgt Spa: ['0.350'] [Step 155 / Rank 2] Tasks: ['Single QA'] | Lens: [46142] → Tgt Spa: ['0.350'] [Step 155 / Rank 3] Tasks: ['Single QA'] | Lens: [46142] → Tgt Spa: ['0.350'] [Step 155 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32158, 32158] → Tgt Spa: ['0.350', '0.350'] [Step 155 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32158, 32158] → Tgt Spa: ['0.350', '0.350'] [Step 155 / Rank 0] Tasks: ['Single QA'] | Lens: [54043] → Tgt Spa: ['0.350'] [Step 155 / Rank 3] Tasks: ['Code', 'Code', 'Code'] | Lens: [19186, 19184, 19184] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 155 / Rank 7] Tasks: ['Single QA'] | Lens: [52407] → Tgt Spa: ['0.350'] [Step 155 / Rank 6] Tasks: ['Single QA'] | Lens: [52407] → Tgt Spa: ['0.350'] [Step 155 / Rank 2] Tasks: ['Code', 'Code', 'Code'] | Lens: [19186, 19184, 19184] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 155 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [23155, 23150] → Tgt Spa: ['1.000', '1.000'] [Step 155 / Rank 1] Tasks: ['Single QA'] | Lens: [54043] → Tgt Spa: ['0.350'] [Step 155 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [23155, 23150] → Tgt Spa: ['1.000', '1.000'] [Step 155 / Rank 1] Tasks: ['Single QA'] | Lens: [50004] → Tgt Spa: ['0.350'] [Step 155 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [42912] → Tgt Spa: ['1.000'] [Step 155 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [25628, 25628] → Tgt Spa: ['0.350', '0.350'] [Step 155 / Rank 6] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17547, 17539, 17551] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 155 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [42912] → Tgt Spa: ['1.000'] [Step 155 / Rank 0] Tasks: ['Single QA'] | Lens: [50004] → Tgt Spa: ['0.350'] [Step 155 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [25628, 25628] → Tgt Spa: ['0.350', '0.350'] [Step 155 / Rank 7] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17547, 17539, 17551] → Tgt Spa: ['1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 01:39:39,940 >> @ 155 | Loss: 2.1634 | LM: 2.0930 | Reg: 0.0704 | Spa(Avg): 0.538 [INFO|lh_trainer.py:797] 2026-02-17 01:39:39,940 >> Statistic -> Code | Spa: 0.681 | Tgt: 1.000 | Z-Loss: 0.094 | [INFO|lh_trainer.py:797] 2026-02-17 01:39:39,940 >> Statistic -> In-Context | Spa: 0.696 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:39:39,940 >> Statistic -> MultiHop | Spa: 0.562 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:39:39,940 >> Statistic -> Single | Spa: 0.415 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:39:39,940 >> Statistic -> Summarization | Spa: 0.528 | Tgt: 1.000 | Z-Loss: 0.165 | [INFO|lh_trainer.py:810] 2026-02-17 01:39:39,942 >> [Micro-Log] {"loss": 2.1633959698180356, "lm_loss": 2.092961026355624, "reg_loss": 0.07043493682673822, "model_sparsity(avg)": 0.5377121890584627, "Spa-Single QA sparsity": 0.41481480598449705, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04484514094268282, "Spa-In-Context Learning sparsity": 0.6959876616795858, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10448518147071202, "Spa-Code sparsity": 0.6805555479867118, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09446870003427778, "Spa-Summarization sparsity": 0.5277777910232544, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.16508124768733978, "Spa-MultiHop QA sparsity": 0.5616830061463749, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.08100335425971185, "step": 155, "current_tau": 1.0234230756759644, "lambda1 Single QA": 0.56640625, "lambda2 MultiHop QA": 0.29296875, "lambda3 Summarization": 0.13671875, "lambda4 Code": 0.236328125} [INFO|lh_trainer.py:331] 2026-02-17 01:39:57,562 >> {'loss': 12.9804, 'grad_norm': 0.6998718976974487, 'learning_rate': 0.0003303598832898038, 'epoch': 0.16429699842022116, 'num_input_tokens_seen': 383363952, 'completed': '52.00% (156 / 300)', 'remaining time': '6:44:27', 'throughput': '7194.49', 'gpu_mem_free': '10721MB', 'step': 156} [Step 156 / Rank 5] Tasks: ['Single QA'] | Lens: [41244] → Tgt Spa: ['0.350'] [Step 156 / Rank 0] Tasks: ['Single QA'] | Lens: [44040] → Tgt Spa: ['0.350'] [Step 156 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [14645, 14652, 14655, 14662] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000'] [Step 156 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [14645, 14652, 14655, 14662] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000'] [Step 156 / Rank 7] Tasks: ['Single QA'] | Lens: [64597] → Tgt Spa: ['0.350'] [Step 156 / Rank 6] Tasks: ['Single QA'] | Lens: [64597] → Tgt Spa: ['0.350'] [Step 156 / Rank 4] Tasks: ['Single QA'] | Lens: [41244] → Tgt Spa: ['0.350'] [Step 156 / Rank 1] Tasks: ['Single QA'] | Lens: [44040] → Tgt Spa: ['0.350'] [Step 156 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [62651] → Tgt Spa: ['1.000'] [Step 156 / Rank 4] Tasks: ['Code', 'Code', 'Code'] | Lens: [20101, 20100, 20101] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 156 / Rank 3] Tasks: ['Single QA'] | Lens: [45997] → Tgt Spa: ['0.350'] [Step 156 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [24267, 24267] → Tgt Spa: ['0.350', '0.350'] [Step 156 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [24267, 24267] → Tgt Spa: ['0.350', '0.350'] [Step 156 / Rank 2] Tasks: ['Single QA'] | Lens: [45997] → Tgt Spa: ['0.350'] [Step 156 / Rank 5] Tasks: ['Code', 'Code', 'Code'] | Lens: [20101, 20100, 20101] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 156 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [62651] → Tgt Spa: ['1.000'] [Step 156 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [55022] → Tgt Spa: ['1.000'] [Step 156 / Rank 1] Tasks: ['Code'] | Lens: [43246] → Tgt Spa: ['1.000'] [Step 156 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [53250] → Tgt Spa: ['1.000'] [Step 156 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [22906, 22915] → Tgt Spa: ['1.000', '1.000'] [Step 156 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [55022] → Tgt Spa: ['1.000'] [Step 156 / Rank 0] Tasks: ['Code'] | Lens: [43246] → Tgt Spa: ['1.000'] [Step 156 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [53250] → Tgt Spa: ['1.000'] [Step 156 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [22906, 22915] → Tgt Spa: ['1.000', '1.000'] [Step 156 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [12034, 12034, 12035, 12036, 12037] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 156 / Rank 1] Tasks: ['Code'] | Lens: [46861] → Tgt Spa: ['1.000'] [Step 156 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [40457] → Tgt Spa: ['1.000'] [Step 156 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [12034, 12034, 12035, 12036, 12037] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 156 / Rank 2] Tasks: ['Code'] | Lens: [41370] → Tgt Spa: ['1.000'] [Step 156 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [40457] → Tgt Spa: ['1.000'] [Step 156 / Rank 3] Tasks: ['Code'] | Lens: [41370] → Tgt Spa: ['1.000'] [Step 156 / Rank 0] Tasks: ['Code'] | Lens: [46861] → Tgt Spa: ['1.000'] [Step 156 / Rank 5] Tasks: ['Single QA'] | Lens: [46146] → Tgt Spa: ['0.350'] [Step 156 / Rank 1] Tasks: ['Code'] | Lens: [34479] → Tgt Spa: ['1.000'] [Step 156 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [25952, 25946] → Tgt Spa: ['1.000', '1.000'] [Step 156 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [25952, 25946] → Tgt Spa: ['1.000', '1.000'] [Step 156 / Rank 2] Tasks: ['Single QA'] | Lens: [38810] → Tgt Spa: ['0.350'] [Step 156 / Rank 3] Tasks: ['Single QA'] | Lens: [38810] → Tgt Spa: ['0.350'] [Step 156 / Rank 0] Tasks: ['Code'] | Lens: [34479] → Tgt Spa: ['1.000'] [Step 156 / Rank 4] Tasks: ['Single QA'] | Lens: [46146] → Tgt Spa: ['0.350'] [Step 156 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [59553] → Tgt Spa: ['1.000'] [Step 156 / Rank 3] Tasks: ['Single QA'] | Lens: [65092] → Tgt Spa: ['0.350'] [Step 156 / Rank 1] Tasks: ['MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Code', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Code', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA'] | Lens: [2038, 2056, 2039, 2038, 2038, 2040, 2043, 2059, 2041, 2042, 2043, 2060, 2041, 2049, 2044, 2061, 2043, 2046, 2043, 2046, 2064, 2063, 2044, 2052, 2064, 2064, 2046, 2048, 2048, 2047, 2048] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 156 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [59553] → Tgt Spa: ['1.000'] [Step 156 / Rank 6] Tasks: ['Single QA'] | Lens: [56618] → Tgt Spa: ['0.350'] [Step 156 / Rank 0] Tasks: ['MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Code', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Code', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA'] | Lens: [2038, 2056, 2039, 2038, 2038, 2040, 2043, 2059, 2041, 2042, 2043, 2060, 2041, 2049, 2044, 2061, 2043, 2046, 2043, 2046, 2064, 2063, 2044, 2052, 2064, 2064, 2046, 2048, 2048, 2047, 2048] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 156 / Rank 7] Tasks: ['Single QA'] | Lens: [56618] → Tgt Spa: ['0.350'] [Step 156 / Rank 2] Tasks: ['Single QA'] | Lens: [65092] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 01:42:21,030 >> @ 156 | Loss: 1.9604 | LM: 1.8908 | Reg: 0.0696 | Spa(Avg): 0.528 [INFO|lh_trainer.py:797] 2026-02-17 01:42:21,030 >> Statistic -> Code | Spa: 0.647 | Tgt: 1.000 | Z-Loss: 0.107 | [INFO|lh_trainer.py:797] 2026-02-17 01:42:21,030 >> Statistic -> In-Context | Spa: 0.690 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:42:21,030 >> Statistic -> MultiHop | Spa: 0.568 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:42:21,030 >> Statistic -> Single | Spa: 0.382 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:42:21,030 >> Statistic -> Summarization | Spa: 0.590 | Tgt: 1.000 | Z-Loss: 0.134 | [INFO|lh_trainer.py:810] 2026-02-17 01:42:21,032 >> [Micro-Log] {"loss": 1.9603886225571234, "lm_loss": 1.890761844192942, "reg_loss": 0.06962677915968622, "model_sparsity(avg)": 0.5283760788540045, "Spa-Single QA sparsity": 0.38230992932068675, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.028411470525162786, "Spa-Code sparsity": 0.6469907462596893, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10722110606729984, "Spa-MultiHop QA sparsity": 0.5680555552244186, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.08345128539949656, "Spa-Summarization sparsity": 0.5902777761220932, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13365738466382027, "Spa-In-Context Learning sparsity": 0.6904761961528233, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10663476479904992, "step": 156, "current_tau": 1.021613597869873, "lambda1 Single QA": 0.56640625, "lambda2 MultiHop QA": 0.29296875, "lambda3 Summarization": 0.1376953125, "lambda4 Code": 0.2373046875} [INFO|lh_trainer.py:331] 2026-02-17 01:42:47,908 >> {'loss': 11.7623, 'grad_norm': 0.8167688250541687, 'learning_rate': 0.00032725426586831203, 'epoch': 0.1653501843075303, 'num_input_tokens_seen': 385820504, 'completed': '52.33% (157 / 300)', 'remaining time': '6:41:40', 'throughput': '7210.46', 'gpu_mem_free': '7407MB', 'step': 157} [Step 157 / Rank 4] Tasks: ['Single QA'] | Lens: [51073] → Tgt Spa: ['0.350'] [Step 157 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [16045, 16046, 16046, 16046] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 157 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [16045, 16046, 16046, 16046] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 157 / Rank 6] Tasks: ['Summarization'] | Lens: [47095] → Tgt Spa: ['1.000'] [Step 157 / Rank 5] Tasks: ['Single QA'] | Lens: [51073] → Tgt Spa: ['0.350'] [Step 157 / Rank 7] Tasks: ['Summarization'] | Lens: [47095] → Tgt Spa: ['1.000'] [Step 157 / Rank 2] Tasks: ['Code', 'Summarization'] | Lens: [30124, 30143] → Tgt Spa: ['1.000', '1.000'] [Step 157 / Rank 3] Tasks: ['Code', 'Summarization'] | Lens: [30124, 30143] → Tgt Spa: ['1.000', '1.000'] [Step 157 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25947, 25947] → Tgt Spa: ['1.000', '1.000'] [Step 157 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25927, 25928] → Tgt Spa: ['1.000', '1.000'] [Step 157 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25947, 25947] → Tgt Spa: ['1.000', '1.000'] [Step 157 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25927, 25928] → Tgt Spa: ['1.000', '1.000'] [Step 157 / Rank 4] Tasks: ['Single QA', 'Code'] | Lens: [23540, 23550] → Tgt Spa: ['0.350', '1.000'] [Step 157 / Rank 5] Tasks: ['Single QA', 'Code'] | Lens: [23540, 23550] → Tgt Spa: ['0.350', '1.000'] [Step 157 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [55997] → Tgt Spa: ['1.000'] [Step 157 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [55997] → Tgt Spa: ['1.000'] [Step 157 / Rank 1] Tasks: ['Single QA'] | Lens: [55775] → Tgt Spa: ['0.350'] [Step 157 / Rank 7] Tasks: ['Single QA'] | Lens: [34946] → Tgt Spa: ['0.350'] [Step 157 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25777, 25778] → Tgt Spa: ['1.000', '1.000'] [Step 157 / Rank 0] Tasks: ['Single QA'] | Lens: [55775] → Tgt Spa: ['0.350'] [Step 157 / Rank 2] Tasks: ['Single QA'] | Lens: [50637] → Tgt Spa: ['0.350'] [Step 157 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25777, 25778] → Tgt Spa: ['1.000', '1.000'] [Step 157 / Rank 6] Tasks: ['Single QA'] | Lens: [34946] → Tgt Spa: ['0.350'] [Step 157 / Rank 3] Tasks: ['Single QA'] | Lens: [50637] → Tgt Spa: ['0.350'] [Step 157 / Rank 5] Tasks: ['Single QA'] | Lens: [47135] → Tgt Spa: ['0.350'] [Step 157 / Rank 6] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [23827, 23809] → Tgt Spa: ['1.000', '1.000'] [Step 157 / Rank 7] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [23827, 23809] → Tgt Spa: ['1.000', '1.000'] [Step 157 / Rank 4] Tasks: ['Single QA'] | Lens: [47135] → Tgt Spa: ['0.350'] [Step 157 / Rank 3] Tasks: ['Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'Summarization', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [7952, 7952, 7955, 7955, 7974, 7964, 7956, 7956] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350'] [Step 157 / Rank 0] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17242, 17242, 17233] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 157 / Rank 2] Tasks: ['Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'Summarization', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [7952, 7952, 7955, 7955, 7974, 7964, 7956, 7956] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350'] [Step 157 / Rank 1] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17242, 17242, 17233] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 157 / Rank 1] Tasks: ['Summarization'] | Lens: [35038] → Tgt Spa: ['1.000'] [Step 157 / Rank 6] Tasks: ['Single QA'] | Lens: [34881] → Tgt Spa: ['0.350'] [Step 157 / Rank 7] Tasks: ['Single QA'] | Lens: [34881] → Tgt Spa: ['0.350'] [Step 157 / Rank 2] Tasks: ['Single QA'] | Lens: [53369] → Tgt Spa: ['0.350'] [Step 157 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [30183, 30186] → Tgt Spa: ['1.000', '1.000'] [Step 157 / Rank 3] Tasks: ['Single QA'] | Lens: [53369] → Tgt Spa: ['0.350'] [Step 157 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [30183, 30186] → Tgt Spa: ['1.000', '1.000'] [Step 157 / Rank 0] Tasks: ['Summarization'] | Lens: [35038] → Tgt Spa: ['1.000'] [Step 157 / Rank 6] Tasks: ['Single QA'] | Lens: [49587] → Tgt Spa: ['0.350'] [Step 157 / Rank 1] Tasks: ['Single QA'] | Lens: [42588] → Tgt Spa: ['0.350'] [Step 157 / Rank 7] Tasks: ['Single QA'] | Lens: [49587] → Tgt Spa: ['0.350'] [Step 157 / Rank 5] Tasks: ['Single QA'] | Lens: [56499] → Tgt Spa: ['0.350'] [Step 157 / Rank 0] Tasks: ['Single QA'] | Lens: [42588] → Tgt Spa: ['0.350'] [Step 157 / Rank 3] Tasks: ['Single QA', 'Code'] | Lens: [32633, 32641] → Tgt Spa: ['0.350', '1.000'] [Step 157 / Rank 2] Tasks: ['Single QA', 'Code'] | Lens: [32633, 32641] → Tgt Spa: ['0.350', '1.000'] [Step 157 / Rank 4] Tasks: ['Single QA'] | Lens: [56499] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 01:45:01,029 >> @ 157 | Loss: 2.1869 | LM: 2.1206 | Reg: 0.0663 | Spa(Avg): 0.491 [INFO|lh_trainer.py:797] 2026-02-17 01:45:01,029 >> Statistic -> Code | Spa: 0.573 | Tgt: 1.000 | Z-Loss: 0.136 | [INFO|lh_trainer.py:797] 2026-02-17 01:45:01,029 >> Statistic -> In-Context | Spa: 0.701 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:45:01,029 >> Statistic -> MultiHop | Spa: 0.568 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:45:01,029 >> Statistic -> Single | Spa: 0.393 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:45:01,029 >> Statistic -> Summarization | Spa: 0.587 | Tgt: 1.000 | Z-Loss: 0.136 | [INFO|lh_trainer.py:810] 2026-02-17 01:45:01,031 >> [Micro-Log] {"loss": 2.186923316369454, "lm_loss": 2.1205901950597763, "reg_loss": 0.06633314990904182, "model_sparsity(avg)": 0.4912712201476097, "Spa-Single QA sparsity": 0.3930555522441864, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.028316794871352614, "Spa-In-Context Learning sparsity": 0.7013889133930207, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10258101373910904, "Spa-Summarization sparsity": 0.5873015778405326, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1364481332046645, "Spa-Code sparsity": 0.5734126993588039, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.13598328615937913, "Spa-MultiHop QA sparsity": 0.5680555552244186, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.08345128539949656, "step": 157, "current_tau": 1.0198737382888794, "lambda1 Single QA": 0.56640625, "lambda2 MultiHop QA": 0.294921875, "lambda3 Summarization": 0.1376953125, "lambda4 Code": 0.2373046875} [INFO|lh_trainer.py:331] 2026-02-17 01:45:22,410 >> {'loss': 13.1215, 'grad_norm': 0.7050719261169434, 'learning_rate': 0.0003241354113303533, 'epoch': 0.1664033701948394, 'num_input_tokens_seen': 388280752, 'completed': '52.67% (158 / 300)', 'remaining time': '6:38:39', 'throughput': '7961.86', 'gpu_mem_free': '12513MB', 'step': 158} [Step 158 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32138, 32138] → Tgt Spa: ['0.350', '0.350'] [Step 158 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [28517, 28509] → Tgt Spa: ['1.000', '1.000'] [Step 158 / Rank 2] Tasks: ['Single QA'] | Lens: [63493] → Tgt Spa: ['0.350'] [Step 158 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [29950, 29951] → Tgt Spa: ['0.350', '0.350'] [Step 158 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [28517, 28509] → Tgt Spa: ['1.000', '1.000'] [Step 158 / Rank 3] Tasks: ['Single QA'] | Lens: [63493] → Tgt Spa: ['0.350'] [Step 158 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [29950, 29951] → Tgt Spa: ['0.350', '0.350'] [Step 158 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32138, 32138] → Tgt Spa: ['0.350', '0.350'] [Step 158 / Rank 6] Tasks: ['Single QA'] | Lens: [64044] → Tgt Spa: ['0.350'] [Step 158 / Rank 1] Tasks: ['Single QA'] | Lens: [65018] → Tgt Spa: ['0.350'] [Step 158 / Rank 7] Tasks: ['Single QA'] | Lens: [64044] → Tgt Spa: ['0.350'] [Step 158 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [28247, 28248] → Tgt Spa: ['0.350', '0.350'] [Step 158 / Rank 4] Tasks: ['Single QA'] | Lens: [58143] → Tgt Spa: ['0.350'] [Step 158 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [28247, 28248] → Tgt Spa: ['0.350', '0.350'] [Step 158 / Rank 0] Tasks: ['Single QA'] | Lens: [65018] → Tgt Spa: ['0.350'] [Step 158 / Rank 5] Tasks: ['Single QA'] | Lens: [58143] → Tgt Spa: ['0.350'] [Step 158 / Rank 3] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17759, 17771, 17765] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 158 / Rank 7] Tasks: ['Code'] | Lens: [41860] → Tgt Spa: ['1.000'] [Step 158 / Rank 2] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17759, 17771, 17765] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 158 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [22625, 22619] → Tgt Spa: ['1.000', '1.000'] [Step 158 / Rank 1] Tasks: ['Single QA'] | Lens: [65010] → Tgt Spa: ['0.350'] [Step 158 / Rank 6] Tasks: ['Code'] | Lens: [41860] → Tgt Spa: ['1.000'] [Step 158 / Rank 0] Tasks: ['Single QA'] | Lens: [65010] → Tgt Spa: ['0.350'] [Step 158 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [22625, 22619] → Tgt Spa: ['1.000', '1.000'] [Step 158 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [31073, 31067] → Tgt Spa: ['1.000', '1.000'] [Step 158 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27304, 27306] → Tgt Spa: ['1.000', '1.000'] [Step 158 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [31073, 31067] → Tgt Spa: ['1.000', '1.000'] [Step 158 / Rank 5] Tasks: ['Single QA'] | Lens: [38016] → Tgt Spa: ['0.350'] [Step 158 / Rank 4] Tasks: ['Single QA'] | Lens: [38016] → Tgt Spa: ['0.350'] [Step 158 / Rank 2] Tasks: ['Code'] | Lens: [33852] → Tgt Spa: ['1.000'] [Step 158 / Rank 3] Tasks: ['Code'] | Lens: [33852] → Tgt Spa: ['1.000'] [Step 158 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27304, 27306] → Tgt Spa: ['1.000', '1.000'] [Step 158 / Rank 1] Tasks: ['Single QA'] | Lens: [34237] → Tgt Spa: ['0.350'] [Step 158 / Rank 7] Tasks: ['Single QA'] | Lens: [53184] → Tgt Spa: ['0.350'] [Step 158 / Rank 0] Tasks: ['Single QA'] | Lens: [34237] → Tgt Spa: ['0.350'] [Step 158 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [61754] → Tgt Spa: ['1.000'] [Step 158 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [58906] → Tgt Spa: ['1.000'] [Step 158 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [58906] → Tgt Spa: ['1.000'] [Step 158 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [61754] → Tgt Spa: ['1.000'] [Step 158 / Rank 6] Tasks: ['Single QA'] | Lens: [53184] → Tgt Spa: ['0.350'] [Step 158 / Rank 7] Tasks: ['Single QA'] | Lens: [49072] → Tgt Spa: ['0.350'] [Step 158 / Rank 6] Tasks: ['Single QA'] | Lens: [49072] → Tgt Spa: ['0.350'] [Step 158 / Rank 1] Tasks: ['Single QA'] | Lens: [56717] → Tgt Spa: ['0.350'] [Step 158 / Rank 5] Tasks: ['Single QA'] | Lens: [38751] → Tgt Spa: ['0.350'] [Step 158 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [63528] → Tgt Spa: ['1.000'] [Step 158 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [63528] → Tgt Spa: ['1.000'] [Step 158 / Rank 0] Tasks: ['Single QA'] | Lens: [56717] → Tgt Spa: ['0.350'] [Step 158 / Rank 4] Tasks: ['Single QA'] | Lens: [38751] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 01:48:11,016 >> @ 158 | Loss: 2.1429 | LM: 2.0888 | Reg: 0.0542 | Spa(Avg): 0.500 [INFO|lh_trainer.py:797] 2026-02-17 01:48:11,016 >> Statistic -> Code | Spa: 0.669 | Tgt: 1.000 | Z-Loss: 0.100 | [INFO|lh_trainer.py:797] 2026-02-17 01:48:11,016 >> Statistic -> In-Context | Spa: 0.687 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:48:11,016 >> Statistic -> MultiHop | Spa: 0.568 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:48:11,017 >> Statistic -> Single | Spa: 0.376 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:48:11,017 >> Statistic -> Summarization | Spa: 0.625 | Tgt: 1.000 | Z-Loss: 0.115 | [INFO|lh_trainer.py:810] 2026-02-17 01:48:11,018 >> [Micro-Log] {"loss": 2.142938549319903, "lm_loss": 2.088780132432779, "reg_loss": 0.05415842297952622, "model_sparsity(avg)": 0.4999999937911828, "Spa-Single QA sparsity": 0.375816986841314, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.01962680950322572, "Spa-In-Context Learning sparsity": 0.6874999776482582, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10794482566416264, "Spa-Code sparsity": 0.6686507974352155, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09998914812292371, "Spa-Summarization sparsity": 0.625, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.115447998046875, "Spa-MultiHop QA sparsity": 0.5680555552244186, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.08345128539949656, "step": 158, "current_tau": 1.0182039737701416, "lambda1 Single QA": 0.56640625, "lambda2 MultiHop QA": 0.294921875, "lambda3 Summarization": 0.138671875, "lambda4 Code": 0.23828125} [INFO|lh_trainer.py:331] 2026-02-17 01:48:36,345 >> {'loss': 12.8576, 'grad_norm': 0.5963127017021179, 'learning_rate': 0.0003210038540755971, 'epoch': 0.1674565560821485, 'num_input_tokens_seen': 390877896, 'completed': '53.00% (159 / 300)', 'remaining time': '6:36:13', 'throughput': '6695.93', 'gpu_mem_free': '7239MB', 'step': 159} [Step 159 / Rank 5] Tasks: ['Single QA'] | Lens: [46757] → Tgt Spa: ['0.350'] [Step 159 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27353, 27357] → Tgt Spa: ['1.000', '1.000'] [Step 159 / Rank 1] Tasks: ['Code'] | Lens: [57428] → Tgt Spa: ['1.000'] [Step 159 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28756, 28756] → Tgt Spa: ['1.000', '1.000'] [Step 159 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28756, 28756] → Tgt Spa: ['1.000', '1.000'] [Step 159 / Rank 4] Tasks: ['Single QA'] | Lens: [46757] → Tgt Spa: ['0.350'] [Step 159 / Rank 0] Tasks: ['Code'] | Lens: [57428] → Tgt Spa: ['1.000'] [Step 159 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27353, 27357] → Tgt Spa: ['1.000', '1.000'] [Step 159 / Rank 4] Tasks: ['Code'] | Lens: [54299] → Tgt Spa: ['1.000'] [Step 159 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25011, 25011] → Tgt Spa: ['1.000', '1.000'] [Step 159 / Rank 0] Tasks: ['Single QA'] | Lens: [50887] → Tgt Spa: ['0.350'] [Step 159 / Rank 5] Tasks: ['Code'] | Lens: [54299] → Tgt Spa: ['1.000'] [Step 159 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25011, 25011] → Tgt Spa: ['1.000', '1.000'] [Step 159 / Rank 6] Tasks: ['Single QA'] | Lens: [64523] → Tgt Spa: ['0.350'] [Step 159 / Rank 1] Tasks: ['Single QA'] | Lens: [50887] → Tgt Spa: ['0.350'] [Step 159 / Rank 7] Tasks: ['Single QA'] | Lens: [64523] → Tgt Spa: ['0.350'] [Step 159 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [32067, 32067] → Tgt Spa: ['0.350', '0.350'] [Step 159 / Rank 5] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [31959, 31990] → Tgt Spa: ['1.000', '1.000'] [Step 159 / Rank 0] Tasks: ['Single QA'] | Lens: [54859] → Tgt Spa: ['0.350'] [Step 159 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [39156] → Tgt Spa: ['1.000'] [Step 159 / Rank 1] Tasks: ['Single QA'] | Lens: [54859] → Tgt Spa: ['0.350'] [Step 159 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [32067, 32067] → Tgt Spa: ['0.350', '0.350'] [Step 159 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [39156] → Tgt Spa: ['1.000'] [Step 159 / Rank 4] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [31959, 31990] → Tgt Spa: ['1.000', '1.000'] [Step 159 / Rank 5] Tasks: ['Single QA'] | Lens: [59043] → Tgt Spa: ['0.350'] [Step 159 / Rank 6] Tasks: ['Single QA'] | Lens: [40267] → Tgt Spa: ['0.350'] [Step 159 / Rank 7] Tasks: ['Single QA'] | Lens: [40267] → Tgt Spa: ['0.350'] [Step 159 / Rank 3] Tasks: ['Single QA'] | Lens: [50652] → Tgt Spa: ['0.350'] [Step 159 / Rank 4] Tasks: ['Single QA'] | Lens: [59043] → Tgt Spa: ['0.350'] [Step 159 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23716, 23718] → Tgt Spa: ['1.000', '1.000'] [Step 159 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23716, 23718] → Tgt Spa: ['1.000', '1.000'] [Step 159 / Rank 2] Tasks: ['Single QA'] | Lens: [50652] → Tgt Spa: ['0.350'] [Step 159 / Rank 1] Tasks: ['Code'] | Lens: [60148] → Tgt Spa: ['1.000'] [Step 159 / Rank 4] Tasks: ['Single QA'] | Lens: [65028] → Tgt Spa: ['0.350'] [Step 159 / Rank 3] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7935, 7935, 7934, 7929, 7929, 7929, 7929, 7929] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 159 / Rank 6] Tasks: ['Single QA'] | Lens: [47055] → Tgt Spa: ['0.350'] [Step 159 / Rank 2] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7935, 7935, 7934, 7929, 7929, 7929, 7929, 7929] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 159 / Rank 7] Tasks: ['Single QA'] | Lens: [47055] → Tgt Spa: ['0.350'] [Step 159 / Rank 0] Tasks: ['Code'] | Lens: [60148] → Tgt Spa: ['1.000'] [Step 159 / Rank 5] Tasks: ['Single QA'] | Lens: [65028] → Tgt Spa: ['0.350'] [Step 159 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24430, 24431] → Tgt Spa: ['1.000', '1.000'] [Step 159 / Rank 3] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Summarization', 'In-Context Learning', 'Code', 'Summarization', 'Summarization', 'Code', 'In-Context Learning'] | Lens: [5935, 5928, 5931, 5950, 5950, 5933, 5941, 5954, 5954, 5943, 5938] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 159 / Rank 7] Tasks: ['Single QA'] | Lens: [55414] → Tgt Spa: ['0.350'] [Step 159 / Rank 2] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Summarization', 'In-Context Learning', 'Code', 'Summarization', 'Summarization', 'Code', 'In-Context Learning'] | Lens: [5935, 5928, 5931, 5950, 5950, 5933, 5941, 5954, 5954, 5943, 5938] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 159 / Rank 6] Tasks: ['Single QA'] | Lens: [55414] → Tgt Spa: ['0.350'] [Step 159 / Rank 0] Tasks: ['Single QA'] | Lens: [38716] → Tgt Spa: ['0.350'] [Step 159 / Rank 1] Tasks: ['Single QA'] | Lens: [38716] → Tgt Spa: ['0.350'] [Step 159 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24430, 24431] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 01:51:17,000 >> @ 159 | Loss: 2.1185 | LM: 2.0525 | Reg: 0.0660 | Spa(Avg): 0.527 [INFO|lh_trainer.py:797] 2026-02-17 01:51:17,001 >> Statistic -> Code | Spa: 0.651 | Tgt: 1.000 | Z-Loss: 0.107 | [INFO|lh_trainer.py:797] 2026-02-17 01:51:17,001 >> Statistic -> In-Context | Spa: 0.691 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:51:17,001 >> Statistic -> MultiHop | Spa: 0.568 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:51:17,001 >> Statistic -> Single | Spa: 0.430 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:51:17,001 >> Statistic -> Summarization | Spa: 0.583 | Tgt: 1.000 | Z-Loss: 0.138 | [INFO|lh_trainer.py:810] 2026-02-17 01:51:17,003 >> [Micro-Log] {"loss": 2.118452193836371, "lm_loss": 2.0524823426579437, "reg_loss": 0.06596986456618954, "model_sparsity(avg)": 0.5268768444657326, "Spa-Code sparsity": 0.6512345737881131, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1067909234099918, "Spa-Single QA sparsity": 0.42978395024935406, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.053114076507174306, "Spa-In-Context Learning sparsity": 0.6909722313284874, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10745417792350054, "Spa-Summarization sparsity": 0.5833333373069763, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1378142148256302, "Spa-MultiHop QA sparsity": 0.5680555552244186, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.08345128539949656, "step": 159, "current_tau": 1.0166049003601074, "lambda1 Single QA": 0.56640625, "lambda2 MultiHop QA": 0.294921875, "lambda3 Summarization": 0.138671875, "lambda4 Code": 0.23828125} [INFO|lh_trainer.py:331] 2026-02-17 01:51:37,680 >> {'loss': 12.7107, 'grad_norm': 0.6776026487350464, 'learning_rate': 0.0003178601306802573, 'epoch': 0.16850974196945762, 'num_input_tokens_seen': 393477216, 'completed': '53.33% (160 / 300)', 'remaining time': '6:33:36', 'throughput': '7167.18', 'gpu_mem_free': '13385MB', 'step': 160} [Step 160 / Rank 4] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [16135, 16135, 16135, 16135] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 160 / Rank 2] Tasks: ['Single QA'] | Lens: [52191] → Tgt Spa: ['0.350'] [Step 160 / Rank 0] Tasks: ['Single QA'] | Lens: [53908] → Tgt Spa: ['0.350'] [Step 160 / Rank 6] Tasks: ['Single QA', 'MultiHop QA', 'Code', 'Single QA'] | Lens: [15414, 15417, 15426, 15423] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350'] [Step 160 / Rank 5] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [16135, 16135, 16135, 16135] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 160 / Rank 3] Tasks: ['Single QA'] | Lens: [52191] → Tgt Spa: ['0.350'] [Step 160 / Rank 1] Tasks: ['Single QA'] | Lens: [53908] → Tgt Spa: ['0.350'] [Step 160 / Rank 7] Tasks: ['Single QA', 'MultiHop QA', 'Code', 'Single QA'] | Lens: [15414, 15417, 15426, 15423] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350'] [Step 160 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [23778, 23778] → Tgt Spa: ['1.000', '1.000'] [Step 160 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [23778, 23778] → Tgt Spa: ['1.000', '1.000'] [Step 160 / Rank 0] Tasks: ['Single QA'] | Lens: [45687] → Tgt Spa: ['0.350'] [Step 160 / Rank 1] Tasks: ['Single QA'] | Lens: [45687] → Tgt Spa: ['0.350'] [Step 160 / Rank 7] Tasks: ['Single QA', 'Code', 'Code'] | Lens: [21063, 21070, 21070] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 160 / Rank 2] Tasks: ['Summarization', 'Summarization'] | Lens: [23328, 23329] → Tgt Spa: ['1.000', '1.000'] [Step 160 / Rank 3] Tasks: ['Summarization', 'Summarization'] | Lens: [23328, 23329] → Tgt Spa: ['1.000', '1.000'] [Step 160 / Rank 6] Tasks: ['Single QA', 'Code', 'Code'] | Lens: [21063, 21070, 21070] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 160 / Rank 4] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23063, 23083] → Tgt Spa: ['1.000', '1.000'] [Step 160 / Rank 1] Tasks: ['Code', 'Single QA'] | Lens: [31118, 31111] → Tgt Spa: ['1.000', '0.350'] [Step 160 / Rank 2] Tasks: ['Single QA'] | Lens: [46394] → Tgt Spa: ['0.350'] [Step 160 / Rank 6] Tasks: ['Code'] | Lens: [63055] → Tgt Spa: ['1.000'] [Step 160 / Rank 3] Tasks: ['Single QA'] | Lens: [46394] → Tgt Spa: ['0.350'] [Step 160 / Rank 5] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23063, 23083] → Tgt Spa: ['1.000', '1.000'] [Step 160 / Rank 0] Tasks: ['Code', 'Single QA'] | Lens: [31118, 31111] → Tgt Spa: ['1.000', '0.350'] [Step 160 / Rank 7] Tasks: ['Code'] | Lens: [63055] → Tgt Spa: ['1.000'] [Step 160 / Rank 1] Tasks: ['Single QA'] | Lens: [51482] → Tgt Spa: ['0.350'] [Step 160 / Rank 0] Tasks: ['Single QA'] | Lens: [51482] → Tgt Spa: ['0.350'] [Step 160 / Rank 4] Tasks: ['Single QA'] | Lens: [35488] → Tgt Spa: ['0.350'] [Step 160 / Rank 7] Tasks: ['Single QA'] | Lens: [41246] → Tgt Spa: ['0.350'] [Step 160 / Rank 3] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18161, 18156, 18154] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 160 / Rank 6] Tasks: ['Single QA'] | Lens: [41246] → Tgt Spa: ['0.350'] [Step 160 / Rank 2] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18161, 18156, 18154] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 160 / Rank 5] Tasks: ['Single QA'] | Lens: [35488] → Tgt Spa: ['0.350'] [Step 160 / Rank 6] Tasks: ['Single QA'] | Lens: [58953] → Tgt Spa: ['0.350'] [Step 160 / Rank 7] Tasks: ['Single QA'] | Lens: [58953] → Tgt Spa: ['0.350'] [Step 160 / Rank 4] Tasks: ['Single QA'] | Lens: [49759] → Tgt Spa: ['0.350'] [Step 160 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [14590, 14590, 14591, 14591] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 160 / Rank 5] Tasks: ['Single QA'] | Lens: [49759] → Tgt Spa: ['0.350'] [Step 160 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [14590, 14590, 14591, 14591] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 160 / Rank 0] Tasks: ['MultiHop QA'] | Lens: [65339] → Tgt Spa: ['0.350'] [Step 160 / Rank 1] Tasks: ['MultiHop QA'] | Lens: [65339] → Tgt Spa: ['0.350'] [Step 160 / Rank 5] Tasks: ['Single QA'] | Lens: [44671] → Tgt Spa: ['0.350'] [Step 160 / Rank 7] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [26844, 26844] → Tgt Spa: ['0.350', '1.000'] [Step 160 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27026, 27026] → Tgt Spa: ['1.000', '1.000'] [Step 160 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27026, 27026] → Tgt Spa: ['1.000', '1.000'] [Step 160 / Rank 1] Tasks: ['Single QA'] | Lens: [51491] → Tgt Spa: ['0.350'] [Step 160 / Rank 6] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [26844, 26844] → Tgt Spa: ['0.350', '1.000'] [Step 160 / Rank 0] Tasks: ['Single QA'] | Lens: [51491] → Tgt Spa: ['0.350'] [Step 160 / Rank 4] Tasks: ['Single QA'] | Lens: [44671] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 01:54:02,507 >> @ 160 | Loss: 1.7647 | LM: 1.7228 | Reg: 0.0419 | Spa(Avg): 0.456 [INFO|lh_trainer.py:797] 2026-02-17 01:54:02,508 >> Statistic -> Code | Spa: 0.660 | Tgt: 1.000 | Z-Loss: 0.103 | [INFO|lh_trainer.py:797] 2026-02-17 01:54:02,508 >> Statistic -> In-Context | Spa: 0.722 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:54:02,508 >> Statistic -> MultiHop | Spa: 0.424 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:54:02,508 >> Statistic -> Single | Spa: 0.349 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:54:02,508 >> Statistic -> Summarization | Spa: 0.625 | Tgt: 1.000 | Z-Loss: 0.118 | [INFO|lh_trainer.py:810] 2026-02-17 01:54:02,510 >> [Micro-Log] {"loss": 1.7647187717763397, "lm_loss": 1.7227809968171641, "reg_loss": 0.04193777613788067, "model_sparsity(avg)": 0.4560667375723521, "Spa-Single QA sparsity": 0.3493055433034897, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.008157856023171917, "Spa-Code sparsity": 0.6604938242170546, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1034963693883684, "Spa-MultiHop QA sparsity": 0.4236111044883728, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.025571412096420925, "Spa-Summarization sparsity": 0.625, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11814296059310436, "Spa-In-Context Learning sparsity": 0.7222222089767456, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.09543336927890778, "step": 160, "current_tau": 1.0150768756866455, "lambda1 Single QA": 0.5703125, "lambda2 MultiHop QA": 0.294921875, "lambda3 Summarization": 0.1396484375, "lambda4 Code": 0.2392578125} [INFO|lh_trainer.py:331] 2026-02-17 01:54:21,070 >> {'loss': 10.5883, 'grad_norm': 0.45999521017074585, 'learning_rate': 0.00031470477980515406, 'epoch': 0.16956292785676672, 'num_input_tokens_seen': 396021712, 'completed': '53.67% (161 / 300)', 'remaining time': '6:30:43', 'throughput': '7786.54', 'gpu_mem_free': '8601MB', 'step': 161} [Step 161 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [52723] → Tgt Spa: ['1.000'] [Step 161 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7908, 7910, 7910, 7912, 7912, 7912, 7912, 7912] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 161 / Rank 7] Tasks: ['Code', 'Summarization'] | Lens: [22191, 22204] → Tgt Spa: ['1.000', '1.000'] [Step 161 / Rank 6] Tasks: ['Code', 'Summarization'] | Lens: [22191, 22204] → Tgt Spa: ['1.000', '1.000'] [Step 161 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7908, 7910, 7910, 7912, 7912, 7912, 7912, 7912] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 161 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [25333, 25332] → Tgt Spa: ['0.350', '0.350'] [Step 161 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [25333, 25332] → Tgt Spa: ['0.350', '0.350'] [Step 161 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [52723] → Tgt Spa: ['1.000'] [Step 161 / Rank 4] Tasks: ['Code'] | Lens: [45172] → Tgt Spa: ['1.000'] [Step 161 / Rank 6] Tasks: ['Code'] | Lens: [63047] → Tgt Spa: ['1.000'] [Step 161 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22619, 22638] → Tgt Spa: ['1.000', '1.000'] [Step 161 / Rank 5] Tasks: ['Code'] | Lens: [45172] → Tgt Spa: ['1.000'] [Step 161 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [39011] → Tgt Spa: ['1.000'] [Step 161 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22619, 22638] → Tgt Spa: ['1.000', '1.000'] [Step 161 / Rank 7] Tasks: ['Code'] | Lens: [63047] → Tgt Spa: ['1.000'] [Step 161 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [39011] → Tgt Spa: ['1.000'] [Step 161 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [40951] → Tgt Spa: ['1.000'] [Step 161 / Rank 5] Tasks: ['Single QA'] | Lens: [64906] → Tgt Spa: ['0.350'] [Step 161 / Rank 4] Tasks: ['Single QA'] | Lens: [64906] → Tgt Spa: ['0.350'] [Step 161 / Rank 6] Tasks: ['Single QA'] | Lens: [54057] → Tgt Spa: ['0.350'] [Step 161 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [29941, 29941] → Tgt Spa: ['0.350', '0.350'] [Step 161 / Rank 7] Tasks: ['Single QA'] | Lens: [54057] → Tgt Spa: ['0.350'] [Step 161 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [40951] → Tgt Spa: ['1.000'] [Step 161 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [29941, 29941] → Tgt Spa: ['0.350', '0.350'] [Step 161 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [55157] → Tgt Spa: ['1.000'] [Step 161 / Rank 4] Tasks: ['Single QA'] | Lens: [59076] → Tgt Spa: ['0.350'] [Step 161 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [50399] → Tgt Spa: ['1.000'] [Step 161 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [58762] → Tgt Spa: ['1.000'] [Step 161 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [55157] → Tgt Spa: ['1.000'] [Step 161 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [58762] → Tgt Spa: ['1.000'] [Step 161 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [50399] → Tgt Spa: ['1.000'] [Step 161 / Rank 5] Tasks: ['Single QA'] | Lens: [59076] → Tgt Spa: ['0.350'] [Step 161 / Rank 7] Tasks: ['Code'] | Lens: [57752] → Tgt Spa: ['1.000'] [Step 161 / Rank 5] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17058, 17062, 17063] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 161 / Rank 1] Tasks: ['Single QA'] | Lens: [58390] → Tgt Spa: ['0.350'] [Step 161 / Rank 0] Tasks: ['Single QA'] | Lens: [58390] → Tgt Spa: ['0.350'] [Step 161 / Rank 4] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17058, 17062, 17063] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 161 / Rank 6] Tasks: ['Code'] | Lens: [57752] → Tgt Spa: ['1.000'] [Step 161 / Rank 3] Tasks: ['Code'] | Lens: [37090] → Tgt Spa: ['1.000'] [Step 161 / Rank 2] Tasks: ['Code'] | Lens: [37090] → Tgt Spa: ['1.000'] [Step 161 / Rank 6] Tasks: ['Single QA'] | Lens: [52617] → Tgt Spa: ['0.350'] [Step 161 / Rank 2] Tasks: ['Single QA'] | Lens: [58353] → Tgt Spa: ['0.350'] [Step 161 / Rank 4] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [18528, 18541, 18540] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 161 / Rank 3] Tasks: ['Single QA'] | Lens: [58353] → Tgt Spa: ['0.350'] [Step 161 / Rank 0] Tasks: ['Single QA'] | Lens: [35896] → Tgt Spa: ['0.350'] [Step 161 / Rank 5] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [18528, 18541, 18540] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 161 / Rank 7] Tasks: ['Single QA'] | Lens: [52617] → Tgt Spa: ['0.350'] [Step 161 / Rank 1] Tasks: ['Single QA'] | Lens: [35896] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 01:57:00,460 >> @ 161 | Loss: 1.9063 | LM: 1.8412 | Reg: 0.0651 | Spa(Avg): 0.547 [INFO|lh_trainer.py:797] 2026-02-17 01:57:00,460 >> Statistic -> Code | Spa: 0.688 | Tgt: 1.000 | Z-Loss: 0.093 | [INFO|lh_trainer.py:797] 2026-02-17 01:57:00,460 >> Statistic -> In-Context | Spa: 0.692 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:57:00,460 >> Statistic -> MultiHop | Spa: 0.424 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:57:00,460 >> Statistic -> Single | Spa: 0.378 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:57:00,460 >> Statistic -> Summarization | Spa: 0.627 | Tgt: 1.000 | Z-Loss: 0.117 | [INFO|lh_trainer.py:810] 2026-02-17 01:57:00,462 >> [Micro-Log] {"loss": 1.9063236992806196, "lm_loss": 1.8411943386308849, "reg_loss": 0.06512934656833143, "model_sparsity(avg)": 0.5466579807301363, "Spa-In-Context Learning sparsity": 0.692460298538208, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10681731679609843, "Spa-Single QA sparsity": 0.3779239717282747, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.021080240147727494, "Spa-Summarization sparsity": 0.6269841279302325, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11709793976375035, "Spa-Code sparsity": 0.6875000099341074, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09342544650038083, "Spa-MultiHop QA sparsity": 0.4236111044883728, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.025571412096420925, "step": 161, "current_tau": 1.013620376586914, "lambda1 Single QA": 0.5703125, "lambda2 MultiHop QA": 0.294921875, "lambda3 Summarization": 0.1396484375, "lambda4 Code": 0.2392578125} [INFO|lh_trainer.py:331] 2026-02-17 01:57:23,021 >> {'loss': 11.4379, 'grad_norm': 0.7179697751998901, 'learning_rate': 0.00031153834210341595, 'epoch': 0.17061611374407584, 'num_input_tokens_seen': 398528988, 'completed': '54.00% (162 / 300)', 'remaining time': '6:28:05', 'throughput': '6889.99', 'gpu_mem_free': '13203MB', 'step': 162} [Step 162 / Rank 7] Tasks: ['Single QA'] | Lens: [49870] → Tgt Spa: ['0.350'] [Step 162 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [25637, 25637] → Tgt Spa: ['0.350', '0.350'] [Step 162 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [25637, 25637] → Tgt Spa: ['0.350', '0.350'] [Step 162 / Rank 3] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17248, 17258, 17248] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 162 / Rank 2] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17248, 17258, 17248] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 162 / Rank 6] Tasks: ['Single QA'] | Lens: [49870] → Tgt Spa: ['0.350'] [Step 162 / Rank 0] Tasks: ['Single QA'] | Lens: [65029] → Tgt Spa: ['0.350'] [Step 162 / Rank 1] Tasks: ['Single QA'] | Lens: [65029] → Tgt Spa: ['0.350'] [Step 162 / Rank 0] Tasks: ['Single QA'] | Lens: [57690] → Tgt Spa: ['0.350'] [Step 162 / Rank 1] Tasks: ['Single QA'] | Lens: [57690] → Tgt Spa: ['0.350'] [Step 162 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [42039] → Tgt Spa: ['1.000'] [Step 162 / Rank 3] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [20396, 20398, 20411] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 162 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [28340, 28340] → Tgt Spa: ['0.350', '0.350'] [Step 162 / Rank 2] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [20396, 20398, 20411] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 162 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [42039] → Tgt Spa: ['1.000'] [Step 162 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [28340, 28340] → Tgt Spa: ['0.350', '0.350'] [Step 162 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [63520] → Tgt Spa: ['1.000'] [Step 162 / Rank 3] Tasks: ['Single QA'] | Lens: [51214] → Tgt Spa: ['0.350'] [Step 162 / Rank 7] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22456, 22478] → Tgt Spa: ['1.000', '1.000'] [Step 162 / Rank 2] Tasks: ['Single QA'] | Lens: [51214] → Tgt Spa: ['0.350'] [Step 162 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [63520] → Tgt Spa: ['1.000'] [Step 162 / Rank 6] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22456, 22478] → Tgt Spa: ['1.000', '1.000'] [Step 162 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15939, 15939, 15939, 15939] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 162 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15939, 15939, 15939, 15939] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 162 / Rank 5] Tasks: ['Single QA'] | Lens: [42537] → Tgt Spa: ['0.350'] [Step 162 / Rank 7] Tasks: ['Code', 'Single QA', 'Single QA'] | Lens: [19179, 19173, 19175] → Tgt Spa: ['1.000', '0.350', '0.350'] [Step 162 / Rank 2] Tasks: ['Single QA'] | Lens: [34882] → Tgt Spa: ['0.350'] [Step 162 / Rank 3] Tasks: ['Single QA'] | Lens: [34882] → Tgt Spa: ['0.350'] [Step 162 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [22563, 22555] → Tgt Spa: ['1.000', '1.000'] [Step 162 / Rank 4] Tasks: ['Single QA'] | Lens: [42537] → Tgt Spa: ['0.350'] [Step 162 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [22563, 22555] → Tgt Spa: ['1.000', '1.000'] [Step 162 / Rank 6] Tasks: ['Code', 'Single QA', 'Single QA'] | Lens: [19179, 19173, 19175] → Tgt Spa: ['1.000', '0.350', '0.350'] [Step 162 / Rank 5] Tasks: ['Single QA'] | Lens: [36215] → Tgt Spa: ['0.350'] [Step 162 / Rank 2] Tasks: ['Single QA'] | Lens: [52684] → Tgt Spa: ['0.350'] [Step 162 / Rank 3] Tasks: ['Single QA'] | Lens: [52684] → Tgt Spa: ['0.350'] [Step 162 / Rank 1] Tasks: ['Single QA'] | Lens: [33972] → Tgt Spa: ['0.350'] [Step 162 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [30408, 30409] → Tgt Spa: ['1.000', '1.000'] [Step 162 / Rank 4] Tasks: ['Single QA'] | Lens: [36215] → Tgt Spa: ['0.350'] [Step 162 / Rank 0] Tasks: ['Single QA'] | Lens: [33972] → Tgt Spa: ['0.350'] [Step 162 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [30408, 30409] → Tgt Spa: ['1.000', '1.000'] [Step 162 / Rank 0] Tasks: ['Single QA', 'Code', 'Code'] | Lens: [19539, 19546, 19547] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 162 / Rank 6] Tasks: ['Code', 'Code', 'Code'] | Lens: [19451, 19452, 19452] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 162 / Rank 2] Tasks: ['In-Context Learning', 'Code', 'In-Context Learning'] | Lens: [21087, 21095, 21090] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 162 / Rank 3] Tasks: ['In-Context Learning', 'Code', 'In-Context Learning'] | Lens: [21087, 21095, 21090] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 162 / Rank 7] Tasks: ['Code', 'Code', 'Code'] | Lens: [19451, 19452, 19452] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 162 / Rank 1] Tasks: ['Single QA', 'Code', 'Code'] | Lens: [19539, 19546, 19547] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 162 / Rank 5] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24634, 24653] → Tgt Spa: ['1.000', '1.000'] [Step 162 / Rank 4] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24634, 24653] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 01:59:48,839 >> @ 162 | Loss: 2.1145 | LM: 2.0513 | Reg: 0.0632 | Spa(Avg): 0.523 [INFO|lh_trainer.py:797] 2026-02-17 01:59:48,840 >> Statistic -> Code | Spa: 0.687 | Tgt: 1.000 | Z-Loss: 0.094 | [INFO|lh_trainer.py:797] 2026-02-17 01:59:48,840 >> Statistic -> In-Context | Spa: 0.704 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:59:48,840 >> Statistic -> MultiHop | Spa: 0.424 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:59:48,840 >> Statistic -> Single | Spa: 0.402 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 01:59:48,840 >> Statistic -> Summarization | Spa: 0.635 | Tgt: 1.000 | Z-Loss: 0.113 | [INFO|lh_trainer.py:810] 2026-02-17 01:59:48,842 >> [Micro-Log] {"loss": 2.114493110527595, "lm_loss": 2.0512858380874, "reg_loss": 0.06320725688904834, "model_sparsity(avg)": 0.523099921643734, "Spa-Single QA sparsity": 0.4020833224058151, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.038426880032056944, "Spa-Code sparsity": 0.6865079360348838, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09416395319359643, "Spa-In-Context Learning sparsity": 0.7043650831495013, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10285695429359164, "Spa-Summarization sparsity": 0.6354166567325592, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11265205964446068, "Spa-MultiHop QA sparsity": 0.4236111044883728, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.025571412096420925, "step": 162, "current_tau": 1.0122358798980713, "lambda1 Single QA": 0.5703125, "lambda2 MultiHop QA": 0.294921875, "lambda3 Summarization": 0.140625, "lambda4 Code": 0.240234375} [INFO|lh_trainer.py:331] 2026-02-17 02:00:03,291 >> {'loss': 12.687, 'grad_norm': 0.5782619118690491, 'learning_rate': 0.00030836136012784226, 'epoch': 0.17166929963138494, 'num_input_tokens_seen': 401033514, 'completed': '54.33% (163 / 300)', 'remaining time': '6:25:09', 'throughput': '7813.44', 'gpu_mem_free': '9339MB', 'step': 163} [Step 163 / Rank 7] Tasks: ['Single QA'] | Lens: [45653] → Tgt Spa: ['0.350'] [Step 163 / Rank 3] Tasks: ['Single QA'] | Lens: [49967] → Tgt Spa: ['0.350'] [Step 163 / Rank 6] Tasks: ['Single QA'] | Lens: [45653] → Tgt Spa: ['0.350'] [Step 163 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23227, 23227] → Tgt Spa: ['1.000', '1.000'] [Step 163 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [59239] → Tgt Spa: ['1.000'] [Step 163 / Rank 2] Tasks: ['Single QA'] | Lens: [49967] → Tgt Spa: ['0.350'] [Step 163 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [59239] → Tgt Spa: ['1.000'] [Step 163 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23227, 23227] → Tgt Spa: ['1.000', '1.000'] [Step 163 / Rank 5] Tasks: ['Code'] | Lens: [35414] → Tgt Spa: ['1.000'] [Step 163 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [41186] → Tgt Spa: ['1.000'] [Step 163 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Code'] | Lens: [9397, 9396, 9406, 9402, 9403, 9419] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '1.000'] [Step 163 / Rank 6] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25860, 25862] → Tgt Spa: ['1.000', '0.350'] [Step 163 / Rank 7] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25860, 25862] → Tgt Spa: ['1.000', '0.350'] [Step 163 / Rank 4] Tasks: ['Code'] | Lens: [35414] → Tgt Spa: ['1.000'] [Step 163 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Code'] | Lens: [9397, 9396, 9406, 9402, 9403, 9419] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '1.000'] [Step 163 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [41186] → Tgt Spa: ['1.000'] [Step 163 / Rank 4] Tasks: ['Code', 'Single QA'] | Lens: [32523, 32516] → Tgt Spa: ['1.000', '0.350'] [Step 163 / Rank 2] Tasks: ['Code'] | Lens: [62412] → Tgt Spa: ['1.000'] [Step 163 / Rank 1] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17814, 17814, 17804] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 163 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [65480] → Tgt Spa: ['1.000'] [Step 163 / Rank 3] Tasks: ['Code'] | Lens: [62412] → Tgt Spa: ['1.000'] [Step 163 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [65480] → Tgt Spa: ['1.000'] [Step 163 / Rank 5] Tasks: ['Code', 'Single QA'] | Lens: [32523, 32516] → Tgt Spa: ['1.000', '0.350'] [Step 163 / Rank 0] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17814, 17814, 17804] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 163 / Rank 7] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [16728, 16728, 16738] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 163 / Rank 2] Tasks: ['Single QA'] | Lens: [49402] → Tgt Spa: ['0.350'] [Step 163 / Rank 1] Tasks: ['Single QA'] | Lens: [56724] → Tgt Spa: ['0.350'] [Step 163 / Rank 3] Tasks: ['Single QA'] | Lens: [49402] → Tgt Spa: ['0.350'] [Step 163 / Rank 5] Tasks: ['Single QA'] | Lens: [51077] → Tgt Spa: ['0.350'] [Step 163 / Rank 4] Tasks: ['Single QA'] | Lens: [51077] → Tgt Spa: ['0.350'] [Step 163 / Rank 6] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [16728, 16728, 16738] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 163 / Rank 0] Tasks: ['Single QA'] | Lens: [56724] → Tgt Spa: ['0.350'] [Step 163 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [41579] → Tgt Spa: ['1.000'] [Step 163 / Rank 6] Tasks: ['Single QA'] | Lens: [45655] → Tgt Spa: ['0.350'] [Step 163 / Rank 1] Tasks: ['Single QA'] | Lens: [55301] → Tgt Spa: ['0.350'] [Step 163 / Rank 0] Tasks: ['Single QA'] | Lens: [55301] → Tgt Spa: ['0.350'] [Step 163 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [41579] → Tgt Spa: ['1.000'] [Step 163 / Rank 5] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24184, 24204] → Tgt Spa: ['1.000', '1.000'] [Step 163 / Rank 7] Tasks: ['Single QA'] | Lens: [45655] → Tgt Spa: ['0.350'] [Step 163 / Rank 4] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24184, 24204] → Tgt Spa: ['1.000', '1.000'] [Step 163 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [11727, 11727, 11728, 11730, 11730] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 163 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [40035] → Tgt Spa: ['1.000'] [Step 163 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [11727, 11727, 11728, 11730, 11730] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 163 / Rank 6] Tasks: ['Single QA'] | Lens: [49687] → Tgt Spa: ['0.350'] [Step 163 / Rank 7] Tasks: ['Single QA'] | Lens: [49687] → Tgt Spa: ['0.350'] [Step 163 / Rank 1] Tasks: ['Single QA'] | Lens: [54993] → Tgt Spa: ['0.350'] [Step 163 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [40035] → Tgt Spa: ['1.000'] [Step 163 / Rank 0] Tasks: ['Single QA'] | Lens: [54993] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 02:02:26,627 >> @ 163 | Loss: 2.0281 | LM: 1.9652 | Reg: 0.0630 | Spa(Avg): 0.525 [INFO|lh_trainer.py:797] 2026-02-17 02:02:26,627 >> Statistic -> Code | Spa: 0.655 | Tgt: 1.000 | Z-Loss: 0.106 | [INFO|lh_trainer.py:797] 2026-02-17 02:02:26,627 >> Statistic -> In-Context | Spa: 0.701 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:02:26,627 >> Statistic -> MultiHop | Spa: 0.424 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:02:26,627 >> Statistic -> Single | Spa: 0.390 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:02:26,627 >> Statistic -> Summarization | Spa: 0.604 | Tgt: 1.000 | Z-Loss: 0.129 | [INFO|lh_trainer.py:810] 2026-02-17 02:02:26,629 >> [Micro-Log] {"loss": 2.028147211919228, "lm_loss": 1.9651780376831691, "reg_loss": 0.06296917301369831, "model_sparsity(avg)": 0.5251929027338823, "Spa-In-Context Learning sparsity": 0.7006173001395332, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10428100658787622, "Spa-Single QA sparsity": 0.38958332538604734, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.02751701504457742, "Spa-Code sparsity": 0.654513880610466, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10585243441164494, "Spa-Summarization sparsity": 0.6041666567325592, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12948043271899223, "Spa-MultiHop QA sparsity": 0.4236111044883728, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.025571412096420925, "step": 163, "current_tau": 1.0109238624572754, "lambda1 Single QA": 0.5703125, "lambda2 MultiHop QA": 0.296875, "lambda3 Summarization": 0.140625, "lambda4 Code": 0.240234375} [INFO|lh_trainer.py:331] 2026-02-17 02:02:47,009 >> {'loss': 12.1689, 'grad_norm': 0.6894981861114502, 'learning_rate': 0.00030517437823793947, 'epoch': 0.17272248551869404, 'num_input_tokens_seen': 403501710, 'completed': '54.67% (164 / 300)', 'remaining time': '6:22:17', 'throughput': '7537.96', 'gpu_mem_free': '8171MB', 'step': 164} [Step 164 / Rank 7] Tasks: ['Summarization', 'In-Context Learning', 'In-Context Learning'] | Lens: [21732, 21714, 21716] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 164 / Rank 0] Tasks: ['Single QA'] | Lens: [58748] → Tgt Spa: ['0.350'] [Step 164 / Rank 1] Tasks: ['Single QA'] | Lens: [58748] → Tgt Spa: ['0.350'] [Step 164 / Rank 2] Tasks: ['Single QA'] | Lens: [41927] → Tgt Spa: ['0.350'] [Step 164 / Rank 6] Tasks: ['Summarization', 'In-Context Learning', 'In-Context Learning'] | Lens: [21732, 21714, 21716] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 164 / Rank 5] Tasks: ['Single QA'] | Lens: [52243] → Tgt Spa: ['0.350'] [Step 164 / Rank 4] Tasks: ['Single QA'] | Lens: [52243] → Tgt Spa: ['0.350'] [Step 164 / Rank 3] Tasks: ['Single QA'] | Lens: [41927] → Tgt Spa: ['0.350'] [Step 164 / Rank 5] Tasks: ['Summarization', 'Single QA'] | Lens: [23419, 23402] → Tgt Spa: ['1.000', '0.350'] [Step 164 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [27476, 27468] → Tgt Spa: ['1.000', '1.000'] [Step 164 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [27476, 27468] → Tgt Spa: ['1.000', '1.000'] [Step 164 / Rank 7] Tasks: ['Single QA'] | Lens: [64906] → Tgt Spa: ['0.350'] [Step 164 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [32196, 32196] → Tgt Spa: ['0.350', '0.350'] [Step 164 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [32196, 32196] → Tgt Spa: ['0.350', '0.350'] [Step 164 / Rank 4] Tasks: ['Summarization', 'Single QA'] | Lens: [23419, 23402] → Tgt Spa: ['1.000', '0.350'] [Step 164 / Rank 6] Tasks: ['Single QA'] | Lens: [64906] → Tgt Spa: ['0.350'] [Step 164 / Rank 1] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [19398, 19398, 19398] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 164 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Summarization', 'Single QA', 'Code'] | Lens: [8052, 8052, 8052, 8052, 8052, 8070, 8052, 8062] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000'] [Step 164 / Rank 2] Tasks: ['Summarization', 'Code'] | Lens: [22832, 22822] → Tgt Spa: ['1.000', '1.000'] [Step 164 / Rank 6] Tasks: ['Code'] | Lens: [33999] → Tgt Spa: ['1.000'] [Step 164 / Rank 0] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [19398, 19398, 19398] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 164 / Rank 3] Tasks: ['Summarization', 'Code'] | Lens: [22832, 22822] → Tgt Spa: ['1.000', '1.000'] [Step 164 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Summarization', 'Single QA', 'Code'] | Lens: [8052, 8052, 8052, 8052, 8052, 8070, 8052, 8062] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000'] [Step 164 / Rank 7] Tasks: ['Code'] | Lens: [33999] → Tgt Spa: ['1.000'] [Step 164 / Rank 7] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23832, 23834] → Tgt Spa: ['1.000', '0.350'] [Step 164 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [43044] → Tgt Spa: ['1.000'] [Step 164 / Rank 0] Tasks: ['Single QA'] | Lens: [56732] → Tgt Spa: ['0.350'] [Step 164 / Rank 5] Tasks: ['Single QA'] | Lens: [43734] → Tgt Spa: ['0.350'] [Step 164 / Rank 6] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23832, 23834] → Tgt Spa: ['1.000', '0.350'] [Step 164 / Rank 1] Tasks: ['Single QA'] | Lens: [56732] → Tgt Spa: ['0.350'] [Step 164 / Rank 4] Tasks: ['Single QA'] | Lens: [43734] → Tgt Spa: ['0.350'] [Step 164 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [43044] → Tgt Spa: ['1.000'] [Step 164 / Rank 1] Tasks: ['Single QA'] | Lens: [62719] → Tgt Spa: ['0.350'] [Step 164 / Rank 5] Tasks: ['Single QA'] | Lens: [46713] → Tgt Spa: ['0.350'] [Step 164 / Rank 3] Tasks: ['Single QA'] | Lens: [45758] → Tgt Spa: ['0.350'] [Step 164 / Rank 4] Tasks: ['Single QA'] | Lens: [46713] → Tgt Spa: ['0.350'] [Step 164 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [57486] → Tgt Spa: ['1.000'] [Step 164 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [57486] → Tgt Spa: ['1.000'] [Step 164 / Rank 0] Tasks: ['Single QA'] | Lens: [62719] → Tgt Spa: ['0.350'] [Step 164 / Rank 2] Tasks: ['Single QA'] | Lens: [45758] → Tgt Spa: ['0.350'] [Step 164 / Rank 7] Tasks: ['Single QA'] | Lens: [36352] → Tgt Spa: ['0.350'] [Step 164 / Rank 5] Tasks: ['Single QA'] | Lens: [8287] → Tgt Spa: ['0.350'] [Step 164 / Rank 2] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [19031, 19032, 19045] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 164 / Rank 4] Tasks: ['Single QA'] | Lens: [8287] → Tgt Spa: ['0.350'] [Step 164 / Rank 6] Tasks: ['Single QA'] | Lens: [36352] → Tgt Spa: ['0.350'] [Step 164 / Rank 3] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [19031, 19032, 19045] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 164 / Rank 1] Tasks: ['Single QA'] | Lens: [43933] → Tgt Spa: ['0.350'] [Step 164 / Rank 0] Tasks: ['Single QA'] | Lens: [43933] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 02:05:15,884 >> @ 164 | Loss: 1.9783 | LM: 1.9173 | Reg: 0.0611 | Spa(Avg): 0.523 [INFO|lh_trainer.py:797] 2026-02-17 02:05:15,884 >> Statistic -> Code | Spa: 0.690 | Tgt: 1.000 | Z-Loss: 0.093 | [INFO|lh_trainer.py:797] 2026-02-17 02:05:15,884 >> Statistic -> In-Context | Spa: 0.704 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:05:15,884 >> Statistic -> MultiHop | Spa: 0.424 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:05:15,884 >> Statistic -> Single | Spa: 0.434 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:05:15,884 >> Statistic -> Summarization | Spa: 0.668 | Tgt: 1.000 | Z-Loss: 0.100 | [INFO|lh_trainer.py:810] 2026-02-17 02:05:15,886 >> [Micro-Log] {"loss": 1.9783315822811953, "lm_loss": 1.9172666487769068, "reg_loss": 0.06106493175320793, "model_sparsity(avg)": 0.5227382419010004, "Spa-Single QA sparsity": 0.43434343012896454, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05274799894753166, "Spa-Code sparsity": 0.6898148059844971, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09322385862469673, "Spa-In-Context Learning sparsity": 0.7037037014961243, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10314313446482022, "Spa-Summarization sparsity": 0.6684027910232544, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09968062117695808, "Spa-MultiHop QA sparsity": 0.4236111044883728, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.025571412096420925, "step": 164, "current_tau": 1.0096845626831055, "lambda1 Single QA": 0.5703125, "lambda2 MultiHop QA": 0.296875, "lambda3 Summarization": 0.1416015625, "lambda4 Code": 0.2412109375} [INFO|lh_trainer.py:331] 2026-02-17 02:05:30,585 >> {'loss': 11.87, 'grad_norm': 0.5236096978187561, 'learning_rate': 0.00030197794250664753, 'epoch': 0.17377567140600317, 'num_input_tokens_seen': 405903642, 'completed': '55.00% (165 / 300)', 'remaining time': '6:19:24', 'throughput': '7341.94', 'gpu_mem_free': '11741MB', 'step': 165} [Step 165 / Rank 2] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17623, 17624, 17613] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 165 / Rank 7] Tasks: ['Single QA'] | Lens: [37124] → Tgt Spa: ['0.350'] [Step 165 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [31821, 31821] → Tgt Spa: ['0.350', '0.350'] [Step 165 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [31821, 31821] → Tgt Spa: ['0.350', '0.350'] [Step 165 / Rank 6] Tasks: ['Single QA'] | Lens: [37124] → Tgt Spa: ['0.350'] [Step 165 / Rank 3] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17623, 17624, 17613] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 165 / Rank 1] Tasks: ['Single QA'] | Lens: [61564] → Tgt Spa: ['0.350'] [Step 165 / Rank 0] Tasks: ['Single QA'] | Lens: [61564] → Tgt Spa: ['0.350'] [Step 165 / Rank 5] Tasks: ['In-Context Learning', 'Summarization', 'Single QA', 'Single QA', 'MultiHop QA', 'In-Context Learning', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'MultiHop QA', 'Summarization', 'In-Context Learning'] | Lens: [3298, 3317, 3299, 3299, 3301, 3300, 3319, 3308, 3304, 3304, 3302, 3305, 3304, 3305, 3307, 3312, 3307, 3324, 3307] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000'] [Step 165 / Rank 6] Tasks: ['Code', 'Code', 'Single QA', 'Single QA', 'Code'] | Lens: [12229, 12236, 12234, 12237, 12245] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '1.000'] [Step 165 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [62423] → Tgt Spa: ['1.000'] [Step 165 / Rank 4] Tasks: ['In-Context Learning', 'Summarization', 'Single QA', 'Single QA', 'MultiHop QA', 'In-Context Learning', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'MultiHop QA', 'Summarization', 'In-Context Learning'] | Lens: [3298, 3317, 3299, 3299, 3301, 3300, 3319, 3308, 3304, 3304, 3302, 3305, 3304, 3305, 3307, 3312, 3307, 3324, 3307] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000'] [Step 165 / Rank 0] Tasks: ['Code'] | Lens: [36316] → Tgt Spa: ['1.000'] [Step 165 / Rank 1] Tasks: ['Code'] | Lens: [36316] → Tgt Spa: ['1.000'] [Step 165 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [62423] → Tgt Spa: ['1.000'] [Step 165 / Rank 7] Tasks: ['Code', 'Code', 'Single QA', 'Single QA', 'Code'] | Lens: [12229, 12236, 12234, 12237, 12245] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '1.000'] [Step 165 / Rank 5] Tasks: ['Single QA'] | Lens: [52271] → Tgt Spa: ['0.350'] [Step 165 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32642, 32642] → Tgt Spa: ['0.350', '0.350'] [Step 165 / Rank 4] Tasks: ['Single QA'] | Lens: [52271] → Tgt Spa: ['0.350'] [Step 165 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32642, 32642] → Tgt Spa: ['0.350', '0.350'] [Step 165 / Rank 3] Tasks: ['Single QA'] | Lens: [34135] → Tgt Spa: ['0.350'] [Step 165 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [61827] → Tgt Spa: ['1.000'] [Step 165 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [61827] → Tgt Spa: ['1.000'] [Step 165 / Rank 2] Tasks: ['Single QA'] | Lens: [34135] → Tgt Spa: ['0.350'] [Step 165 / Rank 3] Tasks: ['Single QA'] | Lens: [41005] → Tgt Spa: ['0.350'] [Step 165 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25687, 25693] → Tgt Spa: ['1.000', '1.000'] [Step 165 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25687, 25693] → Tgt Spa: ['1.000', '1.000'] [Step 165 / Rank 5] Tasks: ['Single QA'] | Lens: [46104] → Tgt Spa: ['0.350'] [Step 165 / Rank 4] Tasks: ['Single QA'] | Lens: [46104] → Tgt Spa: ['0.350'] [Step 165 / Rank 7] Tasks: ['Summarization', 'MultiHop QA'] | Lens: [31757, 31742] → Tgt Spa: ['1.000', '0.350'] [Step 165 / Rank 2] Tasks: ['Single QA'] | Lens: [41005] → Tgt Spa: ['0.350'] [Step 165 / Rank 6] Tasks: ['Summarization', 'MultiHop QA'] | Lens: [31757, 31742] → Tgt Spa: ['1.000', '0.350'] [Step 165 / Rank 2] Tasks: ['Single QA'] | Lens: [47254] → Tgt Spa: ['0.350'] [Step 165 / Rank 5] Tasks: ['Single QA'] | Lens: [54076] → Tgt Spa: ['0.350'] [Step 165 / Rank 3] Tasks: ['Single QA'] | Lens: [47254] → Tgt Spa: ['0.350'] [Step 165 / Rank 1] Tasks: ['Single QA', 'In-Context Learning', 'Single QA', 'Code', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Single QA', 'Code'] | Lens: [5577, 5576, 5577, 5584, 5584, 5584, 5579, 5578, 5597, 5580, 5587] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000'] [Step 165 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [45350] → Tgt Spa: ['1.000'] [Step 165 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [45350] → Tgt Spa: ['1.000'] [Step 165 / Rank 4] Tasks: ['Single QA'] | Lens: [54076] → Tgt Spa: ['0.350'] [Step 165 / Rank 0] Tasks: ['Single QA', 'In-Context Learning', 'Single QA', 'Code', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Single QA', 'Code'] | Lens: [5577, 5576, 5577, 5584, 5584, 5584, 5579, 5578, 5597, 5580, 5587] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000'] [Step 165 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [16539, 16540, 16533] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 165 / Rank 1] Tasks: ['Code', 'Code'] | Lens: [25740, 25738] → Tgt Spa: ['1.000', '1.000'] [Step 165 / Rank 3] Tasks: ['Code', 'Summarization'] | Lens: [30984, 30998] → Tgt Spa: ['1.000', '1.000'] [Step 165 / Rank 2] Tasks: ['Code', 'Summarization'] | Lens: [30984, 30998] → Tgt Spa: ['1.000', '1.000'] [Step 165 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [16539, 16540, 16533] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 165 / Rank 5] Tasks: ['Single QA'] | Lens: [55243] → Tgt Spa: ['0.350'] [Step 165 / Rank 4] Tasks: ['Single QA'] | Lens: [55243] → Tgt Spa: ['0.350'] [Step 165 / Rank 0] Tasks: ['Code', 'Code'] | Lens: [25740, 25738] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 02:08:02,006 >> @ 165 | Loss: 2.1484 | LM: 2.0903 | Reg: 0.0581 | Spa(Avg): 0.534 [INFO|lh_trainer.py:797] 2026-02-17 02:08:02,006 >> Statistic -> Code | Spa: 0.677 | Tgt: 1.000 | Z-Loss: 0.098 | [INFO|lh_trainer.py:797] 2026-02-17 02:08:02,006 >> Statistic -> In-Context | Spa: 0.706 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:08:02,006 >> Statistic -> MultiHop | Spa: 0.650 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:08:02,006 >> Statistic -> Single | Spa: 0.441 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:08:02,006 >> Statistic -> Summarization | Spa: 0.665 | Tgt: 1.000 | Z-Loss: 0.100 | [INFO|lh_trainer.py:810] 2026-02-17 02:08:02,008 >> [Micro-Log] {"loss": 2.1484328247606754, "lm_loss": 2.090300434579452, "reg_loss": 0.05813239444008408, "model_sparsity(avg)": 0.5341998214522997, "Spa-Single QA sparsity": 0.44097221791744234, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.058073362370487304, "Spa-Code sparsity": 0.6768518527348836, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09828042834997178, "Spa-In-Context Learning sparsity": 0.7058080976659601, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10243526642972772, "Spa-Summarization sparsity": 0.6652777791023254, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10021750256419182, "Spa-MultiHop QA sparsity": 0.6500000059604645, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.12559881135821344, "step": 165, "current_tau": 1.0085185766220093, "lambda1 Single QA": 0.5703125, "lambda2 MultiHop QA": 0.296875, "lambda3 Summarization": 0.1416015625, "lambda4 Code": 0.2412109375} [INFO|lh_trainer.py:331] 2026-02-17 02:08:22,659 >> {'loss': 12.8906, 'grad_norm': 0.5555451512336731, 'learning_rate': 0.000298772600626774, 'epoch': 0.17482885729331227, 'num_input_tokens_seen': 408463312, 'completed': '55.33% (166 / 300)', 'remaining time': '6:16:38', 'throughput': '7437.71', 'gpu_mem_free': '9315MB', 'step': 166} [Step 166 / Rank 7] Tasks: ['Single QA'] | Lens: [35677] → Tgt Spa: ['0.350'] [Step 166 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [39286] → Tgt Spa: ['1.000'] [Step 166 / Rank 4] Tasks: ['Summarization'] | Lens: [35971] → Tgt Spa: ['1.000'] [Step 166 / Rank 0] Tasks: ['Single QA'] | Lens: [46275] → Tgt Spa: ['0.350'] [Step 166 / Rank 6] Tasks: ['Single QA'] | Lens: [35677] → Tgt Spa: ['0.350'] [Step 166 / Rank 1] Tasks: ['Single QA'] | Lens: [46275] → Tgt Spa: ['0.350'] [Step 166 / Rank 5] Tasks: ['Summarization'] | Lens: [35971] → Tgt Spa: ['1.000'] [Step 166 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [39286] → Tgt Spa: ['1.000'] [Step 166 / Rank 6] Tasks: ['Single QA'] | Lens: [33738] → Tgt Spa: ['0.350'] [Step 166 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [59782] → Tgt Spa: ['1.000'] [Step 166 / Rank 2] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [18058, 18049, 18060] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 166 / Rank 7] Tasks: ['Single QA'] | Lens: [33738] → Tgt Spa: ['0.350'] [Step 166 / Rank 3] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [18058, 18049, 18060] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 166 / Rank 1] Tasks: ['Code'] | Lens: [34867] → Tgt Spa: ['1.000'] [Step 166 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [59782] → Tgt Spa: ['1.000'] [Step 166 / Rank 0] Tasks: ['Code'] | Lens: [34867] → Tgt Spa: ['1.000'] [Step 166 / Rank 3] Tasks: ['MultiHop QA'] | Lens: [65340] → Tgt Spa: ['0.350'] [Step 166 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [53284] → Tgt Spa: ['1.000'] [Step 166 / Rank 2] Tasks: ['MultiHop QA'] | Lens: [65340] → Tgt Spa: ['0.350'] [Step 166 / Rank 0] Tasks: ['Single QA'] | Lens: [62407] → Tgt Spa: ['0.350'] [Step 166 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [53284] → Tgt Spa: ['1.000'] [Step 166 / Rank 5] Tasks: ['Summarization'] | Lens: [41354] → Tgt Spa: ['1.000'] [Step 166 / Rank 1] Tasks: ['Single QA'] | Lens: [62407] → Tgt Spa: ['0.350'] [Step 166 / Rank 4] Tasks: ['Summarization'] | Lens: [41354] → Tgt Spa: ['1.000'] [Step 166 / Rank 6] Tasks: ['Single QA', 'Code'] | Lens: [27544, 27554] → Tgt Spa: ['0.350', '1.000'] [Step 166 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [27062, 27063] → Tgt Spa: ['0.350', '1.000'] [Step 166 / Rank 7] Tasks: ['Single QA', 'Code'] | Lens: [27544, 27554] → Tgt Spa: ['0.350', '1.000'] [Step 166 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [27062, 27063] → Tgt Spa: ['0.350', '1.000'] [Step 166 / Rank 0] Tasks: ['Single QA'] | Lens: [33803] → Tgt Spa: ['0.350'] [Step 166 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [32130, 32130] → Tgt Spa: ['0.350', '0.350'] [Step 166 / Rank 1] Tasks: ['Single QA'] | Lens: [33803] → Tgt Spa: ['0.350'] [Step 166 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [32130, 32130] → Tgt Spa: ['0.350', '0.350'] [Step 166 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32114, 32114] → Tgt Spa: ['0.350', '0.350'] [Step 166 / Rank 2] Tasks: ['Code'] | Lens: [34270] → Tgt Spa: ['1.000'] [Step 166 / Rank 7] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24608, 24609] → Tgt Spa: ['1.000', '0.350'] [Step 166 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32114, 32114] → Tgt Spa: ['0.350', '0.350'] [Step 166 / Rank 1] Tasks: ['Single QA'] | Lens: [48620] → Tgt Spa: ['0.350'] [Step 166 / Rank 0] Tasks: ['Single QA'] | Lens: [48620] → Tgt Spa: ['0.350'] [Step 166 / Rank 3] Tasks: ['Code'] | Lens: [34270] → Tgt Spa: ['1.000'] [Step 166 / Rank 6] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24608, 24609] → Tgt Spa: ['1.000', '0.350'] [Step 166 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [28739, 28732] → Tgt Spa: ['1.000', '1.000'] [Step 166 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [28739, 28732] → Tgt Spa: ['1.000', '1.000'] [Step 166 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25649, 25650] → Tgt Spa: ['0.350', '1.000'] [Step 166 / Rank 3] Tasks: ['Single QA'] | Lens: [59869] → Tgt Spa: ['0.350'] [Step 166 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [62494] → Tgt Spa: ['1.000'] [Step 166 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [62494] → Tgt Spa: ['1.000'] [Step 166 / Rank 2] Tasks: ['Single QA'] | Lens: [59869] → Tgt Spa: ['0.350'] [Step 166 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25649, 25650] → Tgt Spa: ['0.350', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 02:10:43,466 >> @ 166 | Loss: 2.1445 | LM: 2.0679 | Reg: 0.0766 | Spa(Avg): 0.572 [INFO|lh_trainer.py:797] 2026-02-17 02:10:43,466 >> Statistic -> Code | Spa: 0.703 | Tgt: 1.000 | Z-Loss: 0.089 | [INFO|lh_trainer.py:797] 2026-02-17 02:10:43,466 >> Statistic -> In-Context | Spa: 0.717 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:10:43,466 >> Statistic -> MultiHop | Spa: 0.403 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:10:43,467 >> Statistic -> Single | Spa: 0.450 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:10:43,467 >> Statistic -> Summarization | Spa: 0.670 | Tgt: 1.000 | Z-Loss: 0.098 | [INFO|lh_trainer.py:810] 2026-02-17 02:10:43,468 >> [Micro-Log] {"loss": 2.1444963371613994, "lm_loss": 2.067900809488492, "reg_loss": 0.07659551698695093, "model_sparsity(avg)": 0.5722415136794249, "Spa-Single QA sparsity": 0.44999999602635704, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06302651246078313, "Spa-Code sparsity": 0.7027777910232544, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.0888444647192955, "Spa-In-Context Learning sparsity": 0.717013880610466, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.09869915153831244, "Spa-Summarization sparsity": 0.6701388955116272, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09752937406301498, "Spa-MultiHop QA sparsity": 0.4027777910232544, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.01688862405717373, "step": 166, "current_tau": 1.0074260234832764, "lambda1 Single QA": 0.5703125, "lambda2 MultiHop QA": 0.296875, "lambda3 Summarization": 0.142578125, "lambda4 Code": 0.2421875} [INFO|lh_trainer.py:331] 2026-02-17 02:11:08,299 >> {'loss': 12.867, 'grad_norm': 0.64606773853302, 'learning_rate': 0.0002955589018171488, 'epoch': 0.17588204318062137, 'num_input_tokens_seen': 410857116, 'completed': '55.67% (167 / 300)', 'remaining time': '6:13:47', 'throughput': '7225.96', 'gpu_mem_free': '5485MB', 'step': 167} [Step 167 / Rank 3] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [27343, 27328] → Tgt Spa: ['1.000', '1.000'] [Step 167 / Rank 4] Tasks: ['Single QA'] | Lens: [33195] → Tgt Spa: ['0.350'] [Step 167 / Rank 0] Tasks: ['Single QA'] | Lens: [54172] → Tgt Spa: ['0.350'] [Step 167 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [42073] → Tgt Spa: ['1.000'] [Step 167 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [42073] → Tgt Spa: ['1.000'] [Step 167 / Rank 1] Tasks: ['Single QA'] | Lens: [54172] → Tgt Spa: ['0.350'] [Step 167 / Rank 2] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [27343, 27328] → Tgt Spa: ['1.000', '1.000'] [Step 167 / Rank 5] Tasks: ['Single QA'] | Lens: [33195] → Tgt Spa: ['0.350'] [Step 167 / Rank 5] Tasks: ['Code', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'Code', 'Summarization', 'In-Context Learning', 'Single QA', 'Summarization'] | Lens: [3817, 3819, 3811, 3811, 3811, 3811, 3830, 3812, 3813, 3814, 3816, 3816, 3823, 3836, 3817, 3819, 3837] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000'] [Step 167 / Rank 2] Tasks: ['Code'] | Lens: [52354] → Tgt Spa: ['1.000'] [Step 167 / Rank 4] Tasks: ['Code', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'Code', 'Summarization', 'In-Context Learning', 'Single QA', 'Summarization'] | Lens: [3817, 3819, 3811, 3811, 3811, 3811, 3830, 3812, 3813, 3814, 3816, 3816, 3823, 3836, 3817, 3819, 3837] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000'] [Step 167 / Rank 6] Tasks: ['Single QA'] | Lens: [57576] → Tgt Spa: ['0.350'] [Step 167 / Rank 3] Tasks: ['Code'] | Lens: [52354] → Tgt Spa: ['1.000'] [Step 167 / Rank 1] Tasks: ['Single QA'] | Lens: [54149] → Tgt Spa: ['0.350'] [Step 167 / Rank 7] Tasks: ['Single QA'] | Lens: [57576] → Tgt Spa: ['0.350'] [Step 167 / Rank 0] Tasks: ['Single QA'] | Lens: [54149] → Tgt Spa: ['0.350'] [Step 167 / Rank 6] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [18042, 18045, 18057] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 167 / Rank 7] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [18042, 18045, 18057] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 167 / Rank 5] Tasks: ['Single QA'] | Lens: [51377] → Tgt Spa: ['0.350'] [Step 167 / Rank 2] Tasks: ['Code', 'Code', 'In-Context Learning'] | Lens: [19776, 19781, 19773] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 167 / Rank 1] Tasks: ['Single QA'] | Lens: [46925] → Tgt Spa: ['0.350'] [Step 167 / Rank 4] Tasks: ['Single QA'] | Lens: [51377] → Tgt Spa: ['0.350'] [Step 167 / Rank 3] Tasks: ['Code', 'Code', 'In-Context Learning'] | Lens: [19776, 19781, 19773] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 167 / Rank 0] Tasks: ['Single QA'] | Lens: [46925] → Tgt Spa: ['0.350'] [Step 167 / Rank 1] Tasks: ['Code', 'Code', 'Code'] | Lens: [16867, 16867, 16870] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 167 / Rank 3] Tasks: ['Single QA'] | Lens: [37299] → Tgt Spa: ['0.350'] [Step 167 / Rank 6] Tasks: ['Single QA'] | Lens: [35241] → Tgt Spa: ['0.350'] [Step 167 / Rank 7] Tasks: ['Single QA'] | Lens: [35241] → Tgt Spa: ['0.350'] [Step 167 / Rank 2] Tasks: ['Single QA'] | Lens: [37299] → Tgt Spa: ['0.350'] [Step 167 / Rank 5] Tasks: ['Code'] | Lens: [37918] → Tgt Spa: ['1.000'] [Step 167 / Rank 4] Tasks: ['Code'] | Lens: [37918] → Tgt Spa: ['1.000'] [Step 167 / Rank 0] Tasks: ['Code', 'Code', 'Code'] | Lens: [16867, 16867, 16870] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 167 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Summarization', 'Summarization'] | Lens: [6884, 6885, 6886, 6886, 6886, 6887, 6894, 6906, 6906] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000'] [Step 167 / Rank 5] Tasks: ['Single QA'] | Lens: [56670] → Tgt Spa: ['0.350'] [Step 167 / Rank 6] Tasks: ['Single QA'] | Lens: [63738] → Tgt Spa: ['0.350'] [Step 167 / Rank 3] Tasks: ['Single QA'] | Lens: [46447] → Tgt Spa: ['0.350'] [Step 167 / Rank 7] Tasks: ['Single QA'] | Lens: [63738] → Tgt Spa: ['0.350'] [Step 167 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Summarization', 'Summarization'] | Lens: [6884, 6885, 6886, 6886, 6886, 6887, 6894, 6906, 6906] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000'] [Step 167 / Rank 4] Tasks: ['Single QA'] | Lens: [56670] → Tgt Spa: ['0.350'] [Step 167 / Rank 2] Tasks: ['Single QA'] | Lens: [46447] → Tgt Spa: ['0.350'] [Step 167 / Rank 5] Tasks: ['Single QA'] | Lens: [65089] → Tgt Spa: ['0.350'] [Step 167 / Rank 7] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [20339, 20351, 20340] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 167 / Rank 0] Tasks: ['Single QA'] | Lens: [38357] → Tgt Spa: ['0.350'] [Step 167 / Rank 2] Tasks: ['Single QA'] | Lens: [38812] → Tgt Spa: ['0.350'] [Step 167 / Rank 3] Tasks: ['Single QA'] | Lens: [38812] → Tgt Spa: ['0.350'] [Step 167 / Rank 4] Tasks: ['Single QA'] | Lens: [65089] → Tgt Spa: ['0.350'] [Step 167 / Rank 6] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [20339, 20351, 20340] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 167 / Rank 1] Tasks: ['Single QA'] | Lens: [38357] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 02:13:24,924 >> @ 167 | Loss: 1.9266 | LM: 1.8732 | Reg: 0.0534 | Spa(Avg): 0.508 [INFO|lh_trainer.py:797] 2026-02-17 02:13:24,925 >> Statistic -> Code | Spa: 0.705 | Tgt: 1.000 | Z-Loss: 0.088 | [INFO|lh_trainer.py:797] 2026-02-17 02:13:24,925 >> Statistic -> In-Context | Spa: 0.693 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:13:24,925 >> Statistic -> MultiHop | Spa: 0.646 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:13:24,925 >> Statistic -> Single | Spa: 0.433 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:13:24,925 >> Statistic -> Summarization | Spa: 0.641 | Tgt: 1.000 | Z-Loss: 0.111 | [INFO|lh_trainer.py:810] 2026-02-17 02:13:24,927 >> [Micro-Log] {"loss": 1.926622084652384, "lm_loss": 1.873223175915579, "reg_loss": 0.05339889588746397, "model_sparsity(avg)": 0.5080829411745071, "Spa-Single QA sparsity": 0.4329710110374119, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.052477213751484196, "Spa-Code sparsity": 0.704629635810852, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.08828059434890748, "Spa-Summarization sparsity": 0.6406250074505806, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11125112231820822, "Spa-In-Context Learning sparsity": 0.6929012338320414, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10823776324590047, "Spa-MultiHop QA sparsity": 0.6458333730697632, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1237441934645176, "step": 167, "current_tau": 1.0064074993133545, "lambda1 Single QA": 0.5703125, "lambda2 MultiHop QA": 0.296875, "lambda3 Summarization": 0.142578125, "lambda4 Code": 0.2421875} [INFO|lh_trainer.py:331] 2026-02-17 02:13:51,911 >> {'loss': 11.5597, 'grad_norm': 0.5008332133293152, 'learning_rate': 0.0002923373967285185, 'epoch': 0.1769352290679305, 'num_input_tokens_seen': 413293324, 'completed': '56.00% (168 / 300)', 'remaining time': '6:10:54', 'throughput': '7445.04', 'gpu_mem_free': '13415MB', 'step': 168} [Step 168 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [11347, 11348, 11356, 11350, 11350] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350'] [Step 168 / Rank 5] Tasks: ['Single QA'] | Lens: [62390] → Tgt Spa: ['0.350'] [Step 168 / Rank 3] Tasks: ['Single QA'] | Lens: [57575] → Tgt Spa: ['0.350'] [Step 168 / Rank 4] Tasks: ['Single QA'] | Lens: [62390] → Tgt Spa: ['0.350'] [Step 168 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [42537] → Tgt Spa: ['1.000'] [Step 168 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [42537] → Tgt Spa: ['1.000'] [Step 168 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [11347, 11348, 11356, 11350, 11350] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350'] [Step 168 / Rank 2] Tasks: ['Single QA'] | Lens: [57575] → Tgt Spa: ['0.350'] [Step 168 / Rank 5] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [21034, 21026, 21026] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 168 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [44759] → Tgt Spa: ['1.000'] [Step 168 / Rank 4] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [21034, 21026, 21026] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 168 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [44759] → Tgt Spa: ['1.000'] [Step 168 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [19556, 19556, 19547] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 168 / Rank 3] Tasks: ['In-Context Learning', 'Code'] | Lens: [26213, 26223] → Tgt Spa: ['1.000', '1.000'] [Step 168 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [19556, 19556, 19547] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 168 / Rank 2] Tasks: ['In-Context Learning', 'Code'] | Lens: [26213, 26223] → Tgt Spa: ['1.000', '1.000'] [Step 168 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [40185] → Tgt Spa: ['1.000'] [Step 168 / Rank 0] Tasks: ['Single QA'] | Lens: [51360] → Tgt Spa: ['0.350'] [Step 168 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [40185] → Tgt Spa: ['1.000'] [Step 168 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [60715] → Tgt Spa: ['1.000'] [Step 168 / Rank 7] Tasks: ['Single QA'] | Lens: [34733] → Tgt Spa: ['0.350'] [Step 168 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [60715] → Tgt Spa: ['1.000'] [Step 168 / Rank 6] Tasks: ['Single QA'] | Lens: [34733] → Tgt Spa: ['0.350'][Step 168 / Rank 1] Tasks: ['Single QA'] | Lens: [51360] → Tgt Spa: ['0.350'] [Step 168 / Rank 7] Tasks: ['Single QA'] | Lens: [56612] → Tgt Spa: ['0.350'] [Step 168 / Rank 3] Tasks: ['Code'] | Lens: [59113] → Tgt Spa: ['1.000'] [Step 168 / Rank 6] Tasks: ['Single QA'] | Lens: [56612] → Tgt Spa: ['0.350'] [Step 168 / Rank 5] Tasks: ['Single QA'] | Lens: [44064] → Tgt Spa: ['0.350'] [Step 168 / Rank 4] Tasks: ['Single QA'] | Lens: [44064] → Tgt Spa: ['0.350'] [Step 168 / Rank 1] Tasks: ['Code'] | Lens: [46959] → Tgt Spa: ['1.000'] [Step 168 / Rank 2] Tasks: ['Code'] | Lens: [59113] → Tgt Spa: ['1.000'] [Step 168 / Rank 0] Tasks: ['Code'] | Lens: [46959] → Tgt Spa: ['1.000'] [Step 168 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [23093, 23093] → Tgt Spa: ['0.350', '0.350'] [Step 168 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25453, 25455] → Tgt Spa: ['1.000', '1.000'] [Step 168 / Rank 5] Tasks: ['Code'] | Lens: [35538] → Tgt Spa: ['1.000'] [Step 168 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [45663] → Tgt Spa: ['1.000'] [Step 168 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25453, 25455] → Tgt Spa: ['1.000', '1.000'] [Step 168 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [23093, 23093] → Tgt Spa: ['0.350', '0.350'] [Step 168 / Rank 4] Tasks: ['Code'] | Lens: [35538] → Tgt Spa: ['1.000'] [Step 168 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [45663] → Tgt Spa: ['1.000'] [Step 168 / Rank 3] Tasks: ['Single QA'] | Lens: [37138] → Tgt Spa: ['0.350'] [Step 168 / Rank 4] Tasks: ['Code', 'Code', 'Code'] | Lens: [20219, 20223, 20224] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 168 / Rank 0] Tasks: ['Single QA'] | Lens: [34558] → Tgt Spa: ['0.350'] [Step 168 / Rank 5] Tasks: ['Code', 'Code', 'Code'] | Lens: [20219, 20223, 20224] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 168 / Rank 1] Tasks: ['Single QA'] | Lens: [34558] → Tgt Spa: ['0.350'] [Step 168 / Rank 6] Tasks: ['Single QA'] | Lens: [65101] → Tgt Spa: ['0.350'] [Step 168 / Rank 7] Tasks: ['Single QA'] | Lens: [65101] → Tgt Spa: ['0.350'] [Step 168 / Rank 2] Tasks: ['Single QA'] | Lens: [37138] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 02:16:13,049 >> @ 168 | Loss: 2.1203 | LM: 2.0553 | Reg: 0.0651 | Spa(Avg): 0.555 [INFO|lh_trainer.py:797] 2026-02-17 02:16:13,049 >> Statistic -> Code | Spa: 0.673 | Tgt: 1.000 | Z-Loss: 0.100 | [INFO|lh_trainer.py:797] 2026-02-17 02:16:13,049 >> Statistic -> In-Context | Spa: 0.707 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:16:13,049 >> Statistic -> MultiHop | Spa: 0.646 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:16:13,049 >> Statistic -> Single | Spa: 0.381 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:16:13,049 >> Statistic -> Summarization | Spa: 0.708 | Tgt: 1.000 | Z-Loss: 0.081 | [INFO|lh_trainer.py:810] 2026-02-17 02:16:13,051 >> [Micro-Log] {"loss": 2.1203413593272367, "lm_loss": 2.055290781582395, "reg_loss": 0.06505059502281559, "model_sparsity(avg)": 0.5551311696569124, "Spa-In-Context Learning sparsity": 0.7065972238779068, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10277910716831684, "Spa-Single QA sparsity": 0.3805555502573649, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.01894485066489627, "Spa-Code sparsity": 0.6729797937653281, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10046133602207358, "Spa-Summarization sparsity": 0.7083333333333334, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.08080828934907913, "Spa-MultiHop QA sparsity": 0.6458333730697632, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1237441934645176, "step": 168, "current_tau": 1.0054631233215332, "lambda1 Single QA": 0.57421875, "lambda2 MultiHop QA": 0.296875, "lambda3 Summarization": 0.1435546875, "lambda4 Code": 0.2431640625} [INFO|lh_trainer.py:331] 2026-02-17 02:16:39,983 >> {'loss': 12.722, 'grad_norm': 0.7233253121376038, 'learning_rate': 0.00028910863734919615, 'epoch': 0.1779884149552396, 'num_input_tokens_seen': 415708708, 'completed': '56.33% (169 / 300)', 'remaining time': '6:08:05', 'throughput': '7185.56', 'gpu_mem_free': '14157MB', 'step': 169} [Step 169 / Rank 5] Tasks: ['Code', 'Single QA'] | Lens: [26561, 26554] → Tgt Spa: ['1.000', '0.350'] [Step 169 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [25909, 25919] → Tgt Spa: ['1.000', '1.000'] [Step 169 / Rank 2] Tasks: ['Single QA'] | Lens: [39485] → Tgt Spa: ['0.350'] [Step 169 / Rank 0] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [27085, 27085] → Tgt Spa: ['0.350', '1.000'] [Step 169 / Rank 3] Tasks: ['Single QA'] | Lens: [39485] → Tgt Spa: ['0.350'] [Step 169 / Rank 1] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [27085, 27085] → Tgt Spa: ['0.350', '1.000'] [Step 169 / Rank 4] Tasks: ['Code', 'Single QA'] | Lens: [26561, 26554] → Tgt Spa: ['1.000', '0.350'] [Step 169 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [25909, 25919] → Tgt Spa: ['1.000', '1.000'] [Step 169 / Rank 4] Tasks: ['Single QA'] | Lens: [57506] → Tgt Spa: ['0.350'] [Step 169 / Rank 3] Tasks: ['Code'] | Lens: [61504] → Tgt Spa: ['1.000'] [Step 169 / Rank 1] Tasks: ['Single QA'] | Lens: [35127] → Tgt Spa: ['0.350'] [Step 169 / Rank 5] Tasks: ['Single QA'] | Lens: [57506] → Tgt Spa: ['0.350'] [Step 169 / Rank 0] Tasks: ['Single QA'] | Lens: [35127] → Tgt Spa: ['0.350'] [Step 169 / Rank 6] Tasks: ['Single QA'] | Lens: [65032] → Tgt Spa: ['0.350'] [Step 169 / Rank 2] Tasks: ['Code'] | Lens: [61504] → Tgt Spa: ['1.000'] [Step 169 / Rank 7] Tasks: ['Single QA'] | Lens: [65032] → Tgt Spa: ['0.350'] [Step 169 / Rank 0] Tasks: ['Single QA'] | Lens: [36920] → Tgt Spa: ['0.350'] [Step 169 / Rank 1] Tasks: ['Single QA'] | Lens: [36920] → Tgt Spa: ['0.350'] [Step 169 / Rank 3] Tasks: ['Single QA'] | Lens: [57713] → Tgt Spa: ['0.350'] [Step 169 / Rank 4] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [17217, 17206, 17208] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 169 / Rank 5] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [17217, 17206, 17208] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 169 / Rank 7] Tasks: ['Single QA'] | Lens: [39846] → Tgt Spa: ['0.350'] [Step 169 / Rank 2] Tasks: ['Single QA'] | Lens: [57713] → Tgt Spa: ['0.350'] [Step 169 / Rank 6] Tasks: ['Single QA'] | Lens: [39846] → Tgt Spa: ['0.350'] [Step 169 / Rank 3] Tasks: ['Code'] | Lens: [32875] → Tgt Spa: ['1.000'] [Step 169 / Rank 6] Tasks: ['Single QA'] | Lens: [36409] → Tgt Spa: ['0.350'] [Step 169 / Rank 7] Tasks: ['Single QA'] | Lens: [36409] → Tgt Spa: ['0.350'] [Step 169 / Rank 2] Tasks: ['Code'] | Lens: [32875] → Tgt Spa: ['1.000'] [Step 169 / Rank 4] Tasks: ['Single QA'] | Lens: [46836] → Tgt Spa: ['0.350'] [Step 169 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [29857, 29857] → Tgt Spa: ['0.350', '0.350'] [Step 169 / Rank 5] Tasks: ['Single QA'] | Lens: [46836] → Tgt Spa: ['0.350'] [Step 169 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [29857, 29857] → Tgt Spa: ['0.350', '0.350'] [Step 169 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25982, 25983] → Tgt Spa: ['1.000', '1.000'] [Step 169 / Rank 6] Tasks: ['Single QA'] | Lens: [55101] → Tgt Spa: ['0.350'] [Step 169 / Rank 5] Tasks: ['Single QA'] | Lens: [44168] → Tgt Spa: ['0.350'] [Step 169 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [53390] → Tgt Spa: ['1.000'] [Step 169 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [53390] → Tgt Spa: ['1.000'] [Step 169 / Rank 7] Tasks: ['Single QA'] | Lens: [55101] → Tgt Spa: ['0.350'] [Step 169 / Rank 4] Tasks: ['Single QA'] | Lens: [44168] → Tgt Spa: ['0.350'] [Step 169 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25982, 25983] → Tgt Spa: ['1.000', '1.000'] [Step 169 / Rank 1] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16794, 16783, 16794] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 169 / Rank 4] Tasks: ['Single QA'] | Lens: [55697] → Tgt Spa: ['0.350'] [Step 169 / Rank 3] Tasks: ['Single QA'] | Lens: [59447] → Tgt Spa: ['0.350'] [Step 169 / Rank 5] Tasks: ['Single QA'] | Lens: [55697] → Tgt Spa: ['0.350'] [Step 169 / Rank 7] Tasks: ['Single QA'] | Lens: [41281] → Tgt Spa: ['0.350'] [Step 169 / Rank 2] Tasks: ['Single QA'] | Lens: [59447] → Tgt Spa: ['0.350'] [Step 169 / Rank 0] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16794, 16783, 16794] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 169 / Rank 6] Tasks: ['Single QA'] | Lens: [41281] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 02:18:57,404 >> @ 169 | Loss: 2.1926 | LM: 2.1471 | Reg: 0.0455 | Spa(Avg): 0.479 [INFO|lh_trainer.py:797] 2026-02-17 02:18:57,404 >> Statistic -> Code | Spa: 0.679 | Tgt: 1.000 | Z-Loss: 0.098 | [INFO|lh_trainer.py:797] 2026-02-17 02:18:57,405 >> Statistic -> In-Context | Spa: 0.703 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:18:57,405 >> Statistic -> MultiHop | Spa: 0.646 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:18:57,405 >> Statistic -> Single | Spa: 0.374 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:18:57,405 >> Statistic -> Summarization | Spa: 0.644 | Tgt: 1.000 | Z-Loss: 0.113 | [INFO|lh_trainer.py:810] 2026-02-17 02:18:57,407 >> [Micro-Log] {"loss": 2.1926323423782983, "lm_loss": 2.147144583364328, "reg_loss": 0.045487736380891874, "model_sparsity(avg)": 0.4790702189008395, "Spa-Single QA sparsity": 0.3742283880710602, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.017384074977599084, "Spa-In-Context Learning sparsity": 0.7027777910232544, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10427138060331345, "Spa-Summarization sparsity": 0.6435185273488363, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11279522875944774, "Spa-Code sparsity": 0.6785714370863778, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09834819393498558, "Spa-MultiHop QA sparsity": 0.6458333730697632, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1237441934645176, "step": 169, "current_tau": 1.0045931339263916, "lambda1 Single QA": 0.57421875, "lambda2 MultiHop QA": 0.298828125, "lambda3 Summarization": 0.14453125, "lambda4 Code": 0.2431640625} [INFO|lh_trainer.py:331] 2026-02-17 02:19:20,916 >> {'loss': 13.1558, 'grad_norm': 0.44205302000045776, 'learning_rate': 0.0002858731769104793, 'epoch': 0.17904160084254872, 'num_input_tokens_seen': 418090970, 'completed': '56.67% (170 / 300)', 'remaining time': '6:05:11', 'throughput': '7401.39', 'gpu_mem_free': '11395MB', 'step': 170} [Step 170 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [62289] → Tgt Spa: ['1.000'] [Step 170 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [24121, 24112] → Tgt Spa: ['1.000', '1.000'] [Step 170 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [24121, 24112] → Tgt Spa: ['1.000', '1.000'] [Step 170 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [26100, 26092] → Tgt Spa: ['1.000', '1.000'] [Step 170 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [62289] → Tgt Spa: ['1.000'] [Step 170 / Rank 0] Tasks: ['Single QA'] | Lens: [36938] → Tgt Spa: ['0.350'] [Step 170 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [26100, 26092] → Tgt Spa: ['1.000', '1.000'] [Step 170 / Rank 1] Tasks: ['Single QA'] | Lens: [36938] → Tgt Spa: ['0.350'] [Step 170 / Rank 3] Tasks: ['Code'] | Lens: [52752] → Tgt Spa: ['1.000'] [Step 170 / Rank 5] Tasks: ['Code'] | Lens: [50452] → Tgt Spa: ['1.000'] [Step 170 / Rank 1] Tasks: ['Code'] | Lens: [52771] → Tgt Spa: ['1.000'] [Step 170 / Rank 7] Tasks: ['Single QA'] | Lens: [57236] → Tgt Spa: ['0.350'] [Step 170 / Rank 2] Tasks: ['Code'] | Lens: [52752] → Tgt Spa: ['1.000'] [Step 170 / Rank 4] Tasks: ['Code'] | Lens: [50452] → Tgt Spa: ['1.000'] [Step 170 / Rank 0] Tasks: ['Code'] | Lens: [52771] → Tgt Spa: ['1.000'] [Step 170 / Rank 6] Tasks: ['Single QA'] | Lens: [57236] → Tgt Spa: ['0.350'] [Step 170 / Rank 5] Tasks: ['Summarization', 'Code'] | Lens: [29208, 29198] → Tgt Spa: ['1.000', '1.000'] [Step 170 / Rank 4] Tasks: ['Summarization', 'Code'] | Lens: [29208, 29198] → Tgt Spa: ['1.000', '1.000'] [Step 170 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [10998, 10999, 10999, 11000, 11000] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 170 / Rank 3] Tasks: ['Summarization'] | Lens: [33214] → Tgt Spa: ['1.000'] [Step 170 / Rank 7] Tasks: ['Single QA'] | Lens: [60723] → Tgt Spa: ['0.350'] [Step 170 / Rank 2] Tasks: ['Summarization'] | Lens: [33214] → Tgt Spa: ['1.000'] [Step 170 / Rank 6] Tasks: ['Single QA'] | Lens: [60723] → Tgt Spa: ['0.350'] [Step 170 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [10998, 10999, 10999, 11000, 11000] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 170 / Rank 3] Tasks: ['In-Context Learning', 'Code', 'In-Context Learning', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [5265, 5272, 5265, 5274, 5266, 5267, 5269, 5269, 5270, 5277, 5270, 5271] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 170 / Rank 6] Tasks: ['Summarization'] | Lens: [39537] → Tgt Spa: ['1.000'] [Step 170 / Rank 2] Tasks: ['In-Context Learning', 'Code', 'In-Context Learning', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [5265, 5272, 5265, 5274, 5266, 5267, 5269, 5269, 5270, 5277, 5270, 5271] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 170 / Rank 5] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17644, 17646, 17646] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 170 / Rank 7] Tasks: ['Summarization'] | Lens: [39537] → Tgt Spa: ['1.000'] [Step 170 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26368, 26369] → Tgt Spa: ['1.000', '1.000'] [Step 170 / Rank 4] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17644, 17646, 17646] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 170 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26368, 26369] → Tgt Spa: ['1.000', '1.000'] [Step 170 / Rank 3] Tasks: ['Code'] | Lens: [55088] → Tgt Spa: ['1.000'] [Step 170 / Rank 4] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17247, 17251, 17251] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 170 / Rank 6] Tasks: ['Single QA'] | Lens: [37506] → Tgt Spa: ['0.350'] [Step 170 / Rank 2] Tasks: ['Code'] | Lens: [55088] → Tgt Spa: ['1.000'] [Step 170 / Rank 7] Tasks: ['Single QA'] | Lens: [37506] → Tgt Spa: ['0.350'] [Step 170 / Rank 5] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17247, 17251, 17251] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 170 / Rank 1] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [9164, 9169, 9172, 9164, 9166, 9166, 9174] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000'] [Step 170 / Rank 0] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [9164, 9169, 9172, 9164, 9166, 9166, 9174] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000'] [Step 170 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [40481] → Tgt Spa: ['1.000'] [Step 170 / Rank 4] Tasks: ['Single QA'] | Lens: [53574] → Tgt Spa: ['0.350'] [Step 170 / Rank 6] Tasks: ['Code'] | Lens: [42493] → Tgt Spa: ['1.000'] [Step 170 / Rank 0] Tasks: ['Single QA'] | Lens: [49131] → Tgt Spa: ['0.350'] [Step 170 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [40481] → Tgt Spa: ['1.000'] [Step 170 / Rank 5] Tasks: ['Single QA'] | Lens: [53574] → Tgt Spa: ['0.350'] [Step 170 / Rank 1] Tasks: ['Single QA'] | Lens: [49131] → Tgt Spa: ['0.350'] [Step 170 / Rank 7] Tasks: ['Code'] | Lens: [42493] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 02:21:43,727 >> @ 170 | Loss: 1.8318 | LM: 1.7560 | Reg: 0.0758 | Spa(Avg): 0.588 [INFO|lh_trainer.py:797] 2026-02-17 02:21:43,727 >> Statistic -> Code | Spa: 0.686 | Tgt: 1.000 | Z-Loss: 0.096 | [INFO|lh_trainer.py:797] 2026-02-17 02:21:43,727 >> Statistic -> In-Context | Spa: 0.694 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:21:43,727 >> Statistic -> MultiHop | Spa: 0.646 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:21:43,727 >> Statistic -> Single | Spa: 0.408 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:21:43,727 >> Statistic -> Summarization | Spa: 0.647 | Tgt: 1.000 | Z-Loss: 0.112 | [INFO|lh_trainer.py:810] 2026-02-17 02:21:43,729 >> [Micro-Log] {"loss": 1.8317612502723932, "lm_loss": 1.755980374912421, "reg_loss": 0.07578087497192125, "model_sparsity(avg)": 0.5875261835753918, "Spa-Single QA sparsity": 0.40833332935969036, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03760538084122042, "Spa-Code sparsity": 0.6861111124356588, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.0958292509118716, "Spa-In-Context Learning sparsity": 0.6944444520132882, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10819686789597784, "Spa-Summarization sparsity": 0.6466049353281657, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11203568263186349, "Spa-MultiHop QA sparsity": 0.6458333730697632, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1237441934645176, "step": 170, "current_tau": 1.003798007965088, "lambda1 Single QA": 0.57421875, "lambda2 MultiHop QA": 0.298828125, "lambda3 Summarization": 0.14453125, "lambda4 Code": 0.244140625} [INFO|lh_trainer.py:331] 2026-02-17 02:22:03,712 >> {'loss': 10.9906, 'grad_norm': 0.8143755197525024, 'learning_rate': 0.0002826315697918581, 'epoch': 0.18009478672985782, 'num_input_tokens_seen': 420536658, 'completed': '57.00% (171 / 300)', 'remaining time': '6:02:18', 'throughput': '7511.54', 'gpu_mem_free': '9977MB', 'step': 171} [Step 171 / Rank 5] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [16787, 16787, 16787] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 171 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [42354] → Tgt Spa: ['1.000'] [Step 171 / Rank 4] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [16787, 16787, 16787] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 171 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32164, 32164] → Tgt Spa: ['0.350', '0.350'] [Step 171 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [42354] → Tgt Spa: ['1.000'] [Step 171 / Rank 0] Tasks: ['Single QA'] | Lens: [61778] → Tgt Spa: ['0.350'] [Step 171 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32164, 32164] → Tgt Spa: ['0.350', '0.350'] [Step 171 / Rank 1] Tasks: ['Single QA'] | Lens: [61778] → Tgt Spa: ['0.350'] [Step 171 / Rank 4] Tasks: ['Single QA'] | Lens: [52686] → Tgt Spa: ['0.350'] [Step 171 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28377, 28379] → Tgt Spa: ['1.000', '1.000'] [Step 171 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [42122] → Tgt Spa: ['1.000'] [Step 171 / Rank 5] Tasks: ['Single QA'] | Lens: [52686] → Tgt Spa: ['0.350'] [Step 171 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [40297] → Tgt Spa: ['1.000'] [Step 171 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [40297] → Tgt Spa: ['1.000'] [Step 171 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28377, 28379] → Tgt Spa: ['1.000', '1.000'] [Step 171 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [42122] → Tgt Spa: ['1.000'] [Step 171 / Rank 5] Tasks: ['Summarization', 'Single QA', 'Single QA', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'Summarization', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'MultiHop QA', 'In-Context Learning', 'Code', 'Single QA', 'In-Context Learning', 'MultiHop QA', 'Summarization'] | Lens: [3606, 3589, 3589, 3588, 3606, 3588, 3608, 3592, 3592, 3591, 3593, 3593, 3593, 3600, 3594, 3594, 3596, 3613] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000'] [Step 171 / Rank 4] Tasks: ['Summarization', 'Single QA', 'Single QA', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'Summarization', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'MultiHop QA', 'In-Context Learning', 'Code', 'Single QA', 'In-Context Learning', 'MultiHop QA', 'Summarization'] | Lens: [3606, 3589, 3589, 3588, 3606, 3588, 3608, 3592, 3592, 3591, 3593, 3593, 3593, 3600, 3594, 3594, 3596, 3613] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000'] [Step 171 / Rank 1] Tasks: ['Code'] | Lens: [44477] → Tgt Spa: ['1.000'] [Step 171 / Rank 3] Tasks: ['Summarization', 'In-Context Learning', 'Code'] | Lens: [20796, 20778, 20788] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 171 / Rank 6] Tasks: ['Single QA'] | Lens: [64592] → Tgt Spa: ['0.350'] [Step 171 / Rank 0] Tasks: ['Code'] | Lens: [44477] → Tgt Spa: ['1.000'] [Step 171 / Rank 7] Tasks: ['Single QA'] | Lens: [64592] → Tgt Spa: ['0.350'] [Step 171 / Rank 2] Tasks: ['Summarization', 'In-Context Learning', 'Code'] | Lens: [20796, 20778, 20788] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 171 / Rank 1] Tasks: ['Single QA'] | Lens: [52282] → Tgt Spa: ['0.350'] [Step 171 / Rank 2] Tasks: ['Single QA'] | Lens: [64972] → Tgt Spa: ['0.350'] [Step 171 / Rank 4] Tasks: ['Single QA'] | Lens: [42348] → Tgt Spa: ['0.350'] [Step 171 / Rank 5] Tasks: ['Single QA'] | Lens: [42348] → Tgt Spa: ['0.350'] [Step 171 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [23879, 23879] → Tgt Spa: ['0.350', '0.350'] [Step 171 / Rank 0] Tasks: ['Single QA'] | Lens: [52282] → Tgt Spa: ['0.350'] [Step 171 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [23879, 23879] → Tgt Spa: ['0.350', '0.350'] [Step 171 / Rank 3] Tasks: ['Single QA'] | Lens: [64972] → Tgt Spa: ['0.350'] [Step 171 / Rank 5] Tasks: ['Code', 'Single QA'] | Lens: [23634, 23626] → Tgt Spa: ['1.000', '0.350'] [Step 171 / Rank 4] Tasks: ['Code', 'Single QA'] | Lens: [23634, 23626] → Tgt Spa: ['1.000', '0.350'] [Step 171 / Rank 7] Tasks: ['Single QA'] | Lens: [41310] → Tgt Spa: ['0.350'] [Step 171 / Rank 1] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [20506, 20506, 20495] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 171 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22591, 22592] → Tgt Spa: ['1.000', '1.000'] [Step 171 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22591, 22592] → Tgt Spa: ['1.000', '1.000'] [Step 171 / Rank 6] Tasks: ['Single QA'] | Lens: [41310] → Tgt Spa: ['0.350'] [Step 171 / Rank 0] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [20506, 20506, 20495] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 171 / Rank 4] Tasks: ['Code'] | Lens: [58101] → Tgt Spa: ['1.000'] [Step 171 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [64294] → Tgt Spa: ['1.000'] [Step 171 / Rank 7] Tasks: ['Single QA', 'Code'] | Lens: [23522, 23531] → Tgt Spa: ['0.350', '1.000'] [Step 171 / Rank 6] Tasks: ['Single QA', 'Code'] | Lens: [23522, 23531] → Tgt Spa: ['0.350', '1.000'] [Step 171 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [64294] → Tgt Spa: ['1.000'] [Step 171 / Rank 5] Tasks: ['Code'] | Lens: [58101] → Tgt Spa: ['1.000'] [Step 171 / Rank 1] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23583, 23582] → Tgt Spa: ['0.350', '1.000'] [Step 171 / Rank 0] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23583, 23582] → Tgt Spa: ['0.350', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 02:24:37,271 >> @ 171 | Loss: 2.1740 | LM: 2.0959 | Reg: 0.0781 | Spa(Avg): 0.559 [INFO|lh_trainer.py:797] 2026-02-17 02:24:37,271 >> Statistic -> Code | Spa: 0.663 | Tgt: 1.000 | Z-Loss: 0.105 | [INFO|lh_trainer.py:797] 2026-02-17 02:24:37,271 >> Statistic -> In-Context | Spa: 0.712 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:24:37,271 >> Statistic -> MultiHop | Spa: 0.662 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:24:37,271 >> Statistic -> Single | Spa: 0.447 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:24:37,271 >> Statistic -> Summarization | Spa: 0.640 | Tgt: 1.000 | Z-Loss: 0.113 | [INFO|lh_trainer.py:810] 2026-02-17 02:24:37,273 >> [Micro-Log] {"loss": 2.173998457069198, "lm_loss": 2.0959457798550525, "reg_loss": 0.07805269727638613, "model_sparsity(avg)": 0.5592206778625647, "Spa-Single QA sparsity": 0.4470899360520499, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.063338637751128, "Spa-In-Context Learning sparsity": 0.7115384615384616, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1015732208123574, "Spa-Code sparsity": 0.6626984221594674, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10485645596470151, "Spa-Summarization sparsity": 0.6402777850627899, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1129713237285614, "Spa-MultiHop QA sparsity": 0.6620370348294576, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.13190090656280518, "step": 171, "current_tau": 1.0030778646469116, "lambda1 Single QA": 0.57421875, "lambda2 MultiHop QA": 0.298828125, "lambda3 Summarization": 0.1455078125, "lambda4 Code": 0.244140625} [INFO|lh_trainer.py:331] 2026-02-17 02:25:03,218 >> {'loss': 13.044, 'grad_norm': 0.7114960551261902, 'learning_rate': 0.0002793843714260245, 'epoch': 0.18114797261716692, 'num_input_tokens_seen': 423068800, 'completed': '57.33% (172 / 300)', 'remaining time': '5:59:38', 'throughput': '7053.08', 'gpu_mem_free': '12073MB', 'step': 172} [Step 172 / Rank 4] Tasks: ['Single QA'] | Lens: [57371] → Tgt Spa: ['0.350'] [Step 172 / Rank 6] Tasks: ['Single QA'] | Lens: [55668] → Tgt Spa: ['0.350'] [Step 172 / Rank 5] Tasks: ['Single QA'] | Lens: [57371] → Tgt Spa: ['0.350'] [Step 172 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [39662] → Tgt Spa: ['1.000'] [Step 172 / Rank 7] Tasks: ['Single QA'] | Lens: [55668] → Tgt Spa: ['0.350'] [Step 172 / Rank 3] Tasks: ['Single QA'] | Lens: [50729] → Tgt Spa: ['0.350'] [Step 172 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [39662] → Tgt Spa: ['1.000'] [Step 172 / Rank 2] Tasks: ['Single QA'] | Lens: [50729] → Tgt Spa: ['0.350'] [Step 172 / Rank 3] Tasks: ['Code'] | Lens: [53631] → Tgt Spa: ['1.000'] [Step 172 / Rank 6] Tasks: ['Code', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning'] | Lens: [6385, 6380, 6380, 6380, 6381, 6381, 6390, 6392, 6385, 6385] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 172 / Rank 2] Tasks: ['Code'] | Lens: [53631] → Tgt Spa: ['1.000'] [Step 172 / Rank 5] Tasks: ['Single QA'] | Lens: [57506] → Tgt Spa: ['0.350'] [Step 172 / Rank 1] Tasks: ['Single QA'] | Lens: [52683] → Tgt Spa: ['0.350'] [Step 172 / Rank 7] Tasks: ['Code', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning'] | Lens: [6385, 6380, 6380, 6380, 6381, 6381, 6390, 6392, 6385, 6385] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 172 / Rank 4] Tasks: ['Single QA'] | Lens: [57506] → Tgt Spa: ['0.350'] [Step 172 / Rank 0] Tasks: ['Single QA'] | Lens: [52683] → Tgt Spa: ['0.350'] [Step 172 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [29714, 29714] → Tgt Spa: ['0.350', '0.350'] [Step 172 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [29714, 29714] → Tgt Spa: ['0.350', '0.350'] [Step 172 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32550, 32551] → Tgt Spa: ['0.350', '0.350'] [Step 172 / Rank 3] Tasks: ['Single QA'] | Lens: [38801] → Tgt Spa: ['0.350'] [Step 172 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [29874, 29882] → Tgt Spa: ['1.000', '1.000'] [Step 172 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [29874, 29882] → Tgt Spa: ['1.000', '1.000'] [Step 172 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32550, 32551] → Tgt Spa: ['0.350', '0.350'] [Step 172 / Rank 2] Tasks: ['Single QA'] | Lens: [38801] → Tgt Spa: ['0.350'] [Step 172 / Rank 6] Tasks: ['Code'] | Lens: [56744] → Tgt Spa: ['1.000'] [Step 172 / Rank 2] Tasks: ['Single QA'] | Lens: [59700] → Tgt Spa: ['0.350'] [Step 172 / Rank 3] Tasks: ['Single QA'] | Lens: [59700] → Tgt Spa: ['0.350'] [Step 172 / Rank 0] Tasks: ['In-Context Learning', 'Single QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'Single QA', 'MultiHop QA', 'Code', 'In-Context Learning', 'In-Context Learning'] | Lens: [4320, 4321, 4339, 4321, 4321, 4342, 4323, 4323, 4331, 4331, 4324, 4326, 4331, 4325, 4326] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000'] [Step 172 / Rank 5] Tasks: ['Single QA'] | Lens: [36042] → Tgt Spa: ['0.350'] [Step 172 / Rank 1] Tasks: ['In-Context Learning', 'Single QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'Single QA', 'MultiHop QA', 'Code', 'In-Context Learning', 'In-Context Learning'] | Lens: [4320, 4321, 4339, 4321, 4321, 4342, 4323, 4323, 4331, 4331, 4324, 4326, 4331, 4325, 4326] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000'] [Step 172 / Rank 7] Tasks: ['Code'] | Lens: [56744] → Tgt Spa: ['1.000'] [Step 172 / Rank 4] Tasks: ['Single QA'] | Lens: [36042] → Tgt Spa: ['0.350'] [Step 172 / Rank 1] Tasks: ['Code'] | Lens: [37579] → Tgt Spa: ['1.000'] [Step 172 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [61104] → Tgt Spa: ['1.000'] [Step 172 / Rank 6] Tasks: ['Single QA'] | Lens: [60538] → Tgt Spa: ['0.350'] [Step 172 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [61104] → Tgt Spa: ['1.000'] [Step 172 / Rank 0] Tasks: ['Code'] | Lens: [37579] → Tgt Spa: ['1.000'] [Step 172 / Rank 7] Tasks: ['Single QA'] | Lens: [60538] → Tgt Spa: ['0.350'] [Step 172 / Rank 4] Tasks: ['Code'] | Lens: [62340] → Tgt Spa: ['1.000'] [Step 172 / Rank 5] Tasks: ['Code'] | Lens: [62340] → Tgt Spa: ['1.000'] [Step 172 / Rank 3] Tasks: ['Single QA'] | Lens: [44653] → Tgt Spa: ['0.350'] [Step 172 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [53594] → Tgt Spa: ['1.000'] [Step 172 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25541, 25544] → Tgt Spa: ['1.000', '1.000'] [Step 172 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32374, 32376] → Tgt Spa: ['0.350', '0.350'] [Step 172 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [53594] → Tgt Spa: ['1.000'] [Step 172 / Rank 2] Tasks: ['Single QA'] | Lens: [44653] → Tgt Spa: ['0.350'] [Step 172 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32374, 32376] → Tgt Spa: ['0.350', '0.350'] [Step 172 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25541, 25544] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 02:27:34,300 >> @ 172 | Loss: 2.1517 | LM: 2.0968 | Reg: 0.0549 | Spa(Avg): 0.506 [INFO|lh_trainer.py:797] 2026-02-17 02:27:34,301 >> Statistic -> Code | Spa: 0.673 | Tgt: 1.000 | Z-Loss: 0.101 | [INFO|lh_trainer.py:797] 2026-02-17 02:27:34,301 >> Statistic -> In-Context | Spa: 0.697 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:27:34,301 >> Statistic -> MultiHop | Spa: 0.639 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:27:34,301 >> Statistic -> Single | Spa: 0.404 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:27:34,301 >> Statistic -> Summarization | Spa: 0.625 | Tgt: 1.000 | Z-Loss: 0.119 | [INFO|lh_trainer.py:810] 2026-02-17 02:27:34,303 >> [Micro-Log] {"loss": 2.151717302699884, "lm_loss": 2.0968490143616996, "reg_loss": 0.05486828423454426, "model_sparsity(avg)": 0.5059413500130177, "Spa-In-Context Learning sparsity": 0.6968954205513, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10749448090791702, "Spa-Single QA sparsity": 0.40410051743189496, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03995853929691726, "Spa-Summarization sparsity": 0.625, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1191881000995636, "Spa-Code sparsity": 0.6729797937653281, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10117810354991393, "Spa-MultiHop QA sparsity": 0.6388888955116272, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1192968338727951, "step": 172, "current_tau": 1.002432942390442, "lambda1 Single QA": 0.57421875, "lambda2 MultiHop QA": 0.298828125, "lambda3 Summarization": 0.1455078125, "lambda4 Code": 0.2451171875} [INFO|lh_trainer.py:331] 2026-02-17 02:27:53,838 >> {'loss': 12.9103, 'grad_norm': 0.6115074753761292, 'learning_rate': 0.0002761321382037018, 'epoch': 0.18220115850447605, 'num_input_tokens_seen': 425683216, 'completed': '57.67% (173 / 300)', 'remaining time': '5:56:51', 'throughput': '7661.48', 'gpu_mem_free': '6081MB', 'step': 173} [Step 173 / Rank 3] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [19339, 19330, 19342] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 173 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [39274] → Tgt Spa: ['1.000'] [Step 173 / Rank 2] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [19339, 19330, 19342] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 173 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [48747] → Tgt Spa: ['1.000'] [Step 173 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [48747] → Tgt Spa: ['1.000'] [Step 173 / Rank 5] Tasks: ['Single QA'] | Lens: [55525] → Tgt Spa: ['0.350'] [Step 173 / Rank 4] Tasks: ['Single QA'] | Lens: [55525] → Tgt Spa: ['0.350'] [Step 173 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [39274] → Tgt Spa: ['1.000'] [Step 173 / Rank 3] Tasks: ['Code', 'Single QA', 'MultiHop QA', 'Summarization', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [2518, 2513, 2515, 2532, 2532, 2513, 2515, 2516, 2516, 2516, 2534, 2534, 2518, 2518, 2518, 2521, 2519, 2519, 2518, 2521, 2520, 2521, 2523, 2524, 2523] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 173 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [59588] → Tgt Spa: ['1.000'] [Step 173 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [46574] → Tgt Spa: ['1.000'] [Step 173 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22726, 22726] → Tgt Spa: ['1.000', '1.000'] [Step 173 / Rank 2] Tasks: ['Code', 'Single QA', 'MultiHop QA', 'Summarization', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [2518, 2513, 2515, 2532, 2532, 2513, 2515, 2516, 2516, 2516, 2534, 2534, 2518, 2518, 2518, 2521, 2519, 2519, 2518, 2521, 2520, 2521, 2523, 2524, 2523] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 173 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22726, 22726] → Tgt Spa: ['1.000', '1.000'] [Step 173 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [46574] → Tgt Spa: ['1.000'] [Step 173 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [59588] → Tgt Spa: ['1.000'] [Step 173 / Rank 6] Tasks: ['Single QA'] | Lens: [54040] → Tgt Spa: ['0.350'] [Step 173 / Rank 0] Tasks: ['Single QA'] | Lens: [61765] → Tgt Spa: ['0.350'] [Step 173 / Rank 1] Tasks: ['Single QA'] | Lens: [61765] → Tgt Spa: ['0.350'] [Step 173 / Rank 7] Tasks: ['Single QA'] | Lens: [54040] → Tgt Spa: ['0.350'] [Step 173 / Rank 2] Tasks: ['Single QA'] | Lens: [39179] → Tgt Spa: ['0.350'] [Step 173 / Rank 3] Tasks: ['Single QA'] | Lens: [39179] → Tgt Spa: ['0.350'] [Step 173 / Rank 4] Tasks: ['Single QA'] | Lens: [44681] → Tgt Spa: ['0.350'] [Step 173 / Rank 5] Tasks: ['Single QA'] | Lens: [44681] → Tgt Spa: ['0.350'] [Step 173 / Rank 4] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17626, 17627, 17616] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 173 / Rank 5] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17626, 17627, 17616] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 173 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [26957, 26950] → Tgt Spa: ['1.000', '1.000'] [Step 173 / Rank 6] Tasks: ['Single QA'] | Lens: [41419] → Tgt Spa: ['0.350'] [Step 173 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [26466, 26459] → Tgt Spa: ['1.000', '1.000'] [Step 173 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [26957, 26950] → Tgt Spa: ['1.000', '1.000'] [Step 173 / Rank 7] Tasks: ['Single QA'] | Lens: [41419] → Tgt Spa: ['0.350'] [Step 173 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [26466, 26459] → Tgt Spa: ['1.000', '1.000'] [Step 173 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Code', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'In-Context Learning', 'Single QA'] | Lens: [3109, 3110, 3111, 3128, 3117, 3110, 3113, 3111, 3113, 3112, 3112, 3114, 3113, 3115, 3132, 3116, 3115, 3116, 3116, 3117, 3116] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350'] [Step 173 / Rank 5] Tasks: ['Code'] | Lens: [38505] → Tgt Spa: ['1.000'] [Step 173 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [37306] → Tgt Spa: ['1.000'] [Step 173 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [37306] → Tgt Spa: ['1.000'] [Step 173 / Rank 4] Tasks: ['Code'] | Lens: [38505] → Tgt Spa: ['1.000'] [Step 173 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Code', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'In-Context Learning', 'Single QA'] | Lens: [3109, 3110, 3111, 3128, 3117, 3110, 3113, 3111, 3113, 3112, 3112, 3114, 3113, 3115, 3132, 3116, 3115, 3116, 3116, 3117, 3116] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350'] [Step 173 / Rank 3] Tasks: ['Single QA'] | Lens: [41488] → Tgt Spa: ['0.350'] [Step 173 / Rank 2] Tasks: ['Single QA'] | Lens: [41488] → Tgt Spa: ['0.350'] [Step 173 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43934] → Tgt Spa: ['1.000'] [Step 173 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43934] → Tgt Spa: ['1.000'] [Step 173 / Rank 3] Tasks: ['Single QA'] | Lens: [49987] → Tgt Spa: ['0.350'] [Step 173 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [32906] → Tgt Spa: ['1.000'] [Step 173 / Rank 4] Tasks: ['Single QA'] | Lens: [41573] → Tgt Spa: ['0.350'] [Step 173 / Rank 5] Tasks: ['Single QA'] | Lens: [41573] → Tgt Spa: ['0.350'] [Step 173 / Rank 2] Tasks: ['Single QA'] | Lens: [49987] → Tgt Spa: ['0.350'] [Step 173 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [32906] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 02:30:05,477 >> @ 173 | Loss: 2.2464 | LM: 2.1753 | Reg: 0.0711 | Spa(Avg): 0.561 [INFO|lh_trainer.py:797] 2026-02-17 02:30:05,477 >> Statistic -> Code | Spa: 0.681 | Tgt: 1.000 | Z-Loss: 0.098 | [INFO|lh_trainer.py:797] 2026-02-17 02:30:05,477 >> Statistic -> In-Context | Spa: 0.704 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:30:05,477 >> Statistic -> MultiHop | Spa: 0.613 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:30:05,477 >> Statistic -> Single | Spa: 0.427 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:30:05,477 >> Statistic -> Summarization | Spa: 0.578 | Tgt: 1.000 | Z-Loss: 0.146 | [INFO|lh_trainer.py:810] 2026-02-17 02:30:05,479 >> [Micro-Log] {"loss": 2.2463864162564278, "lm_loss": 2.175291635096073, "reg_loss": 0.07109477341873571, "model_sparsity(avg)": 0.5608443543314934, "Spa-In-Context Learning sparsity": 0.7044753101136949, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.104369864695602, "Spa-Single QA sparsity": 0.42685184478759763, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.053697810601443054, "Spa-Code sparsity": 0.6805555394717625, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09827775082417897, "Spa-Summarization sparsity": 0.5777777791023254, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.14587213918566705, "Spa-MultiHop QA sparsity": 0.6133333277702332, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.10714610397815705, "step": 173, "current_tau": 1.0018634796142578, "lambda1 Single QA": 0.57421875, "lambda2 MultiHop QA": 0.298828125, "lambda3 Summarization": 0.146484375, "lambda4 Code": 0.2451171875} [INFO|lh_trainer.py:331] 2026-02-17 02:30:23,263 >> {'loss': 13.4783, 'grad_norm': 0.7895613312721252, 'learning_rate': 0.00027287542737831016, 'epoch': 0.18325434439178515, 'num_input_tokens_seen': 428019392, 'completed': '58.00% (174 / 300)', 'remaining time': '5:53:48', 'throughput': '7817.27', 'gpu_mem_free': '14753MB', 'step': 174} [Step 174 / Rank 3] Tasks: ['Single QA'] | Lens: [59023] → Tgt Spa: ['0.350'] [Step 174 / Rank 0] Tasks: ['Single QA'] | Lens: [38011] → Tgt Spa: ['0.350'] [Step 174 / Rank 5] Tasks: ['Code', 'Single QA', 'Code', 'Code'] | Lens: [14876, 14879, 14899, 14903] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000'] [Step 174 / Rank 4] Tasks: ['Code', 'Single QA', 'Code', 'Code'] | Lens: [14876, 14879, 14899, 14903] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000'] [Step 174 / Rank 6] Tasks: ['Code', 'Code', 'Code'] | Lens: [18571, 18570, 18571] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 174 / Rank 7] Tasks: ['Code', 'Code', 'Code'] | Lens: [18571, 18570, 18571] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 174 / Rank 1] Tasks: ['Single QA'] | Lens: [38011] → Tgt Spa: ['0.350'] [Step 174 / Rank 2] Tasks: ['Single QA'] | Lens: [59023] → Tgt Spa: ['0.350'] [Step 174 / Rank 0] Tasks: ['Single QA'] | Lens: [54855] → Tgt Spa: ['0.350'] [Step 174 / Rank 1] Tasks: ['Single QA'] | Lens: [54855] → Tgt Spa: ['0.350'] [Step 174 / Rank 4] Tasks: ['Summarization', 'Code', 'In-Context Learning'] | Lens: [20017, 20007, 19999] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 174 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [40608] → Tgt Spa: ['1.000'] [Step 174 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [40608] → Tgt Spa: ['1.000'] [Step 174 / Rank 2] Tasks: ['Summarization', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [4729, 4711, 4711, 4712, 4719, 4720, 4721, 4714, 4715, 4722, 4722, 4716, 4718] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 174 / Rank 5] Tasks: ['Summarization', 'Code', 'In-Context Learning'] | Lens: [20017, 20007, 19999] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 174 / Rank 3] Tasks: ['Summarization', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [4729, 4711, 4711, 4712, 4719, 4720, 4721, 4714, 4715, 4722, 4722, 4716, 4718] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 174 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [42346] → Tgt Spa: ['1.000'] [Step 174 / Rank 7] Tasks: ['Single QA'] | Lens: [40749] → Tgt Spa: ['0.350'] [Step 174 / Rank 3] Tasks: ['Single QA'] | Lens: [49531] → Tgt Spa: ['0.350'] [Step 174 / Rank 6] Tasks: ['Single QA'] | Lens: [40749] → Tgt Spa: ['0.350'] [Step 174 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [21869, 21869] → Tgt Spa: ['0.350', '0.350'] [Step 174 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [42346] → Tgt Spa: ['1.000'] [Step 174 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [21869, 21869] → Tgt Spa: ['0.350', '0.350'] [Step 174 / Rank 2] Tasks: ['Single QA'] | Lens: [49531] → Tgt Spa: ['0.350'] [Step 174 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [42416] → Tgt Spa: ['1.000'] [Step 174 / Rank 0] Tasks: ['Single QA'] | Lens: [39155] → Tgt Spa: ['0.350'] [Step 174 / Rank 1] Tasks: ['Single QA'] | Lens: [39155] → Tgt Spa: ['0.350'] [Step 174 / Rank 4] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24419, 24420] → Tgt Spa: ['1.000', '0.350'] [Step 174 / Rank 3] Tasks: ['Single QA'] | Lens: [43585] → Tgt Spa: ['0.350'] [Step 174 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [42416] → Tgt Spa: ['1.000'] [Step 174 / Rank 5] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24419, 24420] → Tgt Spa: ['1.000', '0.350'] [Step 174 / Rank 2] Tasks: ['Single QA'] | Lens: [43585] → Tgt Spa: ['0.350'] [Step 174 / Rank 5] Tasks: ['Single QA'] | Lens: [64675] → Tgt Spa: ['0.350'] [Step 174 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24078, 24077] → Tgt Spa: ['1.000', '1.000'] [Step 174 / Rank 4] Tasks: ['Single QA'] | Lens: [64675] → Tgt Spa: ['0.350'] [Step 174 / Rank 7] Tasks: ['Single QA'] | Lens: [50478] → Tgt Spa: ['0.350'] [Step 174 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25996, 25997] → Tgt Spa: ['1.000', '1.000'] [Step 174 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24078, 24077] → Tgt Spa: ['1.000', '1.000'] [Step 174 / Rank 6] Tasks: ['Single QA'] | Lens: [50478] → Tgt Spa: ['0.350'] [Step 174 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25996, 25997] → Tgt Spa: ['1.000', '1.000'] [Step 174 / Rank 6] Tasks: ['Single QA'] | Lens: [45678] → Tgt Spa: ['0.350'] [Step 174 / Rank 3] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [19254, 19243, 19246] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 174 / Rank 1] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [27237, 27241] → Tgt Spa: ['1.000', '0.350'] [Step 174 / Rank 4] Tasks: ['Single QA', 'Summarization', 'Code', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [5221, 5240, 5229, 5222, 5222, 5223, 5230, 5223, 5224, 5224, 5224, 5225] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000'] [Step 174 / Rank 5] Tasks: ['Single QA', 'Summarization', 'Code', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [5221, 5240, 5229, 5222, 5222, 5223, 5230, 5223, 5224, 5224, 5224, 5225] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000'] [Step 174 / Rank 0] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [27237, 27241] → Tgt Spa: ['1.000', '0.350'] [Step 174 / Rank 7] Tasks: ['Single QA'] | Lens: [45678] → Tgt Spa: ['0.350'] [Step 174 / Rank 2] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [19254, 19243, 19246] → Tgt Spa: ['1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 02:32:41,831 >> @ 174 | Loss: 2.1498 | LM: 2.0925 | Reg: 0.0573 | Spa(Avg): 0.523 [INFO|lh_trainer.py:797] 2026-02-17 02:32:41,831 >> Statistic -> Code | Spa: 0.681 | Tgt: 1.000 | Z-Loss: 0.098 | [INFO|lh_trainer.py:797] 2026-02-17 02:32:41,831 >> Statistic -> In-Context | Spa: 0.699 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:32:41,831 >> Statistic -> MultiHop | Spa: 0.613 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:32:41,831 >> Statistic -> Single | Spa: 0.425 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:32:41,831 >> Statistic -> Summarization | Spa: 0.653 | Tgt: 1.000 | Z-Loss: 0.106 | [INFO|lh_trainer.py:810] 2026-02-17 02:32:41,833 >> [Micro-Log] {"loss": 2.1498254661758742, "lm_loss": 2.0925483498722315, "reg_loss": 0.05727710390541082, "model_sparsity(avg)": 0.5226065392295519, "Spa-Single QA sparsity": 0.425, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05087854315061122, "Spa-In-Context Learning sparsity": 0.6990740724972316, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10655472782396135, "Spa-Summarization sparsity": 0.652777761220932, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10628783330321312, "Spa-Code sparsity": 0.6814236119389534, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09835011186078191, "Spa-MultiHop QA sparsity": 0.6133333277702332, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.10714610397815705, "step": 174, "current_tau": 1.0013694763183594, "lambda1 Single QA": 0.57421875, "lambda2 MultiHop QA": 0.298828125, "lambda3 Summarization": 0.146484375, "lambda4 Code": 0.24609375} [INFO|lh_trainer.py:331] 2026-02-17 02:32:57,205 >> {'loss': 12.899, 'grad_norm': 0.6167088150978088, 'learning_rate': 0.00026961479697048385, 'epoch': 0.18430753027909427, 'num_input_tokens_seen': 430450162, 'completed': '58.33% (175 / 300)', 'remaining time': '5:50:49', 'throughput': '7895.05', 'gpu_mem_free': '9787MB', 'step': 175} [Step 175 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [54080] → Tgt Spa: ['1.000'] [Step 175 / Rank 6] Tasks: ['Single QA'] | Lens: [53666] → Tgt Spa: ['0.350'] [Step 175 / Rank 3] Tasks: ['Single QA'] | Lens: [52635] → Tgt Spa: ['0.350'] [Step 175 / Rank 7] Tasks: ['Single QA'] | Lens: [53666] → Tgt Spa: ['0.350'] [Step 175 / Rank 5] Tasks: ['Single QA'] | Lens: [63934] → Tgt Spa: ['0.350'] [Step 175 / Rank 4] Tasks: ['Single QA'] | Lens: [63934] → Tgt Spa: ['0.350'] [Step 175 / Rank 2] Tasks: ['Single QA'] | Lens: [52635] → Tgt Spa: ['0.350'] [Step 175 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [54080] → Tgt Spa: ['1.000'] [Step 175 / Rank 7] Tasks: ['Single QA'] | Lens: [63691] → Tgt Spa: ['0.350'] [Step 175 / Rank 6] Tasks: ['Single QA'] | Lens: [63691] → Tgt Spa: ['0.350'] [Step 175 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [55888] → Tgt Spa: ['1.000'] [Step 175 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [55888] → Tgt Spa: ['1.000'] [Step 175 / Rank 0] Tasks: ['Single QA'] | Lens: [60976] → Tgt Spa: ['0.350'] [Step 175 / Rank 4] Tasks: ['Summarization'] | Lens: [47359] → Tgt Spa: ['1.000'] [Step 175 / Rank 5] Tasks: ['Summarization'] | Lens: [47359] → Tgt Spa: ['1.000'] [Step 175 / Rank 1] Tasks: ['Single QA'] | Lens: [60976] → Tgt Spa: ['0.350'] [Step 175 / Rank 4] Tasks: ['Single QA'] | Lens: [37510] → Tgt Spa: ['0.350'] [Step 175 / Rank 6] Tasks: ['Single QA'] | Lens: [40014] → Tgt Spa: ['0.350'] [Step 175 / Rank 7] Tasks: ['Single QA'] | Lens: [40014] → Tgt Spa: ['0.350'] [Step 175 / Rank 2] Tasks: ['Single QA'] | Lens: [54014] → Tgt Spa: ['0.350'] [Step 175 / Rank 5] Tasks: ['Single QA'] | Lens: [37510] → Tgt Spa: ['0.350'] [Step 175 / Rank 3] Tasks: ['Single QA'] | Lens: [54014] → Tgt Spa: ['0.350'] [Step 175 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [61189] → Tgt Spa: ['1.000'] [Step 175 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [61189] → Tgt Spa: ['1.000'] [Step 175 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [37866] → Tgt Spa: ['1.000'] [Step 175 / Rank 1] Tasks: ['Single QA'] | Lens: [39902] → Tgt Spa: ['0.350'] [Step 175 / Rank 7] Tasks: ['Code'] | Lens: [58748] → Tgt Spa: ['1.000'] [Step 175 / Rank 6] Tasks: ['Code'] | Lens: [58748] → Tgt Spa: ['1.000'] [Step 175 / Rank 0] Tasks: ['Single QA'] | Lens: [39902] → Tgt Spa: ['0.350'] [Step 175 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [41554] → Tgt Spa: ['1.000'] [Step 175 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [37866] → Tgt Spa: ['1.000'] [Step 175 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [41554] → Tgt Spa: ['1.000'] [Step 175 / Rank 7] Tasks: ['Single QA'] | Lens: [59074] → Tgt Spa: ['0.350'] [Step 175 / Rank 2] Tasks: ['Single QA'] | Lens: [61420] → Tgt Spa: ['0.350'] [Step 175 / Rank 6] Tasks: ['Single QA'] | Lens: [59074] → Tgt Spa: ['0.350'] [Step 175 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [59207] → Tgt Spa: ['1.000'] [Step 175 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [59207] → Tgt Spa: ['1.000'] [Step 175 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [38672] → Tgt Spa: ['1.000'] [Step 175 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [38672] → Tgt Spa: ['1.000'] [Step 175 / Rank 3] Tasks: ['Single QA'] | Lens: [61420] → Tgt Spa: ['0.350'] [Step 175 / Rank 4] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17772, 17783, 17774] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 175 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [28258, 28262] → Tgt Spa: ['0.350', '0.350'] [Step 175 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [17654, 17654, 17655] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 175 / Rank 6] Tasks: ['Single QA'] | Lens: [65154] → Tgt Spa: ['0.350'] [Step 175 / Rank 5] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17772, 17783, 17774] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 175 / Rank 7] Tasks: ['Single QA'] | Lens: [65154] → Tgt Spa: ['0.350'] [Step 175 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [17654, 17654, 17655] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 175 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [28258, 28262] → Tgt Spa: ['0.350', '0.350'] [INFO|lh_trainer.py:781] 2026-02-17 02:35:45,042 >> @ 175 | Loss: 2.2138 | LM: 2.1667 | Reg: 0.0471 | Spa(Avg): 0.500 [INFO|lh_trainer.py:797] 2026-02-17 02:35:45,043 >> Statistic -> Code | Spa: 0.690 | Tgt: 1.000 | Z-Loss: 0.095 | [INFO|lh_trainer.py:797] 2026-02-17 02:35:45,043 >> Statistic -> In-Context | Spa: 0.708 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:35:45,043 >> Statistic -> MultiHop | Spa: 0.613 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:35:45,043 >> Statistic -> Single | Spa: 0.355 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:35:45,043 >> Statistic -> Summarization | Spa: 0.688 | Tgt: 1.000 | Z-Loss: 0.091 | [INFO|lh_trainer.py:810] 2026-02-17 02:35:45,045 >> [Micro-Log] {"loss": 2.2137891426682472, "lm_loss": 2.1666916074852147, "reg_loss": 0.0470975546243911, "model_sparsity(avg)": 0.5001929004987081, "Spa-In-Context Learning sparsity": 0.7083333390099662, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10351570163454328, "Spa-Single QA sparsity": 0.3553921475129969, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.00863926624879241, "Spa-Summarization sparsity": 0.6875, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.0907982736825943, "Spa-Code sparsity": 0.6898148059844971, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09509095301230748, "Spa-MultiHop QA sparsity": 0.6133333277702332, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.10714610397815705, "step": 175, "current_tau": 1.0009512901306152, "lambda1 Single QA": 0.57421875, "lambda2 MultiHop QA": 0.298828125, "lambda3 Summarization": 0.146484375, "lambda4 Code": 0.24609375} [INFO|lh_trainer.py:331] 2026-02-17 02:36:11,814 >> {'loss': 13.2827, 'grad_norm': 0.5713215470314026, 'learning_rate': 0.00026635080567245756, 'epoch': 0.18536071616640337, 'num_input_tokens_seen': 432988892, 'completed': '58.67% (176 / 300)', 'remaining time': '5:48:19', 'throughput': '6522.65', 'gpu_mem_free': '8745MB', 'step': 176} [Step 176 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [10727, 10728, 10728, 10728, 10728, 10729] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 176 / Rank 6] Tasks: ['Single QA'] | Lens: [33641] → Tgt Spa: ['0.350'] [Step 176 / Rank 5] Tasks: ['Code'] | Lens: [58334] → Tgt Spa: ['1.000'] [Step 176 / Rank 4] Tasks: ['Code'] | Lens: [58334] → Tgt Spa: ['1.000'] [Step 176 / Rank 0] Tasks: ['Code'] | Lens: [35133] → Tgt Spa: ['1.000'] [Step 176 / Rank 1] Tasks: ['Code'] | Lens: [35133] → Tgt Spa: ['1.000'] [Step 176 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [10727, 10728, 10728, 10728, 10728, 10729] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 176 / Rank 7] Tasks: ['Single QA'] | Lens: [33641] → Tgt Spa: ['0.350'] [Step 176 / Rank 1] Tasks: ['Single QA'] | Lens: [57372] → Tgt Spa: ['0.350'] [Step 176 / Rank 7] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Single QA', 'MultiHop QA'] | Lens: [7978, 7978, 7978, 7978, 7978, 7981, 7981, 7979] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 176 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [42493] → Tgt Spa: ['1.000'] [Step 176 / Rank 6] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Single QA', 'MultiHop QA'] | Lens: [7978, 7978, 7978, 7978, 7978, 7981, 7981, 7979] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 176 / Rank 4] Tasks: ['Single QA'] | Lens: [51314] → Tgt Spa: ['0.350'] [Step 176 / Rank 5] Tasks: ['Single QA'] | Lens: [51314] → Tgt Spa: ['0.350'] [Step 176 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [42493] → Tgt Spa: ['1.000'] [Step 176 / Rank 0] Tasks: ['Single QA'] | Lens: [57372] → Tgt Spa: ['0.350'] [Step 176 / Rank 7] Tasks: ['Summarization', 'Code'] | Lens: [25790, 25779] → Tgt Spa: ['1.000', '1.000'] [Step 176 / Rank 1] Tasks: ['Single QA'] | Lens: [60846] → Tgt Spa: ['0.350'] [Step 176 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22084, 22086] → Tgt Spa: ['1.000', '1.000'] [Step 176 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22084, 22086] → Tgt Spa: ['1.000', '1.000'] [Step 176 / Rank 6] Tasks: ['Summarization', 'Code'] | Lens: [25790, 25779] → Tgt Spa: ['1.000', '1.000'] [Step 176 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [30623, 30623] → Tgt Spa: ['0.350', '0.350'] [Step 176 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [30623, 30623] → Tgt Spa: ['0.350', '0.350'] [Step 176 / Rank 0] Tasks: ['Single QA'] | Lens: [60846] → Tgt Spa: ['0.350'] [Step 176 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [45267] → Tgt Spa: ['1.000'] [Step 176 / Rank 2] Tasks: ['Single QA'] | Lens: [44635] → Tgt Spa: ['0.350'] [Step 176 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [26896, 26904] → Tgt Spa: ['1.000', '1.000'] [Step 176 / Rank 6] Tasks: ['Single QA'] | Lens: [61800] → Tgt Spa: ['0.350'] [Step 176 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [26896, 26904] → Tgt Spa: ['1.000', '1.000'] [Step 176 / Rank 3] Tasks: ['Single QA'] | Lens: [44635] → Tgt Spa: ['0.350'] [Step 176 / Rank 7] Tasks: ['Single QA'] | Lens: [61800] → Tgt Spa: ['0.350'] [Step 176 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [45267] → Tgt Spa: ['1.000'] [Step 176 / Rank 4] Tasks: ['Code'] | Lens: [52550] → Tgt Spa: ['1.000'] [Step 176 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [23635, 23642] → Tgt Spa: ['1.000', '1.000'] [Step 176 / Rank 7] Tasks: ['Code'] | Lens: [44948] → Tgt Spa: ['1.000'] [Step 176 / Rank 5] Tasks: ['Code'] | Lens: [52550] → Tgt Spa: ['1.000'] [Step 176 / Rank 3] Tasks: ['Single QA'] | Lens: [56359] → Tgt Spa: ['0.350'] [Step 176 / Rank 6] Tasks: ['Code'] | Lens: [44948] → Tgt Spa: ['1.000'] [Step 176 / Rank 2] Tasks: ['Single QA'] | Lens: [56359] → Tgt Spa: ['0.350'] [Step 176 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [23635, 23642] → Tgt Spa: ['1.000', '1.000'] [Step 176 / Rank 6] Tasks: ['Single QA'] | Lens: [35841] → Tgt Spa: ['0.350'] [Step 176 / Rank 7] Tasks: ['Single QA'] | Lens: [35841] → Tgt Spa: ['0.350'] [Step 176 / Rank 3] Tasks: ['Single QA'] | Lens: [42433] → Tgt Spa: ['0.350'] [Step 176 / Rank 5] Tasks: ['Single QA'] | Lens: [56728] → Tgt Spa: ['0.350'] [Step 176 / Rank 1] Tasks: ['Single QA'] | Lens: [41151] → Tgt Spa: ['0.350'] [Step 176 / Rank 4] Tasks: ['Single QA'] | Lens: [56728] → Tgt Spa: ['0.350'] [Step 176 / Rank 2] Tasks: ['Single QA'] | Lens: [42433] → Tgt Spa: ['0.350'] [Step 176 / Rank 0] Tasks: ['Single QA'] | Lens: [41151] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 02:38:47,997 >> @ 176 | Loss: 1.8984 | LM: 1.8449 | Reg: 0.0535 | Spa(Avg): 0.501 [INFO|lh_trainer.py:797] 2026-02-17 02:38:47,997 >> Statistic -> Code | Spa: 0.663 | Tgt: 1.000 | Z-Loss: 0.106 | [INFO|lh_trainer.py:797] 2026-02-17 02:38:47,997 >> Statistic -> In-Context | Spa: 0.715 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:38:47,997 >> Statistic -> MultiHop | Spa: 0.431 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:38:47,998 >> Statistic -> Single | Spa: 0.384 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:38:47,998 >> Statistic -> Summarization | Spa: 0.611 | Tgt: 1.000 | Z-Loss: 0.127 | [INFO|lh_trainer.py:810] 2026-02-17 02:38:47,999 >> [Micro-Log] {"loss": 1.8983662519603968, "lm_loss": 1.8448923798277974, "reg_loss": 0.05347388041748976, "model_sparsity(avg)": 0.5010609527428945, "Spa-Code sparsity": 0.6626984221594674, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10598587776933398, "Spa-Single QA sparsity": 0.384259249482836, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.028139222623957766, "Spa-In-Context Learning sparsity": 0.7152777711550394, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10083219905694325, "Spa-MultiHop QA sparsity": 0.4305555323759715, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.026799437279502552, "Spa-Summarization sparsity": 0.6111111044883728, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12676021456718445, "step": 176, "current_tau": 1.0006089210510254, "lambda1 Single QA": 0.57421875, "lambda2 MultiHop QA": 0.298828125, "lambda3 Summarization": 0.1474609375, "lambda4 Code": 0.2470703125} [INFO|lh_trainer.py:331] 2026-02-17 02:39:09,473 >> {'loss': 11.3902, 'grad_norm': 0.5876232385635376, 'learning_rate': 0.00026308401275233707, 'epoch': 0.18641390205371247, 'num_input_tokens_seen': 435403104, 'completed': '59.00% (177 / 300)', 'remaining time': '5:45:37', 'throughput': '6794.51', 'gpu_mem_free': '12975MB', 'step': 177} [Step 177 / Rank 4] Tasks: ['Single QA'] | Lens: [55082] → Tgt Spa: ['0.350'] [Step 177 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [57482] → Tgt Spa: ['1.000'] [Step 177 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [57482] → Tgt Spa: ['1.000'] [Step 177 / Rank 7] Tasks: ['Single QA'] | Lens: [61998] → Tgt Spa: ['0.350'] [Step 177 / Rank 5] Tasks: ['Single QA'] | Lens: [55082] → Tgt Spa: ['0.350'] [Step 177 / Rank 6] Tasks: ['Single QA'] | Lens: [61998] → Tgt Spa: ['0.350'] [Step 177 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [45445] → Tgt Spa: ['1.000'] [Step 177 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [45445] → Tgt Spa: ['1.000'] [Step 177 / Rank 6] Tasks: ['Code'] | Lens: [46651] → Tgt Spa: ['1.000'] [Step 177 / Rank 1] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17494, 17506, 17496] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 177 / Rank 7] Tasks: ['Code'] | Lens: [46651] → Tgt Spa: ['1.000'] [Step 177 / Rank 4] Tasks: ['Single QA'] | Lens: [49558] → Tgt Spa: ['0.350'] [Step 177 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [30074, 30073] → Tgt Spa: ['1.000', '1.000'] [Step 177 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [30074, 30073] → Tgt Spa: ['1.000', '1.000'] [Step 177 / Rank 0] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17494, 17506, 17496] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 177 / Rank 5] Tasks: ['Single QA'] | Lens: [49558] → Tgt Spa: ['0.350'] [Step 177 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25421, 25421] → Tgt Spa: ['0.350', '1.000'] [Step 177 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [28490, 28483] → Tgt Spa: ['1.000', '1.000'] [Step 177 / Rank 7] Tasks: ['Code', 'Single QA', 'Single QA', 'Code', 'Code', 'Single QA'] | Lens: [9632, 9626, 9627, 9640, 9645, 9640] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '1.000', '0.350'] [Step 177 / Rank 3] Tasks: ['Single QA'] | Lens: [34598] → Tgt Spa: ['0.350'] [Step 177 / Rank 6] Tasks: ['Code', 'Single QA', 'Single QA', 'Code', 'Code', 'Single QA'] | Lens: [9632, 9626, 9627, 9640, 9645, 9640] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '1.000', '0.350'] [Step 177 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25421, 25421] → Tgt Spa: ['0.350', '1.000'] [Step 177 / Rank 2] Tasks: ['Single QA'] | Lens: [34598] → Tgt Spa: ['0.350'] [Step 177 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [28490, 28483] → Tgt Spa: ['1.000', '1.000'] [Step 177 / Rank 2] Tasks: ['Single QA'] | Lens: [65352] → Tgt Spa: ['0.350'] [Step 177 / Rank 7] Tasks: ['Single QA'] | Lens: [38979] → Tgt Spa: ['0.350'] [Step 177 / Rank 4] Tasks: ['Single QA'] | Lens: [39556] → Tgt Spa: ['0.350'] [Step 177 / Rank 6] Tasks: ['Single QA'] | Lens: [38979] → Tgt Spa: ['0.350'] [Step 177 / Rank 3] Tasks: ['Single QA'] | Lens: [65352] → Tgt Spa: ['0.350'] [Step 177 / Rank 5] Tasks: ['Single QA'] | Lens: [39556] → Tgt Spa: ['0.350'] [Step 177 / Rank 0] Tasks: ['Single QA'] | Lens: [51698] → Tgt Spa: ['0.350'] [Step 177 / Rank 1] Tasks: ['Single QA'] | Lens: [51698] → Tgt Spa: ['0.350'] [Step 177 / Rank 1] Tasks: ['Single QA', 'Code'] | Lens: [29879, 29886] → Tgt Spa: ['0.350', '1.000'] [Step 177 / Rank 6] Tasks: ['Single QA'] | Lens: [53604] → Tgt Spa: ['0.350'] [Step 177 / Rank 3] Tasks: ['Single QA'] | Lens: [36478] → Tgt Spa: ['0.350'] [Step 177 / Rank 4] Tasks: ['Single QA'] | Lens: [51527] → Tgt Spa: ['0.350'] [Step 177 / Rank 5] Tasks: ['Single QA'] | Lens: [51527] → Tgt Spa: ['0.350'] [Step 177 / Rank 7] Tasks: ['Single QA'] | Lens: [53604] → Tgt Spa: ['0.350'] [Step 177 / Rank 0] Tasks: ['Single QA', 'Code'] | Lens: [29879, 29886] → Tgt Spa: ['0.350', '1.000'] [Step 177 / Rank 2] Tasks: ['Single QA'] | Lens: [36478] → Tgt Spa: ['0.350'] [Step 177 / Rank 4] Tasks: ['Single QA'] | Lens: [58000] → Tgt Spa: ['0.350'] [Step 177 / Rank 6] Tasks: ['Single QA'] | Lens: [53175] → Tgt Spa: ['0.350'] [Step 177 / Rank 2] Tasks: ['MultiHop QA', 'Code'] | Lens: [30134, 30145] → Tgt Spa: ['0.350', '1.000'] [Step 177 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24028, 24029] → Tgt Spa: ['1.000', '1.000'] [Step 177 / Rank 5] Tasks: ['Single QA'] | Lens: [58000] → Tgt Spa: ['0.350'] [Step 177 / Rank 7] Tasks: ['Single QA'] | Lens: [53175] → Tgt Spa: ['0.350'] [Step 177 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24028, 24029] → Tgt Spa: ['1.000', '1.000'] [Step 177 / Rank 3] Tasks: ['MultiHop QA', 'Code'] | Lens: [30134, 30145] → Tgt Spa: ['0.350', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 02:41:33,021 >> @ 177 | Loss: 2.1594 | LM: 2.1114 | Reg: 0.0480 | Spa(Avg): 0.505 [INFO|lh_trainer.py:797] 2026-02-17 02:41:33,021 >> Statistic -> Code | Spa: 0.704 | Tgt: 1.000 | Z-Loss: 0.091 | [INFO|lh_trainer.py:797] 2026-02-17 02:41:33,021 >> Statistic -> In-Context | Spa: 0.715 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:41:33,021 >> Statistic -> MultiHop | Spa: 0.597 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:41:33,021 >> Statistic -> Single | Spa: 0.410 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:41:33,021 >> Statistic -> Summarization | Spa: 0.722 | Tgt: 1.000 | Z-Loss: 0.076 | [INFO|lh_trainer.py:810] 2026-02-17 02:41:33,023 >> [Micro-Log] {"loss": 2.1593626153965793, "lm_loss": 2.111401150623957, "reg_loss": 0.04796148929744959, "model_sparsity(avg)": 0.5051118756333987, "Spa-In-Context Learning sparsity": 0.7152777761220932, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1008699219673872, "Spa-Code sparsity": 0.7037037081188626, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09052024119430119, "Spa-Summarization sparsity": 0.7222222089767456, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.07637684047222137, "Spa-Single QA sparsity": 0.4097222155994839, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.039759297415407166, "Spa-MultiHop QA sparsity": 0.5972222089767456, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.09866684675216675, "step": 177, "current_tau": 1.000342607498169, "lambda1 Single QA": 0.578125, "lambda2 MultiHop QA": 0.30078125, "lambda3 Summarization": 0.1474609375, "lambda4 Code": 0.2470703125} [INFO|lh_trainer.py:331] 2026-02-17 02:41:55,235 >> {'loss': 12.9562, 'grad_norm': 0.49001753330230713, 'learning_rate': 0.00025981497795827174, 'epoch': 0.1874670879410216, 'num_input_tokens_seen': 437894208, 'completed': '59.33% (178 / 300)', 'remaining time': '5:42:47', 'throughput': '7514.10', 'gpu_mem_free': '11779MB', 'step': 178} [Step 178 / Rank 0] Tasks: ['Single QA'] | Lens: [51100] → Tgt Spa: ['0.350'] [Step 178 / Rank 3] Tasks: ['Code', 'In-Context Learning', 'Code'] | Lens: [19342, 19337, 19344] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 178 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'Summarization', 'Summarization', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [5207, 5207, 5207, 5207, 5217, 5215, 5228, 5228, 5217, 5211, 5212, 5213] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 178 / Rank 2] Tasks: ['Code', 'In-Context Learning', 'Code'] | Lens: [19342, 19337, 19344] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 178 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'Summarization', 'Summarization', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [5207, 5207, 5207, 5207, 5217, 5215, 5228, 5228, 5217, 5211, 5212, 5213] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 178 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [22368, 22375] → Tgt Spa: ['1.000', '1.000'] [Step 178 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [22368, 22375] → Tgt Spa: ['1.000', '1.000'] [Step 178 / Rank 1] Tasks: ['Single QA'] | Lens: [51100] → Tgt Spa: ['0.350'] [Step 178 / Rank 3] Tasks: ['Single QA'] | Lens: [58416] → Tgt Spa: ['0.350'] [Step 178 / Rank 4] Tasks: ['Single QA'] | Lens: [52413] → Tgt Spa: ['0.350'] [Step 178 / Rank 7] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [17984, 17974, 17977] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 178 / Rank 1] Tasks: ['Code', 'Code'] | Lens: [30618, 30621] → Tgt Spa: ['1.000', '1.000'] [Step 178 / Rank 6] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [17984, 17974, 17977] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 178 / Rank 5] Tasks: ['Single QA'] | Lens: [52413] → Tgt Spa: ['0.350'] [Step 178 / Rank 0] Tasks: ['Code', 'Code'] | Lens: [30618, 30621] → Tgt Spa: ['1.000', '1.000'] [Step 178 / Rank 2] Tasks: ['Single QA'] | Lens: [58416] → Tgt Spa: ['0.350'] [Step 178 / Rank 1] Tasks: ['Single QA'] | Lens: [51140] → Tgt Spa: ['0.350'] [Step 178 / Rank 4] Tasks: ['Single QA'] | Lens: [51279] → Tgt Spa: ['0.350'] [Step 178 / Rank 0] Tasks: ['Single QA'] | Lens: [51140] → Tgt Spa: ['0.350'] [Step 178 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [37538] → Tgt Spa: ['1.000'] [Step 178 / Rank 6] Tasks: ['Code', 'Code', 'Single QA', 'Single QA', 'Code', 'Code', 'Single QA', 'Single QA'] | Lens: [7421, 7419, 7414, 7415, 7423, 7422, 7416, 7418] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350'] [Step 178 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [37538] → Tgt Spa: ['1.000'] [Step 178 / Rank 7] Tasks: ['Code', 'Code', 'Single QA', 'Single QA', 'Code', 'Code', 'Single QA', 'Single QA'] | Lens: [7421, 7419, 7414, 7415, 7423, 7422, 7416, 7418] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350'] [Step 178 / Rank 5] Tasks: ['Single QA'] | Lens: [51279] → Tgt Spa: ['0.350'] [Step 178 / Rank 7] Tasks: ['Single QA', 'Code', 'Single QA'] | Lens: [17659, 17665, 17660] → Tgt Spa: ['0.350', '1.000', '0.350'] [Step 178 / Rank 3] Tasks: ['Single QA', 'In-Context Learning', 'Single QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Code', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'Code', 'Code'] | Lens: [3101, 3101, 3103, 3102, 3103, 3105, 3103, 3102, 3104, 3119, 3121, 3121, 3107, 3111, 3105, 3107, 3106, 3108, 3109, 3116, 3115] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 178 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [37304] → Tgt Spa: ['1.000'] [Step 178 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24895, 24895] → Tgt Spa: ['1.000', '1.000'] [Step 178 / Rank 2] Tasks: ['Single QA', 'In-Context Learning', 'Single QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Code', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'Code', 'Code'] | Lens: [3101, 3101, 3103, 3102, 3103, 3105, 3103, 3102, 3104, 3119, 3121, 3121, 3107, 3111, 3105, 3107, 3106, 3108, 3109, 3116, 3115] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 178 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [37304] → Tgt Spa: ['1.000'] [Step 178 / Rank 6] Tasks: ['Single QA', 'Code', 'Single QA'] | Lens: [17659, 17665, 17660] → Tgt Spa: ['0.350', '1.000', '0.350'] [Step 178 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24895, 24895] → Tgt Spa: ['1.000', '1.000'] [Step 178 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23115, 23133] → Tgt Spa: ['1.000', '1.000'] [Step 178 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23115, 23133] → Tgt Spa: ['1.000', '1.000'] [Step 178 / Rank 4] Tasks: ['Single QA'] | Lens: [39650] → Tgt Spa: ['0.350'] [Step 178 / Rank 5] Tasks: ['Single QA'] | Lens: [39650] → Tgt Spa: ['0.350'] [Step 178 / Rank 6] Tasks: ['Code'] | Lens: [34802] → Tgt Spa: ['1.000'] [Step 178 / Rank 2] Tasks: ['Single QA'] | Lens: [45609] → Tgt Spa: ['0.350'] [Step 178 / Rank 7] Tasks: ['Code'] | Lens: [34802] → Tgt Spa: ['1.000'] [Step 178 / Rank 3] Tasks: ['Single QA'] | Lens: [45609] → Tgt Spa: ['0.350'] [Step 178 / Rank 5] Tasks: ['Single QA'] | Lens: [39797] → Tgt Spa: ['0.350'] [Step 178 / Rank 6] Tasks: ['Single QA'] | Lens: [51241] → Tgt Spa: ['0.350'] [Step 178 / Rank 1] Tasks: ['Code', 'Summarization', 'In-Context Learning'] | Lens: [21343, 21356, 21338] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 178 / Rank 4] Tasks: ['Single QA'] | Lens: [39797] → Tgt Spa: ['0.350'] [Step 178 / Rank 3] Tasks: ['In-Context Learning', 'MultiHop QA', 'Code', 'Code', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Summarization', 'Code', 'Single QA', 'MultiHop QA', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [2950, 2951, 2958, 2959, 2953, 2952, 2954, 2953, 2970, 2954, 2953, 2954, 2972, 2961, 2955, 2957, 2974, 2958, 2959, 2958, 2959, 2959] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 178 / Rank 7] Tasks: ['Single QA'] | Lens: [51241] → Tgt Spa: ['0.350'] [Step 178 / Rank 2] Tasks: ['In-Context Learning', 'MultiHop QA', 'Code', 'Code', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Summarization', 'Code', 'Single QA', 'MultiHop QA', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [2950, 2951, 2958, 2959, 2953, 2952, 2954, 2953, 2970, 2954, 2953, 2954, 2972, 2961, 2955, 2957, 2974, 2958, 2959, 2958, 2959, 2959] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 178 / Rank 0] Tasks: ['Code', 'Summarization', 'In-Context Learning'] | Lens: [21343, 21356, 21338] → Tgt Spa: ['1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 02:43:56,957 >> @ 178 | Loss: 2.0123 | LM: 1.9466 | Reg: 0.0658 | Spa(Avg): 0.555 [INFO|lh_trainer.py:797] 2026-02-17 02:43:56,957 >> Statistic -> Code | Spa: 0.706 | Tgt: 1.000 | Z-Loss: 0.090 | [INFO|lh_trainer.py:797] 2026-02-17 02:43:56,957 >> Statistic -> In-Context | Spa: 0.711 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:43:56,957 >> Statistic -> MultiHop | Spa: 0.647 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:43:56,957 >> Statistic -> Single | Spa: 0.471 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:43:56,957 >> Statistic -> Summarization | Spa: 0.634 | Tgt: 1.000 | Z-Loss: 0.119 | [INFO|lh_trainer.py:810] 2026-02-17 02:43:56,960 >> [Micro-Log] {"loss": 2.0123405841489634, "lm_loss": 1.9465523672600586, "reg_loss": 0.06578823665040545, "model_sparsity(avg)": 0.5553645305335522, "Spa-Single QA sparsity": 0.47101448929828144, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.0818999603026263, "Spa-Code sparsity": 0.7059178715166838, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.08990855223458746, "Spa-In-Context Learning sparsity": 0.710648152563307, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10272260258595149, "Spa-Summarization sparsity": 0.6338383772156455, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11913103678009727, "Spa-MultiHop QA sparsity": 0.6472222179174423, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.12549941688776017, "step": 178, "current_tau": 1.000152349472046, "lambda1 Single QA": 0.578125, "lambda2 MultiHop QA": 0.30078125, "lambda3 Summarization": 0.1484375, "lambda4 Code": 0.248046875} [INFO|lh_trainer.py:331] 2026-02-17 02:44:15,432 >> {'loss': 12.074, 'grad_norm': 0.6338276267051697, 'learning_rate': 0.0002565442614225446, 'epoch': 0.1885202738283307, 'num_input_tokens_seen': 440361302, 'completed': '59.67% (179 / 300)', 'remaining time': '5:39:39', 'throughput': '8798.67', 'gpu_mem_free': '6741MB', 'step': 179} [Step 179 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [36921] → Tgt Spa: ['1.000'] [Step 179 / Rank 1] Tasks: ['Code', 'Single QA'] | Lens: [24664, 24657] → Tgt Spa: ['1.000', '0.350'] [Step 179 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [40786] → Tgt Spa: ['1.000'] [Step 179 / Rank 0] Tasks: ['Code', 'Single QA'] | Lens: [24664, 24657] → Tgt Spa: ['1.000', '0.350'] [Step 179 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [40786] → Tgt Spa: ['1.000'] [Step 179 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [36921] → Tgt Spa: ['1.000'] [Step 179 / Rank 7] Tasks: ['Single QA'] | Lens: [42347] → Tgt Spa: ['0.350'] [Step 179 / Rank 6] Tasks: ['Single QA'] | Lens: [42347] → Tgt Spa: ['0.350'] [Step 179 / Rank 7] Tasks: ['Single QA'] | Lens: [46644] → Tgt Spa: ['0.350'] [Step 179 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [32134, 32134] → Tgt Spa: ['0.350', '0.350'] [Step 179 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [31821, 31821] → Tgt Spa: ['0.350', '0.350'] [Step 179 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [31821, 31821] → Tgt Spa: ['0.350', '0.350'] [Step 179 / Rank 1] Tasks: ['Single QA'] | Lens: [58858] → Tgt Spa: ['0.350'] [Step 179 / Rank 6] Tasks: ['Single QA'] | Lens: [46644] → Tgt Spa: ['0.350'] [Step 179 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [32134, 32134] → Tgt Spa: ['0.350', '0.350'] [Step 179 / Rank 0] Tasks: ['Single QA'] | Lens: [58858] → Tgt Spa: ['0.350'] [Step 179 / Rank 1] Tasks: ['Single QA'] | Lens: [34220] → Tgt Spa: ['0.350'] [Step 179 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [42871] → Tgt Spa: ['1.000'] [Step 179 / Rank 0] Tasks: ['Single QA'] | Lens: [34220] → Tgt Spa: ['0.350'] [Step 179 / Rank 5] Tasks: ['Single QA'] | Lens: [53361] → Tgt Spa: ['0.350'] [Step 179 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [42871] → Tgt Spa: ['1.000'] [Step 179 / Rank 4] Tasks: ['Single QA'] | Lens: [53361] → Tgt Spa: ['0.350'] [Step 179 / Rank 3] Tasks: ['Code'] | Lens: [42243] → Tgt Spa: ['1.000'] [Step 179 / Rank 2] Tasks: ['Code'] | Lens: [42243] → Tgt Spa: ['1.000'] [Step 179 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [25080, 25089] → Tgt Spa: ['1.000', '1.000'] [Step 179 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [25080, 25089] → Tgt Spa: ['1.000', '1.000'] [Step 179 / Rank 3] Tasks: ['Code', 'Code'] | Lens: [22137, 22141] → Tgt Spa: ['1.000', '1.000'] [Step 179 / Rank 4] Tasks: ['Single QA'] | Lens: [39386] → Tgt Spa: ['0.350'] [Step 179 / Rank 6] Tasks: ['In-Context Learning', 'MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Summarization', 'In-Context Learning', 'Code', 'In-Context Learning', 'Code'] | Lens: [3896, 3897, 3897, 3898, 3898, 3898, 3900, 3899, 3899, 3900, 3918, 3918, 3900, 3909, 3901, 3910] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 179 / Rank 5] Tasks: ['Single QA'] | Lens: [39386] → Tgt Spa: ['0.350'] [Step 179 / Rank 7] Tasks: ['In-Context Learning', 'MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Summarization', 'In-Context Learning', 'Code', 'In-Context Learning', 'Code'] | Lens: [3896, 3897, 3897, 3898, 3898, 3898, 3900, 3899, 3899, 3900, 3918, 3918, 3900, 3909, 3901, 3910] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 179 / Rank 2] Tasks: ['Code', 'Code'] | Lens: [22137, 22141] → Tgt Spa: ['1.000', '1.000'] [Step 179 / Rank 4] Tasks: ['Code'] | Lens: [59972] → Tgt Spa: ['1.000'] [Step 179 / Rank 1] Tasks: ['Code'] | Lens: [38412] → Tgt Spa: ['1.000'] [Step 179 / Rank 5] Tasks: ['Code'] | Lens: [59972] → Tgt Spa: ['1.000'] [Step 179 / Rank 7] Tasks: ['Single QA', 'Code'] | Lens: [32133, 32141] → Tgt Spa: ['0.350', '1.000'] [Step 179 / Rank 6] Tasks: ['Single QA', 'Code'] | Lens: [32133, 32141] → Tgt Spa: ['0.350', '1.000'] [Step 179 / Rank 3] Tasks: ['Summarization', 'In-Context Learning', 'In-Context Learning'] | Lens: [21474, 21456, 21456] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 179 / Rank 0] Tasks: ['Code'] | Lens: [38412] → Tgt Spa: ['1.000'] [Step 179 / Rank 2] Tasks: ['Summarization', 'In-Context Learning', 'In-Context Learning'] | Lens: [21474, 21456, 21456] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 179 / Rank 7] Tasks: ['Code'] | Lens: [38957] → Tgt Spa: ['1.000'] [Step 179 / Rank 1] Tasks: ['Single QA'] | Lens: [51351] → Tgt Spa: ['0.350'] [Step 179 / Rank 0] Tasks: ['Single QA'] | Lens: [51351] → Tgt Spa: ['0.350'] [Step 179 / Rank 5] Tasks: ['Single QA'] | Lens: [64078] → Tgt Spa: ['0.350'] [Step 179 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [26102, 26102] → Tgt Spa: ['0.350', '0.350'] [Step 179 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [26102, 26102] → Tgt Spa: ['0.350', '0.350'] [Step 179 / Rank 6] Tasks: ['Code'] | Lens: [38957] → Tgt Spa: ['1.000'] [Step 179 / Rank 4] Tasks: ['Single QA'] | Lens: [64078] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 02:46:23,702 >> @ 179 | Loss: 1.9161 | LM: 1.8540 | Reg: 0.0622 | Spa(Avg): 0.546 [INFO|lh_trainer.py:797] 2026-02-17 02:46:23,703 >> Statistic -> Code | Spa: 0.693 | Tgt: 1.000 | Z-Loss: 0.095 | [INFO|lh_trainer.py:797] 2026-02-17 02:46:23,703 >> Statistic -> In-Context | Spa: 0.718 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:46:23,703 >> Statistic -> MultiHop | Spa: 0.625 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:46:23,703 >> Statistic -> Single | Spa: 0.411 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:46:23,703 >> Statistic -> Summarization | Spa: 0.639 | Tgt: 1.000 | Z-Loss: 0.114 | [INFO|lh_trainer.py:810] 2026-02-17 02:46:23,706 >> [Micro-Log] {"loss": 1.9161432459950447, "lm_loss": 1.8539705493797858, "reg_loss": 0.062172708606037, "model_sparsity(avg)": 0.5462601222097874, "Spa-Code sparsity": 0.6931818127632141, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09484128789468245, "Spa-Single QA sparsity": 0.4109477015102611, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.0423884383019279, "Spa-In-Context Learning sparsity": 0.7175925970077515, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10059516380230586, "Spa-Summarization sparsity": 0.6388888955116272, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11412418136994044, "Spa-MultiHop QA sparsity": 0.625, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11305800825357437, "step": 179, "current_tau": 1.0000380277633667, "lambda1 Single QA": 0.578125, "lambda2 MultiHop QA": 0.30078125, "lambda3 Summarization": 0.1484375, "lambda4 Code": 0.248046875} [INFO|lh_trainer.py:331] 2026-02-17 02:46:49,686 >> {'loss': 11.4969, 'grad_norm': 0.6284471154212952, 'learning_rate': 0.0002532724235655962, 'epoch': 0.1895734597156398, 'num_input_tokens_seen': 442772076, 'completed': '60.00% (180 / 300)', 'remaining time': '5:36:41', 'throughput': '7814.32', 'gpu_mem_free': '8369MB', 'step': 180} [Step 180 / Rank 5] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25433, 25434] → Tgt Spa: ['1.000', '0.350'] [Step 180 / Rank 3] Tasks: ['Single QA'] | Lens: [35643] → Tgt Spa: ['0.350'] [Step 180 / Rank 0] Tasks: ['Code'] | Lens: [53679] → Tgt Spa: ['1.000'] [Step 180 / Rank 6] Tasks: ['Single QA'] | Lens: [54776] → Tgt Spa: ['0.350'] [Step 180 / Rank 1] Tasks: ['Code'] | Lens: [53679] → Tgt Spa: ['1.000'] [Step 180 / Rank 7] Tasks: ['Single QA'] | Lens: [54776] → Tgt Spa: ['0.350'] [Step 180 / Rank 4] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25433, 25434] → Tgt Spa: ['1.000', '0.350'] [Step 180 / Rank 2] Tasks: ['Single QA'] | Lens: [35643] → Tgt Spa: ['0.350'] [Step 180 / Rank 2] Tasks: ['Single QA'] | Lens: [41943] → Tgt Spa: ['0.350'] [Step 180 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26975, 26976] → Tgt Spa: ['1.000', '1.000'] [Step 180 / Rank 5] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7725, 7735, 7728, 7729, 7730, 7730, 7730, 7731] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 180 / Rank 7] Tasks: ['Single QA'] | Lens: [49355] → Tgt Spa: ['0.350'] [Step 180 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26975, 26976] → Tgt Spa: ['1.000', '1.000'] [Step 180 / Rank 4] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7725, 7735, 7728, 7729, 7730, 7730, 7730, 7731] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 180 / Rank 6] Tasks: ['Single QA'] | Lens: [49355] → Tgt Spa: ['0.350'] [Step 180 / Rank 3] Tasks: ['Single QA'] | Lens: [41943] → Tgt Spa: ['0.350'] [Step 180 / Rank 1] Tasks: ['Single QA'] | Lens: [56533] → Tgt Spa: ['0.350'] [Step 180 / Rank 4] Tasks: ['Single QA'] | Lens: [45920] → Tgt Spa: ['0.350'] [Step 180 / Rank 7] Tasks: ['Code'] | Lens: [37428] → Tgt Spa: ['1.000'] [Step 180 / Rank 2] Tasks: ['Single QA'] | Lens: [41734] → Tgt Spa: ['0.350'] [Step 180 / Rank 0] Tasks: ['Single QA'] | Lens: [56533] → Tgt Spa: ['0.350'] [Step 180 / Rank 6] Tasks: ['Code'] | Lens: [37428] → Tgt Spa: ['1.000'] [Step 180 / Rank 5] Tasks: ['Single QA'] | Lens: [45920] → Tgt Spa: ['0.350'] [Step 180 / Rank 3] Tasks: ['Single QA'] | Lens: [41734] → Tgt Spa: ['0.350'] [Step 180 / Rank 4] Tasks: ['Single QA'] | Lens: [39995] → Tgt Spa: ['0.350'] [Step 180 / Rank 3] Tasks: ['Single QA'] | Lens: [33992] → Tgt Spa: ['0.350'] [Step 180 / Rank 2] Tasks: ['Single QA'] | Lens: [33992] → Tgt Spa: ['0.350'] [Step 180 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning', 'Summarization', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Summarization', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning'] | Lens: [4158, 4158, 4178, 4159, 4159, 4160, 4160, 4178, 4178, 4160, 4160, 4161, 4169, 4168, 4162] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 180 / Rank 6] Tasks: ['Single QA'] | Lens: [58271] → Tgt Spa: ['0.350'] [Step 180 / Rank 7] Tasks: ['Single QA'] | Lens: [58271] → Tgt Spa: ['0.350'] [Step 180 / Rank 5] Tasks: ['Single QA'] | Lens: [39995] → Tgt Spa: ['0.350'] [Step 180 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning', 'Summarization', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Summarization', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning'] | Lens: [4158, 4158, 4178, 4159, 4159, 4160, 4160, 4178, 4178, 4160, 4160, 4161, 4169, 4168, 4162] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 180 / Rank 4] Tasks: ['Single QA'] | Lens: [54166] → Tgt Spa: ['0.350'] [Step 180 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [47083] → Tgt Spa: ['1.000'] [Step 180 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [8214, 8214, 8214, 8214, 8215, 8215, 8215] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 180 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [47083] → Tgt Spa: ['1.000'] [Step 180 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [8214, 8214, 8214, 8214, 8215, 8215, 8215] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 180 / Rank 7] Tasks: ['Single QA'] | Lens: [55155] → Tgt Spa: ['0.350'] [Step 180 / Rank 5] Tasks: ['Single QA'] | Lens: [54166] → Tgt Spa: ['0.350'] [Step 180 / Rank 6] Tasks: ['Single QA'] | Lens: [55155] → Tgt Spa: ['0.350'] [Step 180 / Rank 2] Tasks: ['Code'] | Lens: [33987] → Tgt Spa: ['1.000'] [Step 180 / Rank 6] Tasks: ['Code'] | Lens: [53392] → Tgt Spa: ['1.000'] [Step 180 / Rank 0] Tasks: ['Single QA'] | Lens: [53945] → Tgt Spa: ['0.350'] [Step 180 / Rank 1] Tasks: ['Single QA'] | Lens: [53945] → Tgt Spa: ['0.350'] [Step 180 / Rank 7] Tasks: ['Code'] | Lens: [53392] → Tgt Spa: ['1.000'] [Step 180 / Rank 4] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17173, 17165, 17178] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 180 / Rank 5] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17173, 17165, 17178] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 180 / Rank 3] Tasks: ['Code'] | Lens: [33987] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 02:49:09,986 >> @ 180 | Loss: 2.0219 | LM: 1.9741 | Reg: 0.0478 | Spa(Avg): 0.490 [INFO|lh_trainer.py:797] 2026-02-17 02:49:09,986 >> Statistic -> Code | Spa: 0.703 | Tgt: 1.000 | Z-Loss: 0.091 | [INFO|lh_trainer.py:797] 2026-02-17 02:49:09,986 >> Statistic -> In-Context | Spa: 0.714 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:49:09,986 >> Statistic -> MultiHop | Spa: 0.625 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:49:09,986 >> Statistic -> Single | Spa: 0.420 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:49:09,986 >> Statistic -> Summarization | Spa: 0.636 | Tgt: 1.000 | Z-Loss: 0.116 | [INFO|lh_trainer.py:810] 2026-02-17 02:49:09,988 >> [Micro-Log] {"loss": 2.0219071519871554, "lm_loss": 1.9741332546497385, "reg_loss": 0.04777390282833949, "model_sparsity(avg)": 0.4900876296063264, "Spa-Code sparsity": 0.703125, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09103287383913994, "Spa-In-Context Learning sparsity": 0.7141203780968984, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1019954476505518, "Spa-Single QA sparsity": 0.42037036220232643, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05024314867332578, "Spa-Summarization sparsity": 0.6361111283302308, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11605992168188095, "Spa-MultiHop QA sparsity": 0.625, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11305800825357437, "step": 180, "current_tau": 1.0, "lambda1 Single QA": 0.578125, "lambda2 MultiHop QA": 0.30078125, "lambda3 Summarization": 0.1484375, "lambda4 Code": 0.248046875} [INFO|lh_trainer.py:331] 2026-02-17 02:49:29,992 >> {'loss': 12.1314, 'grad_norm': 0.47973552346229553, 'learning_rate': 0.000250000025, 'epoch': 0.19062664560294892, 'num_input_tokens_seen': 445142352, 'completed': '60.33% (181 / 300)', 'remaining time': '5:33:47', 'throughput': '7392.97', 'gpu_mem_free': '8493MB', 'step': 181} [Step 181 / Rank 4] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [8295, 8294, 8298, 8299, 8301, 8302, 8302] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 181 / Rank 5] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [8295, 8294, 8298, 8299, 8301, 8302, 8302] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 181 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [26629, 26623] → Tgt Spa: ['1.000', '1.000'] [Step 181 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [55801] → Tgt Spa: ['1.000'] [Step 181 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [26629, 26623] → Tgt Spa: ['1.000', '1.000'] [Step 181 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [55801] → Tgt Spa: ['1.000'] [Step 181 / Rank 7] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25321, 25322] → Tgt Spa: ['1.000', '0.350'] [Step 181 / Rank 6] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25321, 25322] → Tgt Spa: ['1.000', '0.350'] [Step 181 / Rank 5] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [29577, 29561] → Tgt Spa: ['1.000', '1.000'] [Step 181 / Rank 7] Tasks: ['Single QA'] | Lens: [58663] → Tgt Spa: ['0.350'] [Step 181 / Rank 0] Tasks: ['Code'] | Lens: [48098] → Tgt Spa: ['1.000'] [Step 181 / Rank 2] Tasks: ['Single QA', 'Code', 'Code', 'Single QA'] | Lens: [15522, 15549, 15551, 15547] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350'] [Step 181 / Rank 6] Tasks: ['Single QA'] | Lens: [58663] → Tgt Spa: ['0.350'] [Step 181 / Rank 3] Tasks: ['Single QA', 'Code', 'Code', 'Single QA'] | Lens: [15522, 15549, 15551, 15547] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350'] [Step 181 / Rank 4] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [29577, 29561] → Tgt Spa: ['1.000', '1.000'] [Step 181 / Rank 1] Tasks: ['Code'] | Lens: [48098] → Tgt Spa: ['1.000'] [Step 181 / Rank 1] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [22847, 22852] → Tgt Spa: ['0.350', '1.000'] [Step 181 / Rank 4] Tasks: ['Single QA'] | Lens: [36322] → Tgt Spa: ['0.350'] [Step 181 / Rank 5] Tasks: ['Single QA'] | Lens: [36322] → Tgt Spa: ['0.350'] [Step 181 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [32586, 32586] → Tgt Spa: ['0.350', '0.350'] [Step 181 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [27249, 27249] → Tgt Spa: ['0.350', '0.350'] [Step 181 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [27249, 27249] → Tgt Spa: ['0.350', '0.350'] [Step 181 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [32586, 32586] → Tgt Spa: ['0.350', '0.350'] [Step 181 / Rank 0] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [22847, 22852] → Tgt Spa: ['0.350', '1.000'] [Step 181 / Rank 4] Tasks: ['Single QA'] | Lens: [49456] → Tgt Spa: ['0.350'] [Step 181 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [26252, 26252] → Tgt Spa: ['0.350', '0.350'] [Step 181 / Rank 2] Tasks: ['Code', 'Single QA'] | Lens: [31246, 31238] → Tgt Spa: ['1.000', '0.350'] [Step 181 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [26252, 26252] → Tgt Spa: ['0.350', '0.350'] [Step 181 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [43751] → Tgt Spa: ['1.000'] [Step 181 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [43751] → Tgt Spa: ['1.000'] [Step 181 / Rank 5] Tasks: ['Single QA'] | Lens: [49456] → Tgt Spa: ['0.350'] [Step 181 / Rank 3] Tasks: ['Code', 'Single QA'] | Lens: [31246, 31238] → Tgt Spa: ['1.000', '0.350'] [Step 181 / Rank 6] Tasks: ['Single QA'] | Lens: [49876] → Tgt Spa: ['0.350'] [Step 181 / Rank 1] Tasks: ['Single QA'] | Lens: [44042] → Tgt Spa: ['0.350'] [Step 181 / Rank 7] Tasks: ['Single QA'] | Lens: [49876] → Tgt Spa: ['0.350'] [Step 181 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [24977, 24972] → Tgt Spa: ['1.000', '1.000'] [Step 181 / Rank 5] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [17099, 17089, 17089] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 181 / Rank 0] Tasks: ['Single QA'] | Lens: [44042] → Tgt Spa: ['0.350'] [Step 181 / Rank 4] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [17099, 17089, 17089] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 181 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [24977, 24972] → Tgt Spa: ['1.000', '1.000'] [Step 181 / Rank 1] Tasks: ['Code'] | Lens: [41577] → Tgt Spa: ['1.000'] [Step 181 / Rank 0] Tasks: ['Code'] | Lens: [41577] → Tgt Spa: ['1.000'] [Step 181 / Rank 5] Tasks: ['Single QA'] | Lens: [55767] → Tgt Spa: ['0.350'] [Step 181 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32139, 32139] → Tgt Spa: ['0.350', '0.350'] [Step 181 / Rank 2] Tasks: ['Code'] | Lens: [33870] → Tgt Spa: ['1.000'] [Step 181 / Rank 3] Tasks: ['Code'] | Lens: [33870] → Tgt Spa: ['1.000'] [Step 181 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32139, 32139] → Tgt Spa: ['0.350', '0.350'] [Step 181 / Rank 4] Tasks: ['Single QA'] | Lens: [55767] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 02:51:44,495 >> @ 181 | Loss: 1.9341 | LM: 1.8667 | Reg: 0.0675 | Spa(Avg): 0.543 [INFO|lh_trainer.py:797] 2026-02-17 02:51:44,496 >> Statistic -> Code | Spa: 0.705 | Tgt: 1.000 | Z-Loss: 0.091 | [INFO|lh_trainer.py:797] 2026-02-17 02:51:44,496 >> Statistic -> In-Context | Spa: 0.712 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:51:44,496 >> Statistic -> MultiHop | Spa: 0.625 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:51:44,496 >> Statistic -> Single | Spa: 0.457 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:51:44,496 >> Statistic -> Summarization | Spa: 0.583 | Tgt: 1.000 | Z-Loss: 0.145 | [INFO|lh_trainer.py:810] 2026-02-17 02:51:44,498 >> [Micro-Log] {"loss": 1.9341452351460855, "lm_loss": 1.866652629027764, "reg_loss": 0.06749261333607137, "model_sparsity(avg)": 0.5426656206448873, "Spa-In-Context Learning sparsity": 0.7123015778405326, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10272920770304543, "Spa-Code sparsity": 0.7045454545454546, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09094101122834465, "Spa-Single QA sparsity": 0.45722222089767456, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.0706082587596029, "Spa-Summarization sparsity": 0.5833333134651184, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1454806923866272, "Spa-MultiHop QA sparsity": 0.625, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11305800825357437, "step": 181, "current_tau": 1.0, "lambda1 Single QA": 0.578125, "lambda2 MultiHop QA": 0.30078125, "lambda3 Summarization": 0.1494140625, "lambda4 Code": 0.2490234375} [INFO|lh_trainer.py:331] 2026-02-17 02:52:05,565 >> {'loss': 11.6049, 'grad_norm': 0.6293119192123413, 'learning_rate': 0.00024672762643440383, 'epoch': 0.19167983149025802, 'num_input_tokens_seen': 447635106, 'completed': '60.67% (182 / 300)', 'remaining time': '5:30:51', 'throughput': '8011.52', 'gpu_mem_free': '12851MB', 'step': 182} [Step 182 / Rank 3] Tasks: ['Single QA'] | Lens: [56673] → Tgt Spa: ['0.350'] [Step 182 / Rank 2] Tasks: ['Single QA'] | Lens: [56673] → Tgt Spa: ['0.350'] [Step 182 / Rank 1] Tasks: ['Single QA'] | Lens: [40718] → Tgt Spa: ['0.350'] [Step 182 / Rank 4] Tasks: ['Single QA'] | Lens: [54150] → Tgt Spa: ['0.350'] [Step 182 / Rank 6] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [17600, 17589, 17592] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 182 / Rank 0] Tasks: ['Single QA'] | Lens: [40718] → Tgt Spa: ['0.350'] [Step 182 / Rank 5] Tasks: ['Single QA'] | Lens: [54150] → Tgt Spa: ['0.350'] [Step 182 / Rank 7] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [17600, 17589, 17592] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 182 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [41180] → Tgt Spa: ['1.000'] [Step 182 / Rank 4] Tasks: ['Single QA'] | Lens: [51735] → Tgt Spa: ['0.350'] [Step 182 / Rank 0] Tasks: ['Single QA'] | Lens: [35112] → Tgt Spa: ['0.350'] [Step 182 / Rank 2] Tasks: ['Single QA'] | Lens: [45273] → Tgt Spa: ['0.350'] [Step 182 / Rank 3] Tasks: ['Single QA'] | Lens: [45273] → Tgt Spa: ['0.350'] [Step 182 / Rank 1] Tasks: ['Single QA'] | Lens: [35112] → Tgt Spa: ['0.350'] [Step 182 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [41180] → Tgt Spa: ['1.000'] [Step 182 / Rank 5] Tasks: ['Single QA'] | Lens: [51735] → Tgt Spa: ['0.350'] [Step 182 / Rank 5] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [23464, 23447] → Tgt Spa: ['1.000', '1.000'] [Step 182 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [59820] → Tgt Spa: ['1.000'] [Step 182 / Rank 2] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [28450, 28451] → Tgt Spa: ['0.350', '1.000'] [Step 182 / Rank 4] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [23464, 23447] → Tgt Spa: ['1.000', '1.000'] [Step 182 / Rank 6] Tasks: ['Summarization', 'Single QA'] | Lens: [27087, 27070] → Tgt Spa: ['1.000', '0.350'] [Step 182 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [59820] → Tgt Spa: ['1.000'] [Step 182 / Rank 7] Tasks: ['Summarization', 'Single QA'] | Lens: [27087, 27070] → Tgt Spa: ['1.000', '0.350'] [Step 182 / Rank 3] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [28450, 28451] → Tgt Spa: ['0.350', '1.000'] [Step 182 / Rank 5] Tasks: ['Single QA'] | Lens: [65063] → Tgt Spa: ['0.350'] [Step 182 / Rank 4] Tasks: ['Single QA'] | Lens: [65063] → Tgt Spa: ['0.350'] [Step 182 / Rank 3] Tasks: ['Single QA'] | Lens: [38668] → Tgt Spa: ['0.350'] [Step 182 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [21053, 21053, 21053] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 182 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [21053, 21053, 21053] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 182 / Rank 2] Tasks: ['Single QA'] | Lens: [38668] → Tgt Spa: ['0.350'] [Step 182 / Rank 7] Tasks: ['Code'] | Lens: [44071] → Tgt Spa: ['1.000'] [Step 182 / Rank 6] Tasks: ['Code'] | Lens: [44071] → Tgt Spa: ['1.000'] [Step 182 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [26355, 26366] → Tgt Spa: ['1.000', '1.000'] [Step 182 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [35334] → Tgt Spa: ['1.000'] [Step 182 / Rank 2] Tasks: ['Single QA'] | Lens: [47646] → Tgt Spa: ['0.350'] [Step 182 / Rank 3] Tasks: ['Single QA'] | Lens: [47646] → Tgt Spa: ['0.350'] [Step 182 / Rank 7] Tasks: ['Code'] | Lens: [53778] → Tgt Spa: ['1.000'] [Step 182 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [26355, 26366] → Tgt Spa: ['1.000', '1.000'] [Step 182 / Rank 6] Tasks: ['Code'] | Lens: [53778] → Tgt Spa: ['1.000'] [Step 182 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [35334] → Tgt Spa: ['1.000'] [Step 182 / Rank 6] Tasks: ['Single QA'] | Lens: [41475] → Tgt Spa: ['0.350'] [Step 182 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [27051, 27051] → Tgt Spa: ['0.350', '0.350'] [Step 182 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [27051, 27051] → Tgt Spa: ['0.350', '0.350'] [Step 182 / Rank 5] Tasks: ['Code'] | Lens: [34871] → Tgt Spa: ['1.000'] [Step 182 / Rank 1] Tasks: ['Single QA'] | Lens: [40926] → Tgt Spa: ['0.350'] [Step 182 / Rank 4] Tasks: ['Code'] | Lens: [34871] → Tgt Spa: ['1.000'] [Step 182 / Rank 0] Tasks: ['Single QA'] | Lens: [40926] → Tgt Spa: ['0.350'] [Step 182 / Rank 7] Tasks: ['Single QA'] | Lens: [41475] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 02:54:33,466 >> @ 182 | Loss: 2.0917 | LM: 2.0341 | Reg: 0.0576 | Spa(Avg): 0.514 [INFO|lh_trainer.py:797] 2026-02-17 02:54:33,466 >> Statistic -> Code | Spa: 0.678 | Tgt: 1.000 | Z-Loss: 0.101 | [INFO|lh_trainer.py:797] 2026-02-17 02:54:33,466 >> Statistic -> In-Context | Spa: 0.711 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:54:33,466 >> Statistic -> MultiHop | Spa: 0.625 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:54:33,466 >> Statistic -> Single | Spa: 0.380 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:54:33,466 >> Statistic -> Summarization | Spa: 0.662 | Tgt: 1.000 | Z-Loss: 0.105 | [INFO|lh_trainer.py:810] 2026-02-17 02:54:33,468 >> [Micro-Log] {"loss": 2.0917073699335256, "lm_loss": 2.0341301038861275, "reg_loss": 0.05757727812548789, "model_sparsity(avg)": 0.514081783592701, "Spa-Single QA sparsity": 0.38040123052067226, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.020849015343830817, "Spa-In-Context Learning sparsity": 0.7106481393178304, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10340548927585284, "Spa-Code sparsity": 0.6782407263914744, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1009594996770223, "Spa-Summarization sparsity": 0.6620370149612427, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10517941166957219, "Spa-MultiHop QA sparsity": 0.625, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11305800825357437, "step": 182, "current_tau": 1.0, "lambda1 Single QA": 0.578125, "lambda2 MultiHop QA": 0.30078125, "lambda3 Summarization": 0.1494140625, "lambda4 Code": 0.2490234375} [INFO|lh_trainer.py:331] 2026-02-17 02:54:47,164 >> {'loss': 12.5502, 'grad_norm': 0.5318001508712769, 'learning_rate': 0.00024345578857745548, 'epoch': 0.19273301737756715, 'num_input_tokens_seen': 449969556, 'completed': '61.00% (183 / 300)', 'remaining time': '5:27:58', 'throughput': '7222.96', 'gpu_mem_free': '12111MB', 'step': 183} [Step 183 / Rank 0] Tasks: ['Single QA'] | Lens: [51854] → Tgt Spa: ['0.350'] [Step 183 / Rank 6] Tasks: ['Single QA'] | Lens: [37972] → Tgt Spa: ['0.350'] [Step 183 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning', 'Code'] | Lens: [21521, 21523, 21533] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 183 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning', 'Code'] | Lens: [21521, 21523, 21533] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 183 / Rank 4] Tasks: ['Code'] | Lens: [60999] → Tgt Spa: ['1.000'] [Step 183 / Rank 7] Tasks: ['Single QA'] | Lens: [37972] → Tgt Spa: ['0.350'] [Step 183 / Rank 5] Tasks: ['Code'] | Lens: [60999] → Tgt Spa: ['1.000'] [Step 183 / Rank 1] Tasks: ['Single QA'] | Lens: [51854] → Tgt Spa: ['0.350'] [Step 183 / Rank 4] Tasks: ['Code'] | Lens: [39187] → Tgt Spa: ['1.000'] [Step 183 / Rank 3] Tasks: ['Single QA'] | Lens: [49229] → Tgt Spa: ['0.350'] [Step 183 / Rank 2] Tasks: ['Single QA'] | Lens: [49229] → Tgt Spa: ['0.350'] [Step 183 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [31012, 31004] → Tgt Spa: ['1.000', '1.000'] [Step 183 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [31012, 31004] → Tgt Spa: ['1.000', '1.000'] [Step 183 / Rank 5] Tasks: ['Code'] | Lens: [39187] → Tgt Spa: ['1.000'] [Step 183 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [56557] → Tgt Spa: ['1.000'] [Step 183 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [56557] → Tgt Spa: ['1.000'] [Step 183 / Rank 4] Tasks: ['Single QA', 'Code', 'Single QA', 'Code', 'Code', 'Code'] | Lens: [9509, 9521, 9519, 9540, 9542, 9545] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '1.000', '1.000'] [Step 183 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [33364] → Tgt Spa: ['1.000'] [Step 183 / Rank 3] Tasks: ['Code', 'Code'] | Lens: [22413, 22415] → Tgt Spa: ['1.000', '1.000'] [Step 183 / Rank 5] Tasks: ['Single QA', 'Code', 'Single QA', 'Code', 'Code', 'Code'] | Lens: [9509, 9521, 9519, 9540, 9542, 9545] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '1.000', '1.000'] [Step 183 / Rank 0] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25456, 25457] → Tgt Spa: ['1.000', '0.350'] [Step 183 / Rank 2] Tasks: ['Code', 'Code'] | Lens: [22413, 22415] → Tgt Spa: ['1.000', '1.000'] [Step 183 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [33364] → Tgt Spa: ['1.000'] [Step 183 / Rank 1] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25456, 25457] → Tgt Spa: ['1.000', '0.350'] [Step 183 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [64393] → Tgt Spa: ['1.000'] [Step 183 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [64393] → Tgt Spa: ['1.000'] [Step 183 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24014, 24032] → Tgt Spa: ['1.000', '1.000'] [Step 183 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24014, 24032] → Tgt Spa: ['1.000', '1.000'] [Step 183 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [23418, 23427] → Tgt Spa: ['1.000', '1.000'] [Step 183 / Rank 1] Tasks: ['Single QA'] | Lens: [44007] → Tgt Spa: ['0.350'] [Step 183 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [23418, 23427] → Tgt Spa: ['1.000', '1.000'] [Step 183 / Rank 0] Tasks: ['Single QA'] | Lens: [44007] → Tgt Spa: ['0.350'] [Step 183 / Rank 5] Tasks: ['Code'] | Lens: [40719] → Tgt Spa: ['1.000'] [Step 183 / Rank 2] Tasks: ['Code', 'Single QA'] | Lens: [25543, 25537] → Tgt Spa: ['1.000', '0.350'] [Step 183 / Rank 7] Tasks: ['Single QA'] | Lens: [62440] → Tgt Spa: ['0.350'] [Step 183 / Rank 3] Tasks: ['Code', 'Single QA'] | Lens: [25543, 25537] → Tgt Spa: ['1.000', '0.350'] [Step 183 / Rank 1] Tasks: ['Single QA'] | Lens: [59047] → Tgt Spa: ['0.350'] [Step 183 / Rank 0] Tasks: ['Single QA'] | Lens: [59047] → Tgt Spa: ['0.350'] [Step 183 / Rank 4] Tasks: ['Code'] | Lens: [40719] → Tgt Spa: ['1.000'] [Step 183 / Rank 6] Tasks: ['Single QA'] | Lens: [62440] → Tgt Spa: ['0.350'] [Step 183 / Rank 4] Tasks: ['Code'] | Lens: [33495] → Tgt Spa: ['1.000'] [Step 183 / Rank 0] Tasks: ['Single QA'] | Lens: [51350] → Tgt Spa: ['0.350'] [Step 183 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25653, 25654] → Tgt Spa: ['1.000', '1.000'] [Step 183 / Rank 7] Tasks: ['Single QA'] | Lens: [37362] → Tgt Spa: ['0.350'] [Step 183 / Rank 5] Tasks: ['Code'] | Lens: [33495] → Tgt Spa: ['1.000'] [Step 183 / Rank 6] Tasks: ['Single QA'] | Lens: [37362] → Tgt Spa: ['0.350'] [Step 183 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25653, 25654] → Tgt Spa: ['1.000', '1.000'] [Step 183 / Rank 1] Tasks: ['Single QA'] | Lens: [51350] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 02:57:15,306 >> @ 183 | Loss: 1.9329 | LM: 1.8682 | Reg: 0.0647 | Spa(Avg): 0.573 [INFO|lh_trainer.py:797] 2026-02-17 02:57:15,306 >> Statistic -> Code | Spa: 0.695 | Tgt: 1.000 | Z-Loss: 0.095 | [INFO|lh_trainer.py:797] 2026-02-17 02:57:15,306 >> Statistic -> In-Context | Spa: 0.717 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:57:15,307 >> Statistic -> MultiHop | Spa: 0.625 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:57:15,307 >> Statistic -> Single | Spa: 0.388 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 02:57:15,307 >> Statistic -> Summarization | Spa: 0.681 | Tgt: 1.000 | Z-Loss: 0.095 | [INFO|lh_trainer.py:810] 2026-02-17 02:57:15,309 >> [Micro-Log] {"loss": 1.93290906958282, "lm_loss": 1.8682194653277595, "reg_loss": 0.06468961121087584, "model_sparsity(avg)": 0.5733024577299753, "Spa-Single QA sparsity": 0.38773147265116376, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03136215034949904, "Spa-In-Context Learning sparsity": 0.7171717123551802, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1008810976689512, "Spa-Code sparsity": 0.6954365032059806, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.0947079136967659, "Spa-Summarization sparsity": 0.6805555820465088, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09507769346237183, "Spa-MultiHop QA sparsity": 0.625, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11305800825357437, "step": 183, "current_tau": 1.0, "lambda1 Single QA": 0.578125, "lambda2 MultiHop QA": 0.302734375, "lambda3 Summarization": 0.150390625, "lambda4 Code": 0.25} [INFO|lh_trainer.py:331] 2026-02-17 02:57:33,847 >> {'loss': 11.5975, 'grad_norm': 0.7586023807525635, 'learning_rate': 0.00024018507204172831, 'epoch': 0.19378620326487625, 'num_input_tokens_seen': 452367082, 'completed': '61.33% (184 / 300)', 'remaining time': '5:25:09', 'throughput': '7191.86', 'gpu_mem_free': '9709MB', 'step': 184} [Step 184 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [42227] → Tgt Spa: ['1.000'] [Step 184 / Rank 7] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [27253, 27254] → Tgt Spa: ['0.350', '1.000'] [Step 184 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [54322] → Tgt Spa: ['1.000'] [Step 184 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [54322] → Tgt Spa: ['1.000'] [Step 184 / Rank 4] Tasks: ['Single QA'] | Lens: [45390] → Tgt Spa: ['0.350'] [Step 184 / Rank 5] Tasks: ['Single QA'] | Lens: [45390] → Tgt Spa: ['0.350'] [Step 184 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [42227] → Tgt Spa: ['1.000'] [Step 184 / Rank 6] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [27253, 27254] → Tgt Spa: ['0.350', '1.000'] [Step 184 / Rank 2] Tasks: ['Single QA'] | Lens: [61322] → Tgt Spa: ['0.350'] [Step 184 / Rank 5] Tasks: ['MultiHop QA'] | Lens: [65334] → Tgt Spa: ['0.350'] [Step 184 / Rank 7] Tasks: ['Code'] | Lens: [35516] → Tgt Spa: ['1.000'] [Step 184 / Rank 4] Tasks: ['MultiHop QA'] | Lens: [65334] → Tgt Spa: ['0.350'] [Step 184 / Rank 6] Tasks: ['Code'] | Lens: [35516] → Tgt Spa: ['1.000'] [Step 184 / Rank 0] Tasks: ['Single QA', 'Code'] | Lens: [25612, 25621] → Tgt Spa: ['0.350', '1.000'] [Step 184 / Rank 3] Tasks: ['Single QA'] | Lens: [61322] → Tgt Spa: ['0.350'] [Step 184 / Rank 1] Tasks: ['Single QA', 'Code'] | Lens: [25612, 25621] → Tgt Spa: ['0.350', '1.000'] [Step 184 / Rank 5] Tasks: ['Single QA'] | Lens: [35566] → Tgt Spa: ['0.350'] [Step 184 / Rank 6] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [25243, 25225] → Tgt Spa: ['1.000', '1.000'] [Step 184 / Rank 2] Tasks: ['Single QA'] | Lens: [62402] → Tgt Spa: ['0.350'] [Step 184 / Rank 7] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [25243, 25225] → Tgt Spa: ['1.000', '1.000'] [Step 184 / Rank 3] Tasks: ['Single QA'] | Lens: [62402] → Tgt Spa: ['0.350'] [Step 184 / Rank 0] Tasks: ['Code'] | Lens: [44892] → Tgt Spa: ['1.000'] [Step 184 / Rank 4] Tasks: ['Single QA'] | Lens: [35566] → Tgt Spa: ['0.350'] [Step 184 / Rank 1] Tasks: ['Code'] | Lens: [44892] → Tgt Spa: ['1.000'] [Step 184 / Rank 4] Tasks: ['Single QA'] | Lens: [33865] → Tgt Spa: ['0.350'] [Step 184 / Rank 5] Tasks: ['Single QA'] | Lens: [33865] → Tgt Spa: ['0.350'] [Step 184 / Rank 2] Tasks: ['Single QA'] | Lens: [64975] → Tgt Spa: ['0.350'] [Step 184 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [61589] → Tgt Spa: ['1.000'] [Step 184 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [61589] → Tgt Spa: ['1.000'] [Step 184 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [59195] → Tgt Spa: ['1.000'] [Step 184 / Rank 3] Tasks: ['Single QA'] | Lens: [64975] → Tgt Spa: ['0.350'] [Step 184 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [59195] → Tgt Spa: ['1.000'] [Step 184 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25280, 25279] → Tgt Spa: ['1.000', '1.000'] [Step 184 / Rank 7] Tasks: ['Code', 'Code', 'Code'] | Lens: [19453, 19454, 19456] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 184 / Rank 3] Tasks: ['MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1252, 1251, 1271, 1253, 1253, 1253, 1254, 1253, 1254, 1255, 1254, 1254, 1254, 1273, 1273, 1255, 1256, 1257, 1255, 1274, 1257, 1256, 1275, 1275, 1256, 1258, 1257, 1258, 1258, 1258, 1258, 1258, 1277, 1259, 1259, 1259, 1260, 1259, 1260, 1260, 1260, 1279, 1279, 1260, 1261, 1260, 1280, 1280, 1262, 1261, 1262] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350'] [Step 184 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15807, 15807, 15808, 15808] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 184 / Rank 2] Tasks: ['MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1252, 1251, 1271, 1253, 1253, 1253, 1254, 1253, 1254, 1255, 1254, 1254, 1254, 1273, 1273, 1255, 1256, 1257, 1255, 1274, 1257, 1256, 1275, 1275, 1256, 1258, 1257, 1258, 1258, 1258, 1258, 1258, 1277, 1259, 1259, 1259, 1260, 1259, 1260, 1260, 1260, 1279, 1279, 1260, 1261, 1260, 1280, 1280, 1262, 1261, 1262] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350'] [Step 184 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15807, 15807, 15808, 15808] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 184 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25280, 25279] → Tgt Spa: ['1.000', '1.000'] [Step 184 / Rank 6] Tasks: ['Code', 'Code', 'Code'] | Lens: [19453, 19454, 19456] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 184 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [34220] → Tgt Spa: ['1.000'] [Step 184 / Rank 4] Tasks: ['Summarization', 'Single QA', 'Summarization'] | Lens: [17210, 17193, 17212] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 184 / Rank 6] Tasks: ['Code'] | Lens: [33588] → Tgt Spa: ['1.000'] [Step 184 / Rank 3] Tasks: ['Code'] | Lens: [52483] → Tgt Spa: ['1.000'] [Step 184 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [34220] → Tgt Spa: ['1.000'] [Step 184 / Rank 5] Tasks: ['Summarization', 'Single QA', 'Summarization'] | Lens: [17210, 17193, 17212] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 184 / Rank 2] Tasks: ['Code'] | Lens: [52483] → Tgt Spa: ['1.000'] [Step 184 / Rank 7] Tasks: ['Code'] | Lens: [33588] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 03:00:11,411 >> @ 184 | Loss: 2.0048 | LM: 1.9275 | Reg: 0.0773 | Spa(Avg): 0.594 [INFO|lh_trainer.py:797] 2026-02-17 03:00:11,411 >> Statistic -> Code | Spa: 0.710 | Tgt: 1.000 | Z-Loss: 0.089 | [INFO|lh_trainer.py:797] 2026-02-17 03:00:11,411 >> Statistic -> In-Context | Spa: 0.718 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:00:11,411 >> Statistic -> MultiHop | Spa: 0.597 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:00:11,411 >> Statistic -> Single | Spa: 0.434 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:00:11,411 >> Statistic -> Summarization | Spa: 0.668 | Tgt: 1.000 | Z-Loss: 0.103 | [INFO|lh_trainer.py:810] 2026-02-17 03:00:11,413 >> [Micro-Log] {"loss": 2.004830770970633, "lm_loss": 1.9275469358544797, "reg_loss": 0.07728385195756952, "model_sparsity(avg)": 0.5939655949672064, "Spa-In-Context Learning sparsity": 0.7175925837622749, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10130783749951257, "Spa-Single QA sparsity": 0.4337606796851525, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05785706947342707, "Spa-Code sparsity": 0.7100694477558136, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.08918924443423748, "Spa-MultiHop QA sparsity": 0.5965447178701075, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1005251595250717, "Spa-Summarization sparsity": 0.6676587419850486, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10309071306671415, "step": 184, "current_tau": 1.0, "lambda1 Single QA": 0.578125, "lambda2 MultiHop QA": 0.302734375, "lambda3 Summarization": 0.150390625, "lambda4 Code": 0.25} [INFO|lh_trainer.py:331] 2026-02-17 03:00:29,831 >> {'loss': 12.029, 'grad_norm': 0.7116125226020813, 'learning_rate': 0.00023691603724766298, 'epoch': 0.19483938915218535, 'num_input_tokens_seen': 454829434, 'completed': '61.67% (185 / 300)', 'remaining time': '5:22:26', 'throughput': '6995.94', 'gpu_mem_free': '14715MB', 'step': 185} [Step 185 / Rank 4] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [19582, 19587, 19578] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 185 / Rank 0] Tasks: ['Single QA'] | Lens: [57595] → Tgt Spa: ['0.350'] [Step 185 / Rank 6] Tasks: ['Single QA'] | Lens: [65069] → Tgt Spa: ['0.350'] [Step 185 / Rank 7] Tasks: ['Single QA'] | Lens: [65069] → Tgt Spa: ['0.350'] [Step 185 / Rank 5] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [19582, 19587, 19578] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 185 / Rank 2] Tasks: ['MultiHop QA'] | Lens: [65331] → Tgt Spa: ['0.350'] [Step 185 / Rank 3] Tasks: ['MultiHop QA'] | Lens: [65331] → Tgt Spa: ['0.350'] [Step 185 / Rank 1] Tasks: ['Single QA'] | Lens: [57595] → Tgt Spa: ['0.350'] [Step 185 / Rank 6] Tasks: ['Single QA', 'Code'] | Lens: [29822, 29839] → Tgt Spa: ['0.350', '1.000'] [Step 185 / Rank 7] Tasks: ['Single QA', 'Code'] | Lens: [29822, 29839] → Tgt Spa: ['0.350', '1.000'] [Step 185 / Rank 3] Tasks: ['Single QA'] | Lens: [55060] → Tgt Spa: ['0.350'] [Step 185 / Rank 2] Tasks: ['Single QA'] | Lens: [55060] → Tgt Spa: ['0.350'] [Step 185 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32628, 32628] → Tgt Spa: ['0.350', '0.350'] [Step 185 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32628, 32628] → Tgt Spa: ['0.350', '0.350'] [Step 185 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32000, 32000] → Tgt Spa: ['0.350', '0.350'] [Step 185 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32000, 32000] → Tgt Spa: ['0.350', '0.350'] [Step 185 / Rank 5] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25246, 25247] → Tgt Spa: ['1.000', '0.350'] [Step 185 / Rank 6] Tasks: ['Single QA'] | Lens: [54528] → Tgt Spa: ['0.350'] [Step 185 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [29225, 29225] → Tgt Spa: ['0.350', '0.350'] [Step 185 / Rank 0] Tasks: ['Single QA'] | Lens: [39327] → Tgt Spa: ['0.350'] [Step 185 / Rank 1] Tasks: ['Single QA'] | Lens: [39327] → Tgt Spa: ['0.350'] [Step 185 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [29225, 29225] → Tgt Spa: ['0.350', '0.350'] [Step 185 / Rank 4] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25246, 25247] → Tgt Spa: ['1.000', '0.350'] [Step 185 / Rank 7] Tasks: ['Single QA'] | Lens: [54528] → Tgt Spa: ['0.350'] [Step 185 / Rank 2] Tasks: ['Single QA'] | Lens: [45425] → Tgt Spa: ['0.350'] [Step 185 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [55477] → Tgt Spa: ['1.000'] [Step 185 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [31116, 31117] → Tgt Spa: ['0.350', '0.350'] [Step 185 / Rank 6] Tasks: ['Code'] | Lens: [44410] → Tgt Spa: ['1.000'] [Step 185 / Rank 7] Tasks: ['Code'] | Lens: [44410] → Tgt Spa: ['1.000'] [Step 185 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [55477] → Tgt Spa: ['1.000'] [Step 185 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [31116, 31117] → Tgt Spa: ['0.350', '0.350'] [Step 185 / Rank 3] Tasks: ['Single QA'] | Lens: [45425] → Tgt Spa: ['0.350'] [Step 185 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [57552] → Tgt Spa: ['1.000'] [Step 185 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [57552] → Tgt Spa: ['1.000'] [Step 185 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [42823] → Tgt Spa: ['1.000'] [Step 185 / Rank 2] Tasks: ['Single QA'] | Lens: [39660] → Tgt Spa: ['0.350'] [Step 185 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [42823] → Tgt Spa: ['1.000'] [Step 185 / Rank 3] Tasks: ['Single QA'] | Lens: [39660] → Tgt Spa: ['0.350'] [Step 185 / Rank 1] Tasks: ['Single QA'] | Lens: [34572] → Tgt Spa: ['0.350'] [Step 185 / Rank 0] Tasks: ['Single QA'] | Lens: [34572] → Tgt Spa: ['0.350'] [Step 185 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [39902] → Tgt Spa: ['1.000'] [Step 185 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [39902] → Tgt Spa: ['1.000'] [Step 185 / Rank 7] Tasks: ['Code'] | Lens: [34706] → Tgt Spa: ['1.000'] [Step 185 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32127, 32127] → Tgt Spa: ['0.350', '0.350'] [Step 185 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32127, 32127] → Tgt Spa: ['0.350', '0.350'] [Step 185 / Rank 1] Tasks: ['Single QA'] | Lens: [43068] → Tgt Spa: ['0.350'] [Step 185 / Rank 0] Tasks: ['Single QA'] | Lens: [43068] → Tgt Spa: ['0.350'] [Step 185 / Rank 6] Tasks: ['Code'] | Lens: [34706] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 03:03:00,050 >> @ 185 | Loss: 2.0499 | LM: 1.9988 | Reg: 0.0511 | Spa(Avg): 0.498 [INFO|lh_trainer.py:797] 2026-02-17 03:03:00,050 >> Statistic -> Code | Spa: 0.705 | Tgt: 1.000 | Z-Loss: 0.091 | [INFO|lh_trainer.py:797] 2026-02-17 03:03:00,050 >> Statistic -> In-Context | Spa: 0.719 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:03:00,050 >> Statistic -> MultiHop | Spa: 0.375 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:03:00,050 >> Statistic -> Single | Spa: 0.393 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:03:00,050 >> Statistic -> Summarization | Spa: 0.674 | Tgt: 1.000 | Z-Loss: 0.098 | [INFO|lh_trainer.py:810] 2026-02-17 03:03:00,052 >> [Micro-Log] {"loss": 2.049897940053294, "lm_loss": 1.9987785805327196, "reg_loss": 0.051119369959148266, "model_sparsity(avg)": 0.49845677862564725, "Spa-Single QA sparsity": 0.3928571315038772, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.029307863153960733, "Spa-MultiHop QA sparsity": 0.375, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.00794222205877304, "Spa-In-Context Learning sparsity": 0.7194444417953492, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10057347118854523, "Spa-Summarization sparsity": 0.6736111044883728, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09841975569725037, "Spa-Code sparsity": 0.7048611044883728, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.0911557711660862, "step": 185, "current_tau": 1.0, "lambda1 Single QA": 0.578125, "lambda2 MultiHop QA": 0.302734375, "lambda3 Summarization": 0.150390625, "lambda4 Code": 0.25} [INFO|lh_trainer.py:331] 2026-02-17 03:03:17,561 >> {'loss': 12.2994, 'grad_norm': 0.4598511755466461, 'learning_rate': 0.00023364924432754246, 'epoch': 0.19589257503949448, 'num_input_tokens_seen': 457344632, 'completed': '62.00% (186 / 300)', 'remaining time': '5:19:37', 'throughput': '7497.79', 'gpu_mem_free': '12055MB', 'step': 186} [Step 186 / Rank 7] Tasks: ['Single QA'] | Lens: [36382] → Tgt Spa: ['0.350'] [Step 186 / Rank 6] Tasks: ['Single QA'] | Lens: [36382] → Tgt Spa: ['0.350'] [Step 186 / Rank 4] Tasks: ['Single QA'] | Lens: [64978] → Tgt Spa: ['0.350'] [Step 186 / Rank 0] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning'] | Lens: [20655, 20648, 20648] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 186 / Rank 2] Tasks: ['Single QA'] | Lens: [61570] → Tgt Spa: ['0.350'] [Step 186 / Rank 5] Tasks: ['Single QA'] | Lens: [64978] → Tgt Spa: ['0.350'] [Step 186 / Rank 3] Tasks: ['Single QA'] | Lens: [61570] → Tgt Spa: ['0.350'] [Step 186 / Rank 1] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning'] | Lens: [20655, 20648, 20648] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 186 / Rank 0] Tasks: ['Single QA'] | Lens: [55060] → Tgt Spa: ['0.350'] [Step 186 / Rank 7] Tasks: ['MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Code', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Single QA', 'MultiHop QA'] | Lens: [2531, 2532, 2549, 2549, 2539, 2550, 2550, 2536, 2552, 2536, 2536, 2537, 2538, 2539, 2538, 2556, 2538, 2541, 2559, 2557, 2542, 2542, 2559, 2542, 2542] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350'] [Step 186 / Rank 1] Tasks: ['Single QA'] | Lens: [55060] → Tgt Spa: ['0.350'] [Step 186 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [30173, 30168] → Tgt Spa: ['1.000', '1.000'] [Step 186 / Rank 6] Tasks: ['MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Code', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Single QA', 'MultiHop QA'] | Lens: [2531, 2532, 2549, 2549, 2539, 2550, 2550, 2536, 2552, 2536, 2536, 2537, 2538, 2539, 2538, 2556, 2538, 2541, 2559, 2557, 2542, 2542, 2559, 2542, 2542] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350'] [Step 186 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [58538] → Tgt Spa: ['1.000'] [Step 186 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [58538] → Tgt Spa: ['1.000'] [Step 186 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [30173, 30168] → Tgt Spa: ['1.000', '1.000'] [Step 186 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [22361, 22359] → Tgt Spa: ['1.000', '1.000'] [Step 186 / Rank 3] Tasks: ['Single QA'] | Lens: [42417] → Tgt Spa: ['0.350'] [Step 186 / Rank 4] Tasks: ['Single QA'] | Lens: [40363] → Tgt Spa: ['0.350'] [Step 186 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [22361, 22359] → Tgt Spa: ['1.000', '1.000'] [Step 186 / Rank 2] Tasks: ['Single QA'] | Lens: [42417] → Tgt Spa: ['0.350'] [Step 186 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [39090] → Tgt Spa: ['1.000'] [Step 186 / Rank 5] Tasks: ['Single QA'] | Lens: [40363] → Tgt Spa: ['0.350'] [Step 186 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [39090] → Tgt Spa: ['1.000'] [Step 186 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22289, 22289] → Tgt Spa: ['1.000', '1.000'] [Step 186 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [21051, 21051, 21052] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 186 / Rank 3] Tasks: ['Single QA'] | Lens: [42744] → Tgt Spa: ['0.350'] [Step 186 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [21051, 21051, 21052] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 186 / Rank 2] Tasks: ['Single QA'] | Lens: [42744] → Tgt Spa: ['0.350'] [Step 186 / Rank 5] Tasks: ['Single QA'] | Lens: [53376] → Tgt Spa: ['0.350'] [Step 186 / Rank 4] Tasks: ['Single QA'] | Lens: [53376] → Tgt Spa: ['0.350'] [Step 186 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22289, 22289] → Tgt Spa: ['1.000', '1.000'] [Step 186 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [27700, 27693] → Tgt Spa: ['1.000', '1.000'] [Step 186 / Rank 6] Tasks: ['Single QA'] | Lens: [34051] → Tgt Spa: ['0.350'] [Step 186 / Rank 5] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [16606, 16594, 16594] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 186 / Rank 1] Tasks: ['Code'] | Lens: [35462] → Tgt Spa: ['1.000'] [Step 186 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [27700, 27693] → Tgt Spa: ['1.000', '1.000'] [Step 186 / Rank 7] Tasks: ['Single QA'] | Lens: [34051] → Tgt Spa: ['0.350'] [Step 186 / Rank 0] Tasks: ['Code'] | Lens: [35462] → Tgt Spa: ['1.000'] [Step 186 / Rank 4] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [16606, 16594, 16594] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 186 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [36707] → Tgt Spa: ['1.000'] [Step 186 / Rank 3] Tasks: ['Single QA'] | Lens: [55752] → Tgt Spa: ['0.350'] [Step 186 / Rank 0] Tasks: ['Single QA'] | Lens: [40581] → Tgt Spa: ['0.350'] [Step 186 / Rank 2] Tasks: ['Single QA'] | Lens: [55752] → Tgt Spa: ['0.350'] [Step 186 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [36707] → Tgt Spa: ['1.000'] [Step 186 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [25626, 25618] → Tgt Spa: ['1.000', '1.000'] [Step 186 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [25626, 25618] → Tgt Spa: ['1.000', '1.000'] [Step 186 / Rank 1] Tasks: ['Single QA'] | Lens: [40581] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 03:05:31,352 >> @ 186 | Loss: 2.1443 | LM: 2.0807 | Reg: 0.0636 | Spa(Avg): 0.540 [INFO|lh_trainer.py:797] 2026-02-17 03:05:31,352 >> Statistic -> Code | Spa: 0.696 | Tgt: 1.000 | Z-Loss: 0.095 | [INFO|lh_trainer.py:797] 2026-02-17 03:05:31,352 >> Statistic -> In-Context | Spa: 0.711 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:05:31,353 >> Statistic -> MultiHop | Spa: 0.601 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:05:31,353 >> Statistic -> Single | Spa: 0.414 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:05:31,353 >> Statistic -> Summarization | Spa: 0.610 | Tgt: 1.000 | Z-Loss: 0.131 | [INFO|lh_trainer.py:810] 2026-02-17 03:05:31,355 >> [Micro-Log] {"loss": 2.1443133813639483, "lm_loss": 2.080703371514877, "reg_loss": 0.06361003224931967, "model_sparsity(avg)": 0.5400462945302328, "Spa-Code sparsity": 0.6958333373069763, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09543162882328034, "Spa-In-Context Learning sparsity": 0.7111111044883728, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10394733771681786, "Spa-Single QA sparsity": 0.41421568043091717, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04162318151279846, "Spa-Summarization sparsity": 0.6097222208976746, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13094651997089385, "Spa-MultiHop QA sparsity": 0.6006944427887598, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.10202928632497787, "step": 186, "current_tau": 1.0, "lambda1 Single QA": 0.578125, "lambda2 MultiHop QA": 0.302734375, "lambda3 Summarization": 0.1513671875, "lambda4 Code": 0.251953125} [INFO|lh_trainer.py:331] 2026-02-17 03:05:52,340 >> {'loss': 12.8659, 'grad_norm': 0.6228759288787842, 'learning_rate': 0.0002303852530295162, 'epoch': 0.19694576092680358, 'num_input_tokens_seen': 459728304, 'completed': '62.33% (187 / 300)', 'remaining time': '5:16:41', 'throughput': '7700.22', 'gpu_mem_free': '12025MB', 'step': 187} [Step 187 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [29172, 29165] → Tgt Spa: ['1.000', '1.000'] [Step 187 / Rank 3] Tasks: ['Single QA'] | Lens: [55964] → Tgt Spa: ['0.350'] [Step 187 / Rank 7] Tasks: ['Single QA'] | Lens: [60911] → Tgt Spa: ['0.350'] [Step 187 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [45815] → Tgt Spa: ['1.000'] [Step 187 / Rank 2] Tasks: ['Single QA'] | Lens: [55964] → Tgt Spa: ['0.350'] [Step 187 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [29172, 29165] → Tgt Spa: ['1.000', '1.000'] [Step 187 / Rank 6] Tasks: ['Single QA'] | Lens: [60911] → Tgt Spa: ['0.350'] [Step 187 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [45815] → Tgt Spa: ['1.000'] [Step 187 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [22925, 22918] → Tgt Spa: ['1.000', '1.000'] [Step 187 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [22925, 22918] → Tgt Spa: ['1.000', '1.000'] [Step 187 / Rank 5] Tasks: ['Summarization', 'Code'] | Lens: [28857, 28848] → Tgt Spa: ['1.000', '1.000'] [Step 187 / Rank 4] Tasks: ['Summarization', 'Code'] | Lens: [28857, 28848] → Tgt Spa: ['1.000', '1.000'] [Step 187 / Rank 3] Tasks: ['Single QA'] | Lens: [51460] → Tgt Spa: ['0.350'] [Step 187 / Rank 2] Tasks: ['Single QA'] | Lens: [51460] → Tgt Spa: ['0.350'] [Step 187 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [54618] → Tgt Spa: ['1.000'] [Step 187 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [54618] → Tgt Spa: ['1.000'] [Step 187 / Rank 4] Tasks: ['Single QA'] | Lens: [62392] → Tgt Spa: ['0.350'] [Step 187 / Rank 3] Tasks: ['Single QA'] | Lens: [43591] → Tgt Spa: ['0.350'] [Step 187 / Rank 5] Tasks: ['Single QA'] | Lens: [62392] → Tgt Spa: ['0.350'] [Step 187 / Rank 2] Tasks: ['Single QA'] | Lens: [43591] → Tgt Spa: ['0.350'] [Step 187 / Rank 1] Tasks: ['Single QA'] | Lens: [47599] → Tgt Spa: ['0.350'] [Step 187 / Rank 0] Tasks: ['Single QA'] | Lens: [47599] → Tgt Spa: ['0.350'] [Step 187 / Rank 7] Tasks: ['Single QA'] | Lens: [60034] → Tgt Spa: ['0.350'] [Step 187 / Rank 6] Tasks: ['Single QA'] | Lens: [60034] → Tgt Spa: ['0.350'] [Step 187 / Rank 3] Tasks: ['In-Context Learning', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'Summarization', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [4765, 4772, 4767, 4767, 4770, 4768, 4776, 4769, 4770, 4788, 4777, 4770, 4773] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350'] [Step 187 / Rank 5] Tasks: ['MultiHop QA'] | Lens: [65331] → Tgt Spa: ['0.350'] [Step 187 / Rank 0] Tasks: ['Single QA'] | Lens: [59672] → Tgt Spa: ['0.350'] [Step 187 / Rank 2] Tasks: ['In-Context Learning', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'Summarization', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [4765, 4772, 4767, 4767, 4770, 4768, 4776, 4769, 4770, 4788, 4777, 4770, 4773] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350'] [Step 187 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [62771] → Tgt Spa: ['1.000'] [Step 187 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [62771] → Tgt Spa: ['1.000'] [Step 187 / Rank 1] Tasks: ['Single QA'] | Lens: [59672] → Tgt Spa: ['0.350'] [Step 187 / Rank 4] Tasks: ['MultiHop QA'] | Lens: [65331] → Tgt Spa: ['0.350'] [Step 187 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15807, 15807, 15807, 15807] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 187 / Rank 2] Tasks: ['Single QA'] | Lens: [43142] → Tgt Spa: ['0.350'] [Step 187 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [12747, 12748, 12748, 12749, 12759] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000'] [Step 187 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [12747, 12748, 12748, 12749, 12759] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000'] [Step 187 / Rank 3] Tasks: ['Single QA'] | Lens: [43142] → Tgt Spa: ['0.350'] [Step 187 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15807, 15807, 15807, 15807] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 187 / Rank 7] Tasks: ['Single QA'] | Lens: [48721] → Tgt Spa: ['0.350'] [Step 187 / Rank 6] Tasks: ['Single QA'] | Lens: [48721] → Tgt Spa: ['0.350'] [Step 187 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [27298, 27290] → Tgt Spa: ['1.000', '1.000'] [Step 187 / Rank 5] Tasks: ['Single QA'] | Lens: [58132] → Tgt Spa: ['0.350'] [Step 187 / Rank 4] Tasks: ['Single QA'] | Lens: [58132] → Tgt Spa: ['0.350'] [Step 187 / Rank 1] Tasks: ['Single QA'] | Lens: [56329] → Tgt Spa: ['0.350'] [Step 187 / Rank 6] Tasks: ['Single QA'] | Lens: [39562] → Tgt Spa: ['0.350'] [Step 187 / Rank 7] Tasks: ['Single QA'] | Lens: [39562] → Tgt Spa: ['0.350'] [Step 187 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [27298, 27290] → Tgt Spa: ['1.000', '1.000'] [Step 187 / Rank 0] Tasks: ['Single QA'] | Lens: [56329] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 03:08:28,120 >> @ 187 | Loss: 2.1563 | LM: 2.1033 | Reg: 0.0530 | Spa(Avg): 0.478 [INFO|lh_trainer.py:797] 2026-02-17 03:08:28,120 >> Statistic -> Code | Spa: 0.698 | Tgt: 1.000 | Z-Loss: 0.094 | [INFO|lh_trainer.py:797] 2026-02-17 03:08:28,120 >> Statistic -> In-Context | Spa: 0.691 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:08:28,120 >> Statistic -> MultiHop | Spa: 0.417 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:08:28,120 >> Statistic -> Single | Spa: 0.418 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:08:28,120 >> Statistic -> Summarization | Spa: 0.590 | Tgt: 1.000 | Z-Loss: 0.142 | [INFO|lh_trainer.py:810] 2026-02-17 03:08:28,122 >> [Micro-Log] {"loss": 2.1563087021155902, "lm_loss": 2.1033033637019494, "reg_loss": 0.05300532358523924, "model_sparsity(avg)": 0.4775195854405562, "Spa-In-Context Learning sparsity": 0.6909722288449606, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1122840562214454, "Spa-Code sparsity": 0.6979166865348816, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09435525815933943, "Spa-Single QA sparsity": 0.41782406717538834, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05006222288648132, "Spa-Summarization sparsity": 0.5902777910232544, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.14153318107128143, "Spa-MultiHop QA sparsity": 0.4166666865348816, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.02208341844379902, "step": 187, "current_tau": 1.0, "lambda1 Single QA": 0.58203125, "lambda2 MultiHop QA": 0.302734375, "lambda3 Summarization": 0.1513671875, "lambda4 Code": 0.251953125} [INFO|lh_trainer.py:331] 2026-02-17 03:08:50,699 >> {'loss': 12.9379, 'grad_norm': 0.48603731393814087, 'learning_rate': 0.0002271246226216899, 'epoch': 0.1979989468141127, 'num_input_tokens_seen': 462371360, 'completed': '62.67% (188 / 300)', 'remaining time': '5:13:59', 'throughput': '7409.36', 'gpu_mem_free': '7485MB', 'step': 188} [Step 188 / Rank 3] Tasks: ['Single QA'] | Lens: [55095] → Tgt Spa: ['0.350'] [Step 188 / Rank 5] Tasks: ['Single QA'] | Lens: [48108] → Tgt Spa: ['0.350'] [Step 188 / Rank 4] Tasks: ['Single QA'] | Lens: [48108] → Tgt Spa: ['0.350'] [Step 188 / Rank 2] Tasks: ['Single QA'] | Lens: [55095] → Tgt Spa: ['0.350'] [Step 188 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [41910] → Tgt Spa: ['1.000'] [Step 188 / Rank 0] Tasks: ['Summarization'] | Lens: [33665] → Tgt Spa: ['1.000'] [Step 188 / Rank 1] Tasks: ['Summarization'] | Lens: [33665] → Tgt Spa: ['1.000'] [Step 188 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [41910] → Tgt Spa: ['1.000'] [Step 188 / Rank 5] Tasks: ['Single QA'] | Lens: [47429] → Tgt Spa: ['0.350'] [Step 188 / Rank 3] Tasks: ['Code', 'Code', 'Code'] | Lens: [16824, 16827, 16826] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 188 / Rank 1] Tasks: ['Code'] | Lens: [58271] → Tgt Spa: ['1.000'] [Step 188 / Rank 2] Tasks: ['Code', 'Code', 'Code'] | Lens: [16824, 16827, 16826] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 188 / Rank 4] Tasks: ['Single QA'] | Lens: [47429] → Tgt Spa: ['0.350'] [Step 188 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26146, 26147] → Tgt Spa: ['1.000', '1.000'] [Step 188 / Rank 0] Tasks: ['Code'] | Lens: [58271] → Tgt Spa: ['1.000'] [Step 188 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26146, 26147] → Tgt Spa: ['1.000', '1.000'] [Step 188 / Rank 2] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [18439, 18428, 18444] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 188 / Rank 6] Tasks: ['Single QA'] | Lens: [60716] → Tgt Spa: ['0.350'] [Step 188 / Rank 5] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17270, 17274, 17275] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 188 / Rank 4] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17270, 17274, 17275] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 188 / Rank 7] Tasks: ['Single QA'] | Lens: [60716] → Tgt Spa: ['0.350'] [Step 188 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [13987, 13991, 14007, 14033] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000'] [Step 188 / Rank 3] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [18439, 18428, 18444] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 188 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [13987, 13991, 14007, 14033] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000'] [Step 188 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [29692, 29692] → Tgt Spa: ['0.350', '0.350'] [Step 188 / Rank 0] Tasks: ['Summarization', 'Code'] | Lens: [24091, 24082] → Tgt Spa: ['1.000', '1.000'] [Step 188 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [18990, 18993, 18995] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 188 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [18990, 18993, 18995] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 188 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [29692, 29692] → Tgt Spa: ['0.350', '0.350'] [Step 188 / Rank 4] Tasks: ['Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1682, 1682, 1664, 1664, 1665, 1666, 1666, 1685, 1686, 1685, 1667, 1668, 1667, 1667, 1667, 1686, 1668, 1669, 1668, 1667, 1668, 1688, 1688, 1669, 1669, 1689, 1688, 1689, 1671, 1671, 1670, 1670, 1671, 1671, 1672, 1691, 1672, 1671, 1672] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350'] [Step 188 / Rank 1] Tasks: ['Summarization', 'Code'] | Lens: [24091, 24082] → Tgt Spa: ['1.000', '1.000'] [Step 188 / Rank 5] Tasks: ['Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1682, 1682, 1664, 1664, 1665, 1666, 1666, 1685, 1686, 1685, 1667, 1668, 1667, 1667, 1667, 1686, 1668, 1669, 1668, 1667, 1668, 1688, 1688, 1669, 1669, 1689, 1688, 1689, 1671, 1671, 1670, 1670, 1671, 1671, 1672, 1691, 1672, 1671, 1672] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350'] [Step 188 / Rank 2] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24063, 24064] → Tgt Spa: ['1.000', '0.350'] [Step 188 / Rank 5] Tasks: ['Single QA'] | Lens: [48835] → Tgt Spa: ['0.350'] [Step 188 / Rank 7] Tasks: ['Single QA'] | Lens: [37406] → Tgt Spa: ['0.350'] [Step 188 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [22980, 22988] → Tgt Spa: ['1.000', '1.000'] [Step 188 / Rank 6] Tasks: ['Single QA'] | Lens: [37406] → Tgt Spa: ['0.350'] [Step 188 / Rank 4] Tasks: ['Single QA'] | Lens: [48835] → Tgt Spa: ['0.350'] [Step 188 / Rank 3] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24063, 24064] → Tgt Spa: ['1.000', '0.350'] [Step 188 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [22980, 22988] → Tgt Spa: ['1.000', '1.000'] [Step 188 / Rank 6] Tasks: ['Single QA'] | Lens: [48564] → Tgt Spa: ['0.350'] [Step 188 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [55802] → Tgt Spa: ['1.000'] [Step 188 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [28757, 28754] → Tgt Spa: ['1.000', '1.000'] [Step 188 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [55802] → Tgt Spa: ['1.000'] [Step 188 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [22187, 22187] → Tgt Spa: ['1.000', '1.000'] [Step 188 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [22187, 22187] → Tgt Spa: ['1.000', '1.000'] [Step 188 / Rank 7] Tasks: ['Single QA'] | Lens: [48564] → Tgt Spa: ['0.350'] [Step 188 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [28757, 28754] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 03:11:07,716 >> @ 188 | Loss: 1.9950 | LM: 1.9217 | Reg: 0.0733 | Spa(Avg): 0.541 [INFO|lh_trainer.py:797] 2026-02-17 03:11:07,716 >> Statistic -> Code | Spa: 0.667 | Tgt: 1.000 | Z-Loss: 0.107 | [INFO|lh_trainer.py:797] 2026-02-17 03:11:07,716 >> Statistic -> In-Context | Spa: 0.694 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:11:07,716 >> Statistic -> MultiHop | Spa: 0.525 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:11:07,716 >> Statistic -> Single | Spa: 0.364 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:11:07,717 >> Statistic -> Summarization | Spa: 0.621 | Tgt: 1.000 | Z-Loss: 0.127 | [INFO|lh_trainer.py:810] 2026-02-17 03:11:07,719 >> [Micro-Log] {"loss": 1.9949773214757442, "lm_loss": 1.9216658659279346, "reg_loss": 0.07331144201937907, "model_sparsity(avg)": 0.5408913580079874, "Spa-Summarization sparsity": 0.6205808113921772, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12714850665493446, "Spa-Code sparsity": 0.6666666702790693, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10671370070088994, "Spa-Single QA sparsity": 0.36431622963685256, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.01009665302430781, "Spa-In-Context Learning sparsity": 0.6944444520132882, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11122462046997887, "Spa-MultiHop QA sparsity": 0.5252057622980189, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.06637594538430373, "step": 188, "current_tau": 1.0, "lambda1 Single QA": 0.58203125, "lambda2 MultiHop QA": 0.302734375, "lambda3 Summarization": 0.1513671875, "lambda4 Code": 0.251953125} [INFO|lh_trainer.py:331] 2026-02-17 03:11:28,284 >> {'loss': 11.9699, 'grad_norm': 0.900370180606842, 'learning_rate': 0.00022386791179629828, 'epoch': 0.1990521327014218, 'num_input_tokens_seen': 464826406, 'completed': '63.00% (189 / 300)', 'remaining time': '5:11:04', 'throughput': '7789.60', 'gpu_mem_free': '7417MB', 'step': 189} [Step 189 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [26988, 26988] → Tgt Spa: ['0.350', '0.350'] [Step 189 / Rank 0] Tasks: ['Single QA'] | Lens: [50840] → Tgt Spa: ['0.350'] [Step 189 / Rank 4] Tasks: ['Code'] | Lens: [35670] → Tgt Spa: ['1.000'] [Step 189 / Rank 5] Tasks: ['Code'] | Lens: [35670] → Tgt Spa: ['1.000'] [Step 189 / Rank 2] Tasks: ['Single QA'] | Lens: [49547] → Tgt Spa: ['0.350'] [Step 189 / Rank 3] Tasks: ['Single QA'] | Lens: [49547] → Tgt Spa: ['0.350'] [Step 189 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [26988, 26988] → Tgt Spa: ['0.350', '0.350'] [Step 189 / Rank 1] Tasks: ['Single QA'] | Lens: [50840] → Tgt Spa: ['0.350'] [Step 189 / Rank 3] Tasks: ['Single QA'] | Lens: [60530] → Tgt Spa: ['0.350'] [Step 189 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [28339, 28339] → Tgt Spa: ['0.350', '0.350'] [Step 189 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [64440] → Tgt Spa: ['1.000'] [Step 189 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [64440] → Tgt Spa: ['1.000'] [Step 189 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [28339, 28339] → Tgt Spa: ['0.350', '0.350'] [Step 189 / Rank 2] Tasks: ['Single QA'] | Lens: [60530] → Tgt Spa: ['0.350'] [Step 189 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [25263, 25263] → Tgt Spa: ['0.350', '0.350'] [Step 189 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [25263, 25263] → Tgt Spa: ['0.350', '0.350'] [Step 189 / Rank 4] Tasks: ['MultiHop QA'] | Lens: [65342] → Tgt Spa: ['0.350'] [Step 189 / Rank 6] Tasks: ['Single QA', 'Summarization', 'Single QA', 'Single QA', 'Single QA'] | Lens: [11730, 11748, 11731, 11732, 11732] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350'] [Step 189 / Rank 2] Tasks: ['Single QA'] | Lens: [65451] → Tgt Spa: ['0.350'] [Step 189 / Rank 3] Tasks: ['Single QA'] | Lens: [65451] → Tgt Spa: ['0.350'] [Step 189 / Rank 5] Tasks: ['MultiHop QA'] | Lens: [65342] → Tgt Spa: ['0.350'] [Step 189 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26125, 26125] → Tgt Spa: ['1.000', '1.000'] [Step 189 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26125, 26125] → Tgt Spa: ['1.000', '1.000'] [Step 189 / Rank 7] Tasks: ['Single QA', 'Summarization', 'Single QA', 'Single QA', 'Single QA'] | Lens: [11730, 11748, 11731, 11732, 11732] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350'] [Step 189 / Rank 0] Tasks: ['Code'] | Lens: [50910] → Tgt Spa: ['1.000'] [Step 189 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22629, 22632] → Tgt Spa: ['1.000', '1.000'] [Step 189 / Rank 5] Tasks: ['Single QA'] | Lens: [59443] → Tgt Spa: ['0.350'] [Step 189 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22629, 22632] → Tgt Spa: ['1.000', '1.000'] [Step 189 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [24304, 24297] → Tgt Spa: ['1.000', '1.000'] [Step 189 / Rank 1] Tasks: ['Code'] | Lens: [50910] → Tgt Spa: ['1.000'] [Step 189 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [24304, 24297] → Tgt Spa: ['1.000', '1.000'] [Step 189 / Rank 4] Tasks: ['Single QA'] | Lens: [59443] → Tgt Spa: ['0.350'] [Step 189 / Rank 4] Tasks: ['Single QA'] | Lens: [58388] → Tgt Spa: ['0.350'] [Step 189 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24187, 24192] → Tgt Spa: ['1.000', '1.000'] [Step 189 / Rank 2] Tasks: ['Code', 'Code', 'Code'] | Lens: [19846, 19847, 19849] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 189 / Rank 5] Tasks: ['Single QA'] | Lens: [58388] → Tgt Spa: ['0.350'] [Step 189 / Rank 3] Tasks: ['Code', 'Code', 'Code'] | Lens: [19846, 19847, 19849] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 189 / Rank 0] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23879, 23880] → Tgt Spa: ['1.000', '0.350'] [Step 189 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24187, 24192] → Tgt Spa: ['1.000', '1.000'] [Step 189 / Rank 1] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23879, 23880] → Tgt Spa: ['1.000', '0.350'] [Step 189 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning', 'Summarization', 'Code', 'Summarization', 'Single QA', 'Code', 'Summarization', 'Single QA', 'In-Context Learning', 'Single QA'] | Lens: [5840, 5842, 5861, 5851, 5864, 5846, 5855, 5867, 5849, 5849, 5853] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350'] [Step 189 / Rank 7] Tasks: ['Single QA'] | Lens: [61118] → Tgt Spa: ['0.350'] [Step 189 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [56685] → Tgt Spa: ['1.000'] [Step 189 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27132, 27133] → Tgt Spa: ['1.000', '1.000'] [Step 189 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27132, 27133] → Tgt Spa: ['1.000', '1.000'] [Step 189 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning', 'Summarization', 'Code', 'Summarization', 'Single QA', 'Code', 'Summarization', 'Single QA', 'In-Context Learning', 'Single QA'] | Lens: [5840, 5842, 5861, 5851, 5864, 5846, 5855, 5867, 5849, 5849, 5853] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350'] [Step 189 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [56685] → Tgt Spa: ['1.000'] [Step 189 / Rank 6] Tasks: ['Single QA'] | Lens: [61118] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 03:14:09,068 >> @ 189 | Loss: 1.8990 | LM: 1.8409 | Reg: 0.0581 | Spa(Avg): 0.525 [INFO|lh_trainer.py:797] 2026-02-17 03:14:09,068 >> Statistic -> Code | Spa: 0.700 | Tgt: 1.000 | Z-Loss: 0.094 | [INFO|lh_trainer.py:797] 2026-02-17 03:14:09,068 >> Statistic -> In-Context | Spa: 0.712 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:14:09,068 >> Statistic -> MultiHop | Spa: 0.403 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:14:09,068 >> Statistic -> Single | Spa: 0.383 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:14:09,068 >> Statistic -> Summarization | Spa: 0.549 | Tgt: 1.000 | Z-Loss: 0.165 | [INFO|lh_trainer.py:810] 2026-02-17 03:14:09,070 >> [Micro-Log] {"loss": 1.8989706750338275, "lm_loss": 1.840883197883765, "reg_loss": 0.05808746333544453, "model_sparsity(avg)": 0.5249140821397305, "Spa-Single QA sparsity": 0.38293650036766413, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.02473597836104177, "Spa-In-Context Learning sparsity": 0.7120370626449585, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10421444475650787, "Spa-Code sparsity": 0.6996528059244156, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09378039091825485, "Spa-MultiHop QA sparsity": 0.4027777910232544, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.017216719686985016, "Spa-Summarization sparsity": 0.5486110895872116, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.16507354192435741, "step": 189, "current_tau": 1.0, "lambda1 Single QA": 0.58203125, "lambda2 MultiHop QA": 0.302734375, "lambda3 Summarization": 0.1513671875, "lambda4 Code": 0.251953125} [INFO|lh_trainer.py:331] 2026-02-17 03:14:34,404 >> {'loss': 11.3938, 'grad_norm': 0.6346839070320129, 'learning_rate': 0.0002206156785739756, 'epoch': 0.2001053185887309, 'num_input_tokens_seen': 467463708, 'completed': '63.33% (190 / 300)', 'remaining time': '5:08:27', 'throughput': '7084.95', 'gpu_mem_free': '7153MB', 'step': 190} [Step 190 / Rank 7] Tasks: ['Single QA'] | Lens: [65032] → Tgt Spa: ['0.350'] [Step 190 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26446, 26446] → Tgt Spa: ['1.000', '1.000'] [Step 190 / Rank 3] Tasks: ['Summarization', 'Summarization'] | Lens: [22299, 22300] → Tgt Spa: ['1.000', '1.000'] [Step 190 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26446, 26446] → Tgt Spa: ['1.000', '1.000'] [Step 190 / Rank 2] Tasks: ['Summarization', 'Summarization'] | Lens: [22299, 22300] → Tgt Spa: ['1.000', '1.000'] [Step 190 / Rank 6] Tasks: ['Single QA'] | Lens: [65032] → Tgt Spa: ['0.350'] [Step 190 / Rank 1] Tasks: ['Single QA'] | Lens: [65098] → Tgt Spa: ['0.350'] [Step 190 / Rank 0] Tasks: ['Single QA'] | Lens: [65098] → Tgt Spa: ['0.350'] [Step 190 / Rank 2] Tasks: ['Single QA'] | Lens: [34553] → Tgt Spa: ['0.350'] [Step 190 / Rank 1] Tasks: ['Code'] | Lens: [64062] → Tgt Spa: ['1.000'] [Step 190 / Rank 5] Tasks: ['Single QA'] | Lens: [64213] → Tgt Spa: ['0.350'] [Step 190 / Rank 7] Tasks: ['Summarization'] | Lens: [36930] → Tgt Spa: ['1.000'] [Step 190 / Rank 0] Tasks: ['Code'] | Lens: [64062] → Tgt Spa: ['1.000'] [Step 190 / Rank 4] Tasks: ['Single QA'] | Lens: [64213] → Tgt Spa: ['0.350'] [Step 190 / Rank 3] Tasks: ['Single QA'] | Lens: [34553] → Tgt Spa: ['0.350'] [Step 190 / Rank 6] Tasks: ['Summarization'] | Lens: [36930] → Tgt Spa: ['1.000'] [Step 190 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [60924] → Tgt Spa: ['1.000'] [Step 190 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [60924] → Tgt Spa: ['1.000'] [Step 190 / Rank 0] Tasks: ['Single QA'] | Lens: [55084] → Tgt Spa: ['0.350'] [Step 190 / Rank 6] Tasks: ['Single QA'] | Lens: [53158] → Tgt Spa: ['0.350'] [Step 190 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [50886] → Tgt Spa: ['1.000'] [Step 190 / Rank 7] Tasks: ['Single QA'] | Lens: [53158] → Tgt Spa: ['0.350'] [Step 190 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [50886] → Tgt Spa: ['1.000'] [Step 190 / Rank 1] Tasks: ['Single QA'] | Lens: [55084] → Tgt Spa: ['0.350'] [Step 190 / Rank 2] Tasks: ['Code'] | Lens: [36723] → Tgt Spa: ['1.000'] [Step 190 / Rank 4] Tasks: ['Single QA'] | Lens: [49208] → Tgt Spa: ['0.350'] [Step 190 / Rank 6] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [18416, 18433, 18435] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 190 / Rank 3] Tasks: ['Code'] | Lens: [36723] → Tgt Spa: ['1.000'] [Step 190 / Rank 0] Tasks: ['In-Context Learning', 'Single QA', 'MultiHop QA', 'Summarization', 'Single QA', 'Code', 'In-Context Learning', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'Single QA', 'In-Context Learning', 'Code', 'Code', 'Single QA'] | Lens: [3795, 3796, 3796, 3814, 3796, 3804, 3797, 3805, 3798, 3798, 3799, 3801, 3801, 3801, 3808, 3808, 3804] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350'] [Step 190 / Rank 7] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [18416, 18433, 18435] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 190 / Rank 1] Tasks: ['In-Context Learning', 'Single QA', 'MultiHop QA', 'Summarization', 'Single QA', 'Code', 'In-Context Learning', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'Single QA', 'In-Context Learning', 'Code', 'Code', 'Single QA'] | Lens: [3795, 3796, 3796, 3814, 3796, 3804, 3797, 3805, 3798, 3798, 3799, 3801, 3801, 3801, 3808, 3808, 3804] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350'] [Step 190 / Rank 5] Tasks: ['Single QA'] | Lens: [49208] → Tgt Spa: ['0.350'] [Step 190 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [29512, 29531] → Tgt Spa: ['1.000', '1.000'] [Step 190 / Rank 5] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [27755, 27758] → Tgt Spa: ['1.000', '0.350'] [Step 190 / Rank 3] Tasks: ['Single QA'] | Lens: [49591] → Tgt Spa: ['0.350'] [Step 190 / Rank 4] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [27755, 27758] → Tgt Spa: ['1.000', '0.350'] [Step 190 / Rank 6] Tasks: ['Code'] | Lens: [40402] → Tgt Spa: ['1.000'] [Step 190 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [29512, 29531] → Tgt Spa: ['1.000', '1.000'] [Step 190 / Rank 2] Tasks: ['Single QA'] | Lens: [49591] → Tgt Spa: ['0.350'] [Step 190 / Rank 7] Tasks: ['Code'] | Lens: [40402] → Tgt Spa: ['1.000'] [Step 190 / Rank 7] Tasks: ['Single QA'] | Lens: [52568] → Tgt Spa: ['0.350'] [Step 190 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [60132] → Tgt Spa: ['1.000'] [Step 190 / Rank 5] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [26729, 26711] → Tgt Spa: ['1.000', '1.000'] [Step 190 / Rank 6] Tasks: ['Single QA'] | Lens: [52568] → Tgt Spa: ['0.350'] [Step 190 / Rank 4] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [26729, 26711] → Tgt Spa: ['1.000', '1.000'] [Step 190 / Rank 2] Tasks: ['Code'] | Lens: [40563] → Tgt Spa: ['1.000'] [Step 190 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [60132] → Tgt Spa: ['1.000'] [Step 190 / Rank 3] Tasks: ['Code'] | Lens: [40563] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 03:17:07,639 >> @ 190 | Loss: 2.0977 | LM: 2.0276 | Reg: 0.0701 | Spa(Avg): 0.562 [INFO|lh_trainer.py:797] 2026-02-17 03:17:07,639 >> Statistic -> Code | Spa: 0.694 | Tgt: 1.000 | Z-Loss: 0.096 | [INFO|lh_trainer.py:797] 2026-02-17 03:17:07,640 >> Statistic -> In-Context | Spa: 0.701 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:17:07,640 >> Statistic -> MultiHop | Spa: 0.688 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:17:07,640 >> Statistic -> Single | Spa: 0.415 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:17:07,640 >> Statistic -> Summarization | Spa: 0.637 | Tgt: 1.000 | Z-Loss: 0.120 | [INFO|lh_trainer.py:810] 2026-02-17 03:17:07,643 >> [Micro-Log] {"loss": 2.0976739525794983, "lm_loss": 2.0275709315513573, "reg_loss": 0.07010300595720764, "model_sparsity(avg)": 0.5622276638944944, "Spa-Single QA sparsity": 0.41468252880232676, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04341632248334853, "Spa-Code sparsity": 0.694444457689921, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09585696458816528, "Spa-In-Context Learning sparsity": 0.7013888869966779, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10848061314650945, "Spa-MultiHop QA sparsity": 0.6875, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.14779996126890182, "Spa-Summarization sparsity": 0.637152798473835, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12023143004626036, "step": 190, "current_tau": 1.0, "lambda1 Single QA": 0.58203125, "lambda2 MultiHop QA": 0.302734375, "lambda3 Summarization": 0.15234375, "lambda4 Code": 0.251953125} [INFO|lh_trainer.py:331] 2026-02-17 03:17:30,882 >> {'loss': 12.586, 'grad_norm': 0.6781347393989563, 'learning_rate': 0.00021736848020814198, 'epoch': 0.20115850447604003, 'num_input_tokens_seen': 469992746, 'completed': '63.67% (191 / 300)', 'remaining time': '5:05:43', 'throughput': '7165.33', 'gpu_mem_free': '5955MB', 'step': 191} [Step 191 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32134, 32134] → Tgt Spa: ['0.350', '0.350'] [Step 191 / Rank 0] Tasks: ['Code', 'Code', 'Code', 'Code', 'Single QA'] | Lens: [10911, 10924, 10936, 10936, 10930] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350'] [Step 191 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32134, 32134] → Tgt Spa: ['0.350', '0.350'] [Step 191 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [23407, 23399] → Tgt Spa: ['1.000', '1.000'] [Step 191 / Rank 2] Tasks: ['Single QA'] | Lens: [49435] → Tgt Spa: ['0.350'] [Step 191 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [23407, 23399] → Tgt Spa: ['1.000', '1.000'] [Step 191 / Rank 1] Tasks: ['Code', 'Code', 'Code', 'Code', 'Single QA'] | Lens: [10911, 10924, 10936, 10936, 10930] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350'] [Step 191 / Rank 3] Tasks: ['Single QA'] | Lens: [49435] → Tgt Spa: ['0.350'] [Step 191 / Rank 7] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [17911, 17900, 17900] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 191 / Rank 0] Tasks: ['Single QA'] | Lens: [60960] → Tgt Spa: ['0.350'] [Step 191 / Rank 2] Tasks: ['Single QA'] | Lens: [58662] → Tgt Spa: ['0.350'] [Step 191 / Rank 5] Tasks: ['Summarization'] | Lens: [33177] → Tgt Spa: ['1.000'] [Step 191 / Rank 6] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [17911, 17900, 17900] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 191 / Rank 1] Tasks: ['Single QA'] | Lens: [60960] → Tgt Spa: ['0.350'] [Step 191 / Rank 3] Tasks: ['Single QA'] | Lens: [58662] → Tgt Spa: ['0.350'] [Step 191 / Rank 4] Tasks: ['Summarization'] | Lens: [33177] → Tgt Spa: ['1.000'] [Step 191 / Rank 4] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [7975, 7975, 7975, 7975, 7978, 7976, 7976, 7976] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 191 / Rank 3] Tasks: ['Summarization', 'Code'] | Lens: [32374, 32364] → Tgt Spa: ['1.000', '1.000'] [Step 191 / Rank 6] Tasks: ['Single QA'] | Lens: [52929] → Tgt Spa: ['0.350'] [Step 191 / Rank 7] Tasks: ['Single QA'] | Lens: [52929] → Tgt Spa: ['0.350'] [Step 191 / Rank 0] Tasks: ['Single QA'] | Lens: [59393] → Tgt Spa: ['0.350'] [Step 191 / Rank 2] Tasks: ['Summarization', 'Code'] | Lens: [32374, 32364] → Tgt Spa: ['1.000', '1.000'] [Step 191 / Rank 5] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [7975, 7975, 7975, 7975, 7978, 7976, 7976, 7976] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 191 / Rank 1] Tasks: ['Single QA'] | Lens: [59393] → Tgt Spa: ['0.350'] [Step 191 / Rank 4] Tasks: ['Single QA'] | Lens: [46704] → Tgt Spa: ['0.350'] [Step 191 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [38891] → Tgt Spa: ['1.000'] [Step 191 / Rank 6] Tasks: ['Summarization'] | Lens: [41448] → Tgt Spa: ['1.000'] [Step 191 / Rank 7] Tasks: ['Summarization'] | Lens: [41448] → Tgt Spa: ['1.000'] [Step 191 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [38891] → Tgt Spa: ['1.000'] [Step 191 / Rank 5] Tasks: ['Single QA'] | Lens: [46704] → Tgt Spa: ['0.350'] [Step 191 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [58762] → Tgt Spa: ['1.000'] [Step 191 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [58762] → Tgt Spa: ['1.000'] [Step 191 / Rank 5] Tasks: ['Single QA'] | Lens: [49564] → Tgt Spa: ['0.350'] [Step 191 / Rank 3] Tasks: ['Summarization', 'Single QA'] | Lens: [23095, 23079] → Tgt Spa: ['1.000', '0.350'] [Step 191 / Rank 0] Tasks: ['Code', 'Code'] | Lens: [22720, 22724] → Tgt Spa: ['1.000', '1.000'] [Step 191 / Rank 1] Tasks: ['Code', 'Code'] | Lens: [22720, 22724] → Tgt Spa: ['1.000', '1.000'] [Step 191 / Rank 7] Tasks: ['Single QA'] | Lens: [55695] → Tgt Spa: ['0.350'] [Step 191 / Rank 6] Tasks: ['Single QA'] | Lens: [55695] → Tgt Spa: ['0.350'] [Step 191 / Rank 2] Tasks: ['Summarization', 'Single QA'] | Lens: [23095, 23079] → Tgt Spa: ['1.000', '0.350'] [Step 191 / Rank 4] Tasks: ['Single QA'] | Lens: [49564] → Tgt Spa: ['0.350'] [Step 191 / Rank 5] Tasks: ['Single QA'] | Lens: [40007] → Tgt Spa: ['0.350'] [Step 191 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [31699, 31711] → Tgt Spa: ['1.000', '1.000'] [Step 191 / Rank 4] Tasks: ['Single QA'] | Lens: [40007] → Tgt Spa: ['0.350'] [Step 191 / Rank 1] Tasks: ['Single QA'] | Lens: [61558] → Tgt Spa: ['0.350'] [Step 191 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [31699, 31711] → Tgt Spa: ['1.000', '1.000'] [Step 191 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [42721] → Tgt Spa: ['1.000'] [Step 191 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [42721] → Tgt Spa: ['1.000'] [Step 191 / Rank 0] Tasks: ['Single QA'] | Lens: [61558] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 03:20:00,750 >> @ 191 | Loss: 1.9431 | LM: 1.8840 | Reg: 0.0591 | Spa(Avg): 0.520 [INFO|lh_trainer.py:797] 2026-02-17 03:20:00,750 >> Statistic -> Code | Spa: 0.667 | Tgt: 1.000 | Z-Loss: 0.108 | [INFO|lh_trainer.py:797] 2026-02-17 03:20:00,750 >> Statistic -> In-Context | Spa: 0.712 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:20:00,750 >> Statistic -> MultiHop | Spa: 0.413 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:20:00,750 >> Statistic -> Single | Spa: 0.393 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:20:00,750 >> Statistic -> Summarization | Spa: 0.681 | Tgt: 1.000 | Z-Loss: 0.097 | [INFO|lh_trainer.py:810] 2026-02-17 03:20:00,752 >> [Micro-Log] {"loss": 1.9430599268525839, "lm_loss": 1.8840030552819371, "reg_loss": 0.059056854520652756, "model_sparsity(avg)": 0.5202932059764862, "Spa-Code sparsity": 0.6666666766007742, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10781903999547164, "Spa-Single QA sparsity": 0.3925925811131795, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.02915339518028001, "Spa-In-Context Learning sparsity": 0.7118055522441864, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10435888729989529, "Spa-Summarization sparsity": 0.680555546283722, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09739146828651428, "Spa-MultiHop QA sparsity": 0.41269842215946745, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.02086792087980679, "step": 191, "current_tau": 1.0, "lambda1 Single QA": 0.58203125, "lambda2 MultiHop QA": 0.3046875, "lambda3 Summarization": 0.15234375, "lambda4 Code": 0.25390625} [INFO|lh_trainer.py:331] 2026-02-17 03:20:25,229 >> {'loss': 11.6584, 'grad_norm': 0.5637754201889038, 'learning_rate': 0.00021412687308952077, 'epoch': 0.20221169036334913, 'num_input_tokens_seen': 472498546, 'completed': '64.00% (192 / 300)', 'remaining time': '5:02:58', 'throughput': '7186.25', 'gpu_mem_free': '5805MB', 'step': 192} [Step 192 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [41567] → Tgt Spa: ['1.000'] [Step 192 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [34578] → Tgt Spa: ['1.000'] [Step 192 / Rank 5] Tasks: ['Code'] | Lens: [39780] → Tgt Spa: ['1.000'] [Step 192 / Rank 3] Tasks: ['Single QA'] | Lens: [49716] → Tgt Spa: ['0.350'] [Step 192 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [34578] → Tgt Spa: ['1.000'] [Step 192 / Rank 4] Tasks: ['Code'] | Lens: [39780] → Tgt Spa: ['1.000'] [Step 192 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [41567] → Tgt Spa: ['1.000'] [Step 192 / Rank 2] Tasks: ['Single QA'] | Lens: [49716] → Tgt Spa: ['0.350'] [Step 192 / Rank 6] Tasks: ['Single QA'] | Lens: [50380] → Tgt Spa: ['0.350'] [Step 192 / Rank 7] Tasks: ['Single QA'] | Lens: [50380] → Tgt Spa: ['0.350'] [Step 192 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26141, 26141] → Tgt Spa: ['1.000', '1.000'] [Step 192 / Rank 3] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [23987, 23968] → Tgt Spa: ['1.000', '1.000'] [Step 192 / Rank 1] Tasks: ['Code'] | Lens: [34799] → Tgt Spa: ['1.000'] [Step 192 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26141, 26141] → Tgt Spa: ['1.000', '1.000'] [Step 192 / Rank 0] Tasks: ['Code'] | Lens: [34799] → Tgt Spa: ['1.000'] [Step 192 / Rank 2] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [23987, 23968] → Tgt Spa: ['1.000', '1.000'] [Step 192 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [24685, 24679] → Tgt Spa: ['1.000', '1.000'] [Step 192 / Rank 3] Tasks: ['In-Context Learning', 'Code'] | Lens: [27971, 27978] → Tgt Spa: ['1.000', '1.000'] [Step 192 / Rank 0] Tasks: ['Single QA'] | Lens: [36297] → Tgt Spa: ['0.350'] [Step 192 / Rank 2] Tasks: ['In-Context Learning', 'Code'] | Lens: [27971, 27978] → Tgt Spa: ['1.000', '1.000'] [Step 192 / Rank 1] Tasks: ['Single QA'] | Lens: [36297] → Tgt Spa: ['0.350'] [Step 192 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [30042, 30053] → Tgt Spa: ['1.000', '1.000'] [Step 192 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [24685, 24679] → Tgt Spa: ['1.000', '1.000'] [Step 192 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [30042, 30053] → Tgt Spa: ['1.000', '1.000'] [Step 192 / Rank 4] Tasks: ['In-Context Learning', 'Code', 'In-Context Learning'] | Lens: [21315, 21325, 21319] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 192 / Rank 5] Tasks: ['In-Context Learning', 'Code', 'In-Context Learning'] | Lens: [21315, 21325, 21319] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 192 / Rank 2] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [27650, 27633] → Tgt Spa: ['1.000', '1.000'] [Step 192 / Rank 0] Tasks: ['Single QA'] | Lens: [57383] → Tgt Spa: ['0.350'] [Step 192 / Rank 3] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [27650, 27633] → Tgt Spa: ['1.000', '1.000'] [Step 192 / Rank 1] Tasks: ['Single QA'] | Lens: [57383] → Tgt Spa: ['0.350'] [Step 192 / Rank 6] Tasks: ['Code'] | Lens: [54969] → Tgt Spa: ['1.000'] [Step 192 / Rank 7] Tasks: ['Code'] | Lens: [54969] → Tgt Spa: ['1.000'] [Step 192 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Code', 'Summarization', 'Code'] | Lens: [7633, 7632, 7635, 7636, 7642, 7646, 7657, 7647] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000'] [Step 192 / Rank 6] Tasks: ['Summarization', 'Code'] | Lens: [22117, 22107] → Tgt Spa: ['1.000', '1.000'] [Step 192 / Rank 2] Tasks: ['Single QA'] | Lens: [34536] → Tgt Spa: ['0.350'] [Step 192 / Rank 7] Tasks: ['Summarization', 'Code'] | Lens: [22117, 22107] → Tgt Spa: ['1.000', '1.000'] [Step 192 / Rank 3] Tasks: ['Single QA'] | Lens: [34536] → Tgt Spa: ['0.350'] [Step 192 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Code', 'Summarization', 'Code'] | Lens: [7633, 7632, 7635, 7636, 7642, 7646, 7657, 7647] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000'] [Step 192 / Rank 4] Tasks: ['Code'] | Lens: [39873] → Tgt Spa: ['1.000'] [Step 192 / Rank 5] Tasks: ['Code'] | Lens: [39873] → Tgt Spa: ['1.000'] [Step 192 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23958, 23958] → Tgt Spa: ['1.000', '1.000'] [Step 192 / Rank 5] Tasks: ['Single QA'] | Lens: [58639] → Tgt Spa: ['0.350'] [Step 192 / Rank 1] Tasks: ['Single QA', 'Single QA', 'In-Context Learning'] | Lens: [19167, 19168, 19168] → Tgt Spa: ['0.350', '0.350', '1.000'] [Step 192 / Rank 0] Tasks: ['Single QA', 'Single QA', 'In-Context Learning'] | Lens: [19167, 19168, 19168] → Tgt Spa: ['0.350', '0.350', '1.000'] [Step 192 / Rank 3] Tasks: ['Code', 'Code', 'Code'] | Lens: [17399, 17397, 17398] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 192 / Rank 4] Tasks: ['Single QA'] | Lens: [58639] → Tgt Spa: ['0.350'] [Step 192 / Rank 2] Tasks: ['Code', 'Code', 'Code'] | Lens: [17399, 17397, 17398] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 192 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23958, 23958] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 03:22:25,995 >> @ 192 | Loss: 1.9764 | LM: 1.8945 | Reg: 0.0819 | Spa(Avg): 0.605 [INFO|lh_trainer.py:797] 2026-02-17 03:22:25,995 >> Statistic -> Code | Spa: 0.695 | Tgt: 1.000 | Z-Loss: 0.096 | [INFO|lh_trainer.py:797] 2026-02-17 03:22:25,995 >> Statistic -> In-Context | Spa: 0.708 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:22:25,995 >> Statistic -> MultiHop | Spa: 0.413 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:22:25,995 >> Statistic -> Single | Spa: 0.437 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:22:25,995 >> Statistic -> Summarization | Spa: 0.622 | Tgt: 1.000 | Z-Loss: 0.129 | [INFO|lh_trainer.py:810] 2026-02-17 03:22:25,998 >> [Micro-Log] {"loss": 1.9764235926171143, "lm_loss": 1.894505511969328, "reg_loss": 0.08191806980175897, "model_sparsity(avg)": 0.6052758544683456, "Spa-In-Context Learning sparsity": 0.7083333475249154, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10579125050987516, "Spa-Code sparsity": 0.6953703800837199, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09622533818085989, "Spa-Single QA sparsity": 0.4374999950329463, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05750454841957738, "Spa-Summarization sparsity": 0.6215277761220932, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1292021945118904, "Spa-MultiHop QA sparsity": 0.41269842215946745, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.02086792087980679, "step": 192, "current_tau": 1.0, "lambda1 Single QA": 0.58203125, "lambda2 MultiHop QA": 0.3046875, "lambda3 Summarization": 0.1533203125, "lambda4 Code": 0.25390625} [INFO|lh_trainer.py:331] 2026-02-17 03:22:48,837 >> {'loss': 11.8585, 'grad_norm': 0.8815531134605408, 'learning_rate': 0.00021089141265080388, 'epoch': 0.20326487625065823, 'num_input_tokens_seen': 474859284, 'completed': '64.33% (193 / 300)', 'remaining time': '4:59:56', 'throughput': '8219.36', 'gpu_mem_free': '8867MB', 'step': 193} [Step 193 / Rank 5] Tasks: ['Single QA'] | Lens: [40552] → Tgt Spa: ['0.350'] [Step 193 / Rank 6] Tasks: ['Single QA'] | Lens: [46237] → Tgt Spa: ['0.350'] [Step 193 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [29682, 29683] → Tgt Spa: ['0.350', '0.350'] [Step 193 / Rank 7] Tasks: ['Single QA'] | Lens: [46237] → Tgt Spa: ['0.350'] [Step 193 / Rank 2] Tasks: ['Code'] | Lens: [64898] → Tgt Spa: ['1.000'] [Step 193 / Rank 3] Tasks: ['Code'] | Lens: [64898] → Tgt Spa: ['1.000'] [Step 193 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [29682, 29683] → Tgt Spa: ['0.350', '0.350'] [Step 193 / Rank 4] Tasks: ['Single QA'] | Lens: [40552] → Tgt Spa: ['0.350'] [Step 193 / Rank 3] Tasks: ['Code'] | Lens: [34431] → Tgt Spa: ['1.000'] [Step 193 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Code', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA'] | Lens: [3366, 3368, 3384, 3371, 3368, 3386, 3375, 3370, 3370, 3372, 3389, 3388, 3389, 3378, 3378, 3371, 3371, 3374, 3372] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 193 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [25262, 25263] → Tgt Spa: ['0.350', '0.350'] [Step 193 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Code', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA'] | Lens: [3366, 3368, 3384, 3371, 3368, 3386, 3375, 3370, 3370, 3372, 3389, 3388, 3389, 3378, 3378, 3371, 3371, 3374, 3372] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 193 / Rank 4] Tasks: ['Single QA'] | Lens: [50485] → Tgt Spa: ['0.350'] [Step 193 / Rank 5] Tasks: ['Single QA'] | Lens: [50485] → Tgt Spa: ['0.350'] [Step 193 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [25262, 25263] → Tgt Spa: ['0.350', '0.350'] [Step 193 / Rank 2] Tasks: ['Code'] | Lens: [34431] → Tgt Spa: ['1.000'] [Step 193 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [44518] → Tgt Spa: ['1.000'] [Step 193 / Rank 6] Tasks: ['Single QA'] | Lens: [51685] → Tgt Spa: ['0.350'] [Step 193 / Rank 1] Tasks: ['Single QA'] | Lens: [51228] → Tgt Spa: ['0.350'] [Step 193 / Rank 0] Tasks: ['Single QA'] | Lens: [51228] → Tgt Spa: ['0.350'] [Step 193 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [26546, 26539] → Tgt Spa: ['1.000', '1.000'] [Step 193 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [26546, 26539] → Tgt Spa: ['1.000', '1.000'] [Step 193 / Rank 7] Tasks: ['Single QA'] | Lens: [51685] → Tgt Spa: ['0.350'] [Step 193 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [44518] → Tgt Spa: ['1.000'] [Step 193 / Rank 6] Tasks: ['Single QA'] | Lens: [45467] → Tgt Spa: ['0.350'] [Step 193 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24412, 24412] → Tgt Spa: ['1.000', '1.000'] [Step 193 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24412, 24412] → Tgt Spa: ['1.000', '1.000'] [Step 193 / Rank 4] Tasks: ['Single QA'] | Lens: [64683] → Tgt Spa: ['0.350'] [Step 193 / Rank 7] Tasks: ['Single QA'] | Lens: [45467] → Tgt Spa: ['0.350'] [Step 193 / Rank 3] Tasks: ['Single QA'] | Lens: [55621] → Tgt Spa: ['0.350'] [Step 193 / Rank 2] Tasks: ['Single QA'] | Lens: [55621] → Tgt Spa: ['0.350'] [Step 193 / Rank 5] Tasks: ['Single QA'] | Lens: [64683] → Tgt Spa: ['0.350'] [Step 193 / Rank 5] Tasks: ['Single QA'] | Lens: [39823] → Tgt Spa: ['0.350'] [Step 193 / Rank 2] Tasks: ['Code'] | Lens: [38492] → Tgt Spa: ['1.000'] [Step 193 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22350, 22371] → Tgt Spa: ['1.000', '1.000'] [Step 193 / Rank 3] Tasks: ['Code'] | Lens: [38492] → Tgt Spa: ['1.000'] [Step 193 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22350, 22371] → Tgt Spa: ['1.000', '1.000'] [Step 193 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32121, 32122] → Tgt Spa: ['0.350', '0.350'] [Step 193 / Rank 4] Tasks: ['Single QA'] | Lens: [39823] → Tgt Spa: ['0.350'] [Step 193 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32121, 32122] → Tgt Spa: ['0.350', '0.350'] [Step 193 / Rank 1] Tasks: ['Single QA', 'Summarization'] | Lens: [30414, 30433] → Tgt Spa: ['0.350', '1.000'] [Step 193 / Rank 5] Tasks: ['Single QA'] | Lens: [34632] → Tgt Spa: ['0.350'] [Step 193 / Rank 2] Tasks: ['Single QA'] | Lens: [48112] → Tgt Spa: ['0.350'] [Step 193 / Rank 3] Tasks: ['Single QA'] | Lens: [48112] → Tgt Spa: ['0.350'] [Step 193 / Rank 0] Tasks: ['Single QA', 'Summarization'] | Lens: [30414, 30433] → Tgt Spa: ['0.350', '1.000'] [Step 193 / Rank 7] Tasks: ['Code'] | Lens: [33538] → Tgt Spa: ['1.000'] [Step 193 / Rank 4] Tasks: ['Single QA'] | Lens: [34632] → Tgt Spa: ['0.350'] [Step 193 / Rank 6] Tasks: ['Code'] | Lens: [33538] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 03:25:14,778 >> @ 193 | Loss: 1.9911 | LM: 1.9429 | Reg: 0.0482 | Spa(Avg): 0.502 [INFO|lh_trainer.py:797] 2026-02-17 03:25:14,779 >> Statistic -> Code | Spa: 0.693 | Tgt: 1.000 | Z-Loss: 0.097 | [INFO|lh_trainer.py:797] 2026-02-17 03:25:14,779 >> Statistic -> In-Context | Spa: 0.710 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:25:14,779 >> Statistic -> MultiHop | Spa: 0.646 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:25:14,779 >> Statistic -> Single | Spa: 0.406 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:25:14,779 >> Statistic -> Summarization | Spa: 0.625 | Tgt: 1.000 | Z-Loss: 0.125 | [INFO|lh_trainer.py:810] 2026-02-17 03:25:14,781 >> [Micro-Log] {"loss": 1.9910814439256985, "lm_loss": 1.9428712564210098, "reg_loss": 0.04821018526369395, "model_sparsity(avg)": 0.5016751947502295, "Spa-Single QA sparsity": 0.40608465103876024, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.036491868241379656, "Spa-Summarization sparsity": 0.625, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12486210891178676, "Spa-MultiHop QA sparsity": 0.6458333233992258, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1259546453754107, "Spa-Code sparsity": 0.6927083358168602, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.0971499290317297, "Spa-In-Context Learning sparsity": 0.7103174499103001, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10564787260123662, "step": 193, "current_tau": 1.0, "lambda1 Single QA": 0.58203125, "lambda2 MultiHop QA": 0.3046875, "lambda3 Summarization": 0.1533203125, "lambda4 Code": 0.25390625} [INFO|lh_trainer.py:331] 2026-02-17 03:25:31,552 >> {'loss': 11.9465, 'grad_norm': 0.46758437156677246, 'learning_rate': 0.00020766265327148146, 'epoch': 0.20431806213796735, 'num_input_tokens_seen': 477239588, 'completed': '64.67% (194 / 300)', 'remaining time': '4:57:05', 'throughput': '7314.33', 'gpu_mem_free': '6969MB', 'step': 194} [Step 194 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [63564] → Tgt Spa: ['1.000'] [Step 194 / Rank 5] Tasks: ['Code'] | Lens: [40054] → Tgt Spa: ['1.000'] [Step 194 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [63564] → Tgt Spa: ['1.000'] [Step 194 / Rank 4] Tasks: ['Code'] | Lens: [40054] → Tgt Spa: ['1.000'] [Step 194 / Rank 1] Tasks: ['Single QA'] | Lens: [41705] → Tgt Spa: ['0.350'] [Step 194 / Rank 0] Tasks: ['Single QA'] | Lens: [41705] → Tgt Spa: ['0.350'] [Step 194 / Rank 7] Tasks: ['Single QA'] | Lens: [52277] → Tgt Spa: ['0.350'] [Step 194 / Rank 6] Tasks: ['Single QA'] | Lens: [52277] → Tgt Spa: ['0.350'] [Step 194 / Rank 3] Tasks: ['Code', 'Code'] | Lens: [24651, 24651] → Tgt Spa: ['1.000', '1.000'] [Step 194 / Rank 4] Tasks: ['Code'] | Lens: [37097] → Tgt Spa: ['1.000'] [Step 194 / Rank 2] Tasks: ['Code', 'Code'] | Lens: [24651, 24651] → Tgt Spa: ['1.000', '1.000'] [Step 194 / Rank 5] Tasks: ['Code'] | Lens: [37097] → Tgt Spa: ['1.000'] [Step 194 / Rank 1] Tasks: ['Single QA', 'MultiHop QA'] | Lens: [30888, 30889] → Tgt Spa: ['0.350', '0.350'] [Step 194 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [38145] → Tgt Spa: ['1.000'] [Step 194 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [38145] → Tgt Spa: ['1.000'] [Step 194 / Rank 0] Tasks: ['Single QA', 'MultiHop QA'] | Lens: [30888, 30889] → Tgt Spa: ['0.350', '0.350'] [Step 194 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [29973, 29974] → Tgt Spa: ['0.350', '0.350'] [Step 194 / Rank 7] Tasks: ['Single QA'] | Lens: [45878] → Tgt Spa: ['0.350'] [Step 194 / Rank 4] Tasks: ['Summarization'] | Lens: [37529] → Tgt Spa: ['1.000'] [Step 194 / Rank 5] Tasks: ['Summarization'] | Lens: [37529] → Tgt Spa: ['1.000'] [Step 194 / Rank 6] Tasks: ['Single QA'] | Lens: [45878] → Tgt Spa: ['0.350'] [Step 194 / Rank 3] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23980, 23981] → Tgt Spa: ['1.000', '0.350'] [Step 194 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [29973, 29974] → Tgt Spa: ['0.350', '0.350'] [Step 194 / Rank 2] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23980, 23981] → Tgt Spa: ['1.000', '0.350'] [Step 194 / Rank 3] Tasks: ['Single QA'] | Lens: [50403] → Tgt Spa: ['0.350'] [Step 194 / Rank 5] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [12737, 12730, 12730, 12731, 12731] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '0.350'] [Step 194 / Rank 4] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [12737, 12730, 12730, 12731, 12731] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '0.350'] [Step 194 / Rank 7] Tasks: ['Single QA'] | Lens: [37292] → Tgt Spa: ['0.350'] [Step 194 / Rank 2] Tasks: ['Single QA'] | Lens: [50403] → Tgt Spa: ['0.350'] [Step 194 / Rank 6] Tasks: ['Single QA'] | Lens: [37292] → Tgt Spa: ['0.350'] [Step 194 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [44786] → Tgt Spa: ['1.000'] [Step 194 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [44786] → Tgt Spa: ['1.000'] [Step 194 / Rank 4] Tasks: ['Single QA'] | Lens: [33665] → Tgt Spa: ['0.350'] [Step 194 / Rank 6] Tasks: ['Single QA'] | Lens: [42758] → Tgt Spa: ['0.350'] [Step 194 / Rank 5] Tasks: ['Single QA'] | Lens: [33665] → Tgt Spa: ['0.350'] [Step 194 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [24886, 24886] → Tgt Spa: ['0.350', '0.350'] [Step 194 / Rank 7] Tasks: ['Single QA'] | Lens: [42758] → Tgt Spa: ['0.350'] [Step 194 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Summarization'] | Lens: [18123, 18123, 18144] → Tgt Spa: ['0.350', '0.350', '1.000'] [Step 194 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [24886, 24886] → Tgt Spa: ['0.350', '0.350'] [Step 194 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Summarization'] | Lens: [18123, 18123, 18144] → Tgt Spa: ['0.350', '0.350', '1.000'] [Step 194 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Summarization', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Summarization', 'Code', 'Single QA', 'Code'] | Lens: [4299, 4300, 4318, 4300, 4301, 4301, 4303, 4302, 4302, 4302, 4304, 4322, 4312, 4305, 4312] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000'] [Step 194 / Rank 3] Tasks: ['Single QA'] | Lens: [41402] → Tgt Spa: ['0.350'] [Step 194 / Rank 1] Tasks: ['Single QA'] | Lens: [46765] → Tgt Spa: ['0.350'] [Step 194 / Rank 7] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [27681, 27665] → Tgt Spa: ['1.000', '1.000'] [Step 194 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Summarization', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Summarization', 'Code', 'Single QA', 'Code'] | Lens: [4299, 4300, 4318, 4300, 4301, 4301, 4303, 4302, 4302, 4302, 4304, 4322, 4312, 4305, 4312] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000'] [Step 194 / Rank 6] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [27681, 27665] → Tgt Spa: ['1.000', '1.000'] [Step 194 / Rank 0] Tasks: ['Single QA'] | Lens: [46765] → Tgt Spa: ['0.350'] [Step 194 / Rank 2] Tasks: ['Single QA'] | Lens: [41402] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 03:27:35,244 >> @ 194 | Loss: 2.2052 | LM: 2.1442 | Reg: 0.0611 | Spa(Avg): 0.530 [INFO|lh_trainer.py:797] 2026-02-17 03:27:35,244 >> Statistic -> Code | Spa: 0.706 | Tgt: 1.000 | Z-Loss: 0.092 | [INFO|lh_trainer.py:797] 2026-02-17 03:27:35,244 >> Statistic -> In-Context | Spa: 0.703 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:27:35,244 >> Statistic -> MultiHop | Spa: 0.556 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:27:35,245 >> Statistic -> Single | Spa: 0.439 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:27:35,245 >> Statistic -> Summarization | Spa: 0.675 | Tgt: 1.000 | Z-Loss: 0.100 | [INFO|lh_trainer.py:810] 2026-02-17 03:27:35,247 >> [Micro-Log] {"loss": 2.2052431789537272, "lm_loss": 2.144186858087778, "reg_loss": 0.061056296195602044, "model_sparsity(avg)": 0.5295717604458332, "Spa-Single QA sparsity": 0.4391025648667262, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.057259202196012035, "Spa-MultiHop QA sparsity": 0.5555555820465088, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.07964850217103958, "Spa-In-Context Learning sparsity": 0.7032828222621571, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10868681289932945, "Spa-Summarization sparsity": 0.675, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09981056749820709, "Spa-Code sparsity": 0.706349219594683, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09202938313995089, "step": 194, "current_tau": 1.0, "lambda1 Single QA": 0.58203125, "lambda2 MultiHop QA": 0.3046875, "lambda3 Summarization": 0.1533203125, "lambda4 Code": 0.25390625} [INFO|lh_trainer.py:331] 2026-02-17 03:27:51,341 >> {'loss': 13.2315, 'grad_norm': 0.5134908556938171, 'learning_rate': 0.00020444114818285127, 'epoch': 0.20537124802527645, 'num_input_tokens_seen': 479559702, 'completed': '65.00% (195 / 300)', 'remaining time': '4:54:02', 'throughput': '8298.61', 'gpu_mem_free': '11813MB', 'step': 195} [Step 195 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [48091] → Tgt Spa: ['1.000'] [Step 195 / Rank 4] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24078, 24079] → Tgt Spa: ['1.000', '0.350'] [Step 195 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [48091] → Tgt Spa: ['1.000'] [Step 195 / Rank 5] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24078, 24079] → Tgt Spa: ['1.000', '0.350'] [Step 195 / Rank 1] Tasks: ['MultiHop QA'] | Lens: [65337] → Tgt Spa: ['0.350'] [Step 195 / Rank 7] Tasks: ['MultiHop QA'] | Lens: [65334] → Tgt Spa: ['0.350'] [Step 195 / Rank 0] Tasks: ['MultiHop QA'] | Lens: [65337] → Tgt Spa: ['0.350'] [Step 195 / Rank 6] Tasks: ['MultiHop QA'] | Lens: [65334] → Tgt Spa: ['0.350'] [Step 195 / Rank 3] Tasks: ['Code'] | Lens: [33532] → Tgt Spa: ['1.000'] [Step 195 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [57573] → Tgt Spa: ['1.000'] [Step 195 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [26700, 26693] → Tgt Spa: ['1.000', '1.000'] [Step 195 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [26700, 26693] → Tgt Spa: ['1.000', '1.000'] [Step 195 / Rank 6] Tasks: ['Single QA'] | Lens: [65044] → Tgt Spa: ['0.350'] [Step 195 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [57573] → Tgt Spa: ['1.000'] [Step 195 / Rank 2] Tasks: ['Code'] | Lens: [33532] → Tgt Spa: ['1.000'] [Step 195 / Rank 7] Tasks: ['Single QA'] | Lens: [65044] → Tgt Spa: ['0.350'] [Step 195 / Rank 4] Tasks: ['Single QA'] | Lens: [54960] → Tgt Spa: ['0.350'] [Step 195 / Rank 2] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [31331, 31328] → Tgt Spa: ['1.000', '0.350'] [Step 195 / Rank 5] Tasks: ['Single QA'] | Lens: [54960] → Tgt Spa: ['0.350'] [Step 195 / Rank 3] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [31331, 31328] → Tgt Spa: ['1.000', '0.350'] [Step 195 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [26096, 26098] → Tgt Spa: ['0.350', '0.350'] [Step 195 / Rank 0] Tasks: ['Single QA'] | Lens: [65039] → Tgt Spa: ['0.350'] [Step 195 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [26096, 26098] → Tgt Spa: ['0.350', '0.350'] [Step 195 / Rank 1] Tasks: ['Single QA'] | Lens: [65039] → Tgt Spa: ['0.350'] [Step 195 / Rank 3] Tasks: ['Single QA'] | Lens: [43068] → Tgt Spa: ['0.350'] [Step 195 / Rank 5] Tasks: ['Single QA'] | Lens: [64833] → Tgt Spa: ['0.350'] [Step 195 / Rank 6] Tasks: ['Single QA'] | Lens: [52539] → Tgt Spa: ['0.350'] [Step 195 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [26563, 26563] → Tgt Spa: ['0.350', '0.350'] [Step 195 / Rank 2] Tasks: ['Single QA'] | Lens: [43068] → Tgt Spa: ['0.350'] [Step 195 / Rank 7] Tasks: ['Single QA'] | Lens: [52539] → Tgt Spa: ['0.350'] [Step 195 / Rank 4] Tasks: ['Single QA'] | Lens: [64833] → Tgt Spa: ['0.350'] [Step 195 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [26563, 26563] → Tgt Spa: ['0.350', '0.350'] [Step 195 / Rank 5] Tasks: ['Single QA'] | Lens: [49232] → Tgt Spa: ['0.350'] [Step 195 / Rank 3] Tasks: ['MultiHop QA', 'Single QA', 'Single QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Code', 'Code', 'MultiHop QA', 'MultiHop QA', 'Code', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Code', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA'] | Lens: [3124, 3124, 3124, 3124, 3127, 3125, 3142, 3142, 3133, 3132, 3126, 3126, 3132, 3126, 3128, 3145, 3133, 3127, 3128, 3129] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 195 / Rank 2] Tasks: ['MultiHop QA', 'Single QA', 'Single QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Code', 'Code', 'MultiHop QA', 'MultiHop QA', 'Code', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Code', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA'] | Lens: [3124, 3124, 3124, 3124, 3127, 3125, 3142, 3142, 3133, 3132, 3126, 3126, 3132, 3126, 3128, 3145, 3133, 3127, 3128, 3129] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 195 / Rank 7] Tasks: ['Single QA'] | Lens: [35403] → Tgt Spa: ['0.350'] [Step 195 / Rank 4] Tasks: ['Single QA'] | Lens: [49232] → Tgt Spa: ['0.350'] [Step 195 / Rank 1] Tasks: ['Code'] | Lens: [51498] → Tgt Spa: ['1.000'] [Step 195 / Rank 0] Tasks: ['Code'] | Lens: [51498] → Tgt Spa: ['1.000'] [Step 195 / Rank 6] Tasks: ['Single QA'] | Lens: [35403] → Tgt Spa: ['0.350'] [Step 195 / Rank 5] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [26237, 26258] → Tgt Spa: ['1.000', '1.000'] [Step 195 / Rank 3] Tasks: ['Single QA'] | Lens: [47132] → Tgt Spa: ['0.350'] [Step 195 / Rank 2] Tasks: ['Single QA'] | Lens: [47132] → Tgt Spa: ['0.350'] [Step 195 / Rank 0] Tasks: ['Single QA'] | Lens: [63517] → Tgt Spa: ['0.350'] [Step 195 / Rank 1] Tasks: ['Single QA'] | Lens: [63517] → Tgt Spa: ['0.350'] [Step 195 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43299] → Tgt Spa: ['1.000'] [Step 195 / Rank 4] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [26237, 26258] → Tgt Spa: ['1.000', '1.000'] [Step 195 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43299] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 03:30:45,687 >> @ 195 | Loss: 1.9523 | LM: 1.8937 | Reg: 0.0586 | Spa(Avg): 0.519 [INFO|lh_trainer.py:797] 2026-02-17 03:30:45,687 >> Statistic -> Code | Spa: 0.700 | Tgt: 1.000 | Z-Loss: 0.094 | [INFO|lh_trainer.py:797] 2026-02-17 03:30:45,687 >> Statistic -> In-Context | Spa: 0.713 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:30:45,687 >> Statistic -> MultiHop | Spa: 0.608 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:30:45,687 >> Statistic -> Single | Spa: 0.436 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:30:45,687 >> Statistic -> Summarization | Spa: 0.691 | Tgt: 1.000 | Z-Loss: 0.093 | [INFO|lh_trainer.py:810] 2026-02-17 03:30:45,690 >> [Micro-Log] {"loss": 1.9522563079372048, "lm_loss": 1.8936748123766545, "reg_loss": 0.058581472131966926, "model_sparsity(avg)": 0.5192997604608536, "Spa-MultiHop QA sparsity": 0.6083333373069764, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11044725235551596, "Spa-Code sparsity": 0.7003968272890363, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09440418864999499, "Spa-In-Context Learning sparsity": 0.7129629453023275, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10460647692282994, "Spa-Single QA sparsity": 0.43640350040636566, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05610079105061136, "Spa-Summarization sparsity": 0.6909722089767456, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09253890812397003, "step": 195, "current_tau": 1.0, "lambda1 Single QA": 0.58203125, "lambda2 MultiHop QA": 0.3046875, "lambda3 Summarization": 0.154296875, "lambda4 Code": 0.25390625} [INFO|lh_trainer.py:331] 2026-02-17 03:31:11,662 >> {'loss': 11.7135, 'grad_norm': 0.5208679437637329, 'learning_rate': 0.00020122744937322602, 'epoch': 0.20642443391258558, 'num_input_tokens_seen': 482139806, 'completed': '65.33% (196 / 300)', 'remaining time': '4:51:31', 'throughput': '6439.94', 'gpu_mem_free': '4663MB', 'step': 196} [Step 196 / Rank 6] Tasks: ['Code'] | Lens: [57520] → Tgt Spa: ['1.000'] [Step 196 / Rank 7] Tasks: ['Code'] | Lens: [57520] → Tgt Spa: ['1.000'] [Step 196 / Rank 4] Tasks: ['Code'] | Lens: [44086] → Tgt Spa: ['1.000'] [Step 196 / Rank 5] Tasks: ['Code'] | Lens: [44086] → Tgt Spa: ['1.000'] [Step 196 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [8814, 8814, 8820, 8815, 8815, 8815, 8816] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350'] [Step 196 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning', 'Summarization'] | Lens: [20468, 20468, 20493] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 196 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning', 'Summarization'] | Lens: [20468, 20468, 20493] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 196 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [8814, 8814, 8820, 8815, 8815, 8815, 8816] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350'] [Step 196 / Rank 1] Tasks: ['Single QA'] | Lens: [63428] → Tgt Spa: ['0.350'] [Step 196 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28997, 29001] → Tgt Spa: ['1.000', '1.000'] [Step 196 / Rank 6] Tasks: ['Single QA'] | Lens: [57259] → Tgt Spa: ['0.350'] [Step 196 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28997, 29001] → Tgt Spa: ['1.000', '1.000'] [Step 196 / Rank 7] Tasks: ['Single QA'] | Lens: [57259] → Tgt Spa: ['0.350'] [Step 196 / Rank 3] Tasks: ['Code'] | Lens: [40277] → Tgt Spa: ['1.000'] [Step 196 / Rank 0] Tasks: ['Single QA'] | Lens: [63428] → Tgt Spa: ['0.350'] [Step 196 / Rank 2] Tasks: ['Code'] | Lens: [40277] → Tgt Spa: ['1.000'] [Step 196 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [25986, 25981] → Tgt Spa: ['1.000', '1.000'] [Step 196 / Rank 0] Tasks: ['Single QA'] | Lens: [35024] → Tgt Spa: ['0.350'] [Step 196 / Rank 5] Tasks: ['Single QA'] | Lens: [58388] → Tgt Spa: ['0.350'] [Step 196 / Rank 4] Tasks: ['Single QA'] | Lens: [58388] → Tgt Spa: ['0.350'] [Step 196 / Rank 6] Tasks: ['Single QA'] | Lens: [39805] → Tgt Spa: ['0.350'] [Step 196 / Rank 1] Tasks: ['Single QA'] | Lens: [35024] → Tgt Spa: ['0.350'] [Step 196 / Rank 7] Tasks: ['Single QA'] | Lens: [39805] → Tgt Spa: ['0.350'] [Step 196 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [25986, 25981] → Tgt Spa: ['1.000', '1.000'] [Step 196 / Rank 3] Tasks: ['Single QA'] | Lens: [44215] → Tgt Spa: ['0.350'] [Step 196 / Rank 1] Tasks: ['Single QA'] | Lens: [46859] → Tgt Spa: ['0.350'] [Step 196 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [38075] → Tgt Spa: ['1.000'] [Step 196 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [38075] → Tgt Spa: ['1.000'] [Step 196 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [59495] → Tgt Spa: ['1.000'] [Step 196 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [59495] → Tgt Spa: ['1.000'] [Step 196 / Rank 2] Tasks: ['Single QA'] | Lens: [44215] → Tgt Spa: ['0.350'] [Step 196 / Rank 0] Tasks: ['Single QA'] | Lens: [46859] → Tgt Spa: ['0.350'] [Step 196 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [62807] → Tgt Spa: ['1.000'] [Step 196 / Rank 6] Tasks: ['Single QA'] | Lens: [39257] → Tgt Spa: ['0.350'] [Step 196 / Rank 2] Tasks: ['Single QA'] | Lens: [51528] → Tgt Spa: ['0.350'] [Step 196 / Rank 7] Tasks: ['Single QA'] | Lens: [39257] → Tgt Spa: ['0.350'] [Step 196 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [62807] → Tgt Spa: ['1.000'] [Step 196 / Rank 1] Tasks: ['Single QA'] | Lens: [54041] → Tgt Spa: ['0.350'] [Step 196 / Rank 3] Tasks: ['Single QA'] | Lens: [51528] → Tgt Spa: ['0.350'] [Step 196 / Rank 0] Tasks: ['Single QA'] | Lens: [54041] → Tgt Spa: ['0.350'] [Step 196 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [61230] → Tgt Spa: ['1.000'] [Step 196 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [61230] → Tgt Spa: ['1.000'] [Step 196 / Rank 0] Tasks: ['Code', 'Code', 'Code'] | Lens: [18183, 18186, 18189] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 196 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [40831] → Tgt Spa: ['1.000'] [Step 196 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [40831] → Tgt Spa: ['1.000'] [Step 196 / Rank 1] Tasks: ['Code', 'Code', 'Code'] | Lens: [18183, 18186, 18189] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 196 / Rank 4] Tasks: ['Single QA'] | Lens: [51496] → Tgt Spa: ['0.350'] [Step 196 / Rank 5] Tasks: ['Single QA'] | Lens: [51496] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 03:33:54,428 >> @ 196 | Loss: 2.0781 | LM: 2.0194 | Reg: 0.0587 | Spa(Avg): 0.548 [INFO|lh_trainer.py:797] 2026-02-17 03:33:54,428 >> Statistic -> Code | Spa: 0.707 | Tgt: 1.000 | Z-Loss: 0.093 | [INFO|lh_trainer.py:797] 2026-02-17 03:33:54,429 >> Statistic -> In-Context | Spa: 0.719 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:33:54,429 >> Statistic -> MultiHop | Spa: 0.608 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:33:54,429 >> Statistic -> Single | Spa: 0.385 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:33:54,429 >> Statistic -> Summarization | Spa: 0.694 | Tgt: 1.000 | Z-Loss: 0.091 | [INFO|lh_trainer.py:810] 2026-02-17 03:33:54,431 >> [Micro-Log] {"loss": 2.078103226919969, "lm_loss": 2.0193867720663548, "reg_loss": 0.058716431003025114, "model_sparsity(avg)": 0.5479221815864245, "Spa-In-Context Learning sparsity": 0.7194444417953492, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10197697281837463, "Spa-Summarization sparsity": 0.6944444179534912, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09054604172706604, "Spa-Single QA sparsity": 0.3848039192311904, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.02227073538062327, "Spa-Code sparsity": 0.7065972313284874, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09261257667094469, "Spa-MultiHop QA sparsity": 0.6083333373069764, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11044725235551596, "step": 196, "current_tau": 1.0, "lambda1 Single QA": 0.58203125, "lambda2 MultiHop QA": 0.3046875, "lambda3 Summarization": 0.154296875, "lambda4 Code": 0.255859375} [INFO|lh_trainer.py:331] 2026-02-17 03:34:18,497 >> {'loss': 12.4686, 'grad_norm': 0.6516356468200684, 'learning_rate': 0.0001980221074933525, 'epoch': 0.20747761979989468, 'num_input_tokens_seen': 484606370, 'completed': '65.67% (197 / 300)', 'remaining time': '4:48:52', 'throughput': '6600.90', 'gpu_mem_free': '9549MB', 'step': 197} [Step 197 / Rank 4] Tasks: ['Single QA'] | Lens: [37680] → Tgt Spa: ['0.350'] [Step 197 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [29711, 29711] → Tgt Spa: ['0.350', '0.350'] [Step 197 / Rank 1] Tasks: ['Code', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Summarization'] | Lens: [10663, 10663, 10657, 10658, 10662, 10683] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '0.350', '1.000'] [Step 197 / Rank 3] Tasks: ['Single QA'] | Lens: [44043] → Tgt Spa: ['0.350'] [Step 197 / Rank 5] Tasks: ['Single QA'] | Lens: [37680] → Tgt Spa: ['0.350'] [Step 197 / Rank 0] Tasks: ['Code', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Summarization'] | Lens: [10663, 10663, 10657, 10658, 10662, 10683] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '0.350', '1.000'] [Step 197 / Rank 2] Tasks: ['Single QA'] | Lens: [44043] → Tgt Spa: ['0.350'] [Step 197 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [29711, 29711] → Tgt Spa: ['0.350', '0.350'] [Step 197 / Rank 5] Tasks: ['Code'] | Lens: [36863] → Tgt Spa: ['1.000'] [Step 197 / Rank 4] Tasks: ['Code'] | Lens: [36863] → Tgt Spa: ['1.000'] [Step 197 / Rank 7] Tasks: ['Single QA'] | Lens: [35776] → Tgt Spa: ['0.350'] [Step 197 / Rank 6] Tasks: ['Single QA'] | Lens: [35776] → Tgt Spa: ['0.350'] [Step 197 / Rank 0] Tasks: ['Single QA', 'Code', 'Code', 'Single QA'] | Lens: [15948, 15956, 15956, 15951] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350'] [Step 197 / Rank 1] Tasks: ['Single QA', 'Code', 'Code', 'Single QA'] | Lens: [15948, 15956, 15956, 15951] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350'] [Step 197 / Rank 2] Tasks: ['Single QA'] | Lens: [52880] → Tgt Spa: ['0.350'] [Step 197 / Rank 3] Tasks: ['Single QA'] | Lens: [52880] → Tgt Spa: ['0.350'] [Step 197 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [51755] → Tgt Spa: ['1.000'] [Step 197 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [58631] → Tgt Spa: ['1.000'] [Step 197 / Rank 5] Tasks: ['Single QA'] | Lens: [55829] → Tgt Spa: ['0.350'] [Step 197 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [58631] → Tgt Spa: ['1.000'] [Step 197 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [51755] → Tgt Spa: ['1.000'] [Step 197 / Rank 1] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [23368, 23350] → Tgt Spa: ['1.000', '1.000'] [Step 197 / Rank 0] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [23368, 23350] → Tgt Spa: ['1.000', '1.000'] [Step 197 / Rank 4] Tasks: ['Single QA'] | Lens: [55829] → Tgt Spa: ['0.350'] [Step 197 / Rank 5] Tasks: ['Single QA'] | Lens: [52816] → Tgt Spa: ['0.350'] [Step 197 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [45819] → Tgt Spa: ['1.000'] [Step 197 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [53397] → Tgt Spa: ['1.000'] [Step 197 / Rank 1] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [21146, 21148, 21151] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 197 / Rank 0] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [21146, 21148, 21151] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 197 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [53397] → Tgt Spa: ['1.000'] [Step 197 / Rank 4] Tasks: ['Single QA'] | Lens: [52816] → Tgt Spa: ['0.350'] [Step 197 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [45819] → Tgt Spa: ['1.000'] [Step 197 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [56255] → Tgt Spa: ['1.000'] [Step 197 / Rank 2] Tasks: ['Single QA'] | Lens: [65084] → Tgt Spa: ['0.350'] [Step 197 / Rank 3] Tasks: ['Single QA'] | Lens: [65084] → Tgt Spa: ['0.350'] [Step 197 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24091, 24092] → Tgt Spa: ['1.000', '1.000'] [Step 197 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [56255] → Tgt Spa: ['1.000'] [Step 197 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24091, 24092] → Tgt Spa: ['1.000', '1.000'] [Step 197 / Rank 7] Tasks: ['Single QA'] | Lens: [59050] → Tgt Spa: ['0.350'] [Step 197 / Rank 6] Tasks: ['Single QA'] | Lens: [59050] → Tgt Spa: ['0.350'] [Step 197 / Rank 5] Tasks: ['Single QA'] | Lens: [35307] → Tgt Spa: ['0.350'] [Step 197 / Rank 4] Tasks: ['Single QA'] | Lens: [35307] → Tgt Spa: ['0.350'] [Step 197 / Rank 0] Tasks: ['Summarization', 'Code', 'In-Context Learning'] | Lens: [19553, 19544, 19537] → Tgt Spa: ['1.000', '1.000', '1.000'][Step 197 / Rank 1] Tasks: ['Summarization', 'Code', 'In-Context Learning'] | Lens: [19553, 19544, 19537] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 197 / Rank 2] Tasks: ['Single QA', 'Code', 'Code', 'Code', 'Single QA', 'Code', 'Code'] | Lens: [8359, 8366, 8367, 8369, 8362, 8368, 8371] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 197 / Rank 3] Tasks: ['Single QA', 'Code', 'Code', 'Code', 'Single QA', 'Code', 'Code'] | Lens: [8359, 8366, 8367, 8369, 8362, 8368, 8371] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 197 / Rank 7] Tasks: ['Code'] | Lens: [37316] → Tgt Spa: ['1.000'] [Step 197 / Rank 6] Tasks: ['Code'] | Lens: [37316] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 03:36:41,003 >> @ 197 | Loss: 2.0383 | LM: 1.9697 | Reg: 0.0687 | Spa(Avg): 0.557 [INFO|lh_trainer.py:797] 2026-02-17 03:36:41,003 >> Statistic -> Code | Spa: 0.709 | Tgt: 1.000 | Z-Loss: 0.092 | [INFO|lh_trainer.py:797] 2026-02-17 03:36:41,004 >> Statistic -> In-Context | Spa: 0.721 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:36:41,004 >> Statistic -> MultiHop | Spa: 0.608 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:36:41,004 >> Statistic -> Single | Spa: 0.437 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:36:41,004 >> Statistic -> Summarization | Spa: 0.627 | Tgt: 1.000 | Z-Loss: 0.124 | [INFO|lh_trainer.py:810] 2026-02-17 03:36:41,006 >> [Micro-Log] {"loss": 2.038319382816553, "lm_loss": 1.9696636566271384, "reg_loss": 0.06865570367275116, "model_sparsity(avg)": 0.5572778868178526, "Spa-Code sparsity": 0.7094907462596893, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09150802468260129, "Spa-Single QA sparsity": 0.4374999867545234, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.058763518172781914, "Spa-Summarization sparsity": 0.627314825852712, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12436308339238167, "Spa-In-Context Learning sparsity": 0.7206790049870809, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10155952473481496, "Spa-MultiHop QA sparsity": 0.6083333373069764, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11044725235551596, "step": 197, "current_tau": 1.0, "lambda1 Single QA": 0.58203125, "lambda2 MultiHop QA": 0.306640625, "lambda3 Summarization": 0.154296875, "lambda4 Code": 0.255859375} [INFO|lh_trainer.py:331] 2026-02-17 03:36:54,145 >> {'loss': 12.2299, 'grad_norm': 0.6271363496780396, 'learning_rate': 0.00019482567176206064, 'epoch': 0.20853080568720378, 'num_input_tokens_seen': 487088894, 'completed': '66.00% (198 / 300)', 'remaining time': '4:45:57', 'throughput': '7974.81', 'gpu_mem_free': '9387MB', 'step': 198} [Step 198 / Rank 2] Tasks: ['Single QA'] | Lens: [48376] → Tgt Spa: ['0.350'] [Step 198 / Rank 1] Tasks: ['Code', 'Code', 'Code'] | Lens: [18725, 18726, 18728] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 198 / Rank 0] Tasks: ['Code', 'Code', 'Code'] | Lens: [18725, 18726, 18728] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 198 / Rank 6] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [19117, 19110, 19111] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 198 / Rank 5] Tasks: ['Single QA'] | Lens: [49209] → Tgt Spa: ['0.350'] [Step 198 / Rank 7] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [19117, 19110, 19111] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 198 / Rank 3] Tasks: ['Single QA'] | Lens: [48376] → Tgt Spa: ['0.350'] [Step 198 / Rank 4] Tasks: ['Single QA'] | Lens: [49209] → Tgt Spa: ['0.350'] [Step 198 / Rank 2] Tasks: ['Single QA'] | Lens: [52250] → Tgt Spa: ['0.350'] [Step 198 / Rank 4] Tasks: ['Summarization'] | Lens: [44631] → Tgt Spa: ['1.000'] [Step 198 / Rank 3] Tasks: ['Single QA'] | Lens: [52250] → Tgt Spa: ['0.350'] [Step 198 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22756, 22774] → Tgt Spa: ['1.000', '1.000'] [Step 198 / Rank 6] Tasks: ['Single QA'] | Lens: [38983] → Tgt Spa: ['0.350'] [Step 198 / Rank 5] Tasks: ['Summarization'] | Lens: [44631] → Tgt Spa: ['1.000'] [Step 198 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22756, 22774] → Tgt Spa: ['1.000', '1.000'] [Step 198 / Rank 7] Tasks: ['Single QA'] | Lens: [38983] → Tgt Spa: ['0.350'] [Step 198 / Rank 5] Tasks: ['Single QA'] | Lens: [39738] → Tgt Spa: ['0.350'] [Step 198 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [63958] → Tgt Spa: ['1.000'] [Step 198 / Rank 1] Tasks: ['Code'] | Lens: [59031] → Tgt Spa: ['1.000'] [Step 198 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [36924] → Tgt Spa: ['1.000'] [Step 198 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [36924] → Tgt Spa: ['1.000'] [Step 198 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [63958] → Tgt Spa: ['1.000'] [Step 198 / Rank 0] Tasks: ['Code'] | Lens: [59031] → Tgt Spa: ['1.000'] [Step 198 / Rank 4] Tasks: ['Single QA'] | Lens: [39738] → Tgt Spa: ['0.350'] [Step 198 / Rank 5] Tasks: ['Single QA'] | Lens: [60989] → Tgt Spa: ['0.350'] [Step 198 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [45038] → Tgt Spa: ['1.000'] [Step 198 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [42824] → Tgt Spa: ['1.000'] [Step 198 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [45038] → Tgt Spa: ['1.000'] [Step 198 / Rank 4] Tasks: ['Single QA'] | Lens: [60989] → Tgt Spa: ['0.350'] [Step 198 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28228, 28229] → Tgt Spa: ['1.000', '1.000'] [Step 198 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28228, 28229] → Tgt Spa: ['1.000', '1.000'] [Step 198 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [42824] → Tgt Spa: ['1.000'] [Step 198 / Rank 7] Tasks: ['Single QA'] | Lens: [59436] → Tgt Spa: ['0.350'] [Step 198 / Rank 1] Tasks: ['Single QA'] | Lens: [51028] → Tgt Spa: ['0.350'] [Step 198 / Rank 2] Tasks: ['In-Context Learning', 'Code'] | Lens: [26401, 26409] → Tgt Spa: ['1.000', '1.000'] [Step 198 / Rank 6] Tasks: ['Single QA'] | Lens: [59436] → Tgt Spa: ['0.350'] [Step 198 / Rank 5] Tasks: ['Single QA'] | Lens: [36796] → Tgt Spa: ['0.350'] [Step 198 / Rank 4] Tasks: ['Single QA'] | Lens: [36796] → Tgt Spa: ['0.350'] [Step 198 / Rank 0] Tasks: ['Single QA'] | Lens: [51028] → Tgt Spa: ['0.350'] [Step 198 / Rank 3] Tasks: ['In-Context Learning', 'Code'] | Lens: [26401, 26409] → Tgt Spa: ['1.000', '1.000'] [Step 198 / Rank 5] Tasks: ['Single QA'] | Lens: [53381] → Tgt Spa: ['0.350'] [Step 198 / Rank 7] Tasks: ['Single QA'] | Lens: [62438] → Tgt Spa: ['0.350'] [Step 198 / Rank 4] Tasks: ['Single QA'] | Lens: [53381] → Tgt Spa: ['0.350'] [Step 198 / Rank 3] Tasks: ['Single QA'] | Lens: [33357] → Tgt Spa: ['0.350'] [Step 198 / Rank 1] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [29932, 29932] → Tgt Spa: ['0.350', '1.000'] [Step 198 / Rank 2] Tasks: ['Single QA'] | Lens: [33357] → Tgt Spa: ['0.350'] [Step 198 / Rank 6] Tasks: ['Single QA'] | Lens: [62438] → Tgt Spa: ['0.350'] [Step 198 / Rank 0] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [29932, 29932] → Tgt Spa: ['0.350', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 03:39:25,023 >> @ 198 | Loss: 2.2194 | LM: 2.1595 | Reg: 0.0600 | Spa(Avg): 0.529 [INFO|lh_trainer.py:797] 2026-02-17 03:39:25,023 >> Statistic -> Code | Spa: 0.683 | Tgt: 1.000 | Z-Loss: 0.102 | [INFO|lh_trainer.py:797] 2026-02-17 03:39:25,023 >> Statistic -> In-Context | Spa: 0.713 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:39:25,023 >> Statistic -> MultiHop | Spa: 0.608 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:39:25,023 >> Statistic -> Single | Spa: 0.375 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:39:25,023 >> Statistic -> Summarization | Spa: 0.681 | Tgt: 1.000 | Z-Loss: 0.098 | [INFO|lh_trainer.py:810] 2026-02-17 03:39:25,026 >> [Micro-Log] {"loss": 2.2194467075169086, "lm_loss": 2.159459587186575, "reg_loss": 0.05998711401965314, "model_sparsity(avg)": 0.5285493806004524, "Spa-Code sparsity": 0.682539667401995, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10228330109800611, "Spa-In-Context Learning sparsity": 0.7129629585478041, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10466005073653327, "Spa-Summarization sparsity": 0.680555542310079, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09801580756902695, "Spa-Single QA sparsity": 0.3749999954150273, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.019067004134950157, "Spa-MultiHop QA sparsity": 0.6083333373069764, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11044725235551596, "step": 198, "current_tau": 1.0, "lambda1 Single QA": 0.58203125, "lambda2 MultiHop QA": 0.306640625, "lambda3 Summarization": 0.1552734375, "lambda4 Code": 0.255859375} [INFO|lh_trainer.py:331] 2026-02-17 03:39:50,313 >> {'loss': 13.3167, 'grad_norm': 0.5966213345527649, 'learning_rate': 0.00019163868987215785, 'epoch': 0.2095839915745129, 'num_input_tokens_seen': 489502024, 'completed': '66.33% (199 / 300)', 'remaining time': '4:43:13', 'throughput': '6848.96', 'gpu_mem_free': '7257MB', 'step': 199} [Step 199 / Rank 5] Tasks: ['Single QA'] | Lens: [56363] → Tgt Spa: ['0.350'] [Step 199 / Rank 4] Tasks: ['Single QA'] | Lens: [56363] → Tgt Spa: ['0.350'] [Step 199 / Rank 0] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Code'] | Lens: [11173, 11185, 11189, 11184, 11195] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000'] [Step 199 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [51975] → Tgt Spa: ['1.000'] [Step 199 / Rank 2] Tasks: ['Single QA'] | Lens: [34007] → Tgt Spa: ['0.350'] [Step 199 / Rank 1] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Code'] | Lens: [11173, 11185, 11189, 11184, 11195] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000'] [Step 199 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [51975] → Tgt Spa: ['1.000'] [Step 199 / Rank 3] Tasks: ['Single QA'] | Lens: [34007] → Tgt Spa: ['0.350'] [Step 199 / Rank 5] Tasks: ['Single QA'] | Lens: [52954] → Tgt Spa: ['0.350'] [Step 199 / Rank 4] Tasks: ['Single QA'] | Lens: [52954] → Tgt Spa: ['0.350'] [Step 199 / Rank 6] Tasks: ['Code', 'Single QA'] | Lens: [22102, 22095] → Tgt Spa: ['1.000', '0.350'] [Step 199 / Rank 0] Tasks: ['Single QA'] | Lens: [41487] → Tgt Spa: ['0.350'] [Step 199 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [45368] → Tgt Spa: ['1.000'] [Step 199 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [45368] → Tgt Spa: ['1.000'] [Step 199 / Rank 7] Tasks: ['Code', 'Single QA'] | Lens: [22102, 22095] → Tgt Spa: ['1.000', '0.350'] [Step 199 / Rank 1] Tasks: ['Single QA'] | Lens: [41487] → Tgt Spa: ['0.350'] [Step 199 / Rank 5] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [31079, 31062] → Tgt Spa: ['1.000', '1.000'] [Step 199 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [28915, 28942] → Tgt Spa: ['1.000', '1.000'] [Step 199 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [28915, 28942] → Tgt Spa: ['1.000', '1.000'] [Step 199 / Rank 7] Tasks: ['Code'] | Lens: [51844] → Tgt Spa: ['1.000'] [Step 199 / Rank 3] Tasks: ['Single QA'] | Lens: [44501] → Tgt Spa: ['0.350'] [Step 199 / Rank 2] Tasks: ['Single QA'] | Lens: [44501] → Tgt Spa: ['0.350'] [Step 199 / Rank 6] Tasks: ['Code'] | Lens: [51844] → Tgt Spa: ['1.000'] [Step 199 / Rank 4] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [31079, 31062] → Tgt Spa: ['1.000', '1.000'] [Step 199 / Rank 3] Tasks: ['Single QA'] | Lens: [46770] → Tgt Spa: ['0.350'] [Step 199 / Rank 1] Tasks: ['Single QA'] | Lens: [36016] → Tgt Spa: ['0.350'] [Step 199 / Rank 0] Tasks: ['Single QA'] | Lens: [36016] → Tgt Spa: ['0.350'] [Step 199 / Rank 6] Tasks: ['Code', 'Code', 'Code'] | Lens: [21396, 21397, 21399] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 199 / Rank 7] Tasks: ['Code', 'Code', 'Code'] | Lens: [21396, 21397, 21399] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 199 / Rank 4] Tasks: ['Single QA'] | Lens: [50594] → Tgt Spa: ['0.350'] [Step 199 / Rank 2] Tasks: ['Single QA'] | Lens: [46770] → Tgt Spa: ['0.350'] [Step 199 / Rank 5] Tasks: ['Single QA'] | Lens: [50594] → Tgt Spa: ['0.350'] [Step 199 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [38930] → Tgt Spa: ['1.000'] [Step 199 / Rank 4] Tasks: ['Single QA'] | Lens: [57256] → Tgt Spa: ['0.350'] [Step 199 / Rank 7] Tasks: ['Single QA'] | Lens: [54244] → Tgt Spa: ['0.350'] [Step 199 / Rank 0] Tasks: ['Single QA'] | Lens: [33954] → Tgt Spa: ['0.350'] [Step 199 / Rank 5] Tasks: ['Single QA'] | Lens: [57256] → Tgt Spa: ['0.350'] [Step 199 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [38930] → Tgt Spa: ['1.000'] [Step 199 / Rank 1] Tasks: ['Single QA'] | Lens: [33954] → Tgt Spa: ['0.350'] [Step 199 / Rank 6] Tasks: ['Single QA'] | Lens: [54244] → Tgt Spa: ['0.350'] [Step 199 / Rank 6] Tasks: ['Code'] | Lens: [47542] → Tgt Spa: ['1.000'] [Step 199 / Rank 7] Tasks: ['Code'] | Lens: [47542] → Tgt Spa: ['1.000'] [Step 199 / Rank 2] Tasks: ['Code', 'Code', 'Single QA'] | Lens: [19962, 19966, 19963] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 199 / Rank 4] Tasks: ['Single QA'] | Lens: [51912] → Tgt Spa: ['0.350'] [Step 199 / Rank 1] Tasks: ['Single QA'] | Lens: [33046] → Tgt Spa: ['0.350'] [Step 199 / Rank 0] Tasks: ['Single QA'] | Lens: [33046] → Tgt Spa: ['0.350'] [Step 199 / Rank 3] Tasks: ['Code', 'Code', 'Single QA'] | Lens: [19962, 19966, 19963] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 199 / Rank 5] Tasks: ['Single QA'] | Lens: [51912] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 03:42:05,593 >> @ 199 | Loss: 2.0006 | LM: 1.9470 | Reg: 0.0535 | Spa(Avg): 0.518 [INFO|lh_trainer.py:797] 2026-02-17 03:42:05,593 >> Statistic -> Code | Spa: 0.698 | Tgt: 1.000 | Z-Loss: 0.096 | [INFO|lh_trainer.py:797] 2026-02-17 03:42:05,593 >> Statistic -> In-Context | Spa: 0.719 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:42:05,593 >> Statistic -> MultiHop | Spa: 0.608 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:42:05,593 >> Statistic -> Single | Spa: 0.401 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:42:05,593 >> Statistic -> Summarization | Spa: 0.708 | Tgt: 1.000 | Z-Loss: 0.085 | [INFO|lh_trainer.py:810] 2026-02-17 03:42:05,595 >> [Micro-Log] {"loss": 2.0005883735915027, "lm_loss": 1.947049015512069, "reg_loss": 0.05353937252948526, "model_sparsity(avg)": 0.5178240686655045, "Spa-Code sparsity": 0.6979166666666666, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09597137818733852, "Spa-Single QA sparsity": 0.4010416567325592, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03613526446133619, "Spa-In-Context Learning sparsity": 0.7194444417953492, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10260182917118073, "Spa-Summarization sparsity": 0.7083333134651184, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.08508828654885292, "Spa-MultiHop QA sparsity": 0.6083333373069764, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11044725235551596, "step": 199, "current_tau": 1.0, "lambda1 Single QA": 0.5859375, "lambda2 MultiHop QA": 0.306640625, "lambda3 Summarization": 0.1552734375, "lambda4 Code": 0.255859375} [INFO|lh_trainer.py:331] 2026-02-17 03:42:24,165 >> {'loss': 12.0035, 'grad_norm': 0.5066636204719543, 'learning_rate': 0.0001884617078965841, 'epoch': 0.210637177461822, 'num_input_tokens_seen': 491847958, 'completed': '66.67% (200 / 300)', 'remaining time': '4:40:18', 'throughput': '7623.98', 'gpu_mem_free': '14705MB', 'step': 200} /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( [INFO|trainer.py:3984] 2026-02-17 03:42:37,237 >> Saving model checkpoint to checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/checkpoint-200 [INFO|configuration_utils.py:419] 2026-02-17 03:42:37,404 >> Configuration saved in checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/checkpoint-200/config.json [INFO|configuration_utils.py:911] 2026-02-17 03:42:37,409 >> Configuration saved in checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/checkpoint-200/generation_config.json [INFO|modeling_utils.py:3580] 2026-02-17 03:43:18,458 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/checkpoint-200/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-02-17 03:43:18,465 >> tokenizer config file saved in checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/checkpoint-200/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-02-17 03:43:18,470 >> Special tokens file saved in checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/checkpoint-200/special_tokens_map.json [INFO|tokenization_utils_base.py:2572] 2026-02-17 03:43:18,472 >> added tokens file saved in checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/checkpoint-200/added_tokens.json /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( [Step 200 / Rank 4] Tasks: ['Code'] | Lens: [39798] → Tgt Spa: ['1.000'] [Step 200 / Rank 6] Tasks: ['Single QA'] | Lens: [65022] → Tgt Spa: ['0.350'] [Step 200 / Rank 2] Tasks: ['Single QA'] | Lens: [39980] → Tgt Spa: ['0.350'] [Step 200 / Rank 7] Tasks: ['Single QA'] | Lens: [65022] → Tgt Spa: ['0.350'] [Step 200 / Rank 3] Tasks: ['Single QA'] | Lens: [39980] → Tgt Spa: ['0.350'] [Step 200 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA'] | Lens: [10257, 10257, 10263, 10258, 10258, 10259] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '0.350'] [Step 200 / Rank 5] Tasks: ['Code'] | Lens: [39798] → Tgt Spa: ['1.000'] [Step 200 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA'] | Lens: [10257, 10257, 10263, 10258, 10258, 10259] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '0.350'] /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/autograd/graph.py:823: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [Step 200 / Rank 6] Tasks: ['Single QA'] | Lens: [56320] → Tgt Spa: ['0.350'] [Step 200 / Rank 1] Tasks: ['Single QA'] | Lens: [64912] → Tgt Spa: ['0.350'] [Step 200 / Rank 3] Tasks: ['In-Context Learning', 'Summarization', 'Summarization'] | Lens: [21749, 21769, 21770] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 200 / Rank 7] Tasks: ['Single QA'] | Lens: [56320] → Tgt Spa: ['0.350'] [Step 200 / Rank 4] Tasks: ['Single QA'] | Lens: [58345] → Tgt Spa: ['0.350'] [Step 200 / Rank 0] Tasks: ['Single QA'] | Lens: [64912] → Tgt Spa: ['0.350'] [Step 200 / Rank 2] Tasks: ['In-Context Learning', 'Summarization', 'Summarization'] | Lens: [21749, 21769, 21770] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 200 / Rank 5] Tasks: ['Single QA'] | Lens: [58345] → Tgt Spa: ['0.350'] [Step 200 / Rank 2] Tasks: ['Code', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Single QA'] | Lens: [6183, 6177, 6178, 6179, 6181, 6183, 6184, 6187, 6195, 6188] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 200 / Rank 7] Tasks: ['Single QA'] | Lens: [45127] → Tgt Spa: ['0.350'] [Step 200 / Rank 4] Tasks: ['Single QA'] | Lens: [64929] → Tgt Spa: ['0.350'] [Step 200 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23420, 23424] → Tgt Spa: ['1.000', '1.000'] [Step 200 / Rank 3] Tasks: ['Code', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Single QA'] | Lens: [6183, 6177, 6178, 6179, 6181, 6183, 6184, 6187, 6195, 6188] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 200 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23420, 23424] → Tgt Spa: ['1.000', '1.000'] [Step 200 / Rank 5] Tasks: ['Single QA'] | Lens: [64929] → Tgt Spa: ['0.350'] [Step 200 / Rank 6] Tasks: ['Single QA'] | Lens: [45127] → Tgt Spa: ['0.350'] [Step 200 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [44934] → Tgt Spa: ['1.000'] [Step 200 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32724, 32724] → Tgt Spa: ['0.350', '0.350'] [Step 200 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25507, 25508] → Tgt Spa: ['1.000', '1.000'] [Step 200 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25507, 25508] → Tgt Spa: ['1.000', '1.000'] [Step 200 / Rank 2] Tasks: ['Single QA'] | Lens: [48785] → Tgt Spa: ['0.350'] [Step 200 / Rank 3] Tasks: ['Single QA'] | Lens: [48785] → Tgt Spa: ['0.350'] [Step 200 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [44934] → Tgt Spa: ['1.000'] [Step 200 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32724, 32724] → Tgt Spa: ['0.350', '0.350'] [Step 200 / Rank 2] Tasks: ['Single QA', 'Code', 'Code'] | Lens: [17676, 17684, 17684] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 200 / Rank 6] Tasks: ['Single QA'] | Lens: [65033] → Tgt Spa: ['0.350'] [Step 200 / Rank 4] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24887, 24888] → Tgt Spa: ['1.000', '0.350'] [Step 200 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32369, 32369] → Tgt Spa: ['0.350', '0.350'] [Step 200 / Rank 7] Tasks: ['Single QA'] | Lens: [65033] → Tgt Spa: ['0.350'] [Step 200 / Rank 5] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24887, 24888] → Tgt Spa: ['1.000', '0.350'] [Step 200 / Rank 3] Tasks: ['Single QA', 'Code', 'Code'] | Lens: [17676, 17684, 17684] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 200 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32369, 32369] → Tgt Spa: ['0.350', '0.350'] [Step 200 / Rank 6] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17731, 17743, 17731] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 200 / Rank 4] Tasks: ['Single QA'] | Lens: [48368] → Tgt Spa: ['0.350'] [Step 200 / Rank 0] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [25541, 25522] → Tgt Spa: ['1.000', '1.000'] [Step 200 / Rank 1] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [25541, 25522] → Tgt Spa: ['1.000', '1.000'] [Step 200 / Rank 5] Tasks: ['Single QA'] | Lens: [48368] → Tgt Spa: ['0.350'] [Step 200 / Rank 2] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [17468, 17481, 17482] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 200 / Rank 7] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17731, 17743, 17731] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 200 / Rank 3] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [17468, 17481, 17482] → Tgt Spa: ['1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 03:47:19,056 >> @ 200 | Loss: 2.0035 | LM: 1.9465 | Reg: 0.0570 | Spa(Avg): 0.513 [INFO|lh_trainer.py:797] 2026-02-17 03:47:19,056 >> Statistic -> Code | Spa: 0.701 | Tgt: 1.000 | Z-Loss: 0.095 | [INFO|lh_trainer.py:797] 2026-02-17 03:47:19,056 >> Statistic -> In-Context | Spa: 0.720 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:47:19,056 >> Statistic -> MultiHop | Spa: 0.608 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:47:19,056 >> Statistic -> Single | Spa: 0.404 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:47:19,056 >> Statistic -> Summarization | Spa: 0.701 | Tgt: 1.000 | Z-Loss: 0.088 | [INFO|lh_trainer.py:810] 2026-02-17 03:47:19,059 >> [Micro-Log] {"loss": 2.0035059157235082, "lm_loss": 1.9465028230915777, "reg_loss": 0.05700309981572597, "model_sparsity(avg)": 0.5127314801017443, "Spa-Single QA sparsity": 0.40393517166376114, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.041290741869791724, "Spa-Code sparsity": 0.7006172868940566, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09534679187668695, "Spa-In-Context Learning sparsity": 0.720085464991056, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10242044237943795, "Spa-Summarization sparsity": 0.7013888955116272, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.08819259454806645, "Spa-MultiHop QA sparsity": 0.6083333373069764, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11044725235551596, "step": 200, "current_tau": 1.0, "lambda1 Single QA": 0.5859375, "lambda2 MultiHop QA": 0.306640625, "lambda3 Summarization": 0.1552734375, "lambda4 Code": 0.255859375} [INFO|lh_trainer.py:331] 2026-02-17 03:47:35,985 >> {'loss': 12.021, 'grad_norm': 0.5119277834892273, 'learning_rate': 0.00018529527019484594, 'epoch': 0.21169036334913113, 'num_input_tokens_seen': 494483540, 'completed': '67.00% (201 / 300)', 'remaining time': '4:38:40', 'throughput': '4226.12', 'gpu_mem_free': '10351MB', 'step': 201} [Step 201 / Rank 7] Tasks: ['Single QA'] | Lens: [40582] → Tgt Spa: ['0.350'] [Step 201 / Rank 6] Tasks: ['Single QA'] | Lens: [40582] → Tgt Spa: ['0.350'] [Step 201 / Rank 4] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [19745, 19745, 19757] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 201 / Rank 5] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [19745, 19745, 19757] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 201 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [26289, 26308] → Tgt Spa: ['1.000', '1.000'] [Step 201 / Rank 0] Tasks: ['Single QA', 'Code', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Code'] | Lens: [8780, 8791, 8784, 8792, 8786, 8789, 8799] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000'] [Step 201 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [26289, 26308] → Tgt Spa: ['1.000', '1.000'] [Step 201 / Rank 1] Tasks: ['Single QA', 'Code', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Code'] | Lens: [8780, 8791, 8784, 8792, 8786, 8789, 8799] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000'] [Step 201 / Rank 4] Tasks: ['Code', 'Single QA'] | Lens: [22153, 22145] → Tgt Spa: ['1.000', '0.350'] [Step 201 / Rank 7] Tasks: ['Single QA'] | Lens: [51393] → Tgt Spa: ['0.350'] [Step 201 / Rank 3] Tasks: ['Single QA'] | Lens: [44680] → Tgt Spa: ['0.350'] [Step 201 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [57817] → Tgt Spa: ['1.000'] [Step 201 / Rank 5] Tasks: ['Code', 'Single QA'] | Lens: [22153, 22145] → Tgt Spa: ['1.000', '0.350'] [Step 201 / Rank 6] Tasks: ['Single QA'] | Lens: [51393] → Tgt Spa: ['0.350'] [Step 201 / Rank 2] Tasks: ['Single QA'] | Lens: [44680] → Tgt Spa: ['0.350'] [Step 201 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [57817] → Tgt Spa: ['1.000'] [Step 201 / Rank 5] Tasks: ['Single QA'] | Lens: [33479] → Tgt Spa: ['0.350'] [Step 201 / Rank 1] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [20934, 20925, 20938] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 201 / Rank 0] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [20934, 20925, 20938] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 201 / Rank 3] Tasks: ['Summarization'] | Lens: [41352] → Tgt Spa: ['1.000'] [Step 201 / Rank 7] Tasks: ['Code'] | Lens: [35386] → Tgt Spa: ['1.000'] [Step 201 / Rank 4] Tasks: ['Single QA'] | Lens: [33479] → Tgt Spa: ['0.350'] [Step 201 / Rank 2] Tasks: ['Summarization'] | Lens: [41352] → Tgt Spa: ['1.000'] [Step 201 / Rank 6] Tasks: ['Code'] | Lens: [35386] → Tgt Spa: ['1.000'] [Step 201 / Rank 7] Tasks: ['Single QA'] | Lens: [51368] → Tgt Spa: ['0.350'] [Step 201 / Rank 2] Tasks: ['Single QA'] | Lens: [52806] → Tgt Spa: ['0.350'] [Step 201 / Rank 1] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [18340, 18339, 18341] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 201 / Rank 3] Tasks: ['Single QA'] | Lens: [52806] → Tgt Spa: ['0.350'] [Step 201 / Rank 6] Tasks: ['Single QA'] | Lens: [51368] → Tgt Spa: ['0.350'] [Step 201 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24894, 24894] → Tgt Spa: ['1.000', '1.000'] [Step 201 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24894, 24894] → Tgt Spa: ['1.000', '1.000'] [Step 201 / Rank 0] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [18340, 18339, 18341] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 201 / Rank 5] Tasks: ['Single QA'] | Lens: [41427] → Tgt Spa: ['0.350'] [Step 201 / Rank 4] Tasks: ['Single QA'] | Lens: [41427] → Tgt Spa: ['0.350'] [Step 201 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [55985] → Tgt Spa: ['1.000'] [Step 201 / Rank 1] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25448, 25449] → Tgt Spa: ['1.000', '0.350'] [Step 201 / Rank 6] Tasks: ['Code'] | Lens: [33894] → Tgt Spa: ['1.000'] [Step 201 / Rank 0] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25448, 25449] → Tgt Spa: ['1.000', '0.350'] [Step 201 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [55985] → Tgt Spa: ['1.000'] [Step 201 / Rank 7] Tasks: ['Code'] | Lens: [33894] → Tgt Spa: ['1.000'] [Step 201 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [62627] → Tgt Spa: ['1.000'] [Step 201 / Rank 0] Tasks: ['Single QA'] | Lens: [40008] → Tgt Spa: ['0.350'] [Step 201 / Rank 3] Tasks: ['Single QA'] | Lens: [37668] → Tgt Spa: ['0.350'] [Step 201 / Rank 6] Tasks: ['Code'] | Lens: [33760] → Tgt Spa: ['1.000'] [Step 201 / Rank 7] Tasks: ['Code'] | Lens: [33760] → Tgt Spa: ['1.000'] [Step 201 / Rank 2] Tasks: ['Single QA'] | Lens: [37668] → Tgt Spa: ['0.350'] [Step 201 / Rank 1] Tasks: ['Single QA'] | Lens: [40008] → Tgt Spa: ['0.350'] [Step 201 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [62627] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 03:49:43,094 >> @ 201 | Loss: 2.0224 | LM: 1.9552 | Reg: 0.0672 | Spa(Avg): 0.559 [INFO|lh_trainer.py:797] 2026-02-17 03:49:43,094 >> Statistic -> Code | Spa: 0.714 | Tgt: 1.000 | Z-Loss: 0.090 | [INFO|lh_trainer.py:797] 2026-02-17 03:49:43,094 >> Statistic -> In-Context | Spa: 0.708 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:49:43,094 >> Statistic -> MultiHop | Spa: 0.608 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:49:43,094 >> Statistic -> Single | Spa: 0.431 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:49:43,094 >> Statistic -> Summarization | Spa: 0.663 | Tgt: 1.000 | Z-Loss: 0.106 | [INFO|lh_trainer.py:810] 2026-02-17 03:49:43,096 >> [Micro-Log] {"loss": 2.022377652426561, "lm_loss": 1.9552049015959103, "reg_loss": 0.06717274823555879, "model_sparsity(avg)": 0.5588073208928108, "Spa-Single QA sparsity": 0.4305555542310079, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.053345830808393654, "Spa-Code sparsity": 0.7138888955116272, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.08992919847369193, "Spa-In-Context Learning sparsity": 0.7083333219800677, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10723325610160828, "Spa-Summarization sparsity": 0.663194440305233, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1058741444721818, "Spa-MultiHop QA sparsity": 0.6083333373069764, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11044725235551596, "step": 201, "current_tau": 1.0, "lambda1 Single QA": 0.5859375, "lambda2 MultiHop QA": 0.306640625, "lambda3 Summarization": 0.1552734375, "lambda4 Code": 0.255859375} [INFO|lh_trainer.py:331] 2026-02-17 03:50:08,065 >> {'loss': 12.1343, 'grad_norm': 0.6364587545394897, 'learning_rate': 0.00018213991931974273, 'epoch': 0.21274354923644023, 'num_input_tokens_seen': 496784334, 'completed': '67.33% (202 / 300)', 'remaining time': '4:35:43', 'throughput': '7564.43', 'gpu_mem_free': '11981MB', 'step': 202} [Step 202 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43658] → Tgt Spa: ['1.000'] [Step 202 / Rank 3] Tasks: ['Single QA'] | Lens: [34950] → Tgt Spa: ['0.350'] [Step 202 / Rank 2] Tasks: ['Single QA'] | Lens: [34950] → Tgt Spa: ['0.350'] [Step 202 / Rank 0] Tasks: ['Single QA', 'Summarization'] | Lens: [32589, 32611] → Tgt Spa: ['0.350', '1.000'] [Step 202 / Rank 4] Tasks: ['Single QA'] | Lens: [38384] → Tgt Spa: ['0.350'] [Step 202 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43658] → Tgt Spa: ['1.000'] [Step 202 / Rank 5] Tasks: ['Single QA'] | Lens: [38384] → Tgt Spa: ['0.350'] [Step 202 / Rank 1] Tasks: ['Single QA', 'Summarization'] | Lens: [32589, 32611] → Tgt Spa: ['0.350', '1.000'] [Step 202 / Rank 0] Tasks: ['Single QA'] | Lens: [33685] → Tgt Spa: ['0.350'] [Step 202 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23387, 23386] → Tgt Spa: ['1.000', '1.000'] [Step 202 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [31537, 31531] → Tgt Spa: ['1.000', '1.000'] [Step 202 / Rank 1] Tasks: ['Single QA'] | Lens: [33685] → Tgt Spa: ['0.350'] [Step 202 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [31537, 31531] → Tgt Spa: ['1.000', '1.000'] [Step 202 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23387, 23386] → Tgt Spa: ['1.000', '1.000'] [Step 202 / Rank 2] Tasks: ['Summarization'] | Lens: [41160] → Tgt Spa: ['1.000'] [Step 202 / Rank 3] Tasks: ['Summarization'] | Lens: [41160] → Tgt Spa: ['1.000'] [Step 202 / Rank 3] Tasks: ['Code'] | Lens: [44691] → Tgt Spa: ['1.000'] [Step 202 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25296, 25296] → Tgt Spa: ['1.000', '1.000'] [Step 202 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [29676, 29676] → Tgt Spa: ['0.350', '0.350'] [Step 202 / Rank 4] Tasks: ['Code'] | Lens: [43484] → Tgt Spa: ['1.000'] [Step 202 / Rank 2] Tasks: ['Code'] | Lens: [44691] → Tgt Spa: ['1.000'] [Step 202 / Rank 5] Tasks: ['Code'] | Lens: [43484] → Tgt Spa: ['1.000'] [Step 202 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [29676, 29676] → Tgt Spa: ['0.350', '0.350'] [Step 202 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25296, 25296] → Tgt Spa: ['1.000', '1.000'] [Step 202 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [45141] → Tgt Spa: ['1.000'] [Step 202 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [18760, 18760, 18760] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 202 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [56710] → Tgt Spa: ['1.000'] [Step 202 / Rank 0] Tasks: ['Code'] | Lens: [57957] → Tgt Spa: ['1.000'] [Step 202 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [18760, 18760, 18760] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 202 / Rank 1] Tasks: ['Code'] | Lens: [57957] → Tgt Spa: ['1.000'] [Step 202 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [56710] → Tgt Spa: ['1.000'] [Step 202 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [45141] → Tgt Spa: ['1.000'] [Step 202 / Rank 4] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [23574, 23556] → Tgt Spa: ['1.000', '1.000'] [Step 202 / Rank 5] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [23574, 23556] → Tgt Spa: ['1.000', '1.000'] [Step 202 / Rank 6] Tasks: ['Single QA'] | Lens: [52510] → Tgt Spa: ['0.350'] [Step 202 / Rank 1] Tasks: ['Single QA'] | Lens: [46294] → Tgt Spa: ['0.350'] [Step 202 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25648, 25648] → Tgt Spa: ['1.000', '1.000'] [Step 202 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25648, 25648] → Tgt Spa: ['1.000', '1.000'] [Step 202 / Rank 0] Tasks: ['Single QA'] | Lens: [46294] → Tgt Spa: ['0.350'] [Step 202 / Rank 7] Tasks: ['Single QA'] | Lens: [52510] → Tgt Spa: ['0.350'] [Step 202 / Rank 1] Tasks: ['Single QA'] | Lens: [48520] → Tgt Spa: ['0.350'] [Step 202 / Rank 2] Tasks: ['Single QA'] | Lens: [37322] → Tgt Spa: ['0.350'] [Step 202 / Rank 5] Tasks: ['Single QA'] | Lens: [49238] → Tgt Spa: ['0.350'] [Step 202 / Rank 4] Tasks: ['Single QA'] | Lens: [49238] → Tgt Spa: ['0.350'] [Step 202 / Rank 6] Tasks: ['Single QA'] | Lens: [55723] → Tgt Spa: ['0.350'] [Step 202 / Rank 0] Tasks: ['Single QA'] | Lens: [48520] → Tgt Spa: ['0.350'] [Step 202 / Rank 7] Tasks: ['Single QA'] | Lens: [55723] → Tgt Spa: ['0.350'] [Step 202 / Rank 3] Tasks: ['Single QA'] | Lens: [37322] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 03:52:15,198 >> @ 202 | Loss: 2.1634 | LM: 2.1069 | Reg: 0.0565 | Spa(Avg): 0.544 [INFO|lh_trainer.py:797] 2026-02-17 03:52:15,198 >> Statistic -> Code | Spa: 0.712 | Tgt: 1.000 | Z-Loss: 0.091 | [INFO|lh_trainer.py:797] 2026-02-17 03:52:15,198 >> Statistic -> In-Context | Spa: 0.715 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:52:15,198 >> Statistic -> MultiHop | Spa: 0.608 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:52:15,198 >> Statistic -> Single | Spa: 0.364 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:52:15,199 >> Statistic -> Summarization | Spa: 0.681 | Tgt: 1.000 | Z-Loss: 0.099 | [INFO|lh_trainer.py:810] 2026-02-17 03:52:15,201 >> [Micro-Log] {"loss": 2.1633717070023217, "lm_loss": 2.106902634104093, "reg_loss": 0.056469093935447745, "model_sparsity(avg)": 0.5443672786156336, "Spa-Single QA sparsity": 0.36388887961705524, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.010993566503748298, "Spa-Summarization sparsity": 0.6805555621782938, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09868793934583664, "Spa-Code sparsity": 0.7118055522441864, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09131843969225883, "Spa-In-Context Learning sparsity": 0.7146464532071893, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10467926886948672, "Spa-MultiHop QA sparsity": 0.6083333373069764, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11044725235551596, "step": 202, "current_tau": 1.0, "lambda1 Single QA": 0.5859375, "lambda2 MultiHop QA": 0.306640625, "lambda3 Summarization": 0.15625, "lambda4 Code": 0.2578125} [INFO|lh_trainer.py:331] 2026-02-17 03:52:36,141 >> {'loss': 12.9802, 'grad_norm': 0.6135685443878174, 'learning_rate': 0.00017899619592440298, 'epoch': 0.21379673512374933, 'num_input_tokens_seen': 499122570, 'completed': '67.67% (203 / 300)', 'remaining time': '4:32:45', 'throughput': '7895.41', 'gpu_mem_free': '9771MB', 'step': 203} [Step 203 / Rank 2] Tasks: ['Single QA'] | Lens: [59672] → Tgt Spa: ['0.350'] [Step 203 / Rank 3] Tasks: ['Single QA'] | Lens: [59672] → Tgt Spa: ['0.350'] [Step 203 / Rank 7] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [27169, 27169] → Tgt Spa: ['0.350', '1.000'] [Step 203 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [53023] → Tgt Spa: ['1.000'] [Step 203 / Rank 6] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [27169, 27169] → Tgt Spa: ['0.350', '1.000'] [Step 203 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [53023] → Tgt Spa: ['1.000'] [Step 203 / Rank 0] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [20308, 20310, 20301] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 203 / Rank 1] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [20308, 20310, 20301] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 203 / Rank 3] Tasks: ['Code'] | Lens: [46539] → Tgt Spa: ['1.000'] [Step 203 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28982, 28983] → Tgt Spa: ['1.000', '1.000'] [Step 203 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [60883] → Tgt Spa: ['1.000'] [Step 203 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [60883] → Tgt Spa: ['1.000'] [Step 203 / Rank 2] Tasks: ['Code'] | Lens: [46539] → Tgt Spa: ['1.000'] [Step 203 / Rank 7] Tasks: ['Single QA'] | Lens: [36318] → Tgt Spa: ['0.350'] [Step 203 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28982, 28983] → Tgt Spa: ['1.000', '1.000'] [Step 203 / Rank 6] Tasks: ['Single QA'] | Lens: [36318] → Tgt Spa: ['0.350'] [Step 203 / Rank 5] Tasks: ['Single QA'] | Lens: [51194] → Tgt Spa: ['0.350'] [Step 203 / Rank 7] Tasks: ['Single QA'] | Lens: [65359] → Tgt Spa: ['0.350'] [Step 203 / Rank 0] Tasks: ['Summarization'] | Lens: [34099] → Tgt Spa: ['1.000'] [Step 203 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [45414] → Tgt Spa: ['1.000'] [Step 203 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [45414] → Tgt Spa: ['1.000'] [Step 203 / Rank 6] Tasks: ['Single QA'] | Lens: [65359] → Tgt Spa: ['0.350'] [Step 203 / Rank 4] Tasks: ['Single QA'] | Lens: [51194] → Tgt Spa: ['0.350'] [Step 203 / Rank 1] Tasks: ['Summarization'] | Lens: [34099] → Tgt Spa: ['1.000'] [Step 203 / Rank 5] Tasks: ['Single QA'] | Lens: [42626] → Tgt Spa: ['0.350'] [Step 203 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [43230] → Tgt Spa: ['1.000'] [Step 203 / Rank 2] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15354, 15355, 15356, 15358] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350'] [Step 203 / Rank 4] Tasks: ['Single QA'] | Lens: [42626] → Tgt Spa: ['0.350'] [Step 203 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [43230] → Tgt Spa: ['1.000'] [Step 203 / Rank 7] Tasks: ['Single QA', 'Code'] | Lens: [30253, 30261] → Tgt Spa: ['0.350', '1.000'] [Step 203 / Rank 3] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15354, 15355, 15356, 15358] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350'] [Step 203 / Rank 6] Tasks: ['Single QA', 'Code'] | Lens: [30253, 30261] → Tgt Spa: ['0.350', '1.000'] [Step 203 / Rank 4] Tasks: ['Single QA'] | Lens: [39797] → Tgt Spa: ['0.350'] [Step 203 / Rank 7] Tasks: ['Single QA'] | Lens: [49742] → Tgt Spa: ['0.350'] [Step 203 / Rank 2] Tasks: ['Single QA'] | Lens: [43606] → Tgt Spa: ['0.350'] [Step 203 / Rank 1] Tasks: ['Code', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'Code', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA'] | Lens: [6229, 6228, 6229, 6222, 6223, 6230, 6225, 6227, 6227, 6229] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350'] [Step 203 / Rank 5] Tasks: ['Single QA'] | Lens: [39797] → Tgt Spa: ['0.350'] [Step 203 / Rank 0] Tasks: ['Code', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'Code', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA'] | Lens: [6229, 6228, 6229, 6222, 6223, 6230, 6225, 6227, 6227, 6229] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350'] [Step 203 / Rank 6] Tasks: ['Single QA'] | Lens: [49742] → Tgt Spa: ['0.350'] [Step 203 / Rank 3] Tasks: ['Single QA'] | Lens: [43606] → Tgt Spa: ['0.350'] [Step 203 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [45899] → Tgt Spa: ['1.000'] [Step 203 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [45899] → Tgt Spa: ['1.000'] [Step 203 / Rank 1] Tasks: ['Single QA'] | Lens: [65359] → Tgt Spa: ['0.350'] [Step 203 / Rank 0] Tasks: ['Single QA'] | Lens: [65359] → Tgt Spa: ['0.350'] [Step 203 / Rank 3] Tasks: ['Code'] | Lens: [44876] → Tgt Spa: ['1.000'] [Step 203 / Rank 2] Tasks: ['Code'] | Lens: [44876] → Tgt Spa: ['1.000'] [Step 203 / Rank 7] Tasks: ['Single QA'] | Lens: [54181] → Tgt Spa: ['0.350'] [Step 203 / Rank 6] Tasks: ['Single QA'] | Lens: [54181] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 03:55:05,695 >> @ 203 | Loss: 2.1800 | LM: 2.1087 | Reg: 0.0713 | Spa(Avg): 0.531 [INFO|lh_trainer.py:797] 2026-02-17 03:55:05,695 >> Statistic -> Code | Spa: 0.684 | Tgt: 1.000 | Z-Loss: 0.103 | [INFO|lh_trainer.py:797] 2026-02-17 03:55:05,695 >> Statistic -> In-Context | Spa: 0.717 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:55:05,695 >> Statistic -> MultiHop | Spa: 0.608 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:55:05,695 >> Statistic -> Single | Spa: 0.441 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:55:05,695 >> Statistic -> Summarization | Spa: 0.588 | Tgt: 1.000 | Z-Loss: 0.151 | [INFO|lh_trainer.py:810] 2026-02-17 03:55:05,698 >> [Micro-Log] {"loss": 2.1799698459605374, "lm_loss": 2.1086662194381156, "reg_loss": 0.07130364296608604, "model_sparsity(avg)": 0.5305073286096255, "Spa-Summarization sparsity": 0.5879629850387573, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.150854654610157, "Spa-Code sparsity": 0.6836419767803616, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10290475189685822, "Spa-In-Context Learning sparsity": 0.7166666507720947, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10388636440038682, "Spa-Single QA sparsity": 0.44078946427295085, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06518392717024606, "Spa-MultiHop QA sparsity": 0.6083333373069764, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11044725235551596, "step": 203, "current_tau": 1.0, "lambda1 Single QA": 0.5859375, "lambda2 MultiHop QA": 0.306640625, "lambda3 Summarization": 0.15625, "lambda4 Code": 0.2578125} [INFO|lh_trainer.py:331] 2026-02-17 03:55:32,882 >> {'loss': 13.0798, 'grad_norm': 0.6455655694007874, 'learning_rate': 0.00017586463866964668, 'epoch': 0.21484992101105846, 'num_input_tokens_seen': 501601060, 'completed': '68.00% (204 / 300)', 'remaining time': '4:30:00', 'throughput': '7011.63', 'gpu_mem_free': '3723MB', 'step': 204} [Step 204 / Rank 5] Tasks: ['Code', 'Code', 'Single QA', 'Code', 'Code'] | Lens: [11340, 11342, 11337, 11345, 11346] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000'] [Step 204 / Rank 6] Tasks: ['Code'] | Lens: [53956] → Tgt Spa: ['1.000'] [Step 204 / Rank 0] Tasks: ['Single QA'] | Lens: [49216] → Tgt Spa: ['0.350'] [Step 204 / Rank 1] Tasks: ['Single QA'] | Lens: [49216] → Tgt Spa: ['0.350'] [Step 204 / Rank 4] Tasks: ['Code', 'Code', 'Single QA', 'Code', 'Code'] | Lens: [11340, 11342, 11337, 11345, 11346] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000'] [Step 204 / Rank 3] Tasks: ['Code', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [8745, 8738, 8745, 8741, 8743, 8751, 8757] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000'] [Step 204 / Rank 7] Tasks: ['Code'] | Lens: [53956] → Tgt Spa: ['1.000'] [Step 204 / Rank 2] Tasks: ['Code', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [8745, 8738, 8745, 8741, 8743, 8751, 8757] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000'] [Step 204 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [41725] → Tgt Spa: ['1.000'] [Step 204 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [41725] → Tgt Spa: ['1.000'] [Step 204 / Rank 2] Tasks: ['In-Context Learning', 'Code'] | Lens: [28086, 28094] → Tgt Spa: ['1.000', '1.000'] [Step 204 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [59150] → Tgt Spa: ['1.000'] [Step 204 / Rank 5] Tasks: ['Single QA'] | Lens: [48511] → Tgt Spa: ['0.350'] [Step 204 / Rank 3] Tasks: ['In-Context Learning', 'Code'] | Lens: [28086, 28094] → Tgt Spa: ['1.000', '1.000'] [Step 204 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [59150] → Tgt Spa: ['1.000'] [Step 204 / Rank 4] Tasks: ['Single QA'] | Lens: [48511] → Tgt Spa: ['0.350'] [Step 204 / Rank 7] Tasks: ['Single QA'] | Lens: [44223] → Tgt Spa: ['0.350'] [Step 204 / Rank 5] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23528, 23546] → Tgt Spa: ['1.000', '1.000'] [Step 204 / Rank 2] Tasks: ['Single QA'] | Lens: [36637] → Tgt Spa: ['0.350'] [Step 204 / Rank 1] Tasks: ['Single QA', 'Code', 'Code'] | Lens: [19972, 19980, 19988] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 204 / Rank 3] Tasks: ['Single QA'] | Lens: [36637] → Tgt Spa: ['0.350'] [Step 204 / Rank 6] Tasks: ['Single QA'] | Lens: [44223] → Tgt Spa: ['0.350'] [Step 204 / Rank 4] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23528, 23546] → Tgt Spa: ['1.000', '1.000'] [Step 204 / Rank 0] Tasks: ['Single QA', 'Code', 'Code'] | Lens: [19972, 19980, 19988] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 204 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [39810] → Tgt Spa: ['1.000'] [Step 204 / Rank 3] Tasks: ['Code', 'Code', 'Code'] | Lens: [18702, 18703, 18709] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 204 / Rank 0] Tasks: ['Code'] | Lens: [46052] → Tgt Spa: ['1.000'] [Step 204 / Rank 1] Tasks: ['Code'] | Lens: [46052] → Tgt Spa: ['1.000'] [Step 204 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [39810] → Tgt Spa: ['1.000'] [Step 204 / Rank 2] Tasks: ['Code', 'Code', 'Code'] | Lens: [18702, 18703, 18709] → Tgt Spa: ['1.000', '1.000', '1.000'][Step 204 / Rank 6] Tasks: ['Single QA'] | Lens: [55868] → Tgt Spa: ['0.350'] [Step 204 / Rank 7] Tasks: ['Single QA'] | Lens: [55868] → Tgt Spa: ['0.350'] [Step 204 / Rank 5] Tasks: ['Single QA'] | Lens: [36625] → Tgt Spa: ['0.350'] [Step 204 / Rank 1] Tasks: ['Single QA'] | Lens: [50735] → Tgt Spa: ['0.350'] [Step 204 / Rank 7] Tasks: ['Single QA'] | Lens: [34211] → Tgt Spa: ['0.350'] [Step 204 / Rank 6] Tasks: ['Single QA'] | Lens: [34211] → Tgt Spa: ['0.350'] [Step 204 / Rank 4] Tasks: ['Single QA'] | Lens: [36625] → Tgt Spa: ['0.350'] [Step 204 / Rank 0] Tasks: ['Single QA'] | Lens: [50735] → Tgt Spa: ['0.350'] [Step 204 / Rank 3] Tasks: ['Single QA'] | Lens: [53030] → Tgt Spa: ['0.350'] [Step 204 / Rank 2] Tasks: ['Single QA'] | Lens: [53030] → Tgt Spa: ['0.350'] [Step 204 / Rank 5] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24492, 24511] → Tgt Spa: ['1.000', '1.000'] [Step 204 / Rank 2] Tasks: ['Single QA'] | Lens: [65020] → Tgt Spa: ['0.350'] [Step 204 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [20541, 20545, 20537] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 204 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [20541, 20545, 20537] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 204 / Rank 3] Tasks: ['Single QA'] | Lens: [65020] → Tgt Spa: ['0.350'] [Step 204 / Rank 4] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24492, 24511] → Tgt Spa: ['1.000', '1.000'] [Step 204 / Rank 1] Tasks: ['Single QA'] | Lens: [45090] → Tgt Spa: ['0.350'] [Step 204 / Rank 0] Tasks: ['Single QA'] | Lens: [45090] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 03:57:48,292 >> @ 204 | Loss: 1.9383 | LM: 1.8740 | Reg: 0.0643 | Spa(Avg): 0.544 [INFO|lh_trainer.py:797] 2026-02-17 03:57:48,292 >> Statistic -> Code | Spa: 0.714 | Tgt: 1.000 | Z-Loss: 0.090 | [INFO|lh_trainer.py:797] 2026-02-17 03:57:48,292 >> Statistic -> In-Context | Spa: 0.704 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:57:48,292 >> Statistic -> MultiHop | Spa: 0.608 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:57:48,292 >> Statistic -> Single | Spa: 0.430 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 03:57:48,292 >> Statistic -> Summarization | Spa: 0.625 | Tgt: 1.000 | Z-Loss: 0.128 | [INFO|lh_trainer.py:810] 2026-02-17 03:57:48,294 >> [Micro-Log] {"loss": 1.9383306329449017, "lm_loss": 1.874007682626446, "reg_loss": 0.06432297003630083, "model_sparsity(avg)": 0.544185404976209, "Spa-Single QA sparsity": 0.42973856014363904, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.052135191645528024, "Spa-In-Context Learning sparsity": 0.7037037014961243, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10928081224362056, "Spa-Code sparsity": 0.7144097164273262, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09037985745817423, "Spa-Summarization sparsity": 0.6250000149011612, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12754086405038834, "Spa-MultiHop QA sparsity": 0.6083333373069764, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11044725235551596, "step": 204, "current_tau": 1.0, "lambda1 Single QA": 0.5859375, "lambda2 MultiHop QA": 0.306640625, "lambda3 Summarization": 0.15625, "lambda4 Code": 0.2578125} [INFO|lh_trainer.py:331] 2026-02-17 03:58:15,012 >> {'loss': 11.63, 'grad_norm': 0.5825475454330444, 'learning_rate': 0.00017274578413168805, 'epoch': 0.21590310689836756, 'num_input_tokens_seen': 504016506, 'completed': '68.33% (205 / 300)', 'remaining time': '4:27:08', 'throughput': '7449.12', 'gpu_mem_free': '11435MB', 'step': 205} [Step 205 / Rank 2] Tasks: ['Single QA'] | Lens: [59076] → Tgt Spa: ['0.350'] [Step 205 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [21899, 21893] → Tgt Spa: ['1.000', '1.000'] [Step 205 / Rank 3] Tasks: ['Single QA'] | Lens: [59076] → Tgt Spa: ['0.350'] [Step 205 / Rank 6] Tasks: ['Code'] | Lens: [62251] → Tgt Spa: ['1.000'] [Step 205 / Rank 0] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization'] | Lens: [1429, 1429, 1428, 1427, 1428, 1447, 1447, 1447, 1430, 1430, 1430, 1430, 1430, 1430, 1431, 1452, 1432, 1451, 1451, 1434, 1435, 1434, 1452, 1434, 1434, 1433, 1453, 1435, 1435, 1435, 1435, 1455, 1436, 1437, 1437, 1437, 1438, 1456, 1457, 1457, 1438, 1438, 1438, 1457, 1458] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 205 / Rank 7] Tasks: ['Code'] | Lens: [62251] → Tgt Spa: ['1.000'] [Step 205 / Rank 1] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization'] | Lens: [1429, 1429, 1428, 1427, 1428, 1447, 1447, 1447, 1430, 1430, 1430, 1430, 1430, 1430, 1431, 1452, 1432, 1451, 1451, 1434, 1435, 1434, 1452, 1434, 1434, 1433, 1453, 1435, 1435, 1435, 1435, 1455, 1436, 1437, 1437, 1437, 1438, 1456, 1457, 1457, 1438, 1438, 1438, 1457, 1458] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 205 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [21899, 21893] → Tgt Spa: ['1.000', '1.000'] [Step 205 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [45145] → Tgt Spa: ['1.000'] [Step 205 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24389, 24390] → Tgt Spa: ['1.000', '1.000'] [Step 205 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [45145] → Tgt Spa: ['1.000'] [Step 205 / Rank 1] Tasks: ['Single QA'] | Lens: [60519] → Tgt Spa: ['0.350'] [Step 205 / Rank 7] Tasks: ['Single QA'] | Lens: [62720] → Tgt Spa: ['0.350'] [Step 205 / Rank 6] Tasks: ['Single QA'] | Lens: [62720] → Tgt Spa: ['0.350'] [Step 205 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24389, 24390] → Tgt Spa: ['1.000', '1.000'] [Step 205 / Rank 0] Tasks: ['Single QA'] | Lens: [60519] → Tgt Spa: ['0.350'] [Step 205 / Rank 4] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17135, 17145, 17137] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 205 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [31690, 31685] → Tgt Spa: ['1.000', '1.000'] [Step 205 / Rank 3] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [28981, 28963] → Tgt Spa: ['1.000', '1.000'] [Step 205 / Rank 5] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17135, 17145, 17137] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 205 / Rank 1] Tasks: ['Single QA'] | Lens: [39647] → Tgt Spa: ['0.350'] [Step 205 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [31690, 31685] → Tgt Spa: ['1.000', '1.000'] [Step 205 / Rank 0] Tasks: ['Single QA'] | Lens: [39647] → Tgt Spa: ['0.350'] [Step 205 / Rank 2] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [28981, 28963] → Tgt Spa: ['1.000', '1.000'] [Step 205 / Rank 3] Tasks: ['Code'] | Lens: [62044] → Tgt Spa: ['1.000'] [Step 205 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [44148] → Tgt Spa: ['1.000'] [Step 205 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [44148] → Tgt Spa: ['1.000'] [Step 205 / Rank 5] Tasks: ['Single QA'] | Lens: [65001] → Tgt Spa: ['0.350'] [Step 205 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [44159] → Tgt Spa: ['1.000'] [Step 205 / Rank 4] Tasks: ['Single QA'] | Lens: [65001] → Tgt Spa: ['0.350'] [Step 205 / Rank 2] Tasks: ['Code'] | Lens: [62044] → Tgt Spa: ['1.000'] [Step 205 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [44159] → Tgt Spa: ['1.000'] [Step 205 / Rank 6] Tasks: ['Code'] | Lens: [62520] → Tgt Spa: ['1.000'] [Step 205 / Rank 1] Tasks: ['Code', 'Single QA'] | Lens: [32597, 32588] → Tgt Spa: ['1.000', '0.350'] [Step 205 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [44717] → Tgt Spa: ['1.000'] [Step 205 / Rank 0] Tasks: ['Code', 'Single QA'] | Lens: [32597, 32588] → Tgt Spa: ['1.000', '0.350'] [Step 205 / Rank 3] Tasks: ['Single QA'] | Lens: [65363] → Tgt Spa: ['0.350'] [Step 205 / Rank 2] Tasks: ['Single QA'] | Lens: [65363] → Tgt Spa: ['0.350'] [Step 205 / Rank 7] Tasks: ['Code'] | Lens: [62520] → Tgt Spa: ['1.000'] [Step 205 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [44717] → Tgt Spa: ['1.000'] [Step 205 / Rank 4] Tasks: ['Single QA'] | Lens: [43627] → Tgt Spa: ['0.350'] [Step 205 / Rank 5] Tasks: ['Single QA'] | Lens: [43627] → Tgt Spa: ['0.350'] [Step 205 / Rank 7] Tasks: ['Single QA'] | Lens: [35707] → Tgt Spa: ['0.350'] [Step 205 / Rank 3] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Summarization', 'Code'] | Lens: [9192, 9201, 9196, 9198, 9199, 9219, 9210] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 205 / Rank 0] Tasks: ['Single QA'] | Lens: [37504] → Tgt Spa: ['0.350'] [Step 205 / Rank 1] Tasks: ['Single QA'] | Lens: [37504] → Tgt Spa: ['0.350'] [Step 205 / Rank 6] Tasks: ['Single QA'] | Lens: [35707] → Tgt Spa: ['0.350'] [Step 205 / Rank 2] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Summarization', 'Code'] | Lens: [9192, 9201, 9196, 9198, 9199, 9219, 9210] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 04:00:59,559 >> @ 205 | Loss: 2.0214 | LM: 1.9527 | Reg: 0.0687 | Spa(Avg): 0.558 [INFO|lh_trainer.py:797] 2026-02-17 04:00:59,560 >> Statistic -> Code | Spa: 0.696 | Tgt: 1.000 | Z-Loss: 0.098 | [INFO|lh_trainer.py:797] 2026-02-17 04:00:59,560 >> Statistic -> In-Context | Spa: 0.704 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:00:59,560 >> Statistic -> MultiHop | Spa: 0.553 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:00:59,560 >> Statistic -> Single | Spa: 0.432 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:00:59,560 >> Statistic -> Summarization | Spa: 0.645 | Tgt: 1.000 | Z-Loss: 0.117 | [INFO|lh_trainer.py:810] 2026-02-17 04:00:59,562 >> [Micro-Log] {"loss": 2.0213808678090572, "lm_loss": 1.9526687562465668, "reg_loss": 0.06871209545837094, "model_sparsity(avg)": 0.5584031442801157, "Spa-MultiHop QA sparsity": 0.5533154087681924, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.08116257238772608, "Spa-Summarization sparsity": 0.6446078419685364, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11738031313699834, "Spa-Single QA sparsity": 0.43154762046677725, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05636623428602304, "Spa-In-Context Learning sparsity": 0.7037037147416009, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10978491355975468, "Spa-Code sparsity": 0.6958333432674408, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09760052487254142, "step": 205, "current_tau": 1.0, "lambda1 Single QA": 0.5859375, "lambda2 MultiHop QA": 0.306640625, "lambda3 Summarization": 0.1572265625, "lambda4 Code": 0.2578125} [INFO|lh_trainer.py:331] 2026-02-17 04:01:13,945 >> {'loss': 12.1283, 'grad_norm': 0.6855435967445374, 'learning_rate': 0.0001696401667101963, 'epoch': 0.21695629278567669, 'num_input_tokens_seen': 506604150, 'completed': '68.67% (206 / 300)', 'remaining time': '4:24:24', 'throughput': '7230.76', 'gpu_mem_free': '14147MB', 'step': 206} [Step 206 / Rank 4] Tasks: ['Code'] | Lens: [38391] → Tgt Spa: ['1.000'] [Step 206 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [60011] → Tgt Spa: ['1.000'] [Step 206 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [46207] → Tgt Spa: ['1.000'] [Step 206 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [57874] → Tgt Spa: ['1.000'] [Step 206 / Rank 5] Tasks: ['Code'] | Lens: [38391] → Tgt Spa: ['1.000'] [Step 206 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [57874] → Tgt Spa: ['1.000'] [Step 206 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [46207] → Tgt Spa: ['1.000'] [Step 206 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [60011] → Tgt Spa: ['1.000'] [Step 206 / Rank 0] Tasks: ['Single QA'] | Lens: [51698] → Tgt Spa: ['0.350'] [Step 206 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [16041, 16043, 16043, 16043] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 206 / Rank 3] Tasks: ['Summarization', 'Summarization', 'Code', 'Single QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Code', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA'] | Lens: [2976, 2976, 2964, 2958, 2958, 2961, 2961, 2977, 2962, 2961, 2961, 2961, 2978, 2978, 2967, 2960, 2963, 2979, 2961, 2961, 2962, 2964] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 206 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [40209] → Tgt Spa: ['1.000'] [Step 206 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [40209] → Tgt Spa: ['1.000'] [Step 206 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [16041, 16043, 16043, 16043] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 206 / Rank 2] Tasks: ['Summarization', 'Summarization', 'Code', 'Single QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Code', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA'] | Lens: [2976, 2976, 2964, 2958, 2958, 2961, 2961, 2977, 2962, 2961, 2961, 2961, 2978, 2978, 2967, 2960, 2963, 2979, 2961, 2961, 2962, 2964] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 206 / Rank 1] Tasks: ['Single QA'] | Lens: [51698] → Tgt Spa: ['0.350'] [Step 206 / Rank 3] Tasks: ['Single QA'] | Lens: [54045] → Tgt Spa: ['0.350'] [Step 206 / Rank 2] Tasks: ['Single QA'] | Lens: [54045] → Tgt Spa: ['0.350'] [Step 206 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25765, 25767] → Tgt Spa: ['1.000', '1.000'] [Step 206 / Rank 4] Tasks: ['Single QA'] | Lens: [49540] → Tgt Spa: ['0.350'] [Step 206 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25765, 25767] → Tgt Spa: ['1.000', '1.000'] [Step 206 / Rank 5] Tasks: ['Single QA'] | Lens: [49540] → Tgt Spa: ['0.350'] [Step 206 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [12737, 12737, 12738, 12738, 12739] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 206 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [12737, 12737, 12738, 12738, 12739] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 206 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [57638] → Tgt Spa: ['1.000'] [Step 206 / Rank 2] Tasks: ['Code'] | Lens: [41012] → Tgt Spa: ['1.000'] [Step 206 / Rank 1] Tasks: ['Single QA'] | Lens: [34339] → Tgt Spa: ['0.350'] [Step 206 / Rank 7] Tasks: ['Single QA'] | Lens: [43250] → Tgt Spa: ['0.350'] [Step 206 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [57638] → Tgt Spa: ['1.000'] [Step 206 / Rank 0] Tasks: ['Single QA'] | Lens: [34339] → Tgt Spa: ['0.350'] [Step 206 / Rank 3] Tasks: ['Code'] | Lens: [41012] → Tgt Spa: ['1.000'] [Step 206 / Rank 6] Tasks: ['Single QA'] | Lens: [43250] → Tgt Spa: ['0.350'] [Step 206 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [29212, 29212] → Tgt Spa: ['0.350', '0.350'] [Step 206 / Rank 7] Tasks: ['Single QA'] | Lens: [41024] → Tgt Spa: ['0.350'] [Step 206 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [29212, 29212] → Tgt Spa: ['0.350', '0.350'] [Step 206 / Rank 0] Tasks: ['Single QA'] | Lens: [58644] → Tgt Spa: ['0.350'] [Step 206 / Rank 1] Tasks: ['Single QA'] | Lens: [58644] → Tgt Spa: ['0.350'] [Step 206 / Rank 3] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22473, 22454] → Tgt Spa: ['1.000', '1.000'] [Step 206 / Rank 6] Tasks: ['Single QA'] | Lens: [41024] → Tgt Spa: ['0.350'] [Step 206 / Rank 2] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22473, 22454] → Tgt Spa: ['1.000', '1.000'] [Step 206 / Rank 1] Tasks: ['Single QA'] | Lens: [47366] → Tgt Spa: ['0.350'] [Step 206 / Rank 4] Tasks: ['Single QA'] | Lens: [54778] → Tgt Spa: ['0.350'] [Step 206 / Rank 5] Tasks: ['Single QA'] | Lens: [54778] → Tgt Spa: ['0.350'] [Step 206 / Rank 6] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24054, 24055] → Tgt Spa: ['1.000', '0.350'] [Step 206 / Rank 0] Tasks: ['Single QA'] | Lens: [47366] → Tgt Spa: ['0.350'] [Step 206 / Rank 3] Tasks: ['Single QA'] | Lens: [52663] → Tgt Spa: ['0.350'] [Step 206 / Rank 2] Tasks: ['Single QA'] | Lens: [52663] → Tgt Spa: ['0.350'] [Step 206 / Rank 7] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24054, 24055] → Tgt Spa: ['1.000', '0.350'] [INFO|lh_trainer.py:781] 2026-02-17 04:03:39,648 >> @ 206 | Loss: 2.1240 | LM: 2.0710 | Reg: 0.0530 | Spa(Avg): 0.507 [INFO|lh_trainer.py:797] 2026-02-17 04:03:39,648 >> Statistic -> Code | Spa: 0.719 | Tgt: 1.000 | Z-Loss: 0.089 | [INFO|lh_trainer.py:797] 2026-02-17 04:03:39,649 >> Statistic -> In-Context | Spa: 0.711 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:03:39,649 >> Statistic -> MultiHop | Spa: 0.642 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:03:39,649 >> Statistic -> Single | Spa: 0.381 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:03:39,649 >> Statistic -> Summarization | Spa: 0.599 | Tgt: 1.000 | Z-Loss: 0.141 | [INFO|lh_trainer.py:810] 2026-02-17 04:03:39,651 >> [Micro-Log] {"loss": 2.124038809289535, "lm_loss": 2.071034710854292, "reg_loss": 0.05300410033669323, "model_sparsity(avg)": 0.5065551313261191, "Spa-In-Context Learning sparsity": 0.7106481492519379, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1069809403270483, "Spa-Single QA sparsity": 0.381365733842055, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.023199501427977037, "Spa-Summarization sparsity": 0.5992063539368766, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.14106290574584687, "Spa-Code sparsity": 0.71875, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.08873918280005455, "Spa-MultiHop QA sparsity": 0.6419753167364333, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.12461001757118437, "step": 206, "current_tau": 1.0, "lambda1 Single QA": 0.5859375, "lambda2 MultiHop QA": 0.306640625, "lambda3 Summarization": 0.1572265625, "lambda4 Code": 0.2578125} [INFO|lh_trainer.py:331] 2026-02-17 04:03:59,929 >> {'loss': 12.7442, 'grad_norm': 0.5899667143821716, 'learning_rate': 0.00016654831853672876, 'epoch': 0.21800947867298578, 'num_input_tokens_seen': 509053728, 'completed': '69.00% (207 / 300)', 'remaining time': '4:21:34', 'throughput': '7378.93', 'gpu_mem_free': '10587MB', 'step': 207} [Step 207 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [58531] → Tgt Spa: ['1.000'] [Step 207 / Rank 2] Tasks: ['Single QA'] | Lens: [39993] → Tgt Spa: ['0.350'] [Step 207 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [58531] → Tgt Spa: ['1.000'] [Step 207 / Rank 0] Tasks: ['Single QA'] | Lens: [64592] → Tgt Spa: ['0.350'] [Step 207 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [61451] → Tgt Spa: ['1.000'] [Step 207 / Rank 3] Tasks: ['Single QA'] | Lens: [39993] → Tgt Spa: ['0.350'] [Step 207 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [61451] → Tgt Spa: ['1.000'] [Step 207 / Rank 1] Tasks: ['Single QA'] | Lens: [64592] → Tgt Spa: ['0.350'] [Step 207 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28112, 28112] → Tgt Spa: ['1.000', '1.000'] [Step 207 / Rank 4] Tasks: ['Code'] | Lens: [39030] → Tgt Spa: ['1.000'] [Step 207 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28112, 28112] → Tgt Spa: ['1.000', '1.000'] [Step 207 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Summarization'] | Lens: [20379, 20379, 20397] → Tgt Spa: ['0.350', '0.350', '1.000'] [Step 207 / Rank 5] Tasks: ['Code'] | Lens: [39030] → Tgt Spa: ['1.000'] [Step 207 / Rank 6] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [18376, 18376, 18388] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 207 / Rank 7] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [18376, 18376, 18388] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 207 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Summarization'] | Lens: [20379, 20379, 20397] → Tgt Spa: ['0.350', '0.350', '1.000'] [Step 207 / Rank 4] Tasks: ['Single QA'] | Lens: [58397] → Tgt Spa: ['0.350'] [Step 207 / Rank 1] Tasks: ['Code'] | Lens: [40735] → Tgt Spa: ['1.000'] [Step 207 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [42824] → Tgt Spa: ['1.000'] [Step 207 / Rank 3] Tasks: ['Single QA'] | Lens: [55896] → Tgt Spa: ['0.350'] [Step 207 / Rank 5] Tasks: ['Single QA'] | Lens: [58397] → Tgt Spa: ['0.350'] [Step 207 / Rank 2] Tasks: ['Single QA'] | Lens: [55896] → Tgt Spa: ['0.350'] [Step 207 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [42824] → Tgt Spa: ['1.000'] [Step 207 / Rank 0] Tasks: ['Code'] | Lens: [40735] → Tgt Spa: ['1.000'] [Step 207 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15854, 15855, 15855, 15856] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 207 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [38508] → Tgt Spa: ['1.000'] [Step 207 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15854, 15855, 15855, 15856] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 207 / Rank 5] Tasks: ['Summarization', 'Code'] | Lens: [31398, 31392] → Tgt Spa: ['1.000', '1.000'] [Step 207 / Rank 1] Tasks: ['Single QA'] | Lens: [58646] → Tgt Spa: ['0.350'] [Step 207 / Rank 0] Tasks: ['Single QA'] | Lens: [58646] → Tgt Spa: ['0.350'] [Step 207 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [38508] → Tgt Spa: ['1.000'] [Step 207 / Rank 4] Tasks: ['Summarization', 'Code'] | Lens: [31398, 31392] → Tgt Spa: ['1.000', '1.000'] [Step 207 / Rank 7] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [17749, 17751, 17762] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 207 / Rank 0] Tasks: ['Single QA'] | Lens: [41066] → Tgt Spa: ['0.350'] [Step 207 / Rank 5] Tasks: ['Single QA'] | Lens: [39333] → Tgt Spa: ['0.350'] [Step 207 / Rank 2] Tasks: ['Summarization', 'Code'] | Lens: [22224, 22216] → Tgt Spa: ['1.000', '1.000'] [Step 207 / Rank 6] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [17749, 17751, 17762] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 207 / Rank 3] Tasks: ['Summarization', 'Code'] | Lens: [22224, 22216] → Tgt Spa: ['1.000', '1.000'] [Step 207 / Rank 4] Tasks: ['Single QA'] | Lens: [39333] → Tgt Spa: ['0.350'] [Step 207 / Rank 1] Tasks: ['Single QA'] | Lens: [41066] → Tgt Spa: ['0.350'] [Step 207 / Rank 3] Tasks: ['Single QA'] | Lens: [54686] → Tgt Spa: ['0.350'] [Step 207 / Rank 7] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [20069, 20082, 20082] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 207 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [53139] → Tgt Spa: ['1.000'] [Step 207 / Rank 2] Tasks: ['Single QA'] | Lens: [54686] → Tgt Spa: ['0.350'] [Step 207 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [53139] → Tgt Spa: ['1.000'] [Step 207 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [53384] → Tgt Spa: ['1.000'] [Step 207 / Rank 6] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [20069, 20082, 20082] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 207 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [53384] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 04:06:16,924 >> @ 207 | Loss: 2.1430 | LM: 2.0665 | Reg: 0.0765 | Spa(Avg): 0.563 [INFO|lh_trainer.py:797] 2026-02-17 04:06:16,924 >> Statistic -> Code | Spa: 0.671 | Tgt: 1.000 | Z-Loss: 0.108 | [INFO|lh_trainer.py:797] 2026-02-17 04:06:16,924 >> Statistic -> In-Context | Spa: 0.717 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:06:16,924 >> Statistic -> MultiHop | Spa: 0.642 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:06:16,924 >> Statistic -> Single | Spa: 0.427 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:06:16,925 >> Statistic -> Summarization | Spa: 0.609 | Tgt: 1.000 | Z-Loss: 0.135 | [INFO|lh_trainer.py:810] 2026-02-17 04:06:16,926 >> [Micro-Log] {"loss": 2.1430156528949738, "lm_loss": 2.0664981808513403, "reg_loss": 0.076517468803407, "model_sparsity(avg)": 0.5628375820815563, "Spa-Single QA sparsity": 0.42658730489867075, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05055065083849643, "Spa-Summarization sparsity": 0.6091269850730896, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13452988330807006, "Spa-Code sparsity": 0.6712963051266141, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10751964069075054, "Spa-In-Context Learning sparsity": 0.7170138955116272, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10436726547777653, "Spa-MultiHop QA sparsity": 0.6419753167364333, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.12461001757118437, "step": 207, "current_tau": 1.0, "lambda1 Single QA": 0.5859375, "lambda2 MultiHop QA": 0.306640625, "lambda3 Summarization": 0.1572265625, "lambda4 Code": 0.2578125} [INFO|lh_trainer.py:331] 2026-02-17 04:06:37,184 >> {'loss': 12.8581, 'grad_norm': 0.7350984811782837, 'learning_rate': 0.00016347076938355316, 'epoch': 0.21906266456029488, 'num_input_tokens_seen': 511567478, 'completed': '69.33% (208 / 300)', 'remaining time': '4:18:40', 'throughput': '7992.63', 'gpu_mem_free': '9087MB', 'step': 208} [Step 208 / Rank 3] Tasks: ['Single QA'] | Lens: [51021] → Tgt Spa: ['0.350'] [Step 208 / Rank 5] Tasks: ['Single QA'] | Lens: [64041] → Tgt Spa: ['0.350'] [Step 208 / Rank 7] Tasks: ['Code'] | Lens: [65388] → Tgt Spa: ['1.000'] [Step 208 / Rank 4] Tasks: ['Single QA'] | Lens: [64041] → Tgt Spa: ['0.350'] [Step 208 / Rank 6] Tasks: ['Code'] | Lens: [65388] → Tgt Spa: ['1.000'] [Step 208 / Rank 1] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [8302, 8314, 8307, 8309, 8310, 8311, 8320] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 208 / Rank 2] Tasks: ['Single QA'] | Lens: [51021] → Tgt Spa: ['0.350'] [Step 208 / Rank 0] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [8302, 8314, 8307, 8309, 8310, 8311, 8320] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 208 / Rank 3] Tasks: ['Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Single QA', 'Summarization', 'Code', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization'] | Lens: [2114, 2099, 2115, 2116, 2115, 2099, 2116, 2118, 2098, 2100, 2101, 2101, 2119, 2101, 2102, 2121, 2109, 2105, 2124, 2106, 2106, 2124, 2124, 2110, 2108, 2108, 2127, 2128, 2130, 2111, 2130] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000'] [Step 208 / Rank 2] Tasks: ['Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Single QA', 'Summarization', 'Code', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization'] | Lens: [2114, 2099, 2115, 2116, 2115, 2099, 2116, 2118, 2098, 2100, 2101, 2101, 2119, 2101, 2102, 2121, 2109, 2105, 2124, 2106, 2106, 2124, 2124, 2110, 2108, 2108, 2127, 2128, 2130, 2111, 2130] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000'] [Step 208 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [64941] → Tgt Spa: ['1.000'] [Step 208 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [30626, 30627] → Tgt Spa: ['0.350', '0.350'] [Step 208 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [30626, 30627] → Tgt Spa: ['0.350', '0.350'] [Step 208 / Rank 0] Tasks: ['Code', 'In-Context Learning', 'Code'] | Lens: [19820, 19814, 19821] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 208 / Rank 1] Tasks: ['Code', 'In-Context Learning', 'Code'] | Lens: [19820, 19814, 19821] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 208 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [64941] → Tgt Spa: ['1.000'] [Step 208 / Rank 3] Tasks: ['Single QA'] | Lens: [35826] → Tgt Spa: ['0.350'] [Step 208 / Rank 7] Tasks: ['Single QA'] | Lens: [49762] → Tgt Spa: ['0.350'] [Step 208 / Rank 4] Tasks: ['Code', 'Single QA'] | Lens: [32387, 32381] → Tgt Spa: ['1.000', '0.350'] [Step 208 / Rank 2] Tasks: ['Single QA'] | Lens: [35826] → Tgt Spa: ['0.350'] [Step 208 / Rank 0] Tasks: ['Single QA'] | Lens: [39585] → Tgt Spa: ['0.350'] [Step 208 / Rank 5] Tasks: ['Code', 'Single QA'] | Lens: [32387, 32381] → Tgt Spa: ['1.000', '0.350'] [Step 208 / Rank 6] Tasks: ['Single QA'] | Lens: [49762] → Tgt Spa: ['0.350'] [Step 208 / Rank 1] Tasks: ['Single QA'] | Lens: [39585] → Tgt Spa: ['0.350'] [Step 208 / Rank 2] Tasks: ['Single QA'] | Lens: [64707] → Tgt Spa: ['0.350'] [Step 208 / Rank 6] Tasks: ['Code', 'Single QA'] | Lens: [31078, 31071] → Tgt Spa: ['1.000', '0.350'] [Step 208 / Rank 5] Tasks: ['Code'] | Lens: [58160] → Tgt Spa: ['1.000'] [Step 208 / Rank 0] Tasks: ['Code'] | Lens: [44561] → Tgt Spa: ['1.000'] [Step 208 / Rank 3] Tasks: ['Single QA'] | Lens: [64707] → Tgt Spa: ['0.350'] [Step 208 / Rank 1] Tasks: ['Code'] | Lens: [44561] → Tgt Spa: ['1.000'] [Step 208 / Rank 4] Tasks: ['Code'] | Lens: [58160] → Tgt Spa: ['1.000'] [Step 208 / Rank 7] Tasks: ['Code', 'Single QA'] | Lens: [31078, 31071] → Tgt Spa: ['1.000', '0.350'] [Step 208 / Rank 7] Tasks: ['Single QA'] | Lens: [37154] → Tgt Spa: ['0.350'] [Step 208 / Rank 2] Tasks: ['Single QA'] | Lens: [34550] → Tgt Spa: ['0.350'] [Step 208 / Rank 1] Tasks: ['Single QA'] | Lens: [39384] → Tgt Spa: ['0.350'] [Step 208 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [28436, 28437] → Tgt Spa: ['0.350', '0.350'] [Step 208 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [28436, 28437] → Tgt Spa: ['0.350', '0.350'] [Step 208 / Rank 3] Tasks: ['Single QA'] | Lens: [34550] → Tgt Spa: ['0.350'] [Step 208 / Rank 6] Tasks: ['Single QA'] | Lens: [37154] → Tgt Spa: ['0.350'] [Step 208 / Rank 0] Tasks: ['Single QA'] | Lens: [39384] → Tgt Spa: ['0.350'] [Step 208 / Rank 4] Tasks: ['Single QA'] | Lens: [35030] → Tgt Spa: ['0.350'] [Step 208 / Rank 0] Tasks: ['Single QA', 'Code'] | Lens: [31048, 31056] → Tgt Spa: ['0.350', '1.000'] [Step 208 / Rank 2] Tasks: ['MultiHop QA'] | Lens: [65334] → Tgt Spa: ['0.350'] [Step 208 / Rank 3] Tasks: ['MultiHop QA'] | Lens: [65334] → Tgt Spa: ['0.350'] [Step 208 / Rank 6] Tasks: ['Single QA'] | Lens: [44757] → Tgt Spa: ['0.350'] [Step 208 / Rank 1] Tasks: ['Single QA', 'Code'] | Lens: [31048, 31056] → Tgt Spa: ['0.350', '1.000'] [Step 208 / Rank 7] Tasks: ['Single QA'] | Lens: [44757] → Tgt Spa: ['0.350'] [Step 208 / Rank 5] Tasks: ['Single QA'] | Lens: [35030] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 04:09:11,899 >> @ 208 | Loss: 1.8790 | LM: 1.8295 | Reg: 0.0495 | Spa(Avg): 0.489 [INFO|lh_trainer.py:797] 2026-02-17 04:09:11,899 >> Statistic -> Code | Spa: 0.711 | Tgt: 1.000 | Z-Loss: 0.092 | [INFO|lh_trainer.py:797] 2026-02-17 04:09:11,899 >> Statistic -> In-Context | Spa: 0.715 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:09:11,899 >> Statistic -> MultiHop | Spa: 0.590 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:09:11,899 >> Statistic -> Single | Spa: 0.437 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:09:11,899 >> Statistic -> Summarization | Spa: 0.611 | Tgt: 1.000 | Z-Loss: 0.134 | [INFO|lh_trainer.py:810] 2026-02-17 04:09:11,901 >> [Micro-Log] {"loss": 1.8790021310948457, "lm_loss": 1.829470006477398, "reg_loss": 0.04953210899839178, "model_sparsity(avg)": 0.489316647251447, "Spa-Single QA sparsity": 0.43722221851348875, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05887599484995008, "Spa-Code sparsity": 0.7108585726131093, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09184919433160262, "Spa-In-Context Learning sparsity": 0.7152777910232544, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10507211089134216, "Spa-Summarization sparsity": 0.6111111044883728, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13439188400904337, "Spa-MultiHop QA sparsity": 0.5902777825083051, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.09927106462419033, "step": 208, "current_tau": 1.0, "lambda1 Single QA": 0.5859375, "lambda2 MultiHop QA": 0.306640625, "lambda3 Summarization": 0.158203125, "lambda4 Code": 0.2578125} [INFO|lh_trainer.py:331] 2026-02-17 04:09:39,257 >> {'loss': 11.274, 'grad_norm': 0.504421591758728, 'learning_rate': 0.0001604080465728737, 'epoch': 0.220115850447604, 'num_input_tokens_seen': 514136400, 'completed': '69.67% (209 / 300)', 'remaining time': '4:15:57', 'throughput': '7054.65', 'gpu_mem_free': '6799MB', 'step': 209} [Step 209 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32509, 32509] → Tgt Spa: ['0.350', '0.350'] [Step 209 / Rank 0] Tasks: ['Single QA'] | Lens: [64182] → Tgt Spa: ['0.350'] [Step 209 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32509, 32509] → Tgt Spa: ['0.350', '0.350'] [Step 209 / Rank 1] Tasks: ['Single QA'] | Lens: [64182] → Tgt Spa: ['0.350'] [Step 209 / Rank 2] Tasks: ['Single QA'] | Lens: [36511] → Tgt Spa: ['0.350'] [Step 209 / Rank 3] Tasks: ['Single QA'] | Lens: [36511] → Tgt Spa: ['0.350'] [Step 209 / Rank 5] Tasks: ['Single QA'] | Lens: [39256] → Tgt Spa: ['0.350'] [Step 209 / Rank 4] Tasks: ['Single QA'] | Lens: [39256] → Tgt Spa: ['0.350'] [Step 209 / Rank 4] Tasks: ['Code', 'Summarization'] | Lens: [26557, 26568] → Tgt Spa: ['1.000', '1.000'] [Step 209 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [40143] → Tgt Spa: ['1.000'] [Step 209 / Rank 0] Tasks: ['Code'] | Lens: [36950] → Tgt Spa: ['1.000'] [Step 209 / Rank 2] Tasks: ['Code'] | Lens: [39346] → Tgt Spa: ['1.000'] [Step 209 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [40143] → Tgt Spa: ['1.000'] [Step 209 / Rank 3] Tasks: ['Code'] | Lens: [39346] → Tgt Spa: ['1.000'] [Step 209 / Rank 1] Tasks: ['Code'] | Lens: [36950] → Tgt Spa: ['1.000'] [Step 209 / Rank 5] Tasks: ['Code', 'Summarization'] | Lens: [26557, 26568] → Tgt Spa: ['1.000', '1.000'] [Step 209 / Rank 6] Tasks: ['Code', 'In-Context Learning', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [5727, 5720, 5730, 5723, 5725, 5725, 5728, 5728, 5736, 5728, 5728] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350'] [Step 209 / Rank 2] Tasks: ['Single QA'] | Lens: [51766] → Tgt Spa: ['0.350'] [Step 209 / Rank 7] Tasks: ['Code', 'In-Context Learning', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [5727, 5720, 5730, 5723, 5725, 5725, 5728, 5728, 5736, 5728, 5728] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350'] [Step 209 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [29218, 29219] → Tgt Spa: ['0.350', '0.350'] [Step 209 / Rank 3] Tasks: ['Single QA'] | Lens: [51766] → Tgt Spa: ['0.350'] [Step 209 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23734, 23736] → Tgt Spa: ['1.000', '1.000'] [Step 209 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23734, 23736] → Tgt Spa: ['1.000', '1.000'] [Step 209 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [29218, 29219] → Tgt Spa: ['0.350', '0.350'] [Step 209 / Rank 6] Tasks: ['Summarization', 'Code'] | Lens: [22791, 22779] → Tgt Spa: ['1.000', '1.000'] [Step 209 / Rank 3] Tasks: ['In-Context Learning', 'Code'] | Lens: [23176, 23184] → Tgt Spa: ['1.000', '1.000'] [Step 209 / Rank 2] Tasks: ['In-Context Learning', 'Code'] | Lens: [23176, 23184] → Tgt Spa: ['1.000', '1.000'] [Step 209 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [26255, 26256] → Tgt Spa: ['0.350', '1.000'] [Step 209 / Rank 1] Tasks: ['Single QA'] | Lens: [61792] → Tgt Spa: ['0.350'] [Step 209 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [26255, 26256] → Tgt Spa: ['0.350', '1.000'] [Step 209 / Rank 7] Tasks: ['Summarization', 'Code'] | Lens: [22791, 22779] → Tgt Spa: ['1.000', '1.000'] [Step 209 / Rank 0] Tasks: ['Single QA'] | Lens: [61792] → Tgt Spa: ['0.350'] [Step 209 / Rank 5] Tasks: ['Single QA'] | Lens: [35673] → Tgt Spa: ['0.350'] [Step 209 / Rank 3] Tasks: ['Code', 'Code', 'Single QA'] | Lens: [20957, 20957, 20950] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 209 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [58726] → Tgt Spa: ['1.000'] [Step 209 / Rank 4] Tasks: ['Single QA'] | Lens: [35673] → Tgt Spa: ['0.350'] [Step 209 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [58726] → Tgt Spa: ['1.000'] [Step 209 / Rank 2] Tasks: ['Code', 'Code', 'Single QA'] | Lens: [20957, 20957, 20950] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 209 / Rank 1] Tasks: ['Single QA'] | Lens: [57921] → Tgt Spa: ['0.350'] [Step 209 / Rank 0] Tasks: ['Single QA'] | Lens: [57921] → Tgt Spa: ['0.350'] [Step 209 / Rank 1] Tasks: ['Single QA'] | Lens: [33137] → Tgt Spa: ['0.350'] [Step 209 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Code'] | Lens: [20452, 20452, 20459] → Tgt Spa: ['0.350', '0.350', '1.000'] [Step 209 / Rank 7] Tasks: ['Single QA'] | Lens: [34588] → Tgt Spa: ['0.350'] [Step 209 / Rank 6] Tasks: ['Single QA'] | Lens: [34588] → Tgt Spa: ['0.350'] [Step 209 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [25635, 25636] → Tgt Spa: ['0.350', '0.350'] [Step 209 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Code'] | Lens: [20452, 20452, 20459] → Tgt Spa: ['0.350', '0.350', '1.000'] [Step 209 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [25635, 25636] → Tgt Spa: ['0.350', '0.350'] [Step 209 / Rank 0] Tasks: ['Single QA'] | Lens: [33137] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 04:12:02,814 >> @ 209 | Loss: 1.9858 | LM: 1.9352 | Reg: 0.0507 | Spa(Avg): 0.508 [INFO|lh_trainer.py:797] 2026-02-17 04:12:02,814 >> Statistic -> Code | Spa: 0.705 | Tgt: 1.000 | Z-Loss: 0.095 | [INFO|lh_trainer.py:797] 2026-02-17 04:12:02,815 >> Statistic -> In-Context | Spa: 0.717 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:12:02,815 >> Statistic -> MultiHop | Spa: 0.590 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:12:02,815 >> Statistic -> Single | Spa: 0.393 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:12:02,815 >> Statistic -> Summarization | Spa: 0.646 | Tgt: 1.000 | Z-Loss: 0.116 | [INFO|lh_trainer.py:810] 2026-02-17 04:12:02,817 >> [Micro-Log] {"loss": 1.9858482717148338, "lm_loss": 1.9351964424131438, "reg_loss": 0.05065182390778015, "model_sparsity(avg)": 0.5083999757965406, "Spa-Single QA sparsity": 0.39267675984989514, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.029897975904697723, "Spa-Code sparsity": 0.7045454599640586, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09484520351344888, "Spa-In-Context Learning sparsity": 0.7171717231923883, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10438135063106363, "Spa-Summarization sparsity": 0.6458333134651184, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11593515798449516, "Spa-MultiHop QA sparsity": 0.5902777825083051, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.09927106462419033, "step": 209, "current_tau": 1.0, "lambda1 Single QA": 0.5859375, "lambda2 MultiHop QA": 0.30859375, "lambda3 Summarization": 0.158203125, "lambda4 Code": 0.259765625} [INFO|lh_trainer.py:331] 2026-02-17 04:12:16,801 >> {'loss': 11.9151, 'grad_norm': 0.5333490371704102, 'learning_rate': 0.00015736067488647686, 'epoch': 0.2211690363349131, 'num_input_tokens_seen': 516530356, 'completed': '70.00% (210 / 300)', 'remaining time': '4:13:04', 'throughput': '7597.73', 'gpu_mem_free': '15535MB', 'step': 210} [Step 210 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [45961] → Tgt Spa: ['1.000'] [Step 210 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [40804] → Tgt Spa: ['1.000'] [Step 210 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [29457, 29471] → Tgt Spa: ['1.000', '1.000'] [Step 210 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [29457, 29471] → Tgt Spa: ['1.000', '1.000'] [Step 210 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [59460] → Tgt Spa: ['1.000'] [Step 210 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [59460] → Tgt Spa: ['1.000'] [Step 210 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [40804] → Tgt Spa: ['1.000'] [Step 210 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [45961] → Tgt Spa: ['1.000'] [Step 210 / Rank 1] Tasks: ['Single QA'] | Lens: [58641] → Tgt Spa: ['0.350'] [Step 210 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32125, 32125] → Tgt Spa: ['0.350', '0.350'] [Step 210 / Rank 2] Tasks: ['Code'] | Lens: [56891] → Tgt Spa: ['1.000'] [Step 210 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32125, 32125] → Tgt Spa: ['0.350', '0.350'] [Step 210 / Rank 3] Tasks: ['Code'] | Lens: [56891] → Tgt Spa: ['1.000'] [Step 210 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [31039, 31039] → Tgt Spa: ['0.350', '0.350'] [Step 210 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [31039, 31039] → Tgt Spa: ['0.350', '0.350'] [Step 210 / Rank 0] Tasks: ['Single QA'] | Lens: [58641] → Tgt Spa: ['0.350'] [Step 210 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [40215] → Tgt Spa: ['1.000'] [Step 210 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [40215] → Tgt Spa: ['1.000'] [Step 210 / Rank 6] Tasks: ['Single QA'] | Lens: [55869] → Tgt Spa: ['0.350'] [Step 210 / Rank 7] Tasks: ['Single QA'] | Lens: [55869] → Tgt Spa: ['0.350'] [Step 210 / Rank 1] Tasks: ['Code'] | Lens: [53594] → Tgt Spa: ['1.000'] [Step 210 / Rank 3] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25859, 25859] → Tgt Spa: ['0.350', '1.000'] [Step 210 / Rank 0] Tasks: ['Code'] | Lens: [53594] → Tgt Spa: ['1.000'] [Step 210 / Rank 2] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25859, 25859] → Tgt Spa: ['0.350', '1.000'] [Step 210 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24654, 24653] → Tgt Spa: ['1.000', '1.000'] [Step 210 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [45185] → Tgt Spa: ['1.000'] [Step 210 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24654, 24653] → Tgt Spa: ['1.000', '1.000'] [Step 210 / Rank 5] Tasks: ['Single QA'] | Lens: [40628] → Tgt Spa: ['0.350'] [Step 210 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22723, 22723] → Tgt Spa: ['1.000', '1.000'] [Step 210 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [45185] → Tgt Spa: ['1.000'] [Step 210 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22723, 22723] → Tgt Spa: ['1.000', '1.000'] [Step 210 / Rank 4] Tasks: ['Single QA'] | Lens: [40628] → Tgt Spa: ['0.350'] [Step 210 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [54336] → Tgt Spa: ['1.000'] [Step 210 / Rank 1] Tasks: ['MultiHop QA', 'Summarization', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Code', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA'] | Lens: [2852, 2869, 2857, 2851, 2853, 2852, 2854, 2853, 2854, 2873, 2860, 2854, 2856, 2873, 2857, 2874, 2858, 2877, 2877, 2865, 2861, 2862] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 210 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [42294] → Tgt Spa: ['1.000'] [Step 210 / Rank 6] Tasks: ['Single QA'] | Lens: [52575] → Tgt Spa: ['0.350'] [Step 210 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [42294] → Tgt Spa: ['1.000'] [Step 210 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [54336] → Tgt Spa: ['1.000'] [Step 210 / Rank 7] Tasks: ['Single QA'] | Lens: [52575] → Tgt Spa: ['0.350'] [Step 210 / Rank 0] Tasks: ['MultiHop QA', 'Summarization', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'MultiHop QA', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Code', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA'] | Lens: [2852, 2869, 2857, 2851, 2853, 2852, 2854, 2853, 2854, 2873, 2860, 2854, 2856, 2873, 2857, 2874, 2858, 2877, 2877, 2865, 2861, 2862] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 210 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [50206] → Tgt Spa: ['1.000'] [Step 210 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [28232, 28234] → Tgt Spa: ['1.000', '1.000'] [Step 210 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [50206] → Tgt Spa: ['1.000'] [Step 210 / Rank 2] Tasks: ['Single QA'] | Lens: [42674] → Tgt Spa: ['0.350'] [Step 210 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA', 'In-Context Learning'] | Lens: [8372, 8372, 8372, 8372, 8380, 8373, 8375] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000'] [Step 210 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA', 'In-Context Learning'] | Lens: [8372, 8372, 8372, 8372, 8380, 8373, 8375] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000'] [Step 210 / Rank 3] Tasks: ['Single QA'] | Lens: [42674] → Tgt Spa: ['0.350'] [Step 210 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [28232, 28234] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 04:14:35,816 >> @ 210 | Loss: 2.1925 | LM: 2.1182 | Reg: 0.0742 | Spa(Avg): 0.593 [INFO|lh_trainer.py:797] 2026-02-17 04:14:35,816 >> Statistic -> Code | Spa: 0.699 | Tgt: 1.000 | Z-Loss: 0.097 | [INFO|lh_trainer.py:797] 2026-02-17 04:14:35,816 >> Statistic -> In-Context | Spa: 0.715 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:14:35,816 >> Statistic -> MultiHop | Spa: 0.667 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:14:35,816 >> Statistic -> Single | Spa: 0.411 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:14:35,816 >> Statistic -> Summarization | Spa: 0.609 | Tgt: 1.000 | Z-Loss: 0.135 | [INFO|lh_trainer.py:810] 2026-02-17 04:14:35,818 >> [Micro-Log] {"loss": 2.1924699110289416, "lm_loss": 2.1182233579456806, "reg_loss": 0.07424655853537843, "model_sparsity(avg)": 0.5932163819670677, "Spa-In-Context Learning sparsity": 0.7149122702447992, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10533181264212257, "Spa-Code sparsity": 0.6990740630361769, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09721843401590984, "Spa-Single QA sparsity": 0.4114583320915699, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04057918512262404, "Spa-MultiHop QA sparsity": 0.6666666716337204, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.13865629583597183, "Spa-Summarization sparsity": 0.6087962985038757, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13484549646576247, "step": 210, "current_tau": 1.0, "lambda1 Single QA": 0.5859375, "lambda2 MultiHop QA": 0.30859375, "lambda3 Summarization": 0.158203125, "lambda4 Code": 0.259765625} [INFO|lh_trainer.py:331] 2026-02-17 04:14:53,125 >> {'loss': 13.1548, 'grad_norm': 0.9190953969955444, 'learning_rate': 0.00015432917647581338, 'epoch': 0.2222222222222222, 'num_input_tokens_seen': 519028524, 'completed': '70.33% (211 / 300)', 'remaining time': '4:10:10', 'throughput': '7990.35', 'gpu_mem_free': '10645MB', 'step': 211} [Step 211 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [40996] → Tgt Spa: ['1.000'] [Step 211 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [61464] → Tgt Spa: ['1.000'] [Step 211 / Rank 0] Tasks: ['Single QA'] | Lens: [41129] → Tgt Spa: ['0.350'] [Step 211 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [40996] → Tgt Spa: ['1.000'] [Step 211 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [46849] → Tgt Spa: ['1.000'] [Step 211 / Rank 1] Tasks: ['Single QA'] | Lens: [41129] → Tgt Spa: ['0.350'] [Step 211 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [46849] → Tgt Spa: ['1.000'] [Step 211 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [61464] → Tgt Spa: ['1.000'] [Step 211 / Rank 7] Tasks: ['Code', 'Code', 'Code'] | Lens: [19199, 19201, 19205] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 211 / Rank 5] Tasks: ['Single QA', 'Code', 'Code', 'Code', 'Single QA', 'Code', 'Code'] | Lens: [8244, 8254, 8255, 8256, 8251, 8258, 8261] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 211 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [57492] → Tgt Spa: ['1.000'] [Step 211 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [57492] → Tgt Spa: ['1.000'] [Step 211 / Rank 4] Tasks: ['Single QA', 'Code', 'Code', 'Code', 'Single QA', 'Code', 'Code'] | Lens: [8244, 8254, 8255, 8256, 8251, 8258, 8261] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 211 / Rank 0] Tasks: ['Single QA'] | Lens: [55596] → Tgt Spa: ['0.350'] [Step 211 / Rank 6] Tasks: ['Code', 'Code', 'Code'] | Lens: [19199, 19201, 19205] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 211 / Rank 1] Tasks: ['Single QA'] | Lens: [55596] → Tgt Spa: ['0.350'] [Step 211 / Rank 3] Tasks: ['Single QA'] | Lens: [54842] → Tgt Spa: ['0.350'] [Step 211 / Rank 2] Tasks: ['Single QA'] | Lens: [54842] → Tgt Spa: ['0.350'] [Step 211 / Rank 5] Tasks: ['Single QA'] | Lens: [35685] → Tgt Spa: ['0.350'] [Step 211 / Rank 4] Tasks: ['Single QA'] | Lens: [35685] → Tgt Spa: ['0.350'] [Step 211 / Rank 1] Tasks: ['Single QA'] | Lens: [32966] → Tgt Spa: ['0.350'] [Step 211 / Rank 7] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Code'] | Lens: [7811, 7804, 7806, 7807, 7807, 7808, 7814, 7816] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 211 / Rank 0] Tasks: ['Single QA'] | Lens: [32966] → Tgt Spa: ['0.350'] [Step 211 / Rank 6] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Code'] | Lens: [7811, 7804, 7806, 7807, 7807, 7808, 7814, 7816] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 211 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [58327] → Tgt Spa: ['1.000'] [Step 211 / Rank 3] Tasks: ['Single QA'] | Lens: [49472] → Tgt Spa: ['0.350'] [Step 211 / Rank 5] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22208, 22231] → Tgt Spa: ['1.000', '1.000'] [Step 211 / Rank 2] Tasks: ['Single QA'] | Lens: [49472] → Tgt Spa: ['0.350'] [Step 211 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [58327] → Tgt Spa: ['1.000'] [Step 211 / Rank 6] Tasks: ['Single QA'] | Lens: [53970] → Tgt Spa: ['0.350'] [Step 211 / Rank 7] Tasks: ['Single QA'] | Lens: [53970] → Tgt Spa: ['0.350'] [Step 211 / Rank 4] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22208, 22231] → Tgt Spa: ['1.000', '1.000'] [Step 211 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23088, 23090] → Tgt Spa: ['1.000', '1.000'] [Step 211 / Rank 1] Tasks: ['Single QA'] | Lens: [46676] → Tgt Spa: ['0.350'] [Step 211 / Rank 0] Tasks: ['Single QA'] | Lens: [46676] → Tgt Spa: ['0.350'] [Step 211 / Rank 7] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [26563, 26564] → Tgt Spa: ['1.000', '0.350'] [Step 211 / Rank 3] Tasks: ['Single QA'] | Lens: [51378] → Tgt Spa: ['0.350'] [Step 211 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23088, 23090] → Tgt Spa: ['1.000', '1.000'] [Step 211 / Rank 2] Tasks: ['Single QA'] | Lens: [51378] → Tgt Spa: ['0.350'] [Step 211 / Rank 6] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [26563, 26564] → Tgt Spa: ['1.000', '0.350'] [Step 211 / Rank 3] Tasks: ['Single QA'] | Lens: [35261] → Tgt Spa: ['0.350'] [Step 211 / Rank 4] Tasks: ['Single QA'] | Lens: [40595] → Tgt Spa: ['0.350'] [Step 211 / Rank 2] Tasks: ['Single QA'] | Lens: [35261] → Tgt Spa: ['0.350'] [Step 211 / Rank 1] Tasks: ['Code', 'Code', 'Code'] | Lens: [18486, 18490, 18490] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 211 / Rank 5] Tasks: ['Single QA'] | Lens: [40595] → Tgt Spa: ['0.350'] [Step 211 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32070, 32070] → Tgt Spa: ['0.350', '0.350'] [Step 211 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32070, 32070] → Tgt Spa: ['0.350', '0.350'] [Step 211 / Rank 0] Tasks: ['Code', 'Code', 'Code'] | Lens: [18486, 18490, 18490] → Tgt Spa: ['1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 04:17:18,367 >> @ 211 | Loss: 2.2002 | LM: 2.1435 | Reg: 0.0568 | Spa(Avg): 0.511 [INFO|lh_trainer.py:797] 2026-02-17 04:17:18,367 >> Statistic -> Code | Spa: 0.685 | Tgt: 1.000 | Z-Loss: 0.103 | [INFO|lh_trainer.py:797] 2026-02-17 04:17:18,367 >> Statistic -> In-Context | Spa: 0.710 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:17:18,367 >> Statistic -> MultiHop | Spa: 0.667 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:17:18,367 >> Statistic -> Single | Spa: 0.387 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:17:18,367 >> Statistic -> Summarization | Spa: 0.597 | Tgt: 1.000 | Z-Loss: 0.140 | [INFO|lh_trainer.py:810] 2026-02-17 04:17:18,370 >> [Micro-Log] {"loss": 2.2002177077035108, "lm_loss": 2.1434583415587745, "reg_loss": 0.05675937896982456, "model_sparsity(avg)": 0.5114466200272242, "Spa-Single QA sparsity": 0.3869047562281291, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.02804895472668466, "Spa-In-Context Learning sparsity": 0.7098765505684747, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10793627550204594, "Spa-Code sparsity": 0.6845237953322274, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10295081298266139, "Spa-Summarization sparsity": 0.5972222089767456, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.14008285105228424, "Spa-MultiHop QA sparsity": 0.6666666716337204, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.13865629583597183, "step": 211, "current_tau": 1.0, "lambda1 Single QA": 0.5859375, "lambda2 MultiHop QA": 0.30859375, "lambda3 Summarization": 0.158203125, "lambda4 Code": 0.259765625} [INFO|lh_trainer.py:331] 2026-02-17 04:17:35,851 >> {'loss': 13.2013, 'grad_norm': 0.5969712138175964, 'learning_rate': 0.00015131407077252965, 'epoch': 0.22327540810953134, 'num_input_tokens_seen': 521436334, 'completed': '70.67% (212 / 300)', 'remaining time': '4:07:18', 'throughput': '7398.37', 'gpu_mem_free': '9687MB', 'step': 212} [Step 212 / Rank 5] Tasks: ['Code'] | Lens: [33161] → Tgt Spa: ['1.000'] [Step 212 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23225, 23225] → Tgt Spa: ['1.000', '1.000'] [Step 212 / Rank 1] Tasks: ['Code'] | Lens: [35597] → Tgt Spa: ['1.000'] [Step 212 / Rank 4] Tasks: ['Code'] | Lens: [33161] → Tgt Spa: ['1.000'] [Step 212 / Rank 3] Tasks: ['MultiHop QA', 'MultiHop QA'] | Lens: [32692, 32692] → Tgt Spa: ['0.350', '0.350'] [Step 212 / Rank 0] Tasks: ['Code'] | Lens: [35597] → Tgt Spa: ['1.000'] [Step 212 / Rank 2] Tasks: ['MultiHop QA', 'MultiHop QA'] | Lens: [32692, 32692] → Tgt Spa: ['0.350', '0.350'] [Step 212 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23225, 23225] → Tgt Spa: ['1.000', '1.000'] [Step 212 / Rank 5] Tasks: ['Single QA'] | Lens: [59925] → Tgt Spa: ['0.350'] [Step 212 / Rank 2] Tasks: ['Code', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'Single QA', 'Code', 'Single QA', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning'] | Lens: [4594, 4588, 4589, 4590, 4593, 4611, 4593, 4594, 4602, 4594, 4596, 4596, 4597, 4598] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000'] [Step 212 / Rank 6] Tasks: ['Code', 'Code', 'Code'] | Lens: [21404, 21408, 21410] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 212 / Rank 0] Tasks: ['Single QA'] | Lens: [55411] → Tgt Spa: ['0.350'] [Step 212 / Rank 3] Tasks: ['Code', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'Single QA', 'Code', 'Single QA', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning'] | Lens: [4594, 4588, 4589, 4590, 4593, 4611, 4593, 4594, 4602, 4594, 4596, 4596, 4597, 4598] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000'] [Step 212 / Rank 7] Tasks: ['Code', 'Code', 'Code'] | Lens: [21404, 21408, 21410] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 212 / Rank 4] Tasks: ['Single QA'] | Lens: [59925] → Tgt Spa: ['0.350'] [Step 212 / Rank 1] Tasks: ['Single QA'] | Lens: [55411] → Tgt Spa: ['0.350'] [Step 212 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'MultiHop QA'] | Lens: [13913, 13919, 13920, 13920] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 212 / Rank 5] Tasks: ['Summarization', 'Single QA'] | Lens: [22170, 22153] → Tgt Spa: ['1.000', '0.350'] [Step 212 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'MultiHop QA'] | Lens: [13913, 13919, 13920, 13920] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 212 / Rank 1] Tasks: ['Single QA'] | Lens: [64677] → Tgt Spa: ['0.350'] [Step 212 / Rank 4] Tasks: ['Summarization', 'Single QA'] | Lens: [22170, 22153] → Tgt Spa: ['1.000', '0.350'] [Step 212 / Rank 7] Tasks: ['Single QA'] | Lens: [59394] → Tgt Spa: ['0.350'] [Step 212 / Rank 0] Tasks: ['Single QA'] | Lens: [64677] → Tgt Spa: ['0.350'] [Step 212 / Rank 6] Tasks: ['Single QA'] | Lens: [59394] → Tgt Spa: ['0.350'] [Step 212 / Rank 4] Tasks: ['Single QA'] | Lens: [65107] → Tgt Spa: ['0.350'] [Step 212 / Rank 1] Tasks: ['Single QA'] | Lens: [55866] → Tgt Spa: ['0.350'] [Step 212 / Rank 5] Tasks: ['Single QA'] | Lens: [65107] → Tgt Spa: ['0.350'] [Step 212 / Rank 7] Tasks: ['Single QA'] | Lens: [40550] → Tgt Spa: ['0.350'] [Step 212 / Rank 0] Tasks: ['Single QA'] | Lens: [55866] → Tgt Spa: ['0.350'] [Step 212 / Rank 6] Tasks: ['Single QA'] | Lens: [40550] → Tgt Spa: ['0.350'] [Step 212 / Rank 2] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [19468, 19458, 19458] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 212 / Rank 3] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [19468, 19458, 19458] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 212 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [29947, 29948] → Tgt Spa: ['0.350', '0.350'] [Step 212 / Rank 6] Tasks: ['Single QA', 'Code'] | Lens: [25347, 25354] → Tgt Spa: ['0.350', '1.000'] [Step 212 / Rank 7] Tasks: ['Single QA', 'Code'] | Lens: [25347, 25354] → Tgt Spa: ['0.350', '1.000'] [Step 212 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26766, 26766] → Tgt Spa: ['1.000', '1.000'] [Step 212 / Rank 0] Tasks: ['Single QA'] | Lens: [55566] → Tgt Spa: ['0.350'] [Step 212 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26766, 26766] → Tgt Spa: ['1.000', '1.000'] [Step 212 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [29947, 29948] → Tgt Spa: ['0.350', '0.350'] [Step 212 / Rank 1] Tasks: ['Single QA'] | Lens: [55566] → Tgt Spa: ['0.350'] [Step 212 / Rank 6] Tasks: ['Code'] | Lens: [36257] → Tgt Spa: ['1.000'] [Step 212 / Rank 2] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [28339, 28339] → Tgt Spa: ['1.000', '0.350'] [Step 212 / Rank 3] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [28339, 28339] → Tgt Spa: ['1.000', '0.350'] [Step 212 / Rank 7] Tasks: ['Code'] | Lens: [36257] → Tgt Spa: ['1.000'] [Step 212 / Rank 4] Tasks: ['Code', 'Code', 'Code'] | Lens: [17738, 17740, 17740] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 212 / Rank 0] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [29963, 29966] → Tgt Spa: ['0.350', '1.000'] [Step 212 / Rank 5] Tasks: ['Code', 'Code', 'Code'] | Lens: [17738, 17740, 17740] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 212 / Rank 1] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [29963, 29966] → Tgt Spa: ['0.350', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 04:20:12,812 >> @ 212 | Loss: 1.8178 | LM: 1.7523 | Reg: 0.0655 | Spa(Avg): 0.526 [INFO|lh_trainer.py:797] 2026-02-17 04:20:12,813 >> Statistic -> Code | Spa: 0.714 | Tgt: 1.000 | Z-Loss: 0.091 | [INFO|lh_trainer.py:797] 2026-02-17 04:20:12,813 >> Statistic -> In-Context | Spa: 0.683 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:20:12,813 >> Statistic -> MultiHop | Spa: 0.407 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:20:12,813 >> Statistic -> Single | Spa: 0.449 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:20:12,813 >> Statistic -> Summarization | Spa: 0.569 | Tgt: 1.000 | Z-Loss: 0.160 | [INFO|lh_trainer.py:810] 2026-02-17 04:20:12,816 >> [Micro-Log] {"loss": 1.8178216866217554, "lm_loss": 1.7523211260170986, "reg_loss": 0.06550055377495785, "model_sparsity(avg)": 0.5258280957738558, "Spa-Code sparsity": 0.7142857228006635, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09113474296672004, "Spa-Single QA sparsity": 0.44949494166807696, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06660124301825734, "Spa-In-Context Learning sparsity": 0.6828703582286835, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11978912788132827, "Spa-MultiHop QA sparsity": 0.40740740299224854, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.01940439889828364, "Spa-Summarization sparsity": 0.569444457689921, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.16039423644542694, "step": 212, "current_tau": 1.0, "lambda1 Single QA": 0.5859375, "lambda2 MultiHop QA": 0.30859375, "lambda3 Summarization": 0.1591796875, "lambda4 Code": 0.259765625} [INFO|lh_trainer.py:331] 2026-02-17 04:20:28,683 >> {'loss': 10.9069, 'grad_norm': 0.6031188368797302, 'learning_rate': 0.0001483158743994661, 'epoch': 0.22432859399684044, 'num_input_tokens_seen': 524024802, 'completed': '71.00% (213 / 300)', 'remaining time': '4:04:32', 'throughput': '7488.38', 'gpu_mem_free': '7933MB', 'step': 213} [Step 213 / Rank 1] Tasks: ['Single QA'] | Lens: [37108] → Tgt Spa: ['0.350'] [Step 213 / Rank 2] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [31129, 31139] → Tgt Spa: ['1.000', '0.350'] [Step 213 / Rank 6] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [20758, 20747, 20751] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 213 / Rank 7] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [20758, 20747, 20751] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 213 / Rank 0] Tasks: ['Single QA'] | Lens: [37108] → Tgt Spa: ['0.350'] [Step 213 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [55470] → Tgt Spa: ['1.000'] [Step 213 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [55470] → Tgt Spa: ['1.000'] [Step 213 / Rank 3] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [31129, 31139] → Tgt Spa: ['1.000', '0.350'] [Step 213 / Rank 4] Tasks: ['Single QA', 'Summarization'] | Lens: [30209, 30228] → Tgt Spa: ['0.350', '1.000'] [Step 213 / Rank 3] Tasks: ['Single QA'] | Lens: [60846] → Tgt Spa: ['0.350'] [Step 213 / Rank 0] Tasks: ['Single QA'] | Lens: [63019] → Tgt Spa: ['0.350'] [Step 213 / Rank 2] Tasks: ['Single QA'] | Lens: [60846] → Tgt Spa: ['0.350'] [Step 213 / Rank 1] Tasks: ['Single QA'] | Lens: [63019] → Tgt Spa: ['0.350'] [Step 213 / Rank 5] Tasks: ['Single QA', 'Summarization'] | Lens: [30209, 30228] → Tgt Spa: ['0.350', '1.000'] [Step 213 / Rank 7] Tasks: ['Single QA'] | Lens: [35188] → Tgt Spa: ['0.350'] [Step 213 / Rank 6] Tasks: ['Single QA'] | Lens: [35188] → Tgt Spa: ['0.350'] [Step 213 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43408] → Tgt Spa: ['1.000'] [Step 213 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [23556, 23556] → Tgt Spa: ['0.350', '0.350'] [Step 213 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [24985, 24994] → Tgt Spa: ['1.000', '1.000'] [Step 213 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43408] → Tgt Spa: ['1.000'] [Step 213 / Rank 3] Tasks: ['Single QA'] | Lens: [56196] → Tgt Spa: ['0.350'] [Step 213 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [24985, 24994] → Tgt Spa: ['1.000', '1.000'] [Step 213 / Rank 2] Tasks: ['Single QA'] | Lens: [56196] → Tgt Spa: ['0.350'] [Step 213 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [23556, 23556] → Tgt Spa: ['0.350', '0.350'] [Step 213 / Rank 3] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [18344, 18351, 18351] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 213 / Rank 5] Tasks: ['Single QA'] | Lens: [46332] → Tgt Spa: ['0.350'] [Step 213 / Rank 2] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [18344, 18351, 18351] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 213 / Rank 7] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Code', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [1907, 1908, 1907, 1927, 1925, 1928, 1929, 1930, 1912, 1930, 1912, 1931, 1914, 1913, 1914, 1915, 1933, 1917, 1917, 1936, 1917, 1917, 1936, 1926, 1920, 1937, 1918, 1938, 1922, 1939, 1941, 1922, 1924, 1942] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000'] [Step 213 / Rank 1] Tasks: ['Code'] | Lens: [48633] → Tgt Spa: ['1.000'] [Step 213 / Rank 4] Tasks: ['Single QA'] | Lens: [46332] → Tgt Spa: ['0.350'] [Step 213 / Rank 0] Tasks: ['Code'] | Lens: [48633] → Tgt Spa: ['1.000'] [Step 213 / Rank 6] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Code', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [1907, 1908, 1907, 1927, 1925, 1928, 1929, 1930, 1912, 1930, 1912, 1931, 1914, 1913, 1914, 1915, 1933, 1917, 1917, 1936, 1917, 1917, 1936, 1926, 1920, 1937, 1918, 1938, 1922, 1939, 1941, 1922, 1924, 1942] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000'] [Step 213 / Rank 4] Tasks: ['Single QA'] | Lens: [49575] → Tgt Spa: ['0.350'] [Step 213 / Rank 2] Tasks: ['Single QA'] | Lens: [44674] → Tgt Spa: ['0.350'] [Step 213 / Rank 5] Tasks: ['Single QA'] | Lens: [49575] → Tgt Spa: ['0.350'] [Step 213 / Rank 1] Tasks: ['Single QA'] | Lens: [49269] → Tgt Spa: ['0.350'] [Step 213 / Rank 3] Tasks: ['Single QA'] | Lens: [44674] → Tgt Spa: ['0.350'] [Step 213 / Rank 0] Tasks: ['Single QA'] | Lens: [49269] → Tgt Spa: ['0.350'] [Step 213 / Rank 7] Tasks: ['Single QA'] | Lens: [60724] → Tgt Spa: ['0.350'] [Step 213 / Rank 6] Tasks: ['Single QA'] | Lens: [60724] → Tgt Spa: ['0.350'] [Step 213 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15810, 15810, 15810, 15810] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 213 / Rank 5] Tasks: ['Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA'] | Lens: [1778, 1760, 1760, 1759, 1760, 1761, 1759, 1761, 1762, 1761, 1780, 1780, 1780, 1762, 1764, 1765, 1783, 1764, 1783, 1767, 1766, 1786, 1786, 1767, 1768, 1788, 1788, 1769, 1770, 1771, 1790, 1772, 1772, 1771, 1773, 1776] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 213 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15810, 15810, 15810, 15810] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 213 / Rank 6] Tasks: ['Code'] | Lens: [39548] → Tgt Spa: ['1.000'] [Step 213 / Rank 2] Tasks: ['Single QA'] | Lens: [58372] → Tgt Spa: ['0.350'] [Step 213 / Rank 7] Tasks: ['Code'] | Lens: [39548] → Tgt Spa: ['1.000'] [Step 213 / Rank 4] Tasks: ['Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA'] | Lens: [1778, 1760, 1760, 1759, 1760, 1761, 1759, 1761, 1762, 1761, 1780, 1780, 1780, 1762, 1764, 1765, 1783, 1764, 1783, 1767, 1766, 1786, 1786, 1767, 1768, 1788, 1788, 1769, 1770, 1771, 1790, 1772, 1772, 1771, 1773, 1776] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 213 / Rank 3] Tasks: ['Single QA'] | Lens: [58372] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 04:22:57,056 >> @ 213 | Loss: 2.0167 | LM: 1.9591 | Reg: 0.0575 | Spa(Avg): 0.507 [INFO|lh_trainer.py:797] 2026-02-17 04:22:57,056 >> Statistic -> Code | Spa: 0.715 | Tgt: 1.000 | Z-Loss: 0.091 | [INFO|lh_trainer.py:797] 2026-02-17 04:22:57,056 >> Statistic -> In-Context | Spa: 0.712 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:22:57,056 >> Statistic -> MultiHop | Spa: 0.594 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:22:57,056 >> Statistic -> Single | Spa: 0.422 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:22:57,056 >> Statistic -> Summarization | Spa: 0.623 | Tgt: 1.000 | Z-Loss: 0.130 | [INFO|lh_trainer.py:810] 2026-02-17 04:22:57,058 >> [Micro-Log] {"loss": 2.0166502613574266, "lm_loss": 1.9591058756535251, "reg_loss": 0.057544388012805335, "model_sparsity(avg)": 0.5074550608793894, "Spa-Single QA sparsity": 0.42222221195697784, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.048300050920806824, "Spa-Code sparsity": 0.7152777910232544, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09075437486171722, "Spa-In-Context Learning sparsity": 0.7118055522441864, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10721969790756702, "Spa-Summarization sparsity": 0.6232078786819212, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12973781122315314, "Spa-MultiHop QA sparsity": 0.5939153461229234, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.10035816784061137, "step": 213, "current_tau": 1.0, "lambda1 Single QA": 0.58984375, "lambda2 MultiHop QA": 0.30859375, "lambda3 Summarization": 0.1591796875, "lambda4 Code": 0.259765625} [INFO|lh_trainer.py:331] 2026-02-17 04:23:19,706 >> {'loss': 12.0999, 'grad_norm': 0.4547601640224457, 'learning_rate': 0.0001453351010821365, 'epoch': 0.22538177988414956, 'num_input_tokens_seen': 526580534, 'completed': '71.33% (214 / 300)', 'remaining time': '4:01:44', 'throughput': '7471.89', 'gpu_mem_free': '7643MB', 'step': 214} [Step 214 / Rank 0] Tasks: ['Single QA'] | Lens: [45996] → Tgt Spa: ['0.350'] [Step 214 / Rank 4] Tasks: ['MultiHop QA', 'MultiHop QA'] | Lens: [32691, 32691] → Tgt Spa: ['0.350', '0.350'] [Step 214 / Rank 7] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [27262, 27285] → Tgt Spa: ['1.000', '1.000'] [Step 214 / Rank 2] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [25479, 25460] → Tgt Spa: ['1.000', '1.000'] [Step 214 / Rank 6] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [27262, 27285] → Tgt Spa: ['1.000', '1.000'] [Step 214 / Rank 5] Tasks: ['MultiHop QA', 'MultiHop QA'] | Lens: [32691, 32691] → Tgt Spa: ['0.350', '0.350'] [Step 214 / Rank 1] Tasks: ['Single QA'] | Lens: [45996] → Tgt Spa: ['0.350'] [Step 214 / Rank 3] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [25479, 25460] → Tgt Spa: ['1.000', '1.000'] [Step 214 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22962, 22964] → Tgt Spa: ['1.000', '1.000'] [Step 214 / Rank 7] Tasks: ['Code'] | Lens: [47638] → Tgt Spa: ['1.000'] [Step 214 / Rank 6] Tasks: ['Code'] | Lens: [47638] → Tgt Spa: ['1.000'] [Step 214 / Rank 3] Tasks: ['Single QA'] | Lens: [44249] → Tgt Spa: ['0.350'] [Step 214 / Rank 2] Tasks: ['Single QA'] | Lens: [44249] → Tgt Spa: ['0.350'] [Step 214 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22962, 22964] → Tgt Spa: ['1.000', '1.000'] [Step 214 / Rank 5] Tasks: ['Single QA'] | Lens: [37815] → Tgt Spa: ['0.350'] [Step 214 / Rank 4] Tasks: ['Single QA'] | Lens: [37815] → Tgt Spa: ['0.350'] [Step 214 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [25873, 25873] → Tgt Spa: ['0.350', '0.350'] [Step 214 / Rank 0] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25460, 25461] → Tgt Spa: ['1.000', '0.350'] [Step 214 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [25873, 25873] → Tgt Spa: ['0.350', '0.350'] [Step 214 / Rank 4] Tasks: ['Single QA', 'Code'] | Lens: [28341, 28347] → Tgt Spa: ['0.350', '1.000'] [Step 214 / Rank 1] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25460, 25461] → Tgt Spa: ['1.000', '0.350'] [Step 214 / Rank 5] Tasks: ['Single QA', 'Code'] | Lens: [28341, 28347] → Tgt Spa: ['0.350', '1.000'] [Step 214 / Rank 6] Tasks: ['Single QA'] | Lens: [39624] → Tgt Spa: ['0.350'] [Step 214 / Rank 7] Tasks: ['Single QA'] | Lens: [39624] → Tgt Spa: ['0.350'] [Step 214 / Rank 4] Tasks: ['Single QA'] | Lens: [56254] → Tgt Spa: ['0.350'] [Step 214 / Rank 6] Tasks: ['Single QA', 'Code'] | Lens: [26996, 27003] → Tgt Spa: ['0.350', '1.000'] [Step 214 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [28567, 28578] → Tgt Spa: ['1.000', '1.000'] [Step 214 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [40826] → Tgt Spa: ['1.000'] [Step 214 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [28567, 28578] → Tgt Spa: ['1.000', '1.000'] [Step 214 / Rank 7] Tasks: ['Single QA', 'Code'] | Lens: [26996, 27003] → Tgt Spa: ['0.350', '1.000'] [Step 214 / Rank 5] Tasks: ['Single QA'] | Lens: [56254] → Tgt Spa: ['0.350'] [Step 214 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [40826] → Tgt Spa: ['1.000'] [Step 214 / Rank 5] Tasks: ['Single QA'] | Lens: [61810] → Tgt Spa: ['0.350'] [Step 214 / Rank 6] Tasks: ['Single QA'] | Lens: [43561] → Tgt Spa: ['0.350'] [Step 214 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [60968] → Tgt Spa: ['1.000'] [Step 214 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [60968] → Tgt Spa: ['1.000'] [Step 214 / Rank 7] Tasks: ['Single QA'] | Lens: [43561] → Tgt Spa: ['0.350'] [Step 214 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Code'] | Lens: [18105, 18105, 18114] → Tgt Spa: ['0.350', '0.350', '1.000'] [Step 214 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Code'] | Lens: [18105, 18105, 18114] → Tgt Spa: ['0.350', '0.350', '1.000'] [Step 214 / Rank 4] Tasks: ['Single QA'] | Lens: [61810] → Tgt Spa: ['0.350'] [Step 214 / Rank 3] Tasks: ['MultiHop QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15816, 15817, 15818, 15818] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 214 / Rank 0] Tasks: ['Code'] | Lens: [35487] → Tgt Spa: ['1.000'] [Step 214 / Rank 2] Tasks: ['MultiHop QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15816, 15817, 15818, 15818] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 214 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32196, 32196] → Tgt Spa: ['0.350', '0.350'] [Step 214 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [57670] → Tgt Spa: ['1.000'] [Step 214 / Rank 1] Tasks: ['Code'] | Lens: [35487] → Tgt Spa: ['1.000'] [Step 214 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32196, 32196] → Tgt Spa: ['0.350', '0.350'] [Step 214 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [57670] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 04:25:32,599 >> @ 214 | Loss: 1.9983 | LM: 1.9379 | Reg: 0.0604 | Spa(Avg): 0.521 [INFO|lh_trainer.py:797] 2026-02-17 04:25:32,599 >> Statistic -> Code | Spa: 0.692 | Tgt: 1.000 | Z-Loss: 0.100 | [INFO|lh_trainer.py:797] 2026-02-17 04:25:32,599 >> Statistic -> In-Context | Spa: 0.710 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:25:32,599 >> Statistic -> MultiHop | Spa: 0.463 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:25:32,599 >> Statistic -> Single | Spa: 0.374 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:25:32,599 >> Statistic -> Summarization | Spa: 0.611 | Tgt: 1.000 | Z-Loss: 0.133 | [INFO|lh_trainer.py:810] 2026-02-17 04:25:32,601 >> [Micro-Log] {"loss": 1.9983277182715635, "lm_loss": 1.9378978583651285, "reg_loss": 0.06042984653807556, "model_sparsity(avg)": 0.5212673606971899, "Spa-Single QA sparsity": 0.3735380078616895, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.01821603856392597, "Spa-In-Context Learning sparsity": 0.7098765505684747, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10805256830321418, "Spa-Code sparsity": 0.6921296318372091, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09964290757973988, "Spa-Summarization sparsity": 0.6111111044883728, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13318060338497162, "Spa-MultiHop QA sparsity": 0.4629629651705424, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.04461697426935037, "step": 214, "current_tau": 1.0, "lambda1 Single QA": 0.58984375, "lambda2 MultiHop QA": 0.30859375, "lambda3 Summarization": 0.1591796875, "lambda4 Code": 0.259765625} [INFO|lh_trainer.py:331] 2026-02-17 04:25:54,384 >> {'loss': 11.99, 'grad_norm': 0.5760593414306641, 'learning_rate': 0.0001423722615607036, 'epoch': 0.22643496577145866, 'num_input_tokens_seen': 529062886, 'completed': '71.67% (215 / 300)', 'remaining time': '3:58:50', 'throughput': '8024.27', 'gpu_mem_free': '13591MB', 'step': 215} [Step 215 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23075, 23076] → Tgt Spa: ['1.000', '1.000'] [Step 215 / Rank 3] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23088, 23089] → Tgt Spa: ['0.350', '1.000'] [Step 215 / Rank 1] Tasks: ['Single QA'] | Lens: [53384] → Tgt Spa: ['0.350'] [Step 215 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23831, 23831] → Tgt Spa: ['0.350', '1.000'] [Step 215 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23075, 23076] → Tgt Spa: ['1.000', '1.000'] [Step 215 / Rank 0] Tasks: ['Single QA'] | Lens: [53384] → Tgt Spa: ['0.350'] [Step 215 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23831, 23831] → Tgt Spa: ['0.350', '1.000'] [Step 215 / Rank 2] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23088, 23089] → Tgt Spa: ['0.350', '1.000'] [Step 215 / Rank 3] Tasks: ['Single QA'] | Lens: [51288] → Tgt Spa: ['0.350'] [Step 215 / Rank 6] Tasks: ['Single QA'] | Lens: [52934] → Tgt Spa: ['0.350'] [Step 215 / Rank 4] Tasks: ['Code'] | Lens: [38553] → Tgt Spa: ['1.000'] [Step 215 / Rank 2] Tasks: ['Single QA'] | Lens: [51288] → Tgt Spa: ['0.350'] [Step 215 / Rank 0] Tasks: ['Single QA'] | Lens: [33348] → Tgt Spa: ['0.350'] [Step 215 / Rank 5] Tasks: ['Code'] | Lens: [38553] → Tgt Spa: ['1.000'] [Step 215 / Rank 7] Tasks: ['Single QA'] | Lens: [52934] → Tgt Spa: ['0.350'] [Step 215 / Rank 1] Tasks: ['Single QA'] | Lens: [33348] → Tgt Spa: ['0.350'] [Step 215 / Rank 0] Tasks: ['Code'] | Lens: [56948] → Tgt Spa: ['1.000'] [Step 215 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [13357, 13357, 13357, 13358] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 215 / Rank 7] Tasks: ['Single QA'] | Lens: [64174] → Tgt Spa: ['0.350'] [Step 215 / Rank 4] Tasks: ['Single QA'] | Lens: [65039] → Tgt Spa: ['0.350'] [Step 215 / Rank 1] Tasks: ['Code'] | Lens: [56948] → Tgt Spa: ['1.000'] [Step 215 / Rank 6] Tasks: ['Single QA'] | Lens: [64174] → Tgt Spa: ['0.350'] [Step 215 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [13357, 13357, 13357, 13358] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 215 / Rank 5] Tasks: ['Single QA'] | Lens: [65039] → Tgt Spa: ['0.350'] [Step 215 / Rank 3] Tasks: ['In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning'] | Lens: [4598, 4608, 4605, 4600, 4600, 4599, 4618, 4600, 4602, 4602, 4611, 4609, 4603, 4605] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 215 / Rank 6] Tasks: ['Single QA'] | Lens: [52961] → Tgt Spa: ['0.350'] [Step 215 / Rank 1] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [30414, 30397] → Tgt Spa: ['1.000', '1.000'] [Step 215 / Rank 4] Tasks: ['Single QA'] | Lens: [50399] → Tgt Spa: ['0.350'] [Step 215 / Rank 0] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [30414, 30397] → Tgt Spa: ['1.000', '1.000'] [Step 215 / Rank 5] Tasks: ['Single QA'] | Lens: [50399] → Tgt Spa: ['0.350'] [Step 215 / Rank 7] Tasks: ['Single QA'] | Lens: [52961] → Tgt Spa: ['0.350'] [Step 215 / Rank 2] Tasks: ['In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning'] | Lens: [4598, 4608, 4605, 4600, 4600, 4599, 4618, 4600, 4602, 4602, 4611, 4609, 4603, 4605] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 215 / Rank 3] Tasks: ['Single QA'] | Lens: [53359] → Tgt Spa: ['0.350'] [Step 215 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32221, 32223] → Tgt Spa: ['0.350', '0.350'] [Step 215 / Rank 0] Tasks: ['Single QA'] | Lens: [43046] → Tgt Spa: ['0.350'] [Step 215 / Rank 7] Tasks: ['Single QA'] | Lens: [44175] → Tgt Spa: ['0.350'] [Step 215 / Rank 2] Tasks: ['Single QA'] | Lens: [53359] → Tgt Spa: ['0.350'] [Step 215 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32221, 32223] → Tgt Spa: ['0.350', '0.350'] [Step 215 / Rank 1] Tasks: ['Single QA'] | Lens: [43046] → Tgt Spa: ['0.350'] [Step 215 / Rank 6] Tasks: ['Single QA'] | Lens: [44175] → Tgt Spa: ['0.350'] [Step 215 / Rank 3] Tasks: ['Single QA'] | Lens: [64108] → Tgt Spa: ['0.350'] [Step 215 / Rank 5] Tasks: ['Code'] | Lens: [35129] → Tgt Spa: ['1.000'] [Step 215 / Rank 2] Tasks: ['Single QA'] | Lens: [64108] → Tgt Spa: ['0.350'] [Step 215 / Rank 7] Tasks: ['Single QA'] | Lens: [45619] → Tgt Spa: ['0.350'] [Step 215 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24392, 24392] → Tgt Spa: ['1.000', '1.000'] [Step 215 / Rank 4] Tasks: ['Code'] | Lens: [35129] → Tgt Spa: ['1.000'] [Step 215 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24392, 24392] → Tgt Spa: ['1.000', '1.000'] [Step 215 / Rank 6] Tasks: ['Single QA'] | Lens: [45619] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 04:28:19,570 >> @ 215 | Loss: 2.0746 | LM: 2.0189 | Reg: 0.0557 | Spa(Avg): 0.490 [INFO|lh_trainer.py:797] 2026-02-17 04:28:19,570 >> Statistic -> Code | Spa: 0.681 | Tgt: 1.000 | Z-Loss: 0.105 | [INFO|lh_trainer.py:797] 2026-02-17 04:28:19,570 >> Statistic -> In-Context | Spa: 0.718 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:28:19,571 >> Statistic -> MultiHop | Spa: 0.463 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:28:19,571 >> Statistic -> Single | Spa: 0.389 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:28:19,571 >> Statistic -> Summarization | Spa: 0.569 | Tgt: 1.000 | Z-Loss: 0.165 | [INFO|lh_trainer.py:810] 2026-02-17 04:28:19,574 >> [Micro-Log] {"loss": 2.0745739564299583, "lm_loss": 2.018915065253774, "reg_loss": 0.05565888236742467, "model_sparsity(avg)": 0.48987267911434174, "Spa-Single QA sparsity": 0.3888888813200451, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03075028055657943, "Spa-Code sparsity": 0.6805555394717625, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10458512284926005, "Spa-Summarization sparsity": 0.5694444477558136, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1648736670613289, "Spa-In-Context Learning sparsity": 0.717881940305233, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10472456784918904, "Spa-MultiHop QA sparsity": 0.4629629651705424, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.04461697426935037, "step": 215, "current_tau": 1.0, "lambda1 Single QA": 0.58984375, "lambda2 MultiHop QA": 0.30859375, "lambda3 Summarization": 0.1591796875, "lambda4 Code": 0.259765625} [INFO|lh_trainer.py:331] 2026-02-17 04:28:45,786 >> {'loss': 12.4474, 'grad_norm': 0.44997331500053406, 'learning_rate': 0.000139427863502467, 'epoch': 0.22748815165876776, 'num_input_tokens_seen': 531535650, 'completed': '72.00% (216 / 300)', 'remaining time': '3:56:02', 'throughput': '7213.32', 'gpu_mem_free': '10675MB', 'step': 216} [Step 216 / Rank 2] Tasks: ['Code'] | Lens: [33878] → Tgt Spa: ['1.000'] [Step 216 / Rank 7] Tasks: ['MultiHop QA', 'Single QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Code', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [2679, 2680, 2681, 2696, 2685, 2699, 2683, 2682, 2701, 2688, 2702, 2702, 2686, 2685, 2686, 2688, 2704, 2687, 2689, 2689, 2690, 2690, 2689, 2707] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 216 / Rank 1] Tasks: ['Single QA'] | Lens: [59672] → Tgt Spa: ['0.350'] [Step 216 / Rank 3] Tasks: ['Code'] | Lens: [33878] → Tgt Spa: ['1.000'] [Step 216 / Rank 0] Tasks: ['Single QA'] | Lens: [59672] → Tgt Spa: ['0.350'] [Step 216 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27039, 27037] → Tgt Spa: ['1.000', '1.000'] [Step 216 / Rank 6] Tasks: ['MultiHop QA', 'Single QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Code', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [2679, 2680, 2681, 2696, 2685, 2699, 2683, 2682, 2701, 2688, 2702, 2702, 2686, 2685, 2686, 2688, 2704, 2687, 2689, 2689, 2690, 2690, 2689, 2707] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 216 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27039, 27037] → Tgt Spa: ['1.000', '1.000'] [Step 216 / Rank 3] Tasks: ['Single QA', 'Code'] | Lens: [24193, 24202] → Tgt Spa: ['0.350', '1.000'] [Step 216 / Rank 4] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16471, 16460, 16472] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 216 / Rank 6] Tasks: ['Single QA'] | Lens: [58837] → Tgt Spa: ['0.350'] [Step 216 / Rank 5] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16471, 16460, 16472] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 216 / Rank 1] Tasks: ['Single QA'] | Lens: [55975] → Tgt Spa: ['0.350'] [Step 216 / Rank 0] Tasks: ['Single QA'] | Lens: [55975] → Tgt Spa: ['0.350'] [Step 216 / Rank 7] Tasks: ['Single QA'] | Lens: [58837] → Tgt Spa: ['0.350'] [Step 216 / Rank 2] Tasks: ['Single QA', 'Code'] | Lens: [24193, 24202] → Tgt Spa: ['0.350', '1.000'] [Step 216 / Rank 4] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18291, 18280, 18284] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 216 / Rank 3] Tasks: ['Single QA'] | Lens: [55876] → Tgt Spa: ['0.350'] [Step 216 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [41728] → Tgt Spa: ['1.000'] [Step 216 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [57125] → Tgt Spa: ['1.000'] [Step 216 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [41728] → Tgt Spa: ['1.000'] [Step 216 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [57125] → Tgt Spa: ['1.000'] [Step 216 / Rank 5] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18291, 18280, 18284] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 216 / Rank 2] Tasks: ['Single QA'] | Lens: [55876] → Tgt Spa: ['0.350'] [Step 216 / Rank 6] Tasks: ['Summarization', 'Code', 'Single QA'] | Lens: [18937, 18926, 18919] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 216 / Rank 4] Tasks: ['Code'] | Lens: [62927] → Tgt Spa: ['1.000'] [Step 216 / Rank 5] Tasks: ['Code'] | Lens: [62927] → Tgt Spa: ['1.000'] [Step 216 / Rank 7] Tasks: ['Summarization', 'Code', 'Single QA'] | Lens: [18937, 18926, 18919] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 216 / Rank 3] Tasks: ['Single QA'] | Lens: [41491] → Tgt Spa: ['0.350'] [Step 216 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [21050, 21050, 21050] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 216 / Rank 2] Tasks: ['Single QA'] | Lens: [41491] → Tgt Spa: ['0.350'] [Step 216 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [21050, 21050, 21050] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 216 / Rank 7] Tasks: ['Single QA'] | Lens: [49065] → Tgt Spa: ['0.350'] [Step 216 / Rank 2] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17423, 17414, 17429] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 216 / Rank 3] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17423, 17414, 17429] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 216 / Rank 0] Tasks: ['Code', 'Code'] | Lens: [30008, 30013] → Tgt Spa: ['1.000', '1.000'] [Step 216 / Rank 5] Tasks: ['Single QA'] | Lens: [35564] → Tgt Spa: ['0.350'] [Step 216 / Rank 6] Tasks: ['Single QA'] | Lens: [49065] → Tgt Spa: ['0.350'] [Step 216 / Rank 4] Tasks: ['Single QA'] | Lens: [35564] → Tgt Spa: ['0.350'] [Step 216 / Rank 1] Tasks: ['Code', 'Code'] | Lens: [30008, 30013] → Tgt Spa: ['1.000', '1.000'] [Step 216 / Rank 4] Tasks: ['Code'] | Lens: [46173] → Tgt Spa: ['1.000'] [Step 216 / Rank 6] Tasks: ['Single QA'] | Lens: [61160] → Tgt Spa: ['0.350'] [Step 216 / Rank 7] Tasks: ['Single QA'] | Lens: [61160] → Tgt Spa: ['0.350'] [Step 216 / Rank 5] Tasks: ['Code'] | Lens: [46173] → Tgt Spa: ['1.000'] [Step 216 / Rank 1] Tasks: ['Single QA'] | Lens: [52575] → Tgt Spa: ['0.350'] [Step 216 / Rank 2] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23760, 23760] → Tgt Spa: ['0.350', '1.000'] [Step 216 / Rank 3] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23760, 23760] → Tgt Spa: ['0.350', '1.000'] [Step 216 / Rank 0] Tasks: ['Single QA'] | Lens: [52575] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 04:31:16,217 >> @ 216 | Loss: 1.9251 | LM: 1.8664 | Reg: 0.0587 | Spa(Avg): 0.539 [INFO|lh_trainer.py:797] 2026-02-17 04:31:16,217 >> Statistic -> Code | Spa: 0.704 | Tgt: 1.000 | Z-Loss: 0.096 | [INFO|lh_trainer.py:797] 2026-02-17 04:31:16,217 >> Statistic -> In-Context | Spa: 0.713 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:31:16,217 >> Statistic -> MultiHop | Spa: 0.661 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:31:16,217 >> Statistic -> Single | Spa: 0.373 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:31:16,217 >> Statistic -> Summarization | Spa: 0.652 | Tgt: 1.000 | Z-Loss: 0.116 | [INFO|lh_trainer.py:810] 2026-02-17 04:31:16,220 >> [Micro-Log] {"loss": 1.9250682815909386, "lm_loss": 1.8664043378084898, "reg_loss": 0.05866393373192599, "model_sparsity(avg)": 0.5390142723917961, "Spa-Single QA sparsity": 0.3732638768851757, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.02007075905567035, "Spa-In-Context Learning sparsity": 0.7129629651705424, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10683965558807056, "Spa-Code sparsity": 0.7037036915620168, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09591684676706791, "Spa-Summarization sparsity": 0.6517093961055462, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1158201528283266, "Spa-MultiHop QA sparsity": 0.6607142771993365, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.13602978523288453, "step": 216, "current_tau": 1.0, "lambda1 Single QA": 0.58984375, "lambda2 MultiHop QA": 0.30859375, "lambda3 Summarization": 0.16015625, "lambda4 Code": 0.26171875} [INFO|lh_trainer.py:331] 2026-02-17 04:31:40,425 >> {'loss': 11.5504, 'grad_norm': 0.6070464849472046, 'learning_rate': 0.00013650241141487582, 'epoch': 0.2285413375460769, 'num_input_tokens_seen': 534061814, 'completed': '72.33% (217 / 300)', 'remaining time': '3:53:16', 'throughput': '7232.56', 'gpu_mem_free': '8375MB', 'step': 217} [Step 217 / Rank 6] Tasks: ['Single QA'] | Lens: [41012] → Tgt Spa: ['0.350'] [Step 217 / Rank 2] Tasks: ['MultiHop QA', 'MultiHop QA'] | Lens: [32696, 32696] → Tgt Spa: ['0.350', '0.350'] [Step 217 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [44876] → Tgt Spa: ['1.000'] [Step 217 / Rank 7] Tasks: ['Single QA'] | Lens: [41012] → Tgt Spa: ['0.350'] [Step 217 / Rank 3] Tasks: ['MultiHop QA', 'MultiHop QA'] | Lens: [32696, 32696] → Tgt Spa: ['0.350', '0.350'] [Step 217 / Rank 5] Tasks: ['Single QA'] | Lens: [49448] → Tgt Spa: ['0.350'] [Step 217 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [44876] → Tgt Spa: ['1.000'] [Step 217 / Rank 4] Tasks: ['Single QA'] | Lens: [49448] → Tgt Spa: ['0.350'] [Step 217 / Rank 3] Tasks: ['Single QA'] | Lens: [55062] → Tgt Spa: ['0.350'] [Step 217 / Rank 5] Tasks: ['Single QA'] | Lens: [64034] → Tgt Spa: ['0.350'] [Step 217 / Rank 2] Tasks: ['Single QA'] | Lens: [55062] → Tgt Spa: ['0.350'] [Step 217 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code'] | Lens: [4523, 4524, 4525, 4526, 4527, 4527, 4529, 4547, 4530, 4530, 4530, 4530, 4539, 4541] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 217 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code'] | Lens: [4523, 4524, 4525, 4526, 4527, 4527, 4529, 4547, 4530, 4530, 4530, 4530, 4539, 4541] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 217 / Rank 4] Tasks: ['Single QA'] | Lens: [64034] → Tgt Spa: ['0.350'] [Step 217 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [51613] → Tgt Spa: ['1.000'] [Step 217 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [51613] → Tgt Spa: ['1.000'] [Step 217 / Rank 0] Tasks: ['Single QA'] | Lens: [63527] → Tgt Spa: ['0.350'] [Step 217 / Rank 2] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [28456, 28461] → Tgt Spa: ['1.000', '0.350'] [Step 217 / Rank 1] Tasks: ['Single QA'] | Lens: [63527] → Tgt Spa: ['0.350'] [Step 217 / Rank 7] Tasks: ['Code', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning'] | Lens: [4844, 4845, 4839, 4839, 4839, 4842, 4843, 4844, 4845, 4844, 4845, 4845, 4846] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000'] [Step 217 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [8003, 8003, 8003, 8003, 8004, 8004, 8004, 8004] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 217 / Rank 6] Tasks: ['Code', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning'] | Lens: [4844, 4845, 4839, 4839, 4839, 4842, 4843, 4844, 4845, 4844, 4845, 4845, 4846] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000'] [Step 217 / Rank 3] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [28456, 28461] → Tgt Spa: ['1.000', '0.350'] [Step 217 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [8003, 8003, 8003, 8003, 8004, 8004, 8004, 8004] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 217 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [38012] → Tgt Spa: ['1.000'] [Step 217 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [38012] → Tgt Spa: ['1.000'] [Step 217 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [53346] → Tgt Spa: ['1.000'] [Step 217 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [53346] → Tgt Spa: ['1.000'] [Step 217 / Rank 1] Tasks: ['Summarization', 'Code'] | Lens: [26506, 26500] → Tgt Spa: ['1.000', '1.000'] [Step 217 / Rank 0] Tasks: ['Summarization', 'Code'] | Lens: [26506, 26500] → Tgt Spa: ['1.000', '1.000'] [Step 217 / Rank 2] Tasks: ['Single QA'] | Lens: [49214] → Tgt Spa: ['0.350'] [Step 217 / Rank 3] Tasks: ['Single QA'] | Lens: [49214] → Tgt Spa: ['0.350'] [Step 217 / Rank 3] Tasks: ['Single QA'] | Lens: [51289] → Tgt Spa: ['0.350'] [Step 217 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [22541, 22550] → Tgt Spa: ['1.000', '1.000'] [Step 217 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [40863] → Tgt Spa: ['1.000'] [Step 217 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [22541, 22550] → Tgt Spa: ['1.000', '1.000'] [Step 217 / Rank 2] Tasks: ['Single QA'] | Lens: [51289] → Tgt Spa: ['0.350'] [Step 217 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [40863] → Tgt Spa: ['1.000'] [Step 217 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [29678, 29678] → Tgt Spa: ['0.350', '0.350'] [Step 217 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [29678, 29678] → Tgt Spa: ['0.350', '0.350'] [Step 217 / Rank 3] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [27679, 27661] → Tgt Spa: ['1.000', '1.000'] [Step 217 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [59857] → Tgt Spa: ['1.000'] [Step 217 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [8055, 8055, 8055, 8058, 8058, 8058, 8058, 8058] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 217 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [59857] → Tgt Spa: ['1.000'] [Step 217 / Rank 0] Tasks: ['Single QA'] | Lens: [60517] → Tgt Spa: ['0.350'] [Step 217 / Rank 1] Tasks: ['Single QA'] | Lens: [60517] → Tgt Spa: ['0.350'] [Step 217 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [8055, 8055, 8055, 8058, 8058, 8058, 8058, 8058] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 217 / Rank 2] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [27679, 27661] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 04:34:08,969 >> @ 217 | Loss: 2.2997 | LM: 2.2345 | Reg: 0.0652 | Spa(Avg): 0.529 [INFO|lh_trainer.py:797] 2026-02-17 04:34:08,969 >> Statistic -> Code | Spa: 0.699 | Tgt: 1.000 | Z-Loss: 0.098 | [INFO|lh_trainer.py:797] 2026-02-17 04:34:08,969 >> Statistic -> In-Context | Spa: 0.706 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:34:08,969 >> Statistic -> MultiHop | Spa: 0.368 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:34:08,969 >> Statistic -> Single | Spa: 0.441 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:34:08,969 >> Statistic -> Summarization | Spa: 0.653 | Tgt: 1.000 | Z-Loss: 0.114 | [INFO|lh_trainer.py:810] 2026-02-17 04:34:08,971 >> [Micro-Log] {"loss": 2.2996697343575456, "lm_loss": 2.2345192236437774, "reg_loss": 0.06515050599894796, "model_sparsity(avg)": 0.5292308914164702, "Spa-In-Context Learning sparsity": 0.7062757169758832, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10964811151778256, "Spa-Single QA sparsity": 0.4408602080037517, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06145950097350344, "Spa-Summarization sparsity": 0.6527777512868246, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1142358457048734, "Spa-Code sparsity": 0.6990740696589152, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.097710316379865, "Spa-MultiHop QA sparsity": 0.3680555522441864, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.005848680273629725, "step": 217, "current_tau": 1.0, "lambda1 Single QA": 0.58984375, "lambda2 MultiHop QA": 0.30859375, "lambda3 Summarization": 0.16015625, "lambda4 Code": 0.26171875} [INFO|lh_trainer.py:331] 2026-02-17 04:34:32,828 >> {'loss': 13.798, 'grad_norm': 0.6833432912826538, 'learning_rate': 0.00013359640655908516, 'epoch': 0.229594523433386, 'num_input_tokens_seen': 536687100, 'completed': '72.67% (218 / 300)', 'remaining time': '3:50:29', 'throughput': '7613.81', 'gpu_mem_free': '5477MB', 'step': 218} [Step 218 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [35656] → Tgt Spa: ['1.000'] [Step 218 / Rank 1] Tasks: ['Single QA'] | Lens: [48384] → Tgt Spa: ['0.350'] [Step 218 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [35656] → Tgt Spa: ['1.000'] [Step 218 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [29323, 29323] → Tgt Spa: ['0.350', '0.350'] [Step 218 / Rank 0] Tasks: ['Single QA'] | Lens: [48384] → Tgt Spa: ['0.350'] [Step 218 / Rank 3] Tasks: ['In-Context Learning', 'Single QA', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [6652, 6652, 6653, 6655, 6654, 6655, 6661, 6655, 6657] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350'] [Step 218 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [29323, 29323] → Tgt Spa: ['0.350', '0.350'] [Step 218 / Rank 2] Tasks: ['In-Context Learning', 'Single QA', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [6652, 6652, 6653, 6655, 6654, 6655, 6661, 6655, 6657] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350'] [Step 218 / Rank 6] Tasks: ['Single QA'] | Lens: [58353] → Tgt Spa: ['0.350'] [Step 218 / Rank 7] Tasks: ['Single QA'] | Lens: [58353] → Tgt Spa: ['0.350'] [Step 218 / Rank 2] Tasks: ['Single QA'] | Lens: [37172] → Tgt Spa: ['0.350'] [Step 218 / Rank 3] Tasks: ['Single QA'] | Lens: [37172] → Tgt Spa: ['0.350'] [Step 218 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [60479] → Tgt Spa: ['1.000'] [Step 218 / Rank 0] Tasks: ['Summarization', 'Code'] | Lens: [24661, 24650] → Tgt Spa: ['1.000', '1.000'] [Step 218 / Rank 1] Tasks: ['Summarization', 'Code'] | Lens: [24661, 24650] → Tgt Spa: ['1.000', '1.000'] [Step 218 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [60479] → Tgt Spa: ['1.000'] [Step 218 / Rank 7] Tasks: ['Single QA'] | Lens: [39256] → Tgt Spa: ['0.350'] [Step 218 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [56115] → Tgt Spa: ['1.000'] [Step 218 / Rank 5] Tasks: ['Code'] | Lens: [38817] → Tgt Spa: ['1.000'] [Step 218 / Rank 1] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [16703, 16716, 16704] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 218 / Rank 4] Tasks: ['Code'] | Lens: [38817] → Tgt Spa: ['1.000'] [Step 218 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [56115] → Tgt Spa: ['1.000'] [Step 218 / Rank 6] Tasks: ['Single QA'] | Lens: [39256] → Tgt Spa: ['0.350'] [Step 218 / Rank 0] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [16703, 16716, 16704] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 218 / Rank 4] Tasks: ['Code', 'Code', 'Code'] | Lens: [18741, 18742, 18742] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 218 / Rank 7] Tasks: ['Single QA'] | Lens: [52001] → Tgt Spa: ['0.350'] [Step 218 / Rank 5] Tasks: ['Code', 'Code', 'Code'] | Lens: [18741, 18742, 18742] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 218 / Rank 1] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [18413, 18413, 18414] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 218 / Rank 0] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [18413, 18413, 18414] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 218 / Rank 6] Tasks: ['Single QA'] | Lens: [52001] → Tgt Spa: ['0.350'] [Step 218 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26362, 26361] → Tgt Spa: ['1.000', '1.000'] [Step 218 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26362, 26361] → Tgt Spa: ['1.000', '1.000'] [Step 218 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [15712, 15712, 15712, 15720] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000'] [Step 218 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [59396] → Tgt Spa: ['1.000'] [Step 218 / Rank 7] Tasks: ['Single QA'] | Lens: [47068] → Tgt Spa: ['0.350'] [Step 218 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [43270] → Tgt Spa: ['1.000'] [Step 218 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [43270] → Tgt Spa: ['1.000'] [Step 218 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [59396] → Tgt Spa: ['1.000'] [Step 218 / Rank 6] Tasks: ['Single QA'] | Lens: [47068] → Tgt Spa: ['0.350'] [Step 218 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [15712, 15712, 15712, 15720] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000'] [Step 218 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [26932, 26930] → Tgt Spa: ['1.000', '1.000'] [Step 218 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [31349, 31352] → Tgt Spa: ['1.000', '1.000'] [Step 218 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [31349, 31352] → Tgt Spa: ['1.000', '1.000'] [Step 218 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [40560] → Tgt Spa: ['1.000'] [Step 218 / Rank 2] Tasks: ['Code', 'Code', 'Code', 'Code'] | Lens: [13333, 13341, 13343, 13347] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000'] [Step 218 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [26932, 26930] → Tgt Spa: ['1.000', '1.000'] [Step 218 / Rank 3] Tasks: ['Code', 'Code', 'Code', 'Code'] | Lens: [13333, 13341, 13343, 13347] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000'] [Step 218 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [40560] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 04:36:53,836 >> @ 218 | Loss: 1.9270 | LM: 1.8509 | Reg: 0.0761 | Spa(Avg): 0.592 [INFO|lh_trainer.py:797] 2026-02-17 04:36:53,836 >> Statistic -> Code | Spa: 0.705 | Tgt: 1.000 | Z-Loss: 0.095 | [INFO|lh_trainer.py:797] 2026-02-17 04:36:53,837 >> Statistic -> In-Context | Spa: 0.704 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:36:53,837 >> Statistic -> MultiHop | Spa: 0.368 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:36:53,837 >> Statistic -> Single | Spa: 0.419 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:36:53,837 >> Statistic -> Summarization | Spa: 0.686 | Tgt: 1.000 | Z-Loss: 0.098 | [INFO|lh_trainer.py:810] 2026-02-17 04:36:53,839 >> [Micro-Log] {"loss": 1.9270312900965412, "lm_loss": 1.850943972589448, "reg_loss": 0.07608730194624513, "model_sparsity(avg)": 0.5920781890551249, "Spa-Single QA sparsity": 0.41851851145426433, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04483265237261851, "Spa-Summarization sparsity": 0.6861110925674438, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09779558330774307, "Spa-Code sparsity": 0.7048611119389534, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09547516144812107, "Spa-In-Context Learning sparsity": 0.7040598300787119, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11053182757817782, "Spa-MultiHop QA sparsity": 0.3680555522441864, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.005848680273629725, "step": 218, "current_tau": 1.0, "lambda1 Single QA": 0.58984375, "lambda2 MultiHop QA": 0.30859375, "lambda3 Summarization": 0.16015625, "lambda4 Code": 0.26171875} [INFO|lh_trainer.py:331] 2026-02-17 04:37:10,549 >> {'loss': 11.5622, 'grad_norm': 0.8076875805854797, 'learning_rate': 0.0001307103468640669, 'epoch': 0.23064770932069512, 'num_input_tokens_seen': 539150044, 'completed': '73.00% (219 / 300)', 'remaining time': '3:47:36', 'throughput': '7807.91', 'gpu_mem_free': '12163MB', 'step': 219} [Step 219 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [30551, 30555] → Tgt Spa: ['1.000', '1.000'] [Step 219 / Rank 7] Tasks: ['Single QA'] | Lens: [60566] → Tgt Spa: ['0.350'] [Step 219 / Rank 3] Tasks: ['Single QA'] | Lens: [44041] → Tgt Spa: ['0.350'] [Step 219 / Rank 0] Tasks: ['Single QA'] | Lens: [54309] → Tgt Spa: ['0.350'] [Step 219 / Rank 6] Tasks: ['Single QA'] | Lens: [60566] → Tgt Spa: ['0.350'] [Step 219 / Rank 1] Tasks: ['Single QA'] | Lens: [54309] → Tgt Spa: ['0.350'] [Step 219 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [30551, 30555] → Tgt Spa: ['1.000', '1.000'] [Step 219 / Rank 2] Tasks: ['Single QA'] | Lens: [44041] → Tgt Spa: ['0.350'] [Step 219 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [21988, 21982] → Tgt Spa: ['1.000', '1.000'] [Step 219 / Rank 2] Tasks: ['Single QA'] | Lens: [56317] → Tgt Spa: ['0.350'] [Step 219 / Rank 1] Tasks: ['Summarization'] | Lens: [34891] → Tgt Spa: ['1.000'] [Step 219 / Rank 3] Tasks: ['Single QA'] | Lens: [56317] → Tgt Spa: ['0.350'] [Step 219 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [21988, 21982] → Tgt Spa: ['1.000', '1.000'] [Step 219 / Rank 7] Tasks: ['Code'] | Lens: [40961] → Tgt Spa: ['1.000'] [Step 219 / Rank 6] Tasks: ['Code'] | Lens: [40961] → Tgt Spa: ['1.000'] [Step 219 / Rank 0] Tasks: ['Summarization'] | Lens: [34891] → Tgt Spa: ['1.000'] [Step 219 / Rank 7] Tasks: ['Single QA'] | Lens: [62411] → Tgt Spa: ['0.350'] [Step 219 / Rank 5] Tasks: ['Code'] | Lens: [34450] → Tgt Spa: ['1.000'] [Step 219 / Rank 3] Tasks: ['Code', 'Code'] | Lens: [28561, 28562] → Tgt Spa: ['1.000', '1.000'] [Step 219 / Rank 4] Tasks: ['Code'] | Lens: [34450] → Tgt Spa: ['1.000'] [Step 219 / Rank 1] Tasks: ['Single QA'] | Lens: [62459] → Tgt Spa: ['0.350'] [Step 219 / Rank 2] Tasks: ['Code', 'Code'] | Lens: [28561, 28562] → Tgt Spa: ['1.000', '1.000'] [Step 219 / Rank 6] Tasks: ['Single QA'] | Lens: [62411] → Tgt Spa: ['0.350'] [Step 219 / Rank 0] Tasks: ['Single QA'] | Lens: [62459] → Tgt Spa: ['0.350'] [Step 219 / Rank 4] Tasks: ['Single QA'] | Lens: [58967] → Tgt Spa: ['0.350'] [Step 219 / Rank 0] Tasks: ['Code'] | Lens: [35661] → Tgt Spa: ['1.000'] [Step 219 / Rank 2] Tasks: ['MultiHop QA'] | Lens: [63727] → Tgt Spa: ['0.350'] [Step 219 / Rank 1] Tasks: ['Code'] | Lens: [35661] → Tgt Spa: ['1.000'] [Step 219 / Rank 5] Tasks: ['Single QA'] | Lens: [58967] → Tgt Spa: ['0.350'] [Step 219 / Rank 3] Tasks: ['MultiHop QA'] | Lens: [63727] → Tgt Spa: ['0.350'] [Step 219 / Rank 7] Tasks: ['Single QA'] | Lens: [60753] → Tgt Spa: ['0.350'] [Step 219 / Rank 6] Tasks: ['Single QA'] | Lens: [60753] → Tgt Spa: ['0.350'] [Step 219 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [56328] → Tgt Spa: ['1.000'] [Step 219 / Rank 3] Tasks: ['Code'] | Lens: [62121] → Tgt Spa: ['1.000'] [Step 219 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43809] → Tgt Spa: ['1.000'] [Step 219 / Rank 0] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [17759, 17769, 17770] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 219 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43809] → Tgt Spa: ['1.000'] [Step 219 / Rank 2] Tasks: ['Code'] | Lens: [62121] → Tgt Spa: ['1.000'] [Step 219 / Rank 1] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [17759, 17769, 17770] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 219 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [56328] → Tgt Spa: ['1.000'] [Step 219 / Rank 4] Tasks: ['Single QA'] | Lens: [55084] → Tgt Spa: ['0.350'] [Step 219 / Rank 1] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [10712, 10723, 10723, 10723, 10724, 10726] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350'] [Step 219 / Rank 5] Tasks: ['Single QA'] | Lens: [55084] → Tgt Spa: ['0.350'] [Step 219 / Rank 0] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [10712, 10723, 10723, 10723, 10724, 10726] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350'] [Step 219 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [45947] → Tgt Spa: ['1.000'] [Step 219 / Rank 3] Tasks: ['Code'] | Lens: [36509] → Tgt Spa: ['1.000'] [Step 219 / Rank 2] Tasks: ['Code'] | Lens: [36509] → Tgt Spa: ['1.000'] [Step 219 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [45947] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 04:39:54,809 >> @ 219 | Loss: 1.8670 | LM: 1.8012 | Reg: 0.0658 | Spa(Avg): 0.549 [INFO|lh_trainer.py:797] 2026-02-17 04:39:54,809 >> Statistic -> Code | Spa: 0.703 | Tgt: 1.000 | Z-Loss: 0.097 | [INFO|lh_trainer.py:797] 2026-02-17 04:39:54,809 >> Statistic -> In-Context | Spa: 0.712 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:39:54,809 >> Statistic -> MultiHop | Spa: 0.472 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:39:54,809 >> Statistic -> Single | Spa: 0.405 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:39:54,809 >> Statistic -> Summarization | Spa: 0.597 | Tgt: 1.000 | Z-Loss: 0.145 | [INFO|lh_trainer.py:810] 2026-02-17 04:39:54,811 >> [Micro-Log] {"loss": 1.8670255554219086, "lm_loss": 1.8012436516582966, "reg_loss": 0.0657818951100732, "model_sparsity(avg)": 0.5488040124376615, "Spa-Single QA sparsity": 0.4047618976661137, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.042913072276860476, "Spa-Summarization sparsity": 0.5972222089767456, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.14456202586491904, "Spa-Code sparsity": 0.7025462985038757, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09662514987091224, "Spa-MultiHop QA sparsity": 0.4722222089767456, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.04388635233044624, "Spa-In-Context Learning sparsity": 0.7118055820465088, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10785099491477013, "step": 219, "current_tau": 1.0, "lambda1 Single QA": 0.58984375, "lambda2 MultiHop QA": 0.30859375, "lambda3 Summarization": 0.16015625, "lambda4 Code": 0.26171875} [INFO|lh_trainer.py:331] 2026-02-17 04:40:15,394 >> {'loss': 11.2022, 'grad_norm': 0.7293125987052917, 'learning_rate': 0.0001278447268412924, 'epoch': 0.23170089520800422, 'num_input_tokens_seen': 541648322, 'completed': '73.33% (220 / 300)', 'remaining time': '3:44:53', 'throughput': '6757.76', 'gpu_mem_free': '6819MB', 'step': 220} [Step 220 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [40530] → Tgt Spa: ['1.000'] [Step 220 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [40530] → Tgt Spa: ['1.000'] [Step 220 / Rank 2] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24147, 24148] → Tgt Spa: ['1.000', '0.350'] [Step 220 / Rank 3] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24147, 24148] → Tgt Spa: ['1.000', '0.350'] [Step 220 / Rank 7] Tasks: ['Single QA'] | Lens: [35959] → Tgt Spa: ['0.350'] [Step 220 / Rank 1] Tasks: ['Single QA'] | Lens: [42975] → Tgt Spa: ['0.350'] [Step 220 / Rank 6] Tasks: ['Single QA'] | Lens: [35959] → Tgt Spa: ['0.350'] [Step 220 / Rank 0] Tasks: ['Single QA'] | Lens: [42975] → Tgt Spa: ['0.350'] [Step 220 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [63659] → Tgt Spa: ['1.000'] [Step 220 / Rank 0] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [20510, 20501, 20504] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 220 / Rank 1] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [20510, 20501, 20504] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 220 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7929, 7929, 7932, 7932, 7932, 7932, 7932, 7932] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 220 / Rank 3] Tasks: ['Single QA'] | Lens: [52060] → Tgt Spa: ['0.350'] [Step 220 / Rank 2] Tasks: ['Single QA'] | Lens: [52060] → Tgt Spa: ['0.350'] [Step 220 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7929, 7929, 7932, 7932, 7932, 7932, 7932, 7932] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 220 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [63659] → Tgt Spa: ['1.000'] [Step 220 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15875, 15875, 15875, 15875] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 220 / Rank 7] Tasks: ['Summarization', 'Single QA'] | Lens: [32383, 32364] → Tgt Spa: ['1.000', '0.350'] [Step 220 / Rank 0] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25185, 25186] → Tgt Spa: ['1.000', '0.350'] [Step 220 / Rank 3] Tasks: ['Single QA'] | Lens: [51308] → Tgt Spa: ['0.350'] [Step 220 / Rank 1] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25185, 25186] → Tgt Spa: ['1.000', '0.350'] [Step 220 / Rank 6] Tasks: ['Summarization', 'Single QA'] | Lens: [32383, 32364] → Tgt Spa: ['1.000', '0.350'] [Step 220 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15875, 15875, 15875, 15875] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 220 / Rank 2] Tasks: ['Single QA'] | Lens: [51308] → Tgt Spa: ['0.350'] [Step 220 / Rank 3] Tasks: ['Single QA'] | Lens: [47960] → Tgt Spa: ['0.350'] [Step 220 / Rank 0] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23549, 23550] → Tgt Spa: ['1.000', '0.350'] [Step 220 / Rank 7] Tasks: ['Single QA'] | Lens: [60954] → Tgt Spa: ['0.350'] [Step 220 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26008, 26008] → Tgt Spa: ['1.000', '1.000'] [Step 220 / Rank 2] Tasks: ['Single QA'] | Lens: [47960] → Tgt Spa: ['0.350'] [Step 220 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26008, 26008] → Tgt Spa: ['1.000', '1.000'] [Step 220 / Rank 1] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23549, 23550] → Tgt Spa: ['1.000', '0.350'] [Step 220 / Rank 6] Tasks: ['Single QA'] | Lens: [60954] → Tgt Spa: ['0.350'] [Step 220 / Rank 0] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [30132, 30132] → Tgt Spa: ['0.350', '1.000'] [Step 220 / Rank 1] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [30132, 30132] → Tgt Spa: ['0.350', '1.000'] [Step 220 / Rank 4] Tasks: ['Code'] | Lens: [59168] → Tgt Spa: ['1.000'] [Step 220 / Rank 5] Tasks: ['Code'] | Lens: [59168] → Tgt Spa: ['1.000'] [Step 220 / Rank 3] Tasks: ['Single QA'] | Lens: [46165] → Tgt Spa: ['0.350'] [Step 220 / Rank 6] Tasks: ['Code'] | Lens: [37364] → Tgt Spa: ['1.000'] [Step 220 / Rank 7] Tasks: ['Code'] | Lens: [37364] → Tgt Spa: ['1.000'] [Step 220 / Rank 2] Tasks: ['Single QA'] | Lens: [46165] → Tgt Spa: ['0.350'] [Step 220 / Rank 5] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [19911, 19902, 19901] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 220 / Rank 2] Tasks: ['Summarization'] | Lens: [40869] → Tgt Spa: ['1.000'] [Step 220 / Rank 0] Tasks: ['Summarization', 'Code'] | Lens: [30101, 30090] → Tgt Spa: ['1.000', '1.000'] [Step 220 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [24937, 24945] → Tgt Spa: ['1.000', '1.000'] [Step 220 / Rank 1] Tasks: ['Summarization', 'Code'] | Lens: [30101, 30090] → Tgt Spa: ['1.000', '1.000'] [Step 220 / Rank 3] Tasks: ['Summarization'] | Lens: [40869] → Tgt Spa: ['1.000'] [Step 220 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [24937, 24945] → Tgt Spa: ['1.000', '1.000'] [Step 220 / Rank 4] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [19911, 19902, 19901] → Tgt Spa: ['1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 04:42:37,447 >> @ 220 | Loss: 2.0662 | LM: 1.9912 | Reg: 0.0750 | Spa(Avg): 0.547 [INFO|lh_trainer.py:797] 2026-02-17 04:42:37,447 >> Statistic -> Code | Spa: 0.703 | Tgt: 1.000 | Z-Loss: 0.096 | [INFO|lh_trainer.py:797] 2026-02-17 04:42:37,447 >> Statistic -> In-Context | Spa: 0.708 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:42:37,447 >> Statistic -> MultiHop | Spa: 0.472 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:42:37,447 >> Statistic -> Single | Spa: 0.443 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:42:37,447 >> Statistic -> Summarization | Spa: 0.647 | Tgt: 1.000 | Z-Loss: 0.118 | [INFO|lh_trainer.py:810] 2026-02-17 04:42:37,449 >> [Micro-Log] {"loss": 2.066207288453976, "lm_loss": 1.9911775055030982, "reg_loss": 0.07502978481352329, "model_sparsity(avg)": 0.5469473401705424, "Spa-Single QA sparsity": 0.4427083258827527, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06488663676039626, "Spa-Summarization sparsity": 0.6472222328186035, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11804334223270416, "Spa-Code sparsity": 0.7031249850988388, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09622678067535162, "Spa-In-Context Learning sparsity": 0.70833334657881, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10942047834396362, "Spa-MultiHop QA sparsity": 0.4722222089767456, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.04388635233044624, "step": 220, "current_tau": 1.0, "lambda1 Single QA": 0.58984375, "lambda2 MultiHop QA": 0.310546875, "lambda3 Summarization": 0.16015625, "lambda4 Code": 0.26171875} [INFO|lh_trainer.py:331] 2026-02-17 04:42:53,413 >> {'loss': 12.3972, 'grad_norm': 0.6351690292358398, 'learning_rate': 0.00012500003750000004, 'epoch': 0.23275408109531331, 'num_input_tokens_seen': 544168352, 'completed': '73.67% (221 / 300)', 'remaining time': '3:42:01', 'throughput': '7973.82', 'gpu_mem_free': '7565MB', 'step': 221} [Step 221 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [32199, 32199] → Tgt Spa: ['0.350', '0.350'] [Step 221 / Rank 7] Tasks: ['Single QA'] | Lens: [44335] → Tgt Spa: ['0.350'] [Step 221 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [32199, 32199] → Tgt Spa: ['0.350', '0.350'] [Step 221 / Rank 6] Tasks: ['Single QA'] | Lens: [44335] → Tgt Spa: ['0.350'] [Step 221 / Rank 0] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [30756, 30757] → Tgt Spa: ['1.000', '0.350'] [Step 221 / Rank 1] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [30756, 30757] → Tgt Spa: ['1.000', '0.350'] [Step 221 / Rank 5] Tasks: ['Single QA'] | Lens: [59080] → Tgt Spa: ['0.350'] [Step 221 / Rank 4] Tasks: ['Single QA'] | Lens: [59080] → Tgt Spa: ['0.350'] [Step 221 / Rank 5] Tasks: ['Single QA'] | Lens: [62198] → Tgt Spa: ['0.350'] [Step 221 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [26458, 26450] → Tgt Spa: ['1.000', '1.000'] [Step 221 / Rank 0] Tasks: ['Single QA'] | Lens: [40718] → Tgt Spa: ['0.350'] [Step 221 / Rank 2] Tasks: ['Code'] | Lens: [34633] → Tgt Spa: ['1.000'] [Step 221 / Rank 4] Tasks: ['Single QA'] | Lens: [62198] → Tgt Spa: ['0.350'] [Step 221 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [26458, 26450] → Tgt Spa: ['1.000', '1.000'] [Step 221 / Rank 3] Tasks: ['Code'] | Lens: [34633] → Tgt Spa: ['1.000'] [Step 221 / Rank 1] Tasks: ['Single QA'] | Lens: [40718] → Tgt Spa: ['0.350'] [Step 221 / Rank 5] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [30583, 30565] → Tgt Spa: ['1.000', '1.000'] [Step 221 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [49570] → Tgt Spa: ['1.000'] [Step 221 / Rank 2] Tasks: ['Code'] | Lens: [60499] → Tgt Spa: ['1.000'] [Step 221 / Rank 0] Tasks: ['Single QA'] | Lens: [38346] → Tgt Spa: ['0.350'] [Step 221 / Rank 4] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [30583, 30565] → Tgt Spa: ['1.000', '1.000'] [Step 221 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [49570] → Tgt Spa: ['1.000'] [Step 221 / Rank 3] Tasks: ['Code'] | Lens: [60499] → Tgt Spa: ['1.000'] [Step 221 / Rank 1] Tasks: ['Single QA'] | Lens: [38346] → Tgt Spa: ['0.350'] [Step 221 / Rank 4] Tasks: ['Single QA'] | Lens: [34835] → Tgt Spa: ['0.350'] [Step 221 / Rank 6] Tasks: ['Single QA'] | Lens: [58187] → Tgt Spa: ['0.350'] [Step 221 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [31862, 31863] → Tgt Spa: ['1.000', '1.000'] [Step 221 / Rank 7] Tasks: ['Single QA'] | Lens: [58187] → Tgt Spa: ['0.350'] [Step 221 / Rank 1] Tasks: ['Single QA'] | Lens: [51068] → Tgt Spa: ['0.350'] [Step 221 / Rank 0] Tasks: ['Single QA'] | Lens: [51068] → Tgt Spa: ['0.350'] [Step 221 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [31862, 31863] → Tgt Spa: ['1.000', '1.000'] [Step 221 / Rank 5] Tasks: ['Single QA'] | Lens: [34835] → Tgt Spa: ['0.350'] [Step 221 / Rank 5] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [20482, 20483, 20495] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 221 / Rank 7] Tasks: ['Single QA'] | Lens: [58356] → Tgt Spa: ['0.350'] [Step 221 / Rank 6] Tasks: ['Single QA'] | Lens: [58356] → Tgt Spa: ['0.350'] [Step 221 / Rank 2] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [17296, 17297, 17310] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 221 / Rank 0] Tasks: ['Code', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA'] | Lens: [6568, 6561, 6570, 6565, 6564, 6566, 6566, 6567, 6568] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 221 / Rank 4] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [20482, 20483, 20495] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 221 / Rank 3] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [17296, 17297, 17310] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 221 / Rank 1] Tasks: ['Code', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA'] | Lens: [6568, 6561, 6570, 6565, 6564, 6566, 6566, 6567, 6568] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 221 / Rank 3] Tasks: ['Single QA'] | Lens: [64961] → Tgt Spa: ['0.350'] [Step 221 / Rank 0] Tasks: ['Code'] | Lens: [32895] → Tgt Spa: ['1.000'] [Step 221 / Rank 2] Tasks: ['Single QA'] | Lens: [64961] → Tgt Spa: ['0.350'] [Step 221 / Rank 1] Tasks: ['Code'] | Lens: [32895] → Tgt Spa: ['1.000'] [Step 221 / Rank 4] Tasks: ['Code'] | Lens: [56225] → Tgt Spa: ['1.000'] [Step 221 / Rank 7] Tasks: ['MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [1601, 1620, 1601, 1602, 1600, 1620, 1620, 1621, 1620, 1621, 1620, 1602, 1621, 1604, 1604, 1604, 1624, 1606, 1606, 1605, 1606, 1606, 1606, 1606, 1606, 1607, 1627, 1625, 1625, 1608, 1607, 1608, 1627, 1610, 1627, 1627, 1609, 1612, 1610, 1628] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000'] [Step 221 / Rank 5] Tasks: ['Code'] | Lens: [56225] → Tgt Spa: ['1.000'] [Step 221 / Rank 6] Tasks: ['MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [1601, 1620, 1601, 1602, 1600, 1620, 1620, 1621, 1620, 1621, 1620, 1602, 1621, 1604, 1604, 1604, 1624, 1606, 1606, 1605, 1606, 1606, 1606, 1606, 1606, 1607, 1627, 1625, 1625, 1608, 1607, 1608, 1627, 1610, 1627, 1627, 1609, 1612, 1610, 1628] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 04:45:33,468 >> @ 221 | Loss: 1.9692 | LM: 1.9129 | Reg: 0.0562 | Spa(Avg): 0.530 [INFO|lh_trainer.py:797] 2026-02-17 04:45:33,468 >> Statistic -> Code | Spa: 0.701 | Tgt: 1.000 | Z-Loss: 0.097 | [INFO|lh_trainer.py:797] 2026-02-17 04:45:33,468 >> Statistic -> In-Context | Spa: 0.707 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:45:33,469 >> Statistic -> MultiHop | Spa: 0.549 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:45:33,469 >> Statistic -> Single | Spa: 0.398 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:45:33,469 >> Statistic -> Summarization | Spa: 0.617 | Tgt: 1.000 | Z-Loss: 0.133 | [INFO|lh_trainer.py:810] 2026-02-17 04:45:33,471 >> [Micro-Log] {"loss": 1.9691539493699868, "lm_loss": 1.9129180442541838, "reg_loss": 0.056235903835234545, "model_sparsity(avg)": 0.5297646621863047, "Spa-In-Context Learning sparsity": 0.7069444537162781, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10997165217995644, "Spa-Single QA sparsity": 0.3975694365799427, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03207367897266522, "Spa-Code sparsity": 0.7007575739513744, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09722431207245047, "Spa-Summarization sparsity": 0.6169590699045282, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.133010475651214, "Spa-MultiHop QA sparsity": 0.5486111119389534, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.07873535606389244, "step": 221, "current_tau": 1.0, "lambda1 Single QA": 0.58984375, "lambda2 MultiHop QA": 0.310546875, "lambda3 Summarization": 0.1611328125, "lambda4 Code": 0.26171875} [INFO|lh_trainer.py:331] 2026-02-17 04:46:01,363 >> {'loss': 11.8149, 'grad_norm': 0.5785694718360901, 'learning_rate': 0.00012217676626306417, 'epoch': 0.23380726698262244, 'num_input_tokens_seen': 546741482, 'completed': '74.00% (222 / 300)', 'remaining time': '3:39:19', 'throughput': '6845.26', 'gpu_mem_free': '14879MB', 'step': 222} [Step 222 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [24404, 24399] → Tgt Spa: ['1.000', '1.000'] [Step 222 / Rank 4] Tasks: ['Code', 'Single QA'] | Lens: [23404, 23396] → Tgt Spa: ['1.000', '0.350'] [Step 222 / Rank 6] Tasks: ['Single QA'] | Lens: [51489] → Tgt Spa: ['0.350'] [Step 222 / Rank 1] Tasks: ['Single QA'] | Lens: [42883] → Tgt Spa: ['0.350'] [Step 222 / Rank 7] Tasks: ['Single QA'] | Lens: [51489] → Tgt Spa: ['0.350'] [Step 222 / Rank 5] Tasks: ['Code', 'Single QA'] | Lens: [23404, 23396] → Tgt Spa: ['1.000', '0.350'] [Step 222 / Rank 0] Tasks: ['Single QA'] | Lens: [42883] → Tgt Spa: ['0.350'] [Step 222 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [24404, 24399] → Tgt Spa: ['1.000', '1.000'] [Step 222 / Rank 4] Tasks: ['Code', 'Summarization'] | Lens: [22537, 22550] → Tgt Spa: ['1.000', '1.000'] [Step 222 / Rank 2] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Summarization', 'Code', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Summarization'] | Lens: [4470, 4463, 4463, 4464, 4464, 4465, 4483, 4473, 4466, 4467, 4485, 4467, 4467, 4487] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 222 / Rank 1] Tasks: ['Summarization'] | Lens: [41984] → Tgt Spa: ['1.000'] [Step 222 / Rank 3] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Summarization', 'Code', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Summarization'] | Lens: [4470, 4463, 4463, 4464, 4464, 4465, 4483, 4473, 4466, 4467, 4485, 4467, 4467, 4487] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 222 / Rank 6] Tasks: ['Summarization'] | Lens: [34832] → Tgt Spa: ['1.000'] [Step 222 / Rank 5] Tasks: ['Code', 'Summarization'] | Lens: [22537, 22550] → Tgt Spa: ['1.000', '1.000'] [Step 222 / Rank 0] Tasks: ['Summarization'] | Lens: [41984] → Tgt Spa: ['1.000'] [Step 222 / Rank 7] Tasks: ['Summarization'] | Lens: [34832] → Tgt Spa: ['1.000'] [Step 222 / Rank 0] Tasks: ['Single QA'] | Lens: [62901] → Tgt Spa: ['0.350'] [Step 222 / Rank 2] Tasks: ['Single QA'] | Lens: [65078] → Tgt Spa: ['0.350'] [Step 222 / Rank 5] Tasks: ['Single QA'] | Lens: [51065] → Tgt Spa: ['0.350'] [Step 222 / Rank 1] Tasks: ['Single QA'] | Lens: [62901] → Tgt Spa: ['0.350'] [Step 222 / Rank 4] Tasks: ['Single QA'] | Lens: [51065] → Tgt Spa: ['0.350'] [Step 222 / Rank 7] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [17616, 17616, 17628] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 222 / Rank 6] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [17616, 17616, 17628] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 222 / Rank 3] Tasks: ['Single QA'] | Lens: [65078] → Tgt Spa: ['0.350'] [Step 222 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [23762, 23762] → Tgt Spa: ['0.350', '0.350'] [Step 222 / Rank 6] Tasks: ['Code'] | Lens: [43329] → Tgt Spa: ['1.000'] [Step 222 / Rank 3] Tasks: ['Single QA'] | Lens: [56348] → Tgt Spa: ['0.350'] [Step 222 / Rank 2] Tasks: ['Single QA'] | Lens: [56348] → Tgt Spa: ['0.350'] [Step 222 / Rank 7] Tasks: ['Code'] | Lens: [43329] → Tgt Spa: ['1.000'] [Step 222 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [23762, 23762] → Tgt Spa: ['0.350', '0.350'] [Step 222 / Rank 0] Tasks: ['Code'] | Lens: [33611] → Tgt Spa: ['1.000'] [Step 222 / Rank 1] Tasks: ['Code'] | Lens: [33611] → Tgt Spa: ['1.000'] [Step 222 / Rank 5] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [29978, 29978] → Tgt Spa: ['1.000', '0.350'] [Step 222 / Rank 6] Tasks: ['Single QA'] | Lens: [41283] → Tgt Spa: ['0.350'] [Step 222 / Rank 7] Tasks: ['Single QA'] | Lens: [41283] → Tgt Spa: ['0.350'] [Step 222 / Rank 4] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [29978, 29978] → Tgt Spa: ['1.000', '0.350'] [Step 222 / Rank 1] Tasks: ['Code'] | Lens: [35396] → Tgt Spa: ['1.000'] [Step 222 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [24884, 24876] → Tgt Spa: ['1.000', '1.000'] [Step 222 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [24884, 24876] → Tgt Spa: ['1.000', '1.000'] [Step 222 / Rank 0] Tasks: ['Code'] | Lens: [35396] → Tgt Spa: ['1.000'] [Step 222 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [29748, 29750] → Tgt Spa: ['0.350', '0.350'] [Step 222 / Rank 3] Tasks: ['Single QA'] | Lens: [44531] → Tgt Spa: ['0.350'] [Step 222 / Rank 2] Tasks: ['Single QA'] | Lens: [44531] → Tgt Spa: ['0.350'] [Step 222 / Rank 6] Tasks: ['Single QA'] | Lens: [48589] → Tgt Spa: ['0.350'] [Step 222 / Rank 7] Tasks: ['Single QA'] | Lens: [48589] → Tgt Spa: ['0.350'] [Step 222 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [55246] → Tgt Spa: ['1.000'] [Step 222 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [55246] → Tgt Spa: ['1.000'] [Step 222 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [29748, 29750] → Tgt Spa: ['0.350', '0.350'] [INFO|lh_trainer.py:781] 2026-02-17 04:48:13,669 >> @ 222 | Loss: 2.0045 | LM: 1.9384 | Reg: 0.0661 | Spa(Avg): 0.509 [INFO|lh_trainer.py:797] 2026-02-17 04:48:13,669 >> Statistic -> Code | Spa: 0.701 | Tgt: 1.000 | Z-Loss: 0.097 | [INFO|lh_trainer.py:797] 2026-02-17 04:48:13,669 >> Statistic -> In-Context | Spa: 0.705 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:48:13,669 >> Statistic -> MultiHop | Spa: 0.549 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:48:13,669 >> Statistic -> Single | Spa: 0.372 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:48:13,669 >> Statistic -> Summarization | Spa: 0.560 | Tgt: 1.000 | Z-Loss: 0.169 | [INFO|lh_trainer.py:810] 2026-02-17 04:48:13,672 >> [Micro-Log] {"loss": 2.0044864689310393, "lm_loss": 1.938409412279725, "reg_loss": 0.0660770540125668, "model_sparsity(avg)": 0.5092730323473612, "Spa-Single QA sparsity": 0.3723958246409893, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.01703497121343389, "Spa-Summarization sparsity": 0.5595237953322274, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.16900237330368587, "Spa-Code sparsity": 0.7007575793699785, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09712902456521988, "Spa-In-Context Learning sparsity": 0.7048611144224802, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11089304399987061, "Spa-MultiHop QA sparsity": 0.5486111119389534, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.07873535606389244, "step": 222, "current_tau": 1.0, "lambda1 Single QA": 0.58984375, "lambda2 MultiHop QA": 0.310546875, "lambda3 Summarization": 0.1611328125, "lambda4 Code": 0.26171875} [INFO|lh_trainer.py:331] 2026-02-17 04:48:33,966 >> {'loss': 12.0269, 'grad_norm': 0.674688994884491, 'learning_rate': 0.00011937539688347693, 'epoch': 0.23486045286993154, 'num_input_tokens_seen': 549104356, 'completed': '74.33% (223 / 300)', 'remaining time': '3:36:25', 'throughput': '7741.89', 'gpu_mem_free': '7199MB', 'step': 223} [Step 223 / Rank 5] Tasks: ['Single QA'] | Lens: [35100] → Tgt Spa: ['0.350'] [Step 223 / Rank 4] Tasks: ['Single QA'] | Lens: [35100] → Tgt Spa: ['0.350'] [Step 223 / Rank 7] Tasks: ['Single QA'] | Lens: [58262] → Tgt Spa: ['0.350'] [Step 223 / Rank 0] Tasks: ['Single QA'] | Lens: [58154] → Tgt Spa: ['0.350'] [Step 223 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [43814] → Tgt Spa: ['1.000'] [Step 223 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [43814] → Tgt Spa: ['1.000'] [Step 223 / Rank 6] Tasks: ['Single QA'] | Lens: [58262] → Tgt Spa: ['0.350'] [Step 223 / Rank 1] Tasks: ['Single QA'] | Lens: [58154] → Tgt Spa: ['0.350'] [Step 223 / Rank 5] Tasks: ['Summarization'] | Lens: [41894] → Tgt Spa: ['1.000'] [Step 223 / Rank 4] Tasks: ['Summarization'] | Lens: [41894] → Tgt Spa: ['1.000'] [Step 223 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [59156] → Tgt Spa: ['1.000'] [Step 223 / Rank 1] Tasks: ['Code', 'Code', 'Code'] | Lens: [20269, 20273, 20270] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 223 / Rank 0] Tasks: ['Code', 'Code', 'Code'] | Lens: [20269, 20273, 20270] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 223 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [59156] → Tgt Spa: ['1.000'] [Step 223 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [24504, 24498] → Tgt Spa: ['1.000', '1.000'] [Step 223 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [24504, 24498] → Tgt Spa: ['1.000', '1.000'] [Step 223 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [36457] → Tgt Spa: ['1.000'] [Step 223 / Rank 2] Tasks: ['Single QA'] | Lens: [36226] → Tgt Spa: ['0.350'] [Step 223 / Rank 3] Tasks: ['Single QA'] | Lens: [36226] → Tgt Spa: ['0.350'] [Step 223 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [36457] → Tgt Spa: ['1.000'] [Step 223 / Rank 0] Tasks: ['Single QA'] | Lens: [62088] → Tgt Spa: ['0.350'] [Step 223 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [62636] → Tgt Spa: ['1.000'] [Step 223 / Rank 1] Tasks: ['Single QA'] | Lens: [62088] → Tgt Spa: ['0.350'] [Step 223 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [62636] → Tgt Spa: ['1.000'] [Step 223 / Rank 2] Tasks: ['Single QA'] | Lens: [39461] → Tgt Spa: ['0.350'] [Step 223 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25030, 25031] → Tgt Spa: ['1.000', '1.000'] [Step 223 / Rank 5] Tasks: ['Code'] | Lens: [34217] → Tgt Spa: ['1.000'] [Step 223 / Rank 1] Tasks: ['Single QA'] | Lens: [54852] → Tgt Spa: ['0.350'] [Step 223 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25030, 25031] → Tgt Spa: ['1.000', '1.000'] [Step 223 / Rank 3] Tasks: ['Single QA'] | Lens: [39461] → Tgt Spa: ['0.350'] [Step 223 / Rank 0] Tasks: ['Single QA'] | Lens: [54852] → Tgt Spa: ['0.350'] [Step 223 / Rank 4] Tasks: ['Code'] | Lens: [34217] → Tgt Spa: ['1.000'] [Step 223 / Rank 4] Tasks: ['In-Context Learning', 'Code', 'Code'] | Lens: [21717, 21728, 21727] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 223 / Rank 5] Tasks: ['In-Context Learning', 'Code', 'Code'] | Lens: [21717, 21728, 21727] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 223 / Rank 6] Tasks: ['Single QA'] | Lens: [60744] → Tgt Spa: ['0.350'] [Step 223 / Rank 3] Tasks: ['Single QA'] | Lens: [35194] → Tgt Spa: ['0.350'] [Step 223 / Rank 1] Tasks: ['Single QA', 'Code'] | Lens: [23762, 23768] → Tgt Spa: ['0.350', '1.000'] [Step 223 / Rank 7] Tasks: ['Single QA'] | Lens: [60744] → Tgt Spa: ['0.350'] [Step 223 / Rank 2] Tasks: ['Single QA'] | Lens: [35194] → Tgt Spa: ['0.350'] [Step 223 / Rank 0] Tasks: ['Single QA', 'Code'] | Lens: [23762, 23768] → Tgt Spa: ['0.350', '1.000'] [Step 223 / Rank 5] Tasks: ['Single QA'] | Lens: [33351] → Tgt Spa: ['0.350'] [Step 223 / Rank 7] Tasks: ['Single QA'] | Lens: [56760] → Tgt Spa: ['0.350'] [Step 223 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [21875, 21875] → Tgt Spa: ['0.350', '0.350'] [Step 223 / Rank 4] Tasks: ['Single QA'] | Lens: [33351] → Tgt Spa: ['0.350'] [Step 223 / Rank 6] Tasks: ['Single QA'] | Lens: [56760] → Tgt Spa: ['0.350'] [Step 223 / Rank 3] Tasks: ['Code'] | Lens: [37866] → Tgt Spa: ['1.000'] [Step 223 / Rank 2] Tasks: ['Code'] | Lens: [37866] → Tgt Spa: ['1.000'] [Step 223 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [21875, 21875] → Tgt Spa: ['0.350', '0.350'] [INFO|lh_trainer.py:781] 2026-02-17 04:51:10,603 >> @ 223 | Loss: 2.1977 | LM: 2.1466 | Reg: 0.0511 | Spa(Avg): 0.531 [INFO|lh_trainer.py:797] 2026-02-17 04:51:10,603 >> Statistic -> Code | Spa: 0.718 | Tgt: 1.000 | Z-Loss: 0.091 | [INFO|lh_trainer.py:797] 2026-02-17 04:51:10,603 >> Statistic -> In-Context | Spa: 0.715 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:51:10,603 >> Statistic -> MultiHop | Spa: 0.549 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:51:10,603 >> Statistic -> Single | Spa: 0.360 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:51:10,603 >> Statistic -> Summarization | Spa: 0.708 | Tgt: 1.000 | Z-Loss: 0.087 | [INFO|lh_trainer.py:810] 2026-02-17 04:51:10,605 >> [Micro-Log] {"loss": 2.1976790570964417, "lm_loss": 2.1465963311493397, "reg_loss": 0.05108274639739344, "model_sparsity(avg)": 0.5309606405595938, "Spa-Single QA sparsity": 0.36011904052325655, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.007275039454855557, "Spa-Code sparsity": 0.7175925837622749, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09067453278435601, "Spa-In-Context Learning sparsity": 0.715277761220932, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10651407949626446, "Spa-Summarization sparsity": 0.7083333730697632, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.08720565587282181, "Spa-MultiHop QA sparsity": 0.5486111119389534, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.07873535606389244, "step": 223, "current_tau": 1.0, "lambda1 Single QA": 0.58984375, "lambda2 MultiHop QA": 0.310546875, "lambda3 Summarization": 0.1611328125, "lambda4 Code": 0.26171875} [INFO|lh_trainer.py:331] 2026-02-17 04:51:31,924 >> {'loss': 13.1861, 'grad_norm': 0.5755442976951599, 'learning_rate': 0.00011659640936146005, 'epoch': 0.23591363875724064, 'num_input_tokens_seen': 551429474, 'completed': '74.67% (224 / 300)', 'remaining time': '3:33:39', 'throughput': '6532.76', 'gpu_mem_free': '11711MB', 'step': 224} [Step 224 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [12329, 12331, 12346, 12344, 12346] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350'] [Step 224 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [39327] → Tgt Spa: ['1.000'] [Step 224 / Rank 4] Tasks: ['Single QA'] | Lens: [48537] → Tgt Spa: ['0.350'] [Step 224 / Rank 5] Tasks: ['Single QA'] | Lens: [48537] → Tgt Spa: ['0.350'] [Step 224 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [29848, 29849] → Tgt Spa: ['1.000', '1.000'] [Step 224 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [12329, 12331, 12346, 12344, 12346] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350'] [Step 224 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [29848, 29849] → Tgt Spa: ['1.000', '1.000'] [Step 224 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [39327] → Tgt Spa: ['1.000'] [Step 224 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [41418] → Tgt Spa: ['1.000'] [Step 224 / Rank 6] Tasks: ['Single QA'] | Lens: [38327] → Tgt Spa: ['0.350'] [Step 224 / Rank 4] Tasks: ['Single QA'] | Lens: [61564] → Tgt Spa: ['0.350'] [Step 224 / Rank 0] Tasks: ['Code'] | Lens: [44433] → Tgt Spa: ['1.000'] [Step 224 / Rank 7] Tasks: ['Single QA'] | Lens: [38327] → Tgt Spa: ['0.350'] [Step 224 / Rank 1] Tasks: ['Code'] | Lens: [44433] → Tgt Spa: ['1.000'] [Step 224 / Rank 5] Tasks: ['Single QA'] | Lens: [61564] → Tgt Spa: ['0.350'] [Step 224 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [41418] → Tgt Spa: ['1.000'] [Step 224 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26978, 26979] → Tgt Spa: ['1.000', '1.000'] [Step 224 / Rank 2] Tasks: ['Single QA'] | Lens: [41578] → Tgt Spa: ['0.350'] [Step 224 / Rank 7] Tasks: ['Single QA', 'MultiHop QA'] | Lens: [32699, 32698] → Tgt Spa: ['0.350', '0.350'] [Step 224 / Rank 3] Tasks: ['Single QA'] | Lens: [41578] → Tgt Spa: ['0.350'] [Step 224 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26978, 26979] → Tgt Spa: ['1.000', '1.000'] [Step 224 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23948, 23951] → Tgt Spa: ['1.000', '1.000'] [Step 224 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23948, 23951] → Tgt Spa: ['1.000', '1.000'] [Step 224 / Rank 6] Tasks: ['Single QA', 'MultiHop QA'] | Lens: [32699, 32698] → Tgt Spa: ['0.350', '0.350'] [Step 224 / Rank 1] Tasks: ['Single QA'] | Lens: [61321] → Tgt Spa: ['0.350'] [Step 224 / Rank 4] Tasks: ['Single QA', 'Code', 'Code', 'Single QA', 'Code', 'In-Context Learning'] | Lens: [9727, 9737, 9738, 9734, 9742, 9735] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 224 / Rank 5] Tasks: ['Single QA', 'Code', 'Code', 'Single QA', 'Code', 'In-Context Learning'] | Lens: [9727, 9737, 9738, 9734, 9742, 9735] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 224 / Rank 2] Tasks: ['Summarization', 'Code'] | Lens: [26570, 26559] → Tgt Spa: ['1.000', '1.000'] [Step 224 / Rank 0] Tasks: ['Single QA'] | Lens: [61321] → Tgt Spa: ['0.350'] [Step 224 / Rank 7] Tasks: ['Single QA'] | Lens: [37127] → Tgt Spa: ['0.350'] [Step 224 / Rank 6] Tasks: ['Single QA'] | Lens: [37127] → Tgt Spa: ['0.350'] [Step 224 / Rank 3] Tasks: ['Summarization', 'Code'] | Lens: [26570, 26559] → Tgt Spa: ['1.000', '1.000'] [Step 224 / Rank 4] Tasks: ['Single QA'] | Lens: [64327] → Tgt Spa: ['0.350'] [Step 224 / Rank 5] Tasks: ['Single QA'] | Lens: [64327] → Tgt Spa: ['0.350'] [Step 224 / Rank 6] Tasks: ['Single QA'] | Lens: [42613] → Tgt Spa: ['0.350'] [Step 224 / Rank 7] Tasks: ['Single QA'] | Lens: [42613] → Tgt Spa: ['0.350'] [Step 224 / Rank 2] Tasks: ['Single QA'] | Lens: [53245] → Tgt Spa: ['0.350'] [Step 224 / Rank 3] Tasks: ['Single QA'] | Lens: [53245] → Tgt Spa: ['0.350'] [Step 224 / Rank 0] Tasks: ['Summarization'] | Lens: [37339] → Tgt Spa: ['1.000'] [Step 224 / Rank 1] Tasks: ['Summarization'] | Lens: [37339] → Tgt Spa: ['1.000'] [Step 224 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43859] → Tgt Spa: ['1.000'] [Step 224 / Rank 2] Tasks: ['Single QA'] | Lens: [58402] → Tgt Spa: ['0.350'] [Step 224 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43859] → Tgt Spa: ['1.000'] [Step 224 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [24755, 24764] → Tgt Spa: ['1.000', '1.000'] [Step 224 / Rank 4] Tasks: ['Single QA'] | Lens: [52073] → Tgt Spa: ['0.350'] [Step 224 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [24755, 24764] → Tgt Spa: ['1.000', '1.000'] [Step 224 / Rank 3] Tasks: ['Single QA'] | Lens: [58402] → Tgt Spa: ['0.350'] [Step 224 / Rank 5] Tasks: ['Single QA'] | Lens: [52073] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 04:54:03,374 >> @ 224 | Loss: 2.0579 | LM: 2.0020 | Reg: 0.0560 | Spa(Avg): 0.519 [INFO|lh_trainer.py:797] 2026-02-17 04:54:03,374 >> Statistic -> Code | Spa: 0.716 | Tgt: 1.000 | Z-Loss: 0.092 | [INFO|lh_trainer.py:797] 2026-02-17 04:54:03,375 >> Statistic -> In-Context | Spa: 0.710 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:54:03,375 >> Statistic -> MultiHop | Spa: 0.389 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:54:03,375 >> Statistic -> Single | Spa: 0.408 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:54:03,375 >> Statistic -> Summarization | Spa: 0.583 | Tgt: 1.000 | Z-Loss: 0.155 | [INFO|lh_trainer.py:810] 2026-02-17 04:54:03,377 >> [Micro-Log] {"loss": 2.0579250211206577, "lm_loss": 2.0019532643103353, "reg_loss": 0.0559717665213005, "model_sparsity(avg)": 0.5192515378197035, "Spa-In-Context Learning sparsity": 0.709876537322998, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10879022959205839, "Spa-Code sparsity": 0.7160493797726102, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09181751310825348, "Spa-Single QA sparsity": 0.40817900829845005, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.039970346635931894, "Spa-Summarization sparsity": 0.5833333134651184, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.15503232181072235, "Spa-MultiHop QA sparsity": 0.3888888955116272, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.012821914628148079, "step": 224, "current_tau": 1.0, "lambda1 Single QA": 0.58984375, "lambda2 MultiHop QA": 0.310546875, "lambda3 Summarization": 0.1611328125, "lambda4 Code": 0.263671875} [INFO|lh_trainer.py:331] 2026-02-17 04:54:26,147 >> {'loss': 12.3476, 'grad_norm': 0.553242027759552, 'learning_rate': 0.00011384027986221911, 'epoch': 0.23696682464454977, 'num_input_tokens_seen': 553859868, 'completed': '75.00% (225 / 300)', 'remaining time': '3:30:52', 'throughput': '6974.94', 'gpu_mem_free': '11709MB', 'step': 225} [Step 225 / Rank 7] Tasks: ['Code'] | Lens: [35216] → Tgt Spa: ['1.000'] [Step 225 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [25254, 25254] → Tgt Spa: ['0.350', '0.350'] [Step 225 / Rank 2] Tasks: ['Single QA'] | Lens: [61999] → Tgt Spa: ['0.350'] [Step 225 / Rank 6] Tasks: ['Code'] | Lens: [35216] → Tgt Spa: ['1.000'] [Step 225 / Rank 0] Tasks: ['Code', 'Code', 'Single QA'] | Lens: [19045, 19046, 19038] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 225 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [25254, 25254] → Tgt Spa: ['0.350', '0.350'] [Step 225 / Rank 1] Tasks: ['Code', 'Code', 'Single QA'] | Lens: [19045, 19046, 19038] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 225 / Rank 3] Tasks: ['Single QA'] | Lens: [61999] → Tgt Spa: ['0.350'] [Step 225 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA'] | Lens: [7159, 7161, 7162, 7164, 7168, 7169, 7169, 7169, 7171] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350'] [Step 225 / Rank 4] Tasks: ['Code'] | Lens: [42129] → Tgt Spa: ['1.000'] [Step 225 / Rank 1] Tasks: ['Single QA'] | Lens: [57699] → Tgt Spa: ['0.350'] [Step 225 / Rank 0] Tasks: ['Single QA'] | Lens: [57699] → Tgt Spa: ['0.350'] [Step 225 / Rank 5] Tasks: ['Code'] | Lens: [42129] → Tgt Spa: ['1.000'] [Step 225 / Rank 7] Tasks: ['Code'] | Lens: [59668] → Tgt Spa: ['1.000'] [Step 225 / Rank 6] Tasks: ['Code'] | Lens: [59668] → Tgt Spa: ['1.000'] [Step 225 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA'] | Lens: [7159, 7161, 7162, 7164, 7168, 7169, 7169, 7169, 7171] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350'] [Step 225 / Rank 2] Tasks: ['Single QA'] | Lens: [63696] → Tgt Spa: ['0.350'] [Step 225 / Rank 1] Tasks: ['Single QA'] | Lens: [52581] → Tgt Spa: ['0.350'] [Step 225 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28420, 28422] → Tgt Spa: ['1.000', '1.000'] [Step 225 / Rank 3] Tasks: ['Single QA'] | Lens: [63696] → Tgt Spa: ['0.350'] [Step 225 / Rank 0] Tasks: ['Single QA'] | Lens: [52581] → Tgt Spa: ['0.350'] [Step 225 / Rank 5] Tasks: ['Single QA'] | Lens: [63519] → Tgt Spa: ['0.350'] [Step 225 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28420, 28422] → Tgt Spa: ['1.000', '1.000'] [Step 225 / Rank 4] Tasks: ['Single QA'] | Lens: [63519] → Tgt Spa: ['0.350'] [Step 225 / Rank 7] Tasks: ['Single QA'] | Lens: [41931] → Tgt Spa: ['0.350'] [Step 225 / Rank 1] Tasks: ['Single QA'] | Lens: [42539] → Tgt Spa: ['0.350'] [Step 225 / Rank 0] Tasks: ['Single QA'] | Lens: [42539] → Tgt Spa: ['0.350'] [Step 225 / Rank 6] Tasks: ['Single QA'] | Lens: [41931] → Tgt Spa: ['0.350'] [Step 225 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22050, 22071] → Tgt Spa: ['1.000', '1.000'] [Step 225 / Rank 5] Tasks: ['Single QA'] | Lens: [62491] → Tgt Spa: ['0.350'] [Step 225 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22050, 22071] → Tgt Spa: ['1.000', '1.000'] [Step 225 / Rank 4] Tasks: ['Single QA'] | Lens: [62491] → Tgt Spa: ['0.350'] [Step 225 / Rank 5] Tasks: ['Code'] | Lens: [42332] → Tgt Spa: ['1.000'] [Step 225 / Rank 6] Tasks: ['Single QA', 'Code'] | Lens: [31106, 31114] → Tgt Spa: ['0.350', '1.000'] [Step 225 / Rank 0] Tasks: ['Single QA'] | Lens: [58746] → Tgt Spa: ['0.350'] [Step 225 / Rank 2] Tasks: ['Single QA'] | Lens: [37936] → Tgt Spa: ['0.350'] [Step 225 / Rank 1] Tasks: ['Single QA'] | Lens: [58746] → Tgt Spa: ['0.350'] [Step 225 / Rank 7] Tasks: ['Single QA', 'Code'] | Lens: [31106, 31114] → Tgt Spa: ['0.350', '1.000'] [Step 225 / Rank 3] Tasks: ['Single QA'] | Lens: [37936] → Tgt Spa: ['0.350'] [Step 225 / Rank 4] Tasks: ['Code'] | Lens: [42332] → Tgt Spa: ['1.000'] [Step 225 / Rank 5] Tasks: ['Code'] | Lens: [38706] → Tgt Spa: ['1.000'] [Step 225 / Rank 3] Tasks: ['Single QA'] | Lens: [36153] → Tgt Spa: ['0.350'] [Step 225 / Rank 0] Tasks: ['Single QA'] | Lens: [52651] → Tgt Spa: ['0.350'] [Step 225 / Rank 4] Tasks: ['Code'] | Lens: [38706] → Tgt Spa: ['1.000'] [Step 225 / Rank 7] Tasks: ['Code'] | Lens: [58040] → Tgt Spa: ['1.000'] [Step 225 / Rank 2] Tasks: ['Single QA'] | Lens: [36153] → Tgt Spa: ['0.350'] [Step 225 / Rank 6] Tasks: ['Code'] | Lens: [58040] → Tgt Spa: ['1.000'] [Step 225 / Rank 1] Tasks: ['Single QA'] | Lens: [52651] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 04:57:13,077 >> @ 225 | Loss: 1.7304 | LM: 1.6812 | Reg: 0.0493 | Spa(Avg): 0.510 [INFO|lh_trainer.py:797] 2026-02-17 04:57:13,077 >> Statistic -> Code | Spa: 0.715 | Tgt: 1.000 | Z-Loss: 0.092 | [INFO|lh_trainer.py:797] 2026-02-17 04:57:13,077 >> Statistic -> In-Context | Spa: 0.719 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:57:13,077 >> Statistic -> MultiHop | Spa: 0.389 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:57:13,077 >> Statistic -> Single | Spa: 0.438 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 04:57:13,077 >> Statistic -> Summarization | Spa: 0.653 | Tgt: 1.000 | Z-Loss: 0.113 | [INFO|lh_trainer.py:810] 2026-02-17 04:57:13,080 >> [Micro-Log] {"loss": 1.7304279282689095, "lm_loss": 1.6811536693324645, "reg_loss": 0.04927426083789518, "model_sparsity(avg)": 0.5100308557351431, "Spa-Code sparsity": 0.7145061757829454, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09242692010270225, "Spa-Single QA sparsity": 0.43840579364610754, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05938509225582137, "Spa-In-Context Learning sparsity": 0.7194444417953492, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10486088246107102, "Spa-Summarization sparsity": 0.6527777910232544, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11293387413024902, "Spa-MultiHop QA sparsity": 0.3888888955116272, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.012821914628148079, "step": 225, "current_tau": 1.0, "lambda1 Single QA": 0.58984375, "lambda2 MultiHop QA": 0.310546875, "lambda3 Summarization": 0.1611328125, "lambda4 Code": 0.263671875} [INFO|lh_trainer.py:331] 2026-02-17 04:57:34,503 >> {'loss': 10.3826, 'grad_norm': 0.5550587773323059, 'learning_rate': 0.00011110748063435535, 'epoch': 0.23802001053185887, 'num_input_tokens_seen': 556346556, 'completed': '75.33% (226 / 300)', 'remaining time': '3:28:10', 'throughput': '6601.05', 'gpu_mem_free': '8493MB', 'step': 226} [Step 226 / Rank 3] Tasks: ['Summarization', 'Summarization', 'In-Context Learning'] | Lens: [19023, 19024, 19006] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 226 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [31080, 31083] → Tgt Spa: ['0.350', '0.350'] [Step 226 / Rank 0] Tasks: ['Single QA'] | Lens: [42188] → Tgt Spa: ['0.350'] [Step 226 / Rank 1] Tasks: ['Single QA'] | Lens: [42188] → Tgt Spa: ['0.350'] [Step 226 / Rank 4] Tasks: ['Single QA'] | Lens: [65036] → Tgt Spa: ['0.350'] [Step 226 / Rank 2] Tasks: ['Summarization', 'Summarization', 'In-Context Learning'] | Lens: [19023, 19024, 19006] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 226 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [31080, 31083] → Tgt Spa: ['0.350', '0.350'] [Step 226 / Rank 5] Tasks: ['Single QA'] | Lens: [65036] → Tgt Spa: ['0.350'] [Step 226 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [58324] → Tgt Spa: ['1.000'] [Step 226 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [58324] → Tgt Spa: ['1.000'] [Step 226 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [27078, 27078] → Tgt Spa: ['0.350', '0.350'] [Step 226 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [23766, 23766] → Tgt Spa: ['0.350', '0.350'] [Step 226 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [27078, 27078] → Tgt Spa: ['0.350', '0.350'] [Step 226 / Rank 7] Tasks: ['Single QA'] | Lens: [57531] → Tgt Spa: ['0.350'] [Step 226 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [23766, 23766] → Tgt Spa: ['0.350', '0.350'] [Step 226 / Rank 6] Tasks: ['Single QA'] | Lens: [57531] → Tgt Spa: ['0.350'] [Step 226 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [29717, 29718] → Tgt Spa: ['0.350', '0.350'] [Step 226 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [29717, 29718] → Tgt Spa: ['0.350', '0.350'] [Step 226 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [42240] → Tgt Spa: ['1.000'] [Step 226 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [27076, 27076] → Tgt Spa: ['0.350', '0.350'] [Step 226 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [27076, 27076] → Tgt Spa: ['0.350', '0.350'] [Step 226 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [42240] → Tgt Spa: ['1.000'] [Step 226 / Rank 2] Tasks: ['Single QA'] | Lens: [58109] → Tgt Spa: ['0.350'] [Step 226 / Rank 3] Tasks: ['Single QA'] | Lens: [58109] → Tgt Spa: ['0.350'] [Step 226 / Rank 0] Tasks: ['MultiHop QA'] | Lens: [65334] → Tgt Spa: ['0.350'] [Step 226 / Rank 3] Tasks: ['Single QA'] | Lens: [47448] → Tgt Spa: ['0.350'] [Step 226 / Rank 6] Tasks: ['Single QA'] | Lens: [41816] → Tgt Spa: ['0.350'] [Step 226 / Rank 5] Tasks: ['Single QA'] | Lens: [44219] → Tgt Spa: ['0.350'] [Step 226 / Rank 7] Tasks: ['Single QA'] | Lens: [41816] → Tgt Spa: ['0.350'] [Step 226 / Rank 2] Tasks: ['Single QA'] | Lens: [47448] → Tgt Spa: ['0.350'] [Step 226 / Rank 1] Tasks: ['MultiHop QA'] | Lens: [65334] → Tgt Spa: ['0.350'] [Step 226 / Rank 4] Tasks: ['Single QA'] | Lens: [44219] → Tgt Spa: ['0.350'] [Step 226 / Rank 6] Tasks: ['MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Single QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'In-Context Learning', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA'] | Lens: [1994, 1995, 2015, 2015, 1996, 2017, 2018, 2000, 2001, 2020, 2004, 2004, 2021, 2021, 2023, 2003, 2005, 2005, 2024, 2006, 2006, 2006, 2008, 2025, 2026, 2026, 2006, 2027, 2028, 2013, 2030, 2011] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350'] [Step 226 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [56179] → Tgt Spa: ['1.000'] [Step 226 / Rank 4] Tasks: ['Code', 'Code', 'Code', 'Code', 'Code'] | Lens: [11579, 11581, 11585, 11592, 11591] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000'] [Step 226 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22392, 22392] → Tgt Spa: ['1.000', '1.000'] [Step 226 / Rank 7] Tasks: ['MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Single QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'In-Context Learning', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA'] | Lens: [1994, 1995, 2015, 2015, 1996, 2017, 2018, 2000, 2001, 2020, 2004, 2004, 2021, 2021, 2023, 2003, 2005, 2005, 2024, 2006, 2006, 2006, 2008, 2025, 2026, 2026, 2006, 2027, 2028, 2013, 2030, 2011] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350'] [Step 226 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22392, 22392] → Tgt Spa: ['1.000', '1.000'] [Step 226 / Rank 5] Tasks: ['Code', 'Code', 'Code', 'Code', 'Code'] | Lens: [11579, 11581, 11585, 11592, 11591] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000'] [Step 226 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [56179] → Tgt Spa: ['1.000'] [Step 226 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [21907, 21915] → Tgt Spa: ['1.000', '1.000'] [Step 226 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32164, 32164] → Tgt Spa: ['0.350', '0.350'] [Step 226 / Rank 3] Tasks: ['Single QA'] | Lens: [60960] → Tgt Spa: ['0.350'] [Step 226 / Rank 1] Tasks: ['Single QA'] | Lens: [39856] → Tgt Spa: ['0.350'] [Step 226 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32164, 32164] → Tgt Spa: ['0.350', '0.350'] [Step 226 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [21907, 21915] → Tgt Spa: ['1.000', '1.000'] [Step 226 / Rank 0] Tasks: ['Single QA'] | Lens: [39856] → Tgt Spa: ['0.350'] [Step 226 / Rank 2] Tasks: ['Single QA'] | Lens: [60960] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 05:00:18,080 >> @ 226 | Loss: 2.1990 | LM: 2.1494 | Reg: 0.0496 | Spa(Avg): 0.486 [INFO|lh_trainer.py:797] 2026-02-17 05:00:18,080 >> Statistic -> Code | Spa: 0.718 | Tgt: 1.000 | Z-Loss: 0.091 | [INFO|lh_trainer.py:797] 2026-02-17 05:00:18,080 >> Statistic -> In-Context | Spa: 0.710 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:00:18,080 >> Statistic -> MultiHop | Spa: 0.599 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:00:18,080 >> Statistic -> Single | Spa: 0.386 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:00:18,080 >> Statistic -> Summarization | Spa: 0.663 | Tgt: 1.000 | Z-Loss: 0.110 | [INFO|lh_trainer.py:810] 2026-02-17 05:00:18,082 >> [Micro-Log] {"loss": 2.19900020196413, "lm_loss": 2.1493585454300046, "reg_loss": 0.0496416905176981, "model_sparsity(avg)": 0.48635585233569145, "Spa-Single QA sparsity": 0.3863636282357303, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.025831478213976054, "Spa-In-Context Learning sparsity": 0.710069440305233, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10882723610848188, "Spa-MultiHop QA sparsity": 0.5989583283662796, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.10475413803942502, "Spa-Summarization sparsity": 0.6625816962298225, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10983868118594675, "Spa-Code sparsity": 0.7175925771395365, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.091230608522892, "step": 226, "current_tau": 1.0, "lambda1 Single QA": 0.58984375, "lambda2 MultiHop QA": 0.310546875, "lambda3 Summarization": 0.162109375, "lambda4 Code": 0.263671875} [INFO|lh_trainer.py:331] 2026-02-17 05:00:42,209 >> {'loss': 13.194, 'grad_norm': 0.4513484835624695, 'learning_rate': 0.00010839847992894778, 'epoch': 0.239073196419168, 'num_input_tokens_seen': 558924540, 'completed': '75.67% (227 / 300)', 'remaining time': '3:25:27', 'throughput': '6867.10', 'gpu_mem_free': '11979MB', 'step': 227} [Step 227 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [59979] → Tgt Spa: ['1.000'] [Step 227 / Rank 3] Tasks: ['Code'] | Lens: [60246] → Tgt Spa: ['1.000'] [Step 227 / Rank 2] Tasks: ['Code'] | Lens: [60246] → Tgt Spa: ['1.000'] [Step 227 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [59747] → Tgt Spa: ['1.000'] [Step 227 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [22875, 22868] → Tgt Spa: ['1.000', '1.000'] [Step 227 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [22875, 22868] → Tgt Spa: ['1.000', '1.000'] [Step 227 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [59979] → Tgt Spa: ['1.000'] [Step 227 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [59747] → Tgt Spa: ['1.000'] [Step 227 / Rank 0] Tasks: ['Single QA'] | Lens: [45622] → Tgt Spa: ['0.350'] [Step 227 / Rank 7] Tasks: ['Single QA'] | Lens: [52163] → Tgt Spa: ['0.350'] [Step 227 / Rank 3] Tasks: ['Single QA'] | Lens: [35597] → Tgt Spa: ['0.350'] [Step 227 / Rank 4] Tasks: ['Single QA'] | Lens: [47432] → Tgt Spa: ['0.350'] [Step 227 / Rank 5] Tasks: ['Single QA'] | Lens: [47432] → Tgt Spa: ['0.350'] [Step 227 / Rank 2] Tasks: ['Single QA'] | Lens: [35597] → Tgt Spa: ['0.350'] [Step 227 / Rank 1] Tasks: ['Single QA'] | Lens: [45622] → Tgt Spa: ['0.350'] [Step 227 / Rank 6] Tasks: ['Single QA'] | Lens: [52163] → Tgt Spa: ['0.350'] [Step 227 / Rank 4] Tasks: ['Single QA'] | Lens: [52043] → Tgt Spa: ['0.350'] [Step 227 / Rank 2] Tasks: ['Summarization'] | Lens: [38467] → Tgt Spa: ['1.000'] [Step 227 / Rank 3] Tasks: ['Summarization'] | Lens: [38467] → Tgt Spa: ['1.000'] [Step 227 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [31074, 31079] → Tgt Spa: ['0.350', '0.350'] [Step 227 / Rank 5] Tasks: ['Single QA'] | Lens: [52043] → Tgt Spa: ['0.350'] [Step 227 / Rank 6] Tasks: ['Single QA'] | Lens: [42497] → Tgt Spa: ['0.350'] [Step 227 / Rank 7] Tasks: ['Single QA'] | Lens: [42497] → Tgt Spa: ['0.350'] [Step 227 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [31074, 31079] → Tgt Spa: ['0.350', '0.350'] [Step 227 / Rank 4] Tasks: ['Code'] | Lens: [36943] → Tgt Spa: ['1.000'] [Step 227 / Rank 3] Tasks: ['Single QA'] | Lens: [35043] → Tgt Spa: ['0.350'] [Step 227 / Rank 5] Tasks: ['Code'] | Lens: [36943] → Tgt Spa: ['1.000'] [Step 227 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [64847] → Tgt Spa: ['1.000'] [Step 227 / Rank 2] Tasks: ['Single QA'] | Lens: [35043] → Tgt Spa: ['0.350'] [Step 227 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [64847] → Tgt Spa: ['1.000'] [Step 227 / Rank 7] Tasks: ['Summarization', 'Summarization'] | Lens: [24901, 24902] → Tgt Spa: ['1.000', '1.000'] [Step 227 / Rank 6] Tasks: ['Summarization', 'Summarization'] | Lens: [24901, 24902] → Tgt Spa: ['1.000', '1.000'] [Step 227 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [44884] → Tgt Spa: ['1.000'] [Step 227 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [36230] → Tgt Spa: ['1.000'] [Step 227 / Rank 0] Tasks: ['Single QA'] | Lens: [41533] → Tgt Spa: ['0.350'] [Step 227 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [36230] → Tgt Spa: ['1.000'] [Step 227 / Rank 1] Tasks: ['Single QA'] | Lens: [41533] → Tgt Spa: ['0.350'] [Step 227 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [44884] → Tgt Spa: ['1.000'] [Step 227 / Rank 5] Tasks: ['Single QA'] | Lens: [64903] → Tgt Spa: ['0.350'] [Step 227 / Rank 4] Tasks: ['Single QA'] | Lens: [64903] → Tgt Spa: ['0.350'] [Step 227 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [58190] → Tgt Spa: ['1.000'] [Step 227 / Rank 5] Tasks: ['Code', 'Single QA'] | Lens: [26872, 26865] → Tgt Spa: ['1.000', '0.350'] [Step 227 / Rank 4] Tasks: ['Code', 'Single QA'] | Lens: [26872, 26865] → Tgt Spa: ['1.000', '0.350'] [Step 227 / Rank 0] Tasks: ['Code'] | Lens: [62448] → Tgt Spa: ['1.000'] [Step 227 / Rank 1] Tasks: ['Code'] | Lens: [62448] → Tgt Spa: ['1.000'] [Step 227 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [58190] → Tgt Spa: ['1.000'] [Step 227 / Rank 2] Tasks: ['Single QA'] | Lens: [51201] → Tgt Spa: ['0.350'] [Step 227 / Rank 3] Tasks: ['Single QA'] | Lens: [51201] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 05:03:19,918 >> @ 227 | Loss: 1.9814 | LM: 1.9205 | Reg: 0.0609 | Spa(Avg): 0.537 [INFO|lh_trainer.py:797] 2026-02-17 05:03:19,918 >> Statistic -> Code | Spa: 0.714 | Tgt: 1.000 | Z-Loss: 0.093 | [INFO|lh_trainer.py:797] 2026-02-17 05:03:19,919 >> Statistic -> In-Context | Spa: 0.710 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:03:19,919 >> Statistic -> MultiHop | Spa: 0.599 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:03:19,919 >> Statistic -> Single | Spa: 0.371 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:03:19,919 >> Statistic -> Summarization | Spa: 0.606 | Tgt: 1.000 | Z-Loss: 0.143 | [INFO|lh_trainer.py:810] 2026-02-17 05:03:19,921 >> [Micro-Log] {"loss": 1.9814000012799322, "lm_loss": 1.9205278094062426, "reg_loss": 0.06087219911569264, "model_sparsity(avg)": 0.536747682839632, "Spa-In-Context Learning sparsity": 0.7103174584252494, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10867196321487427, "Spa-Single QA sparsity": 0.37072648451878476, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.015620663833732788, "Spa-Code sparsity": 0.7138889074325562, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.0926364779472351, "Spa-Summarization sparsity": 0.6064814726511637, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.14320731908082962, "Spa-MultiHop QA sparsity": 0.5989583283662796, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.10475413803942502, "step": 227, "current_tau": 1.0, "lambda1 Single QA": 0.58984375, "lambda2 MultiHop QA": 0.310546875, "lambda3 Summarization": 0.162109375, "lambda4 Code": 0.263671875} [INFO|lh_trainer.py:331] 2026-02-17 05:03:44,393 >> {'loss': 11.8884, 'grad_norm': 0.6309003829956055, 'learning_rate': 0.00010571374191932138, 'epoch': 0.2401263823064771, 'num_input_tokens_seen': 561327442, 'completed': '76.00% (228 / 300)', 'remaining time': '3:22:43', 'throughput': '6594.70', 'gpu_mem_free': '5307MB', 'step': 228} [Step 228 / Rank 4] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17509, 17512, 17501] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 228 / Rank 3] Tasks: ['Single QA'] | Lens: [34412] → Tgt Spa: ['0.350'] [Step 228 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [41801] → Tgt Spa: ['1.000'] [Step 228 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [41801] → Tgt Spa: ['1.000'] [Step 228 / Rank 0] Tasks: ['Single QA'] | Lens: [49671] → Tgt Spa: ['0.350'] [Step 228 / Rank 2] Tasks: ['Single QA'] | Lens: [34412] → Tgt Spa: ['0.350'] [Step 228 / Rank 5] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17509, 17512, 17501] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 228 / Rank 1] Tasks: ['Single QA'] | Lens: [49671] → Tgt Spa: ['0.350'] [Step 228 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [55510] → Tgt Spa: ['1.000'] [Step 228 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15930, 15930, 15930, 15930] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 228 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [55510] → Tgt Spa: ['1.000'] [Step 228 / Rank 0] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [27340, 27338] → Tgt Spa: ['0.350', '1.000'] [Step 228 / Rank 1] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [27340, 27338] → Tgt Spa: ['0.350', '1.000'] [Step 228 / Rank 7] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [21196, 21191, 21191] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 228 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15930, 15930, 15930, 15930] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 228 / Rank 6] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [21196, 21191, 21191] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 228 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [26721, 26731] → Tgt Spa: ['1.000', '1.000'] [Step 228 / Rank 0] Tasks: ['Single QA'] | Lens: [45605] → Tgt Spa: ['0.350'] [Step 228 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [19456, 19456, 19460] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 228 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [23774, 23774] → Tgt Spa: ['0.350', '0.350'] [Step 228 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [23774, 23774] → Tgt Spa: ['0.350', '0.350'] [Step 228 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [26721, 26731] → Tgt Spa: ['1.000', '1.000'] [Step 228 / Rank 1] Tasks: ['Single QA'] | Lens: [45605] → Tgt Spa: ['0.350'] [Step 228 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [19456, 19456, 19460] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 228 / Rank 4] Tasks: ['Code'] | Lens: [53289] → Tgt Spa: ['1.000'] [Step 228 / Rank 7] Tasks: ['Code'] | Lens: [34482] → Tgt Spa: ['1.000'] [Step 228 / Rank 6] Tasks: ['Code'] | Lens: [34482] → Tgt Spa: ['1.000'] [Step 228 / Rank 0] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [16793, 16805, 16806] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 228 / Rank 5] Tasks: ['Code'] | Lens: [53289] → Tgt Spa: ['1.000'] [Step 228 / Rank 3] Tasks: ['Single QA'] | Lens: [65079] → Tgt Spa: ['0.350'] [Step 228 / Rank 1] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [16793, 16805, 16806] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 228 / Rank 2] Tasks: ['Single QA'] | Lens: [65079] → Tgt Spa: ['0.350'] [Step 228 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43096] → Tgt Spa: ['1.000'] [Step 228 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [60334] → Tgt Spa: ['1.000'] [Step 228 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [31832, 31833] → Tgt Spa: ['0.350', '0.350'] [Step 228 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [31832, 31833] → Tgt Spa: ['0.350', '0.350'] [Step 228 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [60334] → Tgt Spa: ['1.000'] [Step 228 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [29550, 29559] → Tgt Spa: ['1.000', '1.000'] [Step 228 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [29550, 29559] → Tgt Spa: ['1.000', '1.000'] [Step 228 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43096] → Tgt Spa: ['1.000'] [Step 228 / Rank 3] Tasks: ['Single QA'] | Lens: [51484] → Tgt Spa: ['0.350'] [Step 228 / Rank 4] Tasks: ['Code', 'MultiHop QA', 'Code', 'Code'] | Lens: [13767, 13763, 13791, 13798] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000'] [Step 228 / Rank 2] Tasks: ['Single QA'] | Lens: [51484] → Tgt Spa: ['0.350'] [Step 228 / Rank 6] Tasks: ['Code'] | Lens: [34976] → Tgt Spa: ['1.000'] [Step 228 / Rank 5] Tasks: ['Code', 'MultiHop QA', 'Code', 'Code'] | Lens: [13767, 13763, 13791, 13798] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000'] [Step 228 / Rank 7] Tasks: ['Code'] | Lens: [34976] → Tgt Spa: ['1.000'] [Step 228 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [63130] → Tgt Spa: ['1.000'] [Step 228 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [63130] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 05:06:11,153 >> @ 228 | Loss: 1.9619 | LM: 1.8882 | Reg: 0.0737 | Spa(Avg): 0.593 [INFO|lh_trainer.py:797] 2026-02-17 05:06:11,153 >> Statistic -> Code | Spa: 0.718 | Tgt: 1.000 | Z-Loss: 0.091 | [INFO|lh_trainer.py:797] 2026-02-17 05:06:11,153 >> Statistic -> In-Context | Spa: 0.712 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:06:11,153 >> Statistic -> MultiHop | Spa: 0.569 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:06:11,153 >> Statistic -> Single | Spa: 0.402 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:06:11,153 >> Statistic -> Summarization | Spa: 0.670 | Tgt: 1.000 | Z-Loss: 0.106 | [INFO|lh_trainer.py:810] 2026-02-17 05:06:11,156 >> [Micro-Log] {"loss": 1.9618531080583732, "lm_loss": 1.8881590055922668, "reg_loss": 0.07369412156792048, "model_sparsity(avg)": 0.5927854925394058, "Spa-Single QA sparsity": 0.40178570577076506, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03376030104534168, "Spa-In-Context Learning sparsity": 0.7118055522441864, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10858917236328125, "Spa-Code sparsity": 0.717592587073644, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09130181434253852, "Spa-Summarization sparsity": 0.670138880610466, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1058313837274909, "Spa-MultiHop QA sparsity": 0.5694444179534912, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.08780772238969803, "step": 228, "current_tau": 1.0, "lambda1 Single QA": 0.58984375, "lambda2 MultiHop QA": 0.310546875, "lambda3 Summarization": 0.162109375, "lambda4 Code": 0.263671875} [INFO|lh_trainer.py:331] 2026-02-17 05:06:36,298 >> {'loss': 11.7711, 'grad_norm': 0.7048976421356201, 'learning_rate': 0.00010305372662151306, 'epoch': 0.2411795681937862, 'num_input_tokens_seen': 563837514, 'completed': '76.33% (229 / 300)', 'remaining time': '3:19:55', 'throughput': '7300.76', 'gpu_mem_free': '5317MB', 'step': 229} [Step 229 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [30070, 30079] → Tgt Spa: ['1.000', '1.000'] [Step 229 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [6877, 6878, 6878, 6878, 6878, 6878, 6879, 6880, 6880] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 229 / Rank 7] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24133, 24134] → Tgt Spa: ['1.000', '0.350'] [Step 229 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [6877, 6878, 6878, 6878, 6878, 6878, 6879, 6880, 6880] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 229 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [30070, 30079] → Tgt Spa: ['1.000', '1.000'] [Step 229 / Rank 6] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24133, 24134] → Tgt Spa: ['1.000', '0.350'] [Step 229 / Rank 0] Tasks: ['Single QA'] | Lens: [54078] → Tgt Spa: ['0.350'] [Step 229 / Rank 1] Tasks: ['Single QA'] | Lens: [54078] → Tgt Spa: ['0.350'] [Step 229 / Rank 5] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17271, 17284, 17273] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 229 / Rank 1] Tasks: ['Single QA'] | Lens: [64740] → Tgt Spa: ['0.350'] [Step 229 / Rank 6] Tasks: ['Single QA'] | Lens: [35568] → Tgt Spa: ['0.350'] [Step 229 / Rank 4] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17271, 17284, 17273] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 229 / Rank 2] Tasks: ['Single QA'] | Lens: [35105] → Tgt Spa: ['0.350'] [Step 229 / Rank 7] Tasks: ['Single QA'] | Lens: [35568] → Tgt Spa: ['0.350'] [Step 229 / Rank 3] Tasks: ['Single QA'] | Lens: [35105] → Tgt Spa: ['0.350'] [Step 229 / Rank 0] Tasks: ['Single QA'] | Lens: [64740] → Tgt Spa: ['0.350'] [Step 229 / Rank 6] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [25872, 25852] → Tgt Spa: ['1.000', '1.000'] [Step 229 / Rank 5] Tasks: ['Single QA'] | Lens: [62003] → Tgt Spa: ['0.350'] [Step 229 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24352, 24356] → Tgt Spa: ['1.000', '1.000'] [Step 229 / Rank 7] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [25872, 25852] → Tgt Spa: ['1.000', '1.000'] [Step 229 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24352, 24356] → Tgt Spa: ['1.000', '1.000'] [Step 229 / Rank 3] Tasks: ['Single QA'] | Lens: [53155] → Tgt Spa: ['0.350'] [Step 229 / Rank 2] Tasks: ['Single QA'] | Lens: [53155] → Tgt Spa: ['0.350'] [Step 229 / Rank 4] Tasks: ['Single QA'] | Lens: [62003] → Tgt Spa: ['0.350'] [Step 229 / Rank 6] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Code', 'Code'] | Lens: [10618, 10620, 10620, 10615, 10624, 10625] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 229 / Rank 4] Tasks: ['Single QA'] | Lens: [63886] → Tgt Spa: ['0.350'] [Step 229 / Rank 3] Tasks: ['Code'] | Lens: [38927] → Tgt Spa: ['1.000'] [Step 229 / Rank 1] Tasks: ['Single QA'] | Lens: [47084] → Tgt Spa: ['0.350'] [Step 229 / Rank 7] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Code', 'Code'] | Lens: [10618, 10620, 10620, 10615, 10624, 10625] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 229 / Rank 5] Tasks: ['Single QA'] | Lens: [63886] → Tgt Spa: ['0.350'] [Step 229 / Rank 2] Tasks: ['Code'] | Lens: [38927] → Tgt Spa: ['1.000'] [Step 229 / Rank 0] Tasks: ['Single QA'] | Lens: [47084] → Tgt Spa: ['0.350'] [Step 229 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27601, 27607] → Tgt Spa: ['1.000', '1.000'] [Step 229 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [24651, 24660] → Tgt Spa: ['1.000', '1.000'] [Step 229 / Rank 3] Tasks: ['Summarization', 'Single QA'] | Lens: [22317, 22300] → Tgt Spa: ['1.000', '0.350'] [Step 229 / Rank 2] Tasks: ['Summarization', 'Single QA'] | Lens: [22317, 22300] → Tgt Spa: ['1.000', '0.350'] [Step 229 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32202, 32203] → Tgt Spa: ['0.350', '0.350'] [Step 229 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27601, 27607] → Tgt Spa: ['1.000', '1.000'] [Step 229 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [24651, 24660] → Tgt Spa: ['1.000', '1.000'] [Step 229 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32202, 32203] → Tgt Spa: ['0.350', '0.350'] [Step 229 / Rank 4] Tasks: ['Single QA'] | Lens: [53950] → Tgt Spa: ['0.350'] [Step 229 / Rank 5] Tasks: ['Single QA'] | Lens: [53950] → Tgt Spa: ['0.350'] [Step 229 / Rank 2] Tasks: ['Single QA'] | Lens: [37550] → Tgt Spa: ['0.350'] [Step 229 / Rank 6] Tasks: ['Single QA'] | Lens: [35648] → Tgt Spa: ['0.350'] [Step 229 / Rank 3] Tasks: ['Single QA'] | Lens: [37550] → Tgt Spa: ['0.350'] [Step 229 / Rank 1] Tasks: ['Single QA'] | Lens: [51033] → Tgt Spa: ['0.350'] [Step 229 / Rank 7] Tasks: ['Single QA'] | Lens: [35648] → Tgt Spa: ['0.350'] [Step 229 / Rank 0] Tasks: ['Single QA'] | Lens: [51033] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 05:09:14,512 >> @ 229 | Loss: 2.0940 | LM: 2.0402 | Reg: 0.0538 | Spa(Avg): 0.492 [INFO|lh_trainer.py:797] 2026-02-17 05:09:14,512 >> Statistic -> Code | Spa: 0.697 | Tgt: 1.000 | Z-Loss: 0.099 | [INFO|lh_trainer.py:797] 2026-02-17 05:09:14,512 >> Statistic -> In-Context | Spa: 0.720 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:09:14,512 >> Statistic -> MultiHop | Spa: 0.569 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:09:14,512 >> Statistic -> Single | Spa: 0.393 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:09:14,512 >> Statistic -> Summarization | Spa: 0.583 | Tgt: 1.000 | Z-Loss: 0.155 | [INFO|lh_trainer.py:810] 2026-02-17 05:09:14,514 >> [Micro-Log] {"loss": 2.0939545494814715, "lm_loss": 2.040165572116772, "reg_loss": 0.053788974895724095, "model_sparsity(avg)": 0.49186599006255466, "Spa-Single QA sparsity": 0.3931623880679791, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.030691212884938486, "Spa-In-Context Learning sparsity": 0.7204861044883728, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10497693717479706, "Spa-Code sparsity": 0.6972222208976746, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09931640774011612, "Spa-Summarization sparsity": 0.5833333333333334, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.15525685995817184, "Spa-MultiHop QA sparsity": 0.5694444179534912, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.08780772238969803, "step": 229, "current_tau": 1.0, "lambda1 Single QA": 0.58984375, "lambda2 MultiHop QA": 0.310546875, "lambda3 Summarization": 0.162109375, "lambda4 Code": 0.263671875} [INFO|lh_trainer.py:331] 2026-02-17 05:09:34,444 >> {'loss': 12.5637, 'grad_norm': 0.4465177655220032, 'learning_rate': 0.00010041888981545026, 'epoch': 0.24223275408109532, 'num_input_tokens_seen': 566302658, 'completed': '76.67% (230 / 300)', 'remaining time': '3:17:09', 'throughput': '6918.86', 'gpu_mem_free': '9377MB', 'step': 230} [Step 230 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32226, 32228] → Tgt Spa: ['0.350', '0.350'] [Step 230 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32226, 32228] → Tgt Spa: ['0.350', '0.350'] [Step 230 / Rank 4] Tasks: ['Single QA'] | Lens: [57026] → Tgt Spa: ['0.350'] [Step 230 / Rank 1] Tasks: ['Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [2505, 2506, 2506, 2505, 2523, 2523, 2524, 2524, 2507, 2525, 2524, 2508, 2507, 2524, 2527, 2508, 2510, 2510, 2511, 2529, 2512, 2512, 2512, 2513, 2511, 2513] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 230 / Rank 2] Tasks: ['Single QA'] | Lens: [45264] → Tgt Spa: ['0.350'] [Step 230 / Rank 0] Tasks: ['Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [2505, 2506, 2506, 2505, 2523, 2523, 2524, 2524, 2507, 2525, 2524, 2508, 2507, 2524, 2527, 2508, 2510, 2510, 2511, 2529, 2512, 2512, 2512, 2513, 2511, 2513] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 230 / Rank 5] Tasks: ['Single QA'] | Lens: [57026] → Tgt Spa: ['0.350'] [Step 230 / Rank 3] Tasks: ['Single QA'] | Lens: [45264] → Tgt Spa: ['0.350'] [Step 230 / Rank 3] Tasks: ['Single QA'] | Lens: [44639] → Tgt Spa: ['0.350'] [Step 230 / Rank 6] Tasks: ['Single QA'] | Lens: [64721] → Tgt Spa: ['0.350'] [Step 230 / Rank 5] Tasks: ['Single QA'] | Lens: [47416] → Tgt Spa: ['0.350'] [Step 230 / Rank 4] Tasks: ['Single QA'] | Lens: [47416] → Tgt Spa: ['0.350'] [Step 230 / Rank 0] Tasks: ['Single QA'] | Lens: [49989] → Tgt Spa: ['0.350'] [Step 230 / Rank 7] Tasks: ['Single QA'] | Lens: [64721] → Tgt Spa: ['0.350'] [Step 230 / Rank 1] Tasks: ['Single QA'] | Lens: [49989] → Tgt Spa: ['0.350'] [Step 230 / Rank 2] Tasks: ['Single QA'] | Lens: [44639] → Tgt Spa: ['0.350'] [Step 230 / Rank 3] Tasks: ['Single QA'] | Lens: [53197] → Tgt Spa: ['0.350'] [Step 230 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15828, 15832, 15833, 15839] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 230 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [60995] → Tgt Spa: ['1.000'] [Step 230 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15828, 15832, 15833, 15839] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 230 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [60995] → Tgt Spa: ['1.000'] [Step 230 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [30613, 30618] → Tgt Spa: ['1.000', '1.000'] [Step 230 / Rank 2] Tasks: ['Single QA'] | Lens: [53197] → Tgt Spa: ['0.350'] [Step 230 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [30613, 30618] → Tgt Spa: ['1.000', '1.000'] [Step 230 / Rank 6] Tasks: ['MultiHop QA'] | Lens: [64767] → Tgt Spa: ['0.350'] [Step 230 / Rank 0] Tasks: ['Summarization', 'Code'] | Lens: [22831, 22821] → Tgt Spa: ['1.000', '1.000'] [Step 230 / Rank 2] Tasks: ['Code'] | Lens: [35214] → Tgt Spa: ['1.000'] [Step 230 / Rank 5] Tasks: ['Single QA'] | Lens: [39957] → Tgt Spa: ['0.350'] [Step 230 / Rank 1] Tasks: ['Summarization', 'Code'] | Lens: [22831, 22821] → Tgt Spa: ['1.000', '1.000'] [Step 230 / Rank 3] Tasks: ['Code'] | Lens: [35214] → Tgt Spa: ['1.000'] [Step 230 / Rank 4] Tasks: ['Single QA'] | Lens: [39957] → Tgt Spa: ['0.350'] [Step 230 / Rank 7] Tasks: ['MultiHop QA'] | Lens: [64767] → Tgt Spa: ['0.350'] [Step 230 / Rank 3] Tasks: ['Single QA', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'Code'] | Lens: [6192, 6192, 6193, 6193, 6196, 6204, 6205, 6199, 6201, 6210] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 230 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [26621, 26613] → Tgt Spa: ['1.000', '1.000'] [Step 230 / Rank 2] Tasks: ['Single QA', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'Code'] | Lens: [6192, 6192, 6193, 6193, 6196, 6204, 6205, 6199, 6201, 6210] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 230 / Rank 7] Tasks: ['Code', 'Single QA'] | Lens: [32579, 32572] → Tgt Spa: ['1.000', '0.350'] [Step 230 / Rank 1] Tasks: ['Single QA'] | Lens: [43130] → Tgt Spa: ['0.350'] [Step 230 / Rank 0] Tasks: ['Single QA'] | Lens: [43130] → Tgt Spa: ['0.350'] [Step 230 / Rank 6] Tasks: ['Code', 'Single QA'] | Lens: [32579, 32572] → Tgt Spa: ['1.000', '0.350'] [Step 230 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [26621, 26613] → Tgt Spa: ['1.000', '1.000'] [Step 230 / Rank 6] Tasks: ['Single QA'] | Lens: [57585] → Tgt Spa: ['0.350'] [Step 230 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28650, 28650] → Tgt Spa: ['1.000', '1.000'] [Step 230 / Rank 4] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [20199, 20212, 20202] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 230 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [53847] → Tgt Spa: ['1.000'] [Step 230 / Rank 5] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [20199, 20212, 20202] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 230 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [53847] → Tgt Spa: ['1.000'] [Step 230 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28650, 28650] → Tgt Spa: ['1.000', '1.000'] [Step 230 / Rank 7] Tasks: ['Single QA'] | Lens: [57585] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 05:12:15,717 >> @ 230 | Loss: 1.9379 | LM: 1.8770 | Reg: 0.0609 | Spa(Avg): 0.520 [INFO|lh_trainer.py:797] 2026-02-17 05:12:15,718 >> Statistic -> Code | Spa: 0.706 | Tgt: 1.000 | Z-Loss: 0.096 | [INFO|lh_trainer.py:797] 2026-02-17 05:12:15,718 >> Statistic -> In-Context | Spa: 0.721 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:12:15,718 >> Statistic -> MultiHop | Spa: 0.635 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:12:15,718 >> Statistic -> Single | Spa: 0.424 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:12:15,718 >> Statistic -> Summarization | Spa: 0.635 | Tgt: 1.000 | Z-Loss: 0.125 | [INFO|lh_trainer.py:810] 2026-02-17 05:12:15,722 >> [Micro-Log] {"loss": 1.9378609632452328, "lm_loss": 1.8769697435200214, "reg_loss": 0.06089123894344084, "model_sparsity(avg)": 0.5199363405505816, "Spa-Single QA sparsity": 0.42424242333932355, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05015039820732041, "Spa-MultiHop QA sparsity": 0.6345486156642437, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.12232312234118581, "Spa-Summarization sparsity": 0.635101009498943, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12459302964535626, "Spa-In-Context Learning sparsity": 0.7206790049870809, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10497316718101501, "Spa-Code sparsity": 0.7058080759915438, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09592877260663292, "step": 230, "current_tau": 1.0, "lambda1 Single QA": 0.58984375, "lambda2 MultiHop QA": 0.310546875, "lambda3 Summarization": 0.162109375, "lambda4 Code": 0.263671875} [INFO|lh_trainer.py:331] 2026-02-17 05:12:37,854 >> {'loss': 11.6272, 'grad_norm': 0.5073559880256653, 'learning_rate': 9.780968296685557e-05, 'epoch': 0.24328593996840442, 'num_input_tokens_seen': 568934814, 'completed': '77.00% (231 / 300)', 'remaining time': '3:14:24', 'throughput': '7175.63', 'gpu_mem_free': '8465MB', 'step': 231} [Step 231 / Rank 7] Tasks: ['Single QA'] | Lens: [55939] → Tgt Spa: ['0.350'] [Step 231 / Rank 5] Tasks: ['Single QA'] | Lens: [36721] → Tgt Spa: ['0.350'] [Step 231 / Rank 3] Tasks: ['Code'] | Lens: [49525] → Tgt Spa: ['1.000'] [Step 231 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [25264, 25265] → Tgt Spa: ['0.350', '0.350'] [Step 231 / Rank 2] Tasks: ['Code'] | Lens: [49525] → Tgt Spa: ['1.000'] [Step 231 / Rank 6] Tasks: ['Single QA'] | Lens: [55939] → Tgt Spa: ['0.350'] [Step 231 / Rank 4] Tasks: ['Single QA'] | Lens: [36721] → Tgt Spa: ['0.350'] [Step 231 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [25264, 25265] → Tgt Spa: ['0.350', '0.350'] [Step 231 / Rank 3] Tasks: ['Single QA'] | Lens: [42045] → Tgt Spa: ['0.350'] [Step 231 / Rank 5] Tasks: ['Code'] | Lens: [33573] → Tgt Spa: ['1.000'] [Step 231 / Rank 7] Tasks: ['Code', 'Code', 'Code'] | Lens: [18319, 18319, 18322] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 231 / Rank 1] Tasks: ['Single QA'] | Lens: [51698] → Tgt Spa: ['0.350'] [Step 231 / Rank 6] Tasks: ['Code', 'Code', 'Code'] | Lens: [18319, 18319, 18322] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 231 / Rank 4] Tasks: ['Code'] | Lens: [33573] → Tgt Spa: ['1.000'] [Step 231 / Rank 0] Tasks: ['Single QA'] | Lens: [51698] → Tgt Spa: ['0.350'] [Step 231 / Rank 2] Tasks: ['Single QA'] | Lens: [42045] → Tgt Spa: ['0.350'] [Step 231 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [39211] → Tgt Spa: ['1.000'] [Step 231 / Rank 1] Tasks: ['Single QA'] | Lens: [51036] → Tgt Spa: ['0.350'] [Step 231 / Rank 6] Tasks: ['Code'] | Lens: [38852] → Tgt Spa: ['1.000'] [Step 231 / Rank 2] Tasks: ['Code'] | Lens: [37725] → Tgt Spa: ['1.000'] [Step 231 / Rank 0] Tasks: ['Single QA'] | Lens: [51036] → Tgt Spa: ['0.350'] [Step 231 / Rank 3] Tasks: ['Code'] | Lens: [37725] → Tgt Spa: ['1.000'] [Step 231 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [39211] → Tgt Spa: ['1.000'] [Step 231 / Rank 7] Tasks: ['Code'] | Lens: [38852] → Tgt Spa: ['1.000'] [Step 231 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [40953] → Tgt Spa: ['1.000'] [Step 231 / Rank 5] Tasks: ['Code'] | Lens: [42089] → Tgt Spa: ['1.000'] [Step 231 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32134, 32134] → Tgt Spa: ['0.350', '0.350'] [Step 231 / Rank 4] Tasks: ['Code'] | Lens: [42089] → Tgt Spa: ['1.000'] [Step 231 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [55447] → Tgt Spa: ['1.000'] [Step 231 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [55447] → Tgt Spa: ['1.000'] [Step 231 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [40953] → Tgt Spa: ['1.000'] [Step 231 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32134, 32134] → Tgt Spa: ['0.350', '0.350'] [Step 231 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [57409] → Tgt Spa: ['1.000'] [Step 231 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [42733] → Tgt Spa: ['1.000'] [Step 231 / Rank 3] Tasks: ['Single QA'] | Lens: [38974] → Tgt Spa: ['0.350'] [Step 231 / Rank 6] Tasks: ['Single QA'] | Lens: [59029] → Tgt Spa: ['0.350'] [Step 231 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [57409] → Tgt Spa: ['1.000'] [Step 231 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [42733] → Tgt Spa: ['1.000'] [Step 231 / Rank 7] Tasks: ['Single QA'] | Lens: [59029] → Tgt Spa: ['0.350'] [Step 231 / Rank 2] Tasks: ['Single QA'] | Lens: [38974] → Tgt Spa: ['0.350'] [Step 231 / Rank 1] Tasks: ['Single QA', 'Summarization', 'Code'] | Lens: [18110, 18129, 18119] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 231 / Rank 6] Tasks: ['Single QA'] | Lens: [60035] → Tgt Spa: ['0.350'] [Step 231 / Rank 0] Tasks: ['Single QA', 'Summarization', 'Code'] | Lens: [18110, 18129, 18119] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 231 / Rank 3] Tasks: ['Single QA'] | Lens: [33931] → Tgt Spa: ['0.350'] [Step 231 / Rank 2] Tasks: ['Single QA'] | Lens: [33931] → Tgt Spa: ['0.350'] [Step 231 / Rank 5] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [20169, 20180, 20171] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 231 / Rank 7] Tasks: ['Single QA'] | Lens: [60035] → Tgt Spa: ['0.350'] [Step 231 / Rank 4] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [20169, 20180, 20171] → Tgt Spa: ['1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 05:14:58,363 >> @ 231 | Loss: 1.9193 | LM: 1.8614 | Reg: 0.0579 | Spa(Avg): 0.546 [INFO|lh_trainer.py:797] 2026-02-17 05:14:58,363 >> Statistic -> Code | Spa: 0.696 | Tgt: 1.000 | Z-Loss: 0.100 | [INFO|lh_trainer.py:797] 2026-02-17 05:14:58,363 >> Statistic -> In-Context | Spa: 0.722 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:14:58,363 >> Statistic -> MultiHop | Spa: 0.635 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:14:58,363 >> Statistic -> Single | Spa: 0.365 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:14:58,363 >> Statistic -> Summarization | Spa: 0.694 | Tgt: 1.000 | Z-Loss: 0.095 | [INFO|lh_trainer.py:810] 2026-02-17 05:14:58,365 >> [Micro-Log] {"loss": 1.9192674544950326, "lm_loss": 1.8614140699307125, "reg_loss": 0.05785338775118968, "model_sparsity(avg)": 0.5458140348394712, "Spa-Single QA sparsity": 0.3650793560913631, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.011324532038997859, "Spa-In-Context Learning sparsity": 0.7222222089767456, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10433244705200195, "Spa-Summarization sparsity": 0.6944444179534912, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.0945095606148243, "Spa-Code sparsity": 0.695707071911205, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10004956749352542, "Spa-MultiHop QA sparsity": 0.6345486156642437, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.12232312234118581, "step": 231, "current_tau": 1.0, "lambda1 Single QA": 0.58984375, "lambda2 MultiHop QA": 0.310546875, "lambda3 Summarization": 0.1630859375, "lambda4 Code": 0.263671875} [INFO|lh_trainer.py:331] 2026-02-17 05:15:22,077 >> {'loss': 11.5156, 'grad_norm': 0.6577916145324707, 'learning_rate': 9.522655314989022e-05, 'epoch': 0.24433912585571355, 'num_input_tokens_seen': 571237934, 'completed': '77.33% (232 / 300)', 'remaining time': '3:11:33', 'throughput': '7012.15', 'gpu_mem_free': '9597MB', 'step': 232} [Step 232 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [53078] → Tgt Spa: ['1.000'] [Step 232 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [64656] → Tgt Spa: ['1.000'] [Step 232 / Rank 0] Tasks: ['Single QA'] | Lens: [59287] → Tgt Spa: ['0.350'] [Step 232 / Rank 3] Tasks: ['Code'] | Lens: [33441] → Tgt Spa: ['1.000'] [Step 232 / Rank 1] Tasks: ['Single QA'] | Lens: [59287] → Tgt Spa: ['0.350'] [Step 232 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [64656] → Tgt Spa: ['1.000'] [Step 232 / Rank 2] Tasks: ['Code'] | Lens: [33441] → Tgt Spa: ['1.000'] [Step 232 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [53078] → Tgt Spa: ['1.000'] [Step 232 / Rank 5] Tasks: ['Single QA'] | Lens: [55059] → Tgt Spa: ['0.350'] [Step 232 / Rank 2] Tasks: ['Single QA'] | Lens: [34213] → Tgt Spa: ['0.350'] [Step 232 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [44517] → Tgt Spa: ['1.000'] [Step 232 / Rank 3] Tasks: ['Single QA'] | Lens: [34213] → Tgt Spa: ['0.350'] [Step 232 / Rank 4] Tasks: ['Single QA'] | Lens: [55059] → Tgt Spa: ['0.350'] [Step 232 / Rank 6] Tasks: ['Single QA'] | Lens: [45266] → Tgt Spa: ['0.350'] [Step 232 / Rank 7] Tasks: ['Single QA'] | Lens: [45266] → Tgt Spa: ['0.350'] [Step 232 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [44517] → Tgt Spa: ['1.000'] [Step 232 / Rank 3] Tasks: ['Summarization', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'MultiHop QA', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'Summarization', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'Code'] | Lens: [3309, 3299, 3292, 3294, 3293, 3293, 3301, 3293, 3294, 3294, 3296, 3295, 3314, 3296, 3315, 3316, 3297, 3299, 3306] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000'] [Step 232 / Rank 6] Tasks: ['Single QA'] | Lens: [34411] → Tgt Spa: ['0.350'] [Step 232 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [23933, 23943] → Tgt Spa: ['1.000', '1.000'] [Step 232 / Rank 2] Tasks: ['Summarization', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'MultiHop QA', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'Summarization', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'Code'] | Lens: [3309, 3299, 3292, 3294, 3293, 3293, 3301, 3293, 3294, 3294, 3296, 3295, 3314, 3296, 3315, 3316, 3297, 3299, 3306] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000'] [Step 232 / Rank 1] Tasks: ['Single QA'] | Lens: [36788] → Tgt Spa: ['0.350'] [Step 232 / Rank 0] Tasks: ['Single QA'] | Lens: [36788] → Tgt Spa: ['0.350'] [Step 232 / Rank 7] Tasks: ['Single QA'] | Lens: [34411] → Tgt Spa: ['0.350'] [Step 232 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [23933, 23943] → Tgt Spa: ['1.000', '1.000'] [Step 232 / Rank 5] Tasks: ['Single QA'] | Lens: [49144] → Tgt Spa: ['0.350'] [Step 232 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [47158] → Tgt Spa: ['1.000'] [Step 232 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [47158] → Tgt Spa: ['1.000'] [Step 232 / Rank 0] Tasks: ['Single QA'] | Lens: [43092] → Tgt Spa: ['0.350'] [Step 232 / Rank 3] Tasks: ['Summarization', 'Code', 'In-Context Learning'] | Lens: [21263, 21256, 21251] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 232 / Rank 1] Tasks: ['Single QA'] | Lens: [43092] → Tgt Spa: ['0.350'] [Step 232 / Rank 2] Tasks: ['Summarization', 'Code', 'In-Context Learning'] | Lens: [21263, 21256, 21251] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 232 / Rank 4] Tasks: ['Single QA'] | Lens: [49144] → Tgt Spa: ['0.350'] [Step 232 / Rank 2] Tasks: ['Single QA'] | Lens: [55761] → Tgt Spa: ['0.350'] [Step 232 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [33103] → Tgt Spa: ['1.000'] [Step 232 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [30197, 30197] → Tgt Spa: ['0.350', '0.350'] [Step 232 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [26282, 26289] → Tgt Spa: ['1.000', '1.000'] [Step 232 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [26282, 26289] → Tgt Spa: ['1.000', '1.000'] [Step 232 / Rank 3] Tasks: ['Single QA'] | Lens: [55761] → Tgt Spa: ['0.350'] [Step 232 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [30197, 30197] → Tgt Spa: ['0.350', '0.350'] [Step 232 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [33103] → Tgt Spa: ['1.000'] [Step 232 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [41092] → Tgt Spa: ['1.000'] [Step 232 / Rank 1] Tasks: ['Code'] | Lens: [57999] → Tgt Spa: ['1.000'] [Step 232 / Rank 6] Tasks: ['Single QA'] | Lens: [44770] → Tgt Spa: ['0.350'] [Step 232 / Rank 7] Tasks: ['Single QA'] | Lens: [44770] → Tgt Spa: ['0.350'] [Step 232 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [41092] → Tgt Spa: ['1.000'] [Step 232 / Rank 0] Tasks: ['Code'] | Lens: [57999] → Tgt Spa: ['1.000'] [Step 232 / Rank 2] Tasks: ['Single QA'] | Lens: [38920] → Tgt Spa: ['0.350'] [Step 232 / Rank 3] Tasks: ['Single QA'] | Lens: [38920] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 05:17:36,726 >> @ 232 | Loss: 2.2540 | LM: 2.1916 | Reg: 0.0624 | Spa(Avg): 0.547 [INFO|lh_trainer.py:797] 2026-02-17 05:17:36,726 >> Statistic -> Code | Spa: 0.689 | Tgt: 1.000 | Z-Loss: 0.103 | [INFO|lh_trainer.py:797] 2026-02-17 05:17:36,726 >> Statistic -> In-Context | Spa: 0.713 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:17:36,726 >> Statistic -> MultiHop | Spa: 0.662 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:17:36,726 >> Statistic -> Single | Spa: 0.416 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:17:36,726 >> Statistic -> Summarization | Spa: 0.631 | Tgt: 1.000 | Z-Loss: 0.127 | [INFO|lh_trainer.py:810] 2026-02-17 05:17:36,728 >> [Micro-Log] {"loss": 2.25396446014444, "lm_loss": 2.1915631288041673, "reg_loss": 0.0624013375636423, "model_sparsity(avg)": 0.5469866792360941, "Spa-Single QA sparsity": 0.4157986082136631, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04362416618823772, "Spa-In-Context Learning sparsity": 0.7129629532496135, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10830246806144714, "Spa-Code sparsity": 0.6892361342906952, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10252379532903433, "Spa-Summarization sparsity": 0.6305555820465087, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1271029904484749, "Spa-MultiHop QA sparsity": 0.6620370546976725, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1365566005309423, "step": 232, "current_tau": 1.0, "lambda1 Single QA": 0.58984375, "lambda2 MultiHop QA": 0.310546875, "lambda3 Summarization": 0.1630859375, "lambda4 Code": 0.263671875} [INFO|lh_trainer.py:331] 2026-02-17 05:17:59,013 >> {'loss': 13.5238, 'grad_norm': 0.6333816647529602, 'learning_rate': 9.266994297055047e-05, 'epoch': 0.24539231174302265, 'num_input_tokens_seen': 573556058, 'completed': '77.67% (233 / 300)', 'remaining time': '3:08:41', 'throughput': '7385.60', 'gpu_mem_free': '7195MB', 'step': 233} [Step 233 / Rank 5] Tasks: ['Single QA'] | Lens: [51376] → Tgt Spa: ['0.350'] [Step 233 / Rank 7] Tasks: ['Single QA'] | Lens: [49464] → Tgt Spa: ['0.350'] [Step 233 / Rank 2] Tasks: ['Single QA'] | Lens: [48669] → Tgt Spa: ['0.350'] [Step 233 / Rank 3] Tasks: ['Single QA'] | Lens: [48669] → Tgt Spa: ['0.350'] [Step 233 / Rank 4] Tasks: ['Single QA'] | Lens: [51376] → Tgt Spa: ['0.350'] [Step 233 / Rank 1] Tasks: ['Single QA'] | Lens: [62092] → Tgt Spa: ['0.350'] [Step 233 / Rank 0] Tasks: ['Single QA'] | Lens: [62092] → Tgt Spa: ['0.350'] [Step 233 / Rank 6] Tasks: ['Single QA'] | Lens: [49464] → Tgt Spa: ['0.350'] [Step 233 / Rank 7] Tasks: ['Code', 'Single QA', 'Code', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [7495, 7487, 7494, 7489, 7498, 7501, 7494, 7496] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 233 / Rank 2] Tasks: ['Single QA'] | Lens: [35293] → Tgt Spa: ['0.350'] [Step 233 / Rank 1] Tasks: ['Single QA'] | Lens: [37524] → Tgt Spa: ['0.350'] [Step 233 / Rank 0] Tasks: ['Single QA'] | Lens: [37524] → Tgt Spa: ['0.350'] [Step 233 / Rank 6] Tasks: ['Code', 'Single QA', 'Code', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [7495, 7487, 7494, 7489, 7498, 7501, 7494, 7496] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 233 / Rank 4] Tasks: ['Single QA'] | Lens: [56523] → Tgt Spa: ['0.350'] [Step 233 / Rank 3] Tasks: ['Single QA'] | Lens: [35293] → Tgt Spa: ['0.350'] [Step 233 / Rank 5] Tasks: ['Single QA'] | Lens: [56523] → Tgt Spa: ['0.350'] [Step 233 / Rank 3] Tasks: ['Single QA'] | Lens: [58257] → Tgt Spa: ['0.350'] [Step 233 / Rank 5] Tasks: ['Single QA'] | Lens: [65087] → Tgt Spa: ['0.350'] [Step 233 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [27571, 27564] → Tgt Spa: ['1.000', '1.000'] [Step 233 / Rank 7] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [16432, 16421, 16422] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 233 / Rank 4] Tasks: ['Single QA'] | Lens: [65087] → Tgt Spa: ['0.350'] [Step 233 / Rank 6] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [16432, 16421, 16422] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 233 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [27571, 27564] → Tgt Spa: ['1.000', '1.000'] [Step 233 / Rank 2] Tasks: ['Single QA'] | Lens: [58257] → Tgt Spa: ['0.350'] [Step 233 / Rank 3] Tasks: ['Single QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [4654, 4671, 4653, 4653, 4673, 4673, 4655, 4655, 4656, 4656, 4657, 4657, 4658, 4658] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 233 / Rank 6] Tasks: ['Single QA'] | Lens: [38420] → Tgt Spa: ['0.350'] [Step 233 / Rank 1] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [18578, 18591, 18592] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 233 / Rank 7] Tasks: ['Single QA'] | Lens: [38420] → Tgt Spa: ['0.350'] [Step 233 / Rank 4] Tasks: ['Single QA'] | Lens: [43619] → Tgt Spa: ['0.350'] [Step 233 / Rank 0] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [18578, 18591, 18592] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 233 / Rank 5] Tasks: ['Single QA'] | Lens: [43619] → Tgt Spa: ['0.350'] [Step 233 / Rank 2] Tasks: ['Single QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [4654, 4671, 4653, 4653, 4673, 4673, 4655, 4655, 4656, 4656, 4657, 4657, 4658, 4658] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 233 / Rank 5] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23045, 23066] → Tgt Spa: ['1.000', '1.000'] [Step 233 / Rank 4] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23045, 23066] → Tgt Spa: ['1.000', '1.000'] [Step 233 / Rank 7] Tasks: ['Code'] | Lens: [57422] → Tgt Spa: ['1.000'] [Step 233 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [39836] → Tgt Spa: ['1.000'] [Step 233 / Rank 6] Tasks: ['Code'] | Lens: [57422] → Tgt Spa: ['1.000'] [Step 233 / Rank 1] Tasks: ['Code'] | Lens: [32349] → Tgt Spa: ['1.000'] [Step 233 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [39836] → Tgt Spa: ['1.000'] [Step 233 / Rank 0] Tasks: ['Code'] | Lens: [32349] → Tgt Spa: ['1.000'] [Step 233 / Rank 5] Tasks: ['Single QA'] | Lens: [51210] → Tgt Spa: ['0.350'] [Step 233 / Rank 2] Tasks: ['Code'] | Lens: [36912] → Tgt Spa: ['1.000'] [Step 233 / Rank 3] Tasks: ['Code'] | Lens: [36912] → Tgt Spa: ['1.000'] [Step 233 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [49450] → Tgt Spa: ['1.000'] [Step 233 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [49450] → Tgt Spa: ['1.000'] [Step 233 / Rank 1] Tasks: ['Single QA'] | Lens: [42065] → Tgt Spa: ['0.350'] [Step 233 / Rank 0] Tasks: ['Single QA'] | Lens: [42065] → Tgt Spa: ['0.350'] [Step 233 / Rank 4] Tasks: ['Single QA'] | Lens: [51210] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 05:20:26,985 >> @ 233 | Loss: 2.0224 | LM: 1.9726 | Reg: 0.0498 | Spa(Avg): 0.511 [INFO|lh_trainer.py:797] 2026-02-17 05:20:26,986 >> Statistic -> Code | Spa: 0.715 | Tgt: 1.000 | Z-Loss: 0.092 | [INFO|lh_trainer.py:797] 2026-02-17 05:20:26,986 >> Statistic -> In-Context | Spa: 0.718 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:20:26,986 >> Statistic -> MultiHop | Spa: 0.662 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:20:26,986 >> Statistic -> Single | Spa: 0.396 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:20:26,986 >> Statistic -> Summarization | Spa: 0.657 | Tgt: 1.000 | Z-Loss: 0.114 | [INFO|lh_trainer.py:810] 2026-02-17 05:20:26,988 >> [Micro-Log] {"loss": 2.022422045469284, "lm_loss": 1.9725774036099513, "reg_loss": 0.049844653316540644, "model_sparsity(avg)": 0.5114018296202024, "Spa-Single QA sparsity": 0.39624181915731993, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03630752506775453, "Spa-Code sparsity": 0.7146464586257935, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09247011352669109, "Spa-In-Context Learning sparsity": 0.7175925811131795, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10626857926448187, "Spa-Summarization sparsity": 0.6567460128239223, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11377583763429097, "Spa-MultiHop QA sparsity": 0.6620370546976725, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1365566005309423, "step": 233, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.310546875, "lambda3 Summarization": 0.1630859375, "lambda4 Code": 0.263671875} [INFO|lh_trainer.py:331] 2026-02-17 05:20:45,073 >> {'loss': 12.1345, 'grad_norm': 0.48986560106277466, 'learning_rate': 9.014029049082889e-05, 'epoch': 0.24644549763033174, 'num_input_tokens_seen': 575930124, 'completed': '78.00% (234 / 300)', 'remaining time': '3:05:51', 'throughput': '7148.20', 'gpu_mem_free': '12959MB', 'step': 234} [Step 234 / Rank 5] Tasks: ['Single QA'] | Lens: [60518] → Tgt Spa: ['0.350'] [Step 234 / Rank 4] Tasks: ['Single QA'] | Lens: [60518] → Tgt Spa: ['0.350'] [Step 234 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [23757, 23757] → Tgt Spa: ['0.350', '0.350'] [Step 234 / Rank 3] Tasks: ['Code', 'Single QA', 'MultiHop QA', 'Summarization', 'Single QA', 'Summarization', 'MultiHop QA', 'Single QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'Code', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [2881, 2874, 2875, 2893, 2876, 2894, 2878, 2877, 2878, 2879, 2880, 2879, 2899, 2898, 2899, 2898, 2880, 2882, 2888, 2883, 2883, 2884] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350'] [Step 234 / Rank 2] Tasks: ['Code', 'Single QA', 'MultiHop QA', 'Summarization', 'Single QA', 'Summarization', 'MultiHop QA', 'Single QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'Code', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [2881, 2874, 2875, 2893, 2876, 2894, 2878, 2877, 2878, 2879, 2880, 2879, 2899, 2898, 2899, 2898, 2880, 2882, 2888, 2883, 2883, 2884] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350'] [Step 234 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [23757, 23757] → Tgt Spa: ['0.350', '0.350'] [Step 234 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [24891, 24891] → Tgt Spa: ['0.350', '0.350'] [Step 234 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [24891, 24891] → Tgt Spa: ['0.350', '0.350'] [Step 234 / Rank 2] Tasks: ['Summarization', 'In-Context Learning', 'Code'] | Lens: [20549, 20531, 20540] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 234 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [39832] → Tgt Spa: ['1.000'] [Step 234 / Rank 1] Tasks: ['Single QA'] | Lens: [34414] → Tgt Spa: ['0.350'] [Step 234 / Rank 3] Tasks: ['Summarization', 'In-Context Learning', 'Code'] | Lens: [20549, 20531, 20540] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 234 / Rank 7] Tasks: ['Single QA'] | Lens: [40522] → Tgt Spa: ['0.350'] [Step 234 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [39832] → Tgt Spa: ['1.000'] [Step 234 / Rank 0] Tasks: ['Single QA'] | Lens: [34414] → Tgt Spa: ['0.350'] [Step 234 / Rank 6] Tasks: ['Single QA'] | Lens: [40522] → Tgt Spa: ['0.350'] [Step 234 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [24681, 24673] → Tgt Spa: ['1.000', '1.000'] [Step 234 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [24681, 24673] → Tgt Spa: ['1.000', '1.000'] [Step 234 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [37988] → Tgt Spa: ['1.000'] [Step 234 / Rank 5] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23943, 23962] → Tgt Spa: ['1.000', '1.000'] [Step 234 / Rank 3] Tasks: ['Single QA'] | Lens: [52026] → Tgt Spa: ['0.350'] [Step 234 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [37988] → Tgt Spa: ['1.000'] [Step 234 / Rank 2] Tasks: ['Single QA'] | Lens: [52026] → Tgt Spa: ['0.350'] [Step 234 / Rank 4] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23943, 23962] → Tgt Spa: ['1.000', '1.000'] [Step 234 / Rank 3] Tasks: ['In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [4375, 4376, 4376, 4376, 4377, 4377, 4377, 4378, 4379, 4381, 4381, 4380, 4380, 4380] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 234 / Rank 5] Tasks: ['Single QA'] | Lens: [60521] → Tgt Spa: ['0.350'] [Step 234 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [23038, 23030] → Tgt Spa: ['1.000', '1.000'] [Step 234 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [58104] → Tgt Spa: ['1.000'] [Step 234 / Rank 4] Tasks: ['Single QA'] | Lens: [60521] → Tgt Spa: ['0.350'] [Step 234 / Rank 2] Tasks: ['In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [4375, 4376, 4376, 4376, 4377, 4377, 4377, 4378, 4379, 4381, 4381, 4380, 4380, 4380] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 234 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [23038, 23030] → Tgt Spa: ['1.000', '1.000'] [Step 234 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [58104] → Tgt Spa: ['1.000'] [Step 234 / Rank 4] Tasks: ['Code', 'Code', 'Single QA', 'Single QA'] | Lens: [13947, 13952, 13946, 13948] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350'] [Step 234 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [31838, 31831] → Tgt Spa: ['1.000', '1.000'] [Step 234 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [29759, 29759] → Tgt Spa: ['0.350', '0.350'] [Step 234 / Rank 1] Tasks: ['Single QA'] | Lens: [38291] → Tgt Spa: ['0.350'] [Step 234 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [31838, 31831] → Tgt Spa: ['1.000', '1.000'] [Step 234 / Rank 0] Tasks: ['Single QA'] | Lens: [38291] → Tgt Spa: ['0.350'] [Step 234 / Rank 5] Tasks: ['Code', 'Code', 'Single QA', 'Single QA'] | Lens: [13947, 13952, 13946, 13948] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350'] [Step 234 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [29759, 29759] → Tgt Spa: ['0.350', '0.350'] [Step 234 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28986, 28995] → Tgt Spa: ['1.000', '1.000'] [Step 234 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [56170] → Tgt Spa: ['1.000'] [Step 234 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [56170] → Tgt Spa: ['1.000'] [Step 234 / Rank 3] Tasks: ['Single QA'] | Lens: [49544] → Tgt Spa: ['0.350'] [Step 234 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28986, 28995] → Tgt Spa: ['1.000', '1.000'] [Step 234 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [29035, 29029] → Tgt Spa: ['1.000', '1.000'] [Step 234 / Rank 2] Tasks: ['Single QA'] | Lens: [49544] → Tgt Spa: ['0.350'] [Step 234 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [29035, 29029] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 05:23:00,039 >> @ 234 | Loss: 2.2107 | LM: 2.1463 | Reg: 0.0644 | Spa(Avg): 0.547 [INFO|lh_trainer.py:797] 2026-02-17 05:23:00,039 >> Statistic -> Code | Spa: 0.702 | Tgt: 1.000 | Z-Loss: 0.097 | [INFO|lh_trainer.py:797] 2026-02-17 05:23:00,039 >> Statistic -> In-Context | Spa: 0.716 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:23:00,040 >> Statistic -> MultiHop | Spa: 0.634 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:23:00,040 >> Statistic -> Single | Spa: 0.427 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:23:00,040 >> Statistic -> Summarization | Spa: 0.606 | Tgt: 1.000 | Z-Loss: 0.141 | [INFO|lh_trainer.py:810] 2026-02-17 05:23:00,042 >> [Micro-Log] {"loss": 2.2107051027317843, "lm_loss": 2.14626800455153, "reg_loss": 0.06443709743810662, "model_sparsity(avg)": 0.5472482666373253, "Spa-Single QA sparsity": 0.4270833283662796, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05495286722725723, "Spa-Code sparsity": 0.7021604776382446, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09746772713131374, "Spa-In-Context Learning sparsity": 0.7159090854904868, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10697024417194453, "Spa-MultiHop QA sparsity": 0.6342592636744181, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.12207541945907804, "Spa-Summarization sparsity": 0.6059027686715126, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.14054659102112055, "step": 234, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1630859375, "lambda4 Code": 0.263671875} [INFO|lh_trainer.py:331] 2026-02-17 05:23:20,973 >> {'loss': 13.2642, 'grad_norm': 0.680926501750946, 'learning_rate': 8.763802915365534e-05, 'epoch': 0.24749868351764087, 'num_input_tokens_seen': 578430022, 'completed': '78.33% (235 / 300)', 'remaining time': '3:02:59', 'throughput': '8017.67', 'gpu_mem_free': '8251MB', 'step': 235} [Step 235 / Rank 5] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16664, 16657, 16671] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 235 / Rank 4] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16664, 16657, 16671] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 235 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [11708, 11710, 11710, 11710, 11711] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 235 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [11708, 11710, 11710, 11710, 11711] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 235 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [25559, 25552] → Tgt Spa: ['1.000', '1.000'] [Step 235 / Rank 6] Tasks: ['Single QA'] | Lens: [46468] → Tgt Spa: ['0.350'] [Step 235 / Rank 7] Tasks: ['Single QA'] | Lens: [46468] → Tgt Spa: ['0.350'] [Step 235 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [25559, 25552] → Tgt Spa: ['1.000', '1.000'] [Step 235 / Rank 6] Tasks: ['Single QA'] | Lens: [64693] → Tgt Spa: ['0.350'] [Step 235 / Rank 1] Tasks: ['Code'] | Lens: [43487] → Tgt Spa: ['1.000'] [Step 235 / Rank 5] Tasks: ['Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'Single QA', 'Single QA'] | Lens: [6419, 6420, 6422, 6422, 6422, 6423, 6431, 6423, 6424, 6424] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350'] [Step 235 / Rank 2] Tasks: ['Single QA'] | Lens: [33738] → Tgt Spa: ['0.350'] [Step 235 / Rank 3] Tasks: ['Single QA'] | Lens: [33738] → Tgt Spa: ['0.350'] [Step 235 / Rank 0] Tasks: ['Code'] | Lens: [43487] → Tgt Spa: ['1.000'] [Step 235 / Rank 4] Tasks: ['Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'Single QA', 'Single QA'] | Lens: [6419, 6420, 6422, 6422, 6422, 6423, 6431, 6423, 6424, 6424] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350'] [Step 235 / Rank 7] Tasks: ['Single QA'] | Lens: [64693] → Tgt Spa: ['0.350'] [Step 235 / Rank 1] Tasks: ['Single QA'] | Lens: [42094] → Tgt Spa: ['0.350'] [Step 235 / Rank 2] Tasks: ['Summarization'] | Lens: [37893] → Tgt Spa: ['1.000'] [Step 235 / Rank 0] Tasks: ['Single QA'] | Lens: [42094] → Tgt Spa: ['0.350'] [Step 235 / Rank 6] Tasks: ['Code', 'Single QA'] | Lens: [23746, 23738] → Tgt Spa: ['1.000', '0.350'] [Step 235 / Rank 7] Tasks: ['Code', 'Single QA'] | Lens: [23746, 23738] → Tgt Spa: ['1.000', '0.350'] [Step 235 / Rank 3] Tasks: ['Summarization'] | Lens: [37893] → Tgt Spa: ['1.000'] [Step 235 / Rank 5] Tasks: ['In-Context Learning', 'Code', 'Summarization'] | Lens: [20322, 20330, 20344] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 235 / Rank 4] Tasks: ['In-Context Learning', 'Code', 'Summarization'] | Lens: [20322, 20330, 20344] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 235 / Rank 2] Tasks: ['Single QA'] | Lens: [55633] → Tgt Spa: ['0.350'] [Step 235 / Rank 3] Tasks: ['Single QA'] | Lens: [55633] → Tgt Spa: ['0.350'] [Step 235 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [37943] → Tgt Spa: ['1.000'] [Step 235 / Rank 0] Tasks: ['Single QA'] | Lens: [39992] → Tgt Spa: ['0.350'] [Step 235 / Rank 1] Tasks: ['Single QA'] | Lens: [39992] → Tgt Spa: ['0.350'] [Step 235 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15933, 15933, 15933, 15933] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 235 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [37943] → Tgt Spa: ['1.000'] [Step 235 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15933, 15933, 15933, 15933] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 235 / Rank 3] Tasks: ['Single QA'] | Lens: [48274] → Tgt Spa: ['0.350'] [Step 235 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24294, 24314] → Tgt Spa: ['1.000', '1.000'] [Step 235 / Rank 7] Tasks: ['Summarization', 'Single QA'] | Lens: [23493, 23475] → Tgt Spa: ['1.000', '0.350'] [Step 235 / Rank 4] Tasks: ['Single QA'] | Lens: [35826] → Tgt Spa: ['0.350'] [Step 235 / Rank 5] Tasks: ['Single QA'] | Lens: [35826] → Tgt Spa: ['0.350'] [Step 235 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [24294, 24314] → Tgt Spa: ['1.000', '1.000'] [Step 235 / Rank 2] Tasks: ['Single QA'] | Lens: [48274] → Tgt Spa: ['0.350'] [Step 235 / Rank 6] Tasks: ['Summarization', 'Single QA'] | Lens: [23493, 23475] → Tgt Spa: ['1.000', '0.350'] [Step 235 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28800, 28802] → Tgt Spa: ['1.000', '1.000'] [Step 235 / Rank 2] Tasks: ['Single QA'] | Lens: [45603] → Tgt Spa: ['0.350'] [Step 235 / Rank 1] Tasks: ['Single QA'] | Lens: [54039] → Tgt Spa: ['0.350'] [Step 235 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [58408] → Tgt Spa: ['1.000'] [Step 235 / Rank 3] Tasks: ['Single QA'] | Lens: [45603] → Tgt Spa: ['0.350'] [Step 235 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [58408] → Tgt Spa: ['1.000'] [Step 235 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28800, 28802] → Tgt Spa: ['1.000', '1.000'] [Step 235 / Rank 0] Tasks: ['Single QA'] | Lens: [54039] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 05:25:31,873 >> @ 235 | Loss: 2.2011 | LM: 2.1431 | Reg: 0.0581 | Spa(Avg): 0.529 [INFO|lh_trainer.py:797] 2026-02-17 05:25:31,873 >> Statistic -> Code | Spa: 0.722 | Tgt: 1.000 | Z-Loss: 0.090 | [INFO|lh_trainer.py:797] 2026-02-17 05:25:31,873 >> Statistic -> In-Context | Spa: 0.705 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:25:31,874 >> Statistic -> MultiHop | Spa: 0.634 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:25:31,874 >> Statistic -> Single | Spa: 0.429 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:25:31,874 >> Statistic -> Summarization | Spa: 0.678 | Tgt: 1.000 | Z-Loss: 0.105 | [INFO|lh_trainer.py:810] 2026-02-17 05:25:31,876 >> [Micro-Log] {"loss": 2.2011476991077266, "lm_loss": 2.1430857541660466, "reg_loss": 0.05806197503504033, "model_sparsity(avg)": 0.5288966024915377, "Spa-Code sparsity": 0.7222222089767456, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09013611823320389, "Spa-In-Context Learning sparsity": 0.7045454545454546, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11179064959287643, "Spa-Single QA sparsity": 0.4294871733738826, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05256225201838578, "Spa-Summarization sparsity": 0.6782407363255819, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10467472796638806, "Spa-MultiHop QA sparsity": 0.6342592636744181, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.12207541945907804, "step": 235, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1630859375, "lambda4 Code": 0.265625} [INFO|lh_trainer.py:331] 2026-02-17 05:25:54,262 >> {'loss': 13.2069, 'grad_norm': 0.5158939361572266, 'learning_rate': 8.516358770862817e-05, 'epoch': 0.24855186940494997, 'num_input_tokens_seen': 580816748, 'completed': '78.67% (236 / 300)', 'remaining time': '3:00:05', 'throughput': '7785.03', 'gpu_mem_free': '8399MB', 'step': 236} [Step 236 / Rank 0] Tasks: ['Single QA'] | Lens: [36735] → Tgt Spa: ['0.350'] [Step 236 / Rank 1] Tasks: ['Single QA'] | Lens: [36735] → Tgt Spa: ['0.350'] [Step 236 / Rank 4] Tasks: ['Code'] | Lens: [37527] → Tgt Spa: ['1.000'] [Step 236 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25509, 25512] → Tgt Spa: ['1.000', '1.000'] [Step 236 / Rank 5] Tasks: ['Code'] | Lens: [37527] → Tgt Spa: ['1.000'] [Step 236 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25509, 25512] → Tgt Spa: ['1.000', '1.000'] [Step 236 / Rank 6] Tasks: ['Code', 'Single QA', 'Code', 'Code', 'Single QA', 'Code', 'Code'] | Lens: [8861, 8853, 8864, 8865, 8862, 8874, 8877] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 236 / Rank 7] Tasks: ['Code', 'Single QA', 'Code', 'Code', 'Single QA', 'Code', 'Code'] | Lens: [8861, 8853, 8864, 8865, 8862, 8874, 8877] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 236 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [4479, 4480, 4482, 4499, 4481, 4482, 4482, 4483, 4483, 4483, 4484, 4484, 4492, 4485] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350'] [Step 236 / Rank 0] Tasks: ['Single QA'] | Lens: [64936] → Tgt Spa: ['0.350'] [Step 236 / Rank 3] Tasks: ['In-Context Learning', 'Code'] | Lens: [27687, 27697] → Tgt Spa: ['1.000', '1.000'] [Step 236 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [4479, 4480, 4482, 4499, 4481, 4482, 4482, 4483, 4483, 4483, 4484, 4484, 4492, 4485] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350'] [Step 236 / Rank 7] Tasks: ['Single QA'] | Lens: [49215] → Tgt Spa: ['0.350'] [Step 236 / Rank 1] Tasks: ['Single QA'] | Lens: [64936] → Tgt Spa: ['0.350'] [Step 236 / Rank 6] Tasks: ['Single QA'] | Lens: [49215] → Tgt Spa: ['0.350'] [Step 236 / Rank 2] Tasks: ['In-Context Learning', 'Code'] | Lens: [27687, 27697] → Tgt Spa: ['1.000', '1.000'] [Step 236 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [31628, 31631] → Tgt Spa: ['0.350', '0.350'] [Step 236 / Rank 1] Tasks: ['Code', 'Code', 'Code'] | Lens: [18396, 18400, 18400] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 236 / Rank 6] Tasks: ['Single QA'] | Lens: [44308] → Tgt Spa: ['0.350'] [Step 236 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [28977, 28968] → Tgt Spa: ['1.000', '1.000'] [Step 236 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [28977, 28968] → Tgt Spa: ['1.000', '1.000'] [Step 236 / Rank 0] Tasks: ['Code', 'Code', 'Code'] | Lens: [18396, 18400, 18400] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 236 / Rank 7] Tasks: ['Single QA'] | Lens: [44308] → Tgt Spa: ['0.350'] [Step 236 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [31628, 31631] → Tgt Spa: ['0.350', '0.350'] [Step 236 / Rank 4] Tasks: ['Summarization'] | Lens: [33615] → Tgt Spa: ['1.000'] [Step 236 / Rank 7] Tasks: ['Single QA'] | Lens: [64731] → Tgt Spa: ['0.350'] [Step 236 / Rank 0] Tasks: ['Summarization', 'Code'] | Lens: [26103, 26095] → Tgt Spa: ['1.000', '1.000'] [Step 236 / Rank 6] Tasks: ['Single QA'] | Lens: [64731] → Tgt Spa: ['0.350'] [Step 236 / Rank 1] Tasks: ['Summarization', 'Code'] | Lens: [26103, 26095] → Tgt Spa: ['1.000', '1.000'] [Step 236 / Rank 3] Tasks: ['Single QA', 'Summarization', 'Summarization', 'In-Context Learning', 'Summarization', 'Summarization', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Summarization', 'In-Context Learning'] | Lens: [3541, 3559, 3560, 3541, 3561, 3561, 3544, 3545, 3544, 3544, 3552, 3545, 3544, 3544, 3545, 3546, 3565, 3547] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 236 / Rank 5] Tasks: ['Summarization'] | Lens: [33615] → Tgt Spa: ['1.000'] [Step 236 / Rank 2] Tasks: ['Single QA', 'Summarization', 'Summarization', 'In-Context Learning', 'Summarization', 'Summarization', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Summarization', 'In-Context Learning'] | Lens: [3541, 3559, 3560, 3541, 3561, 3561, 3544, 3545, 3544, 3544, 3552, 3545, 3544, 3544, 3545, 3546, 3565, 3547] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 236 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [53788] → Tgt Spa: ['1.000'] [Step 236 / Rank 4] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [21797, 21796, 21806] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 236 / Rank 1] Tasks: ['Single QA'] | Lens: [43823] → Tgt Spa: ['0.350'] [Step 236 / Rank 3] Tasks: ['Single QA'] | Lens: [56523] → Tgt Spa: ['0.350'] [Step 236 / Rank 5] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [21797, 21796, 21806] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 236 / Rank 0] Tasks: ['Single QA'] | Lens: [43823] → Tgt Spa: ['0.350'] [Step 236 / Rank 2] Tasks: ['Single QA'] | Lens: [56523] → Tgt Spa: ['0.350'] [Step 236 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [53788] → Tgt Spa: ['1.000'] [Step 236 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [37782] → Tgt Spa: ['1.000'] [Step 236 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27119, 27120] → Tgt Spa: ['1.000', '1.000'] [Step 236 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [37782] → Tgt Spa: ['1.000'] [Step 236 / Rank 4] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [19909, 19920, 19909] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 236 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [24366, 24374] → Tgt Spa: ['1.000', '1.000'] [Step 236 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27119, 27120] → Tgt Spa: ['1.000', '1.000'] [Step 236 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [24366, 24374] → Tgt Spa: ['1.000', '1.000'] [Step 236 / Rank 5] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [19909, 19920, 19909] → Tgt Spa: ['1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 05:28:18,607 >> @ 236 | Loss: 1.8989 | LM: 1.8236 | Reg: 0.0753 | Spa(Avg): 0.583 [INFO|lh_trainer.py:797] 2026-02-17 05:28:18,607 >> Statistic -> Code | Spa: 0.700 | Tgt: 1.000 | Z-Loss: 0.099 | [INFO|lh_trainer.py:797] 2026-02-17 05:28:18,607 >> Statistic -> In-Context | Spa: 0.711 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:28:18,607 >> Statistic -> MultiHop | Spa: 0.681 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:28:18,607 >> Statistic -> Single | Spa: 0.488 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:28:18,607 >> Statistic -> Summarization | Spa: 0.654 | Tgt: 1.000 | Z-Loss: 0.115 | [INFO|lh_trainer.py:810] 2026-02-17 05:28:18,609 >> [Micro-Log] {"loss": 1.8989070964356263, "lm_loss": 1.8235969580709934, "reg_loss": 0.0753101466204195, "model_sparsity(avg)": 0.5830623594423135, "Spa-Single QA sparsity": 0.48800505020401697, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.09212453943275084, "Spa-Code sparsity": 0.7002923959179929, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09883132341660951, "Spa-Summarization sparsity": 0.6541666746139526, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11456400156021118, "Spa-In-Context Learning sparsity": 0.7109788315636771, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10911733231374196, "Spa-MultiHop QA sparsity": 0.6805555820465088, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1477014571428299, "step": 236, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1630859375, "lambda4 Code": 0.265625} [INFO|lh_trainer.py:331] 2026-02-17 05:28:32,333 >> {'loss': 11.3934, 'grad_norm': 0.7031413316726685, 'learning_rate': 8.271739013855068e-05, 'epoch': 0.24960505529225907, 'num_input_tokens_seen': 583366398, 'completed': '79.00% (237 / 300)', 'remaining time': '2:57:14', 'throughput': '8064.89', 'gpu_mem_free': '10363MB', 'step': 237} [Step 237 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [64471] → Tgt Spa: ['1.000'] [Step 237 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [64471] → Tgt Spa: ['1.000'] [Step 237 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [56373] → Tgt Spa: ['1.000'] [Step 237 / Rank 3] Tasks: ['Code'] | Lens: [49398] → Tgt Spa: ['1.000'] [Step 237 / Rank 2] Tasks: ['Code'] | Lens: [49398] → Tgt Spa: ['1.000'] [Step 237 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23642, 23644] → Tgt Spa: ['1.000', '1.000'] [Step 237 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23642, 23644] → Tgt Spa: ['1.000', '1.000'] [Step 237 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [56373] → Tgt Spa: ['1.000'] [Step 237 / Rank 4] Tasks: ['Single QA'] | Lens: [43238] → Tgt Spa: ['0.350'] [Step 237 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [28435, 28436] → Tgt Spa: ['0.350', '0.350'] [Step 237 / Rank 2] Tasks: ['Code'] | Lens: [42526] → Tgt Spa: ['1.000'] [Step 237 / Rank 6] Tasks: ['Summarization', 'Code'] | Lens: [26355, 26346] → Tgt Spa: ['1.000', '1.000'] [Step 237 / Rank 3] Tasks: ['Code'] | Lens: [42526] → Tgt Spa: ['1.000'] [Step 237 / Rank 7] Tasks: ['Summarization', 'Code'] | Lens: [26355, 26346] → Tgt Spa: ['1.000', '1.000'] [Step 237 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [28435, 28436] → Tgt Spa: ['0.350', '0.350'] [Step 237 / Rank 5] Tasks: ['Single QA'] | Lens: [43238] → Tgt Spa: ['0.350'] [Step 237 / Rank 5] Tasks: ['Single QA'] | Lens: [38382] → Tgt Spa: ['0.350'] [Step 237 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [18816, 18818, 18806] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 237 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [49901] → Tgt Spa: ['1.000'] [Step 237 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [49901] → Tgt Spa: ['1.000'] [Step 237 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [27413, 27416] → Tgt Spa: ['0.350', '0.350'] [Step 237 / Rank 4] Tasks: ['Single QA'] | Lens: [38382] → Tgt Spa: ['0.350'] [Step 237 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [27413, 27416] → Tgt Spa: ['0.350', '0.350'] [Step 237 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [18816, 18818, 18806] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 237 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28091, 28093] → Tgt Spa: ['1.000', '1.000'] [Step 237 / Rank 7] Tasks: ['Code', 'Single QA'] | Lens: [29698, 29691] → Tgt Spa: ['1.000', '0.350'] [Step 237 / Rank 3] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [19099, 19112, 19106] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 237 / Rank 6] Tasks: ['Code', 'Single QA'] | Lens: [29698, 29691] → Tgt Spa: ['1.000', '0.350'] [Step 237 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28091, 28093] → Tgt Spa: ['1.000', '1.000'] [Step 237 / Rank 4] Tasks: ['Code'] | Lens: [36674] → Tgt Spa: ['1.000'] [Step 237 / Rank 5] Tasks: ['Code'] | Lens: [36674] → Tgt Spa: ['1.000'] [Step 237 / Rank 2] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [19099, 19112, 19106] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 237 / Rank 1] Tasks: ['Single QA'] | Lens: [57045] → Tgt Spa: ['0.350'] [Step 237 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [36797] → Tgt Spa: ['1.000'] [Step 237 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [36797] → Tgt Spa: ['1.000'] [Step 237 / Rank 6] Tasks: ['Single QA'] | Lens: [55869] → Tgt Spa: ['0.350'] [Step 237 / Rank 4] Tasks: ['Single QA'] | Lens: [48721] → Tgt Spa: ['0.350'] [Step 237 / Rank 5] Tasks: ['Single QA'] | Lens: [48721] → Tgt Spa: ['0.350'] [Step 237 / Rank 0] Tasks: ['Single QA'] | Lens: [57045] → Tgt Spa: ['0.350'] [Step 237 / Rank 7] Tasks: ['Single QA'] | Lens: [55869] → Tgt Spa: ['0.350'] [Step 237 / Rank 5] Tasks: ['Single QA'] | Lens: [51793] → Tgt Spa: ['0.350'] [Step 237 / Rank 1] Tasks: ['Code', 'Single QA'] | Lens: [32303, 32297] → Tgt Spa: ['1.000', '0.350'] [Step 237 / Rank 2] Tasks: ['Code'] | Lens: [34406] → Tgt Spa: ['1.000'] [Step 237 / Rank 0] Tasks: ['Code', 'Single QA'] | Lens: [32303, 32297] → Tgt Spa: ['1.000', '0.350'] [Step 237 / Rank 3] Tasks: ['Code'] | Lens: [34406] → Tgt Spa: ['1.000'] [Step 237 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [58547] → Tgt Spa: ['1.000'] [Step 237 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [58547] → Tgt Spa: ['1.000'] [Step 237 / Rank 4] Tasks: ['Single QA'] | Lens: [51793] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 05:30:45,442 >> @ 237 | Loss: 1.8563 | LM: 1.7885 | Reg: 0.0679 | Spa(Avg): 0.581 [INFO|lh_trainer.py:797] 2026-02-17 05:30:45,442 >> Statistic -> Code | Spa: 0.714 | Tgt: 1.000 | Z-Loss: 0.093 | [INFO|lh_trainer.py:797] 2026-02-17 05:30:45,442 >> Statistic -> In-Context | Spa: 0.721 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:30:45,442 >> Statistic -> MultiHop | Spa: 0.681 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:30:45,442 >> Statistic -> Single | Spa: 0.366 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:30:45,442 >> Statistic -> Summarization | Spa: 0.656 | Tgt: 1.000 | Z-Loss: 0.113 | [INFO|lh_trainer.py:810] 2026-02-17 05:30:45,444 >> [Micro-Log] {"loss": 1.8563306517899036, "lm_loss": 1.788454129671057, "reg_loss": 0.06787652338971384, "model_sparsity(avg)": 0.5814043208956718, "Spa-In-Context Learning sparsity": 0.7206790049870809, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10504937420288722, "Spa-Single QA sparsity": 0.36574073632558185, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.011850192347386232, "Spa-Code sparsity": 0.7138888955116272, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09337978959083557, "Spa-Summarization sparsity": 0.6562499850988388, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11312104016542435, "Spa-MultiHop QA sparsity": 0.6805555820465088, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1477014571428299, "step": 237, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1630859375, "lambda4 Code": 0.265625} [INFO|lh_trainer.py:331] 2026-02-17 05:31:08,006 >> {'loss': 11.138, 'grad_norm': 0.7286622524261475, 'learning_rate': 8.02998555867832e-05, 'epoch': 0.2506582411795682, 'num_input_tokens_seen': 585825914, 'completed': '79.33% (238 / 300)', 'remaining time': '2:54:21', 'throughput': '7899.63', 'gpu_mem_free': '6823MB', 'step': 238} [Step 238 / Rank 2] Tasks: ['Single QA'] | Lens: [43204] → Tgt Spa: ['0.350'] [Step 238 / Rank 5] Tasks: ['Summarization', 'In-Context Learning', 'Summarization'] | Lens: [21176, 21158, 21178] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 238 / Rank 7] Tasks: ['Code'] | Lens: [54051] → Tgt Spa: ['1.000'] [Step 238 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [40524] → Tgt Spa: ['1.000'] [Step 238 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [40524] → Tgt Spa: ['1.000'] [Step 238 / Rank 3] Tasks: ['Single QA'] | Lens: [43204] → Tgt Spa: ['0.350'] [Step 238 / Rank 6] Tasks: ['Code'] | Lens: [54051] → Tgt Spa: ['1.000'] [Step 238 / Rank 4] Tasks: ['Summarization', 'In-Context Learning', 'Summarization'] | Lens: [21176, 21158, 21178] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 238 / Rank 1] Tasks: ['Single QA'] | Lens: [51701] → Tgt Spa: ['0.350'] [Step 238 / Rank 0] Tasks: ['Single QA'] | Lens: [51701] → Tgt Spa: ['0.350'] [Step 238 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [41030] → Tgt Spa: ['1.000'] [Step 238 / Rank 5] Tasks: ['Code', 'Code', 'Code'] | Lens: [18480, 18481, 18487] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 238 / Rank 4] Tasks: ['Code', 'Code', 'Code'] | Lens: [18480, 18481, 18487] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 238 / Rank 7] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization'] | Lens: [1409, 1408, 1408, 1408, 1409, 1410, 1409, 1411, 1429, 1428, 1428, 1409, 1409, 1409, 1410, 1429, 1411, 1411, 1411, 1412, 1412, 1413, 1411, 1432, 1431, 1415, 1413, 1414, 1432, 1414, 1414, 1433, 1433, 1433, 1435, 1416, 1415, 1417, 1416, 1415, 1417, 1417, 1416, 1416, 1435, 1435] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 238 / Rank 6] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization'] | Lens: [1409, 1408, 1408, 1408, 1409, 1410, 1409, 1411, 1429, 1428, 1428, 1409, 1409, 1409, 1410, 1429, 1411, 1411, 1411, 1412, 1412, 1413, 1411, 1432, 1431, 1415, 1413, 1414, 1432, 1414, 1414, 1433, 1433, 1433, 1435, 1416, 1415, 1417, 1416, 1415, 1417, 1417, 1416, 1416, 1435, 1435] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 238 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [41030] → Tgt Spa: ['1.000'] [Step 238 / Rank 3] Tasks: ['Single QA'] | Lens: [47630] → Tgt Spa: ['0.350'] [Step 238 / Rank 1] Tasks: ['Single QA'] | Lens: [64906] → Tgt Spa: ['0.350'] [Step 238 / Rank 2] Tasks: ['Single QA'] | Lens: [47630] → Tgt Spa: ['0.350'] [Step 238 / Rank 5] Tasks: ['Single QA'] | Lens: [40873] → Tgt Spa: ['0.350'] [Step 238 / Rank 0] Tasks: ['Single QA'] | Lens: [64906] → Tgt Spa: ['0.350'] [Step 238 / Rank 6] Tasks: ['Single QA'] | Lens: [56497] → Tgt Spa: ['0.350'] [Step 238 / Rank 7] Tasks: ['Single QA'] | Lens: [56497] → Tgt Spa: ['0.350'] [Step 238 / Rank 4] Tasks: ['Single QA'] | Lens: [40873] → Tgt Spa: ['0.350'] [Step 238 / Rank 3] Tasks: ['Code'] | Lens: [60921] → Tgt Spa: ['1.000'] [Step 238 / Rank 5] Tasks: ['Single QA'] | Lens: [45408] → Tgt Spa: ['0.350'] [Step 238 / Rank 2] Tasks: ['Code'] | Lens: [60921] → Tgt Spa: ['1.000'] [Step 238 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [57627] → Tgt Spa: ['1.000'] [Step 238 / Rank 4] Tasks: ['Single QA'] | Lens: [45408] → Tgt Spa: ['0.350'] [Step 238 / Rank 1] Tasks: ['Summarization'] | Lens: [52081] → Tgt Spa: ['1.000'] [Step 238 / Rank 0] Tasks: ['Summarization'] | Lens: [52081] → Tgt Spa: ['1.000'] [Step 238 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [57627] → Tgt Spa: ['1.000'] [Step 238 / Rank 3] Tasks: ['Single QA'] | Lens: [56620] → Tgt Spa: ['0.350'] [Step 238 / Rank 0] Tasks: ['Single QA'] | Lens: [34046] → Tgt Spa: ['0.350'] [Step 238 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [43214] → Tgt Spa: ['1.000'] [Step 238 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [43214] → Tgt Spa: ['1.000'] [Step 238 / Rank 7] Tasks: ['Single QA', 'Summarization'] | Lens: [24036, 24055] → Tgt Spa: ['0.350', '1.000'] [Step 238 / Rank 1] Tasks: ['Single QA'] | Lens: [34046] → Tgt Spa: ['0.350'] [Step 238 / Rank 2] Tasks: ['Single QA'] | Lens: [56620] → Tgt Spa: ['0.350'] [Step 238 / Rank 6] Tasks: ['Single QA', 'Summarization'] | Lens: [24036, 24055] → Tgt Spa: ['0.350', '1.000'] [Step 238 / Rank 0] Tasks: ['Single QA'] | Lens: [49967] → Tgt Spa: ['0.350'] [Step 238 / Rank 7] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [24022, 24003] → Tgt Spa: ['1.000', '1.000'] [Step 238 / Rank 6] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [24022, 24003] → Tgt Spa: ['1.000', '1.000'] [Step 238 / Rank 1] Tasks: ['Single QA'] | Lens: [49967] → Tgt Spa: ['0.350'] [Step 238 / Rank 5] Tasks: ['Code', 'Single QA'] | Lens: [31130, 31123] → Tgt Spa: ['1.000', '0.350'] [Step 238 / Rank 4] Tasks: ['Code', 'Single QA'] | Lens: [31130, 31123] → Tgt Spa: ['1.000', '0.350'] [Step 238 / Rank 3] Tasks: ['In-Context Learning', 'Code', 'In-Context Learning'] | Lens: [21549, 21557, 21551] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 238 / Rank 2] Tasks: ['In-Context Learning', 'Code', 'In-Context Learning'] | Lens: [21549, 21557, 21551] → Tgt Spa: ['1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 05:33:39,454 >> @ 238 | Loss: 1.9068 | LM: 1.8435 | Reg: 0.0633 | Spa(Avg): 0.535 [INFO|lh_trainer.py:797] 2026-02-17 05:33:39,454 >> Statistic -> Code | Spa: 0.716 | Tgt: 1.000 | Z-Loss: 0.092 | [INFO|lh_trainer.py:797] 2026-02-17 05:33:39,454 >> Statistic -> In-Context | Spa: 0.705 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:33:39,454 >> Statistic -> MultiHop | Spa: 0.576 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:33:39,454 >> Statistic -> Single | Spa: 0.369 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:33:39,454 >> Statistic -> Summarization | Spa: 0.647 | Tgt: 1.000 | Z-Loss: 0.119 | [INFO|lh_trainer.py:810] 2026-02-17 05:33:39,456 >> [Micro-Log] {"loss": 1.9067890755832195, "lm_loss": 1.8435009258391801, "reg_loss": 0.06328813491563778, "model_sparsity(avg)": 0.534525123735269, "Spa-In-Context Learning sparsity": 0.7048611044883728, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11186293605715036, "Spa-Single QA sparsity": 0.36858973594812244, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.014075586608109565, "Spa-Summarization sparsity": 0.6473765472571055, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11914683878421783, "Spa-Code sparsity": 0.7162698337009975, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09244718934808459, "Spa-MultiHop QA sparsity": 0.5755208302289248, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.09276125521864742, "step": 238, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1640625, "lambda4 Code": 0.265625} [INFO|lh_trainer.py:331] 2026-02-17 05:33:57,093 >> {'loss': 11.4407, 'grad_norm': 0.6047478318214417, 'learning_rate': 7.791139828542587e-05, 'epoch': 0.2517114270668773, 'num_input_tokens_seen': 588320922, 'completed': '79.67% (239 / 300)', 'remaining time': '2:51:33', 'throughput': '7377.90', 'gpu_mem_free': '10917MB', 'step': 239} [Step 239 / Rank 3] Tasks: ['Single QA'] | Lens: [59932] → Tgt Spa: ['0.350'] [Step 239 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [56976] → Tgt Spa: ['1.000'] [Step 239 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [56976] → Tgt Spa: ['1.000'] [Step 239 / Rank 4] Tasks: ['Code', 'Code', 'Single QA', 'Code', 'Code', 'Code', 'Code'] | Lens: [8770, 8768, 8762, 8772, 8773, 8784, 8788] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000'] [Step 239 / Rank 2] Tasks: ['Single QA'] | Lens: [59932] → Tgt Spa: ['0.350'] [Step 239 / Rank 5] Tasks: ['Code', 'Code', 'Single QA', 'Code', 'Code', 'Code', 'Code'] | Lens: [8770, 8768, 8762, 8772, 8773, 8784, 8788] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000'] [Step 239 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [63143] → Tgt Spa: ['1.000'] [Step 239 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [63143] → Tgt Spa: ['1.000'] [Step 239 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [41997] → Tgt Spa: ['1.000'] [Step 239 / Rank 5] Tasks: ['Single QA'] | Lens: [56714] → Tgt Spa: ['0.350'] [Step 239 / Rank 4] Tasks: ['Single QA'] | Lens: [56714] → Tgt Spa: ['0.350'] [Step 239 / Rank 6] Tasks: ['Single QA'] | Lens: [33016] → Tgt Spa: ['0.350'] [Step 239 / Rank 3] Tasks: ['Summarization', 'Code'] | Lens: [31266, 31256] → Tgt Spa: ['1.000', '1.000'] [Step 239 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [41997] → Tgt Spa: ['1.000'] [Step 239 / Rank 7] Tasks: ['Single QA'] | Lens: [33016] → Tgt Spa: ['0.350'] [Step 239 / Rank 2] Tasks: ['Summarization', 'Code'] | Lens: [31266, 31256] → Tgt Spa: ['1.000', '1.000'] [Step 239 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [41130] → Tgt Spa: ['1.000'] [Step 239 / Rank 1] Tasks: ['Single QA'] | Lens: [47656] → Tgt Spa: ['0.350'] [Step 239 / Rank 6] Tasks: ['Code'] | Lens: [49622] → Tgt Spa: ['1.000'] [Step 239 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [41130] → Tgt Spa: ['1.000'] [Step 239 / Rank 5] Tasks: ['Single QA'] | Lens: [45715] → Tgt Spa: ['0.350'] [Step 239 / Rank 7] Tasks: ['Code'] | Lens: [49622] → Tgt Spa: ['1.000'] [Step 239 / Rank 0] Tasks: ['Single QA'] | Lens: [47656] → Tgt Spa: ['0.350'] [Step 239 / Rank 4] Tasks: ['Single QA'] | Lens: [45715] → Tgt Spa: ['0.350'] [Step 239 / Rank 7] Tasks: ['Single QA'] | Lens: [51277] → Tgt Spa: ['0.350'] [Step 239 / Rank 1] Tasks: ['Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Single QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization'] | Lens: [1377, 1377, 1358, 1359, 1378, 1359, 1359, 1360, 1379, 1360, 1360, 1379, 1361, 1361, 1361, 1380, 1364, 1361, 1363, 1362, 1381, 1381, 1363, 1362, 1364, 1364, 1363, 1362, 1364, 1363, 1382, 1364, 1365, 1384, 1383, 1383, 1384, 1365, 1364, 1366, 1366, 1384, 1367, 1366, 1385, 1385, 1386] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000'] [Step 239 / Rank 3] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [21872, 21873] → Tgt Spa: ['1.000', '0.350'] [Step 239 / Rank 4] Tasks: ['Single QA'] | Lens: [57697] → Tgt Spa: ['0.350'] [Step 239 / Rank 6] Tasks: ['Single QA'] | Lens: [51277] → Tgt Spa: ['0.350'] [Step 239 / Rank 0] Tasks: ['Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Single QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization'] | Lens: [1377, 1377, 1358, 1359, 1378, 1359, 1359, 1360, 1379, 1360, 1360, 1379, 1361, 1361, 1361, 1380, 1364, 1361, 1363, 1362, 1381, 1381, 1363, 1362, 1364, 1364, 1363, 1362, 1364, 1363, 1382, 1364, 1365, 1384, 1383, 1383, 1384, 1365, 1364, 1366, 1366, 1384, 1367, 1366, 1385, 1385, 1386] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000'] [Step 239 / Rank 5] Tasks: ['Single QA'] | Lens: [57697] → Tgt Spa: ['0.350'] [Step 239 / Rank 2] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [21872, 21873] → Tgt Spa: ['1.000', '0.350'] [Step 239 / Rank 3] Tasks: ['Single QA', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Single QA', 'Summarization', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning'] | Lens: [6065, 6065, 6066, 6067, 6068, 6086, 6069, 6069, 6069, 6069] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000'] [Step 239 / Rank 1] Tasks: ['Summarization', 'Code', 'In-Context Learning'] | Lens: [20042, 20033, 20025] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 239 / Rank 5] Tasks: ['Single QA'] | Lens: [42458] → Tgt Spa: ['0.350'] [Step 239 / Rank 0] Tasks: ['Summarization', 'Code', 'In-Context Learning'] | Lens: [20042, 20033, 20025] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 239 / Rank 2] Tasks: ['Single QA', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Single QA', 'Summarization', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning'] | Lens: [6065, 6065, 6066, 6067, 6068, 6086, 6069, 6069, 6069, 6069] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000'] [Step 239 / Rank 6] Tasks: ['Single QA'] | Lens: [45415] → Tgt Spa: ['0.350'] [Step 239 / Rank 7] Tasks: ['Single QA'] | Lens: [45415] → Tgt Spa: ['0.350'] [Step 239 / Rank 4] Tasks: ['Single QA'] | Lens: [42458] → Tgt Spa: ['0.350'] [Step 239 / Rank 6] Tasks: ['Single QA'] | Lens: [36761] → Tgt Spa: ['0.350'] [Step 239 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [48657] → Tgt Spa: ['1.000'] [Step 239 / Rank 0] Tasks: ['Code'] | Lens: [55039] → Tgt Spa: ['1.000'] [Step 239 / Rank 1] Tasks: ['Code'] | Lens: [55039] → Tgt Spa: ['1.000'] [Step 239 / Rank 7] Tasks: ['Single QA'] | Lens: [36761] → Tgt Spa: ['0.350'] [Step 239 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [48657] → Tgt Spa: ['1.000'] [Step 239 / Rank 3] Tasks: ['Single QA'] | Lens: [34753] → Tgt Spa: ['0.350'] [Step 239 / Rank 2] Tasks: ['Single QA'] | Lens: [34753] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 05:36:16,594 >> @ 239 | Loss: 2.2664 | LM: 2.2010 | Reg: 0.0654 | Spa(Avg): 0.528 [INFO|lh_trainer.py:797] 2026-02-17 05:36:16,595 >> Statistic -> Code | Spa: 0.679 | Tgt: 1.000 | Z-Loss: 0.108 | [INFO|lh_trainer.py:797] 2026-02-17 05:36:16,595 >> Statistic -> In-Context | Spa: 0.718 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:36:16,595 >> Statistic -> MultiHop | Spa: 0.582 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:36:16,595 >> Statistic -> Single | Spa: 0.430 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:36:16,595 >> Statistic -> Summarization | Spa: 0.626 | Tgt: 1.000 | Z-Loss: 0.131 | [INFO|lh_trainer.py:810] 2026-02-17 05:36:16,597 >> [Micro-Log] {"loss": 2.266387104988098, "lm_loss": 2.2009998013575873, "reg_loss": 0.06538727046669617, "model_sparsity(avg)": 0.5280702287952105, "Spa-In-Context Learning sparsity": 0.7180555462837219, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10614533722400665, "Spa-Single QA sparsity": 0.42992423610253766, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05379782927709378, "Spa-Summarization sparsity": 0.6256944358348846, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13111318349838258, "Spa-MultiHop QA sparsity": 0.5823045350887157, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.0959023315873411, "Spa-Code sparsity": 0.6791666567325592, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10756962299346924, "step": 239, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1640625, "lambda4 Code": 0.265625} [INFO|lh_trainer.py:331] 2026-02-17 05:36:36,847 >> {'loss': 13.5983, 'grad_norm': 0.5910817980766296, 'learning_rate': 7.55524274843415e-05, 'epoch': 0.2527646129541864, 'num_input_tokens_seen': 590762520, 'completed': '80.00% (240 / 300)', 'remaining time': '2:48:42', 'throughput': '7641.73', 'gpu_mem_free': '8169MB', 'step': 240} [Step 240 / Rank 3] Tasks: ['Code'] | Lens: [36190] → Tgt Spa: ['1.000'] [Step 240 / Rank 7] Tasks: ['Single QA'] | Lens: [51897] → Tgt Spa: ['0.350'] [Step 240 / Rank 1] Tasks: ['Single QA'] | Lens: [35182] → Tgt Spa: ['0.350'] [Step 240 / Rank 2] Tasks: ['Code'] | Lens: [36190] → Tgt Spa: ['1.000'] [Step 240 / Rank 0] Tasks: ['Single QA'] | Lens: [35182] → Tgt Spa: ['0.350'] [Step 240 / Rank 6] Tasks: ['Single QA'] | Lens: [51897] → Tgt Spa: ['0.350'] [Step 240 / Rank 4] Tasks: ['Summarization', 'Summarization'] | Lens: [28504, 28507] → Tgt Spa: ['1.000', '1.000'] [Step 240 / Rank 5] Tasks: ['Summarization', 'Summarization'] | Lens: [28504, 28507] → Tgt Spa: ['1.000', '1.000'] [Step 240 / Rank 5] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17351, 17363, 17352] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 240 / Rank 1] Tasks: ['Single QA'] | Lens: [35410] → Tgt Spa: ['0.350'] [Step 240 / Rank 6] Tasks: ['Single QA'] | Lens: [53734] → Tgt Spa: ['0.350'] [Step 240 / Rank 4] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17351, 17363, 17352] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 240 / Rank 7] Tasks: ['Single QA'] | Lens: [53734] → Tgt Spa: ['0.350'] [Step 240 / Rank 3] Tasks: ['Single QA'] | Lens: [52571] → Tgt Spa: ['0.350'] [Step 240 / Rank 0] Tasks: ['Single QA'] | Lens: [35410] → Tgt Spa: ['0.350'] [Step 240 / Rank 2] Tasks: ['Single QA'] | Lens: [52571] → Tgt Spa: ['0.350'] [Step 240 / Rank 2] Tasks: ['Single QA'] | Lens: [56709] → Tgt Spa: ['0.350'] [Step 240 / Rank 5] Tasks: ['MultiHop QA'] | Lens: [65334] → Tgt Spa: ['0.350'] [Step 240 / Rank 1] Tasks: ['Single QA'] | Lens: [44769] → Tgt Spa: ['0.350'] [Step 240 / Rank 0] Tasks: ['Single QA'] | Lens: [44769] → Tgt Spa: ['0.350'] [Step 240 / Rank 4] Tasks: ['MultiHop QA'] | Lens: [65334] → Tgt Spa: ['0.350'] [Step 240 / Rank 6] Tasks: ['Code', 'Code', 'Code', 'Code', 'Code', 'Code'] | Lens: [9803, 9808, 9809, 9809, 9809, 9815] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 240 / Rank 3] Tasks: ['Single QA'] | Lens: [56709] → Tgt Spa: ['0.350'] [Step 240 / Rank 7] Tasks: ['Code', 'Code', 'Code', 'Code', 'Code', 'Code'] | Lens: [9803, 9808, 9809, 9809, 9809, 9815] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 240 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23315, 23316] → Tgt Spa: ['1.000', '1.000'] [Step 240 / Rank 7] Tasks: ['Single QA'] | Lens: [39958] → Tgt Spa: ['0.350'] [Step 240 / Rank 6] Tasks: ['Single QA'] | Lens: [39958] → Tgt Spa: ['0.350'] [Step 240 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [63180] → Tgt Spa: ['1.000'] [Step 240 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [30694, 30687] → Tgt Spa: ['1.000', '1.000'] [Step 240 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [63180] → Tgt Spa: ['1.000'] [Step 240 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [30694, 30687] → Tgt Spa: ['1.000', '1.000'] [Step 240 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23315, 23316] → Tgt Spa: ['1.000', '1.000'] [Step 240 / Rank 3] Tasks: ['Single QA'] | Lens: [50115] → Tgt Spa: ['0.350'] [Step 240 / Rank 1] Tasks: ['Code'] | Lens: [47359] → Tgt Spa: ['1.000'] [Step 240 / Rank 2] Tasks: ['Single QA'] | Lens: [50115] → Tgt Spa: ['0.350'] [Step 240 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [40922] → Tgt Spa: ['1.000'] [Step 240 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [40922] → Tgt Spa: ['1.000'] [Step 240 / Rank 0] Tasks: ['Code'] | Lens: [47359] → Tgt Spa: ['1.000'] [Step 240 / Rank 5] Tasks: ['Code'] | Lens: [57562] → Tgt Spa: ['1.000'] [Step 240 / Rank 4] Tasks: ['Code'] | Lens: [57562] → Tgt Spa: ['1.000'] [Step 240 / Rank 4] Tasks: ['Single QA'] | Lens: [35255] → Tgt Spa: ['0.350'] [Step 240 / Rank 7] Tasks: ['Single QA'] | Lens: [42664] → Tgt Spa: ['0.350'] [Step 240 / Rank 6] Tasks: ['Single QA'] | Lens: [42664] → Tgt Spa: ['0.350'] [Step 240 / Rank 5] Tasks: ['Single QA'] | Lens: [35255] → Tgt Spa: ['0.350'] [Step 240 / Rank 3] Tasks: ['Single QA'] | Lens: [39493] → Tgt Spa: ['0.350'] [Step 240 / Rank 1] Tasks: ['Summarization', 'Code'] | Lens: [27089, 27078] → Tgt Spa: ['1.000', '1.000'] [Step 240 / Rank 0] Tasks: ['Summarization', 'Code'] | Lens: [27089, 27078] → Tgt Spa: ['1.000', '1.000'] [Step 240 / Rank 2] Tasks: ['Single QA'] | Lens: [39493] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 05:39:08,632 >> @ 240 | Loss: 1.8575 | LM: 1.7972 | Reg: 0.0604 | Spa(Avg): 0.530 [INFO|lh_trainer.py:797] 2026-02-17 05:39:08,632 >> Statistic -> Code | Spa: 0.693 | Tgt: 1.000 | Z-Loss: 0.102 | [INFO|lh_trainer.py:797] 2026-02-17 05:39:08,632 >> Statistic -> In-Context | Spa: 0.717 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:39:08,632 >> Statistic -> MultiHop | Spa: 0.375 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:39:08,633 >> Statistic -> Single | Spa: 0.385 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:39:08,633 >> Statistic -> Summarization | Spa: 0.688 | Tgt: 1.000 | Z-Loss: 0.098 | [INFO|lh_trainer.py:810] 2026-02-17 05:39:08,635 >> [Micro-Log] {"loss": 1.8575220158090815, "lm_loss": 1.7971692638238892, "reg_loss": 0.06035273270875526, "model_sparsity(avg)": 0.530189037322998, "Spa-Single QA sparsity": 0.3854166567325592, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.026449377279883873, "Spa-In-Context Learning sparsity": 0.7166666507720947, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10728142410516739, "Spa-Code sparsity": 0.6933760643005371, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10190823616889808, "Spa-Summarization sparsity": 0.6875000298023224, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.0978698618710041, "Spa-MultiHop QA sparsity": 0.375, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.008196473121643066, "step": 240, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1640625, "lambda4 Code": 0.265625} [INFO|lh_trainer.py:331] 2026-02-17 05:39:22,598 >> {'loss': 11.1451, 'grad_norm': 0.5623341202735901, 'learning_rate': 7.322334738103267e-05, 'epoch': 0.2538177988414955, 'num_input_tokens_seen': 593119346, 'completed': '80.33% (241 / 300)', 'remaining time': '2:45:52', 'throughput': '7109.55', 'gpu_mem_free': '10359MB', 'step': 241} [Step 241 / Rank 1] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [21349, 21352, 21367] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 241 / Rank 3] Tasks: ['Single QA'] | Lens: [41914] → Tgt Spa: ['0.350'] [Step 241 / Rank 4] Tasks: ['Single QA'] | Lens: [57584] → Tgt Spa: ['0.350'] [Step 241 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [37821] → Tgt Spa: ['1.000'] [Step 241 / Rank 2] Tasks: ['Single QA'] | Lens: [41914] → Tgt Spa: ['0.350'] [Step 241 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [37821] → Tgt Spa: ['1.000'] [Step 241 / Rank 5] Tasks: ['Single QA'] | Lens: [57584] → Tgt Spa: ['0.350'] [Step 241 / Rank 0] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [21349, 21352, 21367] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 241 / Rank 4] Tasks: ['Single QA'] | Lens: [33809] → Tgt Spa: ['0.350'] [Step 241 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [25060, 25068] → Tgt Spa: ['1.000', '1.000'] [Step 241 / Rank 5] Tasks: ['Single QA'] | Lens: [33809] → Tgt Spa: ['0.350'] [Step 241 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [27063, 27064] → Tgt Spa: ['0.350', '0.350'] [Step 241 / Rank 1] Tasks: ['Single QA'] | Lens: [62454] → Tgt Spa: ['0.350'] [Step 241 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [27063, 27064] → Tgt Spa: ['0.350', '0.350'] [Step 241 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [25060, 25068] → Tgt Spa: ['1.000', '1.000'] [Step 241 / Rank 0] Tasks: ['Single QA'] | Lens: [62454] → Tgt Spa: ['0.350'] [Step 241 / Rank 7] Tasks: ['Summarization', 'Summarization'] | Lens: [29446, 29446] → Tgt Spa: ['1.000', '1.000'] [Step 241 / Rank 5] Tasks: ['Single QA'] | Lens: [60519] → Tgt Spa: ['0.350'] [Step 241 / Rank 4] Tasks: ['Single QA'] | Lens: [60519] → Tgt Spa: ['0.350'] [Step 241 / Rank 6] Tasks: ['Summarization', 'Summarization'] | Lens: [29446, 29446] → Tgt Spa: ['1.000', '1.000'] [Step 241 / Rank 0] Tasks: ['Single QA'] | Lens: [35217] → Tgt Spa: ['0.350'] [Step 241 / Rank 3] Tasks: ['Single QA'] | Lens: [46707] → Tgt Spa: ['0.350'] [Step 241 / Rank 1] Tasks: ['Single QA'] | Lens: [35217] → Tgt Spa: ['0.350'] [Step 241 / Rank 2] Tasks: ['Single QA'] | Lens: [46707] → Tgt Spa: ['0.350'] [Step 241 / Rank 4] Tasks: ['Single QA'] | Lens: [34578] → Tgt Spa: ['0.350'] [Step 241 / Rank 7] Tasks: ['Code'] | Lens: [37193] → Tgt Spa: ['1.000'] [Step 241 / Rank 3] Tasks: ['Single QA'] | Lens: [58608] → Tgt Spa: ['0.350'] [Step 241 / Rank 5] Tasks: ['Single QA'] | Lens: [34578] → Tgt Spa: ['0.350'] [Step 241 / Rank 6] Tasks: ['Code'] | Lens: [37193] → Tgt Spa: ['1.000'] [Step 241 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24741, 24741] → Tgt Spa: ['1.000', '1.000'] [Step 241 / Rank 2] Tasks: ['Single QA'] | Lens: [58608] → Tgt Spa: ['0.350'] [Step 241 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24741, 24741] → Tgt Spa: ['1.000', '1.000'] [Step 241 / Rank 1] Tasks: ['Single QA', 'Code'] | Lens: [23385, 23392] → Tgt Spa: ['0.350', '1.000'] [Step 241 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning'] | Lens: [6018, 6018, 6019, 6019, 6028, 6020, 6030, 6030, 6023, 6024] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 241 / Rank 7] Tasks: ['Summarization'] | Lens: [41126] → Tgt Spa: ['1.000'] [Step 241 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning'] | Lens: [6018, 6018, 6019, 6019, 6028, 6020, 6030, 6030, 6023, 6024] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 241 / Rank 0] Tasks: ['Single QA', 'Code'] | Lens: [23385, 23392] → Tgt Spa: ['0.350', '1.000'] [Step 241 / Rank 6] Tasks: ['Summarization'] | Lens: [41126] → Tgt Spa: ['1.000'] [Step 241 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'MultiHop QA'] | Lens: [4095, 4095, 4102, 4096, 4099, 4098, 4100, 4099, 4105, 4098, 4099, 4099, 4099, 4101, 4101] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 241 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'MultiHop QA'] | Lens: [4095, 4095, 4102, 4096, 4099, 4098, 4100, 4099, 4105, 4098, 4099, 4099, 4099, 4101, 4101] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350'] [Step 241 / Rank 3] Tasks: ['Single QA'] | Lens: [64069] → Tgt Spa: ['0.350'] [Step 241 / Rank 4] Tasks: ['Single QA'] | Lens: [38637] → Tgt Spa: ['0.350'] [Step 241 / Rank 0] Tasks: ['Single QA'] | Lens: [46689] → Tgt Spa: ['0.350'] [Step 241 / Rank 7] Tasks: ['Single QA'] | Lens: [38664] → Tgt Spa: ['0.350'] [Step 241 / Rank 5] Tasks: ['Single QA'] | Lens: [38637] → Tgt Spa: ['0.350'] [Step 241 / Rank 1] Tasks: ['Single QA'] | Lens: [46689] → Tgt Spa: ['0.350'] [Step 241 / Rank 6] Tasks: ['Single QA'] | Lens: [38664] → Tgt Spa: ['0.350'] [Step 241 / Rank 2] Tasks: ['Single QA'] | Lens: [64069] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 05:41:51,125 >> @ 241 | Loss: 2.0799 | LM: 2.0327 | Reg: 0.0472 | Spa(Avg): 0.500 [INFO|lh_trainer.py:797] 2026-02-17 05:41:51,125 >> Statistic -> Code | Spa: 0.715 | Tgt: 1.000 | Z-Loss: 0.093 | [INFO|lh_trainer.py:797] 2026-02-17 05:41:51,125 >> Statistic -> In-Context | Spa: 0.718 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:41:51,125 >> Statistic -> MultiHop | Spa: 0.667 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:41:51,125 >> Statistic -> Single | Spa: 0.433 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:41:51,125 >> Statistic -> Summarization | Spa: 0.698 | Tgt: 1.000 | Z-Loss: 0.093 | [INFO|lh_trainer.py:810] 2026-02-17 05:41:51,128 >> [Micro-Log] {"loss": 2.079904742538929, "lm_loss": 2.0327369458973408, "reg_loss": 0.04716778330233259, "model_sparsity(avg)": 0.4995177475114663, "Spa-Code sparsity": 0.7152777671813965, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09293081462383271, "Spa-Summarization sparsity": 0.6979166716337204, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09343719109892845, "Spa-Single QA sparsity": 0.43287036567926407, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.055865259802279375, "Spa-In-Context Learning sparsity": 0.7179487026654757, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10674347957739463, "Spa-MultiHop QA sparsity": 0.6666666666666666, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.14038901527722678, "step": 241, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1640625, "lambda4 Code": 0.265625} [INFO|lh_trainer.py:331] 2026-02-17 05:42:18,886 >> {'loss': 12.4794, 'grad_norm': 0.40974873304367065, 'learning_rate': 7.092455705138504e-05, 'epoch': 0.25487098472880465, 'num_input_tokens_seen': 595480902, 'completed': '80.67% (242 / 300)', 'remaining time': '2:43:06', 'throughput': '6697.97', 'gpu_mem_free': '11151MB', 'step': 242} [Step 242 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [23880, 23880] → Tgt Spa: ['0.350', '0.350'] [Step 242 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [23880, 23880] → Tgt Spa: ['0.350', '0.350'] [Step 242 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [56757] → Tgt Spa: ['1.000'] [Step 242 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [23960, 23954] → Tgt Spa: ['1.000', '1.000'] [Step 242 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [56757] → Tgt Spa: ['1.000'] [Step 242 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [23960, 23954] → Tgt Spa: ['1.000', '1.000'] [Step 242 / Rank 0] Tasks: ['MultiHop QA', 'Single QA'] | Lens: [32703, 32705] → Tgt Spa: ['0.350', '0.350'] [Step 242 / Rank 1] Tasks: ['MultiHop QA', 'Single QA'] | Lens: [32703, 32705] → Tgt Spa: ['0.350', '0.350'] [Step 242 / Rank 3] Tasks: ['Code', 'Code', 'Code'] | Lens: [18179, 18183, 18182] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 242 / Rank 1] Tasks: ['Single QA', 'Code'] | Lens: [29708, 29715] → Tgt Spa: ['0.350', '1.000'] [Step 242 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [36603] → Tgt Spa: ['1.000'] [Step 242 / Rank 2] Tasks: ['Code', 'Code', 'Code'] | Lens: [18179, 18183, 18182] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 242 / Rank 5] Tasks: ['Code', 'Code', 'Code'] | Lens: [18993, 18996, 18995] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 242 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [36603] → Tgt Spa: ['1.000'] [Step 242 / Rank 4] Tasks: ['Code', 'Code', 'Code'] | Lens: [18993, 18996, 18995] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 242 / Rank 0] Tasks: ['Single QA', 'Code'] | Lens: [29708, 29715] → Tgt Spa: ['0.350', '1.000'] [Step 242 / Rank 5] Tasks: ['Single QA'] | Lens: [64193] → Tgt Spa: ['0.350'] [Step 242 / Rank 7] Tasks: ['Single QA'] | Lens: [40999] → Tgt Spa: ['0.350'] [Step 242 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [59043] → Tgt Spa: ['1.000'] [Step 242 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [59043] → Tgt Spa: ['1.000'] [Step 242 / Rank 6] Tasks: ['Single QA'] | Lens: [40999] → Tgt Spa: ['0.350'] [Step 242 / Rank 2] Tasks: ['Single QA'] | Lens: [57535] → Tgt Spa: ['0.350'] [Step 242 / Rank 4] Tasks: ['Single QA'] | Lens: [64193] → Tgt Spa: ['0.350'] [Step 242 / Rank 3] Tasks: ['Single QA'] | Lens: [57535] → Tgt Spa: ['0.350'] [Step 242 / Rank 4] Tasks: ['Single QA'] | Lens: [59401] → Tgt Spa: ['0.350'] [Step 242 / Rank 2] Tasks: ['Single QA'] | Lens: [48516] → Tgt Spa: ['0.350'] [Step 242 / Rank 0] Tasks: ['Single QA'] | Lens: [41259] → Tgt Spa: ['0.350'] [Step 242 / Rank 6] Tasks: ['Code', 'Summarization'] | Lens: [29675, 29691] → Tgt Spa: ['1.000', '1.000'] [Step 242 / Rank 5] Tasks: ['Single QA'] | Lens: [59401] → Tgt Spa: ['0.350'] [Step 242 / Rank 1] Tasks: ['Single QA'] | Lens: [41259] → Tgt Spa: ['0.350'] [Step 242 / Rank 3] Tasks: ['Single QA'] | Lens: [48516] → Tgt Spa: ['0.350'] [Step 242 / Rank 7] Tasks: ['Code', 'Summarization'] | Lens: [29675, 29691] → Tgt Spa: ['1.000', '1.000'] [Step 242 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [43508] → Tgt Spa: ['1.000'] [Step 242 / Rank 6] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [18880, 18891, 18896] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 242 / Rank 0] Tasks: ['Summarization', 'In-Context Learning', 'Code'] | Lens: [20888, 20870, 20882] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 242 / Rank 1] Tasks: ['Summarization', 'In-Context Learning', 'Code'] | Lens: [20888, 20870, 20882] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 242 / Rank 7] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [18880, 18891, 18896] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 242 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [43508] → Tgt Spa: ['1.000'] [Step 242 / Rank 3] Tasks: ['Code'] | Lens: [36292] → Tgt Spa: ['1.000'] [Step 242 / Rank 2] Tasks: ['Code'] | Lens: [36292] → Tgt Spa: ['1.000'] [Step 242 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [24715, 24707] → Tgt Spa: ['1.000', '1.000'] [Step 242 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25176, 25177] → Tgt Spa: ['1.000', '1.000'] [Step 242 / Rank 1] Tasks: ['Single QA'] | Lens: [33980] → Tgt Spa: ['0.350'] [Step 242 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32164, 32164] → Tgt Spa: ['0.350', '0.350'] [Step 242 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25176, 25177] → Tgt Spa: ['1.000', '1.000'] [Step 242 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32164, 32164] → Tgt Spa: ['0.350', '0.350'] [Step 242 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [24715, 24707] → Tgt Spa: ['1.000', '1.000'] [Step 242 / Rank 0] Tasks: ['Single QA'] | Lens: [33980] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 05:44:37,967 >> @ 242 | Loss: 1.9941 | LM: 1.9279 | Reg: 0.0662 | Spa(Avg): 0.565 [INFO|lh_trainer.py:797] 2026-02-17 05:44:37,967 >> Statistic -> Code | Spa: 0.715 | Tgt: 1.000 | Z-Loss: 0.093 | [INFO|lh_trainer.py:797] 2026-02-17 05:44:37,967 >> Statistic -> In-Context | Spa: 0.715 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:44:37,967 >> Statistic -> MultiHop | Spa: 0.389 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:44:37,967 >> Statistic -> Single | Spa: 0.388 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:44:37,967 >> Statistic -> Summarization | Spa: 0.681 | Tgt: 1.000 | Z-Loss: 0.101 | [INFO|lh_trainer.py:810] 2026-02-17 05:44:37,969 >> [Micro-Log] {"loss": 1.9941073263374467, "lm_loss": 1.9278850382349144, "reg_loss": 0.06622227121260948, "model_sparsity(avg)": 0.5646219129363695, "Spa-MultiHop QA sparsity": 0.3888888955116272, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.01290164515376091, "Spa-Single QA sparsity": 0.38782050059391904, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.026469861237833705, "Spa-Code sparsity": 0.7147435958568866, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09311092530305569, "Spa-In-Context Learning sparsity": 0.7145061625374688, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10820846921867794, "Spa-Summarization sparsity": 0.6805555522441864, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1013437919318676, "step": 242, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1640625, "lambda4 Code": 0.265625} [INFO|lh_trainer.py:331] 2026-02-17 05:44:55,549 >> {'loss': 11.9646, 'grad_norm': 0.663348376750946, 'learning_rate': 6.865645038128743e-05, 'epoch': 0.2559241706161137, 'num_input_tokens_seen': 597986692, 'completed': '81.00% (243 / 300)', 'remaining time': '2:40:14', 'throughput': '7997.41', 'gpu_mem_free': '14501MB', 'step': 243} [Step 243 / Rank 5] Tasks: ['Single QA'] | Lens: [51701] → Tgt Spa: ['0.350'] [Step 243 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [59672] → Tgt Spa: ['1.000'] [Step 243 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [59672] → Tgt Spa: ['1.000'] [Step 243 / Rank 3] Tasks: ['Summarization', 'Code', 'In-Context Learning'] | Lens: [20860, 20849, 20843] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 243 / Rank 1] Tasks: ['Code', 'Code'] | Lens: [31883, 31891] → Tgt Spa: ['1.000', '1.000'] [Step 243 / Rank 0] Tasks: ['Code', 'Code'] | Lens: [31883, 31891] → Tgt Spa: ['1.000', '1.000'] [Step 243 / Rank 2] Tasks: ['Summarization', 'Code', 'In-Context Learning'] | Lens: [20860, 20849, 20843] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 243 / Rank 4] Tasks: ['Single QA'] | Lens: [51701] → Tgt Spa: ['0.350'] [Step 243 / Rank 0] Tasks: ['Code', 'Code'] | Lens: [26013, 26015] → Tgt Spa: ['1.000', '1.000'] [Step 243 / Rank 1] Tasks: ['Code', 'Code'] | Lens: [26013, 26015] → Tgt Spa: ['1.000', '1.000'] [Step 243 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32076, 32076] → Tgt Spa: ['0.350', '0.350'] [Step 243 / Rank 3] Tasks: ['Single QA'] | Lens: [51477] → Tgt Spa: ['0.350'] [Step 243 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32076, 32076] → Tgt Spa: ['0.350', '0.350'] [Step 243 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning'] | Lens: [5468, 5469, 5469, 5472, 5470, 5474, 5474, 5475, 5476, 5478, 5478] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000'] [Step 243 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning'] | Lens: [5468, 5469, 5469, 5472, 5470, 5474, 5474, 5475, 5476, 5478, 5478] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000'] [Step 243 / Rank 2] Tasks: ['Single QA'] | Lens: [51477] → Tgt Spa: ['0.350'] [Step 243 / Rank 6] Tasks: ['Code'] | Lens: [38146] → Tgt Spa: ['1.000'] [Step 243 / Rank 5] Tasks: ['Code', 'Single QA'] | Lens: [31213, 31205] → Tgt Spa: ['1.000', '0.350'] [Step 243 / Rank 2] Tasks: ['Single QA'] | Lens: [56070] → Tgt Spa: ['0.350'] [Step 243 / Rank 4] Tasks: ['Code', 'Single QA'] | Lens: [31213, 31205] → Tgt Spa: ['1.000', '0.350'] [Step 243 / Rank 1] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [19697, 19709, 19699] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 243 / Rank 0] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [19697, 19709, 19699] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 243 / Rank 3] Tasks: ['Single QA'] | Lens: [56070] → Tgt Spa: ['0.350'] [Step 243 / Rank 7] Tasks: ['Code'] | Lens: [38146] → Tgt Spa: ['1.000'] [Step 243 / Rank 4] Tasks: ['Single QA'] | Lens: [53249] → Tgt Spa: ['0.350'] [Step 243 / Rank 1] Tasks: ['Code'] | Lens: [46074] → Tgt Spa: ['1.000'] [Step 243 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [62607] → Tgt Spa: ['1.000'] [Step 243 / Rank 5] Tasks: ['Single QA'] | Lens: [53249] → Tgt Spa: ['0.350'] [Step 243 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [62607] → Tgt Spa: ['1.000'] [Step 243 / Rank 2] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [18717, 18730, 18719] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 243 / Rank 0] Tasks: ['Code'] | Lens: [46074] → Tgt Spa: ['1.000'] [Step 243 / Rank 3] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [18717, 18730, 18719] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 243 / Rank 6] Tasks: ['Single QA'] | Lens: [65021] → Tgt Spa: ['0.350'] [Step 243 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [44877] → Tgt Spa: ['1.000'] [Step 243 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [13454, 13464, 13476, 13476] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350'] [Step 243 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [13454, 13464, 13476, 13476] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350'] [Step 243 / Rank 7] Tasks: ['Single QA'] | Lens: [65021] → Tgt Spa: ['0.350'] [Step 243 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [44877] → Tgt Spa: ['1.000'] [Step 243 / Rank 1] Tasks: ['Summarization'] | Lens: [64055] → Tgt Spa: ['1.000'] [Step 243 / Rank 0] Tasks: ['Summarization'] | Lens: [64055] → Tgt Spa: ['1.000'] [Step 243 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [56474] → Tgt Spa: ['1.000'] [Step 243 / Rank 1] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [17850, 17852, 17866] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 243 / Rank 5] Tasks: ['Single QA'] | Lens: [47427] → Tgt Spa: ['0.350'] [Step 243 / Rank 0] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [17850, 17852, 17866] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 243 / Rank 6] Tasks: ['Single QA'] | Lens: [41975] → Tgt Spa: ['0.350'] [Step 243 / Rank 7] Tasks: ['Single QA'] | Lens: [41975] → Tgt Spa: ['0.350'] [Step 243 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [56474] → Tgt Spa: ['1.000'] [Step 243 / Rank 4] Tasks: ['Single QA'] | Lens: [47427] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 05:47:31,696 >> @ 243 | Loss: 1.9982 | LM: 1.9238 | Reg: 0.0745 | Spa(Avg): 0.592 [INFO|lh_trainer.py:797] 2026-02-17 05:47:31,696 >> Statistic -> Code | Spa: 0.713 | Tgt: 1.000 | Z-Loss: 0.094 | [INFO|lh_trainer.py:797] 2026-02-17 05:47:31,696 >> Statistic -> In-Context | Spa: 0.720 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:47:31,696 >> Statistic -> MultiHop | Spa: 0.389 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:47:31,697 >> Statistic -> Single | Spa: 0.464 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:47:31,697 >> Statistic -> Summarization | Spa: 0.700 | Tgt: 1.000 | Z-Loss: 0.092 | [INFO|lh_trainer.py:810] 2026-02-17 05:47:31,699 >> [Micro-Log] {"loss": 1.9982148682077725, "lm_loss": 1.9237607816855113, "reg_loss": 0.0744540791589922, "model_sparsity(avg)": 0.591566709180673, "Spa-Code sparsity": 0.7129629611968994, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09383178303639093, "Spa-Summarization sparsity": 0.7, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09210205078125, "Spa-In-Context Learning sparsity": 0.720085464991056, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10591938690497325, "Spa-Single QA sparsity": 0.4635416530072689, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.07641303459240589, "Spa-MultiHop QA sparsity": 0.3888888955116272, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.01290164515376091, "step": 243, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1640625, "lambda4 Code": 0.265625} [INFO|lh_trainer.py:331] 2026-02-17 05:47:52,940 >> {'loss': 11.9893, 'grad_norm': 0.6817061901092529, 'learning_rate': 6.64194159991414e-05, 'epoch': 0.25697735650342285, 'num_input_tokens_seen': 600640014, 'completed': '81.33% (244 / 300)', 'remaining time': '2:37:27', 'throughput': '7478.75', 'gpu_mem_free': '9935MB', 'step': 244} [Step 244 / Rank 3] Tasks: ['Single QA'] | Lens: [35813] → Tgt Spa: ['0.350'] [Step 244 / Rank 0] Tasks: ['Single QA'] | Lens: [36942] → Tgt Spa: ['0.350'] [Step 244 / Rank 6] Tasks: ['Single QA'] | Lens: [44049] → Tgt Spa: ['0.350'] [Step 244 / Rank 2] Tasks: ['Single QA'] | Lens: [35813] → Tgt Spa: ['0.350'] [Step 244 / Rank 1] Tasks: ['Single QA'] | Lens: [36942] → Tgt Spa: ['0.350'] [Step 244 / Rank 7] Tasks: ['Single QA'] | Lens: [44049] → Tgt Spa: ['0.350'] [Step 244 / Rank 4] Tasks: ['Code'] | Lens: [35538] → Tgt Spa: ['1.000'] [Step 244 / Rank 5] Tasks: ['Code'] | Lens: [35538] → Tgt Spa: ['1.000'] [Step 244 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [24670, 24680] → Tgt Spa: ['1.000', '1.000'] [Step 244 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43498] → Tgt Spa: ['1.000'] [Step 244 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22495, 22517] → Tgt Spa: ['1.000', '1.000'] [Step 244 / Rank 4] Tasks: ['Single QA'] | Lens: [55825] → Tgt Spa: ['0.350'] [Step 244 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22495, 22517] → Tgt Spa: ['1.000', '1.000'] [Step 244 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [24670, 24680] → Tgt Spa: ['1.000', '1.000'] [Step 244 / Rank 5] Tasks: ['Single QA'] | Lens: [55825] → Tgt Spa: ['0.350'] [Step 244 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43498] → Tgt Spa: ['1.000'] [Step 244 / Rank 1] Tasks: ['Single QA'] | Lens: [42041] → Tgt Spa: ['0.350'] [Step 244 / Rank 2] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [25252, 25238] → Tgt Spa: ['1.000', '1.000'] [Step 244 / Rank 3] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [25252, 25238] → Tgt Spa: ['1.000', '1.000'] [Step 244 / Rank 0] Tasks: ['Single QA'] | Lens: [42041] → Tgt Spa: ['0.350'] [Step 244 / Rank 6] Tasks: ['Single QA'] | Lens: [47762] → Tgt Spa: ['0.350'] [Step 244 / Rank 4] Tasks: ['Single QA'] | Lens: [63428] → Tgt Spa: ['0.350'] [Step 244 / Rank 7] Tasks: ['Single QA'] | Lens: [47762] → Tgt Spa: ['0.350'] [Step 244 / Rank 5] Tasks: ['Single QA'] | Lens: [63428] → Tgt Spa: ['0.350'] [Step 244 / Rank 1] Tasks: ['Single QA'] | Lens: [39117] → Tgt Spa: ['0.350'] [Step 244 / Rank 6] Tasks: ['Code'] | Lens: [44569] → Tgt Spa: ['1.000'] [Step 244 / Rank 3] Tasks: ['Code'] | Lens: [45046] → Tgt Spa: ['1.000'] [Step 244 / Rank 0] Tasks: ['Single QA'] | Lens: [39117] → Tgt Spa: ['0.350'] [Step 244 / Rank 2] Tasks: ['Code'] | Lens: [45046] → Tgt Spa: ['1.000'] [Step 244 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [23789, 23790] → Tgt Spa: ['0.350', '0.350'] [Step 244 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [23789, 23790] → Tgt Spa: ['0.350', '0.350'] [Step 244 / Rank 7] Tasks: ['Code'] | Lens: [44569] → Tgt Spa: ['1.000'] [Step 244 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25234, 25234] → Tgt Spa: ['1.000', '1.000'] [Step 244 / Rank 7] Tasks: ['Single QA'] | Lens: [42096] → Tgt Spa: ['0.350'] [Step 244 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32118, 32118] → Tgt Spa: ['0.350', '0.350'] [Step 244 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32183, 32184] → Tgt Spa: ['0.350', '0.350'] [Step 244 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32118, 32118] → Tgt Spa: ['0.350', '0.350'] [Step 244 / Rank 6] Tasks: ['Single QA'] | Lens: [42096] → Tgt Spa: ['0.350'] [Step 244 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32183, 32184] → Tgt Spa: ['0.350', '0.350'] [Step 244 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25234, 25234] → Tgt Spa: ['1.000', '1.000'] [Step 244 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [40030] → Tgt Spa: ['1.000'] [Step 244 / Rank 0] Tasks: ['Single QA'] | Lens: [58088] → Tgt Spa: ['0.350'] [Step 244 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [40030] → Tgt Spa: ['1.000'] [Step 244 / Rank 6] Tasks: ['Single QA'] | Lens: [54551] → Tgt Spa: ['0.350'] [Step 244 / Rank 5] Tasks: ['Single QA'] | Lens: [41240] → Tgt Spa: ['0.350'] [Step 244 / Rank 7] Tasks: ['Single QA'] | Lens: [54551] → Tgt Spa: ['0.350'] [Step 244 / Rank 4] Tasks: ['Single QA'] | Lens: [41240] → Tgt Spa: ['0.350'] [Step 244 / Rank 1] Tasks: ['Single QA'] | Lens: [58088] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 05:50:03,806 >> @ 244 | Loss: 2.1845 | LM: 2.1352 | Reg: 0.0493 | Spa(Avg): 0.497 [INFO|lh_trainer.py:797] 2026-02-17 05:50:03,806 >> Statistic -> Code | Spa: 0.715 | Tgt: 1.000 | Z-Loss: 0.093 | [INFO|lh_trainer.py:797] 2026-02-17 05:50:03,807 >> Statistic -> In-Context | Spa: 0.714 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:50:03,807 >> Statistic -> MultiHop | Spa: 0.389 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:50:03,807 >> Statistic -> Single | Spa: 0.380 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:50:03,807 >> Statistic -> Summarization | Spa: 0.681 | Tgt: 1.000 | Z-Loss: 0.101 | [INFO|lh_trainer.py:810] 2026-02-17 05:50:03,809 >> [Micro-Log] {"loss": 2.1845068161686263, "lm_loss": 2.135228402291735, "reg_loss": 0.04927840367599856, "model_sparsity(avg)": 0.49681712066133815, "Spa-Single QA sparsity": 0.37962961859173244, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.023421144171152264, "Spa-In-Context Learning sparsity": 0.7142856972558158, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1083603384239333, "Spa-Code sparsity": 0.7152777910232544, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09289795160293579, "Spa-Summarization sparsity": 0.680555522441864, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.100931815803051, "Spa-MultiHop QA sparsity": 0.3888888955116272, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.01290164515376091, "step": 244, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1640625, "lambda4 Code": 0.265625} [INFO|lh_trainer.py:331] 2026-02-17 05:50:26,094 >> {'loss': 13.107, 'grad_norm': 0.46102288365364075, 'learning_rate': 6.421383720927206e-05, 'epoch': 0.258030542390732, 'num_input_tokens_seen': 602922284, 'completed': '81.67% (245 / 300)', 'remaining time': '2:34:35', 'throughput': '7450.90', 'gpu_mem_free': '6713MB', 'step': 245} [Step 245 / Rank 5] Tasks: ['Single QA'] | Lens: [51549] → Tgt Spa: ['0.350'] [Step 245 / Rank 4] Tasks: ['Single QA'] | Lens: [51549] → Tgt Spa: ['0.350'] [Step 245 / Rank 6] Tasks: ['Single QA'] | Lens: [50123] → Tgt Spa: ['0.350'] [Step 245 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22557, 22558] → Tgt Spa: ['1.000', '1.000'] [Step 245 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22557, 22558] → Tgt Spa: ['1.000', '1.000'] [Step 245 / Rank 2] Tasks: ['Summarization'] | Lens: [42209] → Tgt Spa: ['1.000'] [Step 245 / Rank 3] Tasks: ['Summarization'] | Lens: [42209] → Tgt Spa: ['1.000'] [Step 245 / Rank 7] Tasks: ['Single QA'] | Lens: [50123] → Tgt Spa: ['0.350'] [Step 245 / Rank 6] Tasks: ['Summarization'] | Lens: [49288] → Tgt Spa: ['1.000'] [Step 245 / Rank 5] Tasks: ['Single QA'] | Lens: [37042] → Tgt Spa: ['0.350'] [Step 245 / Rank 3] Tasks: ['Single QA'] | Lens: [40856] → Tgt Spa: ['0.350'] [Step 245 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [29051, 29051] → Tgt Spa: ['1.000', '1.000'] [Step 245 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [29051, 29051] → Tgt Spa: ['1.000', '1.000'] [Step 245 / Rank 2] Tasks: ['Single QA'] | Lens: [40856] → Tgt Spa: ['0.350'] [Step 245 / Rank 7] Tasks: ['Summarization'] | Lens: [49288] → Tgt Spa: ['1.000'] [Step 245 / Rank 4] Tasks: ['Single QA'] | Lens: [37042] → Tgt Spa: ['0.350'] [Step 245 / Rank 5] Tasks: ['Single QA'] | Lens: [58143] → Tgt Spa: ['0.350'] [Step 245 / Rank 4] Tasks: ['Single QA'] | Lens: [58143] → Tgt Spa: ['0.350'] [Step 245 / Rank 7] Tasks: ['Code', 'Code', 'MultiHop QA', 'Code'] | Lens: [15615, 15616, 15610, 15622] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000'] [Step 245 / Rank 2] Tasks: ['Single QA'] | Lens: [52405] → Tgt Spa: ['0.350'] [Step 245 / Rank 1] Tasks: ['Single QA'] | Lens: [36648] → Tgt Spa: ['0.350'] [Step 245 / Rank 6] Tasks: ['Code', 'Code', 'MultiHop QA', 'Code'] | Lens: [15615, 15616, 15610, 15622] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000'] [Step 245 / Rank 3] Tasks: ['Single QA'] | Lens: [52405] → Tgt Spa: ['0.350'] [Step 245 / Rank 0] Tasks: ['Single QA'] | Lens: [36648] → Tgt Spa: ['0.350'] [Step 245 / Rank 1] Tasks: ['Single QA'] | Lens: [38370] → Tgt Spa: ['0.350'] [Step 245 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [61699] → Tgt Spa: ['1.000'] [Step 245 / Rank 0] Tasks: ['Single QA'] | Lens: [38370] → Tgt Spa: ['0.350'] [Step 245 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [53516] → Tgt Spa: ['1.000'] [Step 245 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [61699] → Tgt Spa: ['1.000'] [Step 245 / Rank 3] Tasks: ['Single QA'] | Lens: [33698] → Tgt Spa: ['0.350'] [Step 245 / Rank 2] Tasks: ['Single QA'] | Lens: [33698] → Tgt Spa: ['0.350'] [Step 245 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [53516] → Tgt Spa: ['1.000'] [Step 245 / Rank 1] Tasks: ['Single QA'] | Lens: [56725] → Tgt Spa: ['0.350'] [Step 245 / Rank 2] Tasks: ['Single QA'] | Lens: [64902] → Tgt Spa: ['0.350'] [Step 245 / Rank 0] Tasks: ['Single QA'] | Lens: [56725] → Tgt Spa: ['0.350'] [Step 245 / Rank 3] Tasks: ['Single QA'] | Lens: [64902] → Tgt Spa: ['0.350'] [Step 245 / Rank 7] Tasks: ['Single QA'] | Lens: [65038] → Tgt Spa: ['0.350'] [Step 245 / Rank 5] Tasks: ['Code'] | Lens: [38215] → Tgt Spa: ['1.000'] [Step 245 / Rank 4] Tasks: ['Code'] | Lens: [38215] → Tgt Spa: ['1.000'] [Step 245 / Rank 6] Tasks: ['Single QA'] | Lens: [65038] → Tgt Spa: ['0.350'] [Step 245 / Rank 4] Tasks: ['In-Context Learning', 'Code', 'Single QA', 'Code', 'In-Context Learning', 'MultiHop QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [4003, 4011, 4005, 4011, 4005, 4008, 4007, 4007, 4008, 4014, 4008, 4010, 4011, 4016, 4009, 4011] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350'] [Step 245 / Rank 5] Tasks: ['In-Context Learning', 'Code', 'Single QA', 'Code', 'In-Context Learning', 'MultiHop QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'Code', 'In-Context Learning', 'Single QA'] | Lens: [4003, 4011, 4005, 4011, 4005, 4008, 4007, 4007, 4008, 4014, 4008, 4010, 4011, 4016, 4009, 4011] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350'] [Step 245 / Rank 3] Tasks: ['Single QA'] | Lens: [46927] → Tgt Spa: ['0.350'] [Step 245 / Rank 2] Tasks: ['Single QA'] | Lens: [46927] → Tgt Spa: ['0.350'] [Step 245 / Rank 0] Tasks: ['Single QA'] | Lens: [51852] → Tgt Spa: ['0.350'] [Step 245 / Rank 1] Tasks: ['Single QA'] | Lens: [51852] → Tgt Spa: ['0.350'] [Step 245 / Rank 7] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [18279, 18279, 18290] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 245 / Rank 6] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [18279, 18279, 18290] → Tgt Spa: ['1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 05:52:58,020 >> @ 245 | Loss: 2.2364 | LM: 2.1831 | Reg: 0.0533 | Spa(Avg): 0.510 [INFO|lh_trainer.py:797] 2026-02-17 05:52:58,020 >> Statistic -> Code | Spa: 0.701 | Tgt: 1.000 | Z-Loss: 0.099 | [INFO|lh_trainer.py:797] 2026-02-17 05:52:58,020 >> Statistic -> In-Context | Spa: 0.713 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:52:58,020 >> Statistic -> MultiHop | Spa: 0.583 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:52:58,020 >> Statistic -> Single | Spa: 0.409 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:52:58,020 >> Statistic -> Summarization | Spa: 0.699 | Tgt: 1.000 | Z-Loss: 0.093 | [INFO|lh_trainer.py:810] 2026-02-17 05:52:58,022 >> [Micro-Log] {"loss": 2.236393202096224, "lm_loss": 2.1830789893865585, "reg_loss": 0.05331420419679489, "model_sparsity(avg)": 0.5096571097771326, "Spa-In-Context Learning sparsity": 0.7129629453023275, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10895153010884921, "Spa-Single QA sparsity": 0.40895060698191327, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04176391416048217, "Spa-Summarization sparsity": 0.6990740696589152, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09315208345651627, "Spa-Code sparsity": 0.7013888776302337, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09857483431696892, "Spa-MultiHop QA sparsity": 0.5833333333333334, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.09747740626335144, "step": 245, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1650390625, "lambda4 Code": 0.265625} [INFO|lh_trainer.py:331] 2026-02-17 05:53:16,599 >> {'loss': 13.4184, 'grad_norm': 0.4497967064380646, 'learning_rate': 6.204009192625087e-05, 'epoch': 0.25908372827804105, 'num_input_tokens_seen': 605350038, 'completed': '82.00% (246 / 300)', 'remaining time': '2:31:47', 'throughput': '7119.28', 'gpu_mem_free': '9413MB', 'step': 246} [Step 246 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22788, 22789] → Tgt Spa: ['1.000', '1.000'] [Step 246 / Rank 0] Tasks: ['Single QA'] | Lens: [40718] → Tgt Spa: ['0.350'] [Step 246 / Rank 1] Tasks: ['Single QA'] | Lens: [40718] → Tgt Spa: ['0.350'] [Step 246 / Rank 3] Tasks: ['Single QA'] | Lens: [34785] → Tgt Spa: ['0.350'] [Step 246 / Rank 2] Tasks: ['Single QA'] | Lens: [34785] → Tgt Spa: ['0.350'] [Step 246 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22788, 22789] → Tgt Spa: ['1.000', '1.000'] [Step 246 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23151, 23152] → Tgt Spa: ['1.000', '1.000'] [Step 246 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23151, 23152] → Tgt Spa: ['1.000', '1.000'] [Step 246 / Rank 4] Tasks: ['Summarization', 'Summarization', 'In-Context Learning'] | Lens: [21233, 21233, 21215] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 246 / Rank 6] Tasks: ['Single QA'] | Lens: [45057] → Tgt Spa: ['0.350'] [Step 246 / Rank 7] Tasks: ['Single QA'] | Lens: [45057] → Tgt Spa: ['0.350'] [Step 246 / Rank 2] Tasks: ['Code', 'Code', 'In-Context Learning', 'Summarization', 'Single QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code'] | Lens: [5652, 5652, 5646, 5665, 5648, 5666, 5648, 5648, 5650, 5650, 5658] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 246 / Rank 3] Tasks: ['Code', 'Code', 'In-Context Learning', 'Summarization', 'Single QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code'] | Lens: [5652, 5652, 5646, 5665, 5648, 5666, 5648, 5648, 5650, 5650, 5658] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 246 / Rank 5] Tasks: ['Summarization', 'Summarization', 'In-Context Learning'] | Lens: [21233, 21233, 21215] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 246 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [43457] → Tgt Spa: ['1.000'] [Step 246 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [43457] → Tgt Spa: ['1.000'] [Step 246 / Rank 4] Tasks: ['Single QA'] | Lens: [34810] → Tgt Spa: ['0.350'] [Step 246 / Rank 5] Tasks: ['Single QA'] | Lens: [34810] → Tgt Spa: ['0.350'] [Step 246 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25489, 25489] → Tgt Spa: ['1.000', '1.000'] [Step 246 / Rank 3] Tasks: ['Single QA'] | Lens: [38677] → Tgt Spa: ['0.350'] [Step 246 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25489, 25489] → Tgt Spa: ['1.000', '1.000'] [Step 246 / Rank 7] Tasks: ['Single QA', 'Summarization', 'Single QA', 'Code', 'Code', 'Single QA', 'Single QA'] | Lens: [8610, 8630, 8613, 8620, 8621, 8616, 8617] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350'] [Step 246 / Rank 6] Tasks: ['Single QA', 'Summarization', 'Single QA', 'Code', 'Code', 'Single QA', 'Single QA'] | Lens: [8610, 8630, 8613, 8620, 8621, 8616, 8617] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350'] [Step 246 / Rank 2] Tasks: ['Single QA'] | Lens: [38677] → Tgt Spa: ['0.350'] [Step 246 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [22082, 22073] → Tgt Spa: ['1.000', '1.000'] [Step 246 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [25257, 25257] → Tgt Spa: ['0.350', '0.350'] [Step 246 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [44775] → Tgt Spa: ['1.000'] [Step 246 / Rank 6] Tasks: ['Single QA'] | Lens: [55858] → Tgt Spa: ['0.350'] [Step 246 / Rank 7] Tasks: ['Single QA'] | Lens: [55858] → Tgt Spa: ['0.350'] [Step 246 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [22082, 22073] → Tgt Spa: ['1.000', '1.000'] [Step 246 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [44775] → Tgt Spa: ['1.000'] [Step 246 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [25257, 25257] → Tgt Spa: ['0.350', '0.350'] [Step 246 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32131, 32132] → Tgt Spa: ['0.350', '0.350'] [Step 246 / Rank 3] Tasks: ['Code'] | Lens: [34022] → Tgt Spa: ['1.000'] [Step 246 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [30204, 30204] → Tgt Spa: ['0.350', '0.350'] [Step 246 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32131, 32132] → Tgt Spa: ['0.350', '0.350'] [Step 246 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [30204, 30204] → Tgt Spa: ['0.350', '0.350'] [Step 246 / Rank 7] Tasks: ['Single QA'] | Lens: [45052] → Tgt Spa: ['0.350'] [Step 246 / Rank 2] Tasks: ['Code'] | Lens: [34022] → Tgt Spa: ['1.000'] [Step 246 / Rank 6] Tasks: ['Single QA'] | Lens: [45052] → Tgt Spa: ['0.350'] [Step 246 / Rank 3] Tasks: ['Single QA'] | Lens: [49740] → Tgt Spa: ['0.350'] [Step 246 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [50221] → Tgt Spa: ['1.000'] [Step 246 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [50221] → Tgt Spa: ['1.000'] [Step 246 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Code', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Code', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [1952, 1954, 1953, 1953, 1954, 1937, 1937, 1936, 1936, 1957, 1956, 1955, 1939, 1939, 1939, 1957, 1938, 1958, 1942, 1947, 1959, 1945, 1962, 1962, 1943, 1943, 1962, 1962, 1944, 1951, 1946, 1944, 1966] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000'] [Step 246 / Rank 2] Tasks: ['Single QA'] | Lens: [49740] → Tgt Spa: ['0.350'] [Step 246 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [28800, 28798] → Tgt Spa: ['1.000', '1.000'] [Step 246 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [28800, 28798] → Tgt Spa: ['1.000', '1.000'] [Step 246 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Code', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Code', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [1952, 1954, 1953, 1953, 1954, 1937, 1937, 1936, 1936, 1957, 1956, 1955, 1939, 1939, 1939, 1957, 1938, 1958, 1942, 1947, 1959, 1945, 1962, 1962, 1943, 1943, 1962, 1962, 1944, 1951, 1946, 1944, 1966] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 05:55:07,655 >> @ 246 | Loss: 2.2778 | LM: 2.2102 | Reg: 0.0676 | Spa(Avg): 0.549 [INFO|lh_trainer.py:797] 2026-02-17 05:55:07,655 >> Statistic -> Code | Spa: 0.708 | Tgt: 1.000 | Z-Loss: 0.096 | [INFO|lh_trainer.py:797] 2026-02-17 05:55:07,655 >> Statistic -> In-Context | Spa: 0.721 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:55:07,655 >> Statistic -> MultiHop | Spa: 0.590 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:55:07,655 >> Statistic -> Single | Spa: 0.435 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:55:07,655 >> Statistic -> Summarization | Spa: 0.624 | Tgt: 1.000 | Z-Loss: 0.132 | [INFO|lh_trainer.py:810] 2026-02-17 05:55:07,657 >> [Micro-Log] {"loss": 2.2777920154233775, "lm_loss": 2.210222248608867, "reg_loss": 0.06756974992458709, "model_sparsity(avg)": 0.5494578666985035, "Spa-Single QA sparsity": 0.4349415176793149, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.06143788315698897, "Spa-In-Context Learning sparsity": 0.7214052186292761, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10536739536944557, "Spa-Code sparsity": 0.7083333253860473, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09568300992250442, "Spa-Summarization sparsity": 0.6236772452081952, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13203128036998568, "Spa-MultiHop QA sparsity": 0.5898148258527119, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.09923430780569713, "step": 246, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1650390625, "lambda4 Code": 0.265625} [INFO|lh_trainer.py:331] 2026-02-17 05:55:25,391 >> {'loss': 13.6668, 'grad_norm': 0.6371114253997803, 'learning_rate': 5.989855261014141e-05, 'epoch': 0.2601369141653502, 'num_input_tokens_seen': 607725012, 'completed': '82.33% (247 / 300)', 'remaining time': '2:28:50', 'throughput': '9220.22', 'gpu_mem_free': '8339MB', 'step': 247} [Step 247 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [41720] → Tgt Spa: ['1.000'] [Step 247 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [41720] → Tgt Spa: ['1.000'] [Step 247 / Rank 7] Tasks: ['Single QA'] | Lens: [58338] → Tgt Spa: ['0.350'] [Step 247 / Rank 3] Tasks: ['Single QA'] | Lens: [54258] → Tgt Spa: ['0.350'] [Step 247 / Rank 2] Tasks: ['Single QA'] | Lens: [54258] → Tgt Spa: ['0.350'] [Step 247 / Rank 6] Tasks: ['Single QA'] | Lens: [58338] → Tgt Spa: ['0.350'] [Step 247 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15872, 15872, 15872, 15872] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 247 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15872, 15872, 15872, 15872] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 247 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [61551] → Tgt Spa: ['1.000'] [Step 247 / Rank 6] Tasks: ['Summarization', 'Summarization'] | Lens: [30441, 30447] → Tgt Spa: ['1.000', '1.000'] [Step 247 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [61551] → Tgt Spa: ['1.000'] [Step 247 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [60483] → Tgt Spa: ['1.000'] [Step 247 / Rank 1] Tasks: ['Code'] | Lens: [34103] → Tgt Spa: ['1.000'] [Step 247 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [60483] → Tgt Spa: ['1.000'] [Step 247 / Rank 0] Tasks: ['Code'] | Lens: [34103] → Tgt Spa: ['1.000'] [Step 247 / Rank 7] Tasks: ['Summarization', 'Summarization'] | Lens: [30441, 30447] → Tgt Spa: ['1.000', '1.000'] [Step 247 / Rank 5] Tasks: ['Single QA'] | Lens: [50379] → Tgt Spa: ['0.350'] [Step 247 / Rank 0] Tasks: ['Single QA'] | Lens: [34787] → Tgt Spa: ['0.350'] [Step 247 / Rank 1] Tasks: ['Single QA'] | Lens: [34787] → Tgt Spa: ['0.350'] [Step 247 / Rank 7] Tasks: ['Single QA'] | Lens: [44089] → Tgt Spa: ['0.350'] [Step 247 / Rank 3] Tasks: ['Single QA'] | Lens: [47128] → Tgt Spa: ['0.350'] [Step 247 / Rank 2] Tasks: ['Single QA'] | Lens: [47128] → Tgt Spa: ['0.350'] [Step 247 / Rank 4] Tasks: ['Single QA'] | Lens: [50379] → Tgt Spa: ['0.350'] [Step 247 / Rank 6] Tasks: ['Single QA'] | Lens: [44089] → Tgt Spa: ['0.350'] [Step 247 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [25807, 25801] → Tgt Spa: ['1.000', '1.000'] [Step 247 / Rank 2] Tasks: ['Single QA'] | Lens: [37420] → Tgt Spa: ['0.350'] [Step 247 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [25807, 25801] → Tgt Spa: ['1.000', '1.000'] [Step 247 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23919, 23919] → Tgt Spa: ['1.000', '1.000'] [Step 247 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [45480] → Tgt Spa: ['1.000'] [Step 247 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23919, 23919] → Tgt Spa: ['1.000', '1.000'] [Step 247 / Rank 3] Tasks: ['Single QA'] | Lens: [37420] → Tgt Spa: ['0.350'] [Step 247 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [45480] → Tgt Spa: ['1.000'] [Step 247 / Rank 3] Tasks: ['Single QA'] | Lens: [50402] → Tgt Spa: ['0.350'] [Step 247 / Rank 1] Tasks: ['Single QA'] | Lens: [36962] → Tgt Spa: ['0.350'] [Step 247 / Rank 0] Tasks: ['Single QA'] | Lens: [36962] → Tgt Spa: ['0.350'] [Step 247 / Rank 2] Tasks: ['Single QA'] | Lens: [50402] → Tgt Spa: ['0.350'] [Step 247 / Rank 5] Tasks: ['Single QA'] | Lens: [43632] → Tgt Spa: ['0.350'] [Step 247 / Rank 7] Tasks: ['Code'] | Lens: [58925] → Tgt Spa: ['1.000'] [Step 247 / Rank 4] Tasks: ['Single QA'] | Lens: [43632] → Tgt Spa: ['0.350'] [Step 247 / Rank 6] Tasks: ['Code'] | Lens: [58925] → Tgt Spa: ['1.000'] [Step 247 / Rank 4] Tasks: ['Single QA'] | Lens: [49272] → Tgt Spa: ['0.350'] [Step 247 / Rank 0] Tasks: ['Single QA'] | Lens: [58015] → Tgt Spa: ['0.350'] [Step 247 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27342, 27344] → Tgt Spa: ['1.000', '1.000'] [Step 247 / Rank 2] Tasks: ['Single QA'] | Lens: [51019] → Tgt Spa: ['0.350'] [Step 247 / Rank 5] Tasks: ['Single QA'] | Lens: [49272] → Tgt Spa: ['0.350'] [Step 247 / Rank 1] Tasks: ['Single QA'] | Lens: [58015] → Tgt Spa: ['0.350'] [Step 247 / Rank 3] Tasks: ['Single QA'] | Lens: [51019] → Tgt Spa: ['0.350'] [Step 247 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27342, 27344] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 05:57:46,714 >> @ 247 | Loss: 2.2942 | LM: 2.2440 | Reg: 0.0502 | Spa(Avg): 0.509 [INFO|lh_trainer.py:797] 2026-02-17 05:57:46,714 >> Statistic -> Code | Spa: 0.699 | Tgt: 1.000 | Z-Loss: 0.099 | [INFO|lh_trainer.py:797] 2026-02-17 05:57:46,714 >> Statistic -> In-Context | Spa: 0.721 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:57:46,714 >> Statistic -> MultiHop | Spa: 0.590 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:57:46,714 >> Statistic -> Single | Spa: 0.386 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 05:57:46,714 >> Statistic -> Summarization | Spa: 0.688 | Tgt: 1.000 | Z-Loss: 0.098 | [INFO|lh_trainer.py:810] 2026-02-17 05:57:46,716 >> [Micro-Log] {"loss": 2.294158624485135, "lm_loss": 2.24400509086748, "reg_loss": 0.05015349030630508, "model_sparsity(avg)": 0.5092592512567838, "Spa-Single QA sparsity": 0.38643789992612954, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.026745242662453914, "Spa-Code sparsity": 0.6990740696589152, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09933070093393326, "Spa-In-Context Learning sparsity": 0.7206790049870809, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10567113674349254, "Spa-Summarization sparsity": 0.6875, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.0983208566904068, "Spa-MultiHop QA sparsity": 0.5898148258527119, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.09923430780569713, "step": 247, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1650390625, "lambda4 Code": 0.265625} [INFO|lh_trainer.py:331] 2026-02-17 05:58:09,000 >> {'loss': 13.765, 'grad_norm': 0.5277857780456543, 'learning_rate': 5.778958620268094e-05, 'epoch': 0.2611901000526593, 'num_input_tokens_seen': 610117954, 'completed': '82.67% (248 / 300)', 'remaining time': '2:26:00', 'throughput': '7313.00', 'gpu_mem_free': '7195MB', 'step': 248} [Step 248 / Rank 3] Tasks: ['Single QA'] | Lens: [46692] → Tgt Spa: ['0.350'] [Step 248 / Rank 4] Tasks: ['Single QA'] | Lens: [35244] → Tgt Spa: ['0.350'] [Step 248 / Rank 1] Tasks: ['Single QA'] | Lens: [47805] → Tgt Spa: ['0.350'] [Step 248 / Rank 6] Tasks: ['Single QA'] | Lens: [65447] → Tgt Spa: ['0.350'] [Step 248 / Rank 0] Tasks: ['Single QA'] | Lens: [47805] → Tgt Spa: ['0.350'] [Step 248 / Rank 2] Tasks: ['Single QA'] | Lens: [46692] → Tgt Spa: ['0.350'] [Step 248 / Rank 5] Tasks: ['Single QA'] | Lens: [35244] → Tgt Spa: ['0.350'] [Step 248 / Rank 7] Tasks: ['Single QA'] | Lens: [65447] → Tgt Spa: ['0.350'] [Step 248 / Rank 1] Tasks: ['Single QA'] | Lens: [38744] → Tgt Spa: ['0.350'] [Step 248 / Rank 2] Tasks: ['Single QA'] | Lens: [54849] → Tgt Spa: ['0.350'] [Step 248 / Rank 4] Tasks: ['Single QA'] | Lens: [57284] → Tgt Spa: ['0.350'] [Step 248 / Rank 7] Tasks: ['Single QA'] | Lens: [55768] → Tgt Spa: ['0.350'] [Step 248 / Rank 0] Tasks: ['Single QA'] | Lens: [38744] → Tgt Spa: ['0.350'] [Step 248 / Rank 5] Tasks: ['Single QA'] | Lens: [57284] → Tgt Spa: ['0.350'] [Step 248 / Rank 6] Tasks: ['Single QA'] | Lens: [55768] → Tgt Spa: ['0.350'] [Step 248 / Rank 3] Tasks: ['Single QA'] | Lens: [54849] → Tgt Spa: ['0.350'] [Step 248 / Rank 3] Tasks: ['Single QA'] | Lens: [37156] → Tgt Spa: ['0.350'] [Step 248 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Summarization', 'Summarization', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [4658, 4658, 4659, 4660, 4666, 4659, 4660, 4660, 4678, 4678, 4667, 4660, 4660, 4660] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 248 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [26624, 26634] → Tgt Spa: ['1.000', '1.000'][Step 248 / Rank 1] Tasks: ['Single QA'] | Lens: [45068] → Tgt Spa: ['0.350'] [Step 248 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Summarization', 'Summarization', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [4658, 4658, 4659, 4660, 4666, 4659, 4660, 4660, 4678, 4678, 4667, 4660, 4660, 4660] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 248 / Rank 0] Tasks: ['Single QA'] | Lens: [45068] → Tgt Spa: ['0.350'] [Step 248 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [26624, 26634] → Tgt Spa: ['1.000', '1.000'] [Step 248 / Rank 2] Tasks: ['Single QA'] | Lens: [37156] → Tgt Spa: ['0.350'] [Step 248 / Rank 6] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17371, 17361, 17372] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 248 / Rank 4] Tasks: ['Single QA'] | Lens: [61833] → Tgt Spa: ['0.350'] [Step 248 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [43506] → Tgt Spa: ['1.000'] [Step 248 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [43506] → Tgt Spa: ['1.000'] [Step 248 / Rank 7] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [17371, 17361, 17372] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 248 / Rank 5] Tasks: ['Single QA'] | Lens: [61833] → Tgt Spa: ['0.350'] [Step 248 / Rank 0] Tasks: ['Code'] | Lens: [35556] → Tgt Spa: ['1.000'] [Step 248 / Rank 1] Tasks: ['Code'] | Lens: [35556] → Tgt Spa: ['1.000'] [Step 248 / Rank 4] Tasks: ['MultiHop QA'] | Lens: [65337] → Tgt Spa: ['0.350'] [Step 248 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26269, 26270] → Tgt Spa: ['1.000', '1.000'] [Step 248 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [24481, 24491] → Tgt Spa: ['1.000', '1.000'] [Step 248 / Rank 2] Tasks: ['Single QA'] | Lens: [56252] → Tgt Spa: ['0.350'] [Step 248 / Rank 3] Tasks: ['Single QA'] | Lens: [56252] → Tgt Spa: ['0.350'] [Step 248 / Rank 5] Tasks: ['MultiHop QA'] | Lens: [65337] → Tgt Spa: ['0.350'] [Step 248 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [24481, 24491] → Tgt Spa: ['1.000', '1.000'] [Step 248 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26269, 26270] → Tgt Spa: ['1.000', '1.000'] [Step 248 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [54887] → Tgt Spa: ['1.000'] [Step 248 / Rank 3] Tasks: ['Single QA'] | Lens: [57461] → Tgt Spa: ['0.350'] [Step 248 / Rank 1] Tasks: ['Single QA'] | Lens: [41473] → Tgt Spa: ['0.350'] [Step 248 / Rank 6] Tasks: ['Single QA'] | Lens: [47428] → Tgt Spa: ['0.350'] [Step 248 / Rank 0] Tasks: ['Single QA'] | Lens: [41473] → Tgt Spa: ['0.350'] [Step 248 / Rank 2] Tasks: ['Single QA'] | Lens: [57461] → Tgt Spa: ['0.350'] [Step 248 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [54887] → Tgt Spa: ['1.000'] [Step 248 / Rank 7] Tasks: ['Single QA'] | Lens: [47428] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 06:00:47,568 >> @ 248 | Loss: 2.0474 | LM: 2.0054 | Reg: 0.0420 | Spa(Avg): 0.468 [INFO|lh_trainer.py:797] 2026-02-17 06:00:47,569 >> Statistic -> Code | Spa: 0.687 | Tgt: 1.000 | Z-Loss: 0.105 | [INFO|lh_trainer.py:797] 2026-02-17 06:00:47,569 >> Statistic -> In-Context | Spa: 0.707 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:00:47,569 >> Statistic -> MultiHop | Spa: 0.507 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:00:47,569 >> Statistic -> Single | Spa: 0.386 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:00:47,569 >> Statistic -> Summarization | Spa: 0.615 | Tgt: 1.000 | Z-Loss: 0.138 | [INFO|lh_trainer.py:810] 2026-02-17 06:00:47,571 >> [Micro-Log] {"loss": 2.047443182207644, "lm_loss": 2.0054455015342683, "reg_loss": 0.041997678058881625, "model_sparsity(avg)": 0.46818507214387256, "Spa-Single QA sparsity": 0.3856209060725044, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.028447851805728588, "Spa-Code sparsity": 0.687499980131785, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10472534721096356, "Spa-In-Context Learning sparsity": 0.707264950642219, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11135516774195892, "Spa-MultiHop QA sparsity": 0.5069444477558136, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.06786351918708533, "Spa-Summarization sparsity": 0.6145833283662796, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13772696256637573, "step": 248, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1650390625, "lambda4 Code": 0.265625} [INFO|lh_trainer.py:331] 2026-02-17 06:01:09,599 >> {'loss': 12.2847, 'grad_norm': 0.4079309403896332, 'learning_rate': 5.5713554064406314e-05, 'epoch': 0.26224328593996843, 'num_input_tokens_seen': 612557846, 'completed': '83.00% (249 / 300)', 'remaining time': '2:23:14', 'throughput': '6754.98', 'gpu_mem_free': '12809MB', 'step': 249} [Step 249 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26782, 26784] → Tgt Spa: ['1.000', '1.000'] [Step 249 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [45254] → Tgt Spa: ['1.000'] [Step 249 / Rank 1] Tasks: ['Single QA'] | Lens: [36795] → Tgt Spa: ['0.350'] [Step 249 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [45254] → Tgt Spa: ['1.000'] [Step 249 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26782, 26784] → Tgt Spa: ['1.000', '1.000'] [Step 249 / Rank 0] Tasks: ['Single QA'] | Lens: [36795] → Tgt Spa: ['0.350'] [Step 249 / Rank 5] Tasks: ['Single QA'] | Lens: [59515] → Tgt Spa: ['0.350'] [Step 249 / Rank 4] Tasks: ['Single QA'] | Lens: [59515] → Tgt Spa: ['0.350'] [Step 249 / Rank 6] Tasks: ['Code', 'Summarization'] | Lens: [23084, 23092] → Tgt Spa: ['1.000', '1.000'] [Step 249 / Rank 0] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [27666, 27665] → Tgt Spa: ['0.350', '1.000'] [Step 249 / Rank 5] Tasks: ['Single QA'] | Lens: [35254] → Tgt Spa: ['0.350'] [Step 249 / Rank 4] Tasks: ['Single QA'] | Lens: [35254] → Tgt Spa: ['0.350'] [Step 249 / Rank 7] Tasks: ['Code', 'Summarization'] | Lens: [23084, 23092] → Tgt Spa: ['1.000', '1.000'] [Step 249 / Rank 2] Tasks: ['Single QA'] | Lens: [54865] → Tgt Spa: ['0.350'] [Step 249 / Rank 1] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [27666, 27665] → Tgt Spa: ['0.350', '1.000'] [Step 249 / Rank 3] Tasks: ['Single QA'] | Lens: [54865] → Tgt Spa: ['0.350'] [Step 249 / Rank 5] Tasks: ['Code'] | Lens: [46230] → Tgt Spa: ['1.000'] [Step 249 / Rank 3] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [8455, 8454, 8454, 8450, 8461, 8455, 8459] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350'] [Step 249 / Rank 7] Tasks: ['Single QA'] | Lens: [52326] → Tgt Spa: ['0.350'] [Step 249 / Rank 6] Tasks: ['Single QA'] | Lens: [52326] → Tgt Spa: ['0.350'] [Step 249 / Rank 4] Tasks: ['Code'] | Lens: [46230] → Tgt Spa: ['1.000'] [Step 249 / Rank 0] Tasks: ['Code', 'Code', 'Single QA'] | Lens: [20382, 20384, 20377] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 249 / Rank 1] Tasks: ['Code', 'Code', 'Single QA'] | Lens: [20382, 20384, 20377] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 249 / Rank 2] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [8455, 8454, 8454, 8450, 8461, 8455, 8459] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350'] [Step 249 / Rank 1] Tasks: ['Single QA'] | Lens: [50122] → Tgt Spa: ['0.350'] [Step 249 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23244, 23244] → Tgt Spa: ['1.000', '1.000'] [Step 249 / Rank 0] Tasks: ['Single QA'] | Lens: [50122] → Tgt Spa: ['0.350'] [Step 249 / Rank 6] Tasks: ['Code', 'MultiHop QA', 'Code', 'In-Context Learning', 'Summarization', 'Code', 'In-Context Learning', 'Code', 'MultiHop QA', 'Single QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'MultiHop QA'] | Lens: [3724, 3719, 3725, 3717, 3734, 3725, 3720, 3726, 3722, 3721, 3739, 3722, 3721, 3722, 3723, 3723, 3726] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350'] [Step 249 / Rank 3] Tasks: ['Single QA'] | Lens: [34808] → Tgt Spa: ['0.350'] [Step 249 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23244, 23244] → Tgt Spa: ['1.000', '1.000'] [Step 249 / Rank 2] Tasks: ['Single QA'] | Lens: [34808] → Tgt Spa: ['0.350'] [Step 249 / Rank 7] Tasks: ['Code', 'MultiHop QA', 'Code', 'In-Context Learning', 'Summarization', 'Code', 'In-Context Learning', 'Code', 'MultiHop QA', 'Single QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'MultiHop QA'] | Lens: [3724, 3719, 3725, 3717, 3734, 3725, 3720, 3726, 3722, 3721, 3739, 3722, 3721, 3722, 3723, 3723, 3726] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350'] [Step 249 / Rank 3] Tasks: ['Code'] | Lens: [36859] → Tgt Spa: ['1.000'] [Step 249 / Rank 5] Tasks: ['Single QA'] | Lens: [52512] → Tgt Spa: ['0.350'] [Step 249 / Rank 7] Tasks: ['Single QA'] | Lens: [59145] → Tgt Spa: ['0.350'] [Step 249 / Rank 1] Tasks: ['Single QA'] | Lens: [41748] → Tgt Spa: ['0.350'] [Step 249 / Rank 6] Tasks: ['Single QA'] | Lens: [59145] → Tgt Spa: ['0.350'] [Step 249 / Rank 4] Tasks: ['Single QA'] | Lens: [52512] → Tgt Spa: ['0.350'] [Step 249 / Rank 2] Tasks: ['Code'] | Lens: [36859] → Tgt Spa: ['1.000'] [Step 249 / Rank 0] Tasks: ['Single QA'] | Lens: [41748] → Tgt Spa: ['0.350'] [Step 249 / Rank 2] Tasks: ['Code'] | Lens: [40976] → Tgt Spa: ['1.000'] [Step 249 / Rank 0] Tasks: ['Summarization', 'Summarization', 'Single QA'] | Lens: [17699, 17699, 17683] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 249 / Rank 4] Tasks: ['Code'] | Lens: [33505] → Tgt Spa: ['1.000'] [Step 249 / Rank 3] Tasks: ['Code'] | Lens: [40976] → Tgt Spa: ['1.000'] [Step 249 / Rank 6] Tasks: ['Code'] | Lens: [33569] → Tgt Spa: ['1.000'] [Step 249 / Rank 7] Tasks: ['Code'] | Lens: [33569] → Tgt Spa: ['1.000'] [Step 249 / Rank 5] Tasks: ['Code'] | Lens: [33505] → Tgt Spa: ['1.000'] [Step 249 / Rank 1] Tasks: ['Summarization', 'Summarization', 'Single QA'] | Lens: [17699, 17699, 17683] → Tgt Spa: ['1.000', '1.000', '0.350'] [INFO|lh_trainer.py:781] 2026-02-17 06:03:29,662 >> @ 249 | Loss: 1.8689 | LM: 1.8062 | Reg: 0.0627 | Spa(Avg): 0.540 [INFO|lh_trainer.py:797] 2026-02-17 06:03:29,662 >> Statistic -> Code | Spa: 0.710 | Tgt: 1.000 | Z-Loss: 0.095 | [INFO|lh_trainer.py:797] 2026-02-17 06:03:29,662 >> Statistic -> In-Context | Spa: 0.707 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:03:29,662 >> Statistic -> MultiHop | Spa: 0.704 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:03:29,662 >> Statistic -> Single | Spa: 0.426 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:03:29,663 >> Statistic -> Summarization | Spa: 0.586 | Tgt: 1.000 | Z-Loss: 0.154 | [INFO|lh_trainer.py:810] 2026-02-17 06:03:29,665 >> [Micro-Log] {"loss": 1.868891153484583, "lm_loss": 1.8062074081972241, "reg_loss": 0.06268373042985331, "model_sparsity(avg)": 0.5397643931210041, "Spa-Single QA sparsity": 0.4261695836719714, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05692341132089496, "Spa-In-Context Learning sparsity": 0.7070706974376332, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11154154688119888, "Spa-Code sparsity": 0.7100694365799427, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.0950980712659657, "Spa-Summarization sparsity": 0.5861111044883728, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.15415160059928895, "Spa-MultiHop QA sparsity": 0.7037037213643392, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.16171314815680185, "step": 249, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1650390625, "lambda4 Code": 0.265625} [INFO|lh_trainer.py:331] 2026-02-17 06:03:42,583 >> {'loss': 11.2133, 'grad_norm': 0.5939152240753174, 'learning_rate': 5.3670811912737094e-05, 'epoch': 0.2632964718272775, 'num_input_tokens_seen': 614861376, 'completed': '83.33% (250 / 300)', 'remaining time': '2:20:23', 'throughput': '7528.69', 'gpu_mem_free': '9529MB', 'step': 250} [Step 250 / Rank 5] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [16547, 16559, 16560] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 250 / Rank 3] Tasks: ['Single QA'] | Lens: [36920] → Tgt Spa: ['0.350'] [Step 250 / Rank 1] Tasks: ['Single QA'] | Lens: [49118] → Tgt Spa: ['0.350'] [Step 250 / Rank 7] Tasks: ['Single QA'] | Lens: [52415] → Tgt Spa: ['0.350'] [Step 250 / Rank 0] Tasks: ['Single QA'] | Lens: [49118] → Tgt Spa: ['0.350'] [Step 250 / Rank 4] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [16547, 16559, 16560] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 250 / Rank 2] Tasks: ['Single QA'] | Lens: [36920] → Tgt Spa: ['0.350'] [Step 250 / Rank 6] Tasks: ['Single QA'] | Lens: [52415] → Tgt Spa: ['0.350'] [Step 250 / Rank 1] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [19295, 19287, 19286] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 250 / Rank 6] Tasks: ['Code'] | Lens: [40790] → Tgt Spa: ['1.000'] [Step 250 / Rank 0] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [19295, 19287, 19286] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 250 / Rank 7] Tasks: ['Code'] | Lens: [40790] → Tgt Spa: ['1.000'] [Step 250 / Rank 3] Tasks: ['Single QA'] | Lens: [36322] → Tgt Spa: ['0.350'] [Step 250 / Rank 4] Tasks: ['Single QA'] | Lens: [52960] → Tgt Spa: ['0.350'] [Step 250 / Rank 5] Tasks: ['Single QA'] | Lens: [52960] → Tgt Spa: ['0.350'] [Step 250 / Rank 2] Tasks: ['Single QA'] | Lens: [36322] → Tgt Spa: ['0.350'] [Step 250 / Rank 6] Tasks: ['Single QA'] | Lens: [42020] → Tgt Spa: ['0.350'] [Step 250 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [23450, 23442] → Tgt Spa: ['1.000', '1.000'] [Step 250 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [62700] → Tgt Spa: ['1.000'] [Step 250 / Rank 5] Tasks: ['Single QA'] | Lens: [53369] → Tgt Spa: ['0.350'] [Step 250 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [23450, 23442] → Tgt Spa: ['1.000', '1.000'] [Step 250 / Rank 7] Tasks: ['Single QA'] | Lens: [42020] → Tgt Spa: ['0.350'] [Step 250 / Rank 4] Tasks: ['Single QA'] | Lens: [53369] → Tgt Spa: ['0.350'] [Step 250 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [62700] → Tgt Spa: ['1.000'] [Step 250 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [57148] → Tgt Spa: ['1.000'] [Step 250 / Rank 6] Tasks: ['Single QA'] | Lens: [46960] → Tgt Spa: ['0.350'] [Step 250 / Rank 5] Tasks: ['Code'] | Lens: [56520] → Tgt Spa: ['1.000'] [Step 250 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [17660, 17660, 17662] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 250 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [57148] → Tgt Spa: ['1.000'] [Step 250 / Rank 7] Tasks: ['Single QA'] | Lens: [46960] → Tgt Spa: ['0.350'] [Step 250 / Rank 4] Tasks: ['Code'] | Lens: [56520] → Tgt Spa: ['1.000'] [Step 250 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [17660, 17660, 17662] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 250 / Rank 2] Tasks: ['Single QA'] | Lens: [41971] → Tgt Spa: ['0.350'] [Step 250 / Rank 1] Tasks: ['Single QA'] | Lens: [50955] → Tgt Spa: ['0.350'] [Step 250 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [61470] → Tgt Spa: ['1.000'] [Step 250 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [61470] → Tgt Spa: ['1.000'] [Step 250 / Rank 6] Tasks: ['Single QA'] | Lens: [36498] → Tgt Spa: ['0.350'] [Step 250 / Rank 7] Tasks: ['Single QA'] | Lens: [36498] → Tgt Spa: ['0.350'] [Step 250 / Rank 3] Tasks: ['Single QA'] | Lens: [41971] → Tgt Spa: ['0.350'] [Step 250 / Rank 0] Tasks: ['Single QA'] | Lens: [50955] → Tgt Spa: ['0.350'] [Step 250 / Rank 3] Tasks: ['Single QA'] | Lens: [43290] → Tgt Spa: ['0.350'] [Step 250 / Rank 4] Tasks: ['Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'Summarization', 'Code', 'Code'] | Lens: [8507, 8508, 8513, 8515, 8539, 8528, 8529] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000'] [Step 250 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [64503] → Tgt Spa: ['1.000'] [Step 250 / Rank 6] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Code', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [2884, 2884, 2883, 2884, 2884, 2885, 2902, 2892, 2887, 2887, 2888, 2905, 2888, 2889, 2905, 2893, 2889, 2888, 2890, 2890, 2891, 2907] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 250 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [64503] → Tgt Spa: ['1.000'] [Step 250 / Rank 7] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Code', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization'] | Lens: [2884, 2884, 2883, 2884, 2884, 2885, 2902, 2892, 2887, 2887, 2888, 2905, 2888, 2889, 2905, 2893, 2889, 2888, 2890, 2890, 2891, 2907] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 250 / Rank 2] Tasks: ['Single QA'] | Lens: [43290] → Tgt Spa: ['0.350'] [Step 250 / Rank 5] Tasks: ['Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'Summarization', 'Code', 'Code'] | Lens: [8507, 8508, 8513, 8515, 8539, 8528, 8529] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 06:06:12,816 >> @ 250 | Loss: 2.1578 | LM: 2.1031 | Reg: 0.0547 | Spa(Avg): 0.510 [INFO|lh_trainer.py:797] 2026-02-17 06:06:12,816 >> Statistic -> Code | Spa: 0.694 | Tgt: 1.000 | Z-Loss: 0.102 | [INFO|lh_trainer.py:797] 2026-02-17 06:06:12,817 >> Statistic -> In-Context | Spa: 0.713 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:06:12,817 >> Statistic -> MultiHop | Spa: 0.599 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:06:12,817 >> Statistic -> Single | Spa: 0.414 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:06:12,817 >> Statistic -> Summarization | Spa: 0.674 | Tgt: 1.000 | Z-Loss: 0.106 | [INFO|lh_trainer.py:810] 2026-02-17 06:06:12,819 >> [Micro-Log] {"loss": 2.1578146318594613, "lm_loss": 2.1030975747853518, "reg_loss": 0.05471703472236792, "model_sparsity(avg)": 0.5101962089538574, "Spa-Single QA sparsity": 0.413888880610466, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04446069389814511, "Spa-Summarization sparsity": 0.6736111268401146, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10576110705733299, "Spa-Code sparsity": 0.6944444417953491, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10220608115196228, "Spa-In-Context Learning sparsity": 0.7129629651705424, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1089072513083617, "Spa-MultiHop QA sparsity": 0.5992063454219273, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.10573562926479749, "step": 250, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1650390625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 06:06:38,785 >> {'loss': 12.9469, 'grad_norm': 0.5120065212249756, 'learning_rate': 5.166170976102475e-05, 'epoch': 0.2643496577145866, 'num_input_tokens_seen': 617294518, 'completed': '83.67% (251 / 300)', 'remaining time': '2:17:36', 'throughput': '6904.38', 'gpu_mem_free': '4267MB', 'step': 251} [Step 251 / Rank 1] Tasks: ['Single QA'] | Lens: [56070] → Tgt Spa: ['0.350'] [Step 251 / Rank 0] Tasks: ['Single QA'] | Lens: [56070] → Tgt Spa: ['0.350'] [Step 251 / Rank 7] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23390, 23391] → Tgt Spa: ['0.350', '1.000'] [Step 251 / Rank 2] Tasks: ['Single QA'] | Lens: [44081] → Tgt Spa: ['0.350'] [Step 251 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [27254, 27254] → Tgt Spa: ['0.350', '0.350'] [Step 251 / Rank 3] Tasks: ['Single QA'] | Lens: [44081] → Tgt Spa: ['0.350'] [Step 251 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [27254, 27254] → Tgt Spa: ['0.350', '0.350'] [Step 251 / Rank 6] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23390, 23391] → Tgt Spa: ['0.350', '1.000'] [Step 251 / Rank 2] Tasks: ['Code'] | Lens: [42581] → Tgt Spa: ['1.000'] [Step 251 / Rank 3] Tasks: ['Code'] | Lens: [42581] → Tgt Spa: ['1.000'] [Step 251 / Rank 4] Tasks: ['Single QA'] | Lens: [35086] → Tgt Spa: ['0.350'] [Step 251 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [54723] → Tgt Spa: ['1.000'] [Step 251 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [54723] → Tgt Spa: ['1.000'] [Step 251 / Rank 1] Tasks: ['Single QA'] | Lens: [52150] → Tgt Spa: ['0.350'] [Step 251 / Rank 0] Tasks: ['Single QA'] | Lens: [52150] → Tgt Spa: ['0.350'] [Step 251 / Rank 5] Tasks: ['Single QA'] | Lens: [35086] → Tgt Spa: ['0.350'] [Step 251 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [42113] → Tgt Spa: ['1.000'] [Step 251 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [42113] → Tgt Spa: ['1.000'] [Step 251 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28726, 28730] → Tgt Spa: ['1.000', '1.000'] [Step 251 / Rank 3] Tasks: ['Code'] | Lens: [44071] → Tgt Spa: ['1.000'] [Step 251 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [24507, 24501] → Tgt Spa: ['1.000', '1.000'] [Step 251 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [24507, 24501] → Tgt Spa: ['1.000', '1.000'] [Step 251 / Rank 2] Tasks: ['Code'] | Lens: [44071] → Tgt Spa: ['1.000'] [Step 251 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28726, 28730] → Tgt Spa: ['1.000', '1.000'] [Step 251 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [40772] → Tgt Spa: ['1.000'] [Step 251 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [41897] → Tgt Spa: ['1.000'] [Step 251 / Rank 0] Tasks: ['Code', 'In-Context Learning', 'Summarization'] | Lens: [20177, 20174, 20193] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 251 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [40772] → Tgt Spa: ['1.000'] [Step 251 / Rank 1] Tasks: ['Code', 'In-Context Learning', 'Summarization'] | Lens: [20177, 20174, 20193] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 251 / Rank 7] Tasks: ['Single QA'] | Lens: [54825] → Tgt Spa: ['0.350'] [Step 251 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [41897] → Tgt Spa: ['1.000'] [Step 251 / Rank 6] Tasks: ['Single QA'] | Lens: [54825] → Tgt Spa: ['0.350'] [Step 251 / Rank 2] Tasks: ['Single QA'] | Lens: [33763] → Tgt Spa: ['0.350'] [Step 251 / Rank 6] Tasks: ['Single QA'] | Lens: [53209] → Tgt Spa: ['0.350'] [Step 251 / Rank 3] Tasks: ['Single QA'] | Lens: [33763] → Tgt Spa: ['0.350'] [Step 251 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28738, 28745] → Tgt Spa: ['1.000', '1.000'] [Step 251 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15810, 15810, 15810, 15810] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 251 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28738, 28745] → Tgt Spa: ['1.000', '1.000'] [Step 251 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15810, 15810, 15810, 15810] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 251 / Rank 7] Tasks: ['Single QA'] | Lens: [53209] → Tgt Spa: ['0.350'] [Step 251 / Rank 7] Tasks: ['Single QA'] | Lens: [43208] → Tgt Spa: ['0.350'] [Step 251 / Rank 2] Tasks: ['Code'] | Lens: [43955] → Tgt Spa: ['1.000'] [Step 251 / Rank 0] Tasks: ['Single QA', 'Code'] | Lens: [32205, 32215] → Tgt Spa: ['0.350', '1.000'] [Step 251 / Rank 6] Tasks: ['Single QA'] | Lens: [43208] → Tgt Spa: ['0.350'] [Step 251 / Rank 4] Tasks: ['Single QA'] | Lens: [34407] → Tgt Spa: ['0.350'] [Step 251 / Rank 5] Tasks: ['Single QA'] | Lens: [34407] → Tgt Spa: ['0.350'] [Step 251 / Rank 3] Tasks: ['Code'] | Lens: [43955] → Tgt Spa: ['1.000'] [Step 251 / Rank 1] Tasks: ['Single QA', 'Code'] | Lens: [32205, 32215] → Tgt Spa: ['0.350', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 06:08:51,398 >> @ 251 | Loss: 2.1214 | LM: 2.0663 | Reg: 0.0551 | Spa(Avg): 0.537 [INFO|lh_trainer.py:797] 2026-02-17 06:08:51,398 >> Statistic -> Code | Spa: 0.718 | Tgt: 1.000 | Z-Loss: 0.093 | [INFO|lh_trainer.py:797] 2026-02-17 06:08:51,398 >> Statistic -> In-Context | Spa: 0.716 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:08:51,398 >> Statistic -> MultiHop | Spa: 0.599 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:08:51,398 >> Statistic -> Single | Spa: 0.368 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:08:51,398 >> Statistic -> Summarization | Spa: 0.722 | Tgt: 1.000 | Z-Loss: 0.083 | [INFO|lh_trainer.py:810] 2026-02-17 06:08:51,400 >> [Micro-Log] {"loss": 2.1214100966850915, "lm_loss": 2.066297598183155, "reg_loss": 0.0551124800016017, "model_sparsity(avg)": 0.5367476803561052, "Spa-Single QA sparsity": 0.36846404215868783, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.014733024859143531, "Spa-In-Context Learning sparsity": 0.7159090909090909, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10768065804784949, "Spa-Code sparsity": 0.7175925970077515, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09263196090857188, "Spa-Summarization sparsity": 0.7222222089767456, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.08261598646640778, "Spa-MultiHop QA sparsity": 0.5992063454219273, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.10573562926479749, "step": 251, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1650390625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 06:09:08,867 >> {'loss': 12.7285, 'grad_norm': 0.6367613673210144, 'learning_rate': 4.968659185858018e-05, 'epoch': 0.26540284360189575, 'num_input_tokens_seen': 619635220, 'completed': '84.00% (252 / 300)', 'remaining time': '2:14:44', 'throughput': '7798.08', 'gpu_mem_free': '6325MB', 'step': 252} [Step 252 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25898, 25898] → Tgt Spa: ['1.000', '1.000'] [Step 252 / Rank 0] Tasks: ['Single QA'] | Lens: [50121] → Tgt Spa: ['0.350'] [Step 252 / Rank 7] Tasks: ['Single QA'] | Lens: [36900] → Tgt Spa: ['0.350'] [Step 252 / Rank 2] Tasks: ['Code'] | Lens: [40052] → Tgt Spa: ['1.000'] [Step 252 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25898, 25898] → Tgt Spa: ['1.000', '1.000'] [Step 252 / Rank 3] Tasks: ['Code'] | Lens: [40052] → Tgt Spa: ['1.000'] [Step 252 / Rank 1] Tasks: ['Single QA'] | Lens: [50121] → Tgt Spa: ['0.350'] [Step 252 / Rank 6] Tasks: ['Single QA'] | Lens: [36900] → Tgt Spa: ['0.350'] [Step 252 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [23086, 23086] → Tgt Spa: ['0.350', '0.350'] [Step 252 / Rank 5] Tasks: ['Single QA'] | Lens: [52505] → Tgt Spa: ['0.350'] [Step 252 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [23086, 23086] → Tgt Spa: ['0.350', '0.350'] [Step 252 / Rank 7] Tasks: ['Summarization'] | Lens: [39122] → Tgt Spa: ['1.000'] [Step 252 / Rank 2] Tasks: ['Single QA', 'Summarization'] | Lens: [24788, 24805] → Tgt Spa: ['0.350', '1.000'] [Step 252 / Rank 3] Tasks: ['Single QA', 'Summarization'] | Lens: [24788, 24805] → Tgt Spa: ['0.350', '1.000'] [Step 252 / Rank 6] Tasks: ['Summarization'] | Lens: [39122] → Tgt Spa: ['1.000'] [Step 252 / Rank 4] Tasks: ['Single QA'] | Lens: [52505] → Tgt Spa: ['0.350'] [Step 252 / Rank 1] Tasks: ['Code'] | Lens: [33845] → Tgt Spa: ['1.000'] [Step 252 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7998, 7998, 7999, 7999, 7999, 7998, 7999, 7999] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 252 / Rank 3] Tasks: ['Code'] | Lens: [52051] → Tgt Spa: ['1.000'] [Step 252 / Rank 2] Tasks: ['Code'] | Lens: [52051] → Tgt Spa: ['1.000'] [Step 252 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7998, 7998, 7999, 7999, 7999, 7998, 7999, 7999] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 252 / Rank 0] Tasks: ['Code'] | Lens: [33845] → Tgt Spa: ['1.000'] [Step 252 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [59715] → Tgt Spa: ['1.000'] [Step 252 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [59715] → Tgt Spa: ['1.000'] [Step 252 / Rank 4] Tasks: ['Single QA'] | Lens: [41254] → Tgt Spa: ['0.350'] [Step 252 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24693, 24694] → Tgt Spa: ['1.000', '1.000'] [Step 252 / Rank 2] Tasks: ['Single QA'] | Lens: [44254] → Tgt Spa: ['0.350'] [Step 252 / Rank 6] Tasks: ['Single QA'] | Lens: [65362] → Tgt Spa: ['0.350'] [Step 252 / Rank 5] Tasks: ['Single QA'] | Lens: [41254] → Tgt Spa: ['0.350'] [Step 252 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24693, 24694] → Tgt Spa: ['1.000', '1.000'] [Step 252 / Rank 7] Tasks: ['Single QA'] | Lens: [65362] → Tgt Spa: ['0.350'] [Step 252 / Rank 3] Tasks: ['Single QA'] | Lens: [44254] → Tgt Spa: ['0.350'] [Step 252 / Rank 4] Tasks: ['Code'] | Lens: [59198] → Tgt Spa: ['1.000'] [Step 252 / Rank 5] Tasks: ['Code'] | Lens: [59198] → Tgt Spa: ['1.000'] [Step 252 / Rank 1] Tasks: ['Code'] | Lens: [62487] → Tgt Spa: ['1.000'] [Step 252 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32660, 32662] → Tgt Spa: ['0.350', '0.350'] [Step 252 / Rank 0] Tasks: ['Code'] | Lens: [62487] → Tgt Spa: ['1.000'] [Step 252 / Rank 3] Tasks: ['In-Context Learning', 'Code'] | Lens: [31720, 31730] → Tgt Spa: ['1.000', '1.000'] [Step 252 / Rank 2] Tasks: ['In-Context Learning', 'Code'] | Lens: [31720, 31730] → Tgt Spa: ['1.000', '1.000'] [Step 252 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32660, 32662] → Tgt Spa: ['0.350', '0.350'] [Step 252 / Rank 3] Tasks: ['Single QA'] | Lens: [46451] → Tgt Spa: ['0.350'] [Step 252 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [27152, 27153] → Tgt Spa: ['0.350', '0.350'] [Step 252 / Rank 2] Tasks: ['Single QA'] | Lens: [46451] → Tgt Spa: ['0.350'] [Step 252 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [43870] → Tgt Spa: ['1.000'] [Step 252 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [27152, 27153] → Tgt Spa: ['0.350', '0.350'] [Step 252 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [43870] → Tgt Spa: ['1.000'] [Step 252 / Rank 5] Tasks: ['Single QA'] | Lens: [49167] → Tgt Spa: ['0.350'] [Step 252 / Rank 4] Tasks: ['Single QA'] | Lens: [49167] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 06:11:42,157 >> @ 252 | Loss: 1.9378 | LM: 1.8756 | Reg: 0.0621 | Spa(Avg): 0.545 [INFO|lh_trainer.py:797] 2026-02-17 06:11:42,158 >> Statistic -> Code | Spa: 0.701 | Tgt: 1.000 | Z-Loss: 0.099 | [INFO|lh_trainer.py:797] 2026-02-17 06:11:42,158 >> Statistic -> In-Context | Spa: 0.720 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:11:42,158 >> Statistic -> MultiHop | Spa: 0.599 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:11:42,158 >> Statistic -> Single | Spa: 0.417 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:11:42,158 >> Statistic -> Summarization | Spa: 0.715 | Tgt: 1.000 | Z-Loss: 0.086 | [INFO|lh_trainer.py:810] 2026-02-17 06:11:42,160 >> [Micro-Log] {"loss": 1.9377696874241035, "lm_loss": 1.8756384005149205, "reg_loss": 0.06213125765013198, "model_sparsity(avg)": 0.545138880610466, "Spa-Single QA sparsity": 0.417270523050557, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04302232100061425, "Spa-Code sparsity": 0.7013888756434122, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09910612056652705, "Spa-In-Context Learning sparsity": 0.7202380895614624, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10593200262103762, "Spa-Summarization sparsity": 0.7152777910232544, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.08564663678407669, "Spa-MultiHop QA sparsity": 0.5992063454219273, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.10573562926479749, "step": 252, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1650390625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 06:11:59,596 >> {'loss': 11.6266, 'grad_norm': 0.5774930119514465, 'learning_rate': 4.774579663168803e-05, 'epoch': 0.2664560294892048, 'num_input_tokens_seen': 622075956, 'completed': '84.33% (253 / 300)', 'remaining time': '2:11:56', 'throughput': '7147.98', 'gpu_mem_free': '11273MB', 'step': 253} [Step 253 / Rank 1] Tasks: ['In-Context Learning', 'Single QA', 'Single QA'] | Lens: [21843, 21844, 21844] → Tgt Spa: ['1.000', '0.350', '0.350'] [Step 253 / Rank 5] Tasks: ['Single QA'] | Lens: [39651] → Tgt Spa: ['0.350'] [Step 253 / Rank 3] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [20246, 20237, 20235] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 253 / Rank 4] Tasks: ['Single QA'] | Lens: [39651] → Tgt Spa: ['0.350'] [Step 253 / Rank 7] Tasks: ['Single QA', 'Code', 'Single QA', 'Code'] | Lens: [14113, 14121, 14114, 14124] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000'] [Step 253 / Rank 6] Tasks: ['Single QA', 'Code', 'Single QA', 'Code'] | Lens: [14113, 14121, 14114, 14124] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000'] [Step 253 / Rank 0] Tasks: ['In-Context Learning', 'Single QA', 'Single QA'] | Lens: [21843, 21844, 21844] → Tgt Spa: ['1.000', '0.350', '0.350'] [Step 253 / Rank 2] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [20246, 20237, 20235] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 253 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [29715, 29715] → Tgt Spa: ['0.350', '0.350'] [Step 253 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [32971] → Tgt Spa: ['1.000'] [Step 253 / Rank 7] Tasks: ['Summarization', 'Code'] | Lens: [22158, 22151] → Tgt Spa: ['1.000', '1.000'] [Step 253 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA'] | Lens: [6341, 6342, 6342, 6342, 6349, 6349, 6342, 6343, 6343, 6343] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350'] [Step 253 / Rank 6] Tasks: ['Summarization', 'Code'] | Lens: [22158, 22151] → Tgt Spa: ['1.000', '1.000'] [Step 253 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [32971] → Tgt Spa: ['1.000'] [Step 253 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA'] | Lens: [6341, 6342, 6342, 6342, 6349, 6349, 6342, 6343, 6343, 6343] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350'] [Step 253 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [29715, 29715] → Tgt Spa: ['0.350', '0.350'] [Step 253 / Rank 4] Tasks: ['Single QA'] | Lens: [49149] → Tgt Spa: ['0.350'] [Step 253 / Rank 6] Tasks: ['Code'] | Lens: [63250] → Tgt Spa: ['1.000'] [Step 253 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [60416] → Tgt Spa: ['1.000'] [Step 253 / Rank 2] Tasks: ['Single QA'] | Lens: [54400] → Tgt Spa: ['0.350'] [Step 253 / Rank 5] Tasks: ['Single QA'] | Lens: [49149] → Tgt Spa: ['0.350'] [Step 253 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [60416] → Tgt Spa: ['1.000'] [Step 253 / Rank 3] Tasks: ['Single QA'] | Lens: [54400] → Tgt Spa: ['0.350'] [Step 253 / Rank 7] Tasks: ['Code'] | Lens: [63250] → Tgt Spa: ['1.000'] [Step 253 / Rank 1] Tasks: ['Summarization'] | Lens: [53326] → Tgt Spa: ['1.000'] [Step 253 / Rank 2] Tasks: ['Single QA'] | Lens: [40517] → Tgt Spa: ['0.350'] [Step 253 / Rank 4] Tasks: ['Summarization', 'Single QA'] | Lens: [32238, 32220] → Tgt Spa: ['1.000', '0.350'] [Step 253 / Rank 0] Tasks: ['Summarization'] | Lens: [53326] → Tgt Spa: ['1.000'] [Step 253 / Rank 5] Tasks: ['Summarization', 'Single QA'] | Lens: [32238, 32220] → Tgt Spa: ['1.000', '0.350'] [Step 253 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [23188, 23181] → Tgt Spa: ['1.000', '1.000'] [Step 253 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [23188, 23181] → Tgt Spa: ['1.000', '1.000'] [Step 253 / Rank 3] Tasks: ['Single QA'] | Lens: [40517] → Tgt Spa: ['0.350'] [Step 253 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [38785] → Tgt Spa: ['1.000'] [Step 253 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [62743] → Tgt Spa: ['1.000'] [Step 253 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [38785] → Tgt Spa: ['1.000'] [Step 253 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [26251, 26251] → Tgt Spa: ['0.350', '0.350'] [Step 253 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [26251, 26251] → Tgt Spa: ['0.350', '0.350'] [Step 253 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [26246, 26241] → Tgt Spa: ['1.000', '1.000'] [Step 253 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [26246, 26241] → Tgt Spa: ['1.000', '1.000'] [Step 253 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [62743] → Tgt Spa: ['1.000'] [Step 253 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [39013] → Tgt Spa: ['1.000'] [Step 253 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [31100, 31100] → Tgt Spa: ['0.350', '1.000'] [Step 253 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [31100, 31100] → Tgt Spa: ['0.350', '1.000'] [Step 253 / Rank 2] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [19301, 19291, 19305] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 253 / Rank 1] Tasks: ['Single QA'] | Lens: [45871] → Tgt Spa: ['0.350'] [Step 253 / Rank 0] Tasks: ['Single QA'] | Lens: [45871] → Tgt Spa: ['0.350'] [Step 253 / Rank 3] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [19301, 19291, 19305] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 253 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [39013] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 06:14:17,126 >> @ 253 | Loss: 1.9788 | LM: 1.9115 | Reg: 0.0673 | Spa(Avg): 0.563 [INFO|lh_trainer.py:797] 2026-02-17 06:14:17,127 >> Statistic -> Code | Spa: 0.713 | Tgt: 1.000 | Z-Loss: 0.094 | [INFO|lh_trainer.py:797] 2026-02-17 06:14:17,127 >> Statistic -> In-Context | Spa: 0.714 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:14:17,127 >> Statistic -> MultiHop | Spa: 0.599 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:14:17,127 >> Statistic -> Single | Spa: 0.364 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:14:17,127 >> Statistic -> Summarization | Spa: 0.671 | Tgt: 1.000 | Z-Loss: 0.106 | [INFO|lh_trainer.py:810] 2026-02-17 06:14:17,129 >> [Micro-Log] {"loss": 1.9788143162926037, "lm_loss": 1.9115203619003296, "reg_loss": 0.06729392402727778, "model_sparsity(avg)": 0.5630690592030684, "Spa-In-Context Learning sparsity": 0.7138888835906982, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1086446113884449, "Spa-Single QA sparsity": 0.3636363527991555, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.011059001135766845, "Spa-Summarization sparsity": 0.6712962885697683, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10625040406982104, "Spa-Code sparsity": 0.7133838480169122, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.0942931974476034, "Spa-MultiHop QA sparsity": 0.5992063454219273, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.10573562926479749, "step": 253, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1650390625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 06:14:33,629 >> {'loss': 11.8729, 'grad_norm': 0.7033908367156982, 'learning_rate': 4.583965662561915e-05, 'epoch': 0.26750921537651395, 'num_input_tokens_seen': 624607758, 'completed': '84.67% (254 / 300)', 'remaining time': '2:09:05', 'throughput': '8218.38', 'gpu_mem_free': '11991MB', 'step': 254} [Step 254 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [19160, 19162, 19151] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 254 / Rank 3] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [17081, 17080, 17091] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 254 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [41581] → Tgt Spa: ['1.000'] [Step 254 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [27634, 27656] → Tgt Spa: ['1.000', '1.000'] [Step 254 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [41581] → Tgt Spa: ['1.000'] [Step 254 / Rank 2] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [17081, 17080, 17091] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 254 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [27634, 27656] → Tgt Spa: ['1.000', '1.000'] [Step 254 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [19160, 19162, 19151] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 254 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [31617, 31618] → Tgt Spa: ['0.350', '0.350'] [Step 254 / Rank 1] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23781, 23782] → Tgt Spa: ['1.000', '0.350'] [Step 254 / Rank 5] Tasks: ['Single QA'] | Lens: [63427] → Tgt Spa: ['0.350'] [Step 254 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [40739] → Tgt Spa: ['1.000'] [Step 254 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [31617, 31618] → Tgt Spa: ['0.350', '0.350'] [Step 254 / Rank 0] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23781, 23782] → Tgt Spa: ['1.000', '0.350'] [Step 254 / Rank 4] Tasks: ['Single QA'] | Lens: [63427] → Tgt Spa: ['0.350'] [Step 254 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [40739] → Tgt Spa: ['1.000'] [Step 254 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [56154] → Tgt Spa: ['1.000'] [Step 254 / Rank 3] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [21328, 21341, 21332] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 254 / Rank 0] Tasks: ['Single QA'] | Lens: [50964] → Tgt Spa: ['0.350'] [Step 254 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25746, 25747] → Tgt Spa: ['1.000', '1.000'] [Step 254 / Rank 2] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [21328, 21341, 21332] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 254 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [56154] → Tgt Spa: ['1.000'] [Step 254 / Rank 1] Tasks: ['Single QA'] | Lens: [50964] → Tgt Spa: ['0.350'] [Step 254 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25746, 25747] → Tgt Spa: ['1.000', '1.000'] [Step 254 / Rank 1] Tasks: ['Single QA'] | Lens: [44032] → Tgt Spa: ['0.350'] [Step 254 / Rank 0] Tasks: ['Single QA'] | Lens: [44032] → Tgt Spa: ['0.350'] [Step 254 / Rank 2] Tasks: ['MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1439, 1439, 1458, 1440, 1439, 1440, 1440, 1440, 1440, 1440, 1460, 1459, 1460, 1460, 1444, 1442, 1442, 1461, 1461, 1444, 1443, 1444, 1444, 1444, 1444, 1445, 1444, 1463, 1463, 1445, 1446, 1445, 1445, 1464, 1446, 1446, 1446, 1446, 1447, 1449, 1448, 1448, 1448, 1449, 1449] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 254 / Rank 6] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23090, 23090] → Tgt Spa: ['1.000', '0.350'] [Step 254 / Rank 4] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [19627, 19646, 19648] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 254 / Rank 7] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23090, 23090] → Tgt Spa: ['1.000', '0.350'] [Step 254 / Rank 3] Tasks: ['MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [1439, 1439, 1458, 1440, 1439, 1440, 1440, 1440, 1440, 1440, 1460, 1459, 1460, 1460, 1444, 1442, 1442, 1461, 1461, 1444, 1443, 1444, 1444, 1444, 1444, 1445, 1444, 1463, 1463, 1445, 1446, 1445, 1445, 1464, 1446, 1446, 1446, 1446, 1447, 1449, 1448, 1448, 1448, 1449, 1449] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 254 / Rank 5] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [19627, 19646, 19648] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 254 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [57957] → Tgt Spa: ['1.000'] [Step 254 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [47352] → Tgt Spa: ['1.000'] [Step 254 / Rank 5] Tasks: ['Single QA'] | Lens: [56498] → Tgt Spa: ['0.350'] [Step 254 / Rank 1] Tasks: ['Single QA'] | Lens: [58395] → Tgt Spa: ['0.350'] [Step 254 / Rank 4] Tasks: ['Single QA'] | Lens: [56498] → Tgt Spa: ['0.350'] [Step 254 / Rank 0] Tasks: ['Single QA'] | Lens: [58395] → Tgt Spa: ['0.350'] [Step 254 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [47352] → Tgt Spa: ['1.000'] [Step 254 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [57957] → Tgt Spa: ['1.000'] [Step 254 / Rank 5] Tasks: ['Single QA'] | Lens: [38498] → Tgt Spa: ['0.350'] [Step 254 / Rank 3] Tasks: ['Summarization', 'Code'] | Lens: [22659, 22649] → Tgt Spa: ['1.000', '1.000'] [Step 254 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [13971, 13971, 13972, 13977] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 254 / Rank 4] Tasks: ['Single QA'] | Lens: [38498] → Tgt Spa: ['0.350'] [Step 254 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [13971, 13971, 13972, 13977] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 254 / Rank 1] Tasks: ['In-Context Learning', 'Summarization', 'Single QA', 'Code', 'Single QA', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'Single QA', 'Code'] | Lens: [5910, 5929, 5914, 5919, 5913, 5914, 5915, 5921, 5914, 5915, 5921] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000'] [Step 254 / Rank 2] Tasks: ['Summarization', 'Code'] | Lens: [22659, 22649] → Tgt Spa: ['1.000', '1.000'] [Step 254 / Rank 0] Tasks: ['In-Context Learning', 'Summarization', 'Single QA', 'Code', 'Single QA', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'Single QA', 'Code'] | Lens: [5910, 5929, 5914, 5919, 5913, 5914, 5915, 5921, 5914, 5915, 5921] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 06:16:49,530 >> @ 254 | Loss: 2.1965 | LM: 2.1269 | Reg: 0.0696 | Spa(Avg): 0.559 [INFO|lh_trainer.py:797] 2026-02-17 06:16:49,531 >> Statistic -> Code | Spa: 0.699 | Tgt: 1.000 | Z-Loss: 0.101 | [INFO|lh_trainer.py:797] 2026-02-17 06:16:49,531 >> Statistic -> In-Context | Spa: 0.710 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:16:49,531 >> Statistic -> MultiHop | Spa: 0.558 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:16:49,531 >> Statistic -> Single | Spa: 0.393 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:16:49,531 >> Statistic -> Summarization | Spa: 0.640 | Tgt: 1.000 | Z-Loss: 0.124 | [INFO|lh_trainer.py:810] 2026-02-17 06:16:49,533 >> [Micro-Log] {"loss": 2.196480913708607, "lm_loss": 2.1268817111849785, "reg_loss": 0.06959919176491287, "model_sparsity(avg)": 0.559197003642718, "Spa-In-Context Learning sparsity": 0.7104700803756714, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11005586443039087, "Spa-Summarization sparsity": 0.6396198806009794, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12405484718711753, "Spa-Single QA sparsity": 0.392746905485789, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.030377381791671116, "Spa-Code sparsity": 0.6986111044883728, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1006002351641655, "Spa-MultiHop QA sparsity": 0.5579365083149501, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.08511358558067254, "step": 254, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1650390625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 06:17:01,584 >> {'loss': 13.1789, 'grad_norm': 0.7179914712905884, 'learning_rate': 4.396849844765079e-05, 'epoch': 0.2685624012638231, 'num_input_tokens_seen': 627172634, 'completed': '85.00% (255 / 300)', 'remaining time': '2:06:13', 'throughput': '8667.74', 'gpu_mem_free': '6943MB', 'step': 255} [Step 255 / Rank 4] Tasks: ['Single QA'] | Lens: [33495] → Tgt Spa: ['0.350'] [Step 255 / Rank 3] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [18163, 18175, 18163] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 255 / Rank 5] Tasks: ['Single QA'] | Lens: [33495] → Tgt Spa: ['0.350'] [Step 255 / Rank 0] Tasks: ['Single QA'] | Lens: [46333] → Tgt Spa: ['0.350'] [Step 255 / Rank 1] Tasks: ['Single QA'] | Lens: [46333] → Tgt Spa: ['0.350'] [Step 255 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [60326] → Tgt Spa: ['1.000'] [Step 255 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [60326] → Tgt Spa: ['1.000'] [Step 255 / Rank 2] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [18163, 18175, 18163] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 255 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [25527, 25527] → Tgt Spa: ['0.350', '0.350'] [Step 255 / Rank 3] Tasks: ['Single QA'] | Lens: [52921] → Tgt Spa: ['0.350'] [Step 255 / Rank 6] Tasks: ['Code', 'Single QA', 'Code', 'Code'] | Lens: [13818, 13813, 13826, 13841] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000'] [Step 255 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [25527, 25527] → Tgt Spa: ['0.350', '0.350'] [Step 255 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [25633, 25634] → Tgt Spa: ['0.350', '0.350'] [Step 255 / Rank 7] Tasks: ['Code', 'Single QA', 'Code', 'Code'] | Lens: [13818, 13813, 13826, 13841] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000'] [Step 255 / Rank 2] Tasks: ['Single QA'] | Lens: [52921] → Tgt Spa: ['0.350'] [Step 255 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [25633, 25634] → Tgt Spa: ['0.350', '0.350'] [Step 255 / Rank 4] Tasks: ['Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Single QA', 'Single QA', 'Summarization', 'Summarization', 'Code', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA'] | Lens: [2410, 2410, 2409, 2393, 2411, 2411, 2391, 2394, 2394, 2412, 2394, 2394, 2412, 2412, 2400, 2414, 2414, 2413, 2396, 2414, 2396, 2400, 2398, 2398, 2399, 2417, 2400] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350'] [Step 255 / Rank 7] Tasks: ['Single QA'] | Lens: [57349] → Tgt Spa: ['0.350'] [Step 255 / Rank 0] Tasks: ['In-Context Learning', 'Code', 'Code'] | Lens: [19483, 19492, 19494] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 255 / Rank 5] Tasks: ['Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Single QA', 'Single QA', 'Summarization', 'Summarization', 'Code', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA'] | Lens: [2410, 2410, 2409, 2393, 2411, 2411, 2391, 2394, 2394, 2412, 2394, 2394, 2412, 2412, 2400, 2414, 2414, 2413, 2396, 2414, 2396, 2400, 2398, 2398, 2399, 2417, 2400] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350'] [Step 255 / Rank 6] Tasks: ['Single QA'] | Lens: [57349] → Tgt Spa: ['0.350'] [Step 255 / Rank 3] Tasks: ['Single QA'] | Lens: [44071] → Tgt Spa: ['0.350'] [Step 255 / Rank 2] Tasks: ['Single QA'] | Lens: [44071] → Tgt Spa: ['0.350'] [Step 255 / Rank 1] Tasks: ['In-Context Learning', 'Code', 'Code'] | Lens: [19483, 19492, 19494] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 255 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [36356] → Tgt Spa: ['1.000'] [Step 255 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [30620, 30620] → Tgt Spa: ['0.350', '0.350'] [Step 255 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [30620, 30620] → Tgt Spa: ['0.350', '0.350'] [Step 255 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [36356] → Tgt Spa: ['1.000'] [Step 255 / Rank 3] Tasks: ['Single QA'] | Lens: [42748] → Tgt Spa: ['0.350'] [Step 255 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [58679] → Tgt Spa: ['1.000'] [Step 255 / Rank 2] Tasks: ['Single QA'] | Lens: [42748] → Tgt Spa: ['0.350'] [Step 255 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [58679] → Tgt Spa: ['1.000'] [Step 255 / Rank 2] Tasks: ['Single QA'] | Lens: [64543] → Tgt Spa: ['0.350'] [Step 255 / Rank 7] Tasks: ['Single QA'] | Lens: [53367] → Tgt Spa: ['0.350'] [Step 255 / Rank 4] Tasks: ['Single QA'] | Lens: [61559] → Tgt Spa: ['0.350'] [Step 255 / Rank 3] Tasks: ['Single QA'] | Lens: [64543] → Tgt Spa: ['0.350'] [Step 255 / Rank 0] Tasks: ['Single QA'] | Lens: [36164] → Tgt Spa: ['0.350'] [Step 255 / Rank 6] Tasks: ['Single QA'] | Lens: [53367] → Tgt Spa: ['0.350'] [Step 255 / Rank 5] Tasks: ['Single QA'] | Lens: [61559] → Tgt Spa: ['0.350'] [Step 255 / Rank 1] Tasks: ['Single QA'] | Lens: [36164] → Tgt Spa: ['0.350'] [Step 255 / Rank 3] Tasks: ['Single QA'] | Lens: [65026] → Tgt Spa: ['0.350'] [Step 255 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [58299] → Tgt Spa: ['1.000'] [Step 255 / Rank 4] Tasks: ['Single QA'] | Lens: [48542] → Tgt Spa: ['0.350'] [Step 255 / Rank 2] Tasks: ['Single QA'] | Lens: [65026] → Tgt Spa: ['0.350'] [Step 255 / Rank 7] Tasks: ['Summarization'] | Lens: [42626] → Tgt Spa: ['1.000'] [Step 255 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [58299] → Tgt Spa: ['1.000'] [Step 255 / Rank 5] Tasks: ['Single QA'] | Lens: [48542] → Tgt Spa: ['0.350'] [Step 255 / Rank 6] Tasks: ['Summarization'] | Lens: [42626] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 06:19:37,951 >> @ 255 | Loss: 2.1399 | LM: 2.0939 | Reg: 0.0460 | Spa(Avg): 0.477 [INFO|lh_trainer.py:797] 2026-02-17 06:19:37,951 >> Statistic -> Code | Spa: 0.679 | Tgt: 1.000 | Z-Loss: 0.109 | [INFO|lh_trainer.py:797] 2026-02-17 06:19:37,951 >> Statistic -> In-Context | Spa: 0.719 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:19:37,951 >> Statistic -> MultiHop | Spa: 0.638 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:19:37,951 >> Statistic -> Single | Spa: 0.368 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:19:37,951 >> Statistic -> Summarization | Spa: 0.593 | Tgt: 1.000 | Z-Loss: 0.150 | [INFO|lh_trainer.py:810] 2026-02-17 06:19:37,953 >> [Micro-Log] {"loss": 2.1399205699563026, "lm_loss": 2.0939476769417524, "reg_loss": 0.045972878331667744, "model_sparsity(avg)": 0.47651426990826923, "Spa-Single QA sparsity": 0.36772485574086505, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.014987584646968614, "Spa-In-Context Learning sparsity": 0.7194444417953492, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1062643900513649, "Spa-Code sparsity": 0.6788194477558136, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.1090592946857214, "Spa-Summarization sparsity": 0.5925925890604655, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1504404589533806, "Spa-MultiHop QA sparsity": 0.6376262578097257, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1244383684613488, "step": 255, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.1650390625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 06:20:04,810 >> {'loss': 12.8395, 'grad_norm': 0.41594576835632324, 'learning_rate': 4.213264271110397e-05, 'epoch': 0.26961558715113215, 'num_input_tokens_seen': 629690912, 'completed': '85.33% (256 / 300)', 'remaining time': '2:03:27', 'throughput': '6872.09', 'gpu_mem_free': '7015MB', 'step': 256} [Step 256 / Rank 7] Tasks: ['Single QA'] | Lens: [57326] → Tgt Spa: ['0.350'] [Step 256 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [41608] → Tgt Spa: ['1.000'] [Step 256 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [31103, 31104] → Tgt Spa: ['0.350', '0.350'] [Step 256 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [41608] → Tgt Spa: ['1.000'] [Step 256 / Rank 6] Tasks: ['Single QA'] | Lens: [57326] → Tgt Spa: ['0.350'] [Step 256 / Rank 0] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [26566, 26567] → Tgt Spa: ['1.000', '0.350'] [Step 256 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [31103, 31104] → Tgt Spa: ['0.350', '0.350'] [Step 256 / Rank 1] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [26566, 26567] → Tgt Spa: ['1.000', '0.350'] [Step 256 / Rank 5] Tasks: ['Code'] | Lens: [36354] → Tgt Spa: ['1.000'] [Step 256 / Rank 1] Tasks: ['Code', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Summarization', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'Code', 'Code', 'Single QA', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'Code', 'Code', 'Code', 'MultiHop QA'] | Lens: [3432, 3445, 3427, 3427, 3446, 3447, 3429, 3430, 3432, 3436, 3437, 3430, 3431, 3431, 3432, 3439, 3438, 3440, 3433] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350'] [Step 256 / Rank 3] Tasks: ['Single QA'] | Lens: [48297] → Tgt Spa: ['0.350'] [Step 256 / Rank 0] Tasks: ['Code', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Summarization', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'Code', 'Code', 'Single QA', 'In-Context Learning', 'Single QA', 'MultiHop QA', 'Code', 'Code', 'Code', 'MultiHop QA'] | Lens: [3432, 3445, 3427, 3427, 3446, 3447, 3429, 3430, 3432, 3436, 3437, 3430, 3431, 3431, 3432, 3439, 3438, 3440, 3433] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350'] [Step 256 / Rank 7] Tasks: ['Code'] | Lens: [61979] → Tgt Spa: ['1.000'] [Step 256 / Rank 2] Tasks: ['Single QA'] | Lens: [48297] → Tgt Spa: ['0.350'] [Step 256 / Rank 4] Tasks: ['Code'] | Lens: [36354] → Tgt Spa: ['1.000'] [Step 256 / Rank 6] Tasks: ['Code'] | Lens: [61979] → Tgt Spa: ['1.000'] [Step 256 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [25342, 25344] → Tgt Spa: ['0.350', '0.350'] [Step 256 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [54176] → Tgt Spa: ['1.000'] [Step 256 / Rank 3] Tasks: ['Single QA'] | Lens: [39288] → Tgt Spa: ['0.350'] [Step 256 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32503, 32503] → Tgt Spa: ['0.350', '0.350'] [Step 256 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [25342, 25344] → Tgt Spa: ['0.350', '0.350'] [Step 256 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32503, 32503] → Tgt Spa: ['0.350', '0.350'] [Step 256 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [54176] → Tgt Spa: ['1.000'] [Step 256 / Rank 2] Tasks: ['Single QA'] | Lens: [39288] → Tgt Spa: ['0.350'] [Step 256 / Rank 7] Tasks: ['Single QA'] | Lens: [52996] → Tgt Spa: ['0.350'] [Step 256 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27216, 27216] → Tgt Spa: ['1.000', '1.000'] [Step 256 / Rank 0] Tasks: ['Single QA'] | Lens: [57506] → Tgt Spa: ['0.350'] [Step 256 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [52733] → Tgt Spa: ['1.000'] [Step 256 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [52733] → Tgt Spa: ['1.000'] [Step 256 / Rank 6] Tasks: ['Single QA'] | Lens: [52996] → Tgt Spa: ['0.350'] [Step 256 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27216, 27216] → Tgt Spa: ['1.000', '1.000'] [Step 256 / Rank 1] Tasks: ['Single QA'] | Lens: [57506] → Tgt Spa: ['0.350'] [Step 256 / Rank 3] Tasks: ['Single QA'] | Lens: [51083] → Tgt Spa: ['0.350'] [Step 256 / Rank 1] Tasks: ['Single QA'] | Lens: [53031] → Tgt Spa: ['0.350'] [Step 256 / Rank 6] Tasks: ['Single QA'] | Lens: [61212] → Tgt Spa: ['0.350'] [Step 256 / Rank 0] Tasks: ['Single QA'] | Lens: [53031] → Tgt Spa: ['0.350'] [Step 256 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [45321] → Tgt Spa: ['1.000'] [Step 256 / Rank 7] Tasks: ['Single QA'] | Lens: [61212] → Tgt Spa: ['0.350'] [Step 256 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [45321] → Tgt Spa: ['1.000'] [Step 256 / Rank 2] Tasks: ['Single QA'] | Lens: [51083] → Tgt Spa: ['0.350'] [Step 256 / Rank 4] Tasks: ['Single QA'] | Lens: [56575] → Tgt Spa: ['0.350'] [Step 256 / Rank 3] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24199, 24200] → Tgt Spa: ['1.000', '0.350'] [Step 256 / Rank 2] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [24199, 24200] → Tgt Spa: ['1.000', '0.350'] [Step 256 / Rank 0] Tasks: ['Code', 'Summarization'] | Lens: [24710, 24724] → Tgt Spa: ['1.000', '1.000'] [Step 256 / Rank 1] Tasks: ['Code', 'Summarization'] | Lens: [24710, 24724] → Tgt Spa: ['1.000', '1.000'] [Step 256 / Rank 5] Tasks: ['Single QA'] | Lens: [56575] → Tgt Spa: ['0.350'] [Step 256 / Rank 7] Tasks: ['Single QA'] | Lens: [64205] → Tgt Spa: ['0.350'] [Step 256 / Rank 6] Tasks: ['Single QA'] | Lens: [64205] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 06:22:39,072 >> @ 256 | Loss: 2.1480 | LM: 2.0944 | Reg: 0.0536 | Spa(Avg): 0.503 [INFO|lh_trainer.py:797] 2026-02-17 06:22:39,072 >> Statistic -> Code | Spa: 0.704 | Tgt: 1.000 | Z-Loss: 0.098 | [INFO|lh_trainer.py:797] 2026-02-17 06:22:39,072 >> Statistic -> In-Context | Spa: 0.708 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:22:39,072 >> Statistic -> MultiHop | Spa: 0.616 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:22:39,072 >> Statistic -> Single | Spa: 0.394 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:22:39,072 >> Statistic -> Summarization | Spa: 0.635 | Tgt: 1.000 | Z-Loss: 0.126 | [INFO|lh_trainer.py:810] 2026-02-17 06:22:39,075 >> [Micro-Log] {"loss": 2.1479777719359845, "lm_loss": 2.0944072813532935, "reg_loss": 0.05357047989673447, "model_sparsity(avg)": 0.5029392006496588, "Spa-In-Context Learning sparsity": 0.7083333283662796, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1110025414576133, "Spa-Single QA sparsity": 0.39351851315725417, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.02910165108429889, "Spa-Code sparsity": 0.7037037081188626, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09813146624300215, "Spa-Summarization sparsity": 0.6354166716337204, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12637102603912354, "Spa-MultiHop QA sparsity": 0.6157407363255819, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11368054151535034, "step": 256, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 06:23:05,543 >> {'loss': 12.8879, 'grad_norm': 0.5446677207946777, 'learning_rate': 4.0332403980408214e-05, 'epoch': 0.2706687730384413, 'num_input_tokens_seen': 632255410, 'completed': '85.67% (257 / 300)', 'remaining time': '2:00:41', 'throughput': '7094.68', 'gpu_mem_free': '10431MB', 'step': 257} [Step 257 / Rank 0] Tasks: ['Single QA'] | Lens: [36405] → Tgt Spa: ['0.350'] [Step 257 / Rank 5] Tasks: ['Code'] | Lens: [34663] → Tgt Spa: ['1.000'] [Step 257 / Rank 6] Tasks: ['In-Context Learning', 'Summarization', 'Code'] | Lens: [19220, 19239, 19231] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 257 / Rank 2] Tasks: ['Single QA'] | Lens: [36463] → Tgt Spa: ['0.350'] [Step 257 / Rank 3] Tasks: ['Single QA'] | Lens: [36463] → Tgt Spa: ['0.350'] [Step 257 / Rank 7] Tasks: ['In-Context Learning', 'Summarization', 'Code'] | Lens: [19220, 19239, 19231] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 257 / Rank 1] Tasks: ['Single QA'] | Lens: [36405] → Tgt Spa: ['0.350'] [Step 257 / Rank 4] Tasks: ['Code'] | Lens: [34663] → Tgt Spa: ['1.000'] [Step 257 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [45446] → Tgt Spa: ['1.000'] [Step 257 / Rank 7] Tasks: ['Single QA'] | Lens: [37351] → Tgt Spa: ['0.350'] [Step 257 / Rank 6] Tasks: ['Single QA'] | Lens: [37351] → Tgt Spa: ['0.350'] [Step 257 / Rank 4] Tasks: ['Single QA'] | Lens: [55731] → Tgt Spa: ['0.350'] [Step 257 / Rank 2] Tasks: ['Code', 'Single QA'] | Lens: [22370, 22362] → Tgt Spa: ['1.000', '0.350'] [Step 257 / Rank 3] Tasks: ['Code', 'Single QA'] | Lens: [22370, 22362] → Tgt Spa: ['1.000', '0.350'] [Step 257 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [45446] → Tgt Spa: ['1.000'] [Step 257 / Rank 5] Tasks: ['Single QA'] | Lens: [55731] → Tgt Spa: ['0.350'] [Step 257 / Rank 4] Tasks: ['Code'] | Lens: [58318] → Tgt Spa: ['1.000'] [Step 257 / Rank 6] Tasks: ['Code'] | Lens: [36841] → Tgt Spa: ['1.000'] [Step 257 / Rank 5] Tasks: ['Code'] | Lens: [58318] → Tgt Spa: ['1.000'] [Step 257 / Rank 7] Tasks: ['Code'] | Lens: [36841] → Tgt Spa: ['1.000'] [Step 257 / Rank 3] Tasks: ['Single QA'] | Lens: [34727] → Tgt Spa: ['0.350'] [Step 257 / Rank 2] Tasks: ['Single QA'] | Lens: [34727] → Tgt Spa: ['0.350'] [Step 257 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [61980] → Tgt Spa: ['1.000'] [Step 257 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [61980] → Tgt Spa: ['1.000'] [Step 257 / Rank 7] Tasks: ['Single QA'] | Lens: [56579] → Tgt Spa: ['0.350'] [Step 257 / Rank 1] Tasks: ['Single QA'] | Lens: [39192] → Tgt Spa: ['0.350'] [Step 257 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [54941] → Tgt Spa: ['1.000'] [Step 257 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [54941] → Tgt Spa: ['1.000'] [Step 257 / Rank 2] Tasks: ['Code', 'Code'] | Lens: [28520, 28522] → Tgt Spa: ['1.000', '1.000'] [Step 257 / Rank 6] Tasks: ['Single QA'] | Lens: [56579] → Tgt Spa: ['0.350'] [Step 257 / Rank 3] Tasks: ['Code', 'Code'] | Lens: [28520, 28522] → Tgt Spa: ['1.000', '1.000'] [Step 257 / Rank 0] Tasks: ['Single QA'] | Lens: [39192] → Tgt Spa: ['0.350'] [Step 257 / Rank 5] Tasks: ['Single QA'] | Lens: [58748] → Tgt Spa: ['0.350'] [Step 257 / Rank 6] Tasks: ['Code'] | Lens: [36044] → Tgt Spa: ['1.000'] [Step 257 / Rank 4] Tasks: ['Single QA'] | Lens: [58748] → Tgt Spa: ['0.350'] [Step 257 / Rank 0] Tasks: ['Single QA', 'Code'] | Lens: [23569, 23578] → Tgt Spa: ['0.350', '1.000'] [Step 257 / Rank 3] Tasks: ['Code', 'Code', 'Code'] | Lens: [20984, 20984, 20984] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 257 / Rank 1] Tasks: ['Single QA', 'Code'] | Lens: [23569, 23578] → Tgt Spa: ['0.350', '1.000'] [Step 257 / Rank 2] Tasks: ['Code', 'Code', 'Code'] | Lens: [20984, 20984, 20984] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 257 / Rank 7] Tasks: ['Code'] | Lens: [36044] → Tgt Spa: ['1.000'] [Step 257 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15898, 15898, 15898, 15898] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 257 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15898, 15898, 15898, 15898] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 257 / Rank 2] Tasks: ['In-Context Learning', 'Code'] | Lens: [24921, 24931] → Tgt Spa: ['1.000', '1.000'] [Step 257 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [30946, 30946] → Tgt Spa: ['0.350', '0.350'] [Step 257 / Rank 0] Tasks: ['Single QA'] | Lens: [35239] → Tgt Spa: ['0.350'] [Step 257 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [30946, 30946] → Tgt Spa: ['0.350', '0.350'] [Step 257 / Rank 3] Tasks: ['In-Context Learning', 'Code'] | Lens: [24921, 24931] → Tgt Spa: ['1.000', '1.000'] [Step 257 / Rank 1] Tasks: ['Single QA'] | Lens: [35239] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 06:25:25,631 >> @ 257 | Loss: 1.9984 | LM: 1.9388 | Reg: 0.0597 | Spa(Avg): 0.538 [INFO|lh_trainer.py:797] 2026-02-17 06:25:25,631 >> Statistic -> Code | Spa: 0.706 | Tgt: 1.000 | Z-Loss: 0.097 | [INFO|lh_trainer.py:797] 2026-02-17 06:25:25,631 >> Statistic -> In-Context | Spa: 0.714 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:25:25,631 >> Statistic -> MultiHop | Spa: 0.616 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:25:25,631 >> Statistic -> Single | Spa: 0.394 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:25:25,631 >> Statistic -> Summarization | Spa: 0.653 | Tgt: 1.000 | Z-Loss: 0.115 | [INFO|lh_trainer.py:810] 2026-02-17 06:25:25,633 >> [Micro-Log] {"loss": 1.9984395143886406, "lm_loss": 1.9387828561787803, "reg_loss": 0.05965664830970733, "model_sparsity(avg)": 0.5381944378217062, "Spa-Single QA sparsity": 0.3937908411026001, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.032393798499148994, "Spa-In-Context Learning sparsity": 0.7138888835906982, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10860181301832199, "Spa-Code sparsity": 0.7061965740644015, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09717070769805175, "Spa-Summarization sparsity": 0.6527777910232544, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11510024219751358, "Spa-MultiHop QA sparsity": 0.6157407363255819, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11368054151535034, "step": 257, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 06:25:42,327 >> {'loss': 11.9906, 'grad_norm': 0.6709144711494446, 'learning_rate': 3.856809071720225e-05, 'epoch': 0.2717219589257504, 'num_input_tokens_seen': 634582544, 'completed': '86.00% (258 / 300)', 'remaining time': '1:57:50', 'throughput': '7421.48', 'gpu_mem_free': '14017MB', 'step': 258} [Step 258 / Rank 5] Tasks: ['Code'] | Lens: [38383] → Tgt Spa: ['1.000'] [Step 258 / Rank 6] Tasks: ['Single QA'] | Lens: [65029] → Tgt Spa: ['0.350'] [Step 258 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [57129] → Tgt Spa: ['1.000'] [Step 258 / Rank 2] Tasks: ['Single QA'] | Lens: [51196] → Tgt Spa: ['0.350'] [Step 258 / Rank 4] Tasks: ['Code'] | Lens: [38383] → Tgt Spa: ['1.000'] [Step 258 / Rank 3] Tasks: ['Single QA'] | Lens: [51196] → Tgt Spa: ['0.350'] [Step 258 / Rank 7] Tasks: ['Single QA'] | Lens: [65029] → Tgt Spa: ['0.350'] [Step 258 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [57129] → Tgt Spa: ['1.000'] [Step 258 / Rank 4] Tasks: ['Single QA'] | Lens: [52668] → Tgt Spa: ['0.350'] [Step 258 / Rank 1] Tasks: ['Single QA'] | Lens: [49995] → Tgt Spa: ['0.350'] [Step 258 / Rank 3] Tasks: ['Single QA'] | Lens: [57709] → Tgt Spa: ['0.350'] [Step 258 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [28686, 28683] → Tgt Spa: ['1.000', '1.000'] [Step 258 / Rank 5] Tasks: ['Single QA'] | Lens: [52668] → Tgt Spa: ['0.350'] [Step 258 / Rank 0] Tasks: ['Single QA'] | Lens: [49995] → Tgt Spa: ['0.350'] [Step 258 / Rank 2] Tasks: ['Single QA'] | Lens: [57709] → Tgt Spa: ['0.350'] [Step 258 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [28686, 28683] → Tgt Spa: ['1.000', '1.000'] [Step 258 / Rank 6] Tasks: ['Code', 'Summarization'] | Lens: [27790, 27801] → Tgt Spa: ['1.000', '1.000'] [Step 258 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [25465, 25474] → Tgt Spa: ['1.000', '1.000'] [Step 258 / Rank 0] Tasks: ['Summarization', 'Code'] | Lens: [31368, 31361] → Tgt Spa: ['1.000', '1.000'] [Step 258 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [25465, 25474] → Tgt Spa: ['1.000', '1.000'] [Step 258 / Rank 3] Tasks: ['Single QA'] | Lens: [35367] → Tgt Spa: ['0.350'] [Step 258 / Rank 1] Tasks: ['Summarization', 'Code'] | Lens: [31368, 31361] → Tgt Spa: ['1.000', '1.000'] [Step 258 / Rank 7] Tasks: ['Code', 'Summarization'] | Lens: [27790, 27801] → Tgt Spa: ['1.000', '1.000'] [Step 258 / Rank 2] Tasks: ['Single QA'] | Lens: [35367] → Tgt Spa: ['0.350'] [Step 258 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [38065] → Tgt Spa: ['1.000'] [Step 258 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [38065] → Tgt Spa: ['1.000'] [Step 258 / Rank 3] Tasks: ['Single QA'] | Lens: [57362] → Tgt Spa: ['0.350'] [Step 258 / Rank 6] Tasks: ['Single QA'] | Lens: [52812] → Tgt Spa: ['0.350'] [Step 258 / Rank 7] Tasks: ['Single QA'] | Lens: [52812] → Tgt Spa: ['0.350'] [Step 258 / Rank 2] Tasks: ['Single QA'] | Lens: [57362] → Tgt Spa: ['0.350'] [Step 258 / Rank 0] Tasks: ['Code'] | Lens: [34859] → Tgt Spa: ['1.000'] [Step 258 / Rank 1] Tasks: ['Code'] | Lens: [34859] → Tgt Spa: ['1.000'] [Step 258 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [59616] → Tgt Spa: ['1.000'] [Step 258 / Rank 7] Tasks: ['Single QA'] | Lens: [63925] → Tgt Spa: ['0.350'] [Step 258 / Rank 3] Tasks: ['Single QA', 'MultiHop QA', 'Single QA', 'Single QA'] | Lens: [16066, 16071, 16072, 16073] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 258 / Rank 2] Tasks: ['Single QA', 'MultiHop QA', 'Single QA', 'Single QA'] | Lens: [16066, 16071, 16072, 16073] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 258 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [59616] → Tgt Spa: ['1.000'] [Step 258 / Rank 6] Tasks: ['Single QA'] | Lens: [63925] → Tgt Spa: ['0.350'] [Step 258 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [44549] → Tgt Spa: ['1.000'] [Step 258 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [44549] → Tgt Spa: ['1.000'] [Step 258 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [29679, 29679] → Tgt Spa: ['0.350', '0.350'] [Step 258 / Rank 6] Tasks: ['Code'] | Lens: [54021] → Tgt Spa: ['1.000'] [Step 258 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [29679, 29679] → Tgt Spa: ['0.350', '0.350'] [Step 258 / Rank 0] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [18234, 18235, 18225] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 258 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22658, 22658] → Tgt Spa: ['1.000', '1.000'] [Step 258 / Rank 1] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [18234, 18235, 18225] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 258 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22658, 22658] → Tgt Spa: ['1.000', '1.000'] [Step 258 / Rank 7] Tasks: ['Code'] | Lens: [54021] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 06:28:18,347 >> @ 258 | Loss: 2.0823 | LM: 2.0189 | Reg: 0.0634 | Spa(Avg): 0.550 [INFO|lh_trainer.py:797] 2026-02-17 06:28:18,347 >> Statistic -> Code | Spa: 0.720 | Tgt: 1.000 | Z-Loss: 0.092 | [INFO|lh_trainer.py:797] 2026-02-17 06:28:18,347 >> Statistic -> In-Context | Spa: 0.717 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:28:18,348 >> Statistic -> MultiHop | Spa: 0.458 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:28:18,348 >> Statistic -> Single | Spa: 0.368 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:28:18,348 >> Statistic -> Summarization | Spa: 0.646 | Tgt: 1.000 | Z-Loss: 0.121 | [INFO|lh_trainer.py:810] 2026-02-17 06:28:18,350 >> [Micro-Log] {"loss": 2.08228167394797, "lm_loss": 2.0189304587741694, "reg_loss": 0.06335120003010768, "model_sparsity(avg)": 0.5501543171703815, "Spa-In-Context Learning sparsity": 0.7170138955116272, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10728232655674219, "Spa-Single QA sparsity": 0.36805554798671175, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.018326224110621427, "Spa-Summarization sparsity": 0.6458333432674408, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12089556641876698, "Spa-Code sparsity": 0.7204861044883728, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09150533378124237, "Spa-MultiHop QA sparsity": 0.4583333134651184, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.03877846896648407, "step": 258, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 06:28:37,972 >> {'loss': 12.4937, 'grad_norm': 0.6252522468566895, 'learning_rate': 3.684000522748107e-05, 'epoch': 0.2727751448130595, 'num_input_tokens_seen': 637108470, 'completed': '86.33% (259 / 300)', 'remaining time': '1:55:03', 'throughput': '7190.43', 'gpu_mem_free': '9109MB', 'step': 259} [Step 259 / Rank 5] Tasks: ['Single QA'] | Lens: [38281] → Tgt Spa: ['0.350'] [Step 259 / Rank 6] Tasks: ['Single QA'] | Lens: [38644] → Tgt Spa: ['0.350'] [Step 259 / Rank 1] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [17036, 17048, 17048] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 259 / Rank 4] Tasks: ['Single QA'] | Lens: [38281] → Tgt Spa: ['0.350'] [Step 259 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [54329] → Tgt Spa: ['1.000'] [Step 259 / Rank 7] Tasks: ['Single QA'] | Lens: [38644] → Tgt Spa: ['0.350'] [Step 259 / Rank 0] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [17036, 17048, 17048] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 259 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [54329] → Tgt Spa: ['1.000'] [Step 259 / Rank 5] Tasks: ['Single QA'] | Lens: [59432] → Tgt Spa: ['0.350'] [Step 259 / Rank 7] Tasks: ['Single QA'] | Lens: [50870] → Tgt Spa: ['0.350'] [Step 259 / Rank 3] Tasks: ['Single QA'] | Lens: [51281] → Tgt Spa: ['0.350'] [Step 259 / Rank 6] Tasks: ['Single QA'] | Lens: [50870] → Tgt Spa: ['0.350'] [Step 259 / Rank 2] Tasks: ['Single QA'] | Lens: [51281] → Tgt Spa: ['0.350'] [Step 259 / Rank 0] Tasks: ['Code'] | Lens: [52961] → Tgt Spa: ['1.000'] [Step 259 / Rank 1] Tasks: ['Code'] | Lens: [52961] → Tgt Spa: ['1.000'] [Step 259 / Rank 4] Tasks: ['Single QA'] | Lens: [59432] → Tgt Spa: ['0.350'] [Step 259 / Rank 5] Tasks: ['Single QA'] | Lens: [57503] → Tgt Spa: ['0.350'] [Step 259 / Rank 4] Tasks: ['Single QA'] | Lens: [57503] → Tgt Spa: ['0.350'] [Step 259 / Rank 1] Tasks: ['Single QA'] | Lens: [43259] → Tgt Spa: ['0.350'] [Step 259 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Code', 'Code', 'Single QA', 'Code', 'Single QA'] | Lens: [8427, 8430, 8438, 8441, 8432, 8443, 8436] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350'] [Step 259 / Rank 7] Tasks: ['Code'] | Lens: [36600] → Tgt Spa: ['1.000'] [Step 259 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Code', 'Code', 'Single QA', 'Code', 'Single QA'] | Lens: [8427, 8430, 8438, 8441, 8432, 8443, 8436] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350'] [Step 259 / Rank 0] Tasks: ['Single QA'] | Lens: [43259] → Tgt Spa: ['0.350'] [Step 259 / Rank 6] Tasks: ['Code'] | Lens: [36600] → Tgt Spa: ['1.000'] [Step 259 / Rank 5] Tasks: ['Single QA'] | Lens: [36024] → Tgt Spa: ['0.350'] [Step 259 / Rank 6] Tasks: ['Single QA', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Code'] | Lens: [5740, 5739, 5739, 5748, 5741, 5744, 5763, 5745, 5746, 5766, 5754] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 259 / Rank 7] Tasks: ['Single QA', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Code'] | Lens: [5740, 5739, 5739, 5748, 5741, 5744, 5763, 5745, 5746, 5766, 5754] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 259 / Rank 4] Tasks: ['Single QA'] | Lens: [36024] → Tgt Spa: ['0.350'] [Step 259 / Rank 2] Tasks: ['Single QA'] | Lens: [36154] → Tgt Spa: ['0.350'] [Step 259 / Rank 1] Tasks: ['Single QA'] | Lens: [41381] → Tgt Spa: ['0.350'] [Step 259 / Rank 0] Tasks: ['Single QA'] | Lens: [41381] → Tgt Spa: ['0.350'] [Step 259 / Rank 3] Tasks: ['Single QA'] | Lens: [36154] → Tgt Spa: ['0.350'] [Step 259 / Rank 5] Tasks: ['Single QA'] | Lens: [38218] → Tgt Spa: ['0.350'] [Step 259 / Rank 6] Tasks: ['In-Context Learning', 'Summarization', 'Summarization'] | Lens: [16727, 16746, 16746] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 259 / Rank 7] Tasks: ['In-Context Learning', 'Summarization', 'Summarization'] | Lens: [16727, 16746, 16746] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 259 / Rank 0] Tasks: ['Single QA'] | Lens: [37110] → Tgt Spa: ['0.350'] [Step 259 / Rank 2] Tasks: ['Code'] | Lens: [41721] → Tgt Spa: ['1.000'] [Step 259 / Rank 3] Tasks: ['Code'] | Lens: [41721] → Tgt Spa: ['1.000'] [Step 259 / Rank 4] Tasks: ['Single QA'] | Lens: [38218] → Tgt Spa: ['0.350'] [Step 259 / Rank 1] Tasks: ['Single QA'] | Lens: [37110] → Tgt Spa: ['0.350'] [Step 259 / Rank 6] Tasks: ['Single QA'] | Lens: [52922] → Tgt Spa: ['0.350'] [Step 259 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25796, 25797] → Tgt Spa: ['1.000', '1.000'] [Step 259 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [30932, 30936] → Tgt Spa: ['1.000', '1.000'] [Step 259 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [30932, 30936] → Tgt Spa: ['1.000', '1.000'] [Step 259 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25796, 25797] → Tgt Spa: ['1.000', '1.000'] [Step 259 / Rank 2] Tasks: ['Single QA'] | Lens: [55308] → Tgt Spa: ['0.350'] [Step 259 / Rank 7] Tasks: ['Single QA'] | Lens: [52922] → Tgt Spa: ['0.350'] [Step 259 / Rank 3] Tasks: ['Single QA'] | Lens: [55308] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 06:30:45,581 >> @ 259 | Loss: 2.0711 | LM: 2.0215 | Reg: 0.0496 | Spa(Avg): 0.495 [INFO|lh_trainer.py:797] 2026-02-17 06:30:45,581 >> Statistic -> Code | Spa: 0.673 | Tgt: 1.000 | Z-Loss: 0.111 | [INFO|lh_trainer.py:797] 2026-02-17 06:30:45,581 >> Statistic -> In-Context | Spa: 0.710 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:30:45,581 >> Statistic -> MultiHop | Spa: 0.458 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:30:45,581 >> Statistic -> Single | Spa: 0.415 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:30:45,582 >> Statistic -> Summarization | Spa: 0.648 | Tgt: 1.000 | Z-Loss: 0.120 | [INFO|lh_trainer.py:810] 2026-02-17 06:30:45,584 >> [Micro-Log] {"loss": 2.0710726181666055, "lm_loss": 2.0214637368917465, "reg_loss": 0.04960886742143581, "model_sparsity(avg)": 0.4951123297214508, "Spa-Code sparsity": 0.6728395091162788, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.11130400995413463, "Spa-Summarization sparsity": 0.6481481492519379, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11985244726141293, "Spa-Single QA sparsity": 0.4152777701616287, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04559546462260187, "Spa-In-Context Learning sparsity": 0.709595962004228, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11044252934780988, "Spa-MultiHop QA sparsity": 0.4583333134651184, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.03877846896648407, "step": 259, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 06:31:06,237 >> {'loss': 12.4264, 'grad_norm': 0.46408626437187195, 'learning_rate': 3.514844360979712e-05, 'epoch': 0.2738283307003686, 'num_input_tokens_seen': 639426634, 'completed': '86.67% (260 / 300)', 'remaining time': '1:52:12', 'throughput': '7817.67', 'gpu_mem_free': '7009MB', 'step': 260} [Step 260 / Rank 4] Tasks: ['Single QA'] | Lens: [49460] → Tgt Spa: ['0.350'] [Step 260 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [23699, 23706] → Tgt Spa: ['1.000', '1.000'] [Step 260 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [23699, 23706] → Tgt Spa: ['1.000', '1.000'] [Step 260 / Rank 0] Tasks: ['Code', 'Code', 'Code'] | Lens: [18684, 18683, 18684] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 260 / Rank 3] Tasks: ['Single QA'] | Lens: [40127] → Tgt Spa: ['0.350'] [Step 260 / Rank 1] Tasks: ['Code', 'Code', 'Code'] | Lens: [18684, 18683, 18684] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 260 / Rank 5] Tasks: ['Single QA'] | Lens: [49460] → Tgt Spa: ['0.350'] [Step 260 / Rank 2] Tasks: ['Single QA'] | Lens: [40127] → Tgt Spa: ['0.350'] [Step 260 / Rank 4] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Code'] | Lens: [11396, 11398, 11399, 11394, 11404] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000'] [Step 260 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [26847, 26866] → Tgt Spa: ['1.000', '1.000'] [Step 260 / Rank 7] Tasks: ['Single QA'] | Lens: [54197] → Tgt Spa: ['0.350'] [Step 260 / Rank 5] Tasks: ['Code', 'Code', 'Code', 'Single QA', 'Code'] | Lens: [11396, 11398, 11399, 11394, 11404] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000'] [Step 260 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [26847, 26866] → Tgt Spa: ['1.000', '1.000'] [Step 260 / Rank 0] Tasks: ['Single QA'] | Lens: [58405] → Tgt Spa: ['0.350'] [Step 260 / Rank 6] Tasks: ['Single QA'] | Lens: [54197] → Tgt Spa: ['0.350'] [Step 260 / Rank 1] Tasks: ['Single QA'] | Lens: [58405] → Tgt Spa: ['0.350'] [Step 260 / Rank 4] Tasks: ['Single QA'] | Lens: [47758] → Tgt Spa: ['0.350'] [Step 260 / Rank 6] Tasks: ['Single QA'] | Lens: [51387] → Tgt Spa: ['0.350'] [Step 260 / Rank 2] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [16977, 16988, 16989] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 260 / Rank 3] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [16977, 16988, 16989] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 260 / Rank 7] Tasks: ['Single QA'] | Lens: [51387] → Tgt Spa: ['0.350'] [Step 260 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [59080] → Tgt Spa: ['1.000'] [Step 260 / Rank 5] Tasks: ['Single QA'] | Lens: [47758] → Tgt Spa: ['0.350'] [Step 260 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [59080] → Tgt Spa: ['1.000'] [Step 260 / Rank 3] Tasks: ['Single QA'] | Lens: [50883] → Tgt Spa: ['0.350'] [Step 260 / Rank 5] Tasks: ['Single QA'] | Lens: [51757] → Tgt Spa: ['0.350'] [Step 260 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [61208] → Tgt Spa: ['1.000'] [Step 260 / Rank 2] Tasks: ['Single QA'] | Lens: [50883] → Tgt Spa: ['0.350'] [Step 260 / Rank 1] Tasks: ['Summarization', 'Summarization'] | Lens: [24676, 24672] → Tgt Spa: ['1.000', '1.000'] [Step 260 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [61208] → Tgt Spa: ['1.000'] [Step 260 / Rank 4] Tasks: ['Single QA'] | Lens: [51757] → Tgt Spa: ['0.350'] [Step 260 / Rank 0] Tasks: ['Summarization', 'Summarization'] | Lens: [24676, 24672] → Tgt Spa: ['1.000', '1.000'] [Step 260 / Rank 5] Tasks: ['Single QA'] | Lens: [36504] → Tgt Spa: ['0.350'] [Step 260 / Rank 1] Tasks: ['MultiHop QA', 'Single QA'] | Lens: [32112, 32113] → Tgt Spa: ['0.350', '0.350'] [Step 260 / Rank 4] Tasks: ['Single QA'] | Lens: [36504] → Tgt Spa: ['0.350'] [Step 260 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43195] → Tgt Spa: ['1.000'] [Step 260 / Rank 3] Tasks: ['Single QA'] | Lens: [65033] → Tgt Spa: ['0.350'] [Step 260 / Rank 0] Tasks: ['MultiHop QA', 'Single QA'] | Lens: [32112, 32113] → Tgt Spa: ['0.350', '0.350'] [Step 260 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43195] → Tgt Spa: ['1.000'] [Step 260 / Rank 2] Tasks: ['Single QA'] | Lens: [65033] → Tgt Spa: ['0.350'] [Step 260 / Rank 3] Tasks: ['Single QA'] | Lens: [63308] → Tgt Spa: ['0.350'] [Step 260 / Rank 7] Tasks: ['Single QA'] | Lens: [50240] → Tgt Spa: ['0.350'] [Step 260 / Rank 5] Tasks: ['Code'] | Lens: [35210] → Tgt Spa: ['1.000'] [Step 260 / Rank 4] Tasks: ['Code'] | Lens: [35210] → Tgt Spa: ['1.000'] [Step 260 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [40321] → Tgt Spa: ['1.000'] [Step 260 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [40321] → Tgt Spa: ['1.000'] [Step 260 / Rank 6] Tasks: ['Single QA'] | Lens: [50240] → Tgt Spa: ['0.350'] [Step 260 / Rank 2] Tasks: ['Single QA'] | Lens: [63308] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 06:33:43,064 >> @ 260 | Loss: 2.2627 | LM: 2.2057 | Reg: 0.0570 | Spa(Avg): 0.510 [INFO|lh_trainer.py:797] 2026-02-17 06:33:43,064 >> Statistic -> Code | Spa: 0.679 | Tgt: 1.000 | Z-Loss: 0.109 | [INFO|lh_trainer.py:797] 2026-02-17 06:33:43,064 >> Statistic -> In-Context | Spa: 0.715 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:33:43,064 >> Statistic -> MultiHop | Spa: 0.528 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:33:43,064 >> Statistic -> Single | Spa: 0.363 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:33:43,064 >> Statistic -> Summarization | Spa: 0.611 | Tgt: 1.000 | Z-Loss: 0.142 | [INFO|lh_trainer.py:810] 2026-02-17 06:33:43,066 >> [Micro-Log] {"loss": 2.2626912482082844, "lm_loss": 2.2057134360074997, "reg_loss": 0.05697782605905862, "model_sparsity(avg)": 0.509567899008592, "Spa-Code sparsity": 0.6791666626930237, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10879827439785003, "Spa-Single QA sparsity": 0.36309522816113066, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.011086079499883843, "Spa-In-Context Learning sparsity": 0.7152777711550394, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1085744487742583, "Spa-Summarization sparsity": 0.6111111164093017, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.14200303107500076, "Spa-MultiHop QA sparsity": 0.5277777910232544, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.06857362389564514, "step": 260, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 06:34:10,142 >> {'loss': 13.5761, 'grad_norm': 0.5272302627563477, 'learning_rate': 3.349369570452542e-05, 'epoch': 0.27488151658767773, 'num_input_tokens_seen': 641900154, 'completed': '87.00% (261 / 300)', 'remaining time': '1:49:26', 'throughput': '6724.98', 'gpu_mem_free': '11879MB', 'step': 261} [Step 261 / Rank 0] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22668, 22650] → Tgt Spa: ['1.000', '1.000'] [Step 261 / Rank 6] Tasks: ['Single QA'] | Lens: [37144] → Tgt Spa: ['0.350'] [Step 261 / Rank 2] Tasks: ['Single QA'] | Lens: [51561] → Tgt Spa: ['0.350'] [Step 261 / Rank 7] Tasks: ['Single QA'] | Lens: [37144] → Tgt Spa: ['0.350'] [Step 261 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [41752] → Tgt Spa: ['1.000'] [Step 261 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [41752] → Tgt Spa: ['1.000'] [Step 261 / Rank 1] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22668, 22650] → Tgt Spa: ['1.000', '1.000'] [Step 261 / Rank 3] Tasks: ['Single QA'] | Lens: [51561] → Tgt Spa: ['0.350'] [Step 261 / Rank 5] Tasks: ['Code'] | Lens: [60288] → Tgt Spa: ['1.000'] [Step 261 / Rank 6] Tasks: ['Single QA'] | Lens: [50120] → Tgt Spa: ['0.350'] [Step 261 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26870, 26870] → Tgt Spa: ['1.000', '1.000'] [Step 261 / Rank 3] Tasks: ['Summarization'] | Lens: [50958] → Tgt Spa: ['1.000'] [Step 261 / Rank 4] Tasks: ['Code'] | Lens: [60288] → Tgt Spa: ['1.000'] [Step 261 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26870, 26870] → Tgt Spa: ['1.000', '1.000'] [Step 261 / Rank 7] Tasks: ['Single QA'] | Lens: [50120] → Tgt Spa: ['0.350'] [Step 261 / Rank 2] Tasks: ['Summarization'] | Lens: [50958] → Tgt Spa: ['1.000'] [Step 261 / Rank 4] Tasks: ['Single QA'] | Lens: [33884] → Tgt Spa: ['0.350'] [Step 261 / Rank 5] Tasks: ['Single QA'] | Lens: [33884] → Tgt Spa: ['0.350'] [Step 261 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [55162] → Tgt Spa: ['1.000'] [Step 261 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [55162] → Tgt Spa: ['1.000'] [Step 261 / Rank 6] Tasks: ['Single QA', 'Summarization'] | Lens: [22908, 22930] → Tgt Spa: ['0.350', '1.000'] [Step 261 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [64063] → Tgt Spa: ['1.000'] [Step 261 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [64063] → Tgt Spa: ['1.000'] [Step 261 / Rank 7] Tasks: ['Single QA', 'Summarization'] | Lens: [22908, 22930] → Tgt Spa: ['0.350', '1.000'] [Step 261 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17065, 17067, 17063] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 261 / Rank 3] Tasks: ['Single QA'] | Lens: [45409] → Tgt Spa: ['0.350'] [Step 261 / Rank 1] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [23584, 23568] → Tgt Spa: ['1.000', '1.000'] [Step 261 / Rank 0] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [23584, 23568] → Tgt Spa: ['1.000', '1.000'] [Step 261 / Rank 5] Tasks: ['Single QA'] | Lens: [45918] → Tgt Spa: ['0.350'] [Step 261 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [17065, 17067, 17063] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 261 / Rank 4] Tasks: ['Single QA'] | Lens: [45918] → Tgt Spa: ['0.350'] [Step 261 / Rank 2] Tasks: ['Single QA'] | Lens: [45409] → Tgt Spa: ['0.350'] [Step 261 / Rank 5] Tasks: ['Single QA'] | Lens: [56989] → Tgt Spa: ['0.350'] [Step 261 / Rank 4] Tasks: ['Single QA'] | Lens: [56989] → Tgt Spa: ['0.350'] [Step 261 / Rank 0] Tasks: ['Single QA'] | Lens: [33213] → Tgt Spa: ['0.350'] [Step 261 / Rank 3] Tasks: ['Single QA'] | Lens: [49212] → Tgt Spa: ['0.350'] [Step 261 / Rank 1] Tasks: ['Single QA'] | Lens: [33213] → Tgt Spa: ['0.350'] [Step 261 / Rank 6] Tasks: ['Code'] | Lens: [36986] → Tgt Spa: ['1.000'] [Step 261 / Rank 7] Tasks: ['Code'] | Lens: [36986] → Tgt Spa: ['1.000'] [Step 261 / Rank 2] Tasks: ['Single QA'] | Lens: [49212] → Tgt Spa: ['0.350'] [Step 261 / Rank 1] Tasks: ['Code', 'Code'] | Lens: [27178, 27179] → Tgt Spa: ['1.000', '1.000'] [Step 261 / Rank 5] Tasks: ['Single QA'] | Lens: [58372] → Tgt Spa: ['0.350'] [Step 261 / Rank 0] Tasks: ['Code', 'Code'] | Lens: [27178, 27179] → Tgt Spa: ['1.000', '1.000'] [Step 261 / Rank 7] Tasks: ['Single QA'] | Lens: [50951] → Tgt Spa: ['0.350'] [Step 261 / Rank 2] Tasks: ['Summarization'] | Lens: [32913] → Tgt Spa: ['1.000'] [Step 261 / Rank 4] Tasks: ['Single QA'] | Lens: [58372] → Tgt Spa: ['0.350'] [Step 261 / Rank 6] Tasks: ['Single QA'] | Lens: [50951] → Tgt Spa: ['0.350'] [Step 261 / Rank 3] Tasks: ['Summarization'] | Lens: [32913] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 06:36:34,341 >> @ 261 | Loss: 2.0695 | LM: 2.0097 | Reg: 0.0598 | Spa(Avg): 0.546 [INFO|lh_trainer.py:797] 2026-02-17 06:36:34,341 >> Statistic -> Code | Spa: 0.719 | Tgt: 1.000 | Z-Loss: 0.092 | [INFO|lh_trainer.py:797] 2026-02-17 06:36:34,341 >> Statistic -> In-Context | Spa: 0.712 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:36:34,341 >> Statistic -> MultiHop | Spa: 0.528 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:36:34,341 >> Statistic -> Single | Spa: 0.375 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:36:34,342 >> Statistic -> Summarization | Spa: 0.690 | Tgt: 1.000 | Z-Loss: 0.098 | [INFO|lh_trainer.py:810] 2026-02-17 06:36:34,343 >> [Micro-Log] {"loss": 2.069505088031292, "lm_loss": 2.009728512416283, "reg_loss": 0.059776583356627576, "model_sparsity(avg)": 0.5462962885697683, "Spa-Summarization sparsity": 0.6904761876378741, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09769198724201747, "Spa-In-Context Learning sparsity": 0.7123015778405326, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10989423096179962, "Spa-Single QA sparsity": 0.3749999900658925, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.01656967211359491, "Spa-Code sparsity": 0.7194444417953492, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.091910919547081, "Spa-MultiHop QA sparsity": 0.5277777910232544, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.06857362389564514, "step": 261, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.3125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 06:36:56,968 >> {'loss': 12.417, 'grad_norm': 0.5684494972229004, 'learning_rate': 3.1876045044200884e-05, 'epoch': 0.27593470247498686, 'num_input_tokens_seen': 644205144, 'completed': '87.33% (262 / 300)', 'remaining time': '1:46:37', 'throughput': '6908.35', 'gpu_mem_free': '9075MB', 'step': 262} [Step 262 / Rank 5] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7756, 7764, 7756, 7757, 7758, 7758, 7760, 7760] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 262 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [61837] → Tgt Spa: ['1.000'] [Step 262 / Rank 4] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [7756, 7764, 7756, 7757, 7758, 7758, 7760, 7760] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 262 / Rank 7] Tasks: ['Single QA'] | Lens: [36939] → Tgt Spa: ['0.350'] [Step 262 / Rank 1] Tasks: ['Single QA'] | Lens: [51530] → Tgt Spa: ['0.350'] [Step 262 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [61837] → Tgt Spa: ['1.000'] [Step 262 / Rank 6] Tasks: ['Single QA'] | Lens: [36939] → Tgt Spa: ['0.350'] [Step 262 / Rank 0] Tasks: ['Single QA'] | Lens: [51530] → Tgt Spa: ['0.350'] [Step 262 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [56745] → Tgt Spa: ['1.000'] [Step 262 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [56745] → Tgt Spa: ['1.000'] [Step 262 / Rank 2] Tasks: ['Single QA', 'Code'] | Lens: [26893, 26899] → Tgt Spa: ['0.350', '1.000'] [Step 262 / Rank 3] Tasks: ['Single QA', 'Code'] | Lens: [26893, 26899] → Tgt Spa: ['0.350', '1.000'] [Step 262 / Rank 5] Tasks: ['Single QA'] | Lens: [41384] → Tgt Spa: ['0.350'] [Step 262 / Rank 4] Tasks: ['Single QA'] | Lens: [41384] → Tgt Spa: ['0.350'] [Step 262 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [25433, 25426] → Tgt Spa: ['1.000', '1.000'] [Step 262 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [25433, 25426] → Tgt Spa: ['1.000', '1.000'] [Step 262 / Rank 1] Tasks: ['Single QA'] | Lens: [51029] → Tgt Spa: ['0.350'] [Step 262 / Rank 5] Tasks: ['Single QA'] | Lens: [47570] → Tgt Spa: ['0.350'] [Step 262 / Rank 0] Tasks: ['Single QA'] | Lens: [51029] → Tgt Spa: ['0.350'] [Step 262 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [40709] → Tgt Spa: ['1.000'] [Step 262 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [40709] → Tgt Spa: ['1.000'] [Step 262 / Rank 3] Tasks: ['Single QA'] | Lens: [41331] → Tgt Spa: ['0.350'] [Step 262 / Rank 4] Tasks: ['Single QA'] | Lens: [47570] → Tgt Spa: ['0.350'] [Step 262 / Rank 2] Tasks: ['Single QA'] | Lens: [41331] → Tgt Spa: ['0.350'] [Step 262 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [41543] → Tgt Spa: ['1.000'] [Step 262 / Rank 0] Tasks: ['Single QA'] | Lens: [39843] → Tgt Spa: ['0.350'] [Step 262 / Rank 4] Tasks: ['Single QA'] | Lens: [32854] → Tgt Spa: ['0.350'] [Step 262 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [31839, 31839] → Tgt Spa: ['0.350', '0.350'] [Step 262 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [41543] → Tgt Spa: ['1.000'] [Step 262 / Rank 1] Tasks: ['Single QA'] | Lens: [39843] → Tgt Spa: ['0.350'] [Step 262 / Rank 5] Tasks: ['Single QA'] | Lens: [32854] → Tgt Spa: ['0.350'] [Step 262 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [31839, 31839] → Tgt Spa: ['0.350', '0.350'] [Step 262 / Rank 6] Tasks: ['Code', 'Single QA'] | Lens: [30213, 30207] → Tgt Spa: ['1.000', '0.350'] [Step 262 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [42551] → Tgt Spa: ['1.000'] [Step 262 / Rank 3] Tasks: ['Code'] | Lens: [35231] → Tgt Spa: ['1.000'] [Step 262 / Rank 2] Tasks: ['Code'] | Lens: [35231] → Tgt Spa: ['1.000'] [Step 262 / Rank 4] Tasks: ['Single QA'] | Lens: [41923] → Tgt Spa: ['0.350'] [Step 262 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [42551] → Tgt Spa: ['1.000'] [Step 262 / Rank 5] Tasks: ['Single QA'] | Lens: [41923] → Tgt Spa: ['0.350'] [Step 262 / Rank 7] Tasks: ['Code', 'Single QA'] | Lens: [30213, 30207] → Tgt Spa: ['1.000', '0.350'] [Step 262 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23709, 23709] → Tgt Spa: ['1.000', '1.000'] [Step 262 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23709, 23709] → Tgt Spa: ['1.000', '1.000'] [Step 262 / Rank 3] Tasks: ['Single QA'] | Lens: [63927] → Tgt Spa: ['0.350'] [Step 262 / Rank 6] Tasks: ['Single QA'] | Lens: [42988] → Tgt Spa: ['0.350'][Step 262 / Rank 1] Tasks: ['Code'] | Lens: [37724] → Tgt Spa: ['1.000'] [Step 262 / Rank 2] Tasks: ['Single QA'] | Lens: [63927] → Tgt Spa: ['0.350'] [Step 262 / Rank 0] Tasks: ['Code'] | Lens: [37724] → Tgt Spa: ['1.000'] [Step 262 / Rank 7] Tasks: ['Single QA'] | Lens: [42988] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 06:39:12,807 >> @ 262 | Loss: 2.1879 | LM: 2.1342 | Reg: 0.0537 | Spa(Avg): 0.518 [INFO|lh_trainer.py:797] 2026-02-17 06:39:12,807 >> Statistic -> Code | Spa: 0.708 | Tgt: 1.000 | Z-Loss: 0.097 | [INFO|lh_trainer.py:797] 2026-02-17 06:39:12,807 >> Statistic -> In-Context | Spa: 0.720 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:39:12,807 >> Statistic -> MultiHop | Spa: 0.528 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:39:12,807 >> Statistic -> Single | Spa: 0.392 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:39:12,808 >> Statistic -> Summarization | Spa: 0.690 | Tgt: 1.000 | Z-Loss: 0.098 | [INFO|lh_trainer.py:810] 2026-02-17 06:39:12,809 >> [Micro-Log] {"loss": 2.187908706565698, "lm_loss": 2.134192238251368, "reg_loss": 0.05371648385092461, "model_sparsity(avg)": 0.5178674707810084, "Spa-Single QA sparsity": 0.39204544912685046, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.030295258774210444, "Spa-Code sparsity": 0.7083333233992258, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09652748083074887, "Spa-In-Context Learning sparsity": 0.7204861044883728, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10637405887246132, "Spa-Summarization sparsity": 0.6904761876378741, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09769198724201747, "Spa-MultiHop QA sparsity": 0.5277777910232544, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.06857362389564514, "step": 262, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 06:39:39,178 >> {'loss': 13.1275, 'grad_norm': 0.5754216909408569, 'learning_rate': 3.0295768804936502e-05, 'epoch': 0.27698788836229593, 'num_input_tokens_seen': 646496932, 'completed': '87.67% (263 / 300)', 'remaining time': '1:43:48', 'throughput': '7064.26', 'gpu_mem_free': '13211MB', 'step': 263} [Step 263 / Rank 3] Tasks: ['Single QA'] | Lens: [50816] → Tgt Spa: ['0.350'] [Step 263 / Rank 0] Tasks: ['Code'] | Lens: [54669] → Tgt Spa: ['1.000'] [Step 263 / Rank 5] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17858, 17871, 17860] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 263 / Rank 1] Tasks: ['Code'] | Lens: [54669] → Tgt Spa: ['1.000'] [Step 263 / Rank 4] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17858, 17871, 17860] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 263 / Rank 6] Tasks: ['Single QA', 'Code', 'Single QA'] | Lens: [19157, 19164, 19160] → Tgt Spa: ['0.350', '1.000', '0.350'] [Step 263 / Rank 7] Tasks: ['Single QA', 'Code', 'Single QA'] | Lens: [19157, 19164, 19160] → Tgt Spa: ['0.350', '1.000', '0.350'] [Step 263 / Rank 2] Tasks: ['Single QA'] | Lens: [50816] → Tgt Spa: ['0.350'] [Step 263 / Rank 5] Tasks: ['Code'] | Lens: [58536] → Tgt Spa: ['1.000'] [Step 263 / Rank 6] Tasks: ['Single QA'] | Lens: [51841] → Tgt Spa: ['0.350'] [Step 263 / Rank 7] Tasks: ['Single QA'] | Lens: [51841] → Tgt Spa: ['0.350'] [Step 263 / Rank 4] Tasks: ['Code'] | Lens: [58536] → Tgt Spa: ['1.000'] [Step 263 / Rank 3] Tasks: ['Single QA', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [5509, 5518, 5519, 5511, 5513, 5514, 5516, 5518, 5517, 5519, 5519] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 263 / Rank 1] Tasks: ['Single QA'] | Lens: [56519] → Tgt Spa: ['0.350'] [Step 263 / Rank 0] Tasks: ['Single QA'] | Lens: [56519] → Tgt Spa: ['0.350'] [Step 263 / Rank 2] Tasks: ['Single QA', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [5509, 5518, 5519, 5511, 5513, 5514, 5516, 5518, 5517, 5519, 5519] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 263 / Rank 5] Tasks: ['Single QA'] | Lens: [43066] → Tgt Spa: ['0.350'] [Step 263 / Rank 3] Tasks: ['Single QA'] | Lens: [54464] → Tgt Spa: ['0.350'] [Step 263 / Rank 2] Tasks: ['Single QA'] | Lens: [54464] → Tgt Spa: ['0.350'] [Step 263 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [44051] → Tgt Spa: ['1.000'] [Step 263 / Rank 6] Tasks: ['Single QA'] | Lens: [41920] → Tgt Spa: ['0.350'] [Step 263 / Rank 4] Tasks: ['Single QA'] | Lens: [43066] → Tgt Spa: ['0.350'] [Step 263 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [44051] → Tgt Spa: ['1.000'] [Step 263 / Rank 7] Tasks: ['Single QA'] | Lens: [41920] → Tgt Spa: ['0.350'] [Step 263 / Rank 1] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [16575, 16566, 16565] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 263 / Rank 6] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning'] | Lens: [5038, 5031, 5032, 5035, 5034, 5036, 5035, 5036, 5045, 5037, 5039, 5047, 5038] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 263 / Rank 5] Tasks: ['Single QA'] | Lens: [40591] → Tgt Spa: ['0.350'] [Step 263 / Rank 0] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [16575, 16566, 16565] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 263 / Rank 7] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning'] | Lens: [5038, 5031, 5032, 5035, 5034, 5036, 5035, 5036, 5045, 5037, 5039, 5047, 5038] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000'] [Step 263 / Rank 4] Tasks: ['Single QA'] | Lens: [40591] → Tgt Spa: ['0.350'] [Step 263 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [25256, 25256] → Tgt Spa: ['0.350', '0.350'] [Step 263 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [25256, 25256] → Tgt Spa: ['0.350', '0.350'] [Step 263 / Rank 5] Tasks: ['Code'] | Lens: [57665] → Tgt Spa: ['1.000'] [Step 263 / Rank 7] Tasks: ['Single QA'] | Lens: [35914] → Tgt Spa: ['0.350'] [Step 263 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [9912, 9914, 9914, 9914, 9913, 9915] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 263 / Rank 3] Tasks: ['Summarization', 'Single QA'] | Lens: [32388, 32371] → Tgt Spa: ['1.000', '0.350'] [Step 263 / Rank 6] Tasks: ['Single QA'] | Lens: [35914] → Tgt Spa: ['0.350'] [Step 263 / Rank 2] Tasks: ['Summarization', 'Single QA'] | Lens: [32388, 32371] → Tgt Spa: ['1.000', '0.350'] [Step 263 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [9912, 9914, 9914, 9914, 9913, 9915] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 263 / Rank 4] Tasks: ['Code'] | Lens: [57665] → Tgt Spa: ['1.000'] [Step 263 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [55569] → Tgt Spa: ['1.000'] [Step 263 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32180, 32180] → Tgt Spa: ['0.350', '0.350'] [Step 263 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32691, 32692] → Tgt Spa: ['0.350', '0.350'] [Step 263 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32180, 32180] → Tgt Spa: ['0.350', '0.350'] [Step 263 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [55569] → Tgt Spa: ['1.000'] [Step 263 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32691, 32692] → Tgt Spa: ['0.350', '0.350'] [Step 263 / Rank 3] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [31468, 31469] → Tgt Spa: ['0.350', '1.000'] [Step 263 / Rank 2] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [31468, 31469] → Tgt Spa: ['0.350', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 06:41:51,883 >> @ 263 | Loss: 1.9089 | LM: 1.8446 | Reg: 0.0643 | Spa(Avg): 0.516 [INFO|lh_trainer.py:797] 2026-02-17 06:41:51,883 >> Statistic -> Code | Spa: 0.710 | Tgt: 1.000 | Z-Loss: 0.096 | [INFO|lh_trainer.py:797] 2026-02-17 06:41:51,883 >> Statistic -> In-Context | Spa: 0.712 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:41:51,883 >> Statistic -> MultiHop | Spa: 0.528 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:41:51,883 >> Statistic -> Single | Spa: 0.423 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:41:51,883 >> Statistic -> Summarization | Spa: 0.546 | Tgt: 1.000 | Z-Loss: 0.181 | [INFO|lh_trainer.py:810] 2026-02-17 06:41:51,885 >> [Micro-Log] {"loss": 1.9088953956961632, "lm_loss": 1.8446317352354527, "reg_loss": 0.06426367238357973, "model_sparsity(avg)": 0.5162509170671304, "Spa-Code sparsity": 0.7104700803756714, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09555085863058384, "Spa-Single QA sparsity": 0.42289271642421855, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04836690377700945, "Spa-In-Context Learning sparsity": 0.7116013064103968, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1101313342942911, "Spa-Summarization sparsity": 0.5462962786356608, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1808569704492887, "Spa-MultiHop QA sparsity": 0.5277777910232544, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.06857362389564514, "step": 263, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 06:42:12,722 >> {'loss': 11.4534, 'grad_norm': 0.5298753380775452, 'learning_rate': 2.87531377589305e-05, 'epoch': 0.27804107424960506, 'num_input_tokens_seen': 649096904, 'completed': '88.00% (264 / 300)', 'remaining time': '1:40:57', 'throughput': '8466.57', 'gpu_mem_free': '6199MB', 'step': 264} [Step 264 / Rank 5] Tasks: ['Single QA'] | Lens: [46122] → Tgt Spa: ['0.350'] [Step 264 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [59043] → Tgt Spa: ['1.000'] [Step 264 / Rank 7] Tasks: ['Code', 'Single QA', 'Summarization'] | Lens: [16385, 16387, 16418] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 264 / Rank 4] Tasks: ['Single QA'] | Lens: [46122] → Tgt Spa: ['0.350'] [Step 264 / Rank 6] Tasks: ['Code', 'Single QA', 'Summarization'] | Lens: [16385, 16387, 16418] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 264 / Rank 3] Tasks: ['Single QA'] | Lens: [57566] → Tgt Spa: ['0.350'] [Step 264 / Rank 2] Tasks: ['Single QA'] | Lens: [57566] → Tgt Spa: ['0.350'] [Step 264 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [59043] → Tgt Spa: ['1.000'] [Step 264 / Rank 5] Tasks: ['Single QA'] | Lens: [52692] → Tgt Spa: ['0.350'] [Step 264 / Rank 7] Tasks: ['Code'] | Lens: [63851] → Tgt Spa: ['1.000'] [Step 264 / Rank 0] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25241, 25244] → Tgt Spa: ['1.000', '0.350'] [Step 264 / Rank 4] Tasks: ['Single QA'] | Lens: [52692] → Tgt Spa: ['0.350'] [Step 264 / Rank 2] Tasks: ['Single QA'] | Lens: [60738] → Tgt Spa: ['0.350'] [Step 264 / Rank 1] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25241, 25244] → Tgt Spa: ['1.000', '0.350'] [Step 264 / Rank 6] Tasks: ['Code'] | Lens: [63851] → Tgt Spa: ['1.000'] [Step 264 / Rank 3] Tasks: ['Single QA'] | Lens: [60738] → Tgt Spa: ['0.350'] [Step 264 / Rank 4] Tasks: ['Summarization', 'Code'] | Lens: [28263, 28252] → Tgt Spa: ['1.000', '1.000'] [Step 264 / Rank 0] Tasks: ['Single QA'] | Lens: [55634] → Tgt Spa: ['0.350'] [Step 264 / Rank 5] Tasks: ['Summarization', 'Code'] | Lens: [28263, 28252] → Tgt Spa: ['1.000', '1.000'] [Step 264 / Rank 2] Tasks: ['Single QA'] | Lens: [60152] → Tgt Spa: ['0.350'] [Step 264 / Rank 3] Tasks: ['Single QA'] | Lens: [60152] → Tgt Spa: ['0.350'] [Step 264 / Rank 1] Tasks: ['Single QA'] | Lens: [55634] → Tgt Spa: ['0.350'] [Step 264 / Rank 7] Tasks: ['Single QA'] | Lens: [35563] → Tgt Spa: ['0.350'] [Step 264 / Rank 6] Tasks: ['Single QA'] | Lens: [35563] → Tgt Spa: ['0.350'] [Step 264 / Rank 6] Tasks: ['MultiHop QA'] | Lens: [64798] → Tgt Spa: ['0.350'] [Step 264 / Rank 3] Tasks: ['Single QA'] | Lens: [65116] → Tgt Spa: ['0.350'] [Step 264 / Rank 4] Tasks: ['Single QA'] | Lens: [58117] → Tgt Spa: ['0.350'] [Step 264 / Rank 5] Tasks: ['Single QA'] | Lens: [58117] → Tgt Spa: ['0.350'] [Step 264 / Rank 1] Tasks: ['Summarization'] | Lens: [37348] → Tgt Spa: ['1.000'] [Step 264 / Rank 2] Tasks: ['Single QA'] | Lens: [65116] → Tgt Spa: ['0.350'] [Step 264 / Rank 7] Tasks: ['MultiHop QA'] | Lens: [64798] → Tgt Spa: ['0.350'] [Step 264 / Rank 0] Tasks: ['Summarization'] | Lens: [37348] → Tgt Spa: ['1.000'] [Step 264 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25410, 25410] → Tgt Spa: ['1.000', '1.000'] [Step 264 / Rank 4] Tasks: ['Summarization', 'In-Context Learning', 'Summarization'] | Lens: [19906, 19888, 19907] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 264 / Rank 5] Tasks: ['Summarization', 'In-Context Learning', 'Summarization'] | Lens: [19906, 19888, 19907] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 264 / Rank 1] Tasks: ['Single QA'] | Lens: [42414] → Tgt Spa: ['0.350'] [Step 264 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23531, 23531] → Tgt Spa: ['1.000', '1.000'] [Step 264 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23531, 23531] → Tgt Spa: ['1.000', '1.000'] [Step 264 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25410, 25410] → Tgt Spa: ['1.000', '1.000'] [Step 264 / Rank 0] Tasks: ['Single QA'] | Lens: [42414] → Tgt Spa: ['0.350'] [Step 264 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [23362, 23357] → Tgt Spa: ['1.000', '1.000'] [Step 264 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [57862] → Tgt Spa: ['1.000'] [Step 264 / Rank 0] Tasks: ['Single QA'] | Lens: [36985] → Tgt Spa: ['0.350'] [Step 264 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [57862] → Tgt Spa: ['1.000'] [Step 264 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [23362, 23357] → Tgt Spa: ['1.000', '1.000'] [Step 264 / Rank 3] Tasks: ['Single QA'] | Lens: [37505] → Tgt Spa: ['0.350'] [Step 264 / Rank 2] Tasks: ['Single QA'] | Lens: [37505] → Tgt Spa: ['0.350'] [Step 264 / Rank 1] Tasks: ['Single QA'] | Lens: [36985] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 06:44:48,917 >> @ 264 | Loss: 2.1396 | LM: 2.0879 | Reg: 0.0517 | Spa(Avg): 0.520 [INFO|lh_trainer.py:797] 2026-02-17 06:44:48,917 >> Statistic -> Code | Spa: 0.722 | Tgt: 1.000 | Z-Loss: 0.091 | [INFO|lh_trainer.py:797] 2026-02-17 06:44:48,917 >> Statistic -> In-Context | Spa: 0.718 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:44:48,917 >> Statistic -> MultiHop | Spa: 0.486 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:44:48,918 >> Statistic -> Single | Spa: 0.369 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:44:48,918 >> Statistic -> Summarization | Spa: 0.686 | Tgt: 1.000 | Z-Loss: 0.099 | [INFO|lh_trainer.py:810] 2026-02-17 06:44:48,919 >> [Micro-Log] {"loss": 2.1396048311144114, "lm_loss": 2.0879071025798717, "reg_loss": 0.051697715622140095, "model_sparsity(avg)": 0.519868828356266, "Spa-In-Context Learning sparsity": 0.7175925970077515, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10759153962135315, "Spa-Single QA sparsity": 0.369047611951828, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.012538707657118462, "Spa-Summarization sparsity": 0.6861111164093018, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09930706024169922, "Spa-Code sparsity": 0.7222222089767456, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09090471267700195, "Spa-MultiHop QA sparsity": 0.4861111044883728, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.05049293860793114, "step": 264, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 06:45:10,879 >> {'loss': 12.8376, 'grad_norm': 0.49052122235298157, 'learning_rate': 2.724841622807116e-05, 'epoch': 0.2790942601369142, 'num_input_tokens_seen': 651600900, 'completed': '88.33% (265 / 300)', 'remaining time': '1:38:11', 'throughput': '7027.48', 'gpu_mem_free': '14059MB', 'step': 265} [Step 265 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [31663, 31666] → Tgt Spa: ['1.000', '1.000'] [Step 265 / Rank 3] Tasks: ['Single QA'] | Lens: [55107] → Tgt Spa: ['0.350'] [Step 265 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [31663, 31666] → Tgt Spa: ['1.000', '1.000'] [Step 265 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15904, 15904, 15904, 15904] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 265 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25810, 25812] → Tgt Spa: ['1.000', '1.000'] [Step 265 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25810, 25812] → Tgt Spa: ['1.000', '1.000'] [Step 265 / Rank 2] Tasks: ['Single QA'] | Lens: [55107] → Tgt Spa: ['0.350'] [Step 265 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15904, 15904, 15904, 15904] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 265 / Rank 3] Tasks: ['Single QA'] | Lens: [49219] → Tgt Spa: ['0.350'] [Step 265 / Rank 6] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA'] | Lens: [1684, 1684, 1685, 1687, 1705, 1706, 1706, 1689, 1688, 1687, 1707, 1689, 1689, 1689, 1709, 1709, 1710, 1691, 1691, 1691, 1710, 1692, 1713, 1695, 1694, 1694, 1694, 1695, 1694, 1695, 1695, 1695, 1715, 1698, 1696, 1715, 1698, 1697] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350'] [Step 265 / Rank 2] Tasks: ['Single QA'] | Lens: [49219] → Tgt Spa: ['0.350'] [Step 265 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [42543] → Tgt Spa: ['1.000'] [Step 265 / Rank 0] Tasks: ['Single QA'] | Lens: [51467] → Tgt Spa: ['0.350'] [Step 265 / Rank 1] Tasks: ['Single QA'] | Lens: [51467] → Tgt Spa: ['0.350'] [Step 265 / Rank 7] Tasks: ['MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA'] | Lens: [1684, 1684, 1685, 1687, 1705, 1706, 1706, 1689, 1688, 1687, 1707, 1689, 1689, 1689, 1709, 1709, 1710, 1691, 1691, 1691, 1710, 1692, 1713, 1695, 1694, 1694, 1694, 1695, 1694, 1695, 1695, 1695, 1715, 1698, 1696, 1715, 1698, 1697] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350'] [Step 265 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [42543] → Tgt Spa: ['1.000'] [Step 265 / Rank 0] Tasks: ['Single QA'] | Lens: [54195] → Tgt Spa: ['0.350'] [Step 265 / Rank 6] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [16979, 16979, 16990] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 265 / Rank 1] Tasks: ['Single QA'] | Lens: [54195] → Tgt Spa: ['0.350'] [Step 265 / Rank 3] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [17463, 17464, 17476] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 265 / Rank 7] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [16979, 16979, 16990] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 265 / Rank 4] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [16682, 16684, 16696] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 265 / Rank 5] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [16682, 16684, 16696] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 265 / Rank 2] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [17463, 17464, 17476] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 265 / Rank 4] Tasks: ['Single QA'] | Lens: [45708] → Tgt Spa: ['0.350'] [Step 265 / Rank 3] Tasks: ['Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [5425, 5409, 5409, 5410, 5418, 5419, 5412, 5413, 5431, 5414, 5415, 5415] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 265 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [52666] → Tgt Spa: ['1.000'] [Step 265 / Rank 2] Tasks: ['Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [5425, 5409, 5409, 5410, 5418, 5419, 5412, 5413, 5431, 5414, 5415, 5415] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 265 / Rank 5] Tasks: ['Single QA'] | Lens: [45708] → Tgt Spa: ['0.350'] [Step 265 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [52666] → Tgt Spa: ['1.000'] [Step 265 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25279, 25280] → Tgt Spa: ['1.000', '1.000'] [Step 265 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25279, 25280] → Tgt Spa: ['1.000', '1.000'] [Step 265 / Rank 5] Tasks: ['Single QA'] | Lens: [57696] → Tgt Spa: ['0.350'] [Step 265 / Rank 3] Tasks: ['Single QA'] | Lens: [60605] → Tgt Spa: ['0.350'] [Step 265 / Rank 4] Tasks: ['Single QA'] | Lens: [57696] → Tgt Spa: ['0.350'] [Step 265 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [30954, 30954] → Tgt Spa: ['0.350', '0.350'] [Step 265 / Rank 2] Tasks: ['Single QA'] | Lens: [60605] → Tgt Spa: ['0.350'] [Step 265 / Rank 1] Tasks: ['Code'] | Lens: [44894] → Tgt Spa: ['1.000'] [Step 265 / Rank 0] Tasks: ['Code'] | Lens: [44894] → Tgt Spa: ['1.000'] [Step 265 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [30954, 30954] → Tgt Spa: ['0.350', '0.350'] [Step 265 / Rank 1] Tasks: ['Single QA', 'Code', 'Code', 'MultiHop QA'] | Lens: [14749, 14763, 14763, 14761] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350'] [Step 265 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [42317] → Tgt Spa: ['1.000'] [Step 265 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [42317] → Tgt Spa: ['1.000'] [Step 265 / Rank 0] Tasks: ['Single QA', 'Code', 'Code', 'MultiHop QA'] | Lens: [14749, 14763, 14763, 14761] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350'] [Step 265 / Rank 3] Tasks: ['Single QA'] | Lens: [51407] → Tgt Spa: ['0.350'] [Step 265 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [26846, 26847] → Tgt Spa: ['0.350', '0.350'] [Step 265 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [26846, 26847] → Tgt Spa: ['0.350', '0.350'] [Step 265 / Rank 2] Tasks: ['Single QA'] | Lens: [51407] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 06:47:30,941 >> @ 265 | Loss: 2.1847 | LM: 2.1169 | Reg: 0.0678 | Spa(Avg): 0.540 [INFO|lh_trainer.py:797] 2026-02-17 06:47:30,942 >> Statistic -> Code | Spa: 0.714 | Tgt: 1.000 | Z-Loss: 0.094 | [INFO|lh_trainer.py:797] 2026-02-17 06:47:30,942 >> Statistic -> In-Context | Spa: 0.713 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:47:30,942 >> Statistic -> MultiHop | Spa: 0.585 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:47:30,942 >> Statistic -> Single | Spa: 0.402 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:47:30,942 >> Statistic -> Summarization | Spa: 0.634 | Tgt: 1.000 | Z-Loss: 0.127 | [INFO|lh_trainer.py:810] 2026-02-17 06:47:30,944 >> [Micro-Log] {"loss": 2.1846656799316406, "lm_loss": 2.1168718300759792, "reg_loss": 0.06779384971499287, "model_sparsity(avg)": 0.5399203971028328, "Spa-Single QA sparsity": 0.40196077262654023, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.034748715245822334, "Spa-In-Context Learning sparsity": 0.7129629532496135, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10966312487920125, "Spa-Code sparsity": 0.7136752146940964, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.0942501058945289, "Spa-MultiHop QA sparsity": 0.5848214349576405, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.09823195609663214, "Spa-Summarization sparsity": 0.6336805559694767, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12745214207097888, "step": 265, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 06:47:49,489 >> {'loss': 13.108, 'grad_norm': 0.5886843204498291, 'learning_rate': 2.578186203864648e-05, 'epoch': 0.28014744602422326, 'num_input_tokens_seen': 654189842, 'completed': '88.67% (266 / 300)', 'remaining time': '1:35:21', 'throughput': '8161.37', 'gpu_mem_free': '8971MB', 'step': 266} [Step 266 / Rank 7] Tasks: ['Code', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'Single QA', 'Code', 'In-Context Learning'] | Lens: [5792, 5792, 5785, 5788, 5788, 5788, 5788, 5789, 5789, 5797, 5790] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000'] [Step 266 / Rank 0] Tasks: ['Single QA'] | Lens: [35481] → Tgt Spa: ['0.350'] [Step 266 / Rank 2] Tasks: ['Code'] | Lens: [61912] → Tgt Spa: ['1.000'] [Step 266 / Rank 4] Tasks: ['Single QA'] | Lens: [43794] → Tgt Spa: ['0.350'] [Step 266 / Rank 6] Tasks: ['Code', 'Code', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'Single QA', 'Code', 'In-Context Learning'] | Lens: [5792, 5792, 5785, 5788, 5788, 5788, 5788, 5789, 5789, 5797, 5790] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000'] [Step 266 / Rank 3] Tasks: ['Code'] | Lens: [61912] → Tgt Spa: ['1.000'] [Step 266 / Rank 5] Tasks: ['Single QA'] | Lens: [43794] → Tgt Spa: ['0.350'] [Step 266 / Rank 1] Tasks: ['Single QA'] | Lens: [35481] → Tgt Spa: ['0.350'] [Step 266 / Rank 0] Tasks: ['Summarization', 'Single QA', 'Single QA'] | Lens: [16444, 16427, 16427] → Tgt Spa: ['1.000', '0.350', '0.350'] [Step 266 / Rank 6] Tasks: ['Summarization', 'Code', 'In-Context Learning'] | Lens: [20675, 20665, 20658] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 266 / Rank 4] Tasks: ['Single QA', 'Summarization', 'Single QA', 'Code', 'Single QA', 'Code', 'Code', 'Code'] | Lens: [8149, 8168, 8150, 8164, 8160, 8171, 8174, 8174] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000'] [Step 266 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [33647] → Tgt Spa: ['1.000'] [Step 266 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [33647] → Tgt Spa: ['1.000'] [Step 266 / Rank 1] Tasks: ['Summarization', 'Single QA', 'Single QA'] | Lens: [16444, 16427, 16427] → Tgt Spa: ['1.000', '0.350', '0.350'] [Step 266 / Rank 5] Tasks: ['Single QA', 'Summarization', 'Single QA', 'Code', 'Single QA', 'Code', 'Code', 'Code'] | Lens: [8149, 8168, 8150, 8164, 8160, 8171, 8174, 8174] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000'] [Step 266 / Rank 7] Tasks: ['Summarization', 'Code', 'In-Context Learning'] | Lens: [20675, 20665, 20658] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 266 / Rank 7] Tasks: ['Code'] | Lens: [58007] → Tgt Spa: ['1.000'] [Step 266 / Rank 2] Tasks: ['Single QA'] | Lens: [49855] → Tgt Spa: ['0.350'] [Step 266 / Rank 6] Tasks: ['Code'] | Lens: [58007] → Tgt Spa: ['1.000'] [Step 266 / Rank 3] Tasks: ['Single QA'] | Lens: [49855] → Tgt Spa: ['0.350'] [Step 266 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24384, 24383] → Tgt Spa: ['1.000', '1.000'] [Step 266 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [41256] → Tgt Spa: ['1.000'] [Step 266 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [41256] → Tgt Spa: ['1.000'] [Step 266 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24384, 24383] → Tgt Spa: ['1.000', '1.000'] [Step 266 / Rank 4] Tasks: ['Single QA'] | Lens: [51460] → Tgt Spa: ['0.350'] [Step 266 / Rank 5] Tasks: ['Single QA'] | Lens: [51460] → Tgt Spa: ['0.350'] [Step 266 / Rank 1] Tasks: ['Single QA'] | Lens: [44067] → Tgt Spa: ['0.350'] [Step 266 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22448, 22468] → Tgt Spa: ['1.000', '1.000'] [Step 266 / Rank 6] Tasks: ['Single QA'] | Lens: [54068] → Tgt Spa: ['0.350'] [Step 266 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22448, 22468] → Tgt Spa: ['1.000', '1.000'] [Step 266 / Rank 0] Tasks: ['Single QA'] | Lens: [44067] → Tgt Spa: ['0.350'] [Step 266 / Rank 7] Tasks: ['Single QA'] | Lens: [54068] → Tgt Spa: ['0.350'] [Step 266 / Rank 1] Tasks: ['Code'] | Lens: [36708] → Tgt Spa: ['1.000'] [Step 266 / Rank 2] Tasks: ['Single QA'] | Lens: [51227] → Tgt Spa: ['0.350'] [Step 266 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [30628, 30629] → Tgt Spa: ['0.350', '0.350'] [Step 266 / Rank 4] Tasks: ['Single QA'] | Lens: [48112] → Tgt Spa: ['0.350'] [Step 266 / Rank 5] Tasks: ['Single QA'] | Lens: [48112] → Tgt Spa: ['0.350'] [Step 266 / Rank 3] Tasks: ['Single QA'] | Lens: [51227] → Tgt Spa: ['0.350'] [Step 266 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [30628, 30629] → Tgt Spa: ['0.350', '0.350'] [Step 266 / Rank 0] Tasks: ['Code'] | Lens: [36708] → Tgt Spa: ['1.000'] [Step 266 / Rank 4] Tasks: ['Single QA', 'Summarization', 'Summarization'] | Lens: [20380, 20399, 20400] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 266 / Rank 3] Tasks: ['Single QA'] | Lens: [49702] → Tgt Spa: ['0.350'] [Step 266 / Rank 1] Tasks: ['Single QA'] | Lens: [49117] → Tgt Spa: ['0.350'] [Step 266 / Rank 5] Tasks: ['Single QA', 'Summarization', 'Summarization'] | Lens: [20380, 20399, 20400] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 266 / Rank 0] Tasks: ['Single QA'] | Lens: [49117] → Tgt Spa: ['0.350'] [Step 266 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [58580] → Tgt Spa: ['1.000'] [Step 266 / Rank 2] Tasks: ['Single QA'] | Lens: [49702] → Tgt Spa: ['0.350'] [Step 266 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [58580] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 06:50:05,307 >> @ 266 | Loss: 2.1342 | LM: 2.0771 | Reg: 0.0571 | Spa(Avg): 0.527 [INFO|lh_trainer.py:797] 2026-02-17 06:50:05,307 >> Statistic -> Code | Spa: 0.715 | Tgt: 1.000 | Z-Loss: 0.094 | [INFO|lh_trainer.py:797] 2026-02-17 06:50:05,307 >> Statistic -> In-Context | Spa: 0.718 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:50:05,307 >> Statistic -> MultiHop | Spa: 0.585 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:50:05,307 >> Statistic -> Single | Spa: 0.426 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:50:05,307 >> Statistic -> Summarization | Spa: 0.639 | Tgt: 1.000 | Z-Loss: 0.128 | [INFO|lh_trainer.py:810] 2026-02-17 06:50:05,309 >> [Micro-Log] {"loss": 2.1341762940088906, "lm_loss": 2.0770819671452045, "reg_loss": 0.05709432489432705, "model_sparsity(avg)": 0.5273875879744688, "Spa-Single QA sparsity": 0.42572462817896967, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05086014800927723, "Spa-Summarization sparsity": 0.6388888855775198, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1277021231750647, "Spa-In-Context Learning sparsity": 0.7180555582046508, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10747441351413727, "Spa-Code sparsity": 0.7146464586257935, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09390385787595402, "Spa-MultiHop QA sparsity": 0.5848214349576405, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.09823195609663214, "step": 266, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 06:50:27,630 >> {'loss': 12.8051, 'grad_norm': 0.5260895490646362, 'learning_rate': 2.435372647716701e-05, 'epoch': 0.2812006319115324, 'num_input_tokens_seen': 656636650, 'completed': '89.00% (267 / 300)', 'remaining time': '1:32:31', 'throughput': '7736.16', 'gpu_mem_free': '9789MB', 'step': 267} [Step 267 / Rank 0] Tasks: ['Code'] | Lens: [35190] → Tgt Spa: ['1.000'] [Step 267 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22809, 22811] → Tgt Spa: ['1.000', '1.000'] [Step 267 / Rank 1] Tasks: ['Code'] | Lens: [35190] → Tgt Spa: ['1.000'] [Step 267 / Rank 5] Tasks: ['Single QA'] | Lens: [50488] → Tgt Spa: ['0.350'] [Step 267 / Rank 6] Tasks: ['Single QA'] | Lens: [48484] → Tgt Spa: ['0.350'] [Step 267 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22809, 22811] → Tgt Spa: ['1.000', '1.000'] [Step 267 / Rank 4] Tasks: ['Single QA'] | Lens: [50488] → Tgt Spa: ['0.350'] [Step 267 / Rank 7] Tasks: ['Single QA'] | Lens: [48484] → Tgt Spa: ['0.350'] [Step 267 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [22385, 22395] → Tgt Spa: ['1.000', '1.000'] [Step 267 / Rank 5] Tasks: ['Single QA'] | Lens: [49065] → Tgt Spa: ['0.350'] [Step 267 / Rank 2] Tasks: ['Single QA'] | Lens: [63852] → Tgt Spa: ['0.350'] [Step 267 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [22385, 22395] → Tgt Spa: ['1.000', '1.000'] [Step 267 / Rank 4] Tasks: ['Single QA'] | Lens: [49065] → Tgt Spa: ['0.350'] [Step 267 / Rank 3] Tasks: ['Single QA'] | Lens: [63852] → Tgt Spa: ['0.350'] [Step 267 / Rank 7] Tasks: ['Code'] | Lens: [35025] → Tgt Spa: ['1.000'] [Step 267 / Rank 6] Tasks: ['Code'] | Lens: [35025] → Tgt Spa: ['1.000'] [Step 267 / Rank 5] Tasks: ['Code'] | Lens: [39440] → Tgt Spa: ['1.000'] [Step 267 / Rank 6] Tasks: ['Code'] | Lens: [64223] → Tgt Spa: ['1.000'] [Step 267 / Rank 3] Tasks: ['Single QA'] | Lens: [59033] → Tgt Spa: ['0.350'] [Step 267 / Rank 7] Tasks: ['Code'] | Lens: [64223] → Tgt Spa: ['1.000'] [Step 267 / Rank 1] Tasks: ['Code', 'In-Context Learning', 'Single QA', 'Single QA', 'Summarization', 'Single QA', 'Summarization', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning'] | Lens: [4001, 3996, 3997, 3998, 4017, 4000, 4019, 4008, 4000, 4000, 4001, 4001, 4001, 4002, 4003, 4003] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000'] [Step 267 / Rank 4] Tasks: ['Code'] | Lens: [39440] → Tgt Spa: ['1.000'] [Step 267 / Rank 2] Tasks: ['Single QA'] | Lens: [59033] → Tgt Spa: ['0.350'] [Step 267 / Rank 0] Tasks: ['Code', 'In-Context Learning', 'Single QA', 'Single QA', 'Summarization', 'Single QA', 'Summarization', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning'] | Lens: [4001, 3996, 3997, 3998, 4017, 4000, 4019, 4008, 4000, 4000, 4001, 4001, 4001, 4002, 4003, 4003] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000'] [Step 267 / Rank 4] Tasks: ['Code', 'Single QA'] | Lens: [24046, 24038] → Tgt Spa: ['1.000', '0.350'] [Step 267 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [25535, 25536] → Tgt Spa: ['0.350', '0.350'] [Step 267 / Rank 5] Tasks: ['Code', 'Single QA'] | Lens: [24046, 24038] → Tgt Spa: ['1.000', '0.350'] [Step 267 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [25535, 25536] → Tgt Spa: ['0.350', '0.350'] [Step 267 / Rank 3] Tasks: ['Single QA'] | Lens: [58883] → Tgt Spa: ['0.350'] [Step 267 / Rank 6] Tasks: ['Code', 'Code', 'Code'] | Lens: [20045, 20046, 20047] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 267 / Rank 7] Tasks: ['Code', 'Code', 'Code'] | Lens: [20045, 20046, 20047] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 267 / Rank 2] Tasks: ['Single QA'] | Lens: [58883] → Tgt Spa: ['0.350'] [Step 267 / Rank 5] Tasks: ['Code'] | Lens: [60894] → Tgt Spa: ['1.000'] [Step 267 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [42846] → Tgt Spa: ['1.000'] [Step 267 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [23759, 23760] → Tgt Spa: ['0.350', '0.350'] [Step 267 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [42846] → Tgt Spa: ['1.000'] [Step 267 / Rank 6] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [16940, 16933, 16934] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 267 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [23759, 23760] → Tgt Spa: ['0.350', '0.350'] [Step 267 / Rank 4] Tasks: ['Code'] | Lens: [60894] → Tgt Spa: ['1.000'] [Step 267 / Rank 7] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [16940, 16933, 16934] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 267 / Rank 3] Tasks: ['Single QA', 'Code', 'Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [10035, 10045, 10041, 10065, 10058, 10061] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '0.350', '0.350'] [Step 267 / Rank 6] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16995, 16984, 16995] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 267 / Rank 0] Tasks: ['Single QA'] | Lens: [52879] → Tgt Spa: ['0.350'] [Step 267 / Rank 4] Tasks: ['Summarization'] | Lens: [33447] → Tgt Spa: ['1.000'] [Step 267 / Rank 2] Tasks: ['Single QA', 'Code', 'Single QA', 'Code', 'Single QA', 'Single QA'] | Lens: [10035, 10045, 10041, 10065, 10058, 10061] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '0.350', '0.350'] [Step 267 / Rank 5] Tasks: ['Summarization'] | Lens: [33447] → Tgt Spa: ['1.000'] [Step 267 / Rank 1] Tasks: ['Single QA'] | Lens: [52879] → Tgt Spa: ['0.350'] [Step 267 / Rank 7] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16995, 16984, 16995] → Tgt Spa: ['1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 06:53:04,994 >> @ 267 | Loss: 1.6472 | LM: 1.5779 | Reg: 0.0693 | Spa(Avg): 0.561 [INFO|lh_trainer.py:797] 2026-02-17 06:53:04,994 >> Statistic -> Code | Spa: 0.706 | Tgt: 1.000 | Z-Loss: 0.097 | [INFO|lh_trainer.py:797] 2026-02-17 06:53:04,994 >> Statistic -> In-Context | Spa: 0.718 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:53:04,994 >> Statistic -> MultiHop | Spa: 0.585 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:53:04,994 >> Statistic -> Single | Spa: 0.460 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:53:04,994 >> Statistic -> Summarization | Spa: 0.583 | Tgt: 1.000 | Z-Loss: 0.159 | [INFO|lh_trainer.py:810] 2026-02-17 06:53:04,997 >> [Micro-Log] {"loss": 1.6472033336758614, "lm_loss": 1.5779371062914531, "reg_loss": 0.06926625654644643, "model_sparsity(avg)": 0.5613305320342382, "Spa-Code sparsity": 0.7058823529411765, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09743684267296511, "Spa-In-Context Learning sparsity": 0.7180555582046508, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10747441351413727, "Spa-Single QA sparsity": 0.46022726459936664, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.07491264665308832, "Spa-Summarization sparsity": 0.5833333432674408, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1585592826207479, "Spa-MultiHop QA sparsity": 0.5848214349576405, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.09823195609663214, "step": 267, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 06:53:24,352 >> {'loss': 9.8832, 'grad_norm': 0.7550118565559387, 'learning_rate': 2.2964254247309006e-05, 'epoch': 0.2822538177988415, 'num_input_tokens_seen': 659070838, 'completed': '89.33% (268 / 300)', 'remaining time': '1:29:44', 'throughput': '6887.05', 'gpu_mem_free': '8401MB', 'step': 268} [Step 268 / Rank 4] Tasks: ['Code'] | Lens: [42355] → Tgt Spa: ['1.000'] [Step 268 / Rank 5] Tasks: ['Code'] | Lens: [42355] → Tgt Spa: ['1.000'] [Step 268 / Rank 0] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [19098, 19087, 19088] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 268 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43883] → Tgt Spa: ['1.000'] [Step 268 / Rank 1] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [19098, 19087, 19088] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 268 / Rank 3] Tasks: ['Code', 'Code', 'Code'] | Lens: [17744, 17744, 17745] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 268 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43883] → Tgt Spa: ['1.000'] [Step 268 / Rank 2] Tasks: ['Code', 'Code', 'Code'] | Lens: [17744, 17744, 17745] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 268 / Rank 5] Tasks: ['Single QA'] | Lens: [64590] → Tgt Spa: ['0.350'] [Step 268 / Rank 1] Tasks: ['Single QA'] | Lens: [51456] → Tgt Spa: ['0.350'] [Step 268 / Rank 7] Tasks: ['Single QA'] | Lens: [58664] → Tgt Spa: ['0.350'] [Step 268 / Rank 4] Tasks: ['Single QA'] | Lens: [64590] → Tgt Spa: ['0.350'] [Step 268 / Rank 3] Tasks: ['Single QA'] | Lens: [37113] → Tgt Spa: ['0.350'] [Step 268 / Rank 6] Tasks: ['Single QA'] | Lens: [58664] → Tgt Spa: ['0.350'] [Step 268 / Rank 0] Tasks: ['Single QA'] | Lens: [51456] → Tgt Spa: ['0.350'] [Step 268 / Rank 2] Tasks: ['Single QA'] | Lens: [37113] → Tgt Spa: ['0.350'] [Step 268 / Rank 6] Tasks: ['Single QA'] | Lens: [59042] → Tgt Spa: ['0.350'] [Step 268 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [58170] → Tgt Spa: ['1.000'] [Step 268 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [9235, 9236, 9237, 9237, 9237, 9237, 9247] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 268 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [58170] → Tgt Spa: ['1.000'] [Step 268 / Rank 2] Tasks: ['In-Context Learning', 'Code'] | Lens: [22722, 22730] → Tgt Spa: ['1.000', '1.000'] [Step 268 / Rank 3] Tasks: ['In-Context Learning', 'Code'] | Lens: [22722, 22730] → Tgt Spa: ['1.000', '1.000'] [Step 268 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code'] | Lens: [9235, 9236, 9237, 9237, 9237, 9237, 9247] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000'] [Step 268 / Rank 7] Tasks: ['Single QA'] | Lens: [59042] → Tgt Spa: ['0.350'] [Step 268 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [29954, 29954] → Tgt Spa: ['0.350', '0.350'] [Step 268 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [29954, 29954] → Tgt Spa: ['0.350', '0.350'] [Step 268 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [44963] → Tgt Spa: ['1.000'] [Step 268 / Rank 0] Tasks: ['Summarization', 'Code', 'Single QA'] | Lens: [18118, 18106, 18100] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 268 / Rank 5] Tasks: ['Summarization'] | Lens: [36177] → Tgt Spa: ['1.000'] [Step 268 / Rank 1] Tasks: ['Summarization', 'Code', 'Single QA'] | Lens: [18118, 18106, 18100] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 268 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [44963] → Tgt Spa: ['1.000'] [Step 268 / Rank 4] Tasks: ['Summarization'] | Lens: [36177] → Tgt Spa: ['1.000'] [Step 268 / Rank 3] Tasks: ['Code'] | Lens: [42547] → Tgt Spa: ['1.000'] [Step 268 / Rank 2] Tasks: ['Code'] | Lens: [42547] → Tgt Spa: ['1.000'] [Step 268 / Rank 6] Tasks: ['Single QA'] | Lens: [61776] → Tgt Spa: ['0.350'] [Step 268 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32593, 32593] → Tgt Spa: ['0.350', '0.350'] [Step 268 / Rank 7] Tasks: ['Single QA'] | Lens: [61776] → Tgt Spa: ['0.350'] [Step 268 / Rank 5] Tasks: ['Single QA'] | Lens: [65025] → Tgt Spa: ['0.350'] [Step 268 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32593, 32593] → Tgt Spa: ['0.350', '0.350'] [Step 268 / Rank 4] Tasks: ['Single QA'] | Lens: [65025] → Tgt Spa: ['0.350'] [Step 268 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [12029, 12029, 12030, 12032, 12033] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 268 / Rank 6] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [21437, 21438, 21452] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 268 / Rank 3] Tasks: ['Single QA'] | Lens: [45607] → Tgt Spa: ['0.350'] [Step 268 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [12029, 12029, 12030, 12032, 12033] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350'] [Step 268 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [31380, 31372] → Tgt Spa: ['1.000', '1.000'] [Step 268 / Rank 2] Tasks: ['Single QA'] | Lens: [45607] → Tgt Spa: ['0.350'] [Step 268 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [31380, 31372] → Tgt Spa: ['1.000', '1.000'] [Step 268 / Rank 7] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [21437, 21438, 21452] → Tgt Spa: ['1.000', '1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 06:55:49,673 >> @ 268 | Loss: 1.9235 | LM: 1.8590 | Reg: 0.0645 | Spa(Avg): 0.541 [INFO|lh_trainer.py:797] 2026-02-17 06:55:49,673 >> Statistic -> Code | Spa: 0.706 | Tgt: 1.000 | Z-Loss: 0.097 | [INFO|lh_trainer.py:797] 2026-02-17 06:55:49,674 >> Statistic -> In-Context | Spa: 0.719 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:55:49,674 >> Statistic -> MultiHop | Spa: 0.585 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:55:49,674 >> Statistic -> Single | Spa: 0.381 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:55:49,674 >> Statistic -> Summarization | Spa: 0.639 | Tgt: 1.000 | Z-Loss: 0.129 | [INFO|lh_trainer.py:810] 2026-02-17 06:55:49,676 >> [Micro-Log] {"loss": 1.923450647542874, "lm_loss": 1.858967946221431, "reg_loss": 0.06448272125756678, "model_sparsity(avg)": 0.5413993609448274, "Spa-Summarization sparsity": 0.638888880610466, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12912703305482864, "Spa-Code sparsity": 0.7061965786493741, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09735745191574097, "Spa-Single QA sparsity": 0.38078703234593075, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.020946368526589747, "Spa-In-Context Learning sparsity": 0.7194444417953492, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10688925087451935, "Spa-MultiHop QA sparsity": 0.5848214349576405, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.09823195609663214, "step": 268, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 06:56:06,522 >> {'loss': 11.5407, 'grad_norm': 0.592755138874054, 'learning_rate': 2.1613683427986202e-05, 'epoch': 0.2833070036861506, 'num_input_tokens_seen': 661668122, 'completed': '89.67% (269 / 300)', 'remaining time': '1:26:55', 'throughput': '8007.88', 'gpu_mem_free': '6933MB', 'step': 269} [Step 269 / Rank 7] Tasks: ['Single QA'] | Lens: [49571] → Tgt Spa: ['0.350'] [Step 269 / Rank 0] Tasks: ['Summarization', 'Code'] | Lens: [23988, 23981] → Tgt Spa: ['1.000', '1.000'] [Step 269 / Rank 5] Tasks: ['Single QA'] | Lens: [50326] → Tgt Spa: ['0.350'] [Step 269 / Rank 6] Tasks: ['Single QA'] | Lens: [49571] → Tgt Spa: ['0.350'] [Step 269 / Rank 1] Tasks: ['Summarization', 'Code'] | Lens: [23988, 23981] → Tgt Spa: ['1.000', '1.000'] [Step 269 / Rank 4] Tasks: ['Single QA'] | Lens: [50326] → Tgt Spa: ['0.350'] [Step 269 / Rank 3] Tasks: ['Single QA'] | Lens: [51535] → Tgt Spa: ['0.350'] [Step 269 / Rank 2] Tasks: ['Single QA'] | Lens: [51535] → Tgt Spa: ['0.350'] [Step 269 / Rank 2] Tasks: ['Code'] | Lens: [45838] → Tgt Spa: ['1.000'] [Step 269 / Rank 7] Tasks: ['Code'] | Lens: [61176] → Tgt Spa: ['1.000'] [Step 269 / Rank 6] Tasks: ['Code'] | Lens: [61176] → Tgt Spa: ['1.000'] [Step 269 / Rank 5] Tasks: ['Summarization'] | Lens: [42830] → Tgt Spa: ['1.000'] [Step 269 / Rank 0] Tasks: ['Single QA'] | Lens: [50719] → Tgt Spa: ['0.350'] [Step 269 / Rank 1] Tasks: ['Single QA'] | Lens: [50719] → Tgt Spa: ['0.350'] [Step 269 / Rank 3] Tasks: ['Code'] | Lens: [45838] → Tgt Spa: ['1.000'] [Step 269 / Rank 4] Tasks: ['Summarization'] | Lens: [42830] → Tgt Spa: ['1.000'] [Step 269 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [42765] → Tgt Spa: ['1.000'] [Step 269 / Rank 3] Tasks: ['Single QA'] | Lens: [42677] → Tgt Spa: ['0.350'] [Step 269 / Rank 2] Tasks: ['Single QA'] | Lens: [42677] → Tgt Spa: ['0.350'] [Step 269 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [25170, 25162] → Tgt Spa: ['1.000', '1.000'] [Step 269 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [25170, 25162] → Tgt Spa: ['1.000', '1.000'] [Step 269 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [42765] → Tgt Spa: ['1.000'] [Step 269 / Rank 6] Tasks: ['Single QA'] | Lens: [53530] → Tgt Spa: ['0.350'] [Step 269 / Rank 7] Tasks: ['Single QA'] | Lens: [53530] → Tgt Spa: ['0.350'] [Step 269 / Rank 1] Tasks: ['Single QA'] | Lens: [52037] → Tgt Spa: ['0.350'] [Step 269 / Rank 5] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25344, 25346] → Tgt Spa: ['1.000', '0.350'] [Step 269 / Rank 7] Tasks: ['Single QA'] | Lens: [55317] → Tgt Spa: ['0.350'] [Step 269 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [32313, 32316] → Tgt Spa: ['0.350', '0.350'] [Step 269 / Rank 4] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25344, 25346] → Tgt Spa: ['1.000', '0.350'] [Step 269 / Rank 6] Tasks: ['Single QA'] | Lens: [55317] → Tgt Spa: ['0.350'] [Step 269 / Rank 0] Tasks: ['Single QA'] | Lens: [52037] → Tgt Spa: ['0.350'] [Step 269 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [32313, 32316] → Tgt Spa: ['0.350', '0.350'] [Step 269 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [16728, 16728, 16717] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 269 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [16728, 16728, 16717] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 269 / Rank 2] Tasks: ['Code'] | Lens: [37442] → Tgt Spa: ['1.000'] [Step 269 / Rank 3] Tasks: ['Code'] | Lens: [37442] → Tgt Spa: ['1.000'] [Step 269 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [40872] → Tgt Spa: ['1.000'] [Step 269 / Rank 5] Tasks: ['Single QA'] | Lens: [47559] → Tgt Spa: ['0.350'] [Step 269 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [40872] → Tgt Spa: ['1.000'] [Step 269 / Rank 4] Tasks: ['Single QA'] | Lens: [47559] → Tgt Spa: ['0.350'] [Step 269 / Rank 5] Tasks: ['Single QA'] | Lens: [58713] → Tgt Spa: ['0.350'] [Step 269 / Rank 2] Tasks: ['Single QA'] | Lens: [44944] → Tgt Spa: ['0.350'] [Step 269 / Rank 3] Tasks: ['Single QA'] | Lens: [44944] → Tgt Spa: ['0.350'] [Step 269 / Rank 6] Tasks: ['Single QA'] | Lens: [41424] → Tgt Spa: ['0.350'] [Step 269 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [30354, 30355] → Tgt Spa: ['1.000', '1.000'] [Step 269 / Rank 7] Tasks: ['Single QA'] | Lens: [41424] → Tgt Spa: ['0.350'] [Step 269 / Rank 4] Tasks: ['Single QA'] | Lens: [58713] → Tgt Spa: ['0.350'] [Step 269 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [30354, 30355] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 06:58:23,751 >> @ 269 | Loss: 1.8496 | LM: 1.7982 | Reg: 0.0514 | Spa(Avg): 0.510 [INFO|lh_trainer.py:797] 2026-02-17 06:58:23,752 >> Statistic -> Code | Spa: 0.712 | Tgt: 1.000 | Z-Loss: 0.095 | [INFO|lh_trainer.py:797] 2026-02-17 06:58:23,752 >> Statistic -> In-Context | Spa: 0.717 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:58:23,752 >> Statistic -> MultiHop | Spa: 0.585 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:58:23,752 >> Statistic -> Single | Spa: 0.369 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 06:58:23,752 >> Statistic -> Summarization | Spa: 0.646 | Tgt: 1.000 | Z-Loss: 0.120 | [INFO|lh_trainer.py:810] 2026-02-17 06:58:23,754 >> [Micro-Log] {"loss": 1.8496448335548241, "lm_loss": 1.7982222409918904, "reg_loss": 0.051422586295908936, "model_sparsity(avg)": 0.5103202164173126, "Spa-Summarization sparsity": 0.6458333283662796, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1197004932910204, "Spa-Code sparsity": 0.7123015778405326, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09479664478983198, "Spa-Single QA sparsity": 0.3685185154279073, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.012240031516800325, "Spa-In-Context Learning sparsity": 0.7166666746139526, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1080595761537552, "Spa-MultiHop QA sparsity": 0.5848214349576405, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.09823195609663214, "step": 269, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 06:58:46,651 >> {'loss': 11.0979, 'grad_norm': 0.5145466923713684, 'learning_rate': 2.0302245432555708e-05, 'epoch': 0.2843601895734597, 'num_input_tokens_seen': 664055676, 'completed': '90.00% (270 / 300)', 'remaining time': '1:24:06', 'throughput': '7455.11', 'gpu_mem_free': '7595MB', 'step': 270} [Step 270 / Rank 6] Tasks: ['Code'] | Lens: [46444] → Tgt Spa: ['1.000'] [Step 270 / Rank 4] Tasks: ['MultiHop QA', 'MultiHop QA', 'Code', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Single QA', 'MultiHop QA', 'Summarization', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'Code', 'Single QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [3019, 3019, 3024, 3017, 3019, 3036, 3019, 3020, 3035, 3037, 3019, 3021, 3026, 3022, 3039, 3039, 3023, 3024, 3024, 3025, 3026] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 270 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [38735] → Tgt Spa: ['1.000'] [Step 270 / Rank 5] Tasks: ['MultiHop QA', 'MultiHop QA', 'Code', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'Single QA', 'MultiHop QA', 'Summarization', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'Code', 'Single QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [3019, 3019, 3024, 3017, 3019, 3036, 3019, 3020, 3035, 3037, 3019, 3021, 3026, 3022, 3039, 3039, 3023, 3024, 3024, 3025, 3026] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350'] [Step 270 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [38735] → Tgt Spa: ['1.000'] [Step 270 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [55306] → Tgt Spa: ['1.000'] [Step 270 / Rank 7] Tasks: ['Code'] | Lens: [46444] → Tgt Spa: ['1.000'] [Step 270 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [55306] → Tgt Spa: ['1.000'] [Step 270 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [31008, 31008] → Tgt Spa: ['0.350', '1.000'] [Step 270 / Rank 0] Tasks: ['Single QA'] | Lens: [65008] → Tgt Spa: ['0.350'] [Step 270 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27314, 27313] → Tgt Spa: ['1.000', '1.000'] [Step 270 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27314, 27313] → Tgt Spa: ['1.000', '1.000'] [Step 270 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [31008, 31008] → Tgt Spa: ['0.350', '1.000'] [Step 270 / Rank 6] Tasks: ['MultiHop QA'] | Lens: [65170] → Tgt Spa: ['0.350'] [Step 270 / Rank 7] Tasks: ['MultiHop QA'] | Lens: [65170] → Tgt Spa: ['0.350'] [Step 270 / Rank 1] Tasks: ['Single QA'] | Lens: [65008] → Tgt Spa: ['0.350'] [Step 270 / Rank 1] Tasks: ['MultiHop QA'] | Lens: [63360] → Tgt Spa: ['0.350'] [Step 270 / Rank 0] Tasks: ['MultiHop QA'] | Lens: [63360] → Tgt Spa: ['0.350'] [Step 270 / Rank 2] Tasks: ['Code', 'Code', 'Code'] | Lens: [17848, 17849, 17851] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 270 / Rank 7] Tasks: ['Single QA', 'Summarization'] | Lens: [32450, 32469] → Tgt Spa: ['0.350', '1.000'] [Step 270 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [54139] → Tgt Spa: ['1.000'] [Step 270 / Rank 3] Tasks: ['Code', 'Code', 'Code'] | Lens: [17848, 17849, 17851] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 270 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [54139] → Tgt Spa: ['1.000'] [Step 270 / Rank 6] Tasks: ['Single QA', 'Summarization'] | Lens: [32450, 32469] → Tgt Spa: ['0.350', '1.000'] [Step 270 / Rank 4] Tasks: ['Single QA'] | Lens: [40425] → Tgt Spa: ['0.350'] [Step 270 / Rank 2] Tasks: ['Single QA'] | Lens: [62485] → Tgt Spa: ['0.350'] [Step 270 / Rank 5] Tasks: ['Single QA'] | Lens: [40425] → Tgt Spa: ['0.350'] [Step 270 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [14581, 14581, 14581, 14582] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 270 / Rank 0] Tasks: ['Single QA'] | Lens: [56612] → Tgt Spa: ['0.350'] [Step 270 / Rank 3] Tasks: ['Single QA'] | Lens: [62485] → Tgt Spa: ['0.350'] [Step 270 / Rank 1] Tasks: ['Single QA'] | Lens: [56612] → Tgt Spa: ['0.350'] [Step 270 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [14581, 14581, 14581, 14582] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 270 / Rank 4] Tasks: ['Single QA'] | Lens: [57673] → Tgt Spa: ['0.350'] [Step 270 / Rank 0] Tasks: ['Single QA'] | Lens: [45097] → Tgt Spa: ['0.350'] [Step 270 / Rank 2] Tasks: ['Single QA'] | Lens: [34045] → Tgt Spa: ['0.350'] [Step 270 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [58124] → Tgt Spa: ['1.000'] [Step 270 / Rank 1] Tasks: ['Single QA'] | Lens: [45097] → Tgt Spa: ['0.350'] [Step 270 / Rank 3] Tasks: ['Single QA'] | Lens: [34045] → Tgt Spa: ['0.350'] [Step 270 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [58124] → Tgt Spa: ['1.000'] [Step 270 / Rank 5] Tasks: ['Single QA'] | Lens: [57673] → Tgt Spa: ['0.350'] [Step 270 / Rank 1] Tasks: ['Single QA'] | Lens: [64181] → Tgt Spa: ['0.350'] [Step 270 / Rank 3] Tasks: ['Single QA'] | Lens: [48175] → Tgt Spa: ['0.350'] [Step 270 / Rank 7] Tasks: ['Single QA'] | Lens: [64678] → Tgt Spa: ['0.350'] [Step 270 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [25812, 25806] → Tgt Spa: ['1.000', '1.000'] [Step 270 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [25812, 25806] → Tgt Spa: ['1.000', '1.000'] [Step 270 / Rank 0] Tasks: ['Single QA'] | Lens: [64181] → Tgt Spa: ['0.350'] [Step 270 / Rank 6] Tasks: ['Single QA'] | Lens: [64678] → Tgt Spa: ['0.350'] [Step 270 / Rank 2] Tasks: ['Single QA'] | Lens: [48175] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 07:01:34,442 >> @ 270 | Loss: 2.1504 | LM: 2.0912 | Reg: 0.0592 | Spa(Avg): 0.518 [INFO|lh_trainer.py:797] 2026-02-17 07:01:34,443 >> Statistic -> Code | Spa: 0.708 | Tgt: 1.000 | Z-Loss: 0.096 | [INFO|lh_trainer.py:797] 2026-02-17 07:01:34,443 >> Statistic -> In-Context | Spa: 0.713 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:01:34,443 >> Statistic -> MultiHop | Spa: 0.591 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:01:34,443 >> Statistic -> Single | Spa: 0.407 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:01:34,443 >> Statistic -> Summarization | Spa: 0.667 | Tgt: 1.000 | Z-Loss: 0.109 | [INFO|lh_trainer.py:810] 2026-02-17 07:01:34,445 >> [Micro-Log] {"loss": 2.1503823498884835, "lm_loss": 2.0911657561858497, "reg_loss": 0.059216607546356194, "model_sparsity(avg)": 0.5183600609501203, "Spa-In-Context Learning sparsity": 0.7125, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10984740257263184, "Spa-Single QA sparsity": 0.4071637329302336, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04154777683097085, "Spa-MultiHop QA sparsity": 0.590909096327695, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1016170284287496, "Spa-Code sparsity": 0.7083333390099662, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.0964386665395328, "Spa-Summarization sparsity": 0.6666666567325592, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10911730180184047, "step": 270, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 07:02:00,889 >> {'loss': 12.9023, 'grad_norm': 0.5453233122825623, 'learning_rate': 1.9030164969166632e-05, 'epoch': 0.28541337546076884, 'num_input_tokens_seen': 666712162, 'completed': '90.33% (271 / 300)', 'remaining time': '1:21:21', 'throughput': '6838.23', 'gpu_mem_free': '4431MB', 'step': 271} [Step 271 / Rank 3] Tasks: ['Code', 'Code', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [5381, 5382, 5374, 5382, 5382, 5378, 5377, 5378, 5378, 5379, 5380, 5380] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 271 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [26281, 26291] → Tgt Spa: ['1.000', '1.000'] [Step 271 / Rank 5] Tasks: ['Single QA'] | Lens: [49226] → Tgt Spa: ['0.350'] [Step 271 / Rank 2] Tasks: ['Code', 'Code', 'In-Context Learning', 'Code', 'Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [5381, 5382, 5374, 5382, 5382, 5378, 5377, 5378, 5378, 5379, 5380, 5380] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 271 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [26281, 26291] → Tgt Spa: ['1.000', '1.000'] [Step 271 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [30602, 30605] → Tgt Spa: ['1.000', '1.000'] [Step 271 / Rank 4] Tasks: ['Single QA'] | Lens: [49226] → Tgt Spa: ['0.350'] [Step 271 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [30602, 30605] → Tgt Spa: ['1.000', '1.000'] [Step 271 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [22445, 22441] → Tgt Spa: ['1.000', '1.000'] [Step 271 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [54001] → Tgt Spa: ['1.000'] [Step 271 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [22445, 22441] → Tgt Spa: ['1.000', '1.000'] [Step 271 / Rank 7] Tasks: ['Code'] | Lens: [61113] → Tgt Spa: ['1.000'] [Step 271 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [38417] → Tgt Spa: ['1.000'] [Step 271 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [38417] → Tgt Spa: ['1.000'] [Step 271 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [54001] → Tgt Spa: ['1.000'] [Step 271 / Rank 6] Tasks: ['Code'] | Lens: [61113] → Tgt Spa: ['1.000'] [Step 271 / Rank 3] Tasks: ['Code'] | Lens: [44182] → Tgt Spa: ['1.000'] [Step 271 / Rank 1] Tasks: ['Code'] | Lens: [57047] → Tgt Spa: ['1.000'] [Step 271 / Rank 5] Tasks: ['Single QA'] | Lens: [40721] → Tgt Spa: ['0.350'] [Step 271 / Rank 0] Tasks: ['Code'] | Lens: [57047] → Tgt Spa: ['1.000'] [Step 271 / Rank 2] Tasks: ['Code'] | Lens: [44182] → Tgt Spa: ['1.000'] [Step 271 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [29686, 29687] → Tgt Spa: ['0.350', '0.350'] [Step 271 / Rank 4] Tasks: ['Single QA'] | Lens: [40721] → Tgt Spa: ['0.350'] [Step 271 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [29686, 29687] → Tgt Spa: ['0.350', '0.350'] [Step 271 / Rank 3] Tasks: ['Code'] | Lens: [41209] → Tgt Spa: ['1.000'] [Step 271 / Rank 5] Tasks: ['Code'] | Lens: [37771] → Tgt Spa: ['1.000'] [Step 271 / Rank 1] Tasks: ['Single QA'] | Lens: [45629] → Tgt Spa: ['0.350'] [Step 271 / Rank 2] Tasks: ['Code'] | Lens: [41209] → Tgt Spa: ['1.000'] [Step 271 / Rank 6] Tasks: ['Single QA'] | Lens: [46462] → Tgt Spa: ['0.350'] [Step 271 / Rank 4] Tasks: ['Code'] | Lens: [37771] → Tgt Spa: ['1.000'] [Step 271 / Rank 7] Tasks: ['Single QA'] | Lens: [46462] → Tgt Spa: ['0.350'] [Step 271 / Rank 0] Tasks: ['Single QA'] | Lens: [45629] → Tgt Spa: ['0.350'] [Step 271 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'Code', 'Code'] | Lens: [7008, 7009, 7009, 7009, 7009, 7009, 7010, 7018, 7017] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000'] [Step 271 / Rank 5] Tasks: ['Single QA'] | Lens: [64518] → Tgt Spa: ['0.350'] [Step 271 / Rank 1] Tasks: ['Code'] | Lens: [34268] → Tgt Spa: ['1.000'] [Step 271 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'In-Context Learning', 'Single QA', 'Code', 'Code'] | Lens: [7008, 7009, 7009, 7009, 7009, 7009, 7010, 7018, 7017] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000'] [Step 271 / Rank 0] Tasks: ['Code'] | Lens: [34268] → Tgt Spa: ['1.000'] [Step 271 / Rank 4] Tasks: ['Single QA'] | Lens: [64518] → Tgt Spa: ['0.350'] [Step 271 / Rank 2] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [17979, 17993, 17993] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 271 / Rank 3] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [17979, 17993, 17993] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 271 / Rank 5] Tasks: ['Code'] | Lens: [43970] → Tgt Spa: ['1.000'] [Step 271 / Rank 2] Tasks: ['Single QA'] | Lens: [64712] → Tgt Spa: ['0.350'] [Step 271 / Rank 1] Tasks: ['Single QA'] | Lens: [38197] → Tgt Spa: ['0.350'] [Step 271 / Rank 3] Tasks: ['Single QA'] | Lens: [64712] → Tgt Spa: ['0.350'] [Step 271 / Rank 7] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [27751, 27752] → Tgt Spa: ['1.000', '0.350'] [Step 271 / Rank 4] Tasks: ['Code'] | Lens: [43970] → Tgt Spa: ['1.000'] [Step 271 / Rank 6] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [27751, 27752] → Tgt Spa: ['1.000', '0.350'] [Step 271 / Rank 0] Tasks: ['Single QA'] | Lens: [38197] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 07:04:25,415 >> @ 271 | Loss: 1.7164 | LM: 1.6471 | Reg: 0.0693 | Spa(Avg): 0.580 [INFO|lh_trainer.py:797] 2026-02-17 07:04:25,415 >> Statistic -> Code | Spa: 0.710 | Tgt: 1.000 | Z-Loss: 0.096 | [INFO|lh_trainer.py:797] 2026-02-17 07:04:25,415 >> Statistic -> In-Context | Spa: 0.719 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:04:25,415 >> Statistic -> MultiHop | Spa: 0.591 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:04:25,415 >> Statistic -> Single | Spa: 0.365 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:04:25,415 >> Statistic -> Summarization | Spa: 0.688 | Tgt: 1.000 | Z-Loss: 0.099 | [INFO|lh_trainer.py:810] 2026-02-17 07:04:25,417 >> [Micro-Log] {"loss": 1.7163996938616037, "lm_loss": 1.6470611741145451, "reg_loss": 0.06933849275810644, "model_sparsity(avg)": 0.580150463928779, "Spa-Code sparsity": 0.7099673116908354, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09582941014977063, "Spa-In-Context Learning sparsity": 0.7185185035069783, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1072865496079127, "Spa-Single QA sparsity": 0.3654513843357563, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.010277446126565337, "Spa-Summarization sparsity": 0.6875, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09899374842643738, "Spa-MultiHop QA sparsity": 0.590909096327695, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1016170284287496, "step": 271, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 07:04:51,975 >> {'loss': 10.2984, 'grad_norm': 0.8424168825149536, 'learning_rate': 1.7797660002257764e-05, 'epoch': 0.2864665613480779, 'num_input_tokens_seen': 669145358, 'completed': '90.67% (272 / 300)', 'remaining time': '1:18:33', 'throughput': '7111.02', 'gpu_mem_free': '13431MB', 'step': 272} [Step 272 / Rank 3] Tasks: ['Single QA'] | Lens: [56250] → Tgt Spa: ['0.350'] [Step 272 / Rank 5] Tasks: ['Code'] | Lens: [58908] → Tgt Spa: ['1.000'] [Step 272 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [37431] → Tgt Spa: ['1.000'] [Step 272 / Rank 6] Tasks: ['Single QA'] | Lens: [57251] → Tgt Spa: ['0.350'] [Step 272 / Rank 2] Tasks: ['Single QA'] | Lens: [56250] → Tgt Spa: ['0.350'] [Step 272 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [37431] → Tgt Spa: ['1.000'] [Step 272 / Rank 4] Tasks: ['Code'] | Lens: [58908] → Tgt Spa: ['1.000'] [Step 272 / Rank 7] Tasks: ['Single QA'] | Lens: [57251] → Tgt Spa: ['0.350'] [Step 272 / Rank 5] Tasks: ['Code', 'Summarization'] | Lens: [24991, 25003] → Tgt Spa: ['1.000', '1.000'] [Step 272 / Rank 0] Tasks: ['Single QA'] | Lens: [34900] → Tgt Spa: ['0.350'] [Step 272 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning', 'Code', 'Single QA', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [3700, 3701, 3709, 3702, 3702, 3705, 3723, 3706, 3705, 3706, 3725, 3707, 3707, 3726, 3709, 3708, 3708] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 272 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning', 'Code', 'Single QA', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [3700, 3701, 3709, 3702, 3702, 3705, 3723, 3706, 3705, 3706, 3725, 3707, 3707, 3726, 3709, 3708, 3708] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 272 / Rank 6] Tasks: ['Summarization', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'Code', 'In-Context Learning', 'In-Context Learning', 'Code', 'Single QA'] | Lens: [5841, 5823, 5823, 5824, 5830, 5827, 5835, 5828, 5827, 5838, 5832] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 272 / Rank 1] Tasks: ['Single QA'] | Lens: [34900] → Tgt Spa: ['0.350'] [Step 272 / Rank 4] Tasks: ['Code', 'Summarization'] | Lens: [24991, 25003] → Tgt Spa: ['1.000', '1.000'] [Step 272 / Rank 7] Tasks: ['Summarization', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'Code', 'In-Context Learning', 'In-Context Learning', 'Code', 'Single QA'] | Lens: [5841, 5823, 5823, 5824, 5830, 5827, 5835, 5828, 5827, 5838, 5832] → Tgt Spa: ['1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 272 / Rank 6] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22575, 22594] → Tgt Spa: ['1.000', '1.000'] [Step 272 / Rank 2] Tasks: ['Code', 'Single QA'] | Lens: [32272, 32267] → Tgt Spa: ['1.000', '0.350'] [Step 272 / Rank 1] Tasks: ['Code'] | Lens: [35829] → Tgt Spa: ['1.000'] [Step 272 / Rank 7] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22575, 22594] → Tgt Spa: ['1.000', '1.000'] [Step 272 / Rank 3] Tasks: ['Code', 'Single QA'] | Lens: [32272, 32267] → Tgt Spa: ['1.000', '0.350'] [Step 272 / Rank 4] Tasks: ['Single QA'] | Lens: [37370] → Tgt Spa: ['0.350'] [Step 272 / Rank 5] Tasks: ['Single QA'] | Lens: [37370] → Tgt Spa: ['0.350'] [Step 272 / Rank 0] Tasks: ['Code'] | Lens: [35829] → Tgt Spa: ['1.000'] [Step 272 / Rank 5] Tasks: ['Summarization', 'Summarization'] | Lens: [22647, 22647] → Tgt Spa: ['1.000', '1.000'] [Step 272 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [56194] → Tgt Spa: ['1.000'] [Step 272 / Rank 6] Tasks: ['Single QA'] | Lens: [35623] → Tgt Spa: ['0.350'] [Step 272 / Rank 3] Tasks: ['Single QA'] | Lens: [33196] → Tgt Spa: ['0.350'] [Step 272 / Rank 2] Tasks: ['Single QA'] | Lens: [33196] → Tgt Spa: ['0.350'] [Step 272 / Rank 4] Tasks: ['Summarization', 'Summarization'] | Lens: [22647, 22647] → Tgt Spa: ['1.000', '1.000'] [Step 272 / Rank 7] Tasks: ['Single QA'] | Lens: [35623] → Tgt Spa: ['0.350'] [Step 272 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [56194] → Tgt Spa: ['1.000'] [Step 272 / Rank 5] Tasks: ['Single QA'] | Lens: [57587] → Tgt Spa: ['0.350'] [Step 272 / Rank 4] Tasks: ['Single QA'] | Lens: [57587] → Tgt Spa: ['0.350'] [Step 272 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [65428] → Tgt Spa: ['1.000'] [Step 272 / Rank 3] Tasks: ['Code'] | Lens: [33628] → Tgt Spa: ['1.000'] [Step 272 / Rank 1] Tasks: ['Single QA'] | Lens: [51238] → Tgt Spa: ['0.350'] [Step 272 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [65428] → Tgt Spa: ['1.000'] [Step 272 / Rank 0] Tasks: ['Single QA'] | Lens: [51238] → Tgt Spa: ['0.350'] [Step 272 / Rank 2] Tasks: ['Code'] | Lens: [33628] → Tgt Spa: ['1.000'] [Step 272 / Rank 6] Tasks: ['Code'] | Lens: [50026] → Tgt Spa: ['1.000'] [Step 272 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26507, 26507] → Tgt Spa: ['1.000', '1.000'] [Step 272 / Rank 2] Tasks: ['Single QA'] | Lens: [49459] → Tgt Spa: ['0.350'] [Step 272 / Rank 7] Tasks: ['Code'] | Lens: [50026] → Tgt Spa: ['1.000'] [Step 272 / Rank 1] Tasks: ['Code'] | Lens: [40283] → Tgt Spa: ['1.000'] [Step 272 / Rank 0] Tasks: ['Code'] | Lens: [40283] → Tgt Spa: ['1.000'] [Step 272 / Rank 3] Tasks: ['Single QA'] | Lens: [49459] → Tgt Spa: ['0.350'] [Step 272 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26507, 26507] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 07:07:10,114 >> @ 272 | Loss: 2.0549 | LM: 1.9789 | Reg: 0.0759 | Spa(Avg): 0.558 [INFO|lh_trainer.py:797] 2026-02-17 07:07:10,114 >> Statistic -> Code | Spa: 0.702 | Tgt: 1.000 | Z-Loss: 0.099 | [INFO|lh_trainer.py:797] 2026-02-17 07:07:10,114 >> Statistic -> In-Context | Spa: 0.710 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:07:10,114 >> Statistic -> MultiHop | Spa: 0.701 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:07:10,114 >> Statistic -> Single | Spa: 0.418 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:07:10,114 >> Statistic -> Summarization | Spa: 0.590 | Tgt: 1.000 | Z-Loss: 0.154 | [INFO|lh_trainer.py:810] 2026-02-17 07:07:10,116 >> [Micro-Log] {"loss": 2.0548878150681653, "lm_loss": 1.9789408817887306, "reg_loss": 0.07594693900318816, "model_sparsity(avg)": 0.5576351756850878, "Spa-In-Context Learning sparsity": 0.7103174527486166, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11089112290314265, "Spa-Single QA sparsity": 0.4177350401878357, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04871564412202973, "Spa-Code sparsity": 0.7020202116532759, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09892503849484703, "Spa-MultiHop QA sparsity": 0.7013888955116272, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.16106745600700378, "Spa-Summarization sparsity": 0.5902777686715126, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.15406629908829927, "step": 272, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 07:07:27,804 >> {'loss': 12.3293, 'grad_norm': 0.6945912837982178, 'learning_rate': 1.6604941715210256e-05, 'epoch': 0.28751974723538704, 'num_input_tokens_seen': 671496934, 'completed': '91.00% (273 / 300)', 'remaining time': '1:15:43', 'throughput': '7545.36', 'gpu_mem_free': '12375MB', 'step': 273} [Step 273 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [23546, 23546] → Tgt Spa: ['0.350', '0.350'] [Step 273 / Rank 6] Tasks: ['Single QA'] | Lens: [49583] → Tgt Spa: ['0.350'] [Step 273 / Rank 0] Tasks: ['Single QA'] | Lens: [34952] → Tgt Spa: ['0.350'] [Step 273 / Rank 4] Tasks: ['Summarization'] | Lens: [53196] → Tgt Spa: ['1.000'] [Step 273 / Rank 1] Tasks: ['Single QA'] | Lens: [34952] → Tgt Spa: ['0.350'] [Step 273 / Rank 7] Tasks: ['Single QA'] | Lens: [49583] → Tgt Spa: ['0.350'] [Step 273 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [23546, 23546] → Tgt Spa: ['0.350', '0.350'] [Step 273 / Rank 5] Tasks: ['Summarization'] | Lens: [53196] → Tgt Spa: ['1.000'] [Step 273 / Rank 2] Tasks: ['Code'] | Lens: [35005] → Tgt Spa: ['1.000'] [Step 273 / Rank 5] Tasks: ['Code'] | Lens: [56245] → Tgt Spa: ['1.000'] [Step 273 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [35244] → Tgt Spa: ['1.000'] [Step 273 / Rank 3] Tasks: ['Code'] | Lens: [35005] → Tgt Spa: ['1.000'] [Step 273 / Rank 6] Tasks: ['Single QA'] | Lens: [52323] → Tgt Spa: ['0.350'] [Step 273 / Rank 7] Tasks: ['Single QA'] | Lens: [52323] → Tgt Spa: ['0.350'] [Step 273 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [35244] → Tgt Spa: ['1.000'] [Step 273 / Rank 4] Tasks: ['Code'] | Lens: [56245] → Tgt Spa: ['1.000'] [Step 273 / Rank 1] Tasks: ['In-Context Learning', 'Single QA', 'Single QA'] | Lens: [21459, 21460, 21460] → Tgt Spa: ['1.000', '0.350', '0.350'] [Step 273 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15910, 15910, 15910, 15910] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 273 / Rank 6] Tasks: ['Code'] | Lens: [52894] → Tgt Spa: ['1.000'] [Step 273 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15910, 15910, 15910, 15910] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 273 / Rank 7] Tasks: ['Code'] | Lens: [52894] → Tgt Spa: ['1.000'] [Step 273 / Rank 5] Tasks: ['Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'Single QA', 'Code', 'MultiHop QA', 'MultiHop QA', 'Code', 'Single QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [2812, 2813, 2812, 2812, 2811, 2829, 2811, 2812, 2820, 2814, 2813, 2820, 2813, 2815, 2814, 2815, 2815, 2832, 2832, 2833, 2816, 2817, 2817] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350'] [Step 273 / Rank 0] Tasks: ['In-Context Learning', 'Single QA', 'Single QA'] | Lens: [21459, 21460, 21460] → Tgt Spa: ['1.000', '0.350', '0.350'] [Step 273 / Rank 4] Tasks: ['Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'Single QA', 'Code', 'MultiHop QA', 'MultiHop QA', 'Code', 'Single QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA'] | Lens: [2812, 2813, 2812, 2812, 2811, 2829, 2811, 2812, 2820, 2814, 2813, 2820, 2813, 2815, 2814, 2815, 2815, 2832, 2832, 2833, 2816, 2817, 2817] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350'] [Step 273 / Rank 7] Tasks: ['Single QA'] | Lens: [60536] → Tgt Spa: ['0.350'] [Step 273 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [32547, 32545] → Tgt Spa: ['0.350', '0.350'] [Step 273 / Rank 1] Tasks: ['Single QA', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'In-Context Learning'] | Lens: [5076, 5075, 5076, 5077, 5077, 5077, 5097, 5079, 5079, 5089, 5085, 5085] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 273 / Rank 4] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [16716, 16727, 16727] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 273 / Rank 5] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [16716, 16727, 16727] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 273 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [32547, 32545] → Tgt Spa: ['0.350', '0.350'] [Step 273 / Rank 6] Tasks: ['Single QA'] | Lens: [60536] → Tgt Spa: ['0.350'] [Step 273 / Rank 0] Tasks: ['Single QA', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'In-Context Learning'] | Lens: [5076, 5075, 5076, 5077, 5077, 5077, 5097, 5079, 5079, 5089, 5085, 5085] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 273 / Rank 6] Tasks: ['Single QA'] | Lens: [54284] → Tgt Spa: ['0.350'] [Step 273 / Rank 4] Tasks: ['Summarization'] | Lens: [38134] → Tgt Spa: ['1.000'] [Step 273 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [60921] → Tgt Spa: ['1.000'] [Step 273 / Rank 3] Tasks: ['Single QA'] | Lens: [39324] → Tgt Spa: ['0.350'] [Step 273 / Rank 5] Tasks: ['Summarization'] | Lens: [38134] → Tgt Spa: ['1.000'] [Step 273 / Rank 2] Tasks: ['Single QA'] | Lens: [39324] → Tgt Spa: ['0.350'] [Step 273 / Rank 7] Tasks: ['Single QA'] | Lens: [54284] → Tgt Spa: ['0.350'][Step 273 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [60921] → Tgt Spa: ['1.000'] [Step 273 / Rank 4] Tasks: ['Code'] | Lens: [37711] → Tgt Spa: ['1.000'] [Step 273 / Rank 3] Tasks: ['Single QA'] | Lens: [55407] → Tgt Spa: ['0.350'] [Step 273 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [23283, 23291] → Tgt Spa: ['1.000', '1.000'] [Step 273 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [25742, 25735] → Tgt Spa: ['1.000', '1.000'] [Step 273 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [23283, 23291] → Tgt Spa: ['1.000', '1.000'] [Step 273 / Rank 5] Tasks: ['Code'] | Lens: [37711] → Tgt Spa: ['1.000'] [Step 273 / Rank 2] Tasks: ['Single QA'] | Lens: [55407] → Tgt Spa: ['0.350'] [Step 273 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [25742, 25735] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 07:09:53,280 >> @ 273 | Loss: 1.9374 | LM: 1.8729 | Reg: 0.0645 | Spa(Avg): 0.551 [INFO|lh_trainer.py:797] 2026-02-17 07:09:53,280 >> Statistic -> Code | Spa: 0.711 | Tgt: 1.000 | Z-Loss: 0.095 | [INFO|lh_trainer.py:797] 2026-02-17 07:09:53,280 >> Statistic -> In-Context | Spa: 0.718 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:09:53,280 >> Statistic -> MultiHop | Spa: 0.686 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:09:53,280 >> Statistic -> Single | Spa: 0.436 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:09:53,280 >> Statistic -> Summarization | Spa: 0.679 | Tgt: 1.000 | Z-Loss: 0.103 | [INFO|lh_trainer.py:810] 2026-02-17 07:09:53,282 >> [Micro-Log] {"loss": 1.9373879643778007, "lm_loss": 1.8729136638964217, "reg_loss": 0.06447431655639473, "model_sparsity(avg)": 0.5509217331806818, "Spa-Single QA sparsity": 0.43555554628372195, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.05765978889539838, "Spa-In-Context Learning sparsity": 0.7179487118354211, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10752357427890484, "Spa-Summarization sparsity": 0.6790123515658908, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10323732263512081, "Spa-Code sparsity": 0.7111111164093018, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09528687819838524, "Spa-MultiHop QA sparsity": 0.6856060732494701, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1520778794180263, "step": 273, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 07:10:13,992 >> {'loss': 11.6243, 'grad_norm': 0.6020376086235046, 'learning_rate': 1.545221447416239e-05, 'epoch': 0.28857293312269616, 'num_input_tokens_seen': 673956840, 'completed': '91.33% (274 / 300)', 'remaining time': '1:12:55', 'throughput': '7400.98', 'gpu_mem_free': '11369MB', 'step': 274} [Step 274 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [29330, 29331] → Tgt Spa: ['0.350', '1.000'] [Step 274 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [42602] → Tgt Spa: ['1.000'] [Step 274 / Rank 6] Tasks: ['Single QA'] | Lens: [65027] → Tgt Spa: ['0.350'] [Step 274 / Rank 0] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18883, 18871, 18873] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 274 / Rank 7] Tasks: ['Single QA'] | Lens: [65027] → Tgt Spa: ['0.350'] [Step 274 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [42602] → Tgt Spa: ['1.000'] [Step 274 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [29330, 29331] → Tgt Spa: ['0.350', '1.000'] [Step 274 / Rank 1] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18883, 18871, 18873] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 274 / Rank 3] Tasks: ['Single QA'] | Lens: [61157] → Tgt Spa: ['0.350'] [Step 274 / Rank 7] Tasks: ['Single QA'] | Lens: [64047] → Tgt Spa: ['0.350'] [Step 274 / Rank 0] Tasks: ['Single QA'] | Lens: [54180] → Tgt Spa: ['0.350'] [Step 274 / Rank 2] Tasks: ['Single QA'] | Lens: [61157] → Tgt Spa: ['0.350'] [Step 274 / Rank 5] Tasks: ['Single QA'] | Lens: [34620] → Tgt Spa: ['0.350'] [Step 274 / Rank 6] Tasks: ['Single QA'] | Lens: [64047] → Tgt Spa: ['0.350'] [Step 274 / Rank 1] Tasks: ['Single QA'] | Lens: [54180] → Tgt Spa: ['0.350'] [Step 274 / Rank 4] Tasks: ['Single QA'] | Lens: [34620] → Tgt Spa: ['0.350'] [Step 274 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [27156, 27156] → Tgt Spa: ['0.350', '0.350'] [Step 274 / Rank 7] Tasks: ['Single QA'] | Lens: [57510] → Tgt Spa: ['0.350'] [Step 274 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [27156, 27156] → Tgt Spa: ['0.350', '0.350'] [Step 274 / Rank 3] Tasks: ['Single QA'] | Lens: [55168] → Tgt Spa: ['0.350'] [Step 274 / Rank 2] Tasks: ['Single QA'] | Lens: [55168] → Tgt Spa: ['0.350'] [Step 274 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [57635] → Tgt Spa: ['1.000'] [Step 274 / Rank 6] Tasks: ['Single QA'] | Lens: [57510] → Tgt Spa: ['0.350'] [Step 274 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [57635] → Tgt Spa: ['1.000'] [Step 274 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [55798] → Tgt Spa: ['1.000'] [Step 274 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [55798] → Tgt Spa: ['1.000'] [Step 274 / Rank 4] Tasks: ['Summarization', 'MultiHop QA', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'Single QA', 'Single QA', 'Code', 'Single QA', 'In-Context Learning'] | Lens: [3767, 3750, 3757, 3756, 3751, 3751, 3751, 3752, 3752, 3753, 3772, 3753, 3753, 3753, 3759, 3753, 3753] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000'] [Step 274 / Rank 7] Tasks: ['Single QA'] | Lens: [64608] → Tgt Spa: ['0.350'] [Step 274 / Rank 2] Tasks: ['Single QA', 'Code', 'Code', 'Code'] | Lens: [13141, 13150, 13150, 13154] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000'] [Step 274 / Rank 3] Tasks: ['Single QA', 'Code', 'Code', 'Code'] | Lens: [13141, 13150, 13150, 13154] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000'] [Step 274 / Rank 5] Tasks: ['Summarization', 'MultiHop QA', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'Single QA', 'Single QA', 'Code', 'Single QA', 'In-Context Learning'] | Lens: [3767, 3750, 3757, 3756, 3751, 3751, 3751, 3752, 3752, 3753, 3772, 3753, 3753, 3753, 3759, 3753, 3753] → Tgt Spa: ['1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000'] [Step 274 / Rank 6] Tasks: ['Single QA'] | Lens: [64608] → Tgt Spa: ['0.350'] [Step 274 / Rank 1] Tasks: ['Code'] | Lens: [62999] → Tgt Spa: ['1.000'] [Step 274 / Rank 2] Tasks: ['Code'] | Lens: [37967] → Tgt Spa: ['1.000'] [Step 274 / Rank 5] Tasks: ['Single QA'] | Lens: [35239] → Tgt Spa: ['0.350'] [Step 274 / Rank 0] Tasks: ['Code'] | Lens: [62999] → Tgt Spa: ['1.000'] [Step 274 / Rank 6] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15974, 15975, 15976, 15976] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 274 / Rank 4] Tasks: ['Single QA'] | Lens: [35239] → Tgt Spa: ['0.350'] [Step 274 / Rank 3] Tasks: ['Code'] | Lens: [37967] → Tgt Spa: ['1.000'] [Step 274 / Rank 7] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15974, 15975, 15976, 15976] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 274 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24918, 24920] → Tgt Spa: ['1.000', '1.000'] [Step 274 / Rank 3] Tasks: ['Single QA'] | Lens: [51387] → Tgt Spa: ['0.350'] [Step 274 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [24132, 24132] → Tgt Spa: ['0.350', '0.350'] [Step 274 / Rank 6] Tasks: ['Code'] | Lens: [34946] → Tgt Spa: ['1.000'] [Step 274 / Rank 2] Tasks: ['Single QA'] | Lens: [51387] → Tgt Spa: ['0.350'] [Step 274 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24918, 24920] → Tgt Spa: ['1.000', '1.000'] [Step 274 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [24132, 24132] → Tgt Spa: ['0.350', '0.350'] [Step 274 / Rank 7] Tasks: ['Code'] | Lens: [34946] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 07:13:05,311 >> @ 274 | Loss: 2.0644 | LM: 2.0137 | Reg: 0.0506 | Spa(Avg): 0.503 [INFO|lh_trainer.py:797] 2026-02-17 07:13:05,311 >> Statistic -> Code | Spa: 0.706 | Tgt: 1.000 | Z-Loss: 0.098 | [INFO|lh_trainer.py:797] 2026-02-17 07:13:05,311 >> Statistic -> In-Context | Spa: 0.694 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:13:05,311 >> Statistic -> MultiHop | Spa: 0.618 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:13:05,311 >> Statistic -> Single | Spa: 0.392 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:13:05,311 >> Statistic -> Summarization | Spa: 0.620 | Tgt: 1.000 | Z-Loss: 0.133 | [INFO|lh_trainer.py:810] 2026-02-17 07:13:05,313 >> [Micro-Log] {"loss": 2.0643933241566024, "lm_loss": 2.0137447056670985, "reg_loss": 0.05064860503868355, "model_sparsity(avg)": 0.503404131780068, "Spa-Summarization sparsity": 0.6203703880310059, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13295213878154755, "Spa-Code sparsity": 0.7058080651543357, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09751589664004066, "Spa-Single QA sparsity": 0.39178240050872165, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.029336649876010295, "Spa-In-Context Learning sparsity": 0.6944444278875986, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11782181511322658, "Spa-MultiHop QA sparsity": 0.6180555820465088, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11417927592992783, "step": 274, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 07:13:23,574 >> {'loss': 12.3864, 'grad_norm': 0.5085281133651733, 'learning_rate': 1.4339675792992671e-05, 'epoch': 0.2896261190100053, 'num_input_tokens_seen': 676522688, 'completed': '91.67% (275 / 300)', 'remaining time': '1:10:08', 'throughput': '6767.13', 'gpu_mem_free': '10771MB', 'step': 275} [Step 275 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [25198, 25206] → Tgt Spa: ['1.000', '1.000'] [Step 275 / Rank 3] Tasks: ['Single QA'] | Lens: [44079] → Tgt Spa: ['0.350'] [Step 275 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [25198, 25206] → Tgt Spa: ['1.000', '1.000'] [Step 275 / Rank 7] Tasks: ['Single QA'] | Lens: [36499] → Tgt Spa: ['0.350'] [Step 275 / Rank 6] Tasks: ['Single QA'] | Lens: [36499] → Tgt Spa: ['0.350'] [Step 275 / Rank 2] Tasks: ['Single QA'] | Lens: [44079] → Tgt Spa: ['0.350'] [Step 275 / Rank 5] Tasks: ['Single QA'] | Lens: [58396] → Tgt Spa: ['0.350'] [Step 275 / Rank 4] Tasks: ['Single QA'] | Lens: [58396] → Tgt Spa: ['0.350'] [Step 275 / Rank 3] Tasks: ['Single QA'] | Lens: [49853] → Tgt Spa: ['0.350'] [Step 275 / Rank 6] Tasks: ['Single QA'] | Lens: [61216] → Tgt Spa: ['0.350'] [Step 275 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [43667] → Tgt Spa: ['1.000'] [Step 275 / Rank 1] Tasks: ['Code'] | Lens: [35742] → Tgt Spa: ['1.000'] [Step 275 / Rank 2] Tasks: ['Single QA'] | Lens: [49853] → Tgt Spa: ['0.350'] [Step 275 / Rank 7] Tasks: ['Single QA'] | Lens: [61216] → Tgt Spa: ['0.350'] [Step 275 / Rank 0] Tasks: ['Code'] | Lens: [35742] → Tgt Spa: ['1.000'] [Step 275 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [43667] → Tgt Spa: ['1.000'] [Step 275 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [26453, 26461] → Tgt Spa: ['1.000', '1.000'] [Step 275 / Rank 1] Tasks: ['Single QA'] | Lens: [49221] → Tgt Spa: ['0.350'] [Step 275 / Rank 5] Tasks: ['Single QA'] | Lens: [57763] → Tgt Spa: ['0.350'] [Step 275 / Rank 0] Tasks: ['Single QA'] | Lens: [49221] → Tgt Spa: ['0.350'] [Step 275 / Rank 3] Tasks: ['Single QA'] | Lens: [62992] → Tgt Spa: ['0.350'] [Step 275 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [26453, 26461] → Tgt Spa: ['1.000', '1.000'] [Step 275 / Rank 2] Tasks: ['Single QA'] | Lens: [62992] → Tgt Spa: ['0.350'] [Step 275 / Rank 4] Tasks: ['Single QA'] | Lens: [57763] → Tgt Spa: ['0.350'] [Step 275 / Rank 3] Tasks: ['Single QA'] | Lens: [33977] → Tgt Spa: ['0.350'] [Step 275 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [37874] → Tgt Spa: ['1.000'] [Step 275 / Rank 5] Tasks: ['Code'] | Lens: [40093] → Tgt Spa: ['1.000'] [Step 275 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [37874] → Tgt Spa: ['1.000'] [Step 275 / Rank 4] Tasks: ['Code'] | Lens: [40093] → Tgt Spa: ['1.000'] [Step 275 / Rank 7] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23102, 23102] → Tgt Spa: ['0.350', '1.000'] [Step 275 / Rank 6] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [23102, 23102] → Tgt Spa: ['0.350', '1.000'] [Step 275 / Rank 2] Tasks: ['Single QA'] | Lens: [33977] → Tgt Spa: ['0.350'] [Step 275 / Rank 3] Tasks: ['Single QA'] | Lens: [41009] → Tgt Spa: ['0.350'] [Step 275 / Rank 7] Tasks: ['Single QA'] | Lens: [64531] → Tgt Spa: ['0.350'] [Step 275 / Rank 6] Tasks: ['Single QA'] | Lens: [64531] → Tgt Spa: ['0.350'] [Step 275 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [32549, 32550] → Tgt Spa: ['0.350', '0.350'] [Step 275 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [32549, 32550] → Tgt Spa: ['0.350', '0.350'] [Step 275 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [29956, 29957] → Tgt Spa: ['0.350', '0.350'] [Step 275 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [29956, 29957] → Tgt Spa: ['0.350', '0.350'] [Step 275 / Rank 2] Tasks: ['Single QA'] | Lens: [41009] → Tgt Spa: ['0.350'] [Step 275 / Rank 3] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [29326, 29327] → Tgt Spa: ['1.000', '0.350'] [Step 275 / Rank 5] Tasks: ['Code', 'In-Context Learning', 'Summarization'] | Lens: [21166, 21161, 21181] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 275 / Rank 0] Tasks: ['Single QA'] | Lens: [53544] → Tgt Spa: ['0.350'] [Step 275 / Rank 7] Tasks: ['Single QA'] | Lens: [65069] → Tgt Spa: ['0.350'] [Step 275 / Rank 4] Tasks: ['Code', 'In-Context Learning', 'Summarization'] | Lens: [21166, 21161, 21181] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 275 / Rank 2] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [29326, 29327] → Tgt Spa: ['1.000', '0.350'] [Step 275 / Rank 1] Tasks: ['Single QA'] | Lens: [53544] → Tgt Spa: ['0.350'] [Step 275 / Rank 6] Tasks: ['Single QA'] | Lens: [65069] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 07:15:56,943 >> @ 275 | Loss: 2.1756 | LM: 2.1327 | Reg: 0.0430 | Spa(Avg): 0.482 [INFO|lh_trainer.py:797] 2026-02-17 07:15:56,943 >> Statistic -> Code | Spa: 0.714 | Tgt: 1.000 | Z-Loss: 0.094 | [INFO|lh_trainer.py:797] 2026-02-17 07:15:56,943 >> Statistic -> In-Context | Spa: 0.720 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:15:56,943 >> Statistic -> MultiHop | Spa: 0.618 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:15:56,943 >> Statistic -> Single | Spa: 0.365 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:15:56,943 >> Statistic -> Summarization | Spa: 0.694 | Tgt: 1.000 | Z-Loss: 0.095 | [INFO|lh_trainer.py:810] 2026-02-17 07:15:56,945 >> [Micro-Log] {"loss": 2.1756326258182526, "lm_loss": 2.132660264149308, "reg_loss": 0.042972351215818584, "model_sparsity(avg)": 0.4819637288649877, "Spa-In-Context Learning sparsity": 0.7202380895614624, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.1065548722233091, "Spa-Code sparsity": 0.7138888835906982, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09417061805725098, "Spa-Single QA sparsity": 0.365497068354958, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.013499379501138864, "Spa-Summarization sparsity": 0.6944444179534912, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.0952208861708641, "Spa-MultiHop QA sparsity": 0.6180555820465088, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11417927592992783, "step": 275, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 07:16:23,825 >> {'loss': 13.0538, 'grad_norm': 0.4224667549133301, 'learning_rate': 1.3267516299476845e-05, 'epoch': 0.29067930489731436, 'num_input_tokens_seen': 678987128, 'completed': '92.00% (276 / 300)', 'remaining time': '1:07:21', 'throughput': '6836.16', 'gpu_mem_free': '8691MB', 'step': 276} [Step 276 / Rank 5] Tasks: ['Single QA'] | Lens: [56327] → Tgt Spa: ['0.350'] [Step 276 / Rank 7] Tasks: ['Single QA'] | Lens: [64056] → Tgt Spa: ['0.350'] [Step 276 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [32199, 32199] → Tgt Spa: ['0.350', '0.350'] [Step 276 / Rank 6] Tasks: ['Single QA'] | Lens: [64056] → Tgt Spa: ['0.350'] [Step 276 / Rank 0] Tasks: ['Summarization', 'Code', 'Single QA'] | Lens: [17925, 17915, 17907] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 276 / Rank 4] Tasks: ['Single QA'] | Lens: [56327] → Tgt Spa: ['0.350'] [Step 276 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [32199, 32199] → Tgt Spa: ['0.350', '0.350'] [Step 276 / Rank 1] Tasks: ['Summarization', 'Code', 'Single QA'] | Lens: [17925, 17915, 17907] → Tgt Spa: ['1.000', '1.000', '0.350'] [Step 276 / Rank 1] Tasks: ['Single QA'] | Lens: [48782] → Tgt Spa: ['0.350'] [Step 276 / Rank 3] Tasks: ['Single QA'] | Lens: [55573] → Tgt Spa: ['0.350'] [Step 276 / Rank 4] Tasks: ['Single QA'] | Lens: [45603] → Tgt Spa: ['0.350'] [Step 276 / Rank 7] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17838, 17851, 17844] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 276 / Rank 2] Tasks: ['Single QA'] | Lens: [55573] → Tgt Spa: ['0.350'] [Step 276 / Rank 5] Tasks: ['Single QA'] | Lens: [45603] → Tgt Spa: ['0.350'] [Step 276 / Rank 6] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [17838, 17851, 17844] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 276 / Rank 0] Tasks: ['Single QA'] | Lens: [48782] → Tgt Spa: ['0.350'] [Step 276 / Rank 6] Tasks: ['Single QA'] | Lens: [39382] → Tgt Spa: ['0.350'] [Step 276 / Rank 7] Tasks: ['Single QA'] | Lens: [39382] → Tgt Spa: ['0.350'] [Step 276 / Rank 3] Tasks: ['Code'] | Lens: [53682] → Tgt Spa: ['1.000'] [Step 276 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23238, 23258] → Tgt Spa: ['1.000', '1.000'] [Step 276 / Rank 4] Tasks: ['Code', 'Single QA'] | Lens: [32606, 32602] → Tgt Spa: ['1.000', '0.350'] [Step 276 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23238, 23258] → Tgt Spa: ['1.000', '1.000'] [Step 276 / Rank 5] Tasks: ['Code', 'Single QA'] | Lens: [32606, 32602] → Tgt Spa: ['1.000', '0.350'] [Step 276 / Rank 2] Tasks: ['Code'] | Lens: [53682] → Tgt Spa: ['1.000'] [Step 276 / Rank 6] Tasks: ['Single QA'] | Lens: [35909] → Tgt Spa: ['0.350'] [Step 276 / Rank 7] Tasks: ['Single QA'] | Lens: [35909] → Tgt Spa: ['0.350'] [Step 276 / Rank 1] Tasks: ['Single QA'] | Lens: [43217] → Tgt Spa: ['0.350'] [Step 276 / Rank 3] Tasks: ['MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Single QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization'] | Lens: [2139, 2142, 2143, 2145, 2145, 2144, 2145, 2145, 2146, 2145, 2146, 2150, 2169, 2152, 2153, 2152, 2171, 2152, 2152, 2172, 2157, 2156, 2176, 2156, 2158, 2176, 2176, 2175, 2175, 2176] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 276 / Rank 0] Tasks: ['Single QA'] | Lens: [43217] → Tgt Spa: ['0.350'] [Step 276 / Rank 5] Tasks: ['Summarization', 'Code'] | Lens: [23272, 23262] → Tgt Spa: ['1.000', '1.000'] [Step 276 / Rank 2] Tasks: ['MultiHop QA', 'MultiHop QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Single QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization'] | Lens: [2139, 2142, 2143, 2145, 2145, 2144, 2145, 2145, 2146, 2145, 2146, 2150, 2169, 2152, 2153, 2152, 2171, 2152, 2152, 2172, 2157, 2156, 2176, 2156, 2158, 2176, 2176, 2175, 2175, 2176] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000'] [Step 276 / Rank 4] Tasks: ['Summarization', 'Code'] | Lens: [23272, 23262] → Tgt Spa: ['1.000', '1.000'] [Step 276 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [44178] → Tgt Spa: ['1.000'] [Step 276 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [44178] → Tgt Spa: ['1.000'] [Step 276 / Rank 5] Tasks: ['Code'] | Lens: [34374] → Tgt Spa: ['1.000'] [Step 276 / Rank 3] Tasks: ['Single QA'] | Lens: [46228] → Tgt Spa: ['0.350'] [Step 276 / Rank 0] Tasks: ['Single QA'] | Lens: [36543] → Tgt Spa: ['0.350'] [Step 276 / Rank 1] Tasks: ['Single QA'] | Lens: [36543] → Tgt Spa: ['0.350'] [Step 276 / Rank 2] Tasks: ['Single QA'] | Lens: [46228] → Tgt Spa: ['0.350'] [Step 276 / Rank 4] Tasks: ['Code'] | Lens: [34374] → Tgt Spa: ['1.000'] [Step 276 / Rank 5] Tasks: ['Single QA'] | Lens: [35564] → Tgt Spa: ['0.350'] [Step 276 / Rank 1] Tasks: ['Single QA'] | Lens: [63880] → Tgt Spa: ['0.350'] [Step 276 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43911] → Tgt Spa: ['1.000'] [Step 276 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26417, 26418] → Tgt Spa: ['1.000', '1.000'] [Step 276 / Rank 4] Tasks: ['Single QA'] | Lens: [35564] → Tgt Spa: ['0.350'] [Step 276 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26417, 26418] → Tgt Spa: ['1.000', '1.000'] [Step 276 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43911] → Tgt Spa: ['1.000'] [Step 276 / Rank 0] Tasks: ['Single QA'] | Lens: [63880] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 07:18:38,308 >> @ 276 | Loss: 2.0993 | LM: 2.0458 | Reg: 0.0535 | Spa(Avg): 0.512 [INFO|lh_trainer.py:797] 2026-02-17 07:18:38,308 >> Statistic -> Code | Spa: 0.714 | Tgt: 1.000 | Z-Loss: 0.094 | [INFO|lh_trainer.py:797] 2026-02-17 07:18:38,308 >> Statistic -> In-Context | Spa: 0.719 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:18:38,308 >> Statistic -> MultiHop | Spa: 0.631 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:18:38,308 >> Statistic -> Single | Spa: 0.413 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:18:38,308 >> Statistic -> Summarization | Spa: 0.638 | Tgt: 1.000 | Z-Loss: 0.126 | [INFO|lh_trainer.py:810] 2026-02-17 07:18:38,310 >> [Micro-Log] {"loss": 2.0992775031675897, "lm_loss": 2.0457664616405964, "reg_loss": 0.05351104113894204, "model_sparsity(avg)": 0.5122299405435721, "Spa-Summarization sparsity": 0.637820514348837, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12577218390428102, "Spa-Code sparsity": 0.7142857142857143, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09401114923613411, "Spa-Single QA sparsity": 0.4130116952093024, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04257413362594027, "Spa-In-Context Learning sparsity": 0.7194444417953492, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10688925087451935, "Spa-MultiHop QA sparsity": 0.6311728292041354, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.12247117546697457, "step": 276, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 07:19:04,503 >> {'loss': 12.5957, 'grad_norm': 0.45536261796951294, 'learning_rate': 1.2235919702624524e-05, 'epoch': 0.2917324907846235, 'num_input_tokens_seen': 681376426, 'completed': '92.33% (277 / 300)', 'remaining time': '1:04:32', 'throughput': '7435.02', 'gpu_mem_free': '4821MB', 'step': 277} [Step 277 / Rank 3] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Summarization', 'Summarization', 'Summarization', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [4462, 4456, 4455, 4457, 4457, 4457, 4476, 4477, 4477, 4460, 4460, 4461, 4461, 4461] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000'] [Step 277 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [43783] → Tgt Spa: ['1.000'] [Step 277 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [43783] → Tgt Spa: ['1.000'] [Step 277 / Rank 6] Tasks: ['Code'] | Lens: [64344] → Tgt Spa: ['1.000'] [Step 277 / Rank 2] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Summarization', 'Summarization', 'Summarization', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [4462, 4456, 4455, 4457, 4457, 4457, 4476, 4477, 4477, 4460, 4460, 4461, 4461, 4461] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000'] [Step 277 / Rank 0] Tasks: ['Code'] | Lens: [64681] → Tgt Spa: ['1.000'] [Step 277 / Rank 7] Tasks: ['Code'] | Lens: [64344] → Tgt Spa: ['1.000'] [Step 277 / Rank 1] Tasks: ['Code'] | Lens: [64681] → Tgt Spa: ['1.000'] [Step 277 / Rank 7] Tasks: ['Code'] | Lens: [53564] → Tgt Spa: ['1.000'] [Step 277 / Rank 3] Tasks: ['Code'] | Lens: [47029] → Tgt Spa: ['1.000'] [Step 277 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26960, 26961] → Tgt Spa: ['1.000', '1.000'] [Step 277 / Rank 6] Tasks: ['Code'] | Lens: [53564] → Tgt Spa: ['1.000'] [Step 277 / Rank 0] Tasks: ['Single QA'] | Lens: [48472] → Tgt Spa: ['0.350'] [Step 277 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26960, 26961] → Tgt Spa: ['1.000', '1.000'] [Step 277 / Rank 2] Tasks: ['Code'] | Lens: [47029] → Tgt Spa: ['1.000'] [Step 277 / Rank 1] Tasks: ['Single QA'] | Lens: [48472] → Tgt Spa: ['0.350'] [Step 277 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [26112, 26113] → Tgt Spa: ['0.350', '0.350'] [Step 277 / Rank 3] Tasks: ['Single QA'] | Lens: [57581] → Tgt Spa: ['0.350'] [Step 277 / Rank 0] Tasks: ['Code'] | Lens: [63362] → Tgt Spa: ['1.000'] [Step 277 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24323, 24324] → Tgt Spa: ['1.000', '1.000'] [Step 277 / Rank 2] Tasks: ['Single QA'] | Lens: [57581] → Tgt Spa: ['0.350'] [Step 277 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24323, 24324] → Tgt Spa: ['1.000', '1.000'] [Step 277 / Rank 1] Tasks: ['Code'] | Lens: [63362] → Tgt Spa: ['1.000'] [Step 277 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [26112, 26113] → Tgt Spa: ['0.350', '0.350'] [Step 277 / Rank 6] Tasks: ['Single QA'] | Lens: [65038] → Tgt Spa: ['0.350'] [Step 277 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24682, 24684] → Tgt Spa: ['1.000', '1.000'] [Step 277 / Rank 3] Tasks: ['Single QA'] | Lens: [55143] → Tgt Spa: ['0.350'] [Step 277 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28066, 28065] → Tgt Spa: ['1.000', '1.000'] [Step 277 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24682, 24684] → Tgt Spa: ['1.000', '1.000'] [Step 277 / Rank 2] Tasks: ['Single QA'] | Lens: [55143] → Tgt Spa: ['0.350'] [Step 277 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [28066, 28065] → Tgt Spa: ['1.000', '1.000'] [Step 277 / Rank 7] Tasks: ['Single QA'] | Lens: [65038] → Tgt Spa: ['0.350'] [Step 277 / Rank 5] Tasks: ['Code'] | Lens: [53112] → Tgt Spa: ['1.000'] [Step 277 / Rank 7] Tasks: ['Single QA'] | Lens: [55426] → Tgt Spa: ['0.350'] [Step 277 / Rank 3] Tasks: ['Single QA'] | Lens: [35181] → Tgt Spa: ['0.350'] [Step 277 / Rank 2] Tasks: ['Single QA'] | Lens: [35181] → Tgt Spa: ['0.350'] [Step 277 / Rank 6] Tasks: ['Single QA'] | Lens: [55426] → Tgt Spa: ['0.350'] [Step 277 / Rank 1] Tasks: ['Single QA'] | Lens: [49688] → Tgt Spa: ['0.350'] [Step 277 / Rank 4] Tasks: ['Code'] | Lens: [53112] → Tgt Spa: ['1.000'] [Step 277 / Rank 0] Tasks: ['Single QA'] | Lens: [49688] → Tgt Spa: ['0.350'] [Step 277 / Rank 3] Tasks: ['Single QA'] | Lens: [55052] → Tgt Spa: ['0.350'] [Step 277 / Rank 0] Tasks: ['Single QA'] | Lens: [49724] → Tgt Spa: ['0.350'] [Step 277 / Rank 2] Tasks: ['Single QA'] | Lens: [55052] → Tgt Spa: ['0.350'] [Step 277 / Rank 5] Tasks: ['Single QA'] | Lens: [48455] → Tgt Spa: ['0.350'] [Step 277 / Rank 6] Tasks: ['Summarization'] | Lens: [33670] → Tgt Spa: ['1.000'] [Step 277 / Rank 4] Tasks: ['Single QA'] | Lens: [48455] → Tgt Spa: ['0.350'] [Step 277 / Rank 1] Tasks: ['Single QA'] | Lens: [49724] → Tgt Spa: ['0.350'] [Step 277 / Rank 7] Tasks: ['Summarization'] | Lens: [33670] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 07:21:44,467 >> @ 277 | Loss: 1.9784 | LM: 1.9113 | Reg: 0.0671 | Spa(Avg): 0.562 [INFO|lh_trainer.py:797] 2026-02-17 07:21:44,467 >> Statistic -> Code | Spa: 0.716 | Tgt: 1.000 | Z-Loss: 0.093 | [INFO|lh_trainer.py:797] 2026-02-17 07:21:44,467 >> Statistic -> In-Context | Spa: 0.718 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:21:44,467 >> Statistic -> MultiHop | Spa: 0.631 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:21:44,467 >> Statistic -> Single | Spa: 0.414 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:21:44,467 >> Statistic -> Summarization | Spa: 0.632 | Tgt: 1.000 | Z-Loss: 0.127 | [INFO|lh_trainer.py:810] 2026-02-17 07:21:44,472 >> [Micro-Log] {"loss": 1.978429160391291, "lm_loss": 1.9112995015457273, "reg_loss": 0.06712966610696942, "model_sparsity(avg)": 0.5618386181692282, "Spa-Code sparsity": 0.7162698337009975, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09323750223432269, "Spa-Single QA sparsity": 0.4136904776096344, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.042209843040577004, "Spa-In-Context Learning sparsity": 0.7181372502270866, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10744316437665154, "Spa-Summarization sparsity": 0.6319444477558136, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12743009813129902, "Spa-MultiHop QA sparsity": 0.6311728292041354, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.12247117546697457, "step": 277, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 07:22:05,034 >> {'loss': 11.8706, 'grad_norm': 0.6411481499671936, 'learning_rate': 1.1245062761201955e-05, 'epoch': 0.2927856766719326, 'num_input_tokens_seen': 683908570, 'completed': '92.67% (278 / 300)', 'remaining time': '1:01:44', 'throughput': '7013.04', 'gpu_mem_free': '10733MB', 'step': 278} [Step 278 / Rank 0] Tasks: ['Code', 'Single QA'] | Lens: [23085, 23080] → Tgt Spa: ['1.000', '0.350'] [Step 278 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [64420] → Tgt Spa: ['1.000'] [Step 278 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [28334, 28334] → Tgt Spa: ['0.350', '0.350'] [Step 278 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [64420] → Tgt Spa: ['1.000'] [Step 278 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [43755] → Tgt Spa: ['1.000'] [Step 278 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [28334, 28334] → Tgt Spa: ['0.350', '0.350'] [Step 278 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [43755] → Tgt Spa: ['1.000'] [Step 278 / Rank 1] Tasks: ['Code', 'Single QA'] | Lens: [23085, 23080] → Tgt Spa: ['1.000', '0.350'] [Step 278 / Rank 5] Tasks: ['Single QA'] | Lens: [36285] → Tgt Spa: ['0.350'] [Step 278 / Rank 7] Tasks: ['Single QA'] | Lens: [64969] → Tgt Spa: ['0.350'] [Step 278 / Rank 2] Tasks: ['Single QA'] | Lens: [44837] → Tgt Spa: ['0.350'] [Step 278 / Rank 0] Tasks: ['Single QA'] | Lens: [34955] → Tgt Spa: ['0.350'] [Step 278 / Rank 3] Tasks: ['Single QA'] | Lens: [44837] → Tgt Spa: ['0.350'] [Step 278 / Rank 1] Tasks: ['Single QA'] | Lens: [34955] → Tgt Spa: ['0.350'] [Step 278 / Rank 6] Tasks: ['Single QA'] | Lens: [64969] → Tgt Spa: ['0.350'] [Step 278 / Rank 4] Tasks: ['Single QA'] | Lens: [36285] → Tgt Spa: ['0.350'] [Step 278 / Rank 6] Tasks: ['Single QA'] | Lens: [62933] → Tgt Spa: ['0.350'] [Step 278 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [50917] → Tgt Spa: ['1.000'] [Step 278 / Rank 7] Tasks: ['Single QA'] | Lens: [62933] → Tgt Spa: ['0.350'] [Step 278 / Rank 1] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25630, 25630] → Tgt Spa: ['0.350', '1.000'] [Step 278 / Rank 4] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23897, 23898] → Tgt Spa: ['1.000', '0.350'] [Step 278 / Rank 0] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [25630, 25630] → Tgt Spa: ['0.350', '1.000'] [Step 278 / Rank 5] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [23897, 23898] → Tgt Spa: ['1.000', '0.350'] [Step 278 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [50917] → Tgt Spa: ['1.000'] [Step 278 / Rank 4] Tasks: ['Single QA', 'Code', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [8084, 8092, 8092, 8086, 8088, 8088, 8096, 8090] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350'] [Step 278 / Rank 3] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [28454, 28455] → Tgt Spa: ['1.000', '0.350'] [Step 278 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [42083] → Tgt Spa: ['1.000'] [Step 278 / Rank 5] Tasks: ['Single QA', 'Code', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [8084, 8092, 8092, 8086, 8088, 8088, 8096, 8090] → Tgt Spa: ['0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350'] [Step 278 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [42083] → Tgt Spa: ['1.000'] [Step 278 / Rank 2] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [28454, 28455] → Tgt Spa: ['1.000', '0.350'] [Step 278 / Rank 1] Tasks: ['Single QA'] | Lens: [60046] → Tgt Spa: ['0.350'] [Step 278 / Rank 0] Tasks: ['Single QA'] | Lens: [60046] → Tgt Spa: ['0.350'] [Step 278 / Rank 5] Tasks: ['Single QA'] | Lens: [53888] → Tgt Spa: ['0.350'] [Step 278 / Rank 3] Tasks: ['Code'] | Lens: [51828] → Tgt Spa: ['1.000'] [Step 278 / Rank 4] Tasks: ['Single QA'] | Lens: [53888] → Tgt Spa: ['0.350'] [Step 278 / Rank 2] Tasks: ['Code'] | Lens: [51828] → Tgt Spa: ['1.000'] [Step 278 / Rank 6] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [19395, 19386, 19397] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 278 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [58189] → Tgt Spa: ['1.000'] [Step 278 / Rank 7] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [19395, 19386, 19397] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 278 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [58189] → Tgt Spa: ['1.000'] [Step 278 / Rank 3] Tasks: ['Single QA'] | Lens: [64903] → Tgt Spa: ['0.350'] [Step 278 / Rank 0] Tasks: ['Single QA'] | Lens: [37779] → Tgt Spa: ['0.350'] [Step 278 / Rank 7] Tasks: ['Single QA'] | Lens: [51866] → Tgt Spa: ['0.350'] [Step 278 / Rank 6] Tasks: ['Single QA'] | Lens: [51866] → Tgt Spa: ['0.350'] [Step 278 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [58101] → Tgt Spa: ['1.000'] [Step 278 / Rank 2] Tasks: ['Single QA'] | Lens: [64903] → Tgt Spa: ['0.350'] [Step 278 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [58101] → Tgt Spa: ['1.000'] [Step 278 / Rank 1] Tasks: ['Single QA'] | Lens: [37779] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 07:24:55,664 >> @ 278 | Loss: 2.1585 | LM: 2.0971 | Reg: 0.0614 | Spa(Avg): 0.536 [INFO|lh_trainer.py:797] 2026-02-17 07:24:55,664 >> Statistic -> Code | Spa: 0.718 | Tgt: 1.000 | Z-Loss: 0.093 | [INFO|lh_trainer.py:797] 2026-02-17 07:24:55,664 >> Statistic -> In-Context | Spa: 0.722 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:24:55,664 >> Statistic -> MultiHop | Spa: 0.631 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:24:55,665 >> Statistic -> Single | Spa: 0.402 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:24:55,665 >> Statistic -> Summarization | Spa: 0.688 | Tgt: 1.000 | Z-Loss: 0.099 | [INFO|lh_trainer.py:810] 2026-02-17 07:24:55,667 >> [Micro-Log] {"loss": 2.158514130511321, "lm_loss": 2.09711854653627, "reg_loss": 0.061395585090698056, "model_sparsity(avg)": 0.5364342145621777, "Spa-Code sparsity": 0.7175925970077515, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09270988901456197, "Spa-Single QA sparsity": 0.40211639801661175, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03354256494813377, "Spa-In-Context Learning sparsity": 0.7222222089767456, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10571892559528351, "Spa-Summarization sparsity": 0.6875, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09899374842643738, "Spa-MultiHop QA sparsity": 0.6311728292041354, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.12247117546697457, "step": 278, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 07:25:23,219 >> {'loss': 12.9511, 'grad_norm': 0.6085736155509949, 'learning_rate': 1.0295115253445109e-05, 'epoch': 0.2938388625592417, 'num_input_tokens_seen': 686435460, 'completed': '93.00% (279 / 300)', 'remaining time': '0:58:58', 'throughput': '6375.07', 'gpu_mem_free': '13505MB', 'step': 279} [Step 279 / Rank 7] Tasks: ['Single QA'] | Lens: [57916] → Tgt Spa: ['0.350'] [Step 279 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [23082, 23083] → Tgt Spa: ['0.350', '0.350'] [Step 279 / Rank 6] Tasks: ['Single QA'] | Lens: [57916] → Tgt Spa: ['0.350'] [Step 279 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [61905] → Tgt Spa: ['1.000'] [Step 279 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [62494] → Tgt Spa: ['1.000'] [Step 279 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [61905] → Tgt Spa: ['1.000'] [Step 279 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [23082, 23083] → Tgt Spa: ['0.350', '0.350'] [Step 279 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [62494] → Tgt Spa: ['1.000'] [Step 279 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [39679] → Tgt Spa: ['1.000'] [Step 279 / Rank 2] Tasks: ['Single QA', 'Summarization'] | Lens: [32617, 32636] → Tgt Spa: ['0.350', '1.000'] [Step 279 / Rank 6] Tasks: ['Single QA'] | Lens: [50944] → Tgt Spa: ['0.350'] [Step 279 / Rank 4] Tasks: ['Single QA'] | Lens: [62698] → Tgt Spa: ['0.350'] [Step 279 / Rank 5] Tasks: ['Single QA'] | Lens: [62698] → Tgt Spa: ['0.350'] [Step 279 / Rank 3] Tasks: ['Single QA', 'Summarization'] | Lens: [32617, 32636] → Tgt Spa: ['0.350', '1.000'] [Step 279 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [39679] → Tgt Spa: ['1.000'] [Step 279 / Rank 7] Tasks: ['Single QA'] | Lens: [50944] → Tgt Spa: ['0.350'] [Step 279 / Rank 3] Tasks: ['Single QA'] | Lens: [62918] → Tgt Spa: ['0.350'] [Step 279 / Rank 0] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [18542, 18532, 18544] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 279 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [46093] → Tgt Spa: ['1.000'] [Step 279 / Rank 2] Tasks: ['Single QA'] | Lens: [62918] → Tgt Spa: ['0.350'] [Step 279 / Rank 6] Tasks: ['MultiHop QA', 'MultiHop QA', 'Code', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'Summarization', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA'] | Lens: [2799, 2797, 2803, 2799, 2799, 2799, 2799, 2800, 2806, 2805, 2799, 2801, 2818, 2819, 2800, 2803, 2801, 2820, 2821, 2804, 2822, 2805, 2805] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350'] [Step 279 / Rank 1] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [18542, 18532, 18544] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 279 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [46093] → Tgt Spa: ['1.000'] [Step 279 / Rank 7] Tasks: ['MultiHop QA', 'MultiHop QA', 'Code', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'Code', 'In-Context Learning', 'Single QA', 'Summarization', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA'] | Lens: [2799, 2797, 2803, 2799, 2799, 2799, 2799, 2800, 2806, 2805, 2799, 2801, 2818, 2819, 2800, 2803, 2801, 2820, 2821, 2804, 2822, 2805, 2805] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350', '0.350'] [Step 279 / Rank 6] Tasks: ['Single QA'] | Lens: [41253] → Tgt Spa: ['0.350'] [Step 279 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [26325, 26334] → Tgt Spa: ['1.000', '1.000'] [Step 279 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [26325, 26334] → Tgt Spa: ['1.000', '1.000'] [Step 279 / Rank 7] Tasks: ['Single QA'] | Lens: [41253] → Tgt Spa: ['0.350'] [Step 279 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22435, 22436] → Tgt Spa: ['1.000', '1.000'] [Step 279 / Rank 5] Tasks: ['Code'] | Lens: [43128] → Tgt Spa: ['1.000'] [Step 279 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22435, 22436] → Tgt Spa: ['1.000', '1.000'] [Step 279 / Rank 4] Tasks: ['Code'] | Lens: [43128] → Tgt Spa: ['1.000'] [Step 279 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [48642] → Tgt Spa: ['1.000'] [Step 279 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [52780] → Tgt Spa: ['1.000'] [Step 279 / Rank 2] Tasks: ['Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Summarization', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'Single QA'] | Lens: [5020, 5001, 5002, 5002, 5004, 5012, 5023, 5006, 5006, 5007, 5025, 5007, 5009] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 279 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [48642] → Tgt Spa: ['1.000'] [Step 279 / Rank 6] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [19586, 19600, 19601] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 279 / Rank 3] Tasks: ['Summarization', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'Summarization', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'Single QA'] | Lens: [5020, 5001, 5002, 5002, 5004, 5012, 5023, 5006, 5006, 5007, 5025, 5007, 5009] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 279 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [52780] → Tgt Spa: ['1.000'] [Step 279 / Rank 7] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [19586, 19600, 19601] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 279 / Rank 3] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [25167, 25149] → Tgt Spa: ['1.000', '1.000'] [Step 279 / Rank 7] Tasks: ['Code'] | Lens: [42323] → Tgt Spa: ['1.000'] [Step 279 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Summarization', 'Code'] | Lens: [11744, 11744, 11756, 11774, 11764] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '1.000'] [Step 279 / Rank 0] Tasks: ['Summarization', 'Single QA'] | Lens: [23961, 23942] → Tgt Spa: ['1.000', '0.350'] [Step 279 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Summarization', 'Code'] | Lens: [11744, 11744, 11756, 11774, 11764] → Tgt Spa: ['0.350', '0.350', '0.350', '1.000', '1.000'] [Step 279 / Rank 1] Tasks: ['Summarization', 'Single QA'] | Lens: [23961, 23942] → Tgt Spa: ['1.000', '0.350'] [Step 279 / Rank 6] Tasks: ['Code'] | Lens: [42323] → Tgt Spa: ['1.000'] [Step 279 / Rank 2] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [25167, 25149] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 07:27:49,395 >> @ 279 | Loss: 2.1644 | LM: 2.0863 | Reg: 0.0781 | Spa(Avg): 0.593 [INFO|lh_trainer.py:797] 2026-02-17 07:27:49,395 >> Statistic -> Code | Spa: 0.712 | Tgt: 1.000 | Z-Loss: 0.095 | [INFO|lh_trainer.py:797] 2026-02-17 07:27:49,395 >> Statistic -> In-Context | Spa: 0.719 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:27:49,395 >> Statistic -> MultiHop | Spa: 0.659 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:27:49,395 >> Statistic -> Single | Spa: 0.403 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:27:49,395 >> Statistic -> Summarization | Spa: 0.650 | Tgt: 1.000 | Z-Loss: 0.119 | [INFO|lh_trainer.py:810] 2026-02-17 07:27:49,397 >> [Micro-Log] {"loss": 2.164372748384873, "lm_loss": 2.086308707793554, "reg_loss": 0.07806404689229869, "model_sparsity(avg)": 0.5932073506216208, "Spa-In-Context Learning sparsity": 0.7185672458849455, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10726166555756017, "Spa-Summarization sparsity": 0.6501736156642437, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11859497893601656, "Spa-Code sparsity": 0.712499988079071, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09478677436709404, "Spa-Single QA sparsity": 0.4027777686715126, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.03830384021421196, "Spa-MultiHop QA sparsity": 0.6590909090909091, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1375578601251949, "step": 279, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.166015625, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 07:28:02,634 >> {'loss': 12.9862, 'grad_norm': 0.7676545977592468, 'learning_rate': 9.38623994796912e-06, 'epoch': 0.2948920484465508, 'num_input_tokens_seen': 689001010, 'completed': '93.33% (280 / 300)', 'remaining time': '0:56:09', 'throughput': '8046.77', 'gpu_mem_free': '11953MB', 'step': 280} [Step 280 / Rank 1] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [16512, 16522, 16524] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 280 / Rank 3] Tasks: ['Single QA'] | Lens: [42357] → Tgt Spa: ['0.350'] [Step 280 / Rank 2] Tasks: ['Single QA'] | Lens: [42357] → Tgt Spa: ['0.350'] [Step 280 / Rank 5] Tasks: ['Code'] | Lens: [60327] → Tgt Spa: ['1.000'] [Step 280 / Rank 4] Tasks: ['Code'] | Lens: [60327] → Tgt Spa: ['1.000'] [Step 280 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [42675] → Tgt Spa: ['1.000'] [Step 280 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [42675] → Tgt Spa: ['1.000'] [Step 280 / Rank 0] Tasks: ['Code', 'Summarization', 'Summarization'] | Lens: [16512, 16522, 16524] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 280 / Rank 3] Tasks: ['Code', 'Code', 'Code'] | Lens: [17215, 17216, 17217] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 280 / Rank 6] Tasks: ['Single QA'] | Lens: [60447] → Tgt Spa: ['0.350'] [Step 280 / Rank 1] Tasks: ['Single QA', 'Code'] | Lens: [22979, 22987] → Tgt Spa: ['0.350', '1.000'] [Step 280 / Rank 5] Tasks: ['Code', 'Single QA', 'Code'] | Lens: [19309, 19302, 19311] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 280 / Rank 2] Tasks: ['Code', 'Code', 'Code'] | Lens: [17215, 17216, 17217] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 280 / Rank 0] Tasks: ['Single QA', 'Code'] | Lens: [22979, 22987] → Tgt Spa: ['0.350', '1.000'] [Step 280 / Rank 7] Tasks: ['Single QA'] | Lens: [60447] → Tgt Spa: ['0.350'] [Step 280 / Rank 4] Tasks: ['Code', 'Single QA', 'Code'] | Lens: [19309, 19302, 19311] → Tgt Spa: ['1.000', '0.350', '1.000'] [Step 280 / Rank 5] Tasks: ['Single QA'] | Lens: [63025] → Tgt Spa: ['0.350'] [Step 280 / Rank 1] Tasks: ['Single QA'] | Lens: [59074] → Tgt Spa: ['0.350'] [Step 280 / Rank 4] Tasks: ['Single QA'] | Lens: [63025] → Tgt Spa: ['0.350'] [Step 280 / Rank 2] Tasks: ['Code'] | Lens: [58047] → Tgt Spa: ['1.000'] [Step 280 / Rank 0] Tasks: ['Single QA'] | Lens: [59074] → Tgt Spa: ['0.350'] [Step 280 / Rank 6] Tasks: ['Single QA'] | Lens: [65069] → Tgt Spa: ['0.350'] [Step 280 / Rank 3] Tasks: ['Code'] | Lens: [58047] → Tgt Spa: ['1.000'] [Step 280 / Rank 7] Tasks: ['Single QA'] | Lens: [65069] → Tgt Spa: ['0.350'] [Step 280 / Rank 0] Tasks: ['Single QA'] | Lens: [54004] → Tgt Spa: ['0.350'] [Step 280 / Rank 1] Tasks: ['Single QA'] | Lens: [54004] → Tgt Spa: ['0.350'] [Step 280 / Rank 6] Tasks: ['Code'] | Lens: [42109] → Tgt Spa: ['1.000'] [Step 280 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [54191] → Tgt Spa: ['1.000'] [Step 280 / Rank 4] Tasks: ['Single QA'] | Lens: [49701] → Tgt Spa: ['0.350'] [Step 280 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [54191] → Tgt Spa: ['1.000'] [Step 280 / Rank 5] Tasks: ['Single QA'] | Lens: [49701] → Tgt Spa: ['0.350'] [Step 280 / Rank 7] Tasks: ['Code'] | Lens: [42109] → Tgt Spa: ['1.000'] [Step 280 / Rank 6] Tasks: ['Single QA'] | Lens: [57921] → Tgt Spa: ['0.350'] [Step 280 / Rank 7] Tasks: ['Single QA'] | Lens: [57921] → Tgt Spa: ['0.350'] [Step 280 / Rank 5] Tasks: ['Code'] | Lens: [59155] → Tgt Spa: ['1.000'] [Step 280 / Rank 4] Tasks: ['Code'] | Lens: [59155] → Tgt Spa: ['1.000'] [Step 280 / Rank 3] Tasks: ['Single QA'] | Lens: [48511] → Tgt Spa: ['0.350'] [Step 280 / Rank 2] Tasks: ['Single QA'] | Lens: [48511] → Tgt Spa: ['0.350'] [Step 280 / Rank 0] Tasks: ['Single QA'] | Lens: [52997] → Tgt Spa: ['0.350'] [Step 280 / Rank 1] Tasks: ['Single QA'] | Lens: [52997] → Tgt Spa: ['0.350'] [Step 280 / Rank 3] Tasks: ['Single QA'] | Lens: [54067] → Tgt Spa: ['0.350'] [Step 280 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [43028] → Tgt Spa: ['1.000'] [Step 280 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [41295] → Tgt Spa: ['1.000'] [Step 280 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [32129, 32129] → Tgt Spa: ['0.350', '0.350'] [Step 280 / Rank 2] Tasks: ['Single QA'] | Lens: [54067] → Tgt Spa: ['0.350'] [Step 280 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [41295] → Tgt Spa: ['1.000'] [Step 280 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [32129, 32129] → Tgt Spa: ['0.350', '0.350'] [Step 280 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [43028] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 07:30:41,718 >> @ 280 | Loss: 1.7734 | LM: 1.7157 | Reg: 0.0577 | Spa(Avg): 0.535 [INFO|lh_trainer.py:797] 2026-02-17 07:30:41,718 >> Statistic -> Code | Spa: 0.712 | Tgt: 1.000 | Z-Loss: 0.095 | [INFO|lh_trainer.py:797] 2026-02-17 07:30:41,718 >> Statistic -> In-Context | Spa: 0.715 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:30:41,718 >> Statistic -> MultiHop | Spa: 0.659 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:30:41,718 >> Statistic -> Single | Spa: 0.389 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:30:41,718 >> Statistic -> Summarization | Spa: 0.688 | Tgt: 1.000 | Z-Loss: 0.099 | [INFO|lh_trainer.py:810] 2026-02-17 07:30:41,720 >> [Micro-Log] {"loss": 1.7734359751145046, "lm_loss": 1.7157154008746147, "reg_loss": 0.057720572057102494, "model_sparsity(avg)": 0.535493819663922, "Spa-Code sparsity": 0.7121211940591986, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09493371979756789, "Spa-Summarization sparsity": 0.6875, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.0989312082529068, "Spa-Single QA sparsity": 0.38888888359069823, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.024577776078755657, "Spa-In-Context Learning sparsity": 0.7152777910232544, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10864473879337311, "Spa-MultiHop QA sparsity": 0.6590909090909091, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.1375578601251949, "step": 280, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.1669921875, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 07:31:01,715 >> {'loss': 10.6406, 'grad_norm': 0.5978413820266724, 'learning_rate': 8.518592575878607e-06, 'epoch': 0.29594523433385994, 'num_input_tokens_seen': 691555714, 'completed': '93.67% (281 / 300)', 'remaining time': '0:53:21', 'throughput': '7132.84', 'gpu_mem_free': '11829MB', 'step': 281} [Step 281 / Rank 5] Tasks: ['Single QA'] | Lens: [34871] → Tgt Spa: ['0.350'] [Step 281 / Rank 6] Tasks: ['Summarization'] | Lens: [62991] → Tgt Spa: ['1.000'] [Step 281 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [25539, 25539] → Tgt Spa: ['0.350', '0.350'] [Step 281 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [25539, 25539] → Tgt Spa: ['0.350', '0.350'] [Step 281 / Rank 2] Tasks: ['Single QA'] | Lens: [32854] → Tgt Spa: ['0.350'] [Step 281 / Rank 4] Tasks: ['Single QA'] | Lens: [34871] → Tgt Spa: ['0.350'] [Step 281 / Rank 7] Tasks: ['Summarization'] | Lens: [62991] → Tgt Spa: ['1.000'] [Step 281 / Rank 3] Tasks: ['Single QA'] | Lens: [32854] → Tgt Spa: ['0.350'] [Step 281 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [38103] → Tgt Spa: ['1.000'] [Step 281 / Rank 5] Tasks: ['Single QA'] | Lens: [54187] → Tgt Spa: ['0.350'] [Step 281 / Rank 0] Tasks: ['Single QA', 'Code', 'Code', 'Code', 'Code'] | Lens: [11606, 11615, 11624, 11648, 11650] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000'] [Step 281 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [31105, 31105] → Tgt Spa: ['0.350', '0.350'] [Step 281 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [31105, 31105] → Tgt Spa: ['0.350', '0.350'] [Step 281 / Rank 4] Tasks: ['Single QA'] | Lens: [54187] → Tgt Spa: ['0.350'] [Step 281 / Rank 1] Tasks: ['Single QA', 'Code', 'Code', 'Code', 'Code'] | Lens: [11606, 11615, 11624, 11648, 11650] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000'] [Step 281 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [38103] → Tgt Spa: ['1.000'] [Step 281 / Rank 4] Tasks: ['Summarization'] | Lens: [63378] → Tgt Spa: ['1.000'] [Step 281 / Rank 5] Tasks: ['Summarization'] | Lens: [63378] → Tgt Spa: ['1.000'] [Step 281 / Rank 2] Tasks: ['Single QA'] | Lens: [42765] → Tgt Spa: ['0.350'] [Step 281 / Rank 0] Tasks: ['Single QA', 'MultiHop QA'] | Lens: [32700, 32699] → Tgt Spa: ['0.350', '0.350'] [Step 281 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [39280] → Tgt Spa: ['1.000'] [Step 281 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [39280] → Tgt Spa: ['1.000'] [Step 281 / Rank 3] Tasks: ['Single QA'] | Lens: [42765] → Tgt Spa: ['0.350'] [Step 281 / Rank 1] Tasks: ['Single QA', 'MultiHop QA'] | Lens: [32700, 32699] → Tgt Spa: ['0.350', '0.350'] [Step 281 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [29225, 29225] → Tgt Spa: ['0.350', '0.350'] [Step 281 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [29225, 29225] → Tgt Spa: ['0.350', '0.350'] [Step 281 / Rank 3] Tasks: ['Code'] | Lens: [59706] → Tgt Spa: ['1.000'] [Step 281 / Rank 2] Tasks: ['Code'] | Lens: [59706] → Tgt Spa: ['1.000'] [Step 281 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [23090, 23091] → Tgt Spa: ['0.350', '0.350'] [Step 281 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [56953] → Tgt Spa: ['1.000'] [Step 281 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [56953] → Tgt Spa: ['1.000'] [Step 281 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [23090, 23091] → Tgt Spa: ['0.350', '0.350'] [Step 281 / Rank 3] Tasks: ['Code', 'Code', 'Code'] | Lens: [17009, 17011, 17013] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 281 / Rank 4] Tasks: ['Single QA'] | Lens: [55868] → Tgt Spa: ['0.350'] [Step 281 / Rank 7] Tasks: ['Single QA'] | Lens: [64053] → Tgt Spa: ['0.350'] [Step 281 / Rank 0] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [16742, 16751, 16742] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 281 / Rank 5] Tasks: ['Single QA'] | Lens: [55868] → Tgt Spa: ['0.350'] [Step 281 / Rank 6] Tasks: ['Single QA'] | Lens: [64053] → Tgt Spa: ['0.350'] [Step 281 / Rank 2] Tasks: ['Code', 'Code', 'Code'] | Lens: [17009, 17011, 17013] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 281 / Rank 1] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [16742, 16751, 16742] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 281 / Rank 5] Tasks: ['Single QA'] | Lens: [44219] → Tgt Spa: ['0.350'] [Step 281 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [42105] → Tgt Spa: ['1.000'] [Step 281 / Rank 6] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16677, 16665, 16678] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 281 / Rank 7] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16677, 16665, 16678] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 281 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [28628, 28628] → Tgt Spa: ['0.350', '0.350'] [Step 281 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [28628, 28628] → Tgt Spa: ['0.350', '0.350'] [Step 281 / Rank 4] Tasks: ['Single QA'] | Lens: [44219] → Tgt Spa: ['0.350'] [Step 281 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [42105] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 07:33:43,979 >> @ 281 | Loss: 2.0692 | LM: 2.0147 | Reg: 0.0544 | Spa(Avg): 0.510 [INFO|lh_trainer.py:797] 2026-02-17 07:33:43,979 >> Statistic -> Code | Spa: 0.711 | Tgt: 1.000 | Z-Loss: 0.095 | [INFO|lh_trainer.py:797] 2026-02-17 07:33:43,979 >> Statistic -> In-Context | Spa: 0.712 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:33:43,979 >> Statistic -> MultiHop | Spa: 0.347 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:33:43,979 >> Statistic -> Single | Spa: 0.370 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:33:43,980 >> Statistic -> Summarization | Spa: 0.647 | Tgt: 1.000 | Z-Loss: 0.122 | [INFO|lh_trainer.py:810] 2026-02-17 07:33:43,982 >> [Micro-Log] {"loss": 2.069163642358035, "lm_loss": 2.014730599281999, "reg_loss": 0.05443306074206097, "model_sparsity(avg)": 0.5099151233832041, "Spa-Single QA sparsity": 0.3698830290844566, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.01603981807803441, "Spa-Code sparsity": 0.7108585726131093, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.0953808304938403, "Spa-MultiHop QA sparsity": 0.3472222089767456, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.0007529668509960175, "Spa-In-Context Learning sparsity": 0.7118055522441864, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11012112721800804, "Spa-Summarization sparsity": 0.6472222208976746, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12151511609554291, "step": 281, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.1669921875, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 07:33:58,794 >> {'loss': 12.415, 'grad_norm': 0.5144767165184021, 'learning_rate': 7.692321804084169e-06, 'epoch': 0.296998420221169, 'num_input_tokens_seen': 694038390, 'completed': '94.00% (282 / 300)', 'remaining time': '0:50:33', 'throughput': '7010.08', 'gpu_mem_free': '8097MB', 'step': 282} [Step 282 / Rank 7] Tasks: ['Single QA'] | Lens: [53595] → Tgt Spa: ['0.350'] [Step 282 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [23389, 23390] → Tgt Spa: ['0.350', '0.350'] [Step 282 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15809, 15810, 15810, 15810] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 282 / Rank 6] Tasks: ['Single QA'] | Lens: [53595] → Tgt Spa: ['0.350'] [Step 282 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [23389, 23390] → Tgt Spa: ['0.350', '0.350'] [Step 282 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15809, 15810, 15810, 15810] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 282 / Rank 1] Tasks: ['Single QA'] | Lens: [41107] → Tgt Spa: ['0.350'] [Step 282 / Rank 0] Tasks: ['Single QA'] | Lens: [41107] → Tgt Spa: ['0.350'] [Step 282 / Rank 5] Tasks: ['Single QA'] | Lens: [51784] → Tgt Spa: ['0.350'] [Step 282 / Rank 1] Tasks: ['Code'] | Lens: [38266] → Tgt Spa: ['1.000'] [Step 282 / Rank 2] Tasks: ['Code', 'Single QA'] | Lens: [31837, 31830] → Tgt Spa: ['1.000', '0.350'] [Step 282 / Rank 0] Tasks: ['Code'] | Lens: [38266] → Tgt Spa: ['1.000'] [Step 282 / Rank 4] Tasks: ['Single QA'] | Lens: [51784] → Tgt Spa: ['0.350'] [Step 282 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27036, 27036] → Tgt Spa: ['1.000', '1.000'] [Step 282 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27036, 27036] → Tgt Spa: ['1.000', '1.000'] [Step 282 / Rank 3] Tasks: ['Code', 'Single QA'] | Lens: [31837, 31830] → Tgt Spa: ['1.000', '0.350'] [Step 282 / Rank 0] Tasks: ['Single QA'] | Lens: [65118] → Tgt Spa: ['0.350'] [Step 282 / Rank 3] Tasks: ['Single QA'] | Lens: [52955] → Tgt Spa: ['0.350'] [Step 282 / Rank 1] Tasks: ['Single QA'] | Lens: [65118] → Tgt Spa: ['0.350'] [Step 282 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [50073] → Tgt Spa: ['1.000'] [Step 282 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [50073] → Tgt Spa: ['1.000'] [Step 282 / Rank 6] Tasks: ['Single QA'] | Lens: [50390] → Tgt Spa: ['0.350'] [Step 282 / Rank 2] Tasks: ['Single QA'] | Lens: [52955] → Tgt Spa: ['0.350'] [Step 282 / Rank 7] Tasks: ['Single QA'] | Lens: [50390] → Tgt Spa: ['0.350'] [Step 282 / Rank 4] Tasks: ['Single QA'] | Lens: [58934] → Tgt Spa: ['0.350'] [Step 282 / Rank 5] Tasks: ['Single QA'] | Lens: [58934] → Tgt Spa: ['0.350'] [Step 282 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26071, 26072] → Tgt Spa: ['1.000', '1.000'] [Step 282 / Rank 3] Tasks: ['Single QA'] | Lens: [60594] → Tgt Spa: ['0.350'] [Step 282 / Rank 2] Tasks: ['Single QA'] | Lens: [60594] → Tgt Spa: ['0.350'] [Step 282 / Rank 0] Tasks: ['Single QA'] | Lens: [49452] → Tgt Spa: ['0.350'] [Step 282 / Rank 1] Tasks: ['Single QA'] | Lens: [49452] → Tgt Spa: ['0.350'] [Step 282 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26071, 26072] → Tgt Spa: ['1.000', '1.000'] [Step 282 / Rank 7] Tasks: ['Single QA'] | Lens: [53642] → Tgt Spa: ['0.350'] [Step 282 / Rank 6] Tasks: ['Single QA'] | Lens: [53642] → Tgt Spa: ['0.350'] [Step 282 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [61191] → Tgt Spa: ['1.000'] [Step 282 / Rank 2] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [27885, 27905] → Tgt Spa: ['1.000', '1.000'] [Step 282 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32070, 32070] → Tgt Spa: ['0.350', '0.350'] [Step 282 / Rank 3] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [27885, 27905] → Tgt Spa: ['1.000', '1.000'] [Step 282 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [61191] → Tgt Spa: ['1.000'] [Step 282 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32070, 32070] → Tgt Spa: ['0.350', '0.350'] [Step 282 / Rank 0] Tasks: ['MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA'] | Lens: [2058, 2079, 2060, 2062, 2064, 2062, 2063, 2063, 2069, 2063, 2064, 2065, 2064, 2083, 2082, 2064, 2066, 2084, 2083, 2085, 2085, 2085, 2085, 2066, 2086, 2086, 2086, 2086, 2086, 2086, 2068] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [Step 282 / Rank 4] Tasks: ['Single QA'] | Lens: [60970] → Tgt Spa: ['0.350'] [Step 282 / Rank 2] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [18168, 18180, 18170] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 282 / Rank 3] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [18168, 18180, 18170] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 282 / Rank 7] Tasks: ['Single QA', 'Summarization'] | Lens: [29691, 29710] → Tgt Spa: ['0.350', '1.000'] [Step 282 / Rank 5] Tasks: ['Single QA'] | Lens: [60970] → Tgt Spa: ['0.350'] [Step 282 / Rank 6] Tasks: ['Single QA', 'Summarization'] | Lens: [29691, 29710] → Tgt Spa: ['0.350', '1.000'] [Step 282 / Rank 1] Tasks: ['MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA'] | Lens: [2058, 2079, 2060, 2062, 2064, 2062, 2063, 2063, 2069, 2063, 2064, 2065, 2064, 2083, 2082, 2064, 2066, 2084, 2083, 2085, 2085, 2085, 2085, 2066, 2086, 2086, 2086, 2086, 2086, 2086, 2068] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350'] [INFO|lh_trainer.py:781] 2026-02-17 07:36:36,173 >> @ 282 | Loss: 2.3109 | LM: 2.2601 | Reg: 0.0508 | Spa(Avg): 0.500 [INFO|lh_trainer.py:797] 2026-02-17 07:36:36,173 >> Statistic -> Code | Spa: 0.717 | Tgt: 1.000 | Z-Loss: 0.093 | [INFO|lh_trainer.py:797] 2026-02-17 07:36:36,173 >> Statistic -> In-Context | Spa: 0.722 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:36:36,173 >> Statistic -> MultiHop | Spa: 0.609 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:36:36,174 >> Statistic -> Single | Spa: 0.395 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:36:36,174 >> Statistic -> Summarization | Spa: 0.618 | Tgt: 1.000 | Z-Loss: 0.136 | [INFO|lh_trainer.py:810] 2026-02-17 07:36:36,175 >> [Micro-Log] {"loss": 2.310915903498729, "lm_loss": 2.2601338159292936, "reg_loss": 0.050782093002150454, "model_sparsity(avg)": 0.5004635763665041, "Spa-Single QA sparsity": 0.39484126511074247, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.029002531475964047, "Spa-Code sparsity": 0.7166666507720947, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09308751225471497, "Spa-In-Context Learning sparsity": 0.7222222089767456, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10571892559528351, "Spa-MultiHop QA sparsity": 0.6092592517534892, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11066492373744646, "Spa-Summarization sparsity": 0.6180555522441864, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.13613535629378426, "step": 282, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.1669921875, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 07:37:00,281 >> {'loss': 13.8655, 'grad_norm': 0.4338608384132385, 'learning_rate': 6.907569209828871e-06, 'epoch': 0.29805160610847814, 'num_input_tokens_seen': 696690606, 'completed': '94.33% (283 / 300)', 'remaining time': '0:47:46', 'throughput': '7306.89', 'gpu_mem_free': '7243MB', 'step': 283} [Step 283 / Rank 4] Tasks: ['Single QA'] | Lens: [43520] → Tgt Spa: ['0.350'] [Step 283 / Rank 6] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [25069, 25098] → Tgt Spa: ['1.000', '1.000'] [Step 283 / Rank 7] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [25069, 25098] → Tgt Spa: ['1.000', '1.000'] [Step 283 / Rank 5] Tasks: ['Single QA'] | Lens: [43520] → Tgt Spa: ['0.350'] [Step 283 / Rank 0] Tasks: ['Single QA'] | Lens: [41234] → Tgt Spa: ['0.350'] [Step 283 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [33750] → Tgt Spa: ['1.000'] [Step 283 / Rank 1] Tasks: ['Single QA'] | Lens: [41234] → Tgt Spa: ['0.350'] [Step 283 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [33750] → Tgt Spa: ['1.000'] [Step 283 / Rank 5] Tasks: ['Single QA'] | Lens: [36213] → Tgt Spa: ['0.350'] [Step 283 / Rank 2] Tasks: ['In-Context Learning', 'Summarization', 'Summarization', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'Single QA'] | Lens: [4272, 4292, 4291, 4273, 4274, 4281, 4274, 4276, 4276, 4276, 4276, 4277, 4277, 4285, 4277] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350'] [Step 283 / Rank 1] Tasks: ['Single QA'] | Lens: [37288] → Tgt Spa: ['0.350'] [Step 283 / Rank 3] Tasks: ['In-Context Learning', 'Summarization', 'Summarization', 'In-Context Learning', 'Single QA', 'Code', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'Code', 'Single QA'] | Lens: [4272, 4292, 4291, 4273, 4274, 4281, 4274, 4276, 4276, 4276, 4276, 4277, 4277, 4285, 4277] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '1.000', '0.350'] [Step 283 / Rank 4] Tasks: ['Single QA'] | Lens: [36213] → Tgt Spa: ['0.350'] [Step 283 / Rank 0] Tasks: ['Single QA'] | Lens: [37288] → Tgt Spa: ['0.350'] [Step 283 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [43362] → Tgt Spa: ['1.000'] [Step 283 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [43362] → Tgt Spa: ['1.000'] [Step 283 / Rank 5] Tasks: ['Single QA'] | Lens: [39273] → Tgt Spa: ['0.350'] [Step 283 / Rank 1] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17765, 17766, 17766] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 283 / Rank 7] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [17857, 17847, 17847] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 283 / Rank 6] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [17857, 17847, 17847] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 283 / Rank 3] Tasks: ['Code'] | Lens: [39626] → Tgt Spa: ['1.000'] [Step 283 / Rank 2] Tasks: ['Code'] | Lens: [39626] → Tgt Spa: ['1.000'] [Step 283 / Rank 4] Tasks: ['Single QA'] | Lens: [39273] → Tgt Spa: ['0.350'] [Step 283 / Rank 0] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17765, 17766, 17766] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 283 / Rank 6] Tasks: ['Single QA'] | Lens: [43248] → Tgt Spa: ['0.350'] [Step 283 / Rank 4] Tasks: ['Single QA'] | Lens: [56330] → Tgt Spa: ['0.350'] [Step 283 / Rank 3] Tasks: ['Single QA'] | Lens: [58410] → Tgt Spa: ['0.350'] [Step 283 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26769, 26770] → Tgt Spa: ['1.000', '1.000'] [Step 283 / Rank 5] Tasks: ['Single QA'] | Lens: [56330] → Tgt Spa: ['0.350'] [Step 283 / Rank 2] Tasks: ['Single QA'] | Lens: [58410] → Tgt Spa: ['0.350'] [Step 283 / Rank 7] Tasks: ['Single QA'] | Lens: [43248] → Tgt Spa: ['0.350'] [Step 283 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26769, 26770] → Tgt Spa: ['1.000', '1.000'] [Step 283 / Rank 5] Tasks: ['Single QA'] | Lens: [49394] → Tgt Spa: ['0.350'] [Step 283 / Rank 0] Tasks: ['MultiHop QA'] | Lens: [65334] → Tgt Spa: ['0.350'] [Step 283 / Rank 6] Tasks: ['Single QA'] | Lens: [37287] → Tgt Spa: ['0.350'] [Step 283 / Rank 1] Tasks: ['MultiHop QA'] | Lens: [65334] → Tgt Spa: ['0.350'] [Step 283 / Rank 2] Tasks: ['Single QA', 'Single QA'] | Lens: [29708, 29708] → Tgt Spa: ['0.350', '0.350'] [Step 283 / Rank 4] Tasks: ['Single QA'] | Lens: [49394] → Tgt Spa: ['0.350'] [Step 283 / Rank 7] Tasks: ['Single QA'] | Lens: [37287] → Tgt Spa: ['0.350'] [Step 283 / Rank 3] Tasks: ['Single QA', 'Single QA'] | Lens: [29708, 29708] → Tgt Spa: ['0.350', '0.350'] [Step 283 / Rank 3] Tasks: ['Single QA'] | Lens: [52958] → Tgt Spa: ['0.350'] [Step 283 / Rank 4] Tasks: ['Single QA'] | Lens: [42583] → Tgt Spa: ['0.350'] [Step 283 / Rank 6] Tasks: ['Single QA'] | Lens: [36930] → Tgt Spa: ['0.350'] [Step 283 / Rank 7] Tasks: ['Single QA'] | Lens: [36930] → Tgt Spa: ['0.350'] [Step 283 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [50994] → Tgt Spa: ['1.000'] [Step 283 / Rank 5] Tasks: ['Single QA'] | Lens: [42583] → Tgt Spa: ['0.350'] [Step 283 / Rank 2] Tasks: ['Single QA'] | Lens: [52958] → Tgt Spa: ['0.350'] [Step 283 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [50994] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 07:39:05,119 >> @ 283 | Loss: 2.1248 | LM: 2.0830 | Reg: 0.0419 | Spa(Avg): 0.485 [INFO|lh_trainer.py:797] 2026-02-17 07:39:05,119 >> Statistic -> Code | Spa: 0.719 | Tgt: 1.000 | Z-Loss: 0.092 | [INFO|lh_trainer.py:797] 2026-02-17 07:39:05,119 >> Statistic -> In-Context | Spa: 0.704 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:39:05,119 >> Statistic -> MultiHop | Spa: 0.347 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:39:05,120 >> Statistic -> Single | Spa: 0.418 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:39:05,120 >> Statistic -> Summarization | Spa: 0.677 | Tgt: 1.000 | Z-Loss: 0.106 | [INFO|lh_trainer.py:810] 2026-02-17 07:39:05,122 >> [Micro-Log] {"loss": 2.1248357539686062, "lm_loss": 2.0829771536712847, "reg_loss": 0.041858617293958865, "model_sparsity(avg)": 0.48535879453023273, "Spa-Single QA sparsity": 0.41805554628372193, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04845239195856266, "Spa-Summarization sparsity": 0.6765873261860439, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10571115783282689, "Spa-In-Context Learning sparsity": 0.7037036965290705, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11366030449668567, "Spa-MultiHop QA sparsity": 0.3472222089767456, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.0007529668509960175, "Spa-Code sparsity": 0.7194444417953492, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09198781847953796, "step": 283, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.1669921875, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 07:39:24,466 >> {'loss': 12.749, 'grad_norm': 0.4250311851501465, 'learning_rate': 6.1644692564298475e-06, 'epoch': 0.29910479199578727, 'num_input_tokens_seen': 698974368, 'completed': '94.67% (284 / 300)', 'remaining time': '0:44:56', 'throughput': '7919.58', 'gpu_mem_free': '9793MB', 'step': 284} [Step 284 / Rank 4] Tasks: ['Single QA'] | Lens: [33970] → Tgt Spa: ['0.350'] [Step 284 / Rank 3] Tasks: ['Single QA'] | Lens: [58138] → Tgt Spa: ['0.350'] [Step 284 / Rank 1] Tasks: ['Single QA'] | Lens: [50575] → Tgt Spa: ['0.350'] [Step 284 / Rank 2] Tasks: ['Single QA'] | Lens: [58138] → Tgt Spa: ['0.350'] [Step 284 / Rank 5] Tasks: ['Single QA'] | Lens: [33970] → Tgt Spa: ['0.350'] [Step 284 / Rank 0] Tasks: ['Single QA'] | Lens: [50575] → Tgt Spa: ['0.350'] [Step 284 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [31102, 31102] → Tgt Spa: ['0.350', '0.350'] [Step 284 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [31102, 31102] → Tgt Spa: ['0.350', '0.350'] [Step 284 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [64242] → Tgt Spa: ['1.000'] [Step 284 / Rank 7] Tasks: ['Single QA'] | Lens: [40629] → Tgt Spa: ['0.350'] [Step 284 / Rank 0] Tasks: ['MultiHop QA', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Single QA', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'Single QA'] | Lens: [3625, 3626, 3627, 3627, 3628, 3634, 3629, 3628, 3628, 3648, 3636, 3629, 3629, 3629, 3631, 3631, 3631, 3632] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350'] [Step 284 / Rank 2] Tasks: ['Single QA'] | Lens: [65092] → Tgt Spa: ['0.350'] [Step 284 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [64242] → Tgt Spa: ['1.000'] [Step 284 / Rank 3] Tasks: ['Single QA'] | Lens: [65092] → Tgt Spa: ['0.350'] [Step 284 / Rank 1] Tasks: ['MultiHop QA', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Single QA', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'Summarization', 'Code', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'MultiHop QA', 'In-Context Learning', 'In-Context Learning', 'Single QA'] | Lens: [3625, 3626, 3627, 3627, 3628, 3634, 3629, 3628, 3628, 3648, 3636, 3629, 3629, 3629, 3631, 3631, 3631, 3632] → Tgt Spa: ['0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350'] [Step 284 / Rank 6] Tasks: ['Single QA'] | Lens: [40629] → Tgt Spa: ['0.350'] [Step 284 / Rank 7] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [21756, 21746, 21757] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 284 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [26661, 26654] → Tgt Spa: ['1.000', '1.000'] [Step 284 / Rank 5] Tasks: ['Single QA'] | Lens: [53558] → Tgt Spa: ['0.350'] [Step 284 / Rank 0] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning'] | Lens: [4833, 4827, 4827, 4828, 4830, 4840, 4832, 4834, 4834, 4836, 4837, 4837, 4837] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000'] [Step 284 / Rank 6] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [21756, 21746, 21757] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 284 / Rank 4] Tasks: ['Single QA'] | Lens: [53558] → Tgt Spa: ['0.350'] [Step 284 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [26661, 26654] → Tgt Spa: ['1.000', '1.000'] [Step 284 / Rank 1] Tasks: ['Code', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Code', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning'] | Lens: [4833, 4827, 4827, 4828, 4830, 4840, 4832, 4834, 4834, 4836, 4837, 4837, 4837] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000'] [Step 284 / Rank 2] Tasks: ['Code'] | Lens: [51859] → Tgt Spa: ['1.000'] [Step 284 / Rank 4] Tasks: ['Single QA'] | Lens: [44329] → Tgt Spa: ['0.350'] [Step 284 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Code', 'Code'] | Lens: [13164, 13166, 13202, 13206] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000'] [Step 284 / Rank 5] Tasks: ['Single QA'] | Lens: [44329] → Tgt Spa: ['0.350'] [Step 284 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Code', 'Code'] | Lens: [13164, 13166, 13202, 13206] → Tgt Spa: ['0.350', '0.350', '1.000', '1.000'] [Step 284 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [21915, 21919] → Tgt Spa: ['1.000', '1.000'] [Step 284 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [21915, 21919] → Tgt Spa: ['1.000', '1.000'] [Step 284 / Rank 3] Tasks: ['Code'] | Lens: [51859] → Tgt Spa: ['1.000'] [Step 284 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [58573] → Tgt Spa: ['1.000'] [Step 284 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [47706] → Tgt Spa: ['1.000'] [Step 284 / Rank 7] Tasks: ['Single QA'] | Lens: [46721] → Tgt Spa: ['0.350'] [Step 284 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [47706] → Tgt Spa: ['1.000'] [Step 284 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [41109] → Tgt Spa: ['1.000'] [Step 284 / Rank 6] Tasks: ['Single QA'] | Lens: [46721] → Tgt Spa: ['0.350'] [Step 284 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [41109] → Tgt Spa: ['1.000'] [Step 284 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [58573] → Tgt Spa: ['1.000'] [Step 284 / Rank 6] Tasks: ['Single QA'] | Lens: [59073] → Tgt Spa: ['0.350'] [Step 284 / Rank 5] Tasks: ['Code'] | Lens: [42581] → Tgt Spa: ['1.000'] [Step 284 / Rank 4] Tasks: ['Code'] | Lens: [42581] → Tgt Spa: ['1.000'] [Step 284 / Rank 0] Tasks: ['Single QA'] | Lens: [55140] → Tgt Spa: ['0.350'] [Step 284 / Rank 2] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17050, 17050, 17053] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 284 / Rank 7] Tasks: ['Single QA'] | Lens: [59073] → Tgt Spa: ['0.350'] [Step 284 / Rank 3] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17050, 17050, 17053] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 284 / Rank 1] Tasks: ['Single QA'] | Lens: [55140] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 07:41:55,507 >> @ 284 | Loss: 2.2196 | LM: 2.1566 | Reg: 0.0630 | Spa(Avg): 0.543 [INFO|lh_trainer.py:797] 2026-02-17 07:41:55,507 >> Statistic -> Code | Spa: 0.710 | Tgt: 1.000 | Z-Loss: 0.096 | [INFO|lh_trainer.py:797] 2026-02-17 07:41:55,507 >> Statistic -> In-Context | Spa: 0.716 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:41:55,507 >> Statistic -> MultiHop | Spa: 0.646 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:41:55,507 >> Statistic -> Single | Spa: 0.444 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:41:55,507 >> Statistic -> Summarization | Spa: 0.646 | Tgt: 1.000 | Z-Loss: 0.125 | [INFO|lh_trainer.py:810] 2026-02-17 07:41:55,509 >> [Micro-Log] {"loss": 2.219637838502725, "lm_loss": 2.1566276233643293, "reg_loss": 0.0630102191886787, "model_sparsity(avg)": 0.542940312375625, "Spa-Single QA sparsity": 0.4444444321450733, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.0653656699495124, "Spa-MultiHop QA sparsity": 0.6458333134651184, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.129756398499012, "Spa-In-Context Learning sparsity": 0.7164351840813955, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10816159720222156, "Spa-Code sparsity": 0.7097222208976746, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.0958533026278019, "Spa-Summarization sparsity": 0.6458333432674408, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12494925906260808, "step": 284, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.1669921875, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 07:42:18,593 >> {'loss': 13.3178, 'grad_norm': 0.57984459400177, 'learning_rate': 5.463149270238596e-06, 'epoch': 0.3001579778830964, 'num_input_tokens_seen': 701514324, 'completed': '95.00% (285 / 300)', 'remaining time': '0:42:07', 'throughput': '7293.41', 'gpu_mem_free': '8277MB', 'step': 285} [Step 285 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [31109, 31110] → Tgt Spa: ['0.350', '0.350'] [Step 285 / Rank 7] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22262, 22247] → Tgt Spa: ['1.000', '1.000'] [Step 285 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [31109, 31110] → Tgt Spa: ['0.350', '0.350'] [Step 285 / Rank 6] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22262, 22247] → Tgt Spa: ['1.000', '1.000'] [Step 285 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24681, 24682] → Tgt Spa: ['1.000', '1.000'] [Step 285 / Rank 3] Tasks: ['Single QA'] | Lens: [35034] → Tgt Spa: ['0.350'] [Step 285 / Rank 2] Tasks: ['Single QA'] | Lens: [35034] → Tgt Spa: ['0.350'] [Step 285 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24681, 24682] → Tgt Spa: ['1.000', '1.000'] [Step 285 / Rank 1] Tasks: ['Single QA'] | Lens: [60323] → Tgt Spa: ['0.350'] [Step 285 / Rank 6] Tasks: ['Single QA'] | Lens: [36060] → Tgt Spa: ['0.350'] [Step 285 / Rank 7] Tasks: ['Single QA'] | Lens: [36060] → Tgt Spa: ['0.350'] [Step 285 / Rank 3] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24284, 24286] → Tgt Spa: ['0.350', '1.000'] [Step 285 / Rank 5] Tasks: ['Single QA'] | Lens: [61979] → Tgt Spa: ['0.350'] [Step 285 / Rank 0] Tasks: ['Single QA'] | Lens: [60323] → Tgt Spa: ['0.350'] [Step 285 / Rank 2] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24284, 24286] → Tgt Spa: ['0.350', '1.000'] [Step 285 / Rank 4] Tasks: ['Single QA'] | Lens: [61979] → Tgt Spa: ['0.350'] [Step 285 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25680, 25681] → Tgt Spa: ['1.000', '1.000'] [Step 285 / Rank 2] Tasks: ['Single QA'] | Lens: [40005] → Tgt Spa: ['0.350'] [Step 285 / Rank 7] Tasks: ['Single QA'] | Lens: [34410] → Tgt Spa: ['0.350'] [Step 285 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [43591] → Tgt Spa: ['1.000'] [Step 285 / Rank 6] Tasks: ['Single QA'] | Lens: [34410] → Tgt Spa: ['0.350'] [Step 285 / Rank 3] Tasks: ['Single QA'] | Lens: [40005] → Tgt Spa: ['0.350'] [Step 285 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25680, 25681] → Tgt Spa: ['1.000', '1.000'] [Step 285 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [43591] → Tgt Spa: ['1.000'] [Step 285 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [23775, 23777] → Tgt Spa: ['0.350', '0.350'] [Step 285 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [53016] → Tgt Spa: ['1.000'] [Step 285 / Rank 1] Tasks: ['Single QA'] | Lens: [49337] → Tgt Spa: ['0.350'] [Step 285 / Rank 6] Tasks: ['Single QA'] | Lens: [55464] → Tgt Spa: ['0.350'] [Step 285 / Rank 7] Tasks: ['Single QA'] | Lens: [55464] → Tgt Spa: ['0.350'] [Step 285 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [53016] → Tgt Spa: ['1.000'] [Step 285 / Rank 0] Tasks: ['Single QA'] | Lens: [49337] → Tgt Spa: ['0.350'] [Step 285 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [23775, 23777] → Tgt Spa: ['0.350', '0.350'] [Step 285 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [30194, 30194] → Tgt Spa: ['0.350', '0.350'] [Step 285 / Rank 2] Tasks: ['Single QA'] | Lens: [43175] → Tgt Spa: ['0.350'] [Step 285 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [54816] → Tgt Spa: ['1.000'] [Step 285 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [30194, 30194] → Tgt Spa: ['0.350', '0.350'] [Step 285 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26516, 26516] → Tgt Spa: ['1.000', '1.000'] [Step 285 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26516, 26516] → Tgt Spa: ['1.000', '1.000'] [Step 285 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [54816] → Tgt Spa: ['1.000'] [Step 285 / Rank 3] Tasks: ['Single QA'] | Lens: [43175] → Tgt Spa: ['0.350'] [Step 285 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [43446] → Tgt Spa: ['1.000'] [Step 285 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [47193] → Tgt Spa: ['1.000'] [Step 285 / Rank 5] Tasks: ['Single QA'] | Lens: [41007] → Tgt Spa: ['0.350'] [Step 285 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [27510, 27505] → Tgt Spa: ['1.000', '1.000'] [Step 285 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [27510, 27505] → Tgt Spa: ['1.000', '1.000'] [Step 285 / Rank 4] Tasks: ['Single QA'] | Lens: [41007] → Tgt Spa: ['0.350'] [Step 285 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [43446] → Tgt Spa: ['1.000'] [Step 285 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [47193] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 07:44:31,305 >> @ 285 | Loss: 2.3863 | LM: 2.3294 | Reg: 0.0569 | Spa(Avg): 0.523 [INFO|lh_trainer.py:797] 2026-02-17 07:44:31,305 >> Statistic -> Code | Spa: 0.722 | Tgt: 1.000 | Z-Loss: 0.091 | [INFO|lh_trainer.py:797] 2026-02-17 07:44:31,305 >> Statistic -> In-Context | Spa: 0.718 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:44:31,305 >> Statistic -> MultiHop | Spa: 0.646 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:44:31,305 >> Statistic -> Single | Spa: 0.383 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:44:31,305 >> Statistic -> Summarization | Spa: 0.653 | Tgt: 1.000 | Z-Loss: 0.115 | [INFO|lh_trainer.py:810] 2026-02-17 07:44:31,308 >> [Micro-Log] {"loss": 2.386275698741277, "lm_loss": 2.3293742102881274, "reg_loss": 0.05690150528001444, "model_sparsity(avg)": 0.5228587885697683, "Spa-In-Context Learning sparsity": 0.7182539616312299, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10739467080150332, "Spa-Single QA sparsity": 0.38316992801778454, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.023395589299445206, "Spa-Summarization sparsity": 0.6527777910232544, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11543932557106018, "Spa-Code sparsity": 0.7222222089767456, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09090471267700195, "Spa-MultiHop QA sparsity": 0.6458333134651184, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.129756398499012, "step": 285, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.1669921875, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 07:44:47,327 >> {'loss': 14.3177, 'grad_norm': 0.7126944661140442, 'learning_rate': 4.803729418824403e-06, 'epoch': 0.30121116377040547, 'num_input_tokens_seen': 703856054, 'completed': '95.33% (286 / 300)', 'remaining time': '0:39:18', 'throughput': '7872.17', 'gpu_mem_free': '11069MB', 'step': 286} [Step 286 / Rank 5] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA'] | Lens: [13347, 13347, 13349, 13349] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350'] [Step 286 / Rank 4] Tasks: ['Code', 'Single QA', 'Single QA', 'Single QA'] | Lens: [13347, 13347, 13349, 13349] → Tgt Spa: ['1.000', '0.350', '0.350', '0.350'] [Step 286 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [31621, 31621] → Tgt Spa: ['0.350', '0.350'] [Step 286 / Rank 3] Tasks: ['Single QA'] | Lens: [36938] → Tgt Spa: ['0.350'] [Step 286 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [41387] → Tgt Spa: ['1.000'] [Step 286 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [31621, 31621] → Tgt Spa: ['0.350', '0.350'] [Step 286 / Rank 2] Tasks: ['Single QA'] | Lens: [36938] → Tgt Spa: ['0.350'] [Step 286 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [41387] → Tgt Spa: ['1.000'] [Step 286 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [30630, 30630] → Tgt Spa: ['0.350', '0.350'] [Step 286 / Rank 5] Tasks: ['Single QA'] | Lens: [55708] → Tgt Spa: ['0.350'] [Step 286 / Rank 2] Tasks: ['Single QA'] | Lens: [58405] → Tgt Spa: ['0.350'] [Step 286 / Rank 4] Tasks: ['Single QA'] | Lens: [55708] → Tgt Spa: ['0.350'] [Step 286 / Rank 7] Tasks: ['Code'] | Lens: [37196] → Tgt Spa: ['1.000'] [Step 286 / Rank 6] Tasks: ['Code'] | Lens: [37196] → Tgt Spa: ['1.000'] [Step 286 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [30630, 30630] → Tgt Spa: ['0.350', '0.350'] [Step 286 / Rank 3] Tasks: ['Single QA'] | Lens: [58405] → Tgt Spa: ['0.350'] [Step 286 / Rank 7] Tasks: ['Single QA'] | Lens: [65040] → Tgt Spa: ['0.350'] [Step 286 / Rank 4] Tasks: ['Code'] | Lens: [54417] → Tgt Spa: ['1.000'] [Step 286 / Rank 5] Tasks: ['Code'] | Lens: [54417] → Tgt Spa: ['1.000'] [Step 286 / Rank 2] Tasks: ['Single QA'] | Lens: [48691] → Tgt Spa: ['0.350'] [Step 286 / Rank 1] Tasks: ['Code', 'In-Context Learning'] | Lens: [25737, 25730] → Tgt Spa: ['1.000', '1.000'] [Step 286 / Rank 3] Tasks: ['Single QA'] | Lens: [48691] → Tgt Spa: ['0.350'] [Step 286 / Rank 6] Tasks: ['Single QA'] | Lens: [65040] → Tgt Spa: ['0.350'] [Step 286 / Rank 0] Tasks: ['Code', 'In-Context Learning'] | Lens: [25737, 25730] → Tgt Spa: ['1.000', '1.000'] [Step 286 / Rank 5] Tasks: ['Single QA'] | Lens: [57445] → Tgt Spa: ['0.350'] [Step 286 / Rank 4] Tasks: ['Single QA'] | Lens: [57445] → Tgt Spa: ['0.350'] [Step 286 / Rank 2] Tasks: ['Single QA'] | Lens: [52674] → Tgt Spa: ['0.350'] [Step 286 / Rank 0] Tasks: ['Single QA'] | Lens: [38621] → Tgt Spa: ['0.350'] [Step 286 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [35504] → Tgt Spa: ['1.000'] [Step 286 / Rank 3] Tasks: ['Single QA'] | Lens: [52674] → Tgt Spa: ['0.350'] [Step 286 / Rank 1] Tasks: ['Single QA'] | Lens: [38621] → Tgt Spa: ['0.350'] [Step 286 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [35504] → Tgt Spa: ['1.000'] [Step 286 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [31092, 31093] → Tgt Spa: ['0.350', '0.350'] [Step 286 / Rank 2] Tasks: ['Single QA'] | Lens: [56252] → Tgt Spa: ['0.350'] [Step 286 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [41059] → Tgt Spa: ['1.000'] [Step 286 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [30028, 30029] → Tgt Spa: ['1.000', '1.000'] [Step 286 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [30028, 30029] → Tgt Spa: ['1.000', '1.000'] [Step 286 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [41059] → Tgt Spa: ['1.000'] [Step 286 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [31092, 31093] → Tgt Spa: ['0.350', '0.350'] [Step 286 / Rank 3] Tasks: ['Single QA'] | Lens: [56252] → Tgt Spa: ['0.350'] [Step 286 / Rank 4] Tasks: ['Code'] | Lens: [47013] → Tgt Spa: ['1.000'] [Step 286 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27643, 27644] → Tgt Spa: ['1.000', '1.000'] [Step 286 / Rank 6] Tasks: ['Single QA'] | Lens: [65048] → Tgt Spa: ['0.350'] [Step 286 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27643, 27644] → Tgt Spa: ['1.000', '1.000'] [Step 286 / Rank 5] Tasks: ['Code'] | Lens: [47013] → Tgt Spa: ['1.000'] [Step 286 / Rank 7] Tasks: ['Single QA'] | Lens: [65048] → Tgt Spa: ['0.350'] [Step 286 / Rank 2] Tasks: ['Summarization'] | Lens: [63872] → Tgt Spa: ['1.000'] [Step 286 / Rank 3] Tasks: ['Summarization'] | Lens: [63872] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 07:47:19,597 >> @ 286 | Loss: 2.1376 | LM: 2.0845 | Reg: 0.0530 | Spa(Avg): 0.523 [INFO|lh_trainer.py:797] 2026-02-17 07:47:19,598 >> Statistic -> Code | Spa: 0.717 | Tgt: 1.000 | Z-Loss: 0.093 | [INFO|lh_trainer.py:797] 2026-02-17 07:47:19,598 >> Statistic -> In-Context | Spa: 0.719 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:47:19,598 >> Statistic -> MultiHop | Spa: 0.646 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:47:19,598 >> Statistic -> Single | Spa: 0.375 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:47:19,598 >> Statistic -> Summarization | Spa: 0.722 | Tgt: 1.000 | Z-Loss: 0.083 | [INFO|lh_trainer.py:810] 2026-02-17 07:47:19,600 >> [Micro-Log] {"loss": 2.1375539054473243, "lm_loss": 2.084522428611914, "reg_loss": 0.05303148284535079, "model_sparsity(avg)": 0.5230034651855627, "Spa-In-Context Learning sparsity": 0.7187499850988388, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10718857310712337, "Spa-Single QA sparsity": 0.3749999937258269, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.015749919730679768, "Spa-Code sparsity": 0.7166666746139526, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09307092428207397, "Spa-Summarization sparsity": 0.7222222089767456, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.08315852284431458, "Spa-MultiHop QA sparsity": 0.6458333134651184, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.129756398499012, "step": 286, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.1669921875, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 07:47:48,491 >> {'loss': 12.8253, 'grad_norm': 0.5229653120040894, 'learning_rate': 4.1863226903840625e-06, 'epoch': 0.3022643496577146, 'num_input_tokens_seen': 706380374, 'completed': '95.67% (287 / 300)', 'remaining time': '0:36:30', 'throughput': '6966.98', 'gpu_mem_free': '8769MB', 'step': 287} [Step 287 / Rank 5] Tasks: ['Code', 'In-Context Learning'] | Lens: [25469, 25461] → Tgt Spa: ['1.000', '1.000'] [Step 287 / Rank 1] Tasks: ['Single QA'] | Lens: [44911] → Tgt Spa: ['0.350'] [Step 287 / Rank 4] Tasks: ['Code', 'In-Context Learning'] | Lens: [25469, 25461] → Tgt Spa: ['1.000', '1.000'] [Step 287 / Rank 0] Tasks: ['Single QA'] | Lens: [44911] → Tgt Spa: ['0.350'] [Step 287 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [28058, 28051] → Tgt Spa: ['1.000', '1.000'] [Step 287 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [16597, 16599, 16589] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 287 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [28058, 28051] → Tgt Spa: ['1.000', '1.000'] [Step 287 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [16597, 16599, 16589] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 287 / Rank 4] Tasks: ['Single QA'] | Lens: [43443] → Tgt Spa: ['0.350'] [Step 287 / Rank 6] Tasks: ['Single QA'] | Lens: [49461] → Tgt Spa: ['0.350'] [Step 287 / Rank 3] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [8035, 8042, 8036, 8041, 8040, 8043, 8052, 8044] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350'] [Step 287 / Rank 5] Tasks: ['Single QA'] | Lens: [43443] → Tgt Spa: ['0.350'] [Step 287 / Rank 7] Tasks: ['Single QA'] | Lens: [49461] → Tgt Spa: ['0.350'] [Step 287 / Rank 2] Tasks: ['Single QA', 'Code', 'Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [8035, 8042, 8036, 8041, 8040, 8043, 8052, 8044] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350'] [Step 287 / Rank 1] Tasks: ['In-Context Learning', 'Summarization', 'Summarization'] | Lens: [20805, 20824, 20827] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 287 / Rank 0] Tasks: ['In-Context Learning', 'Summarization', 'Summarization'] | Lens: [20805, 20824, 20827] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 287 / Rank 5] Tasks: ['Single QA'] | Lens: [36223] → Tgt Spa: ['0.350'] [Step 287 / Rank 3] Tasks: ['Single QA'] | Lens: [57077] → Tgt Spa: ['0.350'] [Step 287 / Rank 2] Tasks: ['Single QA'] | Lens: [57077] → Tgt Spa: ['0.350'] [Step 287 / Rank 0] Tasks: ['Summarization'] | Lens: [34621] → Tgt Spa: ['1.000'] [Step 287 / Rank 1] Tasks: ['Summarization'] | Lens: [34621] → Tgt Spa: ['1.000'] [Step 287 / Rank 6] Tasks: ['Single QA'] | Lens: [49974] → Tgt Spa: ['0.350'] [Step 287 / Rank 4] Tasks: ['Single QA'] | Lens: [36223] → Tgt Spa: ['0.350'] [Step 287 / Rank 7] Tasks: ['Single QA'] | Lens: [49974] → Tgt Spa: ['0.350'] [Step 287 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [53458] → Tgt Spa: ['1.000'] [Step 287 / Rank 6] Tasks: ['Code'] | Lens: [53132] → Tgt Spa: ['1.000'] [Step 287 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [39386] → Tgt Spa: ['1.000'] [Step 287 / Rank 7] Tasks: ['Code'] | Lens: [53132] → Tgt Spa: ['1.000'] [Step 287 / Rank 3] Tasks: ['Single QA'] | Lens: [53577] → Tgt Spa: ['0.350'] [Step 287 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [53458] → Tgt Spa: ['1.000'] [Step 287 / Rank 2] Tasks: ['Single QA'] | Lens: [53577] → Tgt Spa: ['0.350'] [Step 287 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [39386] → Tgt Spa: ['1.000'] [Step 287 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [35705] → Tgt Spa: ['1.000'] [Step 287 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [41198] → Tgt Spa: ['1.000'] [Step 287 / Rank 1] Tasks: ['Single QA'] | Lens: [51036] → Tgt Spa: ['0.350'] [Step 287 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [35705] → Tgt Spa: ['1.000'] [Step 287 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [41198] → Tgt Spa: ['1.000'] [Step 287 / Rank 0] Tasks: ['Single QA'] | Lens: [51036] → Tgt Spa: ['0.350'] [Step 287 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [56766] → Tgt Spa: ['1.000'] [Step 287 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [56766] → Tgt Spa: ['1.000'] [Step 287 / Rank 2] Tasks: ['Single QA'] | Lens: [44042] → Tgt Spa: ['0.350'] [Step 287 / Rank 0] Tasks: ['Single QA'] | Lens: [49981] → Tgt Spa: ['0.350'] [Step 287 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [24159, 24167] → Tgt Spa: ['1.000', '1.000'] [Step 287 / Rank 3] Tasks: ['Single QA'] | Lens: [44042] → Tgt Spa: ['0.350'] [Step 287 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [50656] → Tgt Spa: ['1.000'] [Step 287 / Rank 1] Tasks: ['Single QA'] | Lens: [49981] → Tgt Spa: ['0.350'] [Step 287 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [50656] → Tgt Spa: ['1.000'] [Step 287 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [24159, 24167] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 07:50:00,142 >> @ 287 | Loss: 2.1525 | LM: 2.0863 | Reg: 0.0661 | Spa(Avg): 0.558 [INFO|lh_trainer.py:797] 2026-02-17 07:50:00,142 >> Statistic -> Code | Spa: 0.702 | Tgt: 1.000 | Z-Loss: 0.099 | [INFO|lh_trainer.py:797] 2026-02-17 07:50:00,142 >> Statistic -> In-Context | Spa: 0.713 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:50:00,142 >> Statistic -> MultiHop | Spa: 0.646 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:50:00,142 >> Statistic -> Single | Spa: 0.411 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:50:00,143 >> Statistic -> Summarization | Spa: 0.642 | Tgt: 1.000 | Z-Loss: 0.128 | [INFO|lh_trainer.py:810] 2026-02-17 07:50:00,145 >> [Micro-Log] {"loss": 2.152472397312522, "lm_loss": 2.08634019891421, "reg_loss": 0.06613221023386966, "model_sparsity(avg)": 0.5577015777428945, "Spa-Single QA sparsity": 0.4105902723968029, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.040076624194625765, "Spa-In-Context Learning sparsity": 0.7125, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10984740257263184, "Spa-Summarization sparsity": 0.6416666626930236, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12773799300193786, "Spa-Code sparsity": 0.7023809381893703, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09888987668922969, "Spa-MultiHop QA sparsity": 0.6458333134651184, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.129756398499012, "step": 287, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.1669921875, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 07:50:18,071 >> {'loss': 12.9148, 'grad_norm': 0.676238477230072, 'learning_rate': 3.6110348743820393e-06, 'epoch': 0.3033175355450237, 'num_input_tokens_seen': 708733546, 'completed': '96.00% (288 / 300)', 'remaining time': '0:33:41', 'throughput': '7865.93', 'gpu_mem_free': '10915MB', 'step': 288} [Step 288 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [64104] → Tgt Spa: ['1.000'] [Step 288 / Rank 3] Tasks: ['Single QA'] | Lens: [52167] → Tgt Spa: ['0.350'] [Step 288 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [64104] → Tgt Spa: ['1.000'] [Step 288 / Rank 7] Tasks: ['Single QA'] | Lens: [42298] → Tgt Spa: ['0.350'] [Step 288 / Rank 4] Tasks: ['Single QA'] | Lens: [53574] → Tgt Spa: ['0.350'] [Step 288 / Rank 5] Tasks: ['Single QA'] | Lens: [53574] → Tgt Spa: ['0.350'] [Step 288 / Rank 2] Tasks: ['Single QA'] | Lens: [52167] → Tgt Spa: ['0.350'] [Step 288 / Rank 6] Tasks: ['Single QA'] | Lens: [42298] → Tgt Spa: ['0.350'] [Step 288 / Rank 1] Tasks: ['Single QA'] | Lens: [57588] → Tgt Spa: ['0.350'] [Step 288 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [62139] → Tgt Spa: ['1.000'] [Step 288 / Rank 5] Tasks: ['Single QA'] | Lens: [51216] → Tgt Spa: ['0.350'] [Step 288 / Rank 0] Tasks: ['Single QA'] | Lens: [57588] → Tgt Spa: ['0.350'] [Step 288 / Rank 4] Tasks: ['Single QA'] | Lens: [51216] → Tgt Spa: ['0.350'] [Step 288 / Rank 6] Tasks: ['Single QA'] | Lens: [65040] → Tgt Spa: ['0.350'] [Step 288 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [62139] → Tgt Spa: ['1.000'] [Step 288 / Rank 7] Tasks: ['Single QA'] | Lens: [65040] → Tgt Spa: ['0.350'] [Step 288 / Rank 1] Tasks: ['Code'] | Lens: [38910] → Tgt Spa: ['1.000'] [Step 288 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [62196] → Tgt Spa: ['1.000'] [Step 288 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [62196] → Tgt Spa: ['1.000'] [Step 288 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [60506] → Tgt Spa: ['1.000'] [Step 288 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [42764] → Tgt Spa: ['1.000'] [Step 288 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [42764] → Tgt Spa: ['1.000'] [Step 288 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [60506] → Tgt Spa: ['1.000'] [Step 288 / Rank 0] Tasks: ['Code'] | Lens: [38910] → Tgt Spa: ['1.000'] [Step 288 / Rank 6] Tasks: ['Single QA'] | Lens: [57447] → Tgt Spa: ['0.350'] [Step 288 / Rank 7] Tasks: ['Single QA'] | Lens: [57447] → Tgt Spa: ['0.350'] [Step 288 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [31100, 31100] → Tgt Spa: ['0.350', '0.350'] [Step 288 / Rank 4] Tasks: ['Single QA'] | Lens: [52530] → Tgt Spa: ['0.350'] [Step 288 / Rank 2] Tasks: ['Single QA'] | Lens: [45920] → Tgt Spa: ['0.350'] [Step 288 / Rank 3] Tasks: ['Single QA'] | Lens: [45920] → Tgt Spa: ['0.350'] [Step 288 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [31100, 31100] → Tgt Spa: ['0.350', '0.350'] [Step 288 / Rank 5] Tasks: ['Single QA'] | Lens: [52530] → Tgt Spa: ['0.350'] [Step 288 / Rank 2] Tasks: ['Single QA'] | Lens: [36984] → Tgt Spa: ['0.350'] [Step 288 / Rank 5] Tasks: ['Code'] | Lens: [62892] → Tgt Spa: ['1.000'] [Step 288 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [58338] → Tgt Spa: ['1.000'] [Step 288 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [58338] → Tgt Spa: ['1.000'] [Step 288 / Rank 4] Tasks: ['Code'] | Lens: [62892] → Tgt Spa: ['1.000'] [Step 288 / Rank 6] Tasks: ['Single QA'] | Lens: [63932] → Tgt Spa: ['0.350'] [Step 288 / Rank 3] Tasks: ['Single QA'] | Lens: [36984] → Tgt Spa: ['0.350'] [Step 288 / Rank 7] Tasks: ['Single QA'] | Lens: [63932] → Tgt Spa: ['0.350'] [Step 288 / Rank 2] Tasks: ['Single QA'] | Lens: [38716] → Tgt Spa: ['0.350'] [Step 288 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [41228] → Tgt Spa: ['1.000'] [Step 288 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [41228] → Tgt Spa: ['1.000'] [Step 288 / Rank 6] Tasks: ['Single QA'] | Lens: [65022] → Tgt Spa: ['0.350'] [Step 288 / Rank 1] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23923, 23945] → Tgt Spa: ['1.000', '1.000'] [Step 288 / Rank 7] Tasks: ['Single QA'] | Lens: [65022] → Tgt Spa: ['0.350'] [Step 288 / Rank 3] Tasks: ['Single QA'] | Lens: [38716] → Tgt Spa: ['0.350'] [Step 288 / Rank 0] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23923, 23945] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 07:53:11,840 >> @ 288 | Loss: 2.3145 | LM: 2.2662 | Reg: 0.0483 | Spa(Avg): 0.509 [INFO|lh_trainer.py:797] 2026-02-17 07:53:11,840 >> Statistic -> Code | Spa: 0.722 | Tgt: 1.000 | Z-Loss: 0.091 | [INFO|lh_trainer.py:797] 2026-02-17 07:53:11,840 >> Statistic -> In-Context | Spa: 0.720 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:53:11,840 >> Statistic -> MultiHop | Spa: 0.646 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:53:11,840 >> Statistic -> Single | Spa: 0.358 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:53:11,841 >> Statistic -> Summarization | Spa: 0.681 | Tgt: 1.000 | Z-Loss: 0.102 | [INFO|lh_trainer.py:810] 2026-02-17 07:53:11,843 >> [Micro-Log] {"loss": 2.3145369552075863, "lm_loss": 2.2662436527510486, "reg_loss": 0.04829332627802311, "model_sparsity(avg)": 0.5089698980251948, "Spa-In-Context Learning sparsity": 0.7204861044883728, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10645037889480591, "Spa-Single QA sparsity": 0.35833332141240437, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.00859454933864375, "Spa-Code sparsity": 0.7222222089767456, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09090471267700195, "Spa-Summarization sparsity": 0.6805555820465088, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10197541862726212, "Spa-MultiHop QA sparsity": 0.6458333134651184, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.129756398499012, "step": 288, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.1669921875, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 07:53:38,526 >> {'loss': 13.8872, 'grad_norm': 0.5809754729270935, 'learning_rate': 3.0779645434241003e-06, 'epoch': 0.3043707214323328, 'num_input_tokens_seen': 711304704, 'completed': '96.33% (289 / 300)', 'remaining time': '0:30:54', 'throughput': '6413.29', 'gpu_mem_free': '10041MB', 'step': 289} [Step 289 / Rank 3] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18756, 18746, 18748] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 289 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [23286, 23287] → Tgt Spa: ['0.350', '0.350'] [Step 289 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [60347] → Tgt Spa: ['1.000'] [Step 289 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22776, 22779] → Tgt Spa: ['1.000', '1.000'] [Step 289 / Rank 2] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [18756, 18746, 18748] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 289 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22776, 22779] → Tgt Spa: ['1.000', '1.000'] [Step 289 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [60347] → Tgt Spa: ['1.000'] [Step 289 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [23286, 23287] → Tgt Spa: ['0.350', '0.350'] [Step 289 / Rank 4] Tasks: ['Single QA'] | Lens: [49590] → Tgt Spa: ['0.350'] [Step 289 / Rank 1] Tasks: ['Summarization', 'Summarization', 'In-Context Learning'] | Lens: [21639, 21640, 21624] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 289 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [55913] → Tgt Spa: ['1.000'] [Step 289 / Rank 7] Tasks: ['Single QA'] | Lens: [41128] → Tgt Spa: ['0.350'] [Step 289 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [55913] → Tgt Spa: ['1.000'] [Step 289 / Rank 5] Tasks: ['Single QA'] | Lens: [49590] → Tgt Spa: ['0.350'] [Step 289 / Rank 0] Tasks: ['Summarization', 'Summarization', 'In-Context Learning'] | Lens: [21639, 21640, 21624] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 289 / Rank 6] Tasks: ['Single QA'] | Lens: [41128] → Tgt Spa: ['0.350'] [Step 289 / Rank 7] Tasks: ['Single QA'] | Lens: [65030] → Tgt Spa: ['0.350'] [Step 289 / Rank 5] Tasks: ['Single QA'] | Lens: [64031] → Tgt Spa: ['0.350'] [Step 289 / Rank 3] Tasks: ['Code', 'Code', 'Code'] | Lens: [19238, 19238, 19238] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 289 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [29223, 29224] → Tgt Spa: ['1.000', '1.000'] [Step 289 / Rank 4] Tasks: ['Single QA'] | Lens: [64031] → Tgt Spa: ['0.350'] [Step 289 / Rank 6] Tasks: ['Single QA'] | Lens: [65030] → Tgt Spa: ['0.350'] [Step 289 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [29223, 29224] → Tgt Spa: ['1.000', '1.000'] [Step 289 / Rank 2] Tasks: ['Code', 'Code', 'Code'] | Lens: [19238, 19238, 19238] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 289 / Rank 2] Tasks: ['Single QA'] | Lens: [39821] → Tgt Spa: ['0.350'] [Step 289 / Rank 3] Tasks: ['Single QA'] | Lens: [39821] → Tgt Spa: ['0.350'] [Step 289 / Rank 4] Tasks: ['Single QA'] | Lens: [51551] → Tgt Spa: ['0.350'] [Step 289 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [56736] → Tgt Spa: ['1.000'] [Step 289 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [56736] → Tgt Spa: ['1.000'] [Step 289 / Rank 1] Tasks: ['Single QA'] | Lens: [39070] → Tgt Spa: ['0.350'] [Step 289 / Rank 5] Tasks: ['Single QA'] | Lens: [51551] → Tgt Spa: ['0.350'] [Step 289 / Rank 0] Tasks: ['Single QA'] | Lens: [39070] → Tgt Spa: ['0.350'] [Step 289 / Rank 1] Tasks: ['Single QA'] | Lens: [38813] → Tgt Spa: ['0.350'] [Step 289 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [63635] → Tgt Spa: ['1.000'] [Step 289 / Rank 0] Tasks: ['Single QA'] | Lens: [38813] → Tgt Spa: ['0.350'] [Step 289 / Rank 5] Tasks: ['Single QA'] | Lens: [37499] → Tgt Spa: ['0.350'] [Step 289 / Rank 4] Tasks: ['Single QA'] | Lens: [37499] → Tgt Spa: ['0.350'] [Step 289 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [63635] → Tgt Spa: ['1.000'] [Step 289 / Rank 7] Tasks: ['Single QA'] | Lens: [56331] → Tgt Spa: ['0.350'] [Step 289 / Rank 6] Tasks: ['Single QA'] | Lens: [56331] → Tgt Spa: ['0.350'] [Step 289 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [32544, 32544] → Tgt Spa: ['0.350', '0.350'] [Step 289 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [32544, 32544] → Tgt Spa: ['0.350', '0.350'] [Step 289 / Rank 0] Tasks: ['Single QA'] | Lens: [65267] → Tgt Spa: ['0.350'] [Step 289 / Rank 3] Tasks: ['Code', 'Code', 'Single QA', 'Code', 'Code', 'Code', 'Code'] | Lens: [8685, 8686, 8679, 8690, 8689, 8695, 8698] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000'] [Step 289 / Rank 2] Tasks: ['Code', 'Code', 'Single QA', 'Code', 'Code', 'Code', 'Code'] | Lens: [8685, 8686, 8679, 8690, 8689, 8695, 8698] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000'] [Step 289 / Rank 1] Tasks: ['Single QA'] | Lens: [65267] → Tgt Spa: ['0.350'] [Step 289 / Rank 6] Tasks: ['Single QA'] | Lens: [65076] → Tgt Spa: ['0.350'] [Step 289 / Rank 7] Tasks: ['Single QA'] | Lens: [65076] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 07:56:21,081 >> @ 289 | Loss: 2.3031 | LM: 2.2486 | Reg: 0.0544 | Spa(Avg): 0.514 [INFO|lh_trainer.py:797] 2026-02-17 07:56:21,081 >> Statistic -> Code | Spa: 0.699 | Tgt: 1.000 | Z-Loss: 0.100 | [INFO|lh_trainer.py:797] 2026-02-17 07:56:21,081 >> Statistic -> In-Context | Spa: 0.721 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:56:21,081 >> Statistic -> MultiHop | Spa: 0.646 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:56:21,081 >> Statistic -> Single | Spa: 0.391 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:56:21,081 >> Statistic -> Summarization | Spa: 0.648 | Tgt: 1.000 | Z-Loss: 0.120 | [INFO|lh_trainer.py:810] 2026-02-17 07:56:21,083 >> [Micro-Log] {"loss": 2.3030704744160175, "lm_loss": 2.24863384043177, "reg_loss": 0.054436651810343996, "model_sparsity(avg)": 0.5140817910432816, "Spa-In-Context Learning sparsity": 0.7206790049870809, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10636910630597009, "Spa-Summarization sparsity": 0.6481481393178304, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.1200428456068039, "Spa-Single QA sparsity": 0.39052286919425516, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.027795647620223463, "Spa-Code sparsity": 0.6994949579238892, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10015081004662947, "Spa-MultiHop QA sparsity": 0.6458333134651184, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.129756398499012, "step": 289, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.1669921875, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 07:56:49,132 >> {'loss': 13.8184, 'grad_norm': 0.5316653251647949, 'learning_rate': 2.58720303636711e-06, 'epoch': 0.3054239073196419, 'num_input_tokens_seen': 713915084, 'completed': '96.67% (290 / 300)', 'remaining time': '0:28:06', 'throughput': '6847.58', 'gpu_mem_free': '3793MB', 'step': 290} [Step 290 / Rank 1] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [19133, 19136, 19124] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 290 / Rank 7] Tasks: ['Single QA'] | Lens: [36549] → Tgt Spa: ['0.350'] [Step 290 / Rank 6] Tasks: ['Single QA'] | Lens: [36549] → Tgt Spa: ['0.350'] [Step 290 / Rank 5] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16910, 16901, 16912] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 290 / Rank 2] Tasks: ['Single QA'] | Lens: [39259] → Tgt Spa: ['0.350'] [Step 290 / Rank 4] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [16910, 16901, 16912] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 290 / Rank 3] Tasks: ['Single QA'] | Lens: [39259] → Tgt Spa: ['0.350'] [Step 290 / Rank 0] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [19133, 19136, 19124] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 290 / Rank 2] Tasks: ['Single QA', 'Summarization'] | Lens: [22285, 22305] → Tgt Spa: ['0.350', '1.000'] [Step 290 / Rank 7] Tasks: ['Single QA'] | Lens: [34758] → Tgt Spa: ['0.350'] [Step 290 / Rank 3] Tasks: ['Single QA', 'Summarization'] | Lens: [22285, 22305] → Tgt Spa: ['0.350', '1.000'] [Step 290 / Rank 4] Tasks: ['Single QA'] | Lens: [53614] → Tgt Spa: ['0.350'] [Step 290 / Rank 1] Tasks: ['Code', 'Single QA'] | Lens: [32491, 32486] → Tgt Spa: ['1.000', '0.350'] [Step 290 / Rank 0] Tasks: ['Code', 'Single QA'] | Lens: [32491, 32486] → Tgt Spa: ['1.000', '0.350'] [Step 290 / Rank 6] Tasks: ['Single QA'] | Lens: [34758] → Tgt Spa: ['0.350'] [Step 290 / Rank 5] Tasks: ['Single QA'] | Lens: [53614] → Tgt Spa: ['0.350'] [Step 290 / Rank 7] Tasks: ['Single QA'] | Lens: [49866] → Tgt Spa: ['0.350'] [Step 290 / Rank 6] Tasks: ['Single QA'] | Lens: [49866] → Tgt Spa: ['0.350'] [Step 290 / Rank 1] Tasks: ['Single QA'] | Lens: [55604] → Tgt Spa: ['0.350'] [Step 290 / Rank 3] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [15865, 15865, 15873, 15866] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350'] [Step 290 / Rank 2] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA'] | Lens: [15865, 15865, 15873, 15866] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350'] [Step 290 / Rank 4] Tasks: ['Single QA'] | Lens: [60517] → Tgt Spa: ['0.350'] [Step 290 / Rank 0] Tasks: ['Single QA'] | Lens: [55604] → Tgt Spa: ['0.350'] [Step 290 / Rank 5] Tasks: ['Single QA'] | Lens: [60517] → Tgt Spa: ['0.350'] [Step 290 / Rank 1] Tasks: ['Single QA'] | Lens: [33975] → Tgt Spa: ['0.350'] [Step 290 / Rank 6] Tasks: ['Single QA'] | Lens: [45081] → Tgt Spa: ['0.350'] [Step 290 / Rank 7] Tasks: ['Single QA'] | Lens: [45081] → Tgt Spa: ['0.350'] [Step 290 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23229, 23230] → Tgt Spa: ['1.000', '1.000'] [Step 290 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23229, 23230] → Tgt Spa: ['1.000', '1.000'] [Step 290 / Rank 5] Tasks: ['Single QA'] | Lens: [44058] → Tgt Spa: ['0.350'] [Step 290 / Rank 4] Tasks: ['Single QA'] | Lens: [44058] → Tgt Spa: ['0.350'] [Step 290 / Rank 0] Tasks: ['Single QA'] | Lens: [33975] → Tgt Spa: ['0.350'] [Step 290 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [58263] → Tgt Spa: ['1.000'] [Step 290 / Rank 4] Tasks: ['Single QA'] | Lens: [65464] → Tgt Spa: ['0.350'] [Step 290 / Rank 2] Tasks: ['Single QA'] | Lens: [33293] → Tgt Spa: ['0.350'] [Step 290 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [61740] → Tgt Spa: ['1.000'] [Step 290 / Rank 3] Tasks: ['Single QA'] | Lens: [33293] → Tgt Spa: ['0.350'] [Step 290 / Rank 5] Tasks: ['Single QA'] | Lens: [65464] → Tgt Spa: ['0.350'] [Step 290 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [58263] → Tgt Spa: ['1.000'] [Step 290 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [61740] → Tgt Spa: ['1.000'] [Step 290 / Rank 2] Tasks: ['Single QA'] | Lens: [33736] → Tgt Spa: ['0.350'] [Step 290 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [27739, 27749] → Tgt Spa: ['1.000', '1.000'] [Step 290 / Rank 6] Tasks: ['Code', 'Code', 'Code'] | Lens: [20575, 20578, 20580] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 290 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [27739, 27749] → Tgt Spa: ['1.000', '1.000'] [Step 290 / Rank 3] Tasks: ['Single QA'] | Lens: [33736] → Tgt Spa: ['0.350'] [Step 290 / Rank 7] Tasks: ['Code', 'Code', 'Code'] | Lens: [20575, 20578, 20580] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 290 / Rank 1] Tasks: ['Single QA'] | Lens: [44278] → Tgt Spa: ['0.350'] [Step 290 / Rank 0] Tasks: ['Single QA'] | Lens: [44278] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 07:59:04,319 >> @ 290 | Loss: 2.1771 | LM: 2.1297 | Reg: 0.0473 | Spa(Avg): 0.487 [INFO|lh_trainer.py:797] 2026-02-17 07:59:04,319 >> Statistic -> Code | Spa: 0.703 | Tgt: 1.000 | Z-Loss: 0.099 | [INFO|lh_trainer.py:797] 2026-02-17 07:59:04,319 >> Statistic -> In-Context | Spa: 0.722 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:59:04,320 >> Statistic -> MultiHop | Spa: 0.646 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:59:04,320 >> Statistic -> Single | Spa: 0.386 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 07:59:04,320 >> Statistic -> Summarization | Spa: 0.667 | Tgt: 1.000 | Z-Loss: 0.110 | [INFO|lh_trainer.py:810] 2026-02-17 07:59:04,322 >> [Micro-Log] {"loss": 2.177072312682867, "lm_loss": 2.1297410242259502, "reg_loss": 0.04733128128767324, "model_sparsity(avg)": 0.48731674378116924, "Spa-Summarization sparsity": 0.6666666746139527, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11019663512706757, "Spa-Code sparsity": 0.7031250074505806, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09864119254052639, "Spa-Single QA sparsity": 0.38596490809791967, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.02511720657103548, "Spa-In-Context Learning sparsity": 0.7222222089767456, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10571892559528351, "Spa-MultiHop QA sparsity": 0.6458333134651184, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.129756398499012, "step": 290, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.1669921875, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 07:59:19,031 >> {'loss': 13.0624, 'grad_norm': 0.4242398142814636, 'learning_rate': 2.1388344426689387e-06, 'epoch': 0.30647709320695105, 'num_input_tokens_seen': 716304858, 'completed': '97.00% (291 / 300)', 'remaining time': '0:25:17', 'throughput': '7971.30', 'gpu_mem_free': '11203MB', 'step': 291} [Step 291 / Rank 7] Tasks: ['Summarization', 'Code'] | Lens: [22894, 22884] → Tgt Spa: ['1.000', '1.000'] [Step 291 / Rank 4] Tasks: ['Code'] | Lens: [48261] → Tgt Spa: ['1.000'] [Step 291 / Rank 0] Tasks: ['Summarization'] | Lens: [47613] → Tgt Spa: ['1.000'] [Step 291 / Rank 2] Tasks: ['Code'] | Lens: [34012] → Tgt Spa: ['1.000'] [Step 291 / Rank 6] Tasks: ['Summarization', 'Code'] | Lens: [22894, 22884] → Tgt Spa: ['1.000', '1.000'] [Step 291 / Rank 1] Tasks: ['Summarization'] | Lens: [47613] → Tgt Spa: ['1.000'] [Step 291 / Rank 5] Tasks: ['Code'] | Lens: [48261] → Tgt Spa: ['1.000'] [Step 291 / Rank 3] Tasks: ['Code'] | Lens: [34012] → Tgt Spa: ['1.000'] [Step 291 / Rank 7] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [19923, 19913, 19914] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 291 / Rank 3] Tasks: ['Code', 'In-Context Learning'] | Lens: [21892, 21884] → Tgt Spa: ['1.000', '1.000'] [Step 291 / Rank 5] Tasks: ['Code'] | Lens: [37761] → Tgt Spa: ['1.000'] [Step 291 / Rank 2] Tasks: ['Code', 'In-Context Learning'] | Lens: [21892, 21884] → Tgt Spa: ['1.000', '1.000'] [Step 291 / Rank 0] Tasks: ['Single QA'] | Lens: [50369] → Tgt Spa: ['0.350'] [Step 291 / Rank 1] Tasks: ['Single QA'] | Lens: [50369] → Tgt Spa: ['0.350'] [Step 291 / Rank 6] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [19923, 19913, 19914] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 291 / Rank 4] Tasks: ['Code'] | Lens: [37761] → Tgt Spa: ['1.000'] [Step 291 / Rank 5] Tasks: ['Single QA', 'Single QA'] | Lens: [25251, 25252] → Tgt Spa: ['0.350', '0.350'] [Step 291 / Rank 2] Tasks: ['Code'] | Lens: [52928] → Tgt Spa: ['1.000'] [Step 291 / Rank 0] Tasks: ['MultiHop QA'] | Lens: [61498] → Tgt Spa: ['0.350'] [Step 291 / Rank 4] Tasks: ['Single QA', 'Single QA'] | Lens: [25251, 25252] → Tgt Spa: ['0.350', '0.350'] [Step 291 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [57605] → Tgt Spa: ['1.000'] [Step 291 / Rank 3] Tasks: ['Code'] | Lens: [52928] → Tgt Spa: ['1.000'] [Step 291 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [57605] → Tgt Spa: ['1.000'] [Step 291 / Rank 1] Tasks: ['MultiHop QA'] | Lens: [61498] → Tgt Spa: ['0.350'] [Step 291 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [51759] → Tgt Spa: ['1.000'] [Step 291 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [51759] → Tgt Spa: ['1.000'] [Step 291 / Rank 6] Tasks: ['MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Single QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA'] | Lens: [1801, 1820, 1822, 1822, 1821, 1802, 1804, 1804, 1822, 1803, 1804, 1824, 1824, 1824, 1806, 1806, 1806, 1806, 1808, 1808, 1808, 1828, 1827, 1827, 1828, 1809, 1828, 1828, 1811, 1812, 1809, 1811, 1830, 1832, 1831, 1813] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350'] [Step 291 / Rank 3] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [20794, 20785, 20795] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 291 / Rank 1] Tasks: ['Single QA'] | Lens: [54865] → Tgt Spa: ['0.350'] [Step 291 / Rank 2] Tasks: ['Summarization', 'Code', 'Summarization'] | Lens: [20794, 20785, 20795] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 291 / Rank 7] Tasks: ['MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Single QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA'] | Lens: [1801, 1820, 1822, 1822, 1821, 1802, 1804, 1804, 1822, 1803, 1804, 1824, 1824, 1824, 1806, 1806, 1806, 1806, 1808, 1808, 1808, 1828, 1827, 1827, 1828, 1809, 1828, 1828, 1811, 1812, 1809, 1811, 1830, 1832, 1831, 1813] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '1.000', '0.350'] [Step 291 / Rank 0] Tasks: ['Single QA'] | Lens: [54865] → Tgt Spa: ['0.350'] [Step 291 / Rank 1] Tasks: ['Single QA'] | Lens: [58754] → Tgt Spa: ['0.350'] [Step 291 / Rank 3] Tasks: ['Single QA'] | Lens: [35295] → Tgt Spa: ['0.350'] [Step 291 / Rank 5] Tasks: ['Single QA'] | Lens: [65037] → Tgt Spa: ['0.350'] [Step 291 / Rank 2] Tasks: ['Single QA'] | Lens: [35295] → Tgt Spa: ['0.350'] [Step 291 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17958, 17960, 17963] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 291 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Summarization'] | Lens: [17958, 17960, 17963] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 291 / Rank 0] Tasks: ['Single QA'] | Lens: [58754] → Tgt Spa: ['0.350'] [Step 291 / Rank 4] Tasks: ['Single QA'] | Lens: [65037] → Tgt Spa: ['0.350'] [Step 291 / Rank 1] Tasks: ['Single QA'] | Lens: [47445] → Tgt Spa: ['0.350'] [Step 291 / Rank 4] Tasks: ['Single QA'] | Lens: [57325] → Tgt Spa: ['0.350'] [Step 291 / Rank 2] Tasks: ['Summarization'] | Lens: [35778] → Tgt Spa: ['1.000'] [Step 291 / Rank 6] Tasks: ['Single QA'] | Lens: [60976] → Tgt Spa: ['0.350'] [Step 291 / Rank 0] Tasks: ['Single QA'] | Lens: [47445] → Tgt Spa: ['0.350'] [Step 291 / Rank 7] Tasks: ['Single QA'] | Lens: [60976] → Tgt Spa: ['0.350'] [Step 291 / Rank 5] Tasks: ['Single QA'] | Lens: [57325] → Tgt Spa: ['0.350'] [Step 291 / Rank 3] Tasks: ['Summarization'] | Lens: [35778] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 08:01:45,314 >> @ 291 | Loss: 1.9832 | LM: 1.9121 | Reg: 0.0712 | Spa(Avg): 0.558 [INFO|lh_trainer.py:797] 2026-02-17 08:01:45,314 >> Statistic -> Code | Spa: 0.721 | Tgt: 1.000 | Z-Loss: 0.092 | [INFO|lh_trainer.py:797] 2026-02-17 08:01:45,314 >> Statistic -> In-Context | Spa: 0.718 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:01:45,314 >> Statistic -> MultiHop | Spa: 0.572 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:01:45,314 >> Statistic -> Single | Spa: 0.390 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:01:45,314 >> Statistic -> Summarization | Spa: 0.638 | Tgt: 1.000 | Z-Loss: 0.125 | [INFO|lh_trainer.py:810] 2026-02-17 08:01:45,317 >> [Micro-Log] {"loss": 1.9832175162931283, "lm_loss": 1.9120544418692589, "reg_loss": 0.07116308105469216, "model_sparsity(avg)": 0.5583847785989443, "Spa-Summarization sparsity": 0.6383547026377457, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12509262361205542, "Spa-Single QA sparsity": 0.39004628856976825, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.027330289691841852, "Spa-MultiHop QA sparsity": 0.5717592636744181, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.09109077292184035, "Spa-Code sparsity": 0.7206790049870809, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09150643812285529, "Spa-In-Context Learning sparsity": 0.7175925970077515, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10766946772734325, "step": 291, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.1669921875, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 08:02:09,361 >> {'loss': 11.8993, 'grad_norm': 0.6927266716957092, 'learning_rate': 1.7329355879798507e-06, 'epoch': 0.3075302790942601, 'num_input_tokens_seen': 718782282, 'completed': '97.33% (292 / 300)', 'remaining time': '0:22:28', 'throughput': '7272.41', 'gpu_mem_free': '10545MB', 'step': 292} [Step 292 / Rank 3] Tasks: ['Single QA'] | Lens: [51694] → Tgt Spa: ['0.350'] [Step 292 / Rank 5] Tasks: ['Single QA'] | Lens: [65104] → Tgt Spa: ['0.350'] [Step 292 / Rank 7] Tasks: ['Single QA'] | Lens: [53251] → Tgt Spa: ['0.350'] [Step 292 / Rank 6] Tasks: ['Single QA'] | Lens: [53251] → Tgt Spa: ['0.350'] [Step 292 / Rank 0] Tasks: ['Single QA'] | Lens: [63024] → Tgt Spa: ['0.350'] [Step 292 / Rank 4] Tasks: ['Single QA'] | Lens: [65104] → Tgt Spa: ['0.350'] [Step 292 / Rank 2] Tasks: ['Single QA'] | Lens: [51694] → Tgt Spa: ['0.350'] [Step 292 / Rank 1] Tasks: ['Single QA'] | Lens: [63024] → Tgt Spa: ['0.350'] [Step 292 / Rank 4] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [17184, 17184, 17185] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 292 / Rank 2] Tasks: ['Single QA'] | Lens: [37785] → Tgt Spa: ['0.350'] [Step 292 / Rank 3] Tasks: ['Single QA'] | Lens: [37785] → Tgt Spa: ['0.350'] [Step 292 / Rank 1] Tasks: ['Single QA'] | Lens: [50385] → Tgt Spa: ['0.350'] [Step 292 / Rank 0] Tasks: ['Single QA'] | Lens: [50385] → Tgt Spa: ['0.350'] [Step 292 / Rank 5] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [17184, 17184, 17185] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 292 / Rank 7] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [27764, 27773] → Tgt Spa: ['0.350', '1.000'] [Step 292 / Rank 6] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [27764, 27773] → Tgt Spa: ['0.350', '1.000'] [Step 292 / Rank 3] Tasks: ['Code', 'Code'] | Lens: [31605, 31604] → Tgt Spa: ['1.000', '1.000'] [Step 292 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15904, 15904, 15904, 15904] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 292 / Rank 6] Tasks: ['Single QA'] | Lens: [39832] → Tgt Spa: ['0.350'] [Step 292 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [23453, 23462] → Tgt Spa: ['1.000', '1.000'] [Step 292 / Rank 2] Tasks: ['Code', 'Code'] | Lens: [31605, 31604] → Tgt Spa: ['1.000', '1.000'] [Step 292 / Rank 7] Tasks: ['Single QA'] | Lens: [39832] → Tgt Spa: ['0.350'] [Step 292 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [23453, 23462] → Tgt Spa: ['1.000', '1.000'] [Step 292 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [15904, 15904, 15904, 15904] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 292 / Rank 3] Tasks: ['Single QA'] | Lens: [53269] → Tgt Spa: ['0.350'] [Step 292 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27861, 27863] → Tgt Spa: ['1.000', '1.000'] [Step 292 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [27861, 27863] → Tgt Spa: ['1.000', '1.000'] [Step 292 / Rank 2] Tasks: ['Single QA'] | Lens: [53269] → Tgt Spa: ['0.350'] [Step 292 / Rank 0] Tasks: ['Single QA', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Code'] | Lens: [8543, 8543, 8544, 8546, 8546, 8553, 8554] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 292 / Rank 1] Tasks: ['Single QA', 'In-Context Learning', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Code'] | Lens: [8543, 8543, 8544, 8546, 8546, 8553, 8554] → Tgt Spa: ['0.350', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000'] [Step 292 / Rank 7] Tasks: ['Single QA'] | Lens: [52955] → Tgt Spa: ['0.350'] [Step 292 / Rank 6] Tasks: ['Single QA'] | Lens: [52955] → Tgt Spa: ['0.350'] [Step 292 / Rank 5] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24051, 24051] → Tgt Spa: ['0.350', '1.000'] [Step 292 / Rank 3] Tasks: ['Single QA', 'Summarization'] | Lens: [32468, 32489] → Tgt Spa: ['0.350', '1.000'] [Step 292 / Rank 1] Tasks: ['Single QA'] | Lens: [34093] → Tgt Spa: ['0.350'] [Step 292 / Rank 7] Tasks: ['Single QA'] | Lens: [39952] → Tgt Spa: ['0.350'] [Step 292 / Rank 0] Tasks: ['Single QA'] | Lens: [34093] → Tgt Spa: ['0.350'] [Step 292 / Rank 6] Tasks: ['Single QA'] | Lens: [39952] → Tgt Spa: ['0.350'] [Step 292 / Rank 2] Tasks: ['Single QA', 'Summarization'] | Lens: [32468, 32489] → Tgt Spa: ['0.350', '1.000'] [Step 292 / Rank 4] Tasks: ['Single QA', 'In-Context Learning'] | Lens: [24051, 24051] → Tgt Spa: ['0.350', '1.000'] [Step 292 / Rank 5] Tasks: ['Code', 'Code', 'Code'] | Lens: [16412, 16412, 16412] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 292 / Rank 6] Tasks: ['Single QA'] | Lens: [39118] → Tgt Spa: ['0.350'] [Step 292 / Rank 7] Tasks: ['Single QA'] | Lens: [39118] → Tgt Spa: ['0.350'][Step 292 / Rank 2] Tasks: ['Single QA'] | Lens: [64542] → Tgt Spa: ['0.350'] [Step 292 / Rank 1] Tasks: ['Code'] | Lens: [33533] → Tgt Spa: ['1.000'] [Step 292 / Rank 4] Tasks: ['Code', 'Code', 'Code'] | Lens: [16412, 16412, 16412] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 292 / Rank 0] Tasks: ['Code'] | Lens: [33533] → Tgt Spa: ['1.000'] [Step 292 / Rank 3] Tasks: ['Single QA'] | Lens: [64542] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 08:04:28,592 >> @ 292 | Loss: 2.1219 | LM: 2.0757 | Reg: 0.0462 | Spa(Avg): 0.479 [INFO|lh_trainer.py:797] 2026-02-17 08:04:28,592 >> Statistic -> Code | Spa: 0.707 | Tgt: 1.000 | Z-Loss: 0.097 | [INFO|lh_trainer.py:797] 2026-02-17 08:04:28,592 >> Statistic -> In-Context | Spa: 0.718 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:04:28,592 >> Statistic -> MultiHop | Spa: 0.572 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:04:28,592 >> Statistic -> Single | Spa: 0.424 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:04:28,593 >> Statistic -> Summarization | Spa: 0.653 | Tgt: 1.000 | Z-Loss: 0.115 | [INFO|lh_trainer.py:810] 2026-02-17 08:04:28,594 >> [Micro-Log] {"loss": 2.1218616676827273, "lm_loss": 2.0756825463225446, "reg_loss": 0.04617911144547785, "model_sparsity(avg)": 0.4786981865763664, "Spa-Single QA sparsity": 0.42438271089836405, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.04906030240278967, "Spa-In-Context Learning sparsity": 0.7175925970077515, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10766946772734325, "Spa-Code sparsity": 0.7067901293436686, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09704171038336223, "Spa-Summarization sparsity": 0.6527777910232544, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.11543932557106018, "Spa-MultiHop QA sparsity": 0.5717592636744181, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.09109077292184035, "step": 292, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.1669921875, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 08:04:54,979 >> {'loss': 12.7312, 'grad_norm': 0.42387041449546814, 'learning_rate': 1.3695760209790617e-06, 'epoch': 0.30858346498156924, 'num_input_tokens_seen': 721256712, 'completed': '97.67% (293 / 300)', 'remaining time': '0:19:39', 'throughput': '7470.31', 'gpu_mem_free': '14783MB', 'step': 293} [Step 293 / Rank 1] Tasks: ['Single QA', 'Single QA'] | Lens: [23889, 23891] → Tgt Spa: ['0.350', '0.350'] [Step 293 / Rank 2] Tasks: ['Single QA'] | Lens: [49744] → Tgt Spa: ['0.350'] [Step 293 / Rank 4] Tasks: ['Single QA'] | Lens: [60513] → Tgt Spa: ['0.350'] [Step 293 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [24623, 24616] → Tgt Spa: ['1.000', '1.000'] [Step 293 / Rank 3] Tasks: ['Single QA'] | Lens: [49744] → Tgt Spa: ['0.350'] [Step 293 / Rank 5] Tasks: ['Single QA'] | Lens: [60513] → Tgt Spa: ['0.350'] [Step 293 / Rank 0] Tasks: ['Single QA', 'Single QA'] | Lens: [23889, 23891] → Tgt Spa: ['0.350', '0.350'] [Step 293 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [24623, 24616] → Tgt Spa: ['1.000', '1.000'] [Step 293 / Rank 3] Tasks: ['Single QA'] | Lens: [41257] → Tgt Spa: ['0.350'] [Step 293 / Rank 6] Tasks: ['Code', 'In-Context Learning'] | Lens: [21937, 21929] → Tgt Spa: ['1.000', '1.000'] [Step 293 / Rank 1] Tasks: ['Single QA'] | Lens: [55087] → Tgt Spa: ['0.350'] [Step 293 / Rank 5] Tasks: ['Single QA'] | Lens: [51540] → Tgt Spa: ['0.350'] [Step 293 / Rank 7] Tasks: ['Code', 'In-Context Learning'] | Lens: [21937, 21929] → Tgt Spa: ['1.000', '1.000'] [Step 293 / Rank 0] Tasks: ['Single QA'] | Lens: [55087] → Tgt Spa: ['0.350'] [Step 293 / Rank 2] Tasks: ['Single QA'] | Lens: [41257] → Tgt Spa: ['0.350'] [Step 293 / Rank 4] Tasks: ['Single QA'] | Lens: [51540] → Tgt Spa: ['0.350'] [Step 293 / Rank 2] Tasks: ['Summarization'] | Lens: [35184] → Tgt Spa: ['1.000'] [Step 293 / Rank 3] Tasks: ['Summarization'] | Lens: [35184] → Tgt Spa: ['1.000'] [Step 293 / Rank 6] Tasks: ['Single QA', 'Code'] | Lens: [21855, 21864] → Tgt Spa: ['0.350', '1.000'] [Step 293 / Rank 5] Tasks: ['Single QA'] | Lens: [41711] → Tgt Spa: ['0.350'] [Step 293 / Rank 4] Tasks: ['Single QA'] | Lens: [41711] → Tgt Spa: ['0.350'] [Step 293 / Rank 7] Tasks: ['Single QA', 'Code'] | Lens: [21855, 21864] → Tgt Spa: ['0.350', '1.000'] [Step 293 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [63181] → Tgt Spa: ['1.000'] [Step 293 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [63181] → Tgt Spa: ['1.000'] [Step 293 / Rank 5] Tasks: ['Single QA'] | Lens: [54371] → Tgt Spa: ['0.350'] [Step 293 / Rank 0] Tasks: ['Single QA'] | Lens: [41199] → Tgt Spa: ['0.350'] [Step 293 / Rank 7] Tasks: ['Single QA'] | Lens: [53360] → Tgt Spa: ['0.350'] [Step 293 / Rank 3] Tasks: ['Single QA'] | Lens: [49171] → Tgt Spa: ['0.350'] [Step 293 / Rank 1] Tasks: ['Single QA'] | Lens: [41199] → Tgt Spa: ['0.350'] [Step 293 / Rank 2] Tasks: ['Single QA'] | Lens: [49171] → Tgt Spa: ['0.350'] [Step 293 / Rank 6] Tasks: ['Single QA'] | Lens: [53360] → Tgt Spa: ['0.350'] [Step 293 / Rank 4] Tasks: ['Single QA'] | Lens: [54371] → Tgt Spa: ['0.350'] [Step 293 / Rank 0] Tasks: ['Code', 'Code', 'Code', 'Code', 'Code', 'Single QA', 'Code'] | Lens: [9175, 9177, 9178, 9178, 9180, 9175, 9184] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000'] [Step 293 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [26537, 26538] → Tgt Spa: ['1.000', '1.000'] [Step 293 / Rank 1] Tasks: ['Code', 'Code', 'Code', 'Code', 'Code', 'Single QA', 'Code'] | Lens: [9175, 9177, 9178, 9178, 9180, 9175, 9184] → Tgt Spa: ['1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000'] [Step 293 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [21259, 21259, 21259] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 293 / Rank 3] Tasks: ['Code'] | Lens: [45175] → Tgt Spa: ['1.000'] [Step 293 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [26537, 26538] → Tgt Spa: ['1.000', '1.000'] [Step 293 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning', 'In-Context Learning'] | Lens: [21259, 21259, 21259] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 293 / Rank 2] Tasks: ['Code'] | Lens: [45175] → Tgt Spa: ['1.000'] [Step 293 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [40553] → Tgt Spa: ['1.000'] [Step 293 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [31908, 31912] → Tgt Spa: ['1.000', '1.000'] [Step 293 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [31908, 31912] → Tgt Spa: ['1.000', '1.000'] [Step 293 / Rank 7] Tasks: ['Code', 'Code'] | Lens: [31856, 31862] → Tgt Spa: ['1.000', '1.000'] [Step 293 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [40553] → Tgt Spa: ['1.000'] [Step 293 / Rank 0] Tasks: ['Single QA'] | Lens: [52814] → Tgt Spa: ['0.350'] [Step 293 / Rank 6] Tasks: ['Code', 'Code'] | Lens: [31856, 31862] → Tgt Spa: ['1.000', '1.000'] [Step 293 / Rank 1] Tasks: ['Single QA'] | Lens: [52814] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 08:07:18,400 >> @ 293 | Loss: 2.0295 | LM: 1.9742 | Reg: 0.0553 | Spa(Avg): 0.530 [INFO|lh_trainer.py:797] 2026-02-17 08:07:18,401 >> Statistic -> Code | Spa: 0.713 | Tgt: 1.000 | Z-Loss: 0.094 | [INFO|lh_trainer.py:797] 2026-02-17 08:07:18,401 >> Statistic -> In-Context | Spa: 0.718 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:07:18,401 >> Statistic -> MultiHop | Spa: 0.572 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:07:18,401 >> Statistic -> Single | Spa: 0.381 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:07:18,401 >> Statistic -> Summarization | Spa: 0.708 | Tgt: 1.000 | Z-Loss: 0.089 | [INFO|lh_trainer.py:810] 2026-02-17 08:07:18,403 >> [Micro-Log] {"loss": 2.029549814760685, "lm_loss": 1.974200659741958, "reg_loss": 0.05534916347824037, "model_sparsity(avg)": 0.530464610705773, "Spa-Single QA sparsity": 0.38055554231007893, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.02480631029078116, "Spa-In-Context Learning sparsity": 0.7175925837622749, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10767545964982775, "Spa-Code sparsity": 0.7132936545780727, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09444533661007881, "Spa-Summarization sparsity": 0.7083333730697632, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.08924694359302521, "Spa-MultiHop QA sparsity": 0.5717592636744181, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.09109077292184035, "step": 293, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.1669921875, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 08:07:37,568 >> {'loss': 12.1773, 'grad_norm': 0.5390251874923706, 'learning_rate': 1.048818001457775e-06, 'epoch': 0.30963665086887837, 'num_input_tokens_seen': 723712914, 'completed': '98.00% (294 / 300)', 'remaining time': '0:16:51', 'throughput': '7553.41', 'gpu_mem_free': '8839MB', 'step': 294} [Step 294 / Rank 5] Tasks: ['Single QA'] | Lens: [36638] → Tgt Spa: ['0.350'] [Step 294 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [59974] → Tgt Spa: ['1.000'] [Step 294 / Rank 4] Tasks: ['Single QA'] | Lens: [36638] → Tgt Spa: ['0.350'] [Step 294 / Rank 2] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25532, 25533] → Tgt Spa: ['1.000', '0.350'] [Step 294 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [59974] → Tgt Spa: ['1.000'] [Step 294 / Rank 3] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [25532, 25533] → Tgt Spa: ['1.000', '0.350'] [Step 294 / Rank 7] Tasks: ['Code'] | Lens: [60319] → Tgt Spa: ['1.000'] [Step 294 / Rank 6] Tasks: ['Code'] | Lens: [60319] → Tgt Spa: ['1.000'] [Step 294 / Rank 5] Tasks: ['Single QA'] | Lens: [43901] → Tgt Spa: ['0.350'] [Step 294 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26441, 26442] → Tgt Spa: ['1.000', '1.000'] [Step 294 / Rank 7] Tasks: ['In-Context Learning', 'Code'] | Lens: [24274, 24283] → Tgt Spa: ['1.000', '1.000'] [Step 294 / Rank 6] Tasks: ['In-Context Learning', 'Code'] | Lens: [24274, 24283] → Tgt Spa: ['1.000', '1.000'] [Step 294 / Rank 1] Tasks: ['Single QA'] | Lens: [34232] → Tgt Spa: ['0.350'] [Step 294 / Rank 4] Tasks: ['Single QA'] | Lens: [43901] → Tgt Spa: ['0.350'] [Step 294 / Rank 0] Tasks: ['Single QA'] | Lens: [34232] → Tgt Spa: ['0.350'] [Step 294 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26441, 26442] → Tgt Spa: ['1.000', '1.000'] [Step 294 / Rank 4] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [24785, 24767] → Tgt Spa: ['1.000', '1.000'] [Step 294 / Rank 3] Tasks: ['Code'] | Lens: [40455] → Tgt Spa: ['1.000'] [Step 294 / Rank 1] Tasks: ['Single QA'] | Lens: [61116] → Tgt Spa: ['0.350'] [Step 294 / Rank 7] Tasks: ['Single QA'] | Lens: [43578] → Tgt Spa: ['0.350'] [Step 294 / Rank 6] Tasks: ['Single QA'] | Lens: [43578] → Tgt Spa: ['0.350'] [Step 294 / Rank 5] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [24785, 24767] → Tgt Spa: ['1.000', '1.000'] [Step 294 / Rank 0] Tasks: ['Single QA'] | Lens: [61116] → Tgt Spa: ['0.350'] [Step 294 / Rank 2] Tasks: ['Code'] | Lens: [40455] → Tgt Spa: ['1.000'] [Step 294 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [39725] → Tgt Spa: ['1.000'] [Step 294 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [39725] → Tgt Spa: ['1.000'] [Step 294 / Rank 1] Tasks: ['Single QA'] | Lens: [39243] → Tgt Spa: ['0.350'] [Step 294 / Rank 6] Tasks: ['Single QA'] | Lens: [57383] → Tgt Spa: ['0.350'] [Step 294 / Rank 7] Tasks: ['Single QA'] | Lens: [57383] → Tgt Spa: ['0.350'] [Step 294 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [65319] → Tgt Spa: ['1.000'] [Step 294 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [65319] → Tgt Spa: ['1.000'] [Step 294 / Rank 0] Tasks: ['Single QA'] | Lens: [39243] → Tgt Spa: ['0.350'] [Step 294 / Rank 5] Tasks: ['Single QA'] | Lens: [49774] → Tgt Spa: ['0.350'] [Step 294 / Rank 1] Tasks: ['Single QA'] | Lens: [45867] → Tgt Spa: ['0.350'] [Step 294 / Rank 7] Tasks: ['Code'] | Lens: [34910] → Tgt Spa: ['1.000'] [Step 294 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [52402] → Tgt Spa: ['1.000'] [Step 294 / Rank 4] Tasks: ['Single QA'] | Lens: [49774] → Tgt Spa: ['0.350'] [Step 294 / Rank 0] Tasks: ['Single QA'] | Lens: [45867] → Tgt Spa: ['0.350'] [Step 294 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [52402] → Tgt Spa: ['1.000'] [Step 294 / Rank 6] Tasks: ['Code'] | Lens: [34910] → Tgt Spa: ['1.000'] [Step 294 / Rank 4] Tasks: ['Single QA'] | Lens: [39666] → Tgt Spa: ['0.350'] [Step 294 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [36489] → Tgt Spa: ['1.000'] [Step 294 / Rank 7] Tasks: ['Single QA'] | Lens: [55060] → Tgt Spa: ['0.350'] [Step 294 / Rank 6] Tasks: ['Single QA'] | Lens: [55060] → Tgt Spa: ['0.350'] [Step 294 / Rank 3] Tasks: ['Single QA'] | Lens: [52794] → Tgt Spa: ['0.350'] [Step 294 / Rank 2] Tasks: ['Single QA'] | Lens: [52794] → Tgt Spa: ['0.350'] [Step 294 / Rank 5] Tasks: ['Single QA'] | Lens: [39666] → Tgt Spa: ['0.350'] [Step 294 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [36489] → Tgt Spa: ['1.000'] [INFO|lh_trainer.py:781] 2026-02-17 08:10:04,774 >> @ 294 | Loss: 2.0430 | LM: 1.9853 | Reg: 0.0577 | Spa(Avg): 0.528 [INFO|lh_trainer.py:797] 2026-02-17 08:10:04,775 >> Statistic -> Code | Spa: 0.677 | Tgt: 1.000 | Z-Loss: 0.110 | [INFO|lh_trainer.py:797] 2026-02-17 08:10:04,775 >> Statistic -> In-Context | Spa: 0.722 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:10:04,775 >> Statistic -> MultiHop | Spa: 0.572 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:10:04,775 >> Statistic -> Single | Spa: 0.365 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:10:04,775 >> Statistic -> Summarization | Spa: 0.694 | Tgt: 1.000 | Z-Loss: 0.096 | [INFO|lh_trainer.py:810] 2026-02-17 08:10:04,777 >> [Micro-Log] {"loss": 2.042957273001472, "lm_loss": 1.9852835008253653, "reg_loss": 0.05767379289318342, "model_sparsity(avg)": 0.5283564726511637, "Spa-In-Context Learning sparsity": 0.7222222089767456, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10571892559528351, "Spa-Single QA sparsity": 0.36538460621467006, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.012294160479751345, "Spa-Code sparsity": 0.6770833432674408, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.10987288504838943, "Spa-Summarization sparsity": 0.6944444179534912, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09551927447319031, "Spa-MultiHop QA sparsity": 0.5717592636744181, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.09109077292184035, "step": 294, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.1669921875, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 08:10:25,330 >> {'loss': 12.2577, 'grad_norm': 0.632005512714386, 'learning_rate': 7.707164896513524e-07, 'epoch': 0.31068983675618744, 'num_input_tokens_seen': 726014718, 'completed': '98.33% (295 / 300)', 'remaining time': '0:14:02', 'throughput': '6860.32', 'gpu_mem_free': '13129MB', 'step': 295} [Step 295 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [55030] → Tgt Spa: ['1.000'] [Step 295 / Rank 7] Tasks: ['Single QA'] | Lens: [35864] → Tgt Spa: ['0.350'] [Step 295 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning', 'Single QA', 'Summarization', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'Code', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning'] | Lens: [4343, 4342, 4343, 4361, 4346, 4363, 4345, 4347, 4352, 4346, 4365, 4348, 4347, 4349, 4348] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000'] [Step 295 / Rank 2] Tasks: ['Single QA'] | Lens: [46149] → Tgt Spa: ['0.350'] [Step 295 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [55030] → Tgt Spa: ['1.000'] [Step 295 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning', 'Single QA', 'Summarization', 'MultiHop QA', 'Summarization', 'In-Context Learning', 'MultiHop QA', 'Code', 'In-Context Learning', 'Summarization', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning'] | Lens: [4343, 4342, 4343, 4361, 4346, 4363, 4345, 4347, 4352, 4346, 4365, 4348, 4347, 4349, 4348] → Tgt Spa: ['1.000', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '1.000'] [Step 295 / Rank 6] Tasks: ['Single QA'] | Lens: [35864] → Tgt Spa: ['0.350'] [Step 295 / Rank 3] Tasks: ['Single QA'] | Lens: [46149] → Tgt Spa: ['0.350'] [Step 295 / Rank 2] Tasks: ['Code', 'Code', 'Code'] | Lens: [16630, 16630, 16634] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 295 / Rank 7] Tasks: ['Single QA'] | Lens: [64184] → Tgt Spa: ['0.350'] [Step 295 / Rank 6] Tasks: ['Single QA'] | Lens: [64184] → Tgt Spa: ['0.350'] [Step 295 / Rank 3] Tasks: ['Code', 'Code', 'Code'] | Lens: [16630, 16630, 16634] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 295 / Rank 1] Tasks: ['Single QA'] | Lens: [59431] → Tgt Spa: ['0.350'] [Step 295 / Rank 5] Tasks: ['Single QA'] | Lens: [54848] → Tgt Spa: ['0.350'] [Step 295 / Rank 4] Tasks: ['Single QA'] | Lens: [54848] → Tgt Spa: ['0.350'] [Step 295 / Rank 0] Tasks: ['Single QA'] | Lens: [59431] → Tgt Spa: ['0.350'] [Step 295 / Rank 2] Tasks: ['Code', 'Code', 'Code'] | Lens: [18128, 18127, 18130] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 295 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [43221] → Tgt Spa: ['1.000'] [Step 295 / Rank 1] Tasks: ['Code'] | Lens: [43237] → Tgt Spa: ['1.000'] [Step 295 / Rank 3] Tasks: ['Code', 'Code', 'Code'] | Lens: [18128, 18127, 18130] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 295 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [43221] → Tgt Spa: ['1.000'] [Step 295 / Rank 7] Tasks: ['Single QA'] | Lens: [45272] → Tgt Spa: ['0.350'] [Step 295 / Rank 0] Tasks: ['Code'] | Lens: [43237] → Tgt Spa: ['1.000'] [Step 295 / Rank 6] Tasks: ['Single QA'] | Lens: [45272] → Tgt Spa: ['0.350'] [Step 295 / Rank 5] Tasks: ['Single QA'] | Lens: [59396] → Tgt Spa: ['0.350'] [Step 295 / Rank 1] Tasks: ['In-Context Learning', 'Code'] | Lens: [25470, 25479] → Tgt Spa: ['1.000', '1.000'] [Step 295 / Rank 0] Tasks: ['In-Context Learning', 'Code'] | Lens: [25470, 25479] → Tgt Spa: ['1.000', '1.000'] [Step 295 / Rank 4] Tasks: ['Single QA'] | Lens: [59396] → Tgt Spa: ['0.350'] [Step 295 / Rank 3] Tasks: ['Code'] | Lens: [59180] → Tgt Spa: ['1.000'] [Step 295 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [39369] → Tgt Spa: ['1.000'] [Step 295 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [39369] → Tgt Spa: ['1.000'] [Step 295 / Rank 2] Tasks: ['Code'] | Lens: [59180] → Tgt Spa: ['1.000'] [Step 295 / Rank 7] Tasks: ['Single QA'] | Lens: [62209] → Tgt Spa: ['0.350'] [Step 295 / Rank 1] Tasks: ['Single QA'] | Lens: [47775] → Tgt Spa: ['0.350'] [Step 295 / Rank 6] Tasks: ['Single QA'] | Lens: [62209] → Tgt Spa: ['0.350'] [Step 295 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25753, 25753] → Tgt Spa: ['1.000', '1.000'] [Step 295 / Rank 4] Tasks: ['MultiHop QA'] | Lens: [65219] → Tgt Spa: ['0.350'] [Step 295 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25753, 25753] → Tgt Spa: ['1.000', '1.000'] [Step 295 / Rank 0] Tasks: ['Single QA'] | Lens: [47775] → Tgt Spa: ['0.350'] [Step 295 / Rank 5] Tasks: ['MultiHop QA'] | Lens: [65219] → Tgt Spa: ['0.350'] [Step 295 / Rank 3] Tasks: ['In-Context Learning'] | Lens: [35583] → Tgt Spa: ['1.000'] [Step 295 / Rank 5] Tasks: ['In-Context Learning', 'Code'] | Lens: [26359, 26368] → Tgt Spa: ['1.000', '1.000'] [Step 295 / Rank 7] Tasks: ['Single QA'] | Lens: [49451] → Tgt Spa: ['0.350'] [Step 295 / Rank 6] Tasks: ['Single QA'] | Lens: [49451] → Tgt Spa: ['0.350'] [Step 295 / Rank 2] Tasks: ['In-Context Learning'] | Lens: [35583] → Tgt Spa: ['1.000'] [Step 295 / Rank 1] Tasks: ['Single QA'] | Lens: [38138] → Tgt Spa: ['0.350'] [Step 295 / Rank 4] Tasks: ['In-Context Learning', 'Code'] | Lens: [26359, 26368] → Tgt Spa: ['1.000', '1.000'] [Step 295 / Rank 0] Tasks: ['Single QA'] | Lens: [38138] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 08:12:56,804 >> @ 295 | Loss: 2.0406 | LM: 1.9809 | Reg: 0.0597 | Spa(Avg): 0.541 [INFO|lh_trainer.py:797] 2026-02-17 08:12:56,804 >> Statistic -> Code | Spa: 0.713 | Tgt: 1.000 | Z-Loss: 0.094 | [INFO|lh_trainer.py:797] 2026-02-17 08:12:56,804 >> Statistic -> In-Context | Spa: 0.709 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:12:56,804 >> Statistic -> MultiHop | Spa: 0.588 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:12:56,804 >> Statistic -> Single | Spa: 0.388 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:12:56,804 >> Statistic -> Summarization | Spa: 0.528 | Tgt: 1.000 | Z-Loss: 0.191 | [INFO|lh_trainer.py:810] 2026-02-17 08:12:56,806 >> [Micro-Log] {"loss": 2.0406125653535128, "lm_loss": 1.9809006402889888, "reg_loss": 0.059711924632817194, "model_sparsity(avg)": 0.5405285432934761, "Spa-In-Context Learning sparsity": 0.7092592477798462, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.11125948429107665, "Spa-Single QA sparsity": 0.3878205097638644, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.025791156240022525, "Spa-Summarization sparsity": 0.5277777711550394, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.19082755595445633, "Spa-MultiHop QA sparsity": 0.5879629651705424, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.10201904798547427, "Spa-Code sparsity": 0.713383826342496, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09440371868285266, "step": 295, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.1669921875, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 08:13:14,441 >> {'loss': 12.2437, 'grad_norm': 0.623199462890625, 'learning_rate': 5.353191368222112e-07, 'epoch': 0.31174302264349657, 'num_input_tokens_seen': 728471242, 'completed': '98.67% (296 / 300)', 'remaining time': '0:11:14', 'throughput': '7263.06', 'gpu_mem_free': '13597MB', 'step': 296} [Step 296 / Rank 4] Tasks: ['Single QA'] | Lens: [39938] → Tgt Spa: ['0.350'] [Step 296 / Rank 7] Tasks: ['Single QA'] | Lens: [37719] → Tgt Spa: ['0.350'] [Step 296 / Rank 0] Tasks: ['Single QA'] | Lens: [35097] → Tgt Spa: ['0.350'] [Step 296 / Rank 3] Tasks: ['Single QA'] | Lens: [37678] → Tgt Spa: ['0.350'] [Step 296 / Rank 1] Tasks: ['Single QA'] | Lens: [35097] → Tgt Spa: ['0.350'] [Step 296 / Rank 5] Tasks: ['Single QA'] | Lens: [39938] → Tgt Spa: ['0.350'] [Step 296 / Rank 2] Tasks: ['Single QA'] | Lens: [37678] → Tgt Spa: ['0.350'] [Step 296 / Rank 6] Tasks: ['Single QA'] | Lens: [37719] → Tgt Spa: ['0.350'] [Step 296 / Rank 2] Tasks: ['MultiHop QA', 'In-Context Learning'] | Lens: [30381, 30384] → Tgt Spa: ['0.350', '1.000'] [Step 296 / Rank 6] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [16808, 16799, 16800] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 296 / Rank 4] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24135, 24135] → Tgt Spa: ['1.000', '1.000'] [Step 296 / Rank 3] Tasks: ['MultiHop QA', 'In-Context Learning'] | Lens: [30381, 30384] → Tgt Spa: ['0.350', '1.000'] [Step 296 / Rank 0] Tasks: ['Code', 'Code'] | Lens: [27097, 27097] → Tgt Spa: ['1.000', '1.000'] [Step 296 / Rank 5] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [24135, 24135] → Tgt Spa: ['1.000', '1.000'] [Step 296 / Rank 1] Tasks: ['Code', 'Code'] | Lens: [27097, 27097] → Tgt Spa: ['1.000', '1.000'] [Step 296 / Rank 7] Tasks: ['Summarization', 'Code', 'Code'] | Lens: [16808, 16799, 16800] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 296 / Rank 1] Tasks: ['Single QA'] | Lens: [38453] → Tgt Spa: ['0.350'] [Step 296 / Rank 6] Tasks: ['Single QA'] | Lens: [43160] → Tgt Spa: ['0.350'] [Step 296 / Rank 0] Tasks: ['Single QA'] | Lens: [38453] → Tgt Spa: ['0.350'] [Step 296 / Rank 4] Tasks: ['Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA'] | Lens: [2143, 2128, 2128, 2146, 2128, 2149, 2132, 2148, 2149, 2148, 2136, 2131, 2134, 2135, 2152, 2143, 2137, 2136, 2139, 2138, 2155, 2137, 2140, 2156, 2156, 2139, 2141, 2157, 2139, 2141] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350'] [Step 296 / Rank 3] Tasks: ['Single QA', 'Code'] | Lens: [21928, 21936] → Tgt Spa: ['0.350', '1.000'] [Step 296 / Rank 7] Tasks: ['Single QA'] | Lens: [43160] → Tgt Spa: ['0.350'] [Step 296 / Rank 5] Tasks: ['Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Code', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA'] | Lens: [2143, 2128, 2128, 2146, 2128, 2149, 2132, 2148, 2149, 2148, 2136, 2131, 2134, 2135, 2152, 2143, 2137, 2136, 2139, 2138, 2155, 2137, 2140, 2156, 2156, 2139, 2141, 2157, 2139, 2141] → Tgt Spa: ['1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350'] [Step 296 / Rank 2] Tasks: ['Single QA', 'Code'] | Lens: [21928, 21936] → Tgt Spa: ['0.350', '1.000'] [Step 296 / Rank 7] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [19444, 19445, 19436] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 296 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning', 'Summarization'] | Lens: [21441, 21442, 21461] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 296 / Rank 5] Tasks: ['Code', 'Single QA'] | Lens: [23412, 23402] → Tgt Spa: ['1.000', '0.350'] [Step 296 / Rank 4] Tasks: ['Code', 'Single QA'] | Lens: [23412, 23402] → Tgt Spa: ['1.000', '0.350'] [Step 296 / Rank 2] Tasks: ['Single QA'] | Lens: [49214] → Tgt Spa: ['0.350'] [Step 296 / Rank 6] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [19444, 19445, 19436] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 296 / Rank 3] Tasks: ['Single QA'] | Lens: [49214] → Tgt Spa: ['0.350'] [Step 296 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning', 'Summarization'] | Lens: [21441, 21442, 21461] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 296 / Rank 4] Tasks: ['Single QA'] | Lens: [61570] → Tgt Spa: ['0.350'] [Step 296 / Rank 7] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22866, 22886] → Tgt Spa: ['1.000', '1.000'] [Step 296 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26027, 26027] → Tgt Spa: ['1.000', '1.000'] [Step 296 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [26027, 26027] → Tgt Spa: ['1.000', '1.000'] [Step 296 / Rank 6] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [22866, 22886] → Tgt Spa: ['1.000', '1.000'] [Step 296 / Rank 2] Tasks: ['Code'] | Lens: [46569] → Tgt Spa: ['1.000'] [Step 296 / Rank 5] Tasks: ['Single QA'] | Lens: [61570] → Tgt Spa: ['0.350'] [Step 296 / Rank 3] Tasks: ['Code'] | Lens: [46569] → Tgt Spa: ['1.000'] [Step 296 / Rank 5] Tasks: ['Single QA', 'Code', 'Summarization'] | Lens: [17337, 17345, 17360] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 296 / Rank 4] Tasks: ['Single QA', 'Code', 'Summarization'] | Lens: [17337, 17345, 17360] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 296 / Rank 1] Tasks: ['Code'] | Lens: [63270] → Tgt Spa: ['1.000'] [Step 296 / Rank 6] Tasks: ['Single QA'] | Lens: [53257] → Tgt Spa: ['0.350'] [Step 296 / Rank 2] Tasks: ['Single QA'] | Lens: [38853] → Tgt Spa: ['0.350'] [Step 296 / Rank 0] Tasks: ['Code'] | Lens: [63270] → Tgt Spa: ['1.000'] [Step 296 / Rank 3] Tasks: ['Single QA'] | Lens: [38853] → Tgt Spa: ['0.350'] [Step 296 / Rank 7] Tasks: ['Single QA'] | Lens: [53257] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 08:15:14,273 >> @ 296 | Loss: 1.9866 | LM: 1.9265 | Reg: 0.0601 | Spa(Avg): 0.546 [INFO|lh_trainer.py:797] 2026-02-17 08:15:14,273 >> Statistic -> Code | Spa: 0.708 | Tgt: 1.000 | Z-Loss: 0.096 | [INFO|lh_trainer.py:797] 2026-02-17 08:15:14,273 >> Statistic -> In-Context | Spa: 0.719 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:15:14,273 >> Statistic -> MultiHop | Spa: 0.627 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:15:14,274 >> Statistic -> Single | Spa: 0.387 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:15:14,274 >> Statistic -> Summarization | Spa: 0.667 | Tgt: 1.000 | Z-Loss: 0.110 | [INFO|lh_trainer.py:810] 2026-02-17 08:15:14,275 >> [Micro-Log] {"loss": 1.9865862776835759, "lm_loss": 1.9264906359215577, "reg_loss": 0.06009564484702423, "model_sparsity(avg)": 0.5456597159306208, "Spa-Single QA sparsity": 0.3867521286010742, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.02427604618983773, "Spa-Code sparsity": 0.7083333233992258, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09638242858151595, "Spa-In-Context Learning sparsity": 0.7187499850988388, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10718857310712337, "Spa-Summarization sparsity": 0.6674836593515733, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.10992285346283633, "Spa-MultiHop QA sparsity": 0.6273148059844971, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.12053842770142688, "step": 296, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.1669921875, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 08:15:39,125 >> {'loss': 11.9195, 'grad_norm': 0.5572063326835632, 'learning_rate': 3.4266627709491055e-07, 'epoch': 0.3127962085308057, 'num_input_tokens_seen': 730842942, 'completed': '99.00% (297 / 300)', 'remaining time': '0:08:25', 'throughput': '8196.14', 'gpu_mem_free': '5255MB', 'step': 297} [Step 297 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [58979] → Tgt Spa: ['1.000'] [Step 297 / Rank 7] Tasks: ['Code', 'Code', 'In-Context Learning'] | Lens: [19810, 19813, 19804] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 297 / Rank 6] Tasks: ['Code', 'Code', 'In-Context Learning'] | Lens: [19810, 19813, 19804] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 297 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [58979] → Tgt Spa: ['1.000'] [Step 297 / Rank 3] Tasks: ['Single QA'] | Lens: [58833] → Tgt Spa: ['0.350'] [Step 297 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Code'] | Lens: [11414, 11416, 11422, 11416, 11424] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '1.000'] [Step 297 / Rank 2] Tasks: ['Single QA'] | Lens: [58833] → Tgt Spa: ['0.350'] [Step 297 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Code', 'Single QA', 'Code'] | Lens: [11414, 11416, 11422, 11416, 11424] → Tgt Spa: ['0.350', '0.350', '1.000', '0.350', '1.000'] [Step 297 / Rank 0] Tasks: ['Single QA'] | Lens: [65261] → Tgt Spa: ['0.350'] [Step 297 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [61670] → Tgt Spa: ['1.000'] [Step 297 / Rank 2] Tasks: ['Single QA'] | Lens: [41926] → Tgt Spa: ['0.350'] [Step 297 / Rank 4] Tasks: ['Code'] | Lens: [35249] → Tgt Spa: ['1.000'] [Step 297 / Rank 1] Tasks: ['Single QA'] | Lens: [65261] → Tgt Spa: ['0.350'] [Step 297 / Rank 5] Tasks: ['Code'] | Lens: [35249] → Tgt Spa: ['1.000'] [Step 297 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [61670] → Tgt Spa: ['1.000'] [Step 297 / Rank 3] Tasks: ['Single QA'] | Lens: [41926] → Tgt Spa: ['0.350'] [Step 297 / Rank 3] Tasks: ['Single QA'] | Lens: [35420] → Tgt Spa: ['0.350'] [Step 297 / Rank 4] Tasks: ['In-Context Learning', 'Single QA', 'In-Context Learning', 'Single QA', 'Summarization', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Single QA'] | Lens: [5016, 5017, 5018, 5019, 5037, 5019, 5022, 5023, 5022, 5023, 5025, 5025, 5026] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350'] [Step 297 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [13350, 13350, 13351, 13352] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 297 / Rank 5] Tasks: ['In-Context Learning', 'Single QA', 'In-Context Learning', 'Single QA', 'Summarization', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'In-Context Learning', 'In-Context Learning', 'Single QA', 'In-Context Learning', 'Single QA'] | Lens: [5016, 5017, 5018, 5019, 5037, 5019, 5022, 5023, 5022, 5023, 5025, 5025, 5026] → Tgt Spa: ['1.000', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000', '1.000', '1.000', '0.350', '1.000', '0.350'] [Step 297 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA'] | Lens: [13350, 13350, 13351, 13352] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350'] [Step 297 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [56985] → Tgt Spa: ['1.000'] [Step 297 / Rank 2] Tasks: ['Single QA'] | Lens: [35420] → Tgt Spa: ['0.350'] [Step 297 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [56985] → Tgt Spa: ['1.000'] [Step 297 / Rank 5] Tasks: ['In-Context Learning'] | Lens: [62228] → Tgt Spa: ['1.000'] [Step 297 / Rank 3] Tasks: ['Summarization', 'Code'] | Lens: [24419, 24408] → Tgt Spa: ['1.000', '1.000'] [Step 297 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Code'] | Lens: [7307, 7307, 7308, 7309, 7315, 7313, 7314, 7322] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000'] [Step 297 / Rank 7] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25481, 25483] → Tgt Spa: ['1.000', '1.000'] [Step 297 / Rank 2] Tasks: ['Summarization', 'Code'] | Lens: [24419, 24408] → Tgt Spa: ['1.000', '1.000'] [Step 297 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA', 'Single QA', 'Code', 'Single QA', 'Single QA', 'Code'] | Lens: [7307, 7307, 7308, 7309, 7315, 7313, 7314, 7322] → Tgt Spa: ['0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000'] [Step 297 / Rank 4] Tasks: ['In-Context Learning'] | Lens: [62228] → Tgt Spa: ['1.000'] [Step 297 / Rank 6] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25481, 25483] → Tgt Spa: ['1.000', '1.000'] [Step 297 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [63053] → Tgt Spa: ['1.000'] [Step 297 / Rank 5] Tasks: ['Summarization'] | Lens: [46386] → Tgt Spa: ['1.000'] [Step 297 / Rank 3] Tasks: ['Single QA'] | Lens: [56612] → Tgt Spa: ['0.350'] [Step 297 / Rank 6] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [20690, 20702, 20693] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 297 / Rank 4] Tasks: ['Summarization'] | Lens: [46386] → Tgt Spa: ['1.000'] [Step 297 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [63053] → Tgt Spa: ['1.000'] [Step 297 / Rank 7] Tasks: ['Code', 'Summarization', 'Code'] | Lens: [20690, 20702, 20693] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 297 / Rank 2] Tasks: ['Single QA'] | Lens: [56612] → Tgt Spa: ['0.350'] [Step 297 / Rank 5] Tasks: ['Single QA'] | Lens: [64036] → Tgt Spa: ['0.350'] [Step 297 / Rank 6] Tasks: ['Code'] | Lens: [33014] → Tgt Spa: ['1.000'] [Step 297 / Rank 4] Tasks: ['Single QA'] | Lens: [64036] → Tgt Spa: ['0.350'] [Step 297 / Rank 1] Tasks: ['Single QA', 'Summarization'] | Lens: [22332, 22352] → Tgt Spa: ['0.350', '1.000'] [Step 297 / Rank 0] Tasks: ['Single QA', 'Summarization'] | Lens: [22332, 22352] → Tgt Spa: ['0.350', '1.000'] [Step 297 / Rank 3] Tasks: ['Single QA'] | Lens: [61563] → Tgt Spa: ['0.350'] [Step 297 / Rank 7] Tasks: ['Code'] | Lens: [33014] → Tgt Spa: ['1.000'] [Step 297 / Rank 2] Tasks: ['Single QA'] | Lens: [61563] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 08:18:24,803 >> @ 297 | Loss: 2.0808 | LM: 2.0047 | Reg: 0.0761 | Spa(Avg): 0.583 [INFO|lh_trainer.py:797] 2026-02-17 08:18:24,803 >> Statistic -> Code | Spa: 0.705 | Tgt: 1.000 | Z-Loss: 0.098 | [INFO|lh_trainer.py:797] 2026-02-17 08:18:24,803 >> Statistic -> In-Context | Spa: 0.718 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:18:24,803 >> Statistic -> MultiHop | Spa: 0.627 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:18:24,803 >> Statistic -> Single | Spa: 0.483 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:18:24,803 >> Statistic -> Summarization | Spa: 0.692 | Tgt: 1.000 | Z-Loss: 0.097 | [INFO|lh_trainer.py:810] 2026-02-17 08:18:24,805 >> [Micro-Log] {"loss": 2.0808319921294847, "lm_loss": 2.0047486114005246, "reg_loss": 0.07608339478125951, "model_sparsity(avg)": 0.5831541493535042, "Spa-Single QA sparsity": 0.48344016992128813, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.08964194471571738, "Spa-Code sparsity": 0.7045454437082465, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09811370955272154, "Spa-In-Context Learning sparsity": 0.7175925890604655, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10767306288083395, "Spa-Summarization sparsity": 0.6916666507720948, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09695759266614914, "Spa-MultiHop QA sparsity": 0.6273148059844971, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.12053842770142688, "step": 297, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.1669921875, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 08:18:50,978 >> {'loss': 12.485, 'grad_norm': 0.6717016696929932, 'learning_rate': 1.927909205451808e-07, 'epoch': 0.3138493944181148, 'num_input_tokens_seen': 733445910, 'completed': '99.33% (298 / 300)', 'remaining time': '0:05:37', 'throughput': '6783.77', 'gpu_mem_free': '10713MB', 'step': 298} [Step 298 / Rank 6] Tasks: ['Single QA'] | Lens: [65021] → Tgt Spa: ['0.350'] [Step 298 / Rank 3] Tasks: ['Single QA'] | Lens: [46762] → Tgt Spa: ['0.350'] [Step 298 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [58170] → Tgt Spa: ['1.000'] [Step 298 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [58170] → Tgt Spa: ['1.000'] [Step 298 / Rank 5] Tasks: ['Single QA'] | Lens: [54776] → Tgt Spa: ['0.350'] [Step 298 / Rank 2] Tasks: ['Single QA'] | Lens: [46762] → Tgt Spa: ['0.350'] [Step 298 / Rank 7] Tasks: ['Single QA'] | Lens: [65021] → Tgt Spa: ['0.350'] [Step 298 / Rank 4] Tasks: ['Single QA'] | Lens: [54776] → Tgt Spa: ['0.350'] [Step 298 / Rank 7] Tasks: ['Single QA'] | Lens: [36628] → Tgt Spa: ['0.350'] [Step 298 / Rank 5] Tasks: ['Code', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'Code', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'Single QA', 'Summarization', 'Code', 'Single QA', 'Code'] | Lens: [3561, 3553, 3554, 3554, 3561, 3556, 3558, 3556, 3555, 3556, 3557, 3556, 3564, 3558, 3576, 3566, 3559, 3568] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000'] [Step 298 / Rank 0] Tasks: ['Single QA'] | Lens: [40461] → Tgt Spa: ['0.350'] [Step 298 / Rank 6] Tasks: ['Single QA'] | Lens: [36628] → Tgt Spa: ['0.350'] [Step 298 / Rank 3] Tasks: ['In-Context Learning', 'Code'] | Lens: [27256, 27265] → Tgt Spa: ['1.000', '1.000'] [Step 298 / Rank 1] Tasks: ['Single QA'] | Lens: [40461] → Tgt Spa: ['0.350'] [Step 298 / Rank 2] Tasks: ['In-Context Learning', 'Code'] | Lens: [27256, 27265] → Tgt Spa: ['1.000', '1.000'] [Step 298 / Rank 4] Tasks: ['Code', 'In-Context Learning', 'MultiHop QA', 'MultiHop QA', 'Code', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Single QA', 'Single QA', 'MultiHop QA', 'MultiHop QA', 'Code', 'Single QA', 'Summarization', 'Code', 'Single QA', 'Code'] | Lens: [3561, 3553, 3554, 3554, 3561, 3556, 3558, 3556, 3555, 3556, 3557, 3556, 3564, 3558, 3576, 3566, 3559, 3568] → Tgt Spa: ['1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000', '0.350', '1.000'] [Step 298 / Rank 6] Tasks: ['Single QA'] | Lens: [63983] → Tgt Spa: ['0.350'] [Step 298 / Rank 1] Tasks: ['Single QA'] | Lens: [49179] → Tgt Spa: ['0.350'] [Step 298 / Rank 5] Tasks: ['Single QA'] | Lens: [35627] → Tgt Spa: ['0.350'] [Step 298 / Rank 2] Tasks: ['Single QA'] | Lens: [33032] → Tgt Spa: ['0.350'] [Step 298 / Rank 7] Tasks: ['Single QA'] | Lens: [63983] → Tgt Spa: ['0.350'] [Step 298 / Rank 0] Tasks: ['Single QA'] | Lens: [49179] → Tgt Spa: ['0.350'] [Step 298 / Rank 3] Tasks: ['Single QA'] | Lens: [33032] → Tgt Spa: ['0.350'] [Step 298 / Rank 4] Tasks: ['Single QA'] | Lens: [35627] → Tgt Spa: ['0.350'] [Step 298 / Rank 1] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [21454, 21455, 21455] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 298 / Rank 7] Tasks: ['In-Context Learning'] | Lens: [46191] → Tgt Spa: ['1.000'] [Step 298 / Rank 5] Tasks: ['Code', 'Code'] | Lens: [30997, 30997] → Tgt Spa: ['1.000', '1.000'] [Step 298 / Rank 2] Tasks: ['Single QA'] | Lens: [46700] → Tgt Spa: ['0.350'] [Step 298 / Rank 4] Tasks: ['Code', 'Code'] | Lens: [30997, 30997] → Tgt Spa: ['1.000', '1.000'] [Step 298 / Rank 3] Tasks: ['Single QA'] | Lens: [46700] → Tgt Spa: ['0.350'] [Step 298 / Rank 0] Tasks: ['Single QA', 'Single QA', 'Single QA'] | Lens: [21454, 21455, 21455] → Tgt Spa: ['0.350', '0.350', '0.350'] [Step 298 / Rank 6] Tasks: ['In-Context Learning'] | Lens: [46191] → Tgt Spa: ['1.000'] [Step 298 / Rank 6] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [17099, 17099, 17112] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 298 / Rank 0] Tasks: ['Single QA'] | Lens: [51206] → Tgt Spa: ['0.350'] [Step 298 / Rank 2] Tasks: ['Single QA'] | Lens: [34544] → Tgt Spa: ['0.350'] [Step 298 / Rank 7] Tasks: ['Code', 'Code', 'Summarization'] | Lens: [17099, 17099, 17112] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 298 / Rank 5] Tasks: ['Code'] | Lens: [45469] → Tgt Spa: ['1.000'] [Step 298 / Rank 3] Tasks: ['Single QA'] | Lens: [34544] → Tgt Spa: ['0.350'] [Step 298 / Rank 1] Tasks: ['Single QA'] | Lens: [51206] → Tgt Spa: ['0.350'] [Step 298 / Rank 4] Tasks: ['Code'] | Lens: [45469] → Tgt Spa: ['1.000'] [Step 298 / Rank 1] Tasks: ['Single QA'] | Lens: [51696] → Tgt Spa: ['0.350'] [Step 298 / Rank 7] Tasks: ['Single QA', 'Single QA'] | Lens: [29954, 29954] → Tgt Spa: ['0.350', '0.350'] [Step 298 / Rank 0] Tasks: ['Single QA'] | Lens: [51696] → Tgt Spa: ['0.350'] [Step 298 / Rank 5] Tasks: ['MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Code'] | Lens: [1895, 1916, 1914, 1912, 1914, 1915, 1916, 1897, 1898, 1916, 1898, 1898, 1899, 1899, 1900, 1919, 1906, 1920, 1903, 1903, 1901, 1901, 1923, 1921, 1902, 1903, 1903, 1922, 1905, 1906, 1923, 1906, 1925, 1913] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000'] [Step 298 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25404, 25405] → Tgt Spa: ['1.000', '1.000'] [Step 298 / Rank 6] Tasks: ['Single QA', 'Single QA'] | Lens: [29954, 29954] → Tgt Spa: ['0.350', '0.350'] [Step 298 / Rank 4] Tasks: ['MultiHop QA', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'MultiHop QA', 'Summarization', 'MultiHop QA', 'Summarization', 'Code'] | Lens: [1895, 1916, 1914, 1912, 1914, 1915, 1916, 1897, 1898, 1916, 1898, 1898, 1899, 1899, 1900, 1919, 1906, 1920, 1903, 1903, 1901, 1901, 1923, 1921, 1902, 1903, 1903, 1922, 1905, 1906, 1923, 1906, 1925, 1913] → Tgt Spa: ['0.350', '1.000', '1.000', '1.000', '1.000', '1.000', '1.000', '0.350', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '0.350', '1.000', '0.350', '1.000', '0.350', '0.350', '0.350', '0.350', '1.000', '1.000', '0.350', '0.350', '0.350', '1.000', '0.350', '0.350', '1.000', '0.350', '1.000', '1.000'] [Step 298 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [25404, 25405] → Tgt Spa: ['1.000', '1.000'] [INFO|lh_trainer.py:781] 2026-02-17 08:21:10,034 >> @ 298 | Loss: 2.0972 | LM: 2.0386 | Reg: 0.0586 | Spa(Avg): 0.507 [INFO|lh_trainer.py:797] 2026-02-17 08:21:10,034 >> Statistic -> Code | Spa: 0.704 | Tgt: 1.000 | Z-Loss: 0.098 | [INFO|lh_trainer.py:797] 2026-02-17 08:21:10,034 >> Statistic -> In-Context | Spa: 0.720 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:21:10,034 >> Statistic -> MultiHop | Spa: 0.614 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:21:10,034 >> Statistic -> Single | Spa: 0.419 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:21:10,035 >> Statistic -> Summarization | Spa: 0.644 | Tgt: 1.000 | Z-Loss: 0.123 | [INFO|lh_trainer.py:810] 2026-02-17 08:21:10,037 >> [Micro-Log] {"loss": 2.097165590773026, "lm_loss": 2.038590998699268, "reg_loss": 0.05857461374156022, "model_sparsity(avg)": 0.5072678402066231, "Spa-In-Context Learning sparsity": 0.7199074029922485, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10669419666131337, "Spa-Single QA sparsity": 0.4185605997389013, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.046129113661167634, "Spa-Code sparsity": 0.7037037014961243, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09840173398454984, "Spa-MultiHop QA sparsity": 0.6143162388067979, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.11344771860883786, "Spa-Summarization sparsity": 0.6440972201526165, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.12251959927380085, "step": 298, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.1669921875, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 08:21:28,642 >> {'loss': 12.583, 'grad_norm': 0.46403953433036804, 'learning_rate': 8.571874754380943e-08, 'epoch': 0.3149025803054239, 'num_input_tokens_seen': 735908532, 'completed': '99.67% (299 / 300)', 'remaining time': '0:02:48', 'throughput': '7809.70', 'gpu_mem_free': '8719MB', 'step': 299} [Step 299 / Rank 6] Tasks: ['Single QA'] | Lens: [51534] → Tgt Spa: ['0.350'] [Step 299 / Rank 3] Tasks: ['Code'] | Lens: [40815] → Tgt Spa: ['1.000'] [Step 299 / Rank 4] Tasks: ['Single QA'] | Lens: [59062] → Tgt Spa: ['0.350'] [Step 299 / Rank 7] Tasks: ['Single QA'] | Lens: [51534] → Tgt Spa: ['0.350'] [Step 299 / Rank 0] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22659, 22661] → Tgt Spa: ['1.000', '1.000'] [Step 299 / Rank 1] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [22659, 22661] → Tgt Spa: ['1.000', '1.000'] [Step 299 / Rank 2] Tasks: ['Code'] | Lens: [40815] → Tgt Spa: ['1.000'] [Step 299 / Rank 5] Tasks: ['Single QA'] | Lens: [59062] → Tgt Spa: ['0.350'] [Step 299 / Rank 0] Tasks: ['Code'] | Lens: [39921] → Tgt Spa: ['1.000'] [Step 299 / Rank 3] Tasks: ['Single QA'] | Lens: [50746] → Tgt Spa: ['0.350'] [Step 299 / Rank 5] Tasks: ['Single QA'] | Lens: [56505] → Tgt Spa: ['0.350'] [Step 299 / Rank 7] Tasks: ['Single QA'] | Lens: [65016] → Tgt Spa: ['0.350'] [Step 299 / Rank 1] Tasks: ['Code'] | Lens: [39921] → Tgt Spa: ['1.000'] [Step 299 / Rank 6] Tasks: ['Single QA'] | Lens: [65016] → Tgt Spa: ['0.350'] [Step 299 / Rank 4] Tasks: ['Single QA'] | Lens: [56505] → Tgt Spa: ['0.350'] [Step 299 / Rank 2] Tasks: ['Single QA'] | Lens: [50746] → Tgt Spa: ['0.350'] [Step 299 / Rank 2] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23689, 23689] → Tgt Spa: ['1.000', '1.000'] [Step 299 / Rank 1] Tasks: ['MultiHop QA', 'Single QA'] | Lens: [31259, 31260] → Tgt Spa: ['0.350', '0.350'] [Step 299 / Rank 0] Tasks: ['MultiHop QA', 'Single QA'] | Lens: [31259, 31260] → Tgt Spa: ['0.350', '0.350'] [Step 299 / Rank 5] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [25914, 25898] → Tgt Spa: ['1.000', '1.000'] [Step 299 / Rank 3] Tasks: ['In-Context Learning', 'In-Context Learning'] | Lens: [23689, 23689] → Tgt Spa: ['1.000', '1.000'] [Step 299 / Rank 4] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [25914, 25898] → Tgt Spa: ['1.000', '1.000'] [Step 299 / Rank 7] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22097, 22080] → Tgt Spa: ['1.000', '1.000'] [Step 299 / Rank 6] Tasks: ['Summarization', 'In-Context Learning'] | Lens: [22097, 22080] → Tgt Spa: ['1.000', '1.000'] [Step 299 / Rank 6] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [32268, 32270] → Tgt Spa: ['1.000', '0.350'] [Step 299 / Rank 3] Tasks: ['Single QA'] | Lens: [64334] → Tgt Spa: ['0.350'] [Step 299 / Rank 4] Tasks: ['Single QA'] | Lens: [64934] → Tgt Spa: ['0.350'] [Step 299 / Rank 0] Tasks: ['In-Context Learning'] | Lens: [58455] → Tgt Spa: ['1.000'] [Step 299 / Rank 2] Tasks: ['Single QA'] | Lens: [64334] → Tgt Spa: ['0.350'] [Step 299 / Rank 5] Tasks: ['Single QA'] | Lens: [64934] → Tgt Spa: ['0.350'] [Step 299 / Rank 7] Tasks: ['In-Context Learning', 'Single QA'] | Lens: [32268, 32270] → Tgt Spa: ['1.000', '0.350'] [Step 299 / Rank 1] Tasks: ['In-Context Learning'] | Lens: [58455] → Tgt Spa: ['1.000'] [Step 299 / Rank 4] Tasks: ['Single QA'] | Lens: [43828] → Tgt Spa: ['0.350'] [Step 299 / Rank 0] Tasks: ['Single QA', 'Code', 'In-Context Learning'] | Lens: [20093, 20102, 20096] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 299 / Rank 5] Tasks: ['Single QA'] | Lens: [43828] → Tgt Spa: ['0.350'] [Step 299 / Rank 2] Tasks: ['Single QA'] | Lens: [62018] → Tgt Spa: ['0.350'] [Step 299 / Rank 3] Tasks: ['Single QA'] | Lens: [62018] → Tgt Spa: ['0.350'] [Step 299 / Rank 7] Tasks: ['Single QA'] | Lens: [45983] → Tgt Spa: ['0.350'] [Step 299 / Rank 6] Tasks: ['Single QA'] | Lens: [45983] → Tgt Spa: ['0.350'] [Step 299 / Rank 1] Tasks: ['Single QA', 'Code', 'In-Context Learning'] | Lens: [20093, 20102, 20096] → Tgt Spa: ['0.350', '1.000', '1.000'] [Step 299 / Rank 2] Tasks: ['Single QA'] | Lens: [60145] → Tgt Spa: ['0.350'] [Step 299 / Rank 7] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23302, 23326] → Tgt Spa: ['1.000', '1.000'] [Step 299 / Rank 1] Tasks: ['Code'] | Lens: [34153] → Tgt Spa: ['1.000'] [Step 299 / Rank 0] Tasks: ['Code'] | Lens: [34153] → Tgt Spa: ['1.000'] [Step 299 / Rank 6] Tasks: ['In-Context Learning', 'Summarization'] | Lens: [23302, 23326] → Tgt Spa: ['1.000', '1.000'] [Step 299 / Rank 4] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [20657, 20661, 20650] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 299 / Rank 5] Tasks: ['Summarization', 'Summarization', 'Code'] | Lens: [20657, 20661, 20650] → Tgt Spa: ['1.000', '1.000', '1.000'] [Step 299 / Rank 3] Tasks: ['Single QA'] | Lens: [60145] → Tgt Spa: ['0.350'] [INFO|lh_trainer.py:781] 2026-02-17 08:24:15,246 >> @ 299 | Loss: 2.0245 | LM: 1.9718 | Reg: 0.0527 | Spa(Avg): 0.523 [INFO|lh_trainer.py:797] 2026-02-17 08:24:15,246 >> Statistic -> Code | Spa: 0.719 | Tgt: 1.000 | Z-Loss: 0.092 | [INFO|lh_trainer.py:797] 2026-02-17 08:24:15,246 >> Statistic -> In-Context | Spa: 0.715 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:24:15,246 >> Statistic -> MultiHop | Spa: 0.472 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:24:15,246 >> Statistic -> Single | Spa: 0.364 | Tgt: 0.000 | Z-Loss: 0.000 | [INFO|lh_trainer.py:797] 2026-02-17 08:24:15,246 >> Statistic -> Summarization | Spa: 0.689 | Tgt: 1.000 | Z-Loss: 0.098 | [INFO|lh_trainer.py:810] 2026-02-17 08:24:15,248 >> [Micro-Log] {"loss": 2.024469687913855, "lm_loss": 1.9717628775785367, "reg_loss": 0.05270681706315372, "model_sparsity(avg)": 0.5234374962747097, "Spa-In-Context Learning sparsity": 0.7152777671813965, "Spa-In-Context Learning target_sparsity": 1.0, "Spa-In-Context Learning log_z_loss": 0.10867707580327987, "Spa-Code sparsity": 0.7194444417953492, "Spa-Code target_sparsity": 1.0, "Spa-Code log_z_loss": 0.09198781847953796, "Spa-MultiHop QA sparsity": 0.4722222089767456, "Spa-MultiHop QA target_sparsity": 0.349609375, "Spa-MultiHop QA log_z_loss": 0.04466351121664047, "Spa-Single QA sparsity": 0.36408728786877226, "Spa-Single QA target_sparsity": 0.349609375, "Spa-Single QA log_z_loss": 0.014260723746182131, "Spa-Summarization sparsity": 0.6888889074325562, "Spa-Summarization target_sparsity": 1.0, "Spa-Summarization log_z_loss": 0.09835912585258484, "step": 299, "current_tau": 1.0, "lambda1 Single QA": 0.59375, "lambda2 MultiHop QA": 0.314453125, "lambda3 Summarization": 0.1669921875, "lambda4 Code": 0.267578125} [INFO|lh_trainer.py:331] 2026-02-17 08:24:39,040 >> {'loss': 12.1468, 'grad_norm': 0.5197071433067322, 'learning_rate': 2.1468104356439287e-08, 'epoch': 0.315955766192733, 'num_input_tokens_seen': 738472692, 'completed': '100.00% (300 / 300)', 'remaining time': '0:00:00', 'throughput': '6733.66', 'gpu_mem_free': '14699MB', 'step': 300} /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( [INFO|trainer.py:3984] 2026-02-17 08:24:51,728 >> Saving model checkpoint to checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/checkpoint-300 [INFO|configuration_utils.py:419] 2026-02-17 08:24:51,894 >> Configuration saved in checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/checkpoint-300/config.json [INFO|configuration_utils.py:911] 2026-02-17 08:24:51,900 >> Configuration saved in checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/checkpoint-300/generation_config.json [INFO|modeling_utils.py:3580] 2026-02-17 08:25:32,734 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/checkpoint-300/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-02-17 08:25:32,740 >> tokenizer config file saved in checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/checkpoint-300/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-02-17 08:25:32,745 >> Special tokens file saved in checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/checkpoint-300/special_tokens_map.json [INFO|tokenization_utils_base.py:2572] 2026-02-17 08:25:32,748 >> added tokens file saved in checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/checkpoint-300/added_tokens.json /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( [INFO|trainer.py:2681] 2026-02-17 08:26:35,388 >> Training completed. Do not forget to share your model on huggingface.co/models =) [INFO|lh_trainer.py:331] 2026-02-17 08:26:35,390 >> {'train_runtime': 50692.0245, 'train_samples_per_second': 0.284, 'train_steps_per_second': 0.006, 'train_loss': 12.393445908228557, 'epoch': 0.315955766192733, 'num_input_tokens_seen': 738472692, 'completed': '100.00% (300 / 300)', 'remaining time': '0:00:00', 'throughput': '0.00', 'gpu_mem_free': '9949MB', 'step': 300} /opt/conda/envs/qqt/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( [rank2]:[W217 08:26:48.139290592 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) [rank4]:[W217 08:26:48.173515781 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) [rank6]:[W217 08:26:48.324190272 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) [INFO|trainer.py:3984] 2026-02-17 08:26:48,904 >> Saving model checkpoint to checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B [INFO|configuration_utils.py:419] 2026-02-17 08:26:49,069 >> Configuration saved in checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/config.json [INFO|configuration_utils.py:911] 2026-02-17 08:26:49,074 >> Configuration saved in checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/generation_config.json [rank7]:[W217 08:26:49.134459652 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) [rank5]:[W217 08:26:49.193167492 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) [rank3]:[W217 08:26:49.234541487 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) [rank1]:[W217 08:26:49.367149821 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) [INFO|modeling_utils.py:3580] 2026-02-17 08:27:30,280 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2510] 2026-02-17 08:27:30,287 >> tokenizer config file saved in checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2026-02-17 08:27:30,292 >> Special tokens file saved in checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/special_tokens_map.json [INFO|tokenization_utils_base.py:2572] 2026-02-17 08:27:30,295 >> added tokens file saved in checkpoints/2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B/added_tokens.json ***** train metrics ***** epoch = 0.316 num_input_tokens_seen = 738472692 train_loss = 12.3934 train_runtime = 14:04:52.02 train_samples_per_second = 0.284 train_steps_per_second = 0.006 swanlab: Experiment 2.15steps300_full_xattn_layer_router_test_wfrozen_end_0.35_Qwen3-8B has completed swanlab: 🏠 View project at https://swanlab.cn/@qqtang/NIPS swanlab: 🚀 View run at https://swanlab.cn/@qqtang/NIPS/runs/t1f5tfg4dj0dg1bwh58hb [rank0]:[W217 08:27:33.190935283 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())