# Bundle conflicts log — flash-attn-4-sm120-sncbl Conflicts and genuine bugs encountered while stacking PRs #2348/2349/2389/2439 on current Dao-AILab/flash-attention main. Each entry: location, what the two sides did, how I resolved it, and whether an individual PR needs a backport fix at merge time. Bundle validated on SM121a (DGX Spark GB10): - TMA forward (dense, bf16, causal): max_diff=0.0078 - Paged KV (varlen, bf16, causal, ps=16): max_diff=0.0039 - Dropout (bf16, causal, p=0.1, seed=42): finite, magnitude ratio 1.11 - Plain forward (bf16/fp16, causal={0,1}): max_diff 0.0005-0.008 --- ## [SEMANTIC] interface.py SM120 dispatch — #2348 × #2349 × #2389 × #2439 See entry below; addressed during stacking. ## [SEMANTIC] flash_fwd.py FlashAttentionForwardSm80.call() — #2348 × #2389 See entry below; addressed during stacking. ## [MINOR] flash_fwd.py launch()/kernel() args — #2389 × #2439 See entry below; addressed during stacking. --- ## [BUG found in #2349] FlashAttentionForwardSm120Tma.__call__ stream position **Location**: `flash_attn/cute/flash_fwd_sm120_tma.py` `__call__` signature. **Bug**: `stream` is declared at position 7 (right after softmax_scale) instead of at the end. The base `FlashAttentionForwardSm80.__call__` and `FlashAttentionForwardSm100.__call__` both put `stream` last, with a comment: *"Always keep stream as the last parameter (EnvStream: obtained implicitly via TVM FFI)"*. `cute.compile` binds args positionally, so the compile_args list (which ends with current_stream) would pass `cu_seqlens_q_tensor` where the TMA kernel expects stream. **When #2349 lands alone**: Manifests as a runtime error once the compiled TMA kernel is invoked through `_flash_attn_fwd`: `DSLRuntimeError: expects argument #20 (dropout_seed_hi) to be one of (Int32, NoneType), but got _FakeStream`. The TMA PR branch's own tests probably route the stream differently or never hit this code path. **Fix in bundle**: Moved `stream` to the end of TMA's __call__ signature to match the base class. Also added `dropout_seed_lo` / `dropout_seed_hi` parameters that accept and `assert` they're None (interface dispatch already gates TMA on `dropout_p == 0`). **PR backport recommendation**: apply the same stream-last fix to #2349 directly. Independent of any other PR — the TMA kernel is broken as-written. --- ## [BUG found during merge] DSL SSA collision on shared locals `row_scale` / `sO` **Location**: `flash_attn/cute/flash_fwd.py` inside `FlashAttentionForwardSm80.call()`. #2389 adds a block-sparse branch that defines `row_scale` and `sO`. #2348/base dense path also defines `row_scale` and `sO` inside a runtime `if n_block_max > n_block_min:` gate. When both branches coexist, the DSL's SSA analysis sees the same variable name assigned in both a compile-time branch (block-sparse) and a dynamic-if branch (dense), then rejects the dynamic assignment: `"row_scale is None prior to this if, and update to _Tensor inside of this if is not supported."` **Fix in bundle**: renamed block-sparse locals to `bs_row_scale` / `bs_sO`. They're local-use-only, so the rename has no downstream effect. **Also**: replaced my initial combined gate `if const_expr(blocksparse_tensors is None) and n_block_max > n_block_min:` with nested `if const_expr(blocksparse_tensors is None): / if n_block_max > n_block_min:`, so the DSL evaluates the const_expr purely at compile time. **PR backport recommendation**: this only surfaces when #2348 and #2389 are both applied. Whichever PR merges second needs to rename its row_scale/sO to avoid collision and nest the conditions. Either PR can absorb the fix; easier to do in #2389 since it already introduces the block-sparse branch. --- ## [SEMANTIC] interface.py SM120 dispatch **Three-way conflict** (same location, four PRs touch it): - **#2348** sets `num_stages=2` unconditionally, gates on no block-sparse. - **#2349** adds `FlashAttentionForwardSm120Tma` dispatch. - **#2389** enables block-sparse on SM120 — conflicts with #2348's assert. - **#2439** passes `p_dropout=dropout_p` to SM120 constructor; TMA kernel doesn't implement dropout. **Bundle resolution**: ```python is_varlen = cu_seqlens_q is not None or cu_seqlens_k is not None use_tma_sm120 = ( page_table is None and not is_varlen and not use_block_sparsity and dropout_p == 0.0 ) if use_tma_sm120 and FlashAttentionForwardSm120Tma.can_implement(...): fa_fwd = FlashAttentionForwardSm120Tma(...) else: num_stages_sm120 = 2 if page_table is not None else 1 fa_fwd = FlashAttentionForwardSm120( ..., num_stages=num_stages_sm120, ..., p_dropout=dropout_p, ) ``` **PR backport at merge**: whichever PR lands LAST among the four needs to incorporate the combined dispatch. ~5-10 lines each. Flag in PR descriptions now. --- ## [SEMANTIC] flash_fwd.py call() — #2348 × #2389 **Conflict**: #2348 wraps prologue/mainloop in `if n_block_max > n_block_min:` (split-KV empty-range guard) + adds `split_idx` + uses `n_block - n_block_min` bound. #2389 adds a **separate** block-sparse mainloop path that completes to its own epilogue, then re-gates dense on `if const_expr(blocksparse_tensors is None)`. **Bundle resolution**: ```python if const_expr(blocksparse_tensors is not None): # block-sparse complete flow (bs_row_scale, bs_sO — see BUG note above) ... if const_expr(blocksparse_tensors is None): if n_block_max > n_block_min: # dense flow (split-KV-aware, split_idx in epilogue) ... ``` **PR backport**: same guidance as dispatch conflict. Scope ~30-40 lines of code movement. Biggest rebase cost; flag in whichever PR is waiting. --- ## [MINOR] flash_fwd.py launch()/kernel() args — #2389 × #2439 **Conflict**: both add kwargs to the same launch call + kernel signature + interface.py call_args list. **Bundle resolution**: keep both (blocksparse_tensors, then dropout_nheads/lo/hi). Trivial to backport. --- Bundle branch: `bundle/sncbl-sm120`. Final validated commits: 1. `657e8b5` Apply PR #2348 (+#2336) 2. `fa631d1` Apply PR #2349 + merge with #2348 dispatch (amended: TMA stream-last fix) 3. `d921b6a` Apply PR #2389 + merge with #2348/#2349 4. `a865455` Apply PR #2439 + merge with #2389/dispatch (amended: bs_ rename + nested gate) All syntax-clean and validated on SM121a end-to-end through three distinct kernel paths (TMA, paged KV, dropout). --- ## [BUG found in bundle, fixed via #2484] GQA / MQA crd2idx error — flash_fwd_sm120{,_tma}.py **Symptom (pre-fix)**: any call to `flash_attn_func` or `flash_attn_varlen_func` on SM120 with `qhead_per_kvhead > 1` (i.e., real GQA / MQA workloads like Qwen3, LLaMA3) failed at compile time with: ``` loc("tPrPtr[i] = utils.elem_pointer(tensor, ((h_idx, m_idx),)).toint()" ("flash_attn/cute/pack_gqa.py":139:20)): error: unable to compute crd2idx with '!cute.layout<"(?):(?{i64 div=8})">' and '!cute.coord<"((?,?))">' ``` **Cause**: `Sm80.__call__`'s epilogue (which `Sm120` inherits and `Sm120Tma` calls via `self.epilogue`) takes the `pack_gqa.store_O` branch when `self.pack_gqa` is True (default for `qhead_per_kvhead > 1` in interface.py). `pack_gqa.store_O` calls `compute_ptr` (pack_gqa.py:139) which expects a packed `((qhead_per_kvhead, seqlen_q), headdim)` layout. Sm90 and Sm100 apply `pack_gqa_layout` before handing tensors to PackGQA, but Sm80 does not, so the layout reaching `compute_ptr` is un-packed and crd2idx against the hierarchical coord fails. Even adding the `pack_gqa_layout` calls is not sufficient because Sm80's mainloop tile sizing assumes `tile_m` divides the seqlen dimension cleanly, which fails when `qhead_per_kvhead` does not divide `tile_m`. **Fix**: override `self.pack_gqa = False` in both `FlashAttentionForwardSm120.__init__` and `FlashAttentionForwardSm120Tma.__init__`, after `super().__init__()`. This routes GQA / MQA through the non-packed epilogue branch which is functionally correct on every shape tested. Tracked upstream as [Dao-AILab/flash-attention#2484](https://github.com/Dao-AILab/flash-attention/pull/2484). **Validation post-fix**: 64 / 64 configurations pass on SM121a (MHA + GQA Qwen3 + GQA LLaMA3 + MQA, dense + varlen, bf16 + fp16, causal + non-causal, batched). Max diff ≤ 0.0156 against PyTorch f32 reference.