P4016R0 — Canonical Parallel Reduction: A Fixed Expression Structure for Run-To-Run Consistency

P4016R0 — Canonical Parallel Reduction: A Fixed Expression Structure for Run-To-Run Consistency (3 items) SG6, LEWG

Andrew Drakeford

This paper addresses the gap between std::accumulate (fully specified sequential left-fold) and std::reduce (parallel but with unspecified expression structure) by proposing a canonical reduction expression for parallel reductions. For a given input sequence, lane count L, and binary operation, the proposal defines a single fixed abstract expression via a two-stage interleaved binary tree (per-lane iterative pairwise reduction followed by cross-lane reduction), enabling run-to-run reproducibility without constraining execution strategy. The paper seeks LEWG directional approval of the semantic framework before committing to a specific API surface.

Appendix M.1 (CI Regression Testing) — Call to canonical_reduce<8> passes three arguments (first, last, init), omitting the required binary_op parameter. No three-argument overload is declared anywhere in the paper; all illustrative signatures (section 9.1, Appendix A, Appendix J.3) require four arguments including BinaryOperation binary_op. [1]
Appendix M.2 (Distributed Training Checkpoints) — Both calls to canonical_reduce<8> pass three arguments (first, last, init), omitting binary_op. Same mismatch with every declared signature in the paper, which uniformly requires four arguments. [2]
Appendix H, section H.2 (Representative observations) — Literal asterisk characters appear after '[Note:' and before '--end note]' in the bracketed note. These are unrendered markdown bold/italic markers that survived the document toolchain as raw '*' characters. [3]

References — Anthropic Citations API

[1]
"double computed = canonical_reduce<8>( // L = 8 gradients.begin(), gradients.end(), 0.0);"

[2]
"auto gradient_sum = canonical_reduce<8>( local_gradients.begin(), local_gradients.end(), 0.0); and auto restored_sum = canonical_reduce<8>( restored_gradients.begin(), restored_gradients.end(), 0.0);"

[3]
"[Note:* These ﬁgures reﬂect a baseline prototype. and *—end note]"

Summary: P4016R0 proposes a new parallel reduction algorithm, canonical_reduce, parameterized by a fixed summation tree depth L, that guarantees identical floating-point results across runs regardless of thread count or execution order. The paper targets SG6 and LEWG with motivating use cases in scientific computing, ML training, and CI reproducibility.

Pipeline: Discovery (Anthropic Opus + Citations API) → Verification Gate (OpenRouter Opus) → Report Writer (OpenRouter Opus)
Provenance: All references are machine-verified character positions from the Anthropic Citations API — deterministic, exact substrings, not model-generated quotes.