P4016R0 — Canonical Parallel Reduction: A Fixed Expression Structure for Run-To-Run Consistency (3 items) SG6, LEWG
Andrew Drakeford
This paper addresses the gap between std::accumulate (fully specified sequential left-fold) and std::reduce (parallel but with unspecified expression structure) by proposing a canonical reduction expression for parallel reductions. For a given input sequence, lane count L, and binary operation, the proposal defines a single fixed abstract expression via a two-stage interleaved binary tree (per-lane iterative pairwise reduction followed by cross-lane reduction), enabling run-to-run reproducibility without constraining execution strategy. The paper seeks LEWG directional approval of the semantic framework before committing to a specific API surface.

References — Anthropic Citations API

[1]
"double computed = canonical_reduce<8>( // L = 8 gradients.begin(), gradients.end(), 0.0);"
[2]
"auto gradient_sum = canonical_reduce<8>( local_gradients.begin(), local_gradients.end(), 0.0); and auto restored_sum = canonical_reduce<8>( restored_gradients.begin(), restored_gradients.end(), 0.0);"
[3]
"[Note:* These figures reflect a baseline prototype. and *—end note]"
Summary: P4016R0 proposes a new parallel reduction algorithm, canonical_reduce, parameterized by a fixed summation tree depth L, that guarantees identical floating-point results across runs regardless of thread count or execution order. The paper targets SG6 and LEWG with motivating use cases in scientific computing, ML training, and CI reproducibility.
Pipeline: Discovery (Anthropic Opus + Citations API) → Verification Gate (OpenRouter Opus) → Report Writer (OpenRouter Opus)
Provenance: All references are machine-verified character positions from the Anthropic Citations API — deterministic, exact substrings, not model-generated quotes.