P2728R11 — Unicode in the Library, Part 1: UTF Transcoding

P2728R11 — Unicode in the Library, Part 1: UTF Transcoding (13 items) SG-16 Unicode, SG-9 Ranges, LEWG

Eddie Nolan

This paper proposes range adaptor views for transcoding between UTF-8, UTF-16, and UTF-32 encodings in the C++ standard library. It introduces views::to_utf8, views::to_utf16, views::to_utf32, and their _or_error variants that produce std::expected values, along with as_char8_t, as_char16_t, and as_char32_t code unit casting adaptors. Invalid UTF subsequences are handled by substituting U+FFFD replacement characters (per Unicode Substitution of Maximal Subparts) or by yielding typed error values via utf_transcoding_error, providing a safe modern replacement for the deprecated std::codecvt facilities.

Section 9.2, 24.7.?.6, success() specification — The range 'between 0xC0 and 0xC2' for invalid_utf8_leading_byte includes 0xC2, which is a valid UTF-8 leading byte (starts 2-byte sequences for U+0080-U+00BF). The paper's own design discussion in Section 10 states this applies to 0xC0-0xC1 only. Should be 'between 0xC0 and 0xC1'. [1]
Section 9.2, 24.7.?.6, success() and read() specifications — The exposition-only name from-type is used normatively (e.g., 'If from-type is char8_t') but is never defined anywhere in the proposed wording. Presumably range_value_t, but this must be stated. [2]
Section 1, High-Level Overview — The first example passes u8'\U0001f642' (const char8_t[5]) directly to views::to_utf32, but Section 9.2 states that if T is an array of char8_t, char16_t, or char32_t, to_utfN(E) is ill-formed. The example is ill-formed under the paper's own rules. Section 6.1 correctly uses u8"..."sv. [3]
Section 10.1 — Prose says 'arrays of char' when explaining why array inputs are rejected, but the proposed wording rejects arrays of char8_t, char16_t, or char32_t -- not arrays of char. The prose does not match the wording. [4]
Section 9.2, 24.7.?.6, read() Effects — After the period following 'decoding c', the next clause 'encodes c into buf_' begins with a lowercase letter and lacks a subject. Missing 'It' or a semicolon join. [5]
Section 9.2, 24.7.?.1 Overview, empty_view bullet — utf_trancoding_error is misspelled; the declared type is utf_transcoding_error (missing 's'). [6]
Section 4, final paragraph — to_utfN_as_error is inconsistent with the name used everywhere else in the paper, which is to_utfN_or_error. [7]
Section 9.3, Code unit adaptors — Stray trailing '/' in the exposition-only comment: reads '// exposition only/' instead of '// exposition only'. [8]
Section 9.2, 24.7.?.1 Overview — Doubled word: 'that that char8_t corresponds to UTF-8'. Remove one 'that'. [9]
Section 9.2, 24.7.?.6, success() specification for out_of_range — Doubled word: 'If the the current input subsequence'. Remove one 'the'. [10]
Section 10.3, 'Why We Don't Cache begin()' — 'compexity' is misspelled; should be 'complexity'. [11]

References — Anthropic Citations API

[1]
"static_assert((u8"🙂" | views::to_utf32 | ranges::to()) == U"🙂");"

[2]
""The to_utfN_as_error views also use this scheme but produce unexpected values instead of replacement characters.""

[3]
""empty_view>{}""

[4]
""which is that that char8_t corresponds to UTF-8""

[5]
""If the the current input subsequence is the code unit 0xF4""

[6]
""the underlying range could have unbounded compexity""

[7]
""It sets to_increment_ to the number of code units read while decoding c. encodes c into buf_""

[8]
""If the current input subsequence is a code unit between 0xC0 and 0xC2, or between 0xF5 and 0xFF, returns invalid_utf8_leading_byte.""

[9]
""the to_utfN CPOs reject all inputs that are arrays of char, as do the as_charN_t casting CPOs.""

[10]
""If from-type is char8_t:" (multiple occurrences: §24.7.?.6 success() and read()/read-reverse())"

[11]
""struct implicit-cast-to { // exposition only/""

Summary: P2728R11 proposes a set of UTF transcoding range adaptors (to_utf8, to_utf16, to_utf32) and their error-handling variants for the C++ standard library, built on top of std::ranges. It provides views that lazily transcode between UTF-8, UTF-16, and UTF-32, with configurable error handling via expected-based interfaces.

Pipeline: Discovery (Anthropic Opus + Citations API) → Verification Gate (OpenRouter Opus) → Report Writer (OpenRouter Opus)
Provenance: All references are machine-verified character positions from the Anthropic Citations API — deterministic, exact substrings, not model-generated quotes.