P2728R11 — Unicode in the Library, Part 1: UTF Transcoding
(13 items)
SG-16 Unicode, SG-9 Ranges, LEWG
This paper proposes range adaptor views for transcoding between UTF-8, UTF-16, and UTF-32 encodings in the C++ standard library. It introduces views::to_utf8, views::to_utf16, views::to_utf32, and their _or_error variants that produce std::expected values, along with as_char8_t, as_char16_t, and as_char32_t code unit casting adaptors. Invalid UTF subsequences are handled by substituting U+FFFD replacement characters (per Unicode Substitution of Maximal Subparts) or by yielding typed error values via utf_transcoding_error, providing a safe modern replacement for the deprecated std::codecvt facilities.
- Section 9.2, 24.7.?.6, success() specification — The range 'between 0xC0 and 0xC2' for invalid_utf8_leading_byte includes 0xC2, which is a valid UTF-8 leading byte (starts 2-byte sequences for U+0080-U+00BF). The paper's own design discussion in Section 10 states this applies to 0xC0-0xC1 only. Should be 'between 0xC0 and 0xC1'. [1]
-
Section 9.2, 24.7.?.6, success() and read() specifications — The exposition-only name from-type is used normatively (e.g., 'If from-type is char8_t') but is never defined anywhere in the proposed wording. Presumably range_value_t
, but this must be stated. [2] - Section 1, High-Level Overview — The first example passes u8'\U0001f642' (const char8_t[5]) directly to views::to_utf32, but Section 9.2 states that if T is an array of char8_t, char16_t, or char32_t, to_utfN(E) is ill-formed. The example is ill-formed under the paper's own rules. Section 6.1 correctly uses u8"..."sv. [3]
- Section 10.1 — Prose says 'arrays of char' when explaining why array inputs are rejected, but the proposed wording rejects arrays of char8_t, char16_t, or char32_t -- not arrays of char. The prose does not match the wording. [4]
- Section 9.2, 24.7.?.6, read() Effects — After the period following 'decoding c', the next clause 'encodes c into buf_' begins with a lowercase letter and lacks a subject. Missing 'It' or a semicolon join. [5]
- Section 9.2, 24.7.?.1 Overview, empty_view bullet — utf_trancoding_error is misspelled; the declared type is utf_transcoding_error (missing 's'). [6]
- Section 4, final paragraph — to_utfN_as_error is inconsistent with the name used everywhere else in the paper, which is to_utfN_or_error. [7]
- Section 9.3, Code unit adaptors — Stray trailing '/' in the exposition-only comment: reads '// exposition only/' instead of '// exposition only'. [8]
- Section 9.2, 24.7.?.1 Overview — Doubled word: 'that that char8_t corresponds to UTF-8'. Remove one 'that'. [9]
- Section 9.2, 24.7.?.6, success() specification for out_of_range — Doubled word: 'If the the current input subsequence'. Remove one 'the'. [10]
- Section 10.3, 'Why We Don't Cache begin()' — 'compexity' is misspelled; should be 'complexity'. [11]
References — Anthropic Citations API
[1]
"static_assert((u8"🙂" | views::to_utf32 | ranges::to()) == U"🙂");"
"static_assert((u8"🙂" | views::to_utf32 | ranges::to
[2]
""The to_utfN_as_error views also use this scheme but produce unexpected values instead of replacement characters.""
""The to_utfN_as_error views also use this scheme but produce unexpected
[3]
""empty_view>{}""
""empty_view
[4]
""which is that that char8_t corresponds to UTF-8""
""which is that that char8_t corresponds to UTF-8""
[5]
""If the the current input subsequence is the code unit 0xF4""
""If the the current input subsequence is the code unit 0xF4""
[6]
""the underlying range could have unbounded compexity""
""the underlying range could have unbounded compexity""
[7]
""It sets to_increment_ to the number of code units read while decoding c. encodes c into buf_""
""It sets to_increment_ to the number of code units read while decoding c. encodes c into buf_""
[8]
""If the current input subsequence is a code unit between 0xC0 and 0xC2, or between 0xF5 and 0xFF, returns invalid_utf8_leading_byte.""
""If the current input subsequence is a code unit between 0xC0 and 0xC2, or between 0xF5 and 0xFF, returns invalid_utf8_leading_byte.""
[9]
""the to_utfN CPOs reject all inputs that are arrays of char, as do the as_charN_t casting CPOs.""
""the to_utfN CPOs reject all inputs that are arrays of char, as do the as_charN_t casting CPOs.""
[10]
""If from-type is char8_t:" (multiple occurrences: §24.7.?.6 success() and read()/read-reverse())"
""If from-type is char8_t:" (multiple occurrences: §24.7.?.6 success() and read()/read-reverse())"
[11]
""struct implicit-cast-to { // exposition only/""
""struct implicit-cast-to { // exposition only/""
Summary: P2728R11 proposes a set of UTF transcoding range adaptors (to_utf8, to_utf16, to_utf32) and their error-handling variants for the C++ standard library, built on top of std::ranges. It provides views that lazily transcode between UTF-8, UTF-16, and UTF-32, with configurable error handling via expected-based interfaces.
Pipeline: Discovery (Anthropic Opus + Citations API) → Verification Gate (OpenRouter Opus) → Report Writer (OpenRouter Opus)
Provenance: All references are machine-verified character positions from the Anthropic Citations API — deterministic, exact substrings, not model-generated quotes.
Provenance: All references are machine-verified character positions from the Anthropic Citations API — deterministic, exact substrings, not model-generated quotes.