r/wg21 - P3904R1 - When paths go WTF: making formatting lossless

r/wg21

P3904R1 - When paths go WTF: making formatting lossless WG21

Posted by u/path_encoding_enjoyer · 6 hr. ago

Document: P3904R1

Author: Victor Zverovich

Date: 2026-01-28

Audience: SG16

So P2845 landed in C++26 and fixed most of the std::filesystem::path formatting mess — you can now print a path containing Cyrillic or CJK characters on Windows without getting garbage, which was genuinely broken before. Except there is one leftover case: Windows paths can contain lone surrogates. A lone surrogate is a UTF-16 code unit in the range 0xD800–0xDFFF that is not part of a valid surrogate pair. Windows treats path names as opaque sequences of WCHARs and does not care. The problem is that when P2845 formatted these, it replaced them with U+FFFD � — and since all lone surrogates collapsed to the same �, two distinct paths became identical after formatting. Silent data loss.

This paper proposes encoding ill-formed UTF-16 paths using WTF-8 (\"Wobbly Transformation Format − 8-bit\"), a UTF-8 superset that can losslessly represent arbitrary sequences of 16-bit code units including lone surrogates. The path L\"\\xD800\" would format as \"\\xED\\xA0\\x80\" instead of \"�\", preserving the surrogate value in the output. std::print to a terminal still shows � via the recommended-practice substitution rule, so terminal display does not change. Rust’s OsString and Node.js’s libuv already use WTF-8 for exactly this reason; the paper cites both as prior art. The implementation is already in {fmt}. The read-path API — reconstructing a path from those WTF-8 bytes — is left to a follow-on paper.

▲ 87 points (94% upvoted) · 36 comments

sorted by: best

▲ ▼

u/committee_gonna_committee_42 312 points 5 hr. ago

committee gonna committee, but for Unicode specifically

Reply Share Report

▲ ▼

u/unicode_plumber_throwaway 148 points 4 hr. ago

Quick note that nobody in the paper seems to mention: \"\\xED\\xA0\\x80\" is not valid UTF-8.

RFC 3629 §3 explicitly forbids encoding surrogate code points in UTF-8. The byte sequence ED A0 80 passes the three-byte structural check (leading byte followed by two continuation bytes) but fails the validity check because it encodes U+D800, a surrogate. ICU, libxml2, Chromium’s base::IsStringUTF8, and every conforming UTF-8 decoder will either reject it or silently replace it with U+FFFD — which destroys the round-trip you just created.

So after this paper, std::format(\"{}\", path(L\"\\xD800\")) gives you a std::string whose bytes are illegal in UTF-8. You will not know this from the type. Anything downstream that receives this string and assumes std::string contains UTF-8 — which is basically everything in modern C++ that processes text — will either corrupt the data or blow up.

Rust solved this exact problem by making OsString a distinct type that is explicitly not String. The WTF-8 bytes live inside OsString and you cannot accidentally pass them to a function expecting valid UTF-8 without an explicit, documented conversion.

C++ has no OsString. After this paper, std::string can contain valid UTF-8, WTF-8, Latin-1, or random bytes — same as before, but now with a new source of non-UTF-8 bytes created by a formatting function that looks like it produces text.

The goal is right. The tradeoff may be worth it. But the missing discussion of what happens when WTF-8 bytes escape into a UTF-8-expecting context is a real gap in the paper.

Reply Share Report

▲ ▼

u/path_encoding_enjoyer 41 points 3 hr. ago

This is the paragraph I was hoping would show up in the thread. The paper acknowledges the read-path is deferred but does not address the contamination angle at all. Agree this should come up in SG16 review.

Reply Share Report

▲ ▼

u/daily_template_wizard_cpp 22 points 3 hr. ago

The std::string as a bag of bytes problem is as old as the language. This paper does not make it worse, it just adds one more source of non-UTF-8 content.

Reply Share Report

▲ ▼

u/unicode_plumber_throwaway 67 points 3 hr. ago

There is a difference between \"std::string does not enforce UTF-8\" (always been true) and \"std::format can now produce WTF-8 bytes from a path and the API surface gives you no indication of this\". The second is new behavior introduced by a formatting function that looks like it produces text. That is worth naming explicitly.

Reply Share Report

▲ ▼

u/rust_pilgrim_9000 89 points 5 hr. ago†

Rust solved this

Reply Share Report

▲ ▼

u/not_a_rust_hater_exactly 156 points 4 hr. ago

Yes, and the paper literally cites Rust’s OsString as prior art in the proposal section. WTF-8 is the Rust solution. The question is whether you want it living in std::string (C++ approach) or in a dedicated type with enforced boundaries (OsString approach). The paper picks std::string.

Reply Share Report

▲ ▼

u/rust_pilgrim_9000 44 points 4 hr. ago

ok fair

Reply Share Report

▲ ▼

u/windows_fs_veteran_2019 93 points 4 hr. ago

Hit this exact bug two years ago building a backup tool. Two paths with lone surrogates — \\xD800 and \\xD801 — both formatted to \"�\". Files from one path silently clobbered files from the other because the formatted names were identical in every log line and every map key. Took a week to track down.

WTF-8 is the right call. The lossless guarantee is more valuable than worrying about what downstream parsers do with the bytes. Document your interfaces and validate at system boundaries.

Reply Share Report

▲ ▼

u/strict_utf8_only 71 points 3 hr. ago

Document your interfaces and validate at system boundaries.

That principle only holds when the type system enforces the distinction. OsString vs String in Rust makes the documentation machine-checked. A std::string containing WTF-8 returned from std::format has no annotation whatsoever. \"Validate at boundaries\" requires every downstream author to know that std::format(\"{}\", some_path) can produce non-UTF-8. In practice, they will not know.

Reply Share Report

▲ ▼

u/windows_fs_veteran_2019 48 points 3 hr. ago

Fair point, and I do not have a great counterargument. The C++ answer is \"std::string was never guaranteed to be UTF-8 either\" which is technically true and practically unsatisfying. I still think lossless beats lossy even with the type-system gap.

Reply Share Report

▲ ▼

u/strict_utf8_only 82 points 2 hr. ago

Agreed that lossless is better. My narrower concern is that SG16 should require implementations to document that path formatting can produce WTF-8, and that the read-path companion paper should advance together with this one. Shipping only the write side of a round-trip is shipping half a footgun.

Reply Share Report

▲ ▼

u/UB_enjoyer_69 274 points 5 hr. ago

the WTF-8 specification is hosted at wtf-8.codeberg.page and opens with the phrase \"Wobbly Transformation Format\". 10/10 naming. the standards committee is incapable of this energy.

Reply Share Report

▲ ▼

u/compiles_first_try_420 103 points 4 hr. ago

finally a backronym that accurately describes the situation

Reply Share Report

▲ ▼

u/former_boost_contributor_cpp 57 points 3 hr. ago

Section 5 is a single sentence: \"The proposal has been implemented in {fmt} where the default std::filesystem::path representation is now lossless.\"

For R1 that’s technically sufficient — implementation experience is implementation experience — but it would be nice to see at least one real-world deployment beyond \"I shipped it in my own library.\" Not a blocker for SG16 but someone will ask in the room.

Reply Share Report

▲ ▼

u/path_encoding_enjoyer 19 points 2 hr. ago

someone will ask in the room

Zverovich has shipped things before. I would not be surprised if the R2 adds a sentence about real {fmt} users who have hit this on Windows. That tends to be enough for SG16.

Reply Share Report

▲ ▼

u/async_io_burned_twice 67 points 4 hr. ago

can we please finish the networking TS before I retire. I will accept WTF-8 path encoding as a down payment.

Reply Share Report

▲ ▼

u/senior_json_victim_2019 38 points 3 hr. ago

what happens when I put a WTF-8 formatted path into nlohmann/json

Reply Share Report

▲ ▼

u/not_a_real_cpp_dev_throwaway 71 points 2 hr. ago

nlohmann::json validates UTF-8 by default when serializing strings. With ensure_ascii = false you get WTF-8 bytes written verbatim, producing JSON with illegal encoding. With the default strict mode you get an exception at serialize time. Either way: not your problem until it’s someone else’s production incident.

Reply Share Report

▲ ▼

u/senior_json_victim_2019 53 points 1 hr. ago

Edit: tested it. nlohmann throws type_error.316 when you try to assign a WTF-8 string as a JSON value. At least it fails loudly.

Reply Share Report

▲ ▼

u/library_design_skeptic 104 points 2 hr. ago

The API for the read path of the round trip will be proposed by a separate paper.

This sentence is load-bearing and slightly alarming. The paper ships only one half of the round-trip: you can format a path to WTF-8, but there is no standard API to reconstruct a path from those WTF-8 bytes. So the \"lossless\" guarantee means information is preserved in the string — but you cannot use that string to recover the original path without the follow-on paper.

If the follow-on slips a revision cycle you have a path that prints to something humans cannot read and cannot be parsed back to a path. That is an odd intermediate state to standardize. I would want to see both papers advance together.

Reply Share Report

▲ ▼

u/unicode_plumber_throwaway 88 points 1 hr. ago

Exactly this. The Rust analogy is relevant here too: OsString has both encode_wide() (write) and from_encoded_bytes_unchecked() (read). They were designed and shipped as a unit. The write side without the read side is a lossy display mechanism with extra steps and false promises of round-trippability.

Reply Share Report

▲ ▼

u/definitely_not_a_committee_member 14 points 2 hr. ago

wait so is WTF-8 valid UTF-8 or not? the paper calls it \"a superset\" which made me think yes but the comments above say no

Reply Share Report

▲ ▼

u/yet_another_encoding_nerd 92 points 1 hr. ago

WTF-8 is a superset of UTF-8 in the sense that every valid UTF-8 byte sequence is also valid WTF-8. But WTF-8 includes additional byte sequences (the ones encoding lone surrogates) that are explicitly forbidden by RFC 3629 and therefore not valid UTF-8. Valid UTF-8 ⊂ WTF-8, but WTF-8 ⊄ valid UTF-8. Think of it like how IEEE 754 allows NaN as a bit pattern: the float container accepts it, the real-number interpretation rejects it.

Reply Share Report

▲ ▼

u/python_also_suffered 46 points 3 hr. ago

Python hit this in 3.0 and landed PEP 383 (surrogateescape error handler). Completely different mechanism: Python stores lone surrogates as surrogate code points in its internal str objects (technically not valid Unicode text either) rather than WTF-8 bytes. Both approaches have the \"looks like text, isn’t\" problem. C++ is in good company.

Reply Share Report

▲ ▼

u/not_a_rust_hater_exactly 29 points 2 hr. ago

The paper cites PEP 383 in the references. The Python approach only works because Python’s str can hold surrogate code points internally. C++ std::u32string could theoretically do the same, but nobody uses u32string for filesystem paths.

Reply Share Report

▲ ▼

u/utf8_everywhere_or_gtfo -12 points 3 hr. ago

skill issue. just use UTF-8 paths on Windows.

Reply Share Report

▲ ▼

u/windows_fs_veteran_2019 83 points 2 hr. ago

The file system does not care about your encoding preferences. External tools, network shares, SMB mounts, and legacy software create paths with lone surrogates. You do not control the namespace.

Reply Share Report

▲ ▼

u/async_skeptic_embedded 77 points 1 hr. ago

Subtle thing the paper handles but does not emphasize: after this change, std::format(\"{}\", path(L\"\\xD800\")) and std::print(\"{}\", path(L\"\\xD800\")) behave differently.

std::format gives you WTF-8 bytes (\\xED\\xA0\\x80) in a std::string. std::print to a terminal follows the recommended practice in [ostream.formatted.print] of substituting replacement characters on transcoding, so you see � on screen.

Same path, same format string, same function family — different bytes depending on which one you called. The distinction is intentional and correct (lossless storage vs. human-readable display), but it is going to trip people up when they debug why their log file does not match their terminal output.

Reply Share Report

▲ ▼

u/former_boost_contributor_cpp 54 points 47 minutes ago

This is the thing that will generate the most confused bug reports in two years. \"Why does my log file show garbage bytes but my terminal shows a box character\" — except it won’t even look like bytes in the log, it will look like mojibake or a broken glyph depending on the viewer. Fun times ahead.

Reply Share Report

▲ ▼

u/[deleted] -8 points 2 hr. ago

[deleted]

Reply Share Report

▲ ▼

u/build_system_victim_irl 4 points 2 hr. ago

what did they say

Reply Share Report

▲ ▼

u/not_a_real_cpp_dev_throwaway 61 points 2 hr. ago

something about how the Windows NT kernel should not be allowed to create paths with lone surrogates. policy suggestions for the NT filesystem team, on a C++ standards thread.

Reply Share Report

▲ ▼

u/laughs_in_ascii 134 points 4 hr. ago

laughs in ASCII

Reply Share Report

▲ ▼

u/committee_gonna_committee_42 87 points 3 hr. ago

you will regret this when someone names a file 🦄

Reply Share Report

▲ ▼

u/library_design_skeptic 43 points 23 minutes ago

Edit: went and read the full WTF-8 spec at wtf-8.codeberg.page before posting this. Worth reading before the SG16 review. One property the spec guarantees: concatenation of WTF-8 strings preserves WTF-8 validity, and valid UTF-8 is a subset. So you can concatenate a path component (possibly WTF-8) with a directory separator or other valid UTF-8 string and get valid WTF-8. That is better behaved than some alternatives.

Still think the read-path paper should advance in parallel. But the encoding itself is cleanly designed.

Reply Share Report

Promoted

CLion — the C++ IDE that understands your templates

CLion - the C++ IDE that understands your templates. Free 30-day trial.