Document: P3904R1
Author: Victor Zverovich
Date: 2026-01-28
Audience: SG16
So P2845 landed in C++26 and fixed most of the std::filesystem::path formatting mess — you can now print a path containing Cyrillic or CJK characters on Windows without getting garbage, which was genuinely broken before. Except there is one leftover case: Windows paths can contain lone surrogates. A lone surrogate is a UTF-16 code unit in the range 0xD800–0xDFFF that is not part of a valid surrogate pair. Windows treats path names as opaque sequences of WCHARs and does not care. The problem is that when P2845 formatted these, it replaced them with U+FFFD � — and since all lone surrogates collapsed to the same �, two distinct paths became identical after formatting. Silent data loss.
This paper proposes encoding ill-formed UTF-16 paths using WTF-8 (\"Wobbly Transformation Format − 8-bit\"), a UTF-8 superset that can losslessly represent arbitrary sequences of 16-bit code units including lone surrogates. The path L\"\\xD800\" would format as \"\\xED\\xA0\\x80\" instead of \"�\", preserving the surrogate value in the output. std::print to a terminal still shows � via the recommended-practice substitution rule, so terminal display does not change. Rust’s OsString and Node.js’s libuv already use WTF-8 for exactly this reason; the paper cites both as prior art. The implementation is already in {fmt}. The read-path API — reconstructing a path from those WTF-8 bytes — is left to a follow-on paper.
committee gonna committee, but for Unicode specifically
Quick note that nobody in the paper seems to mention:
\"\\xED\\xA0\\x80\"is not valid UTF-8.RFC 3629 §3 explicitly forbids encoding surrogate code points in UTF-8. The byte sequence
ED A0 80passes the three-byte structural check (leading byte followed by two continuation bytes) but fails the validity check because it encodes U+D800, a surrogate. ICU, libxml2, Chromium’sbase::IsStringUTF8, and every conforming UTF-8 decoder will either reject it or silently replace it with U+FFFD — which destroys the round-trip you just created.So after this paper,
std::format(\"{}\", path(L\"\\xD800\"))gives you astd::stringwhose bytes are illegal in UTF-8. You will not know this from the type. Anything downstream that receives this string and assumesstd::stringcontains UTF-8 — which is basically everything in modern C++ that processes text — will either corrupt the data or blow up.Rust solved this exact problem by making
OsStringa distinct type that is explicitly notString. The WTF-8 bytes live insideOsStringand you cannot accidentally pass them to a function expecting valid UTF-8 without an explicit, documented conversion.C++ has no
OsString. After this paper,std::stringcan contain valid UTF-8, WTF-8, Latin-1, or random bytes — same as before, but now with a new source of non-UTF-8 bytes created by a formatting function that looks like it produces text.The goal is right. The tradeoff may be worth it. But the missing discussion of what happens when WTF-8 bytes escape into a UTF-8-expecting context is a real gap in the paper.
This is the paragraph I was hoping would show up in the thread. The paper acknowledges the read-path is deferred but does not address the contamination angle at all. Agree this should come up in SG16 review.
The
std::stringas a bag of bytes problem is as old as the language. This paper does not make it worse, it just adds one more source of non-UTF-8 content.There is a difference between \"std::string does not enforce UTF-8\" (always been true) and \"std::format can now produce WTF-8 bytes from a path and the API surface gives you no indication of this\". The second is new behavior introduced by a formatting function that looks like it produces text. That is worth naming explicitly.
Rust solved this
Yes, and the paper literally cites Rust’s
OsStringas prior art in the proposal section. WTF-8 is the Rust solution. The question is whether you want it living instd::string(C++ approach) or in a dedicated type with enforced boundaries (OsStringapproach). The paper picksstd::string.ok fair
Hit this exact bug two years ago building a backup tool. Two paths with lone surrogates —
\\xD800and\\xD801— both formatted to\"�\". Files from one path silently clobbered files from the other because the formatted names were identical in every log line and every map key. Took a week to track down.WTF-8 is the right call. The lossless guarantee is more valuable than worrying about what downstream parsers do with the bytes. Document your interfaces and validate at system boundaries.
That principle only holds when the type system enforces the distinction.
OsStringvsStringin Rust makes the documentation machine-checked. Astd::stringcontaining WTF-8 returned fromstd::formathas no annotation whatsoever. \"Validate at boundaries\" requires every downstream author to know thatstd::format(\"{}\", some_path)can produce non-UTF-8. In practice, they will not know.Fair point, and I do not have a great counterargument. The C++ answer is \"std::string was never guaranteed to be UTF-8 either\" which is technically true and practically unsatisfying. I still think lossless beats lossy even with the type-system gap.
Agreed that lossless is better. My narrower concern is that SG16 should require implementations to document that
pathformatting can produce WTF-8, and that the read-path companion paper should advance together with this one. Shipping only the write side of a round-trip is shipping half a footgun.the WTF-8 specification is hosted at wtf-8.codeberg.page and opens with the phrase \"Wobbly Transformation Format\". 10/10 naming. the standards committee is incapable of this energy.
finally a backronym that accurately describes the situation
Section 5 is a single sentence: \"The proposal has been implemented in {fmt} where the default
std::filesystem::pathrepresentation is now lossless.\"For R1 that’s technically sufficient — implementation experience is implementation experience — but it would be nice to see at least one real-world deployment beyond \"I shipped it in my own library.\" Not a blocker for SG16 but someone will ask in the room.
Zverovich has shipped things before. I would not be surprised if the R2 adds a sentence about real {fmt} users who have hit this on Windows. That tends to be enough for SG16.
can we please finish the networking TS before I retire. I will accept WTF-8 path encoding as a down payment.
what happens when I put a WTF-8 formatted path into nlohmann/json
nlohmann::json validates UTF-8 by default when serializing strings. With
ensure_ascii = falseyou get WTF-8 bytes written verbatim, producing JSON with illegal encoding. With the default strict mode you get an exception at serialize time. Either way: not your problem until it’s someone else’s production incident.Edit: tested it. nlohmann throws
type_error.316when you try to assign a WTF-8 string as a JSON value. At least it fails loudly.This sentence is load-bearing and slightly alarming. The paper ships only one half of the round-trip: you can format a
pathto WTF-8, but there is no standard API to reconstruct apathfrom those WTF-8 bytes. So the \"lossless\" guarantee means information is preserved in the string — but you cannot use that string to recover the original path without the follow-on paper.If the follow-on slips a revision cycle you have a path that prints to something humans cannot read and cannot be parsed back to a path. That is an odd intermediate state to standardize. I would want to see both papers advance together.
Exactly this. The Rust analogy is relevant here too:
OsStringhas bothencode_wide()(write) andfrom_encoded_bytes_unchecked()(read). They were designed and shipped as a unit. The write side without the read side is a lossy display mechanism with extra steps and false promises of round-trippability.wait so is WTF-8 valid UTF-8 or not? the paper calls it \"a superset\" which made me think yes but the comments above say no
WTF-8 is a superset of UTF-8 in the sense that every valid UTF-8 byte sequence is also valid WTF-8. But WTF-8 includes additional byte sequences (the ones encoding lone surrogates) that are explicitly forbidden by RFC 3629 and therefore not valid UTF-8. Valid UTF-8 ⊂ WTF-8, but WTF-8 ⊄ valid UTF-8. Think of it like how IEEE 754 allows NaN as a bit pattern: the float container accepts it, the real-number interpretation rejects it.
Python hit this in 3.0 and landed PEP 383 (surrogateescape error handler). Completely different mechanism: Python stores lone surrogates as surrogate code points in its internal str objects (technically not valid Unicode text either) rather than WTF-8 bytes. Both approaches have the \"looks like text, isn’t\" problem. C++ is in good company.
The paper cites PEP 383 in the references. The Python approach only works because Python’s str can hold surrogate code points internally. C++
std::u32stringcould theoretically do the same, but nobody usesu32stringfor filesystem paths.skill issue. just use UTF-8 paths on Windows.
The file system does not care about your encoding preferences. External tools, network shares, SMB mounts, and legacy software create paths with lone surrogates. You do not control the namespace.
Subtle thing the paper handles but does not emphasize: after this change,
std::format(\"{}\", path(L\"\\xD800\"))andstd::print(\"{}\", path(L\"\\xD800\"))behave differently.std::formatgives you WTF-8 bytes (\\xED\\xA0\\x80) in astd::string.std::printto a terminal follows the recommended practice in [ostream.formatted.print] of substituting replacement characters on transcoding, so you see�on screen.Same path, same format string, same function family — different bytes depending on which one you called. The distinction is intentional and correct (lossless storage vs. human-readable display), but it is going to trip people up when they debug why their log file does not match their terminal output.
This is the thing that will generate the most confused bug reports in two years. \"Why does my log file show garbage bytes but my terminal shows a box character\" — except it won’t even look like bytes in the log, it will look like mojibake or a broken glyph depending on the viewer. Fun times ahead.
[deleted]
what did they say
something about how the Windows NT kernel should not be allowed to create paths with lone surrogates. policy suggestions for the NT filesystem team, on a C++ standards thread.
laughs in ASCII
you will regret this when someone names a file 🦄
Edit: went and read the full WTF-8 spec at wtf-8.codeberg.page before posting this. Worth reading before the SG16 review. One property the spec guarantees: concatenation of WTF-8 strings preserves WTF-8 validity, and valid UTF-8 is a subset. So you can concatenate a path component (possibly WTF-8) with a directory separator or other valid UTF-8 string and get valid WTF-8. That is better behaved than some alternatives.
Still think the read-path paper should advance in parallel. But the encoding itself is cleanly designed.
CLion - the C++ IDE that understands your templates. Free 30-day trial.