From d20e4115e1628bf4a43f6ab2d050a3a195b4ceb4 Mon Sep 17 00:00:00 2001 From: Lasse Collin Date: Tue, 17 Dec 2024 15:01:29 +0200 Subject: [PATCH] Windows: Document the need for setlocale(LC_ALL, ".UTF8") Also warn about unpaired surrogates and (somewhat UTF-8-specific) MAX_PATH issue in FindFirstFileA(). Fixes: 46ee0061629fb075d61d83839e14dd193337af59 (cherry picked from commit 20dfca8171dad4c64785ac61d5b68972c444877b) --- .../w32_application.manifest.comments.txt | 28 ++++++++++++++++++- 1 file changed, 27 insertions(+), 1 deletion(-) diff --git a/src/common/w32_application.manifest.comments.txt b/src/common/w32_application.manifest.comments.txt index ad0835cc..686d0bd9 100644 --- a/src/common/w32_application.manifest.comments.txt +++ b/src/common/w32_application.manifest.comments.txt @@ -67,6 +67,31 @@ This is useful for programs that use main(): to the UTF-8 code page and aren't distinguishable from filenames that contain the actual replacement character U+FFFD. + FindFirstFileA() and FindFirstFileExA() also suffer from the above + issue where unpaired surrogates become U+FFFD. Another issue is + that filenames may require more bytes in UTF-8 than in a legacy + code page. In UTF-8, a very long filename may exceed MAX_PATH bytes + and thus these APIs cannot list such filenames anymore because + WIN32_FIND_DATAA has a member "CHAR cFileName[MAX_PATH]". + +In addition to the application manifest, setlocale(LC_ALL, ".UTF8") +needs to be called to make functions like mbrtowc() use UTF-8. Regular +setlocale(LC_ALL, "") uses a legacy code page even when an application +manifest specifies UTF-8. The effects of setlocale(LC_ALL, ".UTF8") +partially overlap with the application manifest though: + + - Application manifest affects argv[] in main() and file system APIs. + Multibyte functions like mbrtowc() aren't affected. + + - setlocale() affects file system APIs and multibyte functions like + mbrtowc(). argv[] isn't affected because argv[] has already been + constructed when an application has a chance to call setlocale(). + +To keep everything in sync, it's best to set an UTF-8 locale only when +the active code page is UTF-8 and thus argv[] is in UTF-8: + + setlocale(LC_ALL, GetACP() == CP_UTF8 ? ".UTF8" : ""); + If different programs use different code pages, compatibility issues are possible. For example, if one program produces a list of filenames and another program reads it, both programs should use @@ -76,7 +101,8 @@ char-based file system APIs. If building with a MinGW-w64 toolchain, it is strongly recommended to use UCRT instead of the old MSVCRT. For example, with the UTF-8 code page, MSVCRT doesn't convert non-ASCII characters correctly -when writing to console with printf(). With UCRT it works. +when writing to console with printf(). With UCRT it works. Also, +MSVCRT doesn't support ".UTF8" in setlocale(). Long path names