C++:Encoding
Encodings in C Strings.
Category
알아야 할 항목은 아래와 같다.
- Source code file encoding.
- Unicode & UTF-8
- C++:literal
- std::locale
- Codepage (Encoding)
- Font
MSVC char encoding
- Stackoverflow - C++11 std::cout << “string literal in UTF-8” to Windows cmd console? (Visual Studio 2015)
- [추천] Stackoverflow - How do I print UTF-8 from c++ console application on Windows
ust for additional information:
'ANSI' refers to windows-125x, used for win32 applications while 'OEM' refers to the code page used by console/MS-DOS applications. Current active code-pages can be retrieved with functions GetOEMCP() and GetACP().
In order to output something correctly to the console, you should:
- ensure the current OEM code page supports the characters you want to output (if necessary, use SetConsoleOutputCP() to set it properly)
- convert the string from current ANSI code (win32) to the console OEM code page
Here are some utilities for doing so:
// Convert a UTF-16 string (16-bit) to an OEM string (8-bit)
#define UNICODEtoOEM(str) WCHARtoCHAR(str, CP_OEMCP)
// Convert an OEM string (8-bit) to a UTF-16 string (16-bit)
#define OEMtoUNICODE(str) CHARtoWCHAR(str, CP_OEMCP)
// Convert an ANSI string (8-bit) to a UTF-16 string (16-bit)
#define ANSItoUNICODE(str) CHARtoWCHAR(str, CP_ACP)
// Convert a UTF-16 string (16-bit) to an ANSI string (8-bit)
#define UNICODEtoANSI(str) WCHARtoCHAR(str, CP_ACP)
/* Convert a single/multi-byte string to a UTF-16 string (16-bit).
We take advantage of the MultiByteToWideChar function that allows to specify the charset of the input string.
*/
LPWSTR CHARtoWCHAR(LPSTR str, UINT codePage) {
size_t len = strlen(str) + 1;
int size_needed = MultiByteToWideChar(codePage, 0, str, len, NULL, 0);
LPWSTR wstr = (LPWSTR) LocalAlloc(LPTR, sizeof(WCHAR) * size_needed);
MultiByteToWideChar(codePage, 0, str, len, wstr, size_needed);
return wstr;
}
/* Convert a UTF-16 string (16-bit) to a single/multi-byte string.
We take advantage of the WideCharToMultiByte function that allows to specify the charset of the output string.
*/
LPSTR WCHARtoCHAR(LPWSTR wstr, UINT codePage) {
size_t len = wcslen(wstr) + 1;
int size_needed = WideCharToMultiByte(codePage, 0, wstr, len, NULL, 0, NULL, NULL);
LPSTR str = (LPSTR) LocalAlloc(LPTR, sizeof(CHAR) * size_needed );
WideCharToMultiByte(codePage, 0, wstr, len, str, size_needed, NULL, NULL);
return str;
}
Example
MS949 CodePage를 사용하는 Windows 환경에서 Unicode 문자열 출력시 (SetConsoleOutputCP를 사용하여 콘솔 출력을 UTF-8으로 설정했음을 가정한다) 아래의 두 가지 경우, 정상적으로 출력된다.
-
"\xea\xb0\x80"
-
u8"가"
아래는 실제 테스트한 코드이다:
//MS949 CODEPAGE
#include <iostream>
#include <locale>
using namespace std;
int main()
{
SetConsoleOutputCP(CP_UTF8);
//GetConsoleOutputCP();
// TEST CODE:
//std::cout.imbue(std::locale("ko_KR.UTF-8"));
//std::cout.imbue(std::locale("kor"));
//std::locale::global(std::locale("ko_KR.UTF-8"));
//_setmode(_fileno(stdout), _O_U8TEXT);
//std::cout << u8"\xea\xb0\x80" << "////";
//std::cout << "\xea\xb0\x80" << "////";
//std::cout << u8"가" << "////";
//std::cout << "가" << "////";
printf(u8"\xea\xb0\x80////"); // ???
printf("\xea\xb0\x80////"); // UTF8 (0xEA 0xB0 0x80)
printf(u8"가////"); // UTF8 (0xEA 0xB0 0x80)
printf("가////"); // MS949 (0xB0 0xA1)
return 0;
}
참고로 cout의 경우, locale또는 imbue와 관련된 문제가 있는듯 하다. MSVC에서 정상적으로 출력되지 않는다.
See also
Favorite site
- [추천] Working with Encodings in C Strings 1
- [추천] Unicode in C and C++: What You Can Do About It Today 2
- [추천] Stackoverflow: std::wstring VS std::string 3
- [추천] UTF-8 and Unicode FAQ for Unix/Linux 4
- C++에서 UTF-8 사용
- KLDP - Windows에서 UTF-8사용방법?
- Standard output으로 unicode 문자를 출력하기 (Win32 console application)