Skip to content

C++:Encoding

Encodings in C Strings.

Category

알아야 할 항목은 아래와 같다.

MSVC char encoding

ust for additional information:

'ANSI' refers to windows-125x, used for win32 applications while 'OEM' refers to the code page used by console/MS-DOS applications. Current active code-pages can be retrieved with functions GetOEMCP() and GetACP().

In order to output something correctly to the console, you should:

  • ensure the current OEM code page supports the characters you want to output (if necessary, use SetConsoleOutputCP() to set it properly)
  • convert the string from current ANSI code (win32) to the console OEM code page

Here are some utilities for doing so:

// Convert a UTF-16 string (16-bit) to an OEM string (8-bit) 
#define UNICODEtoOEM(str)   WCHARtoCHAR(str, CP_OEMCP)

// Convert an OEM string (8-bit) to a UTF-16 string (16-bit) 
#define OEMtoUNICODE(str)   CHARtoWCHAR(str, CP_OEMCP)

// Convert an ANSI string (8-bit) to a UTF-16 string (16-bit) 
#define ANSItoUNICODE(str)  CHARtoWCHAR(str, CP_ACP)

// Convert a UTF-16 string (16-bit) to an ANSI string (8-bit)
#define UNICODEtoANSI(str)  WCHARtoCHAR(str, CP_ACP)


/* Convert a single/multi-byte string to a UTF-16 string (16-bit).
 We take advantage of the MultiByteToWideChar function that allows to specify the charset of the input string.
*/
LPWSTR CHARtoWCHAR(LPSTR str, UINT codePage) {
    size_t len = strlen(str) + 1;
    int size_needed = MultiByteToWideChar(codePage, 0, str, len, NULL, 0);
    LPWSTR wstr = (LPWSTR) LocalAlloc(LPTR, sizeof(WCHAR) * size_needed);
    MultiByteToWideChar(codePage, 0, str, len, wstr, size_needed);
    return wstr;
}


/* Convert a UTF-16 string (16-bit) to a single/multi-byte string.
 We take advantage of the WideCharToMultiByte function that allows to specify the charset of the output string.
*/
LPSTR WCHARtoCHAR(LPWSTR wstr, UINT codePage) {
    size_t len = wcslen(wstr) + 1;
    int size_needed = WideCharToMultiByte(codePage, 0, wstr, len, NULL, 0, NULL, NULL);
    LPSTR str = (LPSTR) LocalAlloc(LPTR, sizeof(CHAR) * size_needed );
    WideCharToMultiByte(codePage, 0, wstr, len, str, size_needed, NULL, NULL);
    return str;
}

Example

MS949 CodePage를 사용하는 Windows 환경에서 Unicode 문자열 출력시 (SetConsoleOutputCP를 사용하여 콘솔 출력을 UTF-8으로 설정했음을 가정한다) 아래의 두 가지 경우, 정상적으로 출력된다.

  • "\xea\xb0\x80"
  • u8"가"

아래는 실제 테스트한 코드이다:

//MS949 CODEPAGE

#include <iostream>
#include <locale>

using namespace std;

int main()
{
    SetConsoleOutputCP(CP_UTF8);
    //GetConsoleOutputCP();

    // TEST CODE:
    //std::cout.imbue(std::locale("ko_KR.UTF-8"));
    //std::cout.imbue(std::locale("kor"));
    //std::locale::global(std::locale("ko_KR.UTF-8"));
    //_setmode(_fileno(stdout), _O_U8TEXT);

    //std::cout << u8"\xea\xb0\x80" << "////";
    //std::cout << "\xea\xb0\x80" << "////";
    //std::cout << u8"가" << "////";
    //std::cout << "가" << "////";

    printf(u8"\xea\xb0\x80////"); // ???
    printf("\xea\xb0\x80////"); // UTF8 (0xEA 0xB0 0x80)
    printf(u8"가////"); // UTF8 (0xEA 0xB0 0x80)
    printf("가////"); // MS949 (0xB0 0xA1)

    return 0;
}

참고로 cout의 경우, locale또는 imbue와 관련된 문제가 있는듯 하다. MSVC에서 정상적으로 출력되지 않는다.

See also

Favorite site

References


  1. Working_with_Encodings_in_C_Strings.pdf 

  2. Unicode_in_C_and_Cpp_-_Cprogramming.pdf 

  3. Stackoverflow_-_wstring_vs_string.pdf 

  4. UTF-8_and_Unicode_FAQ.pdf