Localization in 2009 and broken standard of C++.
There are many goodies in upcoming standard C++0x. Both, core language and standard libraries were significantly improved.
However, there is one important part of the library that remains broken – localization.
Let’s write a simple program that prints number to file in C++:
#include <iostream>
#include <fstream>
#include <locale>
int main()
{
// Set global locale to system default;
std::locale::global(std::locale(""));
// open file "number.txt"
std::ofstream number("number.txt");
// write a number to file and close it
number<<13456<<std::endl;
}
And in C:
#include <stdio.h>
#include <locale.h>
int main()
{
setlocale(LC_ALL,"");
FILE *f=fopen("number.txt","w");
fprintf(f,"%'f\n",13456);
fclose(f);
return 0;
}
Lets run both programs with en_US.UTF-8 locale and observe the following number in the output file:
13,456
Now lets run this program with Russian locale LC_ALL=ru_RU.UTF-8 ./a.out. C version gives us as expected:
13 456
When C++ version produces:
13<?>456
Incorrect UTF-8 output text! What happens? What is the difference between C library and C++ library that use same locale database?
According to the locale, the thousands separator in Russian is U+2002 – EN SPACE, the codepoint that requires more then one byte in UTF-8 encoding. But let’s take a look on C++ numbers formatting provider: std::numpunct. We can see that member functions thousands_sep returns single character. When in C locale definition, thousands separator represented as a string, so there is no limitation of single character as in C++ standard class.
This was just a simple and easily reproducible problems with C++ standard locale facets. There much more:
std::time_get– is not symmetric withstd::time_put(as it in C strftime/strptime) and does not allow easy parsing of times with AM/PM marks.std::ctypeis very simplistic assuming that toupper/tolower can be done on per-character base (case conversion may change number of characters and it is context dependent).std::collate– does not support collation strength (case sensitive or insensitive).- There is not way to specify a timezone different from global timezone in time formatting and parsing.
- Time formatting/parsing always assumes Gregorian calendar.
Its very frustrating that in 2009 such annoying, easily reproducible bugs exist and make localization facilities totally useless in certain locales.
All the work I had recently done with support of localization in CppCMS framework had convinced me in important decision — ICU would be mandatory dependency and provide most of localization facilities by default, because native C++ localization is no-go…
The question is: "Would C++0x committee revisit localization support in C++0x?"
Comments
I love the idea of your project :) the blog was a litle slow last days but I'm soooo cool using this framework :)
I'm glad to hear ;)
I'm very looking forward to growth and development of this project. It’s awesome to write web app/services in C++ ;)
Keep up the good work!
You can send a trackback to following url:
Add Comment:
You can write your messages using Markdown syntax.
You must enable JavaScript in order to post comments.