Chinese characters from XML

Hi Everyone

I’m using Hiro Nishihara’s excellent ofxTrueTypeFontUC to render Chinese characters but having trouble loading them from XML. Here’s what I’ve tried so far:

Changing the encoding (both XML header and in Notepad++ software) to everything available to me including UTF-8, Big5, ISO10646, etc. Some of these produce different combinations of random characters, some simply make the XML unreadable to ofXmlSettings producing an error.

Converting the returned XML string to wstring:

wstring chineseText = util::ofxTrueTypeFontUC::convToWString(myXml.getValue("word", ""));  

Converting the characters to Unicode numerical character references, e.g. ? ??? = 鋈 韣顪飋 in the XML. The idea here being that once they are loaded I can convert them back to Chinese and create a wstring.

Using ofxJSON - same problems.

I’m working on a PC using OF 0.8.0 and VS2012. I’m using Notepad++ for XML editing.

I didn’t manage to load Chinese characters from XML but I did find an alternative route. I’m loading the characters as binary from a UTF-16 encoded txt file (confusingly labelled Unicode in Notepad on Win7).

Here’s some code I managed to cobble together from sifting through a seemingly endless stream of stackoverflow threads. It’s largely voodoo but it’s working a treat!

// open as a byte stream  
std::wifstream fin(ofToDataPath("chinese.txt").c_str(), std::ios::binary);  
// apply BOM-sensitive UTF-16 facet  
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));  
vector<wstring> wstrings;  
wstring ws;  
int count = 0;  
for(wchar_t c; fin.get(c); )  
	bool isLineEnd = false;  
	// check if character is newline or carriage return (these are appearing as square character)  
	if ((int)c == 10 || (int)c == 13)   
		isLineEnd = true;  
		wstring wst;  
	if (!isLineEnd)  
		wstrings[wstrings.size() - 1].push_back(c);  
		std::cout <<  c << endl;  
in draw()  
for (int i = 0; i < wstrings.size(); i++)  
	font.drawString(wstrings[i], 10, 50 + (20 * i));  

This text will be loaded from a server and saved to a local file. I’ll update this thread as I learn more.

I somehow managed to put off learning about text encoding/character sets etc until now. Like many who know about this stuff I waited until it bit me on the arse before investigating. For those interested in learning more, this article made for a great intro:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

This is a great place to start as well!:

What’s the declaration on your XML file? Is it UTF-16? Not sure if that affects ofxXmlSettings but it should. Also, I wouldn’t be surprised if that munges Unicode up. Poco::XML also does the same thing iirc so it’s best to just use text or something that allows you to use a naked filestream.

I just tried UTF-16 in the XML declaration and encoding and it didn’t make a difference. The XML doesn’t even load. I get the following error:

[ error ] ofxXmlSettings: pushTag(): tag "tags" not found  

Even if it did load I don’t believe it would be possible to convert the string value returned from ofxXmlSettings to a wstring and retain the Chinese characters. I used:

wstring chineseWordFromXml = util::ofxTrueTypeFontUC::convToWString(tagXml.getValue("word", ""));  

When testing in the past I found the following conversion resulted in random characters:

string chineseString = "???? ?? ??? ??? \n";  
wstring chineseWString = util::ofxTrueTypeFontUC::convToWString(chineseString);  
font.drawString(chineseWString, 0, 0);  

txt files should suffice for this project but it would be interesting to know if there is a way to use ofxXmlSettings in this context without altering the class.

Actually, just tested this with ofXml and it works fine:

<?xml version="1.0"?>  
	<chinese>???? ?? ??? ???</chinese>  


void testApp::setup(){  
    cout << xml.getValue("chinese") << endl;  


???? ?? ??? ???  

Not sure whether that’ll parse properly for ofxTrueTypeFontUC but would be curious to see.

I just tried your code on both Win 7 (VS2013) and osx (Xcode 4.5) and the XML won’t load. I’m using this code:

void testApp::setup(){  
    if (xml.load("data.xml"))  
        cout << "xml loaded! " << endl;  
        cout << "xml not loaded! " << endl;  
    cout << "XML: " << xml.getValue("chinese", "") << endl;  

And I’m seeing “xml not loaded!” every time. Tried several variations. What setup are you using? And how are you making your XML file?

I’m making it in Sublime Text, nothing tricky. Are you using ofXml or ofxXmlSettings?

xml.getValue("chinese", "")  

makes me think you’re using ofxXmlSettings, but I could be wrong. ofxXmlSettings *won’t* load this but for me at least (and hopefully not just for me :slight_smile: ofXml will.

I had no idea ofXml existed in addition to ofxXmlSettings. I just investigated and it doesn’t seem to work on my PC or Mac. Even just trying:


alone results in - [ error ] ofXml: loadFromBuffer(): DOM ERROR - on both platforms

I tried many different approaches including loading the xml as ofFile as in the documentation, replacing the Chinese for English characters, changing the XML encoding, ofToDataPath(…) etc etc.

Even the xmlExample in examples/utils folder breaks when I try to draw.

osx - I get a SIGABRT on line 357 of ofAppGLFWWindow.cpp
Win7 - It breaks at line 366 in mlock.c

I wonder if others are experiencing the same.

Yeah its not well documented yet. Is it a valid XML file that you’re trying to load? Can you send it to me so I can test?

Sure here’s the project. I’ve tested the XML in my usual way of dragging it into Chrome. No complaints.

Cool will check this as soon as I get into the office

Great cheers Josh

So, I opened both up in HexFiend and you can see in the right that there’s some difference in the encoding,

not sure what that is. Where/how did you make that file?

I used TextWrangler on osx to make that particular file. On PC I used Notepad. As mentioned, both were encoded in UTF-16. I’ll try Sublime Text.

Weird, tried using Sublime Text to create the XML file on osx and PC and had the same error message. osx made XML file attached.

That errors out for me too but also still has null bytes in between every character and a weird encoding, so, “<?xml version="1.0"?>” in mine is:

3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 2E 30 22 3F 3E  

in yours is

FE FF 00 3C 00 3F 00 78 00 6D 00 6C 00 20 00 76 00 65 00 72 00 73 00 69 00 6F 00 6E 00 3D 00 22 00 31 00 2E 00 30 00 22 00 3F 00 3E  

Looks like some kind of double-byte or wide-char thing. In Sublime Text I did File->Save With Encoding->UTF-8 and then ran it again and it works ok. I then saved it as UTF16 and it’s back to how it was before (i.e. empty bytes and not loading).

All three are attached. I’d love to get Poco::XML working with UTF-16 but I’m not sure how to set the encoding type properly. Trying the following doesn’t seem to work:

Poco::UTF16Encoding encode;  
        parser.addEncoding("UTF-16", &encode);  
        document = parser.parseString(buffer);  

I’ll see what I can figure out later on.

I couldn’t make it work with utf-16 either,
but Why not use utf-8 though? Is it mandatory for your project to use wstrings and UTF-16 ?

I just tried out of curiosity to see if I can save load and print Chinese text and it works fine.

try use utf8 on your editor settings:

x-code will probably do it automatically once you type Chinese text in it if not change it from text settings at your right bar.

if on windows & codeblocks go to general settings - othersettings

change : Use encoding when opening files : select UTF-8

click AsDefault encoding

Then… just open an empty example

include ofxXmlSettings and either ofxTrueTypeFontUC or even better : fontStash

put a propper unicode ttf file in your data folder like: Arial Unicode.ttf

and do:

        XML.setValue("text","???? ?? ??? ???");  
 textChin =   XML.getValue("text", "nooo!");  

and then just draw that string with ofxTrueTypeFontUC or ofXfontStash.

Just for the record I tried this 5 minutes ago on a mac, with both objects . works like a charm.

on PC ofxTrueTypeFontUC doesn’t work all the time. but ofxFontStash does

![]( Shot 2013-10-26 at 2.32.25 AM.png)

Thanks for your input kyri.

Using UTF-8 for the XML seems to work perfectly on osx - this goes against everything I’ve learned on the subject lately. The Chinese characters are rendering perfectly using ofxTrueTypeFontUC. However, it’s a different story on a PC (we’re using PCs for the project). The UTF-8 XML loaded fine but the characters were incorrect - usual mix of random western characters.

Again, happy to load from a text file so this isn’t the end of the world. Would be good to crack this one though.

Josh, the results from HexFiend are interesting. The Chinese characters should be double byte. If I run them through this conversion tool - - The hex code points look like this:

6C49 5B57 6F22 5B57 0020 9820 9908 0020 6EBF 7154 7143 0020 9C59 9DED 9EC2

Hmm. I wonder if this might be Visual Studios’ wchar_t/wstring thing? I’ll check it out when I get to my Windows machine and see what I can figure out. Visual Studio may treat the Poco internals differently than the OSX gcc/apple llvm compilers. Very interesting though, I’m glad to hear it’s not something in ofXml being super-broken :slight_smile: