Dealing with multi-byte strings

Hello all,
I am out in Taiwan doing an installation and need some help understanding how to deal with strings of chinese characters I am gathering from their version of Twitter, called Plurk.
I am using ofxFontStash to write the charachters correctly to the screen, so that end of things is fine. I am having some success chopping up sentences with ofSplitString, tho there are several symbols they use which I cannot seem to split with. The biggest issue is that I want to search the strings of sentences for specific “chinese characters” which are in fact 2-3 bytes so I cannot use the typical someString[i] method of iterating thru the chars and grab the right one. Is there a good way to break up a string into 3 byte pieces? and then how do I compare that to the “chinese character” I want?
As an example I am looking for this “我” which is “230 136 145” or “\230 \136 \145” (?) as it appears in longer strings.
By making my OF files unicode I wonder if I have made this more difficult because I cannot see how things are being converted in x-code.
Anyone with experience doing this or a better understanding of how to deal with these things in OF or C++?
Thanks!

Hey @digitalColeman - I recently replaced all of the built-in oF string search code with UTF-8 equivalents – it is in this PR. While not in the core, you could have a look at the equivalent Poco:: functions and use them for UTF-8 string searching …

Find it here:

https://github.com/openframeworks/openFrameworks/pull/2910/files

Also, this is required reading :smile:

Chris, thanks, the article was a blast!
I do want to follow up on whether I should be using a different kind of string or if the Poco functions will understand I want to do operations on UTF-8 automatically? I see the core functions in your pull request, I just want to be sure I am using them correctly.
For instance should I use u16string to store and process things?
I see your functions have a new version of ofIsStringInString but you didnt make any change to ofSplitString and so we still cannot split strings with multi-byte characters?
Chris