Splitting a text file into a list of individual character string (supporting accents!)

Hello,

I have been struggling with a bit of code that was working on 0.7 and is not working anymore on 0.8.

I have a text file with multiple lines and accent.
I load it using ofFile and ofBuffer into a vector each element being a line of the text.

Then I want to split each line into another vector of string, each element being an individual character of the text.

It works well with non accentuated character but it does not work with accent, each accentuated letter being 2 characters.

On 0.7 I had a method borrowed from the forum to substitute character before splitting. Now on 0.8 it doesn’t seem to work.

I know it is related to Unicode/UTF8…
I tried a lot of thing including: ofxTrueTypeFontUC ofxFTGL wstring… without any success.

I always get some \303 \251 each time I have a “é”.

Any idea?

Thanks a lot!

that method was included in OF 0.8 so you if your file is UTF8 you shouldn’t need to do the substitution before anymore. if you want your old code to work like it is, you can call on the font object:

ttf.setEncoding(OF_ENCODING_ISO_8859_15);

so the substitution you were doing will still work

Thanks a lot, the old code is working fine with this method.

But it’s not working without the substitution. When I convert from string to vector to split each character it doesn’t handle combinations: for example I have ‘\303’ in one cell and ‘\251’ in the next one instead of ‘é’, so I have to analyze each character before splitting and choose to merge 1, 2 or 3 characters in one cell.

Anyway I am happy with my old code + OF_ENCODING_ISO_8859_15.

Thanks a lot!

Any idea of how split string to letters regardless if they have one or two bytes? Thanks

you can use ofUTF8Iterator like:

for(auto c: ofUTF8Iterator(str)){
    // c contains a utf8 codepoint
}

or if you want to for example put every letter in a vector:

ofUTF8Iterator it(str)
std::vector<int32_t> letters(it.begin(), it.end());

to then put an int32_t utf8 codepoint back into a string you can use:

ofAppendUTF8(str, c);
2 Likes

Thank you very much @arturo
It would be great to be able to use ofUTF8Iterator(str).size() too

if you are using master you can call ofUTF8Length

1 Like