[SOLVED] Insert newline characters in a string

Hi everyone

I’m using openframeworks on ubuntu and am using ofxTrueTypeFontUC for showing unicode characters inside my program.

I have a string which is just one line and very long, and I want to display it inside a circle, so what I want to do is to insert “new line” characters in proper places so it gets wrapped inside the circle.

The problem is if the string contains multi-byte characters and I insert a “\n” between that character, it breaks the character and shows a strange character instead.

I was wondering if I should use another type other than string, or maybe an addon is the right way to go ?

Thanks in advance

this stackoverflow might help – it seems similar to what you’ve want to do. I’ve done it by hand before, it’s not impossible once you know the utf-8 spec but if I was doing it again I’d look for a library that lets you iterate:

they recommend this library:

http://utfcpp.sourceforge.net/

2 Likes

This can also be done without external libraries.

Here’s an ofSketch snippet:

#include "Poco/UTF8Encoding.h"
#include "Poco/TextIterator.h"

void setup() {
	// put your setup code here, to run once:
	
	Poco::UTF8Encoding utf8Encoding;
    std::string utf8String("汉语 / 漢語; Hànyǔ or 中文; Zhōngwén");
    Poco::TextIterator it(utf8String, utf8Encoding);
    Poco::TextIterator end(utf8String);
    
    while (it != end) {
        std::cout << "Unicode Codepoint (an integer): " << *it << std::endl;
        
        ++it; 
    }

}

void draw() {
	// put your main code here, to run once each frame:
	
}
4 Likes

Also, by the way, with 0.9.0 this iteration stuff will be in the core as Poco 1.6 (and eventually c++11 will make all of this stuff much much easier :))

1 Like

@chriss
out of curiosity

how do you convert that codepoint into a utf8 string though?

For example

in utf8

string Pi__;
Pi__ += -49;
Pi__ += -128;

cout<< Pi__<<endl;

output : π

poco will return 960 that is actually UTF-32 or UTF-16 encoding…
(despide the fact that you used UTF8Encoding in your iterator

so… *it = 960

how do you turn that thing into a … π ?
that is two binary digits 11001111 (-49) &10000000 (-128)

or at least a wchar_t * ?

I remember trying to figure this out before but couldn’t find anything inside Poco’s documentation

never mind I’ve figured it out :smile:

@poorya7

basically:
combining @zach &Chris’ posts

first convert the decimal wchar * numbers
into hex string and then hex -> unsigned short (default used for utf chars)
and then cast unsigned short hex into wchar and then just pass it through the typical
encoding stream described in the post:
simples =)
here is the function if anybody cares:

string ofUTF16DecToUtf8Char(int input)
{

std::stringstream ss;
  ss<< hex<<input;
unsigned short myVar;
sscanf(ss.str().c_str(),"%hx",&myVar);

wchar_t in = (wchar_t) myVar;
string out;
unsigned int codepoint = 0;
if (in >= 0xd800 && in <= 0xdbff)
codepoint = ((in - 0xd800) << 10) + 0x10000;
else
{
if (in >= 0xdc00 && in <= 0xdfff)
codepoint |= in - 0xdc00;
else
codepoint = in;

        if (codepoint <= 0x7f)
            out.append(1, static_cast<char>(codepoint));
        else if (codepoint <= 0x7ff)
        {
            out.append(1, static_cast<char>(0xc0 | ((codepoint >> 6) & 0x1f)));
            out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
        }
        else if (codepoint <= 0xffff)
        {
            out.append(1, static_cast<char>(0xe0 | ((codepoint >> 12) & 0x0f)));
            out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
            out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
        }
        else
        {
            out.append(1, static_cast<char>(0xf0 | ((codepoint >> 18) & 0x07)));
            out.append(1, static_cast<char>(0x80 | ((codepoint >> 12) & 0x3f)));
            out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
            out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
        }
        codepoint = 0;
    }
return out;

}

and then by using poco like Chriss suggested:

string RESULT;

using Poco::TextIterator;
using Poco::UTF8Encoding;
string utf8String(“Κύκλος Circle”);
UTF8Encoding utf8;
TextIterator it(utf8String, utf8);
TextIterator end(utf8String);
int pos_=0;
for (; it != end; ++it)
{
RESULT += ofUTF16DecToUtf8Char(*it);
RESULT += “\n”;
}

1 Like

Thanks a lot everyone.
I used the method @igiso suggested. Basically I break my long string into a long ‘one-byte’ string and then insert new lines in proper places. I’m not sure if it’s fully solved though, because with the new converted string, I used the below code at the position where the multi-byte character is, and the result is still a strange character!

string newDescription=description.substr(0,7)+"etc";

With that code, if the ‘description’ string has the multi-byte character at index 7, then the above code will still generate a strange character after that and before ‘etc’, which makes me wonder, how is it possible to convert a string with multi-byte characters to a string of single-byte characters ? wouldn’t some data be lost in that process? or Am I doing something wrong ?

Thanks again

… this is a huge topic

I assumed you were using UTF8

if you are on visual studio… and you are doing string = “π”;

this is not utf8

wstring is not utf8 wchar is not utf8

string is not single byte, utf8 is not singlebyte…

string is an array of chars ergo, an array of bytes

a utf8 char is not a single byte, (unless is ascii char) (utf8 isascii compatible)

so “π” cannot fit into a char.

it can fit to a char * or a wchar_t or a string

wchar_t is not singlebyte either…

wstring is an array of wchar_t.

so basically if you have an array of wchar_t FORGET ALL OF THE ABOVE.
wstring can be browsed by characters in loop without all this fuzzz utf8 because it varies internally it is not supposed to be used as such

  1. what platform are you on?

  2. what is the poco iterator returning when you pass a string = “π” in it.

then… everything else depends on the above two actually…

maybe you dont need the poco iterator but just use wstring + utf16

I assumed you have a unicode UTF-8 encoded string and you want to browse through the characters.

if you have a L"π" that is not UTF-8 that is unicode widechar whatever… probably utf-16

the above process will not work.

regarding wide-char multibyte and loosing data… that is a HUGE topic…

and there is SOOOO much wrong terminology and misconceptions

a good place to start is this:
http://doc.cat-v.org/plan_9/4th_edition/papers/utf
:beer:

1 Like

also…

The code you are posting…

will certainly not work under any circumstances with utf8

let’s say:

string s =“λοβ”

if you do string.size() this will not return 3

it will return 6>

so you can’t use sbtr.

utf8 is using the string as a container.

if you want to do the above,

follow the example I gave you and basically

if it = 7 then add the “etc”

in terms of the array etc will be added in the 14th pos of the array not the 7th

(depending on the language you are using etc)

here is another example:

char * p = “Πi”;

if you do sizeOf℗

it will return 3 not 2 because Π= 2 bytes, i = 1

:four_leaf_clover:

1 Like

Thanks a lot for the great tips @igiso. I’ll be sure to keep them in mind.