Discussion:
Get correct Scintilla-Text with UTF8 encoding?
Charly Dante
2014-05-18 11:08:58 UTC
Permalink
Hi there,

I'm currently trying to get the correct text from my Scintilla Control in
UTF8 encoding, but until now I fail to do so.

I have used the option SendMessage(my_sci_window, SCI_SETCODEPAGE,
SC_CP_UTF8, 0); to enable all Unicode chars like russian or chinese for my
edit control. These characters are displayed correct when entered to the
Scintilla Control.

However, if I try to get the whole text of the Scintilla Control, I only
get Garbage values for the chars that were Chinese/Russian Symbols. I use
this code to get the Scintilla Text:

size_t text_length = SendMessage(my_sci_window, WM_GETTEXTLENGTH, 0, 0);
char *buffer = new char[text_length + 1];
SendMessage(my_sci_window, WM_GETTEXT, text_length + 1,
reinterpret_cast<LPARAM>(buffer));

I tried both WM_GETTEXT and SCI_GETTEXT, but without finding any
difference. Also converting the char buffer array to a wchar_t buffer array
using mbstowcs_s doesn't help. As soon as the Scintilla Control contains
any "non-normal" chars like Chinese/Thai etc., they are translated to some
broken chars and I cannot get the correct text of the Scintilla control.

Any Idea how to fix this?

Best,
CD
--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/d/optout.
Neil Hodgson
2014-05-20 01:09:50 UTC
Permalink
However, if I try to get the whole text of the Scintilla Control, I only get Garbage values for the chars that were Chinese/Russian Symbols.
I was very hesitant to reply to this question because that description is just extremely vague. Unless you include information in your question then no one will have any idea what your problem is and it is unlikely anyone will reply.

Neil
--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/d/optout.
Charly Dante
2014-05-21 18:22:54 UTC
Permalink
Hi,

ok then I will explain more whats my problem. Actually the only thing I
want is: a function to get the whole text of the Scintilla Control and a
function to set the whole text of the Scintilla Control. I wrote those
functions already and they work perfectly fine with normal chars. However,
if the code contains some chinese/russian/etc chars, I get problems.

If you open Scite and then select under *"File->Encoding->UTF-8"*, you are
able to paste basically all those chars in the Scite Text Window and Scite
will display them correct. Lets for example take this test string: "test
string 俿䟶䟏䜶". It contains several normal chars but also some chinese
letters. If I paste this one to Scite with encoding set to UTF-8 I have no
problems and all chars get displayed correct.

However, with those chars my function to get and set the text of the
Control don't work anymore. Lets consider the following code:

// Get the text length of the scite control
size_t text_length = SendMessage(scite_hwnd, WM_GETTEXTLENGTH, 0, 0);

// Allocate a buffer for the text
char *buffer = new char[text_length];

// Get the text of the scite control
SendMessage(scite_hwnd, WM_GETTEXT, text_length, (LPARAM)buffer);

// Test: Convert the text to a wchar_t array
size_t newsize = strlen(buffer) + 1;
wchar_t * wcstring = new wchar_t[newsize];
size_t convertedChars = 0;
mbstowcs_s(&convertedChars, wcstring, newsize, buffer, _TRUNCATE);

// Set the text of the scite control: Both buffer (char array) and wcstring
(wchar_t array) don't work correct
SendMessage(scite_hwnd, WM_SETTEXT, 0, (LPARAM)wcstring);


I tried WM_SETTEXT with both the char array and the wchar_t array, but both
methods don't deliver the result I expect. Basically I only get the text of
the Scite Control and then set the text again to the exactly same text, so
the text should remain unchanged after all.

With the wcstring I only get one letter in the control, namely "t". I guess
this doesn't work at all because WM_SETTEXT expects a pointer to a char
array and not a wchar_t array. However, with a char array it doesn't really
work either, because if I use the char array I get:

test string 俿䟶䟏

followed by two "black boxes". In the first "black box" is the text xE4, in
the second black box the text xBD printed, the last char of the teststring
"test string 俿䟶䟏䜶" is missing.

This is my problem and this is what I want to fix. I hope this makes more
sense now :)

Best,
CD
Post by Charly Dante
Post by Charly Dante
However, if I try to get the whole text of the Scintilla Control, I only
get Garbage values for the chars that were Chinese/Russian Symbols.
I was very hesitant to reply to this question because that description
is just extremely vague. Unless you include information in your question
then no one will have any idea what your problem is and it is unlikely
anyone will reply.
Neil
--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/d/optout.
Matthew Brush
2014-05-21 18:29:12 UTC
Permalink
Hi,

Did you try to pass SC_CP_UTF8 to SCI_SETCODEPAGE message before setting
the buffer with UTF-8 encoded bytes?

Cheers,
Matthew Brush
Post by Charly Dante
Hi,
ok then I will explain more whats my problem. Actually the only thing I
want is: a function to get the whole text of the Scintilla Control and a
function to set the whole text of the Scintilla Control. I wrote those
functions already and they work perfectly fine with normal chars. However,
if the code contains some chinese/russian/etc chars, I get problems.
If you open Scite and then select under *"File->Encoding->UTF-8"*, you are
able to paste basically all those chars in the Scite Text Window and Scite
will display them correct. Lets for example take this test string: "test
string 俿侶侏佶". It contains several normal chars but also some chinese
letters. If I paste this one to Scite with encoding set to UTF-8 I have no
problems and all chars get displayed correct.
However, with those chars my function to get and set the text of the
// Get the text length of the scite control
size_t text_length = SendMessage(scite_hwnd, WM_GETTEXTLENGTH, 0, 0);
// Allocate a buffer for the text
char *buffer = new char[text_length];
// Get the text of the scite control
SendMessage(scite_hwnd, WM_GETTEXT, text_length, (LPARAM)buffer);
// Test: Convert the text to a wchar_t array
size_t newsize = strlen(buffer) + 1;
wchar_t * wcstring = new wchar_t[newsize];
size_t convertedChars = 0;
mbstowcs_s(&convertedChars, wcstring, newsize, buffer, _TRUNCATE);
// Set the text of the scite control: Both buffer (char array) and wcstring
(wchar_t array) don't work correct
SendMessage(scite_hwnd, WM_SETTEXT, 0, (LPARAM)wcstring);
I tried WM_SETTEXT with both the char array and the wchar_t array, but both
methods don't deliver the result I expect. Basically I only get the text of
the Scite Control and then set the text again to the exactly same text, so
the text should remain unchanged after all.
With the wcstring I only get one letter in the control, namely "t". I guess
this doesn't work at all because WM_SETTEXT expects a pointer to a char
array and not a wchar_t array. However, with a char array it doesn't really
test string 俿侶侏
followed by two "black boxes". In the first "black box" is the text xE4, in
the second black box the text xBD printed, the last char of the teststring
"test string 俿侶侏佶" is missing.
This is my problem and this is what I want to fix. I hope this makes more
sense now :)
Best,
CD
Post by Charly Dante
Post by Charly Dante
However, if I try to get the whole text of the Scintilla Control, I only
get Garbage values for the chars that were Chinese/Russian Symbols.
I was very hesitant to reply to this question because that description
is just extremely vague. Unless you include information in your question
then no one will have any idea what your problem is and it is unlikely
anyone will reply.
Neil
--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/d/optout.
Charly Dante
2014-05-21 19:34:48 UTC
Permalink
Uhm, yes as desribed in the initial post, I already did that. Otherwise
scite would not display the chinese letters correct, right?

The option I mentioned *"File->Encoding->UTF-8"* should already do that,
but I also tried it with explicitly sending SC_CP_UTF8 via SCI_SETCODEPAGE
with no difference.

I tried MultiByteToWideChar like this:

// Get the correct length of the buffer
int wchars_num = MultiByteToWideChar(CP_UTF8, 0, buffer, -1, NULL, 0);

// Allocate an array of that length
wchar_t * wcstring = new wchar_t[wchars_num + 1];

// Convert the text to a wchar_t array
MultiByteToWideChar(CP_UTF8, 0, buffer, -1, wcstring, wchars_num);

// Try to send it to the scite window -> fail
SendMessage(scite_hwnd, WM_SETTEXT, 0, reinterpret_cast<LPARAM>(wcstring));

but it doesn't work either... So even if I can convert it correctly to a
wchar_t array, how can I transmit it correctly to the Scintilla/Scite Edit
Window? Because SendMessage expects a pointer
to a char array, so that doesn't seem to work at all...

I mean this has to be somehow possible, right?
Post by Matthew Brush
Hi,
Did you try to pass SC_CP_UTF8 to SCI_SETCODEPAGE message before setting
the buffer with UTF-8 encoded bytes?
Cheers,
Matthew Brush
Post by Charly Dante
Hi,
ok then I will explain more whats my problem. Actually the only thing I
want is: a function to get the whole text of the Scintilla Control and a
function to set the whole text of the Scintilla Control. I wrote those
functions already and they work perfectly fine with normal chars.
However,
Post by Charly Dante
if the code contains some chinese/russian/etc chars, I get problems.
If you open Scite and then select under *"File->Encoding->UTF-8"*, you
are
Post by Charly Dante
able to paste basically all those chars in the Scite Text Window and
Scite
Post by Charly Dante
will display them correct. Lets for example take this test string: "test
string 俿䟶䟏䜶". It contains several normal chars but also some chinese
letters. If I paste this one to Scite with encoding set to UTF-8 I have
no
Post by Charly Dante
problems and all chars get displayed correct.
However, with those chars my function to get and set the text of the
// Get the text length of the scite control
size_t text_length = SendMessage(scite_hwnd, WM_GETTEXTLENGTH, 0, 0);
// Allocate a buffer for the text
char *buffer = new char[text_length];
// Get the text of the scite control
SendMessage(scite_hwnd, WM_GETTEXT, text_length, (LPARAM)buffer);
// Test: Convert the text to a wchar_t array
size_t newsize = strlen(buffer) + 1;
wchar_t * wcstring = new wchar_t[newsize];
size_t convertedChars = 0;
mbstowcs_s(&convertedChars, wcstring, newsize, buffer, _TRUNCATE);
// Set the text of the scite control: Both buffer (char array) and
wcstring
Post by Charly Dante
(wchar_t array) don't work correct
SendMessage(scite_hwnd, WM_SETTEXT, 0, (LPARAM)wcstring);
I tried WM_SETTEXT with both the char array and the wchar_t array, but
both
Post by Charly Dante
methods don't deliver the result I expect. Basically I only get the text
of
Post by Charly Dante
the Scite Control and then set the text again to the exactly same text,
so
Post by Charly Dante
the text should remain unchanged after all.
With the wcstring I only get one letter in the control, namely "t". I
guess
Post by Charly Dante
this doesn't work at all because WM_SETTEXT expects a pointer to a char
array and not a wchar_t array. However, with a char array it doesn't
really
Post by Charly Dante
test string 俿䟶䟏
followed by two "black boxes". In the first "black box" is the text xE4,
in
Post by Charly Dante
the second black box the text xBD printed, the last char of the
teststring
Post by Charly Dante
"test string 俿䟶䟏䜶" is missing.
This is my problem and this is what I want to fix. I hope this makes
more
Post by Charly Dante
sense now :)
Best,
CD
Post by Charly Dante
Post by Charly Dante
However, if I try to get the whole text of the Scintilla Control, I
only
Post by Charly Dante
Post by Charly Dante
get Garbage values for the chars that were Chinese/Russian Symbols.
I was very hesitant to reply to this question because that
description
Post by Charly Dante
Post by Charly Dante
is just extremely vague. Unless you include information in your
question
Post by Charly Dante
Post by Charly Dante
then no one will have any idea what your problem is and it is unlikely
anyone will reply.
Neil
--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/d/optout.
Neil Hodgson
2014-05-21 22:24:39 UTC
Permalink
[previous message was empty because aI hit the wrong button]

Prefer SCI_* messages over WM_* messages as SCI_* are defined completely by Scintilla are more likely to be used by others and thus be maintained. WM_* are just for compatibility and are in the deprecated section of the documentation.
Post by Charly Dante
// Get the text length of the scite control
size_t text_length = SendMessage(scite_hwnd, WM_GETTEXTLENGTH, 0, 0);
http://msdn.microsoft.com/en-us/library/windows/desktop/ms632628(v=vs.85).aspx
-> The return value is the length of the text in characters, not including the terminating null character.
Post by Charly Dante
// Allocate a buffer for the text
char *buffer = new char[text_length];
The buffer should have a NUL terminator so its new char[text_length+1]
Post by Charly Dante
// Get the text of the scite control
SendMessage(scite_hwnd, WM_GETTEXT, text_length, (LPARAM)buffer);
http://msdn.microsoft.com/en-us/library/windows/desktop/ms632627(v=vs.85).aspx
-> wParam The maximum number of characters to be copied, including the terminating null character.

text_length+1

In the current release, 3.4.1, both WM_GETTEXT and WM_SETTEXT use UTF-8 (when that is set as the code page).

In release 3.4.2, for compatibility with some general-purpose applications like screen readers, WM_GETTEXT and WM_GETTEXTLENGTH will return UTF-16, but only when the Scintilla window is created as a wide character window (CreateWindowW).

Neil
--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/d/optout.
Ferdinand Prantl
2014-05-21 19:02:05 UTC
Permalink
Post by Charly Dante
// Test: Convert the text to a wchar_t array
size_t newsize = strlen(buffer) + 1;
wchar_t * wcstring = new wchar_t[newsize];
size_t convertedChars = 0;
mbstowcs_s(&convertedChars, wcstring, newsize, buffer, _TRUNCATE);
You cannot generally use the mbstowcs_s for a UTF-8 conversions. It
expects the input encoding set up by your current locale (LC_CTYPE) which
may not be UTF-8. You can use APIs like MultiByteToWideChar or libiconv,
for example. Having the luxury of Scintilla sources, you can include
UniConversion.h & cxx in your project and use it without additional
dependencies.

When using mbstowcs_s or similar APIs, you should let the API compute the
target length. (In this case by calling the method with NULL as the
target.) You'd probably have no problem in your sample, but in the other
way round you would.

I'm sorry for the off-topic - I think that Matthew's answer nailed it.

--- Ferda
--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/d/optout.
Charly Dante
2014-05-25 20:39:35 UTC
Permalink
Hi again,

I tried now several things (including all the mentioned tipps) and I'm
quite desperated already, because I just can't get that thing to work. To
ensure that the problem is not related to such things like choosing the
wrong length off a buffer, I provide a (very small) minimal example which
already demonstrates my problems:


char *buffer = "teststringÌÌ";
SendMessage(scite_hwnd, WM_SETTEXT, 0, (LPARAM)buffer);


This simply doesn't work. The text in the Scite Control is set to
"teststring" and then followed by two black boxes containing
"xFC" and not the desired chars "ÌÌ". I also tried SCI_SETTEXT (= 2181),
but this one doesn't work at all, I don't know why, but the text of the
Scite Textcontrol is completly erased with SCI_SETTEXT (note that I'm
trying to modify the Scite Text from another, external Application, could
that cause a problem with SCI_SETTEXT?). However, the Main-Problem exists
also within my own Scintilla Application.

Both my Application and Scite are set to Unicode Encoding (Scite via
*"File->Encoding->UTF-8"*).

The strange thing is, if I copy and paste the teststring "teststringÌÌ" to
the Scite Control, everything works fine and gets displayed correct.
However, if I use the programmatic solution from aboth, I get black boxes
and broken values in the Edit Control.

I just don't get what I'm doing wrong...

Best,
CD
Post by Charly Dante
Hi there,
I'm currently trying to get the correct text from my Scintilla Control in
UTF8 encoding, but until now I fail to do so.
I have used the option SendMessage(my_sci_window, SCI_SETCODEPAGE,
SC_CP_UTF8, 0); to enable all Unicode chars like russian or chinese for
my edit control. These characters are displayed correct when entered to the
Scintilla Control.
However, if I try to get the whole text of the Scintilla Control, I only
get Garbage values for the chars that were Chinese/Russian Symbols. I use
size_t text_length = SendMessage(my_sci_window, WM_GETTEXTLENGTH, 0, 0);
char *buffer = new char[text_length + 1];
SendMessage(my_sci_window, WM_GETTEXT, text_length + 1,
reinterpret_cast<LPARAM>(buffer));
I tried both WM_GETTEXT and SCI_GETTEXT, but without finding any
difference. Also converting the char buffer array to a wchar_t buffer array
using mbstowcs_s doesn't help. As soon as the Scintilla Control contains
any "non-normal" chars like Chinese/Thai etc., they are translated to some
broken chars and I cannot get the correct text of the Scintilla control.
Any Idea how to fix this?
Best,
CD
--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/d/optout.
Neil Hodgson
2014-05-25 21:30:31 UTC
Permalink
char *buffer = "teststringüü";
SendMessage(scite_hwnd, WM_SETTEXT, 0, (LPARAM)buffer);
This simply doesn't work. The text in the Scite Control is set to "teststring" and then followed by two black boxes containing
"xFC" and not the desired chars "üü".
'\xFC' is 'ü' in Windows-1252 (and ISO-8859-1) so either your source code is in Windows-1252, or your compiler is converting the literal to Windows-1252.
http://en.wikipedia.org/wiki/Windows-1252
I also tried SCI_SETTEXT (= 2181), but this one doesn't work at all, I don't know why, but the text of the Scite Textcontrol is completly erased with SCI_SETTEXT (note that I'm trying to modify the Scite Text from another, external Application, could that cause a problem with SCI_SETTEXT?). However, the Main-Problem exists also within my own Scintilla Application.
Each application has a separate address space and you can not, in general, hand an address in your application to another. Windows provides limited interception of some known messages, including WM_SETTEXT, and marshals the string across to the other application's address space.

Neil
--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/d/optout.
Lex Trotman
2014-05-26 00:58:01 UTC
Permalink
Post by Charly Dante
Hi again,
I tried now several things (including all the mentioned tipps) and I'm quite
desperated already, because I just can't get that thing to work. To ensure
that the problem is not related to such things like choosing the wrong
length off a buffer, I provide a (very small) minimal example which already
char *buffer = "teststringüü";
SendMessage(scite_hwnd, WM_SETTEXT, 0, (LPARAM)buffer);
This simply doesn't work. The text in the Scite Control is set to
"teststring" and then followed by two black boxes containing
"xFC" and not the desired chars "üü". I also tried SCI_SETTEXT (= 2181), but
FC is not the UTF-8 encoding of the character ü, its the unicode
value. The UTF-8 encoding is two bytes C3BC. Is your locale UTF-8?

Cheers
Lex
Post by Charly Dante
this one doesn't work at all, I don't know why, but the text of the Scite
Textcontrol is completly erased with SCI_SETTEXT (note that I'm trying to
modify the Scite Text from another, external Application, could that cause a
problem with SCI_SETTEXT?). However, the Main-Problem exists also within my
own Scintilla Application.
Both my Application and Scite are set to Unicode Encoding (Scite via
"File->Encoding->UTF-8").
The strange thing is, if I copy and paste the teststring "teststringüü" to
the Scite Control, everything works fine and gets displayed correct.
However, if I use the programmatic solution from aboth, I get black boxes
and broken values in the Edit Control.
I just don't get what I'm doing wrong...
Best,
CD
Post by Charly Dante
Hi there,
I'm currently trying to get the correct text from my Scintilla Control in
UTF8 encoding, but until now I fail to do so.
I have used the option SendMessage(my_sci_window, SCI_SETCODEPAGE,
SC_CP_UTF8, 0); to enable all Unicode chars like russian or chinese for my
edit control. These characters are displayed correct when entered to the
Scintilla Control.
However, if I try to get the whole text of the Scintilla Control, I only
get Garbage values for the chars that were Chinese/Russian Symbols. I use
size_t text_length = SendMessage(my_sci_window, WM_GETTEXTLENGTH, 0, 0);
char *buffer = new char[text_length + 1];
SendMessage(my_sci_window, WM_GETTEXT, text_length + 1,
reinterpret_cast<LPARAM>(buffer));
I tried both WM_GETTEXT and SCI_GETTEXT, but without finding any
difference. Also converting the char buffer array to a wchar_t buffer array
using mbstowcs_s doesn't help. As soon as the Scintilla Control contains any
"non-normal" chars like Chinese/Thai etc., they are translated to some
broken chars and I cannot get the correct text of the Scintilla control.
Any Idea how to fix this?
Best,
CD
--
You received this message because you are subscribed to the Google Groups
"scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/d/optout.
Loading...