Unicode case conversion

Discussion:

Unicode case conversion

Neil Hodgson

2013-07-11 02:42:08 UTC

Scintilla's case-insensitive search and upper and lower casing methods have behaved differently on the different supported platforms. To ensure all platforms behave in compliance with Unicode standards for Unicode text, some new files have been added: CaseConvert and CaseFolder.

CaseConvert contains Unicode's case conversion tables in a compressed form which (including code) adds about 7K to Scintilla. These tables are expanded into 3 case conversion objects (folding, upper, lower) when required and each of these takes around 14K.

The conversions performed are generic and not locale or context aware. I'll accept good patches that implement support for particular locales and that do not impact performance of the generic case. Locale and context sensitive variations are described for Turkish, Azeri, Greek and Lithuanian at the end of
http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt

CaseFolder collects case folding code that was previously in Document and adds CaseFolderUnicode which is used on each platform for case-insensitive searching. Since it uses CaseConvert, there are no calls to the platform and searching is much faster - about twice as fast searching a test document with roughly similar amounts of ASCII and Japanese text.

There are time-space tradeoffs with this code. For example, searching was 10% faster when an unordered_map was tried but that more than doubled memory use.

The changes only affect UTF-8 documents except on Windows where there were some changes for other encodings. Moving case conversion for non-UTF8 encodings from platform layers into generic code would standardise behaviour and simplify platform layers. Adding encoding tables to enable this would expand the size of Scintilla considerably. The tables for 8-bit encodings are small but adding tables for the 5 supported Asian DBCS encodings may more than double Scintilla's executable size. Support for GB18030, which has been requested (and is a legal compliance issue in China), may be even larger as it encodes all of Unicode.

Platform maintainers may wish to switch their implementations of case folding and case conversion in a similar way. See the changes to ScintillaGTK.cxx in the main change set as an example.
https://sourceforge.net/p/scintilla/code/ci/8235df9a162916daf4f3c3d01b309d9216968e96/

Neil

--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

Mike Lischke

2013-07-11 07:05:33 UTC

Neil,

Post by Neil Hodgson
Scintilla's case-insensitive search and upper and lower casing methods have behaved differently on the different supported platforms. To ensure all platforms behave in compliance with Unicode standards for Unicode text, some new files have been added: CaseConvert and CaseFolder.
CaseConvert contains Unicode's case conversion tables in a compressed form which (including code) adds about 7K to Scintilla. These tables are expanded into 3 case conversion objects (folding, upper, lower) when required and each of these takes around 14K.

I guess using glib is not an option for Scintilla. It has everything you would need in that regard (normalization, case folding, character class determination etc.).

Post by Neil Hodgson
The changes only affect UTF-8 documents except on Windows where there were some changes for other encodings. Moving case conversion for non-UTF8 encodings from platform layers into generic code would standardise behaviour and simplify platform layers. Adding encoding tables to enable this would expand the size of Scintilla considerably. The tables for 8-bit encodings are small but adding tables for the 5 supported Asian DBCS encodings may more than double Scintilla's executable size. Support for GB18030, which has been requested (and is a legal compliance issue in China), may be even larger as it encodes all of Unicode.

I'm sorry, but I don't understand why you still keep that burden of supporting old, error prone encodings instead moving completely to Unicode (having converters for documents in such encodings would be fine, but internally I'd make it Unicode only).

Mike

--
www.soft-gems.net
--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

Neil Hodgson

2013-07-12 01:45:21 UTC

Post by Mike Lischke
I guess using glib is not an option for Scintilla. It has everything you would need in that regard (normalization, case folding, character class determination etc.).

Licensing and packaging are problematic with glib. Its LGPL so static linking it into a Scintilla-based application would require the application be available in relinkable form. Using dynamic linking would mean distributing multiple executable files whereas applications can currently be distributed as single files like the Sc1 version of SciTE.

ICU has more compatible licensing but is large (around 24 MB for all the DLLs on Windows) and defining a subset to static link would require some work.

Post by Mike Lischke
I'm sorry, but I don't understand why you still keep that burden of supporting old, error prone encodings instead moving completely to Unicode (having converters for documents in such encodings would be fine, but internally I'd make it Unicode only).

While its difficult to know, I suspect Latin-1 based files (mostly Windows-1252) still greatly outnumber UTF-8 files. Converting at the I/O edge moves outside-encoding detection and fixing to file save time, when the user is less likely to remember the context of the unsaveable characters. Some affordances should be added to Scintilla, such as highlighting outside-encoding characters but that would require more implementation effort and whitelist/blacklist data that is similar to implementing encodings.

Removing encoding support would decrease compatibility greatly and cause downstream projects to fail. If I was starting Scintilla today, I might consider a Unicode-only core but its not worth the cost to switch now.

Neil

--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

Lex Trotman

2013-07-12 02:17:19 UTC

Post by Mike Lischke

Post by Mike Lischke
I guess using glib is not an option for Scintilla. It has everything you

would need in that regard (normalization, case folding, character class
determination etc.).
Licensing and packaging are problematic with glib. Its LGPL so static
linking it into a Scintilla-based application would require the application
be available in relinkable form. Using dynamic linking would mean
distributing multiple executable files whereas applications can currently
be distributed as single files like the Sc1 version of SciTE.
ICU has more compatible licensing but is large (around 24 MB for all
the DLLs on Windows) and defining a subset to static link would require
some work.

Post by Mike Lischke
I'm sorry, but I don't understand why you still keep that burden of

supporting old, error prone encodings instead moving completely to Unicode
(having converters for documents in such encodings would be fine, but
internally I'd make it Unicode only).
While its difficult to know, I suspect Latin-1 based files (mostly
Windows-1252) still greatly outnumber UTF-8 files. Converting at the I/O
edge moves outside-encoding detection and fixing to file save time, when
the user is less likely to remember the context of the unsaveable
characters. Some affordances should be added to Scintilla, such as
highlighting outside-encoding characters but that would require more
implementation effort and whitelist/blacklist data that is similar to
implementing encodings.

The experience with Geany (which always has the Scintilla buffer in UTF-8)
is that detecting encodings of existing files is problematic, encodings
overlap, and a file may detect correctly as several encodings.
In the end a manual selection had to be offered so the user can reload and
correct an erroneous "guess" of the encoding. That could not be included
inside Scintilla. Note locale is useless, some users switch languages (and
so locale encodings) regularly and users can access shared fileservers
using machines with differing locales.

Cheers
Lex

Post by Mike Lischke
Removing encoding support would decrease compatibility greatly and
cause downstream projects to fail. If I was starting Scintilla today, I
might consider a Unicode-only core but its not worth the cost to switch now.
Neil
--
You received this message because you are subscribed to the Google Groups
"scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

Mike Lischke

2013-07-12 06:55:00 UTC

Lex,

The experience with Geany (which always has the Scintilla buffer in UTF-8) is that detecting encodings of existing files is problematic, encodings overlap, and a file may detect correctly as several encodings.
In the end a manual selection had to be offered so the user can reload and correct an erroneous "guess" of the encoding. That could not be included inside Scintilla. Note locale is useless, some users switch languages (and so locale encodings) regularly and users can access shared fileservers using machines with differing locales.

Very similar to what we do in our app (MySQL Workbench). Scintilla is always set to UTF-8 and LF line endings, regardless where it runs. This simplifies text processing a lot. The encoding is not the business of an editor control. These are two different layers. I would not even put any encoding converter into the editor control. By separating storage and presentation more strictly one could avoid a lot of work.

Guessing an encoding is indeed a difficult task and often impossible. That's what I mean by "encondings are error prone" (code page based encodings share code points).

Mike

--
www.soft-gems.net
--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

Neil Hodgson

2013-07-12 07:40:34 UTC

Post by Mike Lischke
Very similar to what we do in our app (MySQL Workbench). Scintilla is always set to UTF-8 and LF line endings, regardless where it runs. This simplifies text processing a lot. The encoding is not the business of an editor control. These are two different layers. I would not even put any encoding converter into the editor control. By separating storage and presentation more strictly one could avoid a lot of work.

What do you do when characters are entered that can not be saved into the file's encoding?

Neil

--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

Mike Lischke

2013-07-12 09:13:17 UTC

Post by Neil Hodgson

Post by Mike Lischke
Very similar to what we do in our app (MySQL Workbench). Scintilla is always set to UTF-8 and LF line endings, regardless where it runs. This simplifies text processing a lot. The encoding is not the business of an editor control. These are two different layers. I would not even put any encoding converter into the editor control. By separating storage and presentation more strictly one could avoid a lot of work.

What do you do when characters are entered that can not be saved into the file's encoding?

The application has to warn the user about this and ask if the file should be saved in an encoding that includes these characters. Otherwise (depending on the application's strategy) the text cannot be saved at all or the text is converted and the invalid characters are replaced with valid ones.

That's a process I have seen in several other applications before and believe this is a good approach. It also pushes people a bit towards Unicode where they have been already 10 years ago.

Mike

--
www.soft-gems.net
--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

Mike Lischke

2013-07-12 09:19:04 UTC

Post by Mike Lischke
That's a process I have seen in several other applications before and believe this is a good approach. It also pushes people a bit towards Unicode where they have been already 10 years ago.

... were they *should* have been...

sorry

Mike

--
www.soft-gems.net
--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

Matthew Brush

2013-07-12 04:02:41 UTC

Post by Neil Hodgson

Post by Mike Lischke
I guess using glib is not an option for Scintilla. It has
everything you would need in that regard (normalization, case
folding, character class determination etc.).

Licensing and packaging are problematic with glib. Its LGPL so static
linking it into a Scintilla-based application would require the
application be available in relinkable form. Using dynamic linking
would mean distributing multiple executable files whereas
applications can currently be distributed as single files like the
Sc1 version of SciTE.
ICU has more compatible licensing but is large (around 24 MB for all
the DLLs on Windows) and defining a subset to static link would
require some work.

Would there be any way to make it part of the platform layer? It seems
like most toolkits/platforms already provide this stuff, like the
mentioned GLib for GTK+, in QString for Qt, in CFMutableString for
Cocoa/CoreFoundation, and in NLS support functions for Win32. Just a
thought.

Cheers,
Matthew Brush

--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

Neil Hodgson

2013-07-12 06:53:56 UTC

Would there be any way to make it part of the platform layer? It seems like most toolkits/platforms already provide this stuff, like the mentioned GLib for GTK+, in QString for Qt, in CFMutableString for Cocoa/CoreFoundation, and in NLS support functions for Win32.

That was what was done before this change. The platforms produced different results. The results from the Win32 calls were different to those defined by the Unicode standard.

Case-insensitive Unicode search performs many string folding operations. Platform calls generally allocated memory each time which can be quite slow.

Neil

--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

Mike Lischke

2013-07-12 06:57:17 UTC

Would there be any way to make it part of the platform layer? It seems like most toolkits/platforms already provide this stuff, like the mentioned GLib for GTK+, in QString for Qt, in CFMutableString for Cocoa/CoreFoundation, and in NLS support functions for Win32. Just a thought.

The main problem is probably the internal representation, not so much the Unicode specific parts (like case folding). The used encoding determines in many places how byte sequences are to be treated etc. This cannot be moved to the platform layers.

Mike

--
www.soft-gems.net
--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

Jason Haslam

2013-07-12 06:01:13 UTC

Post by Neil Hodgson

Post by Mike Lischke
I'm sorry, but I don't understand why you still keep that burden of supporting old, error prone encodings instead moving completely to Unicode (having converters for documents in such encodings would be fine, but internally I'd make it Unicode only).

While its difficult to know, I suspect Latin-1 based files (mostly Windows-1252) still greatly outnumber UTF-8 files. Converting at the I/O edge moves outside-encoding detection and fixing to file save time, when the user is less likely to remember the context of the unsaveable characters. Some affordances should be added to Scintilla, such as highlighting outside-encoding characters but that would require more implementation effort and whitelist/blacklist data that is similar to implementing encodings.

I think that we had a similar discussion about the time that the Qt platform layer was going in. I don't see it in the mailing list archive so it must have been on a private thread. My perspective was that, even if it's always unicode internally, the application still has to have some notion of which encoding the file is eventually going to be saved in. So each time new text is inserted the application can check to see if it can be represented in a the current encoding and prompt the user to change it if not.

Post by Neil Hodgson
Removing encoding support would decrease compatibility greatly and cause downstream projects to fail. If I was starting Scintilla today, I might consider a Unicode-only core but its not worth the cost to switch now.

Fair enough, but in my opinion, new users are better served by unconditionally setting the codepage to UTF-8 and handling the conversion in the application layer.

Jason

--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

Neil Hodgson

2013-07-12 08:19:41 UTC

Post by Jason Haslam
My perspective was that, even if it's always unicode internally, the application still has to have some notion of which encoding the file is eventually going to be saved in. So each time new text is inserted the application can check to see if it can be represented in a the current encoding and prompt the user to change it if not.

There should be some support for helping applications deal with this. A first step could be to fire an event when text is being pasted, dropped, or typed into Scintilla so that the application can cancel any insertion and possibly insert a modified form. For example, an application could replace "curly quotes" with "normal quotes" when the file's encoding is ISO-8859-1. A refinement would fire these events only when characters outside some set were about to be inserted.

Neil

--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

Mike Lischke

2013-07-12 09:18:08 UTC

Post by Neil Hodgson

Post by Jason Haslam
My perspective was that, even if it's always unicode internally, the application still has to have some notion of which encoding the file is eventually going to be saved in. So each time new text is inserted the application can check to see if it can be represented in a the current encoding and prompt the user to change it if not.

There should be some support for helping applications deal with this. A first step could be to fire an event when text is being pasted, dropped, or typed into Scintilla so that the application can cancel any insertion and possibly insert a modified form. For example, an application could replace "curly quotes" with "normal quotes" when the file's encoding is ISO-8859-1. A refinement would fire these events only when characters outside some set were about to be inserted.

Neil, this all is completely unnecessary if interally you would only use Unicode. All the compatibility handling is pushed to the application's file handling (where it belongs!). The editor is the presentation layer. It's not its task to deal with encoding problems (which are by nature a storage problem). The user can insert text that doesn't match the file encoding, sure, but he can also remove it afterwards before saving takes place. So handling all this during editing is just adding complexity. The save action is the place to handle this in one go for all the text.

Mike

--
www.soft-gems.net
--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

Neil Hodgson

2013-07-13 00:20:28 UTC

Post by Mike Lischke
Neil, this all is completely unnecessary if interally you would only use Unicode.

This is *for* using UTF-8 in Scintilla. Just like almost all of the new case conversion code is *for* using UTF-8 in Scintilla. Yet you complain as if these features were favouring the use of other encodings in Scintilla.

Post by Mike Lischke
All the compatibility handling is pushed to the application's file handling (where it belongs!). The editor is the presentation layer. It's not its task to deal with encoding problems (which are by nature a storage problem). The user can insert text that doesn't match the file encoding, sure, but he can also remove it afterwards before saving takes place. So handling all this during editing is just adding complexity. The save action is the place to handle this in one go for all the text.

Delaying fixes until save time makes for a messy experience closing the application with additional dialogues when you'll often just want to shut down quickly. Showing problems as they occur, as with syntax highlighting and squiggly underlines for spelling and syntax errors, allows them to be fixed in the context of their creation when the user is more likely to remember their intent.

Source code and markup files are streams of bytes when read from disk and return to disk as streams of bytes when saved. The resulting bytes should differ from the input only where the user or the application has decided to make a change. Scintilla should not be performing normalisations or other modifications unless asked.

You may think that all text should be valid UTF-8 with combining characters decomposed into NFKD form, CRLF line ends, no tabs in indentation, a terminating line end, and all instances of "Neil" correctly capitalised. Scintilla enables your application to choose to maintain these or other rules but Scintilla itself is not biased towards any particular rules. While its possible to add restrictions on top of a generic byte stream, it is much more work to remove or change restrictions when they are hard-coded into a component.

Neil

--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

Mike Lischke

2013-07-13 09:34:14 UTC

Hey Neil,

Post by Neil Hodgson

Post by Mike Lischke
Neil, this all is completely unnecessary if interally you would only use Unicode.

This is *for* using UTF-8 in Scintilla. Just like almost all of the new case conversion code is *for* using UTF-8 in Scintilla. Yet you complain as if these features were favouring the use of other encodings in Scintilla.

Yes, I'm aware of that. My objection was about having to add support for application notification for pasting text and similar operations where you suggested to let the appliation replace invalid characters that do not fit into the current encoding. While such a replace feature might be handy for other tasks this and the notifications would not be needed if encoding handling would be pushed to the application layer.

Post by Neil Hodgson

Post by Mike Lischke
All the compatibility handling is pushed to the application's file handling (where it belongs!). The editor is the presentation layer. It's not its task to deal with encoding problems (which are by nature a storage problem). The user can insert text that doesn't match the file encoding, sure, but he can also remove it afterwards before saving takes place. So handling all this during editing is just adding complexity. The save action is the place to handle this in one go for all the text.

Delaying fixes until save time makes for a messy experience closing the application with additional dialogues when you'll often just want to shut down quickly.

But instead you want to bother the user while he just wants to type? That's certainly not better - and adds the overhead we are talking about. Additionally, you force the user to think about the file encoding at a time this is usually not relevant. The user might want to later change the encoding but gets requests from Scintilla for anything that doesn't fit the *current* enconding.

Post by Neil Hodgson
Showing problems as they occur, as with syntax highlighting and squiggly underlines for spelling and syntax errors, allows them to be fixed in the context of their creation when the user is more likely to remember their intent.

File encoding is a completely different thing. It should be handled once, not constantly while typing, and only when needed (the user could add a "bad" character but removes it afterwards, so no change to the encoding would be necessary).

Post by Neil Hodgson
Source code and markup files are streams of bytes when read from disk and return to disk as streams of bytes when saved. The resulting bytes should differ from the input only where the user or the application has decided to make a change. Scintilla should not be performing normalisations or other modifications unless asked.

Neil, all that is not only my personal opinion. Look for instance at Visual Studio. When you edit a cpp file encoded as ANSI and add text outside that range it is happily accepted without asking for any additional handling, confirmation and what not. But when you save the file you are asked to change the encoding. That's the least disturbing action.

Mike

--
www.soft-gems.net
--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

Ferdinand Prantl

2013-07-13 12:03:50 UTC

An interesting discussion to follow.

Post by Neil Hodgson

Post by Neil Hodgson
Delaying fixes until save time makes for a messy experience closing

the application with additional dialogues when you'll often just want to
shut down quickly.
But instead you want to bother the user while he just wants to type?
That's certainly not better - and adds the overhead we are talking about.
Additionally, you force the user to think about the file encoding at a time
this is usually not relevant.

+1 for not bugging me with additional dialogs on save & close :-)

It might depend on how the application decides to handle those
notifications. Actually, I'd welcome a notification that what I'm pasting
is in a wrong encoding. Later I could have a mess in the editor. Remember
that Scintilla is not a Unicode-only editor. It's a pity that the input
encoding cannot be always 100% detected.

Post by Neil Hodgson

Post by Neil Hodgson
Source code and markup files are streams of bytes when read from disk

and return to disk as streams of bytes when saved. The resulting bytes
should differ from the input only where the user or the application has
decided to make a change. Scintilla should not be performing normalisations
or other modifications unless asked.
Neil, all that is not only my personal opinion. Look for instance at
Visual Studio. When you edit a cpp file encoded as ANSI and add text
outside that range it is happily accepted without asking for any additional
handling, confirmation and what not. But when you save the file you are
asked to change the encoding. That's the least disturbing action.

Visiual Studio (unless you turn it off) changes the source file encoding to
UTF-8 with BOM (!), after you paste there a text fragment with a non-ANSI
character, which you want later delete or not. When saving the files I may
get notified but I have no idea what change it was and just say "whatever,
I'm going to check the file in some real editor". It is quite annoying and
my colleagues use Notepad++ (a Scintilla-based editor, as you know) to make
sure that the file stays nicely in ANSI - and I use SciTE for this ;-) No
all tools work with Unicode sources and even less survive a meeting with
the BOM...

The advantage of not switching to Unicode is having the editing component
usable in non-Unicode applications. Unicode doesn't provide 100%
round-tripping conversion for them in general and some applications prefer
to stay with MBCS. Probably not so important for a programmer's editor,
but... Configurability and flexibility is a good thing for an editing
component. The application (editor) should set it up so that it serves the
user the best.

--- Ferda

Post by Neil Hodgson
Mike
--
www.soft-gems.net
--
You received this message because you are subscribed to the Google Groups
"scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

Philippe Lhoste

2013-07-15 15:22:24 UTC

No all tools work
with Unicode sources and even less survive a meeting with the BOM...

Indeed! The UTF-8 BOM should be named BOOM, instead...

Notepad (not ++) can have this annoying behavior too.
--
Philippe Lhoste
-- (near) Paris -- France
-- http://Phi.Lho.free.fr
-- -- -- -- -- -- -- -- -- -- -- -- -- --

--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

Randy Kramer

2013-08-26 00:27:06 UTC

Post by Ferdinand Prantl
An interesting discussion to follow.

Post by Neil Hodgson

Post by Neil Hodgson
Delaying fixes until save time makes for a messy experience closing

the application with additional dialogues when you'll often just want to
shut down quickly.
But instead you want to bother the user while he just wants to type?
That's certainly not better - and adds the overhead we are talking about.
Additionally, you force the user to think about the file encoding at a
time this is usually not relevant.

+1 for not bugging me with additional dialogs on save & close :-)

+ 1

Post by Ferdinand Prantl
It might depend on how the application decides to handle those
notifications. Actually, I'd welcome a notification that what I'm pasting
is in a wrong encoding.

+1

Post by Ferdinand Prantl
Later I could have a mess in the editor.

+1

Post by Ferdinand Prantl
Remember
that Scintilla is not a Unicode-only editor. It's a pity that the input
encoding cannot be always 100% detected.

Post by Neil Hodgson

Post by Neil Hodgson
Source code and markup files are streams of bytes when read from disk

and return to disk as streams of bytes when saved. The resulting bytes
should differ from the input only where the user or the application has
decided to make a change. Scintilla should not be performing
normalisations or other modifications unless asked.
Neil, all that is not only my personal opinion. Look for instance at
Visual Studio. When you edit a cpp file encoded as ANSI and add text
outside that range it is happily accepted without asking for any
additional handling, confirmation and what not. But when you save the
file you are asked to change the encoding. That's the least disturbing
action.

Visiual Studio (unless you turn it off) changes the source file encoding to
UTF-8 with BOM (!), after you paste there a text fragment with a non-ANSI
character, which you want later delete or not. When saving the files I may
get notified but I have no idea what change it was and just say "whatever,
I'm going to check the file in some real editor". It is quite annoying and
my colleagues use Notepad++ (a Scintilla-based editor, as you know) to make
sure that the file stays nicely in ANSI - and I use SciTE for this ;-) No
all tools work with Unicode sources and even less survive a meeting with
the BOM...

+1

Post by Ferdinand Prantl
The advantage of not switching to Unicode is having the editing component
usable in non-Unicode applications. Unicode doesn't provide 100%
round-tripping conversion for them in general and some applications prefer
to stay with MBCS. Probably not so important for a programmer's editor,
but... Configurability and flexibility is a good thing for an editing
component. The application (editor) should set it up so that it serves the
user the best.

+1

Randy Kramer

--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

Mike Lischke

2013-07-12 06:48:19 UTC

Post by Neil Hodgson

Post by Mike Lischke
I'm sorry, but I don't understand why you still keep that burden of supporting old, error prone encodings instead moving completely to Unicode (having converters for documents in such encodings would be fine, but internally I'd make it Unicode only).

While its difficult to know, I suspect Latin-1 based files (mostly Windows-1252) still greatly outnumber UTF-8 files. Converting at the I/O edge moves outside-encoding detection and fixing to file save time, when the user is less likely to remember the context of the unsaveable characters. Some affordances should be added to Scintilla, such as highlighting outside-encoding characters but that would require more implementation effort and whitelist/blacklist data that is similar to implementing encodings.
Removing encoding support would decrease compatibility greatly and cause downstream projects to fail. If I was starting Scintilla today, I might consider a Unicode-only core but its not worth the cost to switch now.

You have of course more insight into Scintilla than I, but to me it appears as if all the extra code (just like the new case conversion stuff) is certainly more work than switching to a Unicode-only core and have this solve once and for all.

Mike

--
www.soft-gems.net
--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

19 Replies
7 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Neil Hodgson 2013-07-11 02:42:08 UTC

Mike Lischke 2013-07-11 07:05:33 UTC

Neil Hodgson 2013-07-12 01:45:21 UTC

Lex Trotman 2013-07-12 02:17:19 UTC

Mike Lischke 2013-07-12 06:55:00 UTC

Neil Hodgson 2013-07-12 07:40:34 UTC

Mike Lischke 2013-07-12 09:13:17 UTC

Mike Lischke 2013-07-12 09:19:04 UTC

Matthew Brush 2013-07-12 04:02:41 UTC

Neil Hodgson 2013-07-12 06:53:56 UTC

Mike Lischke 2013-07-12 06:57:17 UTC

Jason Haslam 2013-07-12 06:01:13 UTC

Neil Hodgson 2013-07-12 08:19:41 UTC

Mike Lischke 2013-07-12 09:18:08 UTC

Neil Hodgson 2013-07-13 00:20:28 UTC

Mike Lischke 2013-07-13 09:34:14 UTC

Ferdinand Prantl 2013-07-13 12:03:50 UTC

Philippe Lhoste 2013-07-15 15:22:24 UTC

Randy Kramer 2013-08-26 00:27:06 UTC

Mike Lischke 2013-07-12 06:48:19 UTC

about - legalese

Loading...