Neil Hodgson
2013-07-11 02:42:08 UTC
Scintilla's case-insensitive search and upper and lower casing methods have behaved differently on the different supported platforms. To ensure all platforms behave in compliance with Unicode standards for Unicode text, some new files have been added: CaseConvert and CaseFolder.
CaseConvert contains Unicode's case conversion tables in a compressed form which (including code) adds about 7K to Scintilla. These tables are expanded into 3 case conversion objects (folding, upper, lower) when required and each of these takes around 14K.
The conversions performed are generic and not locale or context aware. I'll accept good patches that implement support for particular locales and that do not impact performance of the generic case. Locale and context sensitive variations are described for Turkish, Azeri, Greek and Lithuanian at the end of
http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt
CaseFolder collects case folding code that was previously in Document and adds CaseFolderUnicode which is used on each platform for case-insensitive searching. Since it uses CaseConvert, there are no calls to the platform and searching is much faster - about twice as fast searching a test document with roughly similar amounts of ASCII and Japanese text.
There are time-space tradeoffs with this code. For example, searching was 10% faster when an unordered_map was tried but that more than doubled memory use.
The changes only affect UTF-8 documents except on Windows where there were some changes for other encodings. Moving case conversion for non-UTF8 encodings from platform layers into generic code would standardise behaviour and simplify platform layers. Adding encoding tables to enable this would expand the size of Scintilla considerably. The tables for 8-bit encodings are small but adding tables for the 5 supported Asian DBCS encodings may more than double Scintilla's executable size. Support for GB18030, which has been requested (and is a legal compliance issue in China), may be even larger as it encodes all of Unicode.
Platform maintainers may wish to switch their implementations of case folding and case conversion in a similar way. See the changes to ScintillaGTK.cxx in the main change set as an example.
https://sourceforge.net/p/scintilla/code/ci/8235df9a162916daf4f3c3d01b309d9216968e96/
Neil
CaseConvert contains Unicode's case conversion tables in a compressed form which (including code) adds about 7K to Scintilla. These tables are expanded into 3 case conversion objects (folding, upper, lower) when required and each of these takes around 14K.
The conversions performed are generic and not locale or context aware. I'll accept good patches that implement support for particular locales and that do not impact performance of the generic case. Locale and context sensitive variations are described for Turkish, Azeri, Greek and Lithuanian at the end of
http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt
CaseFolder collects case folding code that was previously in Document and adds CaseFolderUnicode which is used on each platform for case-insensitive searching. Since it uses CaseConvert, there are no calls to the platform and searching is much faster - about twice as fast searching a test document with roughly similar amounts of ASCII and Japanese text.
There are time-space tradeoffs with this code. For example, searching was 10% faster when an unordered_map was tried but that more than doubled memory use.
The changes only affect UTF-8 documents except on Windows where there were some changes for other encodings. Moving case conversion for non-UTF8 encodings from platform layers into generic code would standardise behaviour and simplify platform layers. Adding encoding tables to enable this would expand the size of Scintilla considerably. The tables for 8-bit encodings are small but adding tables for the 5 supported Asian DBCS encodings may more than double Scintilla's executable size. Support for GB18030, which has been requested (and is a legal compliance issue in China), may be even larger as it encodes all of Unicode.
Platform maintainers may wish to switch their implementations of case folding and case conversion in a similar way. See the changes to ScintillaGTK.cxx in the main change set as an example.
https://sourceforge.net/p/scintilla/code/ci/8235df9a162916daf4f3c3d01b309d9216968e96/
Neil
--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.