Discussion:
Unicode lexing
Neil Hodgson
2013-06-29 11:35:46 UTC
Permalink
Bug #1483 pointed out a bug with some lexers introduced when StyleContext was changed to report whole characters in UTF-8 instead of each byte. The problem occurred because the GetRelative call worked in terms of bytes and Forward worked in characters so there could be a mismatch.

Since lexers sometimes should work in bytes (for example when extracting an identifier to check against a list of keywords) and sometimes in characters, two new methods were added to StyleContext: GetRelativeCharacter works in characters (so matches Forward) and ForwardBytes works in bytes (matching GetRelative).

Kein-Hong Man and I have checked and, where needed, changed some of the lexers to work for UTF-8 and DBCS: LexBash, LexFortran, LexLua, LexPython, LexPerl, and LexTCL should now be OK with changes made for Perl and Lua.

Other lexers that use GetRelative should be checked by people familiar with those languages where the languages are meant to work with UTF-8 or DBCS: LexCaml, LexFlagship, LexHaskell, LexMarkdown, LexModula, LexPS, LexPascal, LexRebol, LexSML, LexSmalltalk, LexTADS3, LexTxt2tags.

In many cases, the lexer is just checking for sequences of ASCII bytes so no changes are needed. For example, LexFortran uses GetRelative to find whether the byte '!' is located after a sequence of whitespace bytes (space, tab and vertical tab), which determines how to handle line continuations. When there is a problem, it should normally be solvable by choosing either bytes or characters and changing Forward to ForwardBytes or GetRelative to GetRelativeCharacter.

The code for moving between positions by character and retrieving character values is now provided by Document over the IDocumentWithLineEnd interface so that lexing and display work in the same way.

For invalid UTF-8, each byte is reported individually as an invalid value equal to the byte's value + 0xDC80. This prevents the lexer from being confused if it attaches meaning to the characters U+0080 .. U+00FF.

http://sourceforge.net/p/scintilla/bugs/1483/

Neil
--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.
Loading...