Issues with longish case insensitive regexps

Discussion:

Chris Emerson

2014-02-13 22:20:30 UTC

Hi,

I noticed that case-insensitive regular expressions stop working at around
60 characters. I believe this is because in the "NFA", each character turns
into a character set (for the two cases), which takes 33 bytes or so (256
bits plus opcode); the overall NFA limit is 2048 bytes.

It looks like it wouldn't be too hard to add a new NFA opcode to represent
a sparse character set, say with one length byte followed by N bytes, which
could be used whenever that's smaller than the bitset (or < some limit).

Would that kind of change be accepted? I'm aware of the recommendation to
use a separate regexp engine for serious use.

Regards,

Chris Emerson

--
You received this message because you are subscribed to the Google Groups "scintilla-interest" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scintilla-interest+***@googlegroups.com.
To post to this group, send email to scintilla-***@googlegroups.com.
Visit this group at http://groups.google.com/group/scintilla-interest.
For more options, visit https://groups.google.com/groups/opt_out.

Neil Hodgson

2014-02-13 23:51:31 UTC

Permalink

Post by Chris Emerson
It looks like it wouldn't be too hard to add a new NFA opcode to represent
a sparse character set, say with one length byte followed by N bytes, which
could be used whenever that's smaller than the bitset (or < some limit).

The common case will be ASCII letters which will have a bit set somewhere between 97 and 122 so will be 13 to 16 bytes. +1 for the length so 14-17. So this will approximately double the length that works but at considerable added complexity and potential for bugs.

May be better to allocate and size nfa dynamically. Something simple like ensuring at the start of Compile that it is at least (BITBLK+1)*length + fiddle factor (enough to counter safety margin in mpMax calculation). Any patch that does this should ensure that normal usage doesn't cause constant reallocation.

Neil

Chris Emerson

2014-02-14 00:21:37 UTC

Permalink

Hi,

Post by Neil Hodgson

Post by Chris Emerson
It looks like it wouldn't be too hard to add a new NFA opcode to
represent a sparse character set, say with one length byte followed by N
bytes, which could be used whenever that's smaller than the bitset (or <
some limit).

I was a bit ambiguous. I meant rather than store a shortened bitfield, just
store the values themselves. So a case insensitive 'A' would be 4 bytes:
opcode/2/'a'/'A'

The downside is that you have to scan the list to look for the character
you're comparing against.

Post by Neil Hodgson
May be better to allocate and size nfa dynamically. Something simple like
ensuring at the start of Compile that it is at least (BITBLK+1)*length +
fiddle factor (enough to counter safety margin in mpMax calculation). Any
patch that does this should ensure that normal usage doesn't cause constant
reallocation.

Allocating dynamically would be good, and get rid of the limit entirely (but
possibly allocate a largish buffer). The two obvious strategies are to
allocate the worst case size, as you say, or start with a reasonable one
and double as needed when hitting pathological cases.

This could be combined with the new opcode to keep the typical size smaller.

Regards,

Chris

Neil Hodgson

2014-02-15 02:43:30 UTC

Permalink

Post by Chris Emerson
This could be combined with the new opcode to keep the typical size smaller.

This is the first report of this issue since the size of bitsets was increased to 32 bytes 11 years ago which halved the maximum length of case-insensitive search strings. Fixing an uncommon issue should take extreme care to avoid adding bugs by minimising the size and effects of the fix.

Neil