ambiguous match with multiple-character collating elements

Henry Spencer henry at zoo.toronto.edu
Sun Sep 9 14:39:29 AEST 1990


From: henry at zoo.toronto.edu (Henry Spencer)

In article <487 at usenix.ORG> karl at IMA.ISC.COM (Karl Heuer) writes:
>In an environment where the digraph "ch" collates as a single element, what
>happens if an attempt is made to match the subject string "chi" with the
>pattern "[c[.ch.]]i" or "[c[.ch.]]hi"?  Is the implementation required to
>report a successful match in both cases?  If so, it would seem necessary to
>use a nondeterministic finite automaton or equivalent, thus making simple
>regexp matching and filename globbing as complex as egrep pattern matching.

Looking at draft 10, I don't think there is much doubt that they both must
match, assuming those are legal regular expressions.  (If "c" is not a
collating element or noncollating character, they're not.)  If both "c"
and "ch" are valid collating elements, the bracket expression must be
prepared to match either.

The wording could stand improving.

As for the implementation aspects, yes, this is a pain.  However, there
is basically no such thing as "simple" regexp matching. :-)  The extra
complexity added by multicharacter collating elements, while annoying,
is not that big a deal.  I think Karl is confused.  *Any* non-trivial
regexp matching ends up using either nondeterministic or deterministic
automata, sometimes behind clever plastic disguises.  The very simplest
forms, like globbing, sometimes can get away without having to compile
the regexp string into an internal form, by running a nondeterministic
automaton directly from the regexp string.  That will get a bit harder
because of the greater complexity of 1003.2 regexps.  However, there
is no way that even "simple" regexp matching (I assume Karl is thinking
of things like ed) is viable without a compilation step.

Given that 1003.2 defines -- finally! -- library functions for regexp
work of various kinds, including globbing, the complexity will, in any
case, be localized to library functions.  The programmer won't have to
worry about it unless he's actually writing those library functions.
(*That* won't be fun.)
-- 
TCP/IP: handling tomorrow's loads today| Henry Spencer at U of Toronto Zoology
OSI: handling yesterday's loads someday|  henry at zoo.toronto.edu   utzoo!henry

Volume-Number: Volume 21, Number 95



More information about the Comp.std.unix mailing list