what should egrep '|root' print? (syntax/semantics)

Kenneth Almquist ka at june.cs.washington.edu
Tue Sep 27 19:12:06 AEST 1988


henry at utzoo.uucp (Henry Spencer) writes:
> Well, personally, I'd dearly love to be able to use (| and |) as metasymbols,
> since (a) one highly desirable extension to my regexp package would be the
> beginning/end-of-identifier metasymbols found in many implementations,
> (b) I am deeply opposed to declaring more unbackslashed characters to be
> metasymbols, and (c) I am even more deeply opposed to declaring *any*
> backslashed characters to be metasymbols.  There are other possibilities,
> exploiting sequences that are syntax errors at the moment, but none of
> them is nearly as pretty.  (Not a trivial issue, given that users have to
> remember whatever sequence gets chosen.)  Alas, I am also sympathetic
> to the argument that (1) it would be an unfortunate inconsistency, and
> (2) programs that generate regexps might have to go out of their way to
> avoid generating these magic sequences.  Argh.  Any thoughts?

My solution (when I faced this problem a long time ago) was to make an
asterisk at the start of a regular expression require that the string
matched not be preceded or followed by an character which can appear in
a word.  The arguments pro and con seem to be:

1)  Word beginning and ending patterns are more flexible.  Can anyone come
    up with a use for this flexibility?  I can't.

2)  The asterisk convention is easier to type.

3)  The asterisk convention is easy to explain to a beginner on an intuitive
    level ("Place an asterisk in front of the expression to search for a
    word"), although a complete explanation of the semantics is about as
    complicated for either convention.

4)  Even after the user learns the word begin and end commands, the user
    still has to type two commands to get a word search, which increases
    the cognitive complexity compared to typing one command to get a word
    search.

5)  Neither syntax is intuitively obvious, but (| and |) do have intuitively
    obvious interpretations (both consist of a parethises and a '|' operator)
    which differ from the interpretation that Henry suggests for them.

The basic problem with the word beginning and ending patterns is that they
are at the wrong level.  If they are *only* used as building blocks to build
word searches, then a higher level feature like the asterisk convention
which allows users to request word searches directly is a better choice.
And they are too high level to be used for much else besides constructing
word searches.  The rare cases where they are used for something else (if
such cases exist) can be handled by lower level features from which word
beginning and ending patterns can be constructed.  I expect that Henry's
regexp package (like egrep) already has the required features.

In conclusion, I believe that including the (| and |) operators in a regular
expression package is a poor idea on two grounds.  The semantics are wrong;
if word searches are desired there are better ways to provide them, such as
the asterisk convention.  And (| and |) are a lousy choice of operators,
for reasons which Henry notes in his article, while the asterisk convention
has no such problems.
				Kenneth Almquist



More information about the Comp.unix.wizards mailing list