Word-oriented GREP

Tom Christiansen tchrist at convex.COM
Tue Apr 30 00:05:47 AEST 1991


>From the keyboard of flee at cs.psu.edu (Felix Lee):
:My point was, I don't want a word-oriented grep.  I don't want a
:line-oriented grep either.  I want a character-oriented grep, a grep
:that will just grab matching substrings from an arbitrarily stream.
:And from this tool you can do word-oriented or line-oriented or
:whatever-oriented grepping.
:
:With the current line-oriented grep, you cannot search for a pattern
:that spans lines.  Say you want to find occurrences of "the dog" in a
:file, where the words can be separated by any whitespace, including
:newlines.  You cannot do this easily with existing tools.
	    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Depends on your idea of easy, I guess.  If you don't mind loading the
whole file in memory to do your work, you can use perl to slurp the whole
stream into the pattern space and do pattern matches to your heart's
content -- ones that are beyond grep's wildest dreams, too.

For example, here's a quick-and-dirty attempt to grep out function
declarations.  It "knows" that I (and all other reasonable people :-) 
always put the type on a preceding line, kind of like this:

    char *
    funct(arg)
	some_type arg;
    {
	blah
	blah blah 
	blahdy blahdy blah 
    } 

And what I'd like back is this:

    char *
    funct(arg)
	some_type arg;


Here's the code.  It's got some extra foo to get rid of C junk I don't
want to see.  I suppose I could run it through cpp if I were serious.

    #!/usr/bin/perl

    undef $/;		# disable input record separator
    $_ = <>;		# slurp input into pattern space
    $* = 1; 		# make ^ and $ work more intuitively

    s#/\*#\200#g;	# trim comments first
    s#\*/#\201#g;
    s#\200[^\201]*\201##g;

    s/^#.*//g;		# and cpp directives

    # now for the real work
    s/((\w+[*\s]*){0,2}\n(\w+)\s*\([^)]*\)[^{]*)\{/print $1, "\n"/eg;


You could probably put this into a script that took the guts of the
LHS of the last line and did all the rest for you.  The last line is
really all the matters anyway.

One problem with this kind of operation is that if you aren't careful
about your regexps, it can take a LONG time.  In the example give above,
if you change the {0,2} to a mere *, you'll be waiting for quite a while
as it tries all the possibilities.  Limiting function types to 0 to 2
words speeds it up into something tolerable.  You've got the same problem
with that \n there.  If you make it optional, the pattern matcher has 
a lot less to anchor it anywhere, and you get exponential blow-up.

But most greppers aren't usually worried about this kind of thing.  
I'm not sure how often it would really come up.


--tom



More information about the Comp.unix.wizards mailing list