Fast file scan

Larry Wall lwall at jpl-devvax.JPL.NASA.GOV
Fri Sep 28 06:50:09 AEST 1990


In article <1990Sep27.195749.2552 at iwarp.intel.com> merlyn at iwarp.intel.com (Randal Schwartz) writes:
: In article <299 at lysator.liu.se>, pen at lysator (Peter Eriksson) writes:
: | I`d like to know how to scan all files in a directory (and it's sub-
: | directories) for a specific string (without regular expressions) as fast
: | as possible with standard Unix tools and/or some special programs.
: | 
: | (I've written such a program, and would like to compare my implementation
: | of it with others.)
: | 
: | (Using the find+fgrep combination is slooooow....)
: | 
: | Any ideas?
: 
: I assume your objection to find+fgrep is that you must start an fgrep
: on each filename (or set of filenames if you use xargs).  Here's a
: solution in Perl that uses 'find' to spit out the filenames (because
: it is fast at that) and Perl to scan the text files.
: 
: ================================================== snip
: #!/usr/bin/perl
: $lookfor = shift;
: open(FIND,"find topdir -type f -print|") || die "Cannot find: $!";
: MAIN: while($FILE = <FIND>) {
: 	open(FILE) || next MAIN; # skip it if I can't open it for read
: 	{ local($/); undef $/; # slurp fast
: 		while (<FILE>) {
: 			(print "$FILE\n"), next MAIN if index($_,$lookfor);
: 		}
: 	}
: }
: ================================================== snip
: 
: This will have a problem if $lookfor straddles buffers, but it'll find
: everything else.  (A small matter of programming to fix the straddling
: problem.)

A bit hasty there, Randal.  First, you have to say index($_,$lookfor) < 0.
Second, there's no buffer straddling problem--undeffing $/ will cause
it to slurp in the whole file.  Thus, the inner while loop is
unnecessary.  Thirdly, you didn't chop the filename.

Another consideration is that Perl doesn't do Boyer-Moore indexing (currently)
except on literals.  So you probably want to use an eval around the loop,
to tell Perl it's okay to compile the BM search table for the string.

I think I'd do it something like this:

#!/usr/bin/perl

($lookfor = shift) =~ s/(\W)/\\$1/g;	# quote metas

open(FIND,"find topdir -type f -print|") || die "Can't run find: $!\n";

eval <<EOB
    while(<FIND>) {
	chop;
	open(FILE, \$_) || next;
	\$size = (stat(FILE))[7];
	read(FILE,\$buf,\$size) || next;
	print "\$_\n" if \$buf =~ /$lookfor/;
    }
EOB

Using read() may or may not beat using <FILE>--it depends on how efficiently
your fread() function is coded.  Some are real smart, and read directly
into your variable.  Others are real stupid, and copy the data a time or
two first.

You might also want to throw a -T FILE test in there after the open if
you want to reject binary files outright.

Larry



More information about the Comp.unix.misc mailing list