regexp(3) A Clear Explanation (Medium length)

Ken Latham latham at bsdpkh.UUCP
Wed Aug 27 11:04:41 AEST 1986


Dr. Megabyte (megabyte at chinet.UUCP) writes:
>I've poured myself over ny manual and looked at regcmp(1), regcmp(3), and
>regexp(3), and I'm still not sure how to use these functions.  Could someone
>send me some clear info on how to use these functions along with some examples?
>
>For the record: I am running Zeus 3.21 which is SYS III port to those of you 
>who are fortunate to have never heard of it.


I am not familiar with Zeus and am only quasi-familiar with sys3, the following
is a sys5 explanation which, if memory serves me, should cover it.

1. regcmp(3) - a function which translates regular expressions
	( a variant of ed(1) style ) to an internal form.  The char pointer
	returned is the address of a ( non-null-terminated ) string that
	represents the regular expression.  This 'compiled' regular expression
	can be interpreted by regex(3).
		If the returned pointer is NULL then you will have to
	'walk' through the regular expression by hand and determine where
	the syntax error is.

2. regcmp(1) - a user level command that will compile files of regular
	expressions into either data files containing the compiled expressions
	or into C files declaring data structures containing same.

3. regex(3) - the compiled regular expression interpreter which parses the
	subject string to determine if it is in fact a member of the language
	described by the compiled regular expression. It returns a pointer to
	the first character in the subject string which caused the pattern
	acceptance to fail.  Usually, this is a '\0' which terminated the
	subject string.  There are many cases where the character that stopped
	the acceptance may not be '\0', this is program dependent.
		A global variable 'loc1' ( according to the manual ) points
	to the position at which the match started in the subject string.
	This is usually the start of the subject string, but may vary with
	the application.

	The ACTUAL NAME of 'loc1' may be different than advertised!!
	on sys5 it is '__loc1' .  You can do a 'nm' on libPW.a to determine
	the name for your version.

EX.
	char *compex, *badchar, *regcomp(), *regex();
	.
	.
	compex = regcomp( "[a-zA-Z][_a-zA-Z0-9]*", 0 );
	if ( compex == NULL )
		.. some error routine to say that the RE is BAD !
	.
	.
	badchar = regex( compex, "A_long_identifier_name" );
	if ( badchar == '\0' && __loc1 == compex )
	{
		...then HOORAH, it was COMPLETE match!!!
	}
	else
	{
		... BOO HISSS, only a partial or no match was made.
		you may want to accept some partial matches in which
		case you can look at what caused the match to fail
		before the string terminator ('\0').  look at *badchar.
	}
	.
	.

	NOTE:
		both "[a-zA-Z][_a-zA-Z0-9]*" and "A_long_identifier_name"
		could just as easily be variables that are pointers to
		strings !!! It is much more useful when used on variables :-).

	
Some side notes:

	If it is the regular expressions and not the actual calls that
	give you problems then you need to buy a text book on the subject
	and get familiar with them.

	If you are familiar with REs then note that the (...)$n  notation
	utilized in regex(3) is an added extension to normal REs.

	The other arguments ret0, ret1 ..., ret9 in regex(3) are there simply
	to provide pointers to regions where the  (...)$n  extractions should
	be copied.  A subexpression surrounded by (....)$1 will extract a 
	substring from the subject string which matches the portion of the
	regular expression enclosed in (...)$1.  The ret0 pointer must hold
	the address of a preallocated area large enough to hold the longest
	possible substring.


	That should just about do it!  Hope that helps.  Sorry if you found
	this long winded, but I wanted to be complete.


			Ken Latham, AT&T-IS (via AGS Inc.), Orlando , FL

			uucp: ihnp4!codas!bsdpkh!latham



More information about the Comp.unix mailing list