Using Lex (and Yacc) on a string.

Mon Aug 13 08:09:16 AEST 1990

(This topic probably belongs elsewhere; perhaps comp.lang.misc or
comp.unix.questions....  Ah well.)

In article <174 at ittc.wec.com> ptb at ittc.wec.com (Pat Broderick) writes:
>The things needed might look something like:
># define input() (((yytchar=yysptr>yysbuf?U(*--yysptr):getc(yyin))==10?
 (yylineno++,yytchar):yytchar)==EOF?0:yytchar)   /* standard defn from lex */

># define input() (((yytchar=yysptr>yysbuf?U(*--yysptr):(*yynyy++))==10?
 (yylineno++,yytchar):yytchar)==EOF?0:yytchar)          ^^^^^^^^^^

>This works fine for us, hope it helps.

This should work, but is overkill.  (It also does not address a question
whose answer I myself am unsure about.)

Here is what is going on:

Lex (or Flex or other similar lexer of your choice) implements a DFSA
(Deterministic Finite State Automaton) that is simply an optimized (in
space if not time) variant on

	int table[NSTATES][128] = { huge amounts of junk };

	yylex() {
		int state = 0;	/* ignoring BEGINs that is */
		yyleng = 0;
		for (;;) {
			c = input();		/* eat the next char */
			yytext[yyleng++] = c;	/* store it in yytext */
			state = table[state][c];/* find the next state */
			switch (state) {	/* and see what to do */
			... cases that exactly match something ...
				do actions from C code;
			... cases indicating we ate too much ...
				unput(some of the things we ate);
				do actions from C code;
			... cases indicating `no match' ...
				output(the things we ate);
				break;
			}
		}
	}

One noteworthy thing about this is that lex can never unput() something
it has not input() `from the same place' (unless you put you own unput()
actions into your lexer: a dangerous practise).  Thus, if you are reading
from a string in a buffer, your `unput' action can be much simpler, and
likewise your input() macro can be simplified:

#define	input() ((yytchar = *mystring++) == '\n' ? (yylineno++, yytchar) : \
		 yytchar)
#define	unput(c) (mystring--)

These two also take advantage of the fact that Lex wants `EOF' to be
the value 0, rather than the (implementation-defined but usually) -1
that stdio returns.  The end of a C string is the character '\0' which
has the value 0.

The question I have is whether lex might call input() again after reading
EOF once.  Since the end of a real file tends to remain the end of the
file no matter how many times it is read per second, it seems possible
that the implementation might invoke input() again after input() returns
0 but without an intervening unput(0)---i.e., it may depend on EOF being
`sticky'.  In this case the input macro must be more careful:

#define input() ((yytchar = *mystr++) == 0 ? (mystr--, 0) : \
	(yytchar == '\n' ? (yylineno++, '\n') : yytchar))

or as a GCC inline function:

	static inline input() {
		int c = *mystr++;
		if (c == 0) mystr--;
		else if (c == '\n') yylineno++;
		return c;
	}

This requires a corresponding change to unput(), however, since now
unput(0) should do nothing:

#define unput(c) ((c) ? mystr-- : 0)

All of these eliminate the need for yysbuf (an array of size YYLMAX that
holds unput() characters since ungetc() only guarantees one character of
pushback).

Lex could be considerably more efficient by avoiding all this copying of
text from one place to another; I believe flex does this.  This usually
means bypassing stdio, of course....
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris at cs.umd.edu	Path:	uunet!mimsy!chris
	(New campus phone system, active sometime soon: +1 301 405 2750)