Match C Comments... the right answer (was Re: LEX rule, anyone???)

Martin Weitzel martin at mwtech.UUCP
Tue Dec 19 21:49:30 AEST 1989


In article <5365 at omepd.UUCP> merlyn at iwarp.intel.com (Randal Schwartz) writes:
>In article <2191 at prune.bbn.com>, rsalz at bbn (Rich Salz) writes:
>| In <601 at vice2utc.chalmers.se> d5kwedb at dtek.chalmers.se (Kristian Wedberg) writes:
>| >A question from a friend of mine, P{r Eriksson:
>| >	Does anyone know how to write a LEX rule for C comments,
>| >	ie for everything between /* and */, nesting not allowed?
>| 
>| We go through this in comp.lang.c about once a year.  Almost everyone
>| gets it wrong.  The best thing to do is to define a lex rule that
>| catches "/*" and in the actions for that rule look for */ on your own.

True, but only partially.

>
>OK, almost everyone got it wrong (and I took great delight at pointing
>out what was wrong sometimes :-), but there was ONE correct answer --
>mine ( :-)...
>
>		"/*"(\**[^*/]|\/)*\*+\/

All solutions to this problem only using regex-s, suffer at one major
point: The text buffer (yytext) of lex is - in most cases - only about
200 Bytes. If the buffer overflows, you're out of luck! (You may expand
the buffer, I know, but it's not quite uncommon, that some comment is
well over 2 KByte and there is no constraint in C that requires a certain
limitation of comments, so, how big shall the buffer be?)

>
>Don't accept any cheap imitations.  They're probably wrong.  (This one
>is wrong in that it will match C comments inside of text strings, but
>that's just plain pathological. :-)

How long will I have to hear this! It's so easy with "lex", because
there are these wonderful "Start Conditions", nobody seems to know :-(

The following sceleton draws all comments out of a C-Source (and
*correctly* handles strings and char const-s).

----------------------------------- cut here ------------------------
%start CTEXT COMMENT STRING
%%
%{
		BEGIN CTEXT;
%}
<CTEXT>"/*"	{ BEGIN COMMENT; }
<CTEXT>'\\''	{ ECHO; }
<CTEXT>'[^']+'	{ ECHO; }
<CTEXT>\"	{ BEGIN STRING; yyless(0); }
<CTEXT>.|\n	{ ECHO; }

<COMMENT>"*/"	{ BEGIN CTEXT; }
<COMMENT>.|\n	;

<STRING>[^\\]\"	{ ECHO; BEGIN CTEXT; }
<STRING>.	{ ECHO; }
<STRING>\n	{ ECHO; /* syntax error! */ }
----------------------------------- cut here ------------------------

Note, that character constants and strings are handled by separate
rules, so that this piece of code may easily be extended for building
other tools, which require this distinction. 

>(No, I don't have one in Perl... :-)

A general remark: Sometimes people discover a tool and learn to handle
it. Sometimes they learn, that it is a real powerful tool, by far more
powerfull than they thought in the beginning. And very often then comes
the big mistake: They want to do *everything* with this tool.

IMHO sometimes you should lean back and accept, that not everything
can be done easy in the way you love most. (I have to tell this to
myself sometimes :-).)

[rest deleted]
-- 
Martin Weitzel, email: martin at mwtech.UUCP, voice: 49-(0)6151-6 56 83



More information about the Comp.lang.c mailing list