LEX with all eight bits?

Martin Weitzel martin at mwtech.UUCP
Fri Mar 23 06:04:35 AEST 1990


In article <9463 at discus.technion.ac.il> joel%techunix.bitnet at jade.berkeley.edu (Yossi (Joel) Hoffman) writes:
>Hi folks!  I was trying to use LEX to process a text (yes, text) file
>that happens to use all eight bits (the 8th bit signifies Hebrew text).
>I just inserted the 8-bit letters in the usual way, but LEX choked on
>it.  (It didn't produce any C output at all.)  This couldn't just be
>a coincidence; is there anyway I can tell LEX that I'm going to use
>all 8 bits?
>Any help will be much appreciated.

Though there are some efforts to make U*IX '8 Bit clean' I have not
yet seen an implementation of 'lex' which gives support for 8-bit
chars. The major problem is that 'lex' uses the 8th bit for its own
purposes in the compiled representation of the regular expressions
(and it seems that no one at AT&T or the software companies which
port U*IX are willing to dig into the sources of 'lex' ... :-()

SO BE AWARE: Even if 'lex' produces a compilable 'lex.yy.c', the
behaviour may be strange if you feed input with the 8th bit set!
(This specific problem hit me some time ago and I was searching for
hours to track the roots of the behaviour: The pitty is that only
*some* few characters trigger the errative situation. So if SOME test
input seems to be processed correctly under SOME circumstances, you
have no guarantee that ALL input will be processed correctly under
ALL circumstances!)

Whether there are work arounds or not depends on your problem:
If you only want to process all chars whith the high bit set in
some more or less uniform way, you may roll your own version
of the 'input'-macro and translate the 8-Bit chars to some
other representation. Eg you can establish a buffer which
parallels 'yytext' where you store the 'real' input, but let
the macro return some common representation for all characters,
that you treat in the same way anyhow. [To the poster: If you
need any further hints mail me a little more about your problem]

As a general rule, avoid characters outside the range 1 .. 127
in your input as well as in the regular expression specification!
(BTW: Who knows how the PD Version FLEX handles this?)
-- 
Martin Weitzel, email: martin at mwtech.UUCP, voice: 49-(0)6151-6 56 83



More information about the Comp.lang.c mailing list