SGML <tag> stripping

Richard L. Goerwitz goer at ellis.uchicago.edu
Fri Dec 14 09:40:43 AEST 1990


SGML (and Chicago) style <tags> are an increasing familiar sight,
and I often want to just strip them out (or else perform simple
manipulations on them).  Here's a simple program that does this.

Sorry to the people who try to make me believe that C is the right
tool for everything.  This is a text-processing problem, and so I
have used a good text-processing language, namely Icon.

I'd be curious to know how to do this in perl.

-Richard (goer at sophist.uchicago.edu)


---- Cut Here and feed the following to sh ----
#!/bin/sh
# This is a shell archive (produced by shar 3.49)
# To extract the files from this archive, save it to a file, remove
# everything above the "!/bin/sh" line above, and type "sh file_name".
#
# made 12/13/1990 23:08 UTC by goer at sophist.uchicago.edu
# Source directory /u/richard/Stripsgml
#
# existing files will NOT be overwritten unless -c is specified
# This format requires very little intelligence at unshar time.
# "if test", "cat", "rm", "echo", "true", and "sed" may be needed.
#
#                                                                          
#                                                                          
#
# This shar contains:
# length  mode       name
# ------ ---------- ------------------------------------------
#   2697 -r--r--r-- stripsgml.icn
#   3289 -r--r--r-- stripunb.icn
#   1915 -r--r--r-- readtbl.icn
#   2274 -r--r--r-- slashbal.icn
#    983 -rw-r--r-- README
#    659 -rw-r--r-- Makefile.dist
#
if test -r _shar_seq_.tmp; then
	echo 'Must unpack archives in sequence!'
	echo Please unpack part `cat _shar_seq_.tmp` next
	exit 1
fi
# ============= stripsgml.icn ==============
if test -f 'stripsgml.icn' -a X"$1" != X"-c"; then
	echo 'x - skipping stripsgml.icn (File already exists)'
	rm -f _shar_wnt_.tmp
else
> _shar_wnt_.tmp
echo 'x - extracting stripsgml.icn (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'stripsgml.icn' &&
X############################################################################
X#
X#	Name:	 stripsgml.icn
X#
X#	Title:	 Strip (or translate) simple SGML tags from a file
X#
X#	Author:	 Richard L. Goerwitz
X#
X#	Version: 1.9
X#
X############################################################################
X#
X#  Strip or perform simple translation on SGML <>-style tags.  Usage
X#  is as follows:
X#
X#      stripsgml [-f translation-file] [left-delimiter [right-delimiter]]
X#
X#  The default left-delimiter is <, the default right delimiter is >.
X#  If no translation file is specified, the program acts as a strip-
X#  per, simply removing material between the delimiters.  Stripsgml
X#  takes its input from stdin, writing to stdout.
X#
X#  The format of the translation file is:
X#
X#      code	initialization	completion
X#
X#  A tab or colon separates the fields.  If you want to use a tab or colon
X#  as part of the text (and not as a separator), place a backslash before
X#  it.  The completion field is optional.  There is not currently any way
X#  of specifying a completion field without an initialization field.  Do
X#  not specify delimiters as part of code.
X#
X#  Note that, if you are translating SGML code into font change or escape
X#  sequences, you may get unexpected results.  This isn't stripsgml's
X#  fault.  It's just a matter of how your terminal or WP operate.  Some
X#  need to be "reminded" at the beginning of each line what mode or font
X#  is being used.  Note also that stripsgml assumes < and > as delimiters.
X#  If you want to put a greater-than or less-than sign into your text,
X#  put a backslash before it.  This will effectively "escape" the spe-
X#  cial meaning of those symbols.  It is now possible to change the
X#  default delimiters, but the option has not been thoroughly tested.
X#
X############################################################################
X#
X#  Links: slashbal.icn ./stripunb.icn ./readtbl.icn
X#
X############################################################################
X
X
Xprocedure main(a)
X
X    local usage, _arg, L, R
X
X    usage:=
X     "usage:  stripsgml [-f map-file] [left-delimiter(s) [right-delimiter(s)]]"
X
X    L := '<'; R := '>'
X    while _arg := get(a) do {
X        if _arg == "-f" then {
X            map_file := open(get(a)) |
X                stop("stripsgml:  can't open map_file\n",usage)
X            t := readtbl(map_file)
X        }
X        else {
X            L := _arg
X            R := cset(get(a))
X        }
X    }
X
X    every line := !&input do
X	write(stripunb(L,R,line,&null,&null,t))  # t is the map table
X
X    # last_k is the stack used in stripunb.icn
X    if *\last_k ~= 0 then
X	stop("Unexpected EOF encountered.  Expecting ", pop(last_k), ".")
X
Xend
SHAR_EOF
true || echo 'restore of stripsgml.icn failed'
rm -f _shar_wnt_.tmp
fi
# ============= stripunb.icn ==============
if test -f 'stripunb.icn' -a X"$1" != X"-c"; then
	echo 'x - skipping stripunb.icn (File already exists)'
	rm -f _shar_wnt_.tmp
else
> _shar_wnt_.tmp
echo 'x - extracting stripunb.icn (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'stripunb.icn' &&
X############################################################################
X#
X#	Name:	 stripunb.icn
X#
X#	Title:	 Strip unbalanced material
X#
X#	Author:	 Richard L. Goerwitz
X#
X#	Version: 1.5
X#
X############################################################################
X#  
X#  This routine strips material from a line which is unbalanced with
X#  respect to the characters defined in arguments 1 and 2 (unbalanced
X#  being defined as bal() defines it, except that characters preceded
X#  by a backslash are counted as regular characters, and are not taken
X#  into account by the balancing algorithm).
X#
X#  One little bit of weirdness I added in is a table argument. Put
X#  simply, if you call stripunb() as follows,
X#
X#      stripunb('<','>',s,&null,&null,t)
X#
X#  and if t is a table having the form,
X#
X#      key:  "bold"        value: outstr("\e[2m", "\e1m")
X#      key:  "underline"   value: outstr("\e[4m", "\e1m")
X#      etc.
X#
X#  then every instance of "<bold>" in string s will be mapped to
X#  "\e2m," and every instance of "</bold>" will be mapped to "\e[1m."
X#  Values in table t must be records of type output(on, off).  When
X#  "</>" is encountered, stripunb will output the .off value for the
X#  preceding .on string encountered.
X#
X############################################################################
X#
X#  Links: slashbal.icn
X#
X############################################################################
X
Xglobal last_k
Xrecord outstr(on, off)
X
X
Xprocedure stripunb(c1,c2,s,i,j,t)
X
X    # NB:  Stripunb() returns a string - not an integer (like find,
X    # upto).
X
X    local lookinfor, bothcs, s2, k, new_s
X    #global last_k
X    initial last_k := list()
X
X    /c1 := '<'
X    /c2 := '>'
X    bothcs := c1 ++ c2
X    lookinfor := c1 ++ '\\'
X    c := &cset -- c1 -- c2
X
X    /s := \&subject | stop("stripunb:  No string argument.")
X    if \i then {
X	if i < 1 then
X	    i := *s + (i+1)
X    }
X    else i := \&pos | 1
X    if \j then {
X	if j < 1 then
X	    j := *s + (j+1)
X    }
X    else j := *s + 1
X
X    s2 := ""
X    s ? {
X	while s2 ||:= tab(upto(lookinfor)) do {
X	    if ="\\" then {
X		if not any(bothcs) then
X		    s2 ||:= "\\"
X		&pos+1 > j & (return s2)
X		s2 ||:= move(1)
X		next
X	    }
X	    else {
X		&pos > j & (return s2)
X		any(c1) |
X		    stop("stripunb:  Unbalanced string, pos(",&pos,").\n",s)
X		if not (k := tab(slashbal(c,c1,c2)))
X		then {
X		    # If the last char on the line is the right-delim...
X		    if (.&subject[&pos:0]||" ") ? slashbal(c,c1,c2)
X		    # ...then, naturally, the rest of the line is the tag.
X		    then k := tab(0)
X		    else {
X			# BUT, if it's not the right-delim, then we have a
X			# tag split by a line break.  Blasted things.
X			return stripunb(c1,c2,&subject||read(&input),
X					*.&subject,,t) |
X			# Can't find the right delimiter.  Parsing error.
X			stop("stripunb:  Incomplete tag\n",s[1:80] | s)
X		    }
X		}
X		# T is the maptable.
X		if \t then {
X		    k ?:= 2(tab(any(c1)), tab(upto(c2)), move(1), pos(0))
X		    if k ?:= (="/", tab(0)) then {
X			compl:= pop(last_k) | stop("Incomplete tag, ",&subject) 
X			if k == ""
X			then k := compl
X			else k == compl | stop("Incorrectly paired tag,/tag.")
X			s2 ||:= \(\t[k]).off
X		    }
X		    else {
X			s2 ||:= \(\t[k]).on
X			push(last_k, k)
X		    }
X		}
X	    }
X	}
X	s2 ||:= tab(0)
X    }
X
X    return s2
X
Xend
SHAR_EOF
true || echo 'restore of stripunb.icn failed'
rm -f _shar_wnt_.tmp
fi
# ============= readtbl.icn ==============
if test -f 'readtbl.icn' -a X"$1" != X"-c"; then
	echo 'x - skipping readtbl.icn (File already exists)'
	rm -f _shar_wnt_.tmp
else
> _shar_wnt_.tmp
echo 'x - extracting readtbl.icn (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'readtbl.icn' &&
X############################################################################
X#
X#	Name:	 readtbl.icn
X#
X#	Title:	 Read user-created stripsgml table
X#
X#	Author:	 Richard L. Goerwitz
X#
X#	Version: 1.1
X#
X############################################################################
X#  
X#  This file is part of the stripsgml package.  It does the job of read-
X#  ing option user-created mapping information from a file.  The purpose
X#  of this file is to specify how each code in a given input text should
X#  be translated.  Each line has the form:
X#
X#      SGML-designator	start_code	end_code
X#
X#  where the SGML designator is something like "quote" (without the quota-
X#  tion marks), and the start and end codes are the way in which you want
X#  the beginning and end of a <quote>...<\quote> sequence to be transla-
X#  ted.  Presumably, in this instance, your codes would indicate some set
X#  level of indentation, and perhaps a font change.  If you don't have an
X#  end code for a particular SGML designator, just leave it blank.
X#
X############################################################################
X#
X#  Links: stripsgml.icn
X#
X############################################################################
X
X
Xprocedure readtbl(f)
X
X    local t, line, k, on_sequence, off_sequence
X
X    /f & stop("readtbl:  Arg must be a valid open file.")
X
X    t := table()
X
X    every line := trim(!f,'\t ') do {
X	line ? {
X	    k := tabslashupto('\t:') &
X	    tab(many('\t:')) &
X	    on_sequence := tabslashupto('\t:') | tab(0)
X	    tab(many('\t:'))
X	    off_sequence := tab(0)
X	} | stop("readtbl:  Bad map file format.")
X	insert(t, k, outstr(on_sequence, off_sequence))
X    }
X
X    return t
X
Xend
X
X
X
Xprocedure tabslashupto(c,s)
X
X    POS := &pos
X
X    while tab(upto('\\' ++ c)) do {
X	if ="\\" then {
X	    move(1)
X	    next
X	}
X	else {
X	    if any(c) then {
X		suspend &subject[POS:.&pos]
X	    }
X	}
X    }
X
X    &pos := POS
X    fail
X
Xend
SHAR_EOF
true || echo 'restore of readtbl.icn failed'
rm -f _shar_wnt_.tmp
fi
# ============= slashbal.icn ==============
if test -f 'slashbal.icn' -a X"$1" != X"-c"; then
	echo 'x - skipping slashbal.icn (File already exists)'
	rm -f _shar_wnt_.tmp
else
> _shar_wnt_.tmp
echo 'x - extracting slashbal.icn (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'slashbal.icn' &&
X############################################################################
X#
X#	Name:	 slashbal.icn
X#
X#	Title:	 Bal() with backslash escaping
X#
X#	Author:	 Richard L. Goerwitz
X#
X#	Version: 1.8
X#
X############################################################################
X#
X#  I am often frustrated at bal()'s inability to deal elegantly with
X#  the common \backslash escaping convention (a way of telling Unix
X#  Bourne and C shells, for instance, not to interpret a given
X#  character as a "metacharacter").  I recognize that bal()'s generic
X#  behavior is a must, and so I wrote slashbal() to fill the gap.
X#
X#  Slashbal behaves like bal, except that it ignores, for purposes of
X#  balancing, any c2/c3 char which is preceded by a backslash.  Note
X#  that we are talking about internally represented backslashes, and
X#  not necessarily the backslashes used in Icon string literals.  If
X#  you have "\(" in your source code, the string produced will have no
X#  backslash.  To get this effect, you would need to write "\\(."
X#
X#  BUGS:  Note that, like bal() (v8), slashbal() cannot correctly
X#  handle cases where c2 and c3 intersect.
X#
X############################################################################
X#
X#  Links: none
X#
X############################################################################
X
Xprocedure slashbal(c1, c2, c3, s, i, j)
X
X    local twocs, allcs, chr2, count
X
X    /c1 := &cset
X    /c2 := '('
X    /c3 := ')'
X    twocs := c2 ++ c3
X    allcs := c1 ++ c2 ++ c3 ++ '\\'
X
X    /s := \&subject | stop("slashbal:  No string argument.")
X    if \i then {
X	if i < 1 then
X	    i := *s + (i+1)
X    }
X    else i := \&pos | 1
X    if \j then {
X	if j < 1 then
X	    j := *s + (j+1)
X    }
X    else j := *s + 1
X
X    count := 0
X    s ? {
X	while tab(upto(allcs)) do {
X	    chr := move(1)
X	    &pos > j & fail
X	    if chr == "\\" & any(twocs) then {
X		chr2 := move(1)
X		&pos > j & fail
X		&pos > i | next
X		if any(c1, chr) & count = 0 then
X		    suspend .&pos - 2
X		if any(c1, chr2) & count = 0 then
X		    suspend .&pos - 1
X	    }
X	    else {
X		&pos > j & fail
X		if any(c1, chr) & count = 0 then {
X		    &pos > i | next
X		    suspend .&pos - 1
X		}
X		if any(c2, chr) then
X		    count +:= 1
X		else if any(c3, chr) & count > 0 then
X		    count -:= 1
X	    }
X	}
X    }
X
Xend
SHAR_EOF
true || echo 'restore of slashbal.icn failed'
rm -f _shar_wnt_.tmp
fi
# ============= README ==============
if test -f 'README' -a X"$1" != X"-c"; then
	echo 'x - skipping README (File already exists)'
	rm -f _shar_wnt_.tmp
else
> _shar_wnt_.tmp
echo 'x - extracting README (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'README' &&
X
XRe:  stripsgml.icn & associated files
X
XThis program is documented in the various source files, most notably
Xstripsgml.icn.  Please look them over, even if you are not an Icon
Xprogrammer.
X
XIn order to compile this program, you will need an Icon interpreter
X(or compiler).  If you do not have it, get it.  It is free, and can be
Xobtained via ftp from cs.arizona.edu.  If you do not have access to
Xthe internet, drop a line to the icon-project at arizona.edu, and they
Xwill fill you in on what to do.
X
XIf you are working on a Unix system, you can simply mv Makefile.dist
Xto Makefile, and then make.  Users on other systems will need to type:
X
X     icont -o stripsgml readtbl.icn slashbal.icn stripsgml.icn stripunb.icn
X
XAs I said above, see the file stripsgml.icn for more information on
Xhow to use this program.  This program is not fancy, and handles only
Xthe simplest <>-style markup.  It is in no way an attempt to handle
Xthe full metalanguage!
X
X-Richard (goer at sophist.uchicago.edu)
X
SHAR_EOF
true || echo 'restore of README failed'
rm -f _shar_wnt_.tmp
fi
# ============= Makefile.dist ==============
if test -f 'Makefile.dist' -a X"$1" != X"-c"; then
	echo 'x - skipping Makefile.dist (File already exists)'
	rm -f _shar_wnt_.tmp
else
> _shar_wnt_.tmp
echo 'x - extracting Makefile.dist (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'Makefile.dist' &&
XPROGNAME = stripsgml
X
X# Please edit these to reflect your local file structure & conventions.
XDESTDIR = /usr/local/bin
XOWNER = bin
XGROUP = bin
X
XSRC = $(PROGNAME).icn stripunb.icn readtbl.icn slashbal.icn
X
X$(PROGNAME): $(SRC)
X	icont -o $(PROGNAME) $(SRC)
X
X# Pessimistic assumptions regarding the environment (in particular,
X# I don't assume you have the BSD "install" shell script).
Xinstall: $(PROGNAME)
X	@sh -c "test -d $(DESTDIR) || (mkdir $(DESTDIR) && chmod 755 $(DESTDIR))"
X	cp $(PROGNAME) $(DESTDIR)/
X	chgrp $(GROUP) $(DESTDIR)/$(PROGNAME)
X	chown $(OWNER) $(DESTDIR)/$(PROGNAME)
X	@echo "\nInstallation done.\n"
X
Xclean:
X	-rm -f *~ .u?
X	-rm -f $(PROGNAME)
SHAR_EOF
true || echo 'restore of Makefile.dist failed'
rm -f _shar_wnt_.tmp
fi
exit 0



More information about the Alt.sources mailing list