Unix binary/text files: is there a difference?

der Mouse mouse at thunder.mcrcim.mcgill.edu
Tue Mar 26 17:07:24 AEST 1991


In article <77384 at bu.edu.bu.edu>, jdubb at bucsf.bu.edu (jay dubb) writes:
> I've looked in a bunch of C and Unix books, and can't seem to find a
> good explanation of this - maybe someone can help... Is there a way
> to tell (from a C program) whether a given file contains text or
> data?

No.  It's not a well-defined distinction, for one thing.  Many files
are both text and data - any file interpreted by a program can be
considered data....

> The reason I'd like to know, is that I've noticed that if you have a
> file into which you have done something like
> write(fid,&an_int,sizeof(int)) and then you take this file to another
> machine via FTP (in binary mode), and try to read() the int back, it
> doesn't work (because of byte-order differences, I assume).

Possibly size differences as well; sometimes an int is only 16 bits.

> So, what I'd like to know is, is there a difference (in terms of
> something stat() could tell me, for example) between straight text
> files and files which contain raw numbers (without searching through
> the whole file to check, hopefully)?

No.  The only distinction is the contents.  (It's true that executable
binaries typically have their execute bits turned on, but so do shell
scripts, and many binary files don't.)  UNIX is not a system like VMS,
with lots and lots of structure imposed on file contents by the
filesystem.

> the 'file' command seems to be able to do this - I've tried it on a
> text file, and on a file with raw ints and floats, and it says "text"
> and "data" respectively. Does it really know, or is it making a guess

It is making a guess based on reading some small portion of the file
(typically the first 1K or 4K or so) and applying various heuristics.
Often there is a file which describes various identifiable patterns,
such as the 0x1f 0x9d in the first two bytes of a compressed file, but
that's a frill for the purposes under discussion.

You were also lucky.  If your int happened to have the value 0x0a6f6f66
(175075174 in decimal) on a little-endian machine, a data file
containing just that int will look like a text file with just one line
reading "foo".  Of course, the chance of this goes down sharply with
the number of "raw" numbers being written, and other factors, but you
get the idea.

					der Mouse

			old: mcgill-vision!mouse
			new: mouse at larry.mcrcim.mcgill.edu



More information about the Comp.unix.programmer mailing list