Hex escape for quoted multibyte character

Teruhiko Kurosaka - Sun Intercon kuro%shochu at Sun.COM
Wed Apr 26 04:18:40 AEST 1989


I have a question about relationship among three new concepts and
notation introduced by ANSI-C draft: multibyte characters,
wide characters, and hexadecimal escape notation.

For the following discussion, let's assume a character X is a
multibyte character and is represented by three byte sequnce: 0x8e 0xab 0xcd,
in some system.

The first question I have is how to represent this three-byte character
by hexadecimal escape sequnce within double-quoted strings.
The draft (12/7/88 p.30 line 14) says:
	The hexadecimal digits that follow the backslash and the letter x in
	a hexadecimal escape sequnce are taken to be part of the construction of a
	single character for an integer character constant or of a single wide
	character for a wide character constant.  The numeric value of the
	hexadecimal integer so formed specifies the value of the desired
	character or wide character.

If I take this literally, it would be:
	char *the_multibyte_char="\x8eabcd";		/* I-1 */

However, I noticed, the draft sometimes use the word "character" and
"byte" interexchangably.  If the "character" actually means a byte, then
	char *the_multibyte_char="\x8e\xab\xcd";	/* I-2 */
must be the right notation.
What I want to mean here is:
	char the_multibyte_char_array[]={0x8e, 0xab, 0xcd, 0};	
	char *the_multibyte_char=the_multibyte_char_array;	


Another related question is, how to use the hexadecimal escape in
the wide character string ( L"..." ).  Let's say, the wide character value
for this character X is 0xbcde.  Then, a wide character string
that includes only one character X should be written as:
	wchar_t *the_wide_char_str=L"\xbcde";		/* II-1 */
or should it be:
	whcar_t *the_wide_char_str=L"\xbc\xde";		/* II-2 */
to mean:
	whcar_t the_wide_char_array={0xbcde, 0};
	whcar_t *the_wide_char_str=the_wide_char_array;
?

And finally, which is right?
	whcar_t the_wide_char=L'\xbcde';		/* III-1 */
	whcar_t the_wide_char=L'\xbc\xde';		/* III-2 */

My personal choices are I-2, II-I and III-1.  This is based on my personal belief that
a hexadecimal escape sequnce should describe the value of the 'atom' element
in a notation.  Because a double quoted string is of type (char *), it's atom's datatype
is char, which actually means a byte for historical reasons all of you know.  Therfore
an escape sequnce should describe a byte.  For the same reason, a hexadecimal
escape sequnce within a wide character constant/string-literal should describe
a wide character.  

I would like to know what other people's think about this.
In your response, please distinguesh what you think ANSI-C should have been, and
what ANSI-C spec (draft) should be interpreted.
Thank you in advance.

-T.Kurosaka, Sun Microsystems



More information about the Comp.std.c mailing list