Character encoding

Character encoding
	Chapter 3. System

ASCII

Codes seven bits to the most commonly used characters and non printable characters. The non printable characters serve as functions used by the early textural user interfaces and data framing. The assignment of the number to characters has been done with the following in mind:

The lowest numbers are for the non printable characters
Number characters can be easily converted to ASCII by adding an offset of 0x30
Lower and upper case letters differ just in bit 6. Bit 6 set, means lower case.

In a ASCII text files a line breaks has to be put where a line ends. The Linux style is to put a single line feed character (LF 0x0a or in C \n) for that. But Microsoft style puts two characters for that carriage return (CR 0x0d or in C \r) and line feed (LF 0x0a). Therefore when passing a Linux style text file to the Microsoft world, the whole file appears in a single text line. On the other hand Microsoft ASCII files look correct under Linux. Linux editors can make a mess when opening a Microsoft file and end up with both styles inside. On the other hand Microsoft editors behave strange when they open a file with Linux style. They might ignore the line-brakes, or put little squares in or do it correctly.

Soon people wanted to have more characters. Many localized conflicting characters tables got invented. Finally UTF8, that is upward compatible to ASCII, has been introduced and the conflicting character tables should not be used anymore.

ASCII Table

/www.ecma-international.org/publications/files/ECMA-ST/Ecma-094.pdf

Higher 3 bits	000	001	010	011
Lower 5 bits
0 0000	NUL	space	@ (commercial at)	` (grave accent)
0 0001	SOH	! (examination mark)	A	a
0 0010	STX	“ (quotation mark)	B	b
0 0011	ETX	# (number sign)	C	c
0 0100	EOT	$ (dollar sign)	D	d
0 0101	ENQ	% (percent sign)	E	e
0 0110	ACK	& (ampersand)	F	f
0 0111	BEL	' (apostrophe)	G	g
0 1000	BS	( (left parenthesis)	H	h
0 1001	HT	) (right parenthesis)	I	i
0 1010	LF	* (asterisk)	J	j
0 1011	VT	+ (plus sign)	K	k
0 1100	FF	, (comma)	L	l
0 1101	CR	- (hyphen, minus sign)	M	m
0 1110	SO	. (full stop)	N	n
0 1111	SI	/ (solidus)	O	o
1 0000	DLE	0	P	p
1 0001	DC1	1	Q	q
1 0010	DC2	2	R	r
1 0011	DC3	3	S	s
1 0100	DC4	4	T	t
1 0101	NAK	5	U	u
1 0110	SYN	6	V	v
1 0111	ETB	7	W	w
1 1000	CAN	8	X	x
1 1001	EM	9	Y	y
1 1010	SUB	: (colon)	Z	z
1 1011	ESC	; (semicolon)	[ (left square bracket)	{ (left curly bracket)
1 1100	FS	< (less-than sign)	\ (reverse solidus)	\| (vertical line)
1 1101	GS	= (equals sign)	] (right square bracket)	} (right curly bracket)
1 1110	RS	> (greater than sign)	^ (circumflex accent)	~ (tilde)
1 1111	US	? (question mark)	_ (low line)	DEL

UTF-8

Once there was 7bit ASCII and all where happy. With 7 bits, 128 characters could be printed. But there are 8 bits and people wanted more:

Add a few native language characters ö ä ü
Graphical characters to print some kind of graphics on the character terminal.
Smilies and Symbols
Japanese, Chinese, Korean....

This lead to many different 8 bit character sets, whereas all relevant sets had the 128 characters common to the original ASCII set. There was obviously a need to standardize those character sets and many standards got created. ISO-8859 has different parts and defines such sets or character formats, Microsoft uses the term codepage for it.

Unicode ISO10646 defines an universal character set that assigns a numbers to all characters available. Obviously 1 Byte containing 8 bits can not be used for such large numbers, so an encoding is necessary.

One approach is UTF-16 used by Microsoft that uses simply two bytes for each character. This has many disadvantages:

0 in the data could be interpreted as EOF and terminating an running low level routine by an application not dealing with UTF-16.
Memory will be doubled even when writing pure English and not using more than 7bits.
Limit to 65'535 characters

UTF-8 is an way to improve this overhead and has a maximum of 2097152 characters (Unicode has a lower limit of maximum 1.114.112 characters (U+0000 bis U+10FFFF) but just 109.449 characters are currently assigned). In UTF-8 a character will require dynamically 1 to 4 Bytes, this has the disadvantage that a UTF-8 file needs to be sequentially read and parsed. The 7bit ASCII are equal to UTF-8 so pure ASCII files are identical to UTF-8 files. The encoding can be best understood by looking at the following table:

Table 3.1. Unicode to UTF-8

unicode character number	binary coding
0x0 .. 0x7F (0.. 127)	0xxxxxxx	1 byte character identical to 7 bit ASCII
0x80 .. 0x7FF (128..2047)	110xxxxx 10xxxxxx	2 byte characters
0x800...0xFFFF (2048..65535)	1110xxxx 10xxxxxx 10xxxxxx	3 byte character
0x1000..0x1FFFFF (65535..2097151)	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	4 byte character

Note

The close relationship and compatibility looks good, however might create suddenly bad software crashes, when a program thinks it gets ASCII but then UTF-8 comes.

UTF-8 uses the 8th most significant bit to indicate that more than one Byte is required for the character. Every byte belonging to an non ASCII character has bit 7 set to 1, so if the sequence in reading would be corrupted, no wrong interpretation of ASCII characters would occur.

For changing the encoding of filenames, app-text/convmv can be used.

Example usage of convmv

emerge --ask app-text/convmv

(Command format)

convmv -f<current-encoding> -t utf-8 <filename>

(Substitute ISO-8859-1 with the character set you are converting from)

convmv -f iso-8859-1 -t utf-8<filename>

For changing the contents of files, use the iconv utility, bundled with glibc:

Example usage of iconv (substitute iso-8859-1 with the charset you are converting from)

(Check the output is sane)

iconv -f iso-8859-1 -t utf-8<filename>

(Convert a file, you must create another file)

iconv -f iso-8859-1 -t utf-8<filename> > <newfile>

app-text/recode can also be used for this purpose.

An other application is app-text/recode to convert from ISO-8859-15 (also known as Latin-9) to UTF-8 do recode l9..u8 <my>.xml

Note

Having UTF-8 enabled, does not mean that your computer finds all the fonts required. Fonts and character encoding is not the same!

A problem when having so many characters is to find what number corresponds to what character. A link to help is http://www.unicode.org/charts/index.html or http://unicode.coeurlumiere.com/

beep

Sending out 0x07 the bell character or using C printf("\a"); (a stands for alert) or echo -e '\a' or echo -ne '\007' (-e interprets the \ character) should make a beep. However this might fail, since the loudspeaker is a device and in a multi user system as Linux there might be no permission to beep.

There is also a package called beep, after installing it beep can be tested in the console. As beep -h or man beep shows, beep has some options

Note

Especially in a desktop environment it might happen that the root console can beep but the user console not. A way out is using the sound card and do aplay <soundfile>

Finally there needs to be the kernel module pcspkr that controls the speaker.