Character encoding

ASCII

Codes seven bits to the most commonly used characters and non printable characters. The non printable characters serve as functions used by the early textural user interfaces and data framing. The assignment of the number to characters has been done with the following in mind:

  1. The lowest numbers are for the non printable characters

  2. Number characters can be easily converted to ASCII by adding an offset of 0x30

  3. Lower and upper case letters differ just in bit 6. Bit 6 set, means lower case.

In a ASCII text files a line breaks has to be put where a line ends. The Linux style is to put a single line feed character (LF 0x0a or in C \n) for that. But Microsoft style puts two characters for that carriage return (CR 0x0d or in C \r) and line feed (LF 0x0a). Therefore when passing a Linux style text file to the Microsoft world, the whole file appears in a single text line. On the other hand Microsoft ASCII files look correct under Linux. Linux editors can make a mess when opening a Microsoft file and end up with both styles inside. On the other hand Microsoft editors behave strange when they open a file with Linux style. They might ignore the line-brakes, or put little squares in or do it correctly.

Soon people wanted to have more characters. Many localized conflicting characters tables got invented. Finally UTF8, that is upward compatible to ASCII, has been introduced and the conflicting character tables should not be used anymore.

ASCII Table

/www.ecma-international.org/publications/files/ECMA-ST/Ecma-094.pdf

Higher 3 bits

000

001

010

011

Lower 5 bits

       

0 0000

NUL

space

@ (commercial at)

` (grave accent)

0 0001

SOH

! (examination mark)

A

a

0 0010

STX

“ (quotation mark)

B

b

0 0011

ETX

# (number sign)

C

c

0 0100

EOT

$ (dollar sign)

D

d

0 0101

ENQ

% (percent sign)

E

e

0 0110

ACK

& (ampersand)

F

f

0 0111

BEL

' (apostrophe)

G

g

0 1000

BS

( (left parenthesis)

H

h

0 1001

HT

) (right parenthesis)

I

i

0 1010

LF

* (asterisk)

J

j

0 1011

VT

+ (plus sign)

K

k

0 1100

FF

, (comma)

L

l

0 1101

CR

- (hyphen, minus sign)

M

m

0 1110

SO

. (full stop)

N

n

0 1111

SI

/ (solidus)

O

o

1 0000

DLE

0

P

p

1 0001

DC1

1

Q

q

1 0010

DC2

2

R

r

1 0011

DC3

3

S

s

1 0100

DC4

4

T

t

1 0101

NAK

5

U

u

1 0110

SYN

6

V

v

1 0111

ETB

7

W

w

1 1000

CAN

8

X

x

1 1001

EM

9

Y

y

1 1010

SUB

: (colon)

Z

z

1 1011

ESC

; (semicolon)

[ (left square bracket)

{ (left curly bracket)

1 1100

FS

< (less-than sign)

\ (reverse solidus)

| (vertical line)

1 1101

GS

= (equals sign)

] (right square bracket)

} (right curly bracket)

1 1110

RS

> (greater than sign)

^ (circumflex accent)

~ (tilde)

1 1111

US

? (question mark)

_ (low line)

DEL

UTF-8

Once there was 7bit ASCII and all where happy. With 7 bits, 128 characters could be printed. But there are 8 bits and people wanted more:

  1. Add a few native language characters ö ä ü

  2. Graphical characters to print some kind of graphics on the character terminal.

  3. Smilies and Symbols

  4. Japanese, Chinese, Korean....

This lead to many different 8 bit character sets, whereas all relevant sets had the 128 characters common to the original ASCII set. There was obviously a need to standardize those character sets and many standards got created. ISO-8859 has different parts and defines such sets or character formats, Microsoft uses the term codepage for it.

Unicode ISO10646 defines an universal character set that assigns a numbers to all characters available. Obviously 1 Byte containing 8 bits can not be used for such large numbers, so an encoding is necessary.

One approach is UTF-16 used by Microsoft that uses simply two bytes for each character. This has many disadvantages:

  1. 0 in the data could be interpreted as EOF and terminating an running low level routine by an application not dealing with UTF-16.

  2. Memory will be doubled even when writing pure English and not using more than 7bits.

  3. Limit to 65'535 characters

UTF-8 is an way to improve this overhead and has a maximum of 2097152 characters (Unicode has a lower limit of maximum 1.114.112 characters (U+0000 bis U+10FFFF) but just 109.449 characters are currently assigned). In UTF-8 a character will require dynamically 1 to 4 Bytes, this has the disadvantage that a UTF-8 file needs to be sequentially read and parsed. The 7bit ASCII are equal to UTF-8 so pure ASCII files are identical to UTF-8 files. The encoding can be best understood by looking at the following table:

Table 3.1. Unicode to UTF-8

unicode character number binary coding  
0x0 .. 0x7F (0.. 127) 0xxxxxxx 1 byte character identical to 7 bit ASCII
0x80 .. 0x7FF (128..2047) 110xxxxx 10xxxxxx 2 byte characters
0x800...0xFFFF (2048..65535) 1110xxxx 10xxxxxx 10xxxxxx 3 byte character
0x1000..0x1FFFFF (65535..2097151) 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4 byte character


Note

The close relationship and compatibility looks good, however might create suddenly bad software crashes, when a program thinks it gets ASCII but then UTF-8 comes.

UTF-8 uses the 8th most significant bit to indicate that more than one Byte is required for the character. Every byte belonging to an non ASCII character has bit 7 set to 1, so if the sequence in reading would be corrupted, no wrong interpretation of ASCII characters would occur.

For changing the encoding of filenames, app-text/convmv can be used.

Example usage of convmv

emerge --ask app-text/convmv

(Command format)

convmv -f<current-encoding> -t utf-8 <filename>

(Substitute ISO-8859-1 with the character set you are converting from)

convmv -f iso-8859-1 -t utf-8<filename>

For changing the contents of files, use the iconv utility, bundled with glibc:

Example usage of iconv (substitute iso-8859-1 with the charset you are converting from)

(Check the output is sane)

iconv -f iso-8859-1 -t utf-8<filename>

(Convert a file, you must create another file)

iconv -f iso-8859-1 -t utf-8<filename> > <newfile>

app-text/recode can also be used for this purpose.

An other application is app-text/recode to convert from ISO-8859-15 (also known as Latin-9) to UTF-8 do recode l9..u8 <my>.xml

Note

Having UTF-8 enabled, does not mean that your computer finds all the fonts required. Fonts and character encoding is not the same!

A problem when having so many characters is to find what number corresponds to what character. A link to help is http://www.unicode.org/charts/index.html or http://unicode.coeurlumiere.com/

beep

Sending out 0x07 the bell character or using C printf("\a"); (a stands for alert) or echo -e '\a' or echo -ne '\007' (-e interprets the \ character) should make a beep. However this might fail, since the loudspeaker is a device and in a multi user system as Linux there might be no permission to beep.

There is also a package called beep, after installing it beep can be tested in the console. As beep -h or man beep shows, beep has some options

Note

Especially in a desktop environment it might happen that the root console can beep but the user console not. A way out is using the sound card and do aplay <soundfile>

Finally there needs to be the kernel module pcspkr that controls the speaker.


Linurs startpage