Codes seven bits to the most commonly used characters and non printable characters. The non printable characters serve as functions used by the early textural user interfaces and data framing. The assignment of the number to characters has been done with the following in mind:
The lowest numbers are for the non printable characters
Number characters can be easily converted to ASCII by adding an offset of 0x30
Lower and upper case letters differ just in bit 6. Bit 6 set, means lower case.
In a ASCII text files a line breaks has to be put where a line ends. The Linux style is to put a single line feed character (LF 0x0a or in C \n) for that. But Microsoft style puts two characters for that carriage return (CR 0x0d or in C \r) and line feed (LF 0x0a). Therefore when passing a Linux style text file to the Microsoft world, the whole file appears in a single text line. On the other hand Microsoft ASCII files look correct under Linux. Linux editors can make a mess when opening a Microsoft file and end up with both styles inside. On the other hand Microsoft editors behave strange when they open a file with Linux style. They might ignore the line-brakes, or put little squares in or do it correctly.
Soon people wanted to have more characters. Many localized conflicting characters tables got invented. Finally UTF8, that is upward compatible to ASCII, has been introduced and the conflicting character tables should not be used anymore.
/www.ecma-international.org/publications/files/ECMA-ST/Ecma-094.pdf
Higher 3 bits |
000 |
001 |
010 |
011 |
Lower 5 bits |
||||
0 0000 |
NUL |
space |
@ (commercial at) |
` (grave accent) |
0 0001 |
SOH |
! (examination mark) |
A |
a |
0 0010 |
STX |
“ (quotation mark) |
B |
b |
0 0011 |
ETX |
# (number sign) |
C |
c |
0 0100 |
EOT |
$ (dollar sign) |
D |
d |
0 0101 |
ENQ |
% (percent sign) |
E |
e |
0 0110 |
ACK |
& (ampersand) |
F |
f |
0 0111 |
BEL |
' (apostrophe) |
G |
g |
0 1000 |
BS |
( (left parenthesis) |
H |
h |
0 1001 |
HT |
) (right parenthesis) |
I |
i |
0 1010 |
LF |
* (asterisk) |
J |
j |
0 1011 |
VT |
+ (plus sign) |
K |
k |
0 1100 |
FF |
, (comma) |
L |
l |
0 1101 |
CR |
- (hyphen, minus sign) |
M |
m |
0 1110 |
SO |
. (full stop) |
N |
n |
0 1111 |
SI |
/ (solidus) |
O |
o |
1 0000 |
DLE |
0 |
P |
p |
1 0001 |
DC1 |
1 |
Q |
q |
1 0010 |
DC2 |
2 |
R |
r |
1 0011 |
DC3 |
3 |
S |
s |
1 0100 |
DC4 |
4 |
T |
t |
1 0101 |
NAK |
5 |
U |
u |
1 0110 |
SYN |
6 |
V |
v |
1 0111 |
ETB |
7 |
W |
w |
1 1000 |
CAN |
8 |
X |
x |
1 1001 |
EM |
9 |
Y |
y |
1 1010 |
SUB |
: (colon) |
Z |
z |
1 1011 |
ESC |
; (semicolon) |
[ (left square bracket) |
{ (left curly bracket) |
1 1100 |
FS |
< (less-than sign) |
\ (reverse solidus) |
| (vertical line) |
1 1101 |
GS |
= (equals sign) |
] (right square bracket) |
} (right curly bracket) |
1 1110 |
RS |
> (greater than sign) |
^ (circumflex accent) |
~ (tilde) |
1 1111 |
US |
? (question mark) |
_ (low line) |
DEL |
Once there was 7bit ASCII and all where happy. With 7 bits, 128 characters could be printed. But there are 8 bits and people wanted more:
Add a few native language characters ö ä ü
Graphical characters to print some kind of graphics on the character terminal.
Smilies and Symbols
Japanese, Chinese, Korean....
This lead to many different 8 bit character sets, whereas all relevant sets had the 128 characters common to the original ASCII set. There was obviously a need to standardize those character sets and many standards got created. ISO-8859 has different parts and defines such sets or character formats, Microsoft uses the term codepage for it.
Unicode ISO10646 defines an universal character set that assigns a numbers to all characters available. Obviously 1 Byte containing 8 bits can not be used for such large numbers, so an encoding is necessary.
One approach is UTF-16 used by Microsoft that uses simply two bytes for each character. This has many disadvantages:
0 in the data could be interpreted as EOF and terminating an running low level routine by an application not dealing with UTF-16.
Memory will be doubled even when writing pure English and not using more than 7bits.
Limit to 65'535 characters
UTF-8 is an way to improve this overhead and has a maximum of 2097152 characters (Unicode has a lower limit of maximum 1.114.112 characters (U+0000 bis U+10FFFF) but just 109.449 characters are currently assigned). In UTF-8 a character will require dynamically 1 to 4 Bytes, this has the disadvantage that a UTF-8 file needs to be sequentially read and parsed. The 7bit ASCII are equal to UTF-8 so pure ASCII files are identical to UTF-8 files. The encoding can be best understood by looking at the following table:
Table 3.1. Unicode to UTF-8
unicode character number | binary coding | |
0x0 .. 0x7F (0.. 127) | 0xxxxxxx | 1 byte character identical to 7 bit ASCII |
0x80 .. 0x7FF (128..2047) | 110xxxxx 10xxxxxx | 2 byte characters |
0x800...0xFFFF (2048..65535) | 1110xxxx 10xxxxxx 10xxxxxx | 3 byte character |
0x1000..0x1FFFFF (65535..2097151) | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 4 byte character |
The close relationship and compatibility looks good, however might create suddenly bad software crashes, when a program thinks it gets ASCII but then UTF-8 comes.
UTF-8 uses the 8th most significant bit to indicate that more than one Byte is required for the character. Every byte belonging to an non ASCII character has bit 7 set to 1, so if the sequence in reading would be corrupted, no wrong interpretation of ASCII characters would occur.
For changing the encoding of filenames, app-text/convmv can be used.
Example usage of convmv
emerge --ask app-text/convmv
(Command format)
convmv -f<current-encoding>
-t utf-8 <filename>
(Substitute ISO-8859-1 with the character set you are converting from)
convmv -f iso-8859-1 -t utf-8<filename>
For changing the contents of files, use the iconv utility, bundled with glibc:
Example usage of iconv (substitute iso-8859-1 with the charset you are converting from)
(Check the output is sane)
iconv -f iso-8859-1 -t utf-8<filename>
(Convert a file, you must create another file)
iconv -f iso-8859-1 -t utf-8<filename>
> <newfile>
app-text/recode can also be used for this purpose.
An other application is app-text/recode to convert from ISO-8859-15 (also known as Latin-9)
to UTF-8 do recode l9..u8 <my>
.xml
Having UTF-8 enabled, does not mean that your computer finds all the fonts required. Fonts and character encoding is not the same!
A problem when having so many characters is to find what number corresponds to what character. A link to help is http://www.unicode.org/charts/index.html or http://unicode.coeurlumiere.com/
Sending out 0x07 the bell character or using C printf("\a"); (a stands for alert) or echo -e '\a' or echo -ne '\007' (-e interprets the \ character) should make a beep. However this might fail, since the loudspeaker is a device and in a multi user system as Linux there might be no permission to beep.
There is also a package called beep, after installing it beep can be tested in the console. As beep -h or man beep shows, beep has some options
Especially in a desktop environment it might happen that the root console can beep but the user console not. A way out is using the sound card and do aplay <soundfile>
Finally there needs to be the kernel module pcspkr that controls the speaker.