DRAFT AND NOTES FOR PRESENTATION ================================ Page 1 ============================================================================= Introduction Computers and Unicode Internationalization and localization Linux for non-english people Unicode in the console Unicode in X Window - font support - locales - keyboard input; SCIM - printing More information: Questions? Page 2 ============================================================================= Introduction - Eric Hameleers Page 3 ============================================================================= Computers and Unicode In the early times of computers, all written (english) text could be represented by 7-bits character set (ASCII).This was later expanded to the 8-bit Latin-1 character sets with different encodings (code tables) for different European languages. Non-european languages required more than the 256 character slots that an 8-bit code table offers. Unicode was invented to address this issue. In Unicode, characters get a numerical place but Unicode does not define how a character is displayed - the Unicode font designer decides this. Unicode was expanded twice until it can hold 1 million characters currently 1.114.111 (220 + 216 - 1; hexadecimaal value 10FFFF) because in early Unicode representation, text in Arabic characters for instance could not be represented in a Chinese character set. Unicode assigns numbers to characters (code points) but it does not tell anything about encoding. Several encoding standards exist of which UTF-8 is mostly used nowadays. Compatible with 7-bit ASCII (stored in one byte), can contain 2,3 or 4 bytes per character (latin-1 uses two bytes, the Basic Multilingual Plane which encompasses virtually all other languages uses 3). UTF-16 (2 or 4 bytes per character) has better compatibility with other encodings like UCS-2 but it takes more space to store the Roman (ASCII) alphabet. Page 4 ============================================================================= Locales A 'locale' is a set of variables that define the user's language, country and local preferences (number formats, currency and such). The user's locale is used to customize the user interface to what he is accustomed to. Your locale also defines the character set your applications use. You can list all available locales on the computer with this command: $ locale -a You can find out which locale is currently active by typing: $ locale This outputs a list of environment variables like below: LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE=C LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= Of these, LANG and LC_ALL will be used most frequently. Example: I want my desktop to display messages and menus in English but use Dutch localization (measurements, currency, time display, word sort, paper sizes and such): LANG="en_US" LC_MESSAGES="en_US" LC_CTYPE="nl_NL@euro" LC_COLLATE="nl_NL@euro" LC_TIME="nl_NL" LC_NUMERIC="nl_NL" LC_MONETARY="nl_NL@euro" LC_PAPER="nl_NL" LC_TELEPHONE="nl_NL" LC_ADDRESS="nl_NL" LC_MEASUREMENT="nl_NL" LC_NAME="nl_NL" Locate settings are defined in /etc/profile.d/lang.sh (globally) or ~/.profile (per-user setting) Page 5 ============================================================================= Internationalization and Localization Internationalization and localization are related to the ability of software to be used with different languages and regions. Internationalization of software (abbreviated to i18n) means that it designed to use libraries which enable support for new languages without having to rewrite the complete application. Localization of software (L10n) means that you add text translations and locale-specific elements. All major Free and Open Source initiatives support these standards: - KDE, Gnome, OpenOffice, Mozilla and many of the smaller FOSS project implement support as well. Page 6 ============================================================================= Linux for non-english people With locales, UTF-8, i18n and L10n, we have the basics covered. We can fully customize our Slackware computer for use in our own country. To start using Unicode, you must define a UTF-8 locale for at least the language and messages. Easiest is to edit /etc/profile.d/lang.sh and change export LANG=en_US #export LANG=en_US.UTF-8 to #export LANG=en_US export LANG=en_US.UTF-8 and login again. Use your own language instead of "en_US" of course - for this presentation, we will assume "en_US.UTF-8" Note, the extension to your language setting must be ".UTF-8". When you look at the output of the 'locale -a' command you will see "en_US.utf8" listed. Be careful not to copy and paste this value from the 'locale -a' output! Your Unicode support will not be enabled. The next sections describe: - how to display Unicode in the Linux console - how to display and input Unicode in X Window - how to print text containing Unicode Page 7 ============================================================================= Unicode in the console Using non-latin text in the console is made easier starting with Linux 2.6.24 kernel. The linux console is Unicode-enabled by default. In earlier days, you had to run $ unicode_start in a console to enable Unicode support. Slackware disables unicode in /etc/lilo.conf (user selectable during installation): append="vt.default_utf8=0" You will need a console font with decent Unicode coverage. Example: terminus font (not part of Slackware). To load this font after you installed it add the following to your ~/.profile file: if [ "$TERM" == "linux" ]; then setfont ter-v16n fi The default Slackware console font will be loaded with UTF-8 extensions if you run the command: $ setfont -v which also gives good results. There are good reasons why Slackware does not come with a Unicode-enabled console by default. The collection of tools on your system expects input in plain roman text (ASCII text). Entering a Unicode text string on the command prompt can have unexpected results if the shell interprets this string and executes the command. Several terminal-based (curses) applications will not display correctly - especially in case of graphical characters - because of the double width of Unicode characters. Also, 'man' will show display glitches that may (or may not) be worked around by starting man like this: $ LC_ALL=C man The 'C' locale is designed for portability and performance. It will be the default if no specific locale has been set. Show characters of your console font: $ showconsolefont Page 8 ============================================================================= Unicode in X Window Slackware was (and is to some extent) not targeted at international audiences. The installer is only available in english and even though the console accepts various keyboard layouts, the safe way of working has always been using plain "us_EN" locale. Adjusting the LANG variable would allow display of your system's messages and logs in your own language at least. No decent support existed for non-european languages. Specifically, CJK (Chinese/Japanese/Korean, or just 'Asian language') support was hard to get right - Slackware did not have high-quality fonts with full CJK coverage; TrueType font rendering of double-width fonts like Chinese was messy and lacked proper bold-face . Text input of these languages was difficult. Several initiatives existed which required extensive patching of core Slackware packages (X, fontconfig, freetype) and the 'best' looking fonts (or even, good-looking fonts at all) were all produced by Microsoft and thus, non-free. In Slackware 12.1 we made an effort to converge all the missing pieces into the core distribution. - By 2007, CJK font rendering in freetype, fontconfig and the X libraries had advanced to a point where patching of thge vanilla sources was no longer required. This was ideal for Slackware which has the philosphy of applying as little patching as possible and let the developers use these patches to enhance their products. Slackware's X Window system based on X.Org 7.x will properly render double-width Asian fonts out of the box. - We added several excellent free TrueType Unicode fonts: - Redhat Liberation fonts (drop-in replacement for Microsoft fonts Arial, Courier New and Times New Roman. These fonts are metric equivalents which means that web pages designed to be viewed with MS fonts will render exactly identical when using Liberation fonts) - Zen Hei - part of the Wen Quan Yi (literally 'Spring of Letters') font family. This font is primarily added for it's crisp display of Chinese (simplified and traditional) but it has good coverage of a great many other languages too. - Sazanami (a good Japanese font) - Sinhala (for display of Sanskrit/Sri Lankan texts) - TibMachUni (Tibetan Machine Unicode font) - Crucial was the addition of keyboard input methods that allow the user to enter extended (non-latin) characters using a standard keyboard. Several projects exist that implement Input Methods, like UIM and IIIMF, fcitx, and SCIM. We adopted the SCIM (Smart Common Input Method) platform because it is widely adopted among other distributions, is actively developed and covers a lot more than just CJK input. These factord make it the safe choice. It can be argued that fcitx ('Free Chinese Input Toy for X') is faster and more elegant but this IM only supports Chinese - and we wanted to offer the broadest support possible. How do we use SCIM for working with Unicode texts? Prepare your system for SCIM - read HINTS_AND_TIPS.TXT on the Slackware CD! - The first requirement is to use a UTF-8 locale. We covered this earlier on. The scim daemon will not start if it detects lack of UTF-8 support. - Make the scim profile scripts executable. These will setup your environment correctly for the use of scim with X applications. Run this command: # chmod +x /etc/profile.d/scim.* - Start the scim daemon as soon as your X session starts. The scim daemon must be active before any of your X applications. In KDE, you can add a shell script to the ~/.kde/Autostart folder that runs the command "scim -d". In XFCE you can add "scim -d" to the Autostarted Applications. If you boot your computer in runlevel 4 (the graphical XDM/KDM login) you can simply add the line "scim -d" to your ~/.xprofile file. This gives you a Desktop Environment independent way of starting scim. - GTK apps like firefox will crash in case you remove or forget to install the scim-bridge package, so take care which packages you leave out. A full Slackware installation is always recommended if your hard drive has the space. Using SCIM When scim is up and running, you will see a small keyboard icon in your system tray. Right-click it to enter SCIM Setup. In 'Global Setup' select your keyboard layout, and you are ready to start entering just about any language characters you wish! Press the magical key combo in order to activate or deactivate SCIM input. The SCIM taskbar in the desktop's corner allows you to select your language. As you type, SCIM will show an overview of applicable character glyphs (if you are inputting complex characters like Japanese or Chinese). Note that not only CJK is supported. SCIM offers input methods for many other languages like Greek, Russian, even accented German. GTK is Unicode friendly. If you know the number of a code point (Unicode character) you can input this character into any GTK input field, using the key combination of -U + number. Printing Unicode Printing is supported out of the box by all major applications like OpenOffice, Mozilla family. For texts that you need to send to the printer from the commandline, use ghostscript to create PostScript or PDF and send that to your printer. The ghostscript version in Slackware 12.1 has been tested and approved by native Japanese and Chinese Slackware users. More information: Page X ============================================================================= More information: http://en.wikipedia.org/wiki/UTF-8 http://gentoo-wiki.com/HOWTO_Make_your_system_use_unicode/utf-8 http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.nls/doc/nlsgdrf/locale_env.htm