Using UTF-8 locales in [B]LFS

来源：互联网发布：运营商大数据平台规范编辑：程序博客网时间：2024/06/05 17:30

AUTHOR: Alexander E. PatrakovDATE: 2003-11-06LICENSE: Public DomainSYNOPSIS: Using UTF-8 locales in [B]LFSDESCRIPTION:This hint explains what should be changed in the LFS and BLFS instructionscurent at the time of this writing in order to use locales such as ru_RU.UTF-8.PREREQUISITES: LFS 5.1-pre1 or later, good knowledge of CCONFLICTS: compressed manual pages*** NOTE ***This hint is not maintained by the author.***CHANGELOG:2003-11-06: Initial submission2004-02-25: Added some BLFS packagesHINT:IMPORTANT INFORMATIONDon't follow this hint unless you are prepared to fix broken things! I neverhad a full BLFS install, and of course because of that some packages thatare broken in UTF-8 locales may well be missing from this hint.Also, please don't ask support questions related to this hint on mailing listshosted on linuxfromscratch.org (and of course don't provide support yourself)if you can't answer the questions at the end of the hint.Also note that while our goal is to move to the international UTF-8 encoding,we have to disable internationalization completely in some older applications.So this hint really becomes an antihint: we gain nothing except compatibilitywith bleeding-edge RedHat-like distros in their default configuration, andlost... lost what we were aiming to get --- internationalization.Once again, don't follow this hint blindly.BIG WARNING: you will probably have to convert ALL your documents.Part 1. INTRODUCTION1. Single-byte and double-byte encodings and UTF-8: what's wrongMost Eropean languages have a relatively short alphabet (less than 40characters). This makes it possible to create a represent thecharacters of that alphabet (both upper-case and lower-case), English alphabet,digits and punctuation with a single byte. The result is known as a single-byteencoding. An example of such encoding is KOI8-R, commonly used in Russia. Allsingle-byte encodings are ASCII-compatible in the sense that charactersrepresentable in ASCII are also representable in these encodings and have thesame code. They are also reverse-ASCII-compatible in the sense that every bytewith the value less than 0x7f represents the same character as it does inASCII. Current LFS and BLFS work well with such encodings.This approach doesn't work with Asian languages such as Chinese, Japanese andKorean (denoted together as CJK further in this hint). They have more than 256different characters, because single characters represent syllables and evenwords. So called double-byte encodings are used with these languages. Theyrepresent English letters, digits and punctuation with single bytes equal toASCII representation of those characters. To represent native CJK characters,two-byte sequences are used. Such encodings are called double-byte. Anexample is GB2312, used in China. Since CJK characters are twice as wide asEnglish ones in monospaced font, the "on-screen" width of a string encoded withsuch methods is directly proportional to the number of bytes in it (there isone exception: any two-byte sequence starting with 0x8e byte in EUC-JP takes asmuch space as an English letter). LFS and BLFS don't work well with Asianlanguages and double-byte encodings because of two reasons:1) It is impossible to display double-width characters on a Linux console (evenon a framebuffer console) without additional programs that are not in the book.Installation of e.g. zhcon corrects this.2) Some assumptions that work with single-byte encodings fail with double-byteones. First, some double-byte encodings are not reverse-ASCII-compatible: abyte with value less than 0x7f can be either an ASCII-representable characteror a second byte of a two-byte sequence. Second, correctly finding the n-thcharacter in a string is a complex task because some characters occupy onebyte, and some characters are represented by two-byte sequences. Software thatmakes bad assumptions needs to be either patched or not installed at all.Today there is a need to encode multilingual texts. E.g., foreign clients ofcompanies don't want their names to be distorted up to unreconinzable state bya chain of multiple transliterations. Since all single-byte and double-byteencodings are capable of representing characters of at most two alphabets(english + national), there is a need for a new character set to encodemultilingual texts. Such character set exists and it is named Unicode.UTF-8 is a method of representing Unicode text with a stream of8-bit bytes. The resulting stream is both ASCII-compatible andreverse-ASCII-compatible. A single character can occypy from 1 to 4 bytes. Manycurrent distributions of Linux configure locales using the UTF-8 characterencoding by default. This doesn't work with (B)LFS for the same reasons as withdouble-byte encodings. However,1) There is no framebuffer-based terminal that is capable of displaying thefull range of Unicode characters (if one doesn't count Debian-specific btermfrom the "bogl" package, bogl = Ben's Own Graphics Library).Fortunately, it is not needed in most cases. Linux console is capable ofdisplaying Latin (including accented), Greek, Arabian and Cyrilliccharacters together even without framebuffer. Also, xterm works just fine.2) There is one more assumption that breaks with UTF-8. The relation ofon-screen width of a string to the number of bytes in it is very complex.That's why e.g. Midnight Commander works with double-byte encodings, butdoesn't work with UTF-8.3) Many packages in UTF-8 locale fail to provide compatibility with olderdoculents saved in traditional single-byte or double-byte encoding.Part 1. LFS PACKAGES1. Suggested changes to the installation instructionsThe following packages should be configured differently in Chapter 6:- ncurses- vim- man1a. Modified Ncurses installation instructionsFirst of all, you need NCurses 5.4. Get it fromhttp://ftp.gnu.org/gnu/ncurses/ncurses-5.4.tar.gzThe new Ncurses version has experimental support for wide characters.According to the output of ./configure --help, it is activated by passingthe --enable-widec argument to ./configure. The resulting libraries arebinary-incompatible with "normal" ncurses and therefore a letter "w" isappended automatically to their names: libncursesw.so.5.4. For compatibilitywith precompiled commercial applications, we will install two versions ofncurses.Now we are ready to install the non-wide-character version of ncurses, almostby the book:./configure --prefix=/usr --with-shared --without-debugmakemake installThis installs /usr/lib/libncurses.so.5.4. We will move it to /lib later.Then install a wide-character-enabled version:make distclean./configure --prefix=/usr --with-shared --without-debug --enable-widecmakemake installThis installs /usr/lib/libncursesw.so.5.4 and related libraries.Move important libraries to /lib and correct permissions:chmod 755 /usr/lib/*.5.4chmod 644 /usr/lib/libncurses++*.amv /usr/lib/libncurses.so.5* /libmv /usr/lib/libncursesw.so.5* /libMake the symbolic links:ln -sf ../../lib/libncursesw.so.5 /usr/lib/libncurses.soln -sf libncurses.so /usr/lib/libcurses.soln -sf ../../lib/libncursesw.so.5 /usr/lib/libncursesw.soln -sf libncursesw.so /usr/lib/libcursesw.soNote the first command. Now all applications trying to link at compile time against -lncurses will actually link to the wide-character version,/lib/libncursesw.so.5. This works, because the two libraries aresource-compatible. At runtime, the linker will happily resolve the dependencyupon libncursesw.so.5. And for precompiled commercial applications thatdepend on the ordinary version of ncurses there is /lib/libncurses.so.5.1b. Modified Vim instructionsFor Vim to work correctly in double-byte encodings and in UTF-8, the--enable-multibye switch has to be added to the ./configure command line. Notethat it is not necessary in BLFS since --with-features= (more than normal)implies this.echo '#define SYS_VIMRC_FILE "/etc/vimrc"' >> src/feature.hecho '#define SYS_GVIMRC_FILE "/etc/gvimrc"' >> src/feature.h./configure --prefix=/usr --enable-multibytemakemake installln -s vim /usr/bin/viVim is able to edit files in arbitrary encodings if you use UTF-8-based locale.E.g. to read the file price.txt that is known to be in CP1251 encoding, type::e ++enc=cp1251 price.txtIt will be automatically converted. To save the file in KOI8-R encoding underthe name price.koi, type::w ++enc=koi8-r price.koiVim is even able to automatically detect the character set of the filebeing read under some conditions. This works because real texts in mostsingle-byte and double-byte encodings contain sequences of bytes that are notvalid in UTF-8.This capability needs to be configured. To do so, create the file /etc/vimrcwith the following contents (replace koi8-r with the name of a single-byte ordouble-byte encoding that is mostly often used in your country):" Begin /etc/vimrcset nocompatibleset bs=2set fileencodings=ucs-bom,utf-8,koi8-r" End /etc/vimrcFor more information, read /usr/share/vim/vim62/doc/mbyte.txt1c. Modified Man instructionsSince Man internationalization does not work at all in UTF-8 locales (themessages are still output in single-byte or double-byte encodings, appearingas lines of unreadable squares on the screen) and because Russian messages areimproperly translated (and offensive!) we will disable NLS. This will notprevent you from viewing manual pages in your native language. It just meansthat messages like "What manual page do you want?" will remain untranslated.Install the "man" package with the followiing commands:patch -Np1 -i ../man-<version>-manpath.patchpatch -Np1 -i ../man-<version>-80cols.patchpatch -Np1 -i ../man-<version>-pager.patchDEFS="-DNONLS" ./configure -default -confdir=/etc +lang allmakemake installNow we have to decide what to do with manual pages in your native language.They are provided with the corresponding packages in the single-byte ordouble-byte encoding, but not in UTF-8. Therefore, they won't display properly.There are two solutions to this problem.The first solution is to store them in the single-byte or double-byte encoding,i.e. as they come with the corresponding packages, and convert them into UTF-8on the fly. To do this, search for the line in /etc/man.conf that starts with"PAGER". Replace it with something like the following:PAGER /usr/bin/iconv -c -f koi8-r | /usr/bin/less -isR(replace koi8-r with your 8-bit or double-byte encoding). Note that this changedoes not hurt you if you later switch back to the usual encoding: iconv willbe a no-op. Unfortunately, this doesn't work well with graphical man pageviewers like Yelp (from GNOME-2.4) or Konqueror, since they just ignore the"PAGER" variable in /etc/man.conf (if they read /etc/man.conf at all) andassume that manual pages are stored in the character set of the current locale.The second solution would be to convert manual pages to UTF-8. Unfortunately,I had no success with this. RedHat provides some patches for groff-1.18.1.I tried to convert all manual pages into UTF-8 and changed man.conf to havethe line# WRONG!NROFF /usr/bin/iconv -c -t koi8-r | /usr/bin/nroff -Tlatin1 -mandoc | /usr/bin/iconv -c -f koi8-rThis didn't work well because some manual pages contain just.so filenameand don't display properly.In fact, the *roff specification says that the input must be in iso8859-1encoding, there is no way to typeset anything except Latin and Greek accordingto the specification, and all localized manual pages (even in the single-byteCyrillic KOI8-R encoding!) are really a hack and violate the specification.2. Setting up UTF-8 based locale and environment variablesSome UTF-8 locales (e.g. se_NO.UTF-8) are installed during themake localedata/install-localesstep while installing glibc. But most of UTF-8 locales must be createdmanually, e.g.:localedef -c -i ru_RU -f UTF-8 ru_RU.UTF-8The role of the -c switch is to continue the creation of the locale even thoughwarnings are issued. After the creation of the locale, it is needed to tellapplications to use it. All that is required is to set some environmentvariables. An easy "solution" is to add this to your /etc/profile:# WRONG!!!export LC_ALL=ru_RU.UTF-8export LANG=ru_RU.UTF-8This "solution" is wrong because these variables will be available to processesstarted from your login shell, but will not be available to the readlinelibrary that the shell uses. The readline library uses this information e.g. todetermine how many bytes to remove from the input buffer (must be one UTF-8character) and how many character cells to erase on the screen (again, onefull character) if you press Backspace or Delete key.Yes, if you _type_ export LC_ALL=ru_RU.UTF-8 in the login shell, then it willpass this setting to the readline library. But this doesn't work in the shellstartup files. This is a bug in bash. So the correct LC_ALL variable must bealready in the environment when the login shell starts.If one adds the above LC_ALL and LANG variables into /etc/environment, it willwork for login shells started by the "login" program, but will not work forshells started by "su" or "sshd" programs. This approach also requires you toplace these variables into /etc/profile so that they will be available fromKDE (the "startkde" script from KDE 3.2.0 sources /etc/profile).Another approach is to make the login shell set the correct locale variablesand reexecute itself. To accompilsh this, add the following snippet at thevery beginning of your /etc/profile:if [ "x$LC_ALL" = "x" ]then export LC_ALL=ru_RU.UTF-8 export LANG=ru_RU.UTF-8 if ( echo $- | grep -q i ) then exec -a "$0" /bin/bash "$@" fifiThe $- check is there because /etc/profile is sometimes sourced by otherscripts that run in noninteractive shells. Such shells don't need to bereexecuted, since you don't want to replace a script that sourced /etc/profilewith an instance of /bin/bash called with the same parameters as the script.Of course, you will have to replace ru_RU above with something moreappropriate.If you are using xdm, you also want to include the following lines into thebeginning of /etc/X11/xdm/Xsession:[ -r /etc/profile ] && . /etc/profile[ -r $HOME/.bash_profile ] && . $HOME/.bash_profileConsult the documentation of other display managers for the means to set theenvironment in the started session.3. Setting up Linux consoleWe will modify the /etc/rc.d/init.d/loadkeys script.#!/bin/bash# Begin $rc_base/init.d/loadkeys - Loadkeys Script# Based on loadkeys script from LFS-3.1 and earlier.# Rewritten by Gerard Beekmans - gerard@linuxfromscratch.org# Modified for UTF-8 locales by Alexander E. Patrakov - semzx@newmail.rusource /etc/sysconfig/rcsource $rc_functionsecho -n "Setting screen font..."for console in /dev/tty[1-6]do (setfontsetfont LatArCyrHeb-16 )<$console >$console 2>&1doneevaluate_retvalecho -n "Loading keymap..."kbd_mode -uloadkeys ru1 2>/dev/null &&dumpkeys -c koi8-r | loadkeys --unicodeevaluate_retval# End $rc_base/init.d/loadkeysSome comments concerning this script.1) The empty "setfont" command works around a bug in 2.6 kernels.2) We don't switch the console output to UTF-8 here. We will do that in/etc/issue (the idea is stolen from "redhat-style-logon" hint). This isnecessary because otherwise this switching will affect only the first console.As an alternative, you can write a "for" loop here sending <ESC>%G to allvirtual consoles.3) The kbd package does not provide ready-to-use keymaps for UTF-8 locales,except for Ukrainian one. First, we load the now-wrong ru1 keymap (the numericcharacter codes there are valid only for koi8-r character set), then we dumpit replacing numeric codes with human-readable descriptions of characters (e.g."cyrillic_small_letter_e"). The resulting keymap is usable in UTF-8 mode, sowe load it with loadkeys --unicode.Let's create /etc/issue:echo -e '/033[2J/033[f/033%GWelcome to Linux From Scratch/n' >/etc/issueThe meaning of the escape sequences:<ESC>[2J = clear entire screen<ESC>[f = move the cursor to the corner of the screen<ESC>%G = put the console into UTF-8 modeSet up screen font and keyboard now, if you don't want to reboot:/etc/rc.d/init.d/loadkeysThen kill all agetty processes for them to reread /etc/issue:killall agetty4. ConclusionFrom your next login, you will use UTF-8 based locale, with all its benefitsand drawbacks.Known bugs:- The Caps Lock key does not work on Linux console for national characters. The guilty package is kbd. - Some packages don't display line drawing characters in UTF-8 mode on Linux console. This is a bug in the packages themselves. See ALSA section below for more detailed discussion.Part 2. BLFS PACKAGES1. GnuPGThe package itself is internationalized well and supports UTF-8 out of the box.Unfortunately, some applications (e.g. Enigmail) assume that the output of gpgis in iso8859-1. For applications that cannot be fixed easily, create thefollowing script:#!/bin/shexport LC_ALL=Cexport LANG=Cexec /usr/bin/gpg "$@"Save it as /usr/bin/gpg-nolocale, give it the "executable" bit and configurethe offending application to use this script instead of the real gpg binary.2. EmacsI don't use Emacs at all, but your comments are welcome. Don't expectany console-based editor except Vim, Emacs and Yudit to work in UTF-8 locale.3. SlangGet the patchhttp://www.linuxfromscratch.org/patches/downloads/slang/slang-1.4.9-utf8.patchInstall Slang using the following instructions:patch -Np1 -i ../lang-1.4.9-utf8.patch./configure --prefix=/usrmake CFLAGS="-O2 -pipe -DUTF8"make installmake CFLAGS="-O2 -pipe -DUTF8" ELF_CFLAGS="-O2 -pipe -DUTF8" elfmake install-elfmake install-linkschmod 755 /usr/lib/libslang.so.1.4.9WARNING: you should pass -DUTF-8 in CFLAGS to all applications that dependon Slang.4. AspellTo be done.5. GPMGPM cannot cut/paste non-ASCII characters. It is really a limitation of Linuxconsole. You can google for a kernel patch namedunicode_copypaste_2.4.19.patch.gzbut I would recommend against it. I had crashes and repeatable kernel panicswith it.6. Zip/UnzipIf you put a file with non-ASCII characters in its name into the archive, youwill be unable to get that name correctly under Windows.7. Midnight CommanderFirst, install Slang. Then, get the patchhttp://www.linuxfromscratch.org/patches/downloads/mc/mc-4.6.0-utf8.patchInstall Midnight Commander with the following instructions:patch -Np1 -i ../mc-4.6.0-utf8.patchCFLAGS="-O2 -pipe -DUTF8" ./configure / --prefix=/usr --with-screen=slang / --what-else-you want, e.g. --with-vfs --with-samba --enable-charset --without-ext2undel / --with-configdir=/etc/samba --with-codepagedir=/usr/share/samba/codepagesmakemake installUnfortunately, this patch is not sufficient. In particular, it is impossibleto view and edit files containing non-ASCII characters using the internalviewer and editor. Configure Midnight Commander to use an external editor,e.g. Vim.8. w3mYou need w3m-m17n, not just a bare w3m. Unfortunately, w3m-m17n-0.4.2 does notexist yet.9. Mutt, PineI don't use them at all, but Debian has a patch for Mutt.Your comments are welcome.10. GTK+-1.2.10This package's default style files in /etc/gtk don't work in UTF-8 locales.Changing "koi8-r" to "iso10646-1" fonts in /etc/gtk/gtkrc.ru fixes the problemwith improper fonts for Russians. Beware that KDE also sets GTK styles (in~/.kde/share/config/gtkrc and ~/.gtkrc), so these files also may need somemanual editing.11. LessTifThis package does not support UNICODE well.12. KDEMultimediaThe players show ID3 tags with national characters improperly.13. YelpThe problems with manual pages have already been mentioned in Man section.14. ALSAAlsamixer 1.0.2 won't show the line drawing characters on Linux console inUTF-8 mode. This is a bug in alsamixer. The problem is that NCurses mustknow whether the Linux console is in UTF-8 mode or not. To do that, NCurseschecks the current locale setting (in the order: LC_ALL, LC_CTYPE, LANG).Also, it has to compute how many cells a given character occupies. Thisrequires a valid LC_CTYPE setting.But this means that a program that links to ncurses must callsetlocale(LC_CTYPE, "")before initscr(). This patch fixes the issue in alsamixerhttp://www.linuxfromscratch.org/patches/downloads/alsa-utils/alsa-utils-1.0.2-locale.patchAfter reading the text above and looking at the alsamixer patch, you shouldbe able to fix this kind of a problem with other packages. Please sendpatches to patches@linuxfromscratch.org.Don't send a patch for the "lxdialog" program that comes with the kernelsources and is used during "make menuconfig", since that will break Question2 in the quiz below and I will no longer be able to check whether others areready to follow this hint.15. XMMSThis package will not show ID3 tags properly out of the box, because they areusually in the windowsish single-byte or double-byte encoding and not in UTF-8.The patch from http://rusxmms.sourceforge.net/ helps.16. DilloThis package does not support UNICODE.17. XSaneThe gtk+-1.2.10 version is affected by a bug in gtk+ style supportand does not work properly even in ru_RU.koi8r locale. To work aroundthe problem, don't build the GIMP plugin --- then XSane will link against GTK2.18. XpdfSince this package depends on LessTif, the support of UTF-8 in the GUI israther poor. E.g., the filenames in the fileselector show improperly. Butthe non-GUI tool, pstotext, works flawlessly and can extract text in the UTF-8encoding from PDF files.19. A2psThis package does not support UNICODE.20. TeXTo use UTF-8 as an input encoding with TeX, you should download the followingpackage:http://www.unruh.de/DniQ/latex/unicode/unicode.tgzJust unpack it into /usr/share/texmf/tex, remove all files exceptucs/*.sty, ucs/*.def, ucs/data/*and then run mktexlsr. Then you will be able to write/usepackage[utf-8]{inputenc}in the document preamble, but I doubt that anyone else will be able to TeXyour documents.If you want someone else to be able to extract text in UTF-8 encoding fromyour PDF files generated by PDFTeX or dvipdfm, you should also installthe "cm-super" font package from CTAN.Part 3. CONCLUSIONSProbably you understand from reading the above that UTF-8 causes more troublethan merit. If you followed this hint, I hope that I didn't damage your systemirreversibly.Please post your deviations and report other broken packages topatrakov@ums.usu.ruAPPENDIX A. QUIZYou should follow the hint only if you know all the answers.1) The non-wide character version of ncurses 5.4 uses poor-man line-drawing characters on Linux console in UTF-8 mode. What other terminal type is affected by this? Where (which file and line) is the check? Where is the piece of code that substitutes these poor-man line-drawng characters instead of those which came from the terminfo database? Where does ncurses 5.4 check the current locale?2) Linux kernel build process uses the "lxdialog" program during the "make menuconfig" step. Unfortunately, lxdialog has the same bug as alsamixer (see the hint). Can you make a patch for lxdialog yourself?