Tip #246: Working with Unicode (the same, rewritten for legibility)

tip karma

Rating 64/22, Viewed by 1118

created:		May 10, 2002 16:19		complexity:		basic
author:		Tony Mechelynck		as of Vim:		6.0

1. Where to look for help ------------------------- :h utf8 :h encoding-values :h 'enc' :h 'fenc' :h 'fencs' :h 'tenc' :h 'bomb' :h 'guifont' :h ga :h g8 :h :dig :h i_Ctrl-V_digit :h has() 2. What to do (These are *examples*. Modify them to suit your work environment.) ------------- if has("multi_byte") set encoding=utf-8 setglobal fileencoding=utf-8 set bomb set termencoding=iso-8859-15 set fileencodings=ucs-bom,iso-8859-15,iso-8859-3,utf-8 else echoerr "Sorry, this version of (g)vim was not compiled with +multi_byte" endif 3. What the above does ---------------------- * has("multi_byte") checks if you have the right options compiled-in. If you haven't got what it takes, it's no use trying to use Unicode. * 'encoding' sets how vim shall represent characters internally. Utf-8 is necessary for most flavors of Unicode. * 'fileencoding' sets the encoding for a particular file (local to buffer); :setglobal sets the default value. An empty value can also be used: it defaults to same as 'encoding'. Or you may want to set one of the ucs encodings, It might make the same disk file bigger or smaller depending on your particular mix of characters. Also, IIUC, utf-8 is always big-endian (high bit first) while ucs can be big-endian or little-endian, so if you use it, you will probably need to set 'bomb" (see below). * 'bomb' (boolean): if set, vim will put a "byte order mark" at the start of ucs files. This option is irrelevant for most non-ucs files (utf-8, iso-8859, etc.) * 'termencoding' defines how your keyboard encodes what you type. The value you put there will depend on your locale: iso-8859-15 is Latin1 + Euro currency sign, but you may want something else for, say, an Eastern European keyboard. * 'fileencodings' defines the heuristic to set 'fileencoding' (local to buffer) when reading an existing file. The first one that matches will be used (and, IIUC, if there is no match, Vim falls back on Latin1). Ucs-bom is "ucs with byte-order-mark"; it must not come after utf-8 if you want it to be used. 4. Additional remarks --------------------- * In "replace" mode, one utf character (one or more data bytes) replaces one utf character (which need not use the same number of bytes) * In "normal" mode, ga shows the character under the cursor as text, decimal, octal and hex; g8 shows which byte(s) is/are used to represent it. * In "insert" or "replace" mode, - any character defined on your keyboard can be entered the usual way (even with dead keys if you have them, e.g. French circumflex, German umlaut, etc.); - any character which has a "digraph" (there are a huge lot of them, see :dig after setting enc=utf-8) can be entered with a Ctrl-K prefix; - any utf character at all can be entered with a Ctrl-V prefix, either <Ctrl-V> u aaaa or <Ctrl-V> U bbbbbbbb, with 0 <= aaaa <= FFFF, or 0 <= bbbbbbbb <= 7FFFFFFF. * Unicode can be used to create html "body text", at least for Netscape 6 and probably for IE; but on my machine it doesn't display properly as "title text" (i.e., between <title></title> tags in the <head> part). * Gvim will display it properly if you have the fonts for it, provided that you set 'guifont' to some fixed-width font which has the glyphs you want to use (Courier New is OK for French, German, Greek, Russian and more, but I'm not sure about Hebrew or Arabic; its glyphs are of a more "fixed" width than those of, e.g. Lucida Console: the latter can be awkward if you need bold Cyrillic writing). Happy Vimming ! Tony.

<<Working with Unicode (platform-independent) | Preexisting code indentation >>

Additional Notes

Anonymous, July 7, 2002 17:03

This doesn't work in gvim on MS-Windows. Apparently you need to use CTRL-Q instead of CTRL-V, eg, CTRL-Q u00f1.

[email protected], July 25, 2002 17:53

I use gvim on W32 but I avoid sourcing $VIMRUNTIME/mswin.vim so Ctrl-V works for me. But you are right, I ought to have mentioned that if Ctrl-V has been mapped to do a paste, then one should use Ctrl-Q instead. -- The Author.

[email protected], July 26, 2002 12:46

About the byte-order mark: The Unicode standard defines a byte-order mark for optional use at the start of all Unicode files (ucs-8 as well as little-endian or big-endian utf-16 and utf-32); it is the character "zero-width non-breaking space", codepoint U+0xFEFF and comes out as follows: utf-8: EF BB BF utf-16 le: FF FE utf-16 be: FE FF utf-32 le: FF FE 00 00 utf-32 be: 00 00 FE FF It is guaranteed not to clash with a valid Unicode character of a different encoding and/or endianness. It defines both unit size (8|16|32 bits) and endianness. I don't know if Vim can generate it for utf-8. You can always "make one" by typing Ctrl-V u FEFF at the start of a file (or use Ctrl-Q if Ctrl-V doesn't work for you, see above). utf-16 and utf-32 are known in Vim as ucs-2 and ucs-4 respectively. Tony.




	If you have questions or remarks about this site, visit the vimonline development pages. Please use this site responsibly. Questions about Vim should go to [email protected] after searching the archive. Help Bram help Uganda.