Wednesday, 15 February 2012

Unicode and Encoding

Just for my reference
http://csharpindepth.com/Articles/General/Unicode.aspx
http://stackoverflow.com/questions/496321/utf8-utf16-and-utf32
http://stackoverflow.com/questions/643694/utf-8-vs-unicode

In nutshell
1. .Net, by default use UTF - 16, and Encoding.Unicode means  UTF - 16.
2.  UTF8: Variable-width encoding, backwards compatible with ASCII. ASCII characters (U+0000 to U+007F) take 1 byte, code points U+0080 to U+07FF take 2 bytes, code points U+0800 to U+FFFF take 3 bytes, code points U+10000 to U+10FFFF take 4 bytes. Good for English text, not so good for Asian text.
3. UTF16: Variable-width encoding. Code points U+0000 to U+FFFF take 2 bytes, code points U+10000 to U+10FFFF take 4 bytes. Bad for English text, good for Asian text.
4. UTF32: Fixed-width encoding. All code points take 4 bytes. An enormous memory hog, but fast to operate on. Rarely used.

No comments: