Nearly Half of All Websites Are Using Unicode
Google published a graph 18 months ago which predicted this trend, showing that Unicode had exceeded all other encodings of text on the web at that time. Since then, the growth of the use of Unicode has been even more pronounced. Google's own internal use of Unicode involves converting any other character encoding to Unicode first for processing, and they do this for all the text that they search. This makes it easier to process the web pages that they index from around the globe, which use many different languages, all of which Unicode is able to handle.
The ISO-8859-1 character set, known as the ISO 8859 standard, is the default character set used by most web browsers--the first 128 characters is the original ASCII character-set which includes the numbers zero through nine, the uppercase and lowercase English alphabet, and some special characters, while the rest contains the characters used in Western European countries along with some commonly used special characters. Unicode was created to overcome the limitations of the ISO 8859 standard, which is of special importance given the global nature of the internet.
Google uses the latest version of Unicode, version 5.2, which added over 6,600 new characters to the character set, even including esoteric characters such as Egyptian Hieroglyphs, along with many others for more widely-used languages. In their blog, Google writes that "We're constantly improving our handling of existing characters. For example, the characters 'fi' can either be represented as two characters ('f' and 'i'), or a special display form '?'. A Google search for [financials] or [office] used to not see these as equivalent to the software they would just look like *nancials and of*ce. There are thousands of characters like this, and they occur in surprisingly many pages on the web, especially generated PDF documents."