Thursday, March 28, 2024

Nearly Half of All Websites Are Using Unicode

The pages which make up websites use many different character encodings, but while most encodings are only able to represent a few languages at most, Unicode can represent thousands of languages. According to Google’s own internal data, the trends are clear that the use of Unicode is still rising, with almost 50 percent of all websites using the encoding.


Google published a graph 18 months ago which predicted this trend, showing that Unicode had exceeded all other encodings of text on the web at that time. Since then, the growth of the use of Unicode has been even more pronounced. Google’s own internal use of Unicode involves converting any other character encoding to Unicode first for processing, and they do this for all the text that they search. This makes it easier to process the web pages that they index from around the globe, which use many different languages, all of which Unicode is able to handle.


The ISO-8859-1 character set, known as the ISO 8859 standard, is the default character set used by most web browsers–the first 128 characters is the original ASCII character-set which includes the numbers zero through nine, the uppercase and lowercase English alphabet, and some special characters, while the rest contains the characters used in Western European countries along with some commonly used special characters. Unicode was created to overcome the limitations of the ISO 8859 standard, which is of special importance given the global nature of the internet.


Google uses the latest version of Unicode, version 5.2, which added over 6,600 new characters to the character set, even including esoteric characters such as Egyptian Hieroglyphs, along with many others for more widely-used languages. In their blog, Google writes that “We’re constantly improving our handling of existing characters. For example, the characters ‘fi’ can either be represented as two characters (‘f’ and ‘i’), or a special display form ‘?’. A Google search for [financials] or [office] used to not see these as equivalent — to the software they would just look like *nancials and of*ce. There are thousands of characters like this, and they occur in surprisingly many pages on the web, especially generated PDF documents.”


By using Unicode 5.2, this is no longer a problem for the search engine giant, and searches performed using Google will now be able to properly include documents which include those special characters, which is why Google is particularly happy with this announcement.

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Popular Articles

Featured