SHARE
Facebook X Pinterest WhatsApp

Nearly Half of All Websites Are Using Unicode

Feb 2, 2010

The pages which make up websites use many different character encodings, but while most encodings are only able to represent a few languages at most, Unicode can represent thousands of languages. According to Google’s own internal data, the trends are clear that the use of Unicode is still rising, with almost 50 percent of all websites using the encoding.


Google published a graph 18 months ago which predicted this trend, showing that Unicode had exceeded all other encodings of text on the web at that time. Since then, the growth of the use of Unicode has been even more pronounced. Google’s own internal use of Unicode involves converting any other character encoding to Unicode first for processing, and they do this for all the text that they search. This makes it easier to process the web pages that they index from around the globe, which use many different languages, all of which Unicode is able to handle.


The ISO-8859-1 character set, known as the ISO 8859 standard, is the default character set used by most web browsers–the first 128 characters is the original ASCII character-set which includes the numbers zero through nine, the uppercase and lowercase English alphabet, and some special characters, while the rest contains the characters used in Western European countries along with some commonly used special characters. Unicode was created to overcome the limitations of the ISO 8859 standard, which is of special importance given the global nature of the internet.


Google uses the latest version of Unicode, version 5.2, which added over 6,600 new characters to the character set, even including esoteric characters such as Egyptian Hieroglyphs, along with many others for more widely-used languages. In their blog, Google writes that “We’re constantly improving our handling of existing characters. For example, the characters ‘fi’ can either be represented as two characters (‘f’ and ‘i’), or a special display form ‘?’. A Google search for [financials] or [office] used to not see these as equivalent — to the software they would just look like *nancials and of*ce. There are thousands of characters like this, and they occur in surprisingly many pages on the web, especially generated PDF documents.”


By using Unicode 5.2, this is no longer a problem for the search engine giant, and searches performed using Google will now be able to properly include documents which include those special characters, which is why Google is particularly happy with this announcement.

Recommended for you...

Web 3.0 and the Future Of Web Development
Rob Gravelle
Jun 23, 2022
CodeGuru and VBForums Developer Forums and Community
James Payne
Apr 7, 2022
Understanding CSS Template Layout
Vipul Patel
Mar 29, 2022
Criminals Pay More for Code Signing Certificates Than for Guns or Passports
HTML Goodies Logo

The original home of HTML tutorials. HTMLGoodies is a website dedicated to publishing tutorials that cover every aspect of being a web developer. We cover programming and web development tutorials on languages and technologies such as HTML, JavaScript, and CSS. In addition, our articles cover web frameworks like Angular and React.JS, as well as popular Content Management Systems (CMS) that include WordPress, Drupal, and Joomla. Website development platforms like Shopify, Squarespace, and Wix are also featured. Topics related to solid web design and Internet Marketing also find a home on HTMLGoodies, as we discuss UX/UI Design, Search Engine Optimization (SEO), and web dev best practices.

Property of TechnologyAdvice. © 2025 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.