Internationalization

By Sue Charlesworth

Despite its name, the World Wide Web has had some difficulty reaching out past the Western languages and alphabets. In general, character representation in HTML was largely confined to the use of the ISO 8859-1 (Latin-1) character set. This character set contains letters for English, French, Spanish, German, and the Scandinavian languages, but no Greek, Hebrew, Arabic, or Cyrillic characters, among others, and few scientific and mathematical symbols. Also, the Latin-1 character set contains no provisions for marking reading direction.

Part of the problem with Latin-1 is that it simply doesn't have room to handle all the alphabets and languages of the world. It is an 8-bit, single-byte coded graphic character set and, as such, can represent only up to 256 characters.

Enter Unicode. Unicode is a character-encoding standard that uses a 16-bit set, thereby increasing the number of encoded characters to more than 65,000 characters.

HTML 4.0 uses the Universal Character Set (UCS) as its character set. UCS is a character- by-character equivalent to Unicode 2.0.

from Special Edition Using HTML 4: Appendix A
What's New in HTML 4.0

© Copyright Macmillan Computer Publishing. All rights reserved.

Make a Comment

Loading Comments...

  • Web Development Newsletter Signup

    Invalid email
    You have successfuly registered to our newsletter.
  •  
  •  
  •  
Thanks for your registration, follow us on our social networks to keep up-to-date