Refactoring HTML: Well-Formedness
Change Name to Lowercase
Make all element and attribute names lowercase. Make most entity names lowercase, except for those that refer to capital letters.
XHTML uses lowercase names exclusively. All elements and attributes are written in lowercase. For example,
There are relatively few trade-offs for converting to lowercase. All modern browsers support lowercase tag names without any problems. A few very old browsers that were never in widespread use, such as HotJava, only supported uppercase for some tags. The same is true of early versions of Java Swings built-in HTML renderer. However, this has long since been fixed.
It is also possible that some homegrown scripts based on regular expressions may not recognize lowercase forms. If you have any scripts that screen-scrape your HTML, youll need to check them to make sure theyre also ready to handle lowercase tag names. Once youre done making the document well-formed, it may be time to consider refactoring those scripts, too, so that they use a real parser instead of regular expression hacks. However, that can wait. Usually its simple enough to change the expressions to look for lowercase tag names instead of uppercase ones, or to not care about the case of the tag names at all.
The first rule of well-formedness is that every start-tag has a matching end-tag. The matching part is crucial. Although classic HTML is case-insensitive, XML and XHTML are not.
<DIV> is not the same as
<div> and a
</div> end-tag cannot close a
For purely well-formedness reasons, all thats needed is to normalize the case. All tags could be capitalized or not, as long as youre consistent. However, its easiest for everyone if we pick one case convention and stick to it. The community has chosen lowercase for XHTML. Thus, the first step is to convert all tag names, attribute names, and entity names to lowercase. For example:
There are several ways to do this.
The first and the simplest is to use TagSoup or Tidy in XHTML mode. Along with many other changes, these tools will convert all tag and attribute names to lowercase. They will also change entity names that need to be in lowercase.
You also can accomplish this with regular expressions. Because HTML element and attribute names are composed exclusively of the Latin letters A to Z and a to z, this isnt too difficult. Lets start with the element names. There are likely to be thousands, perhaps millions, of these, so you dont want to fix them by hand.
Tags are easy to search for. This regular expression will find all start-tags that contain at least one capital letter:
This regular expression will find all end-tags that contain at least one capital letter:
Entities are also easy. This regular expression finds all entity references that contain a capital letter other than the initial letters:
I set up the preceding regular expression to find at least three capital letters to avoid accidentally triggering on references such as
Ω that should have a single initial capital letter and on references such as
Æ that have two initial capital letters. This may miss some cases, such as
&AMp, but those are rare in practice. Usually entity references are either all uppercase or all lowercase. If any such mixed cases exist, well find them later with xmllint and fix them by hand.
Attributes are trickier to find because the pattern to find them (
=name) may appear inside the plain text of the document. I much prefer to use Tidy or TagSoup to fix these. However, if you know you have a large problem with particular attributes, its easy to do a search and replace for individual ones, for instance,
href=. As long as you arent writing about HTML, that string is unlikely to appear in plain text content.
Sometimes your initial find will discover that only a few tags use uppercase. For instance, if there are lots of uppercase table tags, you can quickly change
</tr>, and so forth without even using regular expressions. If the problem is a little broader, consider using Tidy or TagSoup. If that doesnt work, youll need a tool that can replace text while changing its case. jEdit cant do this. However, Perl and BBEdit can. Use
<\L in the replacement pattern to convert all characters to lowercase. For example, lets start with the regular expression for start-tags:
This expression will replace it with its lowercase equivalent: