Thursday, April 18, 2024

Parsing Building and Street Fields from an Address using Regular Expressions

Parsing Building and Street Fields from an Address using Regular Expressions

One of the most challenging tasks that I was faced with in recent memory involved the importing a comma separated values (CSV) file into my client’s database. Talk about trying to fit a round peg into a square hole! The first hurdle I encountered was that the addresses were stored in one field, whereas my client’s database had them split into Building name/number, Street, Town/City/State, Zip/Postcode, and Country. I foolishly expected to see a fairly obvious pattern in the address formats, such as “number street name suffix, city, zip code, country”. What I got was a tough lesson in address parsing! It took a while, but I managed to concoct something workable to handle my admittedly limited dataset using regular expressions. In today’s article, I’ll show you how to use them to capture the first part of an address – i.e., the building/house number – when dealing with a fairly simple and fixed address format.

What’s in a Street Address

The more you look at address formats, the more you realize that the variety seems endless. To keep things manageable let’s limit the addresses to those that pertain to a house or business. That is to say a habitable structure of some sort, as opposed to a milepost, post office box, or geolocation coordinates. Here are a few examples of address numbers and street names:

100 Baker Street
International Convention Centre, 8 Quayside, level 2
109 - 111 Wharfside Street
40-42 Parkway
25b-26 Sun Street
Panton House, 39 Panton Street
Unit 6, Royal Festival Hall, Belvedere Road
R2-R3 City Quay, Gunwharf Quays
43a Garden Walk
6/7 Marine Road
10 - 12 Acacia Ave
4513 3RD STREET CIRCLE WEST
0 1/2 Fifth Avenue
A 19 Calle 11

You would think that some kind of standards exist for this type of thing. It turns out that there is; in the U.S. the Thoroughfare, Landmark, and Postal Address Data Standard was drafted by The Federal Geographic Data Committee in 2005 and has been updated several times since then. The final draft was released in February of 2011. In section 2.2.1 – Address Number Elements, it outlines the rules governing the Address Number Prefix, Address Number, Address Number Suffix, all of which make up the Complete Address Number. So, in theory, if we were to cover every possibility outlined therein, we should be able to parse any correctly formatted building and street address. The key phrase there is “correctly formatted”; in practice, people often makes mistakes, either by accident, or simply because they don’t know the “correct” format. How many people bother to learn it I wonder?

With so much variety to be had, it’s not easy to parse the building field(s) from an address without using another field as an anchor. In fact, many developers advocate parsing address fields in reverse order, starting with the Postal/Zip Code. It has far fewer variations. Countries can be matched against a finite list. In fact, if you’re really dedicated, you can match up cities, provinces/states, and even street names against known locations. The trick is to narrow the selections down as you go. A country contains only so many provinces or states, which in turn contain only so many cities, which in turn contain only so many streets. If you think that an impossible task, GPS devices do it all the time. With the help of web services and the like, there are no shortages of up-to-date lists available.

Parsing the Address Number

According to the Thoroughfare, Landmark, and Postal Address Data Standard document, the Address Number Prefix “is not found in most Complete Address Numbers”, so let’s start with the Address Number itself. I’ll be using JavaScript to execute my tests because it doesn’t require any special tools. Any language that supports RegExes would work.

Here is the basic page:

<!DOCTYPE HTML>

<html>
<head>
<title>Address Parsing Tests</title>
</head>
<body>
<div id="output"></div>
</body>
</html>

Include this code within SCRIPT tags beneath the output DIV element in the page:

var NumericAddresses = ['100 Baker Street',
                        '109 - 111 Wharfside Street',
                        '40-42 Parkway',
                        '25b-26 Sun Street',
                        '43a Garden Walk',
                        '6/7 Marine Road',
                        '10 - 12 Acacia Ave',
                        '4513 3RD STREET CIRCLE WEST',
                        '0 1/2 Fifth Avenue',
                        '194-03 1/2 50th Avenue'];
                       
var re = /^d+w*s*(?:[-/]?s*)?d*s*d+/?s*d*s*/;

var output = '';
for (var i = 0; i < NumericAddresses.length; i++) {
    output += NumericAddresses[i] + ' matched "' + NumericAddresses[i].match(re) + '"<br/>';
}
document.getElementById('output').innerHTML = output;

RegEx Explanation

The following bullets explain every part of the above RegEx in glorious detail:

  • /^d+w*s*(?:(?:[/]?s*)?d*(?:s*d+/s*)?d+)?s+/
    • ^ assert position at start of a line
    • d+ match a digit [0-9]
      • Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
    • w* match any word character [a-zA-Z0-9_]
      • Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
    • s* match any white space character [rntf ]
      • Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
    • (?:(?:[-/]?s*)?d*(?:s*d+/s*)?d+)? Non-capturing group
      • Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
      • (?:[-/]?s*)? Non-capturing group
        • Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
        • [-/]? match a single character present in the list below
          • Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
          • matches the character literally
          • / matches the character / literally
        • s* match any white space character [rntf ]
          • Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
      • d* match a digit [0-9]
        • Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
      • (?:s*d+/s*)? Non-capturing group
        • Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
        • s* match any white space character [rntf ]
          • Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
        • d+ match a digit [0-9]
          • Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
        • / matches the character / literally
        • s* match any white space character [rntf ]
          • Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
      • d+ match a digit [0-9]
        • Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
    • s+ match any white space character [rntf ]
      • Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]

Output

As you can see, I was able to match all of the above test values using one RegEx. That being said, I would definitely encourage you to split up additional tests, such as those that target addresses that start with a letter.

100 Baker Street matched “100 “
109 – 111 Wharfside Street matched “109 – 111 “
40-42 Parkway matched “40-42 “
25b-26 Sun Street matched “25b-26 “
43a Garden Walk matched “43a “
6/7 Marine Road matched “6/7 “
10 – 12 Acacia Ave matched “10 – 12 “
4513 3RD STREET CIRCLE WEST matched “4513 “
0 1/2 Fifth Avenue matched “0 1/2 “
194-03 1/2 50th Avenue matched “194-03 1/2 “

Conclusion

The moral of the story is that parsing address fields is not a trivial exercise. Your best bet would be to use one of the many existing address parsing libraries or web services. Some of them are open source so money should not pose an issue. Working against a limited dataset, I was able to import all of the records using basic process of elimination, which is a fairly time consuming endeavor. In retrospect, I should have charged more than a hundred bucks for that job!

Rob Gravelle
Rob Gravelle
Rob Gravelle resides in Ottawa, Canada, and has been an IT guru for over 20 years. In that time, Rob has built systems for intelligence-related organizations such as Canada Border Services and various commercial businesses. In his spare time, Rob has become an accomplished music artist with several CDs and digital releases to his credit.

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Popular Articles

Featured