Split on Non-quoted Characters

By Robert Gravelle

A common task in Web apps is processing input from JSON, XML, or in AJAX callbacks. Often times, strings represent a delimited array of elements. Splitting elements on that character works in most instances, except when that character is part of a valid element value. In an effort to avoid this predicament, developers will attempt to find a rarely-used character delimiter, such a tilde (~) or something equally uncommon. Another approach is to simply ignore delimiter character instances that lie between string literal delimiters, I.E., single and double quotes. In today's article we'll be building just such a string iterator.

 

The Best Approach

The standard way to split a string is to use the String.split() method. Unfortunately, it falls short when you try to exclude characters that fall within other characters; it happily splits on all instances of your delimiter. Even regular expressions, as great as they are, don't excel at matching characters within other characters. Those require complex lookaheads and lookbehinds, the latter of which is not even supported in JavaScript. Instead, I would recommend using a string iterator for this task.

 

The nextNonQuotedToken() Function:

Java has a StringCharacterIterator class that is easily emulated in JavaScript. Here is a for loop which processes one character at a time:

var charIndex;
for (charIndex=0; charIndex<stringToSplit.length; charIndex++) {
currentChar = stringToSplit.charAt(charIndex);
//process character...
}

Iterating through each character in this way allows us keep track of both single and double quotes and determine whether we are currently within a string literal. Since quotes always come in matched pairs, we can track them using boolean variables. Strings can be considered to be cancelling characters because we are only within a string when we've encountered an odd number of single or double quote characters.

Also, don't forget that strings require special treatment because they may include escaped characters. Escaped quotes should not be counted because we are only interested in the matched opening and closing quotes. Therefore, we must test for quotes that are not preceded by an escape character.

 

Here is the code for the nextNonQuotedToken() method thus far:

var nextNonQuotedToken = function() { //this creates a closure for the  indexOfLastMatch
var indexOfLastMatch = -1;

return function(stringToSplit, splitChar, includeSplitChar) {
var withinSingleQuotedString = false,
withinDoubleQuotedString = false,
nextMatch = '',
charIndex;

for (charIndex=indexOfLastMatch+1; charIndex<stringToSplit.length; charIndex++) {
currentChar = stringToSplit.charAt(charIndex);
if ( /['"]/.test(currentChar)
&& (charIndex == 0 || stringToSplit.charAt(charIndex-1) != '\\')) {
if (currentChar == "'" ) {
withinSingleQuotedString = !withinSingleQuotedString;
}
else if (currentChar == '"' ) {
withinDoubleQuotedString = !withinDoubleQuotedString;
}
}
}
};
}();


Testing for an Unquoted Matched Tokens

Having tracked single and double quotes, we can test for a match when ever we encounter the splitChar character. The match includes a substring from the first character after the index of the last matched delimiter to the current index (charIndex). The includeSplitChar parameter is a boolean value; it evaluates to 0 for false and 1 for true, so we can safely add it to the current charIndex without performing any additional conversion. The indexOfLastMatch variable is updated to the current index and the for loop is exited using the break statement.

 

The final line in the function returns the matched string:

for (charIndex=indexOfLastMatch+1; charIndex<stringToSplit.length;  charIndex++) {
currentChar = stringToSplit.charAt(charIndex);
if ( /['"]/.test(currentChar) {
//...
}
else if ( currentChar == splitChar
&& !withinSingleQuotedString
&& !withinDoubleQuotedString ) {
nextMatch = stringToSplit.substring(indexOfLastMatch+1, charIndex+includeSplitChar);
indexOfLastMatch = charIndex;
break;
}
}
return nextMatch;
}

 

Including the End of String Match

If we were to leave things as is, there would be one significant flaw: the function does not return the last token following the last delimiter. The reason is that the loop is only exited when the delimiter is found. Unless it is the first character in the string, the last token will not be returned. Therefore, we have to check for the end of string. It requires its own if statement because it can occur at the same time as the first if statement. In fact, this is exactly what happened when I ran the function against the following test string of quoted values:

'"Item 1", "Item 2", "Item 3a, \'Item 3b, 3c\'", "Item 4, 5, & 6", "Item 7", "Item 8", "Item 9, Item 10", 11 , 12'

 

Here's the code to include the last token:

        //...
        nextMatch = stringToSplit.substring(indexOfLastMatch+1, charIndex+includeSplitChar);
        indexOfLastMatch = charIndex;
        break;
      }
      if (charIndex == stringToSplit.length - 1) {
        nextMatch = stringToSplit.substr(indexOfLastMatch+1);
        indexOfLastMatch = charIndex;
      }
    }
    return nextMatch; 
} 

 

Conclusion

In the next article we'll take what we did here today and transplant it into a proper Iterator class as well as introduce a couple of additional methods: a split() method that calls nextNonQuotedToken(), and a utility to strip the quotes from string literals.


If you enjoyed this article, please support Rob's rockstar aspirations by purchasing one of Rob's cover or original songs from iTunes.com for only 0.99 cents each.

Rob Gravelle resides in Ottawa, Canada, and is the founder of GravelleConsulting.com. Rob has built systems for Intelligence-related organizations such as Canada Border Services, CSIS as well as for numerous commercial businesses. Email Rob to receive a free estimate on your software project. Should you hire Rob and his firm, you'll receive 15% off for mentioning that you heard about it here!

In his spare time, Rob has become an accomplished guitar player, and has released several CDs. His former band, Ivory Knight, was rated as one Canada's top hard rock and metal groups by Brave Words magazine (issue #92).

Rob uses and recommends MochaHost, which provides Web Hosting at $3.10 per month, 2 LifeTime Free Domains, and 6 Months Free!



Make a Comment

Loading Comments...

  • Web Development Newsletter Signup

    Invalid email
    You have successfuly registered to our newsletter.
  •  
  •  
  •  
Thanks for your registration, follow us on our social networks to keep up-to-date