Wednesday, June 23, 2021

Parsing Atom Feeds using XPath

Parsing Atom Feeds using XPath

Atom has become a very popular way to share information over the World Wide Web. The Atom Syndication format is built on the XML standard so that it may be displayed directly in modern browsers or parsed by scripts and software to fetch pertinent information in real time. Once converted into an XML document object, a script or program can either iterate over node elements of query the document for specific data. In this tutorial, we’ll learn how to apply the xpath XML extraction language to Atom feeds using PHP in order to target specific content.

Atom Document Structure

Take a look at the sample feed below and you’ll notice several things:

  • There is a feed tag that contains the base URL and an attribute called xmlns that points to the Atom spec. XMLNS is an acronym for “XML NameSpace”. Keep that in mind because we’ll need it a little later on…
  • The feed contains some meta info such as the title, subtitle, and author details.
  • A feed may contain one or more entries, which are each identified by an id and title tag.
  • Entries in turn may contain content.
<?xml version="1.0" encoding="utf-8"?>
<feed xml_lang="en-us" xml_base="https://afictionalsite.com/Feed.aspx?FeedName=ProductPrices" >
  <title type="text">GoodFoodTalks_WebProductPriceByShop</title>
  <subtitle>GoodFoodTalks_WebProductPriceByShop description</subtitle>
  <id>uuid:4286313d-b515-4896-81a5-0d6c6fcd858f</id>
  <updated>2015-03-13T08:29:06Z</updated>
  <author>
    <name>A Fictional Site</name>
    <uri>https://afictionalsite.com/Feed.aspx</uri>
    <email>[email protected]</email>
  </author>
  <entry>
    <id>https://afictionalsite.com/Feed.aspx?FeedName=ProductPrices&ItemID=2</id>
    <title>682</title>
    <updated>2014-11-11T11:29:28Z</updated>
    <content type="application/xml">
      <shop_no><![CDATA[2]]></shop_no>
      <product_no><![CDATA[682]]></product_no>
      <TakeAwayPrice><![CDATA[2.99]]></TakeAwayPrice>
      <EatInPrice><![CDATA[3.60]]></EatInPrice>
      <Updated><![CDATA[2014-11-11 11:29:28]]></Updated>
    </content>
  </entry>
  <entry>
    <id>https://afictionalsite.com/Feed.aspx?FeedName=FeedName=ProductPrices&ItemID=2</id>
    <title>698</title>
    <updated>2014-11-11T11:29:28Z</updated>
    <content type="application/xml">
      <shop_no><![CDATA[2]]></shop_no>
      <product_no><![CDATA[698]]></product_no>
      <TakeAwayPrice><![CDATA[1.69]]></TakeAwayPrice>
      <EatInPrice><![CDATA[2.00]]></EatInPrice>
      <Updated><![CDATA[2014-11-11 11:29:28]]></Updated>
    </content>
  </entry>
  <entry>
    <id>https://afictionalsite.com/Feed.aspx?FeedName=ProductPrices&ItemID=2</id>
    <title>705</title>
    <updated>2014-11-11T11:29:28Z</updated>
    <content type="application/xml">
      <shop_no><![CDATA[2]]></shop_no>
      <product_no><![CDATA[705]]></product_no>
      <TakeAwayPrice><![CDATA[3.25]]></TakeAwayPrice>
      <EatInPrice><![CDATA[3.90]]></EatInPrice>
      <Updated><![CDATA[2014-11-11 11:29:28]]></Updated>
    </content>
  </entry>
</feed>

XPath and PHP

Before you can query a feed, you have to fetch it using a utility like curl. Its results may then be loaded into a SimpleXMLElement object via its constructor. Be sure to check for a value of FALSE, because by setting the CURLOPT_RETURNTRANSFER option, curl_exec() returns the result on success or FALSE on failure. As explained in the Display Secure Atom Feeds in WordPress article, from which the sample code originates, passing the LIBXML_NOCDATA constant to the SimpleXMLElement constructor “tells the parser to extract values from <![CDATA[]]> tags.”

Each entry is stored in the <SimpleXMLElement instance>->entry array.

include_once(ABSPATH.WPINC.'/rss.php');
 
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL, 'https://afictionalsite.com/Feed.aspx?FeedName=TheGoodNewsFirst');
curl_setopt($ch,CURLOPT_TIMEOUT,60);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);

// Execute, grab errors
$result = curl_exec($ch);
if($result !== FALSE) {
  $pricesfeedXml = new SimpleXMLElement($result, LIBXML_NOCDATA);
  //fetch the products for each category
  if (isset($pricesfeedXml->entry) ) {
    foreach($feedXml->entry as $entry) {
      //proceess each item
    }
  }
}

// close cURL resource, and free up system resources
curl_close($ch);

Fetching a Content Node by Attribute

For our first xpath query, we’ll fetch a content node that contains a specific attribute value.

Before you can execute any queries against the XML document, you have to register the Atom namespace – that’s the xmlns attribute in the opening feed tag. If you don’t, you won’t get any results. To do that, invoke the XML document instance’s registerXpathNamespace() method. It accepts two arguments: the name that you want to give your namespace and the value of the xmlns attribute, which is the URL to the Atom spec.

Once you’ve done that, you have to include your namespace name at every level of your query paths. In this case, we would use the following xpath query to fetch a content node with a <shop_no> value of 10:

//entry/content[shop_no[text()='10']]

Our “atom” namespace must be prefixed to every document element, giving us this string:

//atom:entry/atom:content[atom:shop_no[text()='10']]

Note that the text() function doesn’t require a prefix because it’s a function and not a document element.

With that in mind, here is the code that execute the query by invoking the XML document instance’s xpath() method. The returned object is an array of SimpleXMLElements on success or a boolean value of FALSE on failure:

//the atom namespace MUST be registered before using xpath
$pricesfeedXml->registerXpathNamespace('atom' , 'http://www.w3.org/2005/Atom');

$shop_no = get_post_meta( $restaurant_id, 'shop_number', true );
$xpath = "//atom:entry/atom:content[atom:shop_no[text()='" . $shop_no . "']]";
$prices = $pricesfeedXml->xpath($xpath);
if ( $prices === false ) {
  echo 'No prices found for shop no ' . $shop_no . "n";
}
else {
  //do something with the price
}

Fetching an Entry Node by Attribute

If you require one or more of the <entry> node’s attributes, you have a couple of options available. First, you can perform the same lookup as above with the extra ::parent() call such as follows:

$xpath = "//atom:entry/atom:content/atom:shop_no[text()="' . $shop_no . '"]/parent::*'";
$entry_node = $pricesfeedXml->xpath($xpath);

Another way to accomplish the same thing would be to simply move the content level into the square brackets, where the attribute is located:

$xpath = "//atom:entry[/atom:content[./atom:shop_no[text()='" . $shop_no . "']]";
$entry_node = $pricesfeedXml->xpath($xpath);

Homing in On a Specific Value

Going the other way, you can fetch a single property based on a sibling using something like the following query:

$xpath = "//atom:entry/atom:content/atom:TakeAwayPrice[../atom:shop_no[text()='" . $shop_no . "']]";
$take_away_price_node = $pricesfeedXml->xpath($xpath);

Since the xpath() function always returns an array, you have to access the value as follows:

$take_away_price = $take_away_price_node[0]->textNode;

Conclusion

Using xpath to query XML documents for specific data sure beats iterating over every node element. Depending on the language that you are working in, the exact mechanism and syntax for using xpath may differ, but the general principles outlined here should hold up well.

Rob Gravelle
Rob Gravelle
Rob Gravelle resides in Ottawa, Canada, and has been an IT guru for over 20 years. In that time, Rob has built systems for intelligence-related organizations such as Canada Border Services and various commercial businesses. In his spare time, Rob has become an accomplished music artist with several CDs and digital releases to his credit.

Popular Articles

Featured