Parsing Atom Feeds using XPath

Atom has become a very popular way to share information over the World Wide Web. The Atom Syndication format is built on the XML standard so that it may be displayed directly in modern browsers or parsed by scripts and software to fetch pertinent information in real time. Once converted into an XML document object, a script or program can either iterate over node elements of query the document for specific data. In this tutorial, we’ll learn how to apply the xpath XML extraction language to Atom feeds using PHP in order to target specific content.

Atom Document Structure

Take a look at the sample feed below and you’ll notice several things:

There is a feed tag that contains the base URL and an attribute called xmlns that points to the Atom spec. XMLNS is an acronym for “XML NameSpace”. Keep that in mind because we’ll need it a little later on…
The feed contains some meta info such as the title, subtitle, and author details.
A feed may contain one or more entries, which are each identified by an id and title tag.
Entries in turn may contain content.

<?xml version="1.0" encoding="utf-8"?>
<feed xml_lang="en-us" xml_base="https://afictionalsite.com/Feed.aspx?FeedName=ProductPrices" >
  <title type="text">GoodFoodTalks_WebProductPriceByShop</title>
  <subtitle>GoodFoodTalks_WebProductPriceByShop description</subtitle>
  <id>uuid:4286313d-b515-4896-81a5-0d6c6fcd858f</id>
  <updated>2015-03-13T08:29:06Z</updated>
  <author>
    <name>A Fictional Site</name>
    <uri>https://afictionalsite.com/Feed.aspx</uri>
    <email>[email protected]</email>
  </author>
  <entry>
    <id>https://afictionalsite.com/Feed.aspx?FeedName=ProductPrices&ItemID=2</id>
    <title>682</title>
    <updated>2014-11-11T11:29:28Z</updated>
    <content type="application/xml">
      <shop_no><![CDATA[2]]></shop_no>
      <product_no><![CDATA[682]]></product_no>
      <TakeAwayPrice><![CDATA[2.99]]></TakeAwayPrice>
      <EatInPrice><![CDATA[3.60]]></EatInPrice>
      <Updated><![CDATA[2014-11-11 11:29:28]]></Updated>
    </content>
  </entry>
  <entry>
    <id>https://afictionalsite.com/Feed.aspx?FeedName=FeedName=ProductPrices&ItemID=2</id>
    <title>698</title>
    <updated>2014-11-11T11:29:28Z</updated>
    <content type="application/xml">
      <shop_no><![CDATA[2]]></shop_no>
      <product_no><![CDATA[698]]></product_no>
      <TakeAwayPrice><![CDATA[1.69]]></TakeAwayPrice>
      <EatInPrice><![CDATA[2.00]]></EatInPrice>
      <Updated><![CDATA[2014-11-11 11:29:28]]></Updated>
    </content>
  </entry>
  <entry>
    <id>https://afictionalsite.com/Feed.aspx?FeedName=ProductPrices&ItemID=2</id>
    <title>705</title>
    <updated>2014-11-11T11:29:28Z</updated>
    <content type="application/xml">
      <shop_no><![CDATA[2]]></shop_no>
      <product_no><![CDATA[705]]></product_no>
      <TakeAwayPrice><![CDATA[3.25]]></TakeAwayPrice>
      <EatInPrice><![CDATA[3.90]]></EatInPrice>
      <Updated><![CDATA[2014-11-11 11:29:28]]></Updated>
    </content>
  </entry>
</feed>

XPath and PHP

Before you can query a feed, you have to fetch it using a utility like curl. Its results may then be loaded into a SimpleXMLElement object via its constructor. Be sure to check for a value of FALSE, because by setting the CURLOPT_RETURNTRANSFER option, curl_exec() returns the result on success or FALSE on failure. As explained in the Display Secure Atom Feeds in WordPress article, from which the sample code originates, passing the LIBXML_NOCDATA constant to the SimpleXMLElement constructor “tells the parser to extract values from <![CDATA[]]> tags.”

Each entry is stored in the <SimpleXMLElement instance>->entry array.

include_once(ABSPATH.WPINC.'/rss.php');
 
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL, 'https://afictionalsite.com/Feed.aspx?FeedName=TheGoodNewsFirst');
curl_setopt($ch,CURLOPT_TIMEOUT,60);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);

// Execute, grab errors
$result = curl_exec($ch);
if($result !== FALSE) {
  $pricesfeedXml = new SimpleXMLElement($result, LIBXML_NOCDATA);
  //fetch the products for each category
  if (isset($pricesfeedXml->entry) ) {
    foreach($feedXml->entry as $entry) {
      //proceess each item
    }
  }
}

// close cURL resource, and free up system resources
curl_close($ch);

Fetching a Content Node by Attribute

For our first xpath query, we’ll fetch a content node that contains a specific attribute value.

Before you can execute any queries against the XML document, you have to register the Atom namespace – that’s the xmlns attribute in the opening feed tag. If you don’t, you won’t get any results. To do that, invoke the XML document instance’s registerXpathNamespace() method. It accepts two arguments: the name that you want to give your namespace and the value of the xmlns attribute, which is the URL to the Atom spec.

Once you’ve done that, you have to include your namespace name at every level of your query paths. In this case, we would use the following xpath query to fetch a content node with a <shop_no> value of 10:

//entry/content[shop_no[text()='10']]

Our “atom” namespace must be prefixed to every document element, giving us this string:

//atom:entry/atom:content[atom:shop_no[text()='10']]

Note that the text() function doesn’t require a prefix because it’s a function and not a document element.

With that in mind, here is the code that execute the query by invoking the XML document instance’s xpath() method. The returned object is an array of SimpleXMLElements on success or a boolean value of FALSE on failure:

//the atom namespace MUST be registered before using xpath
$pricesfeedXml->registerXpathNamespace('atom' , 'http://www.w3.org/2005/Atom');

$shop_no = get_post_meta( $restaurant_id, 'shop_number', true );
$xpath = "//atom:entry/atom:content[atom:shop_no[text()='" . $shop_no . "']]";
$prices = $pricesfeedXml->xpath($xpath);
if ( $prices === false ) {
  echo 'No prices found for shop no ' . $shop_no . "n";
}
else {
  //do something with the price
}

Fetching an Entry Node by Attribute

If you require one or more of the <entry> node’s attributes, you have a couple of options available. First, you can perform the same lookup as above with the extra ::parent() call such as follows:

$xpath = "//atom:entry/atom:content/atom:shop_no[text()="' . $shop_no . '"]/parent::*'";
$entry_node = $pricesfeedXml->xpath($xpath);

Another way to accomplish the same thing would be to simply move the content level into the square brackets, where the attribute is located:

$xpath = "//atom:entry[/atom:content[./atom:shop_no[text()='" . $shop_no . "']]";
$entry_node = $pricesfeedXml->xpath($xpath);

Homing in On a Specific Value

Going the other way, you can fetch a single property based on a sibling using something like the following query:

$xpath = "//atom:entry/atom:content/atom:TakeAwayPrice[../atom:shop_no[text()='" . $shop_no . "']]";
$take_away_price_node = $pricesfeedXml->xpath($xpath);

Since the xpath() function always returns an array, you have to access the value as follows:

$take_away_price = $take_away_price_node[0]->textNode;

Conclusion

Using xpath to query XML documents for specific data sure beats iterating over every node element. Depending on the language that you are working in, the exact mechanism and syntax for using xpath may differ, but the general principles outlined here should hold up well.

Parsing Atom Feeds using XPath

Atom Document Structure

XPath and PHP

Fetching a Content Node by Attribute

Fetching an Entry Node by Attribute

Homing in On a Specific Value

Conclusion

Get the Free Newsletter!

Popular Articles

How to Reload the Page

HTML5 Navigation: Using an Anchor Tag for Hypertext

How to Create Indents and Bullet Lists

Featured

Top Online Courses to Learn SEO

Sellzone Marketing Tool for Amazon Review

The Top Database Plugins for WordPress

The Revolutionary ES6 Rest and Spread Operators

Advertisers

Menu

Our Brands