XPath: A small tutorial

David G - DrupalRecently I was migrating a set of pages from an old site to a new site. To be flexible I wanted to scrape the pages of content. Having trivially used XPath and CSS Selectors before I decided to try my hand at using XPath from within PHP for a more cumbersome task. I will not be providing a full tutorial on all aspects of XPath but rather describe my current problem and describe my solution. There are many basic XPath tutorials on the internet and like any language it’s simply good to immerse yourself and learn it slowly over time.

So my initial problem is that I have a page from a legacy website that looks like this:

Basic HTML page for scraping. While it looks simple it may not all sunshine and roses!

Basic HTML page for scraping. While it looks simple it may not all sunshine and roses!

The page is designed using HTML Tables for layout (eww). The page uses table rows to break up content into Heading + Nested links pairings. By that I mean 1 table row has a header, the following table row contains additional HTML which has link(s) related to (or nested) under the parent heading.

As these linked pages are moved into a Drupal CMS site. I would like the headings to be placed into a Taxonomy Tree and the nested pages to be categorized under the correct heading, and the destination page(s) content of the legacy site to be moved over as essentially a Drupal page. For this task I’ll be using the Migrate module — but as I mentioned this blog post will merely cover the XPath necessary to accomplish this task.

So the page is broken up into related segments. We can kind of see this visually using Firebug:

Firebug shows the related table rows with a header and nested links.

Firebug shows the related table rows with a header and nested links.

Much like CSS Selectors we can target major elements of an HTML page using XPath:

  • /some/path/to/item/on/page – An example location path in XPath. An abbreviated syntax of the head element of any html page could be: /html/head. Similarily the body tag of an html page could be at the location /html/body

Note that in my above example the location path is absolute for the context nodes head and body. Eg, we can treat the DOM alot like a tree and in part like a folder structure of nested elements. So we can easily write the direct path to items in our HTML DOM. Further simple examples:

  • /html/body/h1[1] – Get the 1st H1 element of the body element of a page.
  • /html/body/div[@class=’footer’] – Get the div element found in body with only the class footer.

Note in the above 2 examples that Xpath tends to return all matching elements in the same depth of the DOM, unless you’re specific in what you want. h1[1] may seem a little redudant (or at least it should be) but it’s saying get the 1st H1 tag of the page. In Xpath the counting base starts at 1 and not at 0 as in many languages such as C. Also note that searching for the class attribute here assumes that footer is the only class assigned to the div. To search for a div which has footer as 1 of numerous classes you’d likely want to use the Xpath contains() function.

The last item I want to look at before I show my solution is the descendant-or-self axis selector. When using Xpath you typically have a target node or (context) node you want to find. As your expression traverses the DOM every step of your query is itself a context node, and can be queried and inspected.

The descendant-or-self selector can be abbrievated as //. This means myself of things deeper in the DOM. Keep this in mind when using this expression as this means usually that the full DOM must be searched when using this selector. Some examples of using this are:

  • //h3  – Give me all H3 tags found on the page.
  • /html/body/section/navigation//a — If your webpage was HTML5 this could give you all link tags within a specific nagivation block.

So with all the above pieces of knowledge I can, with some diffculty, write some Xpath rules to get the data I want:

Give me all the headings on my HTML page:

//tr//td[@class=’docpres’]  – give me all the document headings of the document.

Then in PHP I can loop through all the headings I get and do an additional query:

This is the expression I used to get a URL list under a header.

This is the expression I used to get a URL list under a header.

This is the meat and potatoes of my task! This says give me all link tags, from the immediate TD parent sibling (a TR) where the text of the previous rows TD is “Popularity” in this example. Please excuse the image, WordPress wasn’t accepting the expression as text due to the complex markup!

Note that this is a query using grouping to chain my logic with parenthesis.

Other syntactical items I haven’t mentioned yet are:

.. – (dot dot) This is the abbreviated form of writing parent. Again, the DOM is a tree and at any point I can query whats around me using xpath functions or query attributes as needed. following-sibling is an example of another axis specifier in Xpath. And I’m requesting self or deeper of my parent TD limited to the first successor (the next TR).

Lastly another way to write text() = ‘Popularity’ for example is:

This version uses the contains substring function and uses the abbreviation ‘.’, meaning self (the current Node), within our query. I opted to use text() because I feel it’s slightly less cryptic should I need to look at this code at a later date.

So this took me the better part of a day to derive these 2 Xpath expressions, but since PHP supports Xpath (1.0) I can use file_get_contents and php to tear apart HTML fairly easily! Woo!

I hope you’ve found this small exampl on XPath helpful.

Looking for quality web hosting? Look no further than Arvixe Web Hosting!

Tags: , , , | Posted under Drupal | RSS 2.0

Author Spotlight

David Gurba

I am a web programmer currently employed at UCSB. I have been developing web applications professionally for 8+ years now. For the last 5 years I’ve been actively developing websites primarily in PHP using Drupal. I have experience using LAMP and developing data driven websites for clients in aviation, higher education and e-commerce. If you’d like to contact me I can be reached at david.gurba@arvixe.com

Leave a Reply

Your email address will not be published. Required fields are marked *