Eric Nagel

Scraping Websites using PHP’s Document Object Model

I have been, or can be if you click on a link and make a purchase, compensated via a cash payment, gift, or something else of value for writing this post. Regardless, I only recommend products or services I use personally and believe will be good for my readers.

I had a need to get some data that wasn’t available in the merchant’s datafeed, and since I wasn’t going to be putting a huge burden on the merchant’s site (ie: if you scrape, be nice*) I decided to write a little scraper to do so.

In this script, I’m using all, pure PHP functions: cURL, Document Object Model, SimpleXML. By using PHP’s built-in functions, the script will be as portable and efficient as possible.

In my particular example, I’m trying to get the YouTube XML data from the embedded video on www.tennisexpress.com/PRINCE-Thunder-Rip-OS-Tennis-Racquets-6936. So that’s set as my URL to fetch:

$cURL = 'http://www.tennisexpress.com/PRINCE-Thunder-Rip-OS-Tennis-Racquets-6936';

Next, I use cURL to fetch the HTML source. If cURL isn’t available, you can use file_get_contents(), but cURL gives you much more control. I added in some re-tries, too:

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $cURL);
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
												'User-Agent: "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.15) Gecko/2009101601 Firefox/3.0.15 GTB6 (.NET CLR 3.5.30729)"'
											));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);

$cHTML = curl_exec($ch);

$nTries = 1;
while (($nTries < 3) && (curl_error($ch))) {
	echo "Try $nTries - curl error: " . curl_error($ch);
	echo("\t$cSearchURL\n\n");
	$cHTML = curl_exec($ch);
	$nTries++;
} // ends while (($nTries < 3) && (curl_error($ch)))

if (curl_error($ch)) {
	echo "Curl error: " . curl_error($ch);
	echo("\n$cSearchURL\n");

	exit();
} // ends if (curl_error($ch))

Now that I have the HTML source, I need to load it into a DOMDocument, then parse that and load up my XPath:

// Create the DOM Document
$dom = new DOMDocument();

// Load the HTML
@$dom->loadHTML($cHTML);

// Get the paths
$xPath = new DOMXPath($dom);

At this point, you shouldn’t have had to make any changes to the code, except for the starting URL. However, this is where you have to start customizing the script for your own need. To get the path to the object you’re trying to scrape, the Web Developer Firefox toolbar is ideal. Within the toolbar, under Outline, choose Outline Current Element and as you mouseover elements of the page, you’ll see the path directly under the toolbar. If you press Ctrl-C while moused-over an element, you’ll have the path copied to your clipboard.

html > body > div #wrapper > div #mainContent > div #rightContent2 > div #productPageWrapper > div #prodImageContainer > div > div > object > embed

This XPath is what you’re going to use to navigate the DOMDocument to get the data you’re looking for.

$cEmbedURL = $xPath->evaluate("/html/body/div[@id='wrapper']/div[@id='mainContent']/div[@id='rightContent2']/div[@id='productPageWrapper']/div[@id='prodImageContainer']/div/div/object/embed")->item(0)->getAttribute('src');

Hopefully you can follow along, but basically you copy the path from the Web Developer Firefox toolbar into your PHP code, with some slight modifications.

Tip: If you’re looking for text within a tag, instead of using ->getAttribute(‘src’) simply use ->textContent

If you happen to be looking to scrape YouTube XML data, the rest of my script goes something like:

// Pull out the video ID
$cVideoID = str_replace('http://www.youtube.com/v/', '', str_replace('&hl=en&fs=1', '', $cEmbedURL));

// Fetch YouTube's XML data on the video
$cXMLURL = 'http://www.youtube.com/oembed?url=http%3A//www.youtube.com/watch?v%3D' . urlencode($cVideoID) . '&amp;format=xml';
echo("\$cXMLURL is $cXMLURL\n");

// Parse the XML
$oXML = simplexml_load_file($cXMLURL);

// You wouldn't use print_r in your final script, but this shows you what you now have to work with
print_r($oXML);
/*
SimpleXMLElement Object
(
    [provider_url] => http://www.youtube.com/
    [title] => Prince Thunder RIP Tennis Racquet- Tennis Express Racket Review
     => <object width="480" height="295"><param name="movie" value="http://www.youtube.com/v/iI-G0BWk_yI?version=3"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/iI-G0BWk_yI?version=3" type="application/x-shockwave-flash" width="480" height="295" allowscriptaccess="always" allowfullscreen="true"></embed></object>
    [author_name] => tennisexpress
    [height] => 295
    [thumbnail_width] => 480
    [width] => 480
    [version] => 1.0
    [author_url] => http://www.youtube.com/user/tennisexpress
    [provider_name] => YouTube
    [thumbnail_url] => http://i2.ytimg.com/vi/iI-G0BWk_yI/hqdefault.jpg
    [type] => video
    [thumbnail_height] => 360
)
*/

Now if you’re building a datafeed site and want a hand-up on your competition, you can scrape the merchant’s site* for that extra content.

*I know there’s going to be a lot of people (particularly not affiliates) who think scraping is bad. So I have some ground rules when doing this:

  1. Ask for a copy of the merchant’s database. You’re going to be generating sales for them, they’ll probably give it to you, if you just ask
  2. If they won’t, or can’t, scrape slowly. Don’t hammer their site, requesting multiple pages per second. Add some sleep() calls in there. Even better: scrape overnight, and not during the busy shopping hours
  3. Don’t use your affiliate link as the starting URL! This will drive up clicks, with 0 conversions, and kill the program’s EPC.
  4. Cache results for as long as you can. Caching not only takes the load off of the merchant’s site, but keeps the speed of your site as low as possible, as well.