Scraping Websites using PHP's Document Object Model

I have been, or can be if you click on a link and make a purchase, compensated via a cash payment, gift, or something else of value for writing this post. Regardless, I only recommend products or services I use personally and believe will be good for my readers.

I had a need to get some data that wasn’t available in the merchant’s datafeed, and since I wasn’t going to be putting a huge burden on the merchant’s site (ie: if you scrape, be nice*) I decided to write a little scraper to do so.

In this script, I’m using all, pure PHP functions: cURL, Document Object Model, SimpleXML. By using PHP’s built-in functions, the script will be as portable and efficient as possible.

In my particular example, I’m trying to get the YouTube XML data from the embedded video on www.tennisexpress.com/PRINCE-Thunder-Rip-OS-Tennis-Racquets-6936. So that’s set as my URL to fetch:

$cURL = 'http://www.tennisexpress.com/PRINCE-Thunder-Rip-OS-Tennis-Racquets-6936';

Next, I use cURL to fetch the HTML source. If cURL isn’t available, you can use file_get_contents(), but cURL gives you much more control. I added in some re-tries, too:

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $cURL);
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
												'User-Agent: "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.15) Gecko/2009101601 Firefox/3.0.15 GTB6 (.NET CLR 3.5.30729)"'
											));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);

$cHTML = curl_exec($ch);

$nTries = 1;
while (($nTries < 3) && (curl_error($ch))) {
	echo "Try $nTries - curl error: " . curl_error($ch);
	echo("\t$cSearchURL\n\n");
	$cHTML = curl_exec($ch);
	$nTries++;
} // ends while (($nTries < 3) && (curl_error($ch)))

if (curl_error($ch)) {
	echo "Curl error: " . curl_error($ch);
	echo("\n$cSearchURL\n");

	exit();
} // ends if (curl_error($ch))

Now that I have the HTML source, I need to load it into a DOMDocument, then parse that and load up my XPath:

// Create the DOM Document
$dom = new DOMDocument();

// Load the HTML
@$dom->loadHTML($cHTML);

// Get the paths
$xPath = new DOMXPath($dom);

At this point, you shouldn’t have had to make any changes to the code, except for the starting URL. However, this is where you have to start customizing the script for your own need. To get the path to the object you’re trying to scrape, the Web Developer Firefox toolbar is ideal. Within the toolbar, under Outline, choose Outline Current Element and as you mouseover elements of the page, you’ll see the path directly under the toolbar. If you press Ctrl-C while moused-over an element, you’ll have the path copied to your clipboard.

html > body > div #wrapper > div #mainContent > div #rightContent2 > div #productPageWrapper > div #prodImageContainer > div > div > object > embed

This XPath is what you’re going to use to navigate the DOMDocument to get the data you’re looking for.

$cEmbedURL = $xPath->evaluate("/html/body/div[@id='wrapper']/div[@id='mainContent']/div[@id='rightContent2']/div[@id='productPageWrapper']/div[@id='prodImageContainer']/div/div/object/embed")->item(0)->getAttribute('src');

Hopefully you can follow along, but basically you copy the path from the Web Developer Firefox toolbar into your PHP code, with some slight modifications.

Tip: If you’re looking for text within a tag, instead of using ->getAttribute(‘src’) simply use ->textContent

If you happen to be looking to scrape YouTube XML data, the rest of my script goes something like:

// Pull out the video ID
$cVideoID = str_replace('http://www.youtube.com/v/', '', str_replace('&hl=en&fs=1', '', $cEmbedURL));

// Fetch YouTube's XML data on the video
$cXMLURL = 'http://www.youtube.com/oembed?url=http%3A//www.youtube.com/watch?v%3D' . urlencode($cVideoID) . '&amp;format=xml';
echo("\$cXMLURL is $cXMLURL\n");

// Parse the XML
$oXML = simplexml_load_file($cXMLURL);

// You wouldn't use print_r in your final script, but this shows you what you now have to work with
print_r($oXML);
/*
SimpleXMLElement Object
(
    [provider_url] => http://www.youtube.com/
    [title] => Prince Thunder RIP Tennis Racquet- Tennis Express Racket Review
     => <object width="480" height="295"><param name="movie" value="http://www.youtube.com/v/iI-G0BWk_yI?version=3"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/iI-G0BWk_yI?version=3" type="application/x-shockwave-flash" width="480" height="295" allowscriptaccess="always" allowfullscreen="true"></embed></object>
    [author_name] => tennisexpress
    [height] => 295
    [thumbnail_width] => 480
    [width] => 480
    [version] => 1.0
    [author_url] => http://www.youtube.com/user/tennisexpress
    [provider_name] => YouTube
    [thumbnail_url] => http://i2.ytimg.com/vi/iI-G0BWk_yI/hqdefault.jpg
    [type] => video
    [thumbnail_height] => 360
)
*/

Now if you’re building a datafeed site and want a hand-up on your competition, you can scrape the merchant’s site* for that extra content.

*I know there’s going to be a lot of people (particularly not affiliates) who think scraping is bad. So I have some ground rules when doing this:

Ask for a copy of the merchant’s database. You’re going to be generating sales for them, they’ll probably give it to you, if you just ask

If they won’t, or can’t, scrape slowly. Don’t hammer their site, requesting multiple pages per second. Add some sleep() calls in there. Even better: scrape overnight, and not during the busy shopping hours

Don’t use your affiliate link as the starting URL! This will drive up clicks, with 0 conversions, and kill the program’s EPC.

Cache results for as long as you can. Caching not only takes the load off of the merchant’s site, but keeps the speed of your site as low as possible, as well.

Categories:

How To & Tips

Tags:cURL Document Object Model PHP Programming SimpleXML

Share Tweet

Comments

John Ward

Posted March 28, 2011 10:47 am 0Likes

XPATH is the way to go. Saved me so much time since I discovered it.
Dirkjan

Posted March 28, 2011 11:08 am 0Likes

Hi Eric,

About point 3 “# Don’t use your affiliate link as the starting URL”: how are you getting the url? Since most datafeeds only give the affiliate link.
- Eric Nagel
  
  Posted March 28, 2011 11:10 am 0Likes
  
  Hi Dirkjan –
  
  That was the most difficult part. I used the site’s search engine and popped in the Item Number, then took the first (and only) result to get the products URL.
  
  Sometimes you can guess what the URL is from the item ID. Be creative on this part!
CtrTard

Posted March 29, 2011 8:51 pm 0Likes

Great post. I just discovered Xpath recently. It’s a great tool to utilize for scraping.

One thing I suggest, instead of Web Developer Tools, use the Xpather add-on for Firefox. https://addons.mozilla.org/af/firefox/addon/xpather/ This let’s you easily get the true xpath and even has a point and click mode where you just click on the page element you want to get an xpath for.
Matt Pardo

Posted April 2, 2011 12:30 pm 0Likes

Eric, great stuff. FYI, you may want to look at Perl or Expect when you get a chance. Writing bots is very easy with these two languages especially expect although I still prefer Perl.
Shane

Posted November 8, 2011 12:13 pm 0Likes

Dude, you are the man. This is exactly what I needed. I will second the vote for Xpather, too. I had to use it in conjunction with the Web Developer toolbar so that I could pinpoint exactly which element I wanted, but it saved me so much time translating into xpath syntax.

A problem I ran into was that textContent and nodeValue both return the contents with the HTML stripped out, though. To get the full contents of the element, this is what I’m doing:

$element = $xPath->evaluate("...")->item(0); $elementcontent = $dom->saveXML($element);
Jerry Lee

Posted December 13, 2011 4:36 pm 0Likes

So, if I am not getting youtube stuff, but just links from another site, how do I echo those results out?
- Eric Nagel
  
  Posted December 14, 2011 8:58 am 0Likes
  
  42
  
  Completely depends on what you’re trying to do. There’s really not a single answer to this. This post will serve as a guide, but you’ll have to finish the job yourself (or hire a PHP programmer to do it for you)
Appartamenti Barcellona

Posted November 7, 2012 5:09 am 0Likes

Well, great article, I’m figuring out how to make it works in my case, but absolutely it is a good base to start.
Thanks Eric
Carlos Fabuel Cava

Posted December 25, 2012 4:42 am 0Likes

Good work. it is a good practice make this work with only php functions
Trackback: The Top 20 Affiliate Marketing Posts of 2012 | Affiliate Programs Directory Blog - Twist Directory

Eric Nagel