ScraperI had a need to get some data that wasn’t available in the merchant’s datafeed, and since I wasn’t going to be putting a huge burden on the merchant’s site (ie: if you scrape, be nice*) I decided to write a little scraper to do so.

In this script, I’m using all, pure PHP functions: cURL, Document Object Model, SimpleXML. By using PHP’s built-in functions, the script will be as portable and efficient as possible.

In my particular example, I’m trying to get the YouTube XML data from the embedded video on www.tennisexpress.com/PRINCE-Thunder-Rip-OS-Tennis-Racquets-6936. So that’s set as my URL to fetch:

$cURL = 'http://www.tennisexpress.com/PRINCE-Thunder-Rip-OS-Tennis-Racquets-6936';

Next, I use cURL to fetch the HTML source. If cURL isn’t available, you can use file_get_contents(), but cURL gives you much more control. I added in some re-tries, too:

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $cURL);
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
												'User-Agent: "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.15) Gecko/2009101601 Firefox/3.0.15 GTB6 (.NET CLR 3.5.30729)"'
											));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);

$cHTML = curl_exec($ch);

$nTries = 1;
while (($nTries < 3) && (curl_error($ch))) {
	echo "Try $nTries - curl error: " . curl_error($ch);
	echo("\t$cSearchURL\n\n");
	$cHTML = curl_exec($ch);
	$nTries++;
} // ends while (($nTries < 3) && (curl_error($ch)))

if (curl_error($ch)) {
	echo "Curl error: " . curl_error($ch);
	echo("\n$cSearchURL\n");

	exit();
} // ends if (curl_error($ch))

Now that I have the HTML source, I need to load it into a DOMDocument, then parse that and load up my XPath:

// Create the DOM Document
$dom = new DOMDocument();

// Load the HTML
@$dom->loadHTML($cHTML);

// Get the paths
$xPath = new DOMXPath($dom);

Outline Current ElementAt this point, you shouldn’t have had to make any changes to the code, except for the starting URL. However, this is where you have to start customizing the script for your own need. To get the path to the object you’re trying to scrape, the Web Developer Firefox toolbar is ideal. Within the toolbar, under Outline, choose Outline Current Element and as you mouseover elements of the page, you’ll see the path directly under the toolbar. If you press Ctrl-C while moused-over an element, you’ll have the path copied to your clipboard.

XPath

html > body > div #wrapper > div #mainContent > div #rightContent2 > div #productPageWrapper > div #prodImageContainer > div > div > object > embed

This XPath is what you’re going to use to navigate the DOMDocument to get the data you’re looking for.

$cEmbedURL = $xPath->evaluate("/html/body/div[@id='wrapper']/div[@id='mainContent']/div[@id='rightContent2']/div[@id='productPageWrapper']/div[@id='prodImageContainer']/div/div/object/embed")->item(0)->getAttribute('src');

Hopefully you can follow along, but basically you copy the path from the Web Developer Firefox toolbar into your PHP code, with some slight modifications.

Tip: If you’re looking for text within a tag, instead of using ->getAttribute(‘src’) simply use ->textContent

If you happen to be looking to scrape YouTube XML data, the rest of my script goes something like:

// Pull out the video ID
$cVideoID = str_replace('http://www.youtube.com/v/', '', str_replace('&hl=en&fs=1', '', $cEmbedURL));

// Fetch YouTube's XML data on the video
$cXMLURL = 'http://www.youtube.com/oembed?url=http%3A//www.youtube.com/watch?v%3D' . urlencode($cVideoID) . '&amp;format=xml';
echo("\$cXMLURL is $cXMLURL\n");

// Parse the XML
$oXML = simplexml_load_file($cXMLURL);

// You wouldn't use print_r in your final script, but this shows you what you now have to work with
print_r($oXML);
/*
SimpleXMLElement Object
(
    [provider_url] => http://www.youtube.com/
    [title] => Prince Thunder RIP Tennis Racquet- Tennis Express Racket Review
    1 => <object width="480" height="295"><param name="movie" value="http://www.youtube.com/v/iI-G0BWk_yI?version=3"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/iI-G0BWk_yI?version=3" type="application/x-shockwave-flash" width="480" height="295" allowscriptaccess="always" allowfullscreen="true"></embed></object>
    [author_name] => tennisexpress
    [height] => 295
    [thumbnail_width] => 480
    [width] => 480
    [version] => 1.0
    [author_url] => http://www.youtube.com/user/tennisexpress
    [provider_name] => YouTube
    [thumbnail_url] => http://i2.ytimg.com/vi/iI-G0BWk_yI/hqdefault.jpg
    [type] => video
    [thumbnail_height] => 360
)
*/

Now if you’re building a datafeed site and want a hand-up on your competition, you can scrape the merchant’s site* for that extra content.

*I know there’s going to be a lot of people (particularly not affiliates) who think scraping is bad. So I have some ground rules when doing this:

  1. Ask for a copy of the merchant’s database. You’re going to be generating sales for them, they’ll probably give it to you, if you just ask
  2. If they won’t, or can’t, scrape slowly. Don’t hammer their site, requesting multiple pages per second. Add some sleep() calls in there. Even better: scrape overnight, and not during the busy shopping hours
  3. Don’t use your affiliate link as the starting URL! This will drive up clicks, with 0 conversions, and kill the program’s EPC.
  4. Cache results for as long as you can. Caching not only takes the load off of the merchant’s site, but keeps the speed of your site as low as possible, as well.

10 Comments » for Scraping Websites using PHP’s Document Object Model
  1. John Ward says:

    XPATH is the way to go. Saved me so much time since I discovered it.

  2. Dirkjan says:

    Hi Eric,

    About point 3 “# Don’t use your affiliate link as the starting URL”: how are you getting the url? Since most datafeeds only give the affiliate link.

    • Eric Nagel says:

      Hi Dirkjan –

      That was the most difficult part. I used the site’s search engine and popped in the Item Number, then took the first (and only) result to get the products URL.

      Sometimes you can guess what the URL is from the item ID. Be creative on this part!

  3. CtrTard says:

    Great post. I just discovered Xpath recently. It’s a great tool to utilize for scraping.

    One thing I suggest, instead of Web Developer Tools, use the Xpather add-on for Firefox. https://addons.mozilla.org/af/firefox/addon/xpather/ This let’s you easily get the true xpath and even has a point and click mode where you just click on the page element you want to get an xpath for.

  4. Matt Pardo says:

    Eric, great stuff. FYI, you may want to look at Perl or Expect when you get a chance. Writing bots is very easy with these two languages especially expect although I still prefer Perl.

  5. Shane says:

    Dude, you are the man. This is exactly what I needed. I will second the vote for Xpather, too. I had to use it in conjunction with the Web Developer toolbar so that I could pinpoint exactly which element I wanted, but it saved me so much time translating into xpath syntax.

    A problem I ran into was that textContent and nodeValue both return the contents with the HTML stripped out, though. To get the full contents of the element, this is what I’m doing:


    $element = $xPath->evaluate("...")->item(0);
    $elementcontent = $dom->saveXML($element);

  6. Jerry Lee says:

    So, if I am not getting youtube stuff, but just links from another site, how do I echo those results out?

  7. Appartamenti Barcellona says:

    Well, great article, I’m figuring out how to make it works in my case, but absolutely it is a good base to start.
    Thanks Eric

  8. Carlos Fabuel Cava says:

    Good work. it is a good practice make this work with only php functions

1 Pings/Trackbacks for "Scraping Websites using PHP’s Document Object Model"

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>