How to Scrape Meta Tags From Any Web Page With PHP

A couple of weeks ago I was going to work on a new project, it was a directory-like kind of app. One the things this application had to do was let users add their websites to their profiles, and so the first question came up, “how am going to get their website’s title and meta tags?”.

At first I though I would have to mess with regular expressions, which I only know little about, but PHP has surprised me yet again.

Extracting the Meta Tags

PHP comes with this neat function appropriately called get_meta_tags. This function takes a parameter, the URL you want to get the tags from, and returns an associative array with the tags.

array get_meta_tags ( string $filename [, bool $use_include_path = false ] )

Check out the response I get when I use http://webhole.net/ as the URL.

array(4) {
  ["google-site-verification"]=>
  string(43) "57M6wvhmlaOIdGR6hOSFSs9bWEi4RDEhZSnOwwTAy00"
  ["msvalidate_01"]=>
  string(32) "15918B50B901900FE4E7CA6A559AD241"
  ["description"]=>
  string(139) "Web development and design site covering the latest techniques and tools to help you get developing ASAP. We help developers of all levels."
  ["keywords"]=>
  string(92) "web development, web news, all about the web, linux tips, python, java, php, css, javascript"
}

Not only did I get the SEO meta tags I wanted, description and keywords, but I also got all other types of meta tags.

Getting the Title Tag

This one is a little harder, but not crazy hard and you might learn a function or two in the process.

Am going to need three functions to get the title, strpos, substr and file_get_contents.

int strpos ( string $haystack , mixed $needle [, int $offset = 0 ] )

string substr ( string $string , int $start [, int $length ] )

I will use the function strpos to find out where the title begins, ends and how long it is.

To find where the title begins I have to add seven to the position of the <title> tag.

$page=file_get_contents($url)
$titleStart=strpos($page,'<title>')+7;

The title’s length is the position of </title> minus the position where the title begins.

$titleLength=strpos($page,'</title>')-$titleStart

Now that I know where the title begins and how it is I can use substr to get it.

$title=substr($page,$titleStart,$titleLength);

Let me put all this together into a nice function.

function getMetaData($url){
	// get meta tags
	$meta=get_meta_tags($url);
	// store page
	$page=file_get_contents($url);
	// find where the title CONTENT begins
	$titleStart=strpos($page,'<title>')+7;
	// find how long the title is
	$titleLength=strpos($page,'</title>')-$titleStart;
	// extract title from $page
	$meta['title']=substr($page,$titleStart,$titleLength);
	// return array of data
	return $meta;
}

$tags=getMetaData('http://www.yahoo.com/');

echo 'Title: '.$tags['title'];
echo '<br />';
echo 'Description: '.$tags['description'];
echo '<br />';
echo 'Keywords: '.$tags['keywords'];

The result should be this.

Title: Yahoo!
Description: Welcome to Yahoo!, the world’s most visited home page. Quickly find what you’re searching for, get in touch with friends and stay in-the-know with the latest news and information.
Keywords: yahoo, yahoo home page, yahoo homepage, yahoo search, yahoo mail, yahoo messenger, yahoo games, news, finance, sport, entertainment

Happy scraping :)

blog comments powered by Disqus