Simple PHP code for Scraping Google – 2012

Recently Google changed  HTML output of the SERP pages, so the Google scrapers that worked up to couple of days ago will not work any more. This is my first version of the SERP Scraper that works (in this point) with a new HTML structure of the SERP pages. It is a decent solution if you need to monitor your google placement for give keywords, and a placement of your competitors. With little bit of work  results can be stored in database on daily bases, and with simple reporting you can have your own, ranking monitor, one of the most important SEO tools.
Note: Google don’t like scrappers and when they detect you you will get penalize first temporary, your IP will be blocked for short period of time, and if you continue to scrape, your IP will be blocked for longer period of time or permanently.

PHP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
<?PHP
if ($_GET['url']) { $url = $_GET['url']; }
else $url = "aromawebdesign.com";
if ($_GET['keywords']) { $keywords = $_GET['keywords']; }
else $keywords = "vancouver web design";
if ($_GET['limit']) { $limit = $_GET['limit']; }
else $limit = "200";
google_rank($url, $keywords, $limit);
function get_content($url)
{
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_HEADER, 0);
ob_start();
curl_exec ($ch);
curl_close ($ch);
$string = ob_get_contents();
ob_end_clean();
return $string;
}
function google_rank($url,$keywords,$limit)
{
$keywstring = str_replace(" ", "+", $keywords);
$pages = ceil($limit/100);
$content = "";
for($n=1;$n{
if($n) $start=$n*100-100;
$gurl="http://www.google.ca/search?as_q=$keywstring&amp;hl=en&amp;client=firefox-a&amp;channel=s&amp;rls=org.mozilla%3Aen-US%3Aofficial&amp;num=100&amp;start=$start&amp;btnG=Google+Search&amp;as_epq=&amp;as_oq=&amp;as_eq=&amp;lr=&amp;as_ft=i&amp;as_filetype=&amp;as_qdr=all&amp;as_occt=any&amp;as_dt=i&amp;as_sitesearch=&amp;as_rights=&amp;safe=images";
$content .= get_content ($gurl);
}
$titre = preg_match_all("/<cite>(.*?)/",$content,$res);
$replace = array("<strong>","</strong>","www.");
$n = 0;
$pos = 0;</cite>
$res[1] = array_slice($res[1],3); //remove google adds
ob_start();
foreach ($res[1] as $result)
{
$n++;
$result = str_replace($replace,"",$result);
if (substr($result, -1, 1)=="/") $result = substr($result, 0, strlen($result)-1);
if ($result==$url)
{
$result = "<strong>".$result."</strong>\n";
$pos = $n;
}
echo $n.' '.$result."";
$out = ob_get_contents();
if ($n==$limit) break;
}
ob_end_clean();
echo "I found <strong>$url</strong> on position <strong>$pos</strong> for keyword(s): <strong>$keywords.</strong> in first <strong>$limit</strong> results:
";
echo $out;
}
?>

About the Author:
This article is written by Dobrisa Curkovic, Senior Web Developer in Aroma Web Design, a Web Design Company from Vancouver, BC.

Comments are closed on this post.

You may also like-