How to: Scrape search engines without pissing them off

来源:互联网 发布:如何注册一个好的域名 编辑:程序博客网 时间:2024/05/18 01:58
Written by Ian Lurie Wednesday, 28 September 2011 12:43

You can learn a lot about a search engine by scraping its results.It’s the only easy way you can get an hourly or daily record ofexactly what Google, Bing or Yahoo! (you know, backwhen Yahoo! was a search engine company) show their users. It’salso the easiest way to track your keyword rankings.

SERP Scraping

Like it or not, whether you use a third-party tool or your own, ifyou practiceSEO then you’re scrapingsearch results.

If you follow a few simple rules, it’s a lot easier than youthink.

The problem with scraping

Automated scraping — grabbing search results using your own‘bot’— violates every search engine’s terms ofservice. Search engines sniff out and block major scrapers.

If you ever perform a series of searches that match the behavior ofa SERP crawler, Google and Bing willinterrupt your search with a captcha page. You have to enter thecaptcha or perform whatever test the page requires before performinganother query.

That (supposedly) blocks bots and other scripts from automaticallyscraping lots of pages at once.

The reason? Resources. A single automated SERP scraper can perform tens, hundreds or eventhousands of queries per second. The only limitations are bandwidthand processing power. Google doesn’t want to waste server cycleson a bunch of sweaty-palmed search geeks’ Python scripts. So,they block almost anything that looks like an automatic query.

Your job, if you ever did anything like this,which you wouldn’t, is to buy or createsoftware that does not look like an automatic query. Here area few tricks my friend told me:

 

Stay on the right side of the equation

Note that I said “almost” anything. The search enginesaren’t naive. Google knows everySEO scrapes their results. So does Bing. Both engines have to decide whento block a scraper. Testing shows that equation to make theblock/don’t block decision balances:

  • Potential server load created by the scraper.
  • The potential load created by blocking the scraper.
  • The query ‘space’ for the search phrase.

At a minimum, any SERP bot must have thepotential of tying up server resources. If it doesn’t, thesearch engine won’t waste theCPU cycles. It’s not worth the effort required to block the bot.

So, if you’re scraping the SERPs, you need to stay on theright side of that equation: Be so unobtrusive that, even ifyou’re detected, you’re not worth squashing.

Disclaimer

Understand, now, that everything I talked about in this article istotally hypothetical.I certainly don’t scrapeGoogle. That would violate their terms of service.

And I’m sure companies like AuthorityLabs and SERPBuddy haveworked out special agreements for hourly scraping of the major searchengines. But I have this…friend… who’sbeen experimenting a bit, testing what’s allowed andwhat’s not, see…

Cough.

How I did my test

I tested all these theories with three Python scripts. All of them:

  1. Perform a Google search.
  2. Download the first page of results.
  3. Then downloads the next 4 pages.
  4. Saves the pages for parsing.

Script #1 had no shame. It hit Google as fast as possible anddidn’t attempt to behave like a ‘normal’ webbrowser.

Script #2 was a little embarrassed. It pretended to be MozillaFirefox and only queried Google once every 30 seconds.

Script #3 was downright bashful. It selected a random user agentfrom a list of 10, and paused between queries for anywhere between 15and 60 seconds.

The results

Script #3 did the best. That’s hardly a surprise. But thedifference is:

  • Script #1 was blocked within 3 searches.
  • Script #2 was blocked within 10 searches.
  • Script #3 was never blocked, and performed 150 searches. Thatmeans it pulled 5 pages of ranking data for 150 differentkeywords.

There’s no way any of these scripts fooled Google. The searchengine had to know that scripts 1, 2 and 3 were all scrapers.But it only blocked 1 and 2.

My theory: Script 3 created so small a burden that it wasn’tworth it for Google to block it. Just as important, though, was thefact that script 3didn’t make itself obvious.Detectable? Absolutely. I didn’t rotate IP addresses or do anyother serious concealment. But script 3 behaved like it was, well,embarrassed. And a little contrition goes a long way. If youacknowledge you’rescraping onthe square and behave yourself, Google may cut you some slack.

The rules

Based on all of this, here are my guidelines for scraping results:

  1. Scrape slowly. Don’t pound the crap out of Google or Bing.Make your script pause for at least 20 seconds between queries.
  2. Scrape randomly. Randomize the amount of time between queries.
  3. Be a browser. Have a list of typical user agents (browsers).Choose one of those randomly for each query.

Follow all three of these and you’re a well-behaved scraper.Even if Google and Bing figure out what you’re up to, they leaveyou alone: You’re not a burglar. You’re scouring thegutter for loose change. And they’re OK with that.