How to: Scrape search engines without pissing them off
来源:互联网 发布:如何注册一个好的域名 编辑:程序博客网 时间:2024/05/18 01:58
You can learn a lot about a search engine by scraping its results.It’s the only easy way you can get an hourly or daily record ofexactly what Google, Bing or Yahoo! (you know, backwhen Yahoo! was a search engine company) show their users. It’salso the easiest way to track your keyword rankings.
Like it or not, whether you use a third-party tool or your own, ifyou practiceSEO then you’re scrapingsearch results.
If you follow a few simple rules, it’s a lot easier than youthink.
The problem with scraping
Automated scraping — grabbing search results using your own‘bot’— violates every search engine’s terms ofservice. Search engines sniff out and block major scrapers.
If you ever perform a series of searches that match the behavior ofa SERP crawler, Google and Bing willinterrupt your search with a captcha page. You have to enter thecaptcha or perform whatever test the page requires before performinganother query.
That (supposedly) blocks bots and other scripts from automaticallyscraping lots of pages at once.
The reason? Resources. A single automated SERP scraper can perform tens, hundreds or eventhousands of queries per second. The only limitations are bandwidthand processing power. Google doesn’t want to waste server cycleson a bunch of sweaty-palmed search geeks’ Python scripts. So,they block almost anything that looks like an automatic query.
Your job, if you ever did anything like this,which you wouldn’t, is to buy or createsoftware that does not look like an automatic query. Here area few tricks my friend told me:
Stay on the right side of the equation
Note that I said “almost” anything. The search enginesaren’t naive. Google knows everySEO scrapes their results. So does Bing. Both engines have to decide whento block a scraper. Testing shows that equation to make theblock/don’t block decision balances:
- Potential server load created by the scraper.
- The potential load created by blocking the scraper.
- The query ‘space’ for the search phrase.
At a minimum, any SERP bot must have thepotential of tying up server resources. If it doesn’t, thesearch engine won’t waste theCPU cycles. It’s not worth the effort required to block the bot.
So, if you’re scraping the SERPs, you need to stay on theright side of that equation: Be so unobtrusive that, even ifyou’re detected, you’re not worth squashing.
Disclaimer
Understand, now, that everything I talked about in this article istotally hypothetical.I certainly don’t scrapeGoogle. That would violate their terms of service.
And I’m sure companies like AuthorityLabs and SERPBuddy haveworked out special agreements for hourly scraping of the major searchengines. But I have this…friend… who’sbeen experimenting a bit, testing what’s allowed andwhat’s not, see…
Cough.
How I did my test
I tested all these theories with three Python scripts. All of them:
- Perform a Google search.
- Download the first page of results.
- Then downloads the next 4 pages.
- Saves the pages for parsing.
Script #1 had no shame. It hit Google as fast as possible anddidn’t attempt to behave like a ‘normal’ webbrowser.
Script #2 was a little embarrassed. It pretended to be MozillaFirefox and only queried Google once every 30 seconds.
Script #3 was downright bashful. It selected a random user agentfrom a list of 10, and paused between queries for anywhere between 15and 60 seconds.
The results
Script #3 did the best. That’s hardly a surprise. But thedifference is:
- Script #1 was blocked within 3 searches.
- Script #2 was blocked within 10 searches.
- Script #3 was never blocked, and performed 150 searches. Thatmeans it pulled 5 pages of ranking data for 150 differentkeywords.
There’s no way any of these scripts fooled Google. The searchengine had to know that scripts 1, 2 and 3 were all scrapers.But it only blocked 1 and 2.
My theory: Script 3 created so small a burden that it wasn’tworth it for Google to block it. Just as important, though, was thefact that script 3didn’t make itself obvious.Detectable? Absolutely. I didn’t rotate IP addresses or do anyother serious concealment. But script 3 behaved like it was, well,embarrassed. And a little contrition goes a long way. If youacknowledge you’rescraping onthe square and behave yourself, Google may cut you some slack.
The rules
Based on all of this, here are my guidelines for scraping results:
- Scrape slowly. Don’t pound the crap out of Google or Bing.Make your script pause for at least 20 seconds between queries.
- Scrape randomly. Randomize the amount of time between queries.
- Be a browser. Have a list of typical user agents (browsers).Choose one of those randomly for each query.
Follow all three of these and you’re a well-behaved scraper.Even if Google and Bing figure out what you’re up to, they leaveyou alone: You’re not a burglar. You’re scouring thegutter for loose change. And they’re OK with that.
- How to: Scrape search engines without pissing them off
- How Do Search Engines Work?
- 6 methods to control what and how your content appears in search engines
- How to Get Even with Your Annoying Neighbor by Bumping Them Off Their WiFi Network —Undetected
- search engines
- 15 Search Engines To Search Files on Rapidshare
- Don't submit your website to any search engines
- Project Disasters and How to Survive Them
- CSS pitfalls and how to overcome them
- JavaScript Errors and How to Fix Them
- How do engines like Unreal relate to OpenGL or D3D?
- How to Search?
- how to search questions
- How to turn off the Javadoc hover
- How To Remove Drop-Off Libraries
- How to round off float value
- How to turn off checksum offload?
- Android - How to turn off Wifi-direct
- 一些php错误
- Visual Studio 2010新功能-IntelliTrace(智能跟踪)
- C#项目引用完全相同dll文件的问题解决方法(反射)
- hadoop学习过程-2013.08.30.2--初次使用IKAnalyzer来切词--切词试验
- Struts2 控制台不打印异常的解决方案
- How to: Scrape search engines without pissing them off
- 搜索文本框焦点离开时设置文本,位置跳动问题
- Linux下SSH使用rsa认证方式省去输入密码
- 播放多媒体——MCI控件
- 第十一次java课后笔记
- DestroyWindow与PostNcDestroy
- hdu/hdoj 1166 敌兵布阵
- 多方向的筛选
- VS2005环境下配置DirectX