python3中使用builtwith以及各种疑难问题的方法
来源:互联网 发布:centos broadcom 编辑:程序博客网 时间:2024/05/21 11:16
- 首先通过pip install builtwith安装builtwith
C:\Users\Administrator>pip install builtwith Collecting builtwith Downloading builtwith-1.3.2.tar.gz Installing collected packages: builtwith Running setup.py install for builtwith ... done Successfully installed builtwith-1.3.2
- 在pycharm中新建工程并输入下面测试代码
import builtwith tech_used = builtwith.parse('http://www.baidu.com') print(tech_used)
运行会得到下面的错误:
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy Traceback (most recent call last): File "F:/python/first/FirstPy", line 1, in <module> import builtwith File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 43 except Exception, e: ^ SyntaxError: invalid syntax Process finished with exit code 1
原因是builtwith是基于2.x版本的,需要修改几个地方,在pycharm出错信息中双击出错文件,进行修改,主要修改下面三种:
1. Python2中的 “Exception ,e”的写法已经不支持,需要修改为“Exception as e”。
2. Python2中print后的表达式在Python3中都需要用括号括起来(只要加括号不用修改其他的东西)。
3. builtwith中使用的是Python2中的urllib2工具包,这个工具包在Python3中是不存在的,需要修改urllib2相关的代码。
1和2容易修改,下面主要针对第3点进行修改:
首先将import urllib2替换为下面的代码:
import urllib.request import urllib.error
再次运行项目,遇到下面错误:
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy Traceback (most recent call last): File "F:/python/first/FirstPy", line 3, in <module> builtwith.parse('http://www.baidu.com') File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 62, in builtwith if contains(html, snippet): File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 105, in contains return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v) TypeError: cannot use a string pattern on a bytes-like object Process finished with exit code 1
这是因为urllib返回的数据格式已经发生了改变,需要进行转码,将下面的代码:
if html is None: html = response.read() 修改为-》if html is None: html = response.read() html = html.decode('utf-8')
但是如果把网站换成 ‘www.163.com’,运行再次报错如下:
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy Error: 'utf-8' codec can't decode byte 0xcd in position 500: invalid continuation byte Traceback (most recent call last): File "F:/python/first/FirstPy", line 2, in <module> tech_used = builtwith.parse('http://www.163.com') File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 63, in builtwith if contains(html, snippet): File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 106, in contains return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v) TypeError: cannot use a string pattern on a bytes-like object Process finished with exit code 1
所以不同的网站需要用不同的解码方式么?下面介绍一种判别网站编码格式的方法。
我们需要安装一个叫chardet的工具包,如下:
C:\Users\Administrator>pip install chardet Collecting chardet Downloading chardet-2.3.0-py2.py3-none-any.whl (180kB) 100% |████████████████████████████████| 184kB 616kB/s Installing collected packages: chardet Successfully installed chardet-2.3.0 C:\Users\Administrator>
将builtwith对应的代码做下面修改,记得 import chardet!!!!:
encode_type = chardet.detect(html) if encode_type['encoding'] == 'utf-8': html = html.decode('utf-8') else: html = html.decode('gbk')
然后遇到的最大问题,最后这段代码不管怎么写结果都显示缩进错误,1个小时的各种尝试后终于发现问题所在写if xxxx:直接敲回车换行就会出现缩进和空格的错误,要手动删掉前面的缩进手打空格对齐才可以。
最后给懒人附上修改完成的_init.py:
import sysimport osimport reimport jsonimport urllib.request import urllib.errorimport chardetdef builtwith(url, headers=None, html=None, user_agent='builtwith'): """Detect the technology used to build a website >>> builtwith('http://wordpress.com') {u'blogs': [u'PHP', u'WordPress'], u'font-scripts': [u'Google Font API'], u'web-servers': [u'Nginx'], u'javascript-frameworks': [u'Modernizr'], u'programming-languages': [u'PHP'], u'cms': [u'WordPress']} >>> builtwith('http://webscraping.com') {u'javascript-frameworks': [u'jQuery', u'Modernizr'], u'web-frameworks': [u'Twitter Bootstrap'], u'web-servers': [u'Nginx']} >>> builtwith('http://microsoft.com') {u'javascript-frameworks': [u'jQuery'], u'mobile-frameworks': [u'jQuery Mobile'], u'operating-systems': [u'Windows Server'], u'web-servers': [u'IIS']} >>> builtwith('http://jquery.com') {u'cdn': [u'CloudFlare'], u'web-servers': [u'Nginx'], u'javascript-frameworks': [u'jQuery', u'Modernizr'], u'programming-languages': [u'PHP'], u'cms': [u'WordPress'], u'blogs': [u'PHP', u'WordPress']} >>> builtwith('http://joomla.org') {u'font-scripts': [u'Google Font API'], u'miscellaneous': [u'Gravatar'], u'web-servers': [u'LiteSpeed'], u'javascript-frameworks': [u'jQuery'], u'programming-languages': [u'PHP'], u'web-frameworks': [u'Twitter Bootstrap'], u'cms': [u'Joomla'], u'video-players': [u'YouTube']} """ techs = {} # check URL for app_name, app_spec in data['apps'].items(): if 'url' in app_spec: if contains(url, app_spec['url']): add_app(techs, app_name, app_spec) # download content if None in (headers, html): try: request = urllib.request.Request(url, None, {'User-Agent': user_agent}) if html: # already have HTML so just need to make HEAD request for headers request.get_method = lambda : 'HEAD' response = urllib.request.urlopen(request) if headers is None: headers = response.headers if html is None: html = response.read() encode_type=chardet.detect(html) if encode_type['encoding']=='utf-8': html=html.decode('utf-8') else: html=html.decode('gbk') except Exception as e: print ('Error:', e) request = None # check headers if headers: for app_name, app_spec in data['apps'].items(): if 'headers' in app_spec: if contains_dict(headers, app_spec['headers']): add_app(techs, app_name, app_spec) # check html if html: for app_name, app_spec in data['apps'].items(): for key in 'html', 'script': snippets = app_spec.get(key, []) if not isinstance(snippets, list): snippets = [snippets] for snippet in snippets: if contains(html, snippet): add_app(techs, app_name, app_spec) break # check meta # XXX add proper meta data parsing metas = dict(re.compile('<meta[^>]*?name=[\'"]([^>]*?)[\'"][^>]*?content=[\'"]([^>]*?)[\'"][^>]*?>', re.IGNORECASE).findall(html)) for app_name, app_spec in data['apps'].items(): for name, content in app_spec.get('meta', {}).items(): if name in metas: if contains(metas[name], content): add_app(techs, app_name, app_spec) break return techsparse = builtwithdef add_app(techs, app_name, app_spec): """Add this app to technology """ for category in get_categories(app_spec): if category not in techs: techs[category] = [] if app_name not in techs[category]: techs[category].append(app_name) implies = app_spec.get('implies', []) if not isinstance(implies, list): implies = [implies] for app_name in implies: add_app(techs, app_name, data['apps'][app_name])def get_categories(app_spec): """Return category names for this app_spec """ return [data['categories'][str(c_id)] for c_id in app_spec['cats']]def contains(v, regex): """Removes meta data from regex then checks for a regex match """ return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v)def contains_dict(d1, d2): """Takes 2 dictionaries Returns True if d1 contains all items in d2""" for k2, v2 in d2.items(): v1 = d1.get(k2) if v1: if not contains(v1, v2): return False else: return False return Truedef load_apps(filename='apps.json.py'): """Load apps from Wappalyzer JSON (https://github.com/ElbertF/Wappalyzer) """ # get the path of this filename relative to the current script # XXX add support to download update filename = os.path.join(os.getcwd(), os.path.dirname(__file__), filename) return json.load(open(filename))data = load_apps()if __name__ == '__main__': urls = sys.argv[1:] if urls: for url in urls: results = builtwith(url) for result in sorted(results.items()): print ('%s: %s' % result) else: print ('Usage: %s url1 [url2 url3 ...]' % sys.argv[0])
阅读全文
0 0
- python3中使用builtwith以及各种疑难问题的方法
- python3中使用builtwith的方法(很详细)
- python3中使用builtwith的方法(很详细)
- python3使用builtwith 出错的最简单处理方法
- 在python3中安装python2的包(比如builtwith)
- Python3的builtwith模块修改
- builtwith的使用
- python3中builtwith包不能用怎么办
- python3中builtwith 库不能用
- 在论坛中出现的各种疑难问题:各种问题
- play框架配置使用过程中遇到的各种疑难问题汇总(学习同事经验)
- 在论坛中出现的各种疑难问题:日志收缩问题
- 在论坛中出现的各种疑难问题:性能优化
- 在论坛中出现的各种疑难问题:备份还原
- 在论坛中出现的各种疑难问题:数据库存储配置
- 在论坛中出现的各种疑难问题:数据库存储配置
- python3使用builtwith识别网站使用相关技术
- python3使用builtwith识别网站使用相关技术
- UVa806 习题6-8 空间结构(Spatial Structures,ACM/ICPC World Finals 1998)
- 遗传算法(4):交叉概率和变异概率的自适应化
- bzoj1096 [ZJOI2007]仓库建设(斜率优化)
- Linux查看服务命令与端口使用状态
- xmlhttp.status为0的原因及其解决方案
- python3中使用builtwith以及各种疑难问题的方法
- windows下配置安装Anaconda+tensorflow
- 最小生成树(prime算法、kruskal算法) 和 最短路径算法(floyd、dijkstra)
- linux 下切换到root提示 Authentication failure的解决方法
- 继承、实现、依赖、关联、聚合、组合的联系与区别
- 机房收费系统(组合查询)
- 1113. Integer Set Partition (25)
- Java调用Mysql
- 2017年8月5日, 星期六