CS109 Lecture 7
来源:互联网 发布:手机棋牌游戏平台源码 编辑:程序博客网 时间:2024/06/06 13:22
CS109 Lecture 7
Data Scraping
Sources
- From a Web Sites
- With An API
Copyrights and permission
- Be careful and polite
- Give credit
- Care about media law
- Don’t be evil
Useful tags
<h1></h1><p></p><br><a href = 'url'>Link</a>
Useful Libraries for Scraping
- urllib
- beautifulsoup
- pattern
- LXML
Get Data From Website
url = 'url'scource = urllib2.urlopen(url).read()
soup = bs4.BeautifulSoup(source)soup.findAll('a') # find <a><\a> tag
tag = soup.find('a')tag.get('href')
C = soup.findAll('p',{'class':'Event'})t=C[0] t.findNextSiblings
Get Data With An API
import json # JavaScript Obejct Notationimport requestsapi_key = 'mykey'url = 'url' + api_keyscource = urllib2.urlopen(url).read()
#---simple example--------a = {'a':1,'b':2}s = json.dump(a) a2 = json.loads(s) #-------------------------dataDict = json.loads(data)dtatDict.keys()
0 0
- CS109 Lecture 7
- CS109 Lecture 2
- CS109 Lecture 3
- CS109 Lecture 4
- CS109 Lecture 5
- Lecture 7
- RHCE131 Lecture 7
- Lecture 7 自定义类型
- CS224d lecture 7札记
- Lecture 7: Designing Specifications
- CS107-Lecture 7-Note
- data science cs109 homework1
- Lecture 7 Hashing Table I
- MIT 6.006 Algorithm Lecture 7
- Jordan Lecture Note-7: Soft Margin SVM
- Unit 1-Lecture 7:Binary Relation & Function
- Lecture 7 Sigma Delta Converters 积分增量调制
- Stanford ML - Lecture 7 - Machine learning system design
- Uoj 33 树上GCD (树分治)
- ipsec vpn
- todo
- Android触摸事件总结
- mac nginx + php 开发环境集成
- CS109 Lecture 7
- poj 2485 Highways
- 国际化
- Shell编程---数值运算
- AndroidJNI 通过C++调用JAVA
- FragmentTabHost和TabHost在外接键盘输入时文本框焦点异常问题
- Python自动化测试常用语法
- hdu 5007 Post Robot【模拟】水题
- 微信公众平台开发:JS-SDK之分享功能整理(Java)