如何抓取土豆网的视频原文件--原理实现

来源:互联网 发布:老司机网络意思是什么 编辑:程序博客网 时间:2024/04/27 19:44
如何抓取土豆网的视频原文件
以前一直以为抓取flv或者swf的文件都是直接看html的代码,但是发现土豆的没办法找到,一直郁闷不知道如何处理,今天终于找到实现这些抓取的原理,看来要对http协议好好研究研究才行了。

土豆网这样的视频文件是无法用迅雷直接抓取原文件的,因为其采用的是flash flv文件格式,通过迅雷抓取的只是一个个性化的flv播放器。icekernel的意见是通过查看ie缓存,显然还有更好的办法。下面给出典型的数据包顺序,以http://www.toodou.com/v/kV5FNAWE4RY这个原url为例说明。

1.GET /v/kV5FNAWE4RY,Location
---------------------------------------------------------------------------------------------------------------------
GET /v/kV5FNAWE4RY HTTP/1.1
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, application/x-shockwave-flash, */*
Accept-Language: zh-cn
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; InfoPath.1)
Host: www.toodou.com
Connection: Keep-Alive

*****************************Response:************************
HTTP/1.1 302 Found
Date: Wed, 16 Aug 2006 05:40:12 GMT
Server: Apache/2.2.2 (Unix) DAV/2 PHP/4.4.2
X-Powered-By: PHP/4.4.2
Set-Cookie: toodou=0fa1a1ba1262236ed123f0074095aaa9; path=/; domain=.toodou.com
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: private
Pragma: no-cache
Location: http://www.toodou.com/player/player.swf?iid=2119336
Content-Length: 0
Keep-Alive: timeout=300, max=9858
Connection: Keep-Alive
Content-Type: text/html


2.GET /player/player.swf?iid=2119336,下载flv player.
---------------------------------------------------------------------------------------------------------------------
GET /player/player.swf?iid=2119336 HTTP/1.1
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, application/x-shockwave-flash, */*
Accept-Language: zh-cn
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; InfoPath.1)
Host: www.toodou.com
Connection: Keep-Alive
If-Modified-Since: Thu, 10 Aug 2006 06:15:07 GMT
If-None-Match: "591369-6cd7-bfd150c0"
Cookie: toodou=0fa1a1ba1262236ed123f0074095aaa9

*****************************Response:************************
HTTP/1.1 200 OK
Date: Wed, 16 Aug 2006 05:40:35 GMT
Server: Apache/2.2.2 (Unix) DAV/2 PHP/4.4.2
Last-Modified: Thu, 10 Aug 2006 06:15:07 GMT
ETag: "62d0e3-6cd7-bfd150c0"
Accept-Ranges: bytes
Content-Length: 27863
Keep-Alive: timeout=300, max=9947
Connection: Keep-Alive
Content-Type: application/x-shockwave-flash

CWS.....x....XSI.0

.!...Ec(CC...N3..BC..,n4s.....r8.f"......Zp8..x}n.S?....g3b...0....h.jf.*.{s8...i.'..'...l....d.<...y....+.D..kg..7x.>...A.W.oQ..k.4..z.lU.!......~$._.c>......w...G...g...<.....
.!.M22....`y@..
...mIJ..v....q....2.@...h.#.,..%..x.a.3...O:..f.v.)........H`..0...P.vq!......a....F.........h..g...l..k.A,6.....a.91..LF.ct.].)#>6.v..+...+0/.....:..b. .{...E.T....... ..g).*QT..hz..Q.]..Jh%d"......K.".2....K.: ....*..%.d."...S+..t...).P.I.V(..0......Q.".`d.X. .....*... ....$S..jh.R....b..(..7.+..T].d.d.;.R..H.)v..O..AO...........+.HU .d...fXY...xY....$.?...Y2..DUF...P.;./U'.J......d.r^..... ..k%..$........%W.EM...S.....v./cH..w..&.R.y{>T./....S(.....4._Pp.Z..`{.v.(.....O.6.t..,....g.1B..X.L.....5.j..EsM...$.q....v`..$...cC..4.".g....&!lSVL.*N/t.*..S..p.I0.m......!H_..Z.H..a.........`.rb@...
......i..BNB..... .[...BVB../.;d..a.,....=......2....e.d]lme.+./.9I..<6..'./...EQ..x..'.K...... ..E.~C.@......v%....hN0#..D.B......
./N,....@gr.p.l&..!.-V.....qLF...-..e.D,A2.......0f.7+.9D/...Ps...h&#.)m......(X...(/.`.@..S..../$.......Fs.`.B......P....z..M.{:...e8.S.I.3.5.J(.r.......Y....$P.........E.J...
..z...._{..h%eq.!#...P..U.
........dB./F 
Z.....9.D.
.1*.H...F0N...e...F......!.S.FG.#1]t........a...Ua.8....W......DA.<.........z...J......i


3.把iid提交到/player/info.php获得原flv文件地址
---------------------------------------------------------------------------------------------------------------------
POST /player/info.php HTTP/1.1
Accept: */*
Referer: http://www.toodou.com/player/player.swf?iid=2119336
x-flash-version: 9,0,16,0
Content-Type: application/x-www-form-urlencoded
Content-Length: 48
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; InfoPath.1)
Host: www.toodou.com
Connection: Keep-Alive
Cache-Control: no-cache
Cookie: toodou=0fa1a1ba1262236ed123f0074095aaa9

onLoad=%5Btype%20Function%5D×=0&iid=2119336

*****************************Response:************************
HTTP/1.1 200 OK
Date: Wed, 16 Aug 2006 05:40:35 GMT
Server: Apache/2.2.2 (Unix) DAV/2 PHP/4.4.2
X-Powered-By: PHP/4.4.2
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: private
Pragma: no-cache
Content-Length: 155
Keep-Alive: timeout=300, max=9990
Connection: Keep-Alive
Content-Type: text/html

r=1&ipic=http://image.toodou.com/data/imgs/i/002/119/336/w.jpg&insite=0&a=flv|http://player0.toodou.com/flv/002/119/336/2119336.flv|vi|146400|0|0|0|5729673


显然,监视POST /player/info.php请求的返回即可实现抓取,也可以自己模拟数据包发送到server实现。 
 
原创粉丝点击