[笔记] grep用法：数量统计和搜索html中的url

来源：互联网发布：淘宝列表调成大图编辑：程序博客网时间：2024/06/07 07:59

★ 1. 数量统计

实例：可以用来统计CVE列表中包含android的问题有多少，包含linux的问题有多少。
例如，CVE问题列表：http://cve.mitre.org/data/downloads/allitems.csv
这是csv格式的，每个问题占一行。

为简要说明grep的用法，只举简单明了的例子，例如，统计包含字符串android的行数。
注：测试环境为cygwin。

♦ 1.1 构造测试数据

$ echo -e "android, android\nandroid,android\nandroid\nandroid,linux\nlinux,linux\nlinux" > test.txt

注：echo的参数-e表示解析转义字符，例如上面字符串里的\n，会被认为是换行符。

test.txt的内容为：

$ cat test.txtandroid, androidandroid,androidandroidandroid,linuxlinux,linuxlinux

♦ 1.2 统计包含`android`、`linux`的行数

利用grep的-c参数，-c用来只显示匹配的行数。

出现android的行数：

$ grep -ic "android" ./test.txt4$ grep -ni "android" ./test.txt1:android, android2:android,android3:android4:android,linux

同样，出现linux的行数：

$ grep -ic "linux" ./test.txt3$ grep -ni "linux" ./test.txt4:android,linux5:linux,linux6:linux

统计包含android或包含linux的行数：

$ grep -ic "android\|linux" ./test.txt6

那么，同时包含android和linux的行数是：4+3−6=1

★ 2. 搜索html中的url

♦ 2.1 找个例子文件

以http://slide.ent.sina.com.cn/star/slide_4_704_254805.html#p=10为例。

Firefox浏览器，通过“开发者”-> “查看器”（或者使用快捷ctrl+shift+c），找到html文件，并保存下来。

♦ 2.2 搜索html中所有的http开头的url

使用grep -wio "http://[0-9_a-zA-Z\/.\-]*" ./test.html

参数 -w：整个pattern匹配，例如要完整匹配http://[0-9_a-zA-Z\/.\-]*
参数 -i：忽略大小写
参数 -o：表示只显示匹配的部分。

对于这个正则表达式，可以不用-i参数的。因为字符集[0-9_a-zA-Z\/.\-]包含了大小写。
需要注意的是：-必须在最后。否则有可能匹配不上。在python中，-在中间位置是没有问题。

运行结果（部分结果）：

$ grep -wio "http://[0-9_a-zA-Z\/.\-]*" ./test.htmlhttp://ent.sina.com.cn/js/470/20130123/comment.jshttp://comment5.news.sina.com.cn/count/infohttp://beacon.sina.com.cn/ckctl.htmlhttp://i.sso.sina.com.cn/images/login/icon_custom.pnghttp://api.sina.com.cn/weibo/2/users/show.jsonhttp://n.sinaimg.cn/ent/4_img/upload/d411fbc6/w1280h960/20171201/TBHA-fypikwt0416669.jpg略

♦ 2.3 搜索html中所有的http开头的图片

grep -wio "http://[0-9_a-zA-Z\/.\-]*\(.jpg|.png\)" ./test.html
这里的\(.jpg|.png\)表示.jpg和.png都是要搜索的。

$ grep -wio "http://[0-9_a-zA-Z\/.\-]*\(.jpg\|.png\)" ./test.htmlhttp://www.sinaimg.cn/dy/deco/2013/0604/dot_hover.pnghttp://www.sinaimg.cn/dy/deco/2013/0604/weibo.pnghttp://www.sinaimg.cn/dy/deco/2013/0604/weibo_hover.pnghttp://www.sinaimg.cn/cj/hd/close_h2.jpghttp://n.sinaimg.cn/ent/4_img/upload/d411fbc6/w800h600/20171201/ZZ5M-fypikwt0416547.jpghttp://www.sinaimg.cn/ent/deco/2014/0311/images/sc_pic_loginImage.pnghttp://n.sinaimg.cn/ent/4_img/upload/d411fbc6/w1280h960/20171201/TBHA-fypikwt0416669.jpg略

♦ 2.4 只搜索`"http://n.sinaimg.cn`开头的jpg图片

grep -wio "http://n.sinaimg.cn[0-9_a-zA-Z\/.\-]*.jpg" ./test.html
由于原始html中有重复的url，所以匹配出来的也是有重复的。

$ grep -wio "http://n.sinaimg.cn[0-9_a-zA-Z\/.\-]*.jpg" ./test.htmlhttp://n.sinaimg.cn/ent/4_img/upload/d411fbc6/w800h600/20171201/ZZ5M-fypikwt0416547.jpghttp://n.sinaimg.cn/ent/4_img/upload/d411fbc6/w1280h960/20171201/TBHA-fypikwt0416669.jpghttp://n.sinaimg.cn/ent/4_img/upload/d411fbc6/w800h600/20171201/ZZ5M-fypikwt0416547.jpghttp://n.sinaimg.cn/ent/4_img/upload/d411fbc6/w800h600/20171201/EtX1-fypikwt0416552.jpg略

阅读全文

0 0

[笔记] grep用法：数量统计和搜索html中的url

★ 1. 数量统计

♦ 1.1 构造测试数据

♦ 1.2 统计包含android、linux的行数

★ 2. 搜索html中的url

♦ 2.1 找个例子文件

♦ 2.2 搜索html中所有的http开头的url

♦ 2.3 搜索html中所有的http开头的图片

♦ 2.4 只搜索"http://n.sinaimg.cn开头的jpg图片

♦ 1.2 统计包含`android`、`linux`的行数

♦ 2.4 只搜索`"http://n.sinaimg.cn`开头的jpg图片