Perl 实现简单的html 标签筛选
来源:互联网 发布:网络用语猜成语 编辑:程序博客网 时间:2024/05/05 03:45
此程序提供简单的获取html 页面代码并筛选出以下标签和一些基本属性:
<script> : 属性 src, type
<a> : 属性 href
<img>: 属性 src
后续会添加一些更有用的功能,并逐步完善命令行接口。
使用方法:
perl filter_html.pl <URL>
#!/usr/bin/perl# --------------------------# author : ez# date : 2015/8/23# describe : this script send http request for a http url and filter some special # tag you input# --------------------------use strict;use warnings;use LWP::UserAgent;use Data::Dumper;use HTML::TreeBuilder;# use HTML::Parser;our $VERSION = 1.0;my %disp_func = ( a => sub {my $em = shift;return if ! defined ($em) and $em -> tag () ne 'a';my $href = $em -> attr ('href');print "a url = ". ($href ? $href : 'none') . "\n";},script => sub {my $em = shift;return if ! defined ($em) and $em -> tag () ne 'script';my $type = $em -> attr ('type');my $src = $em -> attr ('src');print "script type = ". ($type ? $type : 'none') . ", src = ". ($src ? $src : 'none') . "\n";},img => sub {my $em = shift;return if ! defined ($em) and $em -> tag () ne 'img';my $src = $em -> attr ('src');print "img src = ". ($src ? $src : 'none') . "\n";});&_usage () if @ARGV < 1; my $url = shift @ARGV;my @tags = qw(a script form img);@tags = @ARGV if @ARGV >= 1;my $useragent = LWP::UserAgent -> new;my $request = HTTP::Request -> new ('GET' => $url);$request -> content_type ('application/x-www-form-urlencoded');$request -> header ('Accept-Language' => 'zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3');print "[-] sending request to $url ...\n";my $html = $useragent -> request ($request);print "[-] get response !\n";my $tree = new HTML::TreeBuilder;$tree -> parse ($html -> content ());$tree -> eof ();my $html_tag = $tree -> elementify ();# my @decendants = $html_tag -> descendants ();# maybe the parameter could be more exciting :-)my @find_tags = $html_tag -> find_by_tag_name ('a', 'script', 'img');foreach (@find_tags) {next if !defined ($_) and $_ -> tag () eq '';&{$disp_func {$_ -> tag ()}} ($_);}$tree -> delete ();sub get_tags {my ($tag, $node) = @_;return if ! $tag;}sub _usage {print "usage: filter_html.pl <webpage_url>";exit;}# debug# my $tag = $val -> tag (); # get 'html'# TODO : parse start# my $items = $tree -> findnodes ('/html/body//a');# for my $item ($items -> get_nodelist ()) {# my $str = $item -> content -> [0];# print "$str\n";# }# print $html -> as_string ();# print $html -> content ();# my @line = $html -> content ();# /<(\S*?)[^>]>.*?<\/\1>|<.*?\/>/# foreach (@line) {# print "$_\n";# }__END__
注:perl中可能没有需要的HTML::TreeBuilder模块和Data::Dumper模块,可以CPAN自己下载安装。
运行环境: Linux 3.18.0-kail3-amd64 #1 SMP Debian x86_64 GNU / Linux
Perl: v5.14.2 built for x86_64-linux-gnu-thread-multi
0 0
- Perl 实现简单的html 标签筛选
- 简单的html标签
- 简单的html标签
- 简单的html标签转义
- 常用的简单html标签
- 简单的HTML标签学习
- java实现简单的文件筛选
- AndroidUI:筛选条的简单实现
- HTML 简介及简单的html标签
- Perl如何过滤html标签
- 用PERL实现一个简单的NIDS
- Perl实现的简单单机聊天服务器
- 用perl实现简单的遗传算法
- 自学 HTML--简单的标签使用
- java简单的去HTML标签
- Android 对HTML标签的简单处理
- 简单html标签的基本用法
- 一些简单常见的html标签
- Spring4新特性——更好的Java泛型操作API
- 文件权限的清理
- 开始写博客
- Hadoop/spark安装实战(系列篇2)安装虚拟机、PieTTY、winscp、JDK、配置DNS、SSH免密码登录
- 串口调试助手--VC++ 2010 开发
- Perl 实现简单的html 标签筛选
- 1101. Quick Sort (25)
- 0912_Collective Intelligence Programming Reading Notes
- Android JNI的动态注册
- 测试你是否和LTC水平一样高 1407 (简单数学题)
- posix线程栈
- 判断浏览器是否安装pdf插件
- 改革70周年
- IOS 调用系统键盘 设置搜索字段和事件