html&xml解析
来源:互联网 发布:大众软件停刊了吗 编辑:程序博客网 时间:2024/05/17 02:17
html&xml解析
在html和xml的解析当中,有很多包可以使用,例如dom4j,jsoup等,归根到底,他们的解析都离不开dom树,都是将其转化为一棵dom树,一个document对象来实现的。接下来是一些解析的介绍
dom4j解析,此处的方法都是来自于dom4j的文档,但是解析的过程就是这样的,不管是dom解析还是什么解析
**
首先获取一个document对象**
public Document parse(URL url) throws DocumentException { SAXReader reader = new SAXReader(); //Document document = reader.read("src/Book.xml");//把xml文档加载到document对象中 Document document = reader.read(url); return document;}
Using Iterators
获取根节点Element root = document.getRootElement();// iterate through child elements of root//遍历for ( Iterator i = root.elementIterator(); i.hasNext(); ) {Element element = (Element) i.next();// do something}// iterate through child elements of root with element name "foo"for ( Iterator i = root.elementIterator( "foo" ); i.hasNext(); ) {Element foo = (Element) i.next();// do something}// iterate through attributes of root for ( Iterator i = root.attributeIterator(); i.hasNext(); ) {Attribute attribute = (Attribute) i.next();// do something}
Powerful Navigation with XPath
List list = document.selectNodes( "//foo/bar" );Node node = document.selectSingleNode( "//foo/bar/author" );String name = node.valueOf( "@name" )
Creating a new XML document
Document document = DocumentHelper.createDocument();Element root = document.addElement( "root" );Element author1 = root.addElement( "author" ).addAttribute( "name", "James" ).addAttribute( "location", "UK" ).addText( "James Strachan" );Element author2 = root.addElement( "author" ).addAttribute( "name", "Bob" ).addAttribute( "location", "US" ).addText( "Bob McWhirter" );
**
Writing a document to a file**
// lets write to a fileXMLWriter writer = new XMLWriter(new FileWriter( "output.xml" ));writer.write( document );writer.close();// Pretty print the document to System.outOutputFormat format = OutputFormat.createPrettyPrint();writer = new XMLWriter( System.out, format );writer.write( document );// Compact format to System.outformat = OutputFormat.createCompactFormat();writer = new XMLWriter( System.out, format );writer.write( document );
jsoup可以抓取网上页面的功能,同时他的使用也非常的广泛,使用它可以很轻松的遍历整个文档
文档的跟多内容在这里:http://www.open-open.com/jsoup/selector-syntax.htm
//可以是这个是将String解析成为documentDocument doc = Jsoup.parse(html);//解析指定片段Document doc = Jsoup.parseBodyFragment(html);//你需要从一个网站获取和解析一个HTML文档Document doc = Jsoup.connect("http://example.com/").get();//其中还有post的请求方式,百度一下你就知道//指定特定解码Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
利用jsoup解析进行深度遍历和解析,来抽取特定的内容
创建一个类
public class TagStruct { private Element e ; private int deep; private String xpath; public TagStruct(Element e, int deep, String xpath) { this.e = e; this.deep = deep; this.xpath = xpath; }}
解析的过程
//http://tieba.baidu.com/f?kw=c%E8%AF%AD%E8%A8%80&fr=indexDocument doc = Jsoup.connect("http://tieba.baidu.com/f?kw=c%E8%AF%AD%E8%A8%80&fr=index").get(); List<TagStruct> list = new ArrayList<TagStruct>(); Stack<TagStruct> sk = new Stack<TagStruct>(); Elements allElements = doc.getAllElements(); Element child = doc.child(0);//html标签,跟标签 Element body = doc.body(); doc.siblingElements(); System.out.println(doc.getElementsByTag("a").size()); //广度遍历 TagStruct t = new TagStruct(body,1,"//body"); sk.push(t); while (!sk.isEmpty()){ TagStruct pop = sk.pop(); Element e = pop.getE(); Elements elements = e.children(); for(int i=0;i<elements.size();i++){ Element el = elements.get(i); TagStruct ta = new TagStruct(el,pop.getDeep()+1,pop.getXpath()+"/"+el.tagName()+"["+i+"]"); if(el.tagName()=="a"){ list.add(ta); } sk.push(ta); } } Comparator<TagStruct> comparator = new Comparator<TagStruct>() { public int compare(TagStruct o1, TagStruct o2) { if(o1.getDeep()>o2.getDeep()){ return o1.getDeep()-o2.getDeep(); }else { return o1.getDeep()-o2.getDeep(); } } }; Collections.sort(list,comparator); //list.sort(comparator); for(int i=0;i<list.size();i++){ System.out.println(list.get(i).toString()); }
0 0
- Jsoup解析HTML,XML
- python解析html/xml
- XML和HTML解析
- html和xml解析
- Nokogiri解析html/xml
- GDataXML-HTML 解析XML
- html&xml解析
- c++ 解析xml 解析html
- HTML基础知识 - XML文档解析
- XML及HTML文档解析
- XML与HTML区别,XML解析
- XML与HTML区别,XML解析
- java解析XML和java解析HTML
- java解析XML和java解析HTML
- 解析html下的xml文件
- Objective-C解析html(xml)全过程
- 使用BeautifulSoup解析HTML和XML
- python 专题七 HTML XML解析
- iOS 关闭软键盘
- shell脚本
- C语言基础知识整理(三)
- 深入理解JVM
- iOS 限制应用只能竖屏显示
- html&xml解析
- 自己动手讲述ORACLE异机还原冷备份数据库
- thinikPHP的学习之路(一)
- UVA 1329 Corporative Network(并查集)
- 发现代码
- UITableView的重用池优化
- Linux下创建、查看、提取和修改静态库(*.a)
- Valid Parentheses(java实现)
- 关于python的事件编程