Nokogiri解析html/xml
来源:互联网 发布:网络有时候不稳定会断 编辑:程序博客网 时间:2024/06/06 07:32
Searching an HTML / XML Document
Basic Searching
Let’s suppose you have the following document:
Let’s further suppose that you want a list of all the characters in all the shows in this document.
The Node
methods xpath
and css
actually return a NodeSet, which acts very much like an array, and contains matching nodes from the document.
You can use any XPath or CSS query you like (see the chapter on XPath and CSS syntax for more information).
Notably, you can even use CSS queries in an XML document!
CSS queries are often the easiest and most succinct way to express what you’re looking for, so don’t be afraid to use them!
Single Results
If you know you’re going to get only a single result back, you can use the shortcuts at_css
and at_xpath
instead of having to access the first element of a NodeSet.
Namespaces
Just like our Ruby code, XML can suffer from name collisions. For example, an autoparts dealer can sell tires and so can a bike dealer. Both of them may use a “tire” tag to describe the tires they sell. However, we need to be able to tell the difference between a car tire and a bike tire. This is where namespaces come to the rescue.
Namespaces associate tags with a unique URL. Let’s take a look at the autoparts store’s XML versus the bike stores:
Since the URLs are unique, we can associate our query with a URL and get only the tires belonging to that url:
To make this namespace registration a bit easier, nokogiri will automatically register any namespaces it finds on the root node for you. Nokogiri will associate the name in the declaration with the supplied URL. If we stick to this naming convention, we can shorten up our code.
Let’s take this atom feed for example:
If we stick to the convention, we can grab all title tags like this
Don’t be fooled though. You do not have to use XPath to get the benefits of namespaces. CSS selectors can be used as well. CSS just uses the pipe symbol to indicate a namespace search.
Let’s see the previous search rewritten to use CSS:
When using CSS, if the namespace is called “xmlns”, you can even omit the namespace name. That means your CSS will reduce to:
Dealing with namespaces is a broad topic. If you need more examples, be sure to check out this blog post or send an email to the mailing list, and we can help out.
But I’m Lazy and Don’t Want to Deal With Namespaces!
Lazy == Efficient, so no judgements. :)
If you have an XML document with namespaces, but would prefer to ignore them entirely (and query as if Tim Bray had never invented them), then you can call remove_namespaces on an XML::Document to remove all namespaces. Of course, if the document had nodes with the same names but different namespaces, they will now be ambiguous. But you’re lazy! You don’t care!
Slop1
Maybe you want a more interactive (read: sloppy) way to access nodes and attributes. If you like what XmlSimple does, then you’ll probably like Nokogiri’s Slop mode.2
Slop mode allows you to violate the Law of Demeter with extreme prejudice, by using #method_missing
to introspect on a node’s child tags. 3
Aww yeah. Can you feel the spirit of @jbarnette and @nakajima flowing through you? That’s the power of the slop.4
- Don’t use this.
- This may or may not be a backhanded compliment.
- No, really, don’t use this. If you use it, don’t report bugs.
- You’ve been warned!
- 转自http://www.nokogiri.org/tutorials/searching_a_xml_html_document.html
- Nokogiri解析html/xml
- Ruby - Nokogiri 解析XML的实例
- Ruby爬虫header发送cookie,nokogiri解析html数据
- Jsoup解析HTML,XML
- python解析html/xml
- XML和HTML解析
- html和xml解析
- GDataXML-HTML 解析XML
- html&xml解析
- c++ 解析xml 解析html
- HTML基础知识 - XML文档解析
- XML及HTML文档解析
- XML与HTML区别,XML解析
- XML与HTML区别,XML解析
- java解析XML和java解析HTML
- java解析XML和java解析HTML
- Nokogiri作用
- 解析html下的xml文件
- 日期类的常用方法
- Android视图绘制流程完全解析
- 系统分布式情况下最终一致性方案梳理
- Python BaseHTTPServer 模块解析
- Word Pattern
- Nokogiri解析html/xml
- 美团和当当推荐系统文章
- 深入了解javascript--立即调用的函数表达式
- 常用函数--不同时间粒度循环取数之sp_Utl_CalculateTimeBorder
- 关于多线程环境下安全调用窗体控件方法
- 位图图像基础
- SQL LIKE 操作符 高级教程
- Unity5.x Ugui Button事件 及 改变Text 内容
- 浏览器工作原理