How to extract data from XML nodes in Scala
来源:互联网 发布:达尔文的资料知和名言 编辑:程序博客网 时间:2024/05/19 23:17
Problem: In a Scala application, you want to extract information from XML you receive, so you can use the data in your application.
Solution
Use the methods of the Scala Elem
and NodeSeq
classes to extract the data. The most commonly used methods of the Elem
class are shown here:
Commonly used methods of the Elem classMethod Description------ -----------x \ "div" Searches the XML literal x for elements of type <div>. Only searches immediate child nodes (no grandchild or “descendant” nodes).x \\ "div" Searches the XML literal x for elements of type <div>. Returns matching elements from child nodes at any depth of the XML tree.x.attribute("class") Returns the value of the given attribute in the current node. <a x="10" y="20">foo</a>.attribute("x") // returns Some(10).x.attributes Returns all attributes of the current node, prefixed and unprefixed, in no particular order. scala> <a x="10" y="20">foo</a>.attributes res0: scala.xml.MetaData = x="10" y="20"x.child Returns the children of the current node. <a><b>foo</b></a>.child // returns <b>foo</b>.x.copy(...) Returns a copy of the element, letting you replace data during the copy process.x.label The name of the current element. <a><b>foo</b></a>.label // returns a.x.text Returns a concatenation of text(n) for each child n.x.toString Emits the XML literal as a String. Use scala.xml.PrettyPrinter to format the output, if desired.
Examples
The following examples demonstrate most of the methods just shown. Given this XML literal:
scala> val x = <div class="content"><p>Hello</p><p>world</p></div>x: scala.xml.Elem = <div class="content"><p>Hello</p><p>world</p></div>
you can search for and extract subelements with the \
and \\
XPath methods:
scala> x \ "p"res0: scala.xml.NodeSeq = NodeSeq(<p>Hello</p>, <p>world</p>)scala> x \\ "p"res1: scala.xml.NodeSeq = NodeSeq(<p>Hello</p>, <p>world</p>)
These methods will be demonstrated more in subsequent recipes.
The label method returns the name of the current element. A <p>
tag returns p, a <div>
tag returns div, etc.:
scala> x.labelres2: String = divscala> <name>Joe</name>.labelres3: String = name
The text
method returns the text from all subelements, which the Scaladoc describes as, “a concatenation of all text(n) for each child n”:
scala> x.textres4: String = Helloworld
Later examples will demonstrate how to improve on this result.
Element attributes are extracted with the attribute
or attributes
methods. The following examples demonstrate how to call these methods, and the values they return:
scala> x.attribute("class")res5: Option[Seq[scala.xml.Node]] = Some(content)scala> x.attributes("class")res6: Seq[scala.xml.Node] = contentscala> x.attributes.get("class")res7: Option[Seq[scala.xml.Node]] = Some(content)
The following examples demonstrate how those same method calls behave when you search for an attribute that doesn’t exist:
scala> x.attribute("foo")res8: Option[Seq[scala.xml.Node]] = Nonescala> x.attributes("foo")res9: Seq[scala.xml.Node] = nullscala> x.attributes.get("foo")res10: Option[Seq[scala.xml.Node]] = Nonescala> x.attributes.get("foo").getOrElse("N/A")res11: Object = N/A
To demonstrate more ways to work with element attributes, let’s create a new element:
scala> val w = <forecast day="Thu" date="10 Nov 2011" low="37" high="58" />w: scala.xml.Elem = <forecast day="Thu" date="10 Nov 2011" low="37" high="58" />
These examples show how attribute
and attributes
work with multiple attributes:
scala> w.attribute("day")res0: Option[Seq[scala.xml.Node]] = Some(Thu)scala> w.attributes("day")res1: Seq[scala.xml.Node] = Thuscala> w.attributesres2: scala.xml.MetaData = day="Thu" date="10 Nov 2011" low="37" high="58"
These examples show how to iterate over a set of attributes:
scala> for (a <- w.attributes) println(s"key: ${a.key}, value: ${a.value}")key: day, value: Thukey: date, value: 10 Nov 2011key: low, value: 37key: high, value: 58scala> w.attributes.asAttrMapres3: Map[String,String] = Map(low -> 37, date -> 10 Nov 2011, day -> Thu, high -> 58)
Child elements
The child
method returns all child nodes of the current element. To demonstrate this, let’s create a new XML variable:
scala> val p = <person><name>Ken</name><age>23</age></person>p: scala.xml.Elem = <person><name>Ken</name><age>23</age></person>
The child
method returns immediate child nodes:
scala> p.childres0: Seq[scala.xml.Node] = ArrayBuffer(<name>Ken</name>, <age>23</age>)
You can use child
to iterate over all the children:
scala> for (n <- p.child) println(n)<name>Ken</name><age>23</age>
Because child
returns a sequence, you can also access the child elements like this:
scala> p.child(0)res1: scala.xml.Node = <name>Ken</name>scala> p.child(0).labelres2: String = namescala> p.child(0).textres3: String = Kenscala> p.child(1)res4: scala.xml.Node = <age>23</age>scala> p.child(1).text.toIntres5: Int = 23
Text and strings
The toString
method returns the XML structure as a String
:
scala> p.toStringres6: String = <person><name>Ken</name><age>23</age></person>
You can improve this result with the PrettyPrinter
class.
This approach shows another way to extract the text from the elements:
scala> for (n <- p.child) yield n.textres7: Seq[String] = ArrayBuffer(Ken, 23)
There are more ways to tackle these problems using XPath methods, which will be shown in subsequent chapters.
As a word of caution, be careful with the text
method. It returns different results depending on how the XML is formatted, which can be a particular problem when extracting XHTML data. To demonstrate this, the following examples show the output when there is a space before the <br>
tag, and when there is no space:
scala> <div><p>Hello, world, <br/>it's me.</p></div>.textres0: String = Hello, world, it's me.scala> <div><p>Hello, world,<br/>it's me.</p></div>.textres1: String = Hello, world,it's me.
In the next examples the same XML, formatted in different ways, yields different results:
scala> <div><p>Is 2 > 1?</p><p>Why do you ask?</p></div>.textres2: String = Is 2 > 1?Why do you ask?scala> <div> | <p>Is 2 > 1?</p> | <p>Why do you ask?</p> | </div>.textres3: String = "Is 2 > 1?Why do you ask?"
If you need to extract text in this manner, a workaround is to extract the text components individually into a sequence, and then re-combine the sequence as desired. The following example demonstrates how to accomplish this with the child
, label
, and text
methods. Given this XML literal:
val xml = <div><p>Is 2 > 1?</p><p>Why do you ask?</p></div>
the child
method returns the elements as a sequence:
scala> xml.childres0: Seq[scala.xml.Node] = ArrayBuffer(<p>Is 2 > 1?</p>, <p>Why do you ask?</p>)
This lets you write the following code, which creates a sequence of strings from the <p>
tags:
val strings = for { e <- xml.child if e.label == "p"} yield e.text
The REPL shows that the resulting variable strings has the following type and data:
strings: Seq[String] = ArrayBuffer(Is 2 > 1?, Why do you ask?)
In the XPath recipes in this chapter you’ll see how to accomplish some of the same tasks using the \
and \\
methods.
Example data sets and REPL memory errors
If you want to test these commands against large data sets, this URL maintains a nice collection of sample XML data:
- http://www.cs.washington.edu/research/xmldatasets/
The NASA data set is 23 MB, and causes the Scala REPL to crash with a Java heap space error:
scala> val xml = scala.xml.XML.loadFile("nasa.xml")java.lang.OutOfMemoryError: Java heap space ...
To get around this problem, you can allocate more heap space when starting the REPL with this command:
$ scala -J-Xms256m -J-Xmx512m
or this command:
$ env JAVA_OPTS="-Xms256m -Xmx512m" scala
- How to extract data from XML nodes in Scala
- Insight into DOMDocument - how to convert data from XML to array in PHP
- How to extract datafiles from asm diskgroup?
- How to extract component from vector image
- How to extract .obj from a .lib
- Extract data from DB to flat file
- Simple VBScript program to extract data from all worksheets in an Excel spreadsheet
- How to read data from a file in reverse order?
- How to read data from csv file in c#
- How to retreive raw post data from HttpServletRequest in java
- Use Web-Harvest to data-extract from www.vdisk.cn
- How to remove k__BackingField from Json data
- How to operate data in JTable ?
- How To Get Properties Data in Spring
- How to Get Started in Data Science
- how to do with the special characters in the xml data
- Deleting nodes from an XML object in Flex
- How to import xml data into excel
- hadoop学习
- 初始Hadoop
- js 中 setTimeout()的用法
- 数据库常见索引解析(B树,B-树,B+树,B*树,位图索引,Hash索引)
- 第十四周OJ(4)求3x3矩阵对角线元素之和
- How to extract data from XML nodes in Scala
- 解决使用spread时快捷键无反应的情况
- Fiddler抓取https包设置
- Android图表超简单实现柱状图、折线图、饼状图(基于MpAndroidChart)
- spring applicationContext.xml 配置文件 详解
- oracle 锁表后,如何定位,并解锁
- web前端开发中浏览器兼容问题(五)
- php标准注释
- Android图片中的三级缓存