Python对XML文件标签内容的匹配

来源:互联网 发布:excel人事管理数据库 编辑:程序博客网 时间:2024/06/06 01:28
对XML文件处理,有博客说明使用 import xml.dom.minidom但是这个包严格限制XML文件格式,XML中若含有一些未知的标签如<document>,无法进行解析
<span style="font-size:12px;"><document id="0a86e2ad2b1828b0250b305984113e7a-6" name="Offa_of_Mercia" cat="25"><text>Offa was frequently in conflict with the various Welsh kingdoms. There was a battle between the Mercians and the Welsh at Hereford in 760, and Offa is recorded as campaigning against the Welsh in 778, 784 and 796 in the tenth-century ''Annales Cambriae''.Annales Cambriae, ''sub anno'' 760, 778 and 784.Stenton, ''Anglo-Saxon England'', pp. 214–215.The best known relic associated with Offa's time is Offa's Dyke, a great earthen barrier that runs approximately along the border between England and Wales. It is mentioned by the monk Asser in his biography of Alfred the Great: "a certain vigorous king called Offa...had a great dyke built between Wales and Mercia from sea to sea".Asser, ''Alfred the Great'', ch. 14, p. 71.  The dyke has not been dated by archaeological methods, but most historians find no reason to doubt Asser's attribution.Margaret Worthington, "Offa's Dyke", in Lapidge, ''Blackwell Encyclopaedia of Anglo-Saxon England'', p. 341.  Early names for the dyke in both Welsh and English also support the attribution to Offa.Stenton, ''Anglo-Saxon England'', p. 213.  Despite Asser's comment that the dyke ran "from sea to sea", it is now thought that the original structure only covered about two-thirds of the length of the border: in the north it ends near Llanfynydd, less than five miles (8 km) from the coast, while in the south it stops at Rushock Hill, near Kington in Herefordshire, less than fifty miles (80 km) from the Bristol Channel. The total length of this section is about sixty-four miles (103 km).  Other earthworks exist along the Welsh border, of which Wat's Dyke is one of the largest, but it is not possible to date them relative to each other and so it cannot be determined whether Offa's Dyke was a copy of or the inspiration for Wat's Dyke.Margaret Worthington, "Wat's Dyke", in Lapidge et al., ''Blackwell Encyclopaedia of Anglo-Saxon England'', p. 468.The construction of the dyke suggests that it was built to create an effective barrier and to command views into Wales. This implies that the Mercians who built it were free to choose the best location for the dyke.  There are settlements to the west of the dyke that have names that imply they were English by the eighth century, so it may be that in choosing the location of the barrier the Mercians were consciously surrendering some territory to the native Britons.Stenton cites, for example, the village "Burlingjobb", in Powys, not far from the south end of the dyke, as having a name unlikely to have risen as late as the ninth century. Stenton, ''Anglo-Saxon England'', p. 214.  Alternatively it may be that these settlements had already been retaken by the Welsh, implying a defensive role for the barrier.Patrick Wormald, "Offa's Dyke", in James Campbell et al., ''The Anglo-Saxons'', pp. 120–121. The effort and expense that must have gone into building the dyke are impressive, and suggest that the king who had it built (whether Offa or someone else) had considerable resources at his disposal.</text><imageset><image id="225ec96db2f4ac904fd9d6b37c82e447" name="Offa's_Dyke" sectnum="6#1">../img/225ec96db2f4ac904fd9d6b37c82e447.jpg</image></imageset></document></span>

如上述Wikipedia dataset中的一段文本,需要提取<text>和<\text>之间的内容
这时,可以通过正则表达式进行匹配。但是正则表达式匹配时中间不能出现换行符,需要先将所有换行符去掉。
__author__ = 'Mandy'import reimport osdir_path = 'D:/wikipedia_dataset/texts/'new_path = 'D:/process_xml/wikipedia_txt/'files = os.listdir('D:/wikipedia_dataset/texts')for fi in files:    a=open(dir_path + fi).read()    b=a.replace("\n","")  #正则匹配无法对换行进行处理    x=re.findall('<text>(.*)</text>',b)    nam =fi.split('.xml')    f=file(new_path + nam[0] + '.txt','w')    f.writelines(x)    f.close()print('ok')




1 0
原创粉丝点击