Python对XML文件标签内容的匹配
来源:互联网 发布:excel人事管理数据库 编辑:程序博客网 时间:2024/06/06 01:28
对XML文件处理,有博客说明使用 import xml.dom.minidom但是这个包严格限制XML文件格式,XML中若含有一些未知的标签如<document>,无法进行解析<span style="font-size:12px;"><document id="0a86e2ad2b1828b0250b305984113e7a-6" name="Offa_of_Mercia" cat="25"><text>Offa was frequently in conflict with the various Welsh kingdoms. There was a battle between the Mercians and the Welsh at Hereford in 760, and Offa is recorded as campaigning against the Welsh in 778, 784 and 796 in the tenth-century ''Annales Cambriae''.Annales Cambriae, ''sub anno'' 760, 778 and 784.Stenton, ''Anglo-Saxon England'', pp. 214–215.The best known relic associated with Offa's time is Offa's Dyke, a great earthen barrier that runs approximately along the border between England and Wales. It is mentioned by the monk Asser in his biography of Alfred the Great: "a certain vigorous king called Offa...had a great dyke built between Wales and Mercia from sea to sea".Asser, ''Alfred the Great'', ch. 14, p. 71. The dyke has not been dated by archaeological methods, but most historians find no reason to doubt Asser's attribution.Margaret Worthington, "Offa's Dyke", in Lapidge, ''Blackwell Encyclopaedia of Anglo-Saxon England'', p. 341. Early names for the dyke in both Welsh and English also support the attribution to Offa.Stenton, ''Anglo-Saxon England'', p. 213. Despite Asser's comment that the dyke ran "from sea to sea", it is now thought that the original structure only covered about two-thirds of the length of the border: in the north it ends near Llanfynydd, less than five miles (8 km) from the coast, while in the south it stops at Rushock Hill, near Kington in Herefordshire, less than fifty miles (80 km) from the Bristol Channel. The total length of this section is about sixty-four miles (103 km). Other earthworks exist along the Welsh border, of which Wat's Dyke is one of the largest, but it is not possible to date them relative to each other and so it cannot be determined whether Offa's Dyke was a copy of or the inspiration for Wat's Dyke.Margaret Worthington, "Wat's Dyke", in Lapidge et al., ''Blackwell Encyclopaedia of Anglo-Saxon England'', p. 468.The construction of the dyke suggests that it was built to create an effective barrier and to command views into Wales. This implies that the Mercians who built it were free to choose the best location for the dyke. There are settlements to the west of the dyke that have names that imply they were English by the eighth century, so it may be that in choosing the location of the barrier the Mercians were consciously surrendering some territory to the native Britons.Stenton cites, for example, the village "Burlingjobb", in Powys, not far from the south end of the dyke, as having a name unlikely to have risen as late as the ninth century. Stenton, ''Anglo-Saxon England'', p. 214. Alternatively it may be that these settlements had already been retaken by the Welsh, implying a defensive role for the barrier.Patrick Wormald, "Offa's Dyke", in James Campbell et al., ''The Anglo-Saxons'', pp. 120–121. The effort and expense that must have gone into building the dyke are impressive, and suggest that the king who had it built (whether Offa or someone else) had considerable resources at his disposal.</text><imageset><image id="225ec96db2f4ac904fd9d6b37c82e447" name="Offa's_Dyke" sectnum="6#1">../img/225ec96db2f4ac904fd9d6b37c82e447.jpg</image></imageset></document></span>
如上述Wikipedia dataset中的一段文本,需要提取<text>和<\text>之间的内容这时,可以通过正则表达式进行匹配。但是正则表达式匹配时中间不能出现换行符,需要先将所有换行符去掉。__author__ = 'Mandy'import reimport osdir_path = 'D:/wikipedia_dataset/texts/'new_path = 'D:/process_xml/wikipedia_txt/'files = os.listdir('D:/wikipedia_dataset/texts')for fi in files: a=open(dir_path + fi).read() b=a.replace("\n","") #正则匹配无法对换行进行处理 x=re.findall('<text>(.*)</text>',b) nam =fi.split('.xml') f=file(new_path + nam[0] + '.txt','w') f.writelines(x) f.close()print('ok')
1 0
- Python对XML文件标签内容的匹配
- 解析xml文件的标签内容
- 检查xml文件标签是否匹配
- 匹配得到A标签href的内容
- 匹配html中a标签的内容
- python之正则匹配文件内容
- python 文件查找及内容匹配
- python爬虫正则匹配td标签中的内容,以及一些常用的正则
- Python对xml进行内容筛选
- python 对csv文件的列的内容读取
- 匹配第一对出现的html标签
- struts.xml配置文件的内容标签
- java读取xml指定标签的内容
- struts.xml配置文件的内容标签
- python 对文件做类别标签
- Python 对文本汇总产生的文件内容进行可视化 (加粗汇总内容)
- java 利用String类的简单方法读取xml文件中某个标签中的内容
- C#对XML操作:编辑XML文件内容
- 更新一个js中的this的四个指代对象,作为学习笔记
- ubuntu登陆界面损坏修复
- Android-Service(系统服务实例:定位,网络判断,电话服务 ,通知栏通知等)
- 【JSSDK】微信分享JSSDK关键属性获取(Senparc.Weixin.MP.dll)
- iOS干货:快速集成tableView折叠cell的小框架
- Python对XML文件标签内容的匹配
- 高分辨率不必再为字体太小担心啦,根据页面自动动态缩放插件
- 五大移动GPU厂商论剑
- Bootstrap 样式速查
- 做到这三点可减少产品和开发的相互抱怨
- iOS 之 使用百度地图 (删除地图页面所有自己添加的折线)
- android学习笔记(1)--布局管理器
- webstorm 11下载及注册
- Python day14 模块的内置变量