Python Tika guide
来源:互联网 发布:怎样网络销售 编辑:程序博客网 时间:2024/05/16 14:55
Python Tika guide
IMPORTANT NOTE: Thanks to Chris Wilson's work it seems that a simple command linepip install git+git://github.com/aptivate/python-tika.git
will do the work ! Much better isn't it ? Seehttp://blog.aptivate.org/2012/02/01/content-indexing-in-django-using-apache-tika/ for more info. The following is now clearly deprecated, I keep it here just in case...
This document is a very short guide for building and using Tika (an all purpose documents' content and metadata extraction library) through a Python wrapper. The wrapper is built using JCC.
http://lucene.apache.org/tika/
http://lucene.apache.org/pylucene/jcc/index.html
Until now only the few functionalities I am interested in were tested.
Install
Install jcc : http://lucene.apache.org/pylucene/jcc/documentation/install.html
Install tika : http://lucene.apache.org/tika/0.7/gettingstarted.html
Don't forget to run mvn install
in tika directory.
You will need the jar files from tika-parsers/target, tika-core/target and tika-app/target.
Build Tika Python wrapper with jcc:
> cd jcc/jcc> sudo python __main__.py --jar jar/tika-parsers-0.7.jar --jar jar/tika-core-0.7.jar\ java.io.File java.io.FileInputStream java.io.StringBufferInputStream\ --package org.xml.sax.ContentHandler --package org.xml.sax.SAXException\ --include jar/tika-app-0.7.jar --python tika --reserved asm --build --install
I have been told that the package line should be: "--package org.xml.sax". I don't know if it is because of a version change and I haven't tested it, but try it if you have errors with the command as it is.
1 feb 2012: thanks to another fellow tika user for his input:
I concur with the need to change the package to "--package org.xml.sax".Without this, I do not get "errors" during the compilation process,but jcc silently ignores the all-important AutoDetectParser.parse() method,and produces a wrapper with no such method in it, because it doesn't recognise the return type.This causes the example code that you gave to fail because of the missing method.I also needed to add an OSGI library for Tika 1.0, which I happened to find on my system, so my final command was:python ../jcc/jcc/__main__.py \ --include /usr/share/java/org.eclipse.osgi.jar --jar tika-parsers-1.0.jar \ --jar tika-core-1.0.jar \ java.io.File java.io.FileInputStream \ java.io.StringBufferInputStream \ --package org.xml.sax \ --include tika-app-1.0.jar \ --python tika --version 1.0 --reserved asm
Usage example
In a python console:
# Setup module and virtual machineimport tikatika.initVM()# The all purpose parser from Tika (html, pdf, open documents, etc...)parser = tika.AutoDetectParser()# Create input from a small fake html code# Alternatively you can use: input = tika.FileInputStream(tika.File("/path/to/example"))input = tika.StringBufferInputStream("<html><title>My title</title><body>My body</body></html>")# Create handler for content, metadata and contextcontent = tika.BodyContentHandler()metadata = tika.Metadata()context = tika.ParseContext()# Parse the data and display resultparser.parse(input,content,metadata,context)content.toString()> u'My body'metadata.toString()> u'title=My title Content-Encoding=UTF-8 Content-Type=text/html 'metadata.get('title')> u'My title'
- Python Tika guide
- Tika
- Tika
- 【tika】tika介绍
- Python (Visual QuickStart Guide)
- python quick guide
- Google Python Style Guide
- python-guide翻译
- Python decorator guide
- Python style guide
- Google Python Style Guide
- 编译 Tika
- Apache Tika
- TIKA架构
- TIKA文件格式
- Apache Tika
- Tika是什么?如何安装Tika?
- 【Tika基础教程之一】Tika基础教程
- 设计模式之类之间的关系
- Bone Collector II
- 理解Proc 文件系统
- gcc命令
- sizeof操作符
- Python Tika guide
- linux-0.11调试教程,move_cursor_relative()函数和变量last_c_pos和变量c_pos的关系
- 360手机卫士 IPHONE APP 弹窗求声援
- 周鸿祎,你怕啥?
- 互联网将对中国社会产生更为深远影响
- Windows Vista 和 Windows Server 2008 中,TCP/IP 默认动态端口范围已更改
- tomcat环境变量的配置
- 分布式文件系统HDFS设计
- poj2488 深度优先搜索