Python Tika guide

来源：互联网发布：怎样网络销售编辑：程序博客网时间：2024/05/16 14:55

Python Tika guide

IMPORTANT NOTE: Thanks to Chris Wilson's work it seems that a simple command linepip install git+git://github.com/aptivate/python-tika.git will do the work ! Much better isn't it ? Seehttp://blog.aptivate.org/2012/02/01/content-indexing-in-django-using-apache-tika/ for more info. The following is now clearly deprecated, I keep it here just in case...

This document is a very short guide for building and using Tika (an all purpose documents' content and metadata extraction library) through a Python wrapper. The wrapper is built using JCC.

http://lucene.apache.org/tika/

http://lucene.apache.org/pylucene/jcc/index.html

Until now only the few functionalities I am interested in were tested.

Install

Install jcc : http://lucene.apache.org/pylucene/jcc/documentation/install.html

Install tika : http://lucene.apache.org/tika/0.7/gettingstarted.html

Don't forget to run mvn install in tika directory.
You will need the jar files from tika-parsers/target, tika-core/target and tika-app/target.

Build Tika Python wrapper with jcc:

> cd jcc/jcc> sudo python __main__.py --jar jar/tika-parsers-0.7.jar --jar jar/tika-core-0.7.jar\ java.io.File java.io.FileInputStream java.io.StringBufferInputStream\ --package org.xml.sax.ContentHandler --package org.xml.sax.SAXException\ --include jar/tika-app-0.7.jar --python tika --reserved asm --build --install

I have been told that the package line should be: "--package org.xml.sax". I don't know if it is because of a version change and I haven't tested it, but try it if you have errors with the command as it is.

1 feb 2012: thanks to another fellow tika user for his input:

I concur with the need to change the package to "--package org.xml.sax".Without this, I do not get "errors" during the compilation process,but jcc silently ignores the all-important AutoDetectParser.parse() method,and produces a wrapper with no such method in it, because it doesn't recognise the return type.This causes the example code that you gave to fail because of the missing method.I also needed to add an OSGI library for Tika 1.0, which I happened to find on my system, so my final command was:python ../jcc/jcc/__main__.py \       --include /usr/share/java/org.eclipse.osgi.jar       --jar tika-parsers-1.0.jar \       --jar tika-core-1.0.jar \       java.io.File java.io.FileInputStream \       java.io.StringBufferInputStream \       --package org.xml.sax \       --include tika-app-1.0.jar \       --python tika --version 1.0 --reserved asm

Usage example

In a python console:

# Setup module and virtual machineimport tikatika.initVM()# The all purpose parser from Tika (html, pdf, open documents, etc...)parser = tika.AutoDetectParser()# Create input from a small fake html code# Alternatively you can use: input = tika.FileInputStream(tika.File("/path/to/example"))input = tika.StringBufferInputStream("<html><title>My title</title><body>My body</body></html>")# Create handler for content, metadata and contextcontent = tika.BodyContentHandler()metadata = tika.Metadata()context = tika.ParseContext()# Parse the data and display resultparser.parse(input,content,metadata,context)content.toString()> u'My body'metadata.toString()> u'title=My title Content-Encoding=UTF-8 Content-Type=text/html 'metadata.get('title')> u'My title'