Running Apache Tika in Server Mode

来源:互联网 发布:泽国 unity3d 编辑:程序博客网 时间:2024/05/21 11:18

Running Apache Tika in Server Mode

We are using Apache Tika for plain-text extraction of pdf files. Tika is doing a good job here except for the fact that it takes quite long to get results. As an example, extracting the text from a 234 slides pdf presentation takes about 3.5 seconds on my laptop. You might become a performance problem here, if you do not only want to extract the text of a single file but let's say for 12.000 files.

Here is the command with which I figured out, how long it takes to get the plain text of a document:

?
1
2
3
4
5
$ time java -jar tika-app-1.3.jar -h some.pdf
[...]
real    0m2.935s
user    0m4.640s
sys 0m0.178s

Now Tika can also be run in a server mode. Here is the command to start Tika as a server:

?
1
$ java -jar tika-app-1.3.jar -t --server --port 12345

You can now pass your (pdf) documents to that server (e.g. with NetCat) and get your results as before. As you can see, things become a lot faster:

?
1
2
3
4
$ time nc 127.0.0.1 12345 < some.pdf
real    0m0.386s
user    0m0.003s
sys 0m0.015s

So if you have the same performance problems with Tika as we had, this might be a solution!


Tika supports two "server" modes. The simpler and original is the --server flag of Tika-App. The more functional, but also more recent is theJAX-RS JSR-311 server component, which is an additional jar.

The Tika-App Network Server is very simple to use. Simply start Tika-App with the--server flag, and a --port ### flag telling it what port to listen on. Then, connect to that port and send it a single file. You'll get back the html version. NetCat works well for this, something likejava -jar tika-app.jar --server --port 12345 followed by nc 127.0.0.1 12345 < MyFileToExtract will get you back the html

The JAX-RS JSR-311 server component supports a few different urls, for things like metadata, plain text etc. You start the server withjava -jar tika-server.jar, then do HTTP put calls to the appropriate url with your input document and you'll get the resource back. There are loads of details and examples (including using curl for testing) on thewiki page

The Tika App Network Server is fairly simple, only supports one mode (extract to HTML), and is generally used for testing / demos / prototyping / etc. TheTika JAXRS Server is a fully RESTful service which talks HTTP, and exposes a wide range of Tika's modes. It's the generally recommended way these days to interface with Tika over the network, and/or from non-Java stacks.