Running Apache Tika in Server Mode
来源:互联网 发布:泽国 unity3d 编辑:程序博客网 时间:2024/05/21 11:18
Running Apache Tika in Server Mode
We are using Apache Tika for plain-text extraction of pdf files. Tika is doing a good job here except for the fact that it takes quite long to get results. As an example, extracting the text from a 234 slides pdf presentation takes about 3.5 seconds on my laptop. You might become a performance problem here, if you do not only want to extract the text of a single file but let's say for 12.000 files.
Here is the command with which I figured out, how long it takes to get the plain text of a document:
$ time java -jar tika-app-1.3.jar -h some.pdf
[...]
real 0m2.935s
user 0m4.640s
sys 0m0.178s
Now Tika can also be run in a server mode. Here is the command to start Tika as a server:
$ java -jar tika-app-1.3.jar -t --server --port 12345
You can now pass your (pdf) documents to that server (e.g. with NetCat) and get your results as before. As you can see, things become a lot faster:
$ time nc 127.0.0.1 12345 < some.pdf
real 0m0.386s
user 0m0.003s
sys 0m0.015s
So if you have the same performance problems with Tika as we had, this might be a solution!
Tika supports two "server" modes. The simpler and original is the --server
flag of Tika-App. The more functional, but also more recent is theJAX-RS JSR-311 server component, which is an additional jar.
The Tika-App Network Server is very simple to use. Simply start Tika-App with the--server
flag, and a --port ###
flag telling it what port to listen on. Then, connect to that port and send it a single file. You'll get back the html version. NetCat works well for this, something likejava -jar tika-app.jar --server --port 12345
followed by nc 127.0.0.1 12345 < MyFileToExtract
will get you back the html
The JAX-RS JSR-311 server component supports a few different urls, for things like metadata, plain text etc. You start the server withjava -jar tika-server.jar
, then do HTTP put calls to the appropriate url with your input document and you'll get the resource back. There are loads of details and examples (including using curl for testing) on thewiki page
The Tika App Network Server is fairly simple, only supports one mode (extract to HTML), and is generally used for testing / demos / prototyping / etc. TheTika JAXRS Server is a fully RESTful service which talks HTTP, and exposes a wide range of Tika's modes. It's the generally recommended way these days to interface with Tika over the network, and/or from non-Java stacks.
- Running Apache Tika in Server Mode
- Content indexing in Django using Apache Tika
- Running Installations in Silent Mode
- Apache Tika
- Apache Tika
- Configure Apache Server to work in https mode
- weblogic启动项目在 Server started in RUNNING mode时,卡住不动
- Ubuntu is running in low-graphics mode
- DENIED Redis is running in protected mode
- MySQL Installer is running in Community mode
- apache tika技术了解
- TIKA Server笔记
- the system is running in low-graphics mode
- Ubuntu 12.04 the system is running in low-graphics mode
- 【virtualbox】 your system is running in low graphics mode
- Ubuntu 12.04 出现“Ubuntu is running in low-graphics mode
- Ubuntu 12.04 the system is running in low-graphics mode
- [Ubuntu] this system is running in low-graphics mode
- 8051单片机的指令系统有什么特点?
- 加载驱动过程
- Android开发系列链接
- Python 的神奇方法指南
- 【绿色软件下么】不为人知的Word缩进设置
- Running Apache Tika in Server Mode
- 关于DLL搜索路径的顺序问题
- android 根据IP获取天气情况 详细讲解
- Android 3D旋转动画效果
- c语言从入门到精通(核心)
- jquery 图片滚动插件
- C专家编程-读书笔记之第三章
- 一个人使用局域网多台电脑的常用软件
- 面试题:判断单链表是否为循环链表-快慢指针