Apache Tajo™ - An open source big data warehouse system in Hadoop

来源：互联网发布：结婚礼物知乎编辑：程序博客网时间：2024/05/16 19:23

The main goal of Apache Tajo™ project is to build an advanced open source data warehouse system in Hadoop for processing web-scale data sets

Features

Interactive and Batch Queries
- Fully distributed SQL query processing on large data sets stored in HDFS and other data sources
- Very low response time (100 msec ~) against simple queries (e.g., just aggregation or small-large join) on reasonable data size
Long running query support
- Fault tolerance support that avoids query restart when some tasks are failed.
- Dynamic scheduling support that handles struggling and heterogeneous cluster nodes
Query Optimization
- Cost-based optimization for bushy join trees
- Progressive query optimization for reoptimizing running queries
ETL
- ETL features that transform one data format to another data format
- Various file formats support, such as CSV, RCFile, and RowFile (a row store file)
Extensibility
- User-defined function support
- Scanner/Appender interface for custom file formats
Compatibility
- ANSI/ISO SQL standard compliance and PostgreSQL compliance for non-standard parts
- HiveQL mode support
- Tables access in HCatalog and Hive MetaStore
- JDBC driver support
Easy
- Interactive shell to allow users to submit SQL queries to Tajo clusters
- Backup/Restore utility
- Asynchronous/Synchronous Java API to enable clients to submit SQL queries to Tajo clusters

Ref: http://tajo.apache.org/

The key differences between Tajo and Impala is the design goal. To increasethe performance of query processing, Impala adopts an approach which themain memory is utilized as much as possible and intermediate data aretransfered via streaming. If a query requires too much memory, Impalacannot process the query. Thus, Impala says that it is not an alternate ofHive.However, Tajo uses a query optimization which considers user queries,characteristics of data, the status of cluster, and so on. Thus, Tajo canprocess a query with Impala's algorithm, Hive's algorithm or any otheralgorithms. For an example, Tajo can process a join query using therepartition join, or the merge join. Intermediate results can bematerialized to disks or maintained in memory. Since Tajo builds a queryplan considering above mentioned various factors, it can always processuser queries. So, we can say that Tajo can be an alternate of Hive.Tajo can perform well over Hive for most of queries. The key reason is thatTajo uses the own query engine while Hive uses MapReduce. This limits thatHive can uses only MapReduce-based algorithms. However, Tajo can uses amore optimized algorithm.A sort query is a good example. Hive supports only the hash partitioning.Thus, each node sort data locally in the map phase and *ONE NODE* shouldperform global sort in the reduce phase.However, Tajo supports a sort algorithm using the range partitioning. Inthe first phase, each node sort data locally as in Hive, but theintermediate data are partitioned by the range of the sort key. In thesecond phase, each node performs local sort to get the final results. Sinceintermediate data are partitioned by the range of sort key, final resultsare correct.

Ref: http://mail-archives.apache.org/mod_mbox/tajo-dev/201305.mbox/%3CCACZfFK6PNE+AuNX6CQ0WD784ZxUavEykEKa-rWFMXp0xdyAHmg@mail.gmail.com%3E

0 0