Apache Tajo™ - An open source big data warehouse system in Hadoop
来源:互联网 发布:结婚礼物 知乎 编辑:程序博客网 时间:2024/05/16 19:23
The main goal of Apache Tajo™ project is to build an advanced open source data warehouse system in Hadoop for processing web-scale data sets
Features
- Interactive and Batch Queries
- Fully distributed SQL query processing on large data sets stored in HDFS and other data sources
- Very low response time (100 msec ~) against simple queries (e.g., just aggregation or small-large join) on reasonable data size
- Long running query support
- Fault tolerance support that avoids query restart when some tasks are failed.
- Dynamic scheduling support that handles struggling and heterogeneous cluster nodes
- Query Optimization
- Cost-based optimization for bushy join trees
- Progressive query optimization for reoptimizing running queries
- ETL
- ETL features that transform one data format to another data format
- Various file formats support, such as CSV, RCFile, and RowFile (a row store file)
- Extensibility
- User-defined function support
- Scanner/Appender interface for custom file formats
- Compatibility
- ANSI/ISO SQL standard compliance and PostgreSQL compliance for non-standard parts
- HiveQL mode support
- Tables access in HCatalog and Hive MetaStore
- JDBC driver support
- Easy
- Interactive shell to allow users to submit SQL queries to Tajo clusters
- Backup/Restore utility
- Asynchronous/Synchronous Java API to enable clients to submit SQL queries to Tajo clusters
Ref: http://tajo.apache.org/
The key differences between Tajo and Impala is the design goal. To increasethe performance of query processing, Impala adopts an approach which themain memory is utilized as much as possible and intermediate data aretransfered via streaming. If a query requires too much memory, Impalacannot process the query. Thus, Impala says that it is not an alternate ofHive.However, Tajo uses a query optimization which considers user queries,characteristics of data, the status of cluster, and so on. Thus, Tajo canprocess a query with Impala's algorithm, Hive's algorithm or any otheralgorithms. For an example, Tajo can process a join query using therepartition join, or the merge join. Intermediate results can bematerialized to disks or maintained in memory. Since Tajo builds a queryplan considering above mentioned various factors, it can always processuser queries. So, we can say that Tajo can be an alternate of Hive.Tajo can perform well over Hive for most of queries. The key reason is thatTajo uses the own query engine while Hive uses MapReduce. This limits thatHive can uses only MapReduce-based algorithms. However, Tajo can uses amore optimized algorithm.A sort query is a good example. Hive supports only the hash partitioning.Thus, each node sort data locally in the map phase and *ONE NODE* shouldperform global sort in the reduce phase.However, Tajo supports a sort algorithm using the range partitioning. Inthe first phase, each node sort data locally as in Hive, but theintermediate data are partitioned by the range of the sort key. In thesecond phase, each node performs local sort to get the final results. Sinceintermediate data are partitioned by the range of sort key, final resultsare correct.
Ref: http://mail-archives.apache.org/mod_mbox/tajo-dev/201305.mbox/%3CCACZfFK6PNE+AuNX6CQ0WD784ZxUavEykEKa-rWFMXp0xdyAHmg@mail.gmail.com%3E 0 0
- Apache Tajo™ - An open source big data warehouse system in Hadoop
- An introduction to Apache Hadoop for big data
- component warehouse--open source
- Packetpig - Open Source Big Data Security Analytics
- 100 open source Big Data architecture papers for data professionals
- Open Source Big Data for the Impatient, Part 1
- Spark: Open Source Superstar Rewrites Future of Big Data
- open source big data projects
- Hue : the open source Apache Hadoop UI
- Apache Tajo Enters the SQL-on-Hadoop Space
- Is Apache Spark the Next Big Thing in Big Data?
- 【BigData】100 open source Big Data architecture papers for data professionals
- Open Source Compiles in an Xcode 5.1 World
- Big Data(1): Hadoop, MapReduce and Python in Ubuntu
- Big Data 及 Hadoop
- Open Source Software Fuels a Revolution in Data Science
- Open source (kernel, big data, etc) + ACM (对应起来,才能相互促进)
- 推荐书:J2EE™ Open Source Toolkit:Building an Enterprise Platform with Open Source Tools
- 基于C++和JavaScript的全平台全栈式游戏开发解决方案的思考
- 得到数组的最后一个数的趣味实现
- UPC:2526 Color the necklace
- java项目组月度考核表(KPI)
- mysql查询字段为null的方法
- Apache Tajo™ - An open source big data warehouse system in Hadoop
- java编码终极探秘
- 【PyQt实例6】渐变效果
- iOS 沙盒 理解 文件系统
- AS3.0 Bitmap类实现图片3D旋转效果
- 用cmd来运行java
- 汉字的编码与字模点阵小结
- OpenCV2马拉松第3圈——改变对比度和亮度
- MVC 帮助类/公共方法