Impala and Shark Benchmark

来源：互联网发布：cad mac中文破解版编辑：程序博客网时间：2024/05/17 07:31

Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any modification to the existing data or queries. Shark supports Hive’s query language, metastore, serialization formats, and user-defined functions, providing seamless integration with existing Hive deployments and a familiar, more powerful option for new ones.

Cloudera Impala is an open source Massively Parallel Processing (MPP) query engine that runs natively on Apache Hadoop. The Apache-licensed Impala project brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and Apache HBase without requiring data movement or transformation. Impala is integrated from the ground up as part of the Hadoop ecosystem and leverages the same flexible file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other components of the Hadoop stack.

Impala Benchmark

Shark Benchmark

Shark extends Apache Hive to dramatically speed up both in-memory and on-disk queries. Impala is an enterprise data warehouse system that works well with Hive/HDFS and, from an architectural level, resembles traditional parallel databases.

Both systems share overlapping goals, but there are substantial differences.

Compatibility with existing systems: Shark builds directly on the Apache Hive codebase, so it naturally supports virtually all Hive features. It supports the existing Hive SQL language, Hive data formats (SerDes), user-defined functions (UDFs), and queries that call external scripts. Because Impala uses a custom C++ runtime, it doesn't support Hive UDFs. Both systems will integrate out of the box with many BI tools, and this has been a major goal for Impala. Shark is being used with some BI tools, like Tableau, but this hasn't been explored much.

In-memory data processing: Shark allows users to explicitly load data in memory to speed up query processing, and uses an efficient, compressed column-oriented format for its memory. Impala does not yet provide in-memory storage.

Fault tolerance: Shark is designed to support both short and long-running queries. It can recover from mid-query faults thanks to the underlying Spark engine. Impala is more focused on short queries at the moment, and is not fault-tolerant (queries must be restarted if a node fails, which is arguably acceptable for short queries).

Performance: It's really too early to do a full comparison. Both report between 10x and 100x speedups over Hive, but these are anecdotal and workload-dependent. Both projects also have major optimizations coming in the next 6 months. In our experience, the current version of Shark is regularly 100x faster than Hive with in-memory data, and 5-10x faster with on-disk data, depending on the queries (for queries with joins it can be faster than that).

Target audience: In our understanding, Impala is quite focused on traditional enterprise customers and OLAP and data warehouse workloads. Shark supports traditional OLAP, but also invests effort to support with more complex uses of Hive (such as UDF's), processing of unstructured data (e.g. ETL), and advanced analytics like machine learning (through integration with Spark). The long-term goal for Shark is to have a unified system that supports both SQL and advanced analytics (machine learning, statistics, etc).

Development language: Shark is written in Java and Scala, and runs on the JVM. Impala is written in C++. Impala compiles queries into LLVM intermediate representations, which can be further optimized by a just-in-time compiler. The Shark team is working on compiling queries into JVM bytecode.

Open source: Both systems are open source (Apache licensed). Shark started at UC Berkeley and has accepted major contributions from companies such as Yahoo! already. Impala was developed within Cloudera, and was just recently released.

Ref: http://www.sigmoidanalytics.com/impala-shark-benchmark/

http://www.quora.com/Apache-Hadoop/How-does-Impala-compare-to-Shark

0 0