Big Data Benchmark(Hive、Impala、Shark(Spark)、Redshift基准测试)
来源:互联网 发布:蒙大拿级战列舰数据 编辑:程序博客网 时间:2024/05/01 15:34
看到一篇测试不同的并行框架性能的文章,觉得对大家的选型有比较大的帮助,所以就原文转载过来了。测试是Spark诞生的AMPLIB做的,比较的产品有RedShift、Impala、Shark、Hive。原文地址在这里https://amplab.cs.berkeley.edu/benchmark/。这篇文章提供了几个主流的并行分析处理框架(Hive、Spark、Impala)的在不同场景下性能测试的对比。在相同的数据集下实时扫描,聚合、连接等操作,通过条件改变实施结果的规模,在这几种情况下各个平台的不同表现。从测试结果我们可以看到。
- RedShift整体表现突出并且比较均衡。
- Shark和Impala在小规模能够在内存中计算的数据集表现突出,但是总体表现不如RedShift。Impala在做连接操作的时候优于Shark
- Hive在各个方面的性能表现都是最差的。
当然了Hive和Shark是完全开源的,RedShift则是Amazon的商业MPP数据库,看了一下价格视乎还可以
- CPU: 16 virtual cores - Intel Xeon E5
- ECU: 35
- Memory: 120 GiB
- Storage: 24 HDD with 16TB of local attached storage
- Network: 10 Gigabit Ethernet with support for cluster placement groups
- Disk I/O: Very High
- API: dw.hs1.8xlarge
Impala和Shark从测试上看其实不相上下,国内也各有大的公司在搞,不过Impala是个部分开源的产品,这个限制了他的使用。Shark是构架在Spark上的,完全开源。从这个上面来看视乎Shark是个更好的选择。
当然了,作为Spark的老东家出的测试,炫耀自己的东西是人之常情,不过数据应该做不了假,可以通过这个测试给大家选择这些框架做个参考。以下是原文:
Here from HackerNews? This was originally posted several months ago. Check back in two weeks for an updated benchmark including newer versions of Hive, Impala, and Shark.
Introduction
Several analytic frameworks have been announced in the last six months. Among them are inexpensive data-warehousing solutions based on traditional Massively Parallel Processor (MPP) architectures (Redshift), systems which impose MPP-like execution engines on top of Hadoop (Impala,HAWQ) and systems which optimize MapReduce to improve performance on analytical workloads (Shark, Stinger). This benchmark providesquantitative and qualitative comparisons of four sytems. It is entirely hosted on EC2 and can be reproduced directly from your computer.
- Redshift - a hosted MPP database offered by Amazon.com based on the ParAccel data warehouse.
- Hive - a Hadoop-based data warehousing system. (v0.10, 1/2013 Note: Hive v0.11, which advertises improved performance, was recently released but is not yet included)
- Shark - a Hive-compatible SQL engine which runs on top of the Spark computing framework. (v0.8 preview, 5/2013)
- Impala - a Hive-compatible* SQL engine with its own MPP-like execution engine. (v1.0, 4/2013)
This remains a work in progress and will evolve to include additional frameworks and new capabilities. We welcome contributions.
What is being evaluated?
This benchmark measures response time on a handful of relational queries: scans, aggregations, joins, and UDF’s, across different data sizes. Keep in mind that these systems have very different sets of capabilities. MapReduce-like systems (Shark/Hive) target flexible and large-scale computation, supporting complex User Defined Functions (UDF’s), tolerating failures, and scaling to thousands of nodes. Traditional MPP databases are strictly SQL compliant and heavily optimized for relational queries. The workload here is simply one set of queries that most of these systems these can complete.
Dataset and Workload
Our dataset and queries are inspired by the benchmark contained in ”A comparison of approaches to large scale analytics”. The input data set consists of a set of unstructured HTML documents and two SQL tables which contain summary information. It was generated using Intel’s Hadoop benchmark tools and data sampled from the Common Crawl document corpus. There are three datasets with the following schemas:
Documents
Rankings
UserVisits
pageURL VARCHAR(300)pageRank INTavgDuration INT
sourceIP VARCHAR(116)destURL VARCHAR(100)visitDate DATEadRevenue FLOATuserAgent VARCHAR(256)countryCode CHAR(3)languageCode CHAR(6)searchWord VARCHAR(32)duration INT
Query 1 and Query 2 are exploratory SQL queries. We vary the size of the result to expose scaling properties of each systems.
- Varaint A: BI-Like - result sets are small (e.g., could fit in memory in a BI tool)
- Variant B: Intermediate - result set may not fit in memory on one node
- Variant C: ETL-Like - result sets are large and require several nodes to store
Query 3 is a join query with a small result set, but varying sizes of joins.
Query 4 is a bulk UDF query. It calculates a simplified version of PageRank using a sample of the Common Crawl dataset.
Hardware Configuration
Instance stats
Cluster stats
Results | May 2013
We launch EC2 clusters and run each query several times. We report the median response time here. Except for Redshift, all data is stored on HDFS in compressed SequenceFile format using CDH 4.2.0. Each query is run with six frameworks:
RedshiftAmazon Redshift with default options.Shark - diskInput and output tables are on-disk compressed with gzip. OS buffer cache is cleared before each run.Impala - diskInput and output tables are on-disk compressed with snappy. OS buffer cache is cleared before each run.Shark - memInput tables are stored in Spark cache. Output tables are stored in Spark cache.Impala - memInput tables are coerced into the OS buffer cache. Output tables are on disk (Impala has no notion of a cached table).HiveHive with default options. Input and output tables are on disk compressed with snappy. OS buffer cache is cleared before each run.1. Scan Query
SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
32,888 results
3,331,851 results
89,974,976 results
This query scans and filters the dataset and stores the results.
This query primarily tests the throughput with which each framework can read and write table data. The best performers are Impala (mem) and Shark (mem) which see excellent throughput by avoiding disk. For on-disk data, Redshift sees the best throughput for two reasons. First, the Redshift clusters have more disks and second, Redshift uses columnar compression which allows it to bypass a field which is not used in the query. Shark and Impala scan at HDFS throughput with fewer disks.
Both Shark and Impala outperform Hive by 3-4X due in part to more efficient task launching and scheduling. As the result sets get larger, Impala becomes bottlenecked on the ability to persist the results back to disk. It seems as if writing large tables is not yet optimized in Impala, presumably because its core focus is BI-style queries.
2. Aggregation Query
SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, X)
2,067,313 groups
31,348,913 groups
253,890,330 groups
This query applies string parsing to each input tuple then performs a high-cardinality aggregation.
Redshift’s columnar storage provides greater benefit than in Query 1 since several columns of the UserVistits
table are un-used. While Shark’s in-memory tables are also columnar, it is bottlenecked here on the speed at which it evaluates the SUBSTR
expression. Since Impala is reading from the OS buffer cache, it must read and decompress entire rows. Unlike Shark, however, Impala evaluates this expression using very efficient compiled code. These two factors offset each other and Impala and Shark achieve roughly the same raw throughput for in memory tables. For larger result sets, Impala again sees high latency due to the speed of materializing output tables.
3. Join Query
SELECT sourceIP, totalRevenue, avgPageRankFROM (SELECT sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(`1980-01-01') AND Date(`X') GROUP BY UV.sourceIP) ORDER BY totalRevenue DESC LIMIT 1
485,312 rows
53,332,015 rows
533,287,121 rows
This query joins a smaller table to a larger table then sorts the results.
When the join is small (3A), all frameworks spend the majority of time scanning the large table and performing date comparisons. For larger joins, the initial scan becomes a less significant fraction of overall response time. For this reason the gap between in-memory and on-disk representations diminishes in query 3C. All frameworks perform partitioned joins to answer this query. CPU (due to hashing join keys) and network IO (due to shuffling data) are the primary bottlenecks. Redshift has an edge in this case because the overall network capacity in the cluster is higher.
4. UDF Query
CREATE TABLE url_counts_partial AS SELECT TRANSFORM (line) USING "python /root/url_count.py" as (sourcePage, destPage, cnt) FROM documents;CREATE TABLE url_counts_total AS SELECT SUM(cnt) AS totalCount, destPage FROM url_counts_partial GROUP BY destPage;
This query calls an external Python function which extracts and aggregates URL information from a web crawl dataset. It then aggregates a total count per URL.
Impala and Redshift do not currently support calling this type of UDF, so they are omitted from the result set. The performance advantage of Shark (disk) over Hive in this query is less pronounced than in 1, 2, or 3 because the shuffle and reduce phases take a relatively small amount of time (this query only shuffles a small amount of data) so the task-launch overhead of Hive is less pronounced. Also note that when the data is in-memory, Shark is bottlenecked by the speed at which it can pipe tuples to the Python process rather than memory throughput. This makes the speedup relative to disk around 5X (rather than 10X or more seen in other queries).
Discussion
These numbers compare performance on SQL workloads, but raw performance is just one of many important attributes of an analytic framework. The reason why systems like Hive, Impala, and Shark are used is because they offer a high degree of flexibility, both in terms of the underlying format of the data and the type of computation employed. Below we summarize a few qualitative points of comparison:
FAQ
What's next?
We would like to include the columnar storage formats for Hadoop-based systems, such as Parquet and RC file. We would also like to run the suite at higher scale factors, using different types of nodes, and/or inducing failures during execution. Finally, we plan to re-evaluate on a regular basis as new versions are released.
We wanted to begin with a relatively well known workload, so we chose a variant of the Pavlo benchmark. This benchmark is heavily influenced by relational queries (SQL) and leaves out other types of analytics, such as machine learning and graph processing. The largest table also has fewer columns than in many modern RDBMS warehouses. In future iterations of this benchmark, we may extend the workload to address these gaps.
How is this different from the 2008 Pavlo et al. benchmark?
This benchmark is not an attempt to exactly recreate the environment of the Pavlo at al. benchmark. Instead, it draws on that benchmark for inspiration in the dataset and workload. The most notable differences are as follows:
- We run on a public cloud instead of using dedicated hardware.
- We require the results are materialized to an output table. This is necessary because some queries in our version have results which do not fit in memory on one machine.
- The dataset used for Query 4 is an actual web crawl rather than a synthetic one.
- Query 4 uses a Python UDF instead of SQL/Java UDF’s.
- We create different permutations of queries 1-3. These permutations result in shorter or longer response times.
- The dataset is generated using the newer Intel generator instead of the original C scripts. The newer tools are well supported and designed to output Hadoop datasets.
Did you consider comparing Vertica, Teradata, SAP Hana, MongoDB, Postgres, RAMCloud, SQLite, insert-dbms-or-query-engine-here... etc?
We’ve started with a small number of EC2-hosted query engines because our primary goal is producing verifiable results. Over time we’d like to grow the set of frameworks. We actively welcome contributions!
This workload doesn't represent queries I run -- how can I test these frameworks on my own workload?
We’ve tried to cover a set of fundamental operations in this benchmark, but of course, it may not correspond to your own workload. The prepare scripts provided with this benchmark will load sample data sets into each framework. From there, you are welcome to run your own types of queries against these tables. Because these are all easy to launch on EC2, you can also load your own datasets.
Do these queries take advantage of data-layout options, such as Hive/Impala/Shark partitions or Redshift sort columns?
For now, no. The idea is to test “out of the box” performance on these queries even if you haven’t done a bunch of up-front work at the loading stage to optimize for specific access patterns. We may relax this requirement in the future.
Why didn't you test Hive in memory?
We did, but the results were very hard to stabilize. The reason is that it is hard to coerce the entire input into the buffer cache because of the way Hive uses HDFS: Each file in HDFS has three replicas and Hive’s underlying scheduler may choose to launch a task at any replica on a given run. As a result, you would need 3X the amount of buffer cache (which exceeds the capacity in these clusters) and or need to have precise control over which node runs a given task (which is not offered by the MapReduce scheduler).
Contributing a New Framework
We plan to run this benchmark regularly and may introduce additional workloads over time. We welcome the addition of new frameworks as well. The only requirement is that running the benchmark be reproducible and verifiable in similar fashion to those already included. The best place to start is by contacting Patrick Wendell from the U.C. Berkeley AMPLab.
Run This Benchmark Yourself
Since Redshift, Shark, Hive, and Impala all provide tools to easily provision a cluster on EC2, this benchmark can be easily replicated.
Hosted data sets
To allow this benchmark to be easily reproduced, we’ve prepared various sizes of the input dataset in S3. The scale factor is defined such that each node in a cluster of the given size will hold ~25GB of the UserVisits
table, ~1GB of the Rankings
table, and ~30GB of the web crawl, uncompressed. The datasets are encoded in TextFile
and SequenceFile
format along with corresponding compressed versions. They are available publicly at s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]
.
Rankings
(rows)Rankings
(bytes)UserVisits
(rows)UserVisits
(bytes)Documents
(bytes)Launching and Loading Clusters
Create an Impala, Redshift, Hive or Shark cluster using their provided provisioning tools.
- Each cluster should be created in the US East EC2 Region
- For Redshift, use the Amazon AWS console. Make sure to whitelist the node you plan to run the benchmark from in the Redshift control panel.
- For Impala and Hive, use the Cloudera Manager EC2 deployment instructions. Make sure to upload your own RSA key so that you can use the same key to log into the nodes and run queries.
- For Shark, use the Spark/Shark EC2 launch scripts. These are available as part of the latest Spark distribution. *NOTE: You must set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
$> ec2/spark-ec2 -s 5 -k [KEY PAIR NAME] -i [IDENTITY FILE] --hadoop-major-version=2 -t "m2.4xlarge" launch [CLUSTER NAME]
Scripts for preparing data are included in the benchmark github repo. Use the provided
prepare-benchmark.sh
to load an appropriately sized dataset into the cluster../prepare-benchmark.sh --help
Here are a few examples showing the options used in this benchmark…
$> ./prepare-benchmark.sh --redshift --aws-key-id=[AWS KEY ID] --aws-key=[AWS KEY] --redshift-username=[USERNAME] --redshift-password=[PASSWORD] --redshift-host=[ODBC HOST] --redshift-database=[DATABASE] --scale-factor=5
$> ./prepare-benchmark.sh --shark --aws-key-id=[AWS KEY ID] --aws-key=[AWS KEY] --shark-host=[SHARK MASTER] --shark-identity-file=[IDENTITY FILE] --scale-factor=5 --file-format=text-deflate
$> ./prepare-benchmark.sh --impala --aws-key-id=[AWS KEY ID] --aws-key=[AWS KEY] --impala-host=[NAME NODE] --impala-identity-file=[IDENTITY FILE] --scale-factor=5 --file-format=sequence-snappy
$> ./run-query.sh--redshift--redshift-username=[USERNAME]--redshift-password=[PASSWORD]--redshift-host=[ODBC HOST]--redshift-database=[DATABASE]--query-num=[QUERY NUM]
$> ./run-query.sh--shark--shark-host=[SHARK MASTER]--shark-identity-file=[IDENTITY FILE]--query-num=[QUERY NUM]
$> ./run-query.sh--impala--impala-hosts=[COMMA SEPARATED LIST OF IMPALA NODES]--impala-identity-file=[IDENTITY FILE]--query-num=[QUERY NUM]
- If you are adding a new framework or using this to produce your own scientific performance numbers, get in touch with us. The virtualized environment of EC2 makes eeking out the best results a bit tricky. We can help.
- Big Data Benchmark(Hive、Impala、Shark(Spark)、Redshift基准测试)
- Impala and Shark Benchmark
- OLAP: Hive, Impala and Redshift
- 用cloudera manager安装impala全过程以impala、hive、Spark性能比较-(三)cloudera manager 安装impala成功并对impala、hive进行简单测试
- impala & shark/spark
- HDFS benchmark 基准测试
- Java Benchmark 基准测试
- 用cloudera manager安装impala全过程以impala、hive、Spark性能比较--------(二)手动安装CDH4,hive,impala。
- Benchmark Testing - 性能基准测试
- 大数据基准测试(Benchmark)
- 视频编码基准测试挑战Movie Metric Benchmark Challenge (部分翻译)
- kudu vs parquet, impala vs spark Benchmark
- 大数据分析(Big Data OLAP)引擎Dremel, Tenzing 以及Impala
- 大数据分析(Big Data OLAP)引擎Dremel, Tenzing 以及Impala
- 大数据分析(Big Data OLAP)引擎Dremel, Tenzing 以及Impala
- 【pySpark教程】Big Data, Hardware trends, and Spark(二)
- 用cloudera manager安装impala全过程以impala、hive、Spark性能比较--------(一)初次尝试用cloudera manager安装impala
- CDH5实践(二)Cloudera Manager 5安装Hive,HBase,Impala,Spark等服务
- 中国首款全金属智能手机
- Struts2值栈学习
- 小米3电信版即将杀到
- 3大开源硬件平台 Arduino BeagleBone Raspberry Pi
- javascript中直接引用Microsoft的COM生成Word
- Big Data Benchmark(Hive、Impala、Shark(Spark)、Redshift基准测试)
- 腾讯反击:我教阿里做游戏,请阿里教我做公关
- putchar()、getchar()、puts()、gets()
- 如何选择液晶显示器
- VS2010 手动为控件添加事件处理函数
- 深入了解Struts2返回JSON数据的原理及具体应用范例
- Balanced Binary Tree - LeetCode
- 华为专利战升级,以质代量
- java值传递几个注意点