When would someone use Apache Tez instead of Apache Spark, or vice versa?

来源：互联网发布：魔兽世界5.4.8数据库编辑：程序博客网时间：2024/06/06 03:10

Answer 1

In a nutshell, Spark is a more mature version of Tez,plus much much more. If Tez comes with your version of Hive or Pig, useit as the backend execution engine over MapReduce. If you're planning to directly use theAPIs, whether to write a data-transformationjob, implement a distributed machine learning algorithm, or write your ownhigher-level data processing language, use Spark, handsdown.

Disclaimer: I work at Cloudera,which just started offering Spark support, soIwould say a lot of the things I'm about to say. I alsohope this doesn't come offas too disparaging to the Tez project. Though I see itas somewhat of amisdirected effort,I've worked on YARN with many of the engineers working on Tez, and think highlyof them.

Tez is a ~1.5 year-old implementation of a 2007 paper from Microsoft thatgeneralized the MapRedude distributed compute framework. Spark is a ~4 year-oldimplementation of a 2010 paper from Berkeley thatbuilt on the Microsoft paper. Spark adds "Resilient DistributedDatasets" (RDDs), an abstraction that makes it easy to work with distributed in-memory data.

As mentioned in the question, both Tez and Sparkprovide a distributed execution engine that can handle arbitrary DAGs, targetedtowards processing large amounts of data. Both can read and write data toand from Hadoop using any MapReduce input or output format. The main focusof Tez so far has been providing a faster engine than MapReduce under Hadoop'straditional data-processing languages like Hive and Pig. Spark has thesecapabilities, but also spent a lot of effort on a clean user-facing API with arich set of operators. It can express wordcount in 3 lines of Scala or 15lines of Java. It also provides an interactive shell (REPL) and a PythonAPI, which are especially great for data sciency audiences and facilitate development in general. Tez exposes an API for constructing a data flow DAG - you define vertexes andedges and the way that data gets transferred between them. Theirwordcount example runs over 300 lines. I believe it also supports theHadoop MapReduce API, but if you're using that you're not taking advantage ofthe arbitrary-DAG capabilities.

From a community/adoption/staying-powerstandpoint, Spark boasts over a hundred contributors from a diverse set of companies like DataBricks, Intel,Yahoo, and Cloudera. The mailing lists are constantly overflowing myinbox. Nearly all Tez contributions come from a single company (Hortonworks).

Spark and Tez are both distributed data-processing engines that target similar usecases. My opinion is that Spark's maturity, cleaner, richer APIs,thriving community, and first-class support for RDDs and in-memory data make ita superior choice in nearly every situation.

Answer 2

I do not agreewith the very good answer by Sandy Ryza. Though theanswer is more or less correct, there is one use case where Tez can score significantlyover Spark. This is the onewhich involves extreme scale - for instance, if you want to join a 100Terabytetable to another 100Terabyte table, Hive 13 on Tez is abetter option than Shark.

Hortonworks co-founder Arun C Murthy alsoconfirmed this in our discussion in the Hadoop meetup @Inmobi a few days back. They have beta customers at that scale and have successfully ran 100TB VS 100TBjoin on Hive 13 ontop of Tez. I have never seenSpark/Shark being used at this scale.

Answer 3

With more community backing, flexibility and depth, Spark& Hive on Spark will be prevail. Spark is supported by all distros where asTez is kind of alone out there. We should expect smarter companies to pickspark & hive on spark going forward.

Answer 4

Do your own benchmarks, understand your requirements andonly then you can decide. There is no one-fit-all.

As per my experience goes, I'll use Tez as Hive executionengine while I am going to consume Spark for most of of the queries or jobs.Recently, I have seen wonderful results from Spark for EMPI while Hive and Tezwill provide immense value for the use cases as pointed by Vijay. In use caseslike EMPI, where data volume is not too large but computation is very highlyCPU intensive, Spark is best way to go..

Blog 1

On February 3rd, Cloudera announced support forApache Spark aspart of Cloudera Enterprise. I’veblogged about Spark before so I won’t go into substantial detail here,but the short version is Spark improves upon MapReduce by removing the need towrite data to disk between steps. Spark also takes advantage of in-memoryprocessing and data sharing for further optimizations.

The other successor to MapReduce (of course there is morethan one) is Apache Tez. Tezimproves upon MapReduce by removing the need to write data to disk betweensteps (Sound familiar?). It also has in-memory capabilities similar to Spark. Thus far Hortonworks has thrown its weight behind Tez development as partof the Stinger project.

Both Tez and Spark are described as supplementing MapReduceworkloads. However, I don’t think this will be case much longer. The world haschanged since Google published the original MapReduce paper in 2004. Memoryprices have plummeted while data volumes and sources have increased, makinglegacy MapReduce less appealing.

Vendors will likely begin distancing themselves fromMapReduce for more performant options once there are some high profile customerreferences. It remains to be seen what this means for early adopters withlegacy MapReduce applications.

0 0