When would someone use Apache Tez instead of Apache Spark, or vice versa?
来源:互联网 发布:魔兽世界5.4.8数据库 编辑:程序博客网 时间:2024/06/06 03:10
Answer 1
In a nutshell, Spark is a more mature version of Tez,plus much much more. If Tez comes with your version of Hive or Pig, useit as the backend execution engine over MapReduce. If you're planning to directly use theAPIs, whether to write a data-transformationjob, implement a distributed machine learning algorithm, or write your ownhigher-level data processing language, use Spark, handsdown.
Disclaimer: I work at Cloudera,which just started offering Spark support, soIwould say a lot of the things I'm about to say. I alsohope this doesn't come offas too disparaging to the Tez project. Though I see itas somewhat of amisdirected effort,I've worked on YARN with many of the engineers working on Tez, and think highlyof them.
Tez is a ~1.5 year-old implementation of a 2007 paper from Microsoft thatgeneralized the MapRedude distributed compute framework. Spark is a ~4 year-oldimplementation of a 2010 paper from Berkeley thatbuilt on the Microsoft paper. Spark adds "Resilient DistributedDatasets" (RDDs), an abstraction that makes it easy to work with distributed in-memory data.
As mentioned in the question, both Tez and Sparkprovide a distributed execution engine that can handle arbitrary DAGs, targetedtowards processing large amounts of data. Both can read and write data toand from Hadoop using any MapReduce input or output format. The main focusof Tez so far has been providing a faster engine than MapReduce under Hadoop'straditional data-processing languages like Hive and Pig. Spark has thesecapabilities, but also spent a lot of effort on a clean user-facing API with arich set of operators. It can express wordcount in 3 lines of Scala or 15lines of Java. It also provides an interactive shell (REPL) and a PythonAPI, which are especially great for data sciency audiences and facilitate development in general. Tez exposes an API for constructing a data flow DAG - you define vertexes andedges and the way that data gets transferred between them. Theirwordcount example runs over 300 lines. I believe it also supports theHadoop MapReduce API, but if you're using that you're not taking advantage ofthe arbitrary-DAG capabilities.
From a community/adoption/stayin
Spark and Tez are both distributed data-processing engines that target similar usecases. My opinion is that Spark's maturity, cleaner, richer APIs,thriving community, and first-class support for RDDs and in-memory data make ita superior choice in nearly every situation.
Answer 2
I do not agreewith the very good answer by Sandy Ryza. Though theanswer is more or less correct, there is one use case where Tez can score significantlyover Spark. This is the onewhich involves extreme scale - for instance, if you want to join a 100Terabytetable to another 100Terabyte table, Hive 13 on Tez is abetter option than Shark.
Hortonworks co-founder Arun C Murthy alsoconfirmed this in our discussion in the Hadoop meetup @Inmobi a few days back. They have beta customers at that scale and have successfully ran 100TB VS 100TBjoin on Hive 13 ontop of Tez. I have never seenSpark/Shark being used at this scale.
Answer 3
With more community backing, flexibility and depth, Spark& Hive on Spark will be prevail. Spark is supported by all distros where asTez is kind of alone out there. We should expect smarter companies to pickspark & hive on spark going forward.
Answer 4
Do your own benchmarks, understand your requirements andonly then you can decide. There is no one-fit-all.
As per my experience goes, I'll use Tez as Hive executionengine while I am going to consume Spark for most of of the queries or jobs.Recently, I have seen wonderful results from Spark for EMPI while Hive and Tezwill provide immense value for the use cases as pointed by Vijay. In use caseslike EMPI, where data volume is not too large but computation is very highlyCPU intensive, Spark is best way to go..
Blog 1
On February 3rd, Cloudera announced support forApache Spark aspart of Cloudera Enterprise. I’veblogged about Spark before so I won’t go into substantial detail here,but the short version is Spark improves upon MapReduce by removing the need towrite data to disk between steps. Spark also takes advantage of in-memoryprocessing and data sharing for further optimizations.
The other successor to MapReduce (of course there is morethan one) is Apache Tez. Tezimproves upon MapReduce by removing the need to write data to disk betweensteps (Sound familiar?). It also has in-memory capabilities similar to Spark. Thus far Hortonworks has thrown its weight behind Tez development as partof the Stinger project.
Both Tez and Spark are described as supplementing MapReduceworkloads. However, I don’t think this will be case much longer. The world haschanged since Google published the original MapReduce paper in 2004. Memoryprices have plummeted while data volumes and sources have increased, makinglegacy MapReduce less appealing.
Vendors will likely begin distancing themselves fromMapReduce for more performant options once there are some high profile customerreferences. It remains to be seen what this means for early adopters withlegacy MapReduce applications.
- When would someone use Apache Tez instead of Apache Spark, or vice versa?
- Apache Tez
- apache tez
- When to Use Delegates Instead of Interfaces
- How to convert XSD to XML or vice versa
- When to Use Delegates Instead of Interfaces (C# Programming Guide)
- Apache Hive on Apache Tez
- apache tez 调研
- Apache Tez是什么?
- Apache Tez基本知识
- R12 - GL / XLA / FAH - How to link GL data to the subledger data or vice versa
- Use '$' instead of '.' for inner classes (or use only lowercase letters in package names)
- Use '$' instead of '.' for inner classes (or use only lowercase letters in package names) 问题
- 干活来袭:Effective Concurrency: Know When to Use an Active Object Instead of a Mutex
- When to use Tomcat CATALINA_OPTS instead of JAVA_OPTS - See more at: http://www.tikalk.com/java/when
- use supervisor instead of nohup
- Spark and Tez, out of phase
- wstring to string and vice versa
- Plus One
- SQL学习一
- mysql 中的内置函数
- java 读取文件路径中空格和中文的处理.. this.class.getResource
- codeforces-#478C. Table Decorations
- When would someone use Apache Tez instead of Apache Spark, or vice versa?
- swift-数组array
- CI框架提交表单时出现 Disallowed Key Characters 错误提示
- 曾国藩读书的十二条规矩
- STM32串口中断接收方式详细比较
- 数据结构 算法3.4(栈的应用) 表达式求值(stl版and数组模拟版)
- sql 学习笔记 文档
- mysql数据库导入导出
- 一维随机变量与概率分布