Hadoop2013

来源：互联网发布：葛优的网络意思是什么编辑：程序博客网时间：2024/04/29 14:49

Sessions:

http://hadoopsummit.org/program/

Reading list:

Optimizing MapReduce Job Performance (http://www.slideshare.net/cloudera/mr-perf)

Optimizing MapReduce job performance is often seen as something of a black art. In order to maximize performance, developers need to understand the inner workings of the MapReduce execution framework and how they are affected by various configuration parameters and MR design patterns. The talk will illustrate the underlying mechanics of job and task execution, including the map side sort/spill, the shuffle, and the reduce side merge, and then explain how different job configuration parameters and job design strategies affect the performance of these operations. Though the talk will cover internals, it will also provide practical tips, guidelines, and rules of thumb for better job performance. The talk is primarily targeted towards developers directly using the MapReduce API, though will also include some tips for users of higher level frameworks.

Improving HBase Availability and Repair (http://www.slideshare.net/cloudera/120613-hadoopsummithbaseavailabilitybean-hsieh)

Apache HBase is a rapidly-evolving random-access distributed data store built on top of Apache Hadoop’s HDFS and Apache ZooKeeper. Drawing from real-world support experiences, this talk provides administrators insight into improving HBase’s availability and recovering from situations where HBase is not available. We share tips on the common root causes of unavailability, explain how to diagnose them, and prescribe measures for ensuring maximum availability of an HBase cluster. We discuss new features that improve recovery time such as distributed log splitting as well as supportability improvements. We will also describe utilities including new failure recovery tools that we have developed and contributed that can be used to diagnose and repair rare corruption problems on live HBase systems.

Hadoop Distributed File System Reliability and Durability at Facebook(http://www.slideshare.net/Hadoop_Summit/hadoop-distributed-file-system-at-facebook)

The Hadoop Distributed Filesystem, or HDFS, provides the storage layer to a variety of critical services at Facebook. The HDFS Namenode is often singled out as a particularly weak aspect of the design of HDFS, because it represents a single point of failure within an otherwise redundant system. To address this weakness, Facebook has been developing a highly available Namenode, known as Avatarnode. The objective of this study was to determine how much effect Avatarnode would have on overall service reliability and durability. To analyze this, we categorized, by root cause, the last two years` of operational incidents in the Data Warehouse and Messages services at Facebook, a total of 66 incidents. We were able to show that approximately 10% of each service`s incidents would have been prevented had Avatarnode been in place. Avatarnode would have prevented none of our incidents that involved data loss, and all of the most severe data loss incidents were a result of human error or software bugs. Our conclusions is that Avatarnode will improve the reliability of services that use HDFS, but that the HDFS Namenode represents only a small portion of overall operational incidents in services that use HDFS as a storage layer.

HDFS NameNode High Availability(http://www.slideshare.net/Hadoop_Summit/hdfs-namenode-high-availability)

The HDFS NameNode is a robust and reliable service as seen in practice in production at Yahoo and other customers. However, the NameNode does not have automatic failover support. A hot failover solution called HA NameNode is currently under active development (HDFS-1623). This talk will cover the architecture, design and setup. We will also discuss the future direction for HA NameNode.

Spark and shark(http://www.slideshare.net/Hadoop_Summit/spark-and-shark)

Spark is an open source cluster computing framework that can outperform Hadoop by 30x by storing datasets in memory across jobs. Shark is a port of Apache Hive onto Spark, which provides a similar speedup for SQL queries, allowing interactive exploration of data in existing Hive warehouses. This talk will cover how both Spark and Shark are being used at various companies to accelerate big data analytics, the architecture of the systems, and where they are heading. In particular, we will show how both systems are used for large-scale machine learning, where the ability to keep data in memory across iterations yields substantial speedups, and for interactive data mining, from Shark’s SQL interface or Spark’s Scala-based console. We will also discuss an upcoming extension, Spark Streaming, that adds support for low-latency stream processing in Spark, giving users a unified interface for batch and online analytics.