HDPCD-Java-复习笔记(1)

来源:互联网 发布:数控车床车圆弧编程 编辑:程序博客网 时间:2024/05/21 16:58

1.Understand Hadoop HDFS

Pig  --  A scripting language that simplifies the creation of MapReduce jobs and excels at exploring and transforming data.

Hive -- Provides SQL-like access to your Big Data.

HBase -- A Hadoop database.

Accumulo  -- A robust,  scalable,  high performance data storage and retrieval system built on Hadoop and Zookeeper.

Ambari  -- For provisioning, managing, and monitoring Apache Hadoop clusters.

Sqoop -- For efficiently transferring bulk data between Hadoop and relation databases.

Falcon -- A data processing and management solution for Hadoop , designed for data motion,coordination of data pipelines, life cycle management, and data discovery.

Oozie -- A workflow scheduler system to manage Apache Hadoop jobs.

Solr -- A standalone enterprise search server with a REST-like API.

Flume -- For efficiently collecting, aggregating, and moving large amounts of log data.

ZooKeeper -- An open-source server which enables highly reliable distributed coordination.

Mahout -- An Apache project whose goal is to build scalable machine learning libraries.


The ApacheHadoop 2.x project consists of the followingmodules:

Hadoop Common -- The utilities that provide support for the other Hadoop modules.

HDFS -- The Hadoop Distributed File System

YARN -- A framework for job scheduling and cluster resource management.

MapReduce -- For processing large data sets in a scalable and parallel fashion.


YARN splits up the functionality of the JobTracker in Hadoop 1.x into two separate processes:

ResourceManager -- A daemon process that allocates cluster resources to applications.

ApplicationMaster -- A per-application process that provides the runtime for executing applications.


Putting a file into HDFS involves the following steps:

1)A client application sends a request to the NameNode that specifies where they want to put the file in the file system.

2)The NameNode determines how the data is broken down into blocks and which DataNodes will be used to store those blocks. That information is given to the client application.

3)The client application communicates directly with each DataNode, writing the blocks onto the DataNode.

4)The DataNode then replicates the newly-created block to 2 others DataNodes (assuming the replication factor is 3).


The NameNode has the following characteristics:

It is the master of the DataNodes and executes file system namespace operations like opening, closing, and renaming files and directories. 

It determines the mapping of blocks to DataNodes and maintains the file system namespace.


The NameNode performs these tasks by maintaining two files:

fsimage_N -- Contains the entire file system namespace, including the mapping of blocks to files and file system properties.

edits_N -- A transaction log that persistently records every change that occurs to file system metadata.


The  DataNodes are  responsible for:

Handling read and write requests from application  clients.

Performing block creation, deletion, and replication upon instruction from the NameNode.

Sending heartbeats to the NameNode.

Sending a Blockreport to the NameNode.


Overview  of HDFS High Availability(NameNode HA)

Quorum Journal Manager

All Namespace modifications are logged durably to a majority of the JournalNode daemons (hence the name quorum).

As the Standby Node sees the edits in the JournalNodes, it applies them to its own namespace.

Configuring Automatic Failover

ZKFailoverController(ZKFC) -- A new component that is a ZooKeeper client that monitors and manages the state of a NameNode.


HDFS Commands

ls, du, count, chgrp, chown, chmod, stat, cat, text ,tail, get, copyFromLocal, put, copyToLocal, getmerge, mv, cp, mkdir, rm, rm -R, touchz

test -- Checks if a file exists.

expunge -- Empties the user’s Trash folder.


The Hadoop Filesystem API

  • Configuration conf = new Configuration();
  • Path dir = new Path("results");
  • FileSystem fs = FileSystem.get(conf);
  • if(!fs.exists(dir)) {
  • fs.mkdirs(dir);
  • }


原创粉丝点击