hadoop2.6.0学习
来源:互联网 发布:天宇水利预算软件 编辑:程序博客网 时间:2024/05/22 07:51
hadoop2有三个核心模块
hdfs:
负责数据的分布式存储
主从结构
主节点,可以有2个: namenode
从节点,有很多个: datanode
namenode负责:
接收用户操作请求,是用户操作的入口
维护文件系统的目录结构,称作命名空间
- 是整个文件系统的管理节点。它维护着整个文件系统的文件目录树,文件/目录的元信息和每个文件对应的数据块列表。接收用户的操作请求。
- 文件包括:
- fsimage:元数据镜像文件。存储某一时段NameNode内存元数据信息。
- edits:操作日志文件。
- fstime:保存最近一次checkpoint的时间
- 以上这些文件是保存在linux的文件系统中。
datanode负责:
- 提供真实文件数据的存储服务。
- 文件块(block):最基本的存储单位。对于文件内容而言,一个文件的长度大小是size,那么从文件的0偏移开始,按照固定的大小,顺序对文件进行划分并编号,划分好的每一个块称一个Block。HDFS默认Block大小是128MB,以一个256MB文件,共有256/128=2个Block.
- 不同于普通文件系统的是,HDFS中,如果一个文件小于一个数据块的大小,并不占用整个数据块存储空间
- Replication。多复本。默认是三个。
数据存储:staging
HDFS client上传数据到HDFS时,会首先在本地缓存数据,当数据达到一个block大小时,请求NameNode分配一个block。NameNode会把block所在的DataNode的地址告诉HDFS client。HDFS client会直接和DataNode通信,把数据写到DataNode节点一个block文件中。
- 1.首先调用FileSystem对象的open方法,其实是一个DistributedFileSystem的实例
- 2.DistributedFileSystem通过rpc获得文件的第一批个block的locations,同一block按照重复数会返回多个locations,这些locations按照hadoop拓扑结构排序,距离客户端近的排在前面.
- 3.前两步会返回一个FSDataInputStream对象,该对象会被封装成DFSInputStream对象,DFSInputStream可以方便的管理datanode和namenode数据流。客户端调用read方法,DFSInputStream最会找出离客户端最近的datanode并连接。
- 4.数据从datanode源源不断的流向客户端。
- 5.如果第一块的数据读完了,就会关闭指向第一块的datanode连接,接着读取下一块。这些操作对客户端来说是透明的,客户端的角度看来只是读一个持续不断的流。
- 6.如果第一批block都读完了,DFSInputStream就会去namenode拿下一批blocks的location,然后继续读,如果所有的块都读完,这时就会关闭掉所有的流。
- 如果在读数据的时候,DFSInputStream和datanode的通讯发生异常,就会尝试正在读的block的排第二近的datanode,并且会记录哪个datanode发生错误,剩余的blocks读的时候就会直接跳过该datanode。DFSInputStream也会检查block数据校验和,如果发现一个坏的block,就会先报告到namenode节点,然后DFSInputStream在其他的datanode上读该block的镜像
- 该设计的方向就是客户端直接连接datanode来检索数据并且namenode来负责为每一个block提供最优的datanode,namenode仅仅处理block location的请求,这些信息都加载在namenode的内存中,hdfs通过datanode集群可以承受大量客户端的并发访问。
datanode都表示已经收到的时候,这时akc quene才会把对应的packet包移除掉。
HDFS2的federation
(1)HDFS集群扩展性。多个NameNode分管一部分目录,使得一个集群可以扩展到更多节点,不再像1.0中那样由于内存的限制制约文件存储数目。
(2)性能更高效。多个NameNode管理不同的数据,且同时对外提供服务,将为用户提供更高的读写吞吐率。
(3)良好的隔离性。用户可根据需要将不同业务数据交由不同NameNode管理,这样不同业务之间影响很小。
这个图过于简明,许多设计上的考虑并不那么直观,我们稍微总结一下
这样设计的好处大致有:
现有的NN无需任何配置改动.
如果现有的客户端只连某台NN的话,代码和配置也无需改动。
提供良好扩展性的同时允许其他文件系统或应用直接使用块存储池
统一的块存储管理保证了资源利用率
可以只通过防火墙配置达到一定的文件访问隔离,而无需使用复杂的Kerberos认证
通过路径自动对应NN
使Federation的配置改动对应用透明
mapreduce:
依赖磁盘io的批处理计算模型
执行步骤:
1. map任务处理
1.1读取输入文件内容,解析成key、value对。对输入文件的每一行,解析成key、value对。每一个键值对调用一次map函数。
1.2写自己的逻辑,对输入的key、value处理,转换成新的key、value输出。
1.3对输出的key、value进行分区。
1.4对不同分区的数据,按照key进行排序、分组。相同key的value放到一个集合中。
1.5(可选)分组后的数据进行归约。
2.reduce任务处理
2.1对多个map任务的输出,按照不同的分区,通过网络copy到不同的reduce节点。
2.2对多个map任务的输出进行合并、排序。写reduce函数自己的逻辑,对输入的key、values处理,转换成新的key、value输出。
2.3把reduce的输出保存到文件中。
主从结构
主节点,只有一个: JobTracker
从节点,有很多个: TaskTracker
JobTracker负责:
接收客户提交的计算任务
把计算任务分给TaskTrackers执行,即任务调度
监控TaskTracker的执行情况
TaskTrackers负责:
执行JobTracker分配的计算任务
yarn:
资源的调度和管理平台
主从结构
主节点,可以有2个: ResourceManager
从节点,有很多个: NodeManager
ResourceManager负责:
集群资源的分配与调度
MapReduce、Storm、Spark等应用,必须实现ApplicationMaster接口,才能被RM管理
NodeManager负责:
单节点资源的管理
Hadoop集群搭建:http://blog.csdn.NET/yinhaonefu/article/details/43345287
Distributed File System
数据量越来越多,在一个操作系统管辖的范围存不下了,那么就分配到更多的操作系统管理的磁盘中,但是不方便管
理和维护,因此迫切需要一种系统来管理多台机器上的文件,这就是分布式文件管理系统 。
是一种允许文件通过网络在多台主机上分享的文件系统,可让多机器上的多用户分享文件和存储空间。
通透性。让实际上是通过网络来访问文件的动作,由程序与用户看来,就像是访问本地的磁盘一般。
容错。即使系统中有某些节点脱机,整体来说系统仍然可以持续运作而不会有数据损失。
分布式文件管理系统很多,hdfs只是其中一种,不合适小文件。
HDFS的Shell
bin/hdfs dfs命令
appendToFile
Usage: hdfs dfs -appendToFile <localsrc> ... <dst>
追加一个或者多个文件到hdfs制定文件中.也可以从命令行读取输入.
· hdfs dfs -appendToFile localfile /user/hadoop/hadoopfile
· hdfs dfs -appendToFile localfile1 localfile2 /user/hadoop/hadoopfile
· hdfs dfs -appendToFile localfile hdfs://nn.example.com/hadoop/hadoopfile
· hdfs dfs -appendToFile - hdfs://nn.example.com/hadoop/hadoopfile 从命令行读入
Exit Code:
Returns 0 on success and 1 on error.
cat
Usage: hdfs dfs -cat URI [URI ...]
查看内容.
Example:
· hdfs dfs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
· hdfs dfs -cat file:///file3 /user/hadoop/file4
Exit Code:
Returns 0 on success and -1 on error.
Chgrp【change group】
Usage: hdfs dfs -chgrp [-R] GROUP URI [URI ...]
修改所属组.
Options
· The -R option will make the change recursively through the directory structure.
chmod
Usage: hdfs dfs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI ...]
修改权限.
Options
· The -R option will make the change recursively through the directory structure.
chown
Usage: hdfs dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]
修改所有者.
Options
· The -R option will make the change recursively through the directory structure.
copyFromLocal
Usage: hdfs dfs -copyFromLocal <localsrc> URI
Similar to put command, except that the source is restricted to a local file reference.
Options:
· The -f option will overwrite the destination if it already exists.
copyToLocal
Usage: hdfs dfs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
Similar to get command, except that the destination is restricted to a local file reference.
count
Usage: hdfs dfs -count [-q] [-h] <paths>
列出文件夹数量、文件数量、内容大小. The output columns with -count are: DIR_COUNT, FILE_COUNT, CONTENT_SIZE FILE_NAME
The output columns with -count -q are: QUOTA, REMAINING_QUATA, SPACE_QUOTA, REMAINING_SPACE_QUOTA, DIR_COUNT, FILE_COUNT, CONTENT_SIZE, FILE_NAME
The -h option shows sizes in human readable format.
Example:
· hdfs dfs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
· hdfs dfs -count -q hdfs://nn1.example.com/file1
· hdfs dfs -count -q -h hdfs://nn1.example.com/file1
Exit Code:
Returns 0 on success and -1 on error.
cp
Usage: hdfs dfs -cp [-f] [-p | -p[topax]] URI [URI ...] <dest>
复制文件(夹),可以覆盖,可以保留原有权限信息
Options:
· The -f option will overwrite the destination if it already exists.
· The -p option will preserve file attributes [topx] (timestamps, ownership, permission, ACL, XAttr). If -p is specified with no arg, then preserves timestamps, ownership, permission. If -pa is specified, then preserves permission also because ACL is a super-set of permission. Determination of whether raw namespace extended attributes are preserved is independent of the -p flag.
Example:
· hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2
· hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir
Exit Code:
Returns 0 on success and -1 on error.
du
Usage: hdfs dfs -du [-s] [-h] URI [URI ...]
显示文件(夹)大小.
Options:
· The -s option will result in an aggregate summary of file lengths being displayed, rather than the individual files.
· The -h option will format file sizes in a "human-readable" fashion (e.g 64.0m instead of 67108864)
Example:
· hdfs dfs -du /user/hadoop/dir1 /user/hadoop/file1 hdfs://nn.example.com/user/hadoop/dir1
Exit Code: Returns 0 on success and -1 on error.
dus
Usage: hdfs dfs -dus <args>
Displays a summary of file lengths.
Note: This command is deprecated. Instead use hdfs dfs -du -s.
expunge
Usage: hdfs dfs -expunge
清空回收站.
get
Usage: hdfs dfs -get [-ignorecrc] [-crc] <src> <localdst>
Copy files to the local file system. Files that fail the CRC check may be copied with the -ignorecrc option. Files and CRCs may be copied using the -crc option.
Example:
· hdfs dfs -get /user/hadoop/file localfile
· hdfs dfs -get hdfs://nn.example.com/user/hadoop/file localfile
Exit Code:
Returns 0 on success and -1 on error.
getfacl
Usage: hdfs dfs -getfacl [-R] <path>
显示权限信息.
Options:
· -R: List the ACLs of all files and directories recursively.
· path: File or directory to list.
Examples:
· hdfs dfs -getfacl /file
· hdfs dfs -getfacl -R /dir
Exit Code:
Returns 0 on success and non-zero on error.
getfattr
Usage: hdfs dfs -getfattr [-R] -n name | -d [-e en] <path>
Displays the extended attribute names and values (if any) for a file or directory.
Options:
· -R: Recursively list the attributes for all files and directories.
· -n name: Dump the named extended attribute value.
· -d: Dump all extended attribute values associated with pathname.
· -e encoding: Encode values after retrieving them. Valid encodings are "text", "hex", and "base64". Values encoded as text strings are enclosed in double quotes ("), and values encoded as hexadecimal and base64 are prefixed with 0x and 0s, respectively.
· path: The file or directory.
Examples:
· hdfs dfs -getfattr -d /file
· hdfs dfs -getfattr -R -n user.myAttr /dir
Exit Code:
Returns 0 on success and non-zero on error.
getmerge
Usage: hdfs dfs -getmerge <src> <localdst> [addnl]
合并.
ls
Usage: hdfs dfs -ls [-R] <args>
Options:
· The -R option will return stat recursively through the directory structure.
For a file returns stat on the file with the following format:
permissions number_of_replicas userid groupid filesize modification_date modification_time filename
For a directory it returns list of its direct children as in Unix. A directory is listed as:
permissions userid groupid modification_date modification_time dirname
Example:
· hdfs dfs -ls /user/hadoop/file1
Exit Code:
Returns 0 on success and -1 on error.
lsr
Usage: hdfs dfs -lsr <args>
Recursive version of ls.
Note: This command is deprecated. Instead use hdfs dfs -ls -R
mkdir
Usage: hdfs dfs -mkdir [-p] <paths>
Takes path uri's as argument and creates directories.
Options:
· The -p option behavior is much like Unix mkdir -p, creating parent directories along the path.
Example:
· hdfs dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
· hdfs dfs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir
Exit Code:
Returns 0 on success and -1 on error.
moveFromLocal
Usage: hdfs dfs -moveFromLocal <localsrc> <dst>
Similar to put command, except that the source localsrc is deleted after it's copied.
moveToLocal
Usage: hdfs dfs -moveToLocal [-crc] <src> <dst>
Displays a "Not implemented yet" message.
mv
Usage: hdfs dfs -mv URI [URI ...] <dest>
Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across file systems is not permitted.
Example:
· hdfs dfs -mv /user/hadoop/file1 /user/hadoop/file2
· hdfs dfs -mv hdfs://nn.example.com/file1 hdfs://nn.example.com/file2 hdfs://nn.example.com/file3 hdfs://nn.example.com/dir1
Exit Code:
Returns 0 on success and -1 on error.
put
Usage: hdfs dfs -put <localsrc> ... <dst>
Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system.
· hdfs dfs -put localfile /user/hadoop/hadoopfile
· hdfs dfs -put localfile1 localfile2 /user/hadoop/hadoopdir
· hdfs dfs -put localfile hdfs://nn.example.com/hadoop/hadoopfile
· hdfs dfs -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
Exit Code:
Returns 0 on success and -1 on error.
rm
Usage: hdfs dfs -rm [-f] [-r|-R] [-skipTrash] URI [URI ...]
Delete files specified as args.
Options:
· The -f option will not display a diagnostic message or modify the exit status to reflect an error if the file does not exist.
· The -R option deletes the directory and any content under it recursively.
· The -r option is equivalent to -R.
· The -skipTrash option will bypass trash, if enabled, and delete the specified file(s) immediately. This can be useful when it is necessary to delete files from an over-quota directory.
Example:
· hdfs dfs -rm hdfs://nn.example.com/file /user/hadoop/emptydir
Exit Code:
Returns 0 on success and -1 on error.
rmr
Usage: hdfs dfs -rmr [-skipTrash] URI [URI ...]
Recursive version of delete.
Note: This command is deprecated. Instead use hdfs dfs -rm -r
setfacl
Usage: hdfs dfs -setfacl [-R] [-b|-k -m|-x <acl_spec> <path>]|[--set <acl_spec> <path>]
Sets Access Control Lists (ACLs) of files and directories.
Options:
· -b: Remove all but the base ACL entries. The entries for user, group and others are retained for compatibility with permission bits.
· -k: Remove the default ACL.
· -R: Apply operations to all files and directories recursively.
· -m: Modify ACL. New entries are added to the ACL, and existing entries are retained.
· -x: Remove specified ACL entries. Other ACL entries are retained.
· --set: Fully replace the ACL, discarding all existing entries. The acl_spec must include entries for user, group, and others for compatibility with permission bits.
· acl_spec: Comma separated list of ACL entries.
· path: File or directory to modify.
Examples:
· hdfs dfs -setfacl -m user:hadoop:rw- /file
· hdfs dfs -setfacl -x user:hadoop /file
· hdfs dfs -setfacl -b /file
· hdfs dfs -setfacl -k /dir
· hdfs dfs -setfacl --set user::rw-,user:hadoop:rw-,group::r--,other::r-- /file
· hdfs dfs -setfacl -R -m user:hadoop:r-x /dir
· hdfs dfs -setfacl -m default:user:hadoop:r-x /dir
Exit Code:
Returns 0 on success and non-zero on error.
setfattr
Usage: hdfs dfs -setfattr -n name [-v value] | -x name <path>
Sets an extended attribute name and value for a file or directory.
Options:
· -b: Remove all but the base ACL entries. The entries for user, group and others are retained for compatibility with permission bits.
· -n name: The extended attribute name.
· -v value: The extended attribute value. There are three different encoding methods for the value. If the argument is enclosed in double quotes, then the value is the string inside the quotes. If the argument is prefixed with 0x or 0X, then it is taken as a hexadecimal number. If the argument begins with 0s or 0S, then it is taken as a base64 encoding.
· -x name: Remove the extended attribute.
· path: The file or directory.
Examples:
· hdfs dfs -setfattr -n user.myAttr -v myValue /file
· hdfs dfs -setfattr -n user.noValue /file
· hdfs dfs -setfattr -x user.myAttr /file
Exit Code:
Returns 0 on success and non-zero on error.
setrep
Usage: hdfs dfs -setrep [-R] [-w] <numReplicas> <path>
Changes the replication factor of a file. If path is a directory then the command recursively changes the replication factor of all files under the directory tree rooted at path.
Options:
· The -w flag requests that the command wait for the replication to complete. This can potentially take a very long time.
· The -R flag is accepted for backwards compatibility. It has no effect.
Example:
· hdfs dfs -setrep -w 3 /user/hadoop/dir1
Exit Code:
Returns 0 on success and -1 on error.
stat
Usage: hdfs dfs -stat URI [URI ...]
Returns the stat information on the path.
Example:
· hdfs dfs -stat path
Exit Code: Returns 0 on success and -1 on error.
tail
Usage: hdfs dfs -tail [-f] URI
Displays last kilobyte of the file to stdout.
Options:
· The -f option will output appended data as the file grows, as in Unix.
Example:
· hdfs dfs -tail pathname
Exit Code: Returns 0 on success and -1 on error.
test
Usage: hdfs dfs -test -[ezd] URI
Options:
· The -e option will check to see if the file exists, returning 0 if true.
· The -z option will check to see if the file is zero length, returning 0 if true.
· The -d option will check to see if the path is directory, returning 0 if true.
Example:
· hdfs dfs -test -e filename
text
Usage: hdfs dfs -text <src>
Takes a source file and outputs the file in text format. The allowed formats are zip and TextRecordInputStream.
touchz
Usage: hdfs dfs -touchz URI [URI ...]
Create a file of zero length.
Example:
· hdfs dfs -touchz pathname
Exit Code: Returns 0 on success and -1 on error.
- hadoop2.6.0学习
- hadoop2.6.0学习
- Hadoop2.6.0 MapReduce 源码学习
- HADOOP学习连载:Hadoop2.6.0编译安装
- hadoop2学习URl
- hadoop2.x学习01
- hadoop2.x学习资料
- HADOOP学习连载:Hadoop2.6.0 分布式环境配置
- 大数据学习--使用Hadoop2.6.0遇到的相关问题
- Hadoop2.x的学习路线
- Hadoop2.x的学习路线
- hadoop学习--hadoop2.x使用手册
- Hadoop2.X学习笔记--搭建
- Hadoop2.5.2学习00--安装
- 编译hadoop2.6.0源码
- 4、配置hadoop2.6.0
- hadoop2.6.0 shell 命令
- hadoop2.6.0 dfs.blocksize
- js中将某个value前面补0然后整个字符串还是12位
- maven+ssm例子
- iOS实现图像特效相关
- 回溯法——0-1背包问题
- 启动myeclipse的时候弹出来workspace in use or cannot be created, choose a different one
- hadoop2.6.0学习
- Hdu4135 Co-prime
- SSL 1333_地鼠的困境_匹配
- 阿里云Nginx服务器配置301重定向
- saas平台产品使用合同(模板)
- PCL中读取pcd点云数据的两种方法
- SPARK图计算缓存踩坑记录整理
- linux系统编程之信号(六):信号发送函数sigqueue和信号安装函数sigaction
- 浅谈:Java静态变量