Hadoop Archives

来源:互联网 发布:java未来的应用前景 编辑:程序博客网 时间:2024/04/29 20:54

介绍


时间:

Hadoop Archives (HAR files)是在0.18.0版本中引入的。

作用:

将hdfs里的小文件打包成一个文件,相当于windows的zip,rar。Linux的 tar等压缩文件。把多个文件打包一个文件。

意义:

它的出现就是为了缓解大量小文件消耗namenode内存的问题。

原理:

HAR文件是通过在HDFS上构建一个层次化的文件系统来工作。

一个HAR文件是通过hadoop的archive命令来创建,而这个命令实际上也是运行了一个MapReduce任务来将小文件打包成HAR。

对于client端来说,使用HAR文件没有任何影响。但在HDFS端它内部的文件数减少了。


读取效率不高:

通过HAR来读取一个文件并不会比直接从HDFS中读取文件高效,而且实际上可能还会稍微低效一点,因为对每一个HAR文件的访问都需要完成两层 index文件的读取和文件本身数据的读取。

尽管HAR文件可以被用来作为MapReduce job的input,但是并没有特殊的方法来使maps将HAR文件中打包的文件当作一个HDFS文件处理。


创建命令:

hadoop archive -archiveName xxx.har -p  /src  /dest

archive -archiveName <NAME>.har -p <parent path> [-r <replication factor>]<src>* <dest>

查看命令:

hadoop fs -ls -r har://路径/xxx.har


操作实例:

注意:是hdfs里的文件才能打包,如果不是hdfs里的路径会报错。  


1、hdfs dfs -ls  /

drwx------   - hadoop supergroup          0 2016-04-14 22:19 /tmp
drwxr-xr-x   - hadoop supergroup          0 2016-04-14 22:19 /wc


2、hadoop archive -archiveName temp.har -p /tmp /

会启动mapreduce

16/08/13 00:41:16 INFO client.RMProxy: Connecting to ResourceManager at hello110/192.168.255.130:8032
16/08/13 00:41:18 INFO client.RMProxy: Connecting to ResourceManager at hello110/192.168.255.130:8032
16/08/13 00:41:18 INFO client.RMProxy: Connecting to ResourceManager at hello110/192.168.255.130:8032
16/08/13 00:41:18 INFO mapreduce.JobSubmitter: number of splits:1
16/08/13 00:41:19 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1471019987033_0001
16/08/13 00:41:19 INFO impl.YarnClientImpl: Submitted application application_1471019987033_0001
16/08/13 00:41:19 INFO mapreduce.Job: The url to track the job: http://hello110:8088/proxy/application_1471019987033_0001/
16/08/13 00:41:19 INFO mapreduce.Job: Running job: job_1471019987033_0001
16/08/13 00:41:35 INFO mapreduce.Job: Job job_1471019987033_0001 running in uber mode : false
16/08/13 00:41:35 INFO mapreduce.Job:  map 0% reduce 0%
16/08/13 00:41:57 INFO mapreduce.Job:  map 100% reduce 0%
16/08/13 00:42:21 INFO mapreduce.Job:  map 100% reduce 100%
16/08/13 00:42:23 INFO mapreduce.Job: Job job_1471019987033_0001 completed successfully


3、hdfs dfs -ls  /

drwxr-xr-x   - hadoop supergroup          0 2016-08-13 00:42 /temp.har  (新增的)
drwx------   - hadoop supergroup          0 2016-04-14 22:19 /tmp
drwxr-xr-x   - hadoop supergroup          0 2016-04-14 22:19 /wc


4、hadoop fs -ls -R har:///temp.har

drwxr-xr-x   - hadoop supergroup          0 2016-04-14 22:19 har:///temp.har/hadoop-yarn
drwxr-xr-x   - hadoop supergroup          0 2016-04-14 22:19 har:///temp.har/hadoop-yarn/staging
drwxr-xr-x   - hadoop supergroup          0 2016-04-14 22:19 har:///temp.har/hadoop-yarn/staging/hadoop
drwxr-xr-x   - hadoop supergroup          0 2016-08-13 00:41 har:///temp.har/hadoop-yarn/staging/hadoop/.staging
drwxr-xr-x   - hadoop supergroup          0 2016-08-13 00:41 har:///temp.har/hadoop-yarn/staging/hadoop/.staging/har_dj36hy
-rw-r--r--   1 hadoop supergroup       1593 2016-08-13 00:41 har:///temp.har/hadoop-yarn/staging/hadoop/.staging/har_dj36hy/_har_src_files
drwxr-xr-x   - hadoop supergroup          0 2016-04-14 22:19 har:///temp.har/hadoop-yarn/staging/history
drwxr-xr-x   - hadoop supergroup          0 2016-04-14 22:19 har:///temp.har/hadoop-yarn/staging/history/done_intermediate
drwxr-xr-x   - hadoop supergroup          0 2016-04-14 22:20 har:///temp.har/hadoop-yarn/staging/history/done_intermediate/hadoop
-rw-r--r--   1 hadoop supergroup      33303 2016-04-14 22:20 har:///temp.har/hadoop-yarn/staging/history/done_intermediate/hadoop/job_1460643564332_0001-1460643581404-hadoop-wcount.jar-1460643608082-1-1-SUCCEEDED-default-1460643592087.jhist
-rw-r--r--   1 hadoop supergroup        349 2016-04-14 22:20 har:///temp.har/hadoop-yarn/staging/history/done_intermediate/hadoop/job_1460643564332_0001.summary
-rw-r--r--   1 hadoop supergroup     115449 2016-04-14 22:20 har:///temp.har/hadoop-yarn/staging/history/done_intermediate/hadoop/job_1460643564332_0001_conf.xml


5、 hdfs dfs -cat  har:///temp.har/hadoop-yarn/staging/history/done_intermediate/hadoop/job_1460643564332_0001_conf.xml

<property><name>mapreduce.tasktracker.instrumentation</name><value>org.apache.hadoop.mapred.TaskTrackerMetricsInst</value><source>mapred-default.xml</source><source>job.xml</source></property>
<property><name>io.seqfile.sorter.recordlimit</name><value>1000000</value><source>core-default.xml</source><source>job.xml</source></property>
<property><name>yarn.sharedcache.webapp.address</name><value>0.0.0.0:8788</value><source>yarn-default.xml</source><source>job.xml</source></property>
<property><name>yarn.app.mapreduce.am.resource.mb</name><value>1536</value><source>mapred-default.xml</source><source>job.xml</source></property>
<property><name>mapreduce.framework.name</name><value>yarn</value><source>mapred-site.xml</source><source>job.xml</source></property>
<property><name>mapreduce.job.reduce.slowstart.completedmaps</name><value>0.05</value><source>mapred-default.xml</source><source>job.xml</source></property>
.....................太多了.....................................




0 0
原创粉丝点击