HDFS分布式文件系统(2) HDFS的java接口

hadoop学习笔记 ,文档中所示例的项目的地址 https://git.oschina.net/weiwei02/WHadoop

HDFS 的 java接口

Hadoop是使用java编写的,通过JAVA API可以调用所有Hadoop文件系统的交互操作。例如,文件系统的命令解释器就是一个java应用,它使用JAVA 的FileSystem类来提供文件系统操作。其它一些文件系统接口与HDFS一起使用,因为Hadoop中其它一些文件系统一般都有访问基本文件系统的工具,但它们大多数都能用于任何Hadoop文件系统。

### HTTP接口

通过HTTP接口来访问HDFS有两种办法:直接访问,HDFS后台进程直接服务于来自客户端的请求(由namenode内嵌的web服务器的50070端口提供服务,目录列表以xml或者JSON格式存储,且文件数据由datanode的web服务器的50075端口以数据留的形式传输);通过代理( 一个对多个)访问,客户端通常使用DistributedFilesystem API访问HDFS。

### JAVA接口
#### 1. 从Hadoop URL读取数据


    Inputstream in = null;    try{        in = new URL("hdfs://host/path").openSteam();        //process in    }finally{        IOUtils.closeStream(in);    }

让java程序能够识别Hadoop的hdfs URL方案还需要一些额外的工作。这里采用的是通过FsUrlStreamHandlerFactory实例调用java.net.URL对象的setURLStreamHandlerFactory方法。每个java虚拟机只能调用一次这个方法,因此通常在静态方法中调用。这个限制意味着如果程序的其它组件—如不受你控制的第三方组件—已经声明了一个URLStreamHandlerFactory实例,你将无法使用Hadoop中的实例了。后面我们再继续讨论另外一种备用的办法。

示例1: 展示的程序以标准输出的方式显示Hadoop文件系统中的文件,类似与UNIX系统中的cat命令

    package cn.weiwei.WHadoop.hdfs;    import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;    import org.apache.hadoop.io.IOUtils;    import java.io.IOException;    import java.io.InputStream;    import java.net.URL;    /**     * [@author](https://my.oschina.net/arthor) WangWeiwei     * [@version](https://my.oschina.net/u/931210) 1.0     * [@sine](https://my.oschina.net/mysine) 17-2-6     *     * 通过URLStreamHandler实例以标准输出方式显示Hadoop文件系统中的文件     */    public class URLCat {        static {            URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());        }        public void catFile(String[] args) throws IOException {            InputStream inputStream = null;            try {                inputStream = new URL(args[0]).openStream();                IOUtils.copyBytes(inputStream,System.out,4096,false);            }finally {                IOUtils.closeStream(inputStream);            }        }    }


    package cn.weiwei.WHadoop.hdfs;    import org.junit.Test;    import static org.junit.Assert.*;    /**     * [@author](https://my.oschina.net/arthor) WangWeiwei     * [@version](https://my.oschina.net/u/931210) 1.0     * @sine 17-2-6     */    public class URLCatTest {        @Test        public void catFile() throws Exception {            String[] a = {"hdfs://"};            URLCat urlCat = new URLCat();            urlCat.catFile(a);        }    }


    2017-02-06 18:02:23,560 WARN  [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable    On the top of the Crumpetty Tree    The Quangle Wangle sat,    But his face you could not see,    On account of his Beaver Hat.    Process finished with exit code 0

#### 2. 通过 FileSystem API 读取数据

我们还可以通过使用FileSystem API来打开一个文件的输入流。

Hadoop文件系统中通过Hadoop Path对象(而非 java.io.File对象,因为它的语义与本地文件系统联系太紧密)来代表文件。可以将路径视为一个hadoop文件系统URI,如hdfs://。

获取 FileSystem实例有以下几个静态工厂方法:

    public static FileSystem get(Configuration conf) throws IOException    public static FileSystem get(URI uri,Configuration conf) throws IOException    public static FileSystem get(URI uri,Configuration conf,String user) throws IOException

Configuration 对象封装了客户端或服务器的配置,通过设置配置文件读取类路径来实现(如 conf/core-site.xml)。


    public static LocalFIleSystem getLocal(Configuration conf) throws IOException

有了FileSystem实例之后,我们调用 open() 函数来获取文件的输入流:

    public FSDataInputStream open(Path f) throws IOException    public abstract FSDataInputStream open(Path f,int bufferSize) throws IOException

例2: 直接使用FIleSystem以标准输出格式显示Hadoop文件系统中的文件

    package cn.weiwei.WHadoop.hdfs.filesystem;    import org.apache.hadoop.conf.Configuration;    import org.apache.hadoop.fs.FileSystem;    import org.apache.hadoop.fs.Path;    import org.apache.hadoop.io.IOUtils;    import java.io.InputStream;    import java.net.URI;    /**     * @author WangWeiwei     * @version 1.0     * @sine 17-2-6     * 直接使用FIleSystem以标准输出格式显示Hadoop文件系统中的文件     */    public class FileSystemCat {        public void cat (String uri) throws Exception{            Configuration conf = new Configuration();            FileSystem fileSystem = FileSystem.get(URI.create(uri),conf);            InputStream inputStream = null;            try {                inputStream = fileSystem.open(new Path(uri));                IOUtils.copyBytes(inputStream,System.out,4096,false);            }finally {                IOUtils.closeStream(inputStream);            }        }    }


    package cn.weiwei.WHadoop.hdfs.filesystem;    import org.junit.Test;    import static org.junit.Assert.*;    /**     * @author WangWeiwei     * @version 1.0     * @sine 17-2-6     */    public class FileSystemCatTest {        @Test        public void cat() throws Exception {            FileSystemCat fileSystemCat = new FileSystemCat();            fileSystemCat.cat("hdfs://");        }    }


    2017-02-06 18:52:19,298 WARN  [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable    On the top of the Crumpetty Tree    The Quangle Wangle sat,    But his face you could not see,    On account of his Beaver Hat.    Process finished with exit code 0

##### FSDataInputStream对象


    package org.apache.hadoop.fs;    import java.io.*;    import java.nio.ByteBuffer;    import java.util.EnumSet;    import org.apache.hadoop.classification.InterfaceAudience;    import org.apache.hadoop.classification.InterfaceStability;    import org.apache.hadoop.io.ByteBufferPool;    import org.apache.hadoop.fs.ByteBufferUtil;    import org.apache.hadoop.util.IdentityHashStore;    /** Utility that wraps a {@link FSInputStream} in a {@link DataInputStream}     * and buffers input through a {@link BufferedInputStream}. */    @InterfaceAudience.Public    @InterfaceStability.Stable    public class FSDataInputStream extends DataInputStream        implements Seekable, PositionedReadable,           ByteBufferReadable, HasFileDescriptor, CanSetDropBehind, CanSetReadahead,          HasEnhancedByteBufferAccess, CanUnbuffer {     }


    /**     * Licensed to the Apache Software Foundation (ASF) under one     * or more contributor license agreements.  See the NOTICE file     * distributed with this work for additional information     * regarding copyright ownership.  The ASF licenses this file     * to you under the Apache License, Version 2.0 (the     * "License"); you may not use this file except in compliance     * with the License.  You may obtain a copy of the License at     *     *     http://www.apache.org/licenses/LICENSE-2.0     *     * Unless required by applicable law or agreed to in writing, software     * distributed under the License is distributed on an "AS IS" BASIS,     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.     * See the License for the specific language governing permissions and     * limitations under the License.     */    package org.apache.hadoop.fs;    import java.io.*;    import org.apache.hadoop.classification.InterfaceAudience;    import org.apache.hadoop.classification.InterfaceStability;    /**     *  Stream that permits seeking.     */    @InterfaceAudience.Public    @InterfaceStability.Evolving    public interface Seekable {      /**       * Seek to the given offset from the start of the file.       * The next read() will be from that location.  Can't       * seek past the end of the file.       */      void seek(long pos) throws IOException;      /**       * Return the current offset from the start of the file       */      long getPos() throws IOException;      /**       * Seeks a different copy of the data.  Returns true if        * found a new source, false otherwise.       */      @InterfaceAudience.Private      boolean seekToNewSource(long targetPos) throws IOException;    }



例3 : 使用seek()方法,将Hadoop文件系统中的一个文件在标准输出上显示两次

    package cn.weiwei.WHadoop.hdfs.filesystem;    import org.apache.hadoop.conf.Configuration;    import org.apache.hadoop.fs.FSDataInputStream;    import org.apache.hadoop.fs.FileSystem;    import org.apache.hadoop.fs.Path;    import org.apache.hadoop.io.IOUtils;    import java.net.URI;    /**     * @author WangWeiwei     * @version 1.0     * @sine 17-2-6     * 使用seek()方法,将Hadoop文件系统中的一个文件在标准输出上显示两次     */    public class FileSystemDoubleCat {        public void cat (String uri) throws Exception{            Configuration conf = new Configuration();            FileSystem fileSystem = FileSystem.get(URI.create(uri),conf);            FSDataInputStream inputStream = null;            try {                inputStream = fileSystem.open(new Path(uri));                IOUtils.copyBytes(inputStream,System.out,4096,false);                inputStream.seek(0);//go back to the start of file                IOUtils.copyBytes(inputStream,System.out,4096,false);            }finally {                IOUtils.closeStream(inputStream);            }        }    }


    2017-02-06 19:22:07,472 WARN  [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable    On the top of the Crumpetty Tree    The Quangle Wangle sat,    But his face you could not see,    On account of his Beaver Hat.    On the top of the Crumpetty Tree    The Quangle Wangle sat,    But his face you could not see,    On account of his Beaver Hat.    Process finished with exit code 0

FSDataInputStream 类也实现了 PositionedReadable 借口,从一个指定偏移量读取文件的一部分:

    /**     * Licensed to the Apache Software Foundation (ASF) under one     * or more contributor license agreements.  See the NOTICE file     * distributed with this work for additional information     * regarding copyright ownership.  The ASF licenses this file     * to you under the Apache License, Version 2.0 (the     * "License"); you may not use this file except in compliance     * with the License.  You may obtain a copy of the License at     *     *     http://www.apache.org/licenses/LICENSE-2.0     *     * Unless required by applicable law or agreed to in writing, software     * distributed under the License is distributed on an "AS IS" BASIS,     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.     * See the License for the specific language governing permissions and     * limitations under the License.     */    package org.apache.hadoop.fs;    import java.io.*;    import org.apache.hadoop.classification.InterfaceAudience;    import org.apache.hadoop.classification.InterfaceStability;    /** Stream that permits positional reading. */    @InterfaceAudience.Public    @InterfaceStability.Evolving    public interface PositionedReadable {      /**       * Read upto the specified number of bytes, from a given       * position within a file, and return the number of bytes read. This does not       * change the current offset of a file, and is thread-safe.       */      public int read(long position, byte[] buffer, int offset, int length)        throws IOException;      /**       * Read the specified number of bytes, from a given       * position within a file. This does not       * change the current offset of a file, and is thread-safe.       */      public void readFully(long position, byte[] buffer, int offset, int length)        throws IOException;      /**       * Read number of bytes equal to the length of the buffer, from a given       * position within a file. This does not       * change the current offset of a file, and is thread-safe.       */      public void readFully(long position, byte[] buffer) throws IOException;    }




#### 3. 写入数据


    public FSDataOutputStream create(Path path) throws IOException



还有一个重载方法 Progressable 用于传递回调接口,如此一来,可以把数据写入datanode的进度通知给应用:

    package org.apache.hadoop.utils;    public interface Progressable {        public void progress();    }


    public FSDataOutputStream append(Path path) throws IOException



    package cn.weiwei.WHadoop.hdfs.filesystem;    import org.apache.hadoop.conf.Configuration;    import org.apache.hadoop.fs.FileSystem;    import org.apache.hadoop.fs.Path;    import org.apache.hadoop.io.IOUtils;    import org.apache.hadoop.util.Progressable;    import java.io.BufferedInputStream;    import java.io.FileInputStream;    import java.io.InputStream;    import java.io.OutputStream;    import java.net.URI;    /**     * @author WangWeiwei     * @version 1.0     * @sine 17-2-6     * 将本地文件复制到HDFS文件系统     */    public class FileCopyWithProgress {        public void copyFileToHDFS(String localSrc,String dst) throws Exception{            InputStream inputStream = new BufferedInputStream(new FileInputStream(localSrc));            Configuration configuration = new Configuration();            FileSystem fileSystem = FileSystem.get(URI.create(dst),configuration);            OutputStream outputStream = fileSystem.create(new Path(dst), new Progressable() {                @Override                public void progress() {                    System.out.print(".");                }            });            IOUtils.copyBytes(inputStream,outputStream,4096,true);        }    }


    package cn.weiwei.WHadoop.hdfs;    /**     * @author WangWeiwei     * @version 1.0     * @sine 17-2-6     */    public class HDFSPathTest {        public static String HDFS_ROOT_PATH = "hdfs://";        public static String HDFS_WORKSPACE_PATH = "hdfs://";    }    package cn.weiwei.WHadoop.hdfs.filesystem;    import cn.weiwei.WHadoop.hdfs.HDFSPathTest;    import org.junit.Test;    import static org.junit.Assert.*;    /**     * @author WangWeiwei     * @version 1.0     * @sine 17-2-6     */    public class FileCopyWithProgressTest extends HDFSPathTest {        @Test        public void copyFileToHDFS() throws Exception {            FileCopyWithProgress fileCopyWithProgress = new FileCopyWithProgress();            fileCopyWithProgress.copyFileToHDFS("/media/weiwei/office/workspace/IntelliJIDEA/WHadoop/input/docs/1400-8.txt",                    HDFS_WORKSPACE_PATH + "/input/docs/1400-8.txt");        }    }


   2017-02-06 21:08:06,062 WARN  [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable   ..................   Process finished with exit code 0


##### FSDataOutputStream对象


       package org.apache.hadoop.fs;       public class FSDataOutputStream extends DataOutputStream implements Syncable{            public long getPos() throws IOException{                //implementation elided            }            //implementation elided       }


##### 目录


    public boolean mkdirs(Path path) throws IOException


##### 查询文件系统
1. 文件元数据


例5: 展示文件状态信息

    package cn.weiwei.WHadoop.hdfs.filesystem;    import cn.weiwei.WHadoop.hdfs.HDFSPathTest;    import org.apache.hadoop.conf.Configuration;    import org.apache.hadoop.fs.FileStatus;    import org.apache.hadoop.fs.FileSystem;    import org.apache.hadoop.fs.Path;    import org.apache.hadoop.hdfs.MiniDFSCluster;    import org.junit.After;    import org.junit.Before;    import org.junit.Test;    import java.io.FileNotFoundException;    import java.io.IOException;    import java.io.OutputStream;    import static org.hamcrest.MatcherAssert.assertThat;    import static org.hamcrest.core.Is.is;    /**     * @author WangWeiwei     * @version 1.0     * @sine 17-2-6     * 展示文件状态信息     */    public class ShowFileStatusTest extends HDFSPathTest {        private MiniDFSCluster cluster; // use an in-process HDFS cluster for testing        private FileSystem fs;        @Before        public void setUp() throws IOException {            Configuration conf = new Configuration();            if (System.getProperty("test.build.data") == null) {                System.setProperty("test.build.data", "/tmp");            }            cluster = new MiniDFSCluster(conf, 1, true, null);            fs = cluster.getFileSystem();            OutputStream out = fs.create(new Path("/dir/file"));            out.write("content".getBytes("UTF-8"));            out.close();        }        @After        public void tearDown() throws IOException {            if (fs != null) { fs.close(); }            if (cluster != null) { cluster.shutdown(); }        }        @Test(expected = FileNotFoundException.class)        public void throwsFileNotFoundForNonExistentFile() throws IOException {            fs.getFileStatus(new Path("no-such-file"));        }        @Test        public void fileStatusForFile() throws IOException {            Path file = new Path("/dir/file");            FileStatus stat = fs.getFileStatus(file);            assertThat(stat.getPath().toUri().getPath(), is("/dir/file"));            assertThat(stat.isDir(), is(false));            assertThat(stat.getLen(), is(7L));    //        assertThat(stat.getModificationTime(),    //                is(lessThanOrEqualTo(System.currentTimeMillis())));            assertThat(stat.getReplication(), is((short) 1));    //        assertThat(stat.getBlockSize(), is(64 * 1024 * 1024L));            assertThat(stat.getOwner(), is(System.getProperty("user.name")));            assertThat(stat.getGroup(), is("supergroup"));            assertThat(stat.getPermission().toString(), is("rw-r--r--"));        }        @Test        public void fileStatusForDirectory() throws IOException {            Path dir = new Path("/dir");            FileStatus stat = fs.getFileStatus(dir);            assertThat(stat.getPath().toUri().getPath(), is("/dir"));            assertThat(stat.isDir(), is(true));            assertThat(stat.getLen(), is(0L));    //        assertThat(stat.getModificationTime(),    //                is(lessThanOrEqualTo(System.currentTimeMillis())));            assertThat(stat.getReplication(), is((short) 0));            assertThat(stat.getBlockSize(), is(0L));            assertThat(stat.getOwner(), is(System.getProperty("user.name")));            assertThat(stat.getGroup(), is("supergroup"));            assertThat(stat.getPermission().toString(), is("rwxr-xr-x"));        }    }


  1. 列出文件
    查找一个文件或目录相关的信息很实用,但通常还需要能够列出目录中的内容。这就是FileSystem的 listStatus()方法的功能:

    public FileStatus[] listStatus(Path path) throws IOExceptionpublic FileStatus[] listStatus(Path path,PathFilter filter) throws IOExceptionpublic FileStatus[] listStatus(Path[] paths) throws IOExceptionpublic FileStatus[] listStatus(Path[] paths, PathFilter filter) throws IOException

    当传入的参数是一个文件时,它会简单的转变成以数组形式返回长度为1的FileStatus对象。当传入参数是一个目录时,则返回 0 或多个FileStatus对象,表示此目录中包含的文件和目录。


    例6: 显示Hadoop文件系统中一组路径的文件信息

            package cn.weiwei.WHadoop.hdfs.filesystem;    import org.apache.hadoop.conf.Configuration;    import org.apache.hadoop.fs.FileStatus;    import org.apache.hadoop.fs.FileSystem;    import org.apache.hadoop.fs.FileUtil;    import org.apache.hadoop.fs.Path;    import java.io.IOException;    import java.net.URI;    /**     * @author WangWeiwei     * @version 1.0     * @sine 17-2-12     * 显示hadoop文件系统中一组路径的文件信息     */    public class ListStatus {        public void listStatus(String[] args) throws IOException {            String uri = args[0];            Configuration configuration = new Configuration();            FileSystem fileSystem = FileSystem.get(URI.create(uri),configuration);            Path[] paths = new Path[args.length];            for (int i = 0;i < paths.length; i++){                paths[i] = new Path(args[i]);            }            FileStatus[] fileStatuses = fileSystem.listStatus(paths);            Path[] listedPaths = FileUtil.stat2Paths(fileStatuses);            for (Path path : listedPaths){                System.out.println(path);            }        }    }


    package cn.weiwei.WHadoop.hdfs.filesystem;import cn.weiwei.WHadoop.hdfs.HDFSPathTest;import org.junit.Test;import static org.junit.Assert.*;/** * @author WangWeiwei * @version 1.0 * @sine 17-2-12 */public class ListStatusTest extends HDFSPathTest {    @Test    public void listStatus() throws Exception {        ListStatus listStatus = new ListStatus();        listStatus.listStatus(new String[]{HDFS_ROOT_PATH + "/",HDFS_WORKSPACE_PATH + "/"});    }}


    2017-02-12 15:34:19,230 WARN  [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicablehdfs:// finished with exit code 0
  2. 文件模式

    public FileStatus[] globStatus(Path pathPattern) throws IOExceptionpublic FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws IOException


    Hadoop支持的通配符与 UNIX bash的相同

    1. PathFilter对象


    /** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements.  See the NOTICE file * distributed with this work for additional information * regarding copyright ownership.  The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License.  You may obtain a copy of the License at * *     http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */package org.apache.hadoop.fs;import org.apache.hadoop.classification.InterfaceAudience;import org.apache.hadoop.classification.InterfaceStability;@InterfaceAudience.Public@InterfaceStability.Stablepublic interface PathFilter {  /**   * Tests whether or not the specified abstract pathname should be   * included in a pathname list.   *   * @param  path  The abstract pathname to be tested   * @return  <code>true</code> if and only if <code>pathname</code>   *          should be included   */  boolean accept(Path path);}


    例7: PathFilter,用于排除正则表达式路径

    package cn.weiwei.WHadoop.hdfs.filesystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.fs.PathFilter;/** * @author WangWeiwei * @version 1.0 * @sine 17-2-12 * 用于排除正则表达式路径 */public class RegexExcludePathFilter implements PathFilter{    private final String regex;    public RegexExcludePathFilter(String regex){        this.regex = regex;    }    @Override    public boolean accept(Path path) {        return !path.toString().matches(regex);    }}

    6. 删除数据


    public boolean delete(Path path,boolean recursive) throws IOException


