Spark中使用HanLP分词

来源：互联网发布：巨人网络2018校招面试编辑：程序博客网时间：2024/06/05 23:59

1.将HanLP的data(包含词典和模型)放到hdfs上，然后在项目配置文件hanlp.properties中配置root的路径，比如：
root=hdfs://localhost:9000/tmp/

2.实现com.hankcs.hanlp.corpus.io.IIOAdapter接口：

    public static class HadoopFileIoAdapter implements IIOAdapter {        @Override        public InputStream open(String path) throws IOException {            Configuration conf = new Configuration();            FileSystem fs = FileSystem.get(URI.create(path), conf);            return fs.open(new Path(path));        }        @Override        public OutputStream create(String path) throws IOException {            Configuration conf = new Configuration();            FileSystem fs = FileSystem.get(URI.create(path), conf);            OutputStream out = fs.create(new Path(path));            return out;        }    }

3.设置IoAdapter，创建分词器：

private static Segment segment;static {    HanLP.Config.IOAdapter = new HadoopFileIoAdapter();    segment = new CRFSegment();}

然后，就可以在Spark的操作中使用segment进行分词了。

阅读全文

0 0