Hadoop和Spark学习日记2
来源:互联网 发布:淘宝女士t恤 编辑:程序博客网 时间:2024/06/16 16:49
解析:
(1)词频(TF)=某个词在文章中出现的次数/文章的总词数。
(2)逆文档频率(IDF)=log(语料库的文档总数/(包含该词的文档数+1))。
(3)词频-逆文档频率(TF-IDF)=词频(TF)* 逆文档频率(IDF)。
说明:TF-IDF与一个词在文档中的出现次数成正比,与该词在整个语言中的出现次数成反比。因此,自动提取关键词
的算法就是计算出文档每个词的TF-IDF值,然后按降序排列,取排在最前面的几个词。
2. clusteredPoints
解析:clusteredPoints目录包含了从簇ID到文档ID的最终映射。
3. public abstract class AbstractJob extends org.apache.hadoop.conf.Configured implements
org.apache.hadoop.util.Tool
解析:Superclass of many Mahout Hadoop "jobs". A job drives configuration and launch of one or more maps and
reduces in order to accomplish some task.
4. protected void addInputOption()
解析:Add the default input directory option, '-i' which takes a directory name as an argument. When
parseArguments(String[]) is called, the inputPath will be set based upon the value for this option. If this method is
called, the input is required.
5. protected void addOutputOption()
解析:Add the default output directory option, '-o' which takes a directory name as an argument. When
parseArguments(String[]) is called, the outputPath will be set based upon the value for this option. If this method is
called, the output is required.
6. public static org.apache.commons.cli2.builder.DefaultOptionBuilder distanceMeasureOption()
解析:Returns a default command line option for specification of distance measure class to use. Used by Canopy,
FuzzyKmeans, Kmeans, MeanShift.
7. protected org.apache.commons.cli2.Option addOption(org.apache.commons.cli2.Option option)
解析:Add an arbitrary option to the set of options this job will parse when parseArguments(String[]) is called.
8. protected void addOption(String name, String shortName, String description)
解析:Add an option to the the set of options this job will parse when parseArguments(String[]) is called.
9. protected void addOption(String name, String shortName, String description, boolean required)
解析:Add an option to the the set of options this job will parse when parseArguments(String[]) is called.
10. protected void addOption(String name, String shortName, String description, String defaultValue)
解析:Add an option to the the set of options this job will parse when parseArguments(String[]) is called.
11. WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under
SPARK_HOME.
解析:
(1)hdfs dfs -mkdir -p /spark/jars
(2)hdfs dfs -put $SPARK_HOME/jars/* /spark/jars
(3)spark.yarn.jars hdfs:///spark/jars/*
12. 启动./bin/spark-shell --master yarn --deploy-mode client报错 [2]
(1)Diagnostics: Container [pid=10458,containerID=container_1501381238319_0003_02_000001] is running
beyond virtual memory limits. Current usage: 178.2 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual
memory used. Killing container.
(2)ERROR spark.SparkContext: Error initializing SparkContext.
解析:编辑yarn-site.xml,如下所示:
<property><name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value></property><property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>4</value></property>说明:第一个配置是虚拟机内存不足时,是否强制启动container,设置为否,可以尽早发现内存错误。第二个配置是
增加虚拟机内存到物理内存的映射比例,默认为2.1。
参考文献:
[1] TF-IDF与余弦相似性的应用:http://www.ruanyifeng.com/blog/2013/03/tf-idf.html
[2] Hadoop YARN中内存和CPU两种资源的调度和隔离:http://dongxicheng.org/mapreduce-nextgen/hadoop-yarn-memory-cpu-scheduling/
- Hadoop和Spark学习日记2
- Hadoop和Spark学习日记1
- Hadoop和Spark学习日记3
- Spark学习日记2
- Spark学习日记1
- Spark学习日记3
- Hadoop+Spark学习
- spark-2.2.0安装和部署——Spark集群学习日记
- spark-2.2.0安装和部署——Spark集群学习日记
- 编译hadoop和spark
- Hadoop、Spark和Storm
- Hadoop,Spark和Storm
- Hadoop,Spark和Storm
- Hadoop和Spark部署
- Hadoop 和 Spark 简介
- 概述Hadoop和Spark
- hadoop和spark比较
- spark 和 hadoop
- Linux下PHP+MySQL+CoreSeek中文检索引擎配置(转)
- 自定义进度条
- Android 一个无限循环滚动的卡片式ViewPager
- hdu2047 阿牛的EOF牛肉串(C语言)
- android 安装更新
- Hadoop和Spark学习日记2
- 超级强大的SVG SMIL animation动画详解
- css滤镜
- HDU6055 Regular polygon +多校联赛第二场
- 终极方案:在高版本7.0上webview出现了二级页面白屏
- ThinkPHP5 路由定义
- chrome 安装 ElasticSearch 插件
- QA
- 图说云计算