Hadoop和Spark学习日记2

来源：互联网发布：淘宝女士t恤编辑：程序博客网时间：2024/06/16 16:49

1. TF-IDF（term frequency-inverse document frequency）

解析：

（1）词频（TF）=某个词在文章中出现的次数/文章的总词数。

（2）逆文档频率（IDF）=log（语料库的文档总数/(包含该词的文档数+1)）。

（3）词频-逆文档频率（TF-IDF）=词频（TF）* 逆文档频率（IDF）。

说明：TF-IDF与一个词在文档中的出现次数成正比，与该词在整个语言中的出现次数成反比。因此，自动提取关键词

的算法就是计算出文档每个词的TF-IDF值，然后按降序排列，取排在最前面的几个词。

2. clusteredPoints

解析：clusteredPoints目录包含了从簇ID到文档ID的最终映射。

3. public abstract class AbstractJob extends org.apache.hadoop.conf.Configured implements

org.apache.hadoop.util.Tool

解析：Superclass of many Mahout Hadoop "jobs". A job drives configuration and launch of one or more maps and

reduces in order to accomplish some task.

4. protected void addInputOption()

解析：Add the default input directory option, '-i' which takes a directory name as an argument. When

parseArguments(String[]) is called, the inputPath will be set based upon the value for this option. If this method is

called, the input is required.

5. protected void addOutputOption()

解析：Add the default output directory option, '-o' which takes a directory name as an argument. When

parseArguments(String[]) is called, the outputPath will be set based upon the value for this option. If this method is

called, the output is required.

6. public static org.apache.commons.cli2.builder.DefaultOptionBuilder distanceMeasureOption()

解析：Returns a default command line option for specification of distance measure class to use. Used by Canopy,

FuzzyKmeans, Kmeans, MeanShift.

7. protected org.apache.commons.cli2.Option addOption(org.apache.commons.cli2.Option option)

解析：Add an arbitrary option to the set of options this job will parse when parseArguments(String[]) is called.

8. protected void addOption(String name, String shortName, String description)

解析：Add an option to the the set of options this job will parse when parseArguments(String[]) is called.

9. protected void addOption(String name, String shortName, String description, boolean required)

解析：Add an option to the the set of options this job will parse when parseArguments(String[]) is called.

10. protected void addOption(String name, String shortName, String description, String defaultValue)

解析：Add an option to the the set of options this job will parse when parseArguments(String[]) is called.

11. WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under

SPARK_HOME.

解析：

（1）hdfs dfs -mkdir -p /spark/jars

（2）hdfs dfs -put $SPARK_HOME/jars/* /spark/jars

（3）spark.yarn.jars hdfs:///spark/jars/*

12. 启动./bin/spark-shell --master yarn --deploy-mode client报错 [2]

（1）Diagnostics: Container [pid=10458,containerID=container_1501381238319_0003_02_000001] is running

beyond virtual memory limits. Current usage: 178.2 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual

memory used. Killing container.

（2）ERROR spark.SparkContext: Error initializing SparkContext.

解析：编辑yarn-site.xml，如下所示：

<property><name>yarn.nodemanager.vmem-check-enabled</name>        <value>false</value></property><property>        <name>yarn.nodemanager.vmem-pmem-ratio</name>        <value>4</value></property>

说明：第一个配置是虚拟机内存不足时，是否强制启动container，设置为否，可以尽早发现内存错误。第二个配置是

增加虚拟机内存到物理内存的映射比例，默认为2.1。

参考文献：

[1] TF-IDF与余弦相似性的应用：http://www.ruanyifeng.com/blog/2013/03/tf-idf.html

[2] Hadoop YARN中内存和CPU两种资源的调度和隔离：http://dongxicheng.org/mapreduce-nextgen/hadoop-yarn-memory-cpu-scheduling/

阅读全文

1 0