Hadoop和Spark学习日记2

来源:互联网 发布:淘宝女士t恤 编辑:程序博客网 时间:2024/06/16 16:49
1. TF-IDF(term frequency-inverse document frequency)

解析:

(1)词频(TF)=某个词在文章中出现的次数/文章的总词数。

(2)逆文档频率(IDF)=log(语料库的文档总数/(包含该词的文档数+1))。

(3)词频-逆文档频率(TF-IDF)=词频(TF)* 逆文档频率(IDF)。

说明:TF-IDF与一个词在文档中的出现次数成正比,与该词在整个语言中的出现次数成反比。因此,自动提取关键词

算法就是计算出文档每个词的TF-IDF值,然后按降序排列,取排在最前面的几个词。


2. clusteredPoints

解析:clusteredPoints目录包含了从簇ID到文档ID的最终映射。


3. public abstract class AbstractJob extends org.apache.hadoop.conf.Configured implements 

org.apache.hadoop.util.Tool  

解析:Superclass of many Mahout Hadoop "jobs". A job drives configuration and launch of one or more maps and 

reduces in order to accomplish some task.


4. protected void addInputOption()

解析:Add the default input directory option, '-i' which takes a directory name as an argument. When 

parseArguments(String[]) is called, the inputPath will be set based upon the value for this option. If this method is 

called, the input is required.


5. protected void addOutputOption()

解析:Add the default output directory option, '-o' which takes a directory name as an argument. When 

parseArguments(String[]) is called, the outputPath will be set based upon the value for this option. If this method is 

called, the output is required.


6. public static org.apache.commons.cli2.builder.DefaultOptionBuilder distanceMeasureOption()

解析:Returns a default command line option for specification of distance measure class to use. Used by Canopy, 

FuzzyKmeans, Kmeans, MeanShift.


7. protected org.apache.commons.cli2.Option addOption(org.apache.commons.cli2.Option option)

解析:Add an arbitrary option to the set of options this job will parse when parseArguments(String[]) is called.


8. protected void addOption(String name, String shortName, String description)

解析:Add an option to the the set of options this job will parse when parseArguments(String[]) is called.


9. protected void addOption(String name, String shortName, String description, boolean required)

解析:Add an option to the the set of options this job will parse when parseArguments(String[]) is called.


10. protected void addOption(String name, String shortName, String description, String defaultValue)

解析:Add an option to the the set of options this job will parse when parseArguments(String[]) is called.


11. WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under 

SPARK_HOME.  

解析:

(1)hdfs dfs -mkdir -p /spark/jars

(2)hdfs dfs -put $SPARK_HOME/jars/* /spark/jars 

(3)spark.yarn.jars        hdfs:///spark/jars/*


12. 启动./bin/spark-shell --master yarn --deploy-mode client报错 [2] 

(1)Diagnostics: Container [pid=10458,containerID=container_1501381238319_0003_02_000001] is running 

beyond virtual memory limits. Current usage: 178.2 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual 

memory used. Killing container.

(2)ERROR spark.SparkContext: Error initializing SparkContext. 

解析:编辑yarn-site.xml,如下所示:

<property><name>yarn.nodemanager.vmem-check-enabled</name>        <value>false</value></property><property>        <name>yarn.nodemanager.vmem-pmem-ratio</name>        <value>4</value></property>
说明:第一个配置是虚拟机内存不足时,是否强制启动container,设置为否,可以尽早发现内存错误。第二个配置是

增加虚拟机内存到物理内存的映射比例,默认为2.1。 


参考文献:

[1] TF-IDF与余弦相似性的应用:http://www.ruanyifeng.com/blog/2013/03/tf-idf.html

[2] Hadoop YARN中内存和CPU两种资源的调度和隔离:http://dongxicheng.org/mapreduce-nextgen/hadoop-yarn-memory-cpu-scheduling/

原创粉丝点击