NLP01-python的wordcloud实现中文词云小例

来源:互联网 发布:mac fontawesome 字体 编辑:程序博客网 时间:2024/05/29 17:10

这里写图片描述

上图是下面歌词生成的

《When You Are Old》William Butler YeatsWhen you are old and grey and full of sleep,And nodding by the fire, take down this book,And slowly read, and dream of the soft lookYour eyes had once, and of their shadows deep;How many loved your moments of glad grace,And loved your beauty with love false or true,But one man loved the pilgrim soul in you,And loved the sorrows of your changing face;And bending down beside the glowing bars,Murmur, a little sadly, how love fledAnd paced upon the mountains overheadAnd hid his face amid a crowd of stars.

摘要:只是wordcloud的安装与演示测试,可为入门者提供帮助。

1. 安装

构建词云的方法很多, 但是个人觉得python的wordcloud包功能最为强大,可以自定义图片.
官网: https://amueller.github.io/word_cloud/
github: https://github.com/amueller/word_cloud
安装:pip install wordcloud
或 下载:http://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud 然后安装。

2. 查看API

API中,WordCloud类是重要类。

class wordcloud.WordCloud(font_path=None, width=400, height=200, margin=2, ranks_only=None, prefer_horizontal=0.9,mask=None, scale=1, color_func=None, max_words=200, min_font_size=4, stopwords=None, random_state=None,background_color='black', max_font_size=None, font_step=1, mode='RGB', relative_scaling=0.5, regexp=None, collocations=True,colormap=None, normalize_plurals=True)font_path : string    Font path to the font that will be used (OTF or TTF). Defaults to DroidSansMono path on a Linux machine. If you are on another OS or don’t have this font, you need to adjust this path.    [对于win7,这个得修改了,否则会乱码]width : int (default=400)    Width of the canvas.    画布宽height : int (default=200)    Height of the canvas.    画布高prefer_horizontal : float (default=0.90)    The ratio of times to try horizontal fitting as opposed to vertical. If prefer_horizontal < 1, the algorithm will try rotating the word if it doesn’t fit. (There is currently no built-in way to get only vertical words.)mask : nd-array or None (default=None)scale : float (default=1)    Scaling between computation and drawing. For large word-cloud images, using scale instead of larger canvas size is significantly faster, but might lead to a coarser fit for the words.min_font_size : int (default=4)    Smallest font size to use. Will stop when there is no more room in this size.    最小字号大小font_step : int (default=1)    Step size for the font. font_step > 1 might speed up computation but give a worse fit.max_words : number (default=200)    The maximum number of words.    显示的最多中词数据上限stopwords : set of strings or None    The words that will be eliminated. If None, the build-in STOPWORDS list will be used.    停用词background_color : color value (default=”black”)    Background color for the word cloud image.    前景色max_font_size : int or None (default=None)    Maximum font size for the largest word. If None, height of the image is used.    词的最大大小;mode : string (default=”RGB”)    Transparent background will be generated when mode is “RGBA” and background_color is None.    relative_scaling : float (default=.5)    Importance of relative word frequencies for font-size. With relative_scaling=0, only word-ranks are considered. With relative_scaling=1, a word that is twice as frequent will have twice the size. If you want to consider the word frequencies and not only their rank, relative_scaling around .5 often looks good.color_func : callable, default=None    Callable with parameters word, font_size, position, orientation, font_path, random_state that returns a PIL color for each word. Overwrites “colormap”. See colormap for specifying a matplotlib colormap instead.regexp : string or None (optional)    Regular expression to split the input text into tokens in process_text. If None is specified,r"\w[\w']+" is used.collocations : bool, default=True    Whether to include collocations (bigrams) of two words.colormap : string or matplotlib colormap, default=”viridis”    Matplotlib colormap to randomly draw colors from for each word. Ignored if “color_func” is specified.normalize_plurals : bool, default=True    Whether to remove trailing ‘s’ from words. If True and a word appears with and without a trailing ‘s’, the one with trailing ‘s’ is removed and its counts are added to the version without trailing ‘s’ – unless the word ends with ‘ss’.

3.图片

图片名为:mask_png.png
这里写图片描述

4.测试中文文档

题目:脚抽筋怎么办
网址:http://health.china.com/html/jiankang/jijiuzhinan/richangjijiu/201603/26-328450.html

5.代码

# -*- coding: utf-8 -*-from os import pathimport jiebaimport matplotlib.pyplot as pltfrom scipy.misc import imreadfrom wordcloud import WordClouddef doWordcloud():    comment_text = open('test.txt', 'r', encoding='UTF-8').read()    cut_text = " ".join(jieba.cut(comment_text))    color_mask = imread("mask_png.png")    cloud = WordCloud(        # 设置字体,不指定就会出现乱码;        # 在win7的路径:C:\Windows\Fonts进行查看        font_path="simsun.ttc",        mask=color_mask,        max_words=200,        max_font_size=80,        width=1000,        height=1000    )    word_cloud = cloud.generate(cut_text)  # 产生词云    # word_cloud.to_file("pic.jpg")  # 保存图片    plt.imshow(word_cloud)    plt.axis('off')    plt.show()

说明:test.txt内容是《脚抽筋怎么办》的文章内容;
mask_png.png是上面那个小女孩的图片;

6.显示结果

这里写图片描述

【作者:happyprince ;http://blog.csdn.net/ld326/article/details/78341147】

原创粉丝点击