@@ -146,6 +146,8 @@ You also need to save charles Root Certificate, it also contains in the same men
<divclass="col-box-title">Newest Posts</div>
<ulclass="post-list">
<li><aclass="post-link"href="/archivers/generate-word-cloud-with-chinese-fenci">Generate Word Cloud Figures with Chinese-Tokenization and WordCloud python libraries</a></li>
<p>Recently, I set up a web-based RSS client for retrieving and organizing everyday news. I used <ahref="https://tt-rss.org/">TinyTinyRSS</a>, or as ttrss, a popular RSS client which friendly to docker. Thanks to developer <ahref="https://ttrss.henry.wang/#about">HenryQW</a>, a well-written Nginx-based docker configuration is already available in docker hub. With more feeds were added, I found some feeds does not need to be checked everyday. Thus I was thinking to create a script to automatically list all keywords appears in a last period and generate a heat map kind figure of it.</p>
<p>Before you go further, I’ll tell you all my settings to give readers a general overview.</p>
<p>My first step is to read all text-based information from TTRSS’s PostgreSQL database. With information, I used a Chinese-NLP library, <ahref="https://github.com/fxsjy/jieba">jieba</a>, to extract all keyword with their occurrences frequency. By using <ahref="https://github.com/amueller/word_cloud">WordCloud</a>, a python library, word cloud figure is generated and present. More details will be discussed in later sections.</p>
<p>My first thought is generating a keyword heat map for economy news of a last week. Since this blog post are more skewed to Chinese tokenization and draw the word cloud figure. I’ll leave my code here just in case. The SQL connector I used is <ahref="https://pypi.org/project/psycopg2/">psycopg2</a>, an easy-use PostgreSQL library.</p>
<spanclass="n">cur</span><spanclass="p">.</span><spanclass="n">execute</span><spanclass="p">(</span><spanclass="s">'SELECT content FROM public.ttrss_entries </span><spanclass="se">\
</span><spanclass="s"> where date_updated > now() - interval </span><spanclass="se">\'</span><spanclass="s">1 week</span><spanclass="se">\'</span><spanclass="s"> AND id in ( </span><spanclass="se">\
</span><spanclass="s"> select int_id from DB_TABLE_NAME </span><spanclass="se">\
</span><spanclass="s"> where feed_id='</span><spanclass="o">+</span><spanclass="nb">str</span><spanclass="p">(</span><spanclass="nb">id</span><spanclass="p">)</span><spanclass="o">+</span><spanclass="s">' </span><spanclass="se">\
<p>Most arguments are intuitive and easy to understand. The only exception is argument of function <em>get_1w_of_feed_byid</em>. This <strong>id</strong> is the feed index of my subscriptions.</p>
<h2id="tokenize-with-frequency">Tokenize with frequency</h2>
<p>Two popular tokenization library were used, and I chose <ahref="https://github.com/fxsjy/jieba">jieba</a> after a few comparison. Before cutting the sentence, we first need to remove all punctuation marks.</p>
<p>After we have an all characters string, we can call jieba. By using the function <em>jieba.posseg.cut</em> with or without paddle, we can have a word list and their “part of speech”. As you can see in the following code, I also did two more works.</p>
<p>First, in the if statement, I only kept all nouns with some categories. Category abbreviation such as “nr” and “ns” represent different “part of speech”, I attached with categories I used in the following table. For more details you can find in this <ahref="https://github.com/fxsjy/jieba">link</a>.</p>
<p>The second work is only keeping words with length longer than 2 characters. In Chinese, there’s no space between words such as Latin writing systems. Since then, some single-character-words such as conjunction words are easy to be misrecognized as specialty-noun. And this misrecognition will cause more single-character being regarded as specialty-noun. I am not able to improve NLP method, so I used a easy way to fix this by removing any words less than 2 characters.</p>
<spanclass="n">words</span><spanclass="o">=</span><spanclass="n">pseg</span><spanclass="p">.</span><spanclass="n">cut</span><spanclass="p">(</span><spanclass="n">content</span><spanclass="p">)</span><spanclass="c1"># Invoking jieba.posseg.cut function
</span><spanclass="k">if</span><spanclass="n">flag</span><spanclass="ow">in</span><spanclass="p">[</span><spanclass="s">'nr'</span><spanclass="p">,</span><spanclass="s">'ns'</span><spanclass="p">,</span><spanclass="s">'nt'</span><spanclass="p">,</span><spanclass="s">'nw'</span><spanclass="p">,</span><spanclass="s">'nz'</span><spanclass="p">,</span><spanclass="s">'PER'</span><spanclass="p">,</span><spanclass="s">'ORG'</span><spanclass="p">,</span><spanclass="s">'x'</span><spanclass="p">]:</span><spanclass="c1"># LOC
<p>With all words extracted, we can easily calculate their frequencies. After this, we can using the following line of code to print a sorted result to verify correctness.</p>
<p>With a keyword and frequency dictionary(data structure), we can just call built-in functions from wordcloud library to generate the figure.</p>
<p>First we need to initialize an instance of wordcloud class. As you can see in my code, I set it with 6 parameters. Width and Height of the canvas, maximum amount of words used to generate the figure, the font of words, background color and margin between any two words.</p>
<p>After having the instance, we call function <em>generate_from_frequencies</em> and pass keyword dictionary to it. The return value of this function is an bitmap image, which we can use <ahref="https://matplotlib.org/">matplotlib</a> to plot it to your screen.</p>
<p>I tested my plot on ubuntu-subsystem on Windows 10, unfortunately matplotlib under subsystem depends on x11 window manager and its not default available on windows. We need to install an x11 manager to support. <ahref="https://sourceforge.net/projects/xming/">Xming</a> is the one I used.</p>
<p>This generated word cloud figure reflects the most popular economy news’ keyword in the week started 06-28-2020. Two largest words in the figure are “新冠” and “新冠病毒”, both means “Covid-19” (This figure was in the week of the second covid spur in Beijing, China). The size of the image fits my phone screen and I can use an app to automatic sync it to my phone’s wallpaper. However, in this image, too many location nouns are presented. This will be something I can make progress on in the future.</p>
</article>
<divclass="post-comments">
<divid="disqus_thread"></div>
<scripttype="text/javascript">
vardisqus_shortname='codersherlockblog';// required: replace example with your forum shortname
<li><aclass="post-link"href="/archivers/generate-word-cloud-with-chinese-fenci">Generate Word Cloud Figures with Chinese-Tokenization and WordCloud python libraries</a></li>
<li><aclass="post-link"href="/archivers/generate-word-cloud-with-chinese-fenci">Generate Word Cloud Figures with Chinese-Tokenization and WordCloud python libraries</a></li>
<li><aclass="post-link"href="/archivers/generate-word-cloud-with-chinese-fenci">Generate Word Cloud Figures with Chinese-Tokenization and WordCloud python libraries</a></li>
<li><aclass="post-link"href="/archivers/generate-word-cloud-with-chinese-fenci">Generate Word Cloud Figures with Chinese-Tokenization and WordCloud python libraries</a></li>
<li><aclass="post-link"href="/archivers/some-of-my-previews-exper-work">Some of my previews experiment works: 2016</a></li>
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.