R语言扩展包dplyr——数据清洗和整理

来源:互联网 发布:淘宝丝袜4成是男的买的 编辑:程序博客网 时间:2024/04/20 22:38
<div id="article_content" class="article_content">
<p><span style="font-size:14px"><span style="font-family:SimSun">该包主要用于数据清洗和整理,coursera课程链接:</span><a target="_blank" target="_blank" href="https://class.coursera.org/getdata-017" style="font-family:SimSun">Getting and Cleaning Data</a></span></p>
<p><span style="font-family:SimSun; font-size:14px">也可以载入swirl包,加载课Getting and Cleaning Data跟着学习。</span></p>
<p><span style="font-family:SimSun; font-size:14px">如下:</span></p>
<p><span style="font-family:SimSun; font-size:14px"></span></p>
<pre name="code" class="html">library(swirl)
install_from_swirl(&quot;Getting and Cleaning Data&quot;)
swirl()</pre><br>
<p></p>
<p><span style="font-family:SimSun; font-size:14px">此文主要是参考R自带的简介:<a target="_blank" target="_blank" href="http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html">Introduce to dplyr</a></span></p>
<p><span style="font-family:SimSun; font-size:14px">1、示范数据</span></p>
<p><span style="font-family:SimSun; font-size:14px"></span></p>
<pre name="code" class="html">&gt; library(nycflights13)
&gt; dim(flights)
[1] 336776     16
&gt; head(flights, 3)
Source: local data frame [3 x 16]
  year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time
1 2013     1   1      517         2      830        11      UA  N14228   1545    EWR  IAH      227
2 2013     1   1      533         4      850        20      UA  N24211   1714    LGA  IAH      227
3 2013     1   1      542         2      923        33      AA  N619AA   1141    JFK  MIA      160
Variables not shown: distance (dbl), hour (dbl), minute (dbl)</pre><br>
2、将过长的数据整理成友好的tbl_df数据
<p></p>
<p><span style="font-family:SimSun; font-size:14px"></span></p>
<pre name="code" class="html">&gt; flights_df &lt;- tbl_df(flights)
&gt; flights_df</pre>
<p></p>
<p><span style="font-family:SimSun; font-size:14px"><br>
</span></p>
<p><span style="font-family:SimSun; font-size:14px">3、筛选filter()</span></p>
<p><span style="font-family:SimSun; font-size:14px"></span></p>
<pre name="code" class="html">&gt; filter(flights_df, month == 1, day == 1)
Source: local data frame [842 x 16]
   year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time
1  2013     1   1      517         2      830        11      UA  N14228   1545    EWR  IAH      227
2  2013     1   1      533         4      850        20      UA  N24211   1714    LGA  IAH      227</pre>筛选出month=1和day=1的数据
<p></p>
<p>同样效果的,</p>
<p></p>
<pre name="code" class="html">flights_df[flights_df$month == 1 &amp; flights_df$day == 1, ]</pre><br>
4、选出几行数据slice()
<p></p>
<p></p>
<pre name="code" class="html">slice(flights_df, 1:10)</pre><br>
5、排列arrange()
<p></p>
<p><span style="font-family:SimSun; font-size:14px"></span></p>
<pre name="code" class="html">&gt;arrange(flights_df, year, month, day)</pre>将flights_df数据按照year,month,day的升序排列。
<p></p>
<p><span style="font-family:SimSun; font-size:14px">降序</span></p>
<p><span style="font-family:SimSun; font-size:14px"></span></p>
<pre name="code" class="html">&gt;arrange(flights_df, year, desc(month), day)</pre>R语言当中的自带函数
<p></p>
<p><span style="font-family:SimSun; font-size:14px"></span></p>
<pre name="code" class="html">flights_df[order(flights$year, flights_df$month, flights_df$day), ]
flights_df[order(desc(flights_df$arr_delay)), ]</pre>
<p></p>
<p><span style="font-family:SimSun; font-size:14px"><br>
</span></p>
6、选择select()
<p><span style="font-family:SimSun; font-size:14px">通过列名来选择所要的数据<br>
</span></p>
<pre name="code" class="html">select(flights_df, year, month, day)</pre>选出三列数据<br>
使用:符号<br>
<pre name="code" class="html">select(flights_df, year:day)</pre>使用-来删除不要的列表
<p></p>
<p><span style="font-family:SimSun; font-size:14px"></span></p>
<pre name="code" class="html">select(flights_df, -(year:day))</pre><br>
7、变形mutate()
<p></p>
<p><span style="font-family:SimSun; font-size:14px">产生新的列</span></p>
<p><span style="font-family:SimSun; font-size:14px"></span></p>
<pre name="code" class="html">&gt; mutate(flights_df,
+        gain = arr_delay - dep_delay,
+        speed = distance / air_time * 60)</pre>
<p></p>
<p><span style="font-family:SimSun; font-size:14px"><br>
</span></p>
8、汇总summarize()<br>
<pre name="code" class="html">&lt;pre name=&quot;code&quot; class=&quot;html&quot;&gt;&gt; summarise(flights,
+           delay = mean(dep_delay, na.rm = TRUE)</pre>
<pre></pre>
<p><span style="font-family:SimSun; font-size:14px">求dep_delay的均&#20540;</span></p>
<p><span style="font-family:SimSun; font-size:14px"></span></p>
<p><span style="font-family:SimSun; font-size:14px"></span></p>
<p><span style="font-family:SimSun; font-size:14px">9、随机选出样本</span></p>
<p><span style="font-family:SimSun; font-size:14px"></span></p>
<pre name="code" class="html">sample_n(flights_df, 10)</pre>随机选出10个样本<br>
<pre name="code" class="html">sample_frac(flights_df, 0.01)</pre><span style="font-family:SimSun; font-size:14px">随机选出1%个样本</span><br>
<br>
<p></p>
<p><span style="font-family:SimSun; font-size:14px">10、分组group_py()</span></p>
<p><span style="font-family:SimSun; font-size:14px"></span></p>
<pre name="code" class="html">by_tailnum &lt;- group_by(flights, tailnum)
#确定组别为tailnum,赋值为by_tailnum
delay &lt;- summarise(by_tailnum,
                   count = n(),
                   dist = mean(distance, na.rm = TRUE),
                   delay = mean(arr_delay, na.rm = TRUE))
#汇总flights里地tailnum组的分类数量,及其组别对应的distance和arr_delay的均值
delay &lt;- filter(delay, count &gt; 20, dist &lt; 2000)
ggplot(delay, aes(dist, delay)) +
    geom_point(aes(size = count), alpha = 1/2) +
    geom_smooth() +
    scale_size_area()
</pre><br>
<img src="http://img.blog.csdn.net/20150122175820824?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvdTAxMTI1Mzg3NA==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt=""><br>
<p></p>
<p><span style="font-family:SimSun; font-size:14px"><br>
</span></p>
<p>结果都需要通过赋&#20540;存储</p>
<p></p>
<pre name="code" class="html">a1 &lt;- group_by(flights, year, month, day)
a2 &lt;- select(a1, arr_delay, dep_delay)
a3 &lt;- summarise(a2,
  arr = mean(arr_delay, na.rm = TRUE),
  dep = mean(dep_delay, na.rm = TRUE))
a4 &lt;- filter(a3, arr &gt; 30 | dep &gt; 30)</pre><br>
11、引入链接符%&gt;%
<p></p>
<p>使用时把数据名作为开头,然后依次对数据进行多步操作:</p>
<p></p>
<pre name="code" class="html">flights %&gt;%
    group_by(year, month, day) %&gt;%
    select(arr_delay, dep_delay) %&gt;%
    summarise(
        arr = mean(arr_delay, na.rm = TRUE),
        dep = mean(dep_delay, na.rm = TRUE)
    ) %&gt;%
    filter(arr &gt; 30 | dep &gt; 30)
</pre>前面都免去了数据名
<p></p>
<p><br>
</p>
<p>若想要进行更多地了解这个包,可以参考其自带的说明书(60页):<a target="_blank" target="_blank" href="http://cran.rstudio.com/web/packages/dplyr/dplyr.pdf">dplyr</a></p>
  
</div>
0 0
原创粉丝点击