stringr数据处理

来源:互联网 发布:微信淘宝客怎么拉人 编辑:程序博客网 时间:2024/05/18 17:44

stringr数据处理

前言:

在数据处理阶段,主要用到的是dplyr包,但随着数据的多样性和复杂性,对字符串的处理越来越重要,R语言基础的数据处理能力一般,且使用不够方便。为此,学习stringr包能解决字符串处理的所有问题,它建华了R语言中字符串的转换,搜索,辨识,定位,匹配,替换,提取,分离等操作,同时封装了一些复杂的字符串处理函数。

一、字符串拼接函数

1.word()函数:从句子中提取词组 - 调用公式:

word(string,start= ,end= ,sep=fixed(" "))#sep为字符之间的分隔符,默认是空格
  • 简单例子:
library(stringr)data<-'Using R programming to work for data science'#提取后两个字符word(data,-2:-1)## [1] "data"    "science"#从第1个单词开始,提取前3个单词word(data,start=1,end=3)## [1] "Using R programming"

2.str_wrap()函数:段落操作 - 调用公式:

str_wrap(string,width=80,indent=0,exdent=0)# width:设定每行的宽度# indent:设定每个段落第一行的缩进格式,默认无缩进# exdent:设定每个段落除了第一行的缩进格式,默认无缩进
  • 简单例子:
string<-"New York is 3 hours ahead of California, but it does not make California slow. Someone graduated at age of 22, but waited 5 years before securing a good job!"str_wrap(string,width=80,indent=4)## [1] "    New York is 3 hours ahead of California, but it does not make California slow.\nSomeone graduated at age of 22, but waited 5 years before securing a good job!"# \n 是换行符# cat()函数,在转义符处连接句子cat(str_wrap(string,indent=4),sep="\n")##     New York is 3 hours ahead of California, but it does not make California slow.## Someone graduated at age of 22, but waited 5 years before securing a good job!

3.str_trim()函数:剔除字符串中多余空格 - 调用公式:

str_trim(string,side="both"/"left"/"right")# side 表示剔除字符两边/左边/右边的空格
  • 简单例子:
data<-'  Using R to explore the data science   'str_trim(data,side="both")## [1] "Using R to explore the data science"# 前后空格均删除

4.str_c()函数:字符串连接 - 调用公式:

str_c(...,sep=" ",collapse = NULL)# sep:字符串之间的连接符,功能类似于paste()函数# collapse:如果是向量之间的连接,collapse的作用与sep一样,只不过此时sep无效
  • 简单例子:
str_c('x_',c(1:10),':')##  [1] "x_1:"  "x_2:"  "x_3:"  "x_4:"  "x_5:"  "x_6:"  "x_7:"  "x_8:" ##  [9] "x_9:"  "x_10:"str_c(c(2016,05,13),collapse = '-')## [1] "2016-5-13"# 向量内连接,collapse可代替sep

5.str_pad()函数:字符填充 - 调用公式:

str_pad(string,width,side=("left","right","both"),sep=" ")# width 填充字符后的宽带# side 填充的方向,默认向左填充# sep 填充的字符内容,默认空格填充
  • 简单例子:
data<-'Michael_Jordan'str_pad(data,width = 20,side = "both",pad="*")## [1] "***Michael_Jordan***"

6.str_dup()函数:复制字符串 - 调用公式:

str_dup(string,times)# times:复制字符串的次数
  • 简单例子:
data<-c("A","B","C","D")str_dup(data,2)## [1] "AA" "BB" "CC" "DD"str_dup(data,1:4)## [1] "A"    "BB"   "CCC"  "DDDD"

7.str_sub()函数:截取字符串 - 调用公式:

str_sub(string,start=,end=)# 功能与word()类似,区别在于,sub截取的是字符串的子串,且能起到替换的作用,# word()提取的是单词。
  • 简单例子:
data<-"Using R programming to work for data science"str_sub(data,1,4)## [1] "Usin"word(data,1,4)## [1] "Using R programming to"#发现sub截取字符,word截取单词str_sub(data,1,7)<-'Using Python';data## [1] "Using Python programming to work for data science"

二、字符串计算函数

1.str_length():字符串长度,类似与nchar()函数

fruit<-c('apple','banana','pear',NA)str_length(fruit)## [1]  5  6  4 NAnchar(fruit)## [1]  5  6  4 NA

2.str_count():字符串计数函数

str_count(fruit,pattern="a")## [1]  1  3  1 NA## 对数字的检测 \\dstr_count(fruit,'\\d')## [1]  0  0  0 NA

3.str_order(),str_sotr():对字符向量排序

fruit<-c('banana','pear','orange','apple','pinapple')str_sort(fruit,decreasing=F) ##升序## [1] "apple"    "banana"   "orange"   "pear"     "pinapple"str_order(fruit) ##返回升序顺序的索引## [1] 4 1 3 2 5fruit[str_order(fruit)]## [1] "apple"    "banana"   "orange"   "pear"     "pinapple"

三、字符串匹配函数

1.str_split(),str_split_fixed():字符串分割函数

data<-'myxyznamexyzisxyzkobexyzbryant!'str_split(data,'xyz') ##返回列表,pattern参数 xyz## [[1]]## [1] "my"      "name"    "is"      "kobe"    "bryant!"str_split_fixed(data,'xyz',5) ##返回矩阵形式##      [,1] [,2]   [,3] [,4]   [,5]     ## [1,] "my" "name" "is" "kobe" "bryant!"

2.str_match(),str_match_all():提取匹配的字符串

string <- c('139-1234-5678','133,1267,4589','134 6543 7890','178 2345 1111 or 133 7890 1234')str_match(string,'[1][3-9]{2}[- ,][0-9]{4}[- ,][0-9]{4}')##      [,1]           ## [1,] "139-1234-5678"## [2,] "133,1267,4589"## [3,] "134 6543 7890"## [4,] "178 2345 1111"##解释一下:[]表示要匹配的字符,{}表示匹配个数

3.str_detect():检测字符串是否存在某种匹配模式

str_detect(fruit,'an') ##匹配an## [1]  TRUE FALSE  TRUE FALSE FALSEstr_detect(fruit,'\\d') ##匹配数字## [1] FALSE FALSE FALSE FALSE FALSE

四、字符串变换函数

1.str_to_upper,str_to_lower,str_to_title :字符串转换

data<-'a new way to explore the world'str_to_upper(data)## [1] "A NEW WAY TO EXPLORE THE WORLD"str_to_title(data,locale = "") ##标题首字母大写## [1] "A New Way To Explore The World"

2.str_subset():使用正则表达式匹配字符串中的值

##开头匹配str_subset(fruit,'^a')## [1] "apple"##结尾匹配str_subset(fruit,'e$')## [1] "orange"   "apple"    "pinapple"

3.str_replace():字符串替换

string <- c('139-1234-5678','133,1267,4589','134 6543 7890','178 2345 1111 or 133 7890 1234')string<-str_match_all(string,'[1][3-9]{2}[- ,][0-9]{4}[- ,][0-9]{4}')string<-str_replace_all(string,',','-')str_replace_all(string,' ','-')## [1] "139-1234-5678"                          ## [2] "133-1267-4589"                          ## [3] "134-6543-7890"                          ## [4] "c(\"178-2345-1111\"--\"133-7890-1234\")"
原创粉丝点击