Spark SQL filter not contains

来源：互联网发布：php mvc项目编辑：程序博客网时间：2024/05/29 07:27

软件环境：CDH5.8.0；

问题：在使用Spark SQL 读取Hive进行操作的时候，需要使用不包含，如下：（在Spark SQL中有contains，like，rlike函数）

在Hive中有表id_url ，内容如下：

+------------+-----------------------------------+--+| id_url.id  |            id_url.url             |+------------+-----------------------------------+--+| 1          | http://abc.com/ac/10987_2.html    || 2          | http://abc.com/ac/109872.html     || 3          | http://abc.com/ac/10987_4.html    || 4          | http://abc.com/ac/10987_30.html   || 14         | http://abc.com/ac/a10987_30.html  || 42         | http://abc.com/ac/c10987_30.html  || 43         | http://abc.com/ac/1d0987_30.html  |+------------+-----------------------------------+--+

如果要查看url包含30的网页，可以使用：

假设已经有data数据：

scala> val data = sqlContext.sql("select * from fansy.id_url")data: org.apache.spark.sql.DataFrame = [id: int, url: string]

那么可以使用contains或like货rlike，如下：

scala> data.filter(data("url") contains "30").collect.foreach(println(_))[4,http://abc.com/ac/10987_30.html]                                             [14,http://abc.com/ac/a10987_30.html][42,http://abc.com/ac/c10987_30.html][43,http://abc.com/ac/1d0987_30.html]scala> data.filter(data("url") like "%30%").collect.foreach(println(_))[4,http://abc.com/ac/10987_30.html]                                             [14,http://abc.com/ac/a10987_30.html][42,http://abc.com/ac/c10987_30.html][43,http://abc.com/ac/1d0987_30.html]scala> data.filter(data("url") rlike ".*30.*").collect.foreach(println(_))[4,http://abc.com/ac/10987_30.html]                                             [14,http://abc.com/ac/a10987_30.html][42,http://abc.com/ac/c10987_30.html][43,http://abc.com/ac/1d0987_30.html]

那如果是不包含呢？

1. 使用rlike的正则去匹配不包含30的字符串；

scala> data.filter(data("url") rlike "^((?!30).)*$").collect.foreach(println(_))[1,http://abc.com/ac/10987_2.html][2,http://abc.com/ac/109872.html][3,http://abc.com/ac/10987_4.html]

但是，在大量字符串匹配的时候效率会非常低；

2. 使用not 或！

查看Column的API可以看到其还有一个函数，为not或！，通过这个函数可以把如contains／like／rlike等转换为反，如下：

scala> val t = not (data("url") contains "30")t: org.apache.spark.sql.Column = NOT Contains(url, 30)scala> val t1 = not (data("url") contains "30")t1: org.apache.spark.sql.Column = NOT Contains(url, 30)

同时，使用t或t1进行filter，可以看到结果：

scala> data.filter(t).collect.foreach(println(_))[1,http://abc.com/ac/10987_2.html][2,http://abc.com/ac/109872.html][3,http://abc.com/ac/10987_4.html]scala> data.filter(t1).collect.foreach(println(_))[1,http://abc.com/ac/10987_2.html]                                              [2,http://abc.com/ac/109872.html][3,http://abc.com/ac/10987_4.html]

分享，成长，快乐

脚踏实地，专注

转载请注明blog地址：http://blog.csdn.NET/fansy1990

阅读全文

1 0