如何在 MySQL 中判断中文字符?

来源:互联网 发布:em算法 python代码 编辑:程序博客网 时间:2024/05/22 12:28

一 引子


在生产环境中,经常会有这样的场景:获得中文数据。那问题就来了,怎么才能匹配出中文字符呢?


本文提供两种方法。


二 演示


2.1 环境


mysql> SHOW VARIABLES LIKE "%version%";

+-------------------------+------------------------------+

| Variable_name | Value |

+-------------------------+------------------------------+

| protocol_version | 10 |

| version | 5.1.73 |

| version_comment | MySQL Community Server (GPL) |

| version_compile_machine | i386 |

| version_compile_os | apple-darwin10.3.0 |

+-------------------------+------------------------------+

5 rows in set (0.00 sec)


2.2 创建测试表和插入测试数据


mysql -S /tmp/mysql_5173.sock -uroot -proot


创建测试表和插入测试数据。


mysql> USE test;

Database changed


mysql> CREATE TABLE user

-> (name VARCHAR(20)

-> ) DEFAULT CHARSET = utf8 ENGINE = INNODB;

Query OK, 0 rows affected (0.10 sec)


mysql> SHOW TABLE STATUS LIKE 'user' \G;

*************************** 1. row ***************************

Name: user

Engine: InnoDB

Version: 10

Row_format: Compact

Rows: 2

Avg_row_length: 8192

Data_length: 16384

Max_data_length: 0

Index_length: 0

Data_free: 0

Auto_increment: NULL

Create_time: 2015-01-16 18:01:36

Update_time: NULL

Check_time: NULL

Collation: utf8_general_ci

Checksum: NULL

Create_options:

Comment:

1 row in set (0.00 sec)


ERROR:

No query specified


mysql> INSERT INTO user VALUES('robin');

Query OK, 1 row affected (0.01 sec)


mysql> INSERT INTO user VALUES('温国兵');

Query OK, 1 row affected (0.00 sec)


三 实现


3.1 方法一 正则表达式


mysql> SELECT * FROM user \G;

*************************** 1. row ***************************

name: robin

*************************** 2. row ***************************

name: 温国兵

2 rows in set (0.00 sec)


mysql> SELECT name,

-> CASE name REGEXP "[\u0391-\uFFE5]"

-> WHEN 1 THEN "不是中文字符"

-> ELSE "是中文字符"

-> END AS "判断是否是中文字符"

-> FROM user;

+-----------+-----------------------------+

| name | 判断是否是中文字符 |

+-----------+-----------------------------+

| robin | 不是中文字符 |

| 温国兵 | 是中文字符 |

+-----------+-----------------------------+

2 rows in set (0.00 sec)


mysql> SELECT name FROM user WHERE NOT (name REGEXP "[\u0391-\uFFE5]");

+-----------+

| name |

+-----------+

| 温国兵 |

+-----------+

1 row in set (0.00 sec)


3.2 方法二 length() 和 char_length()


mysql> SELECT name, length(name), char_length(name) FROM user;

+-----------+--------------+-------------------+

| name | length(name) | char_length(name) |

+-----------+--------------+-------------------+

| robin | 5 | 5 |

| 温国兵 | 20 | 9 |

+-----------+--------------+-------------------+

2 rows in set (0.00 sec)


mysql> SELECT name FROM user WHERE length(name) <> char_length(name);

+-----------+

| name |

+-----------+

| 温国兵 |

+-----------+

1 row in set (0.00 sec)


四 总结


方法一中,[\u0391-\uFFE5] 匹配中文以外的字符。


方法二中,当字符集为UTF-8,并且字符为中文时,length() 和 char_length() 两个方法返回的结果不相同。


参考官方文档:


LENGTH()

Return the length of a string in bytes

Returns the length of the string str, measured in bytes. A multibyte character counts as multiple bytes. This means that for a string containing five 2-byte characters, LENGTH() returns 10, whereas CHAR_LENGTH() returns 5.


CHAR_LENGTH()

Return number of characters in argument

Returns the length of the string str, measured in characters. A multibyte character counts as a single character. This means that for a string containing five 2-byte characters, LENGTH() returns 10, whereas CHAR_LENGTH() returns 5.

0 0