URLEncoder Function for MySQL(mysql urlencode 支持中文)
来源:互联网 发布:运动数据记录器 编辑:程序博客网 时间:2024/06/15 17:48
Today I needed to convert some URL segments that are stored in a MySQL database from raw UTF-8 strings (many of which contain multi-byte characters in several hundred languages) into URL-encoded strings. This is for some work I’m doing using the latest version of SilverStripe (see Making SiteTree Fields Longer in SilverStripe 3.X for more info). Since I could not find any good examples of MySQL functions already built to do this, I ended up writing my own. Code and explanations posted below the fold…
Note first of all that I did find this example on DZone. However, it was lacking in a few ways:
- The main thing: it does not handle multi-byte characters
- It encodes characters that it doesn’t need to (i.e. hyphen, period, underscore, and tilde).
- It doesn’t encode spaces correctly (according to RFC they should be %20 and not the plus symbol like the old-school days)
Because of this, I needed to write my own. I tried starting with the one from DZone as a base, but ended up basically rewriting the entire thing. This should (hopefully) be compliant with RFC 3986 that defines percent-encoding of URL segments. You can also see the Wikipedia Percent Encoding article for a little bit easier-to-read definition. Here is the entire piece of code.
Here are some interesting things to take note of regarding this implementation:
- I tested it against 127,465 URL segments I have in my database in several hundred languages to see if it returned the exact same results as PHP’s
rawurlencode
. It did. - It’s probably worth mentioning that this obviously should not be used on entire URLs since it will encode reserved characters like slashes, etc. It can only be used on, for instance, individual segments (between slash separators) or individual query string values (not the entire query string or key/value pairs).
- There’s one obvious flaw that’s worth mentioning so I don’t get flamed for it: If you really do take in a string that is 4096 characters long and it contains even a single character that has to be encoded, your return value will end up botched. I suppose it will end up with a MySQL truncation warning, but I haven’t tested this. Since I’m encoding individual segments of a URL, none of them should ever be anywhere near that long. If your URLs are that long you are going to definitely have other issues unrelated to this, so I don’t feel this is a major flaw.
And now to the code itself:
DELIMITER ;
DROP
FUNCTION
IF EXISTS urlencode;
DELIMITER |
CREATE
FUNCTION
URLENCODE(str
VARCHAR
(4096) CHARSET utf8)
RETURNS
VARCHAR
(4096) CHARSET utf8
DETERMINISTIC
CONTAINS
SQL
BEGIN
-- the individual character we are converting in our loop
-- NOTE: must be VARCHAR even though it won't vary in length
-- CHAR(1), when used with SUBSTRING, made spaces '' instead of ' '
DECLARE
sub
VARCHAR
(1) CHARSET utf8;
-- the ordinal value of the character (i.e. ñ becomes 50097)
DECLARE
val
BIGINT
DEFAULT
0;
-- the substring index we use in our loop (one-based)
DECLARE
ind
INT
DEFAULT
1;
-- the integer value of the individual octet of a character being encoded
-- (which is potentially multi-byte and must be encoded one byte at a time)
DECLARE
oct
INT
DEFAULT
0;
-- the encoded return string that we build up during execution
DECLARE
ret
VARCHAR
(4096)
DEFAULT
'
';
-- our loop index for looping through each octet while encoding
DECLARE octind INT DEFAULT 0;
IF ISNULL(str) THEN
RETURN NULL;
ELSE
SET ret = '
';
-- loop through the input string one character at a time - regardless
-- of how many bytes a character consists of
WHILE ind <= CHAR_LENGTH(str) DO
SET sub = MID(str, ind, 1);
SET val = ORD(sub);
-- these values are ones that should not be converted
-- see http://tools.ietf.org/html/rfc3986
IF NOT (val BETWEEN 48 AND 57 OR -- 48-57 = 0-9
val BETWEEN 65 AND 90 OR -- 65-90 = A-Z
val BETWEEN 97 AND 122 OR -- 97-122 = a-z
-- 45 = hyphen, 46 = period, 95 = underscore, 126 = tilde
val IN (45, 46, 95, 126)) THEN
-- This is not an "unreserved" char and must be encoded:
-- loop through each octet of the potentially multi-octet character
-- and convert each into its hexadecimal value
-- we start with the high octect because that is the order that ORD
-- returns them in - they need to be encoded with the most significant
-- byte first
SET octind = OCTET_LENGTH(sub);
WHILE octind > 0 DO
-- get the actual value of this octet by shifting it to the right
-- so that it is at the lowest byte position - in other words, make
-- the octet/byte we are working on the entire number (or in even
-- other words, oct will no be between zero and 255 inclusive)
SET oct = (val >> (8 * (octind - 1)));
-- we append this to our return string with a percent sign, and then
-- a left-zero-padded (to two characters) string of the hexadecimal
-- value of this octet)
SET ret = CONCAT(ret, '
%', LPAD(HEX(oct), 2, 0));
-- now we need to reset val to essentially zero out the octet that we
-- just encoded so that our number decreases and we are only left with
-- the lower octets as part of our integer
SET
val = (val & (POWER(256, (octind - 1)) - 1));
SET
octind = (octind - 1);
END
WHILE;
ELSE
-- this character was not one that needed to be encoded and can simply be
-- added to our return string as-is
SET
ret = CONCAT(ret, sub);
END
IF;
SET
ind = (ind + 1);
END
WHILE;
END
IF;
RETURN
ret;
END
;
|
DELIMITER ;
The heart of this is the MySQL ORD function which converts the multi-byte character into an integer representation. Then I just do some bitwise operations to get each respective byte as an integer, convert that to hexadecimal, and then concatenate the string. Note that while UTF-8 can have four-byte characters (or, for the nit-picky: pre-2003, it could have six-byte characters), in this case we will only ever encounter three-byte characters since we are not using MySQL’s utf8mb4 encoding.
For you math or computer science majors out there – I’m sure you’ll know more efficient or elegant ways of doing the actual encoding – especially the bitwise operators for handling each byte of a multi-byte character. Anyway, if anyone is able to use this and it helps you, or if you notice an improvement, please leave a comment or email me. I’d be happy to get your feedback.
- URLEncoder Function for MySQL(mysql urlencode 支持中文)
- URLEncoder Function for MySQL(mysql urlencode 支持中文)
- 使用 Mysql 支持中文 for windows
- Mysql 支持中文显示
- MySQL中文支持问题
- MySQL中文显示支持
- 让mysql支持中文
- MySQL中文支持
- alter mysql 支持中文
- ubuntu mysql支持中文
- MySQL支持中文
- 让MySQL支持中文
- Mysql 支持中文集合
- Mysql支持中文
- 让MySQL支持中文
- 让MySQL支持中文
- mysql支持中文设置
- mysql字符集支持中文
- vim
- iOS 开发的常见问题不涉及网络
- Unable to add module to the current project as it is not of packaging type 'pom' 问题解决
- 给小白的Java EE生存指南(4) : 一只叫Tom的猫
- 渠得清如许,源头活水来——2015年年终总结
- URLEncoder Function for MySQL(mysql urlencode 支持中文)
- 【自己总结的】网页乱码解决方式
- 给小白的Java EE生存指南(5) :AJAX
- 中介者模式
- cf602e New Year Tree dfs序+二进制+线段树区间修改
- 从IAR arm 5.5 IAR ARM 6.30 等低版本升级为高版本 IAR arm 7.4
- Spring MVC 中文件上传
- UVA213----Message Decoding
- SpringMVC+Mybatis框架整合源码SSM Nginx,Hudson