Converting Character Sets

来源：互联网发布：詹姆斯09年对魔术数据编辑：程序博客网时间：2024/05/22 08:20

March 17, 2009 by Ryan Lowe 23 Comments

The web is going the way of utf8. Drizzle has chosen it as the default character set, most back-ends to websites use it to store text data, and those who are still using latin1 have begun to migrate their databases to utf8. Googling for “mysql convert charset to utf8″ results in a plethora of sites, each with a slightly different approach, and each broken in some respect. I’ll outline those approaches here and show why they don’t work, and then present a script that can generically be used to convert a database (or set of tables) to a target character set and collation.

Approach #1:

Shell
1
ALTER TABLE `t1` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

Take the following table as an example why this approach will not work:

Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
mysql>CREATETABLE`t1`(
->`c1`textNOTNULL
->)ENGINE=MyISAMDEFAULTCHARSET=latin1;
QueryOK,0rowsaffected(0.02sec)
 
mysql>ALTERTABLE`t1`CONVERTTOCHARACTERSETutf8COLLATEutf8_general_ci;
QueryOK,0rowsaffected(0.02sec)
Records:0Duplicates:0Warnings:0
 
mysql>SHOWCREATETABLE`t1`\G
***************************1.row***************************
Table:t1
CreateTable:CREATETABLE`t1`(
`c1`mediumtextNOTNULL
)ENGINE=MyISAMDEFAULTCHARSET=utf8
1rowinset(0.01sec)

Notice the implicit conversion of c1 from text to mediumtext. This approach can result in modified data types and silent data truncation, which makes it unacceptable for our purposes.

Approach #2 (outlined here):

This approach avoids the issue of implicit conversions by changing each data type to it’s binary counterpart before conversion. Due to implementation limitations, however, it also converts any pre-existing binary columns to their text counterpart. Additionally, this approach will fail because a binary column cannot be part of a FULLTEXT index. Even if these limitations are overcome, this process is inherently unsuitable for large databases because it requires multiple alter statements to be run on each table:

1) Drop FULLTEXT indexes
2) Convert target columns to their binary counterparts
3) Convert the table to the target character set
4) Convert target columns to their original data types
5) Add FULLTEXT indexes back

For those of us routinely waiting hours, if not days, for a single alter statement to finish, this is unacceptable.

Approach #3:

Dumping the entire database and re-importing it with the appropriate server & client character sets.

This is a three-step process, where one must first dump only the schema and then edit it by hand to have the appropriate character sets and the dump the data separately. After which, the schema must be re-created and data imported. If you’re using replication, this usually isn’t even an option because you’ll have a ridiculous amount of binary logs and force a reload of data on every server in the replication chain (very time/bandwidth/disk space consuming).

Except for Approach #1, these approaches are much more difficult than they need to be. Consider the following ALTER statement against the table in Approach #1:

Shell
1
2
3
ALTER TABLE `t1`
DEFAULT CHARSET=utf8,
MODIFY COLUMN `c1` text CHARACTER SET utf8;

This approach will both change the default character set for the table and target column, while leaving in place any FULLTEXT indexes. It also requires only a single ALTER statement for a given table. A perl script has been put together to parallel-ize the ALTER statements and is available at:

Shell
1
%>wgethttp://www.pablowe.net/convert_charset

It will be added to Percona Tools on Launchpad (or perhaps maatkit, if it proves useful enough) once it is feature complete. Outstanding issues include:

- Proper handling of string foreign keys (currently fails, but you probably shouldn’t be using strings as foreign keys anyway …)
- Allow throttling of the number of threads created (currently creates one per table)

Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
Usage:
convert_charset --database=database [options]
 
Options:
--askpass        Prompt for a MySQL password
--charset        The target character set to convert to
--collate        The target collation to convert to
--database|d     The target database
--help|?         Display this help and exit
--host|h         The target host
--ignore-columns Columns to ignore, useful if you want to
keep the existing charset for a target column
Comma-separated.  NO SPACES.
table.column
--ignore-tables  A comma-separated list of tables to ignore
--password|p     The MySQL password to use
--port           The target port
--tables         A comma-separated list of tables to convert.
All non-named tables will be ignored
--test           Print the ALTER statements that would be executed
without executing them.
--user|u         The MySQL user
--version|V      Display version information and exit
 
defaults are:
 
ATTRIBUTE                  VALUE
-------------------------- ------------------
askpass                    FALSE
charset                    utf8
collate                    No Default Value
database                   No Default Value
help                       FALSE
host                       localhost
ignore-columns             No Default Value
ignore-tables              No Default Value
password                   No Default Value
port                       3306
tables                     No Default Value
test                       FALSE
user                       Current User
version                    FALSE

Filed Under: Insight for Developers Tagged With: Community, Tips, Tools

March 17, 2009 by Ryan Lowe 23 Comments

Approach #1:

Shell
1
ALTERTABLE`t1`CONVERTTOCHARACTERSETutf8COLLATEutf8_general_ci;

Take the following table as an example why this approach will not work:

Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
mysql> CREATE TABLE `t1` (
-> `c1` text NOT NULL
-> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (0.02 sec)
 
mysql> ALTER TABLE `t1` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
Query OK, 0 rows affected (0.02 sec)
Records: 0 Duplicates: 0 Warnings: 0
 
mysql> SHOW CREATE TABLE `t1`\G
*************************** 1. row ***************************
Table: t1
Create Table: CREATE TABLE `t1` (
`c1` mediumtext NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8
1 row in set (0.01 sec)

Notice the implicit conversion of c1 from text to mediumtext. This approach can result in modified data types and silent data truncation, which makes it unacceptable for our purposes.

Approach #2 (outlined here):

For those of us routinely waiting hours, if not days, for a single alter statement to finish, this is unacceptable.

Approach #3:

Dumping the entire database and re-importing it with the appropriate server & client character sets.

Except for Approach #1, these approaches are much more difficult than they need to be. Consider the following ALTER statement against the table in Approach #1:

Shell
1
2
3
ALTERTABLE`t1`
DEFAULTCHARSET=utf8,
MODIFYCOLUMN`c1`textCHARACTERSETutf8;

Shell
1
%> wget http://www.pablowe.net/convert_charset

It will be added to Percona Tools on Launchpad (or perhaps maatkit, if it proves useful enough) once it is feature complete. Outstanding issues include:

Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
Usage:
convert_charset--database=database[options]
 
Options:
--askpass        PromptforaMySQLpassword
--charset        Thetargetcharactersettoconvertto
--collate        Thetargetcollationtoconvertto
--database|d    Thetargetdatabase
--help|?        Displaythishelpandexit
--host|h        Thetargethost
--ignore-columnsColumnstoignore,usefulifyouwantto
keeptheexistingcharsetforatargetcolumn
Comma-separated.  NOSPACES.
table.column
--ignore-tables  Acomma-separatedlistoftablestoignore
--password|p    TheMySQLpasswordtouse
--port          Thetargetport
--tables        Acomma-separatedlistoftablestoconvert.
Allnon-namedtableswillbeignored
--test          PrinttheALTERstatementsthatwouldbeexecuted
withoutexecutingthem.
--user|u        TheMySQLuser
--version|V      Displayversioninformationandexit
 
defaultsare:
 
ATTRIBUTE                  VALUE
--------------------------------------------
askpass                    FALSE
charset                    utf8
collate                    NoDefaultValue
database                  NoDefaultValue
help                      FALSE
host                      localhost
ignore-columns            NoDefaultValue
ignore-tables              NoDefaultValue
password                  NoDefaultValue
port                      3306
tables                    NoDefaultValue
test                      FALSE
user                      CurrentUser
version                    FALSE

Filed Under: Insight for Developers Tagged With: Community, Tips, Tools

备注：该篇技术文档是与MySQL数据库中的“备份恢复”主题相关的技术主题。

本文转载自：http://www.mysqlperformanceblog.com/2009/03/17/converting-character-sets/

March 17, 2009 by Ryan Lowe 23 Comments

Approach #1:

Shell
1
ALTER TABLE `t1` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

Take the following table as an example why this approach will not work:

Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
mysql>CREATETABLE`t1`(
->`c1`textNOTNULL
->)ENGINE=MyISAMDEFAULTCHARSET=latin1;
QueryOK,0rowsaffected(0.02sec)
 
mysql>ALTERTABLE`t1`CONVERTTOCHARACTERSETutf8COLLATEutf8_general_ci;
QueryOK,0rowsaffected(0.02sec)
Records:0Duplicates:0Warnings:0
 
mysql>SHOWCREATETABLE`t1`\G
***************************1.row***************************
Table:t1
CreateTable:CREATETABLE`t1`(
`c1`mediumtextNOTNULL
)ENGINE=MyISAMDEFAULTCHARSET=utf8
1rowinset(0.01sec)

Notice the implicit conversion of c1 from text to mediumtext. This approach can result in modified data types and silent data truncation, which makes it unacceptable for our purposes.

Approach #2 (outlined here):

For those of us routinely waiting hours, if not days, for a single alter statement to finish, this is unacceptable.

Approach #3:

Dumping the entire database and re-importing it with the appropriate server & client character sets.

Except for Approach #1, these approaches are much more difficult than they need to be. Consider the following ALTER statement against the table in Approach #1:

Shell
1
2
3
ALTER TABLE `t1`
DEFAULT CHARSET=utf8,
MODIFY COLUMN `c1` text CHARACTER SET utf8;

Shell
1
%>wgethttp://www.pablowe.net/convert_charset

It will be added to Percona Tools on Launchpad (or perhaps maatkit, if it proves useful enough) once it is feature complete. Outstanding issues include:

Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
Usage:
convert_charset--database=database[options]
 
Options:
--askpass        PromptforaMySQLpassword
--charset        Thetargetcharactersettoconvertto
--collate        Thetargetcollationtoconvertto
--database|d    Thetargetdatabase
--help|?        Displaythishelpandexit
--host|h        Thetargethost
--ignore-columnsColumnstoignore,usefulifyouwantto
keeptheexistingcharsetforatargetcolumn
Comma-separated.  NOSPACES.
table.column
--ignore-tables  Acomma-separatedlistoftablestoignore
--password|p    TheMySQLpasswordtouse
--port          Thetargetport
--tables        Acomma-separatedlistoftablestoconvert.
Allnon-namedtableswillbeignored
--test          PrinttheALTERstatementsthatwouldbeexecuted
withoutexecutingthem.
--user|u        TheMySQLuser
--version|V      Displayversioninformationandexit
 
defaultsare:
 
ATTRIBUTE                  VALUE
--------------------------------------------
askpass                    FALSE
charset                    utf8
collate                    NoDefaultValue
database                  NoDefaultValue
help                      FALSE
host                      localhost
ignore-columns            NoDefaultValue
ignore-tables              NoDefaultValue
password                  NoDefaultValue
port                      3306
tables                    NoDefaultValue
test                      FALSE
user                      CurrentUser
version                    FALSE

Filed Under: Insight for Developers Tagged With: Community, Tips, Tools

0 0