Converting Character Sets
来源:互联网 发布:詹姆斯09年对魔术数据 编辑:程序博客网 时间:2024/05/22 08:20
The web is going the way of utf8. Drizzle has chosen it as the default character set, most back-ends to websites use it to store text data, and those who are still using latin1 have begun to migrate their databases to utf8. Googling for “mysql convert charset to utf8″ results in a plethora of sites, each with a slightly different approach, and each broken in some respect. I’ll outline those approaches here and show why they don’t work, and then present a script that can generically be used to convert a database (or set of tables) to a target character set and collation.
Approach #1:
Take the following table as an example why this approach will not work:
Notice the implicit conversion of c1 from text to mediumtext. This approach can result in modified data types and silent data truncation, which makes it unacceptable for our purposes.
Approach #2 (outlined here):
This approach avoids the issue of implicit conversions by changing each data type to it’s binary counterpart before conversion. Due to implementation limitations, however, it also converts any pre-existing binary columns to their text counterpart. Additionally, this approach will fail because a binary column cannot be part of a FULLTEXT index. Even if these limitations are overcome, this process is inherently unsuitable for large databases because it requires multiple alter statements to be run on each table:
1) Drop FULLTEXT indexes
2) Convert target columns to their binary counterparts
3) Convert the table to the target character set
4) Convert target columns to their original data types
5) Add FULLTEXT indexes back
For those of us routinely waiting hours, if not days, for a single alter statement to finish, this is unacceptable.
Approach #3:
Dumping the entire database and re-importing it with the appropriate server & client character sets.
This is a three-step process, where one must first dump only the schema and then edit it by hand to have the appropriate character sets and the dump the data separately. After which, the schema must be re-created and data imported. If you’re using replication, this usually isn’t even an option because you’ll have a ridiculous amount of binary logs and force a reload of data on every server in the replication chain (very time/bandwidth/disk space consuming).
Except for Approach #1, these approaches are much more difficult than they need to be. Consider the following ALTER statement against the table in Approach #1:
This approach will both change the default character set for the table and target column, while leaving in place any FULLTEXT indexes. It also requires only a single ALTER statement for a given table. A perl script has been put together to parallel-ize the ALTER statements and is available at:
It will be added to Percona Tools on Launchpad (or perhaps maatkit, if it proves useful enough) once it is feature complete. Outstanding issues include:
- Proper handling of string foreign keys (currently fails, but you probably shouldn’t be using strings as foreign keys anyway …)
- Allow throttling of the number of threads created (currently creates one per table)
The web is going the way of utf8. Drizzle has chosen it as the default character set, most back-ends to websites use it to store text data, and those who are still using latin1 have begun to migrate their databases to utf8. Googling for “mysql convert charset to utf8″ results in a plethora of sites, each with a slightly different approach, and each broken in some respect. I’ll outline those approaches here and show why they don’t work, and then present a script that can generically be used to convert a database (or set of tables) to a target character set and collation.
Approach #1:
Take the following table as an example why this approach will not work:
Notice the implicit conversion of c1 from text to mediumtext. This approach can result in modified data types and silent data truncation, which makes it unacceptable for our purposes.
Approach #2 (outlined here):
This approach avoids the issue of implicit conversions by changing each data type to it’s binary counterpart before conversion. Due to implementation limitations, however, it also converts any pre-existing binary columns to their text counterpart. Additionally, this approach will fail because a binary column cannot be part of a FULLTEXT index. Even if these limitations are overcome, this process is inherently unsuitable for large databases because it requires multiple alter statements to be run on each table:
1) Drop FULLTEXT indexes
2) Convert target columns to their binary counterparts
3) Convert the table to the target character set
4) Convert target columns to their original data types
5) Add FULLTEXT indexes back
For those of us routinely waiting hours, if not days, for a single alter statement to finish, this is unacceptable.
Approach #3:
Dumping the entire database and re-importing it with the appropriate server & client character sets.
This is a three-step process, where one must first dump only the schema and then edit it by hand to have the appropriate character sets and the dump the data separately. After which, the schema must be re-created and data imported. If you’re using replication, this usually isn’t even an option because you’ll have a ridiculous amount of binary logs and force a reload of data on every server in the replication chain (very time/bandwidth/disk space consuming).
Except for Approach #1, these approaches are much more difficult than they need to be. Consider the following ALTER statement against the table in Approach #1:
This approach will both change the default character set for the table and target column, while leaving in place any FULLTEXT indexes. It also requires only a single ALTER statement for a given table. A perl script has been put together to parallel-ize the ALTER statements and is available at:
It will be added to Percona Tools on Launchpad (or perhaps maatkit, if it proves useful enough) once it is feature complete. Outstanding issues include:
- Proper handling of string foreign keys (currently fails, but you probably shouldn’t be using strings as foreign keys anyway …)
- Allow throttling of the number of threads created (currently creates one per table)
备注:该篇技术文档是与MySQL数据库中的“备份恢复”主题相关的技术主题。
本文转载自:http://www.mysqlperformanceblog.com/2009/03/17/converting-character-sets/
The web is going the way of utf8. Drizzle has chosen it as the default character set, most back-ends to websites use it to store text data, and those who are still using latin1 have begun to migrate their databases to utf8. Googling for “mysql convert charset to utf8″ results in a plethora of sites, each with a slightly different approach, and each broken in some respect. I’ll outline those approaches here and show why they don’t work, and then present a script that can generically be used to convert a database (or set of tables) to a target character set and collation.
Approach #1:
Take the following table as an example why this approach will not work:
Notice the implicit conversion of c1 from text to mediumtext. This approach can result in modified data types and silent data truncation, which makes it unacceptable for our purposes.
Approach #2 (outlined here):
This approach avoids the issue of implicit conversions by changing each data type to it’s binary counterpart before conversion. Due to implementation limitations, however, it also converts any pre-existing binary columns to their text counterpart. Additionally, this approach will fail because a binary column cannot be part of a FULLTEXT index. Even if these limitations are overcome, this process is inherently unsuitable for large databases because it requires multiple alter statements to be run on each table:
1) Drop FULLTEXT indexes
2) Convert target columns to their binary counterparts
3) Convert the table to the target character set
4) Convert target columns to their original data types
5) Add FULLTEXT indexes back
For those of us routinely waiting hours, if not days, for a single alter statement to finish, this is unacceptable.
Approach #3:
Dumping the entire database and re-importing it with the appropriate server & client character sets.
This is a three-step process, where one must first dump only the schema and then edit it by hand to have the appropriate character sets and the dump the data separately. After which, the schema must be re-created and data imported. If you’re using replication, this usually isn’t even an option because you’ll have a ridiculous amount of binary logs and force a reload of data on every server in the replication chain (very time/bandwidth/disk space consuming).
Except for Approach #1, these approaches are much more difficult than they need to be. Consider the following ALTER statement against the table in Approach #1:
This approach will both change the default character set for the table and target column, while leaving in place any FULLTEXT indexes. It also requires only a single ALTER statement for a given table. A perl script has been put together to parallel-ize the ALTER statements and is available at:
It will be added to Percona Tools on Launchpad (or perhaps maatkit, if it proves useful enough) once it is feature complete. Outstanding issues include:
- Proper handling of string foreign keys (currently fails, but you probably shouldn’t be using strings as foreign keys anyway …)
- Allow throttling of the number of threads created (currently creates one per table)
- Converting Character Sets
- Character sets and codepages
- Character Sets and Collations
- 字符集(Character Sets)
- Character Sets and Collations
- Regular Expressions:Character Classes or Character Sets
- How Are Character Sets Used?
- Connection Character Sets and Collations
- MariaDB_Setting Character Sets and Collations
- csharp: Converting chinese character to Unicode
- csharp: Converting chinese character to Unicode
- Win32 Series - Keyboard Messages and Character Sets
- Unicode和Character Sets的详解
- The Minimum About Unicode and Character Sets
- character sets and collations in mysql
- converting to execution character set: Invalid or incomplete multibyte or wide character
- error: converting to execution character set: Invalid or incomplete multibyte or wide character
- Introducing Character Sets and Encodings(字符集与编码介绍)
- NSRange的使用方法
- Reorder List 重排字符串
- 开源框架android-async-http使用
- 为什么Java永远比C++慢?
- os基础--多线程进程面试题02
- Converting Character Sets
- php中内置的mysql数据库连接驱动mysqlnd简介及mysqlnd的配置安装方式
- Node.js系列--模块
- 编程珠玑第八章第十题
- 字符串-07. 说反话-加强版 (20)
- java退出for循环
- android之旅
- Java程序员修炼之道 之 Logging(1/3)
- CC254X 如何添加用户进程