Mysql 字符集

Posted Oct 22, 2024

By Ryan

10 min read

结论：

连接和字段utf8和utf8mb4混用时部分操作会存在隐式转换，降低性能，如果存在utf8mb4四字节转utf8时会报错或乱码。建议5个字符集全部统一utf8mb4，且排序规则统一
排序规则选用：根据业务，大小写敏感使用bin，不敏感使用general_ci
字符集转换会锁表，根据表中字符串数据量决定处理时间，耗时较长
转换字符集会增加字节占用，需要考虑转换后部分字段类型的长度是否充足
一般建议表设置字符集，字段不设置默认继承表字符集，从而统一（标准SQL只有表和字段二级）

https://dev.mysql.com/doc/refman/5.7/en/charset.html

https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html

charset 字符集：字符的编码规则，混用可能乱码，如utf8、latin1

collation 排序规则：用于字符集内比较字符串，由3部分组成：{charset}_{lang}_{suffixes}。

lang

bin：基于字符集的二进制数值比较，只有一种排序规则所以没有其他后缀
general：简化的排序规则，采用了提高速度的快捷方式，准确性较差（utf8/utf8mb4默认）
unicode：可以在多种语言中准确排序，但性能较general低。支持扩展，把一个字符看作多个字符组合

suffix

_ai：不区分重音，仅部分字符集有，unicode没有
_as：区分重音，仅部分字符集有，unicode没有
_ci：不区分大小写
_cs：区分大小写，一般包含了区分重音，仅部分字符集有，unicode没有
_bin：基于字符集的二进制数值比较，区分大小写和重音

MySQL在服务器、库、表、字符串字段（char、varchar、*text）、连接都可设置字符集、排序规则、可以混用字符集、排序规则

服务器：编译时指定，或启动时配置文件指定。客户端连接未指定时默认。MySQL特有
数据库：创建时指定，为空使用服务器的。LOAD DATA语句和存储过程没有指定时默认。MySQL特有
表：创建时指定，为空使用数据库。存储过程涉及字符集会使用表的字符集。SQL标准
字段：创建时指定，为空使用表，最终存储数据使用的字符编码，SQL中涉及字符对比使用字段的排序规则。SQL标准
连接：客户端在建立连接时或建立连接后可指定客户端的字符集，服务端会将客户端发送的数据按客户端字符集接受，按客户端指令取数据，再按客户端字符集对数据进行字符集转换，返回给客户端

查看支持字符集及默认语句 ` SHOW CHARACTER SET `

各级字符集和排序规则策略

两个都指定了则使用指定的
只指定字符集，则使用指定字符集的默认排序规则：utf8mb4 -> utf8mb4, utf8mb4_general_ci
只指定排序规则，则使用排序规则关联的字符集：utf8mb4_general_ci -> utf8mb4, utf8mb4_general_ci
都没有指定，使用上级的字符集、排序规则

使用ALTER TABLE转换字符集时需注意兼容性，如utf8mb4是utf8的超集，则utf8mb4转utf8会丢数据，若编码格式不同会乱码

在数据库连接中指定编码

Character_set_client 客户端发送语句的字符集
Character_set_connection 服务器从客户端接收语句后将语句转换后的字符集
Character_set_results 服务器将查询结果返回给客户端的字符集

查看session字符集

        
SHOW SESSION VARIABLES LIKE 'character\_set\_%';
SHOW SESSION VARIABLES LIKE 'collation\_%';

字符集冲突解决策略

https://dev.mysql.com/doc/refman/5.7/en/charset-collation-coercibility.html

若存在不同排序规则（字符集），则MySQL会按下图试图转换其中一种排序规则（字符集）来解决

Use the collation with the lowest coercibility value.
If both sides have the same coercibility, then:
- If both sides are Unicode, or both sides are not Unicode, it is an error.
- If one of the sides has a Unicode character set, and another side has a non-Unicode character set, the side with Unicode character set wins, and automatic character set conversion is applied to the non-Unicode side. For example, the following statement does not return an error:
  1 SELECT CONCAT(utf8_column, latin1_column) FROM t1;
  It returns a result that has a character set of utf8 and the same collation as utf8_column. Values of latin1_column are automatically converted to utf8 before concatenating.
- For an operation with operands from the same character set but that mix a _bin collation and a _ci or _cs collation, the _bin collation is used. This is similar to how operations that mix nonbinary and binary strings evaluate the operands as binary strings, applied to collations rather than data types.

各规则的coercibility value

An explicit COLLATE clause has a coercibility of 0 (not coercible at all).
The concatenation of two strings with different collations has a coercibility of 1.
The collation of a column or a stored routine parameter or local variable has a coercibility of 2.
A “system constant” (the string returned by functions such as USER() or VERSION()) has a coercibility of 3.
The collation of a literal has a coercibility of 4.
The collation of a numeric or temporal value has a coercibility of 5.
NULL or an expression that is derived from NULL has a coercibility of 6.

从utf转换为utf8mb4

https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-conversion.html

由于char、varchar、text等指定的是字段的字符数(MySQL5+)，而每种类型最长字节数有限制，所以需要考虑转换后字段长度是否会超出最大字节数

type	max len(byte)	utf8 len	utf8mb4 len	remark
char	255	255	255
varchar	65533	21,844		255byte内1个byte标记长度，超过255使用2个字节标记长度
binary	255
varbinary	65533			255byte内1个byte标记长度，超过2552个字节标记长度
tinytext	255
text	65535
mediumtext	16777215
longtext	4294967295

所有文本类型若是可为空则再减少1byte用于存储是否为空

索引同理，innodb最大索引长度为768字节，所以转utfmb4后最大索引长度为191字符

操作

修改字符集sql说明：

确认数据库和表的字符集和排序规则：可以通过以下命令查看数据库和表的字符集和排序规则：
1 2 SHOW CREATE DATABASE database_name; SHOW CREATE TABLE table_name;
如果字符集和排序规则不是utf8mb4和utf8mb4_bin，则需要修改数据库和表的字符集和排序规则。
修改数据库的字符集和排序规则：可以通过以下命令修改数据库的字符集和排序规则：
1 ALTER DATABASE database_name CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;

修改表和表中所有文本类型字段的字符集和排序规则：

        
      
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;

只修改字段的字符集和排序规则：

        
      
ALTER TABLE table_name MODIFY column_name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;

操作流程：

修改字段的字符集或排序规则会锁表，锁表时间根据表中记录的字符数量而定，时间较长，谨慎操作，避免业务期间操作
修改字符集前需确认所有表的字符长度不会超出文本类型的最大字节数，如果可能超出的需要修改字段类型
执行以下sql：需将 “数据库名” 改为对应数据库，排序规则 utf8mb4_bin 根据需要修改

        
      
SELECT
	CONCAT( 'ALTER TABLE ', table_name, ' CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;' ) AS alterStatement 
FROM
	information_schema.TABLES 
WHERE
	table_schema = '数据库名';

该脚本将为每一个表生成一条修改语句，并将其作为alterStatement返回

将生成的修改语句复制到数据库中执行，这些语句将逐个修改每个表的字符集

This post is licensed under CC BY 4.0 by the author.