Sign In Try Free

Character Set and Collation

This document introduces the character sets and collations supported by TiDB.

Concepts

A character set is a set of symbols and encodings. The default character set in TiDB is utf8mb4, which matches the default in MySQL 8.0 and above.

A collation is a set of rules for comparing characters in a character set, and the sorting order of characters. For example in a binary collationAandado not compare as equal:


              
SETNAMES utf8mb4COLLATEutf8mb4_bin;SELECT 'A' = 'a';SETNAMES utf8mb4COLLATEutf8mb4_general_ci;SELECT 'A' = 'a';

              
SELECT 'A' = 'a';

              
+ -----------+ | 'A' = 'a' | + -----------+ | 0 | + -----------+ 1 row in set(0.00sec)

              
SETNAMES utf8mb4COLLATEutf8mb4_general_ci;

              
Query OK,0 rowsaffected (0.00sec)

              
SELECT 'A' = 'a';

              
+ -----------+ | 'A' = 'a' | + -----------+ | 1 | + -----------+ 1 row in set(0.00sec)

TiDB defaults to using a binary collation. This differs from MySQL, which uses a case-insensitive collation by default.

Character sets and collations supported by TiDB

Currently, TiDB supports the following character sets:


              
SHOW CHARACTER SET;

              
+ ---------+-------------------------------------+-------------------+--------+ |Charset|Description| Default collation |Maxlen| + ---------+-------------------------------------+-------------------+--------+ |ascii|US ASCII|ascii_bin| 1 | | binary | binary | binary | 1 | |gbk|Chinese Internal Code Specification|gbk_bin| 2 | |latin1|Latin1|latin1_bin| 1 | |utf8|UTF-8Unicode|utf8_bin| 3 | |utf8mb4|UTF-8Unicode|utf8mb4_bin| 4 | + ---------+-------------------------------------+-------------------+--------+ 6 rows in set(0.00sec)

TiDB supports the following collations:


              
SHOW COLLATION;

              
+ --------------------+---------+------+---------+----------+---------+ | Collation |Charset|Id| Default |Compiled|Sortlen| + --------------------+---------+------+---------+----------+---------+ |ascii_bin|ascii| 65 |Yes|Yes| 1 | | binary | binary | 63 |Yes|Yes| 1 | |gbk_bin|gbk| 87 | |Yes| 1 | |gbk_chinese_ci|gbk| 28 |Yes|Yes| 1 | |latin1_bin|latin1| 47 |Yes|Yes| 1 | |utf8_bin|utf8| 83 |Yes|Yes| 1 | |utf8_general_ci|utf8| 33 | |Yes| 1 | |utf8_unicode_ci|utf8| 192 | |Yes| 1 | |utf8mb4_bin|utf8mb4| 46 |Yes|Yes| 1 | |utf8mb4_general_ci|utf8mb4| 45 | |Yes| 1 | |utf8mb4_unicode_ci|utf8mb4| 224 | |Yes| 1 | + --------------------+---------+------+---------+----------+---------+ 11 rows in set(0.00sec)

You can use the following statement to view the collations (under thenew framework for collations) that corresponds to the character set.


              
SHOW COLLATION WHERECharset= 'utf8mb4';

              
+ --------------------+---------+------+---------+----------+---------+ | Collation |Charset|Id| Default |Compiled|Sortlen| + --------------------+---------+------+---------+----------+---------+ |utf8mb4_bin|utf8mb4| 46 |Yes|Yes| 1 | |utf8mb4_general_ci|utf8mb4| 45 | |Yes| 1 | |utf8mb4_unicode_ci|utf8mb4| 224 | |Yes| 1 | + --------------------+---------+------+---------+----------+---------+ 3 rows in set(0.00sec)

For details about the TiDB support of the GBK character set, seeGBK.

utf8andutf8mb4in TiDB

In MySQL, the character setutf8是有限的,最多三个字节。这是苏fficient to store characters in the Basic Multilingual Plane (BMP), but not enough to store characters such as emojis. For this, it is recommended to use the character setutf8mb4instead.

By default, TiDB also limits the character setutf8to a maximum of three bytes to ensure that data created in TiDB can still safely be restored in MySQL. You can disable it by changing the value of the system variabletidb_check_mb4_value_in_utf8toOFF.

下面演示了wh默认行为en inserting a 4-byte emoji character into a table. TheINSERTstatement fails for theutf8character set, but succeeds forutf8mb4:


              
CREATE TABLEutf8_test ( cchar(1)NOT NULL)CHARACTER SETutf8;

              
Query OK,0 rowsaffected (0.09sec)

              
CREATE TABLEutf8m4_test ( cchar(1)NOT NULL)CHARACTER SETutf8mb4;

              
Query OK,0 rowsaffected (0.09sec)

              
INSERT INTOutf8_testVALUES('');

              
ERROR1366(HY000): incorrect utf8valuef09f9889()for columnc

              
INSERT INTOutf8m4_testVALUES('');

              
Query OK,1 rowaffected (0.02sec)

              
SELECT char_length(c), length(c), cFROMutf8_test;

              
Empty set(0.01sec)

              
SELECT char_length(c), length(c), cFROMutf8m4_test;

              
+ ----------------+-----------+------+ | char_length(c)|length(c)|c| + ----------------+-----------+------+ | 1 | 4 || + ----------------+-----------+------+ 1 row in set(0.00sec)

Character set and collation in different layers

The character set and collation can be set at different layers.

Database character set and collation

Each database has a character set and a collation. You can use the following statements to specify the database character set and collation:


              
CREATEDATABASE db_name [[DEFAULT]CHARACTER SETcharset_name] [[DEFAULT]COLLATEcollation_name]ALTERDATABASE db_name [[DEFAULT]CHARACTER SETcharset_name] [[DEFAULT]COLLATEcollation_name]

DATABASEcan be replaced withSCHEMAhere.

Different databases can use different character sets and collations. Use thecharacter_set_databaseandcollation_databaseto see the character set and collation of the current database:


              
CREATESCHEMA test1CHARACTER SETutf8mb4COLLATEutf8mb4_general_ci;

              
Query OK,0 rowsaffected (0.09sec)

              
USE test1;

              
Database changed

              
SELECT@@character_set_database, @@collation_database;

              
+ --------------------------|----------------------+ |@@character_set_database|@@collation_database| + --------------------------|----------------------+ |utf8mb4|utf8mb4_general_ci| + --------------------------|----------------------+ 1 row in set(0.00sec)

              
CREATESCHEMA test2CHARACTER SETlatin1COLLATElatin1_bin;

              
Query OK,0 rowsaffected (0.09sec)

              
USE test2;

              
Database changed

              
SELECT@@character_set_database, @@collation_database;

              
+ --------------------------|----------------------+ |@@character_set_database|@@collation_database| + --------------------------|----------------------+ |latin1|latin1_bin| + --------------------------|----------------------+ 1 row in set(0.00sec)

You can also see the two values inINFORMATION_SCHEMA:


              
SELECTDEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAMEFROMINFORMATION_SCHEMA.SCHEMATAWHERESCHEMA_NAME= 'db_name';

Table character set and collation

You can use the following statement to specify the character set and collation for tables:


              
CREATE TABLEtbl_name (column_list) [[DEFAULT]CHARACTER SETcharset_name] [COLLATEcollation_name]]ALTER TABLEtbl_name [[DEFAULT]CHARACTER SETcharset_name] [COLLATEcollation_name]

For example:


              
CREATE TABLEt1(aint)CHARACTER SETutf8mb4COLLATEutf8mb4_general_ci;

              
Query OK,0 rowsaffected (0.08sec)

If the table character set and collation are not specified, the database character set and collation are used as their default values.

Column character set and collation

You can use the following statement to specify the character set and collation for columns:


              
col_name {CHAR | VARCHAR |TEXT} (col_length) [CHARACTER SETcharset_name] [COLLATEcollation_name] col_name {ENUM| SET} (val_list) [CHARACTER SETcharset_name] [COLLATEcollation_name]

If the column character set and collation are not specified, the table character set and collation are used as their default values.

String character sets and collation

Each string corresponds to a character set and a collation. When you use a string, this option is available:


              
[_charset_name]'string'[COLLATEcollation_name]

Example:


              
SELECT 'string';SELECT_utf8mb4'string';SELECT_utf8mb4'string' COLLATEutf8mb4_general_ci;

Rules:

  • Rule 1: If you specifyCHARACTER SET charset_nameandCOLLATE collation_name,n thecharset_namecharacter set and thecollation_namecollation are used directly.
  • Rule 2: If you specifyCHARACTER SET charset_namebut do not specifyCOLLATE collation_name,charset_namecharacter set and the default collation ofcharset_nameare used.
  • Rule 3: If you specify neitherCHARACTER SET charset_namenorCOLLATE collation_name,character set and collation given by the system variablescharacter_set_connectionandcollation_connectionare used.

Client connection character set and collation

  • The server character set and collation are the values of thecharacter_set_serverandcollation_serversystem variables.

  • The character set and collation of the default database are the values of thecharacter_set_databaseandcollation_databasesystem variables.

You can usecharacter_set_connectionandcollation_connectionto specify the character set and collation for each connection. Thecharacter_set_clientvariable is to set the client character set.

Before returning the result, thecharacter_set_results系统变量表示the character set in which the server returns query results to the client, including the metadata of the result.

You can use the following statement to set the character set and collation that is related to the client:

  • SET NAMES 'charset_name' [COLLATE 'collation_name']

    SET NAMESindicates what character set the client will use to send SQL statements to the server.SET NAMES utf8mb4indicates that all the requests from the client use utf8mb4, as well as the results from the server.

    TheSET NAMES 'charset_name'statement is equivalent to the following statement combination:

    
                    
    SETcharacter_set_client=charset_name;SETcharacter_set_results=charset_name;SETcharacter_set_connection=charset_name;

    COLLATEis optional, if absent, the default collation of thecharset_nameis used to set thecollation_connection.

  • SET CHARACTER SET 'charset_name'

    Similar toSET NAMES,SET NAMES 'charset_name'statement is equivalent to the following statement combination:

    
                    
    SETcharacter_set_client=charset_name;SETcharacter_set_results=charset_name;SETcharset_connection=@@charset_database;SETcollation_connection=@@collation_database;

Selection priorities of character sets and collations

String > Column > Table > Database > Server

General rules on selecting character sets and collation

  • Rule 1: If you specifyCHARACTER SET charset_nameandCOLLATE collation_name,n thecharset_namecharacter set and thecollation_namecollation are used directly.
  • Rule 2: If you specifyCHARACTER SET charset_nameand do not specifyCOLLATE collation_name,n thecharset_namecharacter set and the default collation ofcharset_nameare used.
  • Rule 3: If you specify neitherCHARACTER SET charset_namenorCOLLATE collation_name,character set and collation with higher optimization levels are used.

Validity check of characters

If the specified character set isutf8orutf8mb4, TiDB only supports the validutf8characters. For invalid characters, TiDB reports theincorrect utf8 valueerror. This validity check of characters in TiDB is compatible with MySQL 8.0 but incompatible with MySQL 5.7 or earlier versions.

To disable this error reporting, useset @@tidb_skip_utf8_check=1;to skip the character check.

Collation support framework

The syntax support and semantic support for the collation are influenced by thenew_collations_enabled_on_first_bootstrapconfiguration item. The syntax support and semantic support are different. The former indicates that TiDB can parse and set collations. The latter indicates that TiDB can correctly use collations when comparing strings.

Before v4.0, TiDB provides only theold framework for collations. In this framework, TiDB supports syntactically parsing most of the MySQL collations but semantically takes all collations as binary collations.

Since v4.0, TiDB supports anew framework for collations. In this framework, TiDB semantically parses different collations and strictly follows the collations when comparing strings.

Old framework for collations

Before v4.0, you can specify most of the MySQL collations in TiDB, and these collations are processed according to the default collations, which means that the byte order determines the character order. Different from MySQL, TiDB does not handle the trailing spaces of a character, which causes the following behavior differences:


              
CREATE TABLEt(avarchar(20) charset utf8mb4collateutf8mb4_general_ciPRIMARYKEY);

              
Query OK,0 rowsaffected

              
INSERT INTOtVALUES('A');

              
Query OK,1 rowaffected

              
INSERT INTOtVALUES('a');

              
Query OK,1 rowaffected

In TiDB, the preceding statement is successfully executed. In MySQL, becauseutf8mb4_general_ciis case-insensitive, theDuplicate entry 'a'error is reported.


              
INSERT INTOt1VALUES('a ');

              
Query OK,1 rowaffected

In TiDB, the preceding statement is successfully executed. In MySQL, because comparison is performed after the spaces are filled in, theDuplicate entry 'a 'error is returned.

New framework for collations

Since TiDB v4.0, a complete framework for collations is introduced.

This new framework supports semantically parsing collations and introduces thenew_collations_enabled_on_first_bootstrapconfiguration item to decide whether to enable the new framework when a cluster is first initialized. To enable the new framework, setnew_collations_enabled_on_first_bootstraptotrue. For details, seenew_collations_enabled_on_first_bootstrap. If you initialize the cluster after the configuration item is enabled, you can check whether the new collation is enabled through thenew_collation_enabledvariable in themysql.tidbtable:


              
SELECTVARIABLE_VALUEFROMmysql.tidbWHEREVARIABLE_NAME= 'new_collation_enabled';

              
+ ----------------+ |VARIABLE_VALUE| + ----------------+ | True | + ----------------+ 1 row in set(0.00sec)

This new framework supports semantically parsing collations. TiDB enables the new framework by default when a cluster is first initialized.

Under the new framework, TiDB supports theutf8_general_ci,utf8mb4_general_ci,utf8_unicode_ci,utf8mb4_unicode_ci,gbk_chinese_ci, andgbk_bincollations, which is compatible with MySQL.

When one ofutf8_general_ci,utf8mb4_general_ci,utf8_unicode_ci,utf8mb4_unicode_ci, andgbk_chinese_ciis used, the string comparison is case-insensitive and accent-insensitive. At the same time, TiDB also corrects the collation'sPADDINGbehavior:


              
CREATE TABLEt(avarchar(20) charset utf8mb4collateutf8mb4_general_ciPRIMARYKEY);

              
Query OK,0 rowsaffected (0.00sec)

              
INSERT INTOtVALUES('A');

              
Query OK,1 rowaffected (0.00sec)

              
INSERT INTOtVALUES('a');

              
ERROR1062(23000): Duplicate entry'a' forkey't.PRIMARY'# TiDBiscompatiblewiththecase - insensitive collation ofMySQL.

              
INSERT INTOtVALUES('a ');

              
ERROR1062(23000): Duplicate entry'a ' forkey't.PRIMARY'# TiDBmodifiesthe `PADDING` behaviortobe compatiblewithMySQL.

Coercibility values of collations in expressions

If an expression involves multiple clauses of different collations, you need to infer the collation used in the calculation. The rules are as follows:

  • The coercibility value of the explicitCOLLATEclause is0.
  • If the collations of two strings are incompatible, the coercibility value of the concatenation of two strings with different collations is1.
  • The collation of the column,CAST(),CONVERT(), orBINARY()has a coercibility value of2.
  • The system constant (the string returned byUSER ()orVERSION ()) has a coercibility value of3.
  • The coercibility value of constants is4.
  • The coercibility value of numbers or intermediate variables is5.
  • NULLor expressions derived fromNULLhas a coercibility value of6.

When inferring collations, TiDB prefers using the collation of expressions with lower coercibility values. If the coercibility values of two clauses are the same, the collation is determined according to the following priority:

binary > utf8mb4_bin > (utf8mb4_general_ci = utf8mb4_unicode_ci) > utf8_bin > (utf8_general_ci = utf8_unicode_ci) > latin1_bin > ascii_bin

TiDB cannot infer the collation and reports an error in the following situations:

  • If the collations of two clauses are different and the coercibility value of both clauses is0.
  • If the collations of two clauses are incompatible and the returned type of expression isString.

COLLATEclause

TiDB supports using theCOLLATEclause to specify the collation of an expression. The coercibility value of this expression is0, which has the highest priority. See the following example:


              
SELECT 'a' =_utf8mb4'A' collateutf8mb4_general_ci;

              
+ -----------------------------------------------+ | 'a' =_utf8mb4'A' collateutf8mb4_general_ci| + -----------------------------------------------+ | 1 | + -----------------------------------------------+ 1 row in set(0.00sec)

For more details, seeConnection Character Sets and Collations.

Download PDF Request docs changes Ask questions on Discord
Playground
New
One-stop & interactive experience of TiDB's capabilities WITHOUT registration.
Was this page helpful?
Products
TiDB
TiDB Dedicated
TiDB Serverless
Pricing
Get Demo
Get Started
©2023PingCAP. All Rights Reserved.