NEW

Pulse 2025 Product Roundup: From Monitoring to AI-Native Control Plane

MySQL Character Set and Collation: Migrating to utf8mb4

MySQL's utf8 character set is a misnomer — it only stores up to 3 bytes per character (BMP only), which excludes emoji, some CJK extension characters, and several mathematical symbols. utf8mb4 is the correct 4-byte UTF-8 implementation and is the right choice for any table that will hold user-generated content. Failing to use it causes silent data truncation on emoji and error ERROR 1366: Incorrect string value for 4-byte characters.

Impact

Inserting a 4-byte Unicode character (any emoji, many special symbols) into a utf8 column silently truncates or errors depending on sql_mode. Columns declared as utf8 but receiving utf8mb4 data produce different behavior in different client drivers — some truncate silently, others raise an exception. Collation mismatches between columns in a JOIN produce ERROR 1267: Illegal mix of collations and prevent index use.

Common Causes

  1. Database or table created with the default character set when character_set_server was latin1 or utf8
  2. ORM that creates tables without specifying charset, inheriting the server default
  3. Migrating from MySQL 5.7 (where utf8 was the historical default for many installations) to 8.0 without converting data
  4. JOINing a utf8 column against a utf8mb4 column (different charsets prevent index use)
  5. utf8mb4_unicode_ci vs. utf8mb4_general_ci vs. utf8mb4_0900_ai_ci mismatches across tables that are JOINed

Checking Current Character Sets

-- Server defaults
SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';

-- Database defaults
SELECT schema_name, default_character_set_name, default_collation_name
FROM information_schema.schemata
WHERE schema_name = 'mydb';

-- Table and column character sets
SELECT
    table_name,
    table_collation
FROM information_schema.tables
WHERE table_schema = 'mydb';

SELECT
    column_name, character_set_name, collation_name, column_type
FROM information_schema.columns
WHERE table_schema = 'mydb'
  AND table_name = 'users'
  AND character_set_name IS NOT NULL;

Converting to utf8mb4

  1. Set the server default for new databases and tables:

    [mysqld]
    character-set-server = utf8mb4
    collation-server = utf8mb4_0900_ai_ci   # MySQL 8.0 default — fast, Unicode 9.0 aware
    

    Or at runtime (for new connections and new objects):

    SET GLOBAL character_set_server = 'utf8mb4';
    SET GLOBAL collation_server = 'utf8mb4_0900_ai_ci';
    
  2. Convert an existing database:

    ALTER DATABASE mydb
        CHARACTER SET utf8mb4
        COLLATION utf8mb4_unicode_ci;
    

    This changes the default for new tables created in this database but does not convert existing tables or columns.

  3. Convert existing tables (non-blocking for InnoDB with ALGORITHM=INPLACE in many cases):

    -- Convert the table and all text columns
    ALTER TABLE users
        CONVERT TO CHARACTER SET utf8mb4
        COLLATE utf8mb4_unicode_ci;
    

    CONVERT TO rewrites all CHAR, VARCHAR, TEXT, ENUM, and SET columns. This is a full table rebuild (ALGORITHM=COPY) if the column byte widths change (which they do for utf8utf8mb4 on VARCHAR). For large tables, use pt-online-schema-change or gh-ost.

  4. For large tables, convert column by column with online DDL:

    -- ALGORITHM=INPLACE may work for column charset changes in MySQL 8.0
    ALTER TABLE posts
        MODIFY COLUMN body TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
        ALGORITHM=INPLACE,
        LOCK=NONE;
    

    If INPLACE is not supported for this operation, the statement will error (safe — nothing changes).

  5. Generate conversion statements for all tables in a database:

    SELECT CONCAT(
        'ALTER TABLE `', table_schema, '`.`', table_name,
        '` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;'
    )
    FROM information_schema.tables
    WHERE table_schema = 'mydb'
      AND table_collation NOT LIKE 'utf8mb4%';
    

Choosing the Right Collation

  1. Collation comparison:

    Collation Speed Case Accent Unicode version Notes
    utf8mb4_general_ci Fastest insensitive insensitive Basic Legacy; avoid for new tables
    utf8mb4_unicode_ci Fast insensitive insensitive Unicode 4.0 Good compatibility
    utf8mb4_0900_ai_ci Fastest insensitive insensitive Unicode 9.0 MySQL 8.0 default; best choice for new tables
    utf8mb4_0900_as_cs Fast sensitive sensitive Unicode 9.0 For case/accent-sensitive lookups
    utf8mb4_bin Fast sensitive sensitive Binary Byte-exact comparison

    For most applications: utf8mb4_unicode_ci (MySQL 5.7 compat) or utf8mb4_0900_ai_ci (MySQL 8.0 only).

Connection Character Set

  1. Ensure the client connection uses utf8mb4:
    SET NAMES utf8mb4;
    -- or equivalently:
    SET character_set_client = utf8mb4;
    SET character_set_results = utf8mb4;
    SET character_set_connection = utf8mb4;
    
    In application drivers:
    # JDBC
    jdbc:mysql://host/db?useUnicode=true&characterEncoding=UTF-8
    
    # Python (mysql-connector-python)
    charset='utf8mb4'
    
    # SQLAlchemy
    create_engine("mysql+pymysql://user:pass@host/db?charset=utf8mb4")
    
    # Node (mysql2)
    charset: 'utf8mb4'
    

Index Length Considerations

  1. utf8mb4 uses up to 4 bytes per character. A VARCHAR(191) in utf8mb4 is 191 × 4 = 764 bytes — within InnoDB's 767-byte index key limit on older row formats. For MySQL 5.7 with COMPACT row format:

    -- If you hit "Specified key was too long; max key length is 767 bytes":
    SHOW VARIABLES LIKE 'innodb_large_prefix';  -- MySQL 5.7
    -- Enable large prefix if not on
    SET GLOBAL innodb_large_prefix = ON;
    SHOW VARIABLES LIKE 'innodb_file_format';
    SET GLOBAL innodb_file_format = Barracuda;
    

    In MySQL 8.0 with DYNAMIC row format (the default), the index key limit is 3072 bytes — VARCHAR(767) in utf8mb4 is fine.

    For TEXT and BLOB columns, prefix indexes are required:

    ALTER TABLE articles ADD INDEX idx_title (title(191));
    

Fixing Collation Mismatch Errors

  1. Error 1267: Illegal mix of collations appears when JOINing or comparing columns with different charsets or collations:
    -- Example: users.email is utf8mb4_unicode_ci but logins.email is utf8_general_ci
    SELECT u.* FROM users u JOIN logins l ON u.email = l.email;
    -- ERROR 1267: Illegal mix of collations
    
    -- Fix: COLLATE cast in the query (temporary)
    ON u.email = l.email COLLATE utf8mb4_unicode_ci
    
    -- Permanent fix: convert the column
    ALTER TABLE logins MODIFY COLUMN email VARCHAR(255)
        CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
    

Additional Information

  • MySQL 8.0 sets utf8mb4 as the server default. New installations on 8.0 are correct out of the box; only upgrades from 5.7 or older need conversion.
  • The utf8 alias in MySQL is utf8mb3 internally (confirmed in MySQL 8.0.28). Using utf8 in new code works but issues a deprecation warning in MySQL 8.0.28+ and will be removed in a future release. Always use utf8mb4 explicitly.
  • ENUM and SET columns also have character sets and must be converted explicitly.
  • Converting tables with CONVERT TO is safe for data that was correctly stored as valid Unicode. Data stored as latin1 that represents actual binary data (image bytes, etc.) may be corrupted by charset conversion — inspect before converting.
  • Amazon Aurora and RDS respect the character_set_server parameter group setting, which defaults to utf8mb4 on newer parameter groups.

Frequently Asked Questions

Q: After converting my table, emoji are still truncated. Why? A: Check the connection charset. Even if the column is utf8mb4, if the client connection is negotiated as utf8 (3-byte), the data is truncated before reaching the server. Verify with SHOW SESSION VARIABLES LIKE 'character_set%'.

Q: Will CONVERT TO CHARACTER SET utf8mb4 corrupt existing data? A: No, if the existing data is valid UTF-8. utf8mb4 is a superset of MySQL's utf8 — all characters that fit in 3 bytes are identical in both charsets. Only 4-byte characters are new to utf8mb4.

Q: My VARCHAR(255) index fails after converting to utf8mb4. What happened? A: On MySQL 5.7 with the COMPACT row format, utf8mb4 × 255 characters × 4 bytes = 1020 bytes — over the 767-byte key limit. Solutions: enable innodb_large_prefix (5.7), switch to DYNAMIC row format, or reduce the column length to VARCHAR(191).

Q: Is utf8mb4_0900_ai_ci backward compatible with utf8mb4_unicode_ci? A: For most strings, yes — they sort and compare identically for common characters. However, Unicode 9.0 (0900) makes different ordering decisions for some edge cases and handles more Unicode codepoints. Do not mix them in a JOIN predicate without a COLLATE cast.

Q: I changed the server collation but existing rows still compare with the old collation. Why? A: The server/database collation only affects new objects. Existing columns retain their declared collation. You must ALTER TABLE ... CONVERT TO or MODIFY COLUMN to change the collation on existing data.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.