Sign In Try Free

TiDB Lightning CSV Support and Restrictions

This document describes how to migrate data from CSV files to TiDB using TiDB Lightning. For information about how to generate CSV files from MySQL, seeExport to CSV files using Dumpling.

TiDB Lightning supports reading CSV (comma-separated values) data source, as well as other delimited format such as TSV (tab-separated values).

File name

A CSV file representing a whole table must be named asdb_name.table_name.csv. This will be restored as a tabletable_nameinside the databasedb_name.

If a table spans multiple CSV files, they should be named likedb_name.table_name.003.csv. The number part do not need to be continuous, but must be increasing and zero-padded.

The file extension must be*.csv, even if the content is not separated by commas.

Schema

CSV files are schema-less. To import them into TiDB, a table schema must be provided. This could be done either by:

  • Providing a file nameddb_name.table_name-schema.sqlcontaining theCREATE TABLEDDL statement, and also a file nameddb_name-schema-create.sqlcontaining theCREATE DATABASEDDL statement.
  • Creating the empty tables directly in TiDB in the first place, and then setting[mydumper] no-schema = trueintidb-lightning.toml.

配置

The CSV format can be configured intidb-lightning.tomlunder the[mydumper.csv]section. Most settings have a corresponding option in the MySQLLOAD DATAstatement.


              
[mydumper.csv] # Separator between fields. Must not be empty. separator=',' # Quoting delimiter. Empty value means no quoting. delimiter='"' # Whether the CSV files contain a header. # If `header` is true, the first line will be skipped. header=true # Whether the CSV contains any NULL value. # If `not-null` is true, all columns from CSV cannot be NULL. not-null=false # When `not-null` is false (that is, CSV can contain NULL), # fields equal to this value will be treated as NULL. null='\N' # Whether to interpret backslash escapes inside fields. backslash-escape=true # If a line ends with a separator, remove it. trim-last-separator=false

In all string fields such asseparatoranddelimiter, if the input involves special characters, you can use backslash escape sequence to represent them in adouble-quotedstring ("…"). For example,separator = "\u001f"means using the ASCII character 0x1F as separator.

Additionally, you can usesingle-quotedstrings ('…')抑制反斜杠光电子能谱ping. For example,separator = '\t'means using the two-character string: a backslash followed by the letter "t", as the separator.

See theTOML v1.0.0 specificationfor details.

separator

  • Defines the field separator.

  • Can be multiple characters, but must not be empty.

  • Common values:

    • ','for CSV (comma-separated values)
    • "\t"for TSV (tab-separated values)
    • "\u0001"to use the ASCII character 0x01 as separator
  • Corresponds to theFIELDS TERMINATED BYoption in the LOAD DATA statement.

delimiter

  • Defines the delimiter used for quoting.

  • Ifdelimiteris empty, all fields are unquoted.

  • Common values:

    • '"'quote fields with double-quote, same asRFC 4180
    • ''disable quoting
  • Corresponds to theFIELDS ENCLOSED BYoption in theLOAD DATAstatement.

  • WhetherallCSV files contain a header row.
  • Ifheaderis true, the first row will be used as thecolumn names. Ifheaderis false, the first row is not special and treated as an ordinary data row.

not-nullandnull

  • Thenot-nullsetting controls whether all fields are non-nullable.

  • Ifnot-nullis false, the string specified bynullwill be transformed to the SQL NULL instead of a concrete value.

  • Quoting will not affect whether a field is null.

    For example, with the CSV file:

    
                    
    A,B,C \N,"\N",

    In the default settings (not-null = false; null = '\N'), the columnsAandBare both converted to NULL after importing to TiDB. The columnCis simply the empty string''but not NULL.

backslash-escape

  • Whether to interpret backslash escapes inside fields.

  • Ifbackslash-escapeis true, the following sequences are recognized and transformed:

    Sequence Converted to
    \ 0 空字符(U + 0000)
    \b Backspace (U+0008)
    \n Line feed (U+000A)
    \r Carriage return (U+000D)
    \t Tab (U+0009)
    \Z Windows EOF (U+001A)

    In all other cases (for example,\") the backslash is simply stripped, leaving the next character (") in the field. The character left has no special roles (for example, delimiters) and is just an ordinary character.

  • Quoting will not affect whether backslash escapes are interpreted.

  • Corresponds to theFIELDS ESCAPED BY '\'option in theLOAD DATAstatement.

trim-last-separator

  • Treats the fieldseparatoras a terminator, and removes all trailing separators.

    For example, with the CSV file:

    
                    
    A,,B,,
  • Whentrim-last-separator = false, this is interpreted as a row of 5 fields('A', '', 'B', '', '').

  • Whentrim-last-separator = true, this is interpreted as a row of 3 fields('A', '', 'B').

Non-configurable options

TiDB Lightning does not support every option supported by theLOAD DATAstatement. Some examples:

  • The line terminator must only be CR (\r),低频(\n) or CRLF (\r\n), which meansLINES TERMINATED BYis not customizable.
  • There cannot be line prefixes (LINES STARTING BY).
  • The header cannot be simply skipped (IGNORE n LINES). It must be valid column names if present.

Strict format

Lightning works the best when the input files have uniform size around 256 MB. When the input is a single huge CSV file, TiDB Lightning can only use one thread to process it, which slows down import speed a lot.

This can be fixed by splitting the CSV into multiple files first. For the generic CSV format, there is no way to quickly identify when a row starts and ends without reading the whole file. Therefore, TiDB Lightning by default doesnotautomatically split a CSV file. However, if you are certain that the CSV input adheres to certain restrictions, you can enable thestrict-formatsetting to allow TiDB Lightning to split the file into multiple 256 MB-sized chunks for parallel processing.


              
[mydumper] strict-format=true

Currently, a strict CSV file means every field occupies only a single line. In other words, one of the following must be true:

  • Delimiter is empty, or
  • Every field does not contain CR (\r) or LF (\n).

If a CSV file is not strict, butstrict-formatwas wrongly set totrue, a field spanning multiple lines may be cut in half into two chunks, causing parse failure, or even worse, quietly importing corrupted data.

Common configurations

CSV

The default setting is already tuned for CSV following RFC 4180.


              
[mydumper.csv] separator=',' delimiter='"' header=true not-null=false null='\N' backslash-escape=true trim-last-separator=false

Example content:


              
ID,Region,Count 1,"East",32 2,"South",\N 3,"West",10 4,"North",39

TSV


              
[mydumper.csv] separator="\t" delimiter='' header=true not-null=false null='NULL' backslash-escape=false trim-last-separator=false

Example content:


              
ID Region Count 1 East 32 2 South NULL 3 West 10 4 North 39

TPC-H DBGEN


              
[mydumper.csv] separator='|' delimiter='' header=false not-null=true backslash-escape=false trim-last-separator=true

Example content:


              
1|East|32| 2|South|0| 3|West|10| 4|North|39|
Download PDF Request docs changes Ask questions on Discord
Playground
New
One-stop & interactive experience of TiDB's capabilities WITHOUT registration.
Was this page helpful?
Products
TiDB
TiDB Dedicated
TiDB Serverless
Pricing
Get Demo
Get Started
©2023PingCAP. All Rights Reserved.