Sign In Try Free

Troubleshoot TiCDC

This document introduces the common errors you might encounter when using TiCDC, and the corresponding maintenance and troubleshooting methods.

TiCDC replication interruptions

How do I know whether a TiCDC replication task is interrupted?

  • Check thechangefeed checkpointmonitoring metric of the replication task (choose the rightchangefeed id) in the Grafana dashboard. If the metric value stays unchanged, or thecheckpoint lagmetric keeps increasing, the replication task might be interrupted.
  • Check theexit error countmonitoring metric. If the metric value is greater than0, an error has occurred in the replication task.
  • Executecdc cli changefeed listandcdc cli changefeed queryto check the status of the replication task.stoppedmeans the task has stopped, and theerroritem provides the detailed error message. After the error occurs, you can searcherror on running processorin the TiCDC server log to see the error stack for troubleshooting.
  • In some extreme cases, the TiCDC service is restarted. You can search theFATALlevel log in the TiCDC server log for troubleshooting.

How do I know whether the replication task is stopped manually?

You can know whether the replication task is stopped manually by executingcdc cli. For example:


              
cdc cli changefeed query --server=http://127.0.0.1:8300 --changefeed-id 28c43ffc-2316-4f4f-a70b-d1a7c59ba79f

In the output of the above command,admin-job-typeshows the state of this replication task:

  • 0: In progress, which means that the task is not stopped manually.
  • 1: Paused. When the task is paused, all replicatedprocessors exit. The configuration and the replication status of the task are retained, so you can resume the task fromcheckpiont-ts.
  • 2: Resumed. The replication task resumes fromcheckpoint-ts.
  • 3: Removed. When the task is removed, all replicatedprocessors are ended, and the configuration information of the replication task is cleared up. The replication status is retained only for later queries.

How do I handle replication interruptions?

A replication task might be interrupted in the following known scenarios:

  • The downstream continues to be abnormal, and TiCDC still fails after many retries.

    • 在这个场景中,TiCDC保存任务信息. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period ofgc-ttl.

    • Handling method: You can resume the replication task via the HTTP interface after the downstream is back to normal.

  • Replication cannot continue because of incompatible SQL statement(s) in the downstream.

    • 在这个场景中,TiCDC保存任务信息. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period ofgc-ttl.
    • Handling procedures:
      1. Query the status information of the replication task using thecdc cli changefeed querycommand and record the value ofcheckpoint-ts.
      2. Use the new task configuration file and add theignore-txn-start-tsparameter to skip the transaction corresponding to the specifiedstart-ts.
      3. Stop the old replication task via HTTP API. Executecdc cli changefeed createto create a new task and specify the new task configuration file. Specifycheckpoint-tsrecorded in step 1 as thestart-tsand start a new task to resume the replication.
  • In TiCDC v4.0.13 and earlier versions, when TiCDC replicates the partitioned table, it might encounter an error that leads to replication interruption.

    • 在这个场景中,TiCDC保存任务信息. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period ofgc-ttl.
    • Handling procedures:
      1. Pause the replication task by executingcdc cli changefeed pause -c .
      2. Wait for about one munite, and then resume the replication task by executingcdc cli changefeed resume -c .

What should I do to handle the OOM that occurs after TiCDC is restarted after a task interruption?

  • Update your TiDB cluster and TiCDC cluster to the latest versions. The OOM problem has already been resolved inv4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the latest versions.

How do I handle theError 1298: Unknown or incorrect time zone: 'UTC'错误当创建replication task or replicating data to MySQL?

This error is returned when the downstream MySQL does not load the time zone. You can load the time zone by runningmysql_tzinfo_to_sql. After loading the time zone, you can create tasks and replicate data normally.


              
mysql_tzinfo_to_sql /usr/share/zoneinfo | mysql -u root mysql -p

If the output of the command above is similar to the following one, the import is successful:


              
Enter password: Warning: Unable to load '/usr/share/zoneinfo/iso3166.tab' as time zone. Skipping it. Warning: Unable to load '/usr/share/zoneinfo/leap-seconds.list' as time zone. Skipping it. Warning: Unable to load '/usr/share/zoneinfo/zone.tab' as time zone. Skipping it. Warning: Unable to load '/usr/share/zoneinfo/zone1970.tab' as time zone. Skipping it.

If the downstream is a special MySQL environment (a public cloud RDS or some MySQL derivative versions) and importing the time zone using the preceding method fails, you can use the default time zone of the downstream by settingtime-zoneto an empty value, such astime-zone="".

When using time zones in TiCDC, it is recommended to explicitly specify the time zone, such astime-zone="Asia/Shanghai". Also, make sure that thetzspecified in TiCDC server configurations and thetime-zonespecified in Sink URI are consistent with the time zone configuration of the downstream database. This prevents data inconsistency caused by inconsistent time zones.

我怎么处理不相容问题的进行guration files caused by TiCDC upgrade?

Refer toNotes for compatibility.

Thestart-tstimestamp of the TiCDC task is quite different from the current time. During the execution of this task, replication is interrupted and an error[CDC:ErrBufferReachLimit]发生。我应该做什么?

Since v4.0.9, you can try to enable the unified sorter feature in your replication task, or use the BR tool for an incremental backup and restore, and then start the TiCDC replication task from a new time.

When the downstream of a changefeed is a database similar to MySQL and TiCDC executes a time-consuming DDL statement, all other changefeeds are blocked. What should I do?

  1. Pause the execution of the changefeed that contains the time-consuming DDL statement. Then you can see that other changefeeds are no longer blocked.
  2. Search for theapply jobfield in the TiCDC log and confirm thestart-tsof the time-consuming DDL statement.
  3. Manually execute the DDL statement in the downstream. After the execution finishes, go on performing the following operations.
  4. Modify the changefeed configuration and add the abovestart-tsto theignore-txn-start-tsconfiguration item.
  5. Resume the paused changefeed.

After I upgrade the TiCDC cluster to v4.0.8, the[CDC:ErrKafkaInvalidConfig]Canal requires old value to be enablederror is reported when I execute a changefeed. What should I do?

Since v4.0.8, if thecanal-json,canalormaxwellprotocol is used for output in a changefeed, TiCDC enables the old value feature automatically. However, if you have upgraded TiCDC from an earlier version to v4.0.8 or later, when the changefeed uses thecanal-json,canalormaxwellprotocol and the old value feature is disabled, this error is reported.

To fix the error, take the following steps:

  1. Set the value ofenable-old-valuein the changefeed configuration file totrue.

  2. Executecdc cli changefeed pauseto pause the replication task.

    
                    
    cdc cli changefeed pause -c test-cf --server=http://127.0.0.1:8300
  3. Executecdc cli changefeed updateto update the original changefeed configuration.

    
                    
    cdc cli changefeed update -c test-cf --server=http://127.0.0.1:8300 --sink-uri="mysql://127.0.0.1:3306/?max-txn-row=20&worker-number=8"--config=changefeed.toml
  4. Executecdc cli changfeed resumeto resume the replication task.

    
                    
    cdc cli changefeed resume -c test-cf --server=http://127.0.0.1:8300

The[tikv:9006]GC life time is shorter than transaction duration, transaction starts at xx, GC safe point is yyerror is reported when I use TiCDC to create a changefeed. What should I do?

You need to run thepd-ctl service-gc-safepoint --pd command to query the current GC safepoint and service GC safepoint. If the GC safepoint is smaller than thestart-tsof the TiCDC replication task (changefeed), you can directly add the--disable-gc-checkoption to thecdc cli create changefeedcommand to create a changefeed.

If the result ofpd-ctl service-gc-safepoint --pd does not havegc_worker service_id:

  • If your PD version is v4.0.8 or earlier, refer toPD issue #3128for details.
  • If your PD is upgraded from v4.0.8 or an earlier version to a later version, refer toPD issue #3366for details.

When I use TiCDC to replicate messages to Kafka, Kafka returns theMessage was too largeerror. Why?

For TiCDC v4.0.8 or earlier versions, you cannot effectively control the size of the message output to Kafka only by configuring themax-message-bytessetting for Kafka in the Sink URI. To control the message size, you also need to increase the limit on the bytes of messages to be received by Kafka. To add such a limit, add the following configuration to the Kafka server configuration.


              
# The maximum byte number of a message that the broker receives message.max.bytes=2147483648 # The maximum byte number of a message that the broker copies replica.fetch.max.bytes=2147483648 # The maximum message byte number that the consumer side reads fetch.message.max.bytes=2147483648

How can I find out whether a DDL statement fails to execute in downstream during TiCDC replication? How to resume the replication?

If a DDL statement fails to execute, the replication task (changefeed) automatically stops. The checkpoint-ts is the DDL statement's finish-ts minus one. If you want TiCDC to retry executing this statement in the downstream, usecdc cli changefeed resumeto resume the replication task. For example:


              
cdc cli changefeed resume -c test-cf --server=http://127.0.0.1:8300

If you want to skip this DDL statement that goes wrong, set the start-ts of the changefeed to the checkpoint-ts (the timestamp at which the DDL statement goes wrong) plus one, and then run thecdc cli changefeed createcommand to create a new changefeed task. For example, if the checkpoint-ts at which the DDL statement goes wrong is415241823337054209, run the following commands to skip this DDL statement:


              
cdc cli changefeed remove --server=http://127.0.0.1:8300 --changefeed-id simple-replication-task cdc cli changefeed create --server=http://127.0.0.1:8300 --sink-uri="mysql://root:123456@127.0.0.1:3306/"--changefeed-id="simple-replication-task"--sort-engine="unified"--start-ts 415241823337054210
Download PDF Request docs changes Ask questions on Discord
Playground
New
One-stop & interactive experience of TiDB's capabilities WITHOUT registration.
Was this page helpful?
Products
TiDB
TiDB Dedicated
TiDB Serverless
Pricing
Get Demo
Get Started
©2023PingCAP. All Rights Reserved.