Sign In Try Free

Common Deployment Failures of TiDB on Kubernetes

This document describes the common deployment failures of TiDB on Kubernetes and their solutions.

The Pod is not created normally

After creating a cluster, if the Pod is not created, you can diagnose it using the following commands:


              
kubectl get tidbclusters -n${namespace}&& \ kubectl describe tidbclusters -n${namespace} ${cluster_name}&& \ kubectl get statefulsets -n${namespace}&& \ kubectl describe statefulsets -n${namespace} ${cluster_name}-pd

After creating a backup/restore task, if the Pod is not created, you can perform a diagnostic operation by executing the following commands:


              
kubectl get backups -n${namespace}kubectl getjobs-n${namespace}kubectl describe backups -n${namespace} ${backup_name}kubectl describe backupschedules -n${namespace} ${backupschedule_name}kubectl describejobs-n${namespace} ${backupjob_name}kubectl describe restores -n${namespace} ${restore_name}

The Pod is in the Pending state

The Pending state of a Pod is usually caused by conditions of insufficient resources, for example:

  • TheStorageClass使用的PVC PD、TiKV TiFlash,泵、监视器r, Backup, and Restore Pods does not exist or the PV is insufficient.
  • No nodes in the Kubernetes cluster can satisfy the CPU or memory resources requested by the Pod
  • The number of TiKV or PD replicas and the number of nodes in the cluster do not satisfy the high availability scheduling policy of tidb-scheduler

你可以检查特定的reason for Pending by using thekubectl describe podcommand:


              
kubectl describe po -n${namespace} ${pod_name}

CPU or memory resources are insufficient

If the CPU or memory resources are insufficient, you can lower the CPU or memory resources requested by the corresponding component for scheduling, or add a new Kubernetes node.

StorageClass of the PVC does not exist

If theStorageClassof the PVC cannot be found, take the following steps:

  1. Get the availableStorageClassin the cluster:

    
                    
    kubectl get storageclass
  2. ChangestorageClassNameto the name of theStorageClassavailable in the cluster.

  3. Update the configuration file:

    • If you want to start the TiDB cluster, executekubectl edit tc ${cluster_name} -n ${namespace}to update the cluster.
    • If you want to run a backup/restore task, first executekubectl delete bk ${backup_name} -n ${namespace}to delete the old backup/restore task, and then executekubectl apply -f backup.yamlto create a new backup/restore task.
  4. Delete StatefulSet and the corresponding PVCs:

    
                    
    kubectl delete pvc -n${namespace} ${pvc_name}&& \ kubectl delete sts -n${namespace} ${statefulset_name}

Insufficient available PVs

If aStorageClassexists in the cluster but the available PV is insufficient, you need to add PV resources correspondingly. For Local PV, you can expand it by referring toLocal PV Configuration.

The high availability scheduling policy of tidb-scheduler is not satisfied

tidb-scheduler has a high availability scheduling policy for PD and TiKV. For the same TiDB cluster, if there are N replicas of TiKV or PD, then the number of PD Pods that can be scheduled to each node isM=(N-1)/2(if N<3, then M=1) at most, and the number of TiKV Pods that can be scheduled to each node isM=ceil(N/3)(if N<3, then M=1;ceilmeans rounding up) at most.

If the Pod's state becomesPendingbecause the high availability scheduling policy is not satisfied, you need to add more nodes in the cluster.

The Pod is in theCrashLoopBackOffstate

A Pod in theCrashLoopBackOffstate means that the container in the Pod repeatedly aborts (in the loop of abort - restart bykubelet- abort). There are many potential causes ofCrashLoopBackOff.

View the log of the current container


              
kubectl -n${namespace}logs -f${pod_name}

View the log when the container was last restarted


              
kubectl -n${namespace}logs -p${pod_name}

After checking the error messages in the log, you can refer toCannot starttidb-server,Cannot starttikv-server, andCannot startpd-serverfor further troubleshooting.

"cluster id mismatch"

When the "cluster id mismatch" message appears in the TiKV Pod log, the TiKV Pod might have used old data from other or previous TiKV Pod. If the data on the local disk remain uncleared when you configure local storage in the cluster, or the data is not recycled by the local volume provisioner due to a forced deletion of PV, this error might occur.

If you confirm that the TiKV should join the cluster as a new node and that the data on the PV should be deleted, you can delete the TiKV Pod and the corresponding PVC. The TiKV Pod automatically rebuilds and binds the new PV for use. When configuring local storage, delete local storage on the machine to avoid Kubernetes using old data. In cluster operation and maintenance, manage PV using the local volume provisioner and do not delete it forcibly. You can manage the lifecycle of PV by creating, deleting PVCs, and settingreclaimPolicyPV。

ulimitis not big enough

TiKV might fail to start whenulimitis not big enough. In this case, you can modify the/etc/security/limits.conffile of the Kubernetes node to increase theulimit:


              
root soft nofile 1000000 root hard nofile 1000000 root soft core unlimited root soft stack 10240

PD Podnslookup domain failed

You should see some log of PD Pod like:


              
Thu Jan 13 14:55:52 IST 2022 ;; Got recursion not available from 10.43.0.10, trying next server ;; Got recursion not available from 10.43.0.10, trying next server ;; Got recursion not available from 10.43.0.10, trying next server Server: 10.43.0.10 Address: 10.43.0.10#53 ** server can't find basic-pd-0.basic-pd-peer.default.svc: NXDOMAIN nslookup domain basic-pd-0.basic-pd-peer.default.svc failed

This type of failure occurs when the cluster meets both of the following two conditions:

  • There are twonameserverin/etc/resolv.conf, and the second one is not IP of CoreDNS.
  • The version of PD is:
    • Greater than or equal to v5.0.5.
    • Greater than or equal to v5.1.4.
    • Greater than or equal to v5.2.4.
    • All 5.3 versions.

To address this failure, addstartUpScriptVersionto TidbCluster as:


              
... spec: pd: startUpScriptVersion: "v1" ...

This failure occurs because there is something wrong with thenslookupin the base image (see detail in#4379). After configuringstartUpScriptVersiontov1, TiDB Operator usesdigto check DNS instead of usingnslookup.

Other causes

If you cannot confirm the cause from the log andulimitis also a normal value, troubleshoot the issue byusing the debug mode.

Download PDF Request docs changes Ask questions on Discord
Playground
New
One-stop & interactive experience of TiDB's capabilities WITHOUT registration.
Was this page helpful?
Products
TiDB
TiDB Dedicated
TiDB Serverless
Pricing
Get Demo
Get Started
©2023PingCAP. All Rights Reserved.