Book a Demo Start Instantly
How to Use TiDB Cloud with AWS Glue Catalog

Author:Andrew Ren(TiDB Cloud Solutions Architect at PingCAP)

A data catalog is a collection of data metadata. The catalog is a glossary and inventory of available data across different data platforms such as databases, data warehouses, and data lakes. Data users, particularly analysts and data scientists, use it to help find specific data that they need. You can use the data catalog to store, annotate, and share metadata.

TiDB Cloudis a fully managed cloud service ofTiDB. The user experience is similar to Amazon Relational Database Service (RDS) and Google Cloud SQL. With some simple clicks in the UI, you can get a fully functional production-ready database in either Amazon Web Services (AWS) or Google Cloud Platform (GCP).

AWS GlueDataCatalogcontains references to data that is used as sources and targets of your extract, transform, and load (ETL) jobs in AWS Glue. To create your data warehouse or data lake, you must catalog this data.

This tutorial will walk you through how to integrate TiDB Cloud with AWS Glue Data Catalog and manage TiDB metadata with the catalog. Major steps include:

  1. Prepare your AWS account with the necessary networking and access configurations.
  2. Create a TiDB Cloud on demand cluster.
  3. Create an AWS Glue Data Catalog and link your TiDB Cloud cluster to it.
  4. Create test data and check out the data catalog.
  5. Clean up the test environment.

TiDB Cloud and AWS Glue Data Catalog Architecture

Before you begin

Before you try the steps in this article, make sure you have:

  • A basic understanding of AWS
  • An AWS account
  • A TiDB Cloud account. (If you do not have a TiDB Cloud account, clickhereto sign up for one.)

Prepare your AWS account

The first step is to prepare your account for AWS Glue and AWS Glue Data Catalog.

Create an S3 endpoint

Create anAmazon Simple Storage Service (S3) endpointand attach it to the Amazon Virtual Private Cloud (VPC). This S3 endpoint is needed for Glue to export logs from your Glue workers in the VPC to S3.

  1. Go to AWS VPC. In the left panel, clickEndpoints.
  2. ClickCreate Endpoint.
  3. In theService namefield, search fors3;select theGatewaytype, and create the endpoint.

Please note yourAWS account ID,VPC ID, and VPC’s Classless Inter-Domain Routing(CIDR). You’ll be entering them in a later step.

Create a security group

Create a security group in the VPC namedglue_eni. Later, you will assign it to the Glue worker so that it has network access.

  1. Create a security group, name itglue_eni.
  2. Set security group inbound and outbound rules.
    1. Specify a self-referencing inbound rule for all TCP ports, to allow AWS Glue components to communicate and also prevent access from other networks.
    2. Allow all outbound traffic. (These are the default settings.)

Prepare an IAM role

Prepare an Identity and Access Management (IAM) role to grant the necessary permissions to the Glue worker.

  1. Create an IAM role, choose use caseglue, and name itglue_iam.
  2. Assign the policyAWSGlueServiceRoletoglue_iam.

Create a TiDB Cloud on demand cluster

Now that your AWS account is ready, create a TiDB Cloud cluster and connect it to your AWS VPC environment.

Set up a customized CIDR

Before you create a TiDB Cloud cluster, set up a customized CIDR for TiDB Cloud, so you can link the TiDB Cloud VPC and your own AWS VPC later via VPC Peering.

  1. Go to your TiDB Cloud console and
  2. ClickNetwork Access.
  3. Select箴ject CIDR.
  4. Input the CIDR. Make sure it’s different from your existing AWS VPC’s CIDR. Otherwise, you won’t be able to VPC peer them.

Create a TiDB Cloud cluster

Now comes the exciting part: creating a TiDB Cloud cluster. Since it’s just a test, you can create the smallest usable cluster size: one TiDB node and three TiKV nodes.

  1. Go to your TiDB Cloud console. At the top right of the screen, clickCreate a cluster.
  2. SetCluster Nametotest.
  3. Set your Root Password and note it for future use.
  4. 改变TiDB / TiKV节点的数量和点击Create.

Your TiDB Cloud cluster will be created in approximately 5 to 10 minutes.

Connect TiDB Cloud to AWS VPC

Connect your TiDB Cloud environment to AWS VPC through VPC peering.

  1. 去你TiDB云控制台和点击Network Access.
  2. ClickVPC Peering.
  3. ClickAddand configure VPC peering tomy AWS VPC.
  4. SelectAWSas the cloud provider.
  5. Input theAWS account ID,VPC ID, and itsCIDR. (You can find this information in the AWS console.)



  6. ClickInitialize. You should see the following screen.

Accept the invitation in AWS

VPC peering has been initialized, but there’s no peering ID yet because you haven’t accepted the invitation in your AWS account.

  1. In your AWS console, go to the VPC page.
  2. In the left panel, clickPeering connections.
  3. Find the following VPC peering request.



  4. ClickAccept request.

Verify the VPC peering status

Check that the VPC peering status is already active.

  1. Return to your TiDB Cloud console.
  2. ClickNetwork Access.
  3. ClickVPC Peering. You should see theStatushas changed intoactive.

Now you have the two VPCs connected, but Glue workers can’t yet access the TiDB Cloud cluster. To fully grant access, you need to add two more networking configurations on the AWS side.

Add networking configurations

In your AWS VPC’s route tables, add routing to the VPC peering CIDR so that the VPC router knows where to send the traffic when its target is the TiDB Cloud’s CIDR.

  1. Go to the VPC page in your AWS console. In the left panel, clickRoute Tables.
  2. To enter the detail page, click on your route table’s ID.
  3. On the right side of the screen, clickEdit routes. Then clickAdd route.
  4. SetDestinationas your TiDB Cloud CIDR.
  5. SetTargetas the alias of your VPC peering. (It is prefixed with pcx-.)
  6. ClickSave changes.

Allow inbound traffic

Next, in the glue_eni security group, allow all inbound traffic from the TiDB Cloud CIDR, so that the security group allows the traffic from Glue workers to the TiDB Cloud database.

  1. Go to the VPC page in your AWS console. In the left panel, clickSecurity Groups.
  2. To enter the detail page, findglue_eniand click on its ID.
  3. ClickEdit inbound rules. Then clickAdd rule.
  4. SetTypeasAll traffic.
  5. SetSourceasCustom.
  6. In the input box behindSource, enter your TiDB Cloud’s CIDR.
  7. ClickSave rules.

The network setup is done. Once you know the endpoint of the TiDB cluster, you will be able to connect to it.

Get the endpoint

Go to your TiDB console and click into the detail page of the TiDB cluster you just created.

  1. ClickConnect.
  2. ChooseVPC Peeringand clickCreating Endpoint.

Make a note of the endpoint because you’ll need this information to create the Data Catalog in the next section.

Create the Data Catalog

AWS Glue can manage TiDB’s metadata, but it needs to know where to look for the data. That’s why you’ll need to configure a database, a connection, and a crawler.

Add a database

  1. Go to the AWS Glue console.
  2. In the left panel, clickDatabase.
  3. ClickAdd database.
  4. SetDatabase nameastidb.
  5. ClickCreate.

Add a connection

  1. In the left panel, clickConnections.
  2. ClickAdd connection.
  3. SetConnection nameastidb.
  4. Set theConnection typeasJDBC.
  5. ClickNext.
  6. 配置的JDBC URL格式后,再保险placing endpoint with the endpoint you noted in the previous section:

    jdbc:mysql://[tidb cloud endpoint]:4000/test



  7. SetUsernametoroot.
  8. SetPasswordto your TiDB Cloud cluster password.
  9. Choose theVPCandSubnet.
  10. Selectglue_enias the security group.
  11. Finish the creation flow.

Test the connection

To make sure your setup is correct, test the connection:

  1. In the connection list, select the connection you created.
  2. ClickTest connection.
  3. You should see a green note in the console “tidb connected successfully to your instance.”

Create a crawler

The Glue crawler crawls the metadata via the database connection. To create a crawler.

  1. In the left panel, clickCrawlers.
  2. ClickAdd crawler.
  3. SetCrawler nameastidb.
  4. Keep the default values forCrawler source typeandRepeat crawls of S3 data storesand clickNext.
  5. SetChoose a data storeasJDBC.
  6. SetConnectionastidb.
  7. SetInclude pathastest/%.
  8. ClickNext.

  9. SetIAM roleasglue_iam.
  10. ClickNext.

  11. SetDatabaseastidb.

Now all Glue setups are ready—you have a database, a connection, and a crawler. Next, you’ll need some test data to run the crawler and see what happens.

Create a test data in the TiDB Cloud cluster

Use TiDB Cloud’s web shell feature to insert the test data. With this approach, you won’t need to create extra EC2 instances.

  1. Go to your TiDB Cloud console.
  2. Go to the TiDB cluster you created.
  3. ClickConnect.
  4. SelectWeb SQL Shell.
  5. ClickOpen SQL Shell.

Run the following queries one by one to insert two tables into the database test.

Use test;
CREATE TABLE t1 (a int);
CREATE TABLE t2 ( id BIGINT NOT NULL PRIMARY KEY auto_increment, b VARCHAR(200) NOT NULL );

Leave the web SQL shell open so that you can come back later to manipulate the schema and test how the data catalog picks up schema changes.

Check the Data Catalog

Run the crawler

Run the crawler to collect metadata from the TiDB Cloud cluster.

  1. Go to the AWS Glue console, and choosecrawler.
  2. Select thetidbcrawler.
  3. Click theRun crawlerbutton.

This example shows a crawler that was configured manually, so you can see how it works step by step. In production, you can always set it up to run on a schedule, so it can pick up your metadata changes automatically.

After about two minutes, the crawler finishes its task.

Verify the results

Go to the tables and check the synchronization results. In the left panel, underDatabases, clickTables. The two test tables you created are now in the Data Catalog.

If you click into tablet2, you’ll be able to see it correctly recorded in the two columns that were created.

You can add comments to the two columns in production to explain what these fields are for.

To add a new column to tablet2, go back to TiDB Cloud’s web SQL shell and run the following query.

ALTER TABLE t2 ADD COLUMN c INT NOT NULL;

Now return to the AWS Glue console and run the TiDB crawler again.

Afterwards, check the table details again, and you’ll see that a new column is added. The comments on the old columns remain unchanged.

On the top right, clickCompare versions显示the table versions and their differences.

Clean up the test environment

Make sure to clean up the test environment so that you don’t get a surprise bill:

  1. Delete the TiDB Cloud cluster.
  2. Delete the Glue crawler, connection, and database.
  3. Delete theglue_enisecurity group.
  4. Delete theglue_iamIAM role.
  5. Delete VPC peering and the route table rule.

Summary

TiDB Cloud and the AWS Glue Data catalog can work seamlessly together without any customization. This tutorial treated TiDB Cloud as a normal JDBC connection, and it worked out well. Using the networking configuration shown in this tutorial to connect TiDB Cloud and AWS Glue, you’ll be able to automatically synchronize all metadata changes from your TiDB Cloud cluster to AWS Glue. Also, you can use the AWS Glue Data Catalog to annotate the metadata and manage metadata versions.

Ready togive TiDB Cloud a try? TiDB Cloud Developer Tier is now available. You can run a TiDB cluster for free for one year on Amazon Web Services. And make sure to follow us onTwitterto stay updated on TiDB Cloud news!

If you are interested in this topic and want to learn more, check out the following resources:


Book a Demo


Have questions? Let us know how we can help.

Contact Us
TiDB Dedicated

TiDB Dedicated

A fully-managed cloud DBaaS for predictable workloads

TiDB Dedicated

TiDB Serverless

A fully-managed cloud DBaaS for auto-scaling workloads