快速启动Guide for TiDB HTAP
This guide walks you through the quickest way to get started with TiDB's one-stop solution of Hybrid Transactional and Analytical Processing (HTAP).
Basic concepts
Before using TiDB HTAP, you need to have some basic knowledge aboutTiKV, a row-based storage engine for TiDB Online Transactional Processing (OLTP), andTiFlash, a columnar storage engine for TiDB Online Analytical Processing (OLAP).
- Storage engines of HTAP: The row-based storage engine and the columnar storage engine co-exist for HTAP. Both storage engines can replicate data automatically and keep strong consistency. The row-based storage engine optimizes OLTP performance, and the columnar storage engine optimizes OLAP performance.
- Data consistency of HTAP: As a distributed and transactional key-value database, TiKV provides transactional interfaces with ACID compliance, and guarantees data consistency between multiple replicas and high availability with the implementation of theRaft consensus algorithm. As a columnar storage extension of TiKV, TiFlash replicates data from TiKV in real time according to the Raft Learner consensus algorithm, which ensures that data is strongly consistent between TiKV and TiFlash.
- Data isolation of HTAP: TiKV and TiFlash can be deployed on different machines as needed to solve the problem of HTAP resource isolation.
- MPP computing engine:MPPis a distributed computing framework provided by the TiFlash engine since TiDB 5.0, which allows data exchange between nodes and provides high-performance, high-throughput SQL algorithms. In the MPP mode, the run time of the analytic queries can be significantly reduced.
Steps
In this document, you can experience the convenience and high performance of TiDB HTAP by querying an example table in aTPC-H达taset. TPC-H is a popular decision support benchmark that consists of a suite of business oriented ad-hoc queries with large volumes of data and a high degree of complexity. To experience 22 complete SQL queries using TPC-H, visittidb-bench repoorTPC-Hfor instructions on how to generate query statements and data.
Step 1. Deploy a local test environment
Before using TiDB HTAP, follow the steps in the快速启动Guide for the TiDB Database Platformto prepare a local test environment, and run the following command to deploy a TiDB cluster:
tiup playground
Step 2. Prepare test data
In the following steps, you can create aTPC-H达taset as the test data to use TiDB HTAP. If you are interested in TPC-H, seeGeneral Implementation Guidelines.
Install the test data generation tool by running the following command:
tiup install benchGenerate the test data by running the following command:
tiup bench tpch --sf=1 prepareIf the output of this command shows
Finished
, it indicates that the data is created.Execute the following SQL statement to view the generated data:
SELECTCONCAT(table_schema,'.',table_name)AS 'Table Name', table_rowsAS 'Number of Rows', FORMAT_BYTES(data_length)AS 'Data Size', FORMAT_BYTES(index_length)AS “索引大小”, FORMAT_BYTES(data_length+index_length)AS 'Total' FROMinformation_schema.TABLESWHEREtable_schema= 'test';As you can see from the output, eight tables are created in total, and the largest table has 6.5 million rows (the number of rows created by the tool depends on the actual SQL query result because the data is randomly generated).
+ ---------------+----------------+-----------+------------+-----------+ | TableName|Numberof Rows |Data Size|Index Size|Total| + ---------------+----------------+-----------+------------+-----------+ |test.nation| 25 | 2.44KiB| 0bytes| 2.44KiB| |test.region| 5 | 416bytes| 0bytes| 416bytes| |test.part| 200000 | 25.07MiB| 0bytes| 25.07MiB| |test.supplier| 10000 | 1.45MiB| 0bytes| 1.45MiB| |test.partsupp| 800000 | 120.17MiB| 12.21MiB| 132.38MiB| |test.customer| 150000 | 24.77MiB| 0bytes| 24.77MiB| |test.orders| 1527648 | 174.40MiB| 0bytes| 174.40MiB| |test.lineitem| 6491711 | 849.07MiB| 99.06MiB| 948.13MiB| + ---------------+----------------+-----------+------------+-----------+ 8 rows in set(0.06sec)This is a database of a commercial ordering system. In which, the
test.nation
table indicates the information about countries, thetest.region
table indicates the information about regions, thetest.part
table indicates the information about parts, thetest.supplier
table indicates the information about suppliers, thetest.partsupp
table indicates the information about parts of suppliers, thetest.customer
table indicates the information about customers, thetest.customer
table indicates the information about orders, and thetest.lineitem
table indicates the information about online items.
Step 3. Query data with the row-based storage engine
To know the performance of TiDB with only the row-based storage engine, execute the following SQL statements:
SELECTl_orderkey,SUM( l_extendedprice*(1
-l_discount) )ASrevenue, o_orderdate, o_shippriorityFROMcustomer, orders, lineitemWHEREc_mktsegment=
'BUILDING'
ANDc_custkey=o_custkeyANDl_orderkey=o_orderkeyANDo_orderdate<
DATE
'1996-01-01'
ANDl_shipdate>
DATE
'1996-02-01'
GROUP
BYl_orderkey, o_orderdate, o_shippriorityORDER
BYrevenueDESC, o_orderdate limit10;
This is a shipping priority query, which provides the priority and potential revenue of the highest-revenue order that has not been shipped before a specified date. The potential revenue is defined as the sum ofl_extendedprice * (1-l_discount)
. The orders are listed in the descending order of revenue. In this example, this query lists the unshipped orders with potential query revenue in the top 10.
Step 4. Replicate the test data to the columnar storage engine
After TiFlash is deployed, TiKV does not replicate data to TiFlash immediately. You need to execute the following DDL statements in a MySQL client of TiDB to specify which tables need to be replicated. After that, TiDB will create the specified replicas in TiFlash accordingly.
ALTER
TABLEtest.customerSETTIFLASH REPLICA1;ALTER
TABLEtest.ordersSETTIFLASH REPLICA1;ALTER
TABLEtest.lineitemSETTIFLASH REPLICA1;
To check the replication status of the specific tables, execute the following statements:
SELECT
*
FROMinformation_schema.tiflash_replicaWHERETABLE_SCHEMA=
'test'
andTABLE_NAME=
'customer';SELECT
*
FROMinformation_schema.tiflash_replicaWHERETABLE_SCHEMA=
'test'
andTABLE_NAME=
'orders';SELECT
*
FROMinformation_schema.tiflash_replicaWHERETABLE_SCHEMA=
'test'
andTABLE_NAME=
'lineitem';
In the result of the above statements:
AVAILABLE
indicates whether the TiFlash replica of a specific table is available or not.1
means available and0
means unavailable. Once theAVAILABLE
field becomes1
, this status does not change anymore.PROGRESS
means the progress of the replication. The value is between 0.0 and 1.0. 1 means that the replication progress of the TiFlash replica is complete.
Step 5. Analyze data faster using HTAP
Execute the SQL statements inStep 3again, and you can see the performance of TiDB HTAP.
For tables with TiFlash replicas, the TiDB optimizer automatically determines whether to use TiFlash replicas based on the cost estimation. To check whether or not a TiFlash replica is selected, you can use thedesc
orexplain analyze
statement. For example:
explain analyzeSELECTl_orderkey,SUM( l_extendedprice*(1
-l_discount) )ASrevenue, o_orderdate, o_shippriorityFROMcustomer, orders, lineitemWHEREc_mktsegment=
'BUILDING'
ANDc_custkey=o_custkeyANDl_orderkey=o_orderkeyANDo_orderdate<
DATE
'1996-01-01'
ANDl_shipdate>
DATE
'1996-02-01'
GROUP
BYl_orderkey, o_orderdate, o_shippriorityORDER
BYrevenueDESC, o_orderdate limit10;
If the result of theEXPLAIN
statement showsExchangeSender
andExchangeReceiver
operators, it indicates that the MPP mode has taken effect.
In addition, you can specify that each part of the entire query is computed using only the TiFlash engine. For detailed information, seeUse TiDB to read TiFlash replicas.
You can compare query results and query performance of these two methods.