• notice
  • Congratulations on the launch of the Sought Tech site

Deep paging query optimization of MySQL tens of millions of databases, rejecting online failures!

In the process of optimizing the project code, a problem of deep paging of tens of millions of data was found. The reason is this

There is a consumables MCS_PROD table in the library, which is assembled into a single consumable product within the system by synchronizing the multi-dimensional data of the external data, and finally synchronized to the ES search engine

The MySQL synchronization ES process is as follows:

  1. Trigger synchronization in the form of timed tasks, such as the time frequency of half a day or a day

  2. The form of synchronization is incremental synchronization. According to the mechanism of update time, for example, the first synchronization query >= 1970-01-01 00:00:00.0

  3. Record the maximum update time for storage, and the next update synchronization is based on this condition

  4. Get data in pagination, add one to the current page number, and loop to the last page

The problem also arises here. The deeper the MySQL query paging OFFSET, the worse the performance . It is estimated that the online MCS_PROD table records about 1000w.

At 10 entries per page, the OFFSET value will drag down query performance and form a "performance abyss"

There are two ways to optimize the synchronization class code for this problem:

  1. Optimized with cursor and streaming solutions

  2. Optimize deep paging performance, the article focuses on this topic

The article directory is as follows:

  • Software and hardware description

  • Rediscover MySQL pagination

  • Deep paging optimization

    • Subquery optimization

    • delayed association

    • bookmark record

  • ORDER BY giant pit, tread carefully

    • ORDER BY index invalidation example

  • Conclusion

Software and hardware description

MySQL VERSION

mysql> select version();+-----------+| version() |+-----------+| 5.7.30    |+-----------+1 row in set (0.01 sec)

Table structure description

Drawing on the structure of the company table, the fields, lengths and names have been deleted

mysql> DESC MCS_PROD;+-----------------------+--------------+------+-----+---------+----------------+| Field                 | Type         | Null | Key | Default | Extra          |+-----------------------+--------------+------+-----+---------+----------------+| MCS_PROD_ID           | int(11)      | NO   | PRI | NULL    | auto_increment || MCS_CODE              | varchar(100) | YES  |     |         |                || MCS_NAME              | varchar(500) | YES  |     |         |                || UPDT_TIME             | datetime     | NO   | MUL | NULL    |                |+-----------------------+--------------+------+-----+---------+----------------+4 rows in set (0.01 sec)

Through the test, the classmates helped to create about 500w of data

mysql> SELECT COUNT(*) FROM MCS_PROD;+----------+| count(*) |+----------+|  5100000 |+----------+1 row in set (1.43 sec)

The SQL statement is as follows

Because the function needs to meet the method of incremental pull , there will be conditional query of data update time, and related query sorting (there are pits here)

SELECT
	MCS_PROD_ID,
	MCS_CODE,
	MCS_NAME,
	UPDT_TIMEFROM
	MCS_PRODWHERE
	UPDT_TIME >= '1970-01-01 00:00:00.0' ORDER BY UPDT_TIMELIMIT xx, xx

Rediscover MySQL pagination

The LIMIT clause can be used to force the SELECT statement to return the specified number of records. LIMIT takes one or two numeric arguments, which must be an integer constant

If two parameters are given, the first parameter specifies the offset of the first returned record line, and the second parameter specifies the maximum number of returned record lines

As a simple example, analyze the SQL query process and understand why the deep paging performance is poor

mysql> SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD WHERE (UPDT_TIME >= '1970-01-01 00:00:00.0') ORDER BY UPDT_TIME LIMIT 100000, 1;+-------------+-------------------------+------------------+---------------------+| MCS_PROD_ID | MCS_CODE                | MCS_NAME         | UPDT_TIME           |+-------------+-------------------------+------------------+---------------------+|      181789 | XA601709733186213015031 | 尺、橈骨LC-DCP骨板 | 2020-10-19 16:22:19 |+-------------+-------------------------+------------------+---------------------+1 row in set (3.66 sec)mysql> EXPLAIN SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD WHERE (UPDT_TIME >= '1970-01-01 00:00:00.0') ORDER BY UPDT_TIME LIMIT 100000, 1;+----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+-----------------------+| id | select_type | table    | partitions | type  | possible_keys | key        | key_len | ref  | rows    | filtered | Extra                 |+----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+-----------------------+|  1 | SIMPLE      | MCS_PROD | NULL       | range | MCS_PROD_1    | MCS_PROD_1 | 5       | NULL | 2296653 |   100.00 | Using index condition |+----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+-----------------------+1 row in set, 1 warning (0.01 sec)

Briefly describe the execution process of the above SQL:

  1. First query the table MCS_PROD, filter the UPDT_TIME condition, query the display column (involving the return table operation) for sorting and LIMIT

  2. LIMIT 100000, 1 means scan 100001 rows that satisfy the condition, then throw away the first 100000 rows

MySQL spends a lot of random I/O to query the data of the clustered index in the return table, and the data of these 100,000 random I/O queries will not appear in the result set

If the system concurrency is a little higher, and each query scans more than 100,000 rows, the performance will definitely be worrying. In addition, the deeper the LIMIT paging OFFSET, the worse the performance (emphasized many times)

Figure 1 Data is for reference only

Deep paging optimization

There are roughly three common strategies for MySQL deep paging optimization:

  1. Subquery optimization

  2. delayed association

  3. bookmark record

The above three points can greatly improve the query efficiency. The core idea is to let MySQL scan as few pages as possible , obtain the records that need to be accessed, and then return to the original table according to the associated columns to query the required columns.

Subquery optimization

The subquery deep paging optimization statement is as follows:

mysql> SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD WHERE MCS_PROD_ID >= ( SELECT m1.MCS_PROD_ID FROM MCS_PROD m1 WHERE m1.UPDT_TIME >= '1970-01-01 00:00:00.0' ORDER BY m1.UPDT_TIME LIMIT 3000000, 1) LIMIT 1;+-------------+-------------------------+------------------------+| MCS_PROD_ID | MCS_CODE                | MCS_NAME               |+-------------+-------------------------+------------------------+|     3021401 | XA892010009391491861476 | 金屬解剖型接骨板T型接骨板A |+-------------+-------------------------+------------------------+1 row in set (0.76 sec)mysql> EXPLAIN SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD WHERE MCS_PROD_ID >= ( SELECT m1.MCS_PROD_ID FROM MCS_PROD m1 WHERE m1.UPDT_TIME >= '1970-01-01 00:00:00.0' ORDER BY m1.UPDT_TIME LIMIT 3000000, 1) LIMIT 1;+----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+--------------------------+| id | select_type | table    | partitions | type  | possible_keys | key        | key_len | ref  | rows    | filtered | Extra                    |+----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+--------------------------+|  1 | PRIMARY     | MCS_PROD | NULL       | range | PRIMARY       | PRIMARY    | 4       | NULL | 2296653 |   100.00 | Using where              ||  2 | SUBQUERY    | m1       | NULL       | range | MCS_PROD_1    | MCS_PROD_1 | 5       | NULL | 2296653 |   100.00 | Using where; Using index |+----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+--------------------------+2 rows in set, 1 warning (0.77 sec)

According to the execution plan, the subquery table m1 query uses the index. First , the primary key ID of the clustered index is obtained on the index, which saves the table return operation , and then the second query directly searches 10 more according to the ID of the first query.

Figure 2 Data is for reference only

delayed association

The "delayed associative" deep paging optimization statement is as follows:

mysql> SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD INNER JOIN (SELECT m1.MCS_PROD_ID FROM MCS_PROD m1 WHERE m1.UPDT_TIME >= '1970-01-01 00:00:00.0' ORDER BY m1.UPDT_TIME LIMIT 3000000, 1) AS  MCS_PROD2 USING(MCS_PROD_ID);+-------------+-------------------------+------------------------+| MCS_PROD_ID | MCS_CODE                | MCS_NAME               |+-------------+-------------------------+------------------------+|     3021401 | XA892010009391491861476 | 金屬解剖型接骨板T型接骨板A |+-------------+-------------------------+------------------------+1 row in set (0.75 sec)mysql> EXPLAIN SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD INNER JOIN (SELECT m1.MCS_PROD_ID FROM MCS_PROD m1 WHERE m1.UPDT_TIME >= '1970-01-01 00:00:00.0' ORDER BY m1.UPDT_TIME LIMIT 3000000, 1) AS  MCS_PROD2 USING(MCS_PROD_ID);+----+-------------+------------+------------+--------+---------------+------------+---------+-----------------------+---------+----------+--------------------------+| id | select_type | table      | partitions | type   | possible_keys | key        | key_len | ref                   | rows    | filtered | Extra                    |+----+-------------+------------+------------+--------+---------------+------------+---------+-----------------------+---------+----------+--------------------------+|  1 | PRIMARY     | <derived2> | NULL       | ALL    | NULL          | NULL       | NULL    | NULL                  | 2296653 |   100.00 | NULL                     ||  1 | PRIMARY     | MCS_PROD   | NULL       | eq_ref | PRIMARY       | PRIMARY    | 4       | MCS_PROD2.MCS_PROD_ID |       1 |   100.00 | NULL                     ||  2 | DERIVED     | m1         | NULL       | range  | MCS_PROD_1    | MCS_PROD_1 | 5       | NULL                  | 2296653 |   100.00 | Using where; Using index |+----+-------------+------------+------------+--------+---------------+------------+---------+-----------------------+---------+----------+--------------------------+3 rows in set, 1 warning (0.00 sec)

The idea and performance are consistent with sub-query optimization, but it is executed in the form of JOIN

bookmark record

Regarding the LIMIT deep paging problem, the core lies in the OFFSET value, which will cause MySQL to scan a large number of unnecessary rows and discard them

We can first use bookmarks to record the location where the data was taken last time , and then we can start scanning directly from this location next time, which can avoid using OFFEST

Suppose you need to query the first record after 3000000 rows of data, the query can be written like this

mysql> SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD WHERE MCS_PROD_ID < 3000000 ORDER BY UPDT_TIME LIMIT 1;+-------------+-------------------------+---------------------------------+| MCS_PROD_ID | MCS_CODE                | MCS_NAME                        |+-------------+-------------------------+---------------------------------+|         127 | XA683240878449276581799 | 股骨近端-1螺紋孔鎖定板(純鈦)YJBL01 |+-------------+-------------------------+---------------------------------+1 row in set (0.00 sec)mysql> EXPLAIN SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD WHERE MCS_PROD_ID < 3000000 ORDER BY UPDT_TIME LIMIT 1;+----+-------------+----------+------------+-------+---------------+------------+---------+------+------+----------+-------------+| id | select_type | table    | partitions | type  | possible_keys | key        | key_len | ref  | rows | filtered | Extra       |+----+-------------+----------+------------+-------+---------------+------------+---------+------+------+----------+-------------+|  1 | SIMPLE      | MCS_PROD | NULL       | index | PRIMARY       | MCS_PROD_1 | 5       | NULL |    2 |    50.00 | Using where |+----+-------------+----------+------------+-------+---------------+------------+---------+------+------+----------+-------------+1 row in set, 1 warning (0.00 sec)

The benefits are obvious, the query speed is super fast, and the performance will be stable at the millisecond level , considering the performance of other methods

However, this method is also relatively limited, and requires a field similar to continuous self-increment, as well as a continuous concept that the business can accommodate, depending on the situation.

The picture above is the list of files in the Alibaba Cloud OSS Bucket. I dare to guess whether it can be done in the form of bookmark records.

ORDER BY giant pit, tread carefully

The following remarks may break your order by all good YY

Let’s talk about the conclusion first, when the LIMIT OFFSET is too deep, it will invalidate the ORDER BY ordinary index (union, only these indexes are not tested)

mysql> EXPLAIN SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME,UPDT_TIME FROM MCS_PROD WHERE (UPDT_TIME >= '1970-01-01 00:00:00.0') ORDER BY UPDT_TIME LIMIT 100000, 1;+----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+-----------------------+| id | select_type | table    | partitions | type  | possible_keys | key        | key_len | ref  | rows    | filtered | Extra                 |+----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+-----------------------+|  1 | SIMPLE      | MCS_PROD | NULL       | range | MCS_PROD_1    | MCS_PROD_1 | 5       | NULL | 2296653 |   100.00 | Using index condition |+----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+-----------------------+1 row in set, 1 warning (0.00 sec)

Let's talk about the ORDER BY execution process first:

  1. Initialize SORT_BUFFER and put it in the four fields of MCS_PROD_ID, MCS_CODE, MCS_NAME, UPDT_TIME

  2. Find the primary key ID that satisfies the conditions from the index UPDT_TIME, return the table to query the four column values and store them in SORT_BUFFER

  3. Continue to query records that meet the UPDT_TIME condition from the index, and continue to step 2

  4. Sort data in SORT_BUFFER by UPDT_TIME

  5. After the sorting is successful, take out the records that meet the LIMIT conditions and return to the client

Sorting by UPDT_TIME may be done in memory, or external sorting may be required, depending on the memory required for sorting and the parameter SORT_BUFFER_SIZE

SORT_BUFFER_SIZE is the memory opened by MySQL for sorting . If the amount of sorted data is less than SORT_BUFFER_SIZE, the sorting will be done in memory. If the amount of data is too large to fit in the memory, the disk will be sorted by temporary files.

Regarding the SORT_BUFFER_SIZE parameter, there are relatively few useful information on the Internet. If you have problems during the test process, you can add WeChat to communicate together

ORDER BY index invalidation example

When OFFSET is 100000, it is known through key Extra that the disk temporary file sorting is not used. At this time, OFFSET is adjusted to 500000.

A cool song to the classmate who wrote this SQL and found Using filesort

mysql> EXPLAIN SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME,UPDT_TIME FROM MCS_PROD WHERE (UPDT_TIME >= '1970-01-01 00:00:00.0') ORDER BY UPDT_TIME LIMIT 500000, 1;+----+-------------+----------+------------+------+---------------+------+---------+------+---------+----------+-----------------------------+| id | select_type | table    | partitions | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra                       |+----+-------------+----------+------------+------+---------------+------+---------+------+---------+----------+-----------------------------+|  1 | SIMPLE      | MCS_PROD | NULL       | ALL  | MCS_PROD_1    | NULL | NULL    | NULL | 4593306 |    50.00 | Using where; Using filesort |+----+-------------+----------+------------+------+---------------+------+---------+------+---------+----------+-----------------------------+1 row in set, 1 warning (0.00 sec)

Using filesort means that in addition to the index, additional external sorting actions are required , and the performance will be seriously affected

Therefore, we should combine the corresponding business logic to avoid regular LIMIT OFFSET , and use the #deep paging optimization chapter to modify the corresponding business

Conclusion

Finally, one thing needs to be stated, MySQL itself is not suitable for single table large data volume business

Because MySQL is used in enterprise-level projects, querying against database tables is not a simple condition, there may be more complex joint queries, or there are frequent new or update operations when there is a large amount of data, maintaining indexes or data ACID characteristics are inevitable There is a performance sacrifice

If the data growth of database tables can be expected in the early stage of design, reasonable reconstruction and optimization methods should be conceived, such as ES with query, database and table partition, TiDB and other solutions.

References:

  1. High Performance MySQL Third Edition

  2. "MySQL Practical 45 Lectures"


Tags

Technical otaku

Sought technology together

Related Topic

0 Comments

Leave a Reply

+