• notice
  • Congratulations on the launch of the Sought Tech site

Sub-database and sub-table based on MySQL database with hundreds of millions of data

In the era of mobile Internet, massive amounts of user data are generated every day. Analysis based on user usage data requires data statistics and analysis. When the amount of data is small, database optimization is not important. Once the amount of data increases Large, the system response will slow down, TPS plummets, until the service is unavailable.

Some people may ask, why not use Oracle? It is true that many developers do not care about SQL when writing code. All performance problems are handed over to DBA to be responsible for SQL optimization. However, not every project will have a DBA, and not all projects will use the Oracle database. Moreover, Oracle In the context of a large amount of data, it is a very easy thing to solve performance problems.

So, can MySQL support a data volume of 100 million? My answer is yes, most of the data storage solutions adopted by Internet companies are mostly MySQL-based, state-owned enterprises with good money. And banks, mainly Oracle, and have a full-time DBA to serve you.

This article will take an actual project application as an example to analyze how to optimize the database layer by layer. The background of the project is an enterprise-level unified message processing platform. The customer data is 50 million plus, the message flow is 10 million per minute, and the daily message flow is about 100 million.

Although My SQL single table can store 1 billion data, the performance is very poor at this time. Since one table can't be done, then try to find a way to put the data in multiple places to solve the problem, so the scheme of database sub-database and sub-table came into being. At present, there are three common schemes: partition, sub-database and sub-table , No SQL /New SQL .

In actual projects, the combination of these three solutions is often used to solve problems. At present, the core data of most systems is mainly stored in RDBMS, supplemented by No SQL /New SQL .


Partitioning scheme

The partition table is implemented by multiple related underlying tables, and these underlying tables are also represented by handle objects, so we can also directly access each partition. Using the same storage engine), the index of the partition table is just adding an identical index to each underlying table. From the perspective of the storage engine, the underlying table is no different from a normal table, and the storage engine does not need to know that this is A normal table is also part of a partitioned table. This solution is also good. It shields the details of sharding from users. Even if the query condition does not have a sharding column, it can work normally (only this time the performance is average). However, its shortcomings are obvious: many resources are limited by a single machine, such as the number of connections, network throughput, etc. How to partition is one of the key elements in practical applications. In our project, taking customer information as an example, the amount of customer data is 50 million plus, and the project background requires saving the customer's bank card binding relationship, the customer's certificate binding relationship, and the business information bound to the customer. In this business context, how to design the database? During the first phase of the project, we established a customer business binding relationship table, which contains redundant business information bound by each customer. The basic structure is roughly as follows:


When querying, the bank card is indexed, the business number is indexed, and the certificate number is indexed. With the increase in demand, the index of this table will reach more than 10. Moreover, when the customer terminates the contract and then signs the contract, two pieces of data will be saved in it, but the binding state is different. Suppose we have 50 million customers, 5 business types, and each customer has an average of 2 cards, then the data volume of this table will reach an astonishing 500 million. In fact, the number of users in our system has not exceeded one million. Not anymore. The data in the My SQL database is stored on the disk in the form of files, and is placed under /mysql/data by default (you can view it through the datadir in my.cnf). A table mainly corresponds to three files, one is frm To store the table structure, one is myd to store table data, and the other is myi to store table index. These three files are very huge, especially the .myd file, which is almost 5G. The first partition optimization is performed below. There are four partitioning methods supported by MySQL :


In our project, the range partition and list partition are not used. If the range or list partition is made based on the binding number, the binding number has no actual business meaning and cannot be queried through it. Therefore, we are left with the HASH partition and KEY is partitioned, and HASH partition only supports partitioning of int type columns, and is one of them. Looking at our library table structure, we found that none of the columns are of type int. How to do partitioning? You can add a column, bind the time column, set this column to int type, and then partition according to the binding time, and assign the users bound every day to the same area. After this optimization, our insertion is much faster, but the query is still very slow. Why, because when doing the query, we only query based on the bank card or ID number, not based on time, which is equivalent to each query , My SQL will query all partitioned tables again.

Then carry out the second scheme optimization. Since the hash partition and key partition require that one of the columns must be of type int, it is possible to create a list of type int to partition. Analysis found that the string of numbers on the bank card has a secret. Bank cards are generally strings of numbers ranging from 16 to 19 digits. Is it feasible for us to take one of them and use it as a table partition? Through analysis, we found that in this string of numbers, there is indeed a random digit from 0 to 9. Generated, different card string lengths, this digit is different, by no means the last digit, the last digit is generally a check digit, not random. Our newly designed scheme performs KEY partitioning based on bank card number + random bit. Each time a query is performed, the random bit number is extracted by calculation, and the card number is added for joint query to achieve the purpose of partition query, which needs to be explained. Yes, after partitioning, the index created must also be a partition column, otherwise, MySQL will still query data in all partitioned tables. Then the problem of querying the binding relationship through the bank card number has been solved, then how to query the binding relationship through the certificate number. As mentioned earlier, indexing must be done on the partition key, otherwise it will cause a full table scan.

We created a new table to save the client's certificate number binding relationship. Each customer's certificate number is unique. In the new certificate number binding relationship table, the certificate number is used as the primary key, so how to calculate this partition key Well, the customer's certificate information is quite complex, including ID number, Hong Kong, Macao and Taiwan pass, motor vehicle driver's license, etc. How to find the partition health in the disordered certificate number.

In order to solve this problem, we divided the certificate number binding relationship table into two, one of which is dedicated to saving the certificate number of the ID card type, and the other table is used to save the certificate number of other certificate types. In the certificate binding relationship table, we split the number of months in the ID number as partition keys, and save the ID numbers of customers born in the same month in the same area, which is divided into 12 areas, and other types of certificates are divided into 12 areas. , and the data volume does not exceed 100,000, so there is no need to partition. In this way, each time you query, first determine which table to query based on the certificate type, and then calculate the partition key to query.

After the partition design is made, when the data of 20 million users is stored, the data storage file of the bank card table is divided into 10 small files, and the data storage file of the certificate table is divided into 12 small files, which solves these two query problems. , the remaining question is, what to do about the business number? A customer has multiple contracted businesses and how to save them. At this time, the partition scheme is not suitable, it needs to use the scheme of sub-tables.

Sub-library and sub-table

There are many versions on the Internet, and some well-known solutions are:

  • Ali's TDDL, DRDS and cobar,

  • JD Finance's sharding-jdbc;

  • MyCAT for interorganization;

  • 360 Atlas;

  • Meituan's zebra;

  • Other companies such as NetEase, 58, and JD.com have self-developed middleware.

But there are so many sub-database sub-table middleware solutions, in summary, there are two types: client mode and proxy mode.


client mode


proxy mode

Whether it is client mode or proxy mode, several core steps are the same: SQL parsing, rewriting, routing, execution, and result merging. Personally, I prefer to use the client mode, which has a simple architecture, small performance loss, and low operation and maintenance costs. If you introduce mycat or cobar into the project, their stand-alone mode cannot guarantee reliability. Once the machine goes down, the service becomes unavailable. You have to introduce HAProxy to implement its high-availability cluster deployment solution. Available problems, but also need to use Keepalived to achieve.


We abandoned this solution in the project and adopted the method of shardingjdbc. Going back to the business problem just now, how to sub-database and sub-table for business types. The first and most important step in sub-database sub-table is the selection of sharding column. The quality of sharding column selection will directly determine whether the entire sub-database sub-table solution is ultimately successful.

The selection of the sharding column is strongly related to the business. In our project scenario, the best choice for the sharding column is undoubtedly the business number. Through the service number, the different binding and contracted services of the customer are stored in different tables. When querying, the query is routed to the corresponding table according to the service number to achieve the purpose of further optimizing the SQL .

Earlier we talked about database optimization based on customer signing and binding business scenarios. Let's talk about the storage scheme for massive data.

vertical library

For nearly 10 million transactions per minute and nearly 100 million transactions per day, how to write and query efficiently is a big challenge. It’s still the old way of sub-database, sub-table and partition, and read-write separation, but this time, we first sub-table, then sub-database, and finally partition.

We divide the message flow into tables according to different business types. The message flow of the same business enters the same table. After the table is divided, the database is divided. We save the data related to the pipeline in a database separately. These data have high requirements for writing, and low requirements for query and update, and distinguish them from those frequently updated data. After the database is partitioned, partition it again.


This is a sub-database operation based on business verticality. Vertical sub-database is to store different tables with low correlation in different databases according to business coupling to achieve saturated utilization of system resources. Such a sub-library scheme is combined with the application's microservice governance, and each microservice system uses an independent database. The data of different modules are stored in separate databases, and inter-related queries cannot be carried out between modules. If there is, it is either solved by data redundancy or by secondary processing of application codes.

If cross-database associative queries cannot be eliminated, then the data from small tables is redundant to large databases with large data volumes. If the query in the flow table needs to correlate to obtain channel information, and the channel information is in the basic management library, then, either when querying, the channel information table in the basic management library is queried twice in the code, or the channel information table is redundant to in the water table.

After separating the daily flow data of over 100 million, the data volume of a single table in the flow reservoir is still too large. We continue to partition a single flow table, and partition a single table according to certain business rules (usually query index columns). , One table programs N tables, of course, these changes are not perceptible to the application layer.


The setting of the partition table is generally based on the query index column. For example, for the flow table A, the query needs to be queried according to the mobile phone number and batch number, so when we create a partition, we choose to use the mobile phone number and batch number. Partition, after this setting, the query will go through the index, each query MySQL will be calculated according to the query conditions, the data will fall in that partition, you can directly retrieve it in the corresponding partition table, avoiding full table scan.

For the data that has a daily flow of over 100 million, of course, it is necessary to do the data migration of the historical table. The customer requires that the running water data needs to be stored for half a year, and some key running water needs to be stored for one year. It is impossible to delete the data, and it is impossible to run away, although I had the urge to delete the data and run away. In fact, it is impossible to delete the data immediately. The clumsy performance of delete is eliminated first, and the truncate is not much faster. We adopted a more ingenious method. The specific steps are as follows:

  1. Create a temporary table exactly the same as the original table 1 create table test_a_serial_1 like test_a_serial;

  2. Name the original table as temporary table 2 alter table test_a_serial rename test_a_serial_{date};

  3. Change the temporary table 1 to the original table alter table able test_a_serial_1 rename able test_a_serial; At this time, the current flow table is a new empty table, and the flow of the current day will continue to be saved, while the temporary table 2 saves the data and parts of yesterday For today's data, the date in temporary table 2 to the name is the date of yesterday obtained by calculation; a temporary table 2 with yesterday's date will be generated every day, and the data in each table is about 10 million.

  4. The operation of migrating the historical data in the current day's table to the yesterday's flow meter is all processed by timed tasks. The timed task is usually triggered after 12:00 in the morning. This operation is completed in a few seconds, and there may be Several pieces of data fall into the table of the day. Therefore, we finally need to insert the historical flow data in the table of the day into the table of yesterday; insert into test_a_serial_{date}(cloumn1,cloumn2....) select(cloumn1,cloumn2....) from test_a_serial where LEFT(create_time,8) > CONCAT(date); commit;

In this way, the migration of flow data is completed;

According to business needs, some business data needs to be stored for half a year, and if it exceeds half a year, it will be deleted. When deleting, you can filter out the flow of more than half a year according to the _{date} in the table name and delete the table directly;

In half a year, there will be more than 180 tables for a business flow meter, and each table has 20 partition tables, so how to query? Since our project does not have particularly high requirements for the real-time query of pipelines, when we do the query, we carry out the route query according to the query time interval.

In general, according to the time interval selected by the customer, bring the query conditions, go to each table in the time interval to query, save the query results in a temporary table, and then query the temporary table to obtain Final query result.

The above is the corresponding optimization at the database level in the case of large data volume. A table with 100 million per day, after splitting, the data in each table partition is about 5 million. After this design, we also faced some other problems, such as the statistical problem of running water. With such a large amount of data, there are more than 100 statistical dimensions in the project. Even if it is counted 100 times a day, it is very difficult. We adopted real-time calculation. The statistical method solves this problem. The related technologies involve real-time computing, message queues, cache middleware, etc. Please look forward to it!


Technical otaku

Sought technology together

Related Topic


Leave a Reply