• notice
  • Congratulations on the launch of the Sought Tech site

Subtimate handling design ideas and implementation

Author: heiyeluren (passers-by in the dark night)
Blog: http://blog.csdn.net/heiyeshuwu
Time:2007-01-19 01:44:20
The sub-table is a It is a relatively popular concept at present, especially in the case of heavy load, sub-table is a good way to spread the pressure on the database.
First of all, we need to understand why we need to divide the table and what are the benefits of dividing the table.Let's first briefly understand the process of executing SQL in one of the following databases:
Receive SQL--> Put into SQL execution queue--> Use analyzer to decompose SQL--> Extract or modify data according to the analysis results--> Return the processing result
Of course, this flowchart is not necessarily correct, this is just my own subjective consciousness.So what is the most likely problem in this process? That is to say, if the previous SQL is not executed, the subsequent SQL will not be executed, because in order to ensure the integrity of the data, the data table file must be locked, including shared locks and exclusive locks.The shared lock is during the locking period.Other threads can also access the data file, but modification operations are not allowed.Correspondingly, the exclusive lock means that the entire file is owned by one thread, and other threads cannot access the data file.In general, MyISAM, the fastest storage engine in MySQL, is based on table locking, which means that if it is locked, the entire data file cannot be accessed outside, and the next operation can only be received after the previous operation is completed.The situation where the previous operation is not completed and the latter operation cannot be executed in the queue is called blocking, which is generally called "lock table" in the popular sense.
What are the direct consequences of locking the table? That is, a large amount of SQL cannot be executed immediately, and it must wait for all the SQLs in front of the queue to be executed before continuing to execute.The SQL that cannot be executed will result in no result or serious delay, affecting the user experience.
Especially for some frequently used tables, such as the user information table in the SNS system, the post table in the forum system, etc., are tables with a large number of visits, in order to ensure the rapid extraction of data and return to the user , you must use some processing methods to solve this problem, this is the sub-table technology I want to talk about today.
Table partitioning technology, as the name suggests, divides several tables that store the same type of data into several tables and stores them in separate tables.When extracting data, different users access different tables without conflicting with each other, reducing the chance of table locking.For example, there are currently two tables for saving user sub-tables, one is user_1 table, and the other is user_2 table, two tables save different user information, user_1 saves the first 100,000 user information, user_2 saves the last 100,000 user information If you query the two users heiyeluren1 and heiyeluren2 at the same time, then the table is extracted from different tables, reducing the possibility of locking the table.
I have not tried the two methods of sub-tables that I will describe below, so I do not guarantee that they can be used accurately, but only provide a design idea.I assume that the following examples of sub-tables are processed and constructed on the basis of a post bar system.(If you have not used Tieba, please Google it)
Second, the sub-table based on the basic table Processing
The general idea of ​​this basic table-based sub-table processing method is: a main table that stores all basic information.If a project needs to find the table it stores, it must be found from this basic table.The corresponding table name and other items, so that you can directly access this table.If you feel that the basic table is not fast enough, you can completely save the entire basic table in the cache or memory for convenient and efficient query.
Based on the situation of Tieba, we construct the following 3 tables:
1.Tieba section table: save the information of the section in Tieba
2.Tieba theme table: save the topic information in the section in Tieba, For browsing
3.Tieba Reply Table: Save the original content and reply content of the topic
"Tieba Board Table" contains the following fields:
Board ID       board_id           int(10)
Board Name    board_name      char(50)
Subtable ID       table_id            smallint(5)
created    created  ;            datetime
The Tieba Topic Table contains the following fields:
Topic ID           topic_id        int(10)
topic name        topic_name    ;  char(255)
Board ID          board_id          int(10)
Created       created   & nbsp;       datetime
The fields of the “Post Bar Reply Table” are as follows:
Reply ID        reply_id    ;        int(10)
Reply Content      reply_text        text
Topic ID        topic_id           int(10)
Forum ID    ;     board_id         int(10)
created      created             datetime
Then above saves the table structure information in our entire post bar, the corresponding relationship of the three tables is:
section--> more Topics
Topics--> Multiple Replies
Then that is to say, the relationship of the table file size is:
Subsection table file< Topic table file< Reply table file
So basically it can be determined The subject table and the reply table need to be divided into tables, which has increased the speed and performance of our data retrieval query changes.
Looking at the above table structure, it is obvious that a "table_id" field is saved in the "section table".This field is used to save the topic and reply corresponding to a section.of.
For example, we have a post called "PHP", the board_id is 1, and the sub-table ID is also 1.Then this record is:
board_id | board_name | table_id | created
1 | PHP | 1 | 2007-01-19 00:30:12
Correspondingly, if I need to extract all topics in the "PHP" bar, then I must combine a table name that stores topics according to the table_id saved in the table, such as The prefix of our topic table is "topic_", then the corresponding topic table of "PHP" should be: "topic_1", then we execute:
SELECT * FROM topic_1 WHERE board_id=1 ORDER BY topic_id DESC LIMIT 10
In this way, we can get the list of replies under this topic, which is convenient for us to view.If we need to view the replies under a topic, we can continue to use the "table_id" saved in the forum table to query.For example, the prefix of our reply table is "reply_", then we can combine the reply of the topic with ID 1 of "PHP":
SELECT * FROM reply_1 WHERE topic_id=1 ORDER BY reply_id DESC LIMIT 10
Here , we can clearly see that we actually use the basic table here, and the basic table is our section table.Correspondingly, it must be said: how to ensure the speed and efficiency of the basic table after the amount of data is large?
Of course, we must make this basic table maintain the best speed and performance.For example, MySQL memory table can be used for storage, or stored in memory, such as memory cache such as Memcache, etc., according to the actual situation to adjust.
Generally, the sub-table mechanism based on the basic table is a relatively good solution in Web2.0 websites such as SNS, dating, forums, etc.In these websites, a table can be used alone to store the basic identifier and the target table.relationship between.The advantage of using a table to save the corresponding relationship is that it is very convenient to expand later, and only one table record needs to be added.
[Advantage] It is very convenient to add and delete nodes, which brings great convenience for later upgrade and maintenance.
[Disadvantage] If you need to add tables or operate on a certain table, you still cannot leave the database, which will cause bottlenecks
Three 、Sub-table processing based on Hash algorithm
We know that a Hash table is a value calculated by a special Hash algorithm, this value must be unique, and the calculated value can be used to find the required value, This is called a hash table.
Our hash algorithm in the sub-table is similar to this idea: calculate the table name of the data storage table through a certain hash algorithm through the ID or name of an original target, and then access the corresponding table.
Continue to take the above post bar as an example, each post bar has a section name and a section ID, then these two values ​​are fixed and unique, then we can consider passing one of these two values.Do some arithmetic to get the name of a target table.
Now if we target our post bar system, assume that the system allows a maximum of 100 million pieces of data, and consider that each table holds 1 million records, then the entire system can accommodate no more than 100 tables.According to this standard, we assume that hashing is performed on the forum ID of Tieba to obtain a key value, which is our table name, and then access the corresponding table.
We construct a simple hash algorithm:
function get_hash($id){
      $hash=substr($str, 0, 4);
     if (strlen($hash)<4){
      ;    $hash=str_pad($hash, 4, "0");
     return $hash;
The algorithm is roughly to pass in a forum ID value, and then the function returns a 4-digit string.If the string length is not enough, use 0 to complete it.
For example: get_hash(1), the output result is "3100", input: get_hash(23819), the result is: 3233, then we can access this table by simply combining it with the table prefix.Then when we need to access the content with ID 1, the combined table will be: topic_3100, reply_3100, then you can directly access the target table.
Of course, after using the hash algorithm, some data may be in the same table.This is different from the hash table.The hash table tries to resolve conflicts.We do not need it here.Of course, it is also necessary to predict and analyze the table data.table name.
If more data needs to be stored, similarly, hash operation can be performed on the name of the section, for example, the above binary is also converted into hexadecimal, because Chinese characters are much more than numbers and letters, so the chance of repetition is smaller , but there are more tables that may be combined, and some other issues must be considered accordingly.
In the final analysis, if you use the hash method, you must choose a good hash algorithm to generate more tables, and then the data query is faster.
[Advantage hash algorithm directly obtains the name of the target table, high efficiency] Through
[Disadvantage] The scalability is relatively poor, a hash algorithm is selected, and the amount of data is defined, and then only the amount of data can be used in the future.Run, can not exceed this amount of data, the scalability is slightly poor
Four.Other problems
1.Search problem
Now that we have divided the table, then we cannot directly search the table, because you It is impossible to retrieve dozens or hundreds of tables that may already exist in the system, so the search must be performed with the help of third-party components.For example, Lucene is a good choice as an on-site search engine.
2.Table file problem
We know that MySQL's MyISAM engine generates three files for each table, *.frm, *.MYD, *.MYI three files, which are divided into tables to save table structure, table Data and table indexes.The number of files in each directory under Linux should not exceed 1000, otherwise the retrieval of data will be slower, then each table will generate three files, correspondingly, if the number of sub-tables exceeds 300 tables, the retrieval will be very slow, so At this time, it must be divided again, such as the separation of the database.
Using the base table, we can add a new field to store what data is stored in this table.Using the Hash method, we must intercept the digit in the hash value as the name of the database.In this way, the problem is completely solved.
In large-load applications, the database has always been a very important bottleneck and must be broken through.This article explains two ways of dividing tables, hoping to inspire many people.Of course, the code and ideas in this article have not been tested by any code, so there is no guarantee that the design is completely accurate and practical, and the reader needs to carefully analyze and implement it during the use process.
The article is written in a hurry, and the quality may not be guaranteed.If you encounter errors, don't take offense.Comments and suggestions are welcome, thank you~~~~!


Technical otaku

Sought technology together

Related Topic


Leave a Reply