• notice
  • Congratulations on the launch of the Sought Tech site

How to design an IM system with billions of messages

This article will not give a set of general IM solutions, nor will it judge the quality of a certain architecture, but will discuss common problems in designing IM systems and industry solutions.Because there is no so-called universal solution, different solutions have their advantages and disadvantages, and only the system that best meets the business is a good system.Moreover, with limited human, material and time resources, many trade-offs are usually required.At this time, a system that can be iterated quickly and easily expanded is a good system.

IM core concepts

User : The user of the system

Message : Refers to the content of communication between users.Usually in the IM system, there are the following types of messages: text messages, emoticons, picture messages, video messages, file messages, etc.

Conversation : usually refers to the association between two users due to chat

Group : usually refers to the association between multiple users due to chat

Terminal : refers to the machine where the user uses the IM system.Usually there are Android, iOS, Web, etc.

No reading : refers to the number of users who have not read the message

User status : refers to whether the user is currently online, offline, or suspended

Relationship chain : refers to the relationship between users and users, usually one-way friend relationship, two-way friend relationship, follow relationship and so on.Here you need to pay attention to the difference with the conversation, the user only has a conversation when they initiate a chat, but the relationship does not need to be chat to be established.For the storage of relationship chains, graph databases (Neo4j, etc.) can be used, which can naturally express relationships in the real world and are easy to model

Single chat : one-on-one chat

Group chat : group chat

Customer service : In the field of e-commerce, it is usually necessary to provide users with services such as pre-sale consultation and after-sale consultation.At this time, it is necessary to introduce customer service to handle user inquiries

Message splitting : In the field of e-commerce, a store usually has multiple customer services.At this time, deciding which customer service will handle the user's inquiries is message splitting.Usually the message distribution will determine which customer service the message will be distributed to according to a series of rules, such as whether the customer service is online (if the customer service is not online, it needs to be diverted to another customer service), whether the message is a pre-sales consultation or after-sales consultation, and how busy the current customer service is and many more

Mailbox : The mailbox in this article refers to a Timeline, a queue for sending and receiving messages

Read diffusion vs.write diffusion

Read diffusion

Let's take a look at read diffusion first.As shown in the figure above, A has a mailbox with each chatter and group (some blog posts will be called Timeline), and A needs to read all mailboxes with new messages when viewing chat information.The read diffusion here needs to pay attention to the difference with the Feeds system.In the Feeds system, everyone has a letter box, and you only need to write to your letter box once.Reading needs to be from the letter box of all the people who follow.Read in.However, the read spread in IM systems is usually one mailbox for every two associated people, or one mailbox for each group.

Advantages of read diffusion:

  • The write operation (message) is very lightweight, whether it is a single chat or a group chat, you only need to write to the corresponding mailbox once.

  • Each mailbox is naturally the chat history of two people, which can be convenient to view the chat history and search the chat history

Disadvantages of read diffusion:

  • Read operation (read message) is heavy


Write diffusion

Next look at write diffusion.

In writing diffusion, everyone only reads messages from their mailbox, but when writing (sending messages), the processing for single chat and group chat is as follows:

  • Single chat: write a message to both your own mailbox and the other's mailbox.At the same time, if you need to view the chat history of two people, you also need to write another copy (of course, if you can trace back all of the two people's Chat records, but this will be very inefficient).

  • Group chat: You need to write a message to all group members’ mailboxes.At the same time, you need to write another copy if you need to view the group’s chat history.It can be seen that write proliferation greatly magnifies write operations for group chats.

Advantages of write diffusion:

  • Read operation is very lightweight

  • Multi-terminal synchronization of messages can be easily done

Disadvantages of write diffusion:

  • The write operation is very heavy, especially for group chats


Note that in the Feeds system:

  • Write diffusion is also called: Push, Fan-out or Write-fanout

  • Read diffusion is also called: Pull, Fan-in or Read-fanout

Unique ID design

Under normal circumstances, ID design mainly has the following categories:

  • UUID

  • ID generation method based on Snowflake

  • Generation method based on the application DB step size

  • Self-incrementing ID generation method based on Redis or DB

  • Special rules generate unique ID

For specific implementation methods and advantages and disadvantages, please refer to a previous blog post: Distributed Unique ID Analysis

The main places that need a unique ID in the IM system are:

  • Session id

  • Message ID

Message ID

Let's take a look at three issues that need to be considered when designing a message ID.

Can the message ID not be incremented?

Let's first see what happens if we don't increment:

  • The use of strings wastes storage space, and the characteristics of the storage engine cannot be used to store adjacent messages together, which reduces the performance of writing and reading messages

  • Use numbers, but the numbers are random, and you cannot use the characteristics of the storage engine to store adjacent messages together, which will increase random IO and reduce performance; and random IDs are not easy to ensure the uniqueness of IDs.

Therefore, the message ID is preferably incremental.

Global increment vs user level increment vs session level increment

Global increment: It means that the message ID is incremented over time in the entire IM system.Generally, Snowflake can be used for global increment (of course, Snowflake is only a worker-level increment).At this point, if your system is reading proliferation, in order to prevent message loss, each message can only carry the ID of the last message.The front end judges whether there is a lost message based on the previous message.If any message is lost, it needs to be pulled again..

User level increment: the message ID is only guaranteed to be incremented in a single user, and it will not be affected between different users and may be repeated.Typical representative: WeChat.If it is a write diffusion system, the mailbox timeline ID and message ID need to be designed separately, the mailbox timeline ID user level increases, and the message ID increases globally.If it is a reading diffusion system, it feels that the necessity of increasing the user level is not great.

Session level increment: the message ID is only guaranteed to be incremented in a single session, and it will not be affected between different sessions and may be repeated.Typical representative: QQ.

Continuously increasing vs monotonically increasing

Continuous increment means that IDs are generated in the manner of 1, 2, 3...n; while monotonic increment means that as long as the ID generated later is larger than the previously generated ID, it is fine, and it does not need to be continuous.

As far as I know, the message ID of QQ is continuously incremented at the session level.The advantage of this is that if the message is lost, when the next message comes, it will ask the server if the ID is not continuous to avoid losing the message.At this time, some people may think, can't I use the timed pull method to see if there are any messages lost? Of course not, because the message ID only increases continuously at the session level.If a person has thousands of sessions, how many times it has to be pulled, the server must be overwhelming.

For read diffusion, continuous incrementing of message ID is a good way.If monotonic increase is used, the current message needs to carry the ID of the previous message (that is, the chat message forms a linked list), so that it can be judged whether the message is lost.


To summarize:

  • Write proliferation: The mailbox timeline ID is increased by user level, and the message ID is increased globally.At this time, it is enough to ensure a monotonous increase.

  • Read proliferation: Message ID can be incremented by session level and it is best to increment continuously

Session id

Let's take a look at the issues that need to be paid attention to when designing a session ID:

Among them, the session ID generated a kind of relatively simple way (special rules to generate a unique ID): spliced from_user_idwith to_user_id:

  1. If from_user_idwith the to_user_idwords are 32-bit integer data can be easily assembled into a bit arithmetic with 64-bit session ID, a: conversation_id = ${from_user_id} << 32 | ${to_user_id}(value before splicing necessary to ensure a relatively small user ID of from_user_idsuch any two users can initiate a session Conveniently know the session ID)

  2. If from_user_idwith to_user_idall 64-bit integer data, then it can only be spliced into a string, and then spliced into the string is more hurt, wasting storage space and bad performance.

The former owner used the first method above.The first method has a flaw: with the global expansion of the business, if the 32-bit user ID is not enough and needs to be expanded to 64-bit, then it will need to be changed drastically.The 32-bit integer ID seems to be able to hold 2.1 billion users, but usually we don’t use continuous IDs in order to prevent others from knowing the real user data.At this time, 32-bit user IDs are completely insufficient.Therefore, the design is completely dependent on the user ID and is not a desirable design method.

Therefore, the design of the session ID can use a global increment method, add a mapping table, and save from_user_idand to_user_idfollow conversation_idthe relationship.

Push mode vs.pull mode vs.push-pull combined mode

In the IM system, there are usually three possible ways to obtain new messages:

  • Push mode: When there are new messages, the server actively pushes to all terminals (iOS, Android, PC, etc.)

  • Pull mode: The front-end actively initiates a request to pull messages.In order to ensure the real-time nature of messages, the push mode is generally used, and the pull mode is generally used to obtain historical messages.

  • Push-pull combination mode: When there is a new message, the server will first push a notification with a new message to the front end, and the front end will pull the message from the server after receiving the notification

The simplified diagram of the push mode is as follows:

As shown in the figure above, under normal circumstances, messages sent by users will be pushed to all ends of the receiver after operations such as server storage.But the push is likely to be lost.The most common situation is that the user may be pseudo-online (referring to if the push service is based on a long connection, and the long connection may have been disconnected, that is, the user has been dropped, but generally it takes a heartbeat cycle Only after the server can perceive it, the server will mistakenly think that the user is still online; pseudo-online is a concept I thought of myself, and I didn't expect a suitable word to explain it).Therefore, if you simply use the push mode, you may lose messages.

The simplified diagram of the push-pull combination mode is as follows:

The push-pull combined mode can be used to solve the problem of message loss in push mode.The server pushes a notification when the user sends a new message, and then the front end requests the latest message list.In order to prevent the loss of messages, it can be actively requested again at regular intervals.It can be seen that the push-pull combination mode is best to use write diffusion, because write diffusion only needs to pull a timeline of personal mailboxes, and read diffusion has N timelines (one for each mailbox), if you also pull regularly If taken, the performance will be poor.

Industry solutions

We have learned about the common design problems of IM systems before, and then let's take a look at how the industry designs IM systems.Studying the mainstream solutions in the industry helps us understand the design of IM systems.The following studies are based on information that has been made public on the Internet, which may not be correct, so please refer to it only.

WeChat

Although many basic frameworks of WeChat are self-developed, this does not prevent us from understanding the architecture design of WeChat.From the article " From 0 to 1: The Evolution of WeChat Back-end System " published by WeChat, we can see that WeChat mainly adopts: writing diffusion + push-pull combination.Since group chat also uses write diffusion, and write diffusion consumes resources, there is a maximum number of WeChat groups (currently 500).So this is also an obvious shortcoming of writing proliferation, it is more difficult if you need ten thousand people.

It can also be seen from the article that WeChat adopts a multi-data center architecture:


Each data center of WeChat is autonomous, and each data center has a full amount of data.The data centers are synchronized through self-developed message queues.In order to ensure data consistency, each user belongs to only one data center and can only read and write data in the data center to which it belongs.If the user is connected to other data centers, the user will be automatically guided to access the data center to which it belongs.If you need to access the data of other users, you only need to access your own data center.At the same time, WeChat uses the disaster recovery architecture of the three campuses, and uses Paxos to ensure data consistency.

It can be seen from the article " Trillion Level Call System: Architecture Design and Evolution of WeChat Serial Number Generator " published by WeChat that the ID design of WeChat adopts: the generation method based on the application DB step size + user level increment.As shown below:


The WeChat serial number generator generates a routing table by the arbitration service (the routing table saves the full mapping of the uid number segment to AllocSvr), and the routing table will be synchronized to AllocSvr and Client.If AllocSvr goes down, the arbitration service will reschedule the uid number segment to other AllocSvr.

Nailed

DingTalk does not have much public information.From the article " Ali DingTalk Technology Sharing: Enterprise-level IM King-DingTalk's Outstanding Back-end Architecture " article, we can only know that DingTalk first used The write diffusion model, in order to support tens of thousands of people, later seemed to be optimized to read diffusion.

But when talking about Ali's IM system, I have to mention Ali's self-developed Tablestore.Under normal circumstances, IM systems will have an auto-increment ID generation system, but Tablestore creatively introduces the auto-increment of primary key columns , which integrates ID generation into the DB layer and supports user level increment (traditional MySQL and other DBs can only support table Level self-increment, that is, global self-increment).For details, please refer to: " How to Optimize High Concurrency IM System Architecture "

Twitter

What? Isn't Twitter a Feeds system? Isn't this article discussing IM? Yes, Twitter is the Feeds system, but the Feeds system and the IM system actually have many design commonalities.Studying the Feeds system will help us to refer to when designing the IM system.Besides, there is no harm in researching the Feeds system, and expanding the technological horizon.

Twitter's self-increasing ID design is estimated to be familiar to everyone, that is, the famous Snowflake , so the ID is incremented globally.

From this video sharing " How We Learned to Stop Worrying and Love Fan-In at Twitter ", it can be seen that Twitter initially used the write diffusion model, Fanout Service was responsible for spreading and writing to Timelines Cache (using Redis), and Timeline Service was responsible Timeline data is read, and then returned to the user by API Services.

However, because write diffusion is too expensive for big Vs, Twitter later used a combination of write diffusion and read diffusion.As shown below:

For users with a small number of followers, if they use the writing diffusion model to post on Twitter, the Timeline Mixer service integrates the user's Timeline, Big V's writing Timeline, and system recommendations, and finally returns it to the user by API Services.

58 Home

58 Daojia has implemented a universal real-time messaging platform:

It can be seen that msg-server saves the correspondence between applications and MQ topics.According to this configuration, msg-server pushes messages to different MQ queues, and specific applications can consume them.Therefore, adding an application only needs to modify the configuration.

In order to ensure the reliability of message delivery, Daojia also introduced an acknowledgment mechanism: the message platform first landed on the database when the message was received, and the application layer ACK was deleted after the receiver received it.It is best to use the confirmation mechanism to only single sign-on.If multiple terminals can log in at the same time, it will be more troublesome, because all terminals must confirm the receipt of the message before it can be deleted.

Seeing this, it is estimated that everyone has already understood that designing an IM system is very challenging.Let's continue to look at the issues that need to be considered when designing an IM system.

IM needs to solve the problem

How to ensure the real-time nature of the message

In the choice of communication protocol, we mainly have the following choices:

  1. Use TCP Socket communication, design your own protocol: 58 Daojia, etc.

  2. Use UDP Socket communication: QQ, etc.

  3. Use HTTP long round robin: WeChat web version, etc.

No matter which method is used, we can achieve real-time notification of messages.But the real-time nature of our news may be affected by the way we process the news.For example: if we use MQ to process and push a message from a 10,000 people when we push, it takes 2ms to push one person, and it takes 20s to push 10,000 people, then the subsequent messages will be blocked for 20s.If we need to push within 10ms, then the concurrency of our push should be:人数:10000 / (推送总时长:10 / 单个人推送时长:2) = 2000

Therefore, we must evaluate the throughput of our system when choosing a specific implementation plan, and evaluate and test every link of the system.Only by evaluating the throughput of each link can the real-time performance of message push be guaranteed.

How to ensure message timing

Messages may be out of order in the following situations:

  • If sending messages is not a long connection, but HTTP, it may appear out of order.Because the back-end is generally deployed in a cluster, requests may be sent to different servers if HTTP is used.Due to network delays or different server processing speeds, the later messages may be completed first, and messages will be out of order at this time.solution:

  • The front-end processes the messages in turn, sending one message before sending the next message.This method will reduce the user experience and is generally not recommended.

  • Bring a sequence ID generated by the front end, and let the receiver sort according to the ID.The front-end processing in this way is a bit more troublesome, and a message may be inserted in the history message list of the receiver during the chat process, which is very strange, and the user may miss the message.But this situation can be solved by rearranging the window when the user switches, and the receiver will append the message to the end each time it receives the message.

  • Generally, in order to optimize the experience, some IM systems may adopt an asynchronous sending confirmation mechanism (for example: QQ).That is, as long as the message reaches the server, and then the server sends it to MQ, it will be sent successfully.If the sending fails due to permissions and other issues, the backend will push another notification.In this case, MQ will choose the appropriate Sharding strategy:

  • Press to_user_idfor Sharding: using the strategy needs to be done if a plurality of multi-terminal end of the synchronization words may be synchronized sender disorder, because the processing speed of the different queues may be different.For example, the sender first sends m1 and then m2, but the server may process m2 first and then m1.Here, the other end will receive m2 first and then m1.At this time, the conversation list of the other end will be messed up.

  • Press conversation_idconducted Sharding: Using this strategy will also lead multiport synchronization will be out of order.

  • Press from_user_idSharding: In this case, using this strategy is a better choice

  • Generally, in order to optimize performance, MQ may be pushed before pushing.In this case, it to_user_idis a better choice to use it.

How to do user online status

Many IM systems need to display the user's status: whether they are online, busy, etc.Redis or distributed consistent hashing can be used to store users' online status.

  1. Redis stores user online status


Looking at the above picture, people may be wondering, why do I need to update Redis every time my heartbeat? If I use a TCP long connection, doesn’t it need to update every time my heartbeat? Indeed, under normal circumstances, the server only needs to update Redis when a new connection or disconnection is made.However, because the server may be abnormal, or the network between the server and Redis may have problems, the event-based update will have problems at this time, resulting in incorrect user status.Therefore, if the user's online status is required to be accurate, it is best to update the online status through the heartbeat.

Since Redis is stored on a single machine, in order to improve reliability and performance, we can use Redis Cluster or Codis.


  1. Distributed consistent hash stores the online status of users

To use distributed consistent hashing, it is necessary to pay attention to the user status migration when expanding or shrinking the Status Server Cluster, otherwise the user status will be inconsistent during the first operation.At the same time, virtual nodes need to be used to avoid the problem of data skew.

How to do multi-terminal synchronization

Read diffusion

As mentioned earlier, for read diffusion, message synchronization is mainly based on push mode.The message ID of a single session increases in order.If the front end receives the push message and finds that the message ID is not continuous, it will request the back end to retrieve the message.But this may still lose the last message of the conversation.In order to increase the reliability of the message, you can add the ID of the last message in the conversation of the historical conversation list.The front end will first pull the latest message when it receives a new message.Conversation list, and then judge whether the last message of the conversation exists.If it does not exist, the message may be lost.The front end needs to pull the message list of the conversation again; if the ID of the last message of the conversation is the same as the last message ID in the message list , The front end will not handle it anymore.The performance bottleneck of this approach is to pull the list of historical conversations, because each new message needs to pull the backend once.If you look at the magnitude of WeChat, the message alone may have a QPS of 200,000.If the history session list is placed in a traditional DB such as MySQL, it will definitely not be resistant.Therefore, it is best to save the historical session list in a Redis cluster with AOF (you may lose data if you use RDB).I can only feel that performance and simplicity cannot have both.


Write diffusion

For write diffusion, multi-terminal synchronization is simpler.The front end only needs to record the last synchronized position, bring the synchronization position when synchronizing, and then the server will return all the data behind the position to the front end, and the front end can update the synchronization position.

How to deal with unreading

In IM systems, the handling of unreading is very important.Unreading is generally divided into session unreading and total unreading.If not handled properly, the session unreading and total unreading may be inconsistent, seriously degrading the user experience.

Read diffusion

For read diffusion, we can store both session unreads and total unreads in the backend, but the backend needs to ensure the atomicity and consistency of the two unread updates, which can generally be achieved by the following two methods:

  1. Using Redis' multi transaction function, transaction update failure can be retried.But note that if you use Codis cluster, transaction function is not supported.

  2. Use Lua to embed the script.To use this method, you need to ensure that the session unread and the total unread are on the same Redis node (Hashtag can be used for Codis).This approach will lead to decentralized implementation logic and increase maintenance costs.


Write diffusion

For write proliferation, the server usually weakens the concept of conversation, that is, the server does not store a list of historical conversations.The front-end is responsible for the calculation of unreading.Marking read and mark unread can only record an event to the mailbox, and each end handles the unreading of the session by replaying the event.Using this method may cause inconsistent unread readings on each end, at least WeChat will have this problem.

If the write diffusion also stores unread readings through the historical conversation list, then the user timeline service is tightly coupled with the conversation service.At this time, if atomicity and consistency need to be guaranteed, then distributed transactions can only be used, which will greatly reduce the performance of the system..

How to store historical messages

Read diffusion

For read diffusion, you only need to store a copy of Sharding according to the session ID.


Write diffusion

For write diffusion, two copies need to be stored: one is a message list whose user is Timeline, and the other is a message list whose conversation is Timeline.The message list with the user as Timeline can use the user ID for sharding, and the message list with the session as the Timeline can use the session ID for sharding.

Data cold and hot separation

For IM, the storage of historical messages has a strong time series characteristic.The longer the time, the lower the probability of the message being accessed and the lower the value.

If we need to store several years or even permanent historical news (more common in e-commerce IM), then it is very necessary to separate hot and cold historical news.The cold and hot separation of data is generally HWC (Hot-Warm-Cold) architecture.The newly sent message can be placed in the Hot storage system (Redis can be used) and the Warm storage system, and then the Store Scheduler will periodically migrate the cold data to the Cold storage system according to certain rules.When getting the message, you need to visit the Hot, Warm and Cold storage systems in sequence, and the Store Service integrates the data and returns it to the IM Service.

How to do the access layer

There are mainly the following methods to achieve load balancing at the access layer:

  1. Hardware load balancing: such as F5, A10, etc.Hardware load balancing has powerful performance and high stability, but the price is very expensive, not recommended by local companies.

  2. Use DNS to achieve load balancing: It is relatively simple to use DNS to achieve load balancing, but if you need to switch or expand capacity using DNS to achieve load balancing, it will take effect very slowly, and the number of IPs supported by load balancing using DNS is limited and supported load balancing strategies It is also relatively simple.

  3. DNS + 4-layer load balancing + 7-layer load balancing architecture: such as DNS + DPVS + Nginx or DNS + LVS + Nginx.Some people may wonder why we want to add 4-layer load balancing? This is because layer 7 load balancing is very CPU intensive and often needs to be expanded or scaled down.For large websites, a lot of layer 7 load balancing servers may be needed, but only a small number of layer 4 load balancing servers are needed.Therefore, this architecture is useful for large-scale applications with short connections such as HTTP.Of course, if the traffic is not large, just use DNS + 7-layer load balancing.But for long connections, adding Nginx with 7-layer load balancing is not great.Because Nginx often needs to change the configuration and reload the configuration, the TCP connection will be disconnected during the reload, causing a lot of dropped connections.

  4. DNS + 4-layer load balancing: The 4-layer load balancing is generally stable with few changes and is more suitable for long connections.

For the long-connected access layer, if we need a more flexible load balancing strategy or need to do grayscale, then we can introduce a scheduling service, as shown in the following figure:

Access Schedule Service can allocate Access Service according to various strategies, such as:

  • Assign according to the grayscale strategy

  • Distribution based on the principle of proximity

  • Assign according to the minimum number of connections

Architecture experience

Finally, share the architecture experience of doing large-scale applications:

  1. Grayscale! Grayscale! Grayscale!

  2. monitor! monitor! monitor!

  3. Warning! Warning! Warning!

  4. Cache! Cache! Cache!

  5. Limiting! Fuse! Downgrade!

  6. Low coupling, high cohesion!

  7. Avoid single points and embrace statelessness!

  8. Evaluate! Evaluate! Evaluate!

  9. Pressure test! Pressure test! Pressure test!


refer to

How We Developed DingTalk: Implementing the Message System Architecture

Feed Stream System Design: General Principles

Rongyun first disclosed the four key points of high-concurrency system architecture design

A complete design of an instant messaging system (IM) for massive online users Plus

Instant Messaging Network (IM Developer Community) — Featured Technology

From 0 to 1: The evolution of WeChat background system

Trillion-level calling system: WeChat serial number generator architecture design and evolution

WeChat PaxosStore: Explain the Paxos algorithm protocol in a simple way

Design practice of WeChat background based on time series of massive data cold and hot hierarchical architecture design practice

DingTalk the way to innovate enterprise-level IM storage architecture

Discussion on the Synchronization and Storage of Chat Messages in Modern IM System

Alibaba DingTalk Technology Sharing: Enterprise-level IM King——DingTalk's superiority in back-end architecture

How to optimize the architecture of high-concurrency IM system

Message system in modern IM system—architecture

Message System in Modern IM System—Model

Message system in modern IM system—implementation

How We Learned to Stop Worrying and Love Fan-In at Twitter

58 Daojia general real-time messaging platform architecture details

How to ensure the "timing" and "consistency" of IM real-time messages


Tags

Technical otaku

Sought technology together

Related Topic

0 Comments

Leave a Reply

+