• notice
  • Congratulations on the launch of the Sought Tech site

Practice of GraphQL and Metadata Driven Architecture in Backend BFF

1 Origin of BFF

The term BFF comes from a blog post by Sam Newman, Pattern: Backends For Frontends", which refers to the backend serving the frontend. What problem does BFF solve? According to the original description, with the rise of the mobile Internet, the server-side functions originally adapted to the desktop Web hope to be provided to mobile apps at the same time, and there are such problems in this process:

  • Mobile App and Desktop Web have differences in UI part.

  • Mobile App involves different terminals, not only iOS, but also Android. There are differences between the UIs of these different terminals.

  • There is already a large coupling between the original backend functionality and the desktop web UI.

Because of the differences between the ends, the functions of the server should be adapted and tailored according to the differences of the ends, and the business function of the server itself is relatively single, which creates a contradiction - between the single business function of the server and the different demands of the end the contradiction. So how to solve this problem? This is also the "Single-purpose Edge Services for UIs and external parties" described in the subtitle of the article. BFF is introduced, and BFF is used to adapt to multi-terminal differences. This is also a model widely used in the industry.

Figure 1 Schematic diagram of BFF

Figure 1 Schematic diagram of BFF


In actual business practice, there are many reasons for this end-to-end difference, including technical and business reasons. For example, whether the user's client is Android or iOS, whether it is a large screen or a small screen, and what version it is. For another example, what industry does the business belong to, what is the product form, what scenario is the function placed in, who is the target user group, and so on. These factors will bring about the difference of end-oriented functional logic.

On this issue, the commodity display business in charge of the author’s team has a certain say. The same commodity business, the logic of the display function on the C side is deeply affected by the commodity type, industry, transaction form, delivery place, group-oriented and other factors. influence. At the same time, the property of frequent iteration of consumer-oriented functions has intensified and deepened this contradiction, making it evolve into a contradiction between the single stability of the server and the flexibility of differences on the end, which is also a commodity display (commodity display BFF ) The inevitability of the existence of the business system. This article mainly introduces some problems and solutions in the context of Meituan's in-store merchandise display scene.

2 Core contradictions in the context of BFF

The introduction of the BFF layer is to solve the contradiction between the single stability of the server and the different and flexible demands of the terminal. This contradiction is not non-existent, but has been transferred. From the original contradiction between the back-end and the front-end to the contradiction between the BFF and the front-end. The main work of the author's team is to fight against this contradiction. The following takes a specific business scenario as an example, combined with the current business characteristics, to illustrate the specific problems we face in the BFF production mode. The following picture shows the display modules of group purchase shelves in two different industries. We consider these two modules to be the display scenarios of two products. They are two sets of independently defined product logics, and they will iterate separately.

Figure 2 shows the scene

Figure 2 shows the scene


In the early stage of business development, there are not many such scenarios. The BFF layer system is constructed in a "chimney style", and functions are rapidly developed and launched to meet business demands. Under such circumstances, this contradiction is not obvious. With the development of business and the development of the industry, many such commodity display functions have been formed, and the contradiction has gradually intensified, mainly in the following two aspects:

  • Business Support Efficiency: With the increasing number of commodity display scenarios and the explosive trend of APIs, business support efficiency and manpower have a linear relationship, and system capabilities are difficult to support the large-scale expansion of business scenarios.

  • High system complexity: The core functions are continuously iterated, the internal logic is full if…else…, and the code is written procedurally. The system complexity is high, and it is difficult to modify and maintain.

So how did these problems arise? This should be understood in combination with the background of the "chimney-style" system construction, the business faced by the commodity display scene, and the system characteristics.

Feature 1: There are many external dependencies, there are differences in the number of scenarios, and the user experience requirements are high

The illustration shows two group-buying shelf modules in different industries. For such a seemingly small module, the backend needs to call more than 20 downstream services at the BFF layer to get all the data. This is one of them. In the above two different scenarios, there are differences in the required data source sets, and such differences are common. This is the second one. For example, a certain data source required by the pedicure group purchase shelf is not required on the beauty group purchase shelf. Beauty group purchase A certain data source required by the shelf, but not required by the pedicure group purchase shelf. Although there are many downstream services, it is also necessary to ensure the user experience of the C-side, which is the third.

These characteristics have brought many difficulties to the technology: 1) The aggregation size is difficult to control, and the aggregation function is constructed in different scenarios? Or unified construction? If the construction is divided into scenarios, there must be the problem of repeatedly writing similar aggregation logic in different scenarios. If the construction is unified, then there will inevitably be invalid calls in a large and comprehensive data aggregation. 2) The problem of complexity control of aggregation logic. In the case of so many data sources, not only how to write business logic, but also the arrangement of asynchronous calls should be considered. If the complexity of the code cannot be well controlled, the subsequent aggregation Changes and modifications will be a problem.

Feature 2: There are many display logics, there are differences between scenes, and common personality logic coupling

We can clearly identify that the logic of a certain type of scene is common, such as the display scene related to the group order. Intuitively, it can be seen that it basically displays the information of one dimension of the group, but this is only an appearance. In fact, there are many differences in the generation process of modules, such as the following two differences:

  • Differences in field splicing logic: For example, the group purchase titles of the two group purchase shelves in the above figure are examples. The same is the title. The display rules in the Liren group purchase shelf are: [Type] + group purchase title , while the display rules on the pedicure group purchase shelf are: Group purchase title .

  • Differences in sorting and filtering logic: For example, in the same group list, scenario A is sorted by sales in reverse order, and scenario B is sorted by price. The sorting logic of different scenarios is different.

There are many differences in display logic such as this. Similar scenarios actually have many differences in logic internally. How to deal with this difference in the backend is a difficult problem. The following is the most common way of writing. Logical routing is realized by reading specific condition fields to make judgments, as follows: Show:

if(category == "丽人") {
  title = "[" + category + "]" + productTitle;} else if (category == "足疗") {
  title = productTitle;}

This solution has no problem in functional realization and can reuse common logic. But in fact, in the case of a lot of scenarios, there will be a lot of difference judgment logic superimposed together, and the function will be continuously iterated. It is conceivable that the system will become more and more complex and more complex. The harder it is to modify and maintain.

Summary : In the BFF layer, there are differences in the display scenes of different products. In the early stage of business development, the system supports rapid business trial and error through independent construction. In this case, the problems caused by business differences are not obvious. With the continuous development of business, more and more scenarios need to be built and operated, showing a trend of scale. At this time, business has put forward higher requirements for technical efficiency. Under the background of many scenarios and differences between scenarios, how to meet the efficiency of scenario expansion and control the complexity of the system is the core problem faced in our business scenarios .

3 BFF application mode analysis

At present, there are mainly two modes for such solutions in the industry, one is the back-end BFF mode, and the other is the front-end BFF mode.

3.1 Backend BFF Mode

The back-end BFF mode means that the BFF is in charge of the back-end classmates. At present, the most extensive practice of this mode is the back-end BFF solution based on GraphQL. Specifically, the back-end encapsulates the display fields into a display service, and exposes it after being orchestrated by GraphQL. for front-end use. As shown below:

Figure 3 Back-end BFF mode

Figure 3 Back-end BFF mode


The biggest feature and advantage of this mode is that when the display field already exists, the backend does not need to care about the differential requirements of the frontend, and the ability to query on demand is supported by GraphQL. This feature can well deal with the problem of display field differences in different scenarios. The front-end can directly query data on demand based on GraphQL, and the back-end does not need to be changed. At the same time, with the help of GraphQL's orchestration and aggregation query capabilities, the backend can decompose the logic into different display services, so the complexity of the BFF layer can be resolved to a certain extent.

However, based on this model, there are still several problems: the display service granularity problem, the data graph division problem and the field diffusion problem. The following figure is a specific case based on the current model:

Figure 4 Back-end BFF mode (case)

Figure 4 Back-end BFF mode (case)


1) Demonstrate service granularity design issues

This solution requires that the presentation logic and the fetching logic be encapsulated in a module to form a Presentation Service, as shown in the figure above. In fact, the many-to-many relationship between the display logic and the number-taking logic is explained by the example mentioned above:

Background : There are two display services, which respectively encapsulate the query capabilities of product titles and product labels. Scenario : At this time, the PM raised a demand, hoping that the title of the product in a certain scene will be displayed in the form of "[type] + product title". At this time, the splicing of the product title depends on the type data, and at this time, the type data product label display service has been called in. Question : The product title display service calls the type data by itself or merges the two display services together?

The problem described above is the control of the granularity of the display service. We can doubt whether the above example is because the granularity of the display service is too small? Then look at it in reverse, if you merge the two services together, there will inevitably be redundancy. This is the difficulty of display service design. The core reason is that the display logic and the number retrieval logic are in a many-to-many relationship, but they are designed together .

2) Data graph division problem

The data of multiple display services is aggregated into a graph (GraphQL Schema) through GraphQL to form a data view. When data is needed, as long as the data is in the graph, it can be queried on demand based on Query. So the question is, how should this diagram be organized? Is it one image or multiple images? If the graph is too large, it will inevitably bring about complex data relationship maintenance problems. If the graph is too small, the value of the solution itself will be reduced.

3) Show the internal complexity of the service + model diffusion problem

As mentioned above, the display of a product title has different splicing logics. This logic is particularly common in product display scenarios. For example, the same is the price, industry A shows the price after discount, industry B shows the price before discount; the same is the label position, industry C shows the service time, and industry D shows the product characteristics, etc. So the question is, how is the display model designed? Take the title field as an example, is it enough to put a titlefield on the display model, or put a titlesum separately titleWithCategoryIf it is the former, there must be if…else…such logic inside the service, which is used to distinguish titlethe splicing method, which will also lead to the complexity of the display service. If there are multiple fields, it is conceivable that the model fields that display the service will also continue to proliferate.

Summary : The back-end BFF mode can resolve the complexity of the back-end logic to a certain extent, while providing a multiplexing mechanism for displaying fields. But there are still unresolved issues, such as the granular design of the presentation service, the partitioning of the data graph, and the complexity and field proliferation within the presentation service. At present, the representatives of this model practice include Facebook, Airbnb, eBay, iQiyi, Ctrip, Qunar and so on.

3.2 Front-end BFF mode

The front-end BFF mode is specially introduced in the "And Autonomy" section of Sam Newman's article, which means that the BFF itself is the responsibility of the front-end team itself, as shown in the following diagram:

Figure 5 Front-end BFF mode

Figure 5 Front-end BFF mode


The idea of this model is that there is no need to split the requirements that could be delivered by one team into two teams, as the two teams themselves bring greater communication and collaboration costs. In essence, it is also a way of thinking of transforming "conflict between the enemy and ourselves" into "contradictory among the people". The front-end completely takes over the development of BFF, realizing self-sufficiency in data query, and greatly reducing the cost of collaboration between front-end and back-end. But this model does not address some of the core issues we care about, such as: how to deal with complexity, how to deal with differences, how to show how the model is designed, and so on. In addition, this model also has some preconditions and disadvantages, such as relatively complete front-end infrastructure; the front-end not only needs to care about rendering, but also needs to understand business logic and so on.

Summary : The front-end BFF model uses the front-end to independently query and use data, thereby reducing the cost of cross-team collaboration and improving the efficiency of BFF research and development. The current practice representative of this model is Alibaba.

4 Design of Information Aggregation Architecture Based on GraphQL and Metadata

4.1 Overall idea

Through the analysis of the two modes of back-end BFF and front-end BFF, we finally choose the back-end BFF model. The front-end BFF scheme has a great impact on the current R&D model, not only requires a lot of front-end resources, but also needs to build a complete front-end infrastructure. , the implementation cost of the program is relatively high.

Although there are some problems in the back-end GraphQL BFF mode mentioned above, it has a great reference value in general, such as the idea of reusing fields, and the idea of querying data on demand. In the commodity display scenario, 80% of the work is concentrated on the aggregation and integration of data , and this part has strong reuse value, so the query and aggregation of information is the main contradiction we face. Therefore, our idea is: based on the improvement of the GraphQL + back-end BFF scheme, to realize the precipitation, composability and reusability of the number fetching logic and display logic , the overall architecture is shown in the following diagram:

Figure 6 Improvement ideas based on GraphQL BFF

Figure 6 Improvement ideas based on GraphQL BFF


As can be seen from the above figure, the biggest difference from the traditional GraphQL BFF solution is that we delegate GraphQL to the data aggregation part. Since the data comes from the commodity field and the field is relatively stable, the scale of the data graph is controllable and relatively stable. In addition, the core design of the overall architecture also includes the following three aspects: 1) separation of data retrieval and display; 2) unification of query models; 3) metadata-driven architecture.

We solve the problem of display service granularity through the separation of data retrieval and display, and at the same time make the display logic and data retrieval logic precipitation and reusability; through the normalized design of the query model to solve the problem of display field proliferation; through the metadata-driven architecture to achieve capacity Visualization, the automation of business component orchestration and execution, allows business development students to focus on the business logic itself. The following will introduce these three parts one by one.

4.2 Core Design

4.2.1 Separation of fetching and displaying

As mentioned above, in the commodity display scenario, the display logic and the number fetching logic are in a many-to-many relationship, and the traditional GraphQL-based back-end BFF practice solution encapsulates them together, which makes it difficult to design the display service granularity root cause. Think about what is the focus of fetching logic and display logic? The fetch logic focuses on how to query and aggregate data, while the display logic focuses on how to process and generate the required display fields. Their concerns are different, and together they will increase the complexity of the display service. Therefore, our idea is to separate the fetching logic and the display logic, and encapsulate them into logical units separately, which are called the fetching unit and the display unit respectively. After the fetching and display are separated, GraphQL also sinks to realize on-demand aggregation of data, as shown in the following figure:

Figure 7 Fetching display separation + metadata description

Figure 7 Fetching display separation + metadata description


So what is the encapsulation granularity of fetching and displaying logic? It can’t be too small or too big. In the design of granularity, we have two core considerations: 1) Reuse , display logic and number retrieval logic are all assets that can be reused in the commodity display scene. We hope They can settle down and be used individually as needed; 2) Simple , keep it simple so that it is easy to modify and maintain. Based on these two considerations, particle size is defined as follows:

  • Number fetching unit : try to encapsulate only one external data source, and at the same time, it is responsible for simplifying the model returned by the external data source. This part of the generated model is called the fetching model.

  • Display unit : try to encapsulate the processing logic of only one display field.

The advantage of separation is that it is simple and can be used in combination, so how to achieve combination use? Our idea is to describe the relationship between them through metadata, and based on the metadata, a unified execution framework is used to associate and run. The specific design will be introduced below. Through the separation of fetching and display, the association of metadata and the combined invocation at runtime, the logic unit can be kept simple, and at the same time, it can meet the requirements of reuse, which also solves the granularity of the display service existing in the traditional solution. problem .

4.2.2 Query Model Normalization

Through what interface is the processing result of the display unit exposed? Next, we introduce the problem of query interface design.

1) Difficulties in query interface design

There are two design patterns for common query interfaces:

  • Strong type mode : Strong type mode means that the query interface returns POJO objects, and each query result corresponds to a specific field in the POJO with specific business meaning.

  • Weakly typed schema : Weakly typed schema refers to query results returned in KV or JSON schema without explicit static fields.

The above two modes are widely used in the industry, and they have clear advantages and disadvantages. The strong type mode is friendly to developers, but the business is constantly iterating. At the same time, the display units accumulated by the system will continue to be enriched. In this case, the fields in the DTO returned by the interface will be more and more, each time The support of new functions must be accompanied by the modification of the interface query model and the upgrade of the JAR version. The upgrade of JAR involves both the data provider and the data consumer, and there are obvious efficiency problems. In addition, it is conceivable that the continuous iteration of the query model will eventually include hundreds or thousands of fields, which is difficult to maintain.

The weak type mode can just make up for this shortcoming, but the weak type mode is very unfriendly to developers. What query results are in the interface query model are completely incomprehensible to the developer during the development process, but the programmer's It's natural to like to understand logic through code, not configuration and documentation. In fact, these two interface design patterns have a common problem - lack of abstraction. In the following two sections, we will introduce the abstract ideas and framework capability support in the design of query model returned by the interface.

2) Query model normalization design

Going back to the product display scenario, there are many different implementations of a display field, such as two different implementations of the product title: 1) product title; 2) [category] + product title. The relationship between the product title and these two display logics is essentially an abstract-concrete relationship. Identifying this key point, the idea is clear, our idea is to abstract the query model. The query model is all abstract display fields, and one display field corresponds to multiple display units, as shown in the following figure:

Figure 8 Query model normalization + metadata description

Figure 8 Query model normalization + metadata description


At the implementation level, the relationship between display fields and display units is also described based on metadata. Based on the above design ideas, the proliferation of models can be slowed down to a certain extent, but expansion cannot be avoided. For example, in addition to the standard attributes that each product has, such as price, inventory, and sales, different product types generally have attributes unique to this product. For example, there is a description attribute such as "several people spelling" for the theme of the secret room Our solution is to introduce extended attributes, which specifically carry such non-standard fields. The query model is established by means of standard fields + extended attributes, which can better solve the problem of field proliferation .

4.2.3 Metadata Driven Architecture

So far, we have defined how to decompose business logic units and design query models , and have mentioned the use of metadata to describe the relationships between them. The business logic and models implemented based on the above definitions have strong reuse value and can be deposited as business assets. So, why use metadata to describe the relationship between business functions and models?

We introduce metadata description for two main purposes: 1) Automatic arrangement of code logic, describing the relationship between business logic through metadata, and the runtime can automatically implement the relationship between logic based on metadata, thus eliminating a large number of 2) Visualization of business functions, the metadata itself describes the functions provided by the business logic, as shown in the following two examples:

The string display of the basic price of the group order, for example: 30 yuan. The display field of the market price of the group order, for example: 100 yuan.

These metadata are reported to the system and can be used to display the functions provided by the current system. The components and their associations are described by metadata, and the business components are automatically called and executed by analyzing the metadata through the framework, forming the following metadata architecture:

Figure 9 Metadata-driven architecture

Figure 9 Metadata-driven architecture


The overall architecture consists of three core parts:

  • Business capabilities: Standard business logic units, including fetching units, display units, and query models, are key reusable assets.

  • Metadata: Describes business functions (such as display units, fetching units) and the relationship between business functions, such as the data that the display unit depends on, and the display fields mapped by the display unit.

  • Execution engine: Responsible for consuming metadata, and scheduling and executing business logic based on metadata.

Through the organic combination of the above three parts, a metadata-driven style architecture is formed.

5 Optimization practices for GraphQL

5.1 Use simplification

1) GraphQL direct use problem

The introduction of GraphQL will introduce some additional complexity, such as some concepts brought by GraphQL, such as: Schema, RuntimeWiring, the following is the development process based on the native Java framework of GraphQL:

Figure 10 Native GraphQL usage process

Figure 10 Native GraphQL usage process


These concepts increase the cost of learning and understanding for students who have not been exposed to GraphQL, and these concepts are usually not related to the business field. We only hope to use the on-demand query feature of GraphQL, but are dragged down by GraphQL itself. The focus of business development students should focus on the business logic itself. How to solve this problem?

The famous computer scientist David Wheeler famously said, "All problems in computer science can be solved by another level of indirection". There are no problems that cannot be solved by adding a layer. In essence, someone needs to be responsible for this. Therefore, we add an execution engine layer on top of native GraphQL to solve these problems. The goal is to shield the complexity of GraphQL and let developers Just focus on business logic.

2) Standardization of the fetch interface

First of all, it is necessary to simplify the access of data. The native DataFetchersum DataLoaderis at a relatively high level of abstraction and lacks business semantics. In the query scenario, we can conclude that all queries belong to the following three modes:

  • 1 check 1 : Query a result according to a condition.

  • 1 Check N : Query multiple results based on one condition.

  • N Check N : Batch version of one check or one check.

As a result, we have standardized the query interface. Business development students can judge which one is based on the scene, and can choose to use it as needed. The standardized design of the data retrieval interface is as follows:

Figure 11 Standardization of query interface

Figure 11 Standardization of query interface


Business development students can select the number of accessors they need to use, and specify the result type through generics. It is relatively simple to check 1 and 1. For N, we define it as a batch query interface to meet the requirements of "N+1". ” scenario, in which the batchSizefield is used to specify the size of the batchKeyshard and the query key. For business development, you only need to specify the parameters, and other frameworks will handle it automatically. In addition, we also constrained that the returned result must be CompleteFuturea full-link asynchrony for satisfying aggregate queries.

3) Aggregation Orchestration Automation

The standardization of the data access interface makes the semantics of the data source clearer, and the development process can be selected as needed, which simplifies the development of the business. But at this time, after the business development students have finished writing, they Fetcherneed to go to another place to write Schema, and they have Schemato write the mapping relationship with and after writing. Business development enjoys the process of writing code more, and is not willing to go to another place after writing the code. Taking the configuration in one place and maintaining the code and the corresponding configuration at the same time also increases the possibility of errors. Can these tedious steps be removed?SchemaFetcher

SchemaIn essence , RuntimeWiringwe want to describe some information. If this information can be described in another way, our optimization idea is: mark annotations in the business development process, describe these information through the metadata of annotation annotations, and other things Let the framework do it. The solution diagram is as follows:

Figure 12 Annotation metadata describing Schema and RuntimeWiring

Figure 12 Annotation metadata describing Schema and RuntimeWiring


5.2 Performance optimization

5.2.1 GraphQL performance issues

Although GraphQL has been open sourced, Facebook has only open sourced relevant standards and has not given a solution. The GraphQL-Java framework is contributed by the community. Based on the open source GraphQL-Java solution as an on-demand query engine, we have found some problems in GraphQL applications, some of which are caused by improper posture, and some are caused by Problems in the implementation of GraphQL itself, such as several typical problems we encountered:

  • CPU-consuming query parsing, including Schemaparsing and Queryparsing.

  • When the query model is complex, especially when there is a large list, there is a delay problem.

  • Reflection-based model conversion CPU consumption issue.

  • DataLoaderThe hierarchical scheduling problem.

Therefore, we have made some optimizations and modifications to the usage and framework to solve the problems listed above. This chapter focuses on our optimization and transformation ideas in GraphQL-Java.

5.2.2 GraphQL compilation optimization

1) An overview of the principles of the GraphQL language

GraphQL is a query language for building client applications based on an intuitive and flexible syntax for describing their data requirements and interactions. GraphQL belongs to a domain-specific language (DSL), and the GraphQL-Java client we use is implemented based on ANTLR 4 at the language compilation level. ANTLR 4 is a Java-based language definition and identification tool. ANTLR is a Meta-Language, their relationship is as follows:

Figure 13 Schematic diagram of the basic principle of the GraphQL language

Figure 13 Schematic diagram of the basic principle of the GraphQL language


The GraphQL execution engine accepts Schemaand Queryexpresses content based on the language defined by GraphQL. The GraphQL execution engine cannot directly understand GraphQL, and must be translated by the GraphQL compiler into a document object that the GraphQL execution engine can understand before execution. The GraphQL compiler is based on Java. Experience shows that in the case of real-time interpretation of high-traffic scenarios, this part of the code will become a CPU hotspot, and it will also occupy the response delay, Schemaor Querythe more complex it is, the more obvious the performance loss will be.

2) Schema and Query compilation cache

SchemaIt expresses that the data view and the number access model are isomorphic, relatively stable, and there are not many of them. In our business scenario, there is only one service per service. Therefore, our approach is to Schemaconstruct the construction-based GraphQL execution engine at startup and cache it as a singleton. For Queryeach scenario, Querythere are some differences, so Querythe parsing result cannot be used as a singleton. Our approach It is to implement the PreparsedDocumentProviderinterface, and cache the compilation result based on Querythe key . QueryAs shown below:

Figure 14 Schematic diagram of Query cache implementation

Figure 14 Schematic diagram of Query cache implementation


5.2.3 GraphQL execution engine optimization

1) GraphQL execution mechanism and problems

Let's first take a look at the operating mechanism of the GraphQL-Java execution engine. Assuming that we choose the execution strategy, let's take a AsyncExecutionStrategylook at the execution process of the GraphQL execution engine:

Figure 15 GraphQL execution engine execution process

Figure 15 GraphQL execution engine execution process


The above sequence diagram has been simplified, and some irrelevant information has been removed. AsyncExecutionStrategyThe executemethod is to implement the asynchronous mode of the object execution strategy, which is the starting point of query execution and the entry of the root node query, and AsyncExecutionStrategythe query of multiple fields of the object. The logic adopts the implementation method of loop + asynchrony. We trigger from AsyncExecutionStrategythe executemethod and understand the GraphQL query process as follows:

  1. Call DataFetcherthe getmethod bound to the current field. If the field is not bound DataFetcher, the default PropertyDataFetcherquery field is used. PropertyDataFetcherThe implementation is to read the query field from the source object based on reflection.

  2. Will wrap the result obtained from the DataFetcherquery into CompletableFuture, if the result itself is CompletableFuture, then not wrap.

  3. After the result is CompletableFuturecomplete, the call completeValueis processed separately based on the result type.

    • If the query result is a list type, then the list type is traversed, recursively executing for each element completeValue.

    • If the result type is an object type, then execute on the object execute, back to the starting point, ie AsyncExecutionStrategy的execute.

The above is the execution process of GraphQL. What is the problem with this process? Let's take a look at the problems encountered in the application and practice of GraphQL in our business scenarios based on the labeling sequence on the graph. These problems do not mean that they are also problems in other scenarios, but are for reference only:

Question 1 : The PropertyDataFetcherCPU hotspot problem PropertyDataFetcherbelongs to the hotspot code in the entire query process, and its own implementation also has some room for optimization, and PropertyDataFetcherthe execution at runtime will become a CPU hotspot. (For specific issues, please refer to commit and Conversion on GitHub: https://github.com/graphql-java/graphql-java/pull/1815)

Figure 16 PropertyDataFetcher becomes a CPU hotspot

Figure 16 PropertyDataFetcher becomes a CPU hotspot


Problem 2 : The calculation of the list is time-consuming. The calculation of the list is circular. For the scenario where there is a large list in the query result, the loop will cause a significant delay in the overall query. Let's take a specific example. Suppose that there is a list size of 1000 in the query result, and the processing of each element is 0.01ms, then the overall time is 10ms. Based on the GraphQL query mechanism, this 10ms will block the entire link.

2) Type conversion optimization

The GraphQL model obtained through the GraphQL query engine is isomorphic to the DataFetcherreturned number retrieval model implemented by the business, but the types of all fields will be converted into GraphQL internal types. PropertyDataFetcherThe reason why it has become a CPU hotspot lies in the model conversion process. The schematic diagram of the conversion process from a business-defined model to a GraphQL type model is shown in the following figure:

Figure 17 Schematic diagram of business model to GraphQL model conversion

Figure 17 Schematic diagram of business model to GraphQL model conversion


When there are many fields in the query result model, such as tens of thousands, it means that each query has tens of thousands of PropertyDataFetcheroperations, which is actually reflected in the CPU hotspot problem. Our solution to this problem is to maintain the original business model. Unchanged, PropertyDataFetcherthe results of the non-query are inversely populated on the business model. As shown in the following diagram:

Figure 18 Schematic diagram of reverse filling of the query result model

Figure 18 Schematic diagram of reverse filling of the query result model


Based on this idea, the result we get through the GraphQL execution engine is Fetcherthe object model returned by the business, which not only solves the CPU hotspot problem caused by field reflection conversion, but also increases the friendliness for business development. Because the GraphQL model is similar to the JSON model, this model lacks business types, and it is very troublesome to use directly for business development. The above optimizations were pilot tested in a scenario, and the results showed that the average response time of this scenario was shortened by 1.457ms, the average 99 lines were shortened by 5.82ms, and the average CPU utilization was reduced by about 12%.

3) List calculation optimization

When there are many list elements, the delay consumption caused by the default single-threaded traversal list element calculation method is very obvious, and this delay optimization is necessary for scenarios that are sensitive to response time. Our solution to this problem is to make full use of the multi-core computing capabilities of the CPU, split the list into tasks, and execute them in parallel through multiple threads. The implementation mechanism is as follows:

Figure 19 List traversal multi-core computing ideas

Figure 19 List traversal multi-core computing ideas


5.2.4 GraphQL-DataLoader scheduling optimization

1) Basic Principles of DataLoader

Let’s briefly introduce the basic principle of DataLoader. DataLoader has two methods, one is load, the other is dispatch, in the scenario of solving the N+1 problem, DataLoader is used like this:

Figure 20 Basic principle of DataLoader

Figure 20 Basic principle of DataLoader


The whole is divided into two stages. The first stage is called load, which is called N times, and the second stage is called . When it is dispatchcalled dispatch, it will actually execute the data query, so as to achieve the effect of batch query + fragmentation.

2) DataLoader scheduling problem

FieldLevelTrackingApproachWhat are the problems with the implementation of GraphQL-Java's integrated support for DataLoader FieldLevelTrackingApproachThe following is based on a diagram to express the problems caused by the native DataLoader scheduling mechanism:

Figure 21 GraphQL-Java's problem with DataLoader scheduling

Figure 21 GraphQL-Java's problem with DataLoader scheduling


The problem is obvious, based on FieldLevelTrackingApproachthe implementation, the next level needs to wait until the results DataLoaderof dispatchthis level are returned before issuing. Based on this implementation, the calculation formula of the total query time is equal to: TOTAL = MAX (Level 1 Latency) + MAX (Level 2 Latency) + MAX (Level 3 Latency) + … , the total query time is equal to the maximum time-consuming of each layer. In fact, if the link arrangement is written by the business development students themselves, the theoretical effect is that the total time consumption is equal to the time consumption of the longest link of all links , which is reasonable. However FieldLevelTrackingApproach, the results shown by the implementation of .

The problem is that the above implementation is unacceptable in some business scenarios. For example, the response time constraint of our list scenario is less than 100ms in total, of which tens of ms are included for this reason. One way to solve this problem is to independently arrange scenarios with particularly high response time requirements without using GraphQL; the other way is to solve this problem at the GraphQL level to maintain the unity of the architecture. Next, let's talk about how we extend the GraphQL-Java execution engine to solve this problem.

3) DataLoader scheduling optimization

For the performance problem of DataLoader scheduling, our solution is to call the method to issue a query request immediately after calling a certain one DataLoaderforloaddispatch the last time. The question is how do we know which load is the last load? This problem is also a difficult point to solve the DataLoader scheduling problem. The following is an example to explain our solution:

Figure 22 Schematic diagram of query object results

Figure 22 Schematic diagram of query object results


Suppose the model structure we query is as follows: the root node is Querythe field below, the field name is a list, subjectsthere are two elements below, both are object instances, there are two fields, and the association of is an instance of , the association of multiple instances.subjectsubjectModelAModelAfieldAfieldBsubjects[0]fieldAModelBsubjects[0]fieldBModelC

In order to facilitate understanding, we define some concepts, field, field instance, field instance execution completed, field instance value size, field instance value object execution size, field instance value object execution completed, etc.:

  • Field : has a unique path, is static, and has nothing to do with the size of the runtime object, such as: subjectsand subjects/fieldA.

  • Field instance : An instance of a field, with a unique path, is dynamic, and is related to the size of the object at runtime, such as: subjects[0]/fieldAand is an instance subjects[1]/fieldAof a field .subjects/fieldA

  • Field instance execution is complete : The object instances associated with the field instance are all executed by GraphQL.

  • Field instance value size : The number of object instances that are referenced by the field instance. As in the above example, the subjects[0]/fieldAfield instance value size is 1, and the subjects[0]/fieldBfield instance value size is 3.

In addition to the above definitions, our business scenario also satisfies the following conditions:

  • There is only 1 root node, and the root node is a list.

  • DataLoaderIt must belong to a field, and the DataLoadernumber of executions under a field is equal to the number of object instances under it.

Based on the above information, we can draw the following problem analysis:

  • When executing a field instance, we can know the size of the current field instance. The size of the field instance is equal to the number of times the field association DataLoaderneeds to be executed under the current instance load. Therefore, after execution load, we can know whether the current object instance is the field instance where it is located. the last object.

  • Instances of an object may hang under different field instances, so only when the current object instance is the last object instance of its field instance does not mean that the current object instance is the last of all object instances, if and only It is only true when the node instance where the object instance is located is the last instance of the node.

  • We can calculate the number of field instances from the size of the field instance. For example, if subjectsthe size we know is 2, then we know that the subjectsfield has two field instances subjects[0]and subjects[1], and we also know that the field subjects/fieldAhas two instances, subjects[0]/fieldAand subjects[1]/fieldA, so we can go from the root node to The following infers whether a field instance has been executed.

Through the above analysis, we can conclude that the condition for an object to be executed is that the field instance where it is located and all parent field instances of the field where it is located are executed, and the currently executed object instance is the last object of the field instance where it is located. instance time. DataFetcherBased on this judgment logic, our implementation plan is to judge whether it needs to be initiated after each call dispatch, and if so, initiate it. In addition, the above timing and conditions have the problem of missing , there is a special case, when the current object instance is not the last one, but the remaining object size is 0, then the dispatchcurrent object association will never be triggered , Therefore, when the object size is 0, an additional judgment is required.DataLoaderload

According to the above logical analysis, we have realized the DataLoaderoptimization of the calling link and achieved the theoretically optimal effect.

6 The impact of the new architecture on the R&D model

Productivity determines the production relationship. The metadata-driven information aggregation architecture is the core productivity of the display scene construction, and the business development model and process are the production relationship, so they will also change accordingly. Next, we will introduce the impact of the new architecture on R&D from the perspective of development mode and process.

6.1 Business-focused development model

The new architecture provides a set of standardized code decomposition constraints based on business abstraction. In the past, the development students' understanding of the system was likely to be "check the services and glue the data together", but now, the development students' understanding of the business and code decomposition ideas will be consistent. For example, the display unit represents the display logic, and the fetch unit represents the fetch logic. At the same time, a lot of complicated and error-prone logic has been blocked by the framework, and R&D students can focus more on the business logic itself, such as: understanding and encapsulating business data, understanding and writing presentation logic, and querying the model. Abstraction and construction. As shown in the following diagram:

Figure 23 Business development focuses on the business itself

Figure 23 Business development focuses on the business itself


6.2 R&D process upgrade

The new architecture not only affects the code writing of R&D, but also affects the improvement of the R&D process. Based on the visualization and configuration capabilities realized by the metadata architecture, the existing R&D process is significantly different from the previous R&D process, as shown in the following figure shown:

Figure 24 Comparison of the R&D process before and after building a display scene based on the development framework

Figure 24 Comparison of the R&D process before and after building a display scene based on the development framework


In the past, it was a development model of "one stick to the end". The construction of each display scene required the entire process from interface communication to API development. Based on the new architecture, the system automatically has multi-layer reuse and visualization and configuration capabilities. .

Scenario 1 : This is the best situation. At this time, the fetching function and the display function have been settled. All the R&D students need to do is to create a query plan, select the required display unit based on the operation platform, and hold the query plan ID. Based on the query interface, the required display information can be found. The visualization and configuration interface is shown in the following diagram:

Figure 25 Visualization and copywriting as needed

Figure 25 Visualization and copywriting as needed


Situation 2 : There may be no display function at this time, but the data source has been accessed through the operation platform, so it is not difficult, just need to write a piece of processing logic based on the existing data source, this processing logic is very A cool piece of pure logic writing, the list of data sources is shown in the following diagram:

Figure 26 Data source list visualization

Figure 26 Data source list visualization


Scenario 3 : The worst case is that the system cannot meet the current query capabilities at this time. This situation is relatively rare, because the back-end service is relatively stable, so there is no need to panic, just connect the data source according to the standard specifications, Then write processing logic fragments, and then these capabilities can be continuously reused.

7 Summary

The complexity of the commodity display scene is reflected in: there are many scenes, many dependencies, many logics, and there are differences between different scenes. In this context, if it is in the early stage of the business, there is no need to question too much about the "chimney-style" personalized construction method. However, with the continuous development of the business, the continuous iteration of functions, and the trend of large-scale scenarios, the disadvantages of "chimney-style" personalized construction will gradually become apparent, including problems such as high code complexity and lack of ability precipitation.

Based on the analysis of the core contradictions faced by Meituan's in-store merchandise display scene, this article introduces:

  • Different BFF application modes in the industry, and the advantages and disadvantages of different modes.

  • Improved metadata-driven schema design based on GraphQL BFF schema.

  • Problems and solutions we encountered in the practice of GraphQL.

  • The impact of the new architecture on the R&D model is presented.


Tags

Technical otaku

Sought technology together

Related Topic

1 Comments

author

Amazing Article ! Would love to know more about the ParallelExecutionStrategy optimizations. Can you share the Github link for the same ?

Graphql-user

2022-10-05

Leave a Reply

+