Basic knowledge about Taobao’s homepage
Many people use Taobao, but they don’t know much about Taobao’s homepage. I will bring here an explanation from a Taobao homepage designer. I hope it can be helpful. Let you have a basic understanding of Taobao homepage.
1. Relevant background introduction
The Taobao homepage is the facade of Taobao, carrying the entrance to almost all Taobao businesses, with large traffic and order of magnitude units For billions. In recent years, with the rise of wireless terminals, the focus of business has begun to shift to wireless terminals (it cannot be called a shift now, it is basically wireless), so the traffic of Taobao's PC homepage has also been reduced, but even so, its daily average PV is still Quite high.
Taobao’s homepage has always been a testing ground for internal platforms and technologies, and it is always changing. The latest frameworks and systems will be piloted on Taobao's homepage. Just imagine, if a certain upgrade or optimization measure that needs to be promoted has been launched on Taobao's homepage, and good data and stability have been obtained, why would other businesses not use it? What about attempts and changes? At the same time, the technical architecture team that worked on Taobao’s front-end last year naturally took the initiative to push some experimental content into the business.
Too busy. In fact, most pages are built based on internal building platforms - operations or front-end building through modules. The focus of the front-end focus is on the construction of the building platform itself and the guarantee of the versatility and reuse rate of the modules. , of course, and some engineering stuff.
For pages built using a building platform, the front-end only needs to consider the development of the atomic modules that make up the page. The overall rendering is fully responsible for the unified script provided by the building platform. On the Taobao homepage, considering the huge number of page modules and the small amount of cross-department and cross-team communication, the rendering model is slightly different.
2. The overall changes of Taobao’s homepage
As mentioned in the background, Taobao’s homepage relies on the internal construction platform, and its changes will naturally follow the changes in the construction system.
1. Taobao homepage under PHP
Soon after taking over Taobao’s homepage, it encountered an annual revision. At that time, it was still running in a PHP environment. What needs to be explained here is that all codes on Taobao's homepage are completely controlled by the front end. The front end does not deal directly with the database, and its data sources are divided into two parts.
Data source
First, the data filled in by operations. In the form of front-end digging, pits are reserved for operators to obtain and fill in the data.
When operators fill in these pits, the data corresponding to this PHP template will be generated, and the final rendering will be a complete HTML fragment ( real-time rendering).
In the old version of the building system, a submodule was constructed in this way. I described it very simply, but as a platform there are many things that it needs to consider, such as data sequence control, scheduled release, rollback mechanism, filtering mechanism, data synchronization, data update, version control, permissions Controls, references to other systems, etc.
The second is the data provided by the backend or personalization platform. Different businesses have different demands. Some businesses have their own backends, and they require the use of data produced by their own businesses; some businesses hope that users will see different content, with thousands of people having different views, and hope to access the algorithm; some businesses deal directly with sellers and hope to use investment promotion Data; and some businesses expect to use data filtered from the data pool... In short, Taobao's homepage needs to connect to various systems and has many interfaces. The integration of dynamic data sources will be mentioned later.
And the domain names corresponding to these systems are different, so the JSONP format naturally becomes the first choice.
But for some special systems, such as advertisements, their rendering is not a simple JSONP request. It may also intervene in the entire advertisement rendering process, such as loading their JS and handing over rendering control.
Page structure
The above introduces the source of data and the structure of sub-modules, so how is the entire page composed? There are two types of module construction, one is visualization Build, operate or front-end can drag and drop the developed module (or module selected in the module library) into the container to form a page:
Of course, the above picture is only a model, as a system needs There are many more issues to consider, such as page layout, multi-terminal adaptation, temporary hiding of modules, position adjustment, skin selection, module duplication, etc.
Introduce the module through the module id, and add some tags like lazyload to facilitate control of rendering rhythm and data entry. The difference between source code construction and module construction is that the former is easier to control the structure of the module and the rendering order of the module.
Dynamic data source
The home page faces a lot of interfaces and platforms, and connects with dozens of business parties. Interfaces are a big problem. Due to the differences in back-end systems, there is basically no solution. Unify the format of the data source. Once the operator suddenly wants to change to a system that he feels is more comfortable to use or has better data, the front and back ends will probably have to communicate and connect several times.
The platform has the ability to access data sources, which means that the pits we dig can not only allow operations to fill in data, but also directly import data from various data sources. Of course, a data process is required here. Field mapping conversion.
After binding, data can be output synchronously or asynchronously. These are capabilities provided by the platform. This solution basically solves the problem of back-end system/interface changes and reduces the communication cost between the front-end and the back-end.
However, what needs to be noted here is that although the interfaces on the page are unified through the platform, this also means that all requests for the page will first flow through the platform and then be distributed to various backends. The platform The ability to withstand stress is very demanding.
2. The transition from PHP to Node
The average daily request on Taobao’s homepage cannot be withheld by more than ten or twenty servers. To support it, there must be A service cluster.
Each CDN node has PHP rendering capabilities. When a page is published, we synchronize all modules and data of the page to all CDN nodes. This is probably the basic mode. It looks pretty good, but after a period of operation and maintenance, many security and performance issues slowly emerged:
Performance issues. Each PHP page contains multiple submodules, and submodules may also reference other submodules. PHP's include operation is expensive. Each reference is a disk IO, and thousands of them are run on a rendering node. A PHP page similar to Taobao's homepage can be imagined to be highly efficient.
Push mechanism problem. File synchronization is a rather disgusting mechanism. First of all, there is no control over the time. A file can be synchronized to all nodes in a few seconds, or more than a minute or two. Moreover, the synchronization process may fail, and the cost of health testing is also quite high. When publishing is relatively compact, there are many files that need to be synchronized, which can easily cause queue accumulation and worsen the poor synchronization experience.
Issues with strong real-time requirements. Before the file is pushed, it may also go through some front-end systems. The longer the publishing link, the slower the online effective time. When it is slow, it takes about five minutes to take effect. Such a delay requires high real-time performance (such as flash sales). This is completely unacceptable in terms of demand.
Of course, there are many other problems, such as increased operation and maintenance costs, increased security risks, insufficient PHP senior talent reserves, etc. So the fate of the PHP rendering container is to be killed.
The service cluster is Cache CDN. It only has static file processing capabilities and does not have the rendering capabilities of PHP/Node. Therefore, it has high processing efficiency, good performance, and strong pressure resistance. You can spend money to buy services and expand the Cache cluster.
When a user visits, Nginx goes to Cache CDN. If it hits the cache, it returns directly. If there is no hit, it returns to the origin server. The origin server is a Node service with module rendering capabilities. It can do many things:
· Control the Cache response header, and control the cache time of the page on the client and on the Cache through max-age and s-maxage. The cache time, this cache time can be adjusted at any time according to needs, such as adjusting it to a longer time during major promotions;
· Control the internal and external network environment, and AB test status;
· Integration Front-end related tool chains, such as detection, compression, filtering, etc.
It has many advantages, which are not listed here. This model also adds a layer of disaster recovery. The origin server pushes data to the backup server in the same computer room as Cache at regular intervals. If the origin server fails, disaster recovery can be automatically transferred to the backup data.
The change in the model not only makes a breakthrough in operation and maintenance, but also reduces the security risk when CDN is attacked. It also eliminates the need for various detection mechanisms required by sync, saving millions in annual costs. Above, the advantages are quite obvious.
3. Node, different modes
In the above PHP module, we only talked about the HTML and data parts. Attentive readers should have discovered that static resources such as CSS and JS are Not mentioned, how is the page rendered?
In the old version of the PHP page, we directly introduced a CSS and a JS. Taobao uses the git version iterative release, and these static resources are It is placed directly in a git repository. That is to say:
Every time you publish the git file, modify the version number of PHP, and then publish the PHP code. Of course, relevant optimizations have also been made, such as automatically updating the version number when releasing git, etc.
Put the CSS/JS and template of a module together, and the CSS/JS and the static resources of other modules on the page are independent of each other. The purpose is to hope that a single module can run completely, which is more conducive to the module. Reuse.
The module digging is also independent from the template, and the data format is defined in the form of JSON Schema:
The modules are independent and isolated from each other, so there will be a certain degree of Redundant, but the benefits of module decoupling are much greater than this redundancy. In fact, we manage individual modules through a warehouse. Page rendering is relatively simple. The source node container will merge all index.xtpl into one page.xtpl. In order to reduce page requests, css and js will also be combo into one file.
The page will be aware of any module updates. The next time you enter the system, you will be prompted whether you need to upgrade modules and pages.
3. Performance optimization of Taobao homepage
There are many homepage modules. If you spit them out in one go, the number of DOMs will definitely exceed 4k, and the result will be an extremely long first screen time. According to the development specifications of TMS, each TMS module contains an index.js and index.css, and finally two combos of js and css are displayed. When the homepage is loaded, all index.js will not be executed at once, otherwise the page blocking will be very serious at the beginning.
Page rendering logic
· Traverse all TMS modules (including a J_Module hook);
· Some TMS modules have no JS content, but one is loaded index.js, add the tb-pass class to the module, which is used to skip the execution of the JS of the module;
· Divide the page into two parts, the first screen is one, and the non-first screen is the second block, first add the first screen module to the lazy loading monitoring;
· After the first screen module is loaded, or the user handles the page interaction (scrolling, mouse movement, etc.), add the non-first screen module to Lazy loading monitoring;
· Handle some special modules, which will start loading a few hundred pixels before entering the window;
· Monitor scrolling, and render modules according to the above logic;
p>
· Even if some modules are executed, they may not be rendered because their priority is not high. Event monitoring is added inside the module, such as waiting until the mouseover/onload event is triggered before rendering the content.
Code performance optimization is a delicate job. If you want to perform performance optimization on a huge unoptimized page, you may face a code reconstruction. The above article mentioned the optimization of internal details of the page, but the standardization and standardization in the development process, as well as the optimization of each link in the online access path, have not been mentioned.
4. Stability guarantee of Taobao homepage
Under large traffic, any small problem will be magnified into a big problem, so any occasional problems encountered in the development process need to be solved Pay attention to. However, many sporadic problems cannot be found in our test environment, such as region-related problems (such as a CDN node in Shanghai is down), user attribute problems (such as the user page skylight whose last nickname is the letter s) ), browser plug-in issues, operator ad injection issues, etc.
It is difficult to consider all issues before going online, but there are two things that must be done well: disaster recovery and monitoring and early warning.
1. Full disaster recovery mechanism
There are two levels of considerations for full disaster recovery:
· Asynchronous interface request errors, including interface data format errors, interface Request timeout, etc.;
· Synchronous rendering, origin page rendering error.
Asynchronous interface requests mainly involve back-end systems. There are many docking systems. Each system has different stability and pressure resistance. There are many solutions to ensure this.
Each data request is cached locally, and a hard bottom is provided for each interface. Another solution is to "retry". If the request fails once, then request the second time.
For synchronous rendering, it only requires page template and synchronization data. If there is an error in either of them, the origin site will report an error. At this time, the content returned back to the origin is an error page with a status code of 5xx. . This error is not necessarily caused by the developer. It may be a synchronization abnormality or a circuit break problem in the system link.
Once there is any abnormality in the origin site, Nginx will go to the homepage mirror in the same computer room as Cache CDN. The content of this mirror is the HTML backup source code of Taobao's homepage.
2. Monitoring and early warning mechanism
Monitoring also has two levels:
· Module-level monitoring, interface request placement, module skylight detection, etc.;
· Page monitoring, add special tags on the page, return to all CDN nodes regularly to check whether the special tag exists.
Module-level monitoring has quite a lot of content. The more monitoring points and the more detailed it is, the more efficient it will be to locate problems in the end. For example, on a slightly complex module, I will bury These monitoring points:
· Interface request format error, request failure, request timeout, at least three hidden points;
· Hard-cover data request failure hidden points;
· The module has not been rendered within 5 seconds.
· The link in the module matches the black and white list of the image.
Part of the monitoring will also automatically handle clear errors. For example, if http images appear on an https page, these problems will be automatically handled immediately.
3. Automated testing before going online
This is part of Taobao’s entire engineering environment, front-end automated testing. Generally, these issues will be dealt with before going online:
· Detect whether the HTML complies with the specifications
· Detect the https upgrade situation
· Detect the legality of the link
· Detect the legality of static resources
· Detect JavaScript errors
· Detect whether there is a pop-up box when the page is loaded
· Detect whether the page calls console. *
· Page JS memory record
Of course, you can also add test cases yourself, such as detecting interface data format, module skylight issues, etc. Automatic detection can also set up scheduled regression, which is relatively safe.
5. Agile measures for Taobao homepage
1. Health check
There are many page modules. In order to be able to track the changes of every small point on the page, I Detailed statistics are made on every aspect of request and rendering.
Once the interface request fails, or the interface loses disaster recovery logic, or the module renders for more than 5 seconds, a yellow alert will appear on the console. Of course, at this time, alert statistics have also been sent to the server.
2. Interface Hub
Interface Hub is a management tool for data requests.
The rendering of many modules of the page requires more than one data source. Once the operation feedback page rendering data is abnormal, the data can be found directly through the Hub to speed up the efficiency of bug location. At the same time, Hub can also be used to switch environments and switch requests from an interface to interfaces in daily or pre-release environments. It is a powerful tool for debugging.
3. Quick channel
I put a quick operation channel before and after the page script is executed. Once an emergency online problem is encountered, such as style confusion and overflow, interface error causing skylight, etc. , you can directly modify the CSS and JS of the page through the shortcut channel, and it will be online within two minutes.
However, this type of channel is only suitable for repairing emergency problems. After all, inserting JS code at will is very risky. ;