1. Modern IT systems are very complicated. When the system is large enough, it will always get out of control. There has never been a complex program without mistakes in the world. The only question is whether you have encountered this mistake. The bank's system is run by products from many different hardware and software manufacturers, which is far more complicated than ordinary home computers. Such a simple home computer will crash ... and the system is complicated to a certain extent, which cannot be completely solved by more people and more money.
2. Try not to ask for money if there is a problem, which requires a lot of money (for example, it costs hundreds of millions for a medium-sized bank to build a decent disaster recovery system). But the problem is only "possible", and the money spent is real. If you were a leader, you wouldn't invest in it indefinitely.
One of the best ways to run stably is not to reform the system. Due to the new business requirements, the system really needs to be constantly upgraded, and every change is a challenge to the stable operation of the system.
Because of three words: great determination. Before the earliest, the banking system was not connected to the internet, and the problem was only in a certain district or city. In recent ten years, the banking industry has been concentrating on a large scale: four of the five major banks, except China Bank, have completed large-scale concentration. ICBC is the first company to complete this project, which is called 999 1. It seems to have been completed from 1999 in 2002. Most banks, including the establishment of diplomatic relations between workers and peasants, China Development Bank, Agricultural Development Bank, Shanghai Pudong Development Bank, Huaxia Bank and Minsheng Bank, operate with two centers, one in Beijing and the other in Shanghai (it seems that Bank of Communications has a center in Wuhan and the People's Bank of China seems to be in Wuxi). The Bank of China has been centralized into five centers for a long time, but it has not yet become a dual center.
Centralization has many commercial benefits, but as far as the influence scope of system stability is concerned, it is a bit like "all eggs are put in the same basket" Although many people spend a lot of money to see this basket, there is always a sparse density, and chickens can hatch with such dense eggs!
There were no Weibo and WeChat before, so as long as you are not unlucky users, you won't know that something is wrong. Before online banking and Taobao, you didn't buy anything in the middle of the night. Many years ago, I was promoted in a big province, and there was a big problem at 3 am. If I can't get there before 8 o'clock, all the banks in the province will be closed. At 6 o'clock, the president stood behind and watched me operate. At 7 o'clock, it was finally done. If it were today, the pressure would be even greater.
Because of four words: historical reasons. The IT construction of banks began in the 1980s, and the traditional thinking still focused on running programs on a single server (some of which were made into dual-machine hot standby). Most of the IT construction of the Internet began in 2 1 century, and most of them adopt the distributed idea: multiple computers run programs at the same time, and if one of them goes wrong, the impact is not so great.
The characteristics of banking procedures are to be stable, and the risk of changing the model is great (some procedures are still using the technology of 20 years ago). So although it is slowly turning, at least it has not turned much until today. By the way, sigh the difficulty of reform and praise Uncle Deng.
Bank IT is the most rigorous industry in China IT industry. For example, some banks also require factory maintenance personnel not to operate, and only bank employees can operate.
A big change must be planned, even if it is an operation that has been done hundreds of times, such as changing a hard disk and changing an IP. However, there is a considerable gap between the plan and the fact. As mentioned above, the system is very complicated. If all possible problems are written down, there may be hundreds of branches. Moreover, the system failure will not happen according to your emergency plan.
The most important function of the emergency plan is to cope with the supervision of the superior, set up the emergency software and hardware environment that may be needed according to the emergency plan, roughly sort out the outline ideas and train the team. There are really complicated problems, and it is still solved by the cattle on the spot.
The most common and simplest overall indicators to measure continuous operation systems are RTO and RPO, which, in layman's terms, are roughly indicators of how much data is lost in a closed Takuwa.
You can safely deposit your money in the bank. Generally speaking, the problem is only at the level of shutdown (the system can't run at a certain time), and it has not reached the level of data loss or data error. Even if there is a problem of data loss, accurate data can generally be retrieved from a backup center or a disaster recovery center. The banking system checks the accounts every night to ensure the accuracy of the data.
Let's talk about the time of locating the problem first: from the time the problem is reported to the IT information center (or found in the monitoring system), the people in the IT center begin to check the cause of the system locating fault. If the location is not clear, they need to find relevant software and hardware personnel to be present or remote network support (for security reasons, most banks can't remotely check the system, and it takes time for maintenance personnel to go to the data center ...), and find out the root cause of the problem in an hour. It always takes time to go to the hospital for examination and judgment, right?
It is even more difficult to solve the problem. In fact, just like everyone's computer, restarting is often the most effective method, but many business systems can't be restarted if there is a problem (which may affect other business systems). So far, most of the standard maintenance contracts of major foreign manufacturers do not promise maintenance time.
Let's talk about the disaster recovery system first, and emphasize a fact that many IT people don't know: the bank disaster recovery system will not easily enable overall switching! As mentioned earlier, the IT system has become so complicated that the disaster recovery system is equivalent to copying another set, and the complexity has increased by more than 2 times. Switching is very troublesome, very painful, and will disrupt a lot of manpower and material resources. Only in case of major disasters (such as earthquake, computer room fire, terrorist explosion, etc.) will it be switched. ).
Of course, disaster-tolerant handover drills are usually conducted, but the core system is generally not used for real handover because of the risks. In the past, a provincial bank in East China switched to the disaster recovery center and never returned to the production center. Recently, a rural credit cooperative in northwest China successfully cut its core production to the disaster recovery system, which is not simple, but after all, this is a small bank with an independent legal person, which is not how big banks play.
In addition, I have seen many comments that "no one dares to risk the disaster node".