what does GPT mean for autonomous driving?
Wen Shu's intelligent driving network Huang Huadan
ChatGPT has caught fire with AI, so what kind of chemical reaction will happen when GPT encounters automatic driving?
the full name of p>GPT is generative pre-trained Transformer, that is, generative pre-training transformer. Simple generalization is a deep learning model of text generation based on the training of available data on the Internet.
on April 11th, at the 8th Millicent AI? On DAY, CEO Gu Weihao officially released DriveGPT based on GPT technology, with Chinese name Xuehu Hai Ruo.
what can p>DriveGPT do? How is it built? Gu Weihao is in AI? I made a detailed interpretation on the DAY. In addition, AI? DAY also showed the upgrade of MANA, a milli-terminal autopilot data system, mainly its progress in visual perception.
1.
what is DriveGPT? What can be achieved?
firstly, gu weihao explained the principle of GPT. the essence of the generative pre-training Transformer model is to solve the probability of the next word. every call is to sample from the probability distribution and generate a word. in this way, a series of characters can be generated for various downstream tasks.
Take Chinese natural language as an example. A single word or word is a Token, and there are about 5, Token vocabularies in Chinese. When the Token is input into the model, the output is the probability of the next word. This probability distribution reflects the knowledge and logic in the language. When the big model outputs the next word, it is the result of reasoning according to the language knowledge and logic, just like reasoning who the murderer is according to the complex clues of a detective novel.
As a large model suitable for autonomous driving training, DriveGPT Xuehu Hai Ruo has three abilities:
1. Many such scene sequences can be generated by probability, each scene is a global scene, and each scene sequence is a practical situation that may happen in the future.
2. Under the condition that all scene sequences are generated, it is possible to quantify the most concerned behavior trajectory of the vehicle in the scene, that is, the future trajectory information of the vehicle will be generated while the scene is generated.
3. With this trajectory, DriveGPT Xuehu Hai Ruo can output the whole decision logic chain while generating the scene sequence and trajectory.
in other words, with the help of DriveGPT Xuehu Hai Ruo, planning, decision-making and reasoning can be completed under a unified generative framework.
specifically, the design of DriveGPT Xuehu Hai Ruo is to tokenize the scene, and call it Drive? Language。
Drive Language discretizes the driving space, and each Token represents a small part of the scene. At present, Millie has about 5, Token thesaurus spaces. If a series of scenario Token sequences that have happened in the past are input, the model can generate all possible scenarios in the future according to history.
In other words, DriveGPT Xuehu Hai Ruo is also like a reasoning machine. Tell it what happened in the past, and it can infer many possibilities in the future according to probability.
when a series of Token are put together, it is a complete time series of driving scenes, including the state of the whole traffic environment and the state of the own car at a certain moment in the future.
with Drive? Language, you can train DriveGPT.
the training process of DriveGPT is to do a large-scale pre-training based on driving data and previously defined driving attempts.
Then, through the scene of taking over or not taking over in the use process, the pre-training results are scored and sorted, and the feedback model is trained. That is to say, the correct human driving method is used to replace the wrong automatic driving method.
The follow-up is to continuously optimize the iterative model with the idea of reinforcement learning.
in the pre-training model, the GPT model with Decode-only structure is adopted, and each Token is used to describe the scene state at a certain moment, including the state of obstacles, the state of self-vehicle, the lane line and so on.
At present, Millimeter's pre-training model has 12 billion parameters. Using the driving data of 4 million production cars, it can do generative tasks for various scenes.
these generated results will be optimized according to human preferences, and trade-offs will be made in the dimensions of safety, efficiency and comfort. At the same time, Millie will use some screened human beings to take over the data, about 5 thousand Clips to train the feedback model and constantly optimize the pre-training model.
When outputting the decision logic chain, DriveGPT Xuehu Hai Ruo used prompt prompt technology. The input terminal gives the model a hint, telling it "where to go, slow down or fast, and let it reason step by step". After this hint, it will generate results in the expected direction, and each result has a decision logic chain. Every result will also have the possibility of appearing in the future. So we can choose the most probable and logical chain driving strategy in the future.
A vivid example can be used to explain the reasoning ability of DriveGPT Xuehu Hai Ruo. Assuming that the model is prompted to "reach a certain target point", DriveGPT Xuehu Hai Ruo will generate many possible driving methods, some of which are radical, will change lanes continuously to overtake and reach the target point quickly, and some will be steady and follow the car to the finish line. At this time, if there are no other additional instructions in the prompt, DriveGPT Xuehu Hai Ruo will optimize the effect according to the feedback training, and finally give an effect that is more in line with most people's driving preferences.
2.
What did you do to realize DriveGPT?
First of all, the training and landing of DriveGPT Xuehu Hai Ruo cannot be separated from the support of computing power.
in January this year, together with Volcano Engine * * *, Wanmo released its self-built intelligent computing center, Wanmo Snow Lake Oasis MANA OASIS. OASIS has a computing power of 6.7 billion times per second, a storage bandwidth of 2T/ second and a communication bandwidth of 8G/ second.
Of course, computing power alone is not enough, but also needs the support of training and reasoning framework. Therefore, Millie has also made the following three upgrades.
the first is to guarantee and upgrade the stability of training.
Large-scale model training is a very arduous task. With the order of magnitude growth of data scale, cluster scale and training time, minor problems in system stability will be infinitely magnified. If not handled, training tasks will often go wrong and lead to abnormal interruption, wasting a lot of resources invested in the early stage.
based on the large model training framework, Millie established a complete set of training support framework with Volcano Engine * * *. Through the training support framework, Millie realized the minute-level capture and recovery ability of abnormal tasks, which can ensure the continuous training of kilocalorie tasks for several months without any abnormal interruption, effectively ensuring the stability of the large model training of DriveGPT Xuehu Hai Ruo.
the second is the upgrade of flexible scheduling resources.
Millie has a large amount of real data brought by production cars, and can automatically learn the real world by using the returned data. Due to the huge difference in the amount of data sent back at different times of the day, it is necessary for the training platform to have flexible scheduling ability and adapt to the size of data.
Millie extended the incremental learning technology to the large-scale model training, built a large-scale model continuous learning system, and developed a task-level flexible scheduler to schedule resources in minutes. The utilization rate of cluster computing resources reached 95%.
the third is the upgrade of throughput efficiency.
in the aspect of training efficiency, in the large matrix calculation of Transformer, the efficiency of calculation is improved by splitting the data of inner and outer loops and keeping the data in SRAM as much as possible. In the traditional training framework, the operator process is very long, and the end-to-end throughput is improved by 84% by introducing the Lego library provided by Volcano Engine.
With the computing power and the upgrading of these three aspects, the DriveGPT Xuehu Hai Ruo can be better trained and iteratively upgraded.
3.
MANA was greatly upgraded, and the camera replaced the ultrasonic radar
In December 221, the 4th AI? MANA, the intelligent system of autopilot data, was released on DAY. After more than a year of application iteration, MANA has now ushered in a comprehensive upgrade.
According to Gu Weihao, this upgrade mainly includes:
1. The capabilities of perception and cognitive related large models are integrated into DriveGPT.
2. The basic computing service is specially optimized for large-scale model training in terms of parameter scale, stability and efficiency, and integrated into OASIS.
3. the data synthesis service using NeRF technology has been added to reduce the acquisition cost of Corner Case data.
4. Aiming at the problem of rapid delivery of multiple chips and multiple models, the heterogeneous deployment tools and vehicle adaptation tools are optimized.
We have introduced the related contents of DriveGPT in detail, and the following mainly looks at the progress of MANA in visual perception.
Gu Weihao said that the core purpose of visual perception tasks is to restore the dynamic and static information and texture distribution in the real world. Therefore, Millimeter upgraded the architecture of the visual self-monitoring model, integrating the three-dimensional structure, velocity field and texture distribution of the prediction environment into a training target, so that it can cope with various specific tasks calmly. At present, the data set of Millimeter Vision Self-monitoring Model exceeds 4 million Clips, and the perceptual performance is improved by 2%.
In the parking scene, Millimeter has achieved the parking requirement by using pure vision ranging with fisheye camera, and the measurement accuracy can reach 3cm within 15m, and the accuracy within 2m is higher than 1cm. Using pure vision instead of ultrasonic radar further reduces the cost of the whole scheme.
In addition, in the aspect of pure visual 3D reconstruction, through the visual self-monitoring large model technology, a large number of mass-produced returned videos can be converted into 3D-labeled truth data that can be used for BEV model training without relying on lidar.
by upgrading NeRF, it means that the reconstruction error can be less than 1
This article comes from the author Zhijia. com, and the copyright belongs to the author. Please contact the author in any form. The content only represents the author's point of view and has nothing to do with car change.