Google has unveiled Gamengen, the first game engine built entirely on a neural model capable of real-time interaction with complex game environments and providing high-quality long-term trajectories. Gamengen allows users to simulate a classic doom game interactively at over 20 frames per second on a single TPU. The accuracy of predicting the next frame reaches PSNR 29.4, comparable to the losses seen when compressing JPEG. Tests have shown that people have difficulty distinguishing between short game fragments and the simulation.
The learning process for Gamengen is split into two stages. In the first stage, an agent using deep reinforcement learning methods plays the game, recording actions and observations to teach a generative model. In the second stage, a diffusion model predicts the next frame based on previous actions and frames. By adding Gaussian noise to previous frames during training, the model can correct errors and improve visual stability in long-term time segments.
To enhance image quality, fine tuning of the decoder of the Stable Diffusion V1.4 model was conducted, eliminating artifacts that occurred when predicting gameplay frames, especially on details like the HUD on the lower screen panel.
In conclusion, Gamengen represents a significant advancement in the realm of game simulation, leveraging cutting-edge developments in diffusion models and machine learning to deliver a high-quality gaming experience in real-time.