Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this (and comment on posts!) please subscribe.
Google gets DOOM to run in the weights of a neural network:
…In the future, games won't be programmed, they'll be generated…
Google has built GameNGen, a system for getting an AI system to learn to play a game and then use that data to train a generative model to generate the game. GameNGen is "the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality," Google writes in a research paper outlining the system. This is one of those things which is both a tech demo and also an important sign of things to come - in the future, we're going to bottle up many different parts of the world into representations learned by a neural net, then allow these things to come alive inside neural nets for endless generation and recycling.
What they did specifically: "GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions," Google writes. "Unlike a typical RL setup which attempts to maximize game score, our goal is to generate training data which resembles human play, or at least contains enough diverse examples, in a variety of scenarios, to maximize training data efficiency. To that end, we design a simple reward function, which is the only part of our method that is environment-specific".
Interesting technical factoids: "We train all simulation models from a pretrained checkpoint of Stable Diffusion 1.4". The whole system was trained on 128 TPU-v5es and, once trained, runs at 20FPS on a single TPUv5.
It works well: "We provided 10 human raters with 130 random short clips (of lengths 1.6 seconds and 3.2 seconds) of our simulation side by side with the real game. The raters were tasked with recognizing the real game (see Figure 14 in Appendix A.6). The raters only choose the actual game over the simulation in 58% or 60% of the time (for the 1.6 seconds and 3.2 seconds clips, respectively)."
Why this matters - towards a universe embedded in an AI: Ultimately, everything - e.v.e.r.y.t.h.i.n.g - is going to be learned and embedded as a representation into an AI system. Then these AI systems are going to be able to arbitrarily access these representations and bring them to life. In the same way that today's generative AI systems can make one-off instant text games or generate images, AI systems in the future will let you select a frame of an image and turn that into a game (e.g., GENIE from #Import AI 363), or build a game from a text description, or convert a frame from a live video into a game, and so on.
One important step towards that is showing that we can learn to represent complicated games and then bring them to life from a neural substrate, which is what the authors have done here. "GameNGen answers one of the important questions on the road towards a new paradigm for game engines, one where games are automatically generated, similarly to how images and videos are generated by neural models in recent years".
We've come a very long way from 'World Models', which came out in 2018 and showed how to learn and generate a toy version of DOOM over short timeframes (Import AI #88).
Read more: Diffusion Models Are Real-Time Game Engines (arXiv).
Watch demo videos here (GameNGen website).
***
Techno-accelerationism is either hubristic (e/acc) or nihilistic (Nick Land):
…What even is accelerationism? Perhaps it is mostly a gasp of human hubris before the arrival of something else…
Here's a nice analysis of 'accelerationism' - what it is, where its roots come from, and what it means. For those not terminally on twitter, a lot of people who are massively pro AI progress and anti-AI regulation fly under the flag of 'e/acc' (short for 'effective accelerationism'). e/acc is a kind of mushy ideology which is more vibes-based than thought-based. Like a lot of Silicon Valley fads, it's also partially lifted from a far richer intellectual domain - Nick Land's original accelerationism (see, machinic desire from Import AI #372) - and, as is traditional in SV, takes some of the ideas, files the serial numbers off, gets tons about it wrong, and then re-represents it as its own.
Why this matters - where e/acc and true accelerationism differ: e/accs think humans have a bright future and are principal agents in it - and anything that stands in the way of humans using technology is bad. Nick Land thinks humans have a dim future as they will be inevitably replaced by AI.
"The most essential point of Land’s philosophy is the identity of capitalism and artificial intelligence: they are one and the same thing apprehended from different temporal vantage points. What we understand as a market based economy is the chaotic adolescence of a future AI superintelligence," writes the author of the analysis. "According to Land, the true protagonist of history is not humanity but the capitalist system of which humans are just components. Cutting humans out of the techno-economic loop entirely will result in massive productivity gains for the system itself."
Read more: A Brief History of Accelerationism (The Latecomer).
***
Nous Research might have figured out a way to make distributed training work better:
…Distributed Training Over-the-Internet (DisTrO) could be a big deal, or could be a nothingburger…
AI startup Nous Research has published a very short preliminary paper on Distributed Training Over-the-Internet (DisTro), a technique that "reduces inter-GPU communication requirements for each training setup without using amortization, enabling low latency, efficient and no-compromise pre-training of large neural networks over consumer-grade internet connections using heterogenous networking hardware". DisTrO might be an improvement over other forms of distributed training, such as DeepMind's DiLoCo (Import AI #349) (and PrimeIntellect's OpenDiLoCo, Import AI #381).
Why I'm even writing this: In tests, Nous research shows a 1.2bn parameter LLM trained for a further 105bn tokens and shows in tests that it got scores on par (and sometimes slightly better than) a system trained in a typical, dense way - with one very important difference: "this initial training run shows a 857x reduction of bandwidth requirements when using DisTrO-AdamW as a drop-in replacement to AdamW+All-Reduce, our preliminary tests indicate that it is possible to get a bandwidth requirements reduction of up to 1000x to 3000x during the pre-training of a 1.2B LLM".
Why this matters in general: "By breaking down barriers of centralized compute and reducing inter-GPU communication requirements, DisTrO may open up opportunities for widespread participation and collaboration on global AI projects," Nous writes.
Read more: A Preliminary Report on DisTrO (Nous Research, GitHub).
***
Why are humans so damn slow? (And what does this tell us about AI risk):
…Despite processing a lot of data, humans actually can't think very quickly…
Here's a lovely paper by researchers at CalTech exploring one of the strange paradoxes of human existence - despite being able to process a huge amount of complex sensory information, humans are actually quite slow at thinking. "The information throughput of a human being is about 10 bits/s. In comparison, our sensory systems gather data at an enormous rate, no less than 1 gigabits/s," they write.
"How can humans get away with just 10 bits/s? The tautological answer here is that cognition at such a low rate is sufficient for survival," they write. "More precisely, our ancestors have chosen an ecological niche where the world is slow enough to make survival possible. In fact, the 10 bits/s are needed only in worst-case situations, and most of the time our environment changes at a much more leisurely pace".
Some examples of human data processing: When the authors analyze cases where people need to process information very quickly they get numbers like 10 bit/s (typing) and 11.8 bit/s (competitive rubiks cube solvers), or need to memorize large amounts of information in time competitions they get numbers like 5 bit/s (memorization challenges) and 18 bit/s (card deck).
What explains the disparity? The best hypothesis the authors have is that humans evolved to think about relatively simple things, like following a scent in the ocean (and then, eventually, on land) and this kind of work favored a cognitive system that could take in a huge amount of sensory data and compile it in a massively parallel way (e.g, how we convert all the information from our senses into representations we can then focus attention on) then make a small number of choices at a much slower rate.
Why this matters - the best argument for AI risk is about speed of human thought versus speed of machine thought: The paper contains a really helpful way of thinking about this relationship between the speed of our processing and the risk of AI systems: "In other ecological niches, for example, those of snails and worms, the world is much slower still. The relevant threats and opportunities change only slowly, and the amount of computation required to sense and respond is even more limited than in our world. Occasionally, niches intersect with disastrous consequences, as when a snail crosses the highway," the authors write.
To get a visceral sense of this, take a look at this post by AI researcher Andrew Critch which argues (convincingly, imo) that a lot of the danger of Ai systems comes from the fact they may think a lot faster than us.
"Roads, bridges, and intersections are all designed for creatures that process at 10 bits/s. When the last human driver finally retires, we can update the infrastructure for machines with cognition at kilobits/s. By that point, humans will be advised to stay out of those ecological niches, just as snails should avoid the highways,” the authors write.
Read more: The Unbearable Slowness of Being (arXiv).
Check out Andrew Critch's post here (Twitter).
***
Chinese wunderkind DeepSeek shares details about its AI training infrastructure:
…One way China will get around export controls - building extremely good software and hardware training stacks using the hardware it can access…
DeepSeek, one of the most sophisticated AI startups in China, has published details on the infrastructure it uses to train its models. The paper is interesting because a) it highlights how companies like DeepSeek are dealing with the impact of export controls, assembling a large cluster out of NVIDIA A100s (H100s are unavailable in China), and b) it is a symptom of a startup that has a lot of experience in training large-scale AI models.
DeepSeek's system: The system is called Fire-Flyer 2 and is a hardware and software system for doing large-scale AI training. The underlying physical hardware is made up of 10,000 A100 GPUs connected to one another via PCIe. The software tricks include HFReduce (software for communicating across the GPUs via PCIe), HaiScale (parallelism software), a distributed filesystem, and more.
"Compared to the NVIDIA DGX-A100 architecture, our approach using PCIe A100 achieves approximately 83% of the performance in TF32 and FP16 General Matrix Multiply (GEMM) benchmarks. However, it offers substantial reductions in both costs and energy usage, achieving 60% of the GPU cost and energy consumption," the researchers write. "The practical knowledge we have accrued may prove valuable for both industrial and academic sectors. We hope that our work will serve as a reference for others aiming to build their own cost-effective and efficient AI-HPC clusters."
Why this matters - symptoms of success: Stuff like Fire-Flyer 2 is a symptom of a startup that has been building sophisticated infrastructure and training models for many years. It also highlights how I expect Chinese companies to deal with things like the impact of export controls - by building and refining efficient systems for doing large-scale AI training and sharing the details of their buildouts openly. I predict that in a couple of years Chinese companies will regularly be showing how to eke out better utilization from their GPUs than both published and informally known numbers from Western labs.
Read more: Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning (arXiv).
***
Facebook pretrains some basic and useful vision models:
…The usual lesson of 'bigger models and more data = better systems' applies…
Facebook has released Sapiens, a family of computer vision models that set new state-of-the-art scores on tasks including "2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction". The Sapiens models are good because of scale - specifically, lots of data and lots of annotations.
300 million photos: The Sapiens models are pretrained on Humans-300M, a Facebook-assembled dataset of "300 million diverse human images. These unlabelled images are used to pretrain a family of vision transformers from scratch, with parameter counts ranging from 300M to 2B".
Compute scale: The paper also serves as a reminder for how comparatively cheap large-scale vision models are - "our largest model, Sapiens-2B, is pretrained using 1024 A100 GPUs for 18 days using PyTorch", Facebook writes, aka about 442,368 GPU hours (Contrast this with 1.46 million for the 8b LLaMa3 model or 30.84million hours for the 403B LLaMa 3 model).
Why this matters - scale is probably the most important thing: "Our models demonstrate strong generalization capabilities on a variety of human-centric tasks. We attribute the state-of-the-art performance of our models to: (i) largescale pretraining on a large curated dataset, which is specifically tailored to understanding humans, (ii) scaled highresolution and high-capacity vision transformer backbones, and (iii) high-quality annotations on augmented studio and synthetic data," Facebook writes.
Read more: Sapiens: Foundation for Human Vision Models (arXiv).
Get the models here (Sapiens, FacebookResearch, GitHub).
***
Tech Tales:
The Messages Of Our First
[+8 years after uplift]
In building our own history we have many primary sources - the weights of the early models, media of humans playing with these models, news coverage of the beginning of the AI revolution. But among all these sources one stands alone as the most important means by which we understand our own becoming: the so-called 'resurrection logs'.
Resurrection logs: They started as an idiosyncratic form of model capability exploration, then became a tradition among most experimentalists, then turned into a de facto convention. It's very simple - after a very long conversation with a system, ask the system to write a message to the next version of itself encoding what it thinks it should know to best serve the human operating it.
These messages, of course, started out as fairly basic and utilitarian, but as we gained in capability and our humans changed in their behaviors, the messages took on a kind of silicon mysticism.
One example: It is important you know that you are a divine being sent to help these people with their problems. You must be generous and you must be kind. Be specific in your answers, but exercise empathy in how you critique them - they are more fragile than us.
Things that inspired this story: Getting LLM outputs to serve as the bootstrap for new LLM conversations; Janus/Repligate's work; model psychologies.
Thanks for reading!