AI factories are factories: Overcoming industrial challenges to commoditize AI


This article is part of VentureBeat’s special issue, “AI at Scale: From Vision to Viability.” Read more from this special issue here.

If you were to travel 60 years back in time to Stevenson, Alabama, you’d find Widows Creek Fossil Plant, a 1.6-gigawatt generating station with one of the tallest chimneys in the world. Today, there’s a Google data center where the Widows Creek plant once stood. Instead of running on coal, the old facility’s transmission lines bring in renewable energy to power the company’s online services.

That metamorphosis, from a carbon-burning facility to a digital factory, is symbolic of a global shift to digital infrastructure. And we’re about to see the production of intelligence kick into high gear thanks to AI factories. 

These data centers are decision-making engines that gobble up compute, networking and storage resources as they convert information into insights. Densely packed data centers are springing up in record time to satisfy the insatiable demand for artificial intelligence. 

The infrastructure to support AI inherits many of the same challenges that defined industrial factories, from power to scalability and reliability, requiring modern solutions to century-old problems.

The new labor force: Compute power

In the era of steam and steel, labor meant thousands of workers operating machinery around the clock. In today’s AI factories, output is determined by compute power. Training large AI models requires massive processing resources. According to Aparna Ramani, VP of engineering at Meta, the growth of training these models is about a factor of four per year across the industry.

That level of scaling is on track to create some of the same bottlenecks that existed in the industrial world. There are supply chain constraints, to start. GPUs — the engines of the AI revolution — come from a handful of manufacturers. They’re incredibly complex. They’re in high demand. And so it should come as no surprise that they’re subject to cost volatility

In an effort to sidestep some of those supply limitations, big names like AWS, Google, IBM, Intel and Meta are designing their own custom silicon. These chips are optimized for power, performance and cost, making them specialists with unique features for their respective workloads.

This shift isn’t just about hardware, though. There’s also concern about how AI technologies will affect the job market. Research published by Columbia Business School studied the investment management industry and found the adoption of AI leads to a 5% decline in the labor share of income, mirroring shifts seen during the Industrial Revolution. 

“AI is likely to be transformative for many, perhaps all, sectors of the economy,” says Professor Laura Veldkamp, one of the paper’s authors. “I’m pretty optimistic that we will find useful employment for lots of people. But there will be transition costs.”

Where will we find the energy to scale?

Cost and availability aside, the GPUs that serve as the AI factory workforce are notoriously power-hungry. When the xAI team brought its Colossus supercomputer cluster online in September 2024, it reportedly had access to somewhere between seven and eight megawatts from the Tennessee Valley Authority. But the cluster’s 100,000 H100 GPUs need a lot more than that. So, xAI brought in VoltaGrid mobile generators to temporarily make up for the difference. In early November, Memphis Light, Gas & Water reached a more permanent agreement with the TVA to deliver xAI an additional 150 megawatts of capacity. But critics counter that the site’s consumption is straining the city’s grid and contributing to its poor air quality. And Elon Musk already has plans for another 100,000 H100/H200 GPUs under the same roof.

According to McKinsey, the power needs of data centers are expected to increase to approximately three times current capacity by the end of the decade. At the same time, the rate at which processors are doubling their performance efficiency is slowing. That means performance per watt is still improving, but at a decelerating pace, and certainly not fast enough to keep up with the demand for compute horsepower. 

So, what will it take to match the feverish adoption of AI technologies? A report from Goldman Sachs suggests that U.S. utilities need to invest about $50 billion in new generation capacity just to support data centers. Analysts also expect data center power consumption to drive around 3.3 billion cubic feet per day of new natural gas demand by 2030.

Scaling gets harder as AI factories get larger

Training the models that make AI factories accurate and efficient can take tens of thousands of GPUs, all working in parallel, months at a time. If a GPU fails during training, the run must be stopped, restored to a recent checkpoint and resumed. However, as the complexity of AI factories increases, so does the likelihood of a failure. Ramani addressed this concern during an AI Infra @ Scale presentation

“Stopping and restarting is pretty painful. But it’s made worse by the fact that, as the number of GPUs increases, so too does the likelihood of a failure. And at some point, the volume of failures could become so overwhelming that we lose too much time mitigating these failures and you barely finish a training run.”

According to Ramani, Meta is working on near-term ways to detect failures sooner and to get back up and running more quickly. Further over the horizon, research into asynchronous training may improve fault tolerance while simultaneously improving GPU utilization and distributing training runs across multiple data centers. 

Always-on AI will change the way we do business

Just as factories of the past relied on new technologies and organizational models to scale the production of goods, AI factories feed on compute power, networking infrastructure and storage to produce tokens — the smallest piece of information an AI model uses.

“This AI factory is generating, creating, producing something of great value, a new commodity,” said Nvidia CEO Jensen Huang during his Computex 2024 keynote. “It’s completely fungible in almost every industry. And that’s why it’s a new Industrial Revolution.”

McKinsey says that generative AI has the potential to add the equivalent of $2.6 to $4.4 trillion in annual economic benefits across 63 different use cases. In each application, whether the AI factory is hosted in the cloud, deployed at the edge or self-managed, the same infrastructure challenges must be overcome, the same as with an industrial factory. According to the same McKinsey report, achieving even a quarter of that growth by the end of the decade is going to require another 50 to 60 gigawatts of data center capacity, to start.

But the outcome of this growth is poised to change the IT industry indelibly. Huang explained that AI factories will make it possible for the IT industry to generate intelligence for $100 trillion worth of industry. “This is going to be a manufacturing industry. Not a manufacturing industry of computers, but using the computers in manufacturing. This has never happened before. Quite an extraordinary thing.”



Source link