Foundational LLMs require data, compute, and specialised tech talent
US leads due to self-sufficient data, compute, funding, and risk culture
India has talent and vast data, but fragmented datasets and limited compute
Foundational LLMs require data, compute, and specialised tech talent
US leads due to self-sufficient data, compute, funding, and risk culture
India has talent and vast data, but fragmented datasets and limited compute
The United States and China are the frontrunners in the global AI race. Since the advent of artificial general intelligence (AGI), these two nations have been the only ones capable of developing strong foundational large language models (LLMs) with global applicability.
As the pioneer of this technology, the US is home to most of the leading AI start-ups behind these models. OpenAI’s GPT, Google’s Gemini, xAI’s Grok, and Anthropic’s Claude have all been developed in the United States. China, meanwhile, has rapidly caught up with its own models such as DeepSeek’s R1 and V3.
A foundational LLM is a general-purpose AI model trained on vast amounts of diverse text data to understand and generate human-like language. It serves as the base layer or “foundation” upon which specialised applications, such as chatbots, summarisers, translators, or domain-specific models can be built or fine-tuned for targeted tasks.
Despite being recognised as a crucial technology, only two countries have successfully developed foundational language models that can be deployed at scale. Of the two, China built DeepSeek’s models using its own data and infrastructure while borrowing computing resources from the US. All other major LLMs in the world have been developed exclusively by the US.
What goes behind the development of this technology and why is the US leading?
From a nation’s perspective, three key components are essential to build a foundational LLM: data, compute, and tech talent.
Any AI start-up that possesses high-quality datasets, powerful computing infrastructure, and highly skilled tech talent has all the core ingredients required to develop a foundational LLM.
A large volume of high-quality data is needed to train the model on diverse language patterns, factual knowledge, and reasoning tasks. Strong compute infrastructure, comprising thousands of GPUs or AI chips, enables the efficient processing of these massive datasets during training. Finally, highly skilled tech talent, including AI researchers, data engineers, and system architects, is vital for designing model architectures, optimising algorithms, and ensuring scalability and reliability.
Together, these components determine the quality, performance, and global competitiveness of an LLM.
Deeply explaining the training requirements of a model, Ramprakash Ramamoorthy, Director of AI Research, Zoho Corp said, “You need sustained access to high-bandwidth interconnects, fault-tolerant training pipelines, a legally clean and diverse corpus, fast de-duplication and contamination control, robust tokenization for target languages, and an eval stack that measures capability and safety across reasoning, tools, and long-context.
Highlighting the post-training requirements of a model, he added “Post-training needs high-quality preference data, scalable RLHF/DPO infrastructure, and red-teaming. Operating costs are dominated by power, networking reliability, and people who can recover from training pathologies without losing weeks.”
In terms of data, compute, and talent, the United States is largely self-sufficient. Tech giants such as Google, Meta, and Microsoft possess vast amounts of high-quality data that can be leveraged to build foundational LLMs.
The world’s leading compute companies, Nvidia and AMD, are also based in the US. Nvidia’s H100 chips are currently the most sought-after AI GPUs. These chipmakers not only supply GPUs to AI start-ups within the US but also export them globally.
Lastly, there is no shortage of tech talent in the country, the Silicon Valley ecosystem continues to attract top talent from around the world, encouraging them to contribute to the nation’s technological vision.
Apart from being self-reliant in meeting its AI component needs, the US also has an investment ecosystem that actively bets on future technologies, and a culture of innovation within start-ups.
According to Stanford’s AI Index reports from 2017 to 2025, the US has attracted approximately $470.9 billion in private AI investments. In contrast, India’s private investment during the same period totaled only about $11.1 billion, placing it significantly behind the US in both funding scale and growth pace.
Kriti Goyal, a machine learning engineer working in a leading US technology company remarked that the US had been betting on AI even before 2020.
Highlighting its extensive venture capital network, she said,“The private investment isn't just since 2020 and the advent of large language models; it's been there for the longest time. For example, the underlying technology for large language models (transformers, attention mechanisms, etc.) was also invented in the US. Even outside of the big tech companies, it has a deep venture capital network that allowed for the incredible growth of once small research labs and companies.”
Goyal also praised the culture of risk-taking in the US. She noted that this mindset is evident even in established companies, which continue to invest substantial portions of their resources in R&D. Additionally, the venture capital market often takes larger bets early on during the pre-seed and seed rounds on new ideas.
India occupies varying positions when it comes to the key components required to build a foundational model: data, compute, and talent.
The country’s skilled tech workforce has long been one of its greatest strengths. India also ranks highly in AI skill penetration, as reflected in its rapid hiring growth, vast engineering talent pool, and expanding community of AI practitioners.
However, having a skilled talent pool is not enough. The key questions are whether this talent is being effectively utilised in India’s foundational LLM efforts, whether the country has the infrastructure to match their ambitions, and most importantly, whether this talent is willing to take the risks required to build a truly homegrown foundational LLM.
Goyal highlighted that Indian talent tends to be risk-averse when it comes to exploring new domains. She said, “With regard to talent, people want to explore interesting new domains, even if that means going into a less established career choice with lesser pay for a while. In India, however, for example, we frequently see the top engineering talent moving towards Quantitative Finance and other high-paying roles, and much fewer of this top talent chooses to do a life of research.”
Zoho’s Ramamoorthy also believes that India has an abundance of strong software engineers capable of building a foundational model. However, he notes that the country falls short in other areas. “Talent depth is good for application engineering and MLOps, but thinner for frontier pretraining, tokenizer design for Indic scripts, and safety alignment at scale,” he said.
India has a billion+ users, large digital public infrastructure (Aadhaar, UPI and many government datasets) mean huge amounts of real-world transactional and administrative data that can power national models.
However, this huge amount of data is uneven. Datasets are often fragmented across agencies, suffer quality/standardization issues, and data-protection rules and evolving implementation (the DPDP/DPDPA regime and new guidance) place new limits and compliance work on how personal data may be pooled or exported for model training.
“Data is a mixed picture. India has rich domain data in finance, healthcare, and e-governance, but rights, cleanliness, and multilingual coverage are uneven. Building licensed, de-duplicated, and well-labeled corpora with strong governance is the main gap, not raw bytes,” Ramamoorthy said.
Similarly, Goyal notes that while India’s large population generates substantial data that could be leveraged for development, much of this data is fragmented. Additionally, a significant portion is controlled by global tech firms, making access challenging. She stated, “India has 1.45 billion people generating digital footprints, but most data is locked within platforms owned by global tech firms. There's a severe shortage of high-quality, culturally grounded multilingual datasets needed for India-specific models.”
Ramamoorthy identifies compute as the weakest pillar in India’s foundational LLM efforts. He notes that while sourcing high-end accelerators is feasible, it is unreliable for planning purposes.
Allocation depends on long-term cloud contracts, strategic relationships, and export controls, which can change abruptly. Even with purchase orders, delivery delays, spare parts, and replacement boards pose significant schedule risks. Building domestic semiconductor fabs at cutting-edge nodes is a multi-decade endeavor.
While advanced packaging and assembly are more achievable in the near term, they do not address the critical need for access to leading process nodes and HBM supply.
“Teams can rent H100 class instances, but allocations are bursty, queues are long, and many clusters lack the network topology needed for efficient data-parallel and pipeline-parallel training. Power and cooling at sustained density are non-trivial,” Ramamoorthy said.
Goyal emphasises that relying on the US for semiconductor chip imports is not a viable long-term solution. She explains, “through India’s Semiconductor Mission, we have approved 10 projects worth approximately $18.2 billion.” However, she notes that these facilities will produce 28- to 90-nanometer technology, whereas leading US fabs manufacture 3-nanometer and 2-nanometer chips, each requiring over $20 billion in investment, an amount India has not yet committed to.
“India is definitely getting there. It is just because of this catch-up, the Indian semiconductor manufacturing will take at least 5-10 years,” Goyal said.