Why India Needs Its Own LLM: Insights into the Challenges and Opportunities

Union Budget 2025 has attempted to tackle this problem by announcing a DeepTech Fund of Funds (FoF). The Government of India is considering establishing this FoF to support the next generation of start-ups. Additionally, the government plans to provide 10,000 fellowships for tech research in IITs and IISc over the next five years

Building a large language model (LLM) from scratch requires a combination of large datasets, compute power (GPUs), a skilled workforce, capital, and infrastructure. India has not been able to produce an LLM that can compete with global giants like OpenAI’s GPT-4o, Anthropic’s Claude Sonnet 3.5, DeepSeek-r1, and others.

India’s tech ecosystem was initially reluctant to develop a homegrown large language model (LLM). However, after the rise of DeepSeek, the country has recognised the importance of having an LLM built locally. The tech industry now understands that a homegrown LLM can better meet regional needs and reduce reliance on foreign models, addressing privacy and security concerns. It also enables the opportunity to develop vernacular LLMs, which foreign models have struggled to provide.

Investment Challenge for AI Start-Ups

Several Indian AI start-ups like Ola’s Krutrim, CoRover AI, Sarvam AI, Uniphore and more have attempted to build LLMs within India. These start-ups have developed some decent fully functional AI models from scratch. However, they have been unsuccessful till now in bringing mass acceptance for their product.

India's limited impact on the AI industry is largely due to a lack of investment in AI start-ups. While the US is investing $500bn through ‘Stargate Project’ to strengthen its AI infrastructure and counter foreign competition, and Britain is rolling out a $14bn AI Opportunities Action Plan, Indian AI start-ups are facing significant funding challenges.

In 2024, Indian AI start-ups raised a total of $166mn, which is significantly lower than the $518.2mn raised in 2022, according to a Business Standard report.

Sunil Gupta, Co-Founder, CEO & MD, Yotta Data Services said, “To truly make a global impact, India must accelerate investments in AI research, enhance computational infrastructure, and build stronger public-private partnerships. Fostering an environment that encourages collaboration between start-ups, academia, and industry leaders is crucial. By creating a robust AI ecosystem with the right policies and resources, India can solidify its place as a leading AI hub, driving innovation and setting global standards.”

The Union Budget 2025 has attempted to tackle this problem by announcing a DeepTech Fund of Funds (FoF). The Government of India is considering establishing this FoF to support the next generation of start-ups. Additionally, the government plans to provide 10,000 fellowships for tech research in IITs and IISc over the next five years.

The DeepTech FoF follows the proposal to expand the Small Industries Development Bank of India (SIDBI) Fund of Funds for start-ups (FFS) by allocating an additional Rs 10,000 crore to the scheme to enhance funding opportunities. The budget also includes an investment of approximately Rs 20,000 crore to promote private-sector innovation.

Also Read: India Joins Global AI Race with More Investments in DeepTech, R&D

Beyond the lack of investments, AI initiatives in India also face several other challenges in developing a model that meets global standards. Some of them are mentioned below.

LLM Training Data

Data is crucial for LLM development as it forms the foundation for training the model. Companies and start-ups collect LLM training data from sources like websites, online books, research papers, code repositories, and open datasets such as Common Crawl and Wikipedia. Most training data comes from web scraping, where text is collected from different websites, while open-source datasets like Common Crawl provide pre-curated text for AI training.

There are detailed steps involved in the data collection for LLM training. It includes data sourcing, data cleaning, data annotation and data augmentation. The collected data is then labeled and cleaned to make it fit for training. It includes privacy checks, bias mitigation, scalability and quality control.

Given its population, India has vast amounts of data, especially in its diverse languages, but it lacks sufficient high-quality digital data in many Indian languages to build strong LLMs, particularly when compared to English. This presents a major challenge in creating effective LLMs directed to the Indian context.

Initiatives like "Bhasha Daan" are working to address this gap by crowdsourcing data in various Indian languages.

CoRover founder & CEO Ankush Sabharwal said, “Data is crude oil, a raw material waiting to be refined. AI platforms and solutions are the game-changers that transform it into usable fuel. India, with its vast data reserves, is poised to create innovative AI solutions that benefit our nation and the world. Let's trust Indian engineers and scientists to drive research and development. Let's prioritise long-term impact over immediate gains.”

Also Read: Reliance Unveils World's Largest Data Centre in Jamnagar, Post Trump's Stargate Reveal

GPU Imports

Processing vast amounts of data requires powerful GPUs (Graphics Processing Units), which enable efficient computation through parallel processing. More data improves model performance, but without high-performance GPUs, training on large datasets would be slow and impractical.

For LLM server applications, ‘Professional’ or ‘Compute’ level GPUs are recommended due to their larger VRAM and better suitability for the cooling requirements of server chassis. Examples include NVIDIA's RTX 6000 Ada, L40S, and H100, as well as AMD's MI Instinct GPUs.

To address this challenge, Reliance Industries partnered with Nvidia to create AI supercomputers and LLMs catering to India’s diverse languages. Mukesh Ambani is reportedly purchasing AI semiconductors from Nvidia, to power his data center in Jamnagar, which is set to become the world’s largest by capacity.

However, US President Donald Trump recently announced that the US is imposing restrictions on GPU exports to other nations, including India. In this context, the cost-effective solution from China could play a key role in enabling the development of homegrown large language models in India.

In Depth

Videos

Start-Up

Planet

Initiatives

Why India Needs Its Own LLM: Insights into Challenges and Opportunities

Investment Challenge for AI Start-Ups

LLM Training Data

GPU Imports

Tags