In a quiet corner of her home, surrounded by cushions doubling as soundproofing and a mic tuned to pick up every breath, Manasi Agavekar from Maharashtra, prepares to speak—not just into a microphone but into the ears and hearts of her future listeners.
A voice-over artist at Rian.io, an AI-powered platform for video localisation and document translation, Manasi has lent her voice to a wide range of content: children’s stories, educational videos, bank reports, investment explainers and advertisements. Each piece, she says, demands a different version of her, a different emotion and a different heartbeat.
“You cannot just read the lines. You must live them,” she says softly. “A child’s story needs joy, innocence. An ad needs conviction. A medical report needs calm clarity. The voice must carry more than just meaning—it must carry feeling.”
Her journey into voice-over began unexpectedly. Initially part of the team building Rian’s translation technology, she was trained in voice work during the pandemic as the need for multilingual, multimedia content surged. She took it seriously, undergoing sessions in voice modulation, breathing and diction across Hindi, Marathi and English. Each language, she says, requires its own discipline, its own emotional compass.
As the calls for domestic large language models (LLMs) grow, one crucial ingredient that remains hard to source is quality voice data, especially in regional Indian languages.
There are two reasons that make this data crucial, according to experts. “The amount of online text data in Indian languages is significantly less compared to English. Even for Hindi, which is widely spoken, the availability of digital text is low. This scarcity becomes even more stark when we look at other regional languages,” says Anand Shiralkar, founder, Rian.io.
Another reason voice usage is growing is that even casual users start relying on voice commands once they discover the ease of using them on mobile. Further, in a country with over 900mn internet users, where literacy is not a prerequisite for voice use, the potential for start-ups working in this space is huge.
This trend is evident in the numbers. According to Grand View Research, India’s voice commerce market was valued at $1.57bn in 2024 and is projected to reach $7.47bn by 2030, growing at a compound annual growth rate of 32% between 2025 and 2030.
Voice data captures accents, tones and literacy variations, making it more inclusive than text. But limited high-quality regional data often leads to AI outputs that sound outdated or overly formal. For instance, Tamil voice models have been found to reflect older, standard usage rather than everyday language, affecting applications like speech recognition and voice assistants.
That is where start-ups like Rian.io and AiVanta come in. AiVanta focuses on AI-generated video content and works closely with content creators, voice-over artists and even struggling actors to build a repository of high-quality voice samples in Indian languages.
These samples are then monetised through AI voice generation platforms like ElevenLabs, enabling artists to earn passive income whenever their voice is used.
Currently, AiVanta is working with eight languages (Hindi, English, Tamil, Telugu, Kannada, Bengali, Marathi and Punjabi) across studios in Mumbai, Delhi, Bangalore and Goa. “We invite artists to our studios to record for two hours—either reading scripts or engaging in casual conversation. Once the audio is processed and uploaded, every time their voice is used, they get paid,” Ahuja adds. Over 1,200 artists have worked with AiVanta so far, some earning lakhs per month through this model, with payments made per use rather than upfront fees.
Bootstrapping Voice Data
For Vaibhav Vats Shukla, founder of Quansys AI, a start-up building a fully autonomous BPO powered by voice AI, creating a high-quality voice dataset in Indian languages wasn’t a luxury, it was a necessity. But professional voice datasets come at a steep cost, often ₹3,000 to ₹5,000 per recorded minute. So instead of buying data, he decided to build it from scratch.
It started in August 2024, when he and his co-founders began recording their own voices. But to truly reflect India’s linguistic diversity, they needed more. Vaibhav reached out to about 20 close friends who spoke different Indian languages (Gujarati, Marathi, Bengali, Kannada, Tamil and more). These weren’t voice artists; they were just everyday people. “We explained what we were building and asked if they could help by recording paragraphs in their native language,” he says. Most of the recordings came through WhatsApp, raw, real and full of background noise.
Using internal tools, the team cleaned these recordings, removed filler words and annotated them in detail. They even developed frameworks to augment this real voice data with synthetic samples, scraping, indexing and generating variations to better train their AI models. “The idea was to simulate data patterns at scale,” says Vaibhav. “We wanted our models to understand not just languages, but dialects, how Hindi is spoken in Rajasthan versus Bihar or Madhya Pradesh.”
By October, after an early product failure marked by poor pronunciation and tone, the team doubled down on building a richer, cleaner dataset. Today, their training corpus spans 1,500 hours of recorded audio, born not from expensive studios but from a network of friends, a WhatsApp group and a lot of scrappy determination.
The Dialect Divide
Jigar Doshi, director of Machine Learning (ML) at Artpark, an R&D non-profit, housed inside the Indian Institute of Science (IISc), Bengaluru, has been tirelessly working on Project Vaani, funded by Google. “We want to collect diverse data from every zip code in India,” he explains, speaking of a vision that includes 150,000 hours of speech from over 1mn people across 773 districts, open-sourced and linguistically rich.
To collect this data, Doshi and his team ask participants to describe images, photos of weddings, beaches, offices, prompting natural speech in their native tongues. From that, audio and text are captured and paired, building a treasure trove of multimodal data essential for training the next generation of AI models.
But the process is messy. Vendors are sent to districts with specific demographic targets, age, gender, educational level. “We make sure we’re not just recording literate populations,” says Doshi. “We go to public health centres and anganwadis. We sit with people, offer incentives and ask them to describe what they see.”
However, it comes with its own challenges. He recounts how one vendor, trying to cheat the system, recorded himself multiple times under different names. Only an ML model trained caught the repetition.
Collecting data is just one part; ensuring it’s clean and useful is another. At scale, the team must detect voice overlap, background noise, abuse or faked gender identity. “Someone listed themselves as female but it was clearly a male voice,” Doshi recalls. “You have to build safeguards. You cannot trust that everyone has good intentions, even if they mean well.”
Sometimes, the smallest oversights cause the biggest problems. Doshi, who has long worked in data collection, recalls a challenge during his Covid Cough project—an initiative to collect audio samples of coughs for research. In one district, over 60% of the submissions were rejected. After digging deeper, Doshi discovered that a phone case was covering the microphone, muffling every single recording. “There’s no way we could’ve caught that without going on the ground,” he says.
While investigating, Doshi discovered that a phone case was covering the microphone, muffling every recording.
So, Doshi goes to the ground. Frequently. He’s visited Rajasthan, Telangana and even the border areas of Punjab to fix broken models in the field. Sometimes it means debugging AI systems on-site. Other times, it is watching frontline workers struggle to type out words like “hemoglobin.”
All this data is intended to be open source, available to start-ups, governments and researchers.
But it’s not that simple. “At the moment, only big tech companies can truly make use of it,” as per a source. Start-ups often lack the engineering muscle to convert raw voice datasets into working models. “They want plug-and-play modules. We’re still in the raw material stage,” adds the source.
The race to have a hold of regional data is not only limited to tech giants. Right now, all small and big players are focusing on this race. As OpenAI’s India Policy head Pragya Misra said in an event in Delhi, “I often tell my colleagues: our emails may look great but when I actually turn on our audio model and try to speak to it in Hindi or even a bit of Tamil or Rajasthani, it doesn’t deliver for me. It works well in Hindi but not in many other languages. And that’s a gap.” It remains to be seen who wins the race.