Outlook Business Desk
Microsoft has launched three new models under its Microsoft AI MAI family, spanning transcription, voice and image generation, as the company steps up efforts to expand multimodal artificial intelligence (AI) tools for developers across platforms.
The lineup features MAI-Transcribe-1 for converting speech into text, MAI-Voice-1 for building customised voices and MAI-Image-2 for generating images. Together, these tools are designed to support a wide range of AI applications within one connected ecosystem.
According to reports, these models are accessible from today through Microsoft Foundry and the MAI Playground. Foundry serves as a unified platform for building and scaling generative AI applications, while Playground allows users to test features and share feedback.
Mustafa Suleyman, head of Microsoft’s AI division, said the models went through extensive testing and red-teaming processes. He added that Foundry includes built-in guardrails, governance features and enterprise-grade controls to support secure and compliant AI deployment.
MAI-Transcribe-1 enables speech-to-text conversion across 25 commonly used languages, including Hindi. Microsoft says it achieves lower mean word error rates, indicating higher accuracy, when compared with rival models such as Google’s Gemini 3.1 Flash and OpenAI’s GPT-Transcribe.
The transcription model can handle batch processing up to 2.5 times faster than Microsoft’s earlier Azure Fast service. It comes with a starting price of $0.36 per hour, aiming to deliver better efficiency while remaining cost-effective for developers.
MAI-Voice-1 allows developers to build customised voices using only a few seconds of audio input. The model can produce up to 60 seconds of speech within one second, with pricing starting at $22 per one million characters.
MAI-Image-2, first introduced in the MAI Playground last month, is now widely available via Foundry. The model delivers at least double the generation speed compared to earlier versions while maintaining consistent output quality
Microsoft is rolling out these models across products like Copilot, Bing and PowerPoint. The company said enterprises have already started adopting them, pointing to rising demand for scalable multimodal AI tools among businesses and developer communities.