By Sree Sivaram, Writer, IBM Blue Studio
This article was featured in the Think newsletter. Get it in your inbox.
India’s linguistic diversity is staggering: 22 official languages, hundreds of dialects and a billion daily conversations shaped by region, context and culture. According to experts, for AI systems to truly serve this diverse market, it needs to work well across languages and cultures. That’s the promise behind OpenAI’s IndQA, a new benchmark designed to evaluate how well AI models reason across Indian languages and cultural contexts.
Unlike traditional benchmarks that might focus on grammar or syntax, IndQA tests for cultural fluency. Can a model interpret a proverb in Tamil? Reason through a literary scenario in Hindi? Understand the nuance of a Bengali news article? With 12 languages and 10 domains, including law, health, history and current affairs, IndQA helps push AI systems to engage with India as it is: complex, multilingual and deeply contextual.
For developers and tech leaders building AI solutions in India, IndQA signals a shift from translation to localization. It’s not just about making AI speak Indian languages anymore. It’s about making it think in them. That’s a critical leap for sectors like healthcare, fintech and government services, where accuracy and cultural relevance aren’t optional but obligatory.
Benchmarks like IndQA are also helping enterprises evaluate model performance before deployment. It tests not just translation, but reasoning, accuracy across 10 domains and cultural fluency in 12 Indian languages. By grading responses against expert-defined criteria and ideal answers, OpenAI claims that IndQA offers a clearer picture of real-world readiness.
As AI adoption accelerates across Indian industries, tools that measure real-world readiness, especially in regional contexts, will become essential. “For enterprises operating in India, linguistic accuracy is not a nice to have ¾it’s mission critical,” said Jasbir Kaur, Engagement Acceleration Leader at IBM’s Innovation Studio in Bangalore in an interview with IBM Think. “Benchmarks like [these] give organizations the confidence that their AI solutions are not just functional, but culturally and contextually intelligent. This is the difference between deploying a tool and delivering trust.”
IBM’s work offers a clear example of how this benchmarking trend is playing out in practice. In 2024, IBM researchers released MILU, an open-source multi-task Indic language understanding benchmark designed to evaluate LLMs in cultural knowledge across 11 Indic languages and 41 subjects. This year, IBM’s Granite 4.0 model has shown promising performance in knowledge related to Indian languages, at reduced cost and latency. Trained on nearly 100 billion tokens of Indian-language data during pre-training and 1.5 million for post-training instances, new research found the Granite 4.0 models surpassed comparable models in performance on several Indic skill benchmarks, including the MILU-IN, BoolQ and Sanskriti datasets.
Industry newsletter
Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.
Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.
IndQA wasn’t built in isolation. It was shaped by 261 Indian researchers, linguists and domain experts, who contributed questions, reviewed outputs and ensured cultural fidelity. These contributors, fluent in both English and their native languages, crafted complex, reasoning-heavy prompts rooted in regional context. For each question, they also defined grading criteria, ideal answers and English translations to ensure clarity. Every prompt went through peer review and multiple iterations before final sign-off, making IndQA not just a benchmark, but a rigorously built standard for culturally grounded AI evaluation.
This emphasis on cultural and domain relevance echoes IBM’s work in responsible AI. Projects such as IBM’s open-source AI Fairness 360 toolkit and ongoing research at IBM India highlight the importance of transparency and fairness in model development. As India’s AI ecosystem matures, collaborations across academia, industry and government will be critical to building systems that serve all users, and not just English-speaking ones.
“Open-source benchmarks play a crucial role in advancing AI research and development as they make progress measurable, transparent and reproducible,” said Michal Shmueli-Scheuer, Distinguished Engineer, IBM AI Benchmarking and Evaluation. “By providing shared datasets and evaluation protocols, they create a common yardstick for comparing models, fostering collaboration across academia and industry, and speeding up innovation.”
Developments like OpenAI’s IndQA and IBM Research India’s MILU benchmark are part of a broader shift to build evaluation tools that reflect India’s semantic richness. As benchmarks evolve, users and developers alike will need more tools that test AI’s ability to navigate India’s semantic richness: from code-switching in Hinglish to cultural references in Malayalam cinema. For enterprise builders, the takeaway is clear: inclusive AI starts with inclusive evaluation.
See how InstructLab enables developers to optimize model performance through customization and alignment, tuning toward a specific use case by taking advantage of existing enterprise and synthetic data.
Move your applications from prototype to production with the help of our AI development solutions.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.