Indic language generative AI in the works at top Indian colleges
Even as tech giants push platforms like ChatGPT, Bing and Bard, top engineering colleges in India are taking on an increasing number of generative artificial intelligence (AI) research projects, many of which are seeking to understand how the technology can help create tools akin to OpenAI’s ChatGPT, but for Indian languages.
Generative AI platforms have been the rage since the second half of last year, as Microsoft and Google push these programs into their existing services. Even the Ministry of Electronics and Information Technology (MeitY), on Feb 3, said it is “cognizant” of the emergence and proliferation of generative AI and noted that AI can be a “kinetic enabler” for growth in India.
However, researchers at institutes underline a host of challenges for generative AI projects in academia, the biggest of which lie in sourcing ample data of Indic languages, the cost associated with such projects, and the scale of computing power needed. Indian researchers have been working on such projects for more than three years.
“In academia, we’re using techniques from language models, namely the transformer architecture, for different tasks such as classification of data, answering questions, machine translation and building chatbots,” said Tapas Kumar Mishra, assistant professor of computer science engineering at National Institute of Technology (NIT), Rourkela.
The transformer AI model is the underlying algorithm for generative AI tools. They can process conversational human language inputs and generative output after understanding context.
While global platforms work mostly in English, Mishra said that researchers under him are working on languages like Hindi, Bangla and Kannada — creating models that can take questions in these languages and generate output in English. They aren’t using OpenAI’s tools for this but have achieved “very good” scores according to the industry standard BiLingual Evaluation Understudy (BLEU) test.
He said NIT Rourkela has achieved scores of between 25 to 30 on Hindi to English, and 19 on Bangla to English. For reference, OpenAI’s GPT-4 model has scores of 22.9 in English to French outputs. The institute published a research paper on translations from Hindi to English last month with the Association for Computing Machinery — a US-based scientific educational community that publishes research work on natural language processing (NLP).
NIT Rourkela isn’t the only one doing this either. Students from the Indian Institute of Technology (IIT), Madras, have also taken up such projects. Harish Guruprasad, assistant professor, computer science engineering at Indian Institute of Technology (IIT), Madras said that one such project includes “better translated YouTube videos in Tamil”.
“Students mostly took this up to compare their own research language models with GPT-4, and eventually publish a paper on new approaches of translating videos into Indian languages,” he added.
Generative AI projects have also been a part of research initiatives beyond Indic languages. For instance, Debanga Raj Neog, assistant professor, data science and AI at IIT Guwahati, said that the institute is presently working on creating “affordable visual animation models that study eyes and facial movements from open-source visual databases, and use this to replicate the process.” IIT Guwahati, too, is working on a research paper on this.
Professor Mausam, the founding head at Yardi School of AI in IIT Delhi, said that in 2022, the institute created a language model called ‘MatSciBERT’ — specifically for the field of material science research. “The way it works is that there are a lot of scientific publications in every field, with many collaborators. There are lots of papers and tasks to be done such as property prediction of a material, and for that, it was valuable to read a bunch of scientific articles in that domain and create a language model. For this, we created a language model called ‘MatSciBERT’ in 2022 — which has been downloaded tens of thousands of times. People have been using it for various material science tasks, already,” he said.
The key problem for most researchers though is computing power. NIT Rourkela has 13 machines with 24GB graphic processing units (GPUs) each. Mausam noted that the scale of compute power required is “exorbitant and prohibitive”.
“For instance, GPT-3 cost OpenAI $4.6 million for one run of training the model, not accounting for the errors and re-trials of the model. No company, apart from the top tech firms, can even think about training such models — within Indian academia or even in the industry. Looking to train India-specific language models is therefore premature, unless we create the compute infrastructure to be able to do so,” he said.
A senior executive, who was formerly working on government tech projects, said on condition of anonymity that there is “a lack of clarity in terms of enabling access to India’s supercomputer infrastructure owned by the Meity-backed Centre for Development of Advanced Computing (C-DAC.” Mint reported in on July 6 last year, India’s supercomputing power is also well behind global systems.
The executive added that while multiple top institutes, including IIT Delhi, have been consulted on using the infrastructure for their research initiatives, not much progress has taken place in this regard.
Availability of data is another problem for India. For instance, NIT Rourkela uses various public datasets, such as the Samantaral database released by IIT Madras. “This consisted of low resource language pairs of Indic languages. We’re also using our own datasets by scraping newspapers and converting to various languages — and then working on that. We’re also using publicly available data, such as state government-backed local language data repositories,” said Mishra.
To accelerate AI research in India, the Meity launched ‘Bhashini’ in May last year — an Indic language database that can be tapped by institutes.
However, access to the scale of data needed for such projects continues to remain an issue. “When a language has a huge amount of data available, transformer architectures can produce great efficiency of translation. But, with small amounts of data, this is difficult to work with. For instance, translating from Odiya to Hindi, such models are not very efficient,” IIT Madras’ Guruprasad said.