ASR state-of-the art: Indicwav2vec

Jyoti Dabass, Ph.D
Python in Plain English
3 min readJan 10, 2024

--

Automatic speech recognition is an autonomous automated technique for oral speech decoding and transcription (ASR). An ASR system typically uses one or more algorithms to map features to related texts after extracting features from audio recordings or streams.

Automatic speech recognition

IndicWav2Vec, a multilingual speech model is pretrained on 40 Indian languages. Among the collection of multilingual speech models, this one represents the widest variety of Indian languages.

Automatic speech recognition

The aim of this blog is to make you walk through the process of using Indicwav2vec for converting speech to text. It uses librosa library and ai4bharat/indicwav2vec model.

import time
from transformers import pipeline
import gradio as gr
import numpy as np
import librosa

transcriber_hindi = pipeline("automatic-speech-recognition", model="ai4bharat/indicwav2vec-hindi")
transcriber_bengali = pipeline("automatic-speech-recognition", model="ai4bharat/indicwav2vec_v1_bengali")
transcriber_odia = pipeline("automatic-speech-recognition", model="ai4bharat/indicwav2vec-odia")
transcriber_gujarati = pipeline("automatic-speech-recognition", model="ai4bharat/indicwav2vec_v1_gujarati")
# transcriber_telugu = pipeline("automatic-speech-recognition", model="ai4bharat/indicwav2vec_v1_telugu")
# transcriber_sinhala = pipeline("automatic-speech-recognition", model="ai4bharat/indicwav2vec_v1_sinhala")
# transcriber_tamil = pipeline("automatic-speech-recognition", model="ai4bharat/indicwav2vec_v1_tamil")
# transcriber_nepali = pipeline("automatic-speech-recognition", model="ai4bharat/indicwav2vec_v1_nepali")
# transcriber_marathi = pipeline("automatic-speech-recognition", model="ai4bharat/indicwav2vec_v1_marathi")

languages = ["hindi","bengali","odia","gujarati"]

def resample_to_16k(audio, orig_sr):
y_resampled = librosa.resample(y=audio, orig_sr=orig_sr, target_sr=16000)
return y_resampled

def transcribe(audio,lang="hindi"):
sr,y = audio
y = y.astype(np.float32)
y/= np.max(np.abs(y))
y_resampled = resample_to_16k(y,sr)
if lang not in languages:
return "No Model","So Stay tuned!"
pipe= eval(f'transcriber_{lang}')
start_time = time.time()
trans = pipe(y_resampled)
end_time = time.time()

return trans["text"],(end_time-start_time)

demo = gr.Interface(
transcribe,
inputs=["microphone",gr.Radio(["hindi","bengali","odia","gujarati"],value="hindi")],
# inputs=["microphone",gr.Radio(["hindi","bengali","odia","gujarati","telugu","sinhala","tamil","nepali","marathi"],value="hindi")],
outputs=["text","text"],
examples=[["./Samples/Hindi_1.mp3","hindi"],["./Samples/Hindi_2.mp3","hindi"],["./Samples/Hindi_3.mp3","hindi"],["./Samples/Hindi_4.mp3","hindi"],["./Samples/Hindi_5.mp3","hindi"],["./Samples/Tamil_2.mp3","hindi"],["./Samples/climate ex short.wav","hindi"],["./Samples/Gujarati_1.wav","gujarati"],["./Samples/Gujarati_2.wav","gujarati"],["./Samples/Bengali_1.wav","bengali"],["./Samples/Bengali_2.wav","bengali"]])
# examples=[["./Samples/Hindi_1.mp3","hindi"],["./Samples/Hindi_2.mp3","hindi"],["./Samples/Tamil_1.mp3","tamil"],["./Samples/Tamil_2.mp3","hindi"],["./Samples/Nepal_1.mp3","nepali"],["./Samples/Nepal_2.mp3","nepali"],["./Samples/Marathi_1.mp3","marathi"],["./Samples/Marathi_2.mp3","marathi"],["./Samples/climate ex short.wav","hindi"]])
demo.launch()

One can try this model from Ai4bharat Indicwave2vec Models — a Hugging Face Space by ashokrawat2023. For the demo purpose, results are shown for speech conversion into Hindi, Bengali, Gujrati and Odia languages.

Results
Results

Cheers!! Happy reading!! Keep learning!!

Please upvote if you liked this!! thanks!!

You can connect with me on Jyoti Dabass, Ph.D | LinkedIn and jyotidabass (Jyoti Dabass, Ph.D) (github.com) for more related content. Thanks!!

PlainEnglish.io 🚀

Thank you for being a part of the In Plain English community! Before you go:

--

--

Researcher and engineer with an interest in data science, analytics, marketing, image analysis, computer vision, fuzzy logic, and natural language processing.