By Khav SAROEUN, Software Engineer

Communication via voice message (instead of typing) is very popular in Cambodia. There are therefore many use cases for both Speech to Text and Text to Speech in Khmer. This blog post details the framework we used to assess the capabilities of various S2T tools in recognizing and transcribing Khmer speech, using as an example the Web Speech API. We’ll walk through converting audio files into the correct format, automating the transcription process, and saving the output to a text file for analysis.

What We’ll Learn

  • How to convert audio file formats to be compatible with speech-to-text processing.
  • Using Python for speech recognition and transcription with Google Web Speech API.
  • Writing the transcribed text to a file.

Tools & Libraries Required

  1. Python: The core language used for scripting the automation process.
  2. SpeechRecognition: A library that integrates with the Google Web Speech API to handle transcription of audio files.
  3. Pydub: A library for converting and manipulating audio files (e.g., .m4a to .wav format).
  4. FFmpeg: A multimedia framework for audio conversion. Download and install make sure to set it up in your system environment variables (e.g., ffmpeg -version should display the version if installed correctly).
  5. OS & Pathlib: These Python libraries handle file management and operations for smoother automation.

Step-by-Step Code Guide

Note that all code here has been simplified (hardcoded paths etc) for readability.

1. Prepare the Audio File and Convert it

For this step, we want to write script for convert extension of audio file to correct format. This part will use Pydub and FFmpeg together.

Most audio recordings are in .m4a, .ogg, . mp3, or pm4 format, but the Google Web Speech API requires .wav format for accurate processing. Here’s how to convert your file:

Install Pydub and FFmpeg:  

pip install pydub
sudo apt-get install ffmpeg

Example code:

from pydub import AudioSegment
from pathlib import Path
entries = Path('/path/to/your/folder')
for entry in entries.iterdir():
    if entry.is_file() and entry.suffix == '.ogg': #set your current file extensions here
        voice_file_path = entry.resolve()
        wav_file_path = voice_file_path.with_suffix('.wav')

        try:
            #if successful, it will convert your audio file to extensions .wav
            audio = AudioSegment.from_file(voice_file_path, format="ogg")
            audio.export(wav_file_path, format="wav")
        except Exception as e:
            print(f"Error processing {voice_file_path}: {e}")

2. Transcribing Audio with SpeechRecognition in Khmer

This script is designed to transcribe audio recordings into text using the SpeechRecognition library, specifically for the Khmer language (km-KH), the official language of Cambodia. It leverages the SpeechRecognition library to convert spoken Khmer into text by interfacing with the Google Web Speech API, which supports Khmer language input.

Install Speech Recognition

pip install SpeechRecognition
pip install pathlib
import speech_recognition as sr
from pathlib import Path

recognizer = sr.Recognizer()

voice_file_path = Path("record/file.wav")
wav_file_path = voice_file_path.with_suffix(".wav")

with sr.AudioFile(str(wav_file_path)) as source:
    audio_data = recognizer.record(source)
    # Transcribe using Google Web Speech API to Khmer language recognition
    transcribed_text = recognizer.recognize_google(audio_data, language="km-KH")
    # You can write the transcribed text into a file

3. Full code

The full version of this code, We want to transcribe audio recordings into text, specifically for the Khmer language (km-KH). For the first, we will get folder of the audio files, and convert it to file.wav. Then, using SpeechRecognition to transcribe audio recordings into text. Final, write the text to file.txt

from pathlib import Path
import speech_recognition as sr
from pydub import AudioSegment

# Manually set the path to the ffmpeg executable
AudioSegment.converter = r"C:\ffmpeg-7.0.2-full_build\ffmpeg-7.0.2-full_build\bin\ffmpeg.exe"

# Open the file to write the transcriptions
with open("results.txt", "a", encoding="utf-8") as result_file: 
    entries = Path('/path/to/your/folder')

    for entry in entries.iterdir():
        if entry.is_file() and entry.suffix == '.wav':  # change the format to pm4, m4a, mp3, ogg, etc.
            voice_file_path = entry.resolve()
            wav_file_path = voice_file_path.with_suffix('.wav')

            try:
                # Convert the original audio file to wav
                audio = AudioSegment.from_file(voice_file_path, format="wav")  # change the format if needed
                audio.export(wav_file_path, format="wav")

                # Transcribe the converted wav file using Khmer language recognition
                recognizer = sr.Recognizer()
                with sr.AudioFile(str(wav_file_path)) as source:
                    audio_data = recognizer.record(source)
                    transcribed_text = recognizer.recognize_google(audio_data, language="km-KH")

                # Write the transcribed text to the file
                result_file.write(f"{transcribed_text}\n")

            except Exception as e:
                print(f"Error processing {voice_file_path}: {e}")

Folder structure

In result.txt

តើខញុំអាចទាកទងអនកដោយរបៀបណាបរសិនបើខញុំមានបញហា
តើអនកកំពុងតែធវើអវីនៅកនុងបនទបនេះ
តើបរភេទកីឡាអវីដែលអនកចូលចិតតជាងគេ

Conclusion

By following the above steps, you can test the Web Speech API’s ability to understand spoken Khmer and automate the process of converting, transcribing, and writing the output in result.txt. Keep in mind that the accuracy of the transcription may vary based on the clarity of the audio and background noise, so the results may not be perfect.