Audio Data Collection – Short Explanation

We’ve already started to become accustomed to our AI-powered personal assistants. Whether they come from Apple and are called Siri, Google and respond to the “Hey Google” keyphrase or are from another vendor entirely, they have in a brief period become the norm in our households.
However, they did not come to us, fully-fledged and ready. They were trained with audio data collection to recognize human speech so that they understood the word and the meaning behind that word.

Audio Data Collection in the Real World

In the real world, there is a multitude of different languages. In addition, a word can often be used in a variety of different ways based on the context of the sentence and even the tone of voice.

To train virtual assistants and other AI systems on what is actually being said and its true meaning can be complicated. It requires large amounts of data collected across many different languages and dialects. This data pool cannot be homogenous – it needs to account for accents and even different levels of quality.

Audio Data Collection in the World of AI

When collecting data for machine learning and AI training, there are several steps that are needed.

  1. Determine what will be said and what the meaning of that phrase is.
    As much as possible, try to understand the real-world situation in which this information will be used as that will color the data you gather. Use chat transcripts and emails to see the question and the outcomes so that the information is representative.

  3. Understand the language those words will be said in.
    When discussing language, do not get confused by regional variations but rather consider language as the technical jargon specific to a role.
    As an example, a client might ask someone – “Hello, I need to have someone come to my house for service. I am having a problem with my internet router, and the speed is very slow.”
    In this context, the first sentence is almost irrelevant. The model needs to understand the second sentence and know what to do in that specific instance.

  5. Build a script in that language & using those words.
    Your script should emulate the real world as much as possible. In this example, you’d want to count the number of speed issues that had been received and perhaps also look at how many problems with cell signal or television channels. The script should be of a reasonable length (15min or less) and offer the speakers a chance to pause between statements. Speakers need to identify themselves and their unique circumstances – accents, devices, etc. – whenever recording.

  7. Record real-world people in a variety of different situations using the words in the script and transcribe the information.
    Based on your target market, try to focus your data collection on your ideal population’s demographic. This would mean considering mobile recordings or landline but also regional variations in accents and more.
    Convert the voices into text files using an automated speech to text engine and then have a human validate the translations for errors. Each text file should have terminology that makes it easy to link to the audio recording.

  9. Build a test case and use that test case to train a new algorithm
    Using the audio files you’ve gathered and the text files you’ve transcribed, create segmented pairs, and extract a subset of these pairs to form your test sets. This subset should be kept separate and will be used to test the accuracy of your model.

Use the remainder of the segmented pairs to train the language model and add other pairs as you go to continue growing its capabilities. After the model is trained, validate it against the test data set to judge its accuracy and iterate until you get it to the level needed.