Speech recognition training data for software development

Case study – Creation and analysis of voice recordings as training data for a speech recognition software

Thousands of Clickworkers record voice commands used to control car infotainment systems. These are then transcribed and analyzed, providing the manufacturer with significant speech recognition training data needed to program and optimize the speech recognition software.

The challenge for speech recognition training data

Voice control systems are only as good as their speech recognition. The biggest challenge is optimizing and training these speech recognition systems to react to the large variety of voice commands.
Programming that does not include “human reason” and “human behavior” factors cannot lead to an ideal speech recognition system. In many cases, the users’ voice commands are not recognized, or they are misunderstood.

The users must often enter their commands several times before the system reacts to the entry correctly and displays the desired information. This is time-consuming for the user and distracting while driving.

Speech recordings of thousands of different people with their individual commands and pronunciations are needed to optimize the range of the system for it to be able to recognize the individual voice commands of potential users.

The solution: creating data sets to improve speech recognition software

Thousands of our Clickworkers from different countries and regions record how they would issue a command, to call up the predefined reaction x, or information y, via the infotainment system. Every voice recording differs – even in the same language – due to the individual choice of words, the word order as well as every single Clickworker’s specific pronunciation.

To optimize the speech recognition software algorithms, they must also be trained to react to certain cues such as keywords. In a second step, our Clickworkers transcribe all the voice recordings and analyze these sentences to identify the keywords used and their frequency.

With the help of these recordings, manufacturers train their speech recognition software and optimize the infotainment system to respond to the individually different ways users handle the system.

Project Data

Clickworker qualifications: Native speakers from the target regions

Languages: 9 languages

Number of voice recordings (in MP4-Format): 810,000 (600 recordings per language for 150 scenarios)

1. Task: Create the audio recording
2. Task: Transcribe the recordings
3. Task: Analyze and evaluate the recordings

Quality assurance: a second Clickworker, the transcriber, checks the quality of the recordings

Data transfer: Data transfer via xls-file

Work Flow

  1. The project is discussed with the customer and the tasks are defined accordingly.
  2. clickworker sets up the project in a three-stage distribution of tasks, including briefings for the Clickworkers and quality assurance.


    1. Task: Creation of voice recordings
      • Audio recordings in 9 languages
      • 600 recordings per language for 150 scenarios
      • 1,200 Clickworkers per language are requested
      • Audio format: MP4-files
    2. Task: Quality assurance and transcription
      • Checking and transcription of the 810,000 voice recordings made by native speakers
    3. Task: Analysis and evaluation
      • Calculation of the keywords and their frequency per scenario and language
      • Filtering the phrases incl. frequency per scenario and language
  3. The final task results are transferred to the customer via xls-file.


  • Speed
  • Three services from a single source
  • Simple access to know-how and language skills
  • Quality assured results
  • Scalable throughput
  • Flexible workforce

The difficulties of speech recognition training data: machine learning and the human factor

Speech recognition offers many useful applications that can make day-to-day activities easier. Whether it is used to search for something online, unlock a smartphone, or operate a car infotainment system: More and more programs use voice recordings. This poses challenges to the software development. Since every person speaks differently based on their dialect, individual mannerisms, or potential speech impediments, the program needs to be trained to recognize the same words in various iterations. This is why the human factor plays such an important role in gathering speech recognition training data. Simply using one recording to train the system would not yield the desired results. Instead, we provide a multitude of different voice recordings that can help the machine learn. Once this foundation has been laid, the software can use the training data to come to the right conclusions and keep evolving.