Realistic training data for machine learning

training data for machine learning

Data are the foundation for training algorithms. The more realistic the data, the better the results. This is because artificial intelligence is based on precise and reliable information for training its algorithms. This is obvious but it is often overlooked. The training data are realistic when they reflect the data that the AI system gathers in real operation. Unrealistic data sets prevent machine learning and lead to expensive false interpretations.

Unsuitable training data are expensive

Artificial neural networks need to be fed good input to be able to learn – just like the human brain. Ultimately, it is the data that are used to train the systems that will determine what an AI system knows and can accomplish. When using artificially created and open data as training data you run a great risk of obtaining distorted results because the data are often not realistic. Artificial intelligence consists of algorithms that are fed data from which they are meant to learn – so-called machine learning. If data are used that are not realistic with regard to their use in the system, this can lead to insufficient or incorrect results in the system as illustrated in the following example.

While developing a software for drone cameras the developers make use of photographs found on the Internet. These photos exist in ample supply on Facebook or Instagram. However, these photos have two typical features:

  • They are generally taken at head height
  • TAnd the targeted object is nearly always at the center of the image.

A self-learning algorithm will draw incorrect conclusions from these features. These allegedly general structures are not useful for the assessment of camera photos taken from a drone; at worst they may even be harmful. In the exemplary case the algorithm might learn that important objects are always at the center of the image – a false conclusion. Photographs taken by drones are taken from various perspectives and distances.

Another example: To train automobile software for the German market, the developer team uses photos of traffic situations taken worldwide. In this case there is a risk that artificial neural networks in practice misinterpret an advertising poster that is similar to a foreign traffic sign, for a road sign.

Verifying existing data sets

How does one identify poor training data sets? The following signs can be indications, for instance:

  • They are incorrect to a large extent,
  • They do not comply with the values the system will be working with,
  • Or the data sets have numerous outliers and redundant information.

The solution is to gather the data oneself or have them newly gathered by a provider. In doing so one can have them gathered to meet ones requirements and / or examine existing data sets with regard to whether they are suitable for the respective system. They are suitable when the data sets correspond to what input the system receives, recognizes and correctly evaluates when in operation.

At clickworker you can have your Datasets for Machine Learning newly generated – to meet your individual requirements and tailored to the specifications of your system.

The quality of training data can be verified based on the following questions:

  • What methods and technologies were used to generate the data?
  • Is the source of the data reliable? Or was data collection associated with a specific intention?
  • Where do the data come from? Many training data sets have a clearly defined geographical focus. Is this suitable for the special use?
  • When were the data collected?
  • In which surroundings / under what conditions were the data generated?
  • How are the data related, why were they gathered?
  • What methods and technologies were used to generate the data?

The crowd assumes the creation of the data and the quality control

The crowd is especially successful for the generation as well as the verification of training data for systems with artificial intelligence. In principle there are three individual approaches here, but they can also be combined:


  • Create new training data (for instance photos, video datasets, audio datasets & voice datasets),
  • Evaluate and classify existing data sets according to their quality and /or content,
  • Control and assess results that are supplied by AI systems.

Inadequate data can also be optimized for use as training data at a later date. Within a short period of time, Clickworkers can process raw data – add keywords and tags, use bounding boxes, polygons and key points to annotate elements on images, or carry out semantic segmentations.

The data sets and results are subsequently controlled, either by means of various procedures, including peer review, or dual control principle and majority decision.

More information about the clickworker “Datasets for Machine Learning” service.

Summary: Realistic data sets pay off

The main risk involved in unrealistic data is that it can falsify an entire algorithm. This is similar to the human brain: If the basic assumption and information turn out to be incorrect, then the hypotheses and worldviews on which they are based are also incorrect. As a consequence, for the machine as well as the brain, this means that it has to start all over again. This can be very expensive in the case of the machine. No company can afford to use an unsafe technology. It is therefore advisable to pay attention to the quality of the training data sets from the outset to avoid these unnecessary costs.



Jan Knupper