Training data for AI: There is more to it than quantity

March 19, 2019

Training data for AI

Artificial intelligence is being used in an increasing number of areas of application. Machines require large amounts of data to perform similarly to human beings. Quantity is what counts. That makes sense, in particular when addressing challenging problems and complex issues. However, the quality of the data is also significant – especially for training data that is used in machine learning. With this information the algorithms can develop themselves and machines learn how to learn.

Machine learning or artificial learning?

Machine learning (ML) is often put on the same level as artificial intelligence (AI). That is not quite accurate. Machine learning is part of artificial intelligence, but not everything AI does is also ML. What is the small yet subtle difference?

  • Artificial intelligence describes software that imitates the cognitive skills of human beings and solves problems with methods it has learned.
  • Machine learning means that the software independently learns how to develop methods to solve problems based on data.

However, this difference is not always kept in mind in practice, especially in the business sector. Machine learning is often used as a synonym for artificial intelligence. If you take a closer look at it, though, you will soon realize that artificial intelligence requires machine learning to further develop and improve itself. Artificial intelligence and machine learning have one thing in common: they need data – the more, the better.

At clickworker, a strong workforce of over 4.5 million Clickworkers provides you with datasets for machine learning. In any quantity and high quality.

Training data for AI – Large amounts of data for complex problems

Machine learning is especially suitable for solving complex problems. The more variables need to be calculated, the more complicated the task becomes. And because the problems are so the complex, a correspondingly large amount of data is required to ensure that the training system can respond more effectively. One example is autonomous driving. Different elements must be analyzed, classified and calculated in a short amount of time to be able to produce the correct response within a few milliseconds.

It is therefore clear that the amount of training data plays a significant role. The number of possible situations in road traffic is basically infinite. A great number of training data, which provide an increasingly exact image, is required to detect similar structures in traffic situations. This is clearly illustrated by a comparison with surveys.

  • The more people are questioned, the closer the result reflects real life conditions.
  • The fewer people are questioned, the more likely the result will be in the random range.

Quantity therefore makes a difference. How large the amount of training data for AI needs to be always depends on the respective task. Speech recognition in particular requires large quantities of training data. For instance, some experts call for at least ten thousand hours of audio data as a basis for a system that can operate at moderate speed.

What about the quality of training data for AI

The quality of the training data for AI must be up to scratch. This is not surprising. At best the system will simply ignore poor data. However, this implies that the system can rate the quality of the data. At worst poor quality data may lead to incorrect results with expensive consequences. Unsatisfactory data quality is therefore one of the main reasons why many companies have been reluctant to use artificial intelligence until now.

But how can one obtain good training data? A determining factor is the use of an intelligent quality control system. Several testing methods can be used for the validation of the data. These include e.g.

  • proofreading, editing or peer review,
  • the dual-control method,
  • majority decisions when the results differ.

These methods can be used for any number of tasks. Generally, a computer cannot evaluate the quality of the data. Therefore, there is always a risk that machine learning may lead to results that are logical, formally speaking, and yet incorrect in practice. For example: in an autonomous driving test series the test subjects (because they are not paying attention) repeatedly classify a specific image of a person as a barrel. The system’s response is logical: in a critical traffic situation it evaluates running over a (perceived) barrel as a reasonable alternative that will cause the least possible damage.


It comes as no surprise that the quality and quantity of training data for machine learning are equally important. The skepticism on the part of many decision-makers in large companies towards artificial intelligence is often based on the fact that they doubt the quality of training data. However, there is no contradiction in quality and quantity. clickworker offers both: by providing a workforce consisting of millions of people who are individually selected according to strict criteria, based on the respective task, it makes both, quality and quantity, possible in the case of training data.

Fast, cost-effective, flexible: clickworker is the solution for your special AI project. Make use of our managed service for generating training data.


Dieser Artikel wurde am 19.March 2019 von Jan Knupper geschrieben.


Jan Knupper