Data are the foundation for training algorithms. The more realistic the data, the better the results. This is because artificial intelligence is based on precise and reliable information for training its algorithms. This is obvious but it is often overlooked. The training data are realistic when they reflect the data that the AI system gathers in real operation. Unrealistic data sets prevent machine learning and lead to expensive false interpretations.
Artificial neural networks need to be fed good input to be able to learn – just like the human brain. Ultimately, it is the data that are used to train the systems that will determine what an AI system knows and can accomplish. When using artificially created and open data as training data you run a great risk of obtaining distorted results because the data are often not realistic. Artificial intelligence consists of algorithms that are fed data from which they are meant to learn – so-called machine learning. If data are used that are not realistic with regard to their use in the system, this can lead to insufficient or incorrect results in the system as illustrated in the following example.
While developing a software for drone cameras the developers make use of photographs found on the Internet. These photos exist in ample supply on Facebook or Instagram. However, these photos have two typical features:
A self-learning algorithm will draw incorrect conclusions from these features. These allegedly general structures are not useful for the assessment of camera photos taken from a drone; at worst they may even be harmful. In the exemplary case the algorithm might learn that important objects are always at the center of the image – a false conclusion. Photographs taken by drones are taken from various perspectives and distances.
Another example: To train automobile software for the German market, the developer team uses photos of traffic situations taken worldwide. In this case there is a risk that artificial neural networks in practice misinterpret an advertising poster that is similar to a foreign traffic sign, for a road sign.
How does one identify poor training data sets? The following signs can be indications, for instance:
The solution is to gather the data oneself or have them newly gathered by a provider. In doing so one can have them gathered to meet ones requirements and / or examine existing data sets with regard to whether they are suitable for the respective system. They are suitable when the data sets correspond to what input the system receives, recognizes and correctly evaluates when in operation.
At clickworker you can have your AI training data newly generated – to meet your individual requirements and tailored to the specifications of your system.
The quality of training data can be verified based on the following questions:
The crowd is especially successful for the generation as well as the verification of training data for systems with artificial intelligence. In principle there are three individual approaches here, but they can also be combined:
Inadequate data can also be optimized for use as training data at a later date. Within a short period of time, Clickworkers can process raw data – add keywords and tags, use bounding boxes, polygons and key points to annotate elements on images, or carry out semantic segmentations.
The data sets and results are subsequently controlled, either by means of various procedures, including peer review, or dual control principle and majority decision.
More information about the clickworker “AI training data” service.
The main risk involved in unrealistic data is that it can falsify an entire algorithm. This is similar to the human brain: If the basic assumption and information turn out to be incorrect, then the hypotheses and worldviews on which they are based are also incorrect. As a consequence, for the machine as well as the brain, this means that it has to start all over again. This can be very expensive in the case of the machine. No company can afford to use an unsafe technology. It is therefore advisable to pay attention to the quality of the training data sets from the outset to avoid these unnecessary costs.
Dieser Artikel wurde am 14.May 2019 von Jan Knupper geschrieben.