Data Extraction – Short conceptual explanation

The term data extraction means the process by which digital information is obtained from a storage device for further processing. This usually means that the data will then be transformed into a different format before being loaded into a new system such as a database or content management system. The word “extract” comes from the Latin “extrahere” meaning “to draw out.” This implies that digital content is being pulled out of a system for use in another rather than non-digital content being pushed into one, a process more commonly known as “data entry.”

Data capture

In the early days of computing getting information into computers was a manual process known as data entry where information on paper documents was typed into terminals by teams of people. These days the process of digitalisation or turning hard copy (paper) into soft copy (electronic files) is much easier due to the invention of optical character recognition technology. This enables machines to scan documents and make sense of handwritten, typed or photographed text which is then stored as computerised text that can be manipulated by software. Product data management uses a combination of both extraction from digital sources such as Excel spreadsheets and entry from paper sources such as catalogues to maintain digital shops.

Data extraction for warehouses

Once data has been collected it is then typically extracted from the devices that captured it and loaded into a central storage facility. For large businesses this is usually a relational database management system known as a data warehouse. In retail for example, every transaction that takes place on a till in a brick and mortar store or a shopping cart on a website produces data. This includes things like product codes, prices, credit card details and the location, date and time of everything sold. This information which is largely just a stream of numbers is not in a user friendly format. By loading it into tables in a database it is then in a format that software developers can further manipulate into reports that can show sales figures, profits and losses and stock levels that business owners need to make decisions.

Data extraction from the warehouse

It could be said that the final step of producing the reports, spreadsheets, graphics and charts that people use in their work is also a form of data extraction because it’s pulling information from a central storage location and then making it available to people in a user friendly form that can be easily understood. In the retail example above the management want to see the big picture rather than individual transactions and this is accomplished by software developers writing programs that query the database and format the results. However data is initially captured it’s unlikely to be in a useful format to begin with so the extraction process is always going to be a vital step in making it available to the people known as the “end users” who wish to make use of it.