What is a data pipeline in computer vision?

A data pipeline in computer vision is the path the data flows through. From data collection to storage, being used for model training, and deployment. Ideally, it is a connected technical set-up where data storage is linked to various data preparation and MLops tools, which in turn are connected through an API to the machine learning model and the deployed product.

Today, many people talk about data-centric AI. In this concept, engineers not only focus their efforts on the model but also on the data on which a model is trained. Data pipelines play a crucial role in doing this appropriately. The purpose of a data pipeline is to enable engineers to work more efficiently by streamlining and automating data flows and intelligently managing datasets and data lakes. 

To better understand this concept the standard machine learning pipeline design is depicted in Figure 1.

Figure 1: Machine Learning pipeline; own illustration by author

Data collection is the first step in the pipeline. From there, the data is ingested into a data storage platform or infrastructure. This is where the typical data pipelines start via steps like data curation, data labeling, and model training until deployment. One goal of a data pipeline is that information can not only move forward but also backwards during the workflow.  Such information can include: 

  • Metadata (e.g, weather, time, GPS location) 
  • Labels/annotations
  • Model predictions 
  • Embeddings 
  • Images/videos 
  • Experiment results (e.g., model accuracy on a given dataset) 
  • Deployed system model confidences 

All the information and features need to be stored and managed accordingly while the data flows are automatized and monitored. Having a good data pipeline setup in place will enable one to efficiently conduct workflows like Active Learning, where model performance on specific samples directly impacts the subsequent data selection stage. Anomalies can also easily be detected and automatically flagged. At the same time, a good pipeline should provide analytics and insights about the state and health of the data in all stages of the pipeline. 

Another goal of an automated data pipeline is to facilitate data operations within the ML pipeline. Not only does such a setup provide more information flow between the steps and thereby increasing the resulting efficiency and performance of products, but logistics for ML pipelines can be improved by reducing overhead resources related to them. Overhead resources include the efforts required to manage such pipelines, by transferring data to different tooling platforms, for example, or by manually monitoring each step and the data transfer between them.

Data curation occupies an essential role in building good data pipelines. Selecting the right data for labeling and model training is crucial. Thus, the aforementioned information will be the basis for making the selection of a batch of newly collected data to be labeled and subsequently fed to the model. Especially with large amounts of data, manual workflows for selecting the samples to be labeled become unfeasible. Instead, selection needs to be automated based on determined parameters, and data flows are monitored (see our article about data selection in computer vision). This is where tools like Lightly come in. Lightly enables you to automate the selection and the model-data feedback loop completely. This allows the data pipeline to scale and shorten re-training cycles. 

The following paragraph elaborates and gives an example of how to build a data pipeline for computer vision. 

Example of a data pipeline for computer vision

In Figure 2, several data pipeline set-ups with different tooling possibilities are pictured. One can either build their own scripts or use a tool like Lightly’s to connect the various pieces of the development chain.

Figure 2: Automated data flows through AI data pipeline; own illustration by author

Using Lightly, a data pipeline for computer vision could, for example, have the following tooling set-up: 

Figure 3: Machine Learning Pipeline Tools for Computer Vision; own table by author

A data pipeline in machine learning in general and in computer vision, in particular, is the path data flows through. It aims to enable automated and scalable data processing for machine learning. 

This blog post is part of a series about data pipelines. Part 2 elaborates on how to build a data pipeline for computer vision and part 3 argues why data pipelines are important.

Matthias Heller,
Co-founder Lightly.ai 

Improve your data
Today is the day to get the most out of your data. Share our mission with the world — unleash your data's true potential.
Contact us