How should I build my data pipeline for computer vision?

A 3-step guide for building data pipelines for computer vision: Setting up the data pipeline aims to automate data streams and transfers, data selection, and dataset management (read more about what a data pipeline is here). Thus, automated processes should be the primary focus when building a data pipeline. This article will elaborate on how to build a data pipeline for computer vision in 3 steps.

With an automated data pipeline, data can be continuously collected. Information about the data can freely flow between the stages. Even if some tasks are outsourced or handled on different platforms, little friction occurs between the steps.

For this, feature information and storage need to be interconnected with tooling.

Thus, there are three crucial steps in order to set up your data pipeline: (1) data storage connection, (2) tooling connection, and (3) monitoring installation.

1. Data storage connection

In this first step, the idea is to automate the collected data flows from your private storage into the data pipeline. For example, if you have stored your data at a cloud provider such as Amazon AWS S3 you can connect it to your data pipeline using, for instance, Lightly, and such transfers will occur automatically. Then, as the data comes in, it will be streamed onto your browser for the Lightly UI to visualize and help manage the data. In addition, automatic processing can occur where only value-adding samples are added to the training dataset. In this setup, no data leaves the servers, guaranteeing data security.

Figure 1: CLI command to connect S3 to Lightly; screenshot by the author
2. Tooling connection

A common challenge in ML in terms of data pipelines is the overhead involved with getting a dataset curated and labeled. The overhead can come from a variety of tasks, e.g:

  • frame extraction from videos
  • manual selection for representative samples
  • eliminating edge cases and corner cases
  • rebalancing of the dataset to avoid bias
  • manual annotation
  • finding wrongly labeled samples
  • training models and tracking experiments
  • model debugging and finding false positives
  • writing scripts for data transfer

Indeed, as many engineering teams outsource this task, the data must be manually moved to a third party’s platform. With an automated data pipeline, the data transfer could happen with little effort. There are two options: writing scripts or using a tool such asLightly. If one decides to use the Lightly API it can be directly connected to a variety of data labeling tools & services, MLops, and experiment tracking. As illustrated in Figure 3 of part 1 of this blog series, Lightly could be connected with Labelstudio, Weights & Biases, and, and data could then be directly transferred from one platform to the other.

3. Installation and monitoring

In this step, we want to automate the workflow. Thus, we will decide which steps will be manual or automatic and set parameters for the different data operations (e.g., for which parameters new data should be added to the training set from the incoming data collection batch). If installed properly, the machine learning engineer can relax and see incoming data flowing automatically through the machine learning development steps. Finally, redundant data or data with little adding value is intelligently removed along the pipeline using Lightly to ensure that the model is exclusively trained on essential samples. This enables the engineer to focus on the model training and deployment while only occasionally being required to monitor and check the data flow, dataset quality, and flagged anomalies.

Figure 2: Automatic data flow management with Lightly; illustration by Lightly 


Data pipelines should be built in a way that enables automated processing. Additionally, it is vital to have a data pipeline design and architecture in mind that can scale to large amounts of data so that future refactoring can be avoided. The goal is to have no manual steps or at least as little as possible. Thus, having seamless integrations with every tool along the machine learning pipeline is crucial.

Matthias Heller,

Improve your data
Today is the day to get the most out of your data. Share our mission with the world — unleash your data's true potential.
Contact us