Unstructured data is growing exponentially and creating new challenges. This article presents 5 ways to tackle data problems in the computer vision machine learning field. They include: understanding data, curating data, building an efficient data pipeline, managing data and sharing data.
Two assumptions can be made about the machine learning field. First, it is no secret that obtaining data has become cheap and the amount of data is increasing, and fast. Indeed, in the next 3 years, more data will be created than in the last 30 years (ICD, 2020). While big data enables endless opportunities, its vastness generates new challenges: the handling of these huge amounts of data. Second, and intimately linked to rapidly growing data, research in the machine learning field is fast-moving. Papers are published at great speed and are often accompanied with a working codebase. In fact, websites exist exclusively for papers published with a GitHub repository, such as Papers With Code. Along with codebase, tutorials to implement the new code in various machine learning architectures are also available. Thus, despite the evolving nature of this complex field, the research can virtually be implemented instantly due to extensive practical indications. Indeed, any machine learning engineer would be able to newly train a model on a custom dataset.
As a result, the competitive advantage in the machine learning market does not lie in the rapidity of applying the most recent knowledge for the development of a new machine learning system, but rather in finding a solution to handle and curate the vast and large sets of almost unlimited raw data. This blog post outlines five answers to tackle this challenge with data and presents Lightly’s current solutions.
1. Understand your data
As its name indicates, machine learning is the process of teaching a machine to act in certain ways. A machine learns from the data it is being fed with, so machine learning models are highly dependent on data. Indeed, data has become a commodity for companies using machine learning and is being collected in great quantities. Once collected, the data is raw, in other words not processed, and of little use. Moreover, the processing steps, such as labeling, are very expensive and time-consuming, thereby usually only done for a small fraction of the dataset. Hence, understanding data in its raw form is essential to save time and costs.
Understanding raw data includes two processes: an assessment and an exploration of the data (towardsdatascience, 2019). The assessment includes describing high-level characteristics of the dataset, such as size and format. This process is enabled by uploading datasets to the Lightly platform. There, such information (datatype, number of samples and sample size) is automatically shown, as exemplified below for a public dataset of road images, Comma10k.
Once the data has been assessed, it must be explored. Indeed, the size and format of the Comma10k dataset provides little insight about the actual content and distribution of these images. Exploring data is one of Lightly’s unique features. Its visualization tool allows teams to visually perceive clusters of similar images in a scatterplot. This way, categories of data are formed. Additionally, outliers or missing points can be identified in this view. Exploring data is crucial in every industry. Automotive companies developing autonomous vehicles, for instance, have huge datasets of lengthy hours of video footage. Without an understanding of the collected data, these companies have no precise indication of the type of images they have collected. The plot obtained with Lightly is a way of obtaining visual insights about the data collected by observing groups of similar data.
Below, the Comma10k dataset is shown through the visualization tool on the Lightly platform. A cluster of night images on the left and another of day images on the right can be observed. Additionally, an outlier is shown, placed in between the day and night clusters. This image seems to have been taken during dawn, hence its in-between nature. This is a significant insight, as it exposes a scenario with little collected data.
The distribution of the data can also be observed for various parameters, such as file size, width, height, signal to noise ratio, luminance, contrast or color channels. Engineering teams can select the data that interests them by setting certain levels to these parameters. Below the signal to noise ratio is explored. For images, this refers to the image quality. In the examples below, images of different quality have been selected. It can be noted that darker images, oftentimes taken during the night, are of lower quality. Thus, the parameters are an additional way of understanding the dataset.
2. Curate your data
No matter how much machine learning engineers attempt to perfect their models and increase their performance, a model is only as good as the data it is trained on. Think of children and education: children can be very smart but the books they read and their access to education ultimately determines their intellectual knowledge. Thus, once the raw data has been properly understood, the selection of training data is crucial. This data must be of the highest quality and relevance, according to the task to be executed.
The curation process involves selecting and filtering data to ensure the model is well-adjusted to perform well. To make a machine learning model as performant as possible, the training set must be representative of different scenarios. With Lightly’s algorithms, redundant data is eliminated, while outliers and special scenarios are retained. For a dataset of street images, this would result in removing similar images captured at the same streetlight, while keeping interesting scenarios of people crossing the streets, for instance, and making sure images of the different clusters, such as different types of road signs, vehicles or sidewalk distractions, are sufficiently represented. Lightly intelligently does this with its algorithms, thereby creating a diverse and representative dataset. This automated data curation process is, in conclusion, more efficient, faster and cheaper.
The diversification of the subset is enabled by a sampling strategy named CORESET. The new dataset can be customized in terms of size, i.e. how many images it should contain, and minimum distance between two dataset points, i.e. the amount of visual difference between two images. This gives engineers the opportunity to have full control over the training set and to gain an understanding for it.
The subset is then processed, and a view of the new subset of images can then be visualized. This visualization tool gives full transparency and eliminates the black-box nature of most large datasets.
3. Build an efficient data pipeline
The continuously evolving industry brings along new obstacles in the data pipeline, as every step of the process becomes more complex and market players specialize. Hence, managing data through this crowded pipeline becomes a challenge.
While Lightly’s main tasks are centered around data selection, also referred to as data curation, their solutions facilitate the entire machine learning pipeline. First, collection processes are optimized as a result of a thorough understanding of the collected data. The insights highlight scenarios which are sufficiently covered as well as underrepresented edge cases which must be collected further. Second, data selection occurs twice as fast with Lightly’s algorithms. Data selection for model training is no longer done manually or randomly. Indeed, training sets are intelligently diversifiable according to preferences in a few clicks. Third, the data labeling process is optimized. Usually, as labeling is expensive and the datasets are large, only a very small fraction of random data is labeled. With annotation costs amounting up to $10 per image, a well curated dataset for autonomous driving can easily cost millions. With Lightly’s smart data selection process, the same accuracy can be reached with half the amount of data and saving up to millions in labelling costs. Fourth, model training is improved, as the curated training sets entail better generalization and increased accuracy. Lastly, as all steps gain efficiency, the development process is accelerated, significantly reducing the time to market of a product.
Furthermore, Lightly’s application programming interface (API) allows companies to easily integrate the solution into existing data pipelines. We refer to this as Lightly Mission Control. Data can be managed in one place, and easily transferred from step to step.
Lightly also offers active learning. This is the process of choosing samples the model struggles with, labeling them and adding them back to be trained on again. These difficult samples mostly help the model in improving accuracy when labelled. It is similar to teaching children: once they can solve simple problems, it no longer makes sense to teach them more simple problems. Instead, they should be given more challenging tasks to further improve.
In addition, the platform can integrate labeling partners from across the world in order to facilitate the entire machine learning pipeline. Thus, engineering teams can simply upload the collected dataset and the rest of the process will be carried out on the same platform, thereby increasing the data pipeline’s efficiency.
4. Manage your data
Managing data is an obvious challenge of big data. The large quantities of data collected over time require to be administered carefully in order to be efficiently exploited.
Data management, defined as “the process of ingesting, storing, organizing and maintaining the data created and collected by an organization” (Tech Target, 2019), entails different practices. First, data can be managed in time. Versioning allows to continuously update and improve datasets over time for different projects, for example. A past dataset could become relevant for new projects. Therefore, its storage with precise indication of its content is essential. Tracking data over time is related to versioning. Yet, the emphasis is on the ability to go back in time and observe a dataset’s evolution or check for current relevance. Second, data can also be managed in space. Splitting or merging datasets are examples of such management. Indeed, some models are trained to specialize on specific tasks with specific data, while others require a vast and diverse dataset to train a general model.
Data management is relevant in almost every industry for its own purposes. On the one hand, autonomous driving companies split data by creating subsets for the different countries they operate in, as road signalization differs. On the other hand, night and day footage as well as frames of unexpected animals or passengers crossing roads are significant for most cases and can be added to or merged with several training subsets. New versions of datasets for these different use cases are then created and tracked over time.
Currently with Lightly, subsets can easily be generated by manually selecting clusters with the visualization tool. They can be named and utilized separately. Below, a cluster of 63 similar images of bridges has been manually selected. A separate subset could be created if further diversification or manipulations are desired within this subset.
5. Share your data internally
As previously mentioned, data has become a commodity. It is valuable to many players of an organization for various projects and goals. Having a central platform to share data is crucial for coordination and optimization within engineering teams, as teams can work on projects in one place. Along with engineers, other stakeholders should be granted access to the datasets in order to facilitate coordination and communication amongst various functions and divisions of the company. This way, processes become more transparent to others, who are otherwise disconnected from the data pipeline. Nevertheless, while data sharing is important, access management should be implemented as well due to data privacy concerns.
In addition, data sharing allows for reproducibility. In other words, the same results can be obtained with datasets of different teams for other projects. Indeed, with the work and processes of colleagues perceivable to others, internal processes for working on data can be standardized across the organization.
As of right now, Lightly users sign in with individual accounts where they have access to their personal datasets. Ideally, this system would be upgraded to allow datasets to be shared amongst users of the same organization, if desired.
Author: Sarah Meibom
Thanks to Malte Ebner, Jeremy Prescott and Igor Susmelj for reading the drafts of this blog.
If this blog post caught your attention and you’re eager to learn more, follow us on Twitter and Medium! If you’d like to hear more about what we’re doing at Lightly, reach out to us at firstname.lastname@example.org. If you’re interested in joining a fun, rockstar engineering crew to help make data preparation easy, reach out to us at email@example.com!