To efficiently curate their dataset, Greenwood used LightlyTrain to train their own DINOv2 model on unlabeled road surface images. The resulting model captures different road surface conditions much better than an off-the-shelf model. This made data curation the most effective lever for improving model performance, and LightlyTrain enabled that shift.
Using LightlyTrain and the custom DINOv2 model, the team generated embeddings for their entire dataset. These embeddings gave them a scalable way to explore the data, run similarity search, remove redundancy across millions of road-surface images, and extract valuable samples for labeling.
Why Curation Was Essential
With the current dataset, the team quickly reached the point where finding relevant samples for labelling required manually inspecting hundreds or thousands of samples.
What they needed instead was a better understanding of which images were actually informative. LightlyTrain helped provide that structure:
- Training improvements from additional labeled data were modest
- Adding labels without addressing redundancy led to diminishing returns
- The dataset needed to be organized before annotation could have impact