Under the hood of LightlyStudio, why we rebuilt and what makes it Gen 2
LightlyOne has evolved. We rebuilt the core from the ground up to deliver what modern ML teams need: a unified tool for curation, labeling, training, and evaluation, all built on a new foundation. We’ve swapped our document database for a columnar analytics engine (DuckDB), replaced performance bottlenecks with Rust, and created a typed, Python-first SDK.
Get Started with Lightly
Talk to Lightly’s computer vision team about your use case.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
To answer this, we need to rewind to the "Gen-1" world LightlyOne was born into. The ML ecosystem was a different beast.
The Gen-1 World: Heavy Containers and Slow Feedback
When we started, TensorFlow was king, PyTorch was the new kid, and every minor CUDA upgrade meant hours of dependency hell. Running anything GPU-related reliably across machines was so painful that Docker wasn't a choice; it was a necessity for us.
LightlyOne was designed for that world. Users created Python jobs that ran inside Docker containers, which we scheduled on the backend. It worked, and it scaled: we had customers running it across 96 CPU cores to process over 10 million images within a few hours.
LightlyOne was effective in its time, designed for an era where containers were crucial for reproducible and scalable GPU workloads. However, its architecture now struggles to keep pace with the rapid iteration expected by modern ML teams.
Testing necessitated large containers.
Feedback was delayed until jobs concluded.
Data iteration felt disconnected.
While well-suited for its original purpose, current workflows require a more agile and direct approach.
This architecture had another bottleneck. Our core algorithms needed C++ for performance, but shipping mixed C++/Python binaries is a nightmare. Docker "solved" this by making it portable, but it just made the feedback loop even slower.
This architectural debt boiled down to two forms of "friction" that were biting us and our users every day:
Data-Model Friction: Our document database was fighting our queries. We wanted to ask: “WHERE model_version='rtdetr-0.8' AND iou<0.5” or “GROUP BY class”. These are analytic queries. Columnar engines crush these patterns; our document store fought them.
Workflow Friction: When curation and labeling live in different systems, you lose the moment. You find a rich pocket of failure cases... then you export a json or csv file, re-import it elsewhere, re-index, and re-label. The context—and the momentum—is lost.
We tried to bend LightlyOne further. It wasn’t enough. To match the speed and scale of modern ML and make embeddings truly first-class, we needed a clean start.
What we mean by “Gen-2” (and how to spot it)
“Gen-2” isn’t defined by its tech stack. It’s defined by how it feels to use. A Gen-2 data tool for ML makes iteration instant, setup effortless, and workflows unified. LightlyStudio focuses on the two things that matter most: speed and flow.
What makes LightlyStudio a Gen-2 tool:
Zero-friction setup. Install with a single command and start in minutes. No Docker, no backend services, no waiting for containers to spin up.
Instant feedback. Operations run interactively instead of being queued as background jobs. Explore data, run queries, and see results in under a second.
Unified workflows. Curation, labeling, and analysis live in the same place, so momentum is never lost. Find issues, fix them, and move on without switching tools.
Cross-modal by design. Images, text, and 3D data share the same schemas and search space. Cross-modal queries just work.
Typed, discoverable SDK. A Python interface that is both rich and predictable, making it easy to extend or integrate.
Reproducible by default. Every subset and annotation session is tracked and versioned automatically.
Table 1: LightlyOne vs. LightlyStudio comparison table.
Category
LightlyOne (Gen-1)
LightlyStudio (Gen-2)
What it means for you
Scalability
Handles up to a few 100 K samples before the UI lags
Interactive performance on 10 M+ samples
Work at full dataset scale without prefiltering
Performance
Heavy jobs run offline or in background containers
Sub-second reactions on ImageNet-scale queries using DuckDB and Rust
Embedding-native, designed for text, 3D, and cross-modal search
Curate and analyze multimodal data in a single workflow
‍‍
The architecture we landed on
We rebuilt LightlyStudio from the ground up with four clean layers.
We replaced MongoDB with DuckDB, an embedded analytical database that runs directly inside the Python process. It requires no extra installation or server setup and is small enough to ship as part of the package. Because everything happens in memory, computations are extremely fast and filters or joins over tens of millions of samples complete almost instantly.
Rust handles the heavy lifting: computations, projections, and selection. Python, via FastAPI, orchestrates everything else. We moved from C++ to Rust because it gives us deterministic performance, easy builds for MacOS, Windows, Linux across x86 and ARM architectures, and a future path to WebAssembly for browser-side performance.
Our new frontend uses streaming data loading to handle 10 M+ samples interactively. You can zoom, search, and label instantly.Â
Interactive experience. Streaming making exploring large datasets feel immediate. The frontend and SDK mirror this responsiveness with typed models and instant validation.
The result of this new architecture? Even before our full performance-tuning pass, the current version of LightlyStudio can process all 1.4 million images of ImageNet on a standard MacBook Pro, fully offline.
LightlyStudio
Why DuckDB
The size of ML datasets has grown massively since we introduced LighltyOne, requiring a more modern database solution. We chose DuckDB because it unlocks:.Â
Built for Analytics, Not Documents: DuckDB is a columnar engine. This means it only reads the columns you need for a query (like WHERE class='cat'), instead of scanning entire documents.
Blazing-Fast Laptop-Scale Performance: It's an in-process database, meaning it runs inside the same process as LightlyStudio. There is no heavy server to run. On ImageNet-sized datasets, typical queries complete in under half a second on a laptop and stream directly to the UI.
Lightweight & Portable: DuckDB ships as a simple pip package. This eliminates the heavy dependencies of database servers (like MongoDB) and makes LightlyStudio easy to install and run anywhere, including offline on a MacBook.
Orders of Magnitude Faster for ML: For the analytical queries ML teams actually run (complex aggregations, joins, and filters), columnar engines like DuckDB outperform document databases by orders of magnitude (often 10x-100x faster).
Why Rust
Python is perfect for orchestration; Rust is perfect for predictable, safe, high-throughput code.
In LightlyOne, our performance-critical code was in C++. This gave us speed, but shipping mixed C++/Python binaries reliably across every OS (macOS, Linux, Windows) and CPU (x86, ARM) was a constant maintenance battle.
Rust is our solution. It gives us C++-level performance but with a modern compiler that guarantees memory safety and produces a single, static binary that just works. We've replaced our Python bottlenecks and C++ "hot paths" with Rust, specifically for:
Typicality computation
Dimensionality reduction and 2D projections (PacMAP)
Sampling algorithms and diversity selection
With the new Rust core, typicality selection and 2D projection using PacMAP are both around 50x faster and no longer depend on complex Python tooling like Numba.Â
This new core guarantees deterministic performance without a garbage collector and without manual memory tuning, eliminating an entire class of bugs.Â
Multimodality from day one
LightlyStudio treats every modality equally. Any encoder that outputs vectors and a schema can register: image, text, video, and 3D point clouds.
Cross-modal search: Filter by text, find matching images, or pivot from 3D data to captions.
Unified schema: Consistent handling for bboxes, polygons, keypoints, masks and captions.
Future-ready: Native support for text-only datasets means you can also curate LLM fine-tuning corpora.
Roadmap: what’s next
This new foundation allows us to move faster than ever. We're also expanding our support for richer 3D and temporal sequences and building deeper ecosystem bridges to make LightlyStudio the true heartbeat of your ML loop. For teams that want a zero-infrastructure solution, a managed cloud version with real-time collaboration is on the way.
Get Started with Gen-2
Try LightlyStudio locally. It's open-source under Apache-2 license. Run pip install lightly-studio and see the performance on your own data.
Explore the docs and examples to see how to plug LightlyStudio into your existing pipeline.Â
Contribute on GitHub. Help us add a selector, wire up an encoder, or file an issue to report bugs or request new features.
Planning a migration? If you're on LightlyOne, we're here to help map your workflows. LightlyOne will remain supported at least until the end of 2026 .
We built LightlyStudio to be the data tool we wanted the day embeddings became mainstream: unified, fast, and engineer-friendly, with multimodality as a core feature. If you’ve felt the pain of Gen-1 pipelines, we think LightlyStudio will feel like finally taking the limiter off.
Welcome to Gen-2.
Pro tip
See Lightly in Action
Curate data, train foundation models, deploy on edge today.