Autonomous Driving, Dataset Optimization

Improving Data Quality in Autonomous Driving: Hirundo Uncovered Mislabels in 10% of BDD100K

Shmuel Y. Hayoun
April 4, 2024


The Berkeley DeepDrive 100K (BDD100K) is considered one of the most expansive and trusted datasets for autonomous driving. Using Hirundo’s proprietary Data Influence analysis, we were able to automatically flag data issues and mislabels in over 10% of the dataset - and resolve them as well. This article covers more about our findings, their importance, and how we obtained them.

What is BDD100K and Why It Matters

In the rapidly evolving world of autonomous driving, datasets like BDD100K have emerged as critical resources for the development and testing of machine learning models.

This expansive dataset comprises over 100,000 video sequences captured at diverse times of day, under various weather conditions, and across multiple urban and rural landscapes. It is a well established dataset that serves in both the development and benchmarking of machine learning models for autonomous vehicles.

The Importance of Data Quality

Data quality plays an integral role in the world of artificial intelligence and machine learning. Meticulously collected and carefully curated data that is accurate and reliable leads to models that are efficient, precise, and robust. These models are capable of making accurate predictions and decisions, thereby enabling businesses and organizations to operate more effectively and efficiently, and to make informed, data-driven decisions.

In contrast, inaccurate or mislabeled data can have detrimental effects. It can lead to skewed, inconsistent results and poor overall performance of the machine learning models. The inevitable outcome is models that make inaccurate predictions and exhibit ineffective decision-making, which at best will cause a loss of resources, but might have more dire consequences in safety-critical systems.

That applies to all kinds of models and sectors: a computer vision (CV) model for an autonomous driving or medical diagnosis use-case could make life-threatening false predictions, if it relies on faulty data. A natural language processing (NLP) model for marketing that relies on false predictions would fail at understanding the sentiment and intent of new texts. Models used for financial fraud-detection, would fail to capture suspicious transactions if their data was skewed. And it is always skewed - in any model and any use case - to some degree.

Note: Though in general there are various types of mislabels, depending on the domain and deep learning task, in the scope of this post our focus is on misclassifications.

Where Mislabels Come From

Mislabeling are a given in almost all datasets, despite continuous efforts to prevent them. It derives from the basic mode of labeling operations, and extends from human-derived errors (in manual labeling) and computer-derived errors (in automated labeling).

Manual data annotation is a complex and tedious task that requires a high level of attention to detail. As such, it is prone to errors that can be caused by any one of a variety of reasons. Those could happen due to: ambiguity in the data; data that requires deep domain expertise; inconsistencies in the guidelines and formatting. And last but not least, plainly due to the fact this is an exhausting task that invariably comes with a heavy workload and usually incentivizes quantity and speed over quality.

Automated labeling, while efficient, can introduce errors due to algorithmic limitations coupled with complex or nuanced data. Computer vision models may misinterpret images due to poor lighting or obstructions, while natural language algorithms might overlook context or sarcasm. These errors are systematic, lacking the nuanced understanding of human annotators.

Dealing with Mislabels: Common Practices and Their Shortfalls

Given the critical need for accurately labeled data, various methods (and paid software) are used in industry to tackle the persistent problem of mislabels. These include leveraging pre-trained models, conducting statistical analysis of the dataset, consensus between human annotators, and crowdsourcing annotations. None, however, have managed to strike a gold standard in perfecting datasets.

  • Pre-trained models (using AI to improve datasets) - Pre-trained models can help spot mislabels in datasets by comparing their predictions with the given annotations. This method, though dependent on the model's accuracy, can help flag outliers or inconsistencies for further review. However, applicability is limited to unseen data. Therefore, the ability to effectively analyze the training set is dependent on the availability of high-quality open-source models or datasets in a specific domain. Otherwise, this can be accomplished by training multiple models on different subsets of the data, each used to examine a designated subset, which can be a highly time-consuming and resource-intensive process.
  • Statistical analysis (looking for clues in the data) - Statistical analysis of the data features can be used to identify outliers by highlighting data points that deviate from the general distribution patterns. This method, while useful for catching gross errors or widely deviant points, may not effectively capture the nuanced or contextual mislabels that do not necessarily manifest as statistical outliers.
  • Consensus labeling (using human review to improve datasets) - Human review  remains essential for identifying complex mislabels. Human annotators can provide nuanced insights that automated systems might overlook. However, it is time and cost consuming, and often times the labelers will suffer from the same ambiguity and fatigue as their predecessors.
  • Crowdsourcing (leveraging the wisdom of the crowd) - Crowdsourcing can complement the previous methods by bringing diverse perspectives to the data review process, as well as making it more cost-effective for identifying errors in less specialized domains. However, the effectiveness degrades with the complexity of the data; the collective opinion of an average crowd would work well in telling dogs from cats, but it can’t replace an expert in deciding whether a tumor is present in a CT scan.

Identifying Mislabels via Data Influence Analysis

Enter our unique data influence methodology. At its core, our  influence analysis traces the impact of individual data points across the training lifecycle of machine learning models. This innovative technique allows us to pinpoint inaccuracies with unprecedented precision, significantly reducing the time and resources typically required for dataset correction.

By focusing on the effect of data points on the model's learning process, our method inherently incorporates both the data's statistical properties and its relevance to the model's performance. This dual focus ensures a more comprehensive analysis of data quality, highlighting mislabels that affect learning outcomes directly.

Unlike prediction-based approaches, ours does not rely on the assumptions of a pre-trained model's accuracy for a specific domain, making it applicable and effective across various domains and sectors. Considering the relevance of each data point to the model training process, allows for the detection of errors that are contextually significant and might otherwise not be flagged by statistical thresholds.

To reiterate, while this article covers our analysis over BDD100k, i.e., an autonomous driving dataset, our platform shows promising results also in healthcare, industry 4.0, agriculture and many more.

Our Findings: Mislabels in 10% of BDD100K

Our findings after applying the data influence analysis to a significant subset of the BDD100K dataset were enlightening. Our approach's versatility across different tasks enabled us to deliver results at the level of individual labels, rather than at the broader level of an entire image.

More than 10% of the sampled images were found to contain issues, ranging from minor mislabels to blatant inaccuracies that could significantly impair model performance.

Using the results from our influence analysis we automatically flagged the most suspicious bounding boxes (b-boxes) across the dataset. Below is an example of an image with all of its original b-boxes, alongside a view featuring only the flagged b-boxed (use the interactive slider to switch between the two).

Following this step, we also used our platform to suggest a corrected label for each b-box. Below are a few interactive views of images found to hold mislabels, with the detected mislabels to the right, and the suggested corrected labels to the left.

The full dataset presenting our findings can be found here.

These findings underscore the critical need for ongoing vigilance and continuous improvement in data quality to support the safe development of autonomous driving technologies.


High-quality data is the lifeblood of any AI or machine learning system. It is the key to unlocking the transformative potential of these technologies and propelling the advancement of sectors like autonomous driving. Therefore, it is critical for organizations to invest in data quality management to ensure the accuracy and reliability of their data, and ultimately, the effectiveness of their AI systems.

While the challenge of maintaining dataset integrity like BDD100K is significant, it is not insurmountable with our data influence driven methodology. This method stands out for its efficiency, effectiveness, and versatility, applicable across all sectors where data's role is pivotal.

We are at the brink of revolutionizing data accuracy and quality in industry. As we prepare to launch our cutting-edge Data Optimization platform, we invite organizations across all sectors to join us in this new era of data optimization where accuracy meets efficiency. Stay tuned for an unveiling that promises to harness the full potential of data, powered by the innovative spirit of our team.

About Hirundo

We are a TLV/London based startup, offering an AI Optimization & Machine Unlearning platform, allowing AI teams to quickly find and remove any unwanted and faulty data powering their AI/ML models.

Our platform uses state-of-the-art technology, combining proprietary methods and best-in-class open source solutions. Our team includes experienced professionals from academia and industry, including the Technion’s former Dean of Computer Science.

We have partnered with some of industry’s leading companies - including Intel, Nvidia, Microsoft, Google and Amazon - and are starting to work with further early adopters looking to revolutionize the quality of their AI models.

Shmuel Y. Hayoun
Senior Deep Learning Researcher, Hirundo

Ready to forget?

Start removing unwanted data with a few clicks