The F1 Score is a key metric for evaluating classification models, especially with imbalanced data. It balances precision and recall to reflect a model’s true performance, offering a more reliable alternative to accuracy in tasks like fraud detection.
Below, you can find a quick summary of key points about the F1 score.
What is the F1 Score?
The F1 score is the harmonic mean of precision (P) and recall (R), ranging from 0 (worst) to 1 (best).
A score of 1 indicates that both precision and recall are 100%. A score of 0 indicates that the model has failed completely, either in precision or recall.
How to interpret the F1 Score?
A higher F1 score (closer to 1) indicates that the model is effective at finding the positive class and makes few mistakes. If either precision or recall is low, the F1 score will also be low. F1 is only high when both are high.
For example, a model with 90% precision and recall achieves an F1 of approximately 0.90. In comparison, a model with 90% precision but only 10% recall drops to an F1 of about 0.18. This shows a balanced performance metric, as measured by F1, rather than focusing only on excellence in a single metric.
Why use F1 over accuracy?
Accuracy can be misleading when classes are imbalanced. F1 is better for such cases because it focuses on the positive class and accounts for both false positives and false negatives. For example, in fraud detection, a model that always predicts "not fraud" will have high accuracy. Whereas an F1 score of 0 which reflects its poor performance metric.
Evaluating classification models is a key part of any machine learning (ML) workflow. While the accuracy metric is common, it is often unreliable for imbalanced datasets. The F1 score combines both precision and recall to provide a clearer view when false positives and false negatives matter.
This article covers the F1 score, its importance, and how to apply it in real-world ML and computer vision tasks like fraud detection, medical diagnosis, and quality control.
We will explore the following:
And remember, improving your F1 score starts long before evaluation - it starts with better data. At Lightly, we help you boost model performance by focusing on the data that matters most:
Better data selection leads to more reliable metrics and better models.
The F1 score, also known as ‘F-score’ or ‘F-measure’, is the harmonic mean of precision and recall. It provides a single number that balances both precision and recall, which helps you compare machine learning models more effectively. Mathematically, it is defined as:
Where precision (P) is the fraction of predicted positives that are correct, and recall (R) is the fraction of true instances that are correctly identified.
Alternatively, in terms of true positives (TP), false positives (FP), and false negatives (FN):
The F1 Score ranges from 0 to 1. It only reaches a high value when both precision and recall are high (near 1). This makes it a strict but fair way to measure balanced performance, especially in situations involving imbalanced classes. If either metric is low, the F1 score drops quickly.
It is useful when you want to balance catching all positive cases and avoiding false alarms. Additionally, it is a part of the F-beta (Fβ) family, where F1 gives equal weight to both precision and recall. Fβ allows you to prioritize one over the other depending on your use case.
To understand the F1 score, it is helpful first to understand the concepts of precision, recall, and accuracy. All three come from the confusion matrix, which is a standard tool for evaluating binary classification models.
A confusion matrix is a 2×2 table that summarizes prediction results for binary classification. It organizes predictions into four categories:
The following confusion matrix provides the counts of TP, FP, FN, and TN, which are the basis for calculating all these metrics. It helps identify the types of errors a model makes and supports reliable metric selection and model improvement.
Precision is the positive predictive value. It measures the quality of positive predictions by calculating the proportion of correctly identified positives among all positive predictions:
High precision means fewer false positives. In other words, when the model says “positive,” it is usually correct. Precision ignores the false negatives and true negatives. This is important in cases like spam detection, where you don’t want to flag real emails as spam. Here, a real email classified as spam would be a false positive.
Recall is also called sensitivity or the true positive rate. It measures how well the model finds all actual positive instances by calculating the proportion of actual positive instances it correctly identifies:
High recall means the model missed fewer positive cases, i.e., a few false negatives (FN). Recall ignores false positives. A model that labels every sample as positive will have a recall = 1 (detects all positives).
However, that model’s precision may be low if it also marks negatives as positives. This metric, where FN represents missed cases, is important in domains like medical testing, where missing a disease can be costly. Here, a patient may be classified as not having a disease (false negative) when, in reality, they may be suffering from a serious condition.
Accuracy is the overall fraction of correct predictions:
It considers both classes, but it can be misleading if classes are imbalanced. For example, if 95 out of 100 emails are not spam. Then, a model that always predicts "not spam" will be 95% accurate but will never catch any actual spam.
Pro Tip: Check out also our Image Classification Guide for Engineers.
In datasets where one class dominates the data, the model’s accuracy can be misleading. While the accuracy metric measures general correctness, it fails to reflect class imbalance or types of errors.
For example, in a medical test where only 1 in 1,000 patients has a disease, a model that always predicts "healthy" achieves 99.9% accuracy. But it provides a low true positive rate (0% recall), and an F1 score of nearly 0 (since TP = 0, recall = 0). This means that it never finds the disease.
In this case, the F1 score offers a more balanced evaluation that addresses this limitation for imbalanced datasets by summarizing model performance in one metric. It penalizes both false positives and false negatives, offering a clearer view of how well the model identifies the positive class.
Different metrics serve various purposes in model performance metrics. Below is a comparison to help you understand when to use each reliable metric for specific business needs and requirements:
Let’s expand on this comparison to help you choose the reliable metric for your problem:
Changing the classification threshold affects the trade-off between both precision and recall. Raising the threshold increases the model’s precision (fewer false positives) but lowers recall (more false negatives). Lowering it has the reverse effect.
The Precision-Recall (PR) curve shows this trade-off across different thresholds, and the F1 score helps find the best balance for your use case.
When precision and recall require different emphasis, the F-beta (Fβ) score provides weighted alternatives to the standard F1 score. It is defined as:
where;
For example, in fraud detection, it is often more important to catch all fraudulent transactions, even if that means raising some false alarms. In such cases, the F2 score is used to put more weight on recall over precision.
The F1 Score is commonly used in ML tasks with imbalanced datasets or when both error types matter. For multi-class settings, it is calculated through micro or macro averages to balance performance across classes.
The F1 score has widespread use in the following cases:
Financial institutions use the F1 score to evaluate fraud detection models, where fraudulent cases make up less than 1% of transactions. These systems must accurately detect fraud while avoiding false alarms that block legitimate activity. It is crucial to minimize situations where the model predicts negative cases as fraudulent, such as false positives.
A high F1 score shows the model captures fraud effectively while balancing both detection accuracy and customer experience.
Business impact: High precision reduces false alarms that block legitimate transactions and frustrate customers. High recall ensures fraudulent transactions are caught before they cause financial losses. F1 score captures both requirements in a single metric aligned with business objectives.
F1 is used to evaluate diagnostic models in healthcare applications. Rare diseases or conditions (like early cancer detection) require high recall (few missed cases) and avoiding unnecessary procedures (reasonable precision).
F1 evaluates diagnostic models where both false positives and false negatives, compared against true labels, carry significant consequences.
Clinical considerations: Missing a disease (false negative) could delay critical treatment, while over-diagnosing (false positives) leads to unnecessary anxiety and medical procedures. F1 score ensures models balance both concerns appropriately.
Email providers and social media platforms use the F1 score to evaluate content filtering systems. The goal is to detect harmful or spam content accurately without wrongly blocking legitimate messages. For instance, non-spam emails (i.e., the negative class) should not be flagged as harmful, ensuring the user experience is minimally disrupted.
Implementation challenges: Spam detection models often prioritize precision to minimize false positives, and thus may use the F0.5 score, which weighs precision higher. In contrast, content moderation platforms may use the balanced F1 or F2 score (which favors recall more), depending on how urgently they need to remove harmful content versus allowing free expression.
F1 is widely used in natural language processing (NLP) tasks, such as named-entity recognition (NER), document retrieval, and question answering (Q&A). It helps balance the relevance of retrieved documents (precision) with the recall of all relevant items.
Technical applications: Modern large language models (LLMs) use the F1 score for evaluating generated content accuracy. They measure how well models retrieve relevant information and avoid hallucinations or irrelevant responses.
F1 can easily be computed using libraries. Here is how to calculate a basic F1 score for a binary classification problem in Python:
from sklearn.metrics import f1_score, confusion_matrix, classification_report
import numpy as np
# Example: Credit card fraud detection results
y_true = [0, 0, 0, 0, 0, 1, 1, 1, 0, 0] # True labels
y_pred = [0, 1, 0, 0, 0, 1, 0, 1, 0, 0] # Model predictions
# Calculate F1 score
f1 = f1_score(y_true, y_pred)
print(f"F1 Score: {f1:.3f}")
# Detailed analysis
cm = confusion_matrix(y_true, y_pred)
print("\nConfusion Matrix:")
print(cm)
# Full classification report
report = classification_report(y_true, y_pred)
print("\nClassification Report:")
print(report)
Output
For more than two classes, F1 can be averaged in several ways:
For more than two classes, F1 can be averaged in several ways:
Pro Tip: Check out 12 Best Data Annotation Tools for Computer Vision (Free & Paid)
While F1 is widely used, especially in multiclass classification, it has some limitations:
Below are some alternatives to consider:
Effective model evaluation depends heavily on the quality and balance of your training data.
Lightly AI offers tools for data curation that help ML teams identify the most informative and diverse samples for labeling.
Focusing labeling efforts on high-value data allows Lightly to reduce redundancy and address class imbalance more effectively. These are two common challenges that directly affect metrics like precision, recall, and F1 score.
Its self-supervised learning methods (e.g., contrastive learning) enable models to learn useful data representations without needing many labels. This results in faster training and improved model robustness. This approach is especially valuable for applications where positive classes are rare or costly to label, such as fraud detection or medical diagnostics.
Lightly’s data-centric platform integrates smoothly into your ML workflow. It helps improve the quality of the entire dataset, reduce labeling costs, and enhance model performance metrics.
This short demo video shows how curated data can boost F1 and other model metrics.
https://www.youtube.com/watch?v=OoHZcZ5e-54
Pro tip: Have a look at our list of 10 Best Data Curation Tools for Computer Vision [2025] to compare your choices.
The F1 is a key tool for evaluating classification models, especially when addressing class imbalance or the impact of false positives and false negatives. It offers a more balanced view of model performance than accuracy, which can be misleading in some cases.
For multi-class scenarios, macro and micro F1 scores offer a balanced assessment of model performance. Additionally, weighted F1 accounts for the relative contribution of each class to the overall score.
As machine learning continues to progress, the F1 score will remain critical for artificial intelligence (AI) professionals. It’s important to combine the F1 score with other metrics and prioritize quality data curation to achieve accurate and actionable model evaluations.
Get exclusive insights, tips, and updates from the Lightly.ai team.