What is precision vs recall in machine learning?

What is precision vs recall in machine learning?

Thilo Huellmann

Co-Founder & CTO


"To minimize the mistakes your AI will make, you should use the most accurate machine learning model." Sounds straightforward, right? However, making the least mistakes should not always be your goal since different types of mistakes can have varying impacts. ML models will make mistakes and it is therefore crucial to decide which mistakes you can better live with.

To choose the right ML model and make informed decisions based on its predictions, it is important to understand different measures of relevance.

If you prefer to watch a video rather than read, check out our video below:

Why you shouldn't blindly use your most accurate ML model

First, let's start by defining accuracy:

The accuracy of an ML model describes how many data points were detected correctly.

To use a practical example, let's look at an image classification problem in which the AI is tasked to label an image dataset containing images of 500 cats and 500 dogs. The model correctly labels 500 dogs and 499 cats. Mistakenly, it labels one cat as "dog". The corresponding accuracy is therefore 99.9%. Here, accuracy is a good assessment of model quality.

For comparison, let's look at a second, less balanced example: A hospital looks for cancer in 1,000 images. In reality, two of those pictures contain evidence for cancer but the model only detects one of them. Since the model only makes one mistake by labeling one cancerous image as "healthy", the accuracy of the model is also at 99.9%.

In this case, the 99.9% accuracy gives a wrong impression as the model actually missed 50% of relevant items. Without doubt, it would be preferable to reduce the accuracy to 99% and mistakenly detect 8 healthy images as "cancerous" if, in return, the second cancerous image could be detected – the trade-off of manually checking 10 images to discover two relevant elements is unworthy of discussion.

But how can you formalize this when choosing your ML model? Let's dive deeper.

Measuring relevance: Dealing with high-priority classes

If your dataset is not well balanced or mistakes have varying impact, your model's accuracy is not a good measure for performance.

Whenever you are looking for specific information, the main task is often to differentiate between the relevant data you are looking for and the irrelevant information that clouds your view. Therefore, it is more important to analyze model performance concerning relevant elements and not the overall dataset.

Let's look at our first example. If the objective is to detect dogs, all dogs are relevant elements whereas cats are irrelevant elements.

In this task, the AI can make two types of mistakes:

  1. It can miss a detection of a dog (false negative) or
  2. It can wrongly identify a cat as a dog (false positive).

For a detailed description of the different mistakes, their possible implications, and how you can systematically control them, head over to our article on how to control AI-enabled workflow automation.

understanding false positives and false negatives
Matrix of the choices the dog/cat image classifier made

Ideally, the AI should detect all dogs without a miss and make no mistake by labeling a cat as a dog. Hence, there are two main dimensions according to which the correctness of machine learning models can be compared.


The precision of a model describes how many detected items are truly relevant. It is calculated by dividing the true positives by overall positives.

In our first example, it compares the number of dogs that were detected to the number of dogs and dressed-up cats that were all detected as dogs. Since missed detections of dogs are not considered in the calculation, it can be increased by setting higher thresholds on when a dog should be detected as such.

In the cat/dog example, the precision is at 99.8% since out of the 501 animals that were detected as dogs, only one was a cat. If we look at the cancer example, we get a perfect score of 100% since the model detected no healthy image as cancerous.

Besides being a measure of model performance, precision can also be seen as the probability that a randomly selected item which is labeled as "relevant" is a true positive. In the cancer example, the precision percentage can be translated as the probability that an image which the model detected as cancerous actually shows cancer.


Recall is a measure of how many relevant elements were detected. Therefore it divides true positives by the number of relevant elements.

In our cat/dog-example, it compares the dogs that were detected to the overall amount of dogs in the dataset (disguised or not). Hence, the recall of the model is at a perfect 100%.

In contrast, the cancer-detection model has a terrible recall. Since only one of two examples of cancer were detected, the recall is at 50%. While accuracy and precision suggested that the model is suitable to detect cancer, calculating recall reveals its weakness.

As with precision, analyzing purely recall can also give a wrong impression on model performance. A model labeling all animals in the dataset as "dog" would have a recall of 100% since it would detect all dogs without a miss. The 500 wrongly labeled cats would not have an impact on recall.

For the individual element, the recall percentage gives the probability that a randomly selected relevant item from the dataset will be detected.

Precision and recall calculations in AI
How precision and recall are calculated

Going back to the question of how to select the right model, there's a trade-off between trying to detect all relevant items and avoiding to make wrong detections. In the end, the decision depends on your use-case.

Put differently, you will need to consider these questions: How crucial is it that you detect every relevant element? Are you willing to manually sort out irrelevant elements in return for optimal recall?

In the cancer diagnosis example, false-negatives should be avoided at all cost since they can have lethal consequences. Here, recall is a better measure than precision.

If you were to optimize recommendations on YouTube, false-negatives are less important since only a small subset of recommendations is shown anyways. Most importantly, false-positives (bad recommendations) should be avoided. Hence, the model should be optimized for precision.

Combining precision and recall: The F-measure

There is also a way to combine the two and it can sometimes make sense to calculate what's called the F-measure. However, unless you are currently preparing for a statistics exam, the above might already be a stretch and to be honest: we struggle with them, too. When working with our software, all you really need to worry about are these two measures.

If you are still hungry for more, here's the Wikipedia article for it 🤓

Now that you're here

Levity is a tool that allows you to train AI models on images, documents, and text data. You can rebuild manual workflows and connect everything to your existing systems without writing a single line of code.

If you liked this blog post, you'll probably love Levity.

Thank you! Please go to your inbox to confirm your email.
We are sorry - something went wrong. Please try it one more time! In case the problem remains, you can also send us an email to hello@levity.ai
Sign up

More from our Blog

A Complete Guide to Data Labeling for AI

Data labeling helps teach your AI model what you’re looking for, and can help you save time on your day-to-day business processes.

Read story

AI for Customer Support and Why You Need It

Keeping customers happy is a must—find out how AI could help you take your customer service to the next level.

Read story

AI in Finance: Defining Your Automation Strategy & Use Cases

Learn how to implement AI in the financial sector to structure and use data consistently, accurately, and efficiently.

Read story

Stay inspired

Sign up and get thoughtfully curated content delivered to your inbox.