Coefficient of Variation in Python for Data Scientists

May 24, 2026 | Mathematics and Statistics

Most data scientists reach for the standard deviation without thinking twice. But when you’re comparing two datasets measured in different units, or trying to decide which features carry the most signal for your model, the standard deviation quietly fails you.

That’s where the Coefficient of Variation (CV) steps in.

It’s one of those statistics that feels almost too simple, yet it solves a problem that trips up even experienced analysts. In this guide, you’ll understand exactly what CV is, when to use it (and when not to), how to calculate it in Python, and crucially, how it applies to real machine learning workflows

What is the coefficient of variation (CV)?

The Coefficient of Variation is a relative measure of variability. Instead of expressing how spread out your data is in absolute terms, it expresses the spread as a proportion of the mean.

In plain terms: the CV tells you how much variability exists relative to the average.

This makes it incredibly powerful for comparisons. A dataset with a standard deviation of 50 sounds very spread out, unless the mean is 10,000, in which case that variability is tiny. The CV captures that relationship in a single number.

You may also see it called Relative Standard Deviation (RSD).

The CV formula

The formula is straightforward:

CV = (Standard Deviation / Mean) × 100

Or in notation:

CV = (σ / μ) × 100
  • σ = standard deviation of the dataset
  • μ = mean of the dataset
  • The result is expressed as a percentage

Quick example

Imagine you’re comparing the height of two plant species after an experiment:

SpeciesMean Height (cm)Std Dev (cm)CV
Species A50510%
Species B200157.5%

Species A has a lower standard deviation but is actually more variable relative to its mean. The CV reveals this immediately. Standard deviation alone would have been misleading.

How to calculate CV in Python

Using NumPy (from scratch)

python

import numpy as np

data = [23, 27, 31, 25, 29, 33, 22, 28]

mean = np.mean(data)
std = np.std(data, ddof=1)  # ddof=1 for sample standard deviation
cv = (std / mean) * 100

print(f"Mean: {mean:.2f}")
print(f"Standard Deviation: {std:.2f}")
print(f"Coefficient of Variation: {cv:.2f}%")

Output:

Mean: 27.25
Standard Deviation: 3.77
Coefficient of Variation: 13.83%

Note: Use ddof=1 when working with a sample (which is almost always the case in data science). Use ddof=0 only when you have the full population.

Using Pandas on a real dataset

python

import pandas as pd
import numpy as np

# Simulated dataset: model scores across 3 features
data = {
    'feature_age': [23, 45, 31, 52, 28, 41, 37, 60, 29, 44],
    'feature_income': [30000, 85000, 42000, 120000, 37000, 78000, 54000, 95000, 31000, 67000],
    'feature_score': [71, 68, 74, 70, 72, 69, 73, 67, 75, 70]
}

df = pd.DataFrame(data)

# Calculate CV for each feature
cv = (df.std() / df.mean()) * 100
print("Coefficient of Variation per feature:")
print(cv.round(2))

Output:

feature_age       30.14
feature_income    47.82
feature_score      3.24
dtype: float64

feature_score has very low variability relative to its mean, it carries almost no discriminative information. feature_income is highly variable. This already gives us a hint for feature selection (more on that below).

Using SciPy (one-liner)

python

from scipy.stats import variation

data = [23, 45, 31, 52, 28, 41, 37, 60, 29, 44]
cv = variation(data) * 100 # variation() returns CV as a ratio, multiply by 100 for %
print(f"CV: {cv:.2f}%")

Learn more : The complete guide to statistical distributions for data science | Around Data Science

CV vs Standard Deviation: When to use which

Coefficient of Variation vs Standard Deviation, two datasets with same std dev but different CV values
Both datasets have σ = 10. Dataset A (mean=20) has a CV of 50%, highly variable.
Dataset B (mean=200) has a CV of just 5%. Same standard deviation, completely different story.

This is the question most tutorials skip. Here’s the honest answer:

SituationUse Standard DeviationUse CV
Comparing spread within one dataset
Comparing two datasets with the same unit
Comparing two datasets with different units
Comparing datasets with very different means
The mean is close to zero❌ (CV becomes unstable)
Data contains negative values❌ (CV loses meaning)

The key rule: whenever your question is “which dataset is relatively more spread out?”, reach for the CV. If your question is “how far from the mean is a typical data point, in real units?”, use the standard deviation.

Interpreting CV values: What is a “good” CV?

There’s no universal answer, it depends heavily on your domain. Here’s a practical reference:

CV ValueInterpretationTypical Context
< 10%Low variability, very consistent dataLab measurements, manufacturing QC
10% – 30%Moderate variabilityBiological data, social science surveys
30% – 60%High variabilityFinancial returns, sales data
> 60%Very high variabilityCould signal outliers, heterogeneous data

A few domain-specific rules of thumb

In manufacturing / quality control: a CV under 10% is typically acceptable. It means your process produces consistent output.

In finance: a higher CV is expected, market returns are inherently volatile. Here, CV is used to compare the risk-per-unit-of-return across different assets.

In machine learning feature analysis: features with very low CV (close to 0%) are nearly constant and carry little predictive value. Features with very high CV may be dominated by outliers.

Coefficient of variation for feature selection in ML

This is where the CV goes beyond textbook statistics and becomes a genuine ML tool.

The core idea

When building a model, you want features that vary enough to be informative. A feature that takes almost the same value for every observation teaches your model nothing. The CV is a fast, unit-free filter to identify these dead-weight features.

This approach is called variance-based filter feature selection, and the CV improves on raw variance by normalizing across scale.

Practical example: Removing low-CV features

python

import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer

# Load dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)

# Calculate CV for all features
cv_scores = (df.std() / df.mean()) * 100
cv_df = pd.DataFrame({'Feature': df.columns, 'CV (%)': cv_scores.values})
cv_df = cv_df.sort_values('CV (%)', ascending=True)

print(cv_df.to_string(index=False))

Output (sample):

                    Feature     CV (%)
          mean fractal dimension      6.32
               mean smoothness      7.14
                 mean symmetry      8.56
       mean compactness       48.21
       worst area           86.34
...

Filtering out low-CV features

python

# Set a CV threshold (e.g., remove features with CV < 10%)
threshold = 10
selected_features = cv_df[cv_df['CV (%)'] >= threshold]['Feature'].tolist()

df_filtered = df[selected_features]
print(f"Features kept: {len(selected_features)} out of {len(df.columns)}")

This is a simple, interpretable pre-processing step that runs in milliseconds, no model training required. It works best as a first filter before applying more sophisticated methods like mutual information or LASSO.

Pro tip: combine CV filtering with variance threshold from scikit-learn for a robust, two-step approach that removes both near-zero variance and near-zero relative variance features.

Real-world data science examples

1. Comparing sensor readings across different scales

A smart factory collects temperature (in °C, typical values: 400–600) and vibration frequency (in Hz, typical values: 0.5–3.0). You need to know which sensor is more erratic.

Standard deviation would make temperature look far more variable just because of scale. CV normalizes this and gives you the true comparison.

Read : Living Intelligence in Health Tech: How AI, Biotech & Sensors Are Reshaping Medicine | Around Data Science

2. Evaluating model consistency across cross-validation folds

python

import numpy as np

# Accuracy scores across 5 CV folds
model_a_scores = [0.91, 0.89, 0.93, 0.90, 0.92]
model_b_scores = [0.95, 0.81, 0.97, 0.78, 0.93]

def cv_score(scores):
    return (np.std(scores, ddof=1) / np.mean(scores)) * 100

print(f"Model A CV: {cv_score(model_a_scores):.2f}%")
print(f"Model B CV: {cv_score(model_b_scores):.2f}%")

Output:

Model A CV: 1.52%
Model B CV: 8.91%

Model B has a higher average accuracy but is far more inconsistent.

Model A is more reliable in production. The CV makes this visible at a glance.

3. Risk assessment in finance

You’re comparing two investment portfolios. Portfolio A yields 12% average return with an 8% std dev. Portfolio B yields 20% average return with a 18% std dev.

  • Portfolio A CV: (8/12) × 100 = 66.7%
  • Portfolio B CV: (18/20) × 100 = 90%

Portfolio B offers higher returns but takes on proportionally more risk per unit of return.

🔥 Applying data science to grow a business? If you’re building or scaling an online store, especially in Algeria or globally, check out Ayor.ai. It’s an AI-powered e-commerce platform that uses the kind of data-driven logic we explore in this blog: automation, product optimization, and AI assistants that help you focus on results, not busywork.

CV limitations: When not to use it

The CV is powerful but it has real limitations you need to know:

1. The mean must be non-zero. If your mean is zero or close to zero, the CV approaches infinity and becomes meaningless. For example, CV is useless for data centered around zero (temperature in Celsius, profit/loss ratios).

2. CV requires a ratio scale. The data must have a true zero point. You can use CV on height, weight, income, or reaction time. You cannot use it on Celsius temperatures or IQ scores, both are interval scales without a true zero.

3. Negative values break the interpretation. If your dataset contains negative numbers and the mean is negative, the CV formula still produces a number, but it loses its intuitive meaning entirely.

4. CV is not robust to outliers. Since it’s based on the mean and standard deviation, a single extreme value can distort both and inflate the CV dramatically. In that case, consider using the IQR-based equivalent: Quartile Coefficient of Dispersion (QCD).

python

# QCD: a robust alternative to CV for skewed/outlier-heavy data
def quartile_cv(data):
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    return (q3 - q1) / (q3 + q1)

FAQ

Q: Can the coefficient of variation be greater than 100%? Yes.

A CV above 100% means the standard deviation is larger than the mean, indicating extremely high variability. This is common in right-skewed distributions like income or transaction amounts.

Q: Can CV be negative? Technically, if the mean is negative, the formula produces a negative value.

But this is undefined in the traditional sense, CV is only meaningful when the mean is positive.

Q: What’s the difference between CV and relative standard deviation (RSD)? They are the same thing.

RSD is the term preferred in chemistry and laboratory sciences; CV is more common in statistics and data science.

Q: What is a good CV for machine learning features? There’s no universal cutoff.

A common practice is to drop features with CV below 5–10% as they carry little discriminative signal. But always validate this with domain knowledge, a medical biomarker with 4% CV might still be clinically significant.

Q: How do I calculate CV in pandas for grouped data?

python

df.groupby('category')['value'].agg(lambda x: x.std() / x.mean() * 100)

Q: Is CV the same as standard deviation divided by mean? Yes, that’s the exact formula. Multiply by 100 to express it as a percentage.

Conclusion

The Coefficient of Variation is one of the most underused tools in a data scientist’s statistical toolkit. It solves a problem that standard deviation cannot: making variability comparable across different scales and units.

Whether you’re comparing sensor data, evaluating model consistency, screening features before training, or assessing financial risk, CV gives you a normalized, interpretable measure in a single number.

The Python implementations above are ready to plug directly into your EDA or preprocessing pipeline.

Keep building your statistical toolkit

If you found this useful, these guides cover related concepts that every data scientist should have in their toolkit:

Together, these three articles give you a complete statistical foundation for clean, reliable data science work.

👉 Join the Around Data Science community on Discord, subscribe to our newsletter, and follow us on LinkedIn for more free resources, tutorials and career tips.

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

Related Articles

Top 9 Machine Learning Algorithms Every Beginner Must Know

Top 9 Machine Learning Algorithms Every Beginner Must Know

Learning the right machine learning algorithms is the fastest way to build practical AI skills today.Many beginners in Algeria want to start ML but feel overwhelmed by technical jargon, equations, or the huge number of models available. However, the truth is simple :...

read more
10 Best Free Python & Data Science Courses in 2026

10 Best Free Python & Data Science Courses in 2026

Discover the 10 best free Python & Data Science certifications and courses in 2026 (Google, IBM, Harvard, Kaggle). Includes Arabic summaries and practical tips for Algerian students & professionals. Bridge the gap with structured local training at BigNova Learning in Béjaïa!

read more