Skip to content

Statistics

Z-Score

A z-score is a statistical measure that quantifies how many standard deviations a data point is from the mean. The formula for calculating a z-score is:

z = (x - μ) / σ

Or z equals x - mean / standard deviation.

To add a z-score to a dataframe try:

score_df = df.assign(
    z_score=lambda x: x.Age.sub(x.Age.mean()).div(x.Age.std()),
    z_score_absolute=lambda x: x.z_score.abs(),
)

Rate of Change

In a DataFrame with a Datetime Index, you can calculate the change in a column from one day to the next using diff or pct_change and also use rank to rank the size of the change:

change_df = df.assign(
    actual_change=lambda x: x.docker_volume_usage_percent.diff().fillna(0).abs(),
    percent_change=lambda x: x.docker_volume_usage_percent.pct_change().fillna(0).abs(),
    ranked_change=lambda x: x.actual_change.rank()
)

change_df.sort_values('ranked_change', ascending=False).head()

Binning Data

When dealing with discrete measurements, it can be useful to categorise the data. For instance, if you have a series of disk usage measurements, ranging from 73% to 99% you could categorise the measurements into 3 bins spread equally between the min and max values and label them 'normal', 'high', 'critical':


df['volume_usage_level'] = pd.cut(
    df.docker_volume_usage_percent, 
    bins=3, labels=['normal', 'high', 'critical']
)

Now you can search the DataFrame by volume_usage_level or run df.volume_usage_level.value_counts() to see how often the volume is at 'normal', 'high' or 'critical' levels.

qcut can categorise your data into bins containing equal number of measurements.