Skip to content

Data Aggregation

Merging DataFrames

The pandas concat function works similar to the SQL UNION ALL: by appending one dataset to another. The pandas merge method is equivalent to an SQL JOIN, and support inner, outer and left and right joins.

class_lookup = pd.DataFrame(
    [ ( 1, 'First'), (2, 'Second'), (3, 'Third') ],
    columns = ['id', 'description']
)

inner_join = df.merge(class_lookup, left_on='Pclass', right_on='id')

The left_on and right_on parameters are necessary because the two dataframes use different names for the 'passenger class' field. This join also means that our resulting inner_join DataFrame contains two columns containing the same data: 'Pclass' and 'id'. We can rename the column in one DataFrame to match the other, resulting in only one column for the 'passenger class':

inner_join = df.merge(
    class_lookup.rename(dict(id='Pclass'), axis=1),
    on='Pclass'
)

Left, Right and Outer Joins are made using the how parameter:

left_join = df.merge(
    class_lookup.rename(dict(id='Pclass'), axis=1),
    on='Pclass', how='left'
)

Outer joins also allow an indicator parameter, that adds a column to indicate if the row resulted from a match on both tables, or from no match on either the left or right side of the join:

outer_join = df.merge(
    class_lookup.rename(dict(id='Pclass'), axis=1),
    on='Pclass', how='outer', indicator=True
)

merge operations can also be done using the DataFrame indexes:

inner_join = df.merge(
    class_df, left_index=True, right_index=True
)

If both DataFrames have columns with the same names, the columns from the left DataFrame will be suffixed with _x and the columns from the right DataFrame with _y. You can alter this behaviour with the suffixes parameter, providing a tuple of the required suffixes.

If joining DataFrames using indexes, then you can use join instead of merge. The left table must use the index, but for the right table you can specify a column using the on parameter:

inner_join = df.join(
    class_df, lsuffix='_l', rsuffix='_r'
    on='passenger_class_id'
)

Use set operations to test the effects of joins before they're made. An Inner Join is represented as an intersection:

patient.set_index('patient_id', inplace=True)
pathology.set_index('patient_id', inplace=True)

patient.index.intersection(pathology.index)

The difference shows what will be lost in the join, either where the join type is inner, left or right:

# index values in patient but not in pathology
patient.index.difference(pathology.index)

# index values in pathology but not in patient
pathology.index.difference(patient.index)

Aggregate Functions

Pandas provides access to aggregate methods that can be applied to one or more columns:

df[['CostInBillingCurrency', 'EffectivePrice', 'Quantity']].sum()

Additional aggregate functions include mean, count, min, max and cumsum.

The agg method allows you to apply multiple aggregates to multiple columns in a DataFrame:

df.agg({
    'CostInBillingCurrency': ['sum', 'mean'],
    'EffectivePrice': ['min', 'max', 'mean'],
    'Quantity': ['sum', 'min', 'max'],
})

The groupby method allows you to run aggregates after grouping rows of data:

df.groupby('ResourceGroup').sum()

However, before applying the aggregate function, it would be sensible to select the columns that you wish to apply the function to, otherwise you'll be applying the function on all columns in the DataFrame:

df[['ResourceGroup', 'CostInBillingCurrency']].groupby('ResourceGroup').sum()

Alternatively, the agg method can be combined with groupby to select the columns and the specific aggregation needed:


df[
    ['MeterSubCategory', 'EffectivePrice', 'Quantity', 'CostInBillingCurrency']
].groupby(
    ['MeterSubCategory', 'EffectivePrice']
).agg(['sum', 'max'])

df.groupby(['MeterSubCategory', 'EffectivePrice']
).agg({
    'Quantity': 'sum',
    'CostInBillingCurrency': 'sum',
})

Note that agg accepts an array if you want to supply the same aggregates to the columns, or a dictionary if you want to specify which aggregations to apply to each column.

Using multiple aggregates results in a MultiIndex in the columns:

df_agg = df[
    ['MeterSubCategory', 'EffectivePrice', 'Quantity', 'CostInBillingCurrency']
].groupby(
    ['MeterSubCategory', 'EffectivePrice']
).agg(['sum', 'max'])

df_agg.columns

MultiIndex([(             'Quantity', 'sum'),
            (             'Quantity', 'max'),
            ('CostInBillingCurrency', 'sum'),
            ('CostInBillingCurrency', 'max')],
           )

However, you can remove this hierarchical index, by processing the column tuples into an array of scalars:

df_agg.columns = ['_'.join(tuple_col) for tuple_col in df_agg.columns]

You can include a DateTimeIndex or PeriodIndex in the groupby, and sort the output by values or index:

df.groupby([df.ResourceGroup, df.index.month]).agg({
    'CostInBillingCurrency': 'sum'
}).sort_values('CostInBillingCurrency')

df.groupby([df.ResourceGroup, df.index.month]).agg({
    'CostInBillingCurrency': 'sum'
}).sort_index()

You can select the largest or smallest values for a group of columns using nlargest or nsmallest:

df.groupby([df.ResourceGroup, df.index.month]).agg({
    'CostInBillingCurrency': 'sum'
}).nlargest(n=10, columns=['CostInBillingCurrency'])

The groupby method accepts Grouper objects: for a DataFrame with a DatetimeIndex, you can set a quarterly grouper:

df['CostInBillingCurrency'].groupby(pd.Grouper(freq='QE')).sum()

By selecting the frequency in the Grouper, you can aggregate over time periods in a DatetimeIndex:

df.loc['2024-12-01': '2024-12-15', 'CostInBillingCurrency'].groupby(pd.Grouper(freq='D')).sum()

The resample method allows you to aggregrate data to a different granularity:

weekly_df = df.resample('1W').agg({
    'CostInBillingCurrency': 'sum',
})