Skip to content

Data Aggregation

Merging DataFrames

The pandas concat function works similar to the SQL UNION ALL: by appending one dataset to another. The pandas merge method is equivalent to an SQL JOIN, and support inner, outer and left and right joins.

class_lookup = pd.DataFrame(
    [ ( 1, 'First'), (2, 'Second'), (3, 'Third') ],
    columns = ['id', 'description']
)

inner_join = df.merge(class_lookup, left_on='Pclass', right_on='id')

The left_on and right_on parameters are necessary because the two dataframes use different names for the 'passenger class' field. This join also means that our resulting inner_join DataFrame contains two columns containing the same data: 'Pclass' and 'id'. We can rename the column in one DataFrame to match the other, resulting in only one column for the 'passenger class':

inner_join = df.merge(
    class_lookup.rename(dict(id='Pclass'), axis=1),
    on='Pclass'
)

Left, Right and Outer Joins are made using the how parameter:

left_join = df.merge(
    class_lookup.rename(dict(id='Pclass'), axis=1),
    on='Pclass', how='left'
)

Outer joins also allow an indicator parameter, that adds a column to indicate if the row resulted from a match on both tables, or from no match on either the left or right side of the join:

outer_join = df.merge(
    class_lookup.rename(dict(id='Pclass'), axis=1),
    on='Pclass', how='outer', indicator=True
)

merge operations can also be done using the DataFrame indexes:

inner_join = df.merge(
    class_df, left_index=True, right_index=True
)

If both DataFrames have columns with the same names, the columns from the left DataFrame will be suffixed with _x and the columns from the right DataFrame with _y. You can alter this behaviour with the suffixes parameter, providing a tuple of the required suffixes.

If joining DataFrames using indexes, then you can use join instead of merge. The left table must use the index, but for the right table you can specify a column using the on parameter:

inner_join = df.join(
    class_df, lsuffix='_l', rsuffix='_r'
    on='passenger_class_id'
)

Use set operations to test the effects of joins before they're made. An Inner Join is represented as an intersection:

patient.set_index('patient_id', inplace=True)
pathology.set_index('patient_id', inplace=True)

patient.index.intersection(pathology.index)

The difference shows what will be lost in the join, either where the join type is inner, left or right:

# index values in patient but not in pathology
patient.index.difference(pathology.index)

# index values in pathology but not in patient
pathology.index.difference(patient.index)

Aggregate Functions

Pandas provides access to aggregate methods that can be applied to one or more columns:

print(df[["Date of Death", "Date of Publication"]].max())
print(df[["Date of Death", "Date of Publication"]].min())

The agg() method allows you to apply multiple aggregates to a DataFrame:

df_aggregations = df.agg(
    {
        "Date of Death": ["min", "max", "count", ],
        "Date of Birth": ["count"],
        "Date of Publication": ["max"]
    }
)

Use group_by() to group aggregates:

religions = df[["Nationality", "Religion"]].groupby("Religion").count()

Data can be sorted using the sort_values() or sort_index() methods:

print(df.sort_values(by=["Date of Birth", "Place of Birth"], ascending=True)[["Date of Birth", "Place of Birth"]])

You can select the largest or smallest values for a group of columns using nlargest or nsmallest:

df.nlargest(n=10, columns=['Nationality', 'Religion']

Matplotlib

To create a graph of a DataFrame use the plot() method. Pandas creates a line plot by default for each of the Series in a DataFrame that has numeric values. All the plots created by Pandas are Matplotlib objects. The matplotlib.pyplot library provides the show() method to display the graph:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("ukgovunclaimedestates.csv", parse_dates=True)

religions = df["Religion"].groupby(df["Religion"]).count()

religions.plot()

plt.show()

Use selection criteria on your DataFrame to choose which Series to plot. In JupyterLab, you can specify the x and y axes for your plot:

plt.plot(df.index, df['java'])

You can use slicing to drop elements from the plot:

plt.plot(df.index[:-1], df['java'][:-1])

You can also plot mulitple columns on the same graph, and add some graph formating:

plt.figure(figsize=(16,10))
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel('Date', fontsize=14)
plt.ylabel('No of Posts', fontsize=14)
plt.ylim(0,35000)

plt.plot(df.index, df['java'])
plt.plot(df.index, df['python'])

To plot multiple columns use a for loop:

plt.figure(figsize=(16,10))
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel('Date', fontsize=14)
plt.ylabel('No of Posts', fontsize=14)
plt.ylim(0,35000)

for column in df.columns:
    plt.plot(df.index, df[column], linewidth=3, label=df[column].name)

plt.legend(fontsize=14)

For time series data you can specify a rolling mean to smooth out the data: instead of plotting the value of each data point, you can specify a window to calculate an average value for each data point based on the values either side of the data point:

df['rolling_mem'] = df.mem_used_percent.rolling('3D').mean()

To calculate a rolling average, your time-series data must be monotonic: that is the index must change in equal steps. Missing dates are not allowed. Check for monotonicity using either df.index.is_monotonic_increasing or df.index.is_monotonic_decreasing.

Plotting two columns with varying value-ranges can look untidy and make trend-spotting difficult. Instead you can specify two different y-axes for each plot to make comparison easier:

ax1 = plt.gca() # gets the current axis
ax2 = ax1.twinx() # create another axis that shares the same x-axis

ax1.plot(sets_by_year.index[:-2], sets_by_year.set_num[:-2], color='g')
ax2.plot(themes_by_year.index[:-2], themes_by_year.nr_themes[:-2], color='b')

ax1.set_xlabel('Year')
ax1.set_ylabel('Number of Sets', color='g')
ax2.set_ylabel('Number of Themes', color='b')

If your xticks overlap, you can specify a rotation to make them readable:

plt.xticks(fontsize=14, rotation=45)

You can also use locator functions for marking the x- and y- axes:

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

df_unemployment = pd.read_csv('UE Benefits Search vs UE Rate 2004-19.csv')
df_unemployment['MONTH'] = pd.to_datetime(df_unemployment['MONTH'])

roll_df = df_unemployment[['UE_BENEFITS_WEB_SEARCH', 'UNRATE']].rolling(window=6).mean()
roll_df['month'] = df_unemployment['MONTH']


plt.figure(figsize=(16,10))
plt.title('Rolling Web Searches vs Unemployment Rate', fontsize=20)
plt.xticks(fontsize=14, rotation=45)
plt.yticks(fontsize=14)

ax1 = plt.gca()
ax2 = ax1.twinx()
ax1.grid(color='gray', linestyle='--')

ax1.set_ylabel('Web Searches', fontsize=16, color='indianred')
ax2.set_ylabel('Unemployment Rate', fontsize=16, color='cadetblue')

ax1.set_xlim(roll_df.month[5:].min(), roll_df.month[5:].max())

years = mdates.YearLocator()
months = mdates.MonthLocator()
years_fmt = mdates.DateFormatter('%Y')
ax1.xaxis.set_major_locator(years)
ax1.xaxis.set_major_formatter(years_fmt)
ax1.xaxis.set_minor_locator(months)

ax1.plot(roll_df.month[5:], roll_df.UE_BENEFITS_WEB_SEARCH[5:], color='indianred', linewidth=3, marker='o')
ax2.plot(roll_df.month[5:], df_unemployment.UNRATE[5:], color='cadetblue', linewidth=3)

Pandas provides a number of graph types including line, area, bar, pie and scatter:

  • plt.plot for a line chart
  • plt.scatter for a scatter plot
  • plt.bar for a bar chart

Plotly

Bar charts are a good way to visualise 'categorical' data. You can use value_counts() to quickly create categorical data:

ratings = df.value_counts('Content_Rating')

Given a dataframe that contains the following data:

Content_Rating
Everyone           6621
Teen                912
Mature 17+          357
Everyone 10+        305
Adults only 18+       3
Unrated               1
Name: count, dtype: int64

we can present this as a pie chart using plotly:

fig = px.pie(
    labels=ratings.index,
    values=ratings.values,
    names=ratings.index,
    title="Content Rating"
)
fig.show()

We can change the location of the text on the pie chart using update_traces():

fig = px.pie(
    labels=ratings.index,
    values=ratings.values,
    title='Content Rating',
    names=ratings.index,
)
fig.update_traces(
    textposition='outside', 
    textinfo='percent+label'
)
fig.show()

To format the pie chart as a 'doughnut chart' add the 'hole' parameter:

fig = px.pie(
    labels=ratings.index,
    values=ratings.values,
    title='Content Rating',
    names=ratings.index,
    hole=0.4
)
fig.update_traces(
    textposition='inside',
    textfont_size=15,
    textinfo='percent'
)
fig.show()

Which produces:

doughnut