Visualising Data
Matplotlib
To create a graph of a DataFrame use the plot()
method. Pandas creates
a line plot by default for each of the Series in a DataFrame
that has numeric values. All the plots created by Pandas are Matplotlib
objects. The matplotlib.pyplot
library provides the show()
method
to display the graph. First import pyplot
:
import pandas as pd
import matplotlib.pyplot as plt
For a DataFrame with a DatetimeIndex and a single numeric column,
you can call plot
directly on the DataFrame, or pass the
axes to pyplot
:
daily_df = ws_df.groupby(pd.Grouper(freq='D')).agg({
'CostInBillingCurrency': 'sum',
})
daily_df.plot()
## Alternate method:
# plt.plot(daily_df.index, daily_df.values)
Using the plot
method on the DataFrame might provide better default
formatting of axes and labels. However, you can customise the parameters
to plt.plot
. For instance, instead of supplying the actual date,
you can provide the day:
plt.plot(daily_df.index.day, daily_df.values)
Use selection criteria on your DataFrame to choose which Series to plot.
plt.plot(df.index, df['java'])
You can use slicing to drop elements from the plot:
plt.plot(df.index[:-1], df['java'][:-1])
You can create a scatter plot by specifying a format string as the third
argument. By also specifying data=
we can refer to the columns by name
rather than as series objects:
plt.plot('EffectivePrice', 'UnitPrice', 'rx', data=df)
Format strings are three-part formats to specify:
- Colour: [b = blue, k = black, r = red, g = green, m = magenta, c = cyan]
- Marker: ['.', ',', 'o', 'v', ...., 'x' ]
- Linestyle: ['-', '--', '-.', ':']
Histograms involve assigning values to 'bins' that can then be plotted by their frequency:
plt.hist(ws_df.UnitPrice, bins=10, color='skyblue', edgecolor='black')
# Adding labels and title
plt.xlabel('Costs')
plt.ylabel('Frequency')
plt.title('Cost Distributions')
plt.savefig('reports/cost_distributions.png')
Note the use of savefig
method to save the plot to a PNG file.
Plot Components
The Figure
object is the top-level component for matplotlib
visualisations. Use the figure
method to create a figure:
fig = plt.figure()
The subplots
method can be used to create a Figure with
Axes objects for subplots:
fig, axes = plt.subplots(nrows=2,ncols=2)
The above code creates a Figure with Axes for 4 plots laid-out in two rows and two columns.
You can also create layouts by specifying a figure size and then
adding axes. Axes are specified
as [left, bottom, width, height]
as proportions of the figure size:
figure = plt.figure( figsize = ( 9, 9 ) )
# Add axes [ 'left', 'bottom', 'width', 'height' ]
bottom_left = figure.add_axes( [ 0, 0, 3/9, 5/9 ] )
top_left = figure.add_axes([ 0, 6/9, 3/9, 3/9 ] )
top_right = figure.add_axes( [ 4/9, 6/9, 5/9, 3/9 ] )
bottom_right = figure.add_axes( [4/9, 0, 5/9, 5/9 ] )
bottom_left.plot( [500, 600, 700], [22, 24, 26])
top_right.plot( [1,2,3,4], [2,4,6,8])
add_axes
can be used to add plots inside other plots:
fig = plt.figure(figsize=(9,9))
outside = fig.add_axes([0.1, 0.1, 0.95, 0.95])
inside = fig.add_axes([0.5, 0.5, 0.25, 0.25])
You can also plot mulitple columns on the same graph, and add some graph formating:
plt.figure(figsize=(16,10))
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel('Date', fontsize=14)
plt.ylabel('No of Posts', fontsize=14)
plt.ylim(0,35000)
plt.plot(df.index, df['java'])
plt.plot(df.index, df['python'])
To plot multiple columns use a for
loop:
plt.figure(figsize=(16,10))
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel('Date', fontsize=14)
plt.ylabel('No of Posts', fontsize=14)
plt.ylim(0,35000)
for column in df.columns:
plt.plot(df.index, df[column], linewidth=3, label=df[column].name)
plt.legend(fontsize=14)
For time series data you can specify a rolling mean to smooth out the data: instead of plotting the value of each data point, you can specify a window to calculate an average value for each data point based on the values either side of the data point:
df['rolling_mem'] = df.mem_used_percent.rolling('3D').mean()
To calculate a rolling average, your time-series data must be monotonic:
that is the index must change in equal steps. Missing dates are not allowed.
Check for monotonicity using either df.index.is_monotonic_increasing
or
df.index.is_monotonic_decreasing
.
Whereas a rolling window calculates an aggregate from the values within the
specified window, an expanding
window, calculates the aggregate for all values up to
the current value, that is a cumulative aggregate.
sub_df['CumulativeCost'] = sub_df['CostInBillingCurrency'].expanding().sum('CostInBillingCurrency')
The same assignment can be made using the cumsum
function:
sub_df['CumulativeCost'] = sub_df['CostInBillingCurrency'].cumsum()
Plotting two columns with varying value-ranges can look untidy and make trend-spotting difficult. Instead you can specify two different y-axes for each plot to make comparison easier:
ax1 = plt.gca() # gets the current axis
ax2 = ax1.twinx() # create another axis that shares the same x-axis
ax1.plot(sets_by_year.index[:-2], sets_by_year.set_num[:-2], color='g')
ax2.plot(themes_by_year.index[:-2], themes_by_year.nr_themes[:-2], color='b')
ax1.set_xlabel('Year')
ax1.set_ylabel('Number of Sets', color='g')
ax2.set_ylabel('Number of Themes', color='b')
If your xticks overlap, you can specify a rotation to make them readable:
plt.xticks(fontsize=14, rotation=45)
You can also use locator functions for marking the x- and y- axes:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
df_unemployment = pd.read_csv('UE Benefits Search vs UE Rate 2004-19.csv')
df_unemployment['MONTH'] = pd.to_datetime(df_unemployment['MONTH'])
roll_df = df_unemployment[['UE_BENEFITS_WEB_SEARCH', 'UNRATE']].rolling(window=6).mean()
roll_df['month'] = df_unemployment['MONTH']
plt.figure(figsize=(16,10))
plt.title('Rolling Web Searches vs Unemployment Rate', fontsize=20)
plt.xticks(fontsize=14, rotation=45)
plt.yticks(fontsize=14)
ax1 = plt.gca()
ax2 = ax1.twinx()
ax1.grid(color='gray', linestyle='--')
ax1.set_ylabel('Web Searches', fontsize=16, color='indianred')
ax2.set_ylabel('Unemployment Rate', fontsize=16, color='cadetblue')
ax1.set_xlim(roll_df.month[5:].min(), roll_df.month[5:].max())
years = mdates.YearLocator()
months = mdates.MonthLocator()
years_fmt = mdates.DateFormatter('%Y')
ax1.xaxis.set_major_locator(years)
ax1.xaxis.set_major_formatter(years_fmt)
ax1.xaxis.set_minor_locator(months)
ax1.plot(roll_df.month[5:], roll_df.UE_BENEFITS_WEB_SEARCH[5:], color='indianred', linewidth=3, marker='o')
ax2.plot(roll_df.month[5:], df_unemployment.UNRATE[5:], color='cadetblue', linewidth=3)
Pandas provides a number of graph types including line, area, bar, pie and scatter:
plt.plot
for a line chartplt.scatter
for a scatter plotplt.bar
for a bar chart
Plotly
Bar charts are a good way to visualise 'categorical' data. You can use
value_counts()
to quickly create categorical data:
ratings = df.value_counts('Content_Rating')
Given a dataframe that contains the following data:
Content_Rating
Everyone 6621
Teen 912
Mature 17+ 357
Everyone 10+ 305
Adults only 18+ 3
Unrated 1
Name: count, dtype: int64
we can present this as a pie chart using plotly:
fig = px.pie(
labels=ratings.index,
values=ratings.values,
names=ratings.index,
title="Content Rating"
)
fig.show()
We can change the location of the text on the pie chart using update_traces()
:
fig = px.pie(
labels=ratings.index,
values=ratings.values,
title='Content Rating',
names=ratings.index,
)
fig.update_traces(
textposition='outside',
textinfo='percent+label'
)
fig.show()
To format the pie chart as a 'doughnut chart' add the 'hole' parameter:
fig = px.pie(
labels=ratings.index,
values=ratings.values,
title='Content Rating',
names=ratings.index,
hole=0.4
)
fig.update_traces(
textposition='inside',
textfont_size=15,
textinfo='percent'
)
fig.show()
Which produces: