Skip to content

DataFrame Basics

Data in Pandas is managed via Series and DataFrame objects.

Series

A pandas.Series object is a data structure for an array of data of a single type, and can be thought of as a column in a table. Create a Series by providing an array of values:

import numpy as np
import pandas as pd

np.random.seed(0) # set a seed so that the random numbers are always the same
pd.Series(np.random.rand(5), name='random')

In addition to the data we provide the Series object, pandas adds in Index object representing the row numbers. The Index object can be used to perform element-wise operations on Series objects. For example, addition using two Series objects, will add the values for the elements with matching index values. Where the indexes do not match, the result will return a null value (NaN or None) for those index values.

DataFrames

A pandas.DataFrame object is a 2-dimensional data structure used to store data in columns.

To create a DataFrame manually, use a dictionary of lists:

df = pd.DataFrame(
  {
    'Forename': ['Jeff', 'James', 'Eric', 'Clark'],
    'Surname': ['Jefferson', 'Jameson', 'Ericsson', 'Clarkson'],
    'Date of Death': ['15/12/2021', '12/05/2002', '25/10/2023', '28/07/2012'],
    'Place of Death': [np.random.choice(['Norfolk', 'Shropshire', 'Devon', 'Bedfordshire']) for _ in range(4)] ,
    'Deceased': [np.random.choice(['Yes', 'No']) for _ in range(4) ],
    'Age' : np.random.rand(4) * 100
  }
)

Or a list of dictionaries:

df2 = pd.DataFrame([
    {'forename': 'Jeff', 'surname': 'Jefferson', 'date of death': '15/12/2021'},
    {'forename': 'James', 'surname': 'Jameson', 'date of death': '01/10/2021'},
    {'forename': 'Eric', 'surname': 'Ericsson', 'date of death': '11/05/2021'},
    {'forename': 'Clark', 'surname': 'Clarkson', 'date of death': '25/02/2021'},
])

Or a list of tuples:

df3 = pd.DataFrame([
    ('Jeff', 'Jefferson', '15/12/2021'),
    ('James', 'Jameson', '01/10/2021'),
    ('Eric', 'Ericsson', '11/05/2021'),
    ('Clark', 'Clarkson', '25/02/2021')],
    columns=['forename', 'surname', 'date of death']
)

Or a list of lists:

df5 = pd.DataFrame([
        ['Jeff', 'Jefferson', '15/12/2021'],
        ['James', 'Jameson', '01/10/2021'],
        ['Eric', 'Ericsson', '11/05/2021'],
        ['Clark', 'Clarkson', '25/02/2021']
    ],
    columns=['forename', 'surname', 'date of death']
)

Or a numpy array:

df4 = pd.DataFrame(
    np.array([
        ['Jeff', 'Jefferson', '15/12/2021'],
        ['James', 'Jameson', '01/10/2021'],
        ['Eric', 'Ericsson', '11/05/2021'],
        ['Clark', 'Clarkson', '25/02/2021']
    ]),
    columns=['forename', 'surname', 'date of death']
)

External Data

DataFrames can be created from CSV files using the pandas read_csv function:

df = pd.read_csv("data/unclaimedestates.csv", parse_dates=True)

The read_csv function provides additional arguments to allow you to specifiy seperators, delimiters, etc. Use help(pd.read_csv) or pd.read_csv? for the full function definition.

There are functions for reading data in various formats, all beginning with a read_ prefix: e.g. read_json, read_excel, read_sql. Try:

functions = [function for function in dir(pd) if function.startswith('read_')]

To read data from a database, you need to provide a both a query or table name and a connection object:

import pandas as pd
from sqlalchemy import URL, create_engine
import psycopg2

psql_connection_string = URL.create(
    'postgresql',
    username='user_name',
    password='passwd',
    host='server_name,
    port=5432,
    database='database_name'
)
conn = create_engine(psql_connection_string)

query = '''
SELECT * FROM
"monitoring"."collect_data"
WHERE collected_date > CURRENT_DATE - INTERVAL '60 days'
ORDER BY collected_date desc
'''

query_df = pd.read_sql_query(query,con=conn)

table_df = pd.read_sql_table(
    'collect_data',
    schema='monitoring',
    con=conn,
    index_col='collected_date',
)

DataFrames can also be written to external files using the corresponding to_ methods: e.g. to_json, to_excel, to_sql. With the to_sql function you can provide the table name, connection and what action to take if the table already exists:

df.to_sql(name='users', con=connection, if_exists='replace')

Pandas to_ functions will also allow to you include or exclude the DataFrame index in the output:

df.to_excel("recipes.xlsx", sheet_name="flapjack", index=False)

Examining Data

DataFrame objects provide a number of attributes and functions that you can use to examine the data frame.

Attribute Description
shape number of rows and columns
columns names of the columns
dtypes data type for each column

All the output from the shape, columns, and dtypes attributes is also provided through the info function, along with details on the index and memory usage:

df.info()

The describe method provides summary data (min, max, mean, count and percentiles) for the numeric columns, but can also be used to get some summary data (count, unique values, mode, frequency of mode) for non-numeric data:

df.describe(include=np.object)
# or
df.describe(include='all')

By default describe provides the 0.25, 0.5, 0.75 percentiles but you can select these with the percentiles parameter:

df.describe(percentiles=[0.05, 0.50, 0.95])

describe only provides summary statistics for non-null values.

Each summary statistic is also available via separate summary methods that can be applied to either the DataFrame or Series objects, e.g min, max, idxmin, idxmax, sum, mean, median. For Series objects additional summary methods exist: unique, value_counts and mode.

Selecting Data

Display the dataframe by calling its name. This will display the first 5 rows and the last 5 rows. To show the first n rows, use the head method, or tail for the last n rows, or sample for n random values:

df.sample(10)

Each column in a DataFrame is a Series. To select a column from a DataFrame, use the column label between square brackets. For multiple columns supply a list:

df[["Place of Death", "Date of Birth"]

DataFrames support slicing operations, which can be combined with column selection:

df[["Place of Death", "Date of Birth"]][5:15]
# Order doesn't matter
df[5:15][["Place of Death", "Date of Birth"]]

However, using slicing and selection to assign values to a DataFrame is not the recommended approach: instead you should use the loc or iloc methods. loc and iloc accept two labels: a row selector and a column selector. Either selector can be expressed as a single value, a list or a slice. :

df.loc[2, 'Surname']
df.loc[[2,3], 'Surname']
df.loc[2:3, 'Surname']

df.loc[df.Surname == 'Johnson', ['Forename','Surname']]
df.loc[df.Surname == 'Johnson', 'Forename':'Surname']

df.loc[[df.Age.idxmin(), df.Age.idxmax()], :]

With loc, slicing operations are inclusive of the end index if provided. iloc instead follows the standard python-slicing approach, and excludes the end value of a slice if provided. iloc expects integer values for both row selectors and column selectors:

# returns only the value at row 2, column 1 and 2
df.iloc[2:3, 1:3]

# returns values at rows 2 and 3, column 1, 2 and 3
df.iloc[[2,3],[1,2,3]]

iloc allows slicing for both row and column selectors.

For selecting a single value from a DataFrame, the at and iat methods provide better performance:

wanted = df.at[2,'Surname']
wanted = df.iat[2,1]

Filtering Data

You can filter the output using a condition inside the selection brackets:

df[df["Date of Birth"] < '01/01/1970']

The isin() method can be used to check for multiple conditions on a column:

df[df["Surname"].isin(["Smith", "Jones"])]

You can also use the 'or' operator to acheve the same result:

df[(df["Surname"] == "Jones") | (df["Surname"] == "Smith")]

The AND Operator (&) can be used to specify multiple conditions that should all return True. The Negation Operator (~) can be used to negate the return value of a condition:

df[~(df.Surname == 'Jameson')]

To filter out rows where a particular column has Null values, use the notna() method:

df[df["Executors"].notna()]

We can also use strings and regular expressions:

# contains 'son' anywhere in the string
df[df.Surname.str.contains('son')]

## contains 'son' at the end of the string
df[df.Surname.str.contains(r'son$')]

## contains 'Jeff' or 'James' at the beginning of the string
df[df.Surname.str.contains(r'^Jeff|James')]

To filter numeric values with a lower and upper range use between:

df[df.Age.between(10,32, inclusive='both')]

pop can be used to remove a column and save this to a Series. If the column contains Boolean values, then this can later be used to filter the DataFrame, even though the column no longer exists in the DataFrame. This works because the Series has the same index as the DataFrame:

is_retired = df.pop('Retired')

df['Retired'] # produces a key-error

retired_df = df[is_retired].copy()
not_retired_df = df[~is_retired].copy()

This can be useful to filter a DataFrame, without storing the filter column in the DataFrame. The copy method is provided, to create new DataFrames that can be worked on independantly of the original.

The filter Series does not have to contain Booleans, but can be used to create a Boolean result:

where_died = df.pop('Place of Death')

df[where_died == 'Norfolk']

These filtering techniques can also be used for row labels in both the loc or iloc operators.

DataFrame Queries

Pandas provides the query method to filter data, instead of using a Boolean mask. The query method allows more complex filters to be used with less typing:

df = pd.read_csv('titanic.csv')

df[(df.Survived == 1) & (df.Sex == 'male')].equals(
    df.query('Survived == 1 & Sex == "male"'))

Concatenate DataFrames

The concat() method can be used to combine two tables with a similar structure. You can specify an axis of '0' to add the second DataFrame as new rows, or an axis of '1' to add the DataFrame as new columns.

To join the rows of two tables use:

disk_data = pd.concat([disk_data_server_1, disk_data_server_2], axis=0)

If either DataFrame contains columns not found in the other, these will be added as new columns in the new DataFrame, and the rows where these columns did not exist, will be filled with NaN. This behaviour can be change by specifying the join type as 'inner': which only keeps columns that are common to the input DataFrames.

The 'ignore_index' option can be used to create a new index for the new DataFrame. If you wish to keep the original indices from both input DataFrames, you can optionally add an additional (hierarchical) row index to identify which row came from which input DataFrame:

disk_data = pd.concat([disk_data_server_1, disk_data_server_2], axis=0, keys=["srv1", "srv2"])

This produces a DataFrame with a multi-index. To access entries by index value, supply a tuple or list of tuples:

disk_data.loc[[('srv2',0),('srv2',1)],:]

When using concat in a columnwise join (axis=1), Pandas uses the row indexes to make the joins. For rows that don't share a common index, values are filled with NaN.