Skip to content

Data Cleansing

Before carrying out any data analysis tasks, the data should be prepared for analysis. This process will involve a number of tasks including:

  • Renaming columns
  • Re-ordering columns and rows
  • Data-type conversions
  • Deduplication
  • Handling missing or invalid data
  • Filtering out unneccessary data

To avoid altering our original data we can use copy to create a new copy of the DataFrame:

df_clean = df.copy()

Columns

The rename method can be used to rename columns in your DataFrame:

df_renamed = df.rename(
    columns={
        "Date of Publication": "publication_date",
        "Date of Death": "death_date",
        "Date of Birth": "birth_date",
    },
)

You can also use lambda functions to achieve the same result:

df_renamed = df.rename(lambda x: x.lower().replace(' ', '_'), axis=1)

Sort columns using sort_index(axis=1).

You can use pandas conversion functions to change the datatypes for columns:

df_renamed.date_of_birth = pd.to_datetime(df_renamed.date_of_birth)

Use format='mixed' if your date strings are in different formats:

data = {
    'First Name': ['John', 'Jack', 'Jane', 'Jill', 'Jake'],
    'Family Name': ['Johnson', 'Jackson', 'Janeson', 'Jillson', 'Jakeson'],
    'Date of Birth': ['2001-05-20', '1980-12-13 12:24:00', 
        '1999-12-31 03:30:00', '1964-06-20', '2020-05-12 17:32'],
}

df = pd.DataFrame(data)
df_renamed = df.rename(lambda x: x.lower().replace(' ', '_'), axis=1)
df_renamed.date_of_birth = pd.to_datetime(
    df_renamed.date_of_birth, format='mixed'
)

Altering the data type of a column containing dates from 'object' to 'datetime' means we get fuller statistics via the describe method and can carry out operations associated to datetime objects:

df_renamed['time_of_birth'] = df_renamed.date_of_birth.dt.time
df_renamed['day_of_birth'] = df_renamed.date_of_birth.dt.date

To convert strings to numeric data, you can perform string formatting and then use the pandas to_numeric function:

# remove commas
df_clean['Installs'] = df_clean['Installs'].astype(str).str.replace(',',"")
# remove currency symbol
df_clean['Installs'] = df_clean['Installs'].astype(str).str.replace('£',"")
# convert to numeric
df_clean['Installs'] = pd.to_numeric(df_clean['Installs'])

astype can be used to perform some type conversions. Float64 values can be converted to Int64 values. No rounding is performed, everything after the decimal point is lost:

data = {
    'forename': ['John', 'Jack', 'Jane', 'Jill', 'Jake'],
    'surname': ['Johnson', 'Jackson', 'Janeson', 'Jillson', 'Jakeson'],
    'date_of_birth': ['2001-05-20', '1980-12-13 12:24:00', '1999-12-31 03:30:00', '1964-06-20', '2020-05-12 17:32'],
    'sats_score': np.random.random(5) * 100,
    'marital_status': ['divorced', 'beheaded', 'died', 'divorced', 'beheaded']
}

df1 = pd.DataFrame(data)
df1['sats_score_truncated'] = df1.sats_score.astype('int')

The marital status column can be converted from an 'Object' type to a 'Category' type using astype:

df1.marital_status = df1.marital_status.astype('category')

Use the drop() method to remove unwanted rows or columns. Set inplace=True to alter the current DataFrame or use assignment to capture the resulting DataFrame.

drop defaults to removing rows (axis=0), but we can also drop columns either by specifying axis=1 or by providing a list to the columns parameter:

# creates new dataframe and leaves the original unchanged
nameless_df = df.drop(columns=['Forename', 'Surname'])

# affects the original dataframe
df.drop(['Forename', 'Surname'], axis=1, inplace=True)

dropna can be used to drop columns by specifying the axis, and a threshold of rows that contain null values:

df_clean = df.dropna(
    axis='columns',
    thresh=3 # keep if 3 or more non-null values
)

df_clean = df.dropna(
    axis=1,
    thresh=df.shape[0]*.75 # keep if 75% or more non-null values
)

We can also use del df['column_name'] to remove a specific column from a DataFrame.

Indexes

If no index is specified, pandas will add an index when creating your DataFrame. This will be a RangeIndex, containing integers from 0 to n-1, where n is the number of rows in your dataset.

You can specify an index using a specific column from your data, or generating one manually:

data = {
    'forename': ['John', 'Jack', 'Jane', 'Jill', 'Jake'],
    'surname': ['Johnson', 'Jackson', 'Janeson', 'Jillson', 'Jakeson'],
    'date_of_birth': ['2001-05-20', '1980-12-13 12:24:00', '1999-12-31 03:30:00', '1964-06-20', '2020-05-12 17:32'],
}

df1 = pd.DataFrame(data, index=pd.date_range(start='2025-06-01', periods=5, freq='D'))

df2 = pd.DataFrame(data)
df2.date_of_birth = pd.to_datetime(df2.date_of_birth, format='mixed')
df2.index = df2.date_of_birth

Note that the number of elements in the date_range should match the number of rows in the DataFrame. Both df1 and df2 now have an index of type `DatetimeIndex'.

You can localise your DatetimeIndex (make them timezone-aware) and convert them to UTC:

# df2 = df2.tz_localize('GMT+0')
df2 = df2.tz_localize('CET')
df2 = df2.tz_convert('UTC')

A DatetimeIndex or a PeriodIndex can be used to slice a DataFrame by dates, but this only works if your Datetime Index has been sorted first:

df_dated.set_index('date_of_birth', inplace=True)
df_dated.sort_index(inplace=True)

df_dated['1964':'2001']

df_dated['2001-05':'2020']
df_dated['200105':'2020']

df_dated['1980': '2025-01-01']

A DatetimeIndex can be converted to PeriodIndex, where the index values are converted to the specified period (e.g. Week, Month, Year):

df2 = df2.to_period('M')

Use sort_index to sort data by index value. Both sort_index and sort_values return a new DataFrame, so you have to use assignment to keep the sorted data, or use inplace=True.

reset_index will copy the current index on a DataFrame to a new column (preserving it's current name), and create a new RangeIndex for the DataFrame.

set_index can be used to drop the current index, and replace it with one or more columns from the DataFrame. Rows from a MultiIndex are accessed using tuples: (outer_index, inner_index). If you set a MultiIndex, you can undo this later with unstack. The unstack method will move the innermost index column to the DataFrame columns.

Note that df.set_index('column_name', inplace=True) works differently to df.index = df.column_name: df.index will copy the column from the data and use this as the index, whereas set_index will move the column from the data and use this as the index.

reindex can be used to align the DataFrame to a new index. Where the DataFrame is missing values for the new index, you can pick a method to fill-in the row values:

  • ffill - forward fill from nearest previous value
  • bfill - back fill from nearest next value
  • nearest - fill from the nearest value

Derived Data

You can add new columns to a DataFrame using assignments:

df["Forename Lower Case"] = df["Forename"].str.lower()

df['Retired'] = df.Age > 67

The assign method allows us to add multiple columns at once. However, assign does not change the original DataFrame, but returns a new one. If you want to change the original, just assign the return value back to the original DataFrame:

df = df.assign(
    is_retired=df.Age >= 67,
    is_working_age=(df.Age >= 18) & (df.Age < 67),
    is_child=df.Age < 18
    income=lambda x: np.where(x.salary.isnull(), x.pension, x.salary),
)

The same assign statement can be written with lambda functions:

df = df.assign(
    is_retired=lambda x: x.Age >= 67,
    is_working_age=lambda x: x.Age.between(18,67,inclusive='left'),
    is_child=lambda x: x.Age < 18
)

Using lambda functions in your assign statements offers the benefit of being able to reference other columns that are being assigned in the same statement.

Rows

To drop rows we need to provide a list of indices: df.drop([1,2], inplace=True).

Use isna to find rows with null values in a specified column(s):

df_null_rating_or_age = df[df.Rating.isna() | df.Age.isna()]

If used on the DataFrame, isna will return a DataFrame with the values replaced with TRUE or FALSE to indicate null or non-null values.

If you use dropna on the DataFrame, this will return a DataFrame containing only rows where none of the values are null: in other words, if any value in the row is null, then it will be dropped. This behaviour can be changed using the how parameter. The default behaviour is any: but you can set the parameter to all, so that only rows where all values are null are dropped. Combining this with the subset parameter allows you to drop only rows where all the values in the specified columns are null:

## drop rows where the subset columns are all null
df_clean = df.dropna(how=all, subset=['home', 'var', 'lib'])

## drop rows where there is a null value in any of the subset columns
df_clean = df.dropna(how=any, subset=['home', 'var', 'lib'])

fillna can be used replace NaN values:

df = df.assign(
    salary=lambda x: x.salary.fillna('0.00'),
    free_space=lambda x: x.free_space.ffill(),
    inplace=True,
)

df_clean.loc[:, 'pension'].fillna('0.00', inplace=True)

Use inplace=True if you don't want to create a new dataframe.

You can also use apply to apply the same fill strategy to all columns in the DataFrame or a list of columns:

df_clean = df.apply(lambda x: x.ffill(inplace=True))

Note that apply does not have a inplace option, so you have to reassign the DataFrame or DataFrame slice:

df.loc[:, ['name', 'toy']] = df.loc[:, ['name', 'toy']].apply(
            lambda x: x.str.replace('Bat', 'Cat')
)

Apply lets you run vectorised operations on rows or columns, which is the preferred method in Pandas. Although methods like iterrows and iteritems exist, these should be used only as a last resort as if is less efficient to loop through items one-by-one in Pandas. There is an applymap method to run a function that has not been vectorised, or np.vectorize to vectorise your own functions.

duplicated returns all duplicated rows, apart from the first occurrence of the row. The subset parameter can be used to specify which columns to use for the comparison. Use drop_duplicates to remove the duplicated rows (keeping the first occurrence):

df_clean = df.drop_duplicates(subset = ['App', 'Type', 'Price'])

You can use filtering to drop unwanted values:

df_clean = df_clean[df_clean.Installs > 300]

Reshaping DataFrames

The pivot() method can be used to reshape data from long-format to wide-format:

import pandas as pd

data = [
    ('2025-01-06', 'home', 97),
    ('2025-01-06', 'opt', 36),
    ('2025-01-06', 'var', 83),
    ('2025-01-07', 'home', 96),
    ('2025-01-07', 'opt', 36),
    ('2025-01-07', 'var', 83),
    ('2025-01-08', 'home', 95),
    ('2025-01-08', 'opt', 45),
    ('2025-01-08', 'var', 63),
]

df = pd.DataFrame(data, columns=['date', 'mount_point', 'free_space'])
df.date = pd.to_datetime(df.date)

pivot_df = df.pivot(index='date', columns='mount_point', values='free_space')

If you pass in a list to the values parameter, you will have a hierarchical index in the columns, which you will access using [<value_column_name>][<column_name>]:

The reverse of pivot() is melt(), which will turn data in wide-format to long-format:

wide_data = [
    ('2025-01-06', 97, 36, 83),
    ('2025-01-07', 96, 36, 83),
    ('2025-01-08', 95, 45, 63)
]
df2 = pd.DataFrame(wide_data, columns=['date', 'home', 'opt', 'var'])

melt_df = pivot_df.melt(
    id_vars='date',
    value_vars=['home', 'opt', 'var'],
    var_name='mount_point'
    value_name='free_space',
)

The merge method works similar to an SQL JOIN statement, producing a DataFrame that results from combining DataFrames using a common Series:

patient_result_data = pd.merge(["patient", "pathology"], how="left", on="patient_id")

If the two tables to merge have the same data in different column names, use the left_on and right_on parameters:

patient_result_data = pd.merge(["patient", "pathology"], 
    how="left", left_on="patient_id", right_on="hospital_number")

Working with Datetime Values

Use the to_datetime() method to work with datetime data in your columns.: Use the dayfirst parameter for dates in "%d/%m/%Y" format.

df = pd.read_csv("ukgovunclaimedestates.csv", parse_dates=True)

df["Date of Birth"] = pd.to_datetime(df["Date of Birth"], dayfirst=True)
df["Date of Death"] = pd.to_datetime(df["Date of Death"], dayfirst=True)

print(df.groupby(df["Date of Birth"].dt.year)["Date of Birth"].count())

df["Age"] = df["Date of Death"].dt.year - df["Date of Birth"].dt.year

Converting columns to DateTime objects, provides access to utility methods to extract 'year', 'month', 'day', 'weekday' and perform calculations via the dt() methods. If you set the index (using the set_index() method) to a DateTime object, then you can use the same methods on the index.

df["date_of_birth"] = pd.to_datetime(df["Date of Birth"], dayfirst=True)
df.set_index("date_of_birth", inplace=True)

print(df.index.year)

Working with String Values

Text data is stored in 'Object' format in a DataFrame. Using the str() accessor on text data, allows access various string methods, including lower, split, replace

split returns a number of elements from a single element: use the the get method to select which element you want:

df["Surname"] = df["Name"].str.split(", ").str.get(0)
df["Forename"] = df["Name"].str.split(", ").str.get(1)

replace uses a dictionary to replace values:

df["Gender"] = df["Sex"].replace({"male": "m", "female": "f"})

Row Comprehensions

The to_dict() method of a DataFrame will produce a dictionary mapping the column names to an array of the column values. If you want a dictionary mapping row values to other row values you can use the iterrows() method.

The iterrows() method returns a list of tuples containing the index and row values for each row. This can be used in a comprehension to create a new dictionary:

df = pd.read_csv("ukgovunclaimedestates.csv", parse_dates=True)

person_dict = {row["BV Reference"]:row["Surname"] for (index, row) in df.iterrows()}