SQL Data Cleaning - AFL grand finals

Blog post description.

SQLMINI SKILL STUDYTABLEAU PUBLICCURRENT PROJECT

Fynn Burgess

12/9/20234 min read

a stack of stacked blue and white plates
a stack of stacked blue and white plates

Vlog update with full AFL/VFL grand finals project coming soon!

I compiled data on AFL/VFL grand finals since 1898 using SQL, ensuring I kept a reference list throughout. Which you can find below. To maintain accuracy, I even enlisted a mostly unbiased third party to review the data both before and after cleaning. My aim was to create a dataset that AFL fans and analysts would find intriguing.

Intentionally introducing data issues with the help of a friend, I then utilized SQL in Google BigQuery to rectify these problems. I also performed the same data cleanup using MySQL Workbench, a version of which I plan to share later.

The provided code exemplifies the types of queries conducted during the project, summarized for readability. For more detailed code snippets, feel free to email me at fynn@fynnburgess.com, and I'll happily share them.

When cleaning the data, I initially validated query outcomes using 'select' statements to ensure accuracy. Subsequently, I implemented changes to the data table using 'update' queries."

SQL Code - BigQuery

Data Quality Assurance examples

Duplicate Detection:

-- Find duplicate rows based on specific columns (e.g., year)

SELECT year, COUNT(*) AS duplicate_count

FROM `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`

GROUP BY year

HAVING COUNT(*) > 1;

NULL Values:

-- Check for NULL values in a specific column (e.g., Team_name_winner)

SELECT COUNT(*) AS null_team_name_winner

FROM `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`

WHERE Team_name_winner IS NULL;

Fixing nulls in the Team name columns:

-- The following query standardizes team names by replacing short forms with full team names

SELECT

-- Using COALESCE to handle NULL values by checking both Team_name_winner and Team_nickname_winner columns

CASE COALESCE(Team_name_winner, Team_nickname_winner)

-- Each WHEN condition checks for specific short forms and replaces them with full team names

WHEN 'Crows' THEN 'Adelaide Crows'

WHEN 'Lions' THEN 'Brisbane Lions'

WHEN 'Blues' THEN 'Carlton Blues'

WHEN 'Magpies' THEN 'Collingwood Magpies'

WHEN 'Bombers' THEN 'Essendon Bombers'

WHEN 'Dockers' THEN 'Fremantle Dockers'

WHEN 'Cats' THEN 'Geelong Cats'

WHEN 'Suns' THEN 'Gold Coast Suns'

WHEN 'Giants' THEN 'Greater Western Sydney Giants'

WHEN 'Hawks' THEN 'Hawthorn Hawks'

WHEN 'Demons' THEN 'Melbourne Demons'

WHEN 'Kangaroos/North' THEN 'North Melbourne Kangaroos'

WHEN 'Power' THEN 'Port Adelaide Power'

WHEN 'Tigers' THEN 'Richmond Tigers'

WHEN 'Saints' THEN 'St Kilda Saints'

WHEN 'Swans' THEN 'Sydney Swans'

WHEN 'Eagles' THEN 'West Coast Eagles'

WHEN 'Bulldogs' THEN 'Western Bulldogs'

ELSE COALESCE(Team_name_winner, Team_nickname_winner) -- If the team name doesn't match any specified condition, keep the original name

END AS adjusted_name

FROM `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`;

Fixing spelling mistakes

SELECT

-- Standardise team names for winners and losers based on specific conditions

CASE

-- Winners' names standardisation

WHEN Team_name_winner IN ('Bisbane Lions', 'Bisband Lions') THEN 'Brisbane Lions'

WHEN Team_name_winner = 'South Melbourne' THEN 'South Melbourne (Sydney Swans)'

WHEN Team_name_winner IN (‘Fitzroy’, ‘Fitsroy') THEN 'Fitzroy Lions (Brisbane Lions)'

ELSE Team_name_winner

END AS corrected_name_winner,

CASE

-- Losers' names standardization

WHEN Team_name_loser IN ('Bisbane Lions', 'Bisband Lions') THEN 'Brisbane Lions'

WHEN Team_name_loser = 'South Melbourne' THEN 'South Melbourne (Sydney Swans)'

WHEN Team_name_loser IN ('Fitzroy', 'Fitsroy') THEN 'Fitzroy Lions (Brisbane Lions)'

ELSE Team_name_loser

END AS corrected_name_loser,

-- Standardize team nicknames for winners and losers

CASE

WHEN Team_nickname_winner = 'Bvlues' THEN 'Blues'

ELSE Team_nickname_winner

END AS corrected_nickname_winner,

CASE

WHEN Team_nickname_loser = 'Bvlues' THEN 'Blues'

ELSE Team_nickname_loser

END AS corrected_nickname_loser

FROM `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`;

Example update statement

UPDATE `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`

SET

-- Update Team_name_winner based on specific conditions

Team_name_winner =

CASE

WHEN Team_name_winner IN ('Bisbane Lions', 'Bisband Lions') THEN 'Brisbane Lions'

WHEN Team_name_winner = 'South Melbourne' THEN 'South Melbourne (Sydney Swans)'

WHEN Team_name_winner IN (‘Fitzroy’, ‘Fitsroy') THEN 'Fitzroy Lions (Brisbane Lions)'

ELSE Team_name_winner

END,

-- Update Team_name_loser based on specific conditions

Team_name_loser =

CASE

WHEN Team_name_loser IN ('Bisbane Lions', 'Bisband Lions') THEN 'Brisbane Lions'

WHEN Team_name_loser = 'South Melbourne' THEN 'South Melbourne (Sydney Swans)'

WHEN Team_name_loser IN ('Fitzroy', 'Fitsroy') THEN 'Fitzroy Lions (Brisbane Lions)'

ELSE Team_name_loser

END,

-- Update Team_nickname_winner based on specific conditions

Team_nickname_winner =

CASE

WHEN Team_nickname_winner = 'Bvlues' THEN 'Blues'

ELSE Team_nickname_winner

END,

-- Update Team_nickname_loser based on specific conditions

Team_nickname_loser =

CASE

WHEN Team_nickname_loser = 'Bvlues' THEN 'Blues'

ELSE Team_nickname_loser

END;

Replacing the ‘$’ in-front of some scores

SELECT

REPLACE(FullTime_Score_winner, '$', '') AS FullTime_Score_winner,

REPLACE(Q3_Score_winner, '$', '') AS Q3_Score_winner,

REPLACE(HalfTime_Score_winner, '$', '') AS HalfTime_Score_winner,

REPLACE(Q1_Score_winner, '$', '') AS Q1_Score_winner,

REPLACE(Q1_Score_loser, '$', '') AS Q1_Score_loser,

REPLACE(HalfTime_Score_loser, '$', '') AS HalfTime_Score_loser,

REPLACE(Q3_Score_loser, '$', '') AS Q3_Score_loser,

REPLACE(FullTime_Score_loser, '$', '') AS FullTime_Score_loser

FROM `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`;

Getting rid of the .0.0

SELECT

-- Create a modified number by extracting substrings from FullTime_Score_loser column

CONCAT(

LEFT(FullTime_Score_loser, 4), -- Extract the first 4 characters

' ',

RIGHT(FullTime_Score_loser, 4) -- Extract the last 4 characters

) AS modified_number

FROM `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`

WHERE LENGTH(FullTime_Score_loser) >= 8 -- Ensure the length of FullTime_Score___loser is at least 8 characters

AND (

-- Filter records where FullTime_Score_loser starts with specific digits

FullTime_Score___loser LIKE '6.%'

OR FullTime_Score___loser LIKE '5.%'

OR FullTime_Score___loser LIKE '8.%'

OR FullTime_Score___loser LIKE '3.%'

);

Updating column

UPDATE `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`

SET FullTime_Score_loser = CONCAT(

LEFT(FullTime_Score_loser, 4), -- Extract the first 4 characters

' (',

RIGHT(FullTime_Score_loser, 4), -- Extract the last 4 characters

')' -- Add parentheses

)

WHERE LENGTH(FullTime_Score_loser) >= 8 -- Ensure the length of FullTime_Score___loser is at least 8 characters

AND (

-- Filter records where FullTime_Score_loser starts with specific digits

FullTime_Score_loser LIKE '6.%'

OR FullTime_Score_loser LIKE '5.%'

OR FullTime_Score_loser LIKE '8.%'

OR FullTime_Score_loser LIKE '3.%'

);

Getting rid of duplicate rows

SELECT *

FROM (

-- Select all columns and add a row number based on the 'year' column

SELECT *,

ROW_NUMBER() OVER (PARTITION BY year ORDER BY year) AS row_num -- Assign row numbers partitioned by 'year'

FROM `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`

) AS subquery

WHERE row_num = 1; -- Filter to retrieve only the rows with the first row number in each 'year' group

SELECT DISTINCT

SELECT DISTINCT *

FROM `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`;

Creating total score columns

SELECT

*,

CASE

WHEN FullTime_Score_loser LIKE '%(%' THEN REGEXP_EXTRACT(FullTime_Score_loser, r'\((\d+)\)')

ELSE NULL

END AS Total_FullTime_Score_loser,

CASE

WHEN HalfTime_Score_loser LIKE '%(%' THEN REGEXP_EXTRACT(HalfTime_Score_loser, r'\((\d+)\)')

ELSE NULL

END AS Total_HalfTime_Score_loser,

CASE

WHEN FullTime_Score_winner LIKE '%(%' THEN REGEXP_EXTRACT(FullTime_Score_winner, r'\((\d+)\)')

ELSE NULL

END AS Total_FullTime_Score_winner,

CASE

WHEN HalfTime_Score_winner LIKE '%(%' THEN REGEXP_EXTRACT(HalfTime_Score_winner, r'\((\d+)\)')

ELSE NULL

END AS Total_HalfTime_Score_winner,

CASE

WHEN Q1_Score_loser LIKE '%(%' THEN REGEXP_EXTRACT(Q1_Score_loser, r'\((\d+)\)')

ELSE NULL

END AS Total_Q1_Score_loser,

CASE

WHEN Q3_Score_loser LIKE '%(%' THEN REGEXP_EXTRACT(Q3_Score_loser, r'\((\d+)\)')

ELSE NULL

END AS Total_Q3_Score_loser,

CASE

WHEN Q1_Score_winner LIKE '%(%' THEN REGEXP_EXTRACT(Q1_Score_winner, r'\((\d+)\)')

ELSE NULL

END AS Total_Q1_Score_winner,

CASE

WHEN Q3_Score_winner LIKE '%(%' THEN REGEXP_EXTRACT(Q3_Score_winner, r'\((\d+)\)')

ELSE NULL

END AS Total_Q3_Score_winner

FROM `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`;

You can freely utilize this dataset, but I kindly request that you also acknowledge me for providing it. If you come across any inaccurate or faulty data, please feel free to contact me without hesitating.