SQL Data Cleaning - AFL grand finals
Blog post description.
SQLMINI SKILL STUDYTABLEAU PUBLICCURRENT PROJECT
Fynn Burgess
12/9/20234 min read
Vlog update with full AFL/VFL grand finals project coming soon!
I compiled data on AFL/VFL grand finals since 1898 using SQL, ensuring I kept a reference list throughout. Which you can find below. To maintain accuracy, I even enlisted a mostly unbiased third party to review the data both before and after cleaning. My aim was to create a dataset that AFL fans and analysts would find intriguing.
Intentionally introducing data issues with the help of a friend, I then utilized SQL in Google BigQuery to rectify these problems. I also performed the same data cleanup using MySQL Workbench, a version of which I plan to share later.
The provided code exemplifies the types of queries conducted during the project, summarized for readability. For more detailed code snippets, feel free to email me at fynn@fynnburgess.com, and I'll happily share them.
When cleaning the data, I initially validated query outcomes using 'select' statements to ensure accuracy. Subsequently, I implemented changes to the data table using 'update' queries."
SQL Code - BigQuery
Data Quality Assurance examples
Duplicate Detection:
-- Find duplicate rows based on specific columns (e.g., year)
SELECT year, COUNT(*) AS duplicate_count
FROM `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`
GROUP BY year
HAVING COUNT(*) > 1;
NULL Values:
-- Check for NULL values in a specific column (e.g., Team_name_winner)
SELECT COUNT(*) AS null_team_name_winner
FROM `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`
WHERE Team_name_winner IS NULL;
Fixing nulls in the Team name columns:
-- The following query standardizes team names by replacing short forms with full team names
SELECT
-- Using COALESCE to handle NULL values by checking both Team_name_winner and Team_nickname_winner columns
CASE COALESCE(Team_name_winner, Team_nickname_winner)
-- Each WHEN condition checks for specific short forms and replaces them with full team names
WHEN 'Crows' THEN 'Adelaide Crows'
WHEN 'Lions' THEN 'Brisbane Lions'
WHEN 'Blues' THEN 'Carlton Blues'
WHEN 'Magpies' THEN 'Collingwood Magpies'
WHEN 'Bombers' THEN 'Essendon Bombers'
WHEN 'Dockers' THEN 'Fremantle Dockers'
WHEN 'Cats' THEN 'Geelong Cats'
WHEN 'Suns' THEN 'Gold Coast Suns'
WHEN 'Giants' THEN 'Greater Western Sydney Giants'
WHEN 'Hawks' THEN 'Hawthorn Hawks'
WHEN 'Demons' THEN 'Melbourne Demons'
WHEN 'Kangaroos/North' THEN 'North Melbourne Kangaroos'
WHEN 'Power' THEN 'Port Adelaide Power'
WHEN 'Tigers' THEN 'Richmond Tigers'
WHEN 'Saints' THEN 'St Kilda Saints'
WHEN 'Swans' THEN 'Sydney Swans'
WHEN 'Eagles' THEN 'West Coast Eagles'
WHEN 'Bulldogs' THEN 'Western Bulldogs'
ELSE COALESCE(Team_name_winner, Team_nickname_winner) -- If the team name doesn't match any specified condition, keep the original name
END AS adjusted_name
FROM `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`;
Fixing spelling mistakes
SELECT
-- Standardise team names for winners and losers based on specific conditions
CASE
-- Winners' names standardisation
WHEN Team_name_winner IN ('Bisbane Lions', 'Bisband Lions') THEN 'Brisbane Lions'
WHEN Team_name_winner = 'South Melbourne' THEN 'South Melbourne (Sydney Swans)'
WHEN Team_name_winner IN (‘Fitzroy’, ‘Fitsroy') THEN 'Fitzroy Lions (Brisbane Lions)'
ELSE Team_name_winner
END AS corrected_name_winner,
CASE
-- Losers' names standardization
WHEN Team_name_loser IN ('Bisbane Lions', 'Bisband Lions') THEN 'Brisbane Lions'
WHEN Team_name_loser = 'South Melbourne' THEN 'South Melbourne (Sydney Swans)'
WHEN Team_name_loser IN ('Fitzroy', 'Fitsroy') THEN 'Fitzroy Lions (Brisbane Lions)'
ELSE Team_name_loser
END AS corrected_name_loser,
-- Standardize team nicknames for winners and losers
CASE
WHEN Team_nickname_winner = 'Bvlues' THEN 'Blues'
ELSE Team_nickname_winner
END AS corrected_nickname_winner,
CASE
WHEN Team_nickname_loser = 'Bvlues' THEN 'Blues'
ELSE Team_nickname_loser
END AS corrected_nickname_loser
FROM `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`;
Example update statement
UPDATE `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`
SET
-- Update Team_name_winner based on specific conditions
Team_name_winner =
CASE
WHEN Team_name_winner IN ('Bisbane Lions', 'Bisband Lions') THEN 'Brisbane Lions'
WHEN Team_name_winner = 'South Melbourne' THEN 'South Melbourne (Sydney Swans)'
WHEN Team_name_winner IN (‘Fitzroy’, ‘Fitsroy') THEN 'Fitzroy Lions (Brisbane Lions)'
ELSE Team_name_winner
END,
-- Update Team_name_loser based on specific conditions
Team_name_loser =
CASE
WHEN Team_name_loser IN ('Bisbane Lions', 'Bisband Lions') THEN 'Brisbane Lions'
WHEN Team_name_loser = 'South Melbourne' THEN 'South Melbourne (Sydney Swans)'
WHEN Team_name_loser IN ('Fitzroy', 'Fitsroy') THEN 'Fitzroy Lions (Brisbane Lions)'
ELSE Team_name_loser
END,
-- Update Team_nickname_winner based on specific conditions
Team_nickname_winner =
CASE
WHEN Team_nickname_winner = 'Bvlues' THEN 'Blues'
ELSE Team_nickname_winner
END,
-- Update Team_nickname_loser based on specific conditions
Team_nickname_loser =
CASE
WHEN Team_nickname_loser = 'Bvlues' THEN 'Blues'
ELSE Team_nickname_loser
END;
Replacing the ‘$’ in-front of some scores
SELECT
REPLACE(FullTime_Score_winner, '$', '') AS FullTime_Score_winner,
REPLACE(Q3_Score_winner, '$', '') AS Q3_Score_winner,
REPLACE(HalfTime_Score_winner, '$', '') AS HalfTime_Score_winner,
REPLACE(Q1_Score_winner, '$', '') AS Q1_Score_winner,
REPLACE(Q1_Score_loser, '$', '') AS Q1_Score_loser,
REPLACE(HalfTime_Score_loser, '$', '') AS HalfTime_Score_loser,
REPLACE(Q3_Score_loser, '$', '') AS Q3_Score_loser,
REPLACE(FullTime_Score_loser, '$', '') AS FullTime_Score_loser
FROM `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`;
Getting rid of the .0.0
SELECT
-- Create a modified number by extracting substrings from FullTime_Score_loser column
CONCAT(
LEFT(FullTime_Score_loser, 4), -- Extract the first 4 characters
' ',
RIGHT(FullTime_Score_loser, 4) -- Extract the last 4 characters
) AS modified_number
FROM `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`
WHERE LENGTH(FullTime_Score_loser) >= 8 -- Ensure the length of FullTime_Score___loser is at least 8 characters
AND (
-- Filter records where FullTime_Score_loser starts with specific digits
FullTime_Score___loser LIKE '6.%'
OR FullTime_Score___loser LIKE '5.%'
OR FullTime_Score___loser LIKE '8.%'
OR FullTime_Score___loser LIKE '3.%'
);
Updating column
UPDATE `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`
SET FullTime_Score_loser = CONCAT(
LEFT(FullTime_Score_loser, 4), -- Extract the first 4 characters
' (',
RIGHT(FullTime_Score_loser, 4), -- Extract the last 4 characters
')' -- Add parentheses
)
WHERE LENGTH(FullTime_Score_loser) >= 8 -- Ensure the length of FullTime_Score___loser is at least 8 characters
AND (
-- Filter records where FullTime_Score_loser starts with specific digits
FullTime_Score_loser LIKE '6.%'
OR FullTime_Score_loser LIKE '5.%'
OR FullTime_Score_loser LIKE '8.%'
OR FullTime_Score_loser LIKE '3.%'
);
Getting rid of duplicate rows
SELECT *
FROM (
-- Select all columns and add a row number based on the 'year' column
SELECT *,
ROW_NUMBER() OVER (PARTITION BY year ORDER BY year) AS row_num -- Assign row numbers partitioned by 'year'
FROM `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`
) AS subquery
WHERE row_num = 1; -- Filter to retrieve only the rows with the first row number in each 'year' group
SELECT DISTINCT
SELECT DISTINCT *
FROM `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`;
Creating total score columns
SELECT
*,
CASE
WHEN FullTime_Score_loser LIKE '%(%' THEN REGEXP_EXTRACT(FullTime_Score_loser, r'\((\d+)\)')
ELSE NULL
END AS Total_FullTime_Score_loser,
CASE
WHEN HalfTime_Score_loser LIKE '%(%' THEN REGEXP_EXTRACT(HalfTime_Score_loser, r'\((\d+)\)')
ELSE NULL
END AS Total_HalfTime_Score_loser,
CASE
WHEN FullTime_Score_winner LIKE '%(%' THEN REGEXP_EXTRACT(FullTime_Score_winner, r'\((\d+)\)')
ELSE NULL
END AS Total_FullTime_Score_winner,
CASE
WHEN HalfTime_Score_winner LIKE '%(%' THEN REGEXP_EXTRACT(HalfTime_Score_winner, r'\((\d+)\)')
ELSE NULL
END AS Total_HalfTime_Score_winner,
CASE
WHEN Q1_Score_loser LIKE '%(%' THEN REGEXP_EXTRACT(Q1_Score_loser, r'\((\d+)\)')
ELSE NULL
END AS Total_Q1_Score_loser,
CASE
WHEN Q3_Score_loser LIKE '%(%' THEN REGEXP_EXTRACT(Q3_Score_loser, r'\((\d+)\)')
ELSE NULL
END AS Total_Q3_Score_loser,
CASE
WHEN Q1_Score_winner LIKE '%(%' THEN REGEXP_EXTRACT(Q1_Score_winner, r'\((\d+)\)')
ELSE NULL
END AS Total_Q1_Score_winner,
CASE
WHEN Q3_Score_winner LIKE '%(%' THEN REGEXP_EXTRACT(Q3_Score_winner, r'\((\d+)\)')
ELSE NULL
END AS Total_Q3_Score_winner
FROM `aflprojects.AFL_VFL_GrandFinals.AFL_VFL_GrandFinals`;
You can freely utilize this dataset, but I kindly request that you also acknowledge me for providing it. If you come across any inaccurate or faulty data, please feel free to contact me without hesitating.