Data cleaning is a crucial step in data analysis. Dirty data leads to incorrect analyses, wrong business decisions, and poor machine learning models. SQL, the most popular query language for relational databases, provides powerful functions for cleaning data efficiently.
Trang 1SQL Functions for Data Cleaning: Detailed Tips and
Examples
1 Introduction to Data Cleaning with SQL
Data cleaning is a crucial step in data analysis Dirty data leads to incorrect analyses, wrong business decisions, and poor machine learning models SQL, the most popular query language for relational databases, provides powerful functions for cleaning data efficiently
Key Focus:
Removing duplicates
Handling null values
Standardizing text
Fixing data types
Validating data consistency
2 Removing Duplicates
Duplicates often distort statistics and analyses SQL provides simple ways to
remove them
Example:
Find duplicates based on email address
SELECT email, COUNT(*)
FROM users
GROUP BY email
HAVING COUNT(*) > 1;
Delete duplicate rows keeping the one with the lowest id
WITH CTE AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id)
AS rn
FROM users
Trang 2)
DELETE FROM CTE WHERE rn > 1;
Tip: Always backup data before running DELETE operations!
3 Handling NULL Values
Nulls can break aggregations and comparisons
Functions:
COALESCE()
IS NULL
NULLIF()
Example:
Replace NULL phone numbers with 'Unknown'
SELECT id, COALESCE(phone, 'Unknown') AS phone
FROM users;
Find all users without a phone number
SELECT * FROM users WHERE phone IS NULL;
4 Standardizing Text Fields
Consistency is vital for text fields
Functions:
LOWER(), UPPER()
TRIM()
REPLACE()
Example:
Standardize email to lowercase
UPDATE users
SET email = LOWER(email);
Remove leading/trailing spaces
Trang 3UPDATE users
SET username = TRIM(username);
Replace special characters
UPDATE users
SET username = REPLACE(username, ' ', '_');
5 Correcting Data Types
Wrong data types can cause performance issues and calculation errors
Example:
Cast string to integer
SELECT CAST(age AS INTEGER) FROM users;
Alter column to correct type
ALTER TABLE users
ALTER COLUMN age TYPE INTEGER USING age::INTEGER;
Tip: Always validate conversions first with SELECT!
6 Validating Data Consistency
Ensuring relationships and data consistency prevents bugs
Example:
Find records with invalid foreign keys
SELECT orders.id
FROM orders
LEFT JOIN customers ON orders.customer_id = customers.id
WHERE customers.id IS NULL;
Tip: Use constraints like FOREIGN KEY, UNIQUE, and CHECK to enforce consistency
7 Splitting and Combining Fields
Sometimes fields are improperly combined
Functions:
Trang 4SUBSTRING()
SPLIT_PART() (PostgreSQL)
CONCAT()
Example:
Extract domain from email
SELECT email, SPLIT_PART(email, '@', 2) AS domain
FROM users;
Combine first and last names
SELECT CONCAT(first_name, ' ', last_name) AS full_name FROM users;
8 Date and Time Cleaning
Dates are often in wrong formats
Functions:
TO_DATE()
EXTRACT()
NOW()
Example:
Convert string to date
SELECT TO_DATE(birthdate, 'YYYY-MM-DD')
FROM users;
Extract year from a date
SELECT EXTRACT(YEAR FROM created_at) AS signup_year FROM users;
9 Dealing with Outliers
Outliers skew analysis and need flagging
Example:
Trang 5Find users with unrealistic ages
SELECT * FROM users WHERE age < 0 OR age > 120;
Cap ages at realistic maximum
UPDATE users
SET age = 120
WHERE age > 120;
10 Aggregating and Deduplicating Data
Aggregating data helps summarize and clean large datasets
Example:
Get most recent order per user
SELECT user_id, MAX(order_date) AS last_order
FROM orders
GROUP BY user_id;
Tip: Use GROUP BY with aggregates smartly to deduplicate while summarizing
11 Writing Reusable Cleaning Scripts
Instead of manual cleaning, write stored procedures
Example:
CREATE PROCEDURE clean_users()
LANGUAGE SQL
AS $$
BEGIN
UPDATE users SET email = LOWER(TRIM(email));
DELETE FROM users WHERE email IS NULL;
END;
$$;
Call procedure
CALL clean_users();
Trang 612 Trimming Whitespace with TRIM, LTRIM, RTRIM
Purpose: Remove leading/trailing spaces
Example:
sql
Remove all whitespace around 'name'
SELECT TRIM(name) AS cleaned_name
FROM users;
Remove leading spaces from 'city'
SELECT LTRIM(city) AS cleaned_city
FROM offices;
13 Standardizing Text with UPPER, LOWER, and INITCAP
Purpose: Ensure consistent capitalization
Example:
sql
Convert 'email' to lowercase
SELECT LOWER(email) AS cleaned_email
FROM subscribers;
Capitalize first letter of each word in 'city'
SELECT INITCAP(city) AS cleaned_city
FROM locations;
14 Extracting Substrings with SUBSTRING/SUBSTR
Purpose: Split or clean parts of a string
Example:
sql
Extract first 3 characters of a product code
SELECT SUBSTRING(product_code, 1, 3) AS category_id FROM products;
Trang 7Isolate domain from emails
SELECT SUBSTRING(email FROM POSITION('@' IN email) AS domain FROM users;
15 Pattern Matching with LIKE and Regular Expressions
Purpose: Validate formats (e.g., emails, phone numbers)
Example:
sql
Find invalid emails (missing '@')
SELECT email
FROM customers
WHERE email NOT LIKE '%@%';
PostgreSQL: Validate phone numbers with regex
SELECT phone
FROM contacts
WHERE phone ~ '^\d{3}-\d{3}-\d{4}$';
16 Conditional Logic with CASE Statements
Purpose: Categorize or recode values
Example:
sql
Recode 'status' values for consistency
SELECT
CASE
WHEN status IN ('A', 'Active') THEN 'Active'
WHEN status IN ('I', 'Inactive') THEN 'Inactive'
ELSE 'Unknown'
END AS cleaned_status
FROM accounts;
Trang 817 Aggregation with GROUP BY and HAVING for Validation
Purpose: Identify outliers or invalid groups
Example:
sql
Find orders with negative quantities
SELECT order_id, SUM(quantity) AS total_qty
FROM order_items
GROUP BY order_id
HAVING SUM(quantity) < 0;
18 Final Checklist Before Finishing Data Cleaning
Conclusion
Good SQL data cleaning practices ensure reliable, maintainable datasets Mastering SQL functions like COALESCE(), LOWER(), TRIM(), SUBSTRING(), and EXTRACT() empowers you to keep your data sharp, clean, and ready for action!