1. Trang chủ
  2. » Công Nghệ Thông Tin

Sql functions for data cleaning detailed tips and examples v1

8 1 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Sql Functions For Data Cleaning: Detailed Tips And Examples
Trường học University
Chuyên ngành Data Analysis
Thể loại Essay
Định dạng
Số trang 8
Dung lượng 32,02 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Data cleaning is a crucial step in data analysis. Dirty data leads to incorrect analyses, wrong business decisions, and poor machine learning models. SQL, the most popular query language for relational databases, provides powerful functions for cleaning data efficiently.

Trang 1

SQL Functions for Data Cleaning: Detailed Tips and

Examples

1 Introduction to Data Cleaning with SQL

Data cleaning is a crucial step in data analysis Dirty data leads to incorrect analyses, wrong business decisions, and poor machine learning models SQL, the most popular query language for relational databases, provides powerful functions for cleaning data efficiently

Key Focus:

Removing duplicates

Handling null values

Standardizing text

Fixing data types

Validating data consistency

2 Removing Duplicates

Duplicates often distort statistics and analyses SQL provides simple ways to

remove them

Example:

Find duplicates based on email address

SELECT email, COUNT(*)

FROM users

GROUP BY email

HAVING COUNT(*) > 1;

Delete duplicate rows keeping the one with the lowest id

WITH CTE AS (

SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id)

AS rn

FROM users

Trang 2

)

DELETE FROM CTE WHERE rn > 1;

Tip: Always backup data before running DELETE operations!

3 Handling NULL Values

Nulls can break aggregations and comparisons

Functions:

COALESCE()

IS NULL

NULLIF()

Example:

Replace NULL phone numbers with 'Unknown'

SELECT id, COALESCE(phone, 'Unknown') AS phone

FROM users;

Find all users without a phone number

SELECT * FROM users WHERE phone IS NULL;

4 Standardizing Text Fields

Consistency is vital for text fields

Functions:

LOWER(), UPPER()

TRIM()

REPLACE()

Example:

Standardize email to lowercase

UPDATE users

SET email = LOWER(email);

Remove leading/trailing spaces

Trang 3

UPDATE users

SET username = TRIM(username);

Replace special characters

UPDATE users

SET username = REPLACE(username, ' ', '_');

5 Correcting Data Types

Wrong data types can cause performance issues and calculation errors

Example:

Cast string to integer

SELECT CAST(age AS INTEGER) FROM users;

Alter column to correct type

ALTER TABLE users

ALTER COLUMN age TYPE INTEGER USING age::INTEGER;

Tip: Always validate conversions first with SELECT!

6 Validating Data Consistency

Ensuring relationships and data consistency prevents bugs

Example:

Find records with invalid foreign keys

SELECT orders.id

FROM orders

LEFT JOIN customers ON orders.customer_id = customers.id

WHERE customers.id IS NULL;

Tip: Use constraints like FOREIGN KEY, UNIQUE, and CHECK to enforce consistency

7 Splitting and Combining Fields

Sometimes fields are improperly combined

Functions:

Trang 4

SUBSTRING()

SPLIT_PART() (PostgreSQL)

CONCAT()

Example:

Extract domain from email

SELECT email, SPLIT_PART(email, '@', 2) AS domain

FROM users;

Combine first and last names

SELECT CONCAT(first_name, ' ', last_name) AS full_name FROM users;

8 Date and Time Cleaning

Dates are often in wrong formats

Functions:

TO_DATE()

EXTRACT()

NOW()

Example:

Convert string to date

SELECT TO_DATE(birthdate, 'YYYY-MM-DD')

FROM users;

Extract year from a date

SELECT EXTRACT(YEAR FROM created_at) AS signup_year FROM users;

9 Dealing with Outliers

Outliers skew analysis and need flagging

Example:

Trang 5

Find users with unrealistic ages

SELECT * FROM users WHERE age < 0 OR age > 120;

Cap ages at realistic maximum

UPDATE users

SET age = 120

WHERE age > 120;

10 Aggregating and Deduplicating Data

Aggregating data helps summarize and clean large datasets

Example:

Get most recent order per user

SELECT user_id, MAX(order_date) AS last_order

FROM orders

GROUP BY user_id;

Tip: Use GROUP BY with aggregates smartly to deduplicate while summarizing

11 Writing Reusable Cleaning Scripts

Instead of manual cleaning, write stored procedures

Example:

CREATE PROCEDURE clean_users()

LANGUAGE SQL

AS $$

BEGIN

UPDATE users SET email = LOWER(TRIM(email));

DELETE FROM users WHERE email IS NULL;

END;

$$;

Call procedure

CALL clean_users();

Trang 6

12 Trimming Whitespace with TRIM, LTRIM, RTRIM

Purpose: Remove leading/trailing spaces

Example:

sql

Remove all whitespace around 'name'

SELECT TRIM(name) AS cleaned_name

FROM users;

Remove leading spaces from 'city'

SELECT LTRIM(city) AS cleaned_city

FROM offices;

13 Standardizing Text with UPPER, LOWER, and INITCAP

Purpose: Ensure consistent capitalization

Example:

sql

Convert 'email' to lowercase

SELECT LOWER(email) AS cleaned_email

FROM subscribers;

Capitalize first letter of each word in 'city'

SELECT INITCAP(city) AS cleaned_city

FROM locations;

14 Extracting Substrings with SUBSTRING/SUBSTR

Purpose: Split or clean parts of a string

Example:

sql

Extract first 3 characters of a product code

SELECT SUBSTRING(product_code, 1, 3) AS category_id FROM products;

Trang 7

Isolate domain from emails

SELECT SUBSTRING(email FROM POSITION('@' IN email) AS domain FROM users;

15 Pattern Matching with LIKE and Regular Expressions

Purpose: Validate formats (e.g., emails, phone numbers)

Example:

sql

Find invalid emails (missing '@')

SELECT email

FROM customers

WHERE email NOT LIKE '%@%';

PostgreSQL: Validate phone numbers with regex

SELECT phone

FROM contacts

WHERE phone ~ '^\d{3}-\d{3}-\d{4}$';

16 Conditional Logic with CASE Statements

Purpose: Categorize or recode values

Example:

sql

Recode 'status' values for consistency

SELECT

CASE

WHEN status IN ('A', 'Active') THEN 'Active'

WHEN status IN ('I', 'Inactive') THEN 'Inactive'

ELSE 'Unknown'

END AS cleaned_status

FROM accounts;

Trang 8

17 Aggregation with GROUP BY and HAVING for Validation

Purpose: Identify outliers or invalid groups

Example:

sql

Find orders with negative quantities

SELECT order_id, SUM(quantity) AS total_qty

FROM order_items

GROUP BY order_id

HAVING SUM(quantity) < 0;

18 Final Checklist Before Finishing Data Cleaning

Conclusion

Good SQL data cleaning practices ensure reliable, maintainable datasets Mastering SQL functions like COALESCE(), LOWER(), TRIM(), SUBSTRING(), and EXTRACT() empowers you to keep your data sharp, clean, and ready for action!

Ngày đăng: 29/04/2025, 17:02

TỪ KHÓA LIÊN QUAN