Data science with python sách khoa học dữ liệu với python

Introduction to Data Science

Data Science is the combined discipline of mathematics, statistics, programming, and domain expertise dedicated to extracting valuable insights from data It transforms raw data into actionable knowledge by turning complex, unstructured information into clear, meaningful signals As both an art and science, Data Science enables businesses to make informed decisions, optimize operations, and uncover hidden patterns that drive growth.

Data Science is an interdisciplinary field focused on analyzing, modeling, and interpreting data to support decision-making and automation

Whether you’re tracking customer behavior on an e-commerce site, analyzing patient records in healthcare, or training self-driving cars, you’re witnessing the power of Data Science

1.2 The Evolution of Data Science

Data Science originates from classical statistics and database systems, but it has now evolved into a standalone discipline due to the explosive growth of digital data, technological advancements in computing, and the rise of machine learning technologies.

1962 First mentions of “Data Analysis as a science”

1990s Rise of business intelligence and data mining

2000s Growth of web data and predictive modeling

2010s Machine Learning, Big Data, and Deep Learning

2020s Real-time analytics, MLOps, and AI integration

1.3 Why is Data Science Important?

Data is the new oil — but raw oil is useless without refinement Data Science refines raw data into insights that can:

A retail company uses data science to:

Without these insights, the company might overstock unpopular items or miss sales opportunities

Data Science follows a structured workflow Each step builds on the previous one, allowing you to move from a vague business question to concrete, data-driven decisions

1 Problem Definition – Understand the real-world issue

2 Data Collection – Gather relevant data from databases, APIs, web, etc

3 Data Cleaning – Handle missing, incorrect, or inconsistent values

4 Exploratory Data Analysis (EDA) – Discover patterns and relationships

5 Modeling – Apply algorithms to predict or classify outcomes

7 Communication – Present insights with visuals and narratives

8 Deployment – Integrate into production (web apps, dashboards, etc.)

We'll cover each of these steps in detail in upcoming chapters

1.5 Key Components of Data Science

Data Science stands on three foundational pillars:

Mathematics & Statistics Understand patterns and probability in data

Programming (Python, R) Build models, automate analysis, handle data

Domain Expertise Contextualize data and interpret results accurately

Missing any one of these weakens the overall process For example, a model might perform well technically but be useless if it doesn't solve a business problem

1.6 Real-World Applications of Data Science

Data Science touches almost every industry Here's how it's transforming different sectors:

 Predict disease outbreaks using historical and geographic data

 Diagnose diseases from imaging using machine learning (e.g., cancer detection)

 Personalize treatment plans based on patient history and genetic data

 Detect fraudulent transactions in real time

 Predict credit risk using historical repayment data

 Automate stock trading through algorithmic strategies

 Recommend products based on browsing/purchase history

 Analyze customer feedback through sentiment analysis

 Predict demand for ride-sharing (e.g., Uber surge pricing)

 Optimize delivery routes using traffic and weather data

 Improve logistics efficiency and fuel savings

 Identify trending topics using NLP and clustering

 Target ads based on user behavior and demographics

 Detect fake news using text classification models

 Monitor crop health with drone imagery

 Predict yield based on climate, soil, and water data

 Automate irrigation using IoT and real-time analytics

A Data Scientist is like a modern-day detective — they uncover patterns hidden in data But the field is diverse, and different roles exist within the data science ecosystem

Data Scientist Designs models and experiments, tells data-driven stories

Data Analyst Analyzes datasets, creates dashboards and reports

Data Engineer Builds and maintains data pipelines and infrastructure

ML Engineer Implements and deploys machine learning models

Business Analyst Bridges the gap between data and business decisions

Statistician Specializes in probability and statistical inference

Each role collaborates with others to complete the data science puzzle

1.8 Skills Required for Data Science

Being a data scientist requires a blend of technical, analytical, and communication skills Here’s a breakdown:

 Python, R – Core languages for DS

 Pandas, NumPy – Data wrangling and computation

 Scikit-learn, TensorFlow – Modeling and machine learning

 Critical thinking and decision-making

💡 Tip: You don’t need to master everything at once Build gradually, layer by layer

1.9 Tools Used in Data Science

Your toolbox as a data scientist will evolve, but here are the essential categories and tools:

 Python: The industry standard for DS and ML

 R: Excellent for statistics-heavy workflows

 SQL, MongoDB – Structured and unstructured databases

 CSV, JSON, Parquet – Common data formats

Data visualization matplotlib, seaborn, plotly

Machine learning scikit-learn, xgboost, lightgbm

Deep learning tensorflow, keras, pytorch

1.10 How Data Science Differs from Related Fields

It’s easy to confuse Data Science with related fields like AI, Machine Learning, or even traditional statistics Here's a breakdown to help clarify:

🔍 Data Science vs Machine Learning

Broader field Subset of Data Science

Includes data cleaning, analysis, modeling, communication

Focuses on algorithms that learn patterns

Combines statistics, software engineering, domain knowledge

Purely concerned with training models

🧠 Data Science vs Artificial Intelligence (AI)

Works with real-world data to derive insights Builds systems that can mimic human intelligence

Data-focused Task-performance focused

May or may not use AI AI may use Data Science to function (e.g., model training)

📊 Data Science vs Traditional Statistics

Practical, computational, large-scale data Theoretical, focuses on data inference

Uses tools like Python, R, Hadoop Uses tools like SPSS, SAS, R

Focused on real-world applications Focused on sampling, distribution theory, etc

1.11 The Ethics of Data Science

With great data comes great responsibility

Data Scientists must operate ethically, ensuring they do not cause harm through their work Bias, misuse, and lack of transparency can have severe consequences

 Bias in Data: Models trained on biased datasets can reinforce discrimination

 Privacy: Mishandling personal data (e.g., location, health, finances)

 Transparency: Opaque black-box models make it hard to justify decisions

 Manipulation: Using data to mislead people or influence opinions unethically

✅ Best Practice: Always ask — “Could this model harm someone?” and “Would I be okay if my data were used this way?”

1.12 Limitations and Challenges of Data Science

Data Science isn’t a magical solution Here are common challenges:

 Poor data quality (missing, noisy, inconsistent)

 Data silos in large companies

 Lack of stakeholder buy-in

Imagine a bank training a model on biased loan data Even if the model is 95% accurate, it may reject many eligible applicants simply because past data reflected systemic bias

1.13 The Future of Data Science

Data Science continues to evolve rapidly Key future trends:

 Automated Machine Learning (AutoML) – Non-experts can train strong models

 Explainable AI (XAI) – Making models more interpretable

 MLOps – Applying DevOps to ML pipelines for better collaboration and deployment

 Synthetic Data – Generating fake but realistic data for testing or privacy

 Edge Analytics – Real-time decision-making on devices (e.g., IoT)

Data Science is also converging with disciplines like blockchain, cloud computing, and robotics

By now, you should understand:

 What Data Science is (and isn’t)

 Key roles in the data science ecosystem

 The workflow followed in most projects

 Skills, challenges, and ethical considerations

This chapter has set the stage for what’s to come From Chapter 2 onward, we’ll begin coding, cleaning, and exploring real datasets

1 Define Data Science in your own words How is it different from statistics and AI?

2 List 3 industries where data science is making a big impact Explain how

3 What are the main steps in a typical data science workflow?

4 Describe at least 5 roles related to data science and what they do

5 Identify 3 challenges in data science and how you might solve them

6 Explain the ethical risks of using biased data to train a machine learning model

7 What is the role of domain knowledge in a successful data science project?

8 Why is Python so popular in the data science ecosystem?

9 Give an example where Data Science could go wrong due to poor communication

10 What trends do you think will shape the next decade of Data Science?

Setting Up the Python Environment

Before we dive into coding or analysis, we must properly set up our Python environment for

Data Science Think of this like preparing your lab before running experiments — without the right tools and a clean workspace, you can’t perform high-quality work

In this chapter, we'll guide you through:

 Using Anaconda and virtual environments

 Managing packages with pip and conda

 Working in Jupyter Notebooks and VS Code

 Organizing your data science projects for real-world scalability

502 Bad GatewayUnable to reach the origin service The service may be down or it may not be responding to traffic from cloudflared

Python is a general-purpose programming language often used in data science for data manipulation, statistical modeling, and machine learning, due to its clean syntax and robust ecosystem

There are two common ways to install Python:

1 Visit https://www.python.org/downloads/

2 Download the latest version (e.g., Python 3.12.x)

3 During installation: o ✅ Check the box: “Add Python to PATH” o ✅ Choose "Customize installation" → enable pip and IDLE

Open terminal or command prompt and type: python version

Anaconda is a Python distribution that includes:

 Conda (package and environment manager)

 Hundreds of data science libraries (NumPy, Pandas, etc.)

Because it solves library compatibility issues and simplifies package/environment management

1 Visit https://www.anaconda.com/products/distribution

2 Download the installer (choose Python 3.x)

4 Open the Anaconda Navigator or Anaconda Prompt

To verify: conda version python version

Now that Python is installed, we need a place to write and run code Let’s compare a few popular environments for data science

 Exportable as HTML, PDF, etc

To launch it: jupyter notebook

Use this when doing EDA (Exploratory Data Analysis) or developing models step by step

💻 VS Code (Visual Studio Code)

While Jupyter is great for analysis, VS Code is better for organizing larger projects and production-ready scripts

 Extensions for Jupyter, Python, Docker

 Great for version-controlled data science workflows

Install the Python extension in VS Code for best performance

2.4 Virtual Environments: Why and How

That’s why we use virtual environments

A virtual environment is an isolated Python environment that allows you to install specific packages and dependencies without affecting your global Python setup or other projects

✅ Benefits of Using Virtual Environments

2.4.1 Creating a Virtual Environment (Using venv)

1 Open your terminal or command prompt

2 Navigate to your project folder: cd my_project

3 Create the environment: python -m venv env

.\env\Scripts\activate o Mac/Linux: source env/bin/activate You should now see (env) in your terminal, which confirms it's active

5 Install your libraries: pip install pandas numpy matplotlib

2.4.2 Creating Environments with Conda (Recommended)

If you use Anaconda, you can use conda environments, which are more powerful than venv conda create name ds_env python=3.11 conda activate ds_env

Then install: conda install pandas numpy matplotlib scikit-learn

You can list all environments: conda env list

2.5 pip vs conda: Which to Use?

Both are package managers, but they have differences:

Language support Python only Python, R, C, etc

Speed Faster, but can break dependencies

Slower but handles dependencies better

Binary packaging Limited Full binary support

Best practice: Use conda when using Anaconda Use pip when outside Anaconda or when conda doesn't support a package

2.6 Managing Project Structure: Professionalism from Day 1

Now that you're coding with isolated environments, let’s structure your projects for clarity and scalability

Here’s a typical Data Science project folder layout:

✅ This structure separates raw data, notebooks, source code, and outputs

🧠 Tools to Help You Stay Organized

 requirements.txt: Tracks pip-installed packages pip freeze > requirements.txt

 environment.yml: For Conda-based projects conda env export > environment.yml

These files are essential for reproducibility, especially when sharing your project or collaborating in teams

2.7 Essential Python Libraries for Data Science

Let’s explore the core Python libraries that every data scientist must know

NumPy (short for Numerical Python) is a fundamental package for scientific computing It offers powerful N-dimensional array objects and broadcasting operations for fast numerical processing

 Faster operations than native Python lists

 Core of most data science libraries (including Pandas, Scikit-learn, etc.)

 Supports matrix operations, statistical computations, and linear algebra

🔨 Basic Usage import numpy as np a = np.array([1, 2, 3]) b = np.array([[1, 2], [3, 4]]) print(a.mean()) # Output: 2.0 print(b.shape) # Output: (2, 2)

 Linear algebra in ML algorithms

2.7.2 Pandas – Data Analysis and Manipulation

Pandas is a fast, powerful, flexible library for data analysis and manipulation, built on top of

 Built for tabular data (like spreadsheets)

 Easy to read CSV, Excel, SQL files

 Provides DataFrames, which are central to data wrangling

🔨 Basic Usage import pandas as pd df = pd.read_csv("sales.csv") print(df.head()) # First 5 rows print(df.describe()) # Summary statistics

Matplotlib is a comprehensive library for creating static, animated, and interactive plots in

 Compatible with Pandas and NumPy

🔨 Basic Usage import matplotlib.pyplot as plt plt.plot([1, 2, 3], [4, 5, 6]) plt.title("Simple Line Plot") plt.xlabel("X Axis") plt.ylabel("Y Axis") plt.show()

 Line charts, histograms, scatter plots

Seaborn is a high-level data visualization library built on top of Matplotlib It provides an interface for drawing attractive and informative statistical graphics

 Cleaner, more informative visuals than Matplotlib

 Built-in themes and color palettes

 Easily works with Pandas DataFrames

🔨 Basic Usage import seaborn as sns import pandas as pd df = pd.read_csv("iris.csv") sns.pairplot(df, hue="species")

2.7.5 Scikit-learn – Machine Learning in Python

Scikit-learn is the most widely used library for building machine learning models in Python It includes tools for classification, regression, clustering, and model evaluation

 Robust set of ML algorithms and utilities

🔨 Basic Usage from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split

X = df[['feature1', 'feature2']] y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y) model = LinearRegression() model.fit(X_train, y_train)

Mastering how to install, update, and manage Data Science libraries is crucial for efficient workflow, especially when collaborating with teams or working across multiple machines Proper management ensures that your projects run smoothly with the right versions of essential tools, minimizing compatibility issues.

2.8.1 Installing Packages with pip pip is Python’s default package installer It's simple and widely used

🧠 Example: Installing NumPy and Pandas pip install numpy pandas

To install a specific version: pip install pandas==1.5.3

To upgrade a package: pip install upgrade matplotlib

To uninstall a package: pip uninstall seaborn

To list all installed packages: pip list

To freeze the current environment for sharing: pip freeze > requirements.txt

2.8.2 Installing with conda (Anaconda Users) conda is the package manager that comes with Anaconda It’s especially useful when managing libraries that have C or Fortran dependencies (e.g., NumPy, SciPy)

🧠 Examples conda install numpy pandas conda install -c conda-forge seaborn

To create a requirements file (environment config): conda env export > environment.yml

To recreate an environment: conda env create -f environment.yml

By default, Jupyter Notebooks are functional but plain To make your notebooks more effective and beautiful:

Use Markdown to write human-readable explanations, headings, and lists directly in your notebooks

You can also embed equations using LaTeX:

This lets you enable time-saving plugins like:

Install it: pip install jupyter_contrib_nbextensions jupyter contrib nbextension install user

Activate desired extensions in the browser interface

To make your notebooks visually appealing: pip install jupyterthemes jt -t monokai

You can customize fonts, cell width, and background colors

As your projects grow, managing versions becomes critical Git allows you to track changes, collaborate with others, and roll back when things break

Git is a distributed version control system that tracks changes to your code and lets you collaborate with others using repositories

2 Track changes: git add git commit -m "Initial commit"

3 Push to GitHub: git remote add origin git push -u origin master

Git is vital when working on data science projects in a team or deploying models into production

2.11 Optional Tools: Docker and Virtual Workspaces

If you want to ensure that your project runs exactly the same on every machine (including servers), consider Docker

Docker packages your environment, code, and dependencies into a container — a self-contained unit that runs the same anywhere

RUN pip install -r requirements.txt

Platforms like Google Colab and Kaggle Kernels provide cloud-based Jupyter notebooks with free GPUs and pre-installed libraries

 Beginners who don’t want to set up local environments

 Running large computations in the cloud

2.12 Putting It All Together: A Real-World Setup Example

Starting a new data science project from scratch involves understanding how various elements come together in a practical workflow, illustrating a realistic scenario that guides you through the process seamlessly.

You are starting a project to analyze customer churn using machine learning You want to:

 Structure your project folders cleanly

Step 1: Create Your Project Folder mkdir customer_churn_analysis cd customer_churn_analysis

Step 2: Set Up a Virtual Environment python -m venv env source env/bin/activate # or \env\Scripts\activate on

Step 3: Install Your Libraries pip install numpy pandas matplotlib seaborn scikit-learn jupyter

Step 4: Freeze Your Environment pip freeze > requirements.txt

Step 5: Initialize Git git init echo "env/" >> gitignore git add git commit -m "Initial setup"

Step 6: Create a Clean Folder Structure customer_churn_analysis/

├── data/ # Raw and processed data

├── src/ # Python scripts (cleaning, modeling)

├── env/ # Virtual environment (excluded from Git)

2.13 Best Practices for Managing Environments

✅ Use one environment per project

✅ Always track dependencies (requirements.txt or environment.yml)

✅ Use version control (Git) from the beginning

✅ Separate raw data from processed data

✅ Use virtual environments even if you're on Colab or Jupyter (when possible)

Python Environment Core Python + pip or Anaconda

Virtual Environments Isolate project dependencies pip vs conda Package management options

Library Installation Install with pip or conda

Essential Libraries NumPy, Pandas, Matplotlib, Seaborn, Scikit-learn

Project Structure Professional and organized

Git Track changes and collaborate

Docker & Workspaces Reproducibility and scalability

1 Set up a local virtual environment and install the following libraries:

 numpy, pandas, matplotlib, seaborn, scikit-learn

2 Create a Jupyter notebook in a project folder and:

 Load a sample CSV file using Pandas

 Plot a histogram and a scatter plot

3 Create and commit your project using Git:

 Write a README.md explaining the project purpose

 Add gitignore to exclude your env/ folder

4 Optional: Export your environment to a requirements.txt file and test re-creating it on a different machine or in a new folder

5 Explore one new Python library from this list (choose one):

Python Programming Essentials for Data Science

This chapter establishes the essential programming foundation crucial for mastering data science with Python Revisiting Python's core concepts, even for those with prior experience, enables you to write cleaner, more efficient, and scalable code tailored to data analysis Strengthening your understanding of these fundamentals will enhance your ability to develop robust data science solutions and set the stage for advanced topics ahead.

3.1 Why Learn Python Fundamentals for Data Science?

Even though Python has a wide range of data science libraries, you still need to master the basics because:

 Libraries like Pandas and Scikit-learn are built on core Python features (e.g., loops, lists, dictionaries)

 Data pipelines and transformations often require custom Python code

 Debugging, writing efficient functions, and building clean scripts requires a solid foundation in Python logic

This chapter focuses on practical programming — the kind you’ll actually use when cleaning data, writing algorithms, or building models

3.2 Variables, Data Types, and Basic Operations

Python is a dynamically typed language — you don’t have to declare variable types explicitly

3.2.1 Variables name = "Alice" age = 30 height = 5.5 is_data_scientist = True

Python automatically understands the type of each variable

Type Example Description int 10 Integer number float 3.14 Decimal number

Type Example Description str "Data" String/text bool True, False Boolean value list [1, 2, 3] Ordered collection dict {"key": "value"} Key-value pairs

3.2.3 Type Conversion x = "100" y = int(x) # Converts string to integer z = float(y) # Converts integer to float

 Comparison: ==, !=, >, =, 10: print("Greater than 10") elif x == 10: print("Equal to 10") else: print("Less than 10")

For Loop for i in range(5): print(i)

While Loop count = 0 while count < 5: print(count) count += 1

 break: exits the loop entirely

 continue: skips the current iteration

 pass: does nothing (used as placeholder)

A more Pythonic way to write loops squares = [x**2 for x in range(5)] print(squares) # Output: [0, 1, 4, 9, 16]

This is widely used in data preprocessing pipelines and feature engineering tasks

Functions allow you to encapsulate logic and reuse it — a must-have for clean and maintainable data science scripts

3.4.1 Defining a Function def greet(name): return f"Hello, {name}!" print(greet("Lukka"))

3.4.2 Parameters and Return Values def add(x, y): return x + y

You can return multiple values: def get_stats(numbers): return min(numbers), max(numbers), sum(numbers)/len(numbers)

These are anonymous functions, used in one-liners or where short logic is needed square = lambda x: x**2 print(square(5)) # Output: 25

Often used with map(), filter(), and apply() in Pandas

Understanding Python's built-in data structures is critical for data science work — whether you're parsing data, storing intermediate results, or constructing data pipelines

A list is an ordered, mutable collection of elements Elements can be of any data type fruits = ["apple", "banana", "cherry"]

Key Operations: fruits[0] # Access fruits.append("kiwi") # Add fruits.remove("banana") # Remove fruits.sort() # Sort len(fruits) # Length

A tuple is like a list, but immutable (cannot be changed after creation) dimensions = (1920, 1080)

Used when you want to protect the integrity of data (e.g., coordinates, feature shapes in ML)

Dictionaries store data in key-value pairs — extremely useful in data science for mapping, grouping, or storing metadata person = {

Common Operations: person["name"] # Access person["city"] = "Mumbai" # Add new key del person["age"] # Delete key list(person.keys()) # All keys list(person.values()) # All values

A set is an unordered collection of unique elements unique_tags = set(["data", "science", "data", "python"]) print(unique_tags) # {'python', 'data', 'science'}

 Set operations: union, intersection, difference set1 = {1, 2, 3} set2 = {3, 4, 5} set1 & set2 # Intersection → {3}

Since a large portion of data science involves textual data (column names, logs, labels), mastering Python strings is a must

3.6.1 String Basics text = "Data Science" text.lower() # "data science" text.upper() # "DATA SCIENCE" text.replace("Data", "AI") # "AI Science"

3.6.2 String Indexing and Slicing text[0] # 'D' text[:4] # 'Data' text[-1] # 'e'

Modern and readable way to embed variables in strings name = "Lukka" score = 95.5 print(f"{name} scored {score} in the test.")

.split() Break string into list

.join() Combine list into string

Reading from and writing to files is an essential skill, especially for working with datasets

3.7.1 Reading and Writing Text Files

# Writing with open("notes.txt", "w") as f: f.write("Hello Data Science!")

# Reading with open("notes.txt", "r") as f: content = f.read() print(content)

3.7.2 Reading and Writing CSV Files

This is a very common file type in data science import csv

# Writing CSV with open("data.csv", "w", newline="") as f: writer = csv.writer(f) writer.writerow(["Name", "Age"]) writer.writerow(["Lukka", 25])

# Reading CSV with open("data.csv", "r") as f: reader = csv.reader(f) for row in reader: print(row)

JSON (JavaScript Object Notation) is a popular format for structured data import json data = {"name": "Lukka", "score": 95}

# Write JSON with open("data.json", "w") as f: json.dump(data, f)

# Read JSON with open("data.json", "r") as f: result = json.load(f) print(result)

In data science, errors are common — especially with messy data Handling errors gracefully helps build robust and reliable pipelines

Exceptions are runtime errors that disrupt normal execution

Use try-except to catch and handle exceptions try: result = 10 / 0 except ZeroDivisionError: print("You can't divide by zero!")

You can catch multiple errors: try: value = int("abc") except (ValueError, TypeError) as e: print("Error:", e)

 else runs if no exception occurs

 finally runs no matter what try: f = open("mydata.csv") except FileNotFoundError: print("File not found.") else: print("File opened successfully.") finally: print("Finished file operation.")

3.9 Comprehensions for Efficient Data Processing

Short-hand for creating lists: squares = [x**2 for x in range(10)]

With conditions: even_squares = [x**2 for x in range(10) if x % 2 == 0]

3.9.2 Dictionary Comprehensions names = ["Alice", "Bob", "Charlie"] lengths = {name: len(name) for name in names}

3.9.3 Set Comprehensions nums = [1, 2, 2, 3, 4, 4, 5] unique_squares = {x**2 for x in nums}

Comprehensions are critical in Pandas when using apply() or transforming datasets inline

3.10 Modular Code: Functions, Scripts, and Modules

Instead of rewriting code: def clean_name(name): return name.strip().title()

This improves code readability, testability, and debugging

You can save reusable code in py files

Example: helpers.py def square(x): return x * x

You can import this into another file: from helpers import square print(square(5)) # Output: 25

Python includes many standard modules: import math print(math.sqrt(16)) # Output: 4.0 import random print(random.choice(["a", "b", "c"]))

Any py file can become a module Group reusable functions or helpers like this:

# file: utils.py def is_even(n): return n % 2 == 0

Usage: import utils print(utils.is_even(10)) # True

In real data science projects, messy code is a major bottleneck Writing clean code helps with:

Variable lowercase_with_underscores customer_age

Function lowercase_with_underscores calculate_mean()

# Calculate average of a list def avg(numbers):

"""Returns the average of a list of numbers.""" return sum(numbers) / len(numbers)

Use comments sparingly but helpfully

 Don’t repeat code (DRY principle)

 Avoid long functions (split logic)

 pylint: detect errors and enforce conventions pip install flake8 black

3.12 Object-Oriented Programming (OOP) in Python

While not always required, object-oriented programming (OOP) can help you build scalable, reusable, and clean code—especially in larger data projects, simulations, or machine learning model pipelines

OOP is a programming paradigm based on the concept of “objects” — which are instances of classes A class defines a blueprint, and an object is a real implementation

3.12.2 Creating a Class class Person: def init (self, name, age): # Constructor self.name = name self.age = age def greet(self): return f"Hello, I’m {self.name} and I’m {self.age} years old."

# Creating an object p1 = Person("Lukka", 25) print(p1.greet())

 self refers to the instance of the class

Extending classes allows for reusing and customizing functionality effectively For example, by creating a subclass like `DataScientist(Person)`, you can add specific attributes such as skills and override methods to suit specialized roles This approach promotes code reusability, modularity, and easier maintenance in object-oriented programming Using inheritance, developers can efficiently build new classes that inherit properties from parent classes while adding unique features relevant to the subclass.

{self.skill}." ds = DataScientist("Lukka", 25, "Machine Learning") print(ds.greet())

3.12.4 Why OOP in Data Science?

 Helps model data in ML projects (e.g., building pipeline classes)

 Supports custom data transformers and encoders

 Useful for simulations or modeling systems

These are advanced Python features that allow you to process large datasets efficiently and lazily

Anything that can be looped over using for is iterable nums = [1, 2, 3] it = iter(nums) print(next(it)) # 1 print(next(it)) # 2

You can create custom iterators with classes by defining iter () and next ()

Generators simplify writing iterators using yield def countdown(n): while n > 0: yield n n -= 1 for x in countdown(5): print(x)

 Memory-efficient (doesn’t store all values in memory)

 Ideal for large file streaming, web scraping, etc

Python is great for writing scripts that automate repetitive data tasks

To efficiently read and clean a CSV file, use pandas to load the data with `pd.read_csv("sales.csv")` Ensure column names are standardized by stripping whitespace, converting to lowercase, and replacing spaces with underscores using a list comprehension Remove any missing values with `df.dropna(inplace=True)` to clean the dataset Finally, save the cleaned data to a new CSV file named "cleaned_sales.csv" with `df.to_csv("cleaned_sales.csv", index=False)` This process streamlines data cleaning for reliable analysis.

Automate the merging of multiple CSV files in a folder using Python by utilizing the os and pandas libraries First, list all files in the target directory with os.listdir("data/") Then, initialize an empty DataFrame to store combined data Loop through each file, and if it ends with ".csv," read its contents into a DataFrame and concatenate it with the main DataFrame Finally, save the merged dataset as "merged.csv" with pandas' to_csv() method, ensuring proper indexing This streamlined process efficiently consolidates CSV files for data analysis and processing.

Use Task Scheduler (Windows) or cron (Linux/macOS) to schedule Python scripts to run daily or weekly

0 6 * * * /usr/bin/python3 /home/user/clean_data.py

3.15 Working with Dates and Time

Python has multiple ways to handle timestamps — critical in time series and log data

3.15.1 datetime Module from datetime import datetime now = datetime.now() print(now) # Current time formatted = now.strftime("%Y-%m-%d %H:%M:%S") print(formatted)

3.15.2 Parsing Strings to Dates date_str = "2025-05-18" date_obj = datetime.strptime(date_str, "%Y-%m-%d")

You can also perform date arithmetic: from datetime import timedelta tomorrow = now + timedelta(days=1)

3.15.3 Time with Pandas df['date'] = pd.to_datetime(df['date_column']) df = df.set_index('date') monthly_avg = df.resample('M').mean()

Time-based grouping and rolling averages are key in time series analysis

As data science projects grow, keeping your code modular and organized becomes essential for long-term success

Use a clean folder layout to separate scripts, notebooks, data, and outputs

Put frequently used functions (e.g., cleaning functions) into reusable Python files

Utilize Python's logging module to effectively monitor your script's behavior, especially during lengthy or automated tasks By configuring the logging level with `logging.basicConfig(level=logging.INFO)`, you can capture informative messages such as `logging.info("Process started")`, ensuring you have a clear record of your script's execution for easier debugging and maintenance.

You can log to a file: logging.basicConfig(filename="process.log", level=logging.INFO)

3.17 Virtual Environments and Dependency Management

For real-world projects, always isolate dependencies using virtual environments

 Avoid polluting global Python setup

3.17.2 Creating and Using a Virtual Environment

# macOS/Linux source venv/bin/activate

Install dependencies inside the environment: pip install pandas numpy

Freeze current environment: pip freeze > requirements.txt

Install from a file: pip install -r requirements.txt

3.18 Intro to Python Libraries for Data Science

Python’s power in data science comes from its ecosystem of libraries Here's a quick primer

 Extremely fast (C-optimized) import numpy as np arr = np.array([1, 2, 3]) print(arr.mean())

 Filtering, grouping, transforming, etc import pandas as pd df = pd.read_csv("sales.csv") print(df.head())

3.18.3 Matplotlib & Seaborn – Visualization import matplotlib.pyplot as plt import seaborn as sns sns.histplot(df["sales"]) plt.show()

3.18.4 Scikit-Learn – Machine Learning from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X, y)

We’ll explore all of these in much greater detail in later chapters

Let’s review the key ideas you’ve learned in this chapter:

✅ Python basics (variables, types, control flow)

✅ Functions, modules, and reusable code

✅ Error handling and writing clean scripts

✅ Working with files (CSV, JSON)

✅ Virtual environments and project setup

Test your understanding with these exercises

Write a function that accepts a list of numbers and returns the mean

Write a script that reads a file called data.txt, cleans it by removing empty lines, and saves a cleaned version to clean_data.txt

Use a list comprehension to create a list of squares for numbers divisible by 3 from 0 to 30

Create a class Student with attributes name, grades, and a method average() that returns the mean of grades

Create a sample directory structure for a sales forecasting project and place:

 a CSV file in data/raw/

 a Python script in scripts/ that loads and cleans the data

Data Cleaning and Preprocessing

Data cleaning is a crucial step in the data science pipeline, as raw datasets from real-world sources often contain missing values, incorrect data types, duplicates, and inconsistencies that hinder accurate analysis Effective data cleaning ensures data integrity, enabling valid insights and robust model performance Since it can take up a significant portion of a data scientist’s time, mastering efficient data cleaning techniques is essential for successful data analysis and machine learning projects.

Data cleaning and preprocessing are essential steps in any data-driven project, as neglecting this phase can lead to inaccurate conclusions, poor model generalization, and wasted efforts downstream Properly preparing your data ensures the reliability and effectiveness of your models, making these steps crucial for successful analysis and decision-making.

Understanding and Handling Missing Data

Missing data is a common challenge in raw datasets, often caused by data entry errors, sensor failures, skipped survey responses, or data corruption Understanding the underlying mechanism of missingness is crucial for selecting the appropriate method to address these gaps, ensuring data integrity and reliable analysis Proper handling of missing data improves the accuracy of insights derived from datasets, making it a vital aspect of data preprocessing for effective data analysis and machine learning.

There are three primary types of missing data:

 Missing Completely at Random (MCAR): The probability of missingness is the same for all observations The absence of data does not depend on any observed or unobserved variable

 Missing at Random (MAR): The missingness is related to other observed data but not the missing data itself For instance, income might be missing more often among younger participants

Missing Not at Random (MNAR) occurs when the likelihood of data being missing is related to the actual missing value itself For example, individuals with extremely high incomes may be more inclined to omit this information, highlighting the importance of addressing MNAR in data analysis to ensure accurate results.

Understanding which type of missingness is present helps inform whether deletion or imputation is appropriate and what kind of imputation is most defensible

Detecting Missing Values in Python

Using the pandas library, you can efficiently identify missing values in your dataset By loading your data with `pd.read_csv()`, you can then utilize the `isnull()` method combined with `sum()` to quickly display the number of missing values per column This approach streamlines data quality assessment, making it easier to handle incomplete data before further analysis.

To determine whether the dataset contains any missing values at all: print(df.isnull().values.any())

Strategies for Handling Missing Data

The approach to managing missing data depends on the nature of the dataset and the amount of missingness

When missing data is minimal or located in non-essential records or features, it can be effectively managed by removing affected rows or columns Use `df.dropna(inplace=True)` to delete rows containing any missing values, or `df.dropna(axis=1, inplace=True)` to eliminate columns with missing data This approach ensures cleaner datasets while preserving valuable information, optimizing data quality for accurate analysis.

When deletion would lead to significant information loss, imputing missing values can be a viable alternative

 Mean or Median Imputation: df['Age'].fillna(df['Age'].mean(), inplace=True) df['Salary'].fillna(df['Salary'].median(), inplace=True)

This is suitable for categorical data df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)

 Forward Fill and Backward Fill:

These are useful for time-series data where previous or subsequent values can logically replace missing ones df.fillna(method='ffill', inplace=True) df.fillna(method='bfill', inplace=True)

For example, setting missing locations to 'Unknown' df['Location'].fillnắUnknown', inplace=True)

In advanced data analysis, statistical models like K-nearest neighbors (KNN), regression-based imputation, and multiple imputation are essential for effectively handling substantial missing data These techniques enable accurate data recovery by leveraging the underlying relationships within the dataset, ensuring meaningful and reliable analysis Proper imputation methods improve data integrity and are crucial for maintaining the validity of insights derived from incomplete datasets.

In raw datasets, inconsistent or incorrect data types—such as dates stored as plain strings or numerical values saved as text—can cause errors during data transformation and analysis Ensuring proper data type classification is essential for accurate data processing and reliable insights Standardizing data formats helps prevent these common issues and improves the overall quality of your data analysis.

The dtypes attribute in pandas helps identify data types of each column: print(df.dtypes)

Converting columns to appropriate data types is straightforward but crucial

 Converting Strings to DateTime: df['JoinDate'] = pd.to_datetime(df['JoinDate'])

 Converting Strings to Numeric Values: df['Revenue'] = pd.to_numeric(df['Revenue'], errors='coerce')

Reducing memory usage and ensuring proper encoding of categorical data df['Membership'] = df['Membership'].astype('category')

Regularly validating data types can prevent subtle bugs in analysis and model training phases In production environments, automated checks are often implemented to enforce schema consistency

Detecting and Removing Duplicate Records

Duplicate records can negatively impact data analysis by over-representing specific entries, resulting in biased insights and inaccurate statistical outcomes These duplicates often stem from multiple data collection systems, accidental re-entries, or errors in logging processes Eliminating duplicate data is essential for ensuring accurate, reliable, and unbiased analytical results.

Pandas provides simple yet powerful methods for detecting and eliminating duplicate rows

To detect duplicates: duplicate_rows = df[df.duplicated()] print(f"Number of duplicate rows: {len(duplicate_rows)}")

To remove duplicates: df.drop_duplicates(inplace=True)

Duplicates can also be checked based on specific columns by passing them as arguments: df.drop_duplicates(subset=['CustomerID', 'Email'], inplace=True)

It is always advisable to verify whether duplicates are truly erroneous before removal In some cases, repeated entries may represent legitimate recurring transactions or events

Outliers are data points that fall significantly outside the normal range of a dataset, which can either represent genuine variation or result from errors such as data entry mistakes, equipment malfunctions, or system bugs Identifying and investigating these outliers is crucial to ensure data accuracy, as some may impact analysis outcomes Proper treatment of outliers, whether through correction or exclusion, helps maintain the integrity of data analysis and enhances the reliability of insights Recognizing the difference between true outliers and errors is essential for accurate data interpretation and informed decision-making.

Outliers can be visualized using plots:

 Box Plot: import matplotlib.pyplot as plt df.boxplot(column='AnnualIncome') plt.show()

 Histogram: df['AnnualIncome'].hist(binsP) plt.show()

Outliers can be effectively identified using the Z-score, which measures how many standard deviations a data point deviates from the mean By calculating the Z-score with the scipy.stats library in Python, such as `df['zscore'] = zscore(df['AnnualIncome'])`, analysts can detect data points that significantly differ from typical values Specifically, values with a Z-score greater than 3 or less than -3 are considered outliers, as demonstrated by filtering the dataset with `outliers = df[(df['zscore'] > 3) | (df['zscore'] < -3)]` This method provides a reliable statistical approach to outlier detection in income data.

This is a robust method commonly used in practice

IQR = Q3 - Q1 outliers = df[(df['AnnualIncome'] < Q1 - 1.5 * IQR) |

Options for dealing with outliers include:

 Removal: Only if the outlier is known to be an error or is unrepresentative of the population

 Capping (Winsorizing): Replace extreme values with the nearest acceptable values

 Transformation: Apply mathematical functions (log, square root) to compress the scale of outliers

 Segmentation: Treat outliers as a separate category or analyze them separately if they are meaningful

Each strategy should be used with caution, ensuring the data's integrity and the relevance of the outliers to the business context

Cleaning and Normalizing Text Data

Handling textual data like user inputs, product names, or addresses requires addressing formatting inconsistencies that can hinder data analysis Key issues include variations in casing, special characters, trailing spaces, and inconsistent encoding, which can complicate accurate processing and reduce data quality Ensuring proper normalization of text is essential for effective data analysis and improved search accuracy.

# Convert to lowercase df['City'] = df['City'].str.lower()

# Strip leading/trailing whitespaces df['City'] = df['City'].str.strip()

# Remove special characters using regex df['City'] = df['City'].str.replace(r'[^a-zA-Z\s]', '', regex=True)

Proper cleaning ensures uniformity and helps avoid false distinctions between values that are effectively the same, e.g., "New York", "new york", and " New York "

Most machine learning algorithms cannot handle categorical variables in raw string form These variables must be encoded into a numerical format

Suitable for ordinal variables where the categories have an inherent order: from sklearn.preprocessing import LabelEncoder le = LabelEncoder() df['EducationLevel'] = le.fit_transform(df['EducationLevel'])

Ideal for nominal variables (no inherent order), this creates binary columns for each category: df = pd.get_dummies(df, columns=['Gender', 'Region'])

To prevent the dummy variable trap in linear models, it's essential to ensure that dummy variables are independent This issue occurs when one dummy variable can be linearly predicted from the others, leading to multicollinearity The common solution is to drop one dummy column during encoding, which can be easily achieved using pandas' `get_dummies` function with the `drop_first=True` parameter For example, you can encode the 'Region' column as `df = pd.get_dummies(df, columns=['Region'], drop_first=True)` to avoid the dummy variable trap and improve model stability.

For large cardinality categorical variables (e.g., thousands of product IDs), dimensionality reduction or embedding techniques are often considered instead

Many machine learning algorithms, including K-Nearest Neighbors, Support Vector Machines, and Gradient Descent-based models, are highly sensitive to the scale of input features When one feature dominates due to its magnitude, these algorithms can behave unpredictably, affecting model accuracy and performance Therefore, applying scaling and normalization techniques is crucial in data preparation to ensure consistent, reliable results and optimal model training.

This approach employs the StandardScaler from sklearn.preprocessing to normalize data by removing the mean and scaling features to unit variance It is particularly effective when features are approximately normally distributed, ensuring better model performance You can apply this method to specific columns, such as 'Age' and 'Income', by fitting and transforming the DataFrame with `scaler.fit_transform(df[['Age', 'Income']])` Using standardization helps in achieving consistent data scaling, which is crucial for many machine learning algorithms.

Each value becomes: where μ\muμ is the mean and σ\sigmaσ is the standard deviation

This scales features to a fixed range—commonly [0, 1] from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() df[['Age', 'Income']] = scaler.fit_transform(df[['Age',

Min-max scaling is especially useful when the algorithm does not make assumptions about the distribution of the data

This method uses the median and interquartile range, making it resilient to outliers from sklearn.preprocessing import RobustScaler scaler = RobustScaler() df[['Age', 'Income']] = scaler.fit_transform(df[['Age',

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is a crucial step in understanding the characteristics of a dataset before formal modeling or hypothesis testing Its main goal is to uncover patterns, identify anomalies, and assess assumptions to gain insights into the data’s underlying structure EDA utilizes both numerical summaries and visual techniques, such as charts and graphs, to provide a comprehensive overview of the dataset This process helps data scientists make informed decisions and ensures the quality of subsequent analyses.

A well-executed EDA lays the groundwork for meaningful analysis, helping to guide decisions about feature engineering, data cleaning, and modeling strategies

Before diving into methods and techniques, it is important to understand what EDA seeks to accomplish:

 Uncover patterns and trends in the data

 Identify missing values and outliers

 Verify assumptions required for statistical models

 Generate hypotheses for further investigation

EDA does not follow a rigid structure—it is often iterative and guided by the nature of the dataset and the goals of the analysis

Understanding the Structure of the Dataset

The first step in EDA is to get an overview of the dataset’s structure: the number of rows and columns, data types, column names, and basic statistical properties

Basic Inspection in Python import pandas as pd df = pd.read_csv('sales_data.csv')

# View dimensions print(df.shape)

# Preview the data print(df.head())

# Get column names print(df.columns)

# Data types and non-null values print(df.info())

# Summary statistics for numeric columns print(df.describe())

This initial inspection helps detect inconsistencies, such as unexpected data types or missing columns, and provides insight into the scales, ranges, and summary statistics of numeric features

Univariate analysis examines a single variable at a time This includes analyzing distributions, central tendencies (mean, median, mode), and dispersion (standard deviation, variance, range)

Histograms and box plots are commonly used to visualize numeric distributions import matplotlib.pyplot as plt

# Histogram df['Revenue'].hist(bins0) plt.title('Revenue Distribution') plt.xlabel('Revenue') plt.ylabel('Frequency') plt.show()

# Box plot df.boxplot(column='Revenue') plt.title('Box Plot of Revenue') plt.show()

These plots reveal skewness, modality (unimodal, bimodal), and potential outliers

For categorical variables, frequency counts and bar charts are informative

# Frequency table print(df['Region'].value_counts())

# Bar chart df['Region'].value_counts().plot(kind='bar') plt.title('Number of Records by Region') plt.xlabel('Region') plt.ylabel('Count') plt.show()

This helps assess the balance of category representation and identify dominant or rare categories

Bivariate analysis explores relationships between two variables—typically one independent and one dependent variable

Scatter plots, correlation matrices, and regression plots are used to study the relationship between two numerical variables

# Scatter plot df.plot.scatter(x='AdvertisingSpend', y='Revenue') plt.title('Revenue vs Advertising Spend') plt.show()

# Correlation print(df[['AdvertisingSpend', 'Revenue']].corr())

Box plots and group-wise aggregations are useful when analyzing the effect of a categorical variable on a numerical variable

# Box plot df.boxplot(column='Revenue', by='Region') plt.title('Revenue by Region') plt.suptitle('') # Remove automatic title plt.show()

# Grouped statistics print(df.groupby('Region')['Revenue'].mean())

Crosstabs and stacked bar charts can show relationships between two categorical variables

# Crosstab pd.crosstab(df['Region'], df['MembershipLevel'])

# Stacked bar plot pd.crosstab(df['Region'], df['MembershipLevel']).plot(kind='bar', stacked=True) plt.title('Membership Level Distribution by Region') plt.show()

Bivariate analysis is key for identifying predictive relationships, feature relevance, and interactions that can be leveraged in modeling

Multivariate analysis examines the interactions between three or more variables simultaneously, enabling the discovery of complex relationships and data patterns It helps identify clusters or segments within the data, providing valuable insights into how multiple features collectively influence outcomes This technique is essential for understanding multi-dimensional data and making informed, data-driven decisions.

Pair plots allow for the simultaneous visualization of relationships between multiple numerical variables import seaborn as sns sns.pairplot(df[['Revenue', 'AdvertisingSpend', 'CustomerAge', 'Tenure']]) plt.show()

Each cell in the pair plot shows a scatter plot (or histogram on the diagonal) for a pair of variables, helping identify correlations, linearity, and potential groupings

A correlation heatmap offers a compact visualization of pairwise correlation coefficients between numerical features, helping to identify relationships at a glance To generate it, compute the correlation matrix using `corr_matrix = df.corr(numeric_only=True)` Then, visualize the matrix with a heatmap using Seaborn: `sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)` Enhancing interpretability, this plot includes annotations and a clear color scheme, making it an essential tool for data analysis Finally, customize the visualization with a title and display it using `plt.title('Correlation Heatmap')` and `plt.show()`.

This is useful for detecting multicollinearity, identifying redundant features, and guiding feature selection

Grouping data by one or more categorical variables and then analyzing numerical trends helps in understanding how different segments behave

# Average revenue by gender and membership level grouped = df.groupby(['Gender',

'MembershipLevel'])['Revenue'].mean() print(grouped)

You can also visualize such groupings using grouped bar plots or facet grids

Facet grids allow for conditioned plotting based on one or more categorical variables g = sns.FacetGrid(df, col='MembershipLevel') g.map(plt.hist, 'Revenue', bins ) plt.show()

Facet grids are extremely useful for comparative analysis across multiple segments

For datasets containing temporal information, such as timestamps or dates, it's important to examine trends over time

# Ensure datetime format df['OrderDate'] = pd.to_datetime(df['OrderDate'])

# Set index and resample df.set_index('OrderDate', inplace=True) monthly_revenue = df['Revenue'].resample('M').sum()

# Plot monthly_revenue.plot() plt.title('Monthly Revenue Trend') plt.xlabel('Month') plt.ylabel('Total Revenue') plt.show()

Time-based EDA helps reveal seasonality, trends, and cycles that may impact forecasting and decision-making

Skewness refers to the asymmetry of a distribution Many statistical methods assume normality, and skewed distributions can violate those assumptions

Detecting Skewness print(df['Revenue'].skew())

 A skew of 0 indicates a symmetric distribution

 A positive skew means the tail is on the right

 A negative skew means the tail is on the left

Transformations can be used to normalize the data: import numpy as np

# Log transformation df['Revenue_log'] = np.log1p(df['Revenue'])

# Square root transformation df['Revenue_sqrt'] = np.sqrt(df['Revenue'])

# Box-Cox (requires positive values) from scipy.stats import boxcox df['Revenue_boxcox'], _ = boxcox(df['Revenue'] + 1)

These transformations can improve model performance and meet algorithmic assumptions

Anomalies, also known as outliers, are data points that significantly deviate from the majority of the dataset, highlighting unusual or interesting patterns While some anomalies represent genuine phenomena such as fraud or rare events, others may result from errors in data entry, measurement inaccuracies, or collection processes Proper identification and analysis of outliers are essential for accurate data interpretation and reliable decision-making.

Detecting anomalies during EDA is crucial, as they can distort summary statistics and affect model performance

Box plots are a simple and effective way to visually detect outliers

# Box plot of revenue df.boxplot(column='Revenue') plt.title('Revenue Box Plot') plt.show()

The Z-score indicates how many standard deviations a specific value is from the mean, helping identify outliers in a dataset Using the SciPy library, we calculate the Z-scores for the 'Revenue' column in a DataFrame with `stats.zscore(df['Revenue'])`, and then take the absolute value to evaluate deviations regardless of direction Data points with Z-scores greater than 3 are considered outliers, and these are filtered into a new DataFrame `df_outliers` This method effectively detects significant anomalies in revenue data for thorough analysis.

Typically, a Z-score greater than 3 is considered an outlier in a normal distribution

Interquartile Range (IQR) is the range between the 25th and 75th percentiles

IQR = Q3 - Q1 outliers = df[(df['Revenue'] < (Q1 - 1.5 * IQR)) |

(df['Revenue'] > (Q3 + 1.5 * IQR))] print(outliers)

This method is more robust than Z-scores and does not assume a normal distribution

Options for handling outliers depend on the context:

 Remove: If they result from data entry errors

 Cap or Floor (Winsorizing): Set to percentile thresholds

 Transform: Apply log or Box-Cox transformations to reduce their impact

 Separate Models: Train different models for normal and anomalous data, if appropriate

Feature Engineering Insights from EDA

A crucial by-product of EDA is the opportunity to create new features that capture relationships or behaviors not explicitly represented in the raw data

 Ratios: Revenue per customer, clicks per impression

 Time-Based: Days since last purchase, month, weekday

 Aggregates: Mean revenue per region, max tenure per product

 Flags: High-value customer (revenue > threshold), recent activity (last 30 days) df['RevenuePerVisit'] = df['Revenue'] / df['NumVisits'] df['IsHighValueCustomer'] = df['Revenue'] > 1000 df['Weekday'] = df['OrderDate'].dt.day_name()

EDA guides which features to create by helping you understand what patterns are most meaningful in the data

Documentation is an often-overlooked aspect of EDA but is vital for reproducibility, collaboration, and model auditing Good documentation includes:

 A record of the data sources and versions used

 A summary of key observations and statistics

 Justifications for data cleaning decisions

 Descriptions of features added, removed, or transformed

Tools like Jupyter Notebooks, markdown cells, and inline commentary are excellent for documenting EDA

Automated profiling libraries like pandas-profiling and Sweetviz enable users to generate comprehensive and interactive exploratory data analysis (EDA) reports effortlessly By importing modules such as ydata_profiling's ProfileReport, you can create detailed data insights with just a few lines of code, for example: `from ydata_profiling import ProfileReport` followed by `profile = ProfileReport(df, title="EDA Report", explorative=True)` and exporting the report to an HTML file with `profile.to_file("eda_report.html")` These tools streamline the EDA process, making it easier to identify data characteristics and prepare datasets for analysis.

These tools provide an overview of missing values, data types, correlations, distributions, and alerts for potential issues

EDA can become an open-ended task While thoroughness is important, there is a point of diminishing returns Signs that EDA is complete include:

 You've examined all variables of interest

 Key relationships and patterns are understood

 Data quality issues have been addressed

 Useful derived features have been engineered

 Modeling assumptions have been explored or validated

At this stage, you are ready to proceed to model building with confidence that your understanding of the dataset is solid

Exploratory Data Analysis (EDA) is a vital step in the data science process that helps uncover the structure, patterns, and unique characteristics of your dataset By thoroughly analyzing data through EDA, data scientists gain insights that are essential for building robust models and making informed decisions This phase lays the foundation for successful data modeling and enhances the overall quality of data-driven solutions.

Here are the key takeaways from this chapter:

 Initial Inspection: Begin with shape, column types, missing values, and summary statistics

 Univariate Analysis: Understand the distribution and variability of individual variables using histograms, box plots, and frequency counts

 Bivariate Analysis: Examine relationships between pairs of variables to reveal trends, group differences, or associations

 Multivariate Analysis: Explore interactions among three or more variables through pair plots, heatmaps, and grouped aggregations

 Visualization: Use a variety of plots (histograms, box plots, scatter plots, heatmaps, bar charts, and facet grids) to detect patterns and anomalies

 Outlier Detection: Identify and manage outliers using visual tools, Z-score, and IQR methods

 Feature Engineering: Use insights from EDA to create new features that enhance model performance

 Documentation: Keep a detailed, clear, and reproducible record of all findings and decisions made during EDA

Exploratory Data Analysis (EDA) is a tailored process that varies based on the dataset's nature, the specific problem, and the chosen modeling approach Its primary goal is to develop a comprehensive understanding of the data, helping to identify potential issues early and ensure smooth progress in subsequent modeling and deployment stages Customizing EDA techniques according to these factors is essential for effective data analysis and successful machine learning outcomes.

These exercises will help reinforce your understanding and give you practical experience applying EDA techniques

1 Initial Dataset Summary o Load a dataset (e.g., Titanic, Iris, or your own) o Print the shape, info, and summary statistics o List the number of missing values per column

2 Univariate Visualizations o Plot histograms and box plots for at least three numerical variables o Plot bar charts for two categorical variables o Identify any distributions that are skewed

3 Bivariate Analysis o Create scatter plots between pairs of numerical variables o Use box plots to examine how a numerical variable varies across categories o Calculate and interpret the correlation between features

4 Multivariate Analysis o Generate a pair plot for 4–5 variables o Use a heatmap to visualize correlations across numerical features o Perform a grouped aggregation (mean, count) for two categorical variables

5 Outlier Detection o Use both the Z-score and IQR methods to identify outliers in a chosen variable o Remove or cap the outliers o Compare summary statistics before and after.

Feature Engineering

o Derive a new feature based on a ratio (e.g., revenue per visit) o Create binary flags based on thresholds or business logic o Extract date-based features such as month or weekday

7 Time Series Exploration (Optional if dataset includes dates) o Convert a column to datetime and set it as an index o Resample to monthly or weekly granularity o Plot a time series trend

8 EDA Report o Use pandas-profiling or sweetviz to generate an automated EDA report o Review the report to confirm consistency with your manual analysis

9 Reflection o Write a short paragraph summarizing the main insights gained from your EDA o List the assumptions you have made and the questions that emerged during your analysis

Feature Engineering is the essential process of transforming raw data into meaningful features that significantly boost the predictive performance of machine learning models Although algorithm selection is important, the quality and relevance of features ultimately determine the success of a machine learning project Effective feature engineering enhances model accuracy and ensures more reliable, interpretable results.

In real-world scenarios, raw data is rarely ready for direct use by models, often containing irrelevant fields, inconsistent formats, or hidden information that needs to be extracted Feature engineering plays a crucial role in transforming raw data into meaningful, structured input suitable for machine learning algorithms This process helps improve model performance by selecting, creating, and refining features that capture essential patterns and insights from the data.

A feature (also called an attribute or variable) is an individual measurable property or characteristic of a phenomenon being observed In the context of supervised learning:

 Input features are the independent variables used to predict an outcome

 Target feature (or label) is the dependent variable or output we aim to predict

The process of identifying, constructing, transforming, and selecting features is collectively known as feature engineering

Effective model performance depends heavily on high-quality data, as models are only as good as the data they are trained on Regardless of the algorithm—whether linear regression, decision trees, or neural networks—poorly constructed or irrelevant features can significantly hinder accuracy and predictive capabilities Ensuring relevant and well-designed features is essential for optimal machine learning results.

Key reasons why feature engineering is critical:

 Increases model accuracy: Well-engineered features provide better signal and reduce noise

 Reduces model complexity: Simpler models with relevant features are more interpretable and generalize better

 Addresses data issues: Handles missing values, categorical variables, and skewed distributions

 Encodes domain knowledge: Converts domain expertise into measurable inputs

 Improves interpretability: Transparent features lead to models that are easier to understand and trust

Feature engineering encompasses a wide array of techniques, each suited for different types of data and modeling challenges The most commonly used strategies include:

Creating new features from existing data can often reveal patterns and relationships that raw data does not explicitly present a Mathematical Transformations

Applying arithmetic operations can uncover meaningful ratios, differences, or composite metrics df['RevenuePerVisit'] = df['Revenue'] / df['NumVisits'] df['AgeDifference'] = df['Age'] - df['Tenure'] b Text Extraction

Extract information from strings such as domain names, keywords, or substrings df['EmailDomain'] = df['Email'].str.split('@').str[1] c Date-Time Decomposition

To analyze order data effectively, decompose timestamps into key components such as day, month, year, hour, and weekday Convert string dates to datetime format using `pd.to_datetime(df['OrderDate'])`, then extract specific components like month with `df['OrderMonth'] = df['OrderDate'].dt.month` and weekday names with `df['OrderWeekday'] = df['OrderDate'].dt.day_name()` These techniques facilitate time-based insights and improve data analysis accuracy.

This allows the model to learn temporal patterns like seasonality, holidays, or business cycles

Transforming variables improves their distribution, removes skewness, or stabilizes variance a Log Transformation

Useful when dealing with positively skewed data (e.g., sales, income): df['LogRevenue'] = np.log1p(df['Revenue']) b Normalization / Min-Max Scaling

Brings all features to a similar scale, typically between 0 and 1: from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() df[['NormalizedTenure']] scaler.fit_transform(df[['Tenure']]) c Standardization

Centers data around the mean with a standard deviation of 1 Beneficial for algorithms assuming Gaussian distribution: from sklearn.preprocessing import StandardScaler scaler = StandardScaler() df[['StandardizedAge']] = scaler.fit_transform(df[['Age']])

Many machine learning models require numerical input Categorical variables must be encoded before modeling a One-Hot Encoding

Converts categorical variables into binary vectors: pd.get_dummies(df['City'], prefix='City') b Label Encoding

Assigns a unique integer to each category: from sklearn.preprocessing import LabelEncoder le = LabelEncoder() df['GenderEncoded'] = le.fit_transform(df['Gender'])

Use cautiously: label encoding may impose an ordinal relationship where none exists c Frequency Encoding

Replaces each category with its frequency in the dataset: freq_encoding = df['ProductCategory'].value_counts().to_dict() df['ProductCategoryFreq'] df['ProductCategory'].map(freq_encoding)

This helps retain cardinality while incorporating category importance

Binning converts continuous variables into categorical bins or intervals a Equal-Width Binning

Splits values into intervals of equal range: df['BinnedAge'] = pd.cut(df['Age'], bins=5) b Equal-Frequency Binning

Each bin contains approximately the same number of observations: df['QuantileTenure'] = pd.qcut(df['Tenure'], q=4) c Custom Binning

Apply domain-specific knowledge to define meaningful thresholds: bins = [0, 18, 35, 60, 100] labels = ['Teen', 'Young Adult', 'Adult', 'Senior'] df['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels)

Beyond basic transformations and encodings, there are more sophisticated strategies that can significantly enhance model performance, especially when dealing with complex data or relationships

Creating interaction terms captures the combined effect of two or more features a Polynomial Interactions

Generate products or ratios between features: df['Income_Tenure'] = df['Income'] * df['Tenure'] df['IncomePerTenure'] = df['Income'] / (df['Tenure'] + 1)

These are especially useful in models that do not automatically account for interactions (e.g., linear regression) b Concatenated Categorical Features

Combine categories to form new compound features: df['Region_Product'] = df['Region'] + "_" + df['ProductCategory']

This can capture localized preferences or behaviors

6 Handling Missing Data as Features

Missing data can offer valuable insights, such as indicating non-disclosure, which may be predictive in itself For instance, the absence of income values can reveal underlying patterns; creating a binary indicator like `df['IsIncomeMissing'] = df['Income'].isnull().astype(int)` helps in identifying and leveraging missing data for more accurate analysis.

Then, combine this with imputation for the original column df['Income'].fillna(df['Income'].median(), inplace=True)

This preserves missingness information while making the column usable by models

Target encoding involves replacing categorical variables with the mean of the target variable for each category, offering a powerful way to incorporate category information into predictive models To perform target encoding, compute the mean of the target variable (e.g., sales) for each category using `mean_encoded = df.groupby('ProductCategory')['Sales'].mean()`, and then map these means back to the original data with `df['ProductCategoryMeanSales'] = df['ProductCategory'].map(mean_encoded)` However, caution is necessary to prevent data leakage, ensuring that the encoding process does not inadvertently introduce information from the target variable into the training data Proper implementation of target encoding can significantly enhance model performance when used carefully.

To prevent leakage, it should be done using cross-validation or on out-of-fold data

8 Dimensionality Reduction for Feature Construction

High-dimensional data (e.g., text, images, sensor data) can overwhelm models Dimensionality reduction helps capture essential information in fewer variables a Principal Component Analysis (PCA)

PCA identifies the axes (components) along which the data varies the most: from sklearn.decomposition import PCA pca = PCA(n_components=2) principal_components = pca.fit_transform(df[['Feature1',

'Feature2', 'Feature3']]) df['PC1'] = principal_components[:, 0] df['PC2'] = principal_components[:, 1]

PCA features are especially helpful when input features are highly correlated b t-SNE or UMAP (for visualization or clustering tasks)

Non-linear dimensionality reduction methods, such as t-SNE from scikit-learn's manifold module, are primarily used for data visualization but can also enhance clustering and segmentation tasks By importing TSNE, setting `n_components=2`, and applying `fit_transform` to numerical data, users can reduce high-dimensional datasets to two dimensions The resulting embeddings can be added as new columns, like 'TSNE1' and 'TSNE2', in the DataFrame to facilitate visual analysis and further data exploration.

High cardinality refers to categorical variables that have a large number of distinct values, such as user IDs or zip codes Naively applying one-hot encoding to these variables results in thousands of sparse features, which can negatively impact model performance and increase memory usage Efficient handling of high-cardinality features is essential for optimizing both accuracy and computational resources.

 Hashing trick (often used in online systems): from sklearn.feature_extraction import FeatureHasher hasher = FeatureHasher(n_features, input_type='string') hashed_features = hasher.transform(df['ZipCode'].astype(str))

 Domain grouping: Merge rare categories into an "Other" group

Feature engineering is most powerful when infused with domain knowledge Understanding the context of the data allows for crafting features that reflect real-world patterns, behaviors, or constraints

10 Examples by Domain a Retail / E-commerce

 AverageBasketValue = Total Revenue / Number of Orders

 RepeatRate = Number of Repeat Purchases / Total Purchases

 DaysSinceLastPurchase = Today – Last Purchase Date b Finance

 LoanToIncomeRatio = Loan Amount / Annual Income

 CreditUtilization = Current Balance / Total Credit Limit

 DebtToAssetRatio = Total Liabilities / Total Assets c Healthcare

 AgeAtDiagnosis = Diagnosis Date – Date of Birth

 HospitalStayLength = Discharge Date – Admission Date d Web Analytics

 PagesPerSession = Total Page Views / Sessions

 BounceRateFlag = 1 if Single Page Visit, else 0

 AvgSessionDuration = Total Time on Site / Sessions

In each domain, thoughtful feature creation often leads to performance gains that cannot be achieved by model tuning alone

Time-related variables are often rich with latent structure However, raw timestamps rarely reveal this on their own

Break time into components that may drive behavior: df['Hour'] = df['Timestamp'].dt.hour df['DayOfWeek'] = df['Timestamp'].dt.dayofweek df['Month'] = df['Timestamp'].dt.month

This allows the model to learn patterns like:

Many time components are cyclical (e.g., hour of day, day of week) Encoding them linearly (0 to 23 for hours) introduces misleading relationships (e.g., 23 and 0 appear far apart)

Instead, use sine and cosine transformations: python

CopyEdit df['Hour_sin'] = np.sin(2 * np.pi * df['Hour'] / 24) df['Hour_cos'] = np.cos(2 * np.pi * df['Hour'] / 24)

This encodes circularity so the model understands that hour 0 and hour 23 are adjacent

After engineering features, not all may be relevant Feature selection helps retain only the most informative ones

 Variance Threshold: Remove features with little to no variability

 Correlation Analysis: Remove highly correlated (redundant) features

Use models to estimate feature importance: from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier() model.fit(X, y) importances = pd.Series(model.feature_importances_, index=X.columns) importances.sort_values().plot(kind='barh')

Recursive Feature Elimination (RFE) is a feature selection technique that iteratively trains a model and removes the least important features to improve model performance Using scikit-learn, you can implement RFE with the `sklearn.feature_selection` module by importing `RFE` Typically, RFE is combined with a linear model like `LogisticRegression` from `sklearn.linear_model` to evaluate feature importance You initialize RFE with an estimator such as `LogisticRegression()` and specify the number of features to select using the `n_features_to_select` parameter This process helps in identifying the most relevant features for your predictive model, ultimately enhancing its accuracy and interpretability.

For large datasets or rapid experimentation, feature engineering can be partially automated

 Featuretools: Automatically creates features from relational datasets using deep feature synthesis

 tsfresh: Extracts hundreds of features from time series data

 Kats: Facebook’s time-series feature extraction library

 AutoML tools (e.g., Auto-sklearn, H2O): Often include feature selection/creation as part of their pipeline

Automated feature engineering should not replace domain expertise but can accelerate baseline exploration and model development

Feature engineering is a cornerstone of effective data science While machine learning algorithms provide the machinery to discover patterns, it is well-crafted features that feed them meaningful, structured signals

Key takeaways from this chapter:

 Good features > complex models: Thoughtfully engineered features often outperform more complex algorithms applied to raw data

 Feature creation includes mathematical combinations, time decomposition, and domain-specific metrics

 Transformations (e.g., log, standardization) correct skewness, stabilize variance, and bring comparability across features

 Categorical encoding techniques such as one-hot, label, and target encoding are critical for handling non-numeric data

 Binning can simplify models, aid interpretability, and capture non-linear patterns

 Advanced strategies such as interaction terms, missingness indicators, and dimensionality reduction can capture hidden structure

 Cyclical variables (time-based features) must be encoded in ways that respect their periodic nature

 Feature selection reduces noise, improves interpretability, and often boosts performance

 Automation tools can rapidly generate useful features but should be guided by domain understanding

Feature engineering is an iterative process that combines technical skills, statistical intuition, and domain knowledge mastering this process enhances your ability to build effective models and solve problems more efficiently Developing strong feature engineering skills is essential for becoming both a better data modeler and a more proficient problem solver.

Given a dataset with CustomerID, OrderDate, TotalAmount, and NumItems:

 Create features for AverageItemPrice, DaysSinceLastOrder, and MonthlySpendingTrend

Use a dataset with Income and Age:

 Apply a log transformation to Income

Take a column Country with 10 unique values:

 Try frequency encoding and explain its impact on interpretability

 Hour of day and day of week

 Sine and cosine encodings for hour

5 Target Encoding with Cross-Validation

 Apply target encoding to a Category column using out-of-fold mean target values

 Compare it with one-hot encoding in terms of model accuracy

With a dataset that includes a UserID field:

 Propose three strategies to manage this feature

 Implement one of them and compare model performance

Use a dataset with at least 20 numeric features:

 Apply correlation filtering to remove redundant variables

 Use a tree-based model to evaluate feature importances

Given loan application data, create:

Tiêu đề	Data science with python sách khoa học dữ liệu với python
Tác giả	Skill Foundry
Trường học	Skill Foundry
Chuyên ngành	Data Science
Thể loại	Sách
Năm xuất bản	2025

Định dạng
Số trang	178
Dung lượng	3,43 MB