Data Cleansing using Python Online Course
This course focuses on the critical process of data preparation in machine learning, teaching you how to transform raw data into a model-ready format. You’ll learn essential preprocessing techniques such as data imputation, advanced cleansing, handling non-numeric values, and meeting algorithm-specific requirements for scale and distribution. The course also covers strategies to prevent data leakage and ensure accurate model evaluation. By the end, you’ll have mastered practical data cleaning and preprocessing skills to build reliable and effective machine learning models.
Who should take the course?
This course is ideal for data analysts, data scientists, students, and professionals who want to learn how to clean and prepare raw data for analysis using Python. It’s well-suited for those with basic Python knowledge who are looking to improve data quality by handling missing values, duplicates, inconsistencies, and formatting issues. Whether you’re an aspiring data professional, a researcher working with messy datasets, or a business professional aiming to make accurate data-driven decisions, this course will equip you with the practical skills to perform effective data cleansing with Python.
What you will learn
- Prepare data in a way that avoids data leakage
- Identify and handle problems with messy data
- Know which feature selection method to choose based on the data types
- Transform the probability distribution of input variables
- Identify and remove irrelevant and redundant input variables
- Project variables into a lower-dimensional space
Course Outline
Introduction
- Course Introduction
- Course Structure
- Is this Course Right for You?
Foundations
- Introducing Data Preparation
- The Machine Learning Process
- Data Preparation Defined
- Choosing a Data Preparation Technique
- What is Data in Machine Learning?
- Raw Data
- Machine Learning is Mostly Data Preparation
- Common Data Preparation Tasks - Data Cleansing
- Common Data Preparation Tasks - Feature Selection
- Common Data Preparation Tasks - Data Transforms
- Common Data Preparation Tasks - Feature Engineering
- Common Data Preparation Tasks - Dimensionality Reduction
- Data Leakage
- Problem with NaĂŹve Data Preparation
- Case Study: Data Leakage: Train / Test / Split NaĂŹve Approach
- Case Study: Data Leakage: Train / Test / Split Correct Approach
- Case Study: Data Leakage: K-Fold NaĂŹve Approach
- Case Study: Data Leakage: K-Fold Correct Approach
Data Cleansing
- Data Cleansing Overview
- Identify Columns That Contain a Single Value
- Identify Columns with Few Values
- Remove Columns with Low Variance
- Identify and Remove Rows That Contain Duplicate Data
- Defining Outliers
- Remove Outliers - The Standard Deviation Approach
- Remove Outliers - The IQR Approach
- Automatic Outlier Detection
- Mark Missing Values
- Remove Rows with Missing Values
- Statistical Imputation
- Mean Value Imputation
- Simple Imputer with Model Evaluation
- Compare Different Statistical Imputation Strategies
- K-Nearest Neighbors Imputation
- KNNImputer and Model Evaluation
- Iterative Imputation
- IterativeImputer and Model Evaluation
- IterativeImputer and Different Imputation Order
Feature Selection
- Feature Selection Introduction
- Feature Selection Defined
- Statistics for Feature Selection
- Loading a Categorical Dataset
- Encode the Dataset for Modelling
- Chi-Squared
- Mutual Information
- Modeling with Selected Categorical Features
- Feature Selection with ANOVA on Numerical Input
- Feature Selection with Mutual Information
- Modeling with Selected Numerical Features
- Tuning a Number of Selected Features
- Select Features for Numerical Output
- Linear Correlation with Correlation Statistics
- Linear Correlation with Mutual Information
- Baseline and Model Built Using Correlation
- Model Built Using Mutual Information Features
- Tuning Number of Selected Features
- Recursive Feature Elimination
- RFE for Classification
- RFE for Regression
- RFE Hyperparameters
- Feature Ranking for RFE
- Feature Importance Scores Defined
- Feature Importance Scores: Linear Regression
- Feature Importance Scores: Logistic Regression and CART
- Feature Importance Scores: Random Forests
- Permutation Feature Importance
- Feature Selection with Importance
Data Transforms
- Scale Numerical Data
- Diabetes Dataset for Scaling
- MinMaxScaler Transform
- StandardScaler Transform
- Robust Scaling Data
- Robust Scaler Applied to Dataset
- Explore Robust Scaler Range
- Nominal and Ordinal Variables
- Ordinal Encoding
- One-Hot Encoding Defined
- One-Hot Encoding
- Dummy Variable Encoding
- Ordinal Encoder Transform on Breast Cancer Dataset
- Make Distributions More Gaussian
- Power Transform on Contrived Dataset
- Power Transform on Sonar Dataset
- Box-Cox on Sonar Dataset
- Yeo-Johnson on Sonar Dataset
- Polynomial Features
- Effect of Polynomial Degrees
Advanced Transforms
- Transforming Different Data Types
- The ColumnTransformer
- The ColumnTransformer on Abalone Dataset
- Manually Transform Target Variable
- Automatically Transform Target Variable
- Challenge of Preparing New Data for a Model
- Save Model and Data Scaler
- Load and Apply Saved Scalers
Dimensionality Reduction
- Curse of Dimensionality
- Techniques for Dimensionality Reduction
- Linear Discriminant Analysis
- Linear Discriminant Analysis Demonstrated
- Principal Component Analysis