Data Preprocessing & Feature
AI is technology that enables computers and machines to simulate human learning.
What is Data Preprocessing?
Data preprocessing is the process of cleaning, transforming, and preparing raw data so that machine learning algorithms can understand it.
💡 Think of it like cleaning and organizing ingredients before cooking — if the data is messy, the model won’t learn properly.
Steps:
1. Data Cleaning
- Handle missing values (NaN, blanks).
- Remove duplicates.
- Correct inconsistencies (e.g., “Male” vs. “M”).
2. Data Transformation
- Convert categorical values into numbers (One-Hot Encoding, Label Encoding).
- Normalize/standardize numeric values (scale values to the same range).
- Convert text to tokens (for NLP).
3. Data Reduction
- Remove irrelevant columns.
- Dimensionality reduction (PCA, t-SNE).
What are Features?
Features are the measurable properties or characteristics of the data that the model uses to make predictions.
- In a dataset, columns = features (inputs), and one column is usually the target (output).
Example: Predicting house prices
- Features → size, location, number_of_rooms
- Target → price
Feature Engineering
Feature engineering means creating new features or modifying existing ones to improve model performance.
Examples:
-
Creating new features: From date_of_birth, create a new feature age. From transaction_amount, create a log(transaction_amount) to handle skewness.
-
Feature Selection: Keep only the most relevant features and remove noise (irrelevant columns).
-
Encoding categorical features: Example: Convert ["Red", "Blue", "Green"] → [1,0,0], [0,1,0], [0,0,1].
✅ In summary:
- Data preprocessing = cleaning + transforming data.
- Features = inputs that describe the problem.
- Feature engineering = creating/selecting the best features to boost model performance.