Normalize Data Transform
The Normalize Data transform allows you to scale numerical data in your dataset using various methods. This is crucial for many machine learning algorithms and statistical analyses that assume data is on a similar scale.
Basic Usage
To normalize numerical data in your dataset:
- Select the Normalize Data transform from the transform menu.
- Choose the numerical column(s) you want to normalize in the "Target Columns" dropdown.
- Select the normalization method you want to apply.
- (Optional) Configure advanced options for the selected method.
- Apply the transformation.
Only numerical columns will be available for selection in the "Target Columns" dropdown.
Configuration Options
Basic Options
- Target Columns: Select one or more numerical columns to normalize.
- Method: Choose the normalization method to apply. Available options are:
- Min-Max Scaling
- Z-Score Standardization
- Robust Scaling
- Max Absolute Scaling
- Normalizer
- Quantile Transformation
- Power Transformation
Hover over each normalization method to see a brief explanation of its use case and characteristics.
Advanced Options
Each normalization method has its own set of advanced options:
Min-Max Scaling
- Range From: Lower bound of the scaling range (default: 0)
- Range To: Upper bound of the scaling range (default: 1)
Z-Score Standardization
- With Mean: Center the data before scaling (default: True)
- With Standard Deviation: Scale the data to unit variance (default: True)
Robust Scaling
- Quantile Range: IQR range for scaling (default: 25-75)
- With Centering: Center the data before scaling (default: True)
- With Scaling: Scale the data to IQR (default: True)
Normalizer
- Norm: The norm to use for normalization (options: 'l1', 'l2', 'max')
Quantile Transformation
- Number of Quantiles: Number of quantiles to compute (default: 1000)
- Output Distribution: Type of output distribution (options: 'uniform', 'normal')
Power Transformation
- Method: Power transformation method (options: 'yeo-johnson', 'box-cox')
Normalization Methods Explained
Min-Max Scaling
Scales data to a fixed range, typically between 0 and 1.
Best Use Case: When you need bounded values within a specific range, useful for algorithms that require non-negative inputs.
Z-Score Standardization
Transforms data to have a mean of 0 and a standard deviation of 1.
Best Use Case: When your data follows a Gaussian distribution and you need to compare features with different scales.
Robust Scaling
Scales data using statistics that are robust to outliers.
Best Use Case: When your dataset contains significant outliers that would distort other scaling methods.
Max Absolute Scaling
Scales each feature by its maximum absolute value.
Best Use Case: When you want to scale data without moving the zero point, particularly useful for sparse data.
Normalizer
Scales individual samples to a unit norm.
Best Use Case: When you're interested in the proportions of the features rather than their absolute values, often used in text classification or clustering.
Quantile Transformation
Transforms features to follow a uniform or normal distribution.
Best Use Case: When you want to spread out the most frequent values or reduce the impact of outliers.
Power Transformation
Applies a power transformation to make data more Gaussian-like.
Best Use Case: When dealing with skewed data and you want to stabilize variance and improve the normality of features.
Examples
Here's an example of how to use the Normalize Data transform:
Example: Normalizing Student Grades
Input Dataset:
| Student_ID | Student_Name | Subject | Grade | Ranking |
|---|---|---|---|---|
| 101 | Alice | Math | 0 | 6 |
| 102 | Bob | Science | 20 | 5 |
| 103 | Charlie | English | 40 | 4 |
| 104 | David | Math | 60 | 3 |
| 105 | Eva | Science | 80 | 2 |
| 106 | Frank | Math | 100 | 1 |
Configuration:
- Target Columns:
Grade - Method: Min-Max Scaling
- Range From: 0
- Range To: 1
Result:
| Student_ID | Student_Name | Subject | Grade | Ranking |
|---|---|---|---|---|
| 101 | Alice | Math | 0.0 | 6 |
| 102 | Bob | Science | 0.2 | 5 |
| 103 | Charlie | English | 0.4 | 4 |
| 104 | David | Math | 0.6 | 3 |
| 105 | Eva | Science | 0.8 | 2 |
| 106 | Frank | Math | 1.0 | 1 |
Best Practices
-
Choose the Right Method: Consider the distribution of your data and the requirements of your analysis or model when selecting a normalization method.
-
Handle Outliers: If your data contains significant outliers, consider using robust scaling or methods that are less sensitive to extreme values.
-
Preserve Zero: For some applications, it's important to preserve the zero point. In such cases, consider methods like Max Absolute Scaling.
-
Consistent Scaling: Apply the same scaling method to both your training and test datasets to ensure consistency.
-
Check Assumptions: Some methods (like Z-Score Standardization) assume a normal distribution. Verify if your data meets the assumptions of the chosen method.
Troubleshooting
- If you don't see a column in the "Target Columns" dropdown, check if it's correctly identified as a numerical column in your dataset.
- For methods sensitive to outliers (like Min-Max Scaling), check your data for extreme values that might skew the results.
- If using Power Transformation with the Box-Cox method, ensure all your data is positive, as Box-Cox doesn't work with zero or negative values.