Skip to main content

Handle Outliers Transform

The Handle Outliers transform allows you to treat or remove outliers in your numerical data that have been identified by a previous outlier detection step. This is crucial for preparing data for analysis and ensuring that extreme values don't unduly influence your results.

Basic Usage

To handle outliers in your dataset:

  1. Ensure you have first run an outlier detection transform on your data.
  2. Select the Handle Outliers transform from the transform menu.
  3. Choose the method for handling outliers.
  4. (Optional) Configure advanced options for the selected method.
  5. Apply the transformation.
note

The Handle Outliers transform will only affect data points that have been identified as outliers in a previous step.

Configuration Options

Basic Options

  • Handle Option: Choose the method to use for handling outliers. Available options are:
    • Cap & Floor
    • Remove
tip

"Cap & Floor" replaces outliers with boundary values, while "Remove" eliminates the entire row containing an outlier.

Advanced Options

If you choose "Cap & Floor" as your handling method, you'll have additional options:

  • Cap & Floor Settings: Choose the method for determining the cap and floor values:

    • Tukey's Method
    • Median Absolute Deviation (MAD)
    • Min-Max Method
  • Multiplier: For Tukey's and MAD methods, specify the multiplier to use in calculating the boundaries.

Outlier Handling Methods Explained

Cap & Floor

Replaces outlier values with upper (cap) or lower (floor) boundary values.

Best for: Preserving data points while limiting the impact of extreme values.

Remove

Eliminates entire rows containing outlier values.

Best for: Cases where you believe outliers represent erroneous or irrelevant data points.

Cap & Floor Methods

Tukey's Method

Uses the Interquartile Range (IQR) to define boundaries.

Best for: Datasets where you want to consider the spread of the middle 50% of the data.

Median Absolute Deviation (MAD)

A robust statistic that measures variability around the median.

Best for: Datasets with extreme skewness or outliers, as it's less sensitive to extreme values than standard deviation.

Min-Max Method

Uses the minimum and maximum values from the dataset to set caps and floors.

Best for: Situations where there are strict limits to the values a variable can take (e.g., percentages must be between 0 and 100).

Examples

Here's an example of how to use the Handle Outliers transform:

Example: Capping Outliers in Sales Data

Input Dataset (after outlier detection):

DateProductSalesOutlier
2023-01-01A100False
2023-01-02B120False
2023-01-03A1500True
2023-01-04C80False
2023-01-05B110False

Configuration:

  • Handle Option: Cap & Floor
  • Cap & Floor Settings: Tukey's Method
  • Multiplier: 1.5

Result:

DateProductSalesOutlier
2023-01-01A100False
2023-01-02B120False
2023-01-03A150True
2023-01-04C80False
2023-01-05B110False

The outlier value (1500) has been capped to a more reasonable value (150) based on the Tukey's method calculation.

Best Practices

  1. Understand Your Data: Choose a handling method that aligns with your data's characteristics and your analysis goals.

  2. Preserve Information: When possible, use capping methods instead of removal to retain as much data as possible.

  3. Document Your Choices: Keep a record of how you handled outliers, including the method and any parameters used.

  4. Consider Domain Knowledge: Use your understanding of the data to inform your outlier handling strategy.

  5. Validate Results: After handling outliers, re-examine your data to ensure the results align with your expectations.

Troubleshooting

  • If too many data points are being capped, consider adjusting the multiplier for Tukey's or MAD methods.
  • If removing outliers results in a significant loss of data, consider using a capping method instead.
  • Always check the distribution of your data after handling outliers to ensure you haven't introduced new biases.