Detect Outliers Transform

The Detect Outliers transform allows you to identify unusual or extreme values in your numerical data. This is crucial for understanding data quality, preparing data for analysis, and identifying potentially interesting or problematic data points.

Basic Usage

To detect outliers in your dataset:

Select the Detect Outliers transform from the transform menu.
Choose the numerical columns you want to analyze in the "Columns to Consider" dropdown.
Select the outlier detection method you want to apply.
(Optional) Configure advanced options for the selected method.
Apply the transformation.

note

The Detect Outliers transform is only applicable to numerical columns. Only numerical columns will be available for selection in the "Columns to Consider" dropdown.

Configuration Options

Basic Options

Columns to Consider: Select the numerical columns you want to analyze for outliers. The selected columns will be displayed for clarity.
Detector: Choose the method to use for detecting outliers. Available options include:
- Z-Score
- Tukey's Fences
- Standard Deviation
- Percentile
- Isolation Forest
- Local Outlier Factor (LOF)
- Selective
- Correlated

tip

Hover over each detection method to see a brief explanation of its characteristics and best use cases.

Advanced Options

Each detection method has its own set of parameters. Here are some examples:

Z-Score

Threshold: Z-score threshold for considering a value as an outlier.

Tukey's Fences

Multiplier: Multiplier for the Interquartile Range (IQR) to determine the fences.

Percentile

Lower Percentile: Lower percentile for considering a value as an outlier.
Upper Percentile: Upper percentile for considering a value as an outlier.

Isolation Forest

Contamination: Expected proportion of outliers in the dataset.

Outlier Detection Methods Explained

Z-Score

Measures how many standard deviations a data point is from the mean.

Best for: Normally distributed data without extreme outliers.

Tukey's Fences

Uses the Interquartile Range (IQR) to identify outliers.

Best for: Datasets where you want to consider the spread of the middle 50% of the data.

Standard Deviation

Identifies outliers based on their distance from the mean in terms of standard deviations.

Best for: Normally distributed data where you want to consider the overall spread.

Percentile

Identifies outliers based on their position within the ordered dataset.

Best for: When you want to define outliers based on their relative position in the data distribution.

Isolation Forest

An algorithm that isolates anomalies by randomly selecting features and split values.

Best for: High-dimensional datasets and when you expect outliers to be rare and different.

Local Outlier Factor (LOF)

Compares the local density of a point to the densities of its neighbors.

Best for: Datasets with varying densities where local context is important.

Selective

Applies different methods based on the characteristics of each column.

Best for: Datasets with diverse columns that may require different outlier detection approaches.

Correlated

Considers relationships between features when detecting outliers.

Best for: Multivariate data where outliers are noticeable due to interactions between features.

Examples

Here's an example of how to use the Detect Outliers transform:

Example: Detecting Outliers in Sales Data

Input Dataset:

Date	Product	Sales	Customer_Rating
2023-01-01	A	100	4.5
2023-01-02	B	120	3.8
2023-01-03	A	1500	4.2
2023-01-04	C	80	4.2
2023-01-05	B	110	4.0

Configuration:

Columns to Consider: Sales, Customer_Rating
Detector: Z-Score
Threshold: 3

Result:

The transform will mark the following as outliers:

Row 3, Sales column (value: 1500)

The output mask will have True for this cell, indicating it's an outlier.

Best Practices

Choose the Right Method: Consider the distribution and characteristics of your data when selecting an outlier detection method.
Visualize Your Data: Use visualization tools in conjunction with outlier detection to better understand your data's distribution and potential outliers.
Consider Context: Remember that not all statistical outliers are errors or problematic. Some may be valuable insights.
Multiple Methods: For critical analyses, consider using multiple outlier detection methods and comparing results.
Handle Outliers Carefully: Once detected, decide whether to remove, transform, or keep outliers based on your specific use case and domain knowledge.

Troubleshooting

If no outliers are detected, try adjusting the parameters of your chosen method (e.g., lowering the Z-score threshold).
For methods sensitive to the scale of the data (like Z-score), consider normalizing your data first.
If you're getting too many outliers, check if your data is highly skewed or if you're using an inappropriate method for your data's distribution.

Basic Usage​

Configuration Options​

Basic Options​

Advanced Options​

Outlier Detection Methods Explained​

Examples​

Best Practices​

Troubleshooting​