Merge Transform
The Merge transform allows you to combine two datasets based on common columns or keys. It offers multiple joining methods and advanced options, providing flexibility for efficient data merging and alignment.
Basic Usage
To merge two datasets:
The Merge transform requires at least two Data Input nodes.
Make sure you have added two datasets to your pipeline before using this operation.

- Select the Merge transform from the transform menu.
- Choose the two datasets you want to merge.
- Specify the columns to join on for each dataset.
- Select the appropriate join type.
- (Optional) Configure additional options for enhanced merging flexibility.
Configuration Options
Basic Options
- Left On: Select the column(s) from the first dataset to use as the join key.
- Right On: Select the column(s) from the second dataset to use as the join key.
- Join Type: Choose the method for combining the datasets:
Inner: Keeps only rows with matching keys in both datasets.Left: Keeps all rows from the left dataset and matching rows from the right.Right: Keeps all rows from the right dataset and matching rows from the left.Outer: Keeps all rows from both datasets, filling in nulls where no match exists.
Advanced Options
- Auto Convert: Automatically convert data types of join columns if they're incompatible.
- Fuzzy Merge: Enables matching for similar, but not identical values.
- Fuzzy Threshold: Set the similarity threshold for fuzzy matching (0-100).
- Fuzzy Distance Method: Choose the algorithm for calculating string similarity.
Fuzzy Merge
Fuzzy merging is particularly useful when datasets contain similar but not identical values due to typos, formatting inconsistencies, or varied data entry practices.
Fuzzy Distance Methods
Click to see available fuzzy distance methods
- Ratio: Standard Levenshtein distance.
- Partial Ratio: Matches substrings.
- Token Sort Ratio: Compares sorted word lists.
- Token Set Ratio: Compares unique word sets.
- WRatio: Advanced algorithm considering string length and matching characters.
- QRatio: Fast, less accurate alternative to WRatio.
- UWRatio: More accurate version of WRatio.
- Jaro: Accounts for character similarity and transpositions.
- Jaro-Winkler: Prioritizes matches at the start of strings.
- Cosine: Compares vector representations of strings.
- Hamming: Counts differing positions in equal-length strings.
- Longest Common Substring: Finds the longest shared sequence.
- N-gram: Compares character sequences.
- Soundex: Phonetic comparison of names.
- Metaphone: Improved phonetic comparison.
- Double Metaphone: Handles multiple pronunciations across languages.
Examples
Here are some examples of how to use the Merge transform:
Example 1: Basic Inner Join
Input Datasets:
Dataset 1:
| id | value1 |
|---|---|
| 1 | A |
| 2 | B |
| 3 | C |
Dataset 2:
| id | value2 |
|---|---|
| 2 | X |
| 3 | Y |
| 4 | Z |
Configuration:
- Left On:
id - Right On:
id - Join Type:
Inner
Result:
| id | value1 | value2 |
|---|---|---|
| 2 | B | X |
| 3 | C | Y |
Example 2: Left Join with Fuzzy Matching
Input Datasets:
Dataset 1:
| name | score |
|---|---|
| John | 85 |
| Mary | 92 |
| Michael | 78 |
Dataset 2:
| employee | department |
|---|---|
| Jon | Sales |
| Mary Ann | Marketing |
| Mike | IT |
Configuration:
- Left On:
name - Right On:
employee - Join Type:
Left - Fuzzy Merge:
Enabled - Fuzzy Threshold:
80 - Fuzzy Distance Method:
Jaro-Winkler
Result:
| name | score | employee | department |
|---|---|---|---|
| John | 85 | Jon | Sales |
| Mary | 92 | Mary Ann | Marketing |
| Michael | 78 | Mike | IT |
When using fuzzy merge, experiment with different thresholds and distance methods to find the best balance between matching accuracy and false positives.
Fuzzy merging can be computationally intensive for large datasets. Consider using it selectively or on smaller subsets of your data for optimal performance.