Skip to main content

Merge Transform

The Merge transform allows you to combine two datasets based on common columns or keys. It offers multiple joining methods and advanced options, providing flexibility for efficient data merging and alignment.

Basic Usage

To merge two datasets:

caution

The Merge transform requires at least two Data Input nodes.
Make sure you have added two datasets to your pipeline before using this operation.

merge Mode

  1. Select the Merge transform from the transform menu.
  2. Choose the two datasets you want to merge.
  3. Specify the columns to join on for each dataset.
  4. Select the appropriate join type.
  5. (Optional) Configure additional options for enhanced merging flexibility.

Configuration Options

Basic Options

  • Left On: Select the column(s) from the first dataset to use as the join key.
  • Right On: Select the column(s) from the second dataset to use as the join key.
  • Join Type: Choose the method for combining the datasets:
    • Inner: Keeps only rows with matching keys in both datasets.
    • Left: Keeps all rows from the left dataset and matching rows from the right.
    • Right: Keeps all rows from the right dataset and matching rows from the left.
    • Outer: Keeps all rows from both datasets, filling in nulls where no match exists.

Advanced Options

  • Auto Convert: Automatically convert data types of join columns if they're incompatible.
  • Fuzzy Merge: Enables matching for similar, but not identical values.
    • Fuzzy Threshold: Set the similarity threshold for fuzzy matching (0-100).
    • Fuzzy Distance Method: Choose the algorithm for calculating string similarity.

Fuzzy Merge

Fuzzy merging is particularly useful when datasets contain similar but not identical values due to typos, formatting inconsistencies, or varied data entry practices.

Fuzzy Distance Methods

Click to see available fuzzy distance methods
  • Ratio: Standard Levenshtein distance.
  • Partial Ratio: Matches substrings.
  • Token Sort Ratio: Compares sorted word lists.
  • Token Set Ratio: Compares unique word sets.
  • WRatio: Advanced algorithm considering string length and matching characters.
  • QRatio: Fast, less accurate alternative to WRatio.
  • UWRatio: More accurate version of WRatio.
  • Jaro: Accounts for character similarity and transpositions.
  • Jaro-Winkler: Prioritizes matches at the start of strings.
  • Cosine: Compares vector representations of strings.
  • Hamming: Counts differing positions in equal-length strings.
  • Longest Common Substring: Finds the longest shared sequence.
  • N-gram: Compares character sequences.
  • Soundex: Phonetic comparison of names.
  • Metaphone: Improved phonetic comparison.
  • Double Metaphone: Handles multiple pronunciations across languages.

Examples

Here are some examples of how to use the Merge transform:

Example 1: Basic Inner Join

Input Datasets:

Dataset 1:

idvalue1
1A
2B
3C

Dataset 2:

idvalue2
2X
3Y
4Z

Configuration:

  • Left On: id
  • Right On: id
  • Join Type: Inner

Result:

idvalue1value2
2BX
3CY
Example 2: Left Join with Fuzzy Matching

Input Datasets:

Dataset 1:

namescore
John85
Mary92
Michael78

Dataset 2:

employeedepartment
JonSales
Mary AnnMarketing
MikeIT

Configuration:

  • Left On: name
  • Right On: employee
  • Join Type: Left
  • Fuzzy Merge: Enabled
  • Fuzzy Threshold: 80
  • Fuzzy Distance Method: Jaro-Winkler

Result:

namescoreemployeedepartment
John85JonSales
Mary92Mary AnnMarketing
Michael78MikeIT
tip

When using fuzzy merge, experiment with different thresholds and distance methods to find the best balance between matching accuracy and false positives.

caution

Fuzzy merging can be computationally intensive for large datasets. Consider using it selectively or on smaller subsets of your data for optimal performance.