Merge Transform

The Merge transform allows you to combine two datasets based on common columns or keys. It offers multiple joining methods and advanced options, providing flexibility for efficient data merging and alignment.

Basic Usage

To merge two datasets:

caution

The Merge transform requires at least two Data Input nodes.
Make sure you have added two datasets to your pipeline before using this operation.

merge Mode

Select the Merge transform from the transform menu.
Choose the two datasets you want to merge.
Specify the columns to join on for each dataset.
Select the appropriate join type.
(Optional) Configure additional options for enhanced merging flexibility.

Configuration Options

Basic Options

Left On: Select the column(s) from the first dataset to use as the join key.
Right On: Select the column(s) from the second dataset to use as the join key.
Join Type: Choose the method for combining the datasets:
- Inner: Keeps only rows with matching keys in both datasets.
- Left: Keeps all rows from the left dataset and matching rows from the right.
- Right: Keeps all rows from the right dataset and matching rows from the left.
- Outer: Keeps all rows from both datasets, filling in nulls where no match exists.

Advanced Options

Auto Convert: Automatically convert data types of join columns if they're incompatible.
Fuzzy Merge: Enables matching for similar, but not identical values.
- Fuzzy Threshold: Set the similarity threshold for fuzzy matching (0-100).
- Fuzzy Distance Method: Choose the algorithm for calculating string similarity.

Fuzzy Merge

Fuzzy merging is particularly useful when datasets contain similar but not identical values due to typos, formatting inconsistencies, or varied data entry practices.

Fuzzy Distance Methods

Click to see available fuzzy distance methods

Ratio: Standard Levenshtein distance.
Partial Ratio: Matches substrings.
Token Sort Ratio: Compares sorted word lists.
Token Set Ratio: Compares unique word sets.
WRatio: Advanced algorithm considering string length and matching characters.
QRatio: Fast, less accurate alternative to WRatio.
UWRatio: More accurate version of WRatio.
Jaro: Accounts for character similarity and transpositions.
Jaro-Winkler: Prioritizes matches at the start of strings.
Cosine: Compares vector representations of strings.
Hamming: Counts differing positions in equal-length strings.
Longest Common Substring: Finds the longest shared sequence.
N-gram: Compares character sequences.
Soundex: Phonetic comparison of names.
Metaphone: Improved phonetic comparison.
Double Metaphone: Handles multiple pronunciations across languages.

Examples

Here are some examples of how to use the Merge transform:

Example 1: Basic Inner Join

Input Datasets:

Dataset 1:

id	value1
1	A
2	B
3	C

Dataset 2:

id	value2
2	X
3	Y
4	Z

Configuration:

Left On: id
Right On: id
Join Type: Inner

Result:

id	value1	value2
2	B	X
3	C	Y

Example 2: Left Join with Fuzzy Matching

Input Datasets:

Dataset 1:

name	score
John	85
Mary	92
Michael	78

Dataset 2:

employee	department
Jon	Sales
Mary Ann	Marketing
Mike	IT

Configuration:

Left On: name
Right On: employee
Join Type: Left
Fuzzy Merge: Enabled
Fuzzy Threshold: 80
Fuzzy Distance Method: Jaro-Winkler

Result:

name	score	employee	department
John	85	Jon	Sales
Mary	92	Mary Ann	Marketing
Michael	78	Mike	IT

tip

When using fuzzy merge, experiment with different thresholds and distance methods to find the best balance between matching accuracy and false positives.

caution

Fuzzy merging can be computationally intensive for large datasets. Consider using it selectively or on smaller subsets of your data for optimal performance.

Basic Usage​

Configuration Options​

Basic Options​

Advanced Options​

Fuzzy Merge​

Fuzzy Distance Methods​

Examples​