Skip to main content

Text Cleanup Transform

The Text Cleanup transform provides comprehensive cleaning of text data in your dataset. It offers options for emoji conversion, punctuation handling, whitespace normalization, and null value replacement.

Basic Usage

To clean up text data in your dataset:

  1. Select the Text Cleanup transform from the transform menu.
  2. Choose the text columns you want to clean.
  3. (Optional) Configure advanced options for more precise control.
  4. Apply the transformation.

Configuration Options

Basic Options

  • Select Columns: Choose one or more text columns to clean. Only string (object) and categorical columns will be available for selection.

Advanced Options

  • Retained Characters: Specify characters to keep in the text after cleaning. Enter characters without spaces (e.g., "!?,"). If left empty, all standard punctuation will be removed.
  • Null Value Replacement: Enter a string to replace null values. Defaults to an empty string.

Examples

Here are some examples of how to use the Text Cleanup transform:

Example 1: Cleaning Product Reviews

Input Dataset:

Review TextUser NameRating
Awesome 😍!!!john1235
Not bad at all. 😐jane_doe4
Terrible product 😡null1

Configuration:

  • Select Columns: Review Text, User Name
  • Retained Characters: !.
  • Null Value Replacement: Anonymous

Result:

Review TextUser NameRating
Awesome smilingfacewithhearteyes!!!john1235
Not bad at all. neutralfacejanedoe4
Terrible product enragedfaceAnonymous1
Example 2: Cleaning Chat Messages

Input Dataset:

MessageSenderTimestamp
Hi there! 👋Alice10:00 AM
How are you? 🙂Bob10:01 AM
I'm good, thanks!null10:02 AM

Configuration:

  • Select Columns: Message, Sender
  • Retained Characters: ?!
  • Null Value Replacement: Unknown

Result:

MessageSenderTimestamp
Hi there! wavinghandAlice10:00 AM
How are you? slightlysmilingfaceBob10:01 AM
Im good thanks!Unknown10:02 AM
tip

When cleaning text data, consider which punctuation marks are important for maintaining the meaning of your text. Use the "Retained Characters" option to keep these specific characters.

caution

Text cleanup can significantly alter your data. Always review the results to ensure that important information hasn't been inadvertently removed or changed.

Best Practices

  1. Preserve Original Data: Consider creating new columns for cleaned text rather than overwriting existing ones, especially when working with important textual data.

  2. Consistent Cleaning: Apply the same cleaning rules across all relevant text columns to maintain consistency in your dataset.

  3. Handle Emojis Carefully: Be aware that emoji conversion can change the length and meaning of text. Ensure this doesn't negatively impact your analysis.

  4. Check for Data Loss: After cleaning, verify that no crucial information has been lost, especially when removing punctuation or converting special characters.

  5. Null Value Handling: Choose your null value replacement carefully. In some cases, it might be better to keep nulls as is rather than replacing them with a string.