Text Cleanup Transform
The Text Cleanup transform provides comprehensive cleaning of text data in your dataset. It offers options for emoji conversion, punctuation handling, whitespace normalization, and null value replacement.
Basic Usage
To clean up text data in your dataset:
- Select the Text Cleanup transform from the transform menu.
- Choose the text columns you want to clean.
- (Optional) Configure advanced options for more precise control.
- Apply the transformation.
Configuration Options
Basic Options
- Select Columns: Choose one or more text columns to clean. Only string (object) and categorical columns will be available for selection.
Advanced Options
- Retained Characters: Specify characters to keep in the text after cleaning. Enter characters without spaces (e.g., "!?,"). If left empty, all standard punctuation will be removed.
- Null Value Replacement: Enter a string to replace null values. Defaults to an empty string.
Examples
Here are some examples of how to use the Text Cleanup transform:
Example 1: Cleaning Product Reviews
Input Dataset:
| Review Text | User Name | Rating |
|---|---|---|
| Awesome 😍!!! | john123 | 5 |
| Not bad at all. 😐 | jane_doe | 4 |
| Terrible product 😡 | null | 1 |
Configuration:
- Select Columns:
Review Text,User Name - Retained Characters:
!. - Null Value Replacement:
Anonymous
Result:
| Review Text | User Name | Rating |
|---|---|---|
| Awesome smilingfacewithhearteyes!!! | john123 | 5 |
| Not bad at all. neutralface | janedoe | 4 |
| Terrible product enragedface | Anonymous | 1 |
Example 2: Cleaning Chat Messages
Input Dataset:
| Message | Sender | Timestamp |
|---|---|---|
| Hi there! 👋 | Alice | 10:00 AM |
| How are you? 🙂 | Bob | 10:01 AM |
| I'm good, thanks! | null | 10:02 AM |
Configuration:
- Select Columns:
Message,Sender - Retained Characters:
?! - Null Value Replacement:
Unknown
Result:
| Message | Sender | Timestamp |
|---|---|---|
| Hi there! wavinghand | Alice | 10:00 AM |
| How are you? slightlysmilingface | Bob | 10:01 AM |
| Im good thanks! | Unknown | 10:02 AM |
When cleaning text data, consider which punctuation marks are important for maintaining the meaning of your text. Use the "Retained Characters" option to keep these specific characters.
Text cleanup can significantly alter your data. Always review the results to ensure that important information hasn't been inadvertently removed or changed.
Best Practices
-
Preserve Original Data: Consider creating new columns for cleaned text rather than overwriting existing ones, especially when working with important textual data.
-
Consistent Cleaning: Apply the same cleaning rules across all relevant text columns to maintain consistency in your dataset.
-
Handle Emojis Carefully: Be aware that emoji conversion can change the length and meaning of text. Ensure this doesn't negatively impact your analysis.
-
Check for Data Loss: After cleaning, verify that no crucial information has been lost, especially when removing punctuation or converting special characters.
-
Null Value Handling: Choose your null value replacement carefully. In some cases, it might be better to keep nulls as is rather than replacing them with a string.