String Splitter (Regex)

This node splits the string content of a selected column into logical groups using regular expressions. A capturing group is usually identified by a pair of parentheses, whereby the pattern in such parentheses is a regular expression. Optionally, a group can be named. See Pattern for more information. For each input, the capture groups are the output values. Those can be appended to the table in different ways; by default, every group will correspond to one additional output column.

A short introduction to groups and capturing is given in the Java API . Some examples are given below:

Parsing Patent Numbers

Patent identifiers such as "US5443036-X21" consisting of a (at most) two-letter country code ("US"), a patent number ("5443036") and possibly some application code ("X21"), which is separated by a dash or a space character, can be grouped by the expression ([A-Za-z]{1,2})([0-9]+)[ \-]?(.*$) . Each of the parenthesized terms corresponds to the aforementioned properties. For named output columns, we can add group names to the pattern:

(?<CC>[A-Za-z]{1,2}) is now identified with "CC" in the output.
(?<patentNumber>[0-9]+) is now identified with "patentNumber".
[ \-]? is and was never a capturing group so it remains unchanged.
(?<applicationCode>.*$) is now identified with "applicationCode".

Named and unnamed groups can also be mixed in one pattern.

Strip File URLs

This is particularly useful when this node is used to parse the file URL of a file reader node (the URL is exposed as a flow variable and then exported to a table using a Variable to Table node). The format of such URLs is similar to "file:c:\some\directory\foo.csv". Using the pattern [A-Za-z]*:(.*[/\\])(?<filename>([^\.]*)\.(.*$)) generates four groups: The first group identifies the directory and is denoted by (.*[/\\]) . It consumes all characters until a final slash or backslash is encountered; in the example, this refers to "c:\some\directory\". The second group represents the file name, whereby it encapsulates the third and fourth group. The third group ( ([^\.]*) ) consumes all characters after the directory, which are not a dot '.' (which is "foo" in the above example). The pattern expects a single dot (final which is ignored) and finally the fourth group (.*$) , which reads until the end of the string and indicates the file suffix ('csv'). The groups for the above example are

Group 1 : c:\some\directory
Group filename : foo.csv
Group 3 : foo
Group 4 : csv

Email Address Extraction

Let's consider a scenario where you have a list of email addresses. Using the pattern (?<username>.+)@(?<domain>.+) , you can extract the username and domain from the addresses. The groups for the email address "john.doe@example.com" are:

Group username : john.doe
Group domain : example.com

Node details

Ports Options Views

Input ports

Type: Table
Data Table
Input table with string column to be split.

Output ports

Type: Table
Input with split columns
Input table with additional column(s) and potentially duplicated rows representing the pattern groups.See "Output matched groups as" for more details.

Extension

The String Splitter (Regex) node is part of this extension:

Go to item