NodeRegex Split


This node splits the string content of a selected column into logical groups using regular expressions. A group is identified by a pair of parentheses, whereby the pattern in such parentheses is a regular expression. Each content of each group is appended as an individual column.

A short introduction to Groups and Capturing is given by in the Java API . Some examples are given below:

Parsing Patent Numbers

Patent identifiers such as "US5443036-X21" consisting of a (at most) two letter country code ("US"), a patent number ("5443036") and possibly some application code ("X21"), which is separated by a dash or a space character, can be grouped by the expression ([A-Za-z]{1,2})([0-9]*)[ \-]*(.*$). Each of the parenthesized terms corresponds to the aforementioned properties.

Strip File URLs

This is particularly useful when this node is used to parse the file URL of a file reader node (the URL is exposed as flow variable and then exported to a table using a Variable to Table node). The format of such URLs is similar to "file:c:\some\directory\foo.csv". Using the pattern [A-Za-z]*:(.*[/\\])(([^\.]*)\.(.*$)) generates four groups (by counting the number of opening parentheses): The first group identifies the directory and is denoted by "(.*[/\\])". It consumes all characters until a final slash or backslash is encountered; in the example this refers to "c:\some\directory\". The second group represents the file name, whereby it encapsulates the third and fourth group. The third group (denoted by "([^\.]*)") consumes all characters after the directory, which are not a dot '.' (which is "foo" in the above example). The pattern expects a single dot (which is ignored) and finally the fourth group "(.*$)", which reads until the end of the string and indicates the file suffix ('csv'). The groups for the above example are

  1. c:\some\directory
  2. foo.csv
  3. foo
  4. csv

Input Ports

  1. Port Type: Data
    Input table with string column to be split.

Output Ports

  1. Port Type: Data
    Input table amended by additional column representing the pattern groups.