This node splits the string content of a selected column into logical groups using regular expressions. A capturing group is usually identified by a pair of parentheses, whereby the pattern in such parentheses is a regular expression. Optionally, a group can be named. See Pattern for more information. For each input, the capture groups are the output values. Those can be appended to the table in different ways; by default, every group will correspond to one additional output column.
A short introduction to groups and capturing is given in the Java API . Some examples are given below:
Parsing Patent Numbers
Patent identifiers such as "US5443036-X21" consisting of a (at most) two-letter country code ("US"), a patent number ("5443036") and possibly some application code ("X21"), which is separated by a dash or a space character, can be grouped by the expression ([A-Za-z]{1,2})([0-9]+)[ \-]?(.*$) . Each of the parenthesized terms corresponds to the aforementioned properties. For named output columns, we can add group names to the pattern:
- (?<CC>[A-Za-z]{1,2}) is now identified with "CC" in the output.
- (?<patentNumber>[0-9]+) is now identified with "patentNumber".
- [ \-]? is and was never a capturing group so it remains unchanged.
- (?<applicationCode>.*$) is now identified with "applicationCode".
Strip File URLs
This is particularly useful when this node is used to parse the file URL of a file reader node (the URL is exposed as a flow variable and then exported to a table using a Variable to Table node). The format of such URLs is similar to "file:c:\some\directory\foo.csv". Using the pattern [A-Za-z]*:(.*[/\\])(?<filename>([^\.]*)\.(.*$)) generates four groups: The first group identifies the directory and is denoted by (.*[/\\]) . It consumes all characters until a final slash or backslash is encountered; in the example, this refers to "c:\some\directory\". The second group represents the file name, whereby it encapsulates the third and fourth group. The third group ( ([^\.]*) ) consumes all characters after the directory, which are not a dot '.' (which is "foo" in the above example). The pattern expects a single dot (final which is ignored) and finally the fourth group (.*$) , which reads until the end of the string and indicates the file suffix ('csv'). The groups for the above example are
- Group 1 : c:\some\directory
- Group filename : foo.csv
- Group 3 : foo
- Group 4 : csv
Email Address Extraction
Let's consider a scenario where you have a list of email addresses. Using the pattern (?<username>.+)@(?<domain>.+) , you can extract the username and domain from the addresses. The groups for the email address "john.doe@example.com" are:
- Group username : john.doe
- Group domain : example.com