v1.10 - 10 January 2022
PLEASE NOTE THIS COMPONENT IS STILL PROTOTYPE AND SUBJECT TO SOME CHANGES - FEEDBACK WELCOME!
Python Script updated with some improvements, and also recoded so that it can be tested (in part) outside of KNIME (e.g. using VSCode) for easier development.
Reads the supplied XML file, using the specified path as a local file system file, but if that fails, attempts to read it as a URL.
This component uses Python 3 so you must have Python 3 installed and available in your KNIME environment. It makes use of the following Python modules: cElementTree, pandas, urllib
The XML data is output in grouped tabular format, which means that the rows should be ungrouped (use an upgroup node). Those data items that are expected to be repeated across all rows for a "group" should be excluded from the selection of columns to be ungrouped. In that way, repeated data is "copied down" where appropriate across row items.
Outputs of the columns and their paths is generated on the "Column Paths" port and on the "Path to Column Mapping" port. The Column Paths port is "by column name" and so if there is column-name clash (which can occur if more than one element in the XML has the same element-name) the resulting rows on this port will be deficient, as will the resulting data output.
The "Path to Column Mapping" port shows the same information, but is "path centric" and so will contain any columns for which "name clash" has occurred.
The "Column Name Clash" port will identify clashing names. This port should return no data if no name clash occurred, but can be used to quickly verify that all expected columns have been handled correctly.
The name of a csv "Column Name to Path" mapping file can be supplied, which allows you to specify which elements/columns to return, based on their paths. By specifying a different column name here, the column will be renamed on the output.
Paths follow a basic "pseudo xpath" format. No additional xpath syntax should be used as it will not be recognised, and will result in data in the file being ignored.
Element paths are defined by the format //element1/element2/element3
Attribute paths are defined by the format //element1/element2/element3/@attributename
Rows in the Column Name - Path mapping table can be "commented out". To do this, all that is necessary is that the path be "invalidated", and this can easily be achieved by, for example, adding a '#' to the end of the line
e.g. In the following example, the paths for the * and orderperson lines have been "invalidated" so are ignored
Column Name,Path
*,*#
Order Id,//shiporder/@orderid
orderperson,//shiporder/orderperson#
The path will change if you specify a different collection subtree, and/or root. If you are having difficulty working out the correct path, execute the node and take a look in the Column Paths output port to see what the paths are with the current configuration.
v 1.0 (Prototype) @takbb Brian Bates
This is a fully functioning prototype, but may well be suitable for your needs. If you wish to use it, please test it with your data to see that it works well for you before relying on it!
Please provide feedback on any issues found, or any suggestions for improvement, or usability.
- Type: TableDataThe data read from the XML file, based on the column names created either from the raw element/attribute names, or from the Column Name - Path mapping file (if supplied)
- Type: TableColumn PathsA list of column names, with their associated paths that can be used to create a Column Name - Path mapping file.
- Type: TablePath to Column MappingThe mapping of paths to column names which can be used to manually check for column name clash. If two paths result in the same column name, you should create a Column Name - Path mapping file (using the output from Port 2) and edit the column names to be returned
- Type: TableColumn Name ClashShows any mappings for which "Column Name clash" has occurred. If there is any data on this port, steps should be taken to provide a suitable mapping file to resolve the issue.