Apache Tika is a library that is mainly used to detect document types and extract textual contents and metadata from various file formats. Internally, Tika delegates all the parsing and detecting works to various existing document parsers and document type detection libraries. Tika provides a single generic API as a universal type detector and content extractor for many file formats. For more information about Tika, please check the Tika website .
This node allows parsing of any kind of documents that are supported by Tika. The type of the files can be selected in the configuration dialog. Users have the choice between selecting the file extensions, or the MIME-types. What kind of information that are to be extracted from the file (metadata and content) can also be selected in the dialog. If possible, user can also extract files that are embedded in the input files, such as attachments in E-mails, etc, and store them in a specified directory. Authentication setting is also provided to parse any encrypted files.
- Type: TableMetadata output tableAn output table containing the parsed document data. The columns are the same as what was selected in the Metadata list in the configure dialog.
- Type: TableAttachment output tableAn output table containing the names of input files that contain any embedded files and also the paths to the extracted files in the output directory.