The goal of the workflow is to show how to parse content of files using Tika nodes, detect the languages of the content using Tika language detector and finally assign a POS tag for each english word found in the document files. First, the Tika parser reads files from a specified directory and parses their content (any detected attachments/embedded files will be extracted as well). A language detector node is then used to detect languages used in the contents. Any file not written in english is filtered out. The remaining files are converted into documents, where a Stanford tagger is then applied to assign a POS tag for each term.
Workflow
Apache Tika integration
Used extensions & nodes
Created with KNIME Analytics Platform version 4.1.0
- Go to item
- Go to item
- Go to item
- Go to item
- Go to item
- Go to item
Legal
By using or downloading the workflow, you agree to our terms and conditions.