The component trains an STM topic model via unsupervised learning. It integrates with the R implementation of Structural Topic Models (STM), following Roberts, Stewart and Tingley, Journal of Statistical Software (2019) (cran.r-project.org/web/packages/stm/vignettes/stmVignette.pdf), via the R library 'stm' (cran.r-project.org/web/packages/stm).
On its first execution the component is set up to automatically install R and all the required libraries. For this to work you need to install Conda (we recommend via "docs.conda.io/en/latest/miniconda.html"). KNIME Analytics Platform can automatically find the default path of where Conda is installed. You can make sure KNIME Analytics Platform is using the correct path via "File > Preferences > KNIME > Conda".
DISCLAIMER: this component won't work on Apple M1 systems as the 'stm' package is not available for 'osx-arm64' via 'conda-forge' ("anaconda.org/conda-forge/r-stm"). For Apple Intel systems manual installation of additional software might be required after the Conda Environment Propagation node executes. For details visit: docs.knime.com/latest/r_installation_guide
Use the component settings to select a document in the column type from the KNIME Textprocessing Extension. Simply apply the Strings to Document node and any other preprocessing required (stopwords removal, stemming, ...) upstream of this component.
Given K, the number of topics to be created, it returns the predicted topic for each document as well as a set of terms representing each of the K topics.
Optionally you can provide metadata columns and fields to the algorithm. Metadata fields are extracted from the document column type. Metadata columns are simply additional columns you provide at the input.
Make sure to provide an operator (+. -, / ,*) for the automated 'Prevalence Formula' when you provide more than one metadata field/column.
- Type: TableDocument TableData table with the document collection to analyze in the KNIME Textprocessing column type (use the 'Strings to Document' node first). Each row contains one document. Documents can be pre-processed (stopwords removal, stemming, ...).