This component can compute different metrics of topics created by the Topic Extractor (Parallel LDA) node and Topic Extractor (STM) component. We list below the metrics it can score provided a table or pre-processed documents and a table of weighted terms for each topic. Provide the topics of a single model or of multiple models.
Take a look at the example workflows at the bottom of this page to learn how to concatenate topics from different models trained on the same corpus of documents or add a ‘model ID’ to the output of the Topic Extractor (Parallel LDA) node.
DISCLAIMER: this verified component is currently marked as part of KNIME Labs (knime.com/knime-labs). Provide feedback at upskilling@knime.com
Topic Semantic Coherence score:
This component calculates semantic coherence scores for each topic. Semantic coherence measures how coherent topics are by checking if the topics top terms appear together in the same documents more often than not. This experimental implementation is based on the paper by Mimno et al (2011) [dl.acm.org/doi/10.5555/2145432.2145462].
Topic Exclusivity score:
This component calculates the exclusivity of topics. Exclusivity is computed using an experimental implementation of the FREX function by Bischof and Airoldi (2012) [dl.acm.org/doi/10.5555/3042573.3042578]. FREX does not take in consideration only how exclusive/unique terms are between different topics (top terms table), but also how rare those topics are in documents of the same topic (documents table).
When comparing multiple models, documents can be assigned by different models to different topics and therefore exclusivity can be computed only using how unique terms are in the topics top terms table. Read more in the setting “Ignore Assigned Topic Column” description.
Topic Neighbor Distance score:
This component computes an experimental distance between topics within the same model or between several models. To do this, topics are represented by a normalized vector by pivoting the top terms by topic table. A cosine distance between topic vectors is computed. For each topic the distance is used to show the closest and farthest topic within one or between more models.
- Type: TableDocuments TableThe pre-processed documents from the corpus used to train the topic model. They can be the ones used in training or a hold-out sample. The documents should be in the KNIME Textprocessing format (use the Strings to Document node).
- Type: TableTop Terms by Topics TableThe topics top term table created either by the Topic Extractor (Parallel LDA) node or the Topic Extractor (STM) component (second table output). This table should list the top weighted terms for each topic for one or different models. If you concatenate topics from different models make sure to add a column for the model ID.