- Type: TableTable with original text documentTable with the document column (not necessarily pre-processed for readability) and the topic label assigned by the model. This is usually the first output from the "Topic Extractor (Parallel LDA)" node or "Topic Extractor (STM)" component. Each row should be a document and the following columns should be available: the document column (from KNIME Textprocessing), the assigned topic, the probability columns for each topic.
- Type: TableTop Terms by TopicSecond output table from "Topic Extractor (Parallel LDA)" node or "Topic Extractor (STM)" component. For the second input each row should be a term and the following columns should be available: the term (String type), the topic id and the weight.
This component serves the purpose of visually representing and analyzing the outcomes of a topic model. It is compatible with any topic modeling model as long as they generate the topic-term matrix and the topic-document matrix. We recommend using this component downstream from the Topic Extractor (Parallel LDA) node [kni.me/n/w7Vr1wY8Bu8Gfpv7] or the Topic Extractor (STM) component [kni.me/c/DFANPa0NHnZb9tSV]. For more details see port documentation below. The component interactive view proves valuable in validating a chosen topic model solution and offering insights into the similarity between different extracted topics. The Topic Explorer View offers two modes: - Explore by Topic: explore the topics (second input) in a similarity bubble chart, select topics and visualize coherence and exclusivity scores from the Topic Scorer component (kni.me/c/5_W2h2g6hBY_M0Bc) and the associated tag cloud. Additionally you can scroll through topics represented as small bar charts. - Explore by Document: explore the documents (first input) in a similarity bubble chart, select topics and visualize the preview or the full length of documents where the terms inside the topics are highlighted. Both modes provide a similarity bubble chart, where topics or documents with higher semantic similarity are positioned closer to each other on the graph in 2-dimensional space. This is achieved through a combination of distinct analytics techniques: 1) For the “Explore by Topic” mode, we utilize a Word2Vec model (kni.me/n/QPMbC4vyfvPkfV8F) to calculate the distances between all words within the documents. These distances are then used to construct a distance matrix, representing the similarity among all topics by averaging the distances of the words associated with each specific topic. 2) The distance matrix generated by Word2Vec is further processed using Multidimensional Scaling (MDS) (kni.me/n/SCgPuzvfM-9t325D), which decomposes it into two dimensions. These two dimensions serve as the coordinates of each topic in a 2-dimensional space. Additionally, the size of the points representing topics directly corresponds to their frequency among the documents. 3) The size of the bubble represents the mean probability of input documents to belong to that topic. 4) When adopting the “Explore by Document” mode, each bubble represent a different document as we perform a similar approach using the documents bag of words instead of the topic models output terms DISCLAIMER: When dealing with a large number of documents this data app slows down in performance. By default the top 250 rows from the top input and the top 10 terms per topic from the second input are considered. You can increase these numbers in the component dialogue. To not face performance issues, it is advisable to employ stratified sampling on the first input using the assigned topic column in a Row Sampling node (kni.me/n/3o-UY2qMENf5piCd) before the component. This component can be utilized as a data app, running either on a local environment or on KNIME Server and KNIME Business Hub.
Used extensions & nodes
Created with KNIME Analytics Platform version 4.7.6
By using or downloading the component, you agree to our terms and conditions.