The component interactive view proves valuable in validating a chosen topic model solution and offering insights into the similarity between different extracted topics.
The Topic Explorer View offers two modes:
- Explore by Topic: explore the topics (second input) in a similarity bubble chart, select topics and visualize coherence and exclusivity scores from the Topic Scorer component (kni.me/c/5_W2h2g6hBY_M0Bc) and the associated tag cloud. Additionally you can scroll through topics represented as small bar charts.
- Explore by Document: explore the documents (first input) in a similarity bubble chart, select topics and visualize the preview or the full length of documents where the terms inside the topics are highlighted.
Both modes provide a similarity bubble chart, where topics or documents with higher semantic similarity are positioned closer to each other on the graph in 2-dimensional space. This is achieved through a combination of distinct analytics techniques:
1) For the “Explore by Topic” mode, we utilize a Word2Vec model (kni.me/n/QPMbC4vyfvPkfV8F) to calculate the distances between all words within the documents. These distances are then used to construct a distance matrix, representing the similarity among all topics by averaging the distances of the words associated with each specific topic.
2) The distance matrix generated by Word2Vec is further processed using Multidimensional Scaling (MDS) (kni.me/n/SCgPuzvfM-9t325D), which decomposes it into two dimensions. These two dimensions serve as the coordinates of each topic in a 2-dimensional space. Additionally, the size of the points representing topics directly corresponds to their frequency among the documents.
3) The size of the bubble represents the mean probability of input documents to belong to that topic.
4) When adopting the “Explore by Document” mode, each bubble represent a different document as we perform a similar approach using the documents bag of words instead of the topic models output terms
DISCLAIMER: When dealing with a large number of documents this data app slows down in performance. By default the top 250 rows from the top input and the top 10 terms per topic from the second input are considered. You can increase these numbers in the component dialogue. To not face performance issues, it is advisable to employ stratified sampling on the first input using the assigned topic column in a Row Sampling node (kni.me/n/3o-UY2qMENf5piCd) before the component.
This component can be utilized as a data app, running either on a local environment or on KNIME Server and KNIME Business Hub.