This node analyses documents and extracts relevant keywords using the graph-based approach described in "KeyGraph: Automatic Indexing by Co-occurrence Graph based on Building Connstruction Metaphor" by Yukio Ohsawa.
First, a predetermined amount of terms are selected based on their frequency (high frequency set, HF) and added as the initial nodes of the graph.
The association strength between each of these terms is then calculated using the following scoring method: assoc(term1, term2) = min(occurrence frequency of term1, occurrence frequency of term2) summed for every sentence in the document. The top |HF|-1 associations are inserted into the graph as edges.
If an edge between two terms is the only path that connects them, it is pruned.
The graph's connected subgraphs are then extracted and considered as "concept" clusters. A new batch of terms is added based on their key score, which is the conditional probability that a term will be used if the author has all the concepts (clusters) in mind (P(UNION(w|g)) where t is the term and the union is done over every cluster g of the set of clusters.
Each of these new terms is then linked to every cluster using the strongest scoring edge amongst the possible ones.
Finally, all the terms in the graph are rated based on this formula: score(t) = summation over every edge connecting t and other terms (w), summation over every sentences, min(freq(t), freq(w)).
Setting the console's output level to DEBUG will make this node display the contents of the clusters after the pruning phase. terms.
- Type: TableDocuments input tableThe input table which contains the documents to analyse.