Chi-Square Keyword Extractor

Node / Other

Chi-Square Keyword Extractor

This node analyses documents and extracts relevant keywords using cooccurrence statistics as described in "Keyword extraction from a single document using word co-occurrence statistical information" by Y.Matsuo and M. Ishizuka.
First, the most frequent terms (see node settings) are extracted and then clustered together using the pointwise mutual information and a normalized version of the L1 norm as measures of distance between their cooccurrence probability distributions.
A term can be considered as member of a cluster if it is similar to all the terms inside it according to at least one of the similarity measures. If more than one cluster meets this condition, the one with the highest average score will be used. If no cluster is similar, a new one is created.
Once this is done, each term is ranked in decreasing order of the deviation between their expected cluster cooccurrence and the actual observed cooccurrence value. The terms with the highest divergence are returned as keywords.
Setting the console's output level to DEBUG will make this node display the set of frequent terms, the distance between them during the clustering phase and the final clusters. terms.

Node details

Ports Options Views

Input ports

Type: Table
Documents input table
The input table which contains the documents to analyse.

Output ports

Type: Table
Keywords output table
The output table which contains (keyword term, deviation value, associated document) tuples.

Extension

The Chi-Square Keyword Extractor node is part of this extension:

Go to item