Hub
  • Software
  • Blog
  • Forum
  • Events
  • Documentation
  • About KNIME
  • KNIME Hub
  • Nodes
  • Stanford Tagger
NodeNode / Manipulator

Stanford Tagger

Other Data Types Text Processing Enrichment Streamable
Drag & drop
Like
Copy short link

This node assigns to each term of a document a part of speech (POS) tag. It is applicable for French, English, German, Spanish and Arabic texts. The underlying tagger models are models of the Stanford NLP group:
http://nlp.stanford.edu/software/tagger.shtml

For English texts the Penn Treebank tag set is used:
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html .
For German texts the STTS tag set is used:
http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-table.html .
For French texts the French Treebank tag set is used:
http://www.llf.cnrs.fr/Gens/Abeille/French-Treebank-fr.php .
For Spanish texts the Ancora Treebank tag set is used:
https://nlp.stanford.edu/software/spanish-faq.shtml#tagset .
For Arabic texts a Arabic Penn Treebank tag set is used:
https://nlp.stanford.edu/software/parser-arabic-faq.html#d .
There are also German, Spanish and French models using the Universal Dependencies POS tag set:
http://universaldependencies.org/u/pos/ .

Note: the provided tagger models vary in memory consumption and processing speed. Especially the models English bidirectional, WSJ bidirectional, German hgc, and German dewac require a lot of memory. For the usage of these models it is recommended to run KNIME with at least 2GB of heap space. To increase the heap space, change the -Xmx setting in the knime.ini file. If KNIME is running with less than 1.5GB heap space it is recommended to use English left3words, English left3words caseless, or German fast models for tagging of english or german texts.

Descriptions of the models (taken from the website of the Stanford NLP group):

  • English bidirectional: Trained on WSJ sections 0-18 using a bidirectional architecture and including word shape and distributional similarity features. Penn Treebank tagset.
  • English left3words: Trained on WSJ sections 0-18 and extra parser training data using the left3words architecture and includes word shape and distributional similarity features. Penn Treebank tagset.
  • English left3words caseless: Trained on WSJ sections 0-18 and extra parser training data using the left3words architecture and includes word shape and distributional similarity features. Penn Treebank tagset. Ignores case.
  • English WSJ 0-18 bidirectional distsim: Trained on WSJ sections 0-18 using a bidirectional architecture and including word shape and distributional similarity features. Penn Treebank tagset.
  • English WSJ 0-18 bidirectional distsim: Trained on WSJ sections 0-18 using a bidirectional architecture and including word shape and distributional similarity features. Penn Treebank tagset.
  • English WSJ 0-18 bidirectional no distsim: Trained on WSJ sections 0-18 using a bidirectional architecture and including word shape. No distributional similarity features. Penn Treebank tagset.
  • English WSJ 0-18 caseless left 3 words distsim: Trained on WSJ sections 0-18 left3words architecture and includes word shape and distributional similarity features. Penn Treebank tagset. Ignores case.
  • English WSJ 0-18 left 3 words distsim: Trained on WSJ sections 0-18 using the left3words architecture and includes word shape and distributional similarity features. Penn Treebank tagset.
  • English WSJ 0-18 left 3 words no distsim: Trained on WSJ sections 0-18 using the left3words architecture and includes word shape. Penn Treebank tagset.

To use following tagger models, the specific language pack has to be installed. (File -> Install KNIME Extensions...)

  • German hgc: Trained on the first 80% of the Negra corpus, which uses the STTS tagset.
  • German dewac: This model uses features from the distributional similarity clusters built from the deWac web corpus.
  • German fast: Lacks distributional similarity features, but is several times faster than the other alternatives.
  • German fast caseless: Lacks distributional similarity features, but is several times faster than the other alternatives. Ignores case.
  • German UD: This is a model that produces Universal Dependencies POS tags.
  • French: Trained on the French treebank.
  • French UD: This is a model that produces Universal Dependencies POS tags.
  • Spanish: Trained on the Spanish Ancora tagset.
  • Spanish distsim: Trained on the French Spanish ancora tagset.
  • Spanish UD: This is a model that produces Universal Dependencies POS tags.
  • Arabic: This is a model that produces POS tags for Arabic language.

Node details

Input ports
  1. Type: Table
    Documents input table
    The input table containing the documents to tag.
Output ports
  1. Type: Table
    Documents output table
    An output table containing the tagged documents.

Extension

The Stanford Tagger node is part of this extension:

  1. Go to item

Related workflows & nodes

  1. Go to item
    01_Tagging Words in Documents
    Text Mining Books From Words To Wisdom
    For the purposes of enrichment, this workflow applies a variety of tags to words in a Doc…
    vincenzo > Public > From_Words_To_Wisdom_Book > Chapter3 > 01_POS_NE_Taggers
  2. Go to item
    Stanford Lemmatizer Example
    NLP Natural Language Processing Lemmatize
    A lemmatizer removes inflections, e.g in case of plurals, pronoun case, and verb endings …
    knime > Examples > 08_Other_Analytics_Types > 01_Text_Processing > 11_Lemmatizer_Preprocessing
  3. Go to item
    Online Job Postings
    NLP Natural Language Processing
    The workflow perofrms text processing of the Job Posts dataset (only IT related postings)…
    knime > Examples > 08_Other_Analytics_Types > 01_Text_Processing > 19_Analyse_and_Visualize_Job_Postings
  4. Go to item
    Online Job Postings
    NLP Natural Language Processing
    The workflow perofrms text processing of the Job Posts dataset (only IT related postings)…
    marcelo_linero > Public > 19_Analyse_and_Visualize_Job_Postings
  5. Go to item
    Apache Tika integration
    NLP Natural Language Processing Tika
    The goal of the workflow is to show how to parse content of files using Tika nodes, detec…
    knime > Examples > 08_Other_Analytics_Types > 01_Text_Processing > 16_Tika_Parsing
  6. Go to item
    IS Literature Mining with Topic Detection (LDA)
    Topic detection Text summarization LDA
    +3
    This workflow shows how to extract topics from the input data presented as an Excel file.…
    knime > Education > Courses > L4-TP Introduction to Text Processing > Supplementary Workflows > TextMining_IS_Literature_BoW_LDA
  7. Go to item
    IS Literature Mining with Topic Detection (LDA)
    Topic detection Text summarization LDA
    +3
    This workflow shows how to extract topics from the input data presented as an Excel file.…
    scottf > Public > TextMiningWebinar > TextMining_IS_Literature_BoW_LDA
  8. Go to item
    IS Literature Mining with Topic Detection (LDA)
    Topic detection Text summarization LDA
    +3
    This workflow shows how to extract topics from the input data presented as an Excel file.…
    scottf > Public > TextMining_IS_Literature_BoW_LDA
  9. Go to item
    IS Literature Mining with Topic Detection (LDA)
    Topic detection Text summarization LDA
    +3
    This workflow shows how to extract topics from the input data presented as an Excel file.…
    huds_gfm18 > Public > TextMiningWebinar > TextMining_IS_Literature_BoW_LDA
  10. Go to item
    Topic Detection LDA: Summarizing Romeo & Juliet or cataloging News
    Topic detection Text summarization LDA
    +4
    The workflow shows two examples for the Topic Extrator (Parallel LDA) node. The first wor…
    knime > Examples > 08_Other_Analytics_Types > 01_Text_Processing > 25_Topic_Detection_LDA
  1. Go to item
  2. Go to item
  3. Go to item
  4. Go to item
  5. Go to item
  6. Go to item

KNIME
Open for Innovation

KNIME AG
Hardturmstrasse 66
8005 Zurich, Switzerland
  • Software
  • Getting started
  • Documentation
  • E-Learning course
  • Solutions
  • KNIME Hub
  • KNIME Forum
  • Blog
  • Events
  • Partner
  • Developers
  • KNIME Home
  • KNIME Open Source Story
  • Careers
  • Contact us
Download KNIME Analytics Platform Read more on KNIME Server
© 2022 KNIME AG. All rights reserved.
  • Trademarks
  • Imprint
  • Privacy
  • Terms & Conditions
  • Credits