Hub
Pricing About
ComponentComponent

Auto Categorical Features Embedding

ashokharnal profile image
Draft Latest edits on 
Oct 6, 2021 5:51 AM
Drag & drop
Like
Use or download
This component encodes string (categorical) features into numeric features. Features created are: a) value-counts of categorical variables (count encoding), b) ranked label-count encoding, and c) target-encoding. Multiple categorical variables (string type) may be specified at one go. As Input, it takes train and test datasets. You have to specify: a. Target column name b. String columns to be encoded c. Type of categorical encoding desired d. Perform PCA or not To avoid (as also check) data leakage while performing 'target-encoding', encoding is performed using only train data. The encoded values are then mapped to test data. To have reliable target encoding, dataset should generally be large. The component (optionally) performs PCA on the DataFrame consisting of either of these three or all three encodings along with on the numeric features already present. PCA model is built on train data and applied on test data. Principal components explain 95% of variance. The output of component are two dataframes, train and test. These dataframes have Principal components (or encoded columns along with already present numeric columns). The output dataframes include target column. The component uses python script. Python libraries used are numpy, pandas, scikit-learn and pyarrow. To speed up categorical feature generation, please reach File > Preferences > KNIME > Python > Serialization library and select Apache Arrow as Serialization library. For a description of the three methods of encoding, please refer: https://wrosinski.github.io/fe_categorical_encoding/

Component details

Input ports
  1. Type: Table
    trainDataIn
    Input train data here. Dataframe may contain string or numeric features
  2. Type: Table
    testDataIn
    Input test data here. Dataframe may contain string or numeric features
Output ports
  1. Type: Table
    trainDataOut
    Transformed train data. Output is either the encoded columns or PCA encoded DataFrame with target column
  2. Type: Table
    testDataOut
    Transformed test data. Output is either the encoded columns or PCA encoded DataFrame with target column

Used extensions & nodes

Created with KNIME Analytics Platform version 4.4.1
  • Go to item
    KNIME Base nodesTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.4.1

    knime
  • Go to item
    KNIME Python Integration

    KNIME AG, Zurich, Switzerland

    Version 4.4.1

    knime
  • Go to item
    KNIME Quick FormsTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.4.1

    knime

This component does not have nodes, extensions, nested components and related workflows

Legal

By using or downloading the component, you agree to our terms and conditions.

KNIME
Open for Innovation

KNIME AG
Talacker 50
8001 Zurich, Switzerland
  • Software
  • Getting started
  • Documentation
  • Courses + Certification
  • Solutions
  • KNIME Hub
  • KNIME Forum
  • Blog
  • Events
  • Partner
  • Developers
  • KNIME Home
  • Careers
  • Contact us
Download KNIME Analytics Platform Read more about KNIME Business Hub
© 2025 KNIME AG. All rights reserved.
  • Trademarks
  • Imprint
  • Privacy
  • Terms & Conditions
  • Data Processing Agreement
  • Credits