Hub
Pricing About
WorkflowWorkflow

Molecule Activity Classification with Machine Learning

Life SciencesPredictionMachine LearningUse CaseActivity
knime profile image
Versionv1.0Latest, created on 
Dec 29, 2025 2:30 PM
Drag & drop
Like
Download workflow
Workflow preview

Molecule Activity Classification with Machine Learning

The amount of data available to researchers has drastically increased over the last couple of years, including large datasets on chemical compounds relevant to pharmacological research. Machine learning models can be used to analyze these large datasets and identify patterns that allow for the prediction of pharmacokinetic properties, issues, or dangers of novel compounds. This has the potential to significantly accelerate industrial and academic pharmacological research and development, saving both time and money.

This workflow demonstrates the basic principle of how to train and evaluate different machine learning models for a binary classification of compounds into active and inactive categories against a specific target protein. The data used for training and testing is a list of compounds containing the SMILES notations of their molecular structure as well as their activity data (as pIC50) against the target of interest. The compounds are classified as either active or inactive using a threshold on their pIC50 values. The dataset is then reduced to the molecular fingerprint of each compound (a reduced numeric representation of the molecule) with the category information and passed on to three branches using the following three machine learning model examples for demonstration:

  • Random Forest: ensemble machine learning method that builds many individual decision trees on random data subsets during training which then "vote" for the total result in classification problems

  • Resilient Backpropagation (RProp): supervised learning algorithm for feedforward neural networks

  • Support Vector Machine (SVM): supervised learning algorithm for classification or regression searching for the optimal separating boundaries between data points/classes

The X Partitioner and X Aggregator nodes provide a reliable first estimate of the different models' performances on unseen data. A low variation in the resulting error rates points to a robust model for the intended use. In combination with the visualization dashboard including ROC curves, confusion matrices, and other performance metrics, the best machine learning model of these three can be chosen for further optimization and subsequent deployment in the future.

Note: This workflow is based on the TeachOpenCADD workflow, more specifically Workflow 7 (Ligand-based screening: Machine learning), from the KNIME Community Hub and zenodo. It uses a processed version of the example data provided there, which is a list of active substances against Epidermal Growth Factor Receptor (EGFR) that has been filtered according to Lipinski's Rule of Five (see use case Compound Library Screening (ADME) for details).

External resources

  • Teach Open CADD - zenodo
  • Compound Library Screening (ADME)
  • Teach Open CADD - Master Workflow
  • Teach Open CADD - Workflow 7 (Ligand-based screening: Machine learning
Loading deploymentsLoading ad hoc jobs

Used extensions & nodes

Created with KNIME Analytics Platform version 5.9.0
  • Go to item
    KNIME Base Chemistry Types & NodesTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.9.0

    knime profile image
    knime
  • Go to item
    KNIME Base nodesTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.9.0

    knime profile image
    knime
  • Go to item
    KNIME Ensemble Learning WrappersTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.9.0

    knime profile image
    knime
  • Go to item
    KNIME ExpressionsTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.9.0

    knime profile image
    knime
  • Go to item
    KNIME JavaScript Views (Labs)Trusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.9.0

    knime profile image
    knime
  • Go to item
    KNIME ViewsTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.9.0

    knime profile image
    knime
  • Go to item
    RDKit Nodes FeatureTrusted extension

    Novartis

    Version 5.2.1

    manuelschwarze

Legal

By using or downloading the workflow, you agree to our terms and conditions.

KNIME
Open for Innovation

KNIME AG
Talacker 50
8001 Zurich, Switzerland
  • Software
  • Getting started
  • Documentation
  • Courses + Certification
  • Solutions
  • KNIME Hub
  • KNIME Forum
  • Blog
  • Events
  • Partner
  • Developers
  • KNIME Home
  • Careers
  • Contact us
Download KNIME Analytics Platform Read more about KNIME Business Hub
© 2026 KNIME AG. All rights reserved.
  • Trademarks
  • Imprint
  • Privacy
  • Terms & Conditions
  • Data Processing Agreement
  • Credits