This workflow shows how to combine the "Serial" Chunk Loop nodes with the RDKit Substructure Filtering to balance the workload and have a compromise between serial and parallel execution of substructure searching.
I have run the workflow on a laptop with 6 cores & 128 Gigabytes memory and it took 4 hours to run up to successful termination problem A) with 100 million of rows.
All the parallelism is achieved by the -RDKit Substructure Filter- node. This node handles itself the parallelism and does not need any further parallelism to be added around it. In fact, if parallelism is added using -Parallel Chunk Loop- nodes, then the two parallelism schemes fight each other against resources, and this is most probably a source of conflict. In other words, it is not recommended to encapsulate two parallel solutions because it generates competition for resources. It is neither recommended to run in parallel two parallelized branches in a workflow for the same reasons.
This solution is using all the cores in the computer to achieve the parallelism.
"Problem B)" works in the same way to show how to implement the solution when several queries are made on a huge number of molecules.
Workflow
20220718 Pikairos How to Optimize RDKit Parallelized Substructure Filtering
External resources
Used extensions & nodes
Created with KNIME Analytics Platform version 4.5.2
Legal
By using or downloading the workflow, you agree to our terms and conditions.