OCR_Python_Portable_CEP

Workflow

Versionv0.9Latest, created on

OCR Foreign Language PDFs with Python and KNIME (with Tesseract and PDFium)

This workflow shows you how to OCR a Foreign Language using Python and KNIME. If the desired language does not show up in the drop-down when configuring the OCR Component, additional languages can be set by tweaking the script.

Conda is needed on the machine and needs to be set up according to the "Prerequisites" section in this documentation (under Preferences - KNIME - Conda).

For portability, the Conda Environment Propagation node sets up the environment, so it should be not necessary to install the following environment. The commands are stated for sake of completeness, in case a workaround without the CEP node is being created:

Linux: conda create -n knime_ocr_tess_pdfium -c knime -c conda-forge --strict-channel-priority python=3.11 knime-python-scripting=5.8 pypdfium2 opencv pytesseract tesseract pillow numpy pandas
Windows: conda create -n knime_ocr_tess_pdfium -c knime -c conda-forge -c pypdfium2-team -c bblanchon --strict-channel-priority python=3.11 knime-python-scripting=5.8 pypdfium2-team::pypdfium2_helpers opencv pytesseract tesseract pillow numpy pandas

Note: If any language other than English is selected, the workflow will download Tesseract's appropriate language files and store them within the workflow folder under /data/tessdata/

This workflow is based on this one: https://hub.knime.com/s/hDBtIjjK900pPNaK

External resources

Conda Documentation, Prerequisites

Loading deploymentsLoading ad hoc jobs

Legal

By using or downloading the workflow, you agree to our terms and conditions.