OCR Foreign Language PDFs with Python and KNIME (with Tesseract and PDFium)
This workflow shows you how to OCR a Foreign Language using Python and KNIME. If the desired language does not show up in the drop-down when configuring the OCR Component, additional languages can be set by tweaking the script.
Conda is needed on the machine and needs to be set up according to the "Prerequisites" section in this documentation (under Preferences - KNIME - Conda).
For portability, the Conda Environment Propagation node sets up the environment, so it should be not necessary to install the following environment. The commands are stated for sake of completeness, in case a workaround without the CEP node is being created:
Linux: conda create -n knime_ocr_tess_pdfium -c knime -c conda-forge --strict-channel-priority python=3.11 knime-python-scripting=5.8 pypdfium2 opencv pytesseract tesseract pillow numpy pandas
Windows: conda create -n knime_ocr_tess_pdfium -c knime -c conda-forge -c pypdfium2-team -c bblanchon --strict-channel-priority python=3.11 knime-python-scripting=5.8 pypdfium2-team::pypdfium2_helpers opencv pytesseract tesseract pillow numpy pandas
Note: If any language other than English is selected, the workflow will download Tesseract's appropriate language files and store them within the workflow folder under /data/tessdata/
This workflow is based on this one: https://hub.knime.com/s/hDBtIjjK900pPNaK