How to use Python in PDI
Python Integration : Pentaho Data Integration
Python is open source, interpreted, high level programming language. It provides a great approach for object-oriented programming and one of the widely used language by data scientist and data analytics for various applications and projects. The language provides great functionality to deal with mathematics, statistics and scientific functions, also, it provides great libraries to deal with data science application.
Learn How to use R language in PDI or explore Pentaho data science integration for comprehensive analytics solutions.
Pentaho 10.2 Solution Architecture for Python Integration
Architecture Overview
Pentaho 10.2 provides seamless Python integration for data science workflows, from data preparation through model deployment.
Solution Architecture Blueprint

Pentaho 10.2 Data Integration and Python
Pentaho 10.2 is a unified data platform that provides Data Integration as part of one of its services. Using Pentaho 10.2 Data Integration (PDI) you can access, prepare and blend data faster, it also provides seamless orchestration for building data pipeline services.
It also provides capabilities to operationalize Python so that data scientists can take advantage of the strength of the versatile programming language to develop predictive solutions using existing PDI steps. Python can be integrated in Pentaho 10.2 via CPython Script Executor (Marketplace Plugin).
Integrating Python in Pentaho 10.2 Data Integration (PDI)
Before using the CPython Script Executor, the following steps needs to be followed;
- Install your suitable python version and set the system variable paths accordingly
- Install core base libraries such as;
Pandas
- Numpy
- Py4j
- Matplotlib
- sklearn
- Install CPython Script Executor from the Marketplace provided in your PDI
- Restart DI Server and Spoon so that the environment changes take effect
Once the above steps are done you can start a new transformation and start working on the CPython Script Executor.
CPython Script Executor
The step uses the C implementation of the Python programming language. While there are JVM-based solutions available – such as Jython – that allow a more tightly integrated experience when executing in the JVM, these do not facilitate the use of many high-powered Python libraries for scientific computing, due to the fact that such libraries include highly optimized components that are written in C or Fortran. In order to gain access to such libraries, the PDI step launches, and communicates with, a micro-service running in the C Python environment.
The CPython Script Executor step requires a developer to map to the Python environment, map input, and/or output.
- Step name: Specifies the unique name of the Python Executor step on the canvas. You can customize the name or leave it as the default.
The step itself offers maximum flexibility when it comes to dealing with data. It can act as a start point/data source in PDI (thus allowing the developer the freedom to source data directly via their Python code if so desired), or it can accept data from an upstream step and push it into the Python environment. In the latter case, the user can opt to send all incoming rows to Python in one hit, send fixed sized batches of rows, or send rows one-at-a-time.
A python script can be specified via the built-in editor, or loaded from a file dynamically at runtime.
Key Benefits of Pentaho 10.2 for Python Integration
- Faster Execution: Java 17 provides 2-3x faster Python script execution
- Seamless Integration: CPython Script Executor integrates Python seamlessly into PDI workflows
- Flexible Data Handling: Can act as data source or accept upstream data, supports batch or row-by-row processing
- Rich Library Support: Access to high-powered Python libraries (Pandas, Numpy, sklearn, Matplotlib)
- Operationalization: Python models can be deployed directly into operational workflows
- Consistent Environments: Docker containerization provides consistent Python environments
- Data Quality: Integrated data quality validation ensures Python model inputs are accurate
- Complete Lineage: Open Lineage tracking provides complete audit trail through Python scripts
Frequently Asked Questions
How does Pentaho integrate with Python?
Pentaho 10.2 Data Integration integrates with Python through CPython Script Executor, which integrates Python seamlessly into PDI workflows. Python scripts can act as data sources or accept upstream data, supporting both batch and row-by-row processing within PDI transformations.
What Python libraries can I use with Pentaho?
Pentaho supports access to high-powered Python libraries including Pandas (data manipulation), Numpy (numerical computing), sklearn (machine learning), Matplotlib (visualization), and other Python data science libraries. Python scripts integrated into PDI can leverage any Python library available in the environment.
How do I execute Python scripts in Pentaho PDI?
Python scripts are executed in Pentaho PDI using the CPython Script Executor step. This step integrates Python seamlessly into PDI workflows, allowing Python scripts to process data from upstream steps and pass results to downstream steps in the transformation pipeline.
Can Python models be deployed in production with Pentaho?
Yes. Python models can be deployed directly into operational workflows using Pentaho PDI. Models operationalized in PDI provide real-time predictions as data flows through pipelines, with automated model updates and complete lineage tracking.
What are the benefits of using Python with Pentaho?
Key benefits include faster execution (2-3x with Java 17), seamless integration into PDI workflows, flexible data handling (batch or row-by-row), rich library support (Pandas, Numpy, sklearn), operationalization of Python models, consistent environments (Docker), and complete lineage tracking.
Does Pentaho ensure data quality for Python scripts?
Yes. Pentaho provides integrated data quality validation ensuring Python model inputs are accurate. 250+ predefined quality rules and AI/ML-powered anomaly detection ensure data quality before Python scripts process the data.
How does Pentaho handle Python environment consistency?
Pentaho supports Docker containerization providing consistent Python environments across development, testing, and production. This ensures Python scripts run identically in all environments, reducing deployment issues and ensuring reproducible results.
🎯 Ready to integrate Python with Pentaho?
Pentaho 10.2 Data Integration enables seamless integration of Python data science scripts into operational workflows. Turn your Python scripts into operationalized predictive solutions with faster execution and complete lineage tracking.
Contact TenthPlanet for expert Pentaho Python integration and data science implementation services.
Note: This guide provides a comprehensive overview of Python integration with Pentaho PDI. Actual implementations may vary based on specific Python libraries, model requirements, and operational needs.
Related Resources: