PDI (kettle) ETL with Horton works 2.4 & 2.2 using Big-data Plugins
Background for this
trial:
Everyone talks about big data so wanted to know how I can
interact with this big data world with all buz words like Cassandra, Horton works and
other distributions available in the internet.
So I picked Horton works VMware sandbox which promised to work out of
the box and well it did too.
Next as a BI / ETL
professional from a relational world I wanted to explore how I can store some data into
big data system . After some considerate amount of research I came to know that
things get stored in Hadoop file system (HDFS) in a big data world . From there
its been moved / transported into HIVE
(SQL-HIVEQL) , HBASE (key value pair db ) or any place for further processing
as per our needs . So where ever the
data to be accessed from the Hadoop ecosystem it has to reach HDFS as a first step.
Please note: This is not a
complete guide to configure PDI to scale things in real world environment with Horton works using multiple
nodes & cluster. Please refer
pentaho & hortonworks documentation for elaborate notes.
Objective :
Copy a CSV file from a local windows machine to a Horton works Sandbox Vm’s HDFS file system using pentaho data integration ETL .
My current environment & software setup :
- Windows 7 Laptop with 16 gb ram
- Horton works 2.2 ( Vm ware sandbox ) ( Tested with pdi 6.0.1 working good with hdp 2.2)
- Horton works 2.4 (Vm ware sandbox ) (Tested with pdi 6.0.1 working good with hdp 2.4)
- PDI community Edition 6.x
- Putty
- Winscp
My journey of trial and error:
A local VMware instance with Horton sandbox 2.4 was running at 192.168.56.129 . This is the latest sandbox available to download as of 11/03/2016
To make pdi talk with your Horton works we need shims to be configured .Refer - http://wiki.pentaho.com/display/BAD/Pentaho+Big+Data+Community+Home
As per the pentaho 6.0 documentation I needed to obtain some configuration files from HDP VM linux host to make my PDI work with hdp 2.4.
My local instance of PDI CE is running at C:\pentaho\design-tools\DI-CE\data-integration .
The list of files to be copied from vm are
- core-site.xml
- hbase-site.xml
- hdfs-site.xml
- hive-site.xml
- mapred-site.xml
- yarn-site.xml
- core-site.xml
- hdfs-site.xml
- mapred-site.xm
- yarn-site.xml
Missing files can be obtained from /etc/hive/conf & /etc/hbase/conf and moved to the above location
- Launch PDI Spoon –> Tools –> Hadoop distrubition select hortonworks hdp 2.2.x to configure pdi to work with hortonworks 2.2 SHIMS
- Edited hdfs-site.xml at C:\pentaho\design-tools\DI-CE\data-integration\plugins\pentaho-big-data-plugin\hadoop-configurations\hdp22) the below property to false from true
<property> <name>dfs.permissions.enabled</name> <value>false</value> </property>
- Ater editing this file locally I copied files to the sandbox vm at etc/Hadoop/conf
- Restarted the vm for all the settings to take effect.
I created a folder in my Hdfs file system using putty & logged as root to execute the below commands
To make sure if the folder is existing I queried HDFS file system using
- hadoop fs -mkdir /user/cheran
- hadoop fs -chmod 777 /user/cheran
- hadoop fs -ls /user/
Finally everything is set now to run a pdi job to push a file from
on my local machine to HDFS linux box using PDI 6.0 CE ETL
I used sample transformation file obtained from pentaho documentation
. Thanks once again pentaho for the excellent documentation & samples.
Follow the link to obtain the sample data & transformation files.
To explain more about the job & my changes see below
pictures
Outcome
The job took about 45 seconds roughly to load a 75 mb csv data from my windows machine to linux HDFS file system .
Task accomplished interacting with big data system. Please click here to download my modified files.
PDI 6.0 (windows based ) can load data to 2.2 & 2.4 horton works hdfs file system.
No comments:
Post a Comment