Cheran Ilango Blog: PDI with Horton works 2.4 & 2.2

PDI (kettle) ETL with Horton works 2.4 & 2.2 using Big-data Plugins

Background for this trial:

Everyone talks about big data so wanted to know how I can interact with this big data world with all buz words like Cassandra, Horton works and other distributions available in the internet. So I picked Horton works VMware sandbox which promised to work out of the box and well it did too.

Next as a BI / ETL professional from a relational world I wanted to explore how I can store some data into big data system . After some considerate amount of research I came to know that things get stored in Hadoop file system (HDFS) in a big data world . From there its been moved / transported into HIVE (SQL-HIVEQL) , HBASE (key value pair db ) or any place for further processing as per our needs . So where ever the data to be accessed from the Hadoop ecosystem it has to reach HDFS as a first step.

Please note: This is not a complete guide to configure PDI to scale things in real world environment with Horton works using multiple nodes & cluster. Please refer pentaho & hortonworks documentation for elaborate notes.

Objective :

Copy a CSV file from a local windows machine to a Horton works Sandbox Vm’s HDFS file system using pentaho data integration ETL .

My current environment & software setup :

Windows 7 Laptop with 16 gb ram
Horton works 2.2 ( Vm ware sandbox ) ( Tested with pdi 6.0.1 working good with hdp 2.2)
Horton works 2.4 (Vm ware sandbox ) (Tested with pdi 6.0.1 working good with hdp 2.4)
PDI community Edition 6.x
Putty
Winscp

My journey of trial and error:
A local VMware instance with Horton sandbox 2.4 was running at 192.168.56.129 . This is the latest sandbox available to download as of 11/03/2016

To make pdi talk with your Horton works we need shims to be configured .Refer - http://wiki.pentaho.com/display/BAD/Pentaho+Big+Data+Community+Home

As per the pentaho 6.0 documentation I needed to obtain some configuration files from HDP VM linux host to make my PDI work with hdp 2.4.

My local instance of PDI CE is running at C:\pentaho\design-tools\DI-CE\data-integration .
The list of files to be copied from vm are

core-site.xml
hbase-site.xml
hdfs-site.xml
hive-site.xml
mapred-site.xml
yarn-site.xml

With the list of 6 files I can only find 4 files at /etc/Hadoop/conf which are

core-site.xml
hdfs-site.xml
mapred-site.xm
yarn-site.xml

Copy them to C:\pentaho\design-tools\DI-CE\data-integration\plugins\pentaho-big-data-plugin\hadoop-configurations\hdp22

Missing files can be obtained from /etc/hive/conf & /etc/hbase/conf and moved to the above location

Launch PDI Spoon –> Tools –> Hadoop distrubition select hortonworks hdp 2.2.x to configure pdi to work with hortonworks 2.2 SHIMS
Edited hdfs-site.xml at C:\pentaho\design-tools\DI-CE\data-integration\plugins\pentaho-big-data-plugin\hadoop-configurations\hdp22) the below property to false from true

<property> <name>dfs.permissions.enabled</name> <value>false</value> </property>

Ater editing this file locally I copied files to the sandbox vm at etc/Hadoop/conf
Restarted the vm for all the settings to take effect.

Please keep in mind that this property change has a security impact on the HDFS file system so research a bit more to understand why you are doing this . Since it’s my local vm I changed these settings to make pdi to push contents to DFS file system without any restrictions.

I created a folder in my Hdfs file system using putty & logged as root to execute the below commands

hadoop fs -mkdir /user/cheran

hadoop fs -chmod 777 /user/cheran

To make sure if the folder is existing I queried HDFS file system using

hadoop fs -ls /user/

Finally everything is set now to run a pdi job to push a file from on my local machine to HDFS linux box using PDI 6.0 CE ETL

I used sample transformation file obtained from pentaho documentation . Thanks once again pentaho for the excellent documentation & samples. Follow the link to obtain the sample data & transformation files.

To explain more about the job & my changes see below pictures

Outcome

The job took about 45 seconds roughly to load a 75 mb csv data from my windows machine to linux HDFS file system .

Task accomplished interacting with big data system. Please click here to download my modified files.
PDI 6.0 (windows based ) can load data to 2.2 & 2.4 horton works hdfs file system.

Cheran Ilango Blog

Friday, 11 March 2016

PDI with Horton works 2.4 & 2.2

PDI (kettle) ETL with Horton works 2.4 & 2.2 using Big-data Plugins

No comments:

Post a Comment