Friday 11 March 2016

PDI with Horton works 2.4 & 2.2

PDI (kettle) ETL with Horton works 2.4 & 2.2  using Big-data Plugins



Background for this trial:
Everyone talks about big data so wanted to know how I can interact with this big data world with all  buz words like Cassandra, Horton works and other distributions available in the internet.  So I picked  Horton works VMware sandbox which promised to work out of the box and well it did too.

Next as a BI  / ETL professional from a relational world I wanted to explore how I can store some data into big data system . After some considerate amount of research I came to know that things get stored in Hadoop file system (HDFS) in a big data world . From there its been moved / transported into  HIVE (SQL-HIVEQL) , HBASE (key value pair db ) or any place for further processing as per  our needs . So where ever the data to be accessed from the Hadoop ecosystem it has to reach HDFS as a first step.

Please note: This is not a complete guide to configure PDI to scale things in  real world  environment with Horton works using multiple nodes & cluster.  Please refer pentaho & hortonworks documentation for elaborate  notes.

Objective :

Copy a CSV file from a local windows  machine to a Horton works Sandbox Vm’s HDFS file system using pentaho data integration ETL .

My current environment & software setup :
  • Windows 7 Laptop with 16 gb ram
  • Horton works 2.2 ( Vm ware sandbox ) ( Tested with pdi 6.0.1 working good with hdp 2.2)
  • Horton works 2.4 (Vm ware sandbox ) (Tested with pdi 6.0.1 working good with hdp 2.4)
  • PDI community Edition 6.x
  • Putty
  • Winscp

My journey of trial and error:
A local VMware instance with Horton sandbox  2.4  was running at 192.168.56.129 . This is the latest sandbox available to download as of 11/03/2016





To make pdi talk with your Horton works we need shims to be configured .Refer -  http://wiki.pentaho.com/display/BAD/Pentaho+Big+Data+Community+Home

As per the pentaho 6.0 documentation I needed to obtain some configuration files from HDP VM linux host to make my PDI work with hdp 2.4.

My local instance of PDI CE is running at C:\pentaho\design-tools\DI-CE\data-integration .
The list of files to be copied from vm are
  1. core-site.xml 
  2. hbase-site.xml 
  3. hdfs-site.xml 
  4. hive-site.xml 
  5. mapred-site.xml 
  6. yarn-site.xml
With the list of 6 files I can only find   4 files at /etc/Hadoop/conf which are
  1. core-site.xml 
  2. hdfs-site.xml 
  3. mapred-site.xm
  4. yarn-site.xml 
Copy them to C:\pentaho\design-tools\DI-CE\data-integration\plugins\pentaho-big-data-plugin\hadoop-configurations\hdp22

Missing files can be obtained from /etc/hive/conf & /etc/hbase/conf and moved to the above location
  • Launch PDI Spoon –> Tools –> Hadoop distrubition select  hortonworks hdp 2.2.x to configure pdi to work with hortonworks 2.2 SHIMS
  • Edited hdfs-site.xml at C:\pentaho\design-tools\DI-CE\data-integration\plugins\pentaho-big-data-plugin\hadoop-configurations\hdp22) the below property to false from true
<property> <name>dfs.permissions.enabled</name> <value>false</value> </property>
  • Ater editing this file locally I copied  files to the sandbox vm at etc/Hadoop/conf 
  • Restarted the vm for all the settings to take effect. 
Please keep in mind that this property change has a security impact on the HDFS file system so research a bit more to understand why you are doing this . Since it’s my local vm I changed these settings to make pdi to push contents to DFS file system without any restrictions.

I created a folder in my Hdfs file system using putty & logged as root to execute the below commands
  • hadoop fs -mkdir /user/cheran
  • hadoop fs -chmod 777 /user/cheran
To make sure if the folder is existing I queried HDFS file system using

  • hadoop fs -ls /user/




Finally everything is set now to run a pdi job to push a file from on my local machine to HDFS linux box  using PDI 6.0 CE  ETL

I used sample transformation file obtained from pentaho documentation . Thanks once again pentaho for the excellent documentation & samples. Follow the link to obtain the sample data & transformation files.
To explain more about the job & my changes see below pictures





Outcome 

The job took about 45 seconds roughly to load a 75 mb csv data from my windows machine to linux  HDFS file system .

Task accomplished interacting with big data system. Please click here to download my modified files.
PDI 6.0 (windows based ) can load data to 2.2 & 2.4 horton works hdfs file system.


No comments:

Post a Comment