Tuesday, 3 March 2020

AWS LAMBDA Function Runtime upgrade from Python 2.7 runtime to 3.7.x



AWS LAMBDA Function Runtime upgrade from Python 2.7 runtime to 3.7.x

Why should this upgrade happen ?

Python has announced the 2 series run time has attained its EOL , The message from the python foundation is
We have decided that January 1, 2020, will be the day that we sunset Python 2. That means that we will not improve it anymore after that day, even if someone finds a security problem in it. You should upgrade to Python 3 as soon as you can.

For whom this is for ?

AWS lambda function users who is running on 2.7 framework

Getting ready for some action

1. Run command line utility 2to3

verify python 3 is the defaulted version on the operating system by typing in
$ python — version
Python 3.7.6
2to3 utility
2to3 is command line utlity that ships with python3's default insallation (This was tested on MAC OS) Alternatively if this is missing there are numerous ways to get this installed for windows or linux from google
Target the lambda function’s python file and run this utility so this utlity can upgrade the code as needed to Python 3 ‘s syntax.

If there is no change the utility outputs as below

$ 2to3 lambda_function.py
RefactoringTool: Skipping optional fixer: buffer
RefactoringTool: Skipping optional fixer: idioms
RefactoringTool: Skipping optional fixer: set_literal
RefactoringTool: Skipping optional fixer: ws_comma
RefactoringTool: No changes to lambda_function.py
RefactoringTool: Files that need to be modified:
RefactoringTool: lambda_function.py

If there is a change the Utility outputs as below

$ 2to3 lambda_function.py
RefactoringTool: Skipping optional fixer: buffer
RefactoringTool: Skipping optional fixer: idioms
RefactoringTool: Skipping optional fixer: set_literal
RefactoringTool: Skipping optional fixer: ws_comma
RefactoringTool: Refactored lambda_function.py
— — lambda_function.py (original)
+++ lambda_function.py (refactored)
@@ -1,5 +1,5 @@
def greet(name):
- print “Hello, {0}!”.format(name)
-print “What’s your name?”
-name = raw_input()
+ print(“Hello, {0}!”.format(name))
+print(“What’s your name?”)
+name = input()
greet(name)
RefactoringTool: Files that need to be modified:
RefactoringTool: lambda_function.py
Use -w keyword as an argument to write it to a file.
$ 2to3 -w lambda_function.py
RefactoringTool: Skipping optional fixer: buffer
RefactoringTool: Skipping optional fixer: idioms
RefactoringTool: Skipping optional fixer: set_literal
RefactoringTool: Skipping optional fixer: ws_comma
RefactoringTool: Refactored lambda_function.py
— — lambda_function.py (original)
+++ lambda_function.py (refactored)
@@ -1,5 +1,5 @@
def greet(name):
- print “Hello, {0}!”.format(name)
-print “What’s your name?”
-name = raw_input()
+ print(“Hello, {0}!”.format(name))
+print(“What’s your name?”)
+name = input()
greet(name)
RefactoringTool: Files that were modified:
RefactoringTool: lambda_function.py

2. Package the python file

The lambda python file has been replaced with all the python3 required syntax. Use the regular process to checkin the file to repository and build them with python3 boto3 libraries .

3 upload and deploy the build to s3 or deploy them using pipelines or terraform to test them on lambda with 3.7 runtime. This sucessfully worked in our tests.

4. Rollout the same process for every other lambda_function.py files which are due for an upgrade.

More references & usages for 2to3 utility can be found at this link

Friday, 8 February 2019

Some useful big-data ecosystem interview questions

so here are some good  big-data ecosystem related questions shared for the community for their interview preparations.

HDFS
  • What are the main components and node types that make HDFS ?
  • Briefly explain the purpose and function of FSI image and Editlog ?
  • What is the difference between secondary name node and standby name node?
  • what are the options for replicating and HDFS cluster for example production and DR cluster ?
  • How are corrupted HDFS files detected and subsequently recovered ?
  • Describe HDFS tiered storage concepts?
KafKA
  • Describe functions of Kafka Mirror maker ?
  • how do you integrate Kafka mirror maker into an existing architecture ?
Spark
  • What are definitions in Spark for , Jobs , Stages ,Task.
  • Describe each of execution modes that spark supports
Yarn
  • What is yarn used for ? Describe each of its core components ?
Cassandra
  • Explain how Cassandra stores data
  • What is difference between partition and clustering ?
Thanks for the interview panel to share a copy of these questions

Wednesday, 26 September 2018

Access VBA with MSSQL Server - Passing Parameter via VBA code to MSSQL SQL SERVER query

Access VBA with MSSQL Server - Passing Parameter via VBA code to MSSQL SQL SERVER query




As a Part of Consulting Life, you never knew what projects you will be working so from an open source world to Microsoft world here is a quick little transition story.



Tasks Explored

Migrate Access to SQL server

on that, the first task is to demonstrate access VBA code can  pass a parameter to SQL query using ADODB connection string.


FYI : SSMA Wont work with SQL 2009 R2 as its EOF so have to live with precision Manual Migration.


What was used


  • MSSQL server 2008 r2 or any
  • access 2010 or any
  • firewall exceptions for 1433 added, named pipes enabled.
  • MSSQL server Pubs database with standard stored proc


Reference



VBA Code


'BeginAppendVB 

    'To integrate this code 
    'replace the data source and initial catalog values 
    'in the connection string 

Public Sub Main() 
    On Error GoTo ErrorHandler 

    'recordset, command and connection variables 
    Dim Cnxn As ADODB.Connection 
    Dim cmdByRoyalty As ADODB.Command 
    Dim prmByRoyalty As ADODB.Parameter 
    Dim rstByRoyalty As ADODB.Recordset 
    Dim rstAuthors As ADODB.Recordset 
    Dim strCnxn As String 
    Dim strSQLAuthors As String 
    Dim strSQLByRoyalty As String 
     'record variables 
    Dim intRoyalty As Integer 
    Dim strAuthorID As String 

    ' Open connection 
    Set Cnxn = New ADODB.Connection 
    strCnxn = "Provider='sqloledb';Data Source='MySqlServer';" & _ 
        "Initial Catalog='Pubs';Integrated Security='SSPI';" 
    Cnxn.Open strCnxn 

    ' Open command object with one parameter 
    Set cmdByRoyalty = New ADODB.Command 
    cmdByRoyalty.CommandText = "byroyalty" 
    cmdByRoyalty.CommandType = adCmdStoredProc 

    ' Get parameter value and append parameter 
    intRoyalty = Trim(InputBox("Enter royalty:")) 
    Set prmByRoyalty = cmdByRoyalty.CreateParameter("percentage", adInteger, adParamInput) 
    cmdByRoyalty.Parameters.Append prmByRoyalty 
    prmByRoyalty.Value = intRoyalty 

    ' Create recordset by executing the command 
    Set cmdByRoyalty.ActiveConnection = Cnxn 
    Set rstByRoyalty = cmdByRoyalty.Execute 

    ' Open the Authors Table to get author names for display 
    ' and set cursor client-side 
    Set rstAuthors = New ADODB.Recordset 
    strSQLAuthors = "Authors" 
    rstAuthors.Open strSQLAuthors, Cnxn, adUseClient, adLockOptimistic, adCmdTable 

    ' Print recordset adding author names from Authors table 
    Debug.Print "Authors with " & intRoyalty & " percent royalty" 

    Do Until rstByRoyalty.EOF 
        strAuthorID = rstByRoyalty!au_id 
        Debug.Print "   " & rstByRoyalty!au_id & ", "; 
        rstAuthors.Filter = "au_id = '" & strAuthorID & "'" 
        Debug.Print rstAuthors!au_fname & " " & rstAuthors!au_lname 
        rstByRoyalty.MoveNext 
    Loop 

    ' clean up 
    rstByRoyalty.Close 
    rstAuthors.Close 
    Cnxn.Close 
    Set rstByRoyalty = Nothing 
    Set rstAuthors = Nothing 
    Set Cnxn = Nothing 
    Exit Sub 

ErrorHandler: 
    ' clean up 
    If Not rstByRoyalty Is Nothing Then 
        If rstByRoyalty.State = adStateOpen Then rstByRoyalty.Close 
    End If 
    Set rstByRoyalty = Nothing 

    If Not rstAuthors Is Nothing Then 
        If rstAuthors.State = adStateOpen Then rstAuthors.Close 
    End If 
    Set rstAuthors = Nothing 

    If Not Cnxn Is Nothing Then 
        If Cnxn.State = adStateOpen Then Cnxn.Close 
    End If 
    Set Cnxn = Nothing 

    If Err <> 0 Then 
        MsgBox Err.Source & "-->" & Err.Description, , "Error" 
    End If 
End Sub 
'EndAppendVB



Database : Microsoft Pubs


Stored proc definition

CREATE PROCEDURE [dbo].[byroyalty] @percentage int

AS

select au_id from titleauthor

where titleauthor.royaltyper = @percentage

Connection String

    ' Open connection
    Set Cnxn = New ADODB.Connection
    strCnxn = "Provider=SQLOLEDB; Data Source=localhost;Database=pubs;Trusted_Connection=Yes;"
    Cnxn.Open strCnxn




Viola It works !





Saturday, 17 September 2016

Kafka Consumer Plugin Integration with PDI 6.1


Configuration of Kafka consumer Plugin for PDI (Pentaho data integration)



Issue Background:
Kafka consumer plugin does not work out of the box when they get integrated with PDI Spoon plugins however the Kafka Producer works out of the box.So many would be wondered how to fix this so here we go with some workaround 
Tested Environment:
  • ·         PDI  6.1 EE or CE
  • ·         pentaho-kafka-producer-v1.6.zip
  • ·         pentaho-kafka-consumer-v1.4.zip

Producer and Consumer Installation
·         Download the latest Kafka Consumer & Producer from GitHub - https://github.com/RuckusWirelessIL
·          where ever you have your  PDI installed  extract the Plugins to the appropriate directory
Example:

C:\pentaho\pdi6.1\data-integration\plugins\steps or to your matching Linux path
Note: if you don’t find the steps folder under  data-integration\plugins  please  create a new folder  named – “steps”



Open up the C:\pentaho\pdi6.1\data-integration\plugins\steps\pentaho-kafka-consumer\lib and check if there are 17 jar files listed out as below


Procedure to Copy & Update  Library Files

  • ·         compare *.jar files from    C:\pentaho \pdi6.1\data-integration\plugins\steps\pentaho-kafka-consumer\lib  against   pdi6.1\data-integration\lib.
·         
Perform the file copy & replace operation as per the below table instruction.





·         Edit plugin .xml under \pdi6.1\data-integration\plugins\steps\pentaho-kafka-consumer and update the version number values  matching to jar  files under  \data-integration\lib .

Sample – Untouched Original File-   Plugin.xml
                <libraries>
                                <library name="pentaho-kafka-consumer.jar"/>
                                <library name="lib/jline-0.9.94.jar"/>
                                <library name="lib/jopt-simple-3.2.jar"/>
                                <library name="lib/junit-3.8.1.jar"/>
                                <library name="lib/kafka_2.10-0.8.2.1.jar"/>
                                <library name="lib/kafka-clients-0.8.2.1.jar"/>
                                <library name="lib/log4j-1.2.16.jar"/>
                                <library name="lib/lz4-1.2.0.jar"/>
                                <library name="lib/metrics-core-2.2.0.jar"/>
                                <library name="lib/netty-3.7.0.Final.jar"/>
                                <library name="lib/scala-library-2.10.4.jar"/>
                                <library name="lib/slf4j-api-1.7.2.jar"/>
                                <library name="lib/slf4j-log4j12-1.6.1.jar"/>
                                <library name="lib/snappy-java-1.1.1.6.jar"/>
                                <library name="lib/xercesImpl-2.9.1.jar"/>
                                <library name="lib/xml-apis-1.3.04.jar"/>
                                <library name="lib/zkclient-0.3.jar"/>
                                <library name="lib/zookeeper-3.4.6.jar"/>
                </libraries>
Sample   - modified Plugin.XML
<libraries>
                                <library name="pentaho-kafka-consumer.jar"/>
                                <library name="lib/jline-0.9.94.jar"/>
                                <library name="lib/jopt-simple-3.2.jar"/>
                                <library name="lib/junit-3.8.1.jar"/>
                                <library name="lib/kafka_2.10-0.8.2.1.jar"/>
                                <library name="lib/kafka-clients-0.8.2.1.jar"/>
                                <library name="lib/log4j-1.2.17.jar"/>  
                                <library name="lib/lz4-1.2.0.jar"/>
                                <library name="lib/metrics-core-2.2.0.jar"/>
                                <library name="lib/netty-3.7.0.Final.jar"/>
                                <library name="lib/scala-library-2.10.4.jar"/>
                                <library name="lib/slf4j-api-1.7.7.jar"/>
                                <library name="lib/slf4j-log4j12-1.7.7.jar"/>
                                <library name="lib/snappy-java-1.1.0.jar"/>
                                <library name="lib/xercesImpl-2.9.1.jar"/>
                                <library name="lib/xml-apis-2.0.2.jar"/> 
                                <library name="lib/zkclient-0.3.jar"/>
                                <library name="lib/zookeeper-3.4.6.jar"/>
                </libraries>

·         Restart Spoon and test the producer and consumer. 

Friday, 10 June 2016

Data Modelling in Impala - Cloudera aka BIG DATA

Data Modelling in Impala - Cloudera aka BIG DATA -  STAR Schema /  ER Schema - Using Parquet Format using Snappy / GZIP compression 

Background for the research  : 

Being a SQL fanatic from relational  RDBMS world I was more  curious to explore how the data modeling
[ table creation ] is done inside Big data world .  This data model can be of any schema like ER , Star  , Datavault.  End of the day i see these schemas as plain tables with some extra bits n bolts with some relationship keys and data organisation among them.

The Goal was to create some table structure  inside  Impala with some good amount of data . On a later perspective this can be re implemented in star schema / ER / Data vault for bigger picture and use case.

Planned to  loaded 46 million records to my impala table and have analysed query performance with sum and average functions for the entire dataset . The query performance results has been published in this post.

I selected Impala  because theories talk  that they can handle bulk volume of data with efficient query mechanism and this can implemented for any  Cube Analytics / BI Reporting needs with enormous data volume without any MAPREDUCE JOBS.

End of this Lengthy  post you will be able to figure out how things works inside Impala . I have tried my best to organise topics to cover the below important areas.


  1. Impala background &  theory
  2. Impala environmental recommendation 
  3. DDL for Internal table & External Tables
  4. Parquet & CSV  File Format based DDL 
  5. Gzip & Snappy Codec for Parquet.
  6. HDFS file Ingestion-  ingesting sample data to Imapala - 46 million rows 
  7. Querying Impala
  8. Query Performance Analysis. 
  9. Indyco - Data modeller  for Impala



Disclaimer::  

What ever information presented here  could be reproduced after tons of research and reading from many different source to help the opensource community to gain knowledge and share thoughts . I take no responsibility if there is any inaccurate information conveyed.  I am a human i make mistakes in understanding theory but in the end whatever i produced here works upto some experimental level. Some software download or sample data links may not work over time so i take no responsibility to update this post over time. So enjoy reading  and spreading the knowledge as it is and your are free to reproduce any contents published here. 



Prerequisite  & Knowledge recommendation: 
  1. SQL DDL knowledge 
  2. Linux Knowledge 
  3. HDFS & hadoop eco system general exposure 

My Environmental Setup 

  • Laptop Ram - 16 GB  ( This is very much needed as 10 GB is Minimum required for  Cloudera Enterprise sandbox &  only enterprise edition has impala on it )
  • Vm -  Oracle Virtual BOX
  • Cloudera Enterprise Sandbox  - cloudera-quick-start-vm-5.7.0-0-virtualbox
  • Processor -  2 core CPU dedicated for VM 





Bingo  you should be all set  to  run cloudera sandbox without any hurdles !






Launching Impala 

Impala can be launched & executed via HUE web interface - To be mentioned cloudera has made life easy for us to run various big data contents .Just launch firefox and the browser has been bookmarked with all the necessary URL's to get started. Impala can also be executed using  terminal -  just launch terminal & execute - impala-shell 

Impala Introduction & some theory   :

Impala is a MPP SQL engine running inside cloudera big data ecosystem . It supports Native  ANSI-92 SQL (compatible with Hive SQL)  - Refer  - Imapala SQL Reference  So Happy go SQL folks.But be aware Impala has been designed for  efficient query retrieval . Hence  it supports  DML like select , orderby  , groupby  , joins , views and aggreagate functions  .

Imapala does not support  operations like delete , update  , foreign key , constraints , indexes , roll back . Henceforth  OLTP scope is limited .


Internal table creation without file formats :

By default table creation statement creates a internal tables  under the hdfs location - /user/hive/warehouse/uk_regions

CREATE TABLE uk_regions (name STRING, abbr STRING); 
INSERT INTO uk_regions VALUES ("Sheffield" ,"SHF"), ("Leeds","LDS"), ("London","LDN");


system logs :


[cloudera@quickstart ~]$ hadoop fs -ls /user/hive/warehouse/uk_regionsFound 2 items
-rw-r--r-- 1 impala supergroup 210 2016-06-01 12:54 /user/hive/warehouse/uk_regions/2345e31cd0a1c8d2-3c2fbcda5d8d159d_1304749469_data.0.
drwxrwxrwx - impala supergroup 0 2016-06-01 12:54 
/user/hive/warehouse/uk_regions/_impala_insert_staging

During the table creation process i have not mentioned explicitly any table storage format so all the contents gets a default storage type 2345e31cd0a1c8d2-3c2fbcda5d8d159d_1304749469_data.0. The same can be re written in MySQL as follows and this is given just to understand the difference in ddl .

CREATE TABLE `uk_regions`( `name` varchar(20) NOT NULL, `abbr` varchar(20) NOT NULL , PRIMARY KEY (`name`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Internal Table creation with multiple file formats:
Impala supports text , rc , sequence , parquet , avro file format with their appropriate compression codecs . File formats helps impala to store and retrieve data from hdfs efficiently either columnar or row based ordering.The below documentation table helps to understand more in depth about different format and their use cases for insert & create statements

.



Parquet File format : 

Impala also helps you to create, manage, and query Parquet tables. Parquet is a column-oriented binary file format intended to be highly efficient for the types of large-scale queries that Impala is best at. Parquet is especially good for queries scanning particular columns within a table, for example to query "wide" tables with many columns, or to perform aggregation operations such as SUM() and AVG() that need to process most or all of the values from a column. Each data file contains the values for a set of rows (the "row group"). Within a data file, the values from each column are organized so that they are all adjacent, enabling good compression for the values from that column. Queries against a Parquet table can retrieve and analyze these values from any column quickly and with minimal I/O.

Internal table creation in parquet format:

CREATE TABLE parquet_uk_regions (name STRING, abbr STRING) STORED AS PARQUET ;
INSERT INTO parquet_uk_regions VALUES ("Sheffield" ,"SHF"), ("Leeds","LDS"), ("London","LDN");

system logs :



[cloudera@quickstart ~]$ hadoop fs -ls /user/hive/warehouse/parquet_uk_regions/Found 2 items
drwxrwxrwx - impala supergroup 0 2016-06-01 15:05 /user/hive/warehouse/parquet_uk_regions/_impala_insert_staging
-rw-r--r-- 1 impala supergroup 379 2016-06-01 15:05 /user/hive/warehouse/parquet_uk_regions/c044e267795afc59-89f2340de627089b_1709498250_data.0.parq

Here the file content gets stored as parquet =  c044e267795afc59-89f2340de627089b_1709498250_data.0.parq

External table :
Impala lets you to create external table . The main difference between internal vs external  tables are that it leverages end-user to hold their data anywhere in hdfs directory structure. To explain this more clear internal tables creates and hold data only in this location  as explained earlier. 



/user/hive/warehouse/parquet_uk_regions/

So this means all the table content are placed under this directory /hive/warehouse and here parquet_uk_region is a table name however a External Table let user to create folders in hadoop in any location as per his requirement as like the below example

/cheran/CSV

Here CSV is my table name and all the data related to csv tables are placed under that. Another major advantages for external tables are when you drop them in impala-shell/ hue  data does not not get deleted in HDFS.Before creating external tables we need create directory and set necessary permissions for impala to execute any sort of query execution or operation from HDFS .




  • hadoop fs -mkdir /user/cheran
  • hadoop fs -mkdir /user/cheran/csv
  •  hadoop fs -chmod 777 /user/cheran/csv
  • sudo -u hdfs hadoop fs -chown -R impala:supergroup
        /user/cheran/csv
  • Please Note :  Without executing the above commands impala will thrown some file permission exception when you execute any queries on top of this directory

    External table creation in CSV format  :  The reason to create csv based ddl is because our downloaded sample data is in csv format .  This ddl is capable of  storing / reading csv files from hdfs.

     create external table if not exists logs (
    TimeSeconds double,
    parsedDate timestamp,
    dateTimeStr string,
    ipLayerProtocol int,
    ipLayerProtocolCode string,
    firstSeenSrcIp string,
    firstSeenDestIp string,
    firstSeenSrcPort int,
    firstSeenDestPor int,
    moreFragment int,
    contFragment int,
    durationSecond int,
    firstSeenSrcPayloadByte bigint,
    firstSeenDestPayloadByte bigint,
    firstSeenSrcTotalByte bigint,
    firstSeenDestTotalByte bigint,
    firstSeenSrcPacketCoun int,
    firstSeenDestPacketCoun int,
    recordForceOut int)
    row format delimited fields terminated by ',' lines terminated by '\n'
    location '/user/cheran/csv/';

    table & sample data referenced from   : http://raffy.ch/

    Now we  have tables and  as  a  next level lets download some sample data which matches the above ddl. The Public dataset what i used here has 46 million records so its worthy to check how cloudera can handle this volume against time .

    Steps to get the sample data and move to HDFS 
    • wget /home/cloudera/Downloads/ http://www.vacommunity.org/dl388
    This is is a 500 MB file download and it has 46 million records in 3 zip files . Unzip this to a local file  location 

    • unzip /home/cloudera/Downloads/VAST2013MC3_NetworkFlow.zip -d /home/cloudera/Downloads/

    When extracted the 500 mb data this occupied 5.6 gb of my  diskspace 

    • hdfs dfs -put /home/cloudera/Downloads/nf/nf-chunk*.csv /user/cheran/csv/

    Moved all the  CSV files to  external table location under HDFS . Thats all you need to do you can start querying the data from impala-shell, The below are my query response time when handled with 46 million records

    CSV tables Query performance 

    [quickstart.cloudera:21000] > select count(*)  from logs;
    Query: select count(*)  from logs
    +----------+
    | count(*) |
    +----------+
    | 46138313 |
    +----------+
    Fetched 1 row(s) in 19.8

    Query: select avg(durationsecond) from logs
    +---------------------+
    | avg(durationsecond) |
    +---------------------+
    | 19.98479690738564   |
    +---------------------+
    Fetched 1 row(s) in 23.3s

    CSV tables file storage inside HDFS


    When i queried the hdfs  file system 

    [cloudera@quickstart Downloads]$ hadoop fs -du -s -h /user/cheran/csv/
    5.4 G  5.4 G  /user/cheran/csv


    i can see  all the csv files occupied 5.4 gb to hold the 46 million records .  Please note impala can do better query response if all these csv files are converted to parquet format. 


    External table creation in Parquet format 

    Just before doing this lets setup required folders & permissions in HDFS 



  • hadoop fs -mkdir /user/cheran
  • hadoop fs -mkdir /user/cheran/log_parquet_format
  •  hadoop fs -chmod 777 /user/cheran/log_parquet_format
  • sudo -u hdfs hadoop fs -chown -R impala:supergroup
        /user/cheran/log_parquet_format


  • Please Note : Without executing the above commands impala will thrown some file permission exception when you execute any queries on top of this directory

    External table creation in parquet format  :  This ddl will store / read parquet files.

    CREATE EXTERNAL TABLE IF NOT EXISTS log_parquet_format (
                    TimeSeconds double,
                    parsedDate timestamp,
                    dateTimeStr string,
                    ipLayerProtocol int,
                    ipLayerProtocolCode string,
                    firstSeenSrcIp string,
                    firstSeenDestIp string,
                    firstSeenSrcPort int,
                    firstSeenDestPor int,
                    moreFragment int,
                    contFragment int,
                    durationSecond int,
                    firstSeenSrcPayloadByte bigint,
                    firstSeenDestPayloadByte bigint,
                    firstSeenSrcTotalByte bigint,
                    firstSeenDestTotalByte bigint,
                    firstSeenSrcPacketCoun int,
                    firstSeenDestPacketCoun int,
                    recordForceOut int)
    STORED AS PARQUETFILE
    LOCATION '/user/cheran/log_parquet_format/';

    Converting all the existing CSV table data to into  Parquet format: 

    Default Compression Codec for parquet format is snappy , this can be changed to Gzip as per requirement.

    set compression_codec=Snappy;
    insert overwrite table log_parquet4 select * from logs;

    or 

    set compression_codec=gzip;
    insert overwrite table log_parquet4 select * from logs;

    This Helps to store the parquet file in compressed format using gzip or snappy  compression . cloudera mostly  recommends  snappy because of their query retrieving capability . I have tried both snappy and gzip to see how they are different in terms of occupying storage space



    Parquet tables Query performance 

    select sum(durationsecond) from log_parquet_format;
    +---------------------+
    | sum(durationsecond) |
    +---------------------+
    | 922064755           |
    +---------------------+
    Fetched 1 row(s) in 1.85s


    [quickstart.cloudera:21000] > select avg(durationsecond) from log_parquet_format;
    Query: select avg(durationsecond) from log_parquet_format
    +---------------------+
    | avg(durationsecond) |
    +---------------------+
    | 19.98479690738564   |
    +---------------------+

    Fetched 1 row(s) in 1.95s


    CSV tables file storage inside HDFS

      
    [cloudera@quickstart Downloads]$   hadoop fs -du -s -h /user/cheran/gzip/
    448.4 M  448.4 M  /user/cheran/gzip


    [cloudera@quickstart Downloads]$   hadoop fs -du -s -h /user/cheran/log_parquet_formatt/
    798.6 M  798.6 M  /user/cheran/log_parquet_format



    Getting Started with Indyco :


    I have tried Indyco tool it has some cool features . Indyco has efficient data governance & modelling capabilities . It lets you to create physical / logical / conceptual data models and deploy them to Impala's through a jdbc driver . It also supports parquet file format based ddl creation . So Guys try them out and let me know your thoughts. You may not get this tool with a straight open download link .you have to request download link from Indyco for evaluation.





    Conclusion 

    So we have discussed elaborately how to create tables using ddl and also introduced you for Idyco. Now its all your requirements & hard work to model a  ER / STAR / Datavault . 


    Guys let me know your thoughts and challenges you face whilst you do data modelling . Best of Luck with all your Impala Adventures.