Friday, 10 June 2016

Data Modelling in Impala - Cloudera aka BIG DATA

Data Modelling in Impala - Cloudera aka BIG DATA -  STAR Schema /  ER Schema - Using Parquet Format using Snappy / GZIP compression 

Background for the research  : 

Being a SQL fanatic from relational  RDBMS world I was more  curious to explore how the data modeling
[ table creation ] is done inside Big data world .  This data model can be of any schema like ER , Star  , Datavault.  End of the day i see these schemas as plain tables with some extra bits n bolts with some relationship keys and data organisation among them.

The Goal was to create some table structure  inside  Impala with some good amount of data . On a later perspective this can be re implemented in star schema / ER / Data vault for bigger picture and use case.

Planned to  loaded 46 million records to my impala table and have analysed query performance with sum and average functions for the entire dataset . The query performance results has been published in this post.

I selected Impala  because theories talk  that they can handle bulk volume of data with efficient query mechanism and this can implemented for any  Cube Analytics / BI Reporting needs with enormous data volume without any MAPREDUCE JOBS.

End of this Lengthy  post you will be able to figure out how things works inside Impala . I have tried my best to organise topics to cover the below important areas.


  1. Impala background &  theory
  2. Impala environmental recommendation 
  3. DDL for Internal table & External Tables
  4. Parquet & CSV  File Format based DDL 
  5. Gzip & Snappy Codec for Parquet.
  6. HDFS file Ingestion-  ingesting sample data to Imapala - 46 million rows 
  7. Querying Impala
  8. Query Performance Analysis. 
  9. Indyco - Data modeller  for Impala



Disclaimer::  

What ever information presented here  could be reproduced after tons of research and reading from many different source to help the opensource community to gain knowledge and share thoughts . I take no responsibility if there is any inaccurate information conveyed.  I am a human i make mistakes in understanding theory but in the end whatever i produced here works upto some experimental level. Some software download or sample data links may not work over time so i take no responsibility to update this post over time. So enjoy reading  and spreading the knowledge as it is and your are free to reproduce any contents published here. 



Prerequisite  & Knowledge recommendation: 
  1. SQL DDL knowledge 
  2. Linux Knowledge 
  3. HDFS & hadoop eco system general exposure 

My Environmental Setup 

  • Laptop Ram - 16 GB  ( This is very much needed as 10 GB is Minimum required for  Cloudera Enterprise sandbox &  only enterprise edition has impala on it )
  • Vm -  Oracle Virtual BOX
  • Cloudera Enterprise Sandbox  - cloudera-quick-start-vm-5.7.0-0-virtualbox
  • Processor -  2 core CPU dedicated for VM 





Bingo  you should be all set  to  run cloudera sandbox without any hurdles !






Launching Impala 

Impala can be launched & executed via HUE web interface - To be mentioned cloudera has made life easy for us to run various big data contents .Just launch firefox and the browser has been bookmarked with all the necessary URL's to get started. Impala can also be executed using  terminal -  just launch terminal & execute - impala-shell 

Impala Introduction & some theory   :

Impala is a MPP SQL engine running inside cloudera big data ecosystem . It supports Native  ANSI-92 SQL (compatible with Hive SQL)  - Refer  - Imapala SQL Reference  So Happy go SQL folks.But be aware Impala has been designed for  efficient query retrieval . Hence  it supports  DML like select , orderby  , groupby  , joins , views and aggreagate functions  .

Imapala does not support  operations like delete , update  , foreign key , constraints , indexes , roll back . Henceforth  OLTP scope is limited .


Internal table creation without file formats :

By default table creation statement creates a internal tables  under the hdfs location - /user/hive/warehouse/uk_regions

CREATE TABLE uk_regions (name STRING, abbr STRING); 
INSERT INTO uk_regions VALUES ("Sheffield" ,"SHF"), ("Leeds","LDS"), ("London","LDN");


system logs :


[cloudera@quickstart ~]$ hadoop fs -ls /user/hive/warehouse/uk_regionsFound 2 items
-rw-r--r-- 1 impala supergroup 210 2016-06-01 12:54 /user/hive/warehouse/uk_regions/2345e31cd0a1c8d2-3c2fbcda5d8d159d_1304749469_data.0.
drwxrwxrwx - impala supergroup 0 2016-06-01 12:54 
/user/hive/warehouse/uk_regions/_impala_insert_staging

During the table creation process i have not mentioned explicitly any table storage format so all the contents gets a default storage type 2345e31cd0a1c8d2-3c2fbcda5d8d159d_1304749469_data.0. The same can be re written in MySQL as follows and this is given just to understand the difference in ddl .

CREATE TABLE `uk_regions`( `name` varchar(20) NOT NULL, `abbr` varchar(20) NOT NULL , PRIMARY KEY (`name`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Internal Table creation with multiple file formats:
Impala supports text , rc , sequence , parquet , avro file format with their appropriate compression codecs . File formats helps impala to store and retrieve data from hdfs efficiently either columnar or row based ordering.The below documentation table helps to understand more in depth about different format and their use cases for insert & create statements

.



Parquet File format : 

Impala also helps you to create, manage, and query Parquet tables. Parquet is a column-oriented binary file format intended to be highly efficient for the types of large-scale queries that Impala is best at. Parquet is especially good for queries scanning particular columns within a table, for example to query "wide" tables with many columns, or to perform aggregation operations such as SUM() and AVG() that need to process most or all of the values from a column. Each data file contains the values for a set of rows (the "row group"). Within a data file, the values from each column are organized so that they are all adjacent, enabling good compression for the values from that column. Queries against a Parquet table can retrieve and analyze these values from any column quickly and with minimal I/O.

Internal table creation in parquet format:

CREATE TABLE parquet_uk_regions (name STRING, abbr STRING) STORED AS PARQUET ;
INSERT INTO parquet_uk_regions VALUES ("Sheffield" ,"SHF"), ("Leeds","LDS"), ("London","LDN");

system logs :



[cloudera@quickstart ~]$ hadoop fs -ls /user/hive/warehouse/parquet_uk_regions/Found 2 items
drwxrwxrwx - impala supergroup 0 2016-06-01 15:05 /user/hive/warehouse/parquet_uk_regions/_impala_insert_staging
-rw-r--r-- 1 impala supergroup 379 2016-06-01 15:05 /user/hive/warehouse/parquet_uk_regions/c044e267795afc59-89f2340de627089b_1709498250_data.0.parq

Here the file content gets stored as parquet =  c044e267795afc59-89f2340de627089b_1709498250_data.0.parq

External table :
Impala lets you to create external table . The main difference between internal vs external  tables are that it leverages end-user to hold their data anywhere in hdfs directory structure. To explain this more clear internal tables creates and hold data only in this location  as explained earlier. 



/user/hive/warehouse/parquet_uk_regions/

So this means all the table content are placed under this directory /hive/warehouse and here parquet_uk_region is a table name however a External Table let user to create folders in hadoop in any location as per his requirement as like the below example

/cheran/CSV

Here CSV is my table name and all the data related to csv tables are placed under that. Another major advantages for external tables are when you drop them in impala-shell/ hue  data does not not get deleted in HDFS.Before creating external tables we need create directory and set necessary permissions for impala to execute any sort of query execution or operation from HDFS .




  • hadoop fs -mkdir /user/cheran
  • hadoop fs -mkdir /user/cheran/csv
  •  hadoop fs -chmod 777 /user/cheran/csv
  • sudo -u hdfs hadoop fs -chown -R impala:supergroup
        /user/cheran/csv
  • Please Note :  Without executing the above commands impala will thrown some file permission exception when you execute any queries on top of this directory

    External table creation in CSV format  :  The reason to create csv based ddl is because our downloaded sample data is in csv format .  This ddl is capable of  storing / reading csv files from hdfs.

     create external table if not exists logs (
    TimeSeconds double,
    parsedDate timestamp,
    dateTimeStr string,
    ipLayerProtocol int,
    ipLayerProtocolCode string,
    firstSeenSrcIp string,
    firstSeenDestIp string,
    firstSeenSrcPort int,
    firstSeenDestPor int,
    moreFragment int,
    contFragment int,
    durationSecond int,
    firstSeenSrcPayloadByte bigint,
    firstSeenDestPayloadByte bigint,
    firstSeenSrcTotalByte bigint,
    firstSeenDestTotalByte bigint,
    firstSeenSrcPacketCoun int,
    firstSeenDestPacketCoun int,
    recordForceOut int)
    row format delimited fields terminated by ',' lines terminated by '\n'
    location '/user/cheran/csv/';

    table & sample data referenced from   : http://raffy.ch/

    Now we  have tables and  as  a  next level lets download some sample data which matches the above ddl. The Public dataset what i used here has 46 million records so its worthy to check how cloudera can handle this volume against time .

    Steps to get the sample data and move to HDFS 
    • wget /home/cloudera/Downloads/ http://www.vacommunity.org/dl388
    This is is a 500 MB file download and it has 46 million records in 3 zip files . Unzip this to a local file  location 

    • unzip /home/cloudera/Downloads/VAST2013MC3_NetworkFlow.zip -d /home/cloudera/Downloads/

    When extracted the 500 mb data this occupied 5.6 gb of my  diskspace 

    • hdfs dfs -put /home/cloudera/Downloads/nf/nf-chunk*.csv /user/cheran/csv/

    Moved all the  CSV files to  external table location under HDFS . Thats all you need to do you can start querying the data from impala-shell, The below are my query response time when handled with 46 million records

    CSV tables Query performance 

    [quickstart.cloudera:21000] > select count(*)  from logs;
    Query: select count(*)  from logs
    +----------+
    | count(*) |
    +----------+
    | 46138313 |
    +----------+
    Fetched 1 row(s) in 19.8

    Query: select avg(durationsecond) from logs
    +---------------------+
    | avg(durationsecond) |
    +---------------------+
    | 19.98479690738564   |
    +---------------------+
    Fetched 1 row(s) in 23.3s

    CSV tables file storage inside HDFS


    When i queried the hdfs  file system 

    [cloudera@quickstart Downloads]$ hadoop fs -du -s -h /user/cheran/csv/
    5.4 G  5.4 G  /user/cheran/csv


    i can see  all the csv files occupied 5.4 gb to hold the 46 million records .  Please note impala can do better query response if all these csv files are converted to parquet format. 


    External table creation in Parquet format 

    Just before doing this lets setup required folders & permissions in HDFS 



  • hadoop fs -mkdir /user/cheran
  • hadoop fs -mkdir /user/cheran/log_parquet_format
  •  hadoop fs -chmod 777 /user/cheran/log_parquet_format
  • sudo -u hdfs hadoop fs -chown -R impala:supergroup
        /user/cheran/log_parquet_format


  • Please Note : Without executing the above commands impala will thrown some file permission exception when you execute any queries on top of this directory

    External table creation in parquet format  :  This ddl will store / read parquet files.

    CREATE EXTERNAL TABLE IF NOT EXISTS log_parquet_format (
                    TimeSeconds double,
                    parsedDate timestamp,
                    dateTimeStr string,
                    ipLayerProtocol int,
                    ipLayerProtocolCode string,
                    firstSeenSrcIp string,
                    firstSeenDestIp string,
                    firstSeenSrcPort int,
                    firstSeenDestPor int,
                    moreFragment int,
                    contFragment int,
                    durationSecond int,
                    firstSeenSrcPayloadByte bigint,
                    firstSeenDestPayloadByte bigint,
                    firstSeenSrcTotalByte bigint,
                    firstSeenDestTotalByte bigint,
                    firstSeenSrcPacketCoun int,
                    firstSeenDestPacketCoun int,
                    recordForceOut int)
    STORED AS PARQUETFILE
    LOCATION '/user/cheran/log_parquet_format/';

    Converting all the existing CSV table data to into  Parquet format: 

    Default Compression Codec for parquet format is snappy , this can be changed to Gzip as per requirement.

    set compression_codec=Snappy;
    insert overwrite table log_parquet4 select * from logs;

    or 

    set compression_codec=gzip;
    insert overwrite table log_parquet4 select * from logs;

    This Helps to store the parquet file in compressed format using gzip or snappy  compression . cloudera mostly  recommends  snappy because of their query retrieving capability . I have tried both snappy and gzip to see how they are different in terms of occupying storage space



    Parquet tables Query performance 

    select sum(durationsecond) from log_parquet_format;
    +---------------------+
    | sum(durationsecond) |
    +---------------------+
    | 922064755           |
    +---------------------+
    Fetched 1 row(s) in 1.85s


    [quickstart.cloudera:21000] > select avg(durationsecond) from log_parquet_format;
    Query: select avg(durationsecond) from log_parquet_format
    +---------------------+
    | avg(durationsecond) |
    +---------------------+
    | 19.98479690738564   |
    +---------------------+

    Fetched 1 row(s) in 1.95s


    CSV tables file storage inside HDFS

      
    [cloudera@quickstart Downloads]$   hadoop fs -du -s -h /user/cheran/gzip/
    448.4 M  448.4 M  /user/cheran/gzip


    [cloudera@quickstart Downloads]$   hadoop fs -du -s -h /user/cheran/log_parquet_formatt/
    798.6 M  798.6 M  /user/cheran/log_parquet_format



    Getting Started with Indyco :


    I have tried Indyco tool it has some cool features . Indyco has efficient data governance & modelling capabilities . It lets you to create physical / logical / conceptual data models and deploy them to Impala's through a jdbc driver . It also supports parquet file format based ddl creation . So Guys try them out and let me know your thoughts. You may not get this tool with a straight open download link .you have to request download link from Indyco for evaluation.





    Conclusion 

    So we have discussed elaborately how to create tables using ddl and also introduced you for Idyco. Now its all your requirements & hard work to model a  ER / STAR / Datavault . 


    Guys let me know your thoughts and challenges you face whilst you do data modelling . Best of Luck with all your Impala Adventures.



    No comments:

    Post a Comment