Cheran Ilango Blog: Data Modelling in Impala

Data Modelling in Impala - Cloudera aka BIG DATA - STAR Schema / ER Schema - Using Parquet Format using Snappy / GZIP compression

Background for the research :

Being a SQL fanatic from relational RDBMS world I was more curious to explore how the data modeling
[ table creation ] is done inside Big data world . This data model can be of any schema like ER , Star , Datavault. End of the day i see these schemas as plain tables with some extra bits n bolts with some relationship keys and data organisation among them.

The Goal was to create some table structure inside Impala with some good amount of data . On a later perspective this can be re implemented in star schema / ER / Data vault for bigger picture and use case.

Planned to loaded 46 million records to my impala table and have analysed query performance with sum and average functions for the entire dataset . The query performance results has been published in this post.

I selected Impala because theories talk that they can handle bulk volume of data with efficient query mechanism and this can implemented for any Cube Analytics / BI Reporting needs with enormous data volume without any MAPREDUCE JOBS.

End of this Lengthy post you will be able to figure out how things works inside Impala . I have tried my best to organise topics to cover the below important areas.

Impala background & theory
Impala environmental recommendation
DDL for Internal table & External Tables
Parquet & CSV File Format based DDL
Gzip & Snappy Codec for Parquet.
HDFS file Ingestion- ingesting sample data to Imapala - 46 million rows
Querying Impala
Query Performance Analysis.
Indyco - Data modeller for Impala

Disclaimer::

What ever information presented here could be reproduced after tons of research and reading from many different source to help the opensource community to gain knowledge and share thoughts . I take no responsibility if there is any inaccurate information conveyed. I am a human i make mistakes in understanding theory but in the end whatever i produced here works upto some experimental level. Some software download or sample data links may not work over time so i take no responsibility to update this post over time. So enjoy reading and spreading the knowledge as it is and your are free to reproduce any contents published here.

Prerequisite & Knowledge recommendation:

SQL DDL knowledge
Linux Knowledge
HDFS & hadoop eco system general exposure

My Environmental Setup

Laptop Ram - 16 GB ( This is very much needed as 10 GB is Minimum required for Cloudera Enterprise sandbox & only enterprise edition has impala on it )
Vm - Oracle Virtual BOX
Cloudera Enterprise Sandbox - cloudera-quick-start-vm-5.7.0-0-virtualbox
Processor - 2 core CPU dedicated for VM

Bingo you should be all set to run cloudera sandbox without any hurdles !

Launching Impala

Impala can be launched & executed via HUE web interface - To be mentioned cloudera has made life easy for us to run various big data contents .Just launch firefox and the browser has been bookmarked with all the necessary URL's to get started. Impala can also be executed using terminal - just launch terminal & execute - impala-shell

Impala Introduction & some theory :

Impala is a MPP SQL engine running inside cloudera big data ecosystem . It supports Native ANSI-92 SQL (compatible with Hive SQL) - Refer - Imapala SQL Reference So Happy go SQL folks.But be aware Impala has been designed for efficient query retrieval . Hence it supports DML like select , orderby , groupby , joins , views and aggreagate functions .

Imapala does not support operations like delete , update , foreign key , constraints , indexes , roll back . Henceforth OLTP scope is limited .

Internal table creation without file formats :

By default table creation statement creates a internal tables under the hdfs location - /user/hive/warehouse/uk_regions

CREATE TABLE uk_regions (name STRING, abbr STRING);

INSERT INTO uk_regions VALUES ("Sheffield" ,"SHF"), ("Leeds","LDS"), ("London","LDN");

system logs :

[cloudera@quickstart ~]$ hadoop fs -ls /user/hive/warehouse/uk_regionsFound 2 items

-rw-r--r-- 1 impala supergroup 210 2016-06-01 12:54 /user/hive/warehouse/uk_regions/2345e31cd0a1c8d2-3c2fbcda5d8d159d_1304749469_data.0.

drwxrwxrwx - impala supergroup 0 2016-06-01 12:54

/user/hive/warehouse/uk_regions/_impala_insert_staging

During the table creation process i have not mentioned explicitly any table storage format so all the contents gets a default storage type 2345e31cd0a1c8d2-3c2fbcda5d8d159d_1304749469_data.0. The same can be re written in MySQL as follows and this is given just to understand the difference in ddl .

CREATE TABLE `uk_regions`( `name` varchar(20) NOT NULL, `abbr` varchar(20) NOT NULL , PRIMARY KEY (`name`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Internal Table creation with multiple file formats:
Impala supports text , rc , sequence , parquet , avro file format with their appropriate compression codecs . File formats helps impala to store and retrieve data from hdfs efficiently either columnar or row based ordering.The below documentation table helps to understand more in depth about different format and their use cases for insert & create statements

Parquet File format :

Impala also helps you to create, manage, and query Parquet tables. Parquet is a column-oriented binary file format intended to be highly efficient for the types of large-scale queries that Impala is best at. Parquet is especially good for queries scanning particular columns within a table, for example to query "wide" tables with many columns, or to perform aggregation operations such as SUM() and AVG() that need to process most or all of the values from a column. Each data file contains the values for a set of rows (the "row group"). Within a data file, the values from each column are organized so that they are all adjacent, enabling good compression for the values from that column. Queries against a Parquet table can retrieve and analyze these values from any column quickly and with minimal I/O.

Internal table creation in parquet format:

CREATE TABLE parquet_uk_regions (name STRING, abbr STRING) STORED AS PARQUET ;
INSERT INTO parquet_uk_regions VALUES ("Sheffield" ,"SHF"), ("Leeds","LDS"), ("London","LDN");

system logs :

[cloudera@quickstart ~]$ hadoop fs -ls /user/hive/warehouse/parquet_uk_regions/Found 2 items

drwxrwxrwx - impala supergroup 0 2016-06-01 15:05 /user/hive/warehouse/parquet_uk_regions/_impala_insert_staging

-rw-r--r-- 1 impala supergroup 379 2016-06-01 15:05 /user/hive/warehouse/parquet_uk_regions/c044e267795afc59-89f2340de627089b_1709498250_data.0.parq

Here the file content gets stored as parquet = c044e267795afc59-89f2340de627089b_1709498250_data.0.parq

External table :
Impala lets you to create external table . The main difference between internal vs external tables are that it leverages end-user to hold their data anywhere in hdfs directory structure. To explain this more clear internal tables creates and hold data only in this location as explained earlier.

/user/hive/warehouse/parquet_uk_regions/

So this means all the table content are placed under this directory /hive/warehouse and here parquet_uk_region is a table name however a External Table let user to create folders in hadoop in any location as per his requirement as like the below example

/cheran/CSV

Here CSV is my table name and all the data related to csv tables are placed under that. Another major advantages for external tables are when you drop them in impala-shell/ hue data does not not get deleted in HDFS.Before creating external tables we need create directory and set necessary permissions for impala to execute any sort of query execution or operation from HDFS .

hadoop fs -mkdir /user/cheran

hadoop fs -mkdir /user/cheran/csv

hadoop fs -chmod 777 /user/cheran/csv

sudo -u hdfs hadoop fs -chown -R impala:supergroup
/user/cheran/csv

Please Note : Without executing the above commands impala will thrown some file permission exception when you execute any queries on top of this directory

External table creation in CSV format : The reason to create csv based ddl is because our downloaded sample data is in csv format . This ddl is capable of storing / reading csv files from hdfs.

create external table if not exists logs (

TimeSeconds double,

parsedDate timestamp,

dateTimeStr string,

ipLayerProtocol int,

ipLayerProtocolCode string,

firstSeenSrcIp string,

firstSeenDestIp string,

firstSeenSrcPort int,

firstSeenDestPor int,

moreFragment int,

contFragment int,

durationSecond int,

firstSeenSrcPayloadByte bigint,

firstSeenDestPayloadByte bigint,

firstSeenSrcTotalByte bigint,

firstSeenDestTotalByte bigint,

firstSeenSrcPacketCoun int,

firstSeenDestPacketCoun int,

recordForceOut int)

row format delimited fields terminated by ',' lines terminated by '\n'

location '/user/cheran/csv/';

table & sample data referenced from : http://raffy.ch/

Now we have tables and as a next level lets download some sample data which matches the above ddl. The Public dataset what i used here has 46 million records so its worthy to check how cloudera can handle this volume against time .

Steps to get the sample data and move to HDFS

wget /home/cloudera/Downloads/ http://www.vacommunity.org/dl388

This is is a 500 MB file download and it has 46 million records in 3 zip files . Unzip this to a local file location

unzip /home/cloudera/Downloads/VAST2013MC3_NetworkFlow.zip -d /home/cloudera/Downloads/

When extracted the 500 mb data this occupied 5.6 gb of my diskspace

hdfs dfs -put /home/cloudera/Downloads/nf/nf-chunk*.csv /user/cheran/csv/

Moved all the CSV files to external table location under HDFS . Thats all you need to do you can start querying the data from impala-shell, The below are my query response time when handled with 46 million records

CSV tables Query performance

[quickstart.cloudera:21000] > select count(*) from logs;

Query: select count(*) from logs

+----------+

| count(*) |

+----------+

| 46138313 |

+----------+

Fetched 1 row(s) in 19.8

Query: select avg(durationsecond) from logs

+---------------------+

| avg(durationsecond) |

+---------------------+

| 19.98479690738564 |

+---------------------+

Fetched 1 row(s) in 23.3s

CSV tables file storage inside HDFS

When i queried the hdfs file system

[cloudera@quickstart Downloads]$ hadoop fs -du -s -h /user/cheran/csv/

5.4 G 5.4 G /user/cheran/csv

i can see all the csv files occupied 5.4 gb to hold the 46 million records . Please note impala can do better query response if all these csv files are converted to parquet format.

External table creation in Parquet format

Just before doing this lets setup required folders & permissions in HDFS

hadoop fs -mkdir /user/cheran

hadoop fs -mkdir /user/cheran/log_parquet_format

hadoop fs -chmod 777 /user/cheran/log_parquet_format

sudo -u hdfs hadoop fs -chown -R impala:supergroup
/user/cheran/log_parquet_format

Please Note : Without executing the above commands impala will thrown some file permission exception when you execute any queries on top of this directory

External table creation in parquet format : This ddl will store / read parquet files.

CREATE EXTERNAL TABLE IF NOT EXISTS log_parquet_format (

TimeSeconds double,

parsedDate timestamp,

dateTimeStr string,

ipLayerProtocol int,

ipLayerProtocolCode string,

firstSeenSrcIp string,

firstSeenDestIp string,

firstSeenSrcPort int,

firstSeenDestPor int,

moreFragment int,

contFragment int,

durationSecond int,

firstSeenSrcPayloadByte bigint,

firstSeenDestPayloadByte bigint,

firstSeenSrcTotalByte bigint,

firstSeenDestTotalByte bigint,

firstSeenSrcPacketCoun int,

firstSeenDestPacketCoun int,

recordForceOut int)

STORED AS PARQUETFILE

LOCATION '/user/cheran/log_parquet_format/';

Converting all the existing CSV table data to into Parquet format:

Default Compression Codec for parquet format is snappy , this can be changed to Gzip as per requirement.

set compression_codec=Snappy;

insert overwrite table log_parquet4 select * from logs;

set compression_codec=gzip;

insert overwrite table log_parquet4 select * from logs;

This Helps to store the parquet file in compressed format using gzip or snappy compression . cloudera mostly recommends snappy because of their query retrieving capability . I have tried both snappy and gzip to see how they are different in terms of occupying storage space

Parquet tables Query performance

select sum(durationsecond) from log_parquet_format;
+---------------------+
| sum(durationsecond) |
+---------------------+
| 922064755 |
+---------------------+
Fetched 1 row(s) in 1.85s

[quickstart.cloudera:21000] > select avg(durationsecond) from log_parquet_format;
Query: select avg(durationsecond) from log_parquet_format
+---------------------+
| avg(durationsecond) |
+---------------------+
| 19.98479690738564 |
+---------------------+

Fetched 1 row(s) in 1.95s

CSV tables file storage inside HDFS

[cloudera@quickstart Downloads]$ hadoop fs -du -s -h /user/cheran/gzip/
448.4 M 448.4 M /user/cheran/gzip

[cloudera@quickstart Downloads]$ hadoop fs -du -s -h /user/cheran/log_parquet_formatt/
798.6 M 798.6 M /user/cheran/log_parquet_format

Getting Started with Indyco :

I have tried Indyco tool it has some cool features . Indyco has efficient data governance & modelling capabilities . It lets you to create physical / logical / conceptual data models and deploy them to Impala's through a jdbc driver . It also supports parquet file format based ddl creation . So Guys try them out and let me know your thoughts. You may not get this tool with a straight open download link .you have to request download link from Indyco for evaluation.

Conclusion

So we have discussed elaborately how to create tables using ddl and also introduced you for Idyco. Now its all your requirements & hard work to model a ER / STAR / Datavault .

Guys let me know your thoughts and challenges you face whilst you do data modelling . Best of Luck with all your Impala Adventures.

Cheran Ilango Blog

Friday, 10 June 2016

Data Modelling in Impala - Cloudera aka BIG DATA

No comments:

Post a Comment