In this post I will share my experience with an Apache Hadoop component called Hive which enables you to do SQL on an Apache Hadoop Big Data cluster.
Being a great fun of SQL and relational databases, this was my opportunity to set up a mechanism where I could transfer some (a lot) data from a relational database into Hadoop and query it with SQL. Not a very difficult thing to do these days, actually is very easy with Apache Hive!
Having access to a Hadoop cluster which has the Hive module installed on, is all you need. You can provision a Hadoop cluster yourself by downloading and installing it in pseudo mode on your own PC. Or you can run one in the cloud with Amazon AWS EMR in a pay-as-you-go fashion.
There are many ways of doing this, just Google it and you will be surprised how easy it is. It is easier than it sounds. Next find links for installing it on your own PC (Linux). Just download and install Apache Hadoop and Hive from Apache Hadoop Downloads
You will need to download and install 3 things from the above link.
- Hadoop (HDFS and Big Data Framework, the cluster)
- Hive (data warehouse module)
- Sqoop (data importer)
You will also need to put the connector of the database (Oracle, MySQL...) you want to extract data from in the */lib folder in your Sqoop installation. For example the MySQL JDBC connector can be downloaded from hereDon't expect loads of tinkering installing Apache Hadoop and Hive or Sqoop, just unzipping binary extracts and few line changes on some config files in directories, that's all. Is not a big deal, and is Free. There are tones of tutorials on internet on this, here is one I used from another blogger bogotobogo.
What is Hive?
Hive is Big Data SQL, the Data Warehouse in Hadoop. You can create tables, indexes, partition tables, use external tables, Views like in a relational database Data Warehouse. You can run SQL to do joins and to query the Hive tables in parallel using the MapReduce framework. It is actually quite fun to see your SQL queries translating to MapReduce jobs and run in parallel like parallel SQL queries we do on Oracle EE Data Warehouses and other databases. :0) The syntax looks very much like MySQL's SQL syntax.
Hive is NOT an OLTP transactional database, does not have transactions of INSERT, UPDATE, DELETE like in OLTP and doesn't conform to ANSI SQL and ACID properties of transactions.
Direct insert into Hive with Apache Sqoop:
After you have installed Hadoop and have hive setup and are able to login to it, you can use Sqoop - the data importer of Hadoop - like in the following command and directly import a table from MySQL via JDBC into Hive using MapReduce.
$ sqoop import -connect jdbc:mysql://mydatbasename -username kubilay -P -table mytablename --hive-import --hive-drop-import-delims --hive-database dbadb --num-mappers 16 --split-by id
Sqoop import options explained:
- -P will ask for the password
- --hive-import which makes Sqoop to import data straight into hive table which it creates for you
- --hive-drop-import-delims Drops \n, \r, and \01 from string fields when importing to Hive.
- --hive-database tells it which database in Hive to import it to, otherwise it goes to the default database.
- --num-mappers number of parallel maps to run, like parallel processes / threads in SQL
- --split-by Column of the table used to split work units, like in partitioning key in database partitioning.
The above command will import any MySQL table you give in place of mytablename into Hive using MapReduce from a MySQL database you specify.
Once you import the table then you can login to hive and run SQL to it like in any relational database. You can login to Hive in a properly configured system just by calling hive from command line like this:
$ hive
hive>
More Commands to list jobs:
Couple of other commands I found useful when I was experimenting with this:
List running Hadoop jobs
hadoop job -list
Once you import the table then you can login to hive and run SQL to it like in any relational database. You can login to Hive in a properly configured system just by calling hive from command line like this:
$ hive
hive>
More Commands to list jobs:
Couple of other commands I found useful when I was experimenting with this:
List running Hadoop jobs
hadoop job -list
Kill running Hadoop jobs
hadoop job -kill job_1234567891011_1234
List particular table directories in HDFS
hadoop fs -ls mytablename
More resources & Links
hadoop job -kill job_1234567891011_1234
List particular table directories in HDFS
hadoop fs -ls mytablename
More resources & Links
2 comments:
There are lots of information about hadoop have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get to the next level in big data. Thanks for sharing this.
Big Data Training Chennai
Big Data Training
Big Data Course in Chennai
There are lots of information about hadoop have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get to the next level in big data. Thanks for sharing this.
Hadoop Course in Chennai
Hadoop training institutes in chennai
Post a Comment