Database Systems is a blog about Databases, Oracle, Salesforce and Data IntegrationKubilay Tsil Karahttps://firstname.lastname@example.orgBlogger113125
Updated: 17 hours 43 min ago
In this post I will share my experience with an Apache Hadoop component called Hive which enables you to do SQL on an Apache Hadoop Big Data cluster.
Being a great fun of SQL and relational databases, this was my opportunity to set up a mechanism where I could transfer some (a lot) data from a relational database into Hadoop and query it with SQL. Not a very difficult thing to do these days, actually is very easy with Apache Hive!
Having access to a Hadoop cluster which has the Hive module installed on, is all you need. You can provision a Hadoop cluster yourself by downloading and installing it in pseudo mode on your own PC. Or you can run one in the cloud with Amazon AWS EMR in a pay-as-you-go fashion.
There are many ways of doing this, just Google it and you will be surprised how easy it is. It is easier than it sounds. Next find links for installing it on your own PC (Linux). Just download and install Apache Hadoop and Hive from Apache Hadoop Downloads
You will need to download and install 3 things from the above link.
- Hadoop (HDFS and Big Data Framework, the cluster)
- Hive (data warehouse module)
- Sqoop (data importer)
What is Hive?
Hive is Big Data SQL, the Data Warehouse in Hadoop. You can create tables, indexes, partition tables, use external tables, Views like in a relational database Data Warehouse. You can run SQL to do joins and to query the Hive tables in parallel using the MapReduce framework. It is actually quite fun to see your SQL queries translating to MapReduce jobs and run in parallel like parallel SQL queries we do on Oracle EE Data Warehouses and other databases. :0) The syntax looks very much like MySQL's SQL syntax.
Hive is NOT an OLTP transactional database, does not have transactions of INSERT, UPDATE, DELETE like in OLTP and doesn't conform to ANSI SQL and ACID properties of transactions.
Direct insert into Hive with Apache Sqoop:
After you have installed Hadoop and have hive setup and are able to login to it, you can use Sqoop - the data importer of Hadoop - like in the following command and directly import a table from MySQL via JDBC into Hive using MapReduce.
$ sqoop import -connect jdbc:mysql://mydatbasename -username kubilay -P -table mytablename --hive-import --hive-drop-import-delims --hive-database dbadb --num-mappers 16 --split-by id
Sqoop import options explained:
- -P will ask for the password
- --hive-import which makes Sqoop to import data straight into hive table which it creates for you
- --hive-drop-import-delims Drops \n, \r, and \01 from string fields when importing to Hive.
- --hive-database tells it which database in Hive to import it to, otherwise it goes to the default database.
- --num-mappers number of parallel maps to run, like parallel processes / threads in SQL
- --split-by Column of the table used to split work units, like in partitioning key in database partitioning.
Once you import the table then you can login to hive and run SQL to it like in any relational database. You can login to Hive in a properly configured system just by calling hive from command line like this:
More Commands to list jobs:
Couple of other commands I found useful when I was experimenting with this:
List running Hadoop jobs
hadoop job -list
Kill running Hadoop jobs
hadoop job -kill job_1234567891011_1234
List particular table directories in HDFS
hadoop fs -ls mytablename
More resources & Links
Categories: DBA Blogs