Big Data Play Ground For Engineers : Hive and Metastore
This is part of the series called Big Data Playground for Engineers and the content page is here!
Git:
A fully functional code base and use case examples are up and running.
Repo: https://github.com/gyan42/spark-streaming-playground
Website: https://gyan42.github.io/spark-streaming-playground/build/html/index.html
After this you can setup common metastore between Spark Streaming, Spark SQL and Hive, thus enabling cross tooling query capability.
From Apache Spark you can read Hive tables and vice versa!
What is Hive?
Apache Hive is an open source data warehouse system built on top of Hadoop for querying and analyzing large datasets stored in Hadoop files. It process structured and semi-structured data in Hadoop.
In tradition Database store, both the computing and data lives in the same machine, forcing people to go with big machine every time they reach the bottleneck with the business requirements for more power and data storage.
With Hadoop in the picture, people decouples the data and data base query engine to take the advantage of distributed computing.
These are similar to traditional SQL queries but with extra power and tweaks to handle data in Tera Bytes!
In short, data lives in distributed storage system like HDFS/S3/GFS and the compute power comes from the cluster backed by Hadoop MapReduce (vanishing these days), Apache Spark or some custom build distributed query engines with custom meta store.
Refer this link as part of the repo to setup Hive in your local machine:
https://gyan42.github.io/spark-streaming-playground/build/html/setup/ApacheHive.html
What Hive Meta Store?
As we know, the database is among the most important and powerful parts of any organization. It is the collection of Schema, Tables, Relationships, Queries, and Views. It is an organized collection of data.
Metastore is the central repository of Apache Hive metadata. It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It provides client access to this information by using metastore service API.
All Hive implementations need a metastore service, where it stores metadata. It is implemented using tables in a relational database. By default, Hive uses a built-in Derby SQL server. It provides single process storage, so when we use Derby, we cannot run instances of Hive CLI. Whenever we want to run Hive on a personal machine or for some developer task, then it is good, but when we want to use it in a cluster, then MySQL or any other similar relational database is required.
Hive metastore consists of two fundamental units:
1. A service that provides metastore access to other Apache Hive services.
2. Disk storage for the Hive metadata which is separate from HDFS storage.
Hive Metastore Modes
There are three modes for Hive Metastore deployment:
· Embedded Metastore
· Local Metastore
· Remote Metastore
1. Embedded Metastore
In Hive by default, metastore service runs in the same JVM as the Hive service. It uses embedded derby database stored on the local file system in this mode. Thus both metastore service and hive service runs in the same JVM by using embedded Derby Database. But, this mode also has limitation that, as only one embedded Derby database can access the database files on disk at any one time, so only one Hive session could be open at a time.
If we try to start the second session it produces an error when it attempts to open a connection to the metastore. So, to allow many services to connect the Metastore, it configures Derby as a network server. This mode is good for unit testing. But it is not good for the practical solutions.
2 . Local Metastore
Hive is the data-warehousing framework, so hive does not prefer single session. To overcome this limitation of Embedded Metastore, for Local Metastore was introduced. This mode allows us to have many Hive sessions i.e. many users can use the metastore at the same time. We can achieve by using any JDBC compliant like MySQL which runs in a separate JVM or different machines than that of the Hive service and metastore service which are running in the same JVM.
Hive is the data-warehousing framework, so hive does not prefer single session. To overcome this limitation of Embedded Metastore, for Local Metastore was introduced. This mode allows us to have many Hive sessions i.e. many users can use the metastore at the same time. We can achieve by using any JDBC compliant like MySQL which runs in a separate JVM or different machines than that of the Hive service and metastore service which are running in the same JVM.
3. Remote Metastore
Moving further, another metastore configuration called Remote Metastore. In this mode, metastore runs on its own separate JVM, not in the Hive service JVM. If other processes want to communicate with the metastore server they can communicate using Thrift Network APIs. We can also have one more metastore servers in this case to provide more availability. This also brings better manageability/security because the database tier can be completely firewalled off. And the clients no longer need share database credentials with each Hiver user to access the metastore database.
To use this remote metastore, you should configure Hive service by setting hive.metastore.uris to the metastore server URI(s). Metastore server URIs are of the form http://thrift://host:port, where the port corresponds to the one set by METASTORE_PORT when starting the metastore server.
Databases Supported by Hive
Hive supports 5 backend databases which are as follows:
· Derby
· MySQL
· MS SQL Server
· Oracle
· Postgres
Postgresql as Backend for Hive Metastore
Here is a good comparison: PostgreSQL vs MySQL
Setup:
#install ubuntu packages
sudo apt-get install postgresql postgresql-contrib
# check the version
sudo -u postgres psql -c "SELECT version();"
# test the installation
sudo su - postgres
psql #to launch the terminal
\q #to quit
# or to run psql directly
sudo -i -u postgres psql
To cut short the permission configurations for new users, lets create a Ubuntu user with same name.
We need log into PostgresSQL to create users before they can use the DB.
For our case we are going ot create a user called `hive` and DB called `hive`
sudo adduser hive #password hive
sudo su - postgres
psql #to launch the terminal
CREATE USER hive WITH PASSWORD 'hive'; # drop user hive; (if needed)
\du
CREATE DATABASE hive;
grant all privileges on database hive to hive;
\list # to see the DB created
\q
We need do some changes to connection configs, to enable login from different services:
# change the 3rd colum values to "all"
sudo vim /etc/postgresql/10/main/pg_hba.conf
# "local" is for Unix domain socket connections only
local all all md5
# IPv4 local connections:
host all all 127.0.0.1/32 md5
# IPv6 local connections:
host all all ::1/128 md5
If you would like to restart the service:
sudo service postgresql restart
sudo systemctl restart postgresql
To see how many active connections to the DB are made, helps you to debug when many connection are made to DB.
SELECT pid, application_name, state FROM pg_stat_activity;
This is important some times to debug the connections, as it becomes stale and hangs there if connections are not closed properly, making new connection bounce back.
Use Hive provided tool to setup the metastore tables an schema:
/path/to/hive/bin/schematool -dbType postgres -initSchema
And then try running following commands, you should see bunch of tables there:
sudo -i -u hive psql -d hive
#asks for two passwords, one for sudo and other one for DB `hive` which is `hive`
hive=> \dt
Fire a python shell and test out the connection
pip install psycopg2==2.8.3pythonimport psycopg2
conn = psycopg2.connect(host="localhost", port=5432, database="hive", user="hive", password="hive")
sql_command = "SELECT * FROM \"CDS\";"
print (sql_command)
# Load the data
data = pd.read_sql(sql_command, conn)
print(data)#>>> output
Empty DataFrame
Columns: [CD_ID]
Index: []
Integration with Spark
Part of our work flow we need to read and write to Postresql from Apache Spark. Lets test whether we can do it from local `pyspark` terminal.
#it is mandate to give the postgres maven ids for runtime
pyspark --packages postgresql:postgresql:9.1-901-1.jdbc4#read the table "authors"
df = spark.read. \
format("jdbc"). \
option("url", "jdbc:postgresql://localhost:5432/hive"). \
option("schema", "public"). \
option("dbtable", "\"CDS\""). \
option("user", "hive"). \
option("password", "hive"). \
option("driver", "org.postgresql.Driver"). \
load()# display the table
df.printSchema()
And now comes the configuration part of Hive and Spark to use common metastore…
Following config is important since Spark and Hive has different versions of metastore schema:
hive.metastore.schema.verification is diabled, to make metastore compatible between different services.
path/to/hive/cong/hive-site.xml
<configurations>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.postgresql.Driver</value>
<description>PostgreSQL metastore driver class name</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://localhost:5432/hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>password to use against metastore database</description>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
<description>To disable the schema check, so that Spark and Hive can work together</description>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10001</value>
<description>TCP port number to listen on, default 10000</description>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
<description>
Setting this property to true will have HiveServer2 execute
Hive operations as the user making the calls to it.
</description>
</property>
</configurations>
/path/to/spark/conf/hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<configuration>
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
<description>
Setting this property to true will have HiveServer2 execute
Hive operations as the user making the calls to it.
</description>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
<description>To disable the schema check, so that Spark and Hive can work together</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.postgresql.Driver</value>
<description>PostgreSQL metastore driver class name</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://localhost:5432/hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>password to use against metastore database</description>
</property>
</configuration>
Now go a head and create DB/tables in Spark and see it in Hive or vice versa.
Spark Thrift Server:
Thrift server enable REST endpoint to Spark, hosting itself as a running application in the Spark cluster. It can be thought as a distributed SQL engine with an REST end point.
Spark doc @ https://spark.apache.org/docs/latest/sql-distributed-sql-engine.html
sbin/start-thriftserver.sh \
--master spark://IMCHLT276:7077 \
--hiveconf hive.server2.thrift.bind.host=localhost \
--hiveconf hive.server2.thrift.port=10000 \
--executor-memory 2g \
--conf spark.jars=libs/postgresql-42.2.10.jar \
--conf spark.cores.max=2
Test it with `beeline` client, user will be your machine login user and empty password.
/path/to/spark/bin/beeline
!connect jdbc:hive2://localhost:10000
Hive Thrift Server:
Use beeline from hive package command to Connect to Hive thrift server (assuming it is already running), port 10001 is already configured in /path/to/spark/conf/hive-site.xml
/path/to/hive/bin/beeline -n ssp -u jdbc:hive2://localhost:10001
References:
- https://www.quora.com/What-is-Hive-Metastore
- https://www.tecmint.com/install-postgresql-on-ubuntu/
- https://linuxize.com/post/how-to-install-postgresql-on-ubuntu-18-04/
- https://medium.com/@thomaspt748/how-to-upsert-data-into-relational-database-using-spark-7d2d92e05bb9
- https://linuxize.com/post/how-to-create-a-sudo-user-on-ubuntu/
- https://stackoverflow.com/questions/21898152/why-cant-you-start-postgres-in-docker-using-service-postgres-start
- https://markheath.net/post/exploring-postgresql-with-docker