emp_id,emp_name,salary198,Donald,2600
199,Douglas,2600
200,Jennifer,4400
201,Michael,13000
202,Pat,6000
203,Susan,6500
HIVE
Hive is a data warehouse software within Hadoop that is designed to read, write, and manage large, tabular-type datasets and data analysis.
A data warehouse stores historical data from many different sources so that you can analyze and extract insights from it. These insights are used for reporting.
- Hive is scalable and fast because it is designed to work on petabytes of data.
- It is easy to use if you are familiar with SQL because Hive query language, or Hive QL, is based on SQL.
- Hive supports the following file formats:
- Sequence files consisting of binary key value pairs
- Record columnar files that store columns of a table in a columnar database
- Text or flat files.
- Hive allows for data cleaning and filtering tasks according to users’ requirements.
RDBMS vs HIVE
RDBMS is a type of database management system specifically designed for relational databases.
- RDBMS is the acronym for Relational Database Management System.
- A relational database is a database that stores data in a structured format using rows and columns known as a table.
Let’s look at the differences between a traditional RDBMS and Hive.
- Traditional RDBMS is used to maintain a database and uses the structured query language known as SQL.
- Hive is used to maintain a data warehouse and uses the Hive query language, a language inspired by SQL.
- Traditional RDBMS is suited for real-time data analysis, like data from sensors.
- Hive is suited for static data analysis.
- Traditional RDBMS allows for as many read and write operations a user needs.
- Hive is based on the write once, read many methodology.
- Traditional RDBMS can handle up to terabytes of data.
- Hive is designed to handle as much as petabytes of data.
- RDBMS enforces that the schema must verify loading data before it can proceed.
- Hive doesn’t enforce the schema to verify loading data.
- Lastly, traditional RDBMS may not always have built-in for support data partitioning.
- Hive supports partitioning. Partitioning means dividing the table into parts based on the values of a particular column such as date or city.
Architecture
This is what the Hive architecture looks like.
There are three main parts of the architecture:
- The Hive clients: Hive provides different drivers for communication depending on the type of application.
- For example, for Java based applications, it uses JDBC drivers, and other types of applications will use ODBC drivers. These drivers communicate with the servers.
- Hive services:
- Client interactions are done through the Hive services. Any query operations are done here.
- The command line interface acts as an interface for the Hive service.
- The driver takes in query statements, monitors the progress and life cycle of each session, and stores metadata generated from the query statements.
- The meta store stores the metadata, the data and information about each table, such as the location and schema.
- The meta store, file system, and job client in turn communicate with
- Hive storage and computing to perform the following: - Metadata information of tables are stored in some sort of database and query results and data loaded in the tables are stored in Hadoop cluster or HDFS.
Let’s look at the concepts in detail.
Hive Client
Components in the Hive clients include:
- The JDBC client, which allows Java-based applications to connect to Hive, and
- the ODBC client, which allows applications based on the ODBC protocol to connect to Hive.
Hive Services
The Hive services section of the architecture is in charge of performing queries.
- The Hive server is used to run queries and allows multiple clients to submit requests.
- It is built to support the JDBC and ODBC clients.
- The driver receives query statements submitted through the command line and sends the query to the compiler after initiating a session.
- The optimizer performs transformations on the execution plan and splits the tasks to help speed up and improve efficiency.
- The executor executes tasks after the optimizer has split the tasks.
Hive Storage & Computing
- The meta store is storage for the metadata, which is information about the table.
- The meta store is in charge of keeping this information in a central place.
How To Use Hive
I will be demonstrating the use of Hive on the cloud. Import data in csv format. Create Table. Load data into the table, and finally query the table
CLI Terminal
- Open an instance of your cloud server
- Open a new terminal
Create Directory
- create data directory in the project directory:
mkdir /home/project/data
Change Diretories
cd /home/project/data
Download Data
wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/data/emp.csv
View Data
- Open the file and view the sample data
Setup Hive and Bee
- We’ll use the Hive from the docker hub for this example.
- Pull the Hive image into your system by running:
docker pull apache/hive:4.0.0-alpha-1
Run Hive Server
- Run the hive server on port 10002.
- Name the server instance myhiveserver.
- Mount the local data folder in the hive server as hive_custom_data.
- This would mean that the whole data folder that you created locally, along with anything you add in the data folder, is copied into the container under the directory hive_custom_data
- Use:
docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 -v /home/project/data:/hive_custom_data --name myhiveserver apache/hive:4.0.0-alpha-1
Open Hive Server GUI
- Open HiveServer GUIand look at the Hive server
- Under Other > Click on Launch Application
- Input the port number: 10002
Access beeline
- Access beeline
- This is where you can create, modify, delete table and access the data in the table
docker exec -it myhiveserver beeline -u 'jdbc:hive2://localhost:10000/'
Create Table
- Create a new table Employee with three columns as in the csv file: em_id, emp_name, salary_
create table Employee(emp_id string, emp_name string, salary int) row format delimited fields terminated by ',' ;
Check Table
- Check if the table is created
show tables;
- This will list the Employee table
Load Data from CSV
LOAD DATA INPATH '/hive_custom_data/emp.csv' INTO TABLE Employee;
List all Rows
SELECT * FROM employee;
View Details in GUI
- In the Home Tab
- You can see the list of queries we just performed
Hive internally uses MapReduce to process and analyze data. When you execute a Hive query, it generates MapReduce jobs that run on the Hadoop cluster.
Quit beehive
CTRL + D
/home/project$ mkdir /home/project/data
/home/project$ cd /home/project/data
# Download Data
/home/project/data$ wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/data/emp.csv
# Pull Hive image
/home/project/data$ docker pull apache/hive:4.0.0-alpha-1
4.0.0-alpha-1: Pulling from apache/hive
1efc276f4ff9: Pull complete
a2f2f93da482: Pull complete 1a2de4cc9431: Pull complete
d2421c7a4bbf: Pull complete 94537d112428: Pull complete
c1c180ff1f7a: Pull complete
e7bf3bcb5626: Pull complete 03dc5be35f32: Pull complete
f40d13a78683: Pull complete
bbc9e14bf9a1: Pull complete 1a47e3a1960a: Pull complete
1111ce1bbac2: Pull complete
4f4fb700ef54: Pull complete
Digest: sha256:b06cfa5eb6e70a9ad6fbdbaef8369b7f4832b1adb892adce118d3eec474dc7aefor apache/hive:4.0.0-alpha-1
Status: Downloaded newer image /apache/hive:4.0.0-alpha-1
docker.io
# Setup Hive
/home/project/data$ docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 -v /home/project/data:/hive_custom_data --name myhiveserver apache/hive:4.0.0-alpha-1
ef254b8992a25e5f71cca5ec7172a3de57de06b889a9d8ca8f08151f696fef02
# Open HiveServre CLI using the icon
#
/home/project/data$ docker exec -it myhiveserver beeline -u 'jdbc:hive2://localhost:10000/'
SLF4J: Class path contains multiple SLF4J bindings.in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding //www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: See http:is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
SLF4J: Actual binding
SLF4J: Class path contains multiple SLF4J bindings.in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding //www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: See http:is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
SLF4J: Actual binding //localhost:10000/
Connecting to jdbc:hive2:4.0.0-alpha-1)
Connected to: Apache Hive (version 4.0.0-alpha-1)
Driver: Hive JDBC (version
Transaction isolation: TRANSACTION_REPEATABLE_READ4.0.0-alpha-1 by Apache Hive
Beeline version 0: jdbc:hive2://localhost:10000/>
# Create Table
0: jdbc:hive2://localhost:10000/> create table Employee(emp_id string, emp_name string, salary int) row format delimited fields terminated by ',' ;
=hive_20241028205658_dcca05c6-9159-4d27-9057-18cc32dcb406): create table Employee(emp_id string, emp_name string, salary int) row format delimited fields terminated by ','
INFO : Compiling command(queryId= false)
INFO : Semantic Analysis Completed (retrial
INFO : Created Hive schema: Schema(fieldSchemas:null, properties:null)=hive_20241028205658_dcca05c6-9159-4d27-9057-18cc32dcb406); Time taken: 1.799 seconds
INFO : Completed compiling command(queryIdis disabled, not creating a lock manager
INFO : Concurrency mode =hive_20241028205658_dcca05c6-9159-4d27-9057-18cc32dcb406): create table Employee(emp_id string, emp_name string, salary int) row format delimited fields terminated by ','
INFO : Executing command(queryId-0:DDL] in serial mode
INFO : Starting task [Stage=hive_20241028205658_dcca05c6-9159-4d27-9057-18cc32dcb406); Time taken: 0.743 seconds
INFO : Completed executing command(queryId2.79 seconds)
No rows affected (
# Check if Table is created
0: jdbc:hive2://localhost:10000/> show tables;
=hive_20241028205920_f59a2fb5-046c-48cc-bfd4-5eb833b02073): show tables
INFO : Compiling command(queryId= false)
INFO : Semantic Analysis Completed (retrial type:string, comment:from deserializer)], properties:null)
INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, =hive_20241028205920_f59a2fb5-046c-48cc-bfd4-5eb833b02073); Time taken: 0.081 seconds
INFO : Completed compiling command(queryIdis disabled, not creating a lock manager
INFO : Concurrency mode =hive_20241028205920_f59a2fb5-046c-48cc-bfd4-5eb833b02073): show tables
INFO : Executing command(queryId-0:DDL] in serial mode
INFO : Starting task [Stage=hive_20241028205920_f59a2fb5-046c-48cc-bfd4-5eb833b02073); Time taken: 0.119 seconds
INFO : Completed executing command(queryId+-----------+
| tab_name |
+-----------+
| employee |
+-----------+
1 row selected (0.396 seconds)
# Load Data from csv
0: jdbc:hive2://localhost:10000/> LOAD DATA INPATH '/hive_custom_data/emp.csv' INTO TABLE Employee;
=hive_20241028210232_559cb98d-76de-4cb9-ac14-e799ea2b6f60): LOAD DATA INPATH '/hive_custom_data/emp.csv' INTO TABLE Employee
INFO : Compiling command(queryId= false)
INFO : Semantic Analysis Completed (retrial
INFO : Created Hive schema: Schema(fieldSchemas:null, properties:null)=hive_20241028210232_559cb98d-76de-4cb9-ac14-e799ea2b6f60); Time taken: 0.182 seconds
INFO : Completed compiling command(queryIdis disabled, not creating a lock manager
INFO : Concurrency mode =hive_20241028210232_559cb98d-76de-4cb9-ac14-e799ea2b6f60): LOAD DATA INPATH '/hive_custom_data/emp.csv' INTO TABLE Employee
INFO : Executing command(queryId-0:MOVE] in serial mode
INFO : Starting task [Stagefrom file:/hive_custom_data/emp.csv
INFO : Loading data to table default.employee -1:STATS] in serial mode
INFO : Starting task [Stage
INFO : Executing stats task=1, numRows=0, totalSize=854, rawDataSize=0, numFilesErasureCoded=0]
INFO : Table default.employee stats: [numFiles=hive_20241028210232_559cb98d-76de-4cb9-ac14-e799ea2b6f60); Time taken: 0.25 seconds
INFO : Completed executing command(queryId0.457 seconds)
No rows affected (
# List all rows from Table
0: jdbc:hive2://localhost:10000/> SELECT * FROM employee;
=hive_20241028210417_30e53dbd-d31e-4e77-9105-24e98e79c00a): SELECT * FROM employee
INFO : Compiling command(queryIdfor default@employee, Columns: emp_name, salary, emp_id
INFO : No Stats = false)
INFO : Semantic Analysis Completed (retrial type:string, comment:null), FieldSchema(name:employee.emp_name, type:string, comment:null), FieldSchema(name:employee.salary, type:int, comment:null)], properties:null)
INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:employee.emp_id, =hive_20241028210417_30e53dbd-d31e-4e77-9105-24e98e79c00a); Time taken: 2.097 seconds
INFO : Completed compiling command(queryIdis disabled, not creating a lock manager
INFO : Concurrency mode =hive_20241028210417_30e53dbd-d31e-4e77-9105-24e98e79c00a): SELECT * FROM employee
INFO : Executing command(queryId=hive_20241028210417_30e53dbd-d31e-4e77-9105-24e98e79c00a); Time taken: 0.001 seconds
INFO : Completed executing command(queryId+------------------+--------------------+------------------+
| employee.emp_id | employee.emp_name | employee.salary |
+------------------+--------------------+------------------+
| emp_id | emp_name | NULL |
| 198 | Donald | 2600 |
| 199 | Douglas | 2600 |
| 137 | Renske | 3600 |
| 138 | Stephen | 3200 |
| 139 | John | 2700 |
| 140 | Joshua | 2500 |
+------------------+--------------------+------------------+
51 rows selected (2.134 seconds)
# Quit CTRL+D
0: jdbc:hive2://localhost:10000/> CTRL+D
/home/hive/.beeline
[WARN] Failed to create directory: file or directory
No such 0: jdbc:hive2://localhost:10000/
Closing: @theiadocker-emhrcf:/home/project/data$ theia