Hadoop Overview

Before we get into Hadoop let’s look at how we got there: Big Data

Some industries contributing to Big Data:

  • Retail
    • Price analytics, segmentation, price points, margin analysis, sentiment analysis, marketing strategies
  • Insurance
    • Fraud analytics, anomalies detection, risk assessment, predictive modeling to identify high-risk customers
  • Telecommunications
    • Network security, location-based promotions, network analytics, optimized pricing
  • Manufacturing
    • Predictive maintenance, production optimization
  • Automotive Industry
    • Predictive support regarding malfunctions, pre-emptive ordering of parts, repair, self-driven cars, real-time data analysis for autonomous driving
  • Finance
    • Customer segmentation, targeting marketing, customer demographics, transaction frequency, behavior patterns, social media posts and interactions with customer service systems, algorithmic trading, ML based decisions using instant analysis of large quantities of data

Hadoop is an open-source framework used to process enormous datasets. Hadoop is a set of open-source programs and procedures which can be used as the framework for Big Data operations. It is used for processing massive data in distributed file systems that are linked together. It allows for running applications on clusters. (A cluster is a collection of computers working together at the same to time to perform tasks. )

It should be noted that Hadoop is not a database but an ecosystem that can handle processes and jobs in parallel or concurrently.

Hadoop is optimized to handle massive quantities of data which could be:

  • Structured
  • Tabular
  • Unstructured

Core Components


The term Hadoop is often used to refer to both the core components of Hadoop as well as the ecosystem of related projects. The core components of Hadoop include:

Hadoop Common

Hadoop Common, which is an essential part of the Apache Hadoop Framework that refers to the collection of common utilities and libraries that support other Hadoop modules.

HDFS

There is a storage component called Hadoop Distributed File System, or HDFS. It handles large data sets running on commodity hardware.

A commodity hardware is low-specifications industry-grade hardware and scales a single Hadoop cluster to hundreds or even thousands.

MapReduce

MapReduce is a processing unit of Hadoop and an important core component to the Hadoop framework. MapReduce processes data by splitting large amounts of data into smaller units and processes them simultaneously.

For a while, MapReduce was the only way to access the data stored in the HDFS. There are now other systems like Hive and Pig.

Yarn

YARN, which is short for “Yet Another Resource Negotiator.” YARN is a very important component because it prepares the RAM and CPU for Hadoop to run data in batch processing, stream processing, interactive processing, and graph processing, which are stored in HDFS.

Hadoop Ecosystem


The Hadoop ecosystem is made up of components that support one another for Big Data processing. We can examine the Hadoop ecosystem based on the various stages.

Ingest Data

When data is received from multiple sources, Flume and Sqoop are responsible for ingesting the data and transferring them to the storage component.

Ingesting is the first stage of Big Data processing.

  • Whenever you deal with big data, you get data from different sources.
  • You then use tools like Flume and Sqoop.

Flume

  • Flume is a distributed service that collects, aggregates, and transfers Big Data to the storage system.
  • Flume has a simple and flexible architecture based on streaming data flows and uses a simple extensible data model that allows for online analytic application.

Sqoop

  • Sqoop is an open-source product designed to transfer bulk data between relational database systems and Hadoop.
  • Sqoop looks in the relational database and summarizes the schema.
  • It then generates MapReduce code to import and export data as needed.
  • Sqoop allows you to quickly develop any other MapReduce applications that use the records that Sqoop stored into HDFS.

Storage Component

The Storage component, HDFS and HBase.

HBase

HBase is a column-oriented non-relational database system that runs on top of HDFS.

  • It provides real time wrangling access to the Hadoop file system.
  • HBase uses hash tables to store data in indexes and allow for random access of data, which makes lookups faster.

Cassandra

Cassandra is a scalable, NoSQL database designed to have no single point of failure.

Process & Analyze Data

From HDBS the data is Distributed to a MapReduce framework like Pig and Hive to process and analyze the data, and the processing is done by parallel computing.

Pig

In the Analyze Data stage, Pig is used for analyzing large amounts of data.

  • Pig is a procedural data flow language and a procedural programming language that follows an order and set of commands.

Hive

Hive is used mainly for creating reports and operates on the server side of a cluster.

  • Hive is a declarative programming language, which means it allows users to express which data they wish to receive.

Access Data

After all the processing is done, tools like Hue are used to access the refined data. The final stage is “Access data,” where users have access to the analyzed and refined data. At this stage tools like Impala are often used.

Impala

Impala is a scalable system that allows non-technical users to search and access the data in Hadoop.

  • You do not need to be skilled in programming to use Impala.

Hue

Hue is another tool of choice at this stage.

  • Hue is an acronym for Hadoop user experience.
  • Hue allows you to upload, browse, and query data.
  • You can run Pig jobs and workflow in Hue.
  • Hue also provides a SQL editor for several query languages like Hive, and MySQL.

Drawbacks


Hadoop contained many smaller components. Although efficient at first glance, Hadoop failed at simple tasks.

Transaction Processing

Hadoop is not good for processing transactions due to its lack of random access.

Non-parallel Processing

Hadoop is not good when the work cannot be done in parallel

Dependencies in Data

Hadoop is not good when there are dependencies within the data. Dependencies arise when record one must be processed before record two.

Low Latency Access

Hadoop is also not good for low latency data access. “Low latency” allows small delays, unnoticeable to humans, between an input being processed and the corresponding output providing real time characteristics.

This can be especially important for

  • Internet connections utilizing services such as trading,
  • online gaming
  • Voice over IP

Small Files

Hadoop is also not good for processing lots of small files, although, there is work being done in this area such as IBM’s Adaptive MapReduce.

Intensive Calculations with Little Data

Hadoop is not good for intensive calculations with little data.

Remedies

  • To deal with the shortcomings of Hadoop, new tools like Hive were built on top of Hadoop. Hive provided SQL-like query and provided users with strong statistical functions.
  • Pig was popular for its multi query approach to cut down the number of times that the data is scanned.

Cheat Sheet 1


Term Definition
Anomaly detection A process in machine learning that identifies data points, events, and observations that deviate from a data set’s normal behavior. Detecting anomalies from time series data is a pain point that is critical to address for industrial applications.
Apache This open-source HTTP server implements current HTTP standards to be highly secure, easily configurable, and highly extendible. The Apache Software License by the Apache Software Foundation builds and distributes it.
Apache Cassandra It is a scalable, NoSQL database specifically designed not to have a single point of failure.
Apache Nutch An extensible and scalable web crawler software product to aggregate data from the web.
Apache ZooKeeper A centralized service for maintaining configuration information to maintain healthy links between nodes. It provides synchronization across distributed applications. It also tracks server failure and network partitions by triggering an error message and then repairing the failed nodes.
Big data Data sets whose type or size supersedes the ability of traditional relational databases to manage, capture, and process the data with low latency. Big data characteristics include high volume, velocity, and variety.
Big data analytics Uses advanced analytic techniques against large, diverse big data sets, including structured, semi-structured, and unstructured data, from varied sources and sizes, from terabytes to zettabytes.
Block Minimum amount of data written or read, and also offers fault tolerance. The default block size can be 64 or 128 MB, depending on the user’s system configuration. Each file stored need not take up the storage of the preconfigured block size.
Clusters These servers are managed and participate in workload management. They allow enterprise applications to supersede the throughput achieved with a single application server.
Command-line interface (CLI) Used to enter commands that enable users to manage the system.
Commodity hardware Consists of low-cost workstations or desktop computers that are IBM-compatible and run multiple operating systems such as Microsoft Windows, Linux, and DOS without additional adaptations or software.
Data ingestion The first stage of big data processing. It is a process of importing and loading data into IBM® WatsonX.data. You can use the Ingestion jobs tab from the Data manager page to load data securely and easily into WatsonX.data console.
Data sets Created by extracting data from packages or data modules. They gather a customized collection of items that you use frequently. As users update their data set, dashboards and stories are also updated.
Data warehouse Stores historical data from many different sources so users can analyze and extract insights from it.
Distributed computing A system or machine with multiple components on different machines. Each component has its own job, but the components communicate with each other to run as one system for the end user.
Driver Receives query statements submitted through the command line and sends the query to the compiler after initiating a session.
Executor Executes tasks after the optimizer has split the tasks.
Extended Hadoop Ecosystem Consists of libraries or software packages commonly used with or installed on top of the Hadoop core.
Fault tolerance A system is fault-tolerant if it can continue performing despite parts failing. Fault tolerance helps to make your remote-boot infrastructure more robust. In the case of OS deployment servers, the whole system is fault-tolerant if the servers back up each other.
File system An all-comprehensive directory structure with a root ( / ) directory and other directories and files under a logical volume. The complete information about the file system centralized in the /etc/filesystems file.
Flume A distributed service that collects, aggregates, and transfers big data to the storage system. Offers a simple yet flexible architecture that streams data flows and uses an extensible data model, allowing online analytic applications.
Hadoop An open-source software framework offering reliable distributed processing of large data sets using simplified programming models.
Hadoop Common Fundamental part of the Apache Hadoop framework. It refers to a collection of primary utilities and libraries that support other Hadoop modules.
Hadoop Distributed File System (HDFS) A file system distributed on multiple file servers, allowing programmers to access or store files from any network or computer. It is the storage layer of Hadoop. It works by splitting the files into blocks, creating replicas of the blocks, and storing them on different machines. It can access streaming data seamlessly. It uses a command-line interface to interact with Hadoop.
Hadoop Ecosystem It splits big data analytics processing tasks into smaller tasks. The small tasks are performed in conjunction using an algorithm (MapReduce) and then distributed across a Hadoop cluster (nodes that perform parallel computations on big data sets).
Hadoop Ecosystem stages The four main stages are: Ingest, store, process, analyze, and access.
HBase A column-oriented, non-relational database system that runs on top of the Hadoop Distributed File System (HDFS). It provides real-time wrangling access to the Hadoop file system. It uses hash tables to store data in indexes and allow for random data access, making lookups faster.
High-throughput Throughput quantifies the data processed in a timeframe. The target system needs robust throughput for heavy workloads with substantial data changes from the source database to prevent latency spikes. Performance objectives are frequently outlined with throughput targets. High throughput is achieved when most messages are delivered successfully, whereas low successful delivery rates indicate poor throughput and network performance.
Hive It is a data warehouse infrastructure used in data query and analysis with an SQL-like interface. It helps in generating and creating reports. It is a declarative programming language allowing users to express which data they wish to receive.
Hive client Hive provides different communication drivers depending on the application type. For example, Java-based applications use JDBC drivers, and other applications use ODBC drivers. These drivers communicate with the servers.
Hive server Used to execute queries and enable multiple clients to submit requests. It can support JDBC and ODBC clients.
Hive services Client interactions and query operations are done through the Hive services. The command-line interface acts as an interface for the Hive service. The driver takes in query statements, monitors each session’s progress and life cycle, and stores metadata generated from the query statements.
Hive Web Interface A web-based user interface that interacts with Hive through a web browser. It offers a graphical user interface (GUI) to browse tables, execute Hive queries, and manage Hive resources.
HMaster The master server that monitors the region server instances. It assigns regions to region servers and distributes services to different region servers. It also manages any changes to the schema and metadata operations.
Hue An acronym for Hadoop user experience. It allows you to upload, browse, and query data. Users can run Pig jobs and workflow in Hue. It also provides an SQL editor for several query languages, like Hive and MySQL.
Impala A scalable system that allows nontechnical users to search for and access the data in Hadoop.
InputSplits Created by the logical division of data. They serve as an input to a single Mapper job.
JDBC client Component in the Hive client allows Java-based applications to connect to Hive.
Low latency data access A type of data access allowing minimal delays, not noticeable to humans, between an input processed and corresponding output offering real-time characteristics. It is crucial for internet connections using trading, online gaming, and Voice over IP.
Map Job in MapReduce converts a set of data into another set of data. The elements fragment into tuples (key/value pairs).
MapReduce A program model and processing technique used in distributed computing based on Java. It splits the data into smaller units and processes big data. It is the first method used to query data stored in HDFS. It allows massive scalability across hundreds or thousands of servers in a Hadoop cluster.
Meta store Stores the metadata, the data, and information about each table, such as the location and schema. In turn, the meta store, file system, and job client communicate with Hive storage and computing to perform the following: Metadata information from tables store in some databases and query results, and data loaded from the tables store in a Hadoop cluster on HDFS.
Node A single independent system for storing and processing big data. HDFS follows the primary/secondary concept.
ODBC (Open Database Connectivity) Client Component in the Hive client, which allows applications based on the ODBC protocol to connect to Hive.
Optimizer Performs transformations on the execution and splits the tasks to help speed up and improve efficiency.
Parallel computing Workload for each job is distributed across several processors on one or more computers, called compute nodes.
Parser A program that interprets the physical bit stream of an incoming message and creates an internal logical representation of the message in a tree structure. The parser also regenerates a bit stream from the internal message tree representation for an outgoing message.
Partitioning This implies dividing the table into parts depending on the values of a specific column, such as date or city.
Pig Hadoop component Famous for its multi-query approach, it analyzes large amounts of data. It is a procedural data flow and programming language that follows an order and set of commands.
Primary node Also known as the name node, it regulates client file access and maintains, manages, and assigns tasks to the secondary node. The architecture is such that per cluster, there is one name node and multiple data nodes, the secondary nodes.
Rack The collection of about forty to fifty data nodes using the same network switch.
Rack awareness When performing operations such as read and write, the name node maximizes performance by choosing the data nodes closest to themselves. Developers can select data nodes on the same rack or nearby racks. It reduces network traffic and improve cluster performance. The name node keeps the rack ID information to achieve rack awareness.
Read In this operation, the client will request the primary node to acquire the location of the data nodes containing blocks. The client will read files closest to the data nodes.
Reduce Job in MapReduce that uses output from a map as an input and combines data tuples into small sets of tuples.
Region The basic building element and most negligible unit of the HBase cluster, consisting of column families. It contains multiple stores, one for each column family, and has two components: HFile and MemStore.
Region servers These servers receive read and write requests from the client. They assign the request to a region where the column family resides. They serve and manage regions present in a distributed cluster. The region servers can communicate directly with the client to facilitate requests.
Relational database Data is organized into rows and columns collectively, forming a table. The data is structured across tables, joined by a primary or a foreign key.
Relational Database Management System (RDBMS) Traditional RDBMS maintains a database and uses the structured query language, SQL. It is suited for real-time data analysis, like data from sensors. It allows for as many read-and-write operations as a user may require. It can handle up to terabytes of data. It enforces that the schema must verify loading data before it can proceed. It may not always have built-in support for data partitioning.
Replication The process of creating a copy of the data block. It is performed by rack awareness as well. It is done by ensuring data node replicas are in different racks. So, if a rack is down, users can obtain the data from another rack.
Replication factor Defined as the number of times you make a copy of the data block. Users can set the number of copies they want, depending on their configuration.
Schema It is a collection of named objects. It provides a way to group those objects logically. A schema is also a name qualifier; it provides a way to use the same natural name for several objects and prevent ambiguous references.
Secondary node This node is also known as a data node. There can be hundreds of data nodes in the HDFS that manage the storage system. They perform read and write requests at the instructions of the name node. They also create, replicate, and delete file blocks based on instructions from the name node.
Semi-structured data Semi-structured data (JSON, CSV, XML) is the “bridge” between structured and unstructured data. It does not have a predefined data model and is more complex than structured data, yet easier to store than unstructured data.
Shuffle Phase in which interim map output from mappers transfers to reducers. Every reducer fetches interim results for all values associated with the same key from multiple nodes. This is a network-intensive operation within the Hadoop cluster nodes.
Sqoop An open-source product designed to transfer bulk data between relational database systems and Hadoop. It looks at the relational database and summarizes the schema. It generates MapReduce code to import and export data. It helps develop other MapReduce applications that use the records stored in HDFS.
Streaming Implies HDFS provides a constant bitrate when transferring data rather than having the data transferred in waves.
Structured data Structured data, typically categorized as quantitative data, is highly organized and easily decipherable by machine learning algorithms. Developed by IBM in 1974, structured query language (SQL) is the programming language used to manage structured data.
Unstructured data Information lacking a predefined data model or not fitting into relational tables.
Write In this operation, the Name node ensures that the file does not exist. If the file exists, the client gets an IO Exception message. If the file does not exist, the client is given access to start writing files.
Yet Another Resource Negotiator (YARN) Prepares Hadoop for batch, stream, interactive, and graph processing.

Cheat Sheet 2


Package/Method Description Code Example
bin/hadoop All Hadoop commands are invoked by the bin/hadoop script. Running the Hadoop script without any arguments prints the description for all commands.

Running Hadoop script without arguments:

bin/hadoop
cat Reads each file parameter in sequence and writes it to standard output. If you do not specify a file name, the cat command reads from standard input. You can also specify a file name of - (dash) for standard input.

Create two sample files.

echo "This is file 1" > file1.txt echo "This is file 2" > file2.txt

Use the cat command to read and display the contents of both files

cat file1.txt file2.txt

Sample output (Contents of file1.txt and file2.txt):

This is file 1
This is file 2
cd Used to move efficiently from the existing working directory to different directories on your system.

Basic syntax of cd command:

cd [options]... [directory]

Example 1: Change directory location to “folder1”

cd /usr/local/folder1

Example 2: Get back to the previous working directory

cd -

Example 3: Move up one level from the present working directory tree

cd ..
create table Used to create a new table in a database

Create a new database (if not already created).

CREATE DATABASE your_database;

Use the newly created database.

USE your_database;

Create a new table named “employees” in Hive.

CREATE TABLE employees (    id INT,    first_name STRING,    last_name STRING,    email STRING,    hire_date DATE)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ','STORED AS TEXTFILE;

Show the list of tables in the database.

SHOW TABLES;

Sample Output (List of Tables):

OK
employees
curl A command-line tool (pronounced “curl”) that allows data to be exchanged between a device and a server through a terminal. The user specifies the server URL, the location where they want to send the request, and the data they want to send to the server URL using this command-line interface (CLI).

Example 1: Sending a GET request and displaying the response

Send a GET request to a server and display the response.

curl https://www.example.com

In this example, we use the curl command to send a GET request to https://www.example.com and display the HTML response from the server.


Example 2: Sending data to a server using POST Request:

Send a POST request with data to a server and display the response.

curl -X POST -d "name=John&age=30" https://www.example.com/api

In this example, we use the curl command to send a POST request to https://www.example.com/api with data name=John&age=30 and display the JSON response from the server.

docker exec Runs a new command in a running container. It only runs when the container’s primary process is running, and it is not restarted if the container is restarted.

Running a command in a running Docker container:

Run a new command inside a running Docker container.

docker exec -it container_name_or_id ls /app

Sample Output (List of files in the ‘/app’ Directory inside the container):

file1.txt
file2.txt
subdirectory

In this example:

  • docker exec is used to run a new command (ls /app) inside a running Docker container.

  • -it enables an interactive terminal session, which allows you to see the output of the command.

  • container_name_or_id is the name or ID of the running Docker container you want to execute the command in.

  • ls /app is the command that lists the files and directories in the ‘/app’ directory inside the container.

docker-compose Compose is a tool for defining and running multi-container Docker applications. It uses the YAML file to configure the services and enables us to create and start all the services from just one configuration file.

Starting Docker containers using docker-compose:

Suppose you have a docker-compose.yml file like this:

version: '3'
services:
web:
image: nginx:latest            ports:
- "80:80"
db:
image: postgres:latest            environment:            POSTGRES_PASSWORD: example_password

You can use docker-compose to start the services defined in the docker-compose.yml file as follows:

Navigate to the directory containing the docker-compose.yml file.

cd /path/to/your/docker-compose-project

Start the Docker containers defined in the docker-compose.yml file

docker-compose up
docker pull You can download Docker images from the internet.
docker pull [OPTIONS] IMAGE_NAME[:TAG]
docker run It runs a command in a new container, getting the image and starting the container if needed.
docker run [OPTIONS] IMAGE [COMMAND] [ARG...]
git clone You can create a copy of a specific repository or branch within a repository.
git clone REPOSITORY_URL [DESTINATION_DIRECTORY]
hdfs dfs Apache Hadoop hadoop fs or hdfs dfs are file system commands to interact with HDFS. These commands are very similar to Unix commands. Hadoop provides two types of commands to interact with the file system: hadoop fs or hdfs dfs. The major difference is that Hadoop commands are supported with multiple file systems like S3, Azure, and many more.

Example-1:

Listing files and directories in HDFS:

List files and directories in the root directory of HDFS.

hdfs dfs -ls /

Example-2: In this example, we use the hdfs dfs -ls command to list files and directories in the root directory of HDFS.

hdfs dfs -ls /

Sample output:

drwxr-xr-x   - hdfs hduser          0 2023-09-13 10:00 /user
drwxrwxrwx   - hdfs hduser          0 2023-09-13 10:05 /tmp
drwxrwxrwx   - mapred hduser        0 2023-09-13 10:10 /mapred

Create a new directory named “mydata” in HDFS.

hdfs dfs -mkdir /user/your_username/mydata
hdfs dfs -cat Display the contents for a file.

Display the contents of a file in HDFS.

hdfs dfs -cat /path/to/file.txt
hdfs dfs -mkdir Creates a directory named path in HDFS

Create a directory in HDFS.

hdfs dfs -mkdir /user/username/mydirectory
hdfs dfs -put Upload a file or folder from the local disk to HDFS.

Upload a file from the local file system to HDFS.

hdfs dfs -put localfile.txt /user/username/hdfsfile.txt
LOAD DATA INPATH Hive provides the functionality to load precreated table entities either from the local file system or from HDFS. This command is used to load data into the hive table.

Load data from HDFS into a Hive table.

LOAD DATA INPATH '/user/username/hdfsfile.txt' INTO TABLEmytable;
ls Writes to standard output the contents of each specified Directory parameter or the name of each specified file parameter, along with any other information you ask for with the flags. If you do not specify a file or directory parameter, the ls command displays the contents of the current directory.

Basic command syntax

ls [options] [file/directory]

Example 1: Sorts the file names displayed in the order of last modification time. ‘r’ is for displaying in reverse order

ls -lt
ls -ltr

Example 2: Displays hidden files

ls -a
mkdir Used to create one or more directories specified by the Directory parameter. Each new directory contains the standard entries dot (.) and dot dot (..). You can specify the permissions for the new directories with the -m Mode flag.

Create a new directory named “myfolder.”

mkdir myfolder
SELECT * FROM Lists all the rows from the table to check if the data has been loaded from the file.

Select all rows from a table.

SELECT * FROM tablename;
show tables Used to see all the tables in the database that have been selected.

Show all tables in the selected database.

SHOW TABLES;
tar Looks for archives on the default device (usually tape) unless you specify another device. When writing to an archive, the tar command uses a temporary file (the /tmp/tar* file) and maintains in memory a table of files with several links.

Create a tar archive of a directory.

tar -cvf archive.tar /path/to/directory
wget Stands for web get. The wget is a free, noninteractive file downloader command. Noninteractive means it can work in the background when the user is not logged in.

Basic syntax of the wget command; commonly used options are [-V], [-h], [-b], [-e], [-o], [-a], [-q]

wget [options]... [URL]...

Example 1: Specifies to download file.txt over HTTP website URL into the working directory.

wget http://example.com/file.txt

Example 2: Specifies to download the archive.zip over the HTTP website URL in the background and returns you to the command prompt in the interim.

wget -b http://www.example.org/files/archive.zip