# Dowload data to our environment
/home/project$ curl https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz --output hadoop-3.3.6.tar.gz
# Extract tar file
/home/project$ tar -xvf hadoop-3.3.6.tar.gz
# move to directory
/home/project$ cd hadoop-3.3.6
/home/project/hadoop-3.3.6$
# Verify Hadoop setup
/home/project/hadoop-3.3.6$ bin/hadoop
Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]or hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
is a user-provided Java class
where CLASSNAME
is none or any of:
OPTIONS
--config dir Hadoop config directory
--debug turn on shell script debug mode
--help usage information
class files from build tree
buildpaths attempt to add list[,of,host,names] hosts to use in worker mode
hostnames list of hosts to use in worker mode
hosts filename set the log4j level for this command
loglevel level
workers turn on worker mode
is one of:
SUBCOMMAND
Admin Commands:
/set the log level for each daemon
daemonlog get
Client Commands:
archive create a Hadoop archiveand compression libraries availability
checknative check native Hadoop class path needed to get the Hadoop jar and the required
classpath prints the
libraries
conftest validate configuration XML fileswith credential providers
credential interact
distch distributed metadata changerfile or directories recursively
distcp copy
dtutil operations related to delegation tokens
envvars display computed Hadoop environment variables
fs run a generic filesystem user clientfrom production
gridmix submit a mix of synthetic job, modeling a profiled
load<jar> run a jar file. NOTE: please use "yarn jar" to launch YARN
jar not this command.
applications,
jnipath prints the java.library.path
kdiag Diagnose Kerberos Problems
kerbname show auth_to_local principal conversion
key manage keys via the KeyProviderinput trace
rumenfolder scale a rumen
rumentrace convert logs into a rumen trace
s3guard S3 Commandsand modify Hadoop tracing settings
trace view print the version
version
Daemon Commands:
kms run KMS, the Key Management Server
registrydns run the registry DNS server
print help when invoked w/o parameters or with -h.
SUBCOMMAND may
# Download data.txt to local directory
/home/project/hadoop-3.3.6$ curl https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/labs/data/data.txt --output data.txt
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed100 47 100 47 0 0 339 0 --:--:-- --:--:-- --:--:-- 340
# Run MapReduce on file and store in user/root/output
/home/project/hadoop-3.3.6$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar wordcount data.txt output
# See the output file generated
/home/project/hadoop-3.3.6$ ls output
:
-r-00000
_SUCCESS part
# View the output of word count
/home/project/hadoop-3.3.6$ cat output/part-r-00000
:
2
BigData 1
Hadoop 1
IBM 2 MapReduce
How To Use - 2
MapReduce
MapReduce is a programming pattern that enables massive scalability across hundreds or thousands of servers in a Hadoop cluster. As the processing component, MapReduce is the heart of Apache Hadoop.
- MapReduce is a processing technique and a program model for distributed computing, it is based on Java, can also be coded in C++, Python, Ruby, R
- Distributed computing is a system or machine with multiple components located on different machines.
- Each component has its own job, but the components communicate with each other to run as one system to the end user.”
- The MapReduce algorithm consists of two important tasks - Map and Reduce.
Map
As the name suggests, the MapReduce framework contains two tasks, Map and Reduce.
Map takes in an input file and performs some mapping tasks by processing and extracting important data information into a key value pairs and these are the preliminary output list.
Some more reorganization goes on before the preliminary output is sent to the Reducer.
Reduce
The Reducer works with multiple map functions and aggregates the pairs using their keys to produce a final output. MapReduce keeps track of its tasks by creating unique keys to ensure that all the processes are solving the same problem.
Input
The input data is a file that is saved in the Hadoop file system called HDFS.
Example
Input
- Now let’s assume we have an input file that contains names of people, and we would like to do a word count on the unique name occurrences.
Map
- First, the data is split into the following four files, each of them in key value pair and are worked on separately.
- For example, for the first split line for Teju and Briana, we have two key value pairs with one occurrence in each file
- It will do the same for all of key value pairs.
Reduce
- We then have the reducer; the reducer processes the data that comes from the map.
- The Reducer starts with shuffling. Shuffling sorts the key and a list of values in a list, for example, you will see the Key Teju and the corresponding list of values from the previous step.
- We will have Teju [1, 1, 1], This is because the name Teju occurred 3 times in the “Map” step.
- It does the same for the rest of the names, counting how many times they appeared in the“Map” step.
- The Reducer layer then aggregates the values in the list and saves them, then the final output is saved in an output file, which will be stored in the HDFS.
Advantages
Parallel Computing
The advantages of MapReduce is its ability to allow for a high level of parallel jobs across multiple nodes. A node is an independent computer used for processing and storing big volumes of data.
Splitting Independent Tasks
In Hadoop we have two types of nodes, the name node and the data node. Map reduce allows for splitting and running independent tasks in parallel by dividing each task which in turn saves time.
Variety of Data Processed
MapReduce is very flexible and can process data that come in tabular and non-tabular forms. Therefore, MapReduce provides business value to organizations regardless of how their data is structured.
Multiple Language Support
It offers support for different languages
Platform for Analysis
It provides a platform for analysis, data warehousing and more.
Use Cases
Some use cases not displayed above:
Video Streaming
Map reduce is used by Netflix to recommend movies based on what you have watched in the past by using the user’s interests.
Financial Institutions
It is also used in financial institutions like banks and credit card companies to flag and detect anomalies in user transactions.
Advertisement
It can also be used in the advertisement industry to understand a user’s behavior by how they engage with ads. Google ads work by using MapReduce to understand the engagement of users with an ad.
How To Use - 1
The steps outlined in this lab use the single-node Hadoop Version 3.3.6 Hadoop is most useful when deployed in a fully distributed mode on a large cluster of networked servers sharing a large volume of data. However, for basic understanding, we will configure Hadoop on a single node.
We will run the WordCount example with an input text and see how the content of the input file is processed
Download Data
Start New Terminal
- Since we are in the cloud, start a new terminal
Download Tar File
curl https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz --output hadoop-3.3.6.tar.gz
Extract tar file
tar -xvf hadoop-3.3.6.tar.gz
Move to Directory
cd hadoop-3.3.6
Verify Hadoop Setup
- Check the hadoop command to see if it is setup. This will display the usage documentation for the hadoop script
bin/hadoop
Download Data to Directory
- Download data.txt to current directory
curl https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/labs/data/data.txt --output data.txt
Run MapReduce
- Run the Mapreduce application for wordcount on data.txt and store the output in /user/root/output
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar wordcount data.txt output
List the Output
- Once the word count runs successfully, you can run the following command to see the output file it has generated
ls output
- output should be:
_SUCCESS
and the name: part-r-00000
View the Output
- It might take a couple minutes to be processed
- To view the word count output run
cat output/part-r-00000
Clean up previous File
- Delete the data file as well as the output folder from above
rm data.txt
rm -rf output
Create Text File
- Use text editor to create data.txt
- Use the following
Save file
- Save the above file in Hadoop directory
Run MapReduce
- To run wordcount use and save it in user/root/output
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar wordcount data.txt output
Italy Venice
Italy Pizza Pizza Pasta Gelato
List Output
ls output
_SUCCESS part-r-00000
View Output
cat output/part-r-00000
rm data.txt-rf output
rm
# Run MapReduce to run wordcount and save in output
/home/project/hadoop-3.3.6$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar wordcount data.txt output
:
# List Output
/home/project/hadoop-3.3.6$ ls output
-r-00000
_SUCCESS part
# View Output
/home/project/hadoop-3.3.6$ cat output/part-r-00000
1
Gelato 2
Italy 1
Pasta 2
Pizza 1
Venice
# Quit
+D CTRL
Social Media
MapReduce can be used for social media platforms like LinkedIn and Instagram to analyze who visited, viewed, and interacted with your profile posts.