An Implementation of the Map Reduce model to digest logs files generated using a log generator and produce metrics based in the distribution of logs. This project makes use of the HDFS and the map reduce framework to perform required operations on the Log Files.
This project consists of multiple map Reduce jobs that can be executed within a Hadoop distributed file system. Below are the steps that need to be followed in order to execute the Map Reduce jobs.
- Open the command prompt/terminal at the root directory of the project and run the command
sbt clean compile assembly - If you need to execute test cases then run
sbt clean compile test - Once the assembly step is completed, a fat jar file will be generated under
target/scala-3.0.2/<Jar name> - In order to generate an input log file, from within the terminal run
sbt clean compile runand then select the LogGenerator class. This will generate a log file according to the parameters set in theapplication.conffile.
Here we use the Horton's hadoop sandbox to execute our map reduce jobs.
- Use a scp tool, transfer the jar file as well as the generate log file into the hortons VM instance.
- Move the input file into a hdfs directory using the command
hdfs dfs -put <name of the log file> /hdfs directory - Run the individual jobs from the jar file by running the command
hadoop jar <classname> /<input directory path> /<output directory path> - Once the job executes successfully, the output can be viewed at the specified output folder
Here we execute our map reduce jobs on the AWS EMR Custer
- Create an S3 bucket and move the Fat jar as well as the generated log files into separate folders within the S3 bucket.
- From the AWS console, navigate to the EMR service and click on create cluster.
- Here we create a cluster to execute step jobs and create multiple Custom Jar execution steps for each Map reduce jobs that we need to run.
- For each of the steps created this way, we need to specify the s3 location of the jar and provide the appropriate arguments to the job.
- Finally click on create cluster. This will start the resource provisioning process and execute the steps one after the other.
- The output files can be viewed in the s3 path specified in the arguments passed for each of the jobs.