Drivers

Big Data Project A.Y. 2024/2025

Giovanni Antonioni [Second Job]
Luca Rubboli (0001083742) [First Job]

Dataset

New York City Taxi Dataset

Only yellow and green datasets will be analyzed.

First job - Anomaly detection

First job consists of the following steps:

Dataset cleanup
Discretization of continuous features
Select features to categorize data in classes, then calculate average prices per distance and per time ($ / mile || $ / minute)
Define bins representing price difference w.r.t. average prices (e.g. avg_price | avg_price + 5 | avg_price + 10 | ...)
Establish a subset of discrete features
For each feature bins, calculate the percentage of data points that fall into each price bin for every possible combination of values
Results visualization through appropriate graphs

Second job - Features impact on tips

Second job consists of the following steps:

Dataset cleanup
Discretization continuous features
Select features to categorize data in classes (average_price is calculated too and used in categorization), then calculate average tips ($)
For each categorization, analyze the impact on tips given by the change of feature's bins, fixing all the other features, calculating the percentage change of average tips w.r.t the first bin established
Results visualization through appropriate graphs

Local Deployment

Define new Run/Debug Configurations

Configuration type: Spark Submit - Local (deprecated)

Name: Spark Local

Spark home: Your spark directory

Application: point to the .jar file inside the build/libs folder of this repository; if you don't find it, build the project (./gradlew build)

Class: jobs.FirstJob / jobs.SecondJob

Run arguments: local

Cluster manager: Local

Master: local

It is also possible to add a before launch command, triggering gradle build task to force building before running

Remote Deployment

Create a cluster on AWS EMR via CLI

Make sure that SSH connections are enabled on the Security Group of the master node

Under src/main/resources, create a file called "aws_credentials.txt"; put the value of your aws_access_key in the first line and the value of your aws_secret_access_key in the second line

Open the src/main/scala/utils/Config file and update the variables' values according to your settings

Define new Run/Debug Configurations

Configuration type: Spark Submit - Cluster

Name: Spark Cluster

Region: us-east-1

Remote Target: Add EMR connection

Authentication type: Profile from credentials file

Profile name: Your profile name

Click on "Test connection" to verify: if you cannot connect or there are no deployed cluster, the connection will not be saved

Enter a new SSH Configuration

Host: the address of the primary node of the cluster, i.e., the MasterPublicDnsName

Username: hadoop

Authentication type: Key pair

Private key file: point to your .ppk / pem key

Test the connection

Application: point to the .jar file inside the build/libs folder of this repository; if you don't find it, build the project (./gradlew build)

Class: jobs.FirstJob / jobs.SecondJob

Run arguments: remote

Before launch: Upload Files Through SFTP

It is also possible to add a before launch command, triggering gradle build task to force building before running

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.idea		.idea
config/checkstyle		config/checkstyle
gradle		gradle
results		results
src/main		src/main
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
requirements.txt		requirements.txt
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Drivers

Dataset

First job - Anomaly detection

Second job - Features impact on tips

Local Deployment

Define new Run/Debug Configurations

Remote Deployment

Define new Run/Debug Configurations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

roostico/Drivers

Folders and files

Latest commit

History

Repository files navigation

Drivers

Dataset

First job - Anomaly detection

Second job - Features impact on tips

Local Deployment

Define new Run/Debug Configurations

Remote Deployment

Define new Run/Debug Configurations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages