Data Engineering Project - Funland Team

This repository contains the final team project of the North Coders Data Engineering Bootcamp, showcasing a full-stack ETL (Extract, Transform, Load) pipeline designed for real-world data engineering practice.

Data ingestion from PostgreSQL into AWS S3 data lakes.
Transformation into star-schema format using pandas and awswrangler.
Deployment managed with Infrastructure-as-Code (Terraform).
Automated testing and deployment with CI/CD pipelines via GitHub Actions and Makefile.
Monitoring, logging, and alerts integrated via AWS CloudWatch and SNS.
Business dashboards and insights delivered through Tableau.

Technologyies and packages

Python packages:

awswrangler 3.12.0
boto3 1.38.24
pandas 2.3.0
pg8000 1.31.2
pytest 8.3.5
urllib3 2.4.0

Installation

To install this project, run:

git clone https://github.com/sapkotahari/de-project-funland
cd de-project-funland

Create a virtual environment

python -m venv venv

Activate your venv

source venv/bin/activate

Install packages

Required packages are listed in our requirements.txt and can be installed using our makefile.

make requirements

Usage/Examples

Firstly, activate your virtual environment

source venv/bin/activate

To use AWS services and infrastructure, sign up to a AWS account and create a IAM user. Once this is done, extract your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

export AWS_ACCESS_KEY_ID=<your key id>
export AWS_SECRET_ACCESS_KEY=<your secret access key>
export AWS_DEFAULT_REGION=<your default region>

An AWS parameter needs to be put into parameter store with a parameter name of "last_checked". This parameter is a date in the format "YYYY-MM-DD HH:MM:SS:ffffff". This date should be some date before 2019, to ensure that all the data gets extracted from the database initially.\

In AWS secret manager, your should set up a secret with the name "db_creds" with 4 key value pairs e.g.:

{
"DB_USER":<your database username>,
"DB_PASSWORD":<your database password>,
"DB_HOST":<your database host>,
"DB_NAME":<your database name>,
"DB_PORT":<your database port>
}

Now your aws account is linked to your local terminal and you are ready to navigate to the terraform directory

cd terraform

In this directory, an initialisation is needed to download the required hashicorp version and to setup the location of the terraform state file remotely. To accomplish this, we run:

terraform init

Once this finished, we are ready to see a plan of the infrastructure and its availability:

terraform plan

Be sure that all the information looks correct, and we are ready to deploy!! Run:

terraform apply

All the infrastructure should be created (ingestion and processed buckets, ETL lambdas and a step function to facilitate them, alongside the necessary roles and cloudwatch logs and notification systems):

Apply complete! Resources: 1 added, 0 changed, 1 destroyed.

Outputs:

notification_email = "<email to receive error notifications>"

To see the infrastructure, we can use AWS CLI to view our buckets:

aws s3 ls

example output:

2025-05-28 10:24:59 <ingestion-bucket-name>
2025-05-28 10:24:59 <processed-bucket-name>

And checking the AWS console for our state machine we can see:

Running Tests

Setup an .env file with the following values:

totesys_user=<your database username >
totesys_password=<your database password>
totesys_database=<your database database>
totesys_host=<your database host>
totesys_port=<your database port>

Add the given PYTHONPATH to your environment variables:

export PYTHONPATH=$(pwd)

To run tests, run the following command:

   make unit-test

To run all checks (tests, linting, security and coverage), run the following command:

   make run-checks

Visuals

A graphic representation showing the countries where the products are sold. The size of the dot corresponds to sales in the corresponding country.

A graph showing sales for each country for the years 2023 and 2024.

A graph showing sales for each city for the years 2023 and 2024.

A graph showing total sales by month.

Acknowledgements

We would like to acknowledge Northcoders for providing the Data Engineering Bootcamp, which was instrumental in building the foundations for this project.

We also used the following resources and tools throughout the project:

Pandas - For data sanitising.
Boto3 - The Amazon Web Services (AWS) SDK for Python, used extensively for interacting with AWS services.
Terraform - Comprehensive and clear documentation that helped in managing infrastructure as code.
AWS Wrangler - A Python library that made working with AWS data services much easier.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
assets		assets
src/lambda_handler		src/lambda_handler
terraform		terraform
test		test
.gitignore		.gitignore
README.md		README.md
lambda_packages.txt		lambda_packages.txt
makefile		makefile
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Engineering Project - Funland Team

Technologyies and packages

Python packages:

Installation

Usage/Examples

Running Tests

Visuals

Acknowledgements

Authors

About

Uh oh!

Releases

Packages

Languages

Leda909/de_funland_project

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Project - Funland Team

Technologyies and packages

Python packages:

Installation

Usage/Examples

Running Tests

Visuals

Acknowledgements

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages