Rossman Kaggle Mini-Competition

The following project was completed as part of the following Kaggle Competion

The goal was to develop a time-series model to predict the daily sales for 1,115 Rossman stores located across Germany. Rossman is a major drug store that operates in Europe.

Teamname: " #1 "

The python version of our environment is 3.9.7

Follow the instructions below to get started

Conda environment

To set up the conda environment and download the requirements enter these commands:

conda create --name minicomp python=3.9.7

conda config --append channels conda-forge

conda activate minicomp

pip install -r requirements.txt

Python Notebook

After installing sucessfully all the requirments,open the notebook Compute_RMSPE.ipynb and go through the different sections.

1 - Data Preparation

By running this cell, an input_file is requested from the user, which will be used as the holdout or test file.

In this section we also go over the necessary modifications to prepare the data for our model.

And so this is the transform part of our data pipeline.

The transformations were made based on our data exploration as we went through feature selection, fixing missing data, removing problematic features with many nulls, and other statistical analysis such as target correlation.

We also use categorical encoders to turn non numerical features into numbers. In general we used one_hot_encoder for most features, however we found that using a target_encoder for store considerably improves the accuracy of our model.

2 - XGB Model

The model chosen is XGBRegressor using the XGBoost module which is a scalable distributed gradient boosted decision tree model.

The hyperparameters were chosen based on the optimal configuration found during our hyperparameters tunning experiments

Run this cell to train the model using the fit function, and get the RMSPE (Root mean square percentage error) for the input_file

3 - Formatting Results

Once the predictions are made we just need to do some formatting to fit the sample_submission.csv structure and write out the results.

4 - Error Analysis

To improve the performance of our model it is crucial to visualize the results and do some error analysis.

Some data visualization techniques will be needed to isolate, observe and manage erroneous predictions.

Data

All the data can be found in the corresponding data folder

store.csv the data for the 1115 Stores,
train.csv the training dataset
test.csv is the Kaggle competition test dataset
holdout_b29.csv is the DSR mini-competion's test dataset

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Results		Results
data		data
Compute_RMSPE.ipynb		Compute_RMSPE.ipynb
Main.ipynb		Main.ipynb
README.md		README.md
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Rossman Kaggle Mini-Competition

Conda environment

Python Notebook

1 - Data Preparation

2 - XGB Model

3 - Formatting Results

4 - Error Analysis

Data

About

Uh oh!

Releases

Packages

Uh oh!

Languages

khuss/RossmanPred

Folders and files

Latest commit

History

Repository files navigation

Rossman Kaggle Mini-Competition

Conda environment

Python Notebook

1 - Data Preparation

2 - XGB Model

3 - Formatting Results

4 - Error Analysis

Data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages