Skip to content

This repository contains coursework for the Statistical Computing & Empirical Methods unit, which blends foundational statistical techniques, empirical data exploration, and predictive modeling using R.

License

Notifications You must be signed in to change notification settings

thrisharajkumar/Statistical-Computing-Empirical-Methods-Using-R-Programming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Statistical Computing & Empirical Methods Using R Programming

Course: Statistical Computing & Empirical Methods
Timeline: October 2024 – December 2024
Institution: University of Bristol
Author: Thrisha Rajkumar

This repository contains coursework for the Statistical Computing & Empirical Methods unit, blending foundational statistical techniques, empirical data exploration, and predictive modeling using R.


Project Highlights

1. Supervised Learning & Predictive Modelling

  • Developed Decision Tree and Random Forest classifiers for predicting loan approvals.
  • Addressed missing data, encoded categorical variables, and engineered features.
  • Tuned models via grid search with 10-fold cross-validation.
  • Achieved:
    • AUC = 0.7949
    • F1 Score = 0.70

Loan Prediction Metrics


2. Data Wrangling & Exploratory Data Analysis (EDA)

  • Cleaned multi-source financial data (13,824 rows, 63 indicators).
  • Performed joins, renaming, and grouping by income/region.
  • Created insightful plots to explore patterns:

Income Distribution
Applicant Income Histogram

Credit History vs Loan Status
Loan Status by Credit History

Regional Financial Flow (Violin Plot)
Violin Plot of Financial Flow by Region


3. Statistical Simulation & Computing

  • Estimated λ for exponential distribution using Maximum Likelihood Estimation (MLE).
  • Bootstrapped 1,000 samples to construct confidence intervals: CI = [0.0191, 0.0207].
  • Simulated 100,000 trials of discrete variables.
  • Demonstrated Central Limit Theorem (CLT) using 50,000 sample means.

CLT Density Plot
CLT Distribution

Exponential λ Visualization
Exponential Distribution


📊 Dataset Sources

1. Loan Approval Data

  • Kaggle – Finance Loan Approval Prediction
  • Includes demographic and financial attributes for binary classification.
  • Note: Data files (train.csv, test.csv) are not included due to licensing. Please download them directly from Kaggle.

2. Simulated Exponential Data

  • Generated internally using rexp() in R for bootstrapping & CLT demonstrations.

📁 Folder Structure

Statistical-Computing-R/
│
├── notebooks/                 # R Markdown reports
│   ├── SectionA_Code_Output.Rmd
│   ├── SectionB_Analysis.Rmd
│   └── SectionC_Methodology.Rmd
│
├── reports/                   # Final PDF submissions
│   ├── Section-A.pdf
│   ├── Section-B.pdf
│   └── Section-C.pdf
│
├── images/                    # Visuals used in README
│   ├── applicant_income_histogram.png
│   ├── loan_status_by_credit.png
│   ├── loan_prediction_metrics.png
│   ├── clt_distribution.png
│   ├── exponential_distribution_lambda.png
│   └── eda_violinplot_timeseries.png
│
├── setup.txt                  # Installation and usage instructions
└── README.md                  # This file

🛠 How to Run

Step 1 – Install Required Packages

install.packages(c("tidyverse", "rpart", "randomForest", "caret", "boot",
                   "ggplot2", "e1071", "reshape2", "knitr"))

Step 2 – Run Models

# Load dataset
train <- read.csv("train.csv")

# Decision Tree
library(rpart)
tree_model <- rpart(Loan_Status ~ ., data = train, method = "class")

# Random Forest
library(randomForest)
rf_model <- randomForest(Loan_Status ~ ., data = train, ntree = 100)

Step 3 – Bootstrap & CLT

library(boot)
lambda_hat <- function(x, i) 1 / mean(x[i])
boot_data <- rexp(1000, rate = 0.02)
boot_lambda <- boot(data = boot_data, statistic = lambda_hat, R = 1000)
boot.ci(boot_lambda, type = "perc")

📑 Documentation

Section Description
A Code and outputs for EDA and wrangling
B Descriptive analysis and distribution modeling
C Simulation-based inference, bootstrapping, MLE

All reports are in the /reports/ folder.


🧠 Skills Demonstrated

  • R Programming (Tidyverse, ggplot2, caret, rpart, randomForest)
  • Data Wrangling & Joining
  • Predictive Modelling & Evaluation
  • Statistical Estimation (MLE, Bootstrapping)
  • Central Limit Theorem Demonstration
  • Reproducible Analysis with R Markdown

📜 License

This repository is for academic and educational purposes under the University of Bristol.
Reuse is permitted with citation of author and coursework context.

About

This repository contains coursework for the Statistical Computing & Empirical Methods unit, which blends foundational statistical techniques, empirical data exploration, and predictive modeling using R.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published