The purpose of this repository is to provide a quick-start for a machine learning take home problem. They usually are framed like the following.
- Here is some data and instructions
- Build a model for us using a Jupyter notebook
- Send us the notebook and/or some slides and we'll talk about it
Years ago when we gave this problem to a candidate he went further to actually deploy a little model inside a web application. We all really appreciated his desire to take a vision end to end, and this repo is meant to give a similar flavor.
With this repository you can be sure that you work is reproduced by a hiring committee because it's dockerized! In my experience it pays dividends to actually walk through a notebook instead of just looking at the output.
This template includes an example dataset that I created based on iteratively querying the New Jersey State Health Assessment Data, and these data are also available as an extract on Kaggle. Full caveats there is virtually no signal in this data to predict premature birth outcomes, it is merely meant to illustrate a problem.
This repository exposes four components that are useful in a data science proof of concept.
- A container running Jupyter notebooks with common machine learning libraries (available @ localhost:8888). Any notebooks will persist in a mounted volume (./volumes/notebooks)
- A container running Postgres in the event a relational database is useful (available @ localhost:5432). Any transformations will persist between containers in a mounted volume (./volumes/postgres)
- A container running FastAPI to serve predictions from a scikit-learn model (available @ localhost:8080)
- A container running Streamlit allows a user to access the predictions from their scikit-learn model based on user inputs (available at localhost:8501)
turn on the application
docker-compose up
turn off the application
docker-compose down
rebuild the application
docker-compose down
|-- containers - code
| |-- python # interactive jupyter notebooks
| |-- fastapi # deploy pickled model as a REST API
| |-- streamlit # access REST API in a user interface
|-- volumes # persistent data
| |-- notebooks # jupyter notebooks persisted here
| |-- postgres # database files persisted here, not in version control
| |-- static # static files that are loaded into postgres or jupyter
There are several secrets pertaining to the postgres datbasse stored in the .env file at the root of the repository.
PGHOST=postgres
PGUSER=local
PGPASSWORD=password
PGPORT=5432
PGDATABASE=postgres
You can connect to PostgreSQL on localhost:5432 with a user 'local' and password 'password' with any SQL client.
Inside the dockerized jupyter notebook, you can connect to PostgreSQL with the following URI
from sqlalchemy import create_engine
engine = create_engine('postgresql://local:password@postgres:5432/postgres)
connection = engine.connect()
This template uses conda environments in each container. Simply modify the environment.yml to add anything you like.
The model is available as a REST API endpoint on port 8080. It accepts JSON data that look like 1 row of the dataframe it as trained on.
curl --request POST http://127.0.0.1:8080/predict \
-H 'Content-Type: application/json' \
-d '{"age_group": "Under 15 yrs","reported_race_ethnicity": "White, non-Hispanic", "previous_births": "None","tobacco_use_during_pregnancy": "Yes","adequate_prenatal_care": "Inadequate"}'
A small web application can take features used to drive your model, then return a prediction from the REST API.
There is literally zero security. Keep this on localhost.
- There is no password for the postgres database.
- The rest API calls are not encrypted.
- The jupyter notebook runs as root in a container.
- The user interface is exposed without encryption or a password.
