A complete system for harvesting, normalizing, and storing U.S. Census API metadata & geographies into a relational SQLite database.This has the purpose of creating a sql database for faster quary building
This project automates the discovery, download, normalization, and storage of Census API datasets, variables, and geographic hierarchies into a structured SQLite database.
It:
-
Pulls dataset metadata from the U.S. Census API
-
Builds lookup tables for:
- Regions
- Divisions
- States
- Counties
- Census Tracts
-
Inserts API-linked metadata:
- Datasets
- Years
- Variables
- Geography levels
-
Manages intermediate relationship tables mapping datasets ↔ variables ↔ geographies ↔ years
-
Cleans missing or malformed Census API responses
-
Stores everything in a relational schema so quary builds becomes straightforward
The core of the system is the data_inserts class, which extends a base data_pull class and orchestrates the entire data ingestion pipeline.
The class automates nearly every required step:
On initialization:
- Ensures missing geography record exists
- Ensures missing variable record exists
- Loads dataset + metadata URLs via
pull_urls()
| Method | Description |
|---|---|
insert_states() |
Loads TIGER/Line state shapefile → SQLite table states |
insert_county() |
Loads county shapefile → county |
insert_track() |
Loops every state tract file → track |
insert_regions() |
Inserts static Census regions |
insert_divisions() |
Inserts static Census divisions |
insert_geo_full() |
Parses API geography responses and populates: • geo_table (geographic levels) • geo_interm (dataset-year-geo relationships) |
The geo insertion logic forms the core feature of this project: discovering all possible API geographies and normalizing them.
| Method | Description |
|---|---|
insert_datasets() |
Extracts dataset names + API paths from metadata |
insert_years() |
Extracts years (c_vintage) and inserts into year_table |
| Method | Description |
|---|---|
insert_var_full() |
Retrieves each dataset’s variable list, cleans it, and populates: • variable_table • variable_interm (dataset-year-variable relationships) |
Handles:
- Wildcards
- Missing labels
- Raw JSON to structured table
The class provides helper operations that support the ingestion workflow:
get_year_id(year)get_dataset_id(dataset)get_geo_id(geo_lv)get_geo_desc(geo_name)get_var_id(var_name)- Relationship-table checkers:
check_geo_interm_id(...)check_variable_interm_id(...)
These ensure:
- Referential integrity
- No duplicate entries
- Late discovery of unknown variables or geographies is handled smoothly
pip install polars requests duckdb geopandas alembicor using uv
uv syncAditionionaly you'll need to run the migration to prepare the database for the insertion of the data.
alembic upgrade headfrom src.inserts import data_inserts
runner = data_inserts(
saving_dir="data/",
db_file="sqlite:///database.db",
log_file="data_process.log",
)
runner.insert_regions()
runner.insert_divisions()
runner.insert_states()
runner.insert_county()
runner.insert_track()
runner.insert_datasets()
runner.insert_years()
runner.insert_geo_full()
runner.insert_var_full()The class data_insert also includes a methode self.conn this methode can be used to quary directly form the database. Here is an example:
from src.insert import data_inserts
di = data_inserts()
di.conn.execute(
"""
SELECT * FROM sqlite_db.geo_table;
"""
).df()This will return a pd.DataFrame