improve --check performance when indexing to postgres #111

dannyim · 2025-12-03T15:48:25Z

I tried running index_setsm.py with --check but this was very slow (this was run using gdal 3.9.3 and python 3.12.11 on our compute cluster), it took nearly 7 hours when run against a ~32 million record postgres table to check for 16 records. Looks like a lot of the time was spent due to the client fetching all rows from the database table into memory.

This PR improves execution time by performing the check within postgres instead. I am seeing this complete in about 12 minutes now against the aforementioned dataset. Not sure if this is the best way to perform the check on the database side though.

clairecporter · 2025-12-05T22:40:52Z

Cool! Do you or @klassenjs use the --check functionality? I know we did a lot of work to actually catch errors instead of having to check after the fact.

dannyim · 2025-12-08T14:57:47Z

@clairecporter I just started using --check recently as a sanity check and to potentially catch errors earlier in our ingest pipeline.

I also realized that my current approach in this PR is flawed since I'm using the same GDAL path to build the temp table, so now it's just checking how consistent gdal is rather than if gdal wrote to postgres correctly. I think the way to go is to do something like select * from dest where <identifier> in (id_0, id_1, ..., id_n) and compare that against layer_recordids. I'm guessing that will still be much faster than fetching all records, at least in my use case (where I'm indexing in small batches of dozens of records at a time). Thoughts?

clairecporter · 2025-12-23T17:12:08Z

Feel free to strip out the --check logic if you can be reasonably sure the errors and warnings are properbly caught.

Danny Im added 2 commits December 3, 2025 09:45

run check in postgres instead

42a6105

fix method name

262b055

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

improve --check performance when indexing to postgres #111

improve --check performance when indexing to postgres #111

Uh oh!

dannyim commented Dec 3, 2025

Uh oh!

clairecporter commented Dec 5, 2025

Uh oh!

dannyim commented Dec 8, 2025

Uh oh!

clairecporter commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

improve --check performance when indexing to postgres #111

Are you sure you want to change the base?

improve --check performance when indexing to postgres #111

Uh oh!

Conversation

dannyim commented Dec 3, 2025

Uh oh!

clairecporter commented Dec 5, 2025

Uh oh!

dannyim commented Dec 8, 2025

Uh oh!

clairecporter commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants