Skip to content

Conversation

@dannyim
Copy link
Contributor

@dannyim dannyim commented Dec 3, 2025

I tried running index_setsm.py with --check but this was very slow (this was run using gdal 3.9.3 and python 3.12.11 on our compute cluster), it took nearly 7 hours when run against a ~32 million record postgres table to check for 16 records. Looks like a lot of the time was spent due to the client fetching all rows from the database table into memory.

This PR improves execution time by performing the check within postgres instead. I am seeing this complete in about 12 minutes now against the aforementioned dataset. Not sure if this is the best way to perform the check on the database side though.

@clairecporter
Copy link
Member

Cool! Do you or @klassenjs use the --check functionality? I know we did a lot of work to actually catch errors instead of having to check after the fact.

@dannyim
Copy link
Contributor Author

dannyim commented Dec 8, 2025

@clairecporter I just started using --check recently as a sanity check and to potentially catch errors earlier in our ingest pipeline.

I also realized that my current approach in this PR is flawed since I'm using the same GDAL path to build the temp table, so now it's just checking how consistent gdal is rather than if gdal wrote to postgres correctly. I think the way to go is to do something like select * from dest where <identifier> in (id_0, id_1, ..., id_n) and compare that against layer_recordids. I'm guessing that will still be much faster than fetching all records, at least in my use case (where I'm indexing in small batches of dozens of records at a time). Thoughts?

@clairecporter
Copy link
Member

Feel free to strip out the --check logic if you can be reasonably sure the errors and warnings are properbly caught.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants