-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Description
For now, when wal folder grows too large, use following fix route:
- 1) Increase size of the out-of-space PVC(s), using Lens. (instead of default
10000Mi, set to10100Mior something) - 2) Kill the pods that are using those PVC(s), so they can finish resizing (kill the "...instance1..." and "repo-host-0" pods, OR do "Restart" on the two corresponding Stateful Sets).
- Note: The resizing may fail to take place the first attempt, for one or both PVCs; repeat the process until it works. (can confirm by checking
dfin the two pods mentioned above, or more easily, by runningdescribeon the PVCs, like seen here, except done for the purpose of seeing the events rather than the pods using the PVC; can also just observe the events in the Lens UI) - Note: In some cases, this step appears unnecessary. (it wasn't necessary one time when I increased the size of just the repo pvc)
- Note: The resizing may fail to take place the first attempt, for one or both PVCs; repeat the process until it works. (can confirm by checking
- 3) Open shell in the "instance1" and/or "repo-host-0" pods; check if "df" shows 100% still, on any of the filesystems. If they all are lower than 100% now, then that's good.
- 4) IF the database was not corrupted by the space running out (the first time this issue happened the db got corrupted to some extent, require a full scp -> ... -> pgdump import process -- but the second time it didn't), then the main database PVC should reduce in size a lot as the WAL segments get cleared out. (doesn't seem to happen for the repo PVC as well unfortunately; EDIT: see second comment in thread for apparent explanation)
- 5) Restart the app-server, to confirm that it works again. (it should, if step 4 succeeded)
- 6) You should now try to do a pgdump of the contents. See: https://github.com/debate-map/app#pg-dump
- Note: Atm the option 1 pgdump approach (nodejs script) is failing for prod cluster, since the database has too much data for the HTTP request to be solidified prior to nginx timing out the request. Need to fix this. For now, use option 4.
Also see: #331 (comment)
Other misc. info from DM
Btw: The issue of the PVC getting to 100% space usage happened again. Thankfully this time it did not corrupt the database, so I was able to fix it by simply increasing the PVC's size, restarting the database pod, then restarting the app-server. After that, the 100% usage (from the pg_wal folder like before) went down to ~20%, presumably since the cause of the WAL sticking around was disconnected, letting the WAL segments get cleaned up.
However, this is of course a terrible thing to keep happening.
Some remeditation plans:
- Detection: Make space usage more observable. I want to get emails set up at some point, but for now I added this little display to my custom taskbar panel: (it updates by sending a graphql query to the monitor backend once per minute) [image]
- Root cause: Discover whatever is causing the database to keep its WAL segments from being cleaned up, and resolve it.
Possibly it is my logical-replication slot, maybe after an app-server crash or something.
But possibly it's some side-effect of the pgbackrest backups getting broken. (I discovered that after we restored from backup on June 25th, the next day the pgbackrest backups started working like normal. They kept working until July 20th. Maybe that's the point where postgres knew the backups were failing and so started keeping all WAL segments until the pgbackrest backups could complete, similar to here: https://www.crunchydata.com/blog/postgres-is-out-of-disk-and-how-to-recover-the-dos-and-donts#broken-archives)
Other notes:
- The PVC size-increasing worked for the main database pod+pvc, but more complicated for the "repo-host" / in-cluster database copy, as seen in screenshot above. (not exactly sure what that repo1 is, but anyway it's large folder is
/pgbackrest/archiverather than the/pgdata/pg_walon the main postgres database pod)- More specifically, the size increase worked, but the WAL data did not clear out in that repo-host PVC like it did for the main database PVC.
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
🔖 Short-term (Venryx)