-
Notifications
You must be signed in to change notification settings - Fork 85
SysV Shared memory module #882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: pid-take-2
Are you sure you want to change the base?
Conversation
) * Replace p2 estimator with SES * Fix PID tests to use new smoother * Remove p2 * update images * Correct comment * update images * Clean up comments, address Abed's concerns * Update comment * Update images * run all experiments * Unit tests for success criteria * Add incident detection and adaptive convergence * Update success criteria version 2 * Address Abed's comments * Update experiments Gemfile to prevent error * Update experiments * Address Abed comments again * Remove init * Experiments --------- Co-authored-by: Abdulrahman Alhamali <abdulrahman.alhamali@shopify.com>
227e8c7 to
eb6750e
Compare
SummaryHere is a summary of the conversation:
LinksInternalThis summary was generated automatically using Google's gemini-2.5-pro (temperature 0.5). 🧵 Slack Thread
|
||||||||||||||||||||||||||||||||||||||||||||||||
| User | Message | |
|---|---|---|
|
Kris Gaudel 2025‑11‑21 02:06 |
I spent a lot of time thinking today about how the shared memory and PID update logic will be implemented (notes and drawings here). The description is in the 🧵 | |
|
Kris Gaudel 2025‑11‑21 02:09 |
*How the semaphore is initialized in the bulkhead*
1. Process 1 will call semget to try to create the semaphore, since the semaphore doesn't exist it does not return EEXIST, thus creating it (sem_id = semget...)
2. Process 2 will call semget , but it returns EEXIST, it sets its sem_id to that of the one that exists( sem_id= wait_for_new_semaphore_set(res->key, permissions); )
|
|
|
Kris Gaudel 2025‑11‑21 02:11 |
*We can follow the same logic but for our shared memory*
1. P1 tries to create shared memory, it doesn't already exist so shm_id = shmget(...)
2. P2 tries to create shared memory, it already exists so shm_id = shmctl(...)
|
|
|
Kris Gaudel 2025‑11‑21 02:14 |
*Stepping back, how will we our ACBs connect to this shared memory?*
• Each process will call Semain.register(...) , initializing an instance of an ACB
• The first process will try to create the shared memory as described above, the second will attach to the shared memory
|
|
|
Kris Gaudel 2025‑11‑21 02:15 |
*Ok, but now how will UpdatePID get called?*
• Currently, each process has its own background thread that calls UpdatePID at the end of every window
• We can employ similar logic where each process has its own background thread that will try and acquire a mutex upon reaching the end of its window
• If it gets the mutex, it can call UpdatePID
|
|
|
Kris Gaudel 2025‑11‑21 02:51 |
*Great, but what if we have a case where P1 updates the PID, then P2 (who's window just finished) tries to update the PID even though it was just updated?*
• Timestamps! A timestamp of the last time the PID was updated
• Our UpdatePID function will update these things in our shared memory: rejection_rate and update_timestamp
• Processes looking to update the PID have to pass this condition: If time.now() - last_update < window_size * MULTIPLIER (e.g., some value lt 1 like 0.9 to account for latency) we skip the update since it was very recently updated
(edited)
|
|
|
Kris Gaudel 2025‑11‑21 02:52 |
*How will shared memory get cleaned up*
• Reference counting
• When processes shut down they will reduce the reference count of the shared memory by 1 (ref_count-=1 )
• If the last process wants to shut down, it will check the reference count before shutting down
• Last process calls destructor to clean up shared memory
|
|
|
Kris Gaudel 2025‑11‑21 02:55 |
New architecture and diagram to show inheritance
|
|
|
Kris Gaudel 2025‑11‑21 02:56 |
Happy to pair on this tomorrow and clarify things | |
|
Abdulrahman Alhamali 2025‑11‑21 14:00 |
wow, I really like how detailed this is. Overall it seems solid to me, and I really like the idea of all of the processes having their PID controller. Because if we kept it to a single process, that process might die, and we end up without a controller | |
|
Anagayan Ariaran 2025‑11‑21 14:28 |
Hey great work thinking through all this! The last update timestamp is definitely a key value to track.
each process has its own background threadJust thinking about this, if each process has a thread that's constantly checking we are more likely to encounter. I also have concerns around opening threads to do this process within applications that call Semian. We don't know the architecture of others and opening threads that just check on this might not be a great pattern. To combat this I was thinking this can be considered on a fraction of requests coming through, like say on 5% of requests after a time window. On that fraction they try to update the values by acquiring the semaphore. It's more on the hot path, but it's not every request, assuming that the actual calculations themselves won't take too long in general. (reference) *How will shared memory get cleaned up*Great work thinking this through on normal operations. I think the hard part to consider is what we do when processes crash or are killed and whether we can account for those unexpected events. There might be a tradeoff here where we can't account for everything but know that things will get cleaned up when pods restart, and the total volume of memory we use is marginal. |
|
|
Abdulrahman Alhamali 2025‑11‑21 14:31 |
To combat this I was thinking this can be considered on a fraction of requests coming through, like say on 5% of requests after a time window. On that fraction they try to update the values by acquiring the semaphore. It's more on the hot path, but it's not every request, assuming that the actual calculations themselves won't take too long in general. (<a href="https://draw.shopify.io/dtcmrhusnppj5ourmdddbpl2?d=v-1039.-1282.3598.3777.nzTtLG_aXBOO-JtR8phRE">reference</a>)this might be useful for the current implementation that we have of the pid controller. Some people might not be fans of the fact that we are creating a thread. The problem though with doing it after requests is that requests with slow queries would end up blocking it, so the sliding window implementation would become less effective. However, this becomes less of an issue if we have many processes, so it's mostly an issue for the current implementation |
|
|
Abdulrahman Alhamali 2025‑11‑21 15:35 |
Kris, let's make sure your research gets amortized in github | |
|
Kris Gaudel 2025‑11‑21 15:59 |
Thanks for the feedback everyone! | |
|
Kris Gaudel 2025‑11‑21 16:00 |
I haven’t used GitHubArchiver before, it’d be useful to refer to this stuff somewhere | |
Kris Gaudel archived this conversation from gsd-47642-semian-autoconfig at 2025‑11‑21 16:46.
All times are in UTC.


Part of https://github.com/orgs/Shopify/projects/13332?pane=issue&itemId=139784715&issue=Shopify%7Cresiliency%7C6646
This implements the first two parts of the issue (Sysv shared memory and atomic ops). PID state updates will be in a follow-up PR
Instructions to build the extension and run the unit test (after initializing the container):
podman-compose -f .devcontainer/docker-compose.yml exec semian bashto get into the containerbundle exec rake buildto compilebundle exec ruby -Ilib:test test/shared_memory_test.rbto run the unit test for the extension