Skip to content

Conversation

@alecksphillips
Copy link
Collaborator

@alecksphillips alecksphillips commented Aug 24, 2023

There is currently a lack of reproducibility with MPI, due to numerical instability when calculating the cdf for resampling.

This pull request will (eventually) do a few things:

  • Separate calculation of cdf to a separate function to aid testing
  • Add tests for calculate_cdf
  • Fix numerical stability of calculate_cdf for single core
  • Fix numerical stability of calculate_cdf for MPI (this last point will require adding a parallel logcumsumexp to redistribution submodule)

@alecksphillips alecksphillips force-pushed the fix/MPI-reproducibility branch 2 times, most recently from 58231e1 to c9190bb Compare September 22, 2023 11:14
separate calculation of cdf in get_num_copies to test separately

add test (currently fails due to numerical inaccuracy)
Add logcumsumexp to Executor and Executor_MPI with accompanying tests

Executor_MPI.logsumexp(x) currently implemented naively as log(exec.cumsum(log(x))) and thus fails tests due to numerical inaccuracy
Use exec.logcumsumexp in calculate cdf for numerical stability (currently still unstable with MPI)
Mark test_logcumsumexp in test_executor_MPI
For MPI implementation of logcumsumexp, use inclusive_prefix_logsumexp
from redistribution submodule.
Tests in tests/test_executor_MPI.py all pass including for large array
of numbers for logcumsumexp, however running working example (e.g.
examples/mpi/decision_tree_example.py) with MPI results in error when
performing reduction operation:

```
Fatal Python error: exception in user-defined reduction operation
Traceback (most recent call last):
  File "mpi4py\MPI\opimpl.pxi", line 99, in mpi4py.MPI.op_user_mpi
    File "mpi4py\MPI\opimpl.pxi", line 90, in mpi4py.MPI.op_user_py
      File
      "...\discretesampling\base\executor\MPI\distributed_fixed_size_redistribution\prefix_sum.py",
      line 7, in LSE~
  x = np.frombuffer(xmem, dtype=np.float64)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: buffer size must be a multiple of element size
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants