From 4bb62af8dfbce57324eee8f7b096db0a246cec18 Mon Sep 17 00:00:00 2001
From: Steve Bronder <stevo15025@gmail.com>
Date: Tue, 6 Apr 2021 18:11:24 -0400
Subject: [PATCH 1/7] add design doc for parallel services

---
 designs/0020-parallel-chain-api.md | 106 +++++++++++++++++++++++++++++
 1 file changed, 106 insertions(+)
 create mode 100644 designs/0020-parallel-chain-api.md
diff --git a/designs/0020-parallel-chain-api.md b/designs/0020-parallel-chain-api.md
new file mode 100644
index 0000000..474eda0
--- /dev/null
+++ b/designs/0020-parallel-chain-api.md
@@ -0,0 +1,106 @@
+- Feature Name: parallel_chain_api
+- Start Date: 2021-04-06
+- RFC PR: (leave this empty)
+- Stan Issue: (leave this empty)
+
+# Summary
+[summary]: #summary
+
+This outlines a services layer API for running multiple chains in one Stan program.
+
+# Motivation
+[motivation]: #motivation
+
+Currently to run multiple chains for a given model a user or developer must use higher level parallelization tools such as `gnu parallel` or R/Python parallelism schemes. However, we have access to the TBB and with it a schedular for managing hierarchical parallelism. We can utilize the TBB to provide service API's for running multiple chains in one program and safely account for possible parallelism within a model using tools such as `reduce_sum()`.
+
+The benefits to this scheme are mostly in memory savings and standardization of multi chain processes in Stan. Because a stan model is immutable after construction it's possible to share that model across all chains. For a model that uses 1GB of data running 8 chains in parallel means we use 8GB of RAM. However by sharing the model across the chains we simply use 1GB of data.
+
+Having a standardized IO and API for multi chain processes will allow researchers to develop methods which utilize information across chains. This research can allow for algorithms such as automated warmup periods where instead of hard coding the number of warmups, warmups will only happen until a set of conditions are achieved and then we can begin sampling.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+Each of the servies layers in [`src/stan/services/`](https://github.com/stan-dev/stan/blob/147fba5fb93aa007ec42744a36d97cc84c291945/src/stan/services/sample/hmc_nuts_dense_e_adapt.hpp) layer will have the current API for single chain processes as well as an API for running multi chain processes. Their inputs are conceptually the same, but several of the inputs have been changed to be vectors of the single chain processes arguments in order to account for multiple chains. For instance, the signature of a single chain for `hmc_nuts_dense_e_adapt` now has `std::vector`s for the initialial values, inverse metric, and init, sample, and diagnostic writers.
+
+```cpp
+template <class Model>
+int hmc_nuts_dense_e_adapt(
+    Model& model,
+    const stan::io::var_context& init,
+    const stan::io::var_context& init_inv_metric,
+    unsigned int random_seed,
+    unsigned int chain, double init_radius, int num_warmup, int num_samples,
+    int num_thin, bool save_warmup, int refresh, double stepsize,
+    double stepsize_jitter, int max_depth, double delta, double gamma,
+    double kappa, double t0, unsigned int init_buffer, unsigned int term_buffer,
+    unsigned int window,
+    callbacks::interrupt& interrupt,
+    callbacks::logger& logger,
+    callbacks::writer& init_writer,
+    callbacks::writer& sample_writer,
+    callbacks::writer& diagnostic_writer)
+```
+
+```cpp
+template <class Model, typename InitContext, typename InitInvContext,
+          typename InitWriter, typename SampleWriter, typename DiagnosticWriter>
+int hmc_nuts_dense_e_adapt(
+    Model& model,
+    // now vectors
+    const std::vector<InitContext>& init,
+    const std::vector<InitInvContext>& init_inv_metric,
+    unsigned int random_seed, unsigned int chain, double init_radius,
+    int num_warmup, int num_samples, int num_thin, bool save_warmup,
+    int refresh, double stepsize, double stepsize_jitter, int max_depth,
+    double delta, double gamma, double kappa, double t0,
+    unsigned int init_buffer, unsigned int term_buffer, unsigned int window,
+    // interrupt and logger must be threadsafe
+    callbacks::interrupt& interrupt,
+    callbacks::logger& logger,
+    // now vectors
+    std::vector<InitWriter>& init_writer,
+    std::vector<SampleWriter>& sample_writer,
+    std::vector<DiagnosticWriter>& diagnostic_writer,
+    size_t n_chain)
+```
+
+Additionally the new API has an argument `n_chain` which tells the backend how many chains to run. All of the vector inputs must be the same size as `n_chain`. For optional performance, `InitContext` and `InitInvContext` can either be any type inheriting from `stan::io::var_context` or either `std::shared_ptr<>` or `std::unique_ptr<>` with an underlying pointer whose type is derived from `stan::io::var_context`. Within the new API these arguments are accessed through a function `stan::io::get_underlying(const T& x)` which for any of the above inputs returns a reference to the object inheriting from `stan::io::var_context`. For upstream APIs such as rstan which uses `Rcpp` this function can be overloaded to support smart pointers such as `Rcpp::Xptr`.
+
+```cpp
+namespace stan {
+namespace io {
+template <typename T>
+const auto& get_underlying(const Rcpp::Xptr<T>& x) {
+  return *x;
+}
+}
+}
+```
+
+This scheme allows for flexibility, where a user can pass one initialization for all chains and the program can make one shared pointer used in all instances of the vector.
+
+The elements of the vectors for `init`, `init_inv_metric`, `interrupt`, `logger`, `init_writer`, `sample_writer`, and `diagnostic_writer` must be threadsafe. `init` and `init_inv_metric` are only read from so should be threadsafe by default. Any of the writers which write to `std::cout` are safe by the standard, though it is recommended to write any output to an local `std::stringstream` and then pass the fully constructed output so that thread outputs are not mixed together. See the code [here](https://github.com/stan-dev/stan/pull/3033/files#diff-ab5eb0683288927defb395f1af49548c189f6e7ab4b06e217dec046b0c1be541R80) for an example. Additionally if the elements of `init_writer`, `sample_writer`, and `diagnostic_writer` each point to unique output they will be threadsafe as well.
+
+### Recommended Upstream Initialization
+
+Upstream packages can generate `init` and `init_inv_metric` as they wish, though for cmdstan the prototype follows the following rules for reading user input.
+
+If the user specifies their init as `{file_name}.{file_ending}` then the program will search for `{file_name}_{1..chains}.{file_ending}` where `chains` is the integer value specified for the user for the number of chains to run in the program. If it fails to find any of the `{file_name}_{1..chains}.{file_ending}` it will then search for `{file_name}.{file_ending}` and if found will use that. Otherwise an exception will occur.
+
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+The services API on the backend has a prototype implementation found [here](https://github.com/stan-dev/stan/blob/147fba5fb93aa007ec42744a36d97cc84c291945/src/stan/services/sample/hmc_nuts_dense_e_adapt.hpp#L206). The main additions to this change are in creating the following for each chain.
+
+1. PRNGs
+2. Initializations
+3. Samplers
+4. inverse metrics
+
+Then a [`tbb::parallel_for()`](https://github.com/stan-dev/stan/blob/147fba5fb93aa007ec42744a36d97cc84c291945/src/stan/services/sample/hmc_nuts_dense_e_adapt.hpp#L261) is used to run the each of the samplers.
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+This does add overhead to existing implimentations in managing the per chain IO. Performance tests still need to be completed to assess the efficiency of nested parallelism (i.e. using `reduce_sum()`) inside of chains executing in parallel.

From ec7f1468aac457d76c8629181f13fc9ea0f041d6 Mon Sep 17 00:00:00 2001
From: Steve Bronder <stevo15025@gmail.com>
Date: Fri, 9 Apr 2021 12:47:59 -0400
Subject: [PATCH 2/7] update with sbc Q

---
 designs/0020-parallel-chain-api.md | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/designs/0020-parallel-chain-api.md b/designs/0020-parallel-chain-api.md
index 474eda0..9952b0c 100644
--- a/designs/0020-parallel-chain-api.md
+++ b/designs/0020-parallel-chain-api.md
@@ -104,3 +104,17 @@ Then a [`tbb::parallel_for()`](https://github.com/stan-dev/stan/blob/147fba5fb93
 [drawbacks]: #drawbacks
 
 This does add overhead to existing implimentations in managing the per chain IO. Performance tests still need to be completed to assess the efficiency of nested parallelism (i.e. using `reduce_sum()`) inside of chains executing in parallel.
+
+
+### Open Questions
+
+The main open question is whether to recommend upstream users of services to generate N models or a single model
+whenever a Stan program uses `*_rng()` functions in transformed data for methods such as Simulation Based Calibration.
+With 1 model the transformed data will be shared across all chains. With SBC we commonly want to run multiple
+data sets and the question is whether we want multiple chains over one dataset or a chain for each data set.
+If we would like to have multiple models in one program if the user uses an `*_rng()` there is a [`stanc3 PR`](https://github.com/stan-dev/stanc3/pull/868) to add a method to check whether the user uses an rng function in
+tranformed data. Upstream service users can generate one model, then ask it if an rng is used in transformed data
+to decide whether it wants to generate N more models.
+
+Personally, I think it makes since to run multiple chains for each generated dataset (having 1 model).
+This makes sense to me as we can check for recovery of parameters given K datasets and N chains per dataset.

From 401694aba52b359e3f3732031c5a35d17e4e1fe0 Mon Sep 17 00:00:00 2001
From: Steve Bronder <stevo15025@gmail.com>
Date: Tue, 20 Apr 2021 19:07:57 -0400
Subject: [PATCH 3/7] update with init_chain_id and clarifications on
 replication between multi-chain program and multiple programs with single
 chains

---
 designs/0020-parallel-chain-api.md | 46 +++++++++++++++++++++---------
 1 file changed, 33 insertions(+), 13 deletions(-)

diff --git a/designs/0020-parallel-chain-api.md b/designs/0020-parallel-chain-api.md
index 9952b0c..411df79 100644
--- a/designs/0020-parallel-chain-api.md
+++ b/designs/0020-parallel-chain-api.md
@@ -11,7 +11,7 @@ This outlines a services layer API for running multiple chains in one Stan progr
 # Motivation
 [motivation]: #motivation
 
-Currently to run multiple chains for a given model a user or developer must use higher level parallelization tools such as `gnu parallel` or R/Python parallelism schemes. However, we have access to the TBB and with it a schedular for managing hierarchical parallelism. We can utilize the TBB to provide service API's for running multiple chains in one program and safely account for possible parallelism within a model using tools such as `reduce_sum()`.
+Currently, to run multiple chains for a given model a user or developer must use higher level parallelization tools such as `gnu parallel` or R/Python parallelism schemes. However, we have access to the TBB and with it a schedular for managing hierarchical parallelism. We can utilize the TBB to provide service API's for running multiple chains in one program and safely account for possible parallelism within a model using tools such as `reduce_sum()`.
 
 The benefits to this scheme are mostly in memory savings and standardization of multi chain processes in Stan. Because a stan model is immutable after construction it's possible to share that model across all chains. For a model that uses 1GB of data running 8 chains in parallel means we use 8GB of RAM. However by sharing the model across the chains we simply use 1GB of data.
 
@@ -20,7 +20,7 @@ Having a standardized IO and API for multi chain processes will allow researcher
 # Guide-level explanation
 [guide-level-explanation]: #guide-level-explanation
 
-Each of the servies layers in [`src/stan/services/`](https://github.com/stan-dev/stan/blob/147fba5fb93aa007ec42744a36d97cc84c291945/src/stan/services/sample/hmc_nuts_dense_e_adapt.hpp) layer will have the current API for single chain processes as well as an API for running multi chain processes. Their inputs are conceptually the same, but several of the inputs have been changed to be vectors of the single chain processes arguments in order to account for multiple chains. For instance, the signature of a single chain for `hmc_nuts_dense_e_adapt` now has `std::vector`s for the initialial values, inverse metric, and init, sample, and diagnostic writers.
+Each of the servies layers in [`src/stan/services/`](https://github.com/stan-dev/stan/blob/147fba5fb93aa007ec42744a36d97cc84c291945/src/stan/services/sample/hmc_nuts_dense_e_adapt.hpp) will have the current API for single chain processes as well as an API for running multi chain processes. Their inputs are conceptually the same, but several of the inputs have been changed to be vectors of the single chain processes arguments in order to account for multiple chains. For instance, the signature of a single chain for `hmc_nuts_dense_e_adapt` now has `std::vector`s for the initialial values, inverse metric, init writers, sample writers, and diagnostic writers.
 
 ```cpp
 template <class Model>
@@ -29,7 +29,7 @@ int hmc_nuts_dense_e_adapt(
     const stan::io::var_context& init,
     const stan::io::var_context& init_inv_metric,
     unsigned int random_seed,
-    unsigned int chain, double init_radius, int num_warmup, int num_samples,
+    unsigned int init_chain_id, double init_radius, int num_warmup, int num_samples,
     int num_thin, bool save_warmup, int refresh, double stepsize,
     double stepsize_jitter, int max_depth, double delta, double gamma,
     double kappa, double t0, unsigned int init_buffer, unsigned int term_buffer,
@@ -49,7 +49,7 @@ int hmc_nuts_dense_e_adapt(
     // now vectors
     const std::vector<InitContext>& init,
     const std::vector<InitInvContext>& init_inv_metric,
-    unsigned int random_seed, unsigned int chain, double init_radius,
+    unsigned int random_seed, unsigned int init_chain_id, double init_radius,
     int num_warmup, int num_samples, int num_thin, bool save_warmup,
     int refresh, double stepsize, double stepsize_jitter, int max_depth,
     double delta, double gamma, double kappa, double t0,
@@ -64,7 +64,7 @@ int hmc_nuts_dense_e_adapt(
     size_t n_chain)
 ```
 
-Additionally the new API has an argument `n_chain` which tells the backend how many chains to run. All of the vector inputs must be the same size as `n_chain`. For optional performance, `InitContext` and `InitInvContext` can either be any type inheriting from `stan::io::var_context` or either `std::shared_ptr<>` or `std::unique_ptr<>` with an underlying pointer whose type is derived from `stan::io::var_context`. Within the new API these arguments are accessed through a function `stan::io::get_underlying(const T& x)` which for any of the above inputs returns a reference to the object inheriting from `stan::io::var_context`. For upstream APIs such as rstan which uses `Rcpp` this function can be overloaded to support smart pointers such as `Rcpp::Xptr`.
+Additionally the new API has an argument `n_chain` which tells the backend how many chains to run and `init_chain_id` instead of `chain`. `init_chain_id` will be used to generate PRNGs for each chain as `seed + init_chain_id + chain_num` where `chain_num` is the i'th chain being generated. All of the vector inputs must be the same size as `n_chain`. For optional flexibility, `InitContext` and `InitInvContext` can either be any type inheriting from `stan::io::var_context` or either `std::shared_ptr<>` or `std::unique_ptr<>` with an underlying pointer whose type is derived from `stan::io::var_context`. Within the new API these arguments are accessed through a function `stan::io::get_underlying(const T& x)` which for any of the above inputs returns a reference to the object inheriting from `stan::io::var_context`. For upstream APIs such as rstan which uses `Rcpp` this function can be overloaded to support smart pointers such as `Rcpp::Xptr`.
 
 ```cpp
 namespace stan {
@@ -81,13 +81,6 @@ This scheme allows for flexibility, where a user can pass one initialization for
 
 The elements of the vectors for `init`, `init_inv_metric`, `interrupt`, `logger`, `init_writer`, `sample_writer`, and `diagnostic_writer` must be threadsafe. `init` and `init_inv_metric` are only read from so should be threadsafe by default. Any of the writers which write to `std::cout` are safe by the standard, though it is recommended to write any output to an local `std::stringstream` and then pass the fully constructed output so that thread outputs are not mixed together. See the code [here](https://github.com/stan-dev/stan/pull/3033/files#diff-ab5eb0683288927defb395f1af49548c189f6e7ab4b06e217dec046b0c1be541R80) for an example. Additionally if the elements of `init_writer`, `sample_writer`, and `diagnostic_writer` each point to unique output they will be threadsafe as well.
 
-### Recommended Upstream Initialization
-
-Upstream packages can generate `init` and `init_inv_metric` as they wish, though for cmdstan the prototype follows the following rules for reading user input.
-
-If the user specifies their init as `{file_name}.{file_ending}` then the program will search for `{file_name}_{1..chains}.{file_ending}` where `chains` is the integer value specified for the user for the number of chains to run in the program. If it fails to find any of the `{file_name}_{1..chains}.{file_ending}` it will then search for `{file_name}.{file_ending}` and if found will use that. Otherwise an exception will occur.
-
-
 # Reference-level explanation
 [reference-level-explanation]: #reference-level-explanation
 
@@ -100,10 +93,37 @@ The services API on the backend has a prototype implementation found [here](http
 
 Then a [`tbb::parallel_for()`](https://github.com/stan-dev/stan/blob/147fba5fb93aa007ec42744a36d97cc84c291945/src/stan/services/sample/hmc_nuts_dense_e_adapt.hpp#L261) is used to run the each of the samplers.
 
+### Recommended Upstream Initialization
+
+Upstream packages can generate `init` and `init_inv_metric` as they wish, though for cmdstan the prototype follows the following rules for reading user input.
+
+If the user specifies their init as `{file_name}.{file_ending}` with an input `id` of `N` and chains `M` then the program will search for `{file_name}_{N..(N + M)}.{file_ending}` where `N..(N + M)` is a linear integer sequence from `N` to `N + M`. If the program fails to find any of the `{file_name}_{N..(N + M)}.{file_ending}` it will then search for `{file_name}.{file_ending}` and if found will use that. Otherwise an exception will occur.
+
+Documentation must be added to clarify reproducibility between a multi-chain program and running multiple chains across several programs. This requires
+
+1. Using the same random seed for the multi-chain program and each program running a chain.
+2. Starting each program in the multi-chain context with the `ith` chain number.
+
+For example, the following two sets of calls should produce the same results up to floating point accuracy.
+
+```bash
+# From cmdstan example folder
+# running 4 chains at once
+examples/bernoulli/bernoulli sample data file=examples/bernoulli/bernoulli.data.R chains=4 id=1 random seed=123 output file=output.csv
+# Running 4 seperate chains
+examples/bernoulli/bernoulli sample data file=examples/bernoulli/bernoulli.data.R chains=1 id=1 random seed=123 output file=output1.csv
+examples/bernoulli/bernoulli sample data file=examples/bernoulli/bernoulli.data.R chains=1 id=2 random seed=123 output file=output2.csv
+examples/bernoulli/bernoulli sample data file=examples/bernoulli/bernoulli.data.R chains=1 id=3 random seed=123 output file=output3.csv
+
+examples/bernoulli/bernoulli sample data file=examples/bernoulli/bernoulli.data.R chains=1 id=4 random seed=123 output file=output4.csv
+```
+
+
+
 # Drawbacks
 [drawbacks]: #drawbacks
 
-This does add overhead to existing implimentations in managing the per chain IO. Performance tests still need to be completed to assess the efficiency of nested parallelism (i.e. using `reduce_sum()`) inside of chains executing in parallel.
+This does add overhead to existing implimentations in managing the per chain IO.
 
 
 ### Open Questions

From 6b60ded3e68ba03957a6b2452e82d18103e28941 Mon Sep 17 00:00:00 2001
From: Steve Bronder <stevo15025@gmail.com>
Date: Tue, 20 Apr 2021 21:53:17 -0400
Subject: [PATCH 4/7] update n_chain to num_chains

---
 designs/0020-parallel-chain-api.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/designs/0020-parallel-chain-api.md b/designs/0020-parallel-chain-api.md
index 411df79..f69e697 100644
--- a/designs/0020-parallel-chain-api.md
+++ b/designs/0020-parallel-chain-api.md
@@ -61,10 +61,10 @@ int hmc_nuts_dense_e_adapt(
     std::vector<InitWriter>& init_writer,
     std::vector<SampleWriter>& sample_writer,
     std::vector<DiagnosticWriter>& diagnostic_writer,
-    size_t n_chain)
+    size_t num_chains)
 ```
 
-Additionally the new API has an argument `n_chain` which tells the backend how many chains to run and `init_chain_id` instead of `chain`. `init_chain_id` will be used to generate PRNGs for each chain as `seed + init_chain_id + chain_num` where `chain_num` is the i'th chain being generated. All of the vector inputs must be the same size as `n_chain`. For optional flexibility, `InitContext` and `InitInvContext` can either be any type inheriting from `stan::io::var_context` or either `std::shared_ptr<>` or `std::unique_ptr<>` with an underlying pointer whose type is derived from `stan::io::var_context`. Within the new API these arguments are accessed through a function `stan::io::get_underlying(const T& x)` which for any of the above inputs returns a reference to the object inheriting from `stan::io::var_context`. For upstream APIs such as rstan which uses `Rcpp` this function can be overloaded to support smart pointers such as `Rcpp::Xptr`.
+Additionally the new API has an argument `num_chains` which tells the backend how many chains to run and `init_chain_id` instead of `chain`. `init_chain_id` will be used to generate PRNGs for each chain as `seed + init_chain_id + chain_num` where `chain_num` is the i'th chain being generated. All of the vector inputs must be the same size as `num_chains`. For optional flexibility, `InitContext` and `InitInvContext` can either be any type inheriting from `stan::io::var_context` or either `std::shared_ptr<>` or `std::unique_ptr<>` with an underlying pointer whose type is derived from `stan::io::var_context`. Within the new API these arguments are accessed through a function `stan::io::get_underlying(const T& x)` which for any of the above inputs returns a reference to the object inheriting from `stan::io::var_context`. For upstream APIs such as rstan which uses `Rcpp` this function can be overloaded to support smart pointers such as `Rcpp::Xptr`.
 
 ```cpp
 namespace stan {

From 31249fbd8c3697336973ad661bf7354ec12612f8 Mon Sep 17 00:00:00 2001
From: Steve Bronder <stevo15025@gmail.com>
Date: Tue, 27 Apr 2021 23:54:35 -0400
Subject: [PATCH 5/7] 1. Add more to motivation about service layer parallelism
 unifying the interfaces. 2. Update function signature 3. Change var context
 vectors to only take in classes with valid operator* 4. Example of multi init
 5. Removed doc about transformed parameters as with it the loose definition
 of model

---
 designs/0020-parallel-chain-api.md | 43 ++++++++----------------------
 1 file changed, 11 insertions(+), 32 deletions(-)

diff --git a/designs/0020-parallel-chain-api.md b/designs/0020-parallel-chain-api.md
index f69e697..9c89317 100644
--- a/designs/0020-parallel-chain-api.md
+++ b/designs/0020-parallel-chain-api.md
@@ -11,7 +11,7 @@ This outlines a services layer API for running multiple chains in one Stan progr
 # Motivation
 [motivation]: #motivation
 
-Currently, to run multiple chains for a given model a user or developer must use higher level parallelization tools such as `gnu parallel` or R/Python parallelism schemes. However, we have access to the TBB and with it a schedular for managing hierarchical parallelism. We can utilize the TBB to provide service API's for running multiple chains in one program and safely account for possible parallelism within a model using tools such as `reduce_sum()`.
+Currently, to run multiple chains for a given model a user or developer must use higher level parallelization tools such as `gnu parallel` or R/Python parallelism schemes. The high level approach is partly done because of intracacies at the lower level around managing Stan's thread local stack allocators along with multi-threaded IO. Providing a service layer API for multiple chains in one Stan program will remove the requirment of interfaces to impliment all the necessary tools for parallel chains in one Stan program independently. Moreover, we have access to the TBB and with it a schedular for managing hierarchical parallelism. We can utilize the TBB to provide service API's for running multiple chains in one program and safely account for possible parallelism within a model using tools such as `reduce_sum()`.
 
 The benefits to this scheme are mostly in memory savings and standardization of multi chain processes in Stan. Because a stan model is immutable after construction it's possible to share that model across all chains. For a model that uses 1GB of data running 8 chains in parallel means we use 8GB of RAM. However by sharing the model across the chains we simply use 1GB of data.
 
@@ -42,10 +42,11 @@ int hmc_nuts_dense_e_adapt(
 ```
 
 ```cpp
-template <class Model, typename InitContext, typename InitInvContext,
+template <typename Model, typename InitContext, typename InitInvContext,
           typename InitWriter, typename SampleWriter, typename DiagnosticWriter>
 int hmc_nuts_dense_e_adapt(
     Model& model,
+    size_t num_chains,
     // now vectors
     const std::vector<InitContext>& init,
     const std::vector<InitInvContext>& init_inv_metric,
@@ -60,24 +61,10 @@ int hmc_nuts_dense_e_adapt(
     // now vectors
     std::vector<InitWriter>& init_writer,
     std::vector<SampleWriter>& sample_writer,
-    std::vector<DiagnosticWriter>& diagnostic_writer,
-    size_t num_chains)
+    std::vector<DiagnosticWriter>& diagnostic_writer)
 ```
 
-Additionally the new API has an argument `num_chains` which tells the backend how many chains to run and `init_chain_id` instead of `chain`. `init_chain_id` will be used to generate PRNGs for each chain as `seed + init_chain_id + chain_num` where `chain_num` is the i'th chain being generated. All of the vector inputs must be the same size as `num_chains`. For optional flexibility, `InitContext` and `InitInvContext` can either be any type inheriting from `stan::io::var_context` or either `std::shared_ptr<>` or `std::unique_ptr<>` with an underlying pointer whose type is derived from `stan::io::var_context`. Within the new API these arguments are accessed through a function `stan::io::get_underlying(const T& x)` which for any of the above inputs returns a reference to the object inheriting from `stan::io::var_context`. For upstream APIs such as rstan which uses `Rcpp` this function can be overloaded to support smart pointers such as `Rcpp::Xptr`.
-
-```cpp
-namespace stan {
-namespace io {
-template <typename T>
-const auto& get_underlying(const Rcpp::Xptr<T>& x) {
-  return *x;
-}
-}
-}
-```
-
-This scheme allows for flexibility, where a user can pass one initialization for all chains and the program can make one shared pointer used in all instances of the vector.
+Additionally the new API has an argument `num_chains` which tells the backend how many chains to run and `init_chain_id` instead of `chain`. `init_chain_id` will be used to generate PRNGs for each chain as `seed + init_chain_id + chain_num` where `chain_num` is the i'th chain being generated. All of the vector inputs must be the same size as `num_chains`. `InitContext` and `InitInvContext` must have a valid `operator*` which returns back a reference to a class derived from `stan::io::var_context`.
 
 The elements of the vectors for `init`, `init_inv_metric`, `interrupt`, `logger`, `init_writer`, `sample_writer`, and `diagnostic_writer` must be threadsafe. `init` and `init_inv_metric` are only read from so should be threadsafe by default. Any of the writers which write to `std::cout` are safe by the standard, though it is recommended to write any output to an local `std::stringstream` and then pass the fully constructed output so that thread outputs are not mixed together. See the code [here](https://github.com/stan-dev/stan/pull/3033/files#diff-ab5eb0683288927defb395f1af49548c189f6e7ab4b06e217dec046b0c1be541R80) for an example. Additionally if the elements of `init_writer`, `sample_writer`, and `diagnostic_writer` each point to unique output they will be threadsafe as well.
 
@@ -99,6 +86,12 @@ Upstream packages can generate `init` and `init_inv_metric` as they wish, though
 
 If the user specifies their init as `{file_name}.{file_ending}` with an input `id` of `N` and chains `M` then the program will search for `{file_name}_{N..(N + M)}.{file_ending}` where `N..(N + M)` is a linear integer sequence from `N` to `N + M`. If the program fails to find any of the `{file_name}_{N..(N + M)}.{file_ending}` it will then search for `{file_name}.{file_ending}` and if found will use that. Otherwise an exception will occur.
 
+For example, if a user specifies `chains=4`, `id=2`, and their init file as `init=init.data.R` then the program
+will first search for `init.data_2.R` and if it finds it will then search for `init.data_3.R`,
+`init.data_4.R`, `init.data_5.R` and will fail if all files are not found. If the program fails to find `init.data_2.R` then it will attempt
+to find `init.data.R` and if successfull will use these initial values for all chains. If neither
+are found then an error will be thrown.
+
 Documentation must be added to clarify reproducibility between a multi-chain program and running multiple chains across several programs. This requires
 
 1. Using the same random seed for the multi-chain program and each program running a chain.
@@ -124,17 +117,3 @@ examples/bernoulli/bernoulli sample data file=examples/bernoulli/bernoulli.data.
 [drawbacks]: #drawbacks
 
 This does add overhead to existing implimentations in managing the per chain IO.
-
-
-### Open Questions
-
-The main open question is whether to recommend upstream users of services to generate N models or a single model
-whenever a Stan program uses `*_rng()` functions in transformed data for methods such as Simulation Based Calibration.
-With 1 model the transformed data will be shared across all chains. With SBC we commonly want to run multiple
-data sets and the question is whether we want multiple chains over one dataset or a chain for each data set.
-If we would like to have multiple models in one program if the user uses an `*_rng()` there is a [`stanc3 PR`](https://github.com/stan-dev/stanc3/pull/868) to add a method to check whether the user uses an rng function in
-tranformed data. Upstream service users can generate one model, then ask it if an rng is used in transformed data
-to decide whether it wants to generate N more models.
-
-Personally, I think it makes since to run multiple chains for each generated dataset (having 1 model).
-This makes sense to me as we can check for recovery of parameters given K datasets and N chains per dataset.

From d1ea3f41ec5c403f07bd67f22e074f71eece6c90 Mon Sep 17 00:00:00 2001
From: Steve Bronder <stevo15025@gmail.com>
Date: Tue, 11 May 2021 10:36:52 -0400
Subject: [PATCH 6/7] update docs for PRNG

---
 designs/0020-parallel-chain-api.md | 29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/designs/0020-parallel-chain-api.md b/designs/0020-parallel-chain-api.md
index 9c89317..672fef6 100644
--- a/designs/0020-parallel-chain-api.md
+++ b/designs/0020-parallel-chain-api.md
@@ -80,6 +80,22 @@ The services API on the backend has a prototype implementation found [here](http
 
 Then a [`tbb::parallel_for()`](https://github.com/stan-dev/stan/blob/147fba5fb93aa007ec42744a36d97cc84c291945/src/stan/services/sample/hmc_nuts_dense_e_adapt.hpp#L261) is used to run the each of the samplers.
 
+PRNGs will be initialized such as the following pseudocode, where a constant stride is used to initialize the PRNG.
+
+```cpp
+inline boost::ecuyer1988 create_rng(unsigned int seed, unsigned int init_chain_id, unsigned int chain_num) {
+  // Initialize L’ecuyer generator
+  boost::ecuyer1988 rng(seed);
+
+  // Seek generator to disjoint region for each chain
+  static uintmax_t DISCARD_STRIDE = static_cast<uintmax_t>(1) << 50;
+  rng.discard(DISCARD_STRIDE * (init_chain_id + chain_num - 1));
+  return rng;
+}
+```
+
+The constant stride guarantees that models which use multiple chains in one program and multiple programs using multiple chains are able to be reproducible given the same seed as noted below.  
+
 ### Recommended Upstream Initialization
 
 Upstream packages can generate `init` and `init_inv_metric` as they wish, though for cmdstan the prototype follows the following rules for reading user input.
@@ -89,7 +105,7 @@ If the user specifies their init as `{file_name}.{file_ending}` with an input `i
 For example, if a user specifies `chains=4`, `id=2`, and their init file as `init=init.data.R` then the program
 will first search for `init.data_2.R` and if it finds it will then search for `init.data_3.R`,
 `init.data_4.R`, `init.data_5.R` and will fail if all files are not found. If the program fails to find `init.data_2.R` then it will attempt
-to find `init.data.R` and if successfull will use these initial values for all chains. If neither
+to find `init.data.R` and if successful will use these initial values for all chains. If neither
 are found then an error will be thrown.
 
 Documentation must be added to clarify reproducibility between a multi-chain program and running multiple chains across several programs. This requires
@@ -111,9 +127,18 @@ examples/bernoulli/bernoulli sample data file=examples/bernoulli/bernoulli.data.
 examples/bernoulli/bernoulli sample data file=examples/bernoulli/bernoulli.data.R chains=1 id=4 random seed=123 output file=output4.csv
 ```
 
+In general the constant stride allow for the following where `n1 + n2 + n3 + n4 = N` chains.
+
+```
+seed=848383, id=1, chains=n1
+seed=848383, id=1 + n1, chains=n2
+seed=848383, id=1 + n1 + n2, chains=n3
+seed=848383, id=1 + n1 + n2 + n3, chains=n4
+```
+
 
 
 # Drawbacks
 [drawbacks]: #drawbacks
 
-This does add overhead to existing implimentations in managing the per chain IO.
+This does add overhead to existing implementations in managing the per chain IO.

From fa63256d9757eba22ce1ab68cf5e5ae022ed32e9 Mon Sep 17 00:00:00 2001
From: Steve Bronder <stevo15025@gmail.com>
Date: Tue, 11 May 2021 16:00:58 -0400
Subject: [PATCH 7/7] makes init contexts names to have Ptr in them

---
 designs/0020-parallel-chain-api.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/designs/0020-parallel-chain-api.md b/designs/0020-parallel-chain-api.md
index 672fef6..d4f0fdb 100644
--- a/designs/0020-parallel-chain-api.md
+++ b/designs/0020-parallel-chain-api.md
@@ -42,14 +42,14 @@ int hmc_nuts_dense_e_adapt(
 ```
 
 ```cpp
-template <typename Model, typename InitContext, typename InitInvContext,
+template <typename Model, typename InitContextPtr, typename InitInvContextPtr,
           typename InitWriter, typename SampleWriter, typename DiagnosticWriter>
 int hmc_nuts_dense_e_adapt(
     Model& model,
     size_t num_chains,
     // now vectors
-    const std::vector<InitContext>& init,
-    const std::vector<InitInvContext>& init_inv_metric,
+    const std::vector<InitContextPtr>& init,
+    const std::vector<InitInvContextPtr>& init_inv_metric,
     unsigned int random_seed, unsigned int init_chain_id, double init_radius,
     int num_warmup, int num_samples, int num_thin, bool save_warmup,
     int refresh, double stepsize, double stepsize_jitter, int max_depth,
@@ -64,7 +64,7 @@ int hmc_nuts_dense_e_adapt(
     std::vector<DiagnosticWriter>& diagnostic_writer)
 ```
 
-Additionally the new API has an argument `num_chains` which tells the backend how many chains to run and `init_chain_id` instead of `chain`. `init_chain_id` will be used to generate PRNGs for each chain as `seed + init_chain_id + chain_num` where `chain_num` is the i'th chain being generated. All of the vector inputs must be the same size as `num_chains`. `InitContext` and `InitInvContext` must have a valid `operator*` which returns back a reference to a class derived from `stan::io::var_context`.
+Additionally the new API has an argument `num_chains` which tells the backend how many chains to run and `init_chain_id` instead of `chain`. `init_chain_id` will be used to generate PRNGs for each chain as `seed + init_chain_id + chain_num` where `chain_num` is the i'th chain being generated. All of the vector inputs must be the same size as `num_chains`. `InitContextPtr` and `InitInvContextPtr` must have a valid `operator*` which returns back a reference to a class derived from `stan::io::var_context`.
 
 The elements of the vectors for `init`, `init_inv_metric`, `interrupt`, `logger`, `init_writer`, `sample_writer`, and `diagnostic_writer` must be threadsafe. `init` and `init_inv_metric` are only read from so should be threadsafe by default. Any of the writers which write to `std::cout` are safe by the standard, though it is recommended to write any output to an local `std::stringstream` and then pass the fully constructed output so that thread outputs are not mixed together. See the code [here](https://github.com/stan-dev/stan/pull/3033/files#diff-ab5eb0683288927defb395f1af49548c189f6e7ab4b06e217dec046b0c1be541R80) for an example. Additionally if the elements of `init_writer`, `sample_writer`, and `diagnostic_writer` each point to unique output they will be threadsafe as well.