-
Notifications
You must be signed in to change notification settings - Fork 8
Remove temporary allocations #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was an implicit conversion to float64 taking place due to a cast
to the Python's list (and hence Python's float). This resulted in a
serious (factor ~10) degradation in performance, which should now be
fixed.
The performance degradation was a result of temporary allocation
performed by pandas when a dtype of a frame was implicitly changed
in updates of the form e.g.:
```python
import pandas as pd
import numpy as np
df = pd.DataFrame({"f32": np.ones(2, dtype="float32")})
df.iloc[1:2] = np.float64(1/3)
```
since pandas 2.1, such operations raise a FutureWarning. All occurences
of that warning in DEMENTpy are resolved in this commit.
62fd4ad to
1499480
Compare
| choose_taxa = np.zeros((self.n_taxa,self.gridsize), dtype='int8') | ||
| for i in range(self.n_taxa): | ||
| choose_taxa[i,:] = np.random.choice([1,0], self.gridsize, replace=True, p=[frequencies[i], 1-frequencies[i]]) | ||
| choose_taxa[i,:] = np.random.binomial(1, frequencies[i], self.gridsize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I changed the numbers back to "float32" this sampling started failing saying that the probabilities do not sum to 1.0. I presume they must be converted to "float64" before getting summed inside np.random.random_choice.
I have changed the sampling to binominal, which i believe should be equivalent and would avoid the summation errors.
| [Enzyme_Loss, | ||
| Enzyme_Loss.mul(self.Enz_Attrib['N_cost'].tolist()*self.gridsize,axis=0), | ||
| Enzyme_Loss.mul(self.Enz_Attrib['P_cost'].tolist()*self.gridsize,axis=0)], | ||
| Enzyme_Loss.mul(np.repeat(self.Enz_Attrib['N_cost'].values, self.gridsize), axis=0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'float64' was appearing here.
tolist method would return a list of Python's floats, which are double precision.
Builds on #34, related to #22
There was an implicit conversion to float64 taking place due to a cast to the Python's list (and hence Python's float). This resulted in a serious (factor ~10) degradation in performance, which should now be fixed.
The problem was ultimately related to how
pandashandles partial assignments inDataFrames andSeries. If the data is assigned to from a higher precision expression, thedtypeof the Series may be implicitly changed, which causes a temporary memory allocation. For example:Please note that the behaviour is a bit peculiar and perhaps unintuitive. Pandas will not change
dtypeif it is "not necessary", that is the new value can be represented exactly in the prior dtype. e.g. :The implicit conversion to float64 was causing some
DataFrames to change their dtype to "float64", which in turn causedMax_Uptake(ingrid.uptakemethod) getting repeatability changed from "float32" to "float64", which appears to have been the main performance bottleneck.Tagging @bioatmosphere explicitly since you were interested during the meeting yesterday ;-)