Optimize MesaData.read_log_data() and MesaData.remove_backups() #22
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi Bill,
I've been using mesa_reader to handle very large grids of stars(~1000 individual runs adding up to tens of GiB) for my work, and I felt tempted to optimize the file loading/parsing procedure.
Using
numpy.genfromtxt()(ornumpy.loadtxt()for that matter) runs into pitfalls for large files, as it parses each line in python and concatenates the records into lists before forming the ndarray at the end. Having unknown data-types at runtime adds extra overhead (as in genfromtxt). I switched it to usepandas.read_csv(), which is substantially faster. I wrote a simple parser for the first data line to determine the data types for each column, so it should handle floats, ints, nans, and logicals just fine.Similarly, implementing
pandas.DataFrame.drop_duplicates()in the remove_backups method gives a modest speed increase.As far as my testing has found, the output of this should be exactly the same, but I do not know how it would handle, say, an incomplete line (if you managed to open a log file mid-write).
Some simple profiling with a test grid (84 history files, ~600 MiB) shows a pretty good speed increase, especially if you have an SSD and are not limited by storage speed:
genfromtxt method (last commit)
read_csv method (this PR)
For my particular test and system, the speedup is around 400%. :)
However, this method does require adding pandas as a dependency, which may not necessarily be desirable.