Optimize MesaData.read_log_data() and MesaData.remove_backups() #22

d-maclean · 2025-01-03T17:57:12Z

Hi Bill,

I've been using mesa_reader to handle very large grids of stars(~1000 individual runs adding up to tens of GiB) for my work, and I felt tempted to optimize the file loading/parsing procedure.

Using numpy.genfromtxt() (or numpy.loadtxt() for that matter) runs into pitfalls for large files, as it parses each line in python and concatenates the records into lists before forming the ndarray at the end. Having unknown data-types at runtime adds extra overhead (as in genfromtxt). I switched it to use pandas.read_csv(), which is substantially faster. I wrote a simple parser for the first data line to determine the data types for each column, so it should handle floats, ints, nans, and logicals just fine.

Similarly, implementing pandas.DataFrame.drop_duplicates() in the remove_backups method gives a modest speed increase.

As far as my testing has found, the output of this should be exactly the same, but I do not know how it would handle, say, an incomplete line (if you managed to open a log file mid-write).

Some simple profiling with a test grid (84 history files, ~600 MiB) shows a pretty good speed increase, especially if you have an SSD and are not limited by storage speed:

genfromtxt method (last commit)

         47863545 function calls (47838824 primitive calls) in 28.949 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   28.975   28.975 /home/duncan_m/Projects/sample_history/mesa_test.py:11(test)
       84    0.000    0.000   28.975    0.345 /home/duncan_m/.conda/envs/sci/lib/python3.11/site-packages/mesa_reader/__init__.py:103(__init__)
       84    0.001    0.000   28.975    0.345 /home/duncan_m/.conda/envs/sci/lib/python3.11/site-packages/mesa_reader/__init__.py:152(read_data)
  ---> 84    0.451    0.005   28.974    0.345 /home/duncan_m/.conda/envs/sci/lib/python3.11/site-packages/mesa_reader/__init__.py:187(read_log_data)
...
  ---> 84    0.198    0.002    1.004    0.012 /home/duncan_m/.conda/envs/sci/lib/python3.11/site-packages/mesa_reader/__init__.py:673(remove_backups)

read_csv method (this PR)

         3814731 function calls (3747761 primitive calls) in 6.821 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.821    6.821 /home/duncan_m/Projects/sample_history/mesa_test.py:11(test)
       84    0.000    0.000    6.821    0.081 /home/duncan_m/.conda/envs/sci-test/lib/python3.12/site-packages/mesa_reader/__init__.py:105(__init__)
       84    0.003    0.000    6.821    0.081 /home/duncan_m/.conda/envs/sci-test/lib/python3.12/site-packages/mesa_reader/__init__.py:182(read_data)
  ---> 84    0.007    0.000    6.817    0.081 /home/duncan_m/.conda/envs/sci-test/lib/python3.12/site-packages/mesa_reader/__init__.py:217(read_log_data)**
...
  ---> 84    0.121    0.001    0.187    0.002 /home/duncan_m/.conda/envs/sci-test/lib/python3.12/site-packages/mesa_reader/__init__.py:714(remove_backups)

For my particular test and system, the speedup is around 400%. :)

However, this method does require adding pandas as a dependency, which may not necessarily be desirable.

Adopt "alternate" method with pandas.read_csv

d-maclean · 2025-01-04T16:41:33Z

Updates:

I found that the heuristic method I used to detect data-types is not as good as the parser provided by pandas. I went ahead and removed that, to keep things simple and reliable. Behavior should, again, be unchanged.

wmwolf · 2025-01-15T18:55:26Z

@d-maclean: this is fantastic! I think we will need to move to pandas anyway since we will hopefully transition to hdf5 for output files (or at least have them as an option), which I believe pandas can open up relatively easily. I'll take a closer look at this in the coming weeks and hopefully issue a new version soon after.

d-maclean and others added 11 commits January 3, 2025 00:00

new method in MesaData.read_log_data to improve speed

f540a3f

fixed a missing underscore :(

aeb1271

added dtype handling for logicals

dbff63a

fixed rewind file to read bulk_data

5cf022b

switched to using numpy.fromfile method

c0bbacd

switched to use pandas.read_csv for extremely fast performance

a203eeb

added pandas

46da20f

changed remove_backups to use vectorized ops

d50e39e

fixed a method call

ac91c92

Merge pull request #1 from d-maclean/alternate

20afddd

Adopt "alternate" method with pandas.read_csv

cleaned up read_log_data and removed superfluous heuristic algorihm

00f6aa5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize MesaData.read_log_data() and MesaData.remove_backups() #22

Optimize MesaData.read_log_data() and MesaData.remove_backups() #22

Uh oh!

d-maclean commented Jan 3, 2025 •

edited

Loading

Uh oh!

d-maclean commented Jan 4, 2025

Uh oh!

wmwolf commented Jan 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Optimize MesaData.read_log_data() and MesaData.remove_backups() #22

Are you sure you want to change the base?

Optimize MesaData.read_log_data() and MesaData.remove_backups() #22

Uh oh!

Conversation

d-maclean commented Jan 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

genfromtxt method (last commit)

read_csv method (this PR)

Uh oh!

d-maclean commented Jan 4, 2025

Uh oh!

wmwolf commented Jan 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

d-maclean commented Jan 3, 2025 •

edited

Loading