Skip to content

Conversation

@d-maclean
Copy link

@d-maclean d-maclean commented Jan 3, 2025

Hi Bill,

I've been using mesa_reader to handle very large grids of stars(~1000 individual runs adding up to tens of GiB) for my work, and I felt tempted to optimize the file loading/parsing procedure.

Using numpy.genfromtxt() (or numpy.loadtxt() for that matter) runs into pitfalls for large files, as it parses each line in python and concatenates the records into lists before forming the ndarray at the end. Having unknown data-types at runtime adds extra overhead (as in genfromtxt). I switched it to use pandas.read_csv(), which is substantially faster. I wrote a simple parser for the first data line to determine the data types for each column, so it should handle floats, ints, nans, and logicals just fine.

Similarly, implementing pandas.DataFrame.drop_duplicates() in the remove_backups method gives a modest speed increase.

As far as my testing has found, the output of this should be exactly the same, but I do not know how it would handle, say, an incomplete line (if you managed to open a log file mid-write).

Some simple profiling with a test grid (84 history files, ~600 MiB) shows a pretty good speed increase, especially if you have an SSD and are not limited by storage speed:

genfromtxt method (last commit)

         47863545 function calls (47838824 primitive calls) in 28.949 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   28.975   28.975 /home/duncan_m/Projects/sample_history/mesa_test.py:11(test)
       84    0.000    0.000   28.975    0.345 /home/duncan_m/.conda/envs/sci/lib/python3.11/site-packages/mesa_reader/__init__.py:103(__init__)
       84    0.001    0.000   28.975    0.345 /home/duncan_m/.conda/envs/sci/lib/python3.11/site-packages/mesa_reader/__init__.py:152(read_data)
  ---> 84    0.451    0.005   28.974    0.345 /home/duncan_m/.conda/envs/sci/lib/python3.11/site-packages/mesa_reader/__init__.py:187(read_log_data)
...
  ---> 84    0.198    0.002    1.004    0.012 /home/duncan_m/.conda/envs/sci/lib/python3.11/site-packages/mesa_reader/__init__.py:673(remove_backups)

read_csv method (this PR)

         3814731 function calls (3747761 primitive calls) in 6.821 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.821    6.821 /home/duncan_m/Projects/sample_history/mesa_test.py:11(test)
       84    0.000    0.000    6.821    0.081 /home/duncan_m/.conda/envs/sci-test/lib/python3.12/site-packages/mesa_reader/__init__.py:105(__init__)
       84    0.003    0.000    6.821    0.081 /home/duncan_m/.conda/envs/sci-test/lib/python3.12/site-packages/mesa_reader/__init__.py:182(read_data)
  ---> 84    0.007    0.000    6.817    0.081 /home/duncan_m/.conda/envs/sci-test/lib/python3.12/site-packages/mesa_reader/__init__.py:217(read_log_data)**
...
  ---> 84    0.121    0.001    0.187    0.002 /home/duncan_m/.conda/envs/sci-test/lib/python3.12/site-packages/mesa_reader/__init__.py:714(remove_backups)

For my particular test and system, the speedup is around 400%. :)

However, this method does require adding pandas as a dependency, which may not necessarily be desirable.

@d-maclean
Copy link
Author

Updates:

I found that the heuristic method I used to detect data-types is not as good as the parser provided by pandas. I went ahead and removed that, to keep things simple and reliable. Behavior should, again, be unchanged.

@wmwolf
Copy link
Owner

wmwolf commented Jan 15, 2025

@d-maclean: this is fantastic! I think we will need to move to pandas anyway since we will hopefully transition to hdf5 for output files (or at least have them as an option), which I believe pandas can open up relatively easily. I'll take a closer look at this in the coming weeks and hopefully issue a new version soon after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants