Skip to content

Conversation

@natir
Copy link

@natir natir commented Jun 21, 2021

Hi,

In my laboratory we use facets on many whole genome human dataset, on this data facets have a huge memory usage, approximately 150 Gib.

The purpose of this PR is to try to reduce facets memory usage, for this I replace some classic R data.frame by tidyverse tibble data-structure, I also use tydiverse pipe syntaxe to perform some operation on this tibble.

With all this change, I divide memory usage by 2.

On my test dataset result is same between my PR and version v0.6.1, but maybe I miss some stuff.

I'm not a good R developer, maybe I include some stupid mistake, so if you want just take the idea of my change and rewrite it please do it.

Thank

@veseshan
Copy link
Collaborator

Can you give me some breakdown of where this memory explosion occurs. My back of the envelope calculation says

R:> x = rnorm(12e6) # one locus every 250 bases across 3000 Megabase
R:> format(object.size(x), units="Mb")
[1] "91.6 Mb"

The jointseg data frame has 16 columns but even that wouldn't translate to 150Gib memory use.

Have you tried using the readSnpMatrixDT.R in path/facets/extRfns/ to read in the data?

Thanks

@natir
Copy link
Author

natir commented Jun 22, 2021

With v0.6.1 the memory peak is during file reading, use readSnpMatrixDT.R like my change solve this issue.

But another peak occur during preProcSample I assume, it's more specifically in procSnps (some duplication, column creation, calling of Fortran code and filtration not run in place).

With v0.6.1 and readSnpMatrixDT.R memory usage is 85Gib, my version use 70Gib.

@veseshan
Copy link
Collaborator

Can you tell me how big is the pileup matrix i.e. how many loci? And how many end up in jointseg? Thanks.

@natir
Copy link
Author

natir commented Jun 23, 2021

The pileup matrix contains 546,700,164 loci.

To evaluate number of jointseg I consider $jointseg in output produce by procSample, I get 5,583,831 jointseg.

@veseshan
Copy link
Collaborator

Given that the whole genome is around 3 Gigabase, the pileup seems to have a locus every 6 bases. That is a lot of redundant data as they will be highly serially correlated. You can DM me if you want to talk about this further.

I will look into how your code can be used to reduce the memory use of procSnps.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants