-
Notifications
You must be signed in to change notification settings - Fork 0
Dev chunk optimization postprocessveppanel #390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
migrau
wants to merge
42
commits into
dev
Choose a base branch
from
dev-chunk-optimization-POSTPROCESSVEPPANEL
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+567
−263
Open
Changes from all commits
Commits
Show all changes
42 commits
Select commit
Hold shift + click to select a range
035a0c7
dev: VEP chunk and VEP cache beegfs
migrau 8ef2919
fix: use standard cache for ENSEMBLVEP_VEP
migrau 40bb507
perf: improve VEP performance by converting input format
migrau bb21b25
fix: panel_postprocessing_annotation.py
migrau 7c73d3b
fix: arguments safe_transform_context
migrau 276152d
perf: chunking panel_custom_processing.py
migrau 7bc3a16
perf: CREATECAPTUREDPANELS containers edited. create_panel_versions.p…
migrau 346665d
fix: python3 container for CREATECAPTUREDPANELS
migrau 08d8fad
fix: remove container option CREATECAPTUREDPANELS. fix conda versions…
migrau 5c8ff55
fix: typo CREATECAPTUREDPANELS
migrau 891ec85
fix: wave true only for CREATECAPTUREDPANELS
migrau e1fd6af
fix: syntax config module CREATECAPTUREDPANELS
migrau ca0ae01
fix: new way to specify wave for a single process
migrau 5560c25
fix: toString added for wave
migrau c0c3e97
fix: wave label added
migrau 24efcf6
fix: wave true for everything
migrau 7734938
fix: wave false except CREATECAPTUREDPANELS
migrau b625332
fix: comma...
migrau 8110a34
fix: wave removed. New container created
migrau e718e41
fix: Removed wave from nextflow.config
migrau 9fd0ed7
fix: adjust memory requeriments
migrau abc85ed
perf: added new profile, nanoseq
migrau 3e0b4b5
fix: naming withLabel config review
migrau 61ec864
fix: nanoseq config resourceLimits
migrau 0188172
fix: correct withName *
migrau b0e422a
fix: SITESFROMPOSITIONS memory test
migrau 63dcea7
fix SITESFROMPOSITIONS
migrau 7c2f56b
fix: SITESFROMPOSITIONS
migrau 6e53f23
fix: fix profile
migrau e9d1b3b
fix: SITESFROMPOSITIONS config
migrau 1dffd94
fix: POSTPROCESSVEPPANEL. Time
migrau 24b170a
fix: RESOURCE LIMITS added
migrau d243ebc
fix: typo
migrau 945c129
fix: update base.config
migrau 198ff20
fix: adjust nanoconfig
migrau 0cfd80f
Merge branch 'dev' into dev-chunk-optimization-POSTPROCESSVEPPANEL
migrau 6c64f4d
fix: parallelization optional. Include sort for bedtools merge
migrau b2f12fd
fix: gene omega error: "No flagged entries found; skipping plots and …
migrau d4ed3c2
fix: Add debug logging and ensure failing_consensus file is always cr…
migrau 4be3b45
feat: Add chunking support for SITESFROMPOSITIONS with genomic sorting
migrau e52cb76
feat: add parallel_processing_parameters section to schema for chunki…
migrau 92580ce
update dnds genes list
FerriolCalvet File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,29 +17,60 @@ | |
| } | ||
|
|
||
|
|
||
| def load_chr_data_chunked(filepath, chrom, chunksize=1_000_000): | ||
| """ | ||
| Loads data for a specific chromosome from a large VEP output file in chunks. | ||
|
|
||
| Args: | ||
| filepath (str): Path to the VEP output file. | ||
| chrom (str): Chromosome to filter. | ||
| chunksize (int): Number of rows per chunk. | ||
|
|
||
| Returns: | ||
| pd.DataFrame: Filtered DataFrame for the chromosome. | ||
| """ | ||
| reader = pd.read_csv(filepath, sep="\t", na_values=custom_na_values, chunksize=chunksize, dtype={'CHROM': str}) | ||
| chr_data = [] | ||
| for chunk in reader: | ||
| filtered = chunk[chunk["CHROM"] == chrom] | ||
| if not filtered.empty: | ||
| chr_data.append(filtered) | ||
| return pd.concat(chr_data) if chr_data else pd.DataFrame() | ||
|
|
||
|
|
||
| def customize_panel_regions(VEP_output_file, custom_regions_file, customized_output_annotation_file, | ||
| simple = True | ||
| simple = True, | ||
| chr_chunk_size = 1_000_000 | ||
| ): | ||
| """ | ||
| # TODO | ||
| explain what this function does | ||
| Modifies annotations in a VEP output file based on custom genomic regions. | ||
|
|
||
| - For each region in the custom regions file, identifies the corresponding slice | ||
| in the VEP output. | ||
| - Updates gene names and impact values for the region. | ||
| - Saves both the modified annotation file and a record of added regions. | ||
|
|
||
| Args: | ||
| VEP_output_file (str): Path to the full VEP output file (TSV). | ||
| custom_regions_file (str): Custom region definitions (tab-delimited). | ||
| customized_output_annotation_file (str): Output file for updated annotations. | ||
| simple (bool): If True, outputs simplified annotations; else adds more fields. | ||
| """ | ||
|
|
||
| # simple = ['CHROM', 'POS', 'REF', 'ALT', 'MUT_ID' , 'GENE', 'IMPACT' , 'CONTEXT_MUT', 'CONTEXT'] | ||
| # rich = ['CHROM', 'POS', 'REF', 'ALT', 'MUT_ID', 'STRAND', 'GENE', 'IMPACT', 'Feature', 'Protein_position', 'Amino_acids', 'CONTEXT_MUT', 'CONTEXT'] | ||
| all_possible_sites = pd.read_csv(VEP_output_file, sep = "\t", | ||
| na_values = custom_na_values) | ||
| print("all possible sites loaded") | ||
|
|
||
| custom_regions_df = pd.read_table(custom_regions_file) | ||
|
|
||
| added_regions_df = pd.DataFrame() | ||
|
|
||
| current_chr = "" | ||
| for ind, row in custom_regions_df.iterrows(): | ||
| chr_data = pd.DataFrame() | ||
|
|
||
| for _, row in custom_regions_df.iterrows(): | ||
| try: | ||
| if row["CHROM"] != current_chr: | ||
| current_chr = row["CHROM"] | ||
| chr_data = all_possible_sites[all_possible_sites["CHROM"] == current_chr] | ||
| chr_data = load_chr_data_chunked(VEP_output_file, current_chr, chunksize=chr_chunk_size) | ||
|
|
||
| print("Updating chromosome to:", current_chr) | ||
|
|
||
| # Get start and end indices | ||
|
|
@@ -88,25 +119,25 @@ def customize_panel_regions(VEP_output_file, custom_regions_file, customized_out | |
|
|
||
| ## Insert modified rows back into the df | ||
| if simple: | ||
| all_possible_sites.loc[original_df_start: original_df_end, ["GENE", "IMPACT"]] = hotspot_data[["GENE", "IMPACT"]].values | ||
| chr_data.loc[original_df_start: original_df_end, ["GENE", "IMPACT"]] = hotspot_data[["GENE", "IMPACT"]].values | ||
| else: | ||
| print("Getting Feature to '-'") | ||
| hotspot_data["Feature"] = '-' | ||
| all_possible_sites.loc[original_df_start: original_df_end, ["GENE", "IMPACT", "Feature"]] = hotspot_data[["GENE", "IMPACT", "Feature"]].values | ||
| chr_data.loc[original_df_start: original_df_end, ["GENE", "IMPACT", "Feature"]] = hotspot_data[["GENE", "IMPACT", "Feature"]].values | ||
|
|
||
|
|
||
| added_regions_df = pd.concat((added_regions_df, hotspot_data)) | ||
| print("Small region added:", row["NAME"]) | ||
|
|
||
| except Exception as e: | ||
| print(f"Error processing row {row}: {e}") | ||
|
|
||
| all_possible_sites = all_possible_sites.drop_duplicates(subset = ['CHROM', 'POS', 'REF', 'ALT', 'MUT_ID', | ||
| 'GENE', 'CONTEXT_MUT', 'CONTEXT', 'IMPACT'], | ||
| keep = 'first') | ||
| all_possible_sites.to_csv(customized_output_annotation_file, | ||
| header = True, | ||
| index = False, | ||
| sep = "\t") | ||
| chr_data = chr_data.drop_duplicates( | ||
| subset=['CHROM', 'POS', 'REF', 'ALT', 'MUT_ID', 'GENE', 'CONTEXT_MUT', 'CONTEXT', 'IMPACT'], | ||
| keep='first' | ||
| ) | ||
| chr_data.to_csv(customized_output_annotation_file, header=True, index=False, sep="\t") | ||
|
|
||
|
Comment on lines
+135
to
+140
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure this does the same as it was doing before, because it is supposed to output all the same TSV table but replacing the values in some of the rows, in this case it seems that only the information from the last chromosome will be outputted, but maybe I got it wrong |
||
|
|
||
| added_regions_df = added_regions_df.drop_duplicates(subset = ['CHROM', 'POS', 'REF', 'ALT', 'MUT_ID', | ||
| 'GENE', 'CONTEXT_MUT', 'CONTEXT', 'IMPACT'], | ||
|
|
@@ -123,8 +154,9 @@ def customize_panel_regions(VEP_output_file, custom_regions_file, customized_out | |
| @click.option('--custom-regions-file', required=True, type=click.Path(exists=True), help='Input custom regions file (TSV)') | ||
| @click.option('--customized-output-annotation-file', required=True, type=click.Path(), help='Output annotation file (TSV)') | ||
| @click.option('--simple', is_flag=True, help='Use simple annotation') | ||
| def main(vep_output_file, custom_regions_file, customized_output_annotation_file, simple): | ||
| customize_panel_regions(vep_output_file, custom_regions_file, customized_output_annotation_file, simple) | ||
| @click.option('--chr-chunk-size', type=int, default=1000000, show_default=True, help='Chunk size for per-chromosome loading') | ||
| def main(vep_output_file, custom_regions_file, customized_output_annotation_file, simple, chr_chunk_size): | ||
| customize_panel_regions(vep_output_file, custom_regions_file, customized_output_annotation_file, simple, chr_chunk_size) | ||
|
|
||
| if __name__ == '__main__': | ||
| main() | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these changes may not be required since I already updated the Nextflow module to make the failing consensus file optional.
I think I would prefer to not generate the file if there is nothing to report.