Skip to content

Conversation

@MaribelleHGomez
Copy link
Contributor

@MaribelleHGomez MaribelleHGomez commented Mar 20, 2023

Jira Ticket: PXP-10845

New Features

  • Utility file for DSV file manipulation; functions include: collecting file contents into a list, finding file intersection and difference, chunking files, splitting file by headers, merging files

Breaking Changes

Bug Fixes

Improvements

Dependency updates

Deployment changes

@github-actions
Copy link

The style in this PR agrees with black. ✔️

This formatting comment was generated automatically by a script in uc-cdis/wool.

output_writer.writerow(record)


def convert_type(file_name: str, new_type: str, has_headers=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so is this only changing the extensions? For a tsv to a csv would we not need to replace the tabs to commas?

write(records, new_file_name)


def merge_files(files: Union[str, list], output_file_name: str, has_headers=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have different headers or are different file types? I think we should check for that

Args:
file_name(str):
Filename or file path
Returns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an example in the description here?

has_headers=False,
):
"""
Chunk records into files of input size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could use a better description. Something like "chunk manifests into x different chunks" and maybe in the logs add how many records there will be per manifest.


def chunk(
file_name: str,
chunk_size: int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh actually a fun idea would be adding another input, number of records in a manifest. So you're required to either give a number of chunks or number of records per chunks, do the math and create chunks according to that. If both are given, maybe do the math of wheteher or not those chunk and record numbers per chunk are compatible

if has_headers:
file_one_headers = get_headers(file_one)
file_two_headers = get_headers(file_two)
if file_one_headers != file_two_headers and sorted(file_one_headers) == sorted(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess i dont understand what is going on here.
if we have headers:

A = ["a", "b", "c"]
B = ["b", "c", "a"]

print(A == B) : False
print(sorted(A) == sorted(B)) :True

So i see you're moving the header, i think you should add logs notifying that the columns aren't arranged correctly and we're moving the columns.

if file_one_headers != file_two_headers and sorted(file_one_headers) == sorted(
file_two_headers
):
list_two = move_columns(file_one_headers, file_two)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a sorted(A) != sorted(B)

else:
list_two = file_to_list(file_two, has_headers)

if not strict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you need to handle cases where filename and file_name are the same thing and this code should be smart enough to figure that out. Luckily there are functions in the sdk that already handles that. Check here. https://github.com/uc-cdis/gen3sdk-python/blob/master/gen3/tools/utils.py
Maybe right after we open up the files we standardize the header formats


logging.info(f"Taking intersection between {file_one} & {file_two}...")

list_one = file_to_list(file_one, has_headers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should validation for all manifests that the utils processes.

return [row for row in list_one if row in list_two]


# TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create a ticket for this TODO and link it here as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants