-
Notifications
You must be signed in to change notification settings - Fork 167
Bench #1697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Bench #1697
Conversation
Summary of ChangesHello @chandra-siri, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a collection of new Python scripts and a Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a collection of benchmarking and testing scripts. My review focuses on improving code quality, fixing bugs, and increasing efficiency. Key findings include a critical bug in a time calculation, several instances of inefficient client creation within loops, and various opportunities for code cleanup such as removing unused imports and commented-out code. I've also suggested improvements for better script structure and more informative benchmark reporting.
I am having trouble creating individual review comments. Click here to see my feedback.
parallel_downloader.py (60)
There is a critical bug in the time calculation due to operator precedence. start_time / 10**9 is calculated before the subtraction. You need to wrap end_time - start_time in parentheses to get the correct duration.
f"\nFinished all download attempts for {num_objects} objects: took - {(end_time - start_time) / 10**9}s"
parallel_downloader.py (16)
Creating a new AsyncGrpcClient for each object download is inefficient as it involves setting up a new connection each time. The client should be created once per worker process and reused for all downloads within that process. Consider creating the client in download_worker and passing it to this function.
threaded_downloader.py (14)
Creating a new AsyncGrpcClient for each download is inefficient. Since each download_worker runs in its own thread and creates its own event loop, the client should be created once per thread in download_worker and passed to download_object_async.
tests/perf/microbenchmarks/test_reads.py (66)
The output_buffer is initialized here and then immediately re-initialized on line 69. This first initialization is redundant and can be removed.
parallel_uploader.py (17)
Creating a new AsyncGrpcClient for each object upload is inefficient. This creates a new connection for every task. The client should be instantiated once per worker process in upload_worker and passed to this function to be reused.
parallel_downloader.py (23)
The object size is hardcoded here. It would be better to define it as a constant at the top of the file (e.g., OBJECT_SIZE = 100 * 1024 * 1024) for better readability and maintainability.
parallel_uploader.py (31)
The bucket name is hardcoded. It's better to define this as a constant at the top of the file for easier configuration.
async_tasks_downloader.py (65)
There is a typo in FINSHED. It should be FINISHED.
f"FINISHED: total bytes downloaded - {num_objects*OBJECT_SIZE} in time {(end_time - start_time) / (10**9)}s"
move_blob_test.py (3-9)
The script's logic is executed at the module level. It's a best practice to encapsulate the main logic within a main() function and call it under an if __name__ == "__main__": guard. This prevents the code from running when the module is imported elsewhere and improves reusability and testability.
def main():
gcs = storage.Client()
bucket = gcs.bucket("chandrasiri-us-west1-hns-soft-del")
# print(bucket.name)
blob = bucket.blob("test/blob.csv")
blob.upload_from_string("")
print("Uploaded blob:", blob.name)
bucket.move_blob(blob, new_name="test/blob2.csv")
if __name__ == "__main__":
main()
move_blob_test.py (4)
The bucket name is hardcoded. For better portability and reusability, consider defining it as a constant at the top of the file or passing it as a command-line argument.
parallel_downloader.py (3)
The os module is imported but never used. It should be removed.
async_tasks_downloader.py (4)
The ThreadPoolExecutor is imported but not used in this file. It should be removed to keep the code clean.
async_tasks_downloade_mp.py (4)
The ThreadPoolExecutor is imported but never used in this file. It's good practice to remove unused imports to keep the code clean.
parallel_downloader.py (34)
The bucket name is hardcoded. Consider defining this as a constant at the top of the file to make it easier to change.
async_tasks_downloade_mp.py (83-84)
There's a typo in 'Throuput'. It should be 'Throughput'.
Additionally, the throughput calculation uses 10**-6 to convert from bytes to megabytes, which assumes 1 MB = 1,000,000 bytes. However, OBJECT_SIZE is defined using binary prefixes (1024*1024). For consistency, you should use (1024*1024) for the conversion to MiB/s.
f"Throughput: {num_object*OBJECT_SIZE /((end_time_proc - start_time_proc) / (10**9))/(1024*1024)} MiB/s"
async_tasks_downloade_mp.py (76)
The results variable is assigned but its value is never used. It should be removed to avoid confusion.
pool.starmap(async_runner, args)
async_tasks_downloader.py (36)
This docstring is empty. Please add a description of what the function does.
"""Creates and runs asyncio tasks to download a range of objects."""
random_number.py (12-20)
The script's logic is executed at the module level. It's a best practice to encapsulate this logic within a main() function and call it under an if __name__ == "__main__": guard. This prevents the code from running when the module is imported elsewhere.
def main():
# Generate 1000 unique IDs
# A set is the easiest way to guarantee uniqueness in the batch.
request_ids = set()
while len(request_ids) < 1000:
request_ids.add(generate_random_64bit_int())
# You can convert it to a list if needed
id_list = list(request_ids)
print(f"Generated {len(id_list)} unique 64-bit IDs.")
print("First 5 IDs:", id_list[:5])
if __name__ == "__main__":
main()
tests/perf/microbenchmarks/test_reads.py (16-21)
These modules (math, google.api_core.exceptions, google.cloud.storage.blob.Blob) are imported but not used in the file. They should be removed to keep the code clean.
tests/perf/microbenchmarks/test_reads.py (42)
The comment indicates the object size is 1 GiB, but 100 * (1024 ** 2) is 100 MiB. The comment should be corrected to avoid confusion.
OBJECT_SIZE = 100 * (1024 ** 2) # 100 MiB
async_tasks_downloade_mp.py (71-73)
These lines appear to be leftover debugging code. They should be removed before merging to keep the codebase clean.
tests/perf/microbenchmarks/test_reads.py (102-109)
This docstring appears to be a list of implementation notes rather than a description of the function's purpose. It should be updated to be a proper docstring that explains what my_setup does, its parameters, and what it returns.
tests/perf/microbenchmarks/test_reads.py (210)
The benchmark summary prints 'N/A' for standard deviation. pytest-benchmark calculates this value and it's available in benchmark.stats['stddev']. It would be more informative to include this statistic in the report. Note that the other metrics are in MiB/s, so you may need to convert the time-based standard deviation or adjust the table header.
tests/perf/microbenchmarks/test_reads.py (219-234)
This large block of commented-out code should be removed to improve readability and reduce clutter.
async_tasks_downloade_mp.py (38)
This docstring is empty. Please add a meaningful description of what the download_objects_pool function does, its parameters, and what it returns, or remove the docstring if it's not needed.
"""Downloads a pool of objects asynchronously within a single process."""
threaded_downloader.py (32)
The bucket name is hardcoded. It's better to define this as a constant at the top of the file for easier configuration and maintenance.
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕