Bench #1697

chandra-siri · 2025-12-27T18:01:38Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

gemini-code-assist · 2025-12-27T18:02:05Z

Summary of Changes

Hello @chandra-siri, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a collection of new Python scripts and a pytest benchmark designed to thoroughly evaluate and measure the performance of asynchronous operations with Google Cloud Storage. The primary focus is on benchmarking parallel uploads and downloads, leveraging Python's asyncio for concurrent I/O and multiprocessing for CPU-bound parallelism, thereby providing comprehensive tools to assess throughput and latency across various scenarios.

Highlights

Asynchronous Download Benchmarks: Introduced async_tasks_downloade_mp.py, async_tasks_downloader.py, and parallel_downloader.py to measure the performance of parallel asynchronous downloads from Google Cloud Storage, utilizing both asyncio and multiprocessing.
Asynchronous Upload Benchmarks: Added parallel_uploader.py to benchmark parallel asynchronous uploads of data to Google Cloud Storage.
Pytest Read Benchmark: Integrated test_reads.py as a formal pytest-benchmark test to assess the throughput of single-object asynchronous downloads with configurable chunk sizes.
Utility Scripts: Included random_number.py for generating secure 64-bit random integers and test_asyncio_gc.py for exploring asyncio task garbage collection behavior.
move_blob Test: A new script move_blob_test.py was added to demonstrate and test the move_blob functionality for Google Cloud Storage objects.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds a collection of benchmarking and testing scripts. My review focuses on improving code quality, fixing bugs, and increasing efficiency. Key findings include a critical bug in a time calculation, several instances of inefficient client creation within loops, and various opportunities for code cleanup such as removing unused imports and commented-out code. I've also suggested improvements for better script structure and more informative benchmark reporting.

I am having trouble creating individual review comments. Click here to see my feedback.

parallel_downloader.py (60)

There is a critical bug in the time calculation due to operator precedence. start_time / 10**9 is calculated before the subtraction. You need to wrap end_time - start_time in parentheses to get the correct duration.

        f"\nFinished all download attempts for {num_objects} objects: took - {(end_time - start_time) / 10**9}s"

parallel_downloader.py (16)

Creating a new AsyncGrpcClient for each object download is inefficient as it involves setting up a new connection each time. The client should be created once per worker process and reused for all downloads within that process. Consider creating the client in download_worker and passing it to this function.

threaded_downloader.py (14)

Creating a new AsyncGrpcClient for each download is inefficient. Since each download_worker runs in its own thread and creates its own event loop, the client should be created once per thread in download_worker and passed to download_object_async.

tests/perf/microbenchmarks/test_reads.py (66)

The output_buffer is initialized here and then immediately re-initialized on line 69. This first initialization is redundant and can be removed.

parallel_uploader.py (17)

Creating a new AsyncGrpcClient for each object upload is inefficient. This creates a new connection for every task. The client should be instantiated once per worker process in upload_worker and passed to this function to be reused.

parallel_downloader.py (23)

The object size is hardcoded here. It would be better to define it as a constant at the top of the file (e.g., OBJECT_SIZE = 100 * 1024 * 1024) for better readability and maintainability.

parallel_uploader.py (31)

The bucket name is hardcoded. It's better to define this as a constant at the top of the file for easier configuration.

async_tasks_downloader.py (65)

There is a typo in FINSHED. It should be FINISHED.

        f"FINISHED: total bytes downloaded - {num_objects*OBJECT_SIZE} in time {(end_time - start_time) / (10**9)}s"

move_blob_test.py (3-9)

The script's logic is executed at the module level. It's a best practice to encapsulate the main logic within a main() function and call it under an if __name__ == "__main__": guard. This prevents the code from running when the module is imported elsewhere and improves reusability and testability.

def main():
    gcs = storage.Client()
    bucket = gcs.bucket("chandrasiri-us-west1-hns-soft-del")
    # print(bucket.name)
    blob = bucket.blob("test/blob.csv")
    blob.upload_from_string("")
    print("Uploaded blob:", blob.name)
    bucket.move_blob(blob, new_name="test/blob2.csv")

if __name__ == "__main__":
    main()

move_blob_test.py (4)

The bucket name is hardcoded. For better portability and reusability, consider defining it as a constant at the top of the file or passing it as a command-line argument.

parallel_downloader.py (3)

The os module is imported but never used. It should be removed.

async_tasks_downloader.py (4)

The ThreadPoolExecutor is imported but not used in this file. It should be removed to keep the code clean.

async_tasks_downloade_mp.py (4)

The ThreadPoolExecutor is imported but never used in this file. It's good practice to remove unused imports to keep the code clean.

parallel_downloader.py (34)

The bucket name is hardcoded. Consider defining this as a constant at the top of the file to make it easier to change.

async_tasks_downloade_mp.py (83-84)

There's a typo in 'Throuput'. It should be 'Throughput'.

Additionally, the throughput calculation uses 10**-6 to convert from bytes to megabytes, which assumes 1 MB = 1,000,000 bytes. However, OBJECT_SIZE is defined using binary prefixes (1024*1024). For consistency, you should use (1024*1024) for the conversion to MiB/s.

        f"Throughput: {num_object*OBJECT_SIZE /((end_time_proc - start_time_proc) / (10**9))/(1024*1024)} MiB/s"

async_tasks_downloade_mp.py (76)

The results variable is assigned but its value is never used. It should be removed to avoid confusion.

        pool.starmap(async_runner, args)

async_tasks_downloader.py (36)

This docstring is empty. Please add a description of what the function does.

    """Creates and runs asyncio tasks to download a range of objects."""

random_number.py (12-20)

The script's logic is executed at the module level. It's a best practice to encapsulate this logic within a main() function and call it under an if __name__ == "__main__": guard. This prevents the code from running when the module is imported elsewhere.

def main():
    # Generate 1000 unique IDs
    # A set is the easiest way to guarantee uniqueness in the batch.
    request_ids = set()
    while len(request_ids) < 1000:
        request_ids.add(generate_random_64bit_int())

    # You can convert it to a list if needed
    id_list = list(request_ids)

    print(f"Generated {len(id_list)} unique 64-bit IDs.")
    print("First 5 IDs:", id_list[:5])

if __name__ == "__main__":
    main()

tests/perf/microbenchmarks/test_reads.py (16-21)

These modules (math, google.api_core.exceptions, google.cloud.storage.blob.Blob) are imported but not used in the file. They should be removed to keep the code clean.

tests/perf/microbenchmarks/test_reads.py (42)

The comment indicates the object size is 1 GiB, but 100 * (1024 ** 2) is 100 MiB. The comment should be corrected to avoid confusion.

OBJECT_SIZE = 100 * (1024 ** 2)  # 100 MiB

async_tasks_downloade_mp.py (71-73)

These lines appear to be leftover debugging code. They should be removed before merging to keep the codebase clean.

tests/perf/microbenchmarks/test_reads.py (102-109)

This docstring appears to be a list of implementation notes rather than a description of the function's purpose. It should be updated to be a proper docstring that explains what my_setup does, its parameters, and what it returns.

tests/perf/microbenchmarks/test_reads.py (210)

The benchmark summary prints 'N/A' for standard deviation. pytest-benchmark calculates this value and it's available in benchmark.stats['stddev']. It would be more informative to include this statistic in the report. Note that the other metrics are in MiB/s, so you may need to convert the time-based standard deviation or adjust the table header.

tests/perf/microbenchmarks/test_reads.py (219-234)

This large block of commented-out code should be removed to improve readability and reduce clutter.

async_tasks_downloade_mp.py (38)

This docstring is empty. Please add a meaningful description of what the download_objects_pool function does, its parameters, and what it returns, or remove the docstring if it's not needed.

    """Downloads a pool of objects asynchronously within a single process."""

threaded_downloader.py (32)

The bucket name is hardcoded. It's better to define this as a constant at the top of the file for easier configuration and maintenance.

chandra-siri added 3 commits December 3, 2025 15:33

local files for benchmarking

5d58213

Merge branch 'main' of github.com:googleapis/python-storage into bench

c797586

add test_reads.py for microbenchmarking reads

20d2d2d

product-auto-label bot added size: l Pull request size is large. api: storage Issues related to the googleapis/python-storage API. labels Dec 27, 2025

push local files

f493bd8

product-auto-label bot added size: xl Pull request size is extra large. and removed size: l Pull request size is large. labels Dec 27, 2025

gemini-code-assist bot reviewed Dec 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bench #1697

Bench #1697

Uh oh!

chandra-siri commented Dec 27, 2025

Uh oh!

gemini-code-assist bot commented Dec 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bench #1697

Are you sure you want to change the base?

Bench #1697

Uh oh!

Conversation

chandra-siri commented Dec 27, 2025

Uh oh!

gemini-code-assist bot commented Dec 27, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

parallel_downloader.py (60)

parallel_downloader.py (16)

threaded_downloader.py (14)

tests/perf/microbenchmarks/test_reads.py (66)

parallel_uploader.py (17)

parallel_downloader.py (23)

parallel_uploader.py (31)

async_tasks_downloader.py (65)

move_blob_test.py (3-9)

move_blob_test.py (4)

parallel_downloader.py (3)

async_tasks_downloader.py (4)

async_tasks_downloade_mp.py (4)

parallel_downloader.py (34)

async_tasks_downloade_mp.py (83-84)

async_tasks_downloade_mp.py (76)

async_tasks_downloader.py (36)

random_number.py (12-20)

tests/perf/microbenchmarks/test_reads.py (16-21)

tests/perf/microbenchmarks/test_reads.py (42)

async_tasks_downloade_mp.py (71-73)

tests/perf/microbenchmarks/test_reads.py (102-109)

tests/perf/microbenchmarks/test_reads.py (210)

tests/perf/microbenchmarks/test_reads.py (219-234)

async_tasks_downloade_mp.py (38)

threaded_downloader.py (32)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant