-
Notifications
You must be signed in to change notification settings - Fork 27
Feat : Multiple download functionality #271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Dhiren-Mhatre
wants to merge
13
commits into
uc-cdis:master
Choose a base branch
from
Dhiren-Mhatre:feat/multiple-download-performance-testing
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
eaaf94d
Feat : Multiple download functionality with performance testing
Dhiren-Mhatre 97c29eb
applied feedback
Dhiren-Mhatre fadbaac
removed timeout
Dhiren-Mhatre 4bfcd98
Merge branch 'master' into feat/multiple-download-performance-testing
Avantol13 3679b4b
addressed feedbacks
Dhiren-Mhatre f005a87
added unit tests
Dhiren-Mhatre 3dd6b69
Merge branch 'master' into feat/multiple-download-performance-testing
Avantol13 a78f9be
added docstrings
Dhiren-Mhatre a348aee
Merge branch 'master' into feat/multiple-download-performance-testing
Avantol13 970e77d
fixed test
Dhiren-Mhatre 2516e27
fixed tests
Dhiren-Mhatre 4b371e7
fixed tests
Dhiren-Mhatre ffbf404
version bumped and fixed tests
Dhiren-Mhatre File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,186 @@ | ||
| ## Asynchronous Multiple File Downloads | ||
|
|
||
| The Gen3 SDK provides an optimized asynchronous download method `async_download_multiple` for efficiently downloading large numbers of files with high throughput and memory efficiency. | ||
|
|
||
| ## Overview | ||
|
|
||
| The `async_download_multiple` method implements a hybrid architecture combining: | ||
|
|
||
| - **Multiprocessing**: Multiple Python subprocesses for CPU utilization | ||
| - **Asyncio**: High I/O concurrency within each process | ||
| - **Queue-based memory management**: Efficient handling of large file sets | ||
| - **Just-in-time presigned URL generation**: Optimized authentication flow | ||
|
|
||
| ## Architecture | ||
|
|
||
| ### Concurrency Model | ||
|
|
||
| The implementation uses a three-tier architecture: | ||
|
|
||
| 1. **Producer Thread**: Feeds GUIDs to worker processes via bounded queues | ||
| 2. **Worker Processes**: Multiple Python subprocesses with asyncio event loops | ||
| 3. **Queue System**: Memory-efficient streaming of work items | ||
|
|
||
| ```python | ||
| # Architecture overview | ||
| Producer Thread → Input Queue → Worker Processes → Output Queue → Results | ||
| (1) (configurable) (configurable) (configurable) (Final) | ||
| ``` | ||
|
|
||
| ### Key Features | ||
|
|
||
| - **Memory Efficiency**: Bounded queues prevent memory explosion with large file sets | ||
| - **True Parallelism**: Multiprocessing bypasses Python GIL limitations | ||
| - **High Concurrency**: Configurable concurrent downloads per process | ||
| - **Resume Support**: Skip completed files with `--skip-completed` flag | ||
| - **Progress Tracking**: Real-time progress bars and detailed reporting | ||
|
|
||
| ## Usage | ||
|
|
||
| ### Command Line Interface | ||
|
|
||
| Download multiple files using a manifest: | ||
|
|
||
| ```bash | ||
| gen3 --endpoint my-commons.org --auth credentials.json download-multiple \ | ||
| --manifest files.json \ | ||
| --download-path ./downloads \ | ||
| --max-concurrent-requests 10 \ | ||
| --filename-format original \ | ||
| --skip-completed \ | ||
| --no-prompt | ||
| ``` | ||
|
|
||
| ### Python API | ||
|
|
||
| The `async_download_multiple` method is available in the `Gen3File` class for programmatic use. Refer to the Python SDK documentation for the complete API reference. | ||
|
|
||
| ## Parameters | ||
|
|
||
| For detailed parameter information and current default values, run: | ||
|
|
||
| ```bash | ||
| gen3 download-multiple --help | ||
| ``` | ||
|
|
||
| The command supports various options for customizing download behavior, including concurrency settings, file naming strategies, and progress controls. | ||
|
|
||
| ## Performance Characteristics | ||
|
|
||
| ### Throughput Optimization | ||
|
|
||
| The method is optimized for high-throughput scenarios: | ||
|
|
||
| - **Concurrent Downloads**: Configurable number of simultaneous downloads | ||
| - **Memory Usage**: Bounded by queue sizes (typically < 100MB) | ||
| - **CPU Utilization**: Leverages multiple CPU cores | ||
| - **Network Efficiency**: Just-in-time presigned URL generation | ||
|
|
||
| ### Scalability | ||
|
|
||
| Performance scales with: | ||
|
|
||
| - **File Count**: Linear time complexity with constant memory usage | ||
| - **File Size**: Independent of individual file sizes | ||
| - **Network Bandwidth**: Limited by available bandwidth and concurrent connections | ||
| - **System Resources**: Scales with available CPU cores and memory | ||
|
|
||
| ## Error Handling | ||
|
|
||
| ### Robust Error Recovery | ||
|
|
||
| The implementation includes comprehensive error handling: | ||
|
|
||
| - **Network Failures**: Automatic retry with exponential backoff | ||
| - **Authentication Errors**: Token refresh and retry | ||
| - **File System Errors**: Graceful handling of permission and space issues | ||
| - **Process Failures**: Automatic worker process restart | ||
|
|
||
| ### Result Reporting | ||
|
|
||
| The method returns a structured result object containing lists of succeeded, failed, and skipped downloads with detailed information about each operation. | ||
|
|
||
| ## Best Practices | ||
|
|
||
| ### Configuration Recommendations | ||
|
|
||
| For optimal performance, adjust the concurrency and process settings based on your specific use case: | ||
|
|
||
| - **Small files**: Use higher concurrent request limits | ||
| - **Large files**: Use lower concurrent request limits to avoid overwhelming the system | ||
| - **High-bandwidth networks**: Increase the number of worker processes | ||
| - **Limited memory**: Reduce queue sizes to manage memory usage | ||
|
|
||
|
|
||
| ## Comparison with Synchronous Downloads | ||
|
|
||
| ### Performance Advantages | ||
|
|
||
| | Metric | Synchronous | Asynchronous | | ||
| | ------------------ | ---------------------------- | ---------------------------- | | ||
| | Memory Usage | O(n) - grows with file count | O(1) - bounded by queue size | | ||
| | CPU Utilization | Single core | Multiple cores | | ||
| | Network Efficiency | Sequential | Parallel | | ||
| | Scalability | Limited by GIL | Scales with CPU cores | | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### Common Issues | ||
|
|
||
| **Slow Downloads:** | ||
|
|
||
| - Check network bandwidth and server limits | ||
| - Reduce concurrent request limits if server is overwhelmed | ||
|
|
||
| **Memory Issues:** | ||
|
|
||
| - Reduce queue sizes and batch sizes | ||
| - Lower the number of worker processes if system memory is limited | ||
| - Monitor system memory usage during downloads | ||
|
|
||
| **Authentication Errors:** | ||
|
|
||
| - Verify credentials file is valid and not expired | ||
| - Check endpoint URL is correct | ||
| - Ensure proper permissions for target files | ||
|
|
||
| **Process Failures:** | ||
|
|
||
| - Check system resources (CPU, memory, file descriptors) | ||
| - Verify network connectivity to Gen3 commons | ||
| - Review logs for specific error messages | ||
|
|
||
| ### Debugging | ||
|
|
||
| Enable verbose logging for detailed debugging: | ||
|
|
||
| ```bash | ||
| gen3 -vv --endpoint my-commons.org --auth credentials.json download-multiple \ | ||
| --manifest files.json \ | ||
| --download-path ./downloads | ||
| ``` | ||
|
|
||
| ## Examples | ||
|
|
||
| ### Basic Usage | ||
|
|
||
| ```bash | ||
| # Download files with default settings | ||
| gen3 --endpoint data.commons.io --auth creds.json download-multiple \ | ||
| --manifest my_files.json \ | ||
| --download-path ./data | ||
| ``` | ||
|
|
||
| ### High-Performance Configuration | ||
|
|
||
| ```bash | ||
| # Optimized for high-throughput downloads | ||
| gen3 --endpoint data.commons.io --auth creds.json download-multiple \ | ||
| --manifest large_dataset.json \ | ||
| --download-path ./large_downloads \ | ||
| --max-concurrent-requests 8 \ | ||
| --no-progress \ | ||
| --skip-completed | ||
| ``` | ||
|
|
||
| **Note**: The specific values shown in examples (like `--max-concurrent-requests 8`) are for demonstration only. For current parameter options and default values, always refer to the command line help: `gen3 download-multiple --help` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.