-
Notifications
You must be signed in to change notification settings - Fork 7
ProteomeXchange support #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds ProteomeXchange (PX) raw-file download support to Files (XML normalization/parsing, FTP/HTTP download orchestration, resume/retries) and a new CLI command to trigger it; also renames the CLI/search option Changes
Sequence Diagram(s)sequenceDiagram
participant User as CLI User
participant CLI as pridepy/pridepy.py
participant Files as pridepy/files/files.py
participant PX as ProteomeXchange
participant DL as Downloader (FTP/HTTP)
User->>CLI: download-px-raw-files --accession PXDxxxxxx --output-folder ./out
CLI->>Files: download_px_raw_files(px_id_or_url, output_folder, skip_if_downloaded_already)
Files->>Files: _normalize_px_xml_url(px_id_or_url)
Files->>PX: GET normalized XML endpoint
PX-->>Files: XML response
Files->>Files: _parse_px_xml_for_raw_file_urls(px_xml_url)
Files-->>Files: list of raw file URLs
Files->>Files: classify URLs (ftp:// vs http(s)://)
rect rgb(220, 240, 220)
Files->>DL: download_ftp_urls(ftp_urls,...)
DL->>DL: per-host grouping, resume, retries
DL-->>Files: ftp downloads complete
end
rect rgb(240, 240, 220)
Files->>DL: download_http_urls(http_urls,...)
DL->>DL: resume support, retries, progress
DL-->>Files: http downloads complete
end
Files-->>CLI: report success/failures
CLI-->>User: Download complete / errors
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Compliance Guide 🔍Below is a summary of compliance checks for this PR:
Compliance status legend🟢 - Fully Compliant🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label |
|||||||||||||||||||
PR Code Suggestions ✨Explore these optional code suggestions:
|
|||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds ProteomeXchange (PX) dataset support: a new CLI to download raw files referenced by PX datasets and core logic to normalize PX URLs, parse PX XML, and download via FTP/HTTP with resume support.
- New CLI command: download-px-raw-files (PX accession/URL input, output folder, optional skip-existing).
- Core Files methods for PX: URL normalization, XML parsing for "Associated raw file URI", FTP/HTTP downloaders with resume/progress.
- Renames CLI option --filter to --filters for search_projects_by_keywords_and_filters.
Reviewed Changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| pridepy/pridepy.py | Adds the PX download CLI and renames --filter to --filters in the search projects command. |
| pridepy/files/files.py | Implements PX URL normalization, XML parsing, and HTTP/FTP download helpers, plus supporting imports. |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| except (socket.timeout, ftplib.error_temp, ftplib.error_perm) as e: | ||
| download_attempt += 1 | ||
| logging.error( | ||
| f"Download failed for {local_path} (attempt {download_attempt}): {str(e)}" | ||
| ) | ||
| if download_attempt >= max_download_retries: | ||
| logging.error( |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ftplib is not imported as a module, but its names (error_temp, error_perm) are referenced via the ftplib namespace. Either add import ftplib at the top of the file (near other imports) or import the specific exceptions: from ftplib import error_temp, error_perm and then remove the ftplib. qualifier in the except clauses. This occurs in both exception handlers within download_ftp_urls.
| except (socket.timeout, ftplib.error_temp, ftplib.error_perm, socket.error) as e: | ||
| connection_attempt += 1 | ||
| logging.error(f"FTP connection failed (attempt {connection_attempt}): {str(e)}") |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as above: ftplib is not imported as a module but is referenced here. Add import ftplib or import the specific exceptions (from ftplib import error_temp, error_perm) and update the clause accordingly.
| "-skip", | ||
| "--skip_if_downloaded_already", | ||
| required=False, | ||
| default=True, | ||
| help="Boolean to skip a file if it already exists", |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Click short options must be a single character (e.g., -s), so -skip will not work. Also, for a boolean flag, prefer is_flag=True and provide both forms. Suggest: @click.option('-s', '--skip-if-downloaded-already/--no-skip-if-downloaded-already', 'skip_if_downloaded_already', is_flag=True, default=True, help='Skip a file if it already exists').
| "-skip", | |
| "--skip_if_downloaded_already", | |
| required=False, | |
| default=True, | |
| help="Boolean to skip a file if it already exists", | |
| "-s", | |
| "--skip-if-downloaded-already/--no-skip-if-downloaded-already", | |
| "skip_if_downloaded_already", | |
| is_flag=True, | |
| default=True, | |
| help="Skip a file if it already exists", |
| required=False, | ||
| help="Parameters to filter the search results. The structure of the " | ||
| "filter is: field1==value1, field2==value2. Example " |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The help text still refers to 'filter' (singular) but the option was renamed to --filters. Update to 'filters' in both sentences for consistency and clarity.
| return ( | ||
| f"https://proteomecentral.proteomexchange.org/cgi/GetDataset?{query}&outputMode=XML&test=no" |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] If query is empty, this builds a URL with ?&outputMode=XML.... Consider constructing the query string conditionally, e.g., qs = f'{query}&' if query else '' and then ...GetDataset?{qs}outputMode=XML&test=no.
| return ( | |
| f"https://proteomecentral.proteomexchange.org/cgi/GetDataset?{query}&outputMode=XML&test=no" | |
| qs = f"{query}&" if query else "" | |
| return ( | |
| f"https://proteomecentral.proteomexchange.org/cgi/GetDataset?{qs}outputMode=XML&test=no" |
| mode = "ab" | ||
| else: | ||
| resume_size = 0 | ||
| headers = {} | ||
| mode = "wb" | ||
|
|
||
| with session.get(url, stream=True, headers=headers, timeout=(10, 60)) as r: | ||
| r.raise_for_status() | ||
| total_size = int(r.headers.get("content-length", 0)) + resume_size | ||
| block_size = 1024 * 1024 | ||
| with tqdm( | ||
| total=total_size, | ||
| unit="B", | ||
| unit_scale=True, | ||
| desc=local_path, | ||
| initial=resume_size, |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] When resuming, if the server ignores the Range header and returns 200 (full content), adding resume_size to content-length will overcount and may duplicate data. Guard by checking for r.status_code == 206 or a Content-Range header; only add resume_size when the response is partial. Otherwise, fall back to rewriting from the start.
| mode = "ab" | |
| else: | |
| resume_size = 0 | |
| headers = {} | |
| mode = "wb" | |
| with session.get(url, stream=True, headers=headers, timeout=(10, 60)) as r: | |
| r.raise_for_status() | |
| total_size = int(r.headers.get("content-length", 0)) + resume_size | |
| block_size = 1024 * 1024 | |
| with tqdm( | |
| total=total_size, | |
| unit="B", | |
| unit_scale=True, | |
| desc=local_path, | |
| initial=resume_size, | |
| else: | |
| resume_size = 0 | |
| headers = {} | |
| with session.get(url, stream=True, headers=headers, timeout=(10, 60)) as r: | |
| r.raise_for_status() | |
| # Check if server honored Range request | |
| is_partial = r.status_code == 206 or "content-range" in r.headers | |
| if is_partial: | |
| mode = "ab" | |
| total_size = int(r.headers.get("content-length", 0)) + resume_size | |
| initial = resume_size | |
| else: | |
| # Server ignored Range, start from scratch | |
| mode = "wb" | |
| resume_size = 0 | |
| total_size = int(r.headers.get("content-length", 0)) | |
| initial = 0 | |
| block_size = 1024 * 1024 | |
| with tqdm( | |
| total=total_size, | |
| unit="B", | |
| unit_scale=True, | |
| desc=local_path, | |
| initial=initial, |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
.gitignore (1)
18-19: Remove redundant blank lines.These extra blank lines after
.qodoare unnecessary and don't follow a consistent pattern with the rest of the file.Apply this diff:
.qodo - - #Ignore cursor AI rulespridepy/files/files.py (3)
853-951: FTP is insecure but may be required.FTP transmits credentials and data in plaintext. If possible, prefer SFTP/FTPS. However, if ProteomeXchange repositories only support FTP, document this limitation.
953-1000: Improve error handling in HTTP downloads.Similar to the FTP downloader:
- Blind exception (line 999): Catching
Exceptionis too broad.- Use
logging.exception(line 1000): Include traceback for better debugging.Apply these fixes:
logging.info(f"Successfully downloaded {local_path}") - except Exception as e: - logging.error(f"HTTP download failed for {url}: {str(e)}") + except Exception: + logging.exception(f"HTTP download failed for {url}")
770-797: Use query param merging to avoid duplicatesIn _normalize_px_xml_url (pridepy/files/files.py: L790–L793), the code appends
&outputMode=XML&test=noonto the original query, risking duplicate or conflicting values. Instead, parse and merge parameters withurllib.parse.parse_qsandurlencode:from urllib.parse import urlparse, parse_qs, urlencode parsed = urlparse(px_id_or_url) params = parse_qs(parsed.query) params['outputMode'] = ['XML'] params['test'] = ['no'] new_query = urlencode(params, doseq=True) return f"{parsed.scheme}://{parsed.netloc}{parsed.path}?{new_query}"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.gitignore(1 hunks)pridepy/files/files.py(2 hunks)pridepy/pridepy.py(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
pridepy/files/files.py (2)
pridepy/util/api_handling.py (3)
Util(18-155)get_api_call(26-37)create_session_with_retries(145-155)pridepy/pridepy.py (1)
download_px_raw_files(281-285)
pridepy/pridepy.py (1)
pridepy/files/files.py (2)
download_px_raw_files(820-846)Files(50-1000)
🪛 Ruff (0.14.0)
pridepy/files/files.py
808-808: Using xml to parse untrusted data is known to be vulnerable to XML attacks; use defusedxml equivalents
(S314)
868-868: FTP-related functions are being called. FTP is considered insecure. Use SSH/SFTP/SCP or some other encrypted protocol.
(S321)
919-919: Do not catch blind exception: Exception
(BLE001)
929-931: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
930-930: Use explicit conversion flag
Replace with conversion flag
(RUF010)
933-935: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
937-937: Do not catch blind exception: Exception
(BLE001)
938-938: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
938-938: Use explicit conversion flag
Replace with conversion flag
(RUF010)
944-944: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
944-944: Use explicit conversion flag
Replace with conversion flag
(RUF010)
949-951: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
999-999: Do not catch blind exception: Exception
(BLE001)
1000-1000: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
1000-1000: Use explicit conversion flag
Replace with conversion flag
(RUF010)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build
- GitHub Check: build
🔇 Additional comments (5)
.gitignore (1)
20-22: LGTM!The ignore rules for Cursor AI and system files are appropriate.
pridepy/files/files.py (2)
799-818: LGTM with security note.The XML parsing logic correctly extracts "Associated raw file URI" values. However, ensure you apply the defusedxml fix mentioned in the earlier comment.
820-846: LGTM!The orchestration logic correctly separates FTP and HTTP URLs and delegates to the appropriate download methods.
pridepy/pridepy.py (2)
256-286: LGTM!The new CLI command correctly delegates to
Files.download_px_raw_fileswith appropriate options and logging.
355-355: Document the breaking change from--filterto--filters.Renaming the CLI option from
--filterto--filtersis a breaking change that will affect existing users and scripts. Ensure this is documented in release notes or a changelog, and consider providing a deprecation period where both options are accepted.Alternatively, you could temporarily support both options with a deprecation warning:
@click.option( "-f", "--filters", "--filter", "filters", required=False, help="Parameters to filter the search results...", )Then add a warning if the old option is detected (this requires custom click callback logic).
Also applies to: 399-399, 415-415
| from urllib.parse import urlparse | ||
| import xml.etree.ElementTree as ET |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use defusedxml to prevent XML attacks.
The standard xml.etree.ElementTree is vulnerable to XML bomb attacks, billion laughs, quadratic blowup, and external entity expansion. ProteomeXchange XML comes from an external source and should be treated as untrusted.
Install and use defusedxml:
from urllib.parse import urlparse
-import xml.etree.ElementTree as ET
+import defusedxml.ElementTree as ETThen add defusedxml to your project dependencies (e.g., requirements.txt or setup.py).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from urllib.parse import urlparse | |
| import xml.etree.ElementTree as ET | |
| from urllib.parse import urlparse | |
| import defusedxml.ElementTree as ET |
🤖 Prompt for AI Agents
In pridepy/files/files.py around lines 15 to 16, the code imports
xml.etree.ElementTree which is unsafe for untrusted XML; replace that import
with defusedxml.ElementTree (e.g., from defusedxml import ElementTree as ET) so
parsing uses the secure implementation, update any references to ET accordingly,
and add defusedxml to the project dependencies (requirements.txt or setup.py)
and run tests to confirm no API mismatches.
| @staticmethod | ||
| def _local_path_for_url(download_url: str, output_folder: str) -> str: | ||
| filename = os.path.basename(urlparse(download_url).path) | ||
| return os.path.join(output_folder, filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle URLs with trailing slashes.
os.path.basename() returns an empty string if the URL path ends with a slash, which would result in an invalid local path.
Add validation:
@staticmethod
def _local_path_for_url(download_url: str, output_folder: str) -> str:
filename = os.path.basename(urlparse(download_url).path)
+ if not filename:
+ raise ValueError(f"Unable to extract filename from URL: {download_url}")
return os.path.join(output_folder, filename)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @staticmethod | |
| def _local_path_for_url(download_url: str, output_folder: str) -> str: | |
| filename = os.path.basename(urlparse(download_url).path) | |
| return os.path.join(output_folder, filename) | |
| @staticmethod | |
| def _local_path_for_url(download_url: str, output_folder: str) -> str: | |
| filename = os.path.basename(urlparse(download_url).path) | |
| if not filename: | |
| raise ValueError(f"Unable to extract filename from URL: {download_url}") | |
| return os.path.join(output_folder, filename) |
🤖 Prompt for AI Agents
In pridepy/files/files.py around lines 848 to 851, the method can return an
empty filename when the URL path ends with a slash; change it to strip trailing
slashes and pick the last non-empty path segment as the filename, and if that
still yields an empty string fall back to a safe name (e.g., netloc or a
uuid-based name, optionally with a default extension). Also unquote and sanitize
the chosen filename to remove unsafe characters before joining with
output_folder so the resulting local path is always valid.
| @staticmethod | ||
| def download_ftp_urls( | ||
| ftp_urls: List[str], | ||
| output_folder: str, | ||
| skip_if_downloaded_already: bool, | ||
| max_connection_retries: int = 3, | ||
| max_download_retries: int = 3, | ||
| ) -> None: | ||
| """ | ||
| Download a list of FTP URLs using a single connection, with retries and progress bars. | ||
| """ | ||
| if not os.path.isdir(output_folder): | ||
| os.makedirs(output_folder, exist_ok=True) | ||
|
|
||
| def connect_ftp(host: str): | ||
| ftp = FTP(host, timeout=30) | ||
| ftp.login() | ||
| ftp.set_pasv(True) | ||
| logging.info(f"Connected to FTP host: {host}") | ||
| return ftp | ||
|
|
||
| # Group URLs by host to reuse connections efficiently | ||
| host_to_paths: Dict[str, List[str]] = {} | ||
| for url in ftp_urls: | ||
| parsed = urlparse(url) | ||
| host_to_paths.setdefault(parsed.hostname, []).append(parsed.path.lstrip("/")) | ||
|
|
||
| for host, paths in host_to_paths.items(): | ||
| connection_attempt = 0 | ||
| while connection_attempt < max_connection_retries: | ||
| try: | ||
| ftp = connect_ftp(host) | ||
| for ftp_path in paths: | ||
| try: | ||
| local_path = os.path.join(output_folder, os.path.basename(ftp_path)) | ||
| if skip_if_downloaded_already and os.path.exists(local_path): | ||
| logging.info("Skipping download as file already exists") | ||
| continue | ||
|
|
||
| logging.info(f"Starting FTP download: {host}/{ftp_path}") | ||
| download_attempt = 0 | ||
| while download_attempt < max_download_retries: | ||
| try: | ||
| total_size = ftp.size(ftp_path) | ||
| # Try to resume using REST if partial file exists | ||
| if os.path.exists(local_path): | ||
| current_size = os.path.getsize(local_path) | ||
| mode = "ab" | ||
| else: | ||
| current_size = 0 | ||
| mode = "wb" | ||
|
|
||
| with open(local_path, mode) as f, tqdm( | ||
| total=total_size, | ||
| unit="B", | ||
| unit_scale=True, | ||
| desc=local_path, | ||
| initial=current_size, | ||
| ) as pbar: | ||
| def callback(data): | ||
| f.write(data) | ||
| pbar.update(len(data)) | ||
|
|
||
| if current_size: | ||
| try: | ||
| ftp.sendcmd(f"REST {current_size}") | ||
| except Exception: | ||
| # If REST not supported, fall back to full download | ||
| current_size = 0 | ||
| f.seek(0) | ||
| f.truncate() | ||
| ftp.retrbinary(f"RETR {ftp_path}", callback) | ||
| logging.info(f"Successfully downloaded {local_path}") | ||
| break | ||
| except (socket.timeout, ftplib.error_temp, ftplib.error_perm) as e: | ||
| download_attempt += 1 | ||
| logging.error( | ||
| f"Download failed for {local_path} (attempt {download_attempt}): {str(e)}" | ||
| ) | ||
| if download_attempt >= max_download_retries: | ||
| logging.error( | ||
| f"Giving up on {local_path} after {max_download_retries} attempts." | ||
| ) | ||
| break | ||
| except Exception as e: | ||
| logging.error(f"Unexpected error while processing FTP path {ftp_path}: {str(e)}") | ||
| ftp.quit() | ||
| logging.info(f"Disconnected from FTP host: {host}") | ||
| break | ||
| except (socket.timeout, ftplib.error_temp, ftplib.error_perm, socket.error) as e: | ||
| connection_attempt += 1 | ||
| logging.error(f"FTP connection failed (attempt {connection_attempt}): {str(e)}") | ||
| if connection_attempt < max_connection_retries: | ||
| logging.info("Retrying connection...") | ||
| time.sleep(5) | ||
| else: | ||
| logging.error( | ||
| f"Giving up after {max_connection_retries} failed connection attempts to {host}." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix resume logic and improve error handling.
Several issues:
-
Resume fallback bug (lines 916-923): If REST is not supported and the code falls back to a full download by truncating the file, the progress bar's
initial=current_sizeis not reset to 0, causing incorrect progress display. -
Blind exception catching (lines 919, 937): Catching
Exceptionis too broad. Be specific about what failures to handle. -
Use
logging.exception(lines 929, 933, 938, 944, 949): In exception handlers, uselogging.exception()instead oflogging.error()to include the traceback.
Apply these fixes:
if current_size:
try:
ftp.sendcmd(f"REST {current_size}")
- except Exception:
+ except ftplib.error_perm:
# If REST not supported, fall back to full download
current_size = 0
f.seek(0)
f.truncate()
+ pbar.reset() # Reset progress bar
ftp.retrbinary(f"RETR {ftp_path}", callback) except (socket.timeout, ftplib.error_temp, ftplib.error_perm) as e:
download_attempt += 1
- logging.error(
- f"Download failed for {local_path} (attempt {download_attempt}): {str(e)}"
+ logging.exception(
+ f"Download failed for {local_path} (attempt {download_attempt})"
)
if download_attempt >= max_download_retries:
- logging.error(
- f"Giving up on {local_path} after {max_download_retries} attempts."
+ logging.exception(
+ f"Giving up on {local_path} after {max_download_retries} attempts"
)- except Exception as e:
- logging.error(f"Unexpected error while processing FTP path {ftp_path}: {str(e)}")
+ except Exception:
+ logging.exception(f"Unexpected error while processing FTP path {ftp_path}") except (socket.timeout, ftplib.error_temp, ftplib.error_perm, socket.error) as e:
connection_attempt += 1
- logging.error(f"FTP connection failed (attempt {connection_attempt}): {str(e)}")
+ logging.exception(f"FTP connection failed (attempt {connection_attempt})")
if connection_attempt < max_connection_retries:
logging.info("Retrying connection...")
time.sleep(5)
else:
- logging.error(
- f"Giving up after {max_connection_retries} failed connection attempts to {host}."
+ logging.exception(
+ f"Giving up after {max_connection_retries} failed connection attempts to {host}"
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @staticmethod | |
| def download_ftp_urls( | |
| ftp_urls: List[str], | |
| output_folder: str, | |
| skip_if_downloaded_already: bool, | |
| max_connection_retries: int = 3, | |
| max_download_retries: int = 3, | |
| ) -> None: | |
| """ | |
| Download a list of FTP URLs using a single connection, with retries and progress bars. | |
| """ | |
| if not os.path.isdir(output_folder): | |
| os.makedirs(output_folder, exist_ok=True) | |
| def connect_ftp(host: str): | |
| ftp = FTP(host, timeout=30) | |
| ftp.login() | |
| ftp.set_pasv(True) | |
| logging.info(f"Connected to FTP host: {host}") | |
| return ftp | |
| # Group URLs by host to reuse connections efficiently | |
| host_to_paths: Dict[str, List[str]] = {} | |
| for url in ftp_urls: | |
| parsed = urlparse(url) | |
| host_to_paths.setdefault(parsed.hostname, []).append(parsed.path.lstrip("/")) | |
| for host, paths in host_to_paths.items(): | |
| connection_attempt = 0 | |
| while connection_attempt < max_connection_retries: | |
| try: | |
| ftp = connect_ftp(host) | |
| for ftp_path in paths: | |
| try: | |
| local_path = os.path.join(output_folder, os.path.basename(ftp_path)) | |
| if skip_if_downloaded_already and os.path.exists(local_path): | |
| logging.info("Skipping download as file already exists") | |
| continue | |
| logging.info(f"Starting FTP download: {host}/{ftp_path}") | |
| download_attempt = 0 | |
| while download_attempt < max_download_retries: | |
| try: | |
| total_size = ftp.size(ftp_path) | |
| # Try to resume using REST if partial file exists | |
| if os.path.exists(local_path): | |
| current_size = os.path.getsize(local_path) | |
| mode = "ab" | |
| else: | |
| current_size = 0 | |
| mode = "wb" | |
| with open(local_path, mode) as f, tqdm( | |
| total=total_size, | |
| unit="B", | |
| unit_scale=True, | |
| desc=local_path, | |
| initial=current_size, | |
| ) as pbar: | |
| def callback(data): | |
| f.write(data) | |
| pbar.update(len(data)) | |
| if current_size: | |
| try: | |
| ftp.sendcmd(f"REST {current_size}") | |
| except Exception: | |
| # If REST not supported, fall back to full download | |
| current_size = 0 | |
| f.seek(0) | |
| f.truncate() | |
| ftp.retrbinary(f"RETR {ftp_path}", callback) | |
| logging.info(f"Successfully downloaded {local_path}") | |
| break | |
| except (socket.timeout, ftplib.error_temp, ftplib.error_perm) as e: | |
| download_attempt += 1 | |
| logging.error( | |
| f"Download failed for {local_path} (attempt {download_attempt}): {str(e)}" | |
| ) | |
| if download_attempt >= max_download_retries: | |
| logging.error( | |
| f"Giving up on {local_path} after {max_download_retries} attempts." | |
| ) | |
| break | |
| except Exception as e: | |
| logging.error(f"Unexpected error while processing FTP path {ftp_path}: {str(e)}") | |
| ftp.quit() | |
| logging.info(f"Disconnected from FTP host: {host}") | |
| break | |
| except (socket.timeout, ftplib.error_temp, ftplib.error_perm, socket.error) as e: | |
| connection_attempt += 1 | |
| logging.error(f"FTP connection failed (attempt {connection_attempt}): {str(e)}") | |
| if connection_attempt < max_connection_retries: | |
| logging.info("Retrying connection...") | |
| time.sleep(5) | |
| else: | |
| logging.error( | |
| f"Giving up after {max_connection_retries} failed connection attempts to {host}." | |
| ) | |
| @staticmethod | |
| def download_ftp_urls( | |
| ftp_urls: List[str], | |
| output_folder: str, | |
| skip_if_downloaded_already: bool, | |
| max_connection_retries: int = 3, | |
| max_download_retries: int = 3, | |
| ) -> None: | |
| """ | |
| Download a list of FTP URLs using a single connection, with retries and progress bars. | |
| """ | |
| if not os.path.isdir(output_folder): | |
| os.makedirs(output_folder, exist_ok=True) | |
| def connect_ftp(host: str): | |
| ftp = FTP(host, timeout=30) | |
| ftp.login() | |
| ftp.set_pasv(True) | |
| logging.info(f"Connected to FTP host: {host}") | |
| return ftp | |
| # Group URLs by host to reuse connections efficiently | |
| host_to_paths: Dict[str, List[str]] = {} | |
| for url in ftp_urls: | |
| parsed = urlparse(url) | |
| host_to_paths.setdefault(parsed.hostname, []).append(parsed.path.lstrip("/")) | |
| for host, paths in host_to_paths.items(): | |
| connection_attempt = 0 | |
| while connection_attempt < max_connection_retries: | |
| try: | |
| ftp = connect_ftp(host) | |
| for ftp_path in paths: | |
| try: | |
| local_path = os.path.join(output_folder, os.path.basename(ftp_path)) | |
| if skip_if_downloaded_already and os.path.exists(local_path): | |
| logging.info("Skipping download as file already exists") | |
| continue | |
| logging.info(f"Starting FTP download: {host}/{ftp_path}") | |
| download_attempt = 0 | |
| while download_attempt < max_download_retries: | |
| try: | |
| total_size = ftp.size(ftp_path) | |
| # Try to resume using REST if partial file exists | |
| if os.path.exists(local_path): | |
| current_size = os.path.getsize(local_path) | |
| mode = "ab" | |
| else: | |
| current_size = 0 | |
| mode = "wb" | |
| with open(local_path, mode) as f, tqdm( | |
| total=total_size, | |
| unit="B", | |
| unit_scale=True, | |
| desc=local_path, | |
| initial=current_size, | |
| ) as pbar: | |
| def callback(data): | |
| f.write(data) | |
| pbar.update(len(data)) | |
| if current_size: | |
| try: | |
| ftp.sendcmd(f"REST {current_size}") | |
| except ftplib.error_perm: | |
| # If REST not supported, fall back to full download | |
| current_size = 0 | |
| f.seek(0) | |
| f.truncate() | |
| pbar.reset() # Reset progress bar | |
| ftp.retrbinary(f"RETR {ftp_path}", callback) | |
| logging.info(f"Successfully downloaded {local_path}") | |
| break | |
| except (socket.timeout, ftplib.error_temp, ftplib.error_perm) as e: | |
| download_attempt += 1 | |
| logging.exception( | |
| f"Download failed for {local_path} (attempt {download_attempt})" | |
| ) | |
| if download_attempt >= max_download_retries: | |
| logging.exception( | |
| f"Giving up on {local_path} after {max_download_retries} attempts" | |
| ) | |
| break | |
| except Exception: | |
| logging.exception(f"Unexpected error while processing FTP path {ftp_path}") | |
| ftp.quit() | |
| logging.info(f"Disconnected from FTP host: {host}") | |
| break | |
| except (socket.timeout, ftplib.error_temp, ftplib.error_perm, socket.error) as e: | |
| connection_attempt += 1 | |
| logging.exception(f"FTP connection failed (attempt {connection_attempt})") | |
| if connection_attempt < max_connection_retries: | |
| logging.info("Retrying connection...") | |
| time.sleep(5) | |
| else: | |
| logging.exception( | |
| f"Giving up after {max_connection_retries} failed connection attempts to {host}" | |
| ) |
🧰 Tools
🪛 Ruff (0.14.0)
868-868: FTP-related functions are being called. FTP is considered insecure. Use SSH/SFTP/SCP or some other encrypted protocol.
(S321)
919-919: Do not catch blind exception: Exception
(BLE001)
929-931: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
930-930: Use explicit conversion flag
Replace with conversion flag
(RUF010)
933-935: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
937-937: Do not catch blind exception: Exception
(BLE001)
938-938: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
938-938: Use explicit conversion flag
Replace with conversion flag
(RUF010)
944-944: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
944-944: Use explicit conversion flag
Replace with conversion flag
(RUF010)
949-951: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
🤖 Prompt for AI Agents
In pridepy/files/files.py around lines 853-951, fix resume and error handling:
when REST fails, reset current_size to 0, switch to write mode ('wb'), and
recreate the tqdm progress bar (or reset its initial) so the progress display
reflects a full download rather than the old initial value; replace the broad
"except Exception" around per-path processing with specific exceptions (e.g.,
OSError, ftplib.all_errors, socket.timeout) so only expected I/O/FTP errors are
caught; and replace logging.error(...) calls in all exception handlers at the
indicated lines with logging.exception(...) to include tracebacks. Ensure REST
fallback truncates the file, sets mode='wb' and current_size=0 before
re-creating the retrbinary callback/progress loop, and narrow all catch clauses
to explicit FTP/IO/socket errors.
PR Type
Enhancement
Description
Add ProteomeXchange dataset support for downloading raw files via XML parsing
Implement FTP and HTTP download methods with resume capability and progress tracking
Add CLI command
download-px-raw-filesfor ProteomeXchange dataset downloadsFix parameter naming in search command (filter → filters)
Diagram Walkthrough
File Walkthrough
files.py
ProteomeXchange file download support with FTP/HTTP handlerspridepy/files/files.py
metadata
download_px_raw_filesmethod supporting both FTP and HTTPprotocols
download_ftp_urlswith connection pooling, retry logic, and resumesupport
download_http_urlswith resume capability and progress bars usingtqdm
pridepy.py
CLI command for ProteomeXchange downloads and parameter fixespridepy/pridepy.py
download-px-raw-filesfor ProteomeXchange datasetdownloads
filtertofiltersin search commandparameter
Summary by CodeRabbit
New Features
Changes
Chores