ProteomeXchange support #67

ypriverol · 2025-10-16T11:31:35Z

PR Type

Enhancement

Description

Add ProteomeXchange dataset support for downloading raw files via XML parsing
Implement FTP and HTTP download methods with resume capability and progress tracking
Add CLI command download-px-raw-files for ProteomeXchange dataset downloads
Fix parameter naming in search command (filter → filters)

Diagram Walkthrough

flowchart LR
  PX["ProteomeXchange Dataset"] -- "parse XML" --> URLs["Extract Raw File URLs"]
  URLs -- "FTP URLs" --> FTP["FTP Download with Resume"]
  URLs -- "HTTP URLs" --> HTTP["HTTP Download with Resume"]
  FTP --> Output["Output Folder"]
  HTTP --> Output

File Walkthrough

Relevant files

Enhancement

files.py `ProteomeXchange file download support with FTP/HTTP handlers` pridepy/files/files.py Add ProteomeXchange XML parsing to extract raw file URIs from dataset metadata Implement `download_px_raw_files` method supporting both FTP and HTTP protocols Add `download_ftp_urls` with connection pooling, retry logic, and resume support Add `download_http_urls` with resume capability and progress bars using tqdm	+238/-0
pridepy.py `CLI command for ProteomeXchange downloads and parameter fixes` pridepy/pridepy.py Add new CLI command `download-px-raw-files` for ProteomeXchange dataset downloads Fix parameter name from `filter` to `filters` in search command Update help text and option aliases for ProteomeXchange accession parameter	+35/-3

Summary by CodeRabbit

New Features
- New CLI command to download ProteomeXchange dataset raw files with resume, progress and retry support for FTP and HTTP(S) sources.
Changes
- Renamed CLI option --filter to --filters for search commands.
Chores
- Updated repository ignore rules (e.g., system files) and bumped package version to 0.0.9.

…o dev

coderabbitai · 2025-10-16T11:31:45Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds ProteomeXchange (PX) raw-file download support to Files (XML normalization/parsing, FTP/HTTP download orchestration, resume/retries) and a new CLI command to trigger it; also renames the CLI/search option --filter → --filters and bumps package version.

Changes

Cohort / File(s)	Summary
Git Configuration `\.gitignore`	Add ignore rules for `\.cursor/rules/codacy.mdc` and `.DS_Store`.
PX Support & Downloading `pridepy/files/files.py`	Add PX XML normalization (`_normalize_px_xml_url`), PX XML parsing for raw-file URIs (`_parse_px_xml_for_raw_file_urls`), local-path helper (`_local_path_for_url`), FTP downloader (`download_ftp_urls`), HTTP downloader (`download_http_urls`), and top-level orchestrator (`download_px_raw_files`). Includes resume/progress/retry logic and URL classification.
CLI & Search API `pridepy/pridepy.py`	Add `download-px-raw-files` CLI command and `download_px_raw_files` call; rename `--filter` to `--filters` and update `search_projects_by_keywords_and_filters` parameter to `filters`.
Project Metadata `pyproject.toml`	Bump package version from `0.0.8` to `0.0.9`.

Sequence Diagram(s)

sequenceDiagram
    participant User as CLI User
    participant CLI as pridepy/pridepy.py
    participant Files as pridepy/files/files.py
    participant PX as ProteomeXchange
    participant DL as Downloader (FTP/HTTP)

    User->>CLI: download-px-raw-files --accession PXDxxxxxx --output-folder ./out
    CLI->>Files: download_px_raw_files(px_id_or_url, output_folder, skip_if_downloaded_already)

    Files->>Files: _normalize_px_xml_url(px_id_or_url)
    Files->>PX: GET normalized XML endpoint
    PX-->>Files: XML response

    Files->>Files: _parse_px_xml_for_raw_file_urls(px_xml_url)
    Files-->>Files: list of raw file URLs

    Files->>Files: classify URLs (ftp:// vs http(s)://)

    rect rgb(220, 240, 220)
      Files->>DL: download_ftp_urls(ftp_urls,...)
      DL->>DL: per-host grouping, resume, retries
      DL-->>Files: ftp downloads complete
    end

    rect rgb(240, 240, 220)
      Files->>DL: download_http_urls(http_urls,...)
      DL->>DL: resume support, retries, progress
      DL-->>Files: http downloads complete
    end

    Files-->>CLI: report success/failures
    CLI-->>User: Download complete / errors

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

move to poetry #61 — Modifies pridepy/pridepy.py, adds download-related CLI commands and the filter→filters rename; strong overlap with CLI/search changes.
Search by key word #60 — Edits the same CLI/search API and parameter naming; likely intersects on search_projects_by_keywords_and_filters.
fix: removed dataegap and test searchby keyword #56 — Changes the search_projects_by_keywords_and_filters signature (filter→filters removal/adjustment); related to the same function-level refactor.

Poem

🐇 I hop through XML and stream each URL,
FTP and HTTP — down they fall.
Retries and resumes, progress on show,
Raw files gathered, row by row.
A rabbit cheers: downloads all! 🎉

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title “ProteomeXchange support” concisely captures the primary new functionality introduced by the pull request—adding ProteomeXchange dataset support for raw file downloads—without including extraneous details or vague phrasing. It directly aligns with the core change and is clear to any team member reviewing the project history.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch dev

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9e30c7 and 08d58c5.

📒 Files selected for processing (2)

pridepy/pridepy.py (4 hunks)
pyproject.toml (1 hunks)

✅ Files skipped from review due to trivial changes (1)

pyproject.toml

🚧 Files skipped from review as they are similar to previous changes (1)

pridepy/pridepy.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build
GitHub Check: build

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

qodo-code-review · 2025-10-16T11:32:15Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
⚪	Incomplete download risk Description: FTP resume is attempted with REST but falls back to full download without validating downloaded size or integrity, which risks incomplete or corrupted files being treated as successful. files.py [905-926] Referred Code with open(local_path, mode) as f, tqdm( total=total_size, unit="B", unit_scale=True, desc=local_path, initial=current_size, ) as pbar: def callback(data): f.write(data) pbar.update(len(data)) if current_size: try: ftp.sendcmd(f"REST {current_size}") except Exception: # If REST not supported, fall back to full download current_size = 0 f.seek(0) f.truncate() ftp.retrbinary(f"RETR {ftp_path}", callback) logging.info(f"Successfully downloaded {local_path}") ... (clipped 1 lines)
	Integrity not verified Description: HTTP resume relies on Content-Length and appends data without validating final size or checksum, allowing truncated or corrupted files if server omits length or connection drops. files.py [982-999] Referred Code with session.get(url, stream=True, headers=headers, timeout=(10, 60)) as r: r.raise_for_status() total_size = int(r.headers.get("content-length", 0)) + resume_size block_size = 1024 * 1024 with tqdm( total=total_size, unit="B", unit_scale=True, desc=local_path, initial=resume_size, ) as pbar: with open(local_path, mode) as f: for chunk in r.iter_content(chunk_size=block_size): if chunk: f.write(chunk) pbar.update(len(chunk)) logging.info(f"Successfully downloaded {local_path}") except Exception as e:
Ticket Compliance
⚪	🎫 No ticket provided `- [ ] Create ticket/issue <!-- /create_ticket --create_ticket=true --> </details></td></tr>`
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
⚪	No custom compliance provided Follow the guide to enable custom compliance check.
Update

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

qodo-code-review · 2025-10-16T11:33:35Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
High-level	Unify download logic for all sources Refactor the existing PRIDE download functions to use the new, more capable FTP and HTTP download methods introduced for ProteomeXchange. This will unify the download logic, reduce code duplication, and provide consistent features like download resuming for all data sources. Examples: pridepy/files/files.py [854-951] def download_ftp_urls( ftp_urls: List[str], output_folder: str, skip_if_downloaded_already: bool, max_connection_retries: int = 3, max_download_retries: int = 3, ) -> None: """ Download a list of FTP URLs using a single connection, with retries and progress bars. """ ... (clipped 88 lines) pridepy/files/files.py [143-149] def download_files_from_ftp( file_list_json, output_folder, skip_if_downloaded_already, max_connection_retries=3, max_download_retries=3, ): Solution Walkthrough: Before: # pridepy/files/files.py class Files: # Old FTP downloader for PRIDE data (no resume) @staticmethod def download_files_from_ftp(file_list_json, ...): # ... implementation without resume ... ftp.retrbinary(...) # Main dispatcher for PRIDE downloads @staticmethod def download_files(file_list_json, protocol="ftp", ...): if protocol == "ftp": Files.download_files_from_ftp(file_list_json, ...) # ... # New downloader for ProteomeXchange data def download_px_raw_files(self, px_id_or_url, ...): # ... gets urls ... self.download_ftp_urls(ftp_urls, ...) # Uses new FTP downloader # New FTP downloader with resume capability @staticmethod def download_ftp_urls(ftp_urls, ...): # ... implementation with resume using REST command ... ftp.retrbinary(...) After: # pridepy/files/files.py class Files: # Old FTP downloader refactored to use the new one @staticmethod def download_files_from_ftp(file_list_json, ...): # 1. Convert file_list_json to a list of FTP URLs ftp_urls = [file["ftpDownloadLink"] for file in file_list_json] # 2. Call the unified, feature-rich downloader Files.download_ftp_urls(ftp_urls, ...) # Main dispatcher now implicitly uses the new downloader @staticmethod def download_files(file_list_json, protocol="ftp", ...): if protocol == "ftp": Files.download_files_from_ftp(file_list_json, ...) # ... # New downloader for ProteomeXchange data (unchanged) def download_px_raw_files(self, px_id_or_url, ...): # ... self.download_ftp_urls(ftp_urls, ...) # Unified FTP downloader with resume capability @staticmethod def download_ftp_urls(ftp_urls, ...): # ... implementation with resume using REST command ... ftp.retrbinary(...) Suggestion importance[1-10]: 9 __ Why: This is an excellent architectural suggestion that correctly identifies duplicated download logic and proposes a valuable refactoring to unify FTP/HTTP downloads, which would improve maintainability and provide consistent features like resuming across the entire tool.	High
Possible issue	Fix FTP resume file corruption Fix a flaw in the FTP download resume logic to prevent file corruption. If the `REST` command for resuming fails, ensure the file is correctly handled to start the download from the beginning. pridepy/files/files.py [916-924] if current_size: try: ftp.sendcmd(f"REST {current_size}") - except Exception: - # If REST not supported, fall back to full download + ftp.retrbinary(f"RETR {ftp_path}", callback) + except (ftplib.error_perm, ftplib.error_temp): + # If REST not supported or fails, fall back to full download + logging.warning(f"FTP resume failed for {local_path}. Starting download from beginning.") current_size = 0 f.seek(0) f.truncate() -ftp.retrbinary(f"RETR {ftp_path}", callback) + # Re-initialize pbar for full download + pbar.reset(total=total_size) + pbar.initial = 0 + pbar.update(0) + ftp.retrbinary(f"RETR {ftp_path}", callback) +else: + ftp.retrbinary(f"RETR {ftp_path}", callback) Apply / Chat Suggestion importance[1-10]: 9 __ Why: This suggestion correctly identifies a critical bug in the FTP resume logic that could lead to file corruption. The proposed fix, while needing refinement, addresses the core issue of handling `REST` command failures.	High
General	Improve CLI option clarity and consistency Improve CLI clarity by renaming the `--accession` option for ProteomeXchange downloads to `--px-id` with a `-px` alias, and remove the confusing "deprecated" notice from the help text. pridepy/pridepy.py [260-267] @click.option( - "-a", - "--accession", - "--px", + "-px", + "--px-id", "accession", required=True, - help="ProteomeXchange accession (e.g. PXD039236). --px is deprecated.", + help="ProteomeXchange accession or URL (e.g., PXD039236).", ) Apply / Chat Suggestion importance[1-10]: 5 __ Why: The suggestion correctly points out that marking an option as deprecated in the same PR it's introduced is confusing. The proposed change to use `--px-id` improves the CLI's clarity and usability.	Low
Update

Copilot

Pull Request Overview

Adds ProteomeXchange (PX) dataset support: a new CLI to download raw files referenced by PX datasets and core logic to normalize PX URLs, parse PX XML, and download via FTP/HTTP with resume support.

New CLI command: download-px-raw-files (PX accession/URL input, output folder, optional skip-existing).
Core Files methods for PX: URL normalization, XML parsing for "Associated raw file URI", FTP/HTTP downloaders with resume/progress.
Renames CLI option --filter to --filters for search_projects_by_keywords_and_filters.

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 7 comments.

File	Description
pridepy/pridepy.py	Adds the PX download CLI and renames --filter to --filters in the search projects command.
pridepy/files/files.py	Implements PX URL normalization, XML parsing, and HTTP/FTP download helpers, plus supporting imports.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-16T11:35:40Z

pridepy/files/files.py

+                                except (socket.timeout, ftplib.error_temp, ftplib.error_perm) as e:
+                                    download_attempt += 1
+                                    logging.error(
+                                        f"Download failed for {local_path} (attempt {download_attempt}): {str(e)}"
+                                    )
+                                    if download_attempt >= max_download_retries:
+                                        logging.error(


ftplib is not imported as a module, but its names (error_temp, error_perm) are referenced via the ftplib namespace. Either add import ftplib at the top of the file (near other imports) or import the specific exceptions: from ftplib import error_temp, error_perm and then remove the ftplib. qualifier in the except clauses. This occurs in both exception handlers within download_ftp_urls.

Copilot · 2025-10-16T11:35:40Z

pridepy/files/files.py

+                except (socket.timeout, ftplib.error_temp, ftplib.error_perm, socket.error) as e:
+                    connection_attempt += 1
+                    logging.error(f"FTP connection failed (attempt {connection_attempt}): {str(e)}")


Same issue as above: ftplib is not imported as a module but is referenced here. Add import ftplib or import the specific exceptions (from ftplib import error_temp, error_perm) and update the clause accordingly.

Copilot · 2025-10-16T11:35:41Z

pridepy/pridepy.py

+    "-skip",
+    "--skip_if_downloaded_already",
+    required=False,
+    default=True,
+    help="Boolean to skip a file if it already exists",


Click short options must be a single character (e.g., -s), so -skip will not work. Also, for a boolean flag, prefer is_flag=True and provide both forms. Suggest: @click.option('-s', '--skip-if-downloaded-already/--no-skip-if-downloaded-already', 'skip_if_downloaded_already', is_flag=True, default=True, help='Skip a file if it already exists').

Suggested change

"-skip",

"--skip_if_downloaded_already",

required=False,

default=True,

help="Boolean to skip a file if it already exists",

"-s",

"--skip-if-downloaded-already/--no-skip-if-downloaded-already",

"skip_if_downloaded_already",

is_flag=True,

default=True,

help="Skip a file if it already exists",

Copilot · 2025-10-16T11:35:41Z

pridepy/pridepy.py

    required=False,
    help="Parameters to filter the search results. The structure of the "
    "filter is: field1==value1, field2==value2. Example "


The help text still refers to 'filter' (singular) but the option was renamed to --filters. Update to 'filters' in both sentences for consistency and clarity.

Copilot · 2025-10-16T11:35:41Z

pridepy/files/files.py

+                return (
+                    f"https://proteomecentral.proteomexchange.org/cgi/GetDataset?{query}&outputMode=XML&test=no"


[nitpick] If query is empty, this builds a URL with ?&outputMode=XML.... Consider constructing the query string conditionally, e.g., qs = f'{query}&' if query else '' and then ...GetDataset?{qs}outputMode=XML&test=no.

Suggested change

return (

f"https://proteomecentral.proteomexchange.org/cgi/GetDataset?{query}&outputMode=XML&test=no"

qs = f"{query}&" if query else ""

return (

f"https://proteomecentral.proteomexchange.org/cgi/GetDataset?{qs}outputMode=XML&test=no"

Copilot · 2025-10-16T11:35:42Z

pridepy/files/files.py

+                    mode = "ab"
+                else:
+                    resume_size = 0
+                    headers = {}
+                    mode = "wb"
+
+                with session.get(url, stream=True, headers=headers, timeout=(10, 60)) as r:
+                    r.raise_for_status()
+                    total_size = int(r.headers.get("content-length", 0)) + resume_size
+                    block_size = 1024 * 1024
+                    with tqdm(
+                        total=total_size,
+                        unit="B",
+                        unit_scale=True,
+                        desc=local_path,
+                        initial=resume_size,


[nitpick] When resuming, if the server ignores the Range header and returns 200 (full content), adding resume_size to content-length will overcount and may duplicate data. Guard by checking for r.status_code == 206 or a Content-Range header; only add resume_size when the response is partial. Otherwise, fall back to rewriting from the start.

Suggested change

mode = "ab"

else:

resume_size = 0

headers = {}

mode = "wb"

with session.get(url, stream=True, headers=headers, timeout=(10, 60)) as r:

r.raise_for_status()

total_size = int(r.headers.get("content-length", 0)) + resume_size

block_size = 1024 * 1024

with tqdm(

total=total_size,

unit="B",

unit_scale=True,

desc=local_path,

initial=resume_size,

else:

resume_size = 0

headers = {}

with session.get(url, stream=True, headers=headers, timeout=(10, 60)) as r:

r.raise_for_status()

# Check if server honored Range request

is_partial = r.status_code == 206 or "content-range" in r.headers

if is_partial:

mode = "ab"

total_size = int(r.headers.get("content-length", 0)) + resume_size

initial = resume_size

else:

# Server ignored Range, start from scratch

mode = "wb"

resume_size = 0

total_size = int(r.headers.get("content-length", 0))

initial = 0

block_size = 1024 * 1024

with tqdm(

total=total_size,

unit="B",

unit_scale=True,

desc=local_path,

initial=initial,

pridepy/pridepy.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

.gitignore (1)
18-19: Remove redundant blank lines.

These extra blank lines after .qodo are unnecessary and don't follow a consistent pattern with the rest of the file.

Apply this diff:
 .qodo
-

-
 #Ignore cursor AI rules
pridepy/files/files.py (3)
853-951: FTP is insecure but may be required.

FTP transmits credentials and data in plaintext. If possible, prefer SFTP/FTPS. However, if ProteomeXchange repositories only support FTP, document this limitation.

953-1000: Improve error handling in HTTP downloads.

Similar to the FTP downloader:

Blind exception (line 999): Catching Exception is too broad.

Use logging.exception (line 1000): Include traceback for better debugging.

Apply these fixes:
             logging.info(f"Successfully downloaded {local_path}")
-        except Exception as e:
-            logging.error(f"HTTP download failed for {url}: {str(e)}")
+        except Exception:
+            logging.exception(f"HTTP download failed for {url}")
770-797: Use query param merging to avoid duplicates

In _normalize_px_xml_url (pridepy/files/files.py: L790–L793), the code appends &outputMode=XML&test=no onto the original query, risking duplicate or conflicting values. Instead, parse and merge parameters with urllib.parse.parse_qs and urlencode:
from urllib.parse import urlparse, parse_qs, urlencode

parsed = urlparse(px_id_or_url)
params = parse_qs(parsed.query)
params['outputMode'] = ['XML']
params['test'] = ['no']
new_query = urlencode(params, doseq=True)
return f"{parsed.scheme}://{parsed.netloc}{parsed.path}?{new_query}"

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 68cb4ac and c9e30c7.

📒 Files selected for processing (3)

.gitignore (1 hunks)
pridepy/files/files.py (2 hunks)
pridepy/pridepy.py (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

pridepy/files/files.py (2)

pridepy/util/api_handling.py (3)

Util (18-155)

get_api_call (26-37)

create_session_with_retries (145-155)

pridepy/pridepy.py (1)

download_px_raw_files (281-285)

pridepy/pridepy.py (1)

pridepy/files/files.py (2)

download_px_raw_files (820-846)

Files (50-1000)

🪛 Ruff (0.14.0)

pridepy/files/files.py

808-808: Using xml to parse untrusted data is known to be vulnerable to XML attacks; use defusedxml equivalents

(S314)

868-868: FTP-related functions are being called. FTP is considered insecure. Use SSH/SFTP/SCP or some other encrypted protocol.

(S321)

919-919: Do not catch blind exception: Exception

(BLE001)

929-931: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

930-930: Use explicit conversion flag

Replace with conversion flag

(RUF010)

933-935: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

937-937: Do not catch blind exception: Exception

(BLE001)

938-938: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

938-938: Use explicit conversion flag

Replace with conversion flag

(RUF010)

944-944: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

944-944: Use explicit conversion flag

Replace with conversion flag

(RUF010)

949-951: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

999-999: Do not catch blind exception: Exception

(BLE001)

1000-1000: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

1000-1000: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build
GitHub Check: build

🔇 Additional comments (5)

.gitignore (1)

20-22: LGTM!

The ignore rules for Cursor AI and system files are appropriate.

pridepy/files/files.py (2)

799-818: LGTM with security note.

The XML parsing logic correctly extracts "Associated raw file URI" values. However, ensure you apply the defusedxml fix mentioned in the earlier comment.

820-846: LGTM!

The orchestration logic correctly separates FTP and HTTP URLs and delegates to the appropriate download methods.
pridepy/pridepy.py (2)
256-286: LGTM!

The new CLI command correctly delegates to Files.download_px_raw_files with appropriate options and logging.

355-355: Document the breaking change from --filter to --filters.

Renaming the CLI option from --filter to --filters is a breaking change that will affect existing users and scripts. Ensure this is documented in release notes or a changelog, and consider providing a deprecation period where both options are accepted.

Alternatively, you could temporarily support both options with a deprecation warning:
@click.option(
    "-f",
    "--filters",
    "--filter",
    "filters",
    required=False,
    help="Parameters to filter the search results...",
)
Then add a warning if the old option is detected (this requires custom click callback logic).

Also applies to: 399-399, 415-415

coderabbitai · 2025-10-16T11:36:23Z

pridepy/files/files.py

+from urllib.parse import urlparse
+import xml.etree.ElementTree as ET


⚠️ Potential issue | 🔴 Critical

Use defusedxml to prevent XML attacks.

The standard xml.etree.ElementTree is vulnerable to XML bomb attacks, billion laughs, quadratic blowup, and external entity expansion. ProteomeXchange XML comes from an external source and should be treated as untrusted.

Install and use defusedxml:

from urllib.parse import urlparse -import xml.etree.ElementTree as ET +import defusedxml.ElementTree as ET

Then add defusedxml to your project dependencies (e.g., requirements.txt or setup.py).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from urllib.parse import urlparse

import xml.etree.ElementTree as ET

from urllib.parse import urlparse

import defusedxml.ElementTree as ET

🤖 Prompt for AI Agents

In pridepy/files/files.py around lines 15 to 16, the code imports xml.etree.ElementTree which is unsafe for untrusted XML; replace that import with defusedxml.ElementTree (e.g., from defusedxml import ElementTree as ET) so parsing uses the secure implementation, update any references to ET accordingly, and add defusedxml to the project dependencies (requirements.txt or setup.py) and run tests to confirm no API mismatches.

coderabbitai · 2025-10-16T11:36:23Z

pridepy/files/files.py

+    @staticmethod
+    def _local_path_for_url(download_url: str, output_folder: str) -> str:
+        filename = os.path.basename(urlparse(download_url).path)
+        return os.path.join(output_folder, filename)


⚠️ Potential issue | 🟡 Minor

Handle URLs with trailing slashes.

os.path.basename() returns an empty string if the URL path ends with a slash, which would result in an invalid local path.

Add validation:

@staticmethod def _local_path_for_url(download_url: str, output_folder: str) -> str: filename = os.path.basename(urlparse(download_url).path) + if not filename: + raise ValueError(f"Unable to extract filename from URL: {download_url}") return os.path.join(output_folder, filename)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@staticmethod

def _local_path_for_url(download_url: str, output_folder: str) -> str:

filename = os.path.basename(urlparse(download_url).path)

return os.path.join(output_folder, filename)

@staticmethod

def _local_path_for_url(download_url: str, output_folder: str) -> str:

filename = os.path.basename(urlparse(download_url).path)

if not filename:

raise ValueError(f"Unable to extract filename from URL: {download_url}")

return os.path.join(output_folder, filename)

🤖 Prompt for AI Agents

In pridepy/files/files.py around lines 848 to 851, the method can return an empty filename when the URL path ends with a slash; change it to strip trailing slashes and pick the last non-empty path segment as the filename, and if that still yields an empty string fall back to a safe name (e.g., netloc or a uuid-based name, optionally with a default extension). Also unquote and sanitize the chosen filename to remove unsafe characters before joining with output_folder so the resulting local path is always valid.

coderabbitai · 2025-10-16T11:36:23Z

pridepy/files/files.py

+    @staticmethod
+    def download_ftp_urls(
+        ftp_urls: List[str],
+        output_folder: str,
+        skip_if_downloaded_already: bool,
+        max_connection_retries: int = 3,
+        max_download_retries: int = 3,
+    ) -> None:
+        """
+        Download a list of FTP URLs using a single connection, with retries and progress bars.
+        """
+        if not os.path.isdir(output_folder):
+            os.makedirs(output_folder, exist_ok=True)
+
+        def connect_ftp(host: str):
+            ftp = FTP(host, timeout=30)
+            ftp.login()
+            ftp.set_pasv(True)
+            logging.info(f"Connected to FTP host: {host}")
+            return ftp
+
+        # Group URLs by host to reuse connections efficiently
+        host_to_paths: Dict[str, List[str]] = {}
+        for url in ftp_urls:
+            parsed = urlparse(url)
+            host_to_paths.setdefault(parsed.hostname, []).append(parsed.path.lstrip("/"))
+
+        for host, paths in host_to_paths.items():
+            connection_attempt = 0
+            while connection_attempt < max_connection_retries:
+                try:
+                    ftp = connect_ftp(host)
+                    for ftp_path in paths:
+                        try:
+                            local_path = os.path.join(output_folder, os.path.basename(ftp_path))
+                            if skip_if_downloaded_already and os.path.exists(local_path):
+                                logging.info("Skipping download as file already exists")
+                                continue
+
+                            logging.info(f"Starting FTP download: {host}/{ftp_path}")
+                            download_attempt = 0
+                            while download_attempt < max_download_retries:
+                                try:
+                                    total_size = ftp.size(ftp_path)
+                                    # Try to resume using REST if partial file exists
+                                    if os.path.exists(local_path):
+                                        current_size = os.path.getsize(local_path)
+                                        mode = "ab"
+                                    else:
+                                        current_size = 0
+                                        mode = "wb"
+
+                                    with open(local_path, mode) as f, tqdm(
+                                        total=total_size,
+                                        unit="B",
+                                        unit_scale=True,
+                                        desc=local_path,
+                                        initial=current_size,
+                                    ) as pbar:
+                                        def callback(data):
+                                            f.write(data)
+                                            pbar.update(len(data))
+
+                                        if current_size:
+                                            try:
+                                                ftp.sendcmd(f"REST {current_size}")
+                                            except Exception:
+                                                # If REST not supported, fall back to full download
+                                                current_size = 0
+                                                f.seek(0)
+                                                f.truncate()
+                                        ftp.retrbinary(f"RETR {ftp_path}", callback)
+                                    logging.info(f"Successfully downloaded {local_path}")
+                                    break
+                                except (socket.timeout, ftplib.error_temp, ftplib.error_perm) as e:
+                                    download_attempt += 1
+                                    logging.error(
+                                        f"Download failed for {local_path} (attempt {download_attempt}): {str(e)}"
+                                    )
+                                    if download_attempt >= max_download_retries:
+                                        logging.error(
+                                            f"Giving up on {local_path} after {max_download_retries} attempts."
+                                        )
+                                        break
+                        except Exception as e:
+                            logging.error(f"Unexpected error while processing FTP path {ftp_path}: {str(e)}")
+                    ftp.quit()
+                    logging.info(f"Disconnected from FTP host: {host}")
+                    break
+                except (socket.timeout, ftplib.error_temp, ftplib.error_perm, socket.error) as e:
+                    connection_attempt += 1
+                    logging.error(f"FTP connection failed (attempt {connection_attempt}): {str(e)}")
+                    if connection_attempt < max_connection_retries:
+                        logging.info("Retrying connection...")
+                        time.sleep(5)
+                    else:
+                        logging.error(
+                            f"Giving up after {max_connection_retries} failed connection attempts to {host}."
+                        )


⚠️ Potential issue | 🟠 Major

Fix resume logic and improve error handling.

Several issues:

Resume fallback bug (lines 916-923): If REST is not supported and the code falls back to a full download by truncating the file, the progress bar's initial=current_size is not reset to 0, causing incorrect progress display.

Blind exception catching (lines 919, 937): Catching Exception is too broad. Be specific about what failures to handle.

Use logging.exception (lines 929, 933, 938, 944, 949): In exception handlers, use logging.exception() instead of logging.error() to include the traceback.

Apply these fixes:

if current_size: try: ftp.sendcmd(f"REST {current_size}") - except Exception: + except ftplib.error_perm: # If REST not supported, fall back to full download current_size = 0 f.seek(0) f.truncate() + pbar.reset() # Reset progress bar ftp.retrbinary(f"RETR {ftp_path}", callback)

except (socket.timeout, ftplib.error_temp, ftplib.error_perm) as e: download_attempt += 1 - logging.error( - f"Download failed for {local_path} (attempt {download_attempt}): {str(e)}" + logging.exception( + f"Download failed for {local_path} (attempt {download_attempt})" ) if download_attempt >= max_download_retries: - logging.error( - f"Giving up on {local_path} after {max_download_retries} attempts." + logging.exception( + f"Giving up on {local_path} after {max_download_retries} attempts" )

- except Exception as e: - logging.error(f"Unexpected error while processing FTP path {ftp_path}: {str(e)}") + except Exception: + logging.exception(f"Unexpected error while processing FTP path {ftp_path}")

except (socket.timeout, ftplib.error_temp, ftplib.error_perm, socket.error) as e: connection_attempt += 1 - logging.error(f"FTP connection failed (attempt {connection_attempt}): {str(e)}") + logging.exception(f"FTP connection failed (attempt {connection_attempt})") if connection_attempt < max_connection_retries: logging.info("Retrying connection...") time.sleep(5) else: - logging.error( - f"Giving up after {max_connection_retries} failed connection attempts to {host}." + logging.exception( + f"Giving up after {max_connection_retries} failed connection attempts to {host}" )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@staticmethod

def download_ftp_urls(

ftp_urls: List[str],

output_folder: str,

skip_if_downloaded_already: bool,

max_connection_retries: int = 3,

max_download_retries: int = 3,

) -> None:

"""

Download a list of FTP URLs using a single connection, with retries and progress bars.

"""

if not os.path.isdir(output_folder):

os.makedirs(output_folder, exist_ok=True)

def connect_ftp(host: str):

ftp = FTP(host, timeout=30)

ftp.login()

ftp.set_pasv(True)

logging.info(f"Connected to FTP host: {host}")

return ftp

# Group URLs by host to reuse connections efficiently

host_to_paths: Dict[str, List[str]] = {}

for url in ftp_urls:

parsed = urlparse(url)

host_to_paths.setdefault(parsed.hostname, []).append(parsed.path.lstrip("/"))

for host, paths in host_to_paths.items():

connection_attempt = 0

while connection_attempt < max_connection_retries:

try:

ftp = connect_ftp(host)

for ftp_path in paths:

try:

local_path = os.path.join(output_folder, os.path.basename(ftp_path))

if skip_if_downloaded_already and os.path.exists(local_path):

logging.info("Skipping download as file already exists")

continue

logging.info(f"Starting FTP download: {host}/{ftp_path}")

download_attempt = 0

while download_attempt < max_download_retries:

try:

total_size = ftp.size(ftp_path)

# Try to resume using REST if partial file exists

if os.path.exists(local_path):

current_size = os.path.getsize(local_path)

mode = "ab"

else:

current_size = 0

mode = "wb"

with open(local_path, mode) as f, tqdm(

total=total_size,

unit="B",

unit_scale=True,

desc=local_path,

initial=current_size,

) as pbar:

def callback(data):

f.write(data)

pbar.update(len(data))

if current_size:

try:

ftp.sendcmd(f"REST {current_size}")

except Exception:

# If REST not supported, fall back to full download

current_size = 0

f.seek(0)

f.truncate()

ftp.retrbinary(f"RETR {ftp_path}", callback)

logging.info(f"Successfully downloaded {local_path}")

break

except (socket.timeout, ftplib.error_temp, ftplib.error_perm) as e:

download_attempt += 1

logging.error(

f"Download failed for {local_path} (attempt {download_attempt}): {str(e)}"

)

if download_attempt >= max_download_retries:

logging.error(

f"Giving up on {local_path} after {max_download_retries} attempts."

)

break

except Exception as e:

logging.error(f"Unexpected error while processing FTP path {ftp_path}: {str(e)}")

ftp.quit()

logging.info(f"Disconnected from FTP host: {host}")

break

except (socket.timeout, ftplib.error_temp, ftplib.error_perm, socket.error) as e:

connection_attempt += 1

logging.error(f"FTP connection failed (attempt {connection_attempt}): {str(e)}")

if connection_attempt < max_connection_retries:

logging.info("Retrying connection...")

time.sleep(5)

else:

logging.error(

f"Giving up after {max_connection_retries} failed connection attempts to {host}."

)

@staticmethod

def download_ftp_urls(

ftp_urls: List[str],

output_folder: str,

skip_if_downloaded_already: bool,

max_connection_retries: int = 3,

max_download_retries: int = 3,

) -> None:

"""

Download a list of FTP URLs using a single connection, with retries and progress bars.

"""

if not os.path.isdir(output_folder):

os.makedirs(output_folder, exist_ok=True)

def connect_ftp(host: str):

ftp = FTP(host, timeout=30)

ftp.login()

ftp.set_pasv(True)

logging.info(f"Connected to FTP host: {host}")

return ftp

# Group URLs by host to reuse connections efficiently

host_to_paths: Dict[str, List[str]] = {}

for url in ftp_urls:

parsed = urlparse(url)

host_to_paths.setdefault(parsed.hostname, []).append(parsed.path.lstrip("/"))

for host, paths in host_to_paths.items():

connection_attempt = 0

while connection_attempt < max_connection_retries:

try:

ftp = connect_ftp(host)

for ftp_path in paths:

try:

local_path = os.path.join(output_folder, os.path.basename(ftp_path))

if skip_if_downloaded_already and os.path.exists(local_path):

logging.info("Skipping download as file already exists")

continue

logging.info(f"Starting FTP download: {host}/{ftp_path}")

download_attempt = 0

while download_attempt < max_download_retries:

try:

total_size = ftp.size(ftp_path)

# Try to resume using REST if partial file exists

if os.path.exists(local_path):

current_size = os.path.getsize(local_path)

mode = "ab"

else:

current_size = 0

mode = "wb"

with open(local_path, mode) as f, tqdm(

total=total_size,

unit="B",

unit_scale=True,

desc=local_path,

initial=current_size,

) as pbar:

def callback(data):

f.write(data)

pbar.update(len(data))

if current_size:

try:

ftp.sendcmd(f"REST {current_size}")

except ftplib.error_perm:

# If REST not supported, fall back to full download

current_size = 0

f.seek(0)

f.truncate()

pbar.reset() # Reset progress bar

ftp.retrbinary(f"RETR {ftp_path}", callback)

logging.info(f"Successfully downloaded {local_path}")

break

except (socket.timeout, ftplib.error_temp, ftplib.error_perm) as e:

download_attempt += 1

logging.exception(

f"Download failed for {local_path} (attempt {download_attempt})"

)

if download_attempt >= max_download_retries:

logging.exception(

f"Giving up on {local_path} after {max_download_retries} attempts"

)

break

except Exception:

logging.exception(f"Unexpected error while processing FTP path {ftp_path}")

ftp.quit()

logging.info(f"Disconnected from FTP host: {host}")

break

except (socket.timeout, ftplib.error_temp, ftplib.error_perm, socket.error) as e:

connection_attempt += 1

logging.exception(f"FTP connection failed (attempt {connection_attempt})")

if connection_attempt < max_connection_retries:

logging.info("Retrying connection...")

time.sleep(5)

else:

logging.exception(

f"Giving up after {max_connection_retries} failed connection attempts to {host}"

)

🧰 Tools

🪛 Ruff (0.14.0)

868-868: FTP-related functions are being called. FTP is considered insecure. Use SSH/SFTP/SCP or some other encrypted protocol.

(S321)

919-919: Do not catch blind exception: Exception

(BLE001)

929-931: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

930-930: Use explicit conversion flag

Replace with conversion flag

(RUF010)

933-935: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

937-937: Do not catch blind exception: Exception

(BLE001)

938-938: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

938-938: Use explicit conversion flag

Replace with conversion flag

(RUF010)

944-944: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

944-944: Use explicit conversion flag

Replace with conversion flag

(RUF010)

949-951: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents

In pridepy/files/files.py around lines 853-951, fix resume and error handling: when REST fails, reset current_size to 0, switch to write mode ('wb'), and recreate the tqdm progress bar (or reset its initial) so the progress display reflects a full download rather than the old initial value; replace the broad "except Exception" around per-path processing with specific exceptions (e.g., OSError, ftplib.all_errors, socket.timeout) so only expected I/O/FTP errors are caught; and replace logging.error(...) calls in all exception handlers at the indicated lines with logging.exception(...) to include tracebacks. Ensure REST fallback truncates the file, sets mode='wb' and current_size=0 before re-creating the retrbinary callback/progress loop, and narrow all catch clauses to explicit FTP/IO/socket errors.

ypriverol added 3 commits October 16, 2025 11:27

minor changes

432e994

minor changes

36d140b

Merge branch 'master' of https://github.com/PRIDE-Archive/pridepy int…

c9e30c7

…o dev

ypriverol requested a review from Copilot October 16, 2025 11:31

qodo-code-review bot added the Review effort 3/5 label Oct 16, 2025

ypriverol requested a review from selvaebi October 16, 2025 11:33

increase version

6713f04

Copilot AI reviewed Oct 16, 2025

View reviewed changes

Update pridepy/pridepy.py

08d58c5

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

coderabbitai bot reviewed Oct 16, 2025

View reviewed changes

ypriverol merged commit 03fd460 into master Oct 16, 2025
9 checks passed

coderabbitai bot mentioned this pull request Dec 12, 2025

improve skip options #74

Merged

		return (
		f"https://proteomecentral.proteomexchange.org/cgi/GetDataset?{query}&outputMode=XML&test=no"

		from urllib.parse import urlparse
		import xml.etree.ElementTree as ET

ProteomeXchange support #67

ProteomeXchange support #67

Uh oh!

Conversation

ypriverol commented Oct 16, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

Diagram Walkthrough

File Walkthrough

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

qodo-code-review bot commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Compliance Guide 🔍

Uh oh!

qodo-code-review bot commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ypriverol commented Oct 16, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 16, 2025 •

edited

Loading

qodo-code-review bot commented Oct 16, 2025 •

edited

Loading

qodo-code-review bot commented Oct 16, 2025 •

edited

Loading