Skip to content

Conversation

@ypriverol
Copy link
Contributor

@ypriverol ypriverol commented Oct 16, 2025

PR Type

Enhancement


Description

  • Add ProteomeXchange dataset support for downloading raw files via XML parsing

  • Implement FTP and HTTP download methods with resume capability and progress tracking

  • Add CLI command download-px-raw-files for ProteomeXchange dataset downloads

  • Fix parameter naming in search command (filter → filters)


Diagram Walkthrough

flowchart LR
  PX["ProteomeXchange Dataset"] -- "parse XML" --> URLs["Extract Raw File URLs"]
  URLs -- "FTP URLs" --> FTP["FTP Download with Resume"]
  URLs -- "HTTP URLs" --> HTTP["HTTP Download with Resume"]
  FTP --> Output["Output Folder"]
  HTTP --> Output
Loading

File Walkthrough

Relevant files
Enhancement
files.py
ProteomeXchange file download support with FTP/HTTP handlers

pridepy/files/files.py

  • Add ProteomeXchange XML parsing to extract raw file URIs from dataset
    metadata
  • Implement download_px_raw_files method supporting both FTP and HTTP
    protocols
  • Add download_ftp_urls with connection pooling, retry logic, and resume
    support
  • Add download_http_urls with resume capability and progress bars using
    tqdm
+238/-0 
pridepy.py
CLI command for ProteomeXchange downloads and parameter fixes

pridepy/pridepy.py

  • Add new CLI command download-px-raw-files for ProteomeXchange dataset
    downloads
  • Fix parameter name from filter to filters in search command
  • Update help text and option aliases for ProteomeXchange accession
    parameter
+35/-3   

Summary by CodeRabbit

  • New Features

    • New CLI command to download ProteomeXchange dataset raw files with resume, progress and retry support for FTP and HTTP(S) sources.
  • Changes

    • Renamed CLI option --filter to --filters for search commands.
  • Chores

    • Updated repository ignore rules (e.g., system files) and bumped package version to 0.0.9.

@ypriverol ypriverol requested a review from Copilot October 16, 2025 11:31
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 16, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds ProteomeXchange (PX) raw-file download support to Files (XML normalization/parsing, FTP/HTTP download orchestration, resume/retries) and a new CLI command to trigger it; also renames the CLI/search option --filter--filters and bumps package version.

Changes

Cohort / File(s) Summary
Git Configuration
\.gitignore
Add ignore rules for \.cursor/rules/codacy.mdc and .DS_Store.
PX Support & Downloading
pridepy/files/files.py
Add PX XML normalization (_normalize_px_xml_url), PX XML parsing for raw-file URIs (_parse_px_xml_for_raw_file_urls), local-path helper (_local_path_for_url), FTP downloader (download_ftp_urls), HTTP downloader (download_http_urls), and top-level orchestrator (download_px_raw_files). Includes resume/progress/retry logic and URL classification.
CLI & Search API
pridepy/pridepy.py
Add download-px-raw-files CLI command and download_px_raw_files call; rename --filter to --filters and update search_projects_by_keywords_and_filters parameter to filters.
Project Metadata
pyproject.toml
Bump package version from 0.0.8 to 0.0.9.

Sequence Diagram(s)

sequenceDiagram
    participant User as CLI User
    participant CLI as pridepy/pridepy.py
    participant Files as pridepy/files/files.py
    participant PX as ProteomeXchange
    participant DL as Downloader (FTP/HTTP)

    User->>CLI: download-px-raw-files --accession PXDxxxxxx --output-folder ./out
    CLI->>Files: download_px_raw_files(px_id_or_url, output_folder, skip_if_downloaded_already)

    Files->>Files: _normalize_px_xml_url(px_id_or_url)
    Files->>PX: GET normalized XML endpoint
    PX-->>Files: XML response

    Files->>Files: _parse_px_xml_for_raw_file_urls(px_xml_url)
    Files-->>Files: list of raw file URLs

    Files->>Files: classify URLs (ftp:// vs http(s)://)

    rect rgb(220, 240, 220)
      Files->>DL: download_ftp_urls(ftp_urls,...)
      DL->>DL: per-host grouping, resume, retries
      DL-->>Files: ftp downloads complete
    end

    rect rgb(240, 240, 220)
      Files->>DL: download_http_urls(http_urls,...)
      DL->>DL: resume support, retries, progress
      DL-->>Files: http downloads complete
    end

    Files-->>CLI: report success/failures
    CLI-->>User: Download complete / errors
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • move to poetry #61 — Modifies pridepy/pridepy.py, adds download-related CLI commands and the filterfilters rename; strong overlap with CLI/search changes.
  • Search by key word #60 — Edits the same CLI/search API and parameter naming; likely intersects on search_projects_by_keywords_and_filters.
  • fix: removed dataegap and test searchby keyword #56 — Changes the search_projects_by_keywords_and_filters signature (filter→filters removal/adjustment); related to the same function-level refactor.

Poem

🐇 I hop through XML and stream each URL,
FTP and HTTP — down they fall.
Retries and resumes, progress on show,
Raw files gathered, row by row.
A rabbit cheers: downloads all! 🎉

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title “ProteomeXchange support” concisely captures the primary new functionality introduced by the pull request—adding ProteomeXchange dataset support for raw file downloads—without including extraneous details or vague phrasing. It directly aligns with the core change and is clear to any team member reviewing the project history.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9e30c7 and 08d58c5.

📒 Files selected for processing (2)
  • pridepy/pridepy.py (4 hunks)
  • pyproject.toml (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
  • pridepy/pridepy.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build
  • GitHub Check: build

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link
Contributor

qodo-code-review bot commented Oct 16, 2025

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
Incomplete download risk

Description: FTP resume is attempted with REST but falls back to full download without validating
downloaded size or integrity, which risks incomplete or corrupted files being treated as
successful.
files.py [905-926]

Referred Code
with open(local_path, mode) as f, tqdm(
    total=total_size,
    unit="B",
    unit_scale=True,
    desc=local_path,
    initial=current_size,
) as pbar:
    def callback(data):
        f.write(data)
        pbar.update(len(data))

    if current_size:
        try:
            ftp.sendcmd(f"REST {current_size}")
        except Exception:
            # If REST not supported, fall back to full download
            current_size = 0
            f.seek(0)
            f.truncate()
    ftp.retrbinary(f"RETR {ftp_path}", callback)
logging.info(f"Successfully downloaded {local_path}")


 ... (clipped 1 lines)
Integrity not verified

Description: HTTP resume relies on Content-Length and appends data without validating final size or
checksum, allowing truncated or corrupted files if server omits length or connection
drops.
files.py [982-999]

Referred Code
    with session.get(url, stream=True, headers=headers, timeout=(10, 60)) as r:
        r.raise_for_status()
        total_size = int(r.headers.get("content-length", 0)) + resume_size
        block_size = 1024 * 1024
        with tqdm(
            total=total_size,
            unit="B",
            unit_scale=True,
            desc=local_path,
            initial=resume_size,
        ) as pbar:
            with open(local_path, mode) as f:
                for chunk in r.iter_content(chunk_size=block_size):
                    if chunk:
                        f.write(chunk)
                        pbar.update(len(chunk))
    logging.info(f"Successfully downloaded {local_path}")
except Exception as e:
Ticket Compliance
🎫 No ticket provided
- [ ] Create ticket/issue <!-- /create_ticket --create_ticket=true -->

</details></td></tr>
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
No custom compliance provided

Follow the guide to enable custom compliance check.

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@ypriverol ypriverol requested a review from selvaebi October 16, 2025 11:33
@qodo-code-review
Copy link
Contributor

qodo-code-review bot commented Oct 16, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Unify download logic for all sources

Refactor the existing PRIDE download functions to use the new, more capable FTP
and HTTP download methods introduced for ProteomeXchange. This will unify the
download logic, reduce code duplication, and provide consistent features like
download resuming for all data sources.

Examples:

pridepy/files/files.py [854-951]
    def download_ftp_urls(
        ftp_urls: List[str],
        output_folder: str,
        skip_if_downloaded_already: bool,
        max_connection_retries: int = 3,
        max_download_retries: int = 3,
    ) -> None:
        """
        Download a list of FTP URLs using a single connection, with retries and progress bars.
        """

 ... (clipped 88 lines)
pridepy/files/files.py [143-149]
    def download_files_from_ftp(
        file_list_json,
        output_folder,
        skip_if_downloaded_already,
        max_connection_retries=3,
        max_download_retries=3,
    ):

Solution Walkthrough:

Before:

# pridepy/files/files.py

class Files:
    # Old FTP downloader for PRIDE data (no resume)
    @staticmethod
    def download_files_from_ftp(file_list_json, ...):
        # ... implementation without resume ...
        ftp.retrbinary(...)

    # Main dispatcher for PRIDE downloads
    @staticmethod
    def download_files(file_list_json, protocol="ftp", ...):
        if protocol == "ftp":
            Files.download_files_from_ftp(file_list_json, ...)
        # ...

    # New downloader for ProteomeXchange data
    def download_px_raw_files(self, px_id_or_url, ...):
        # ... gets urls ...
        self.download_ftp_urls(ftp_urls, ...) # Uses new FTP downloader

    # New FTP downloader with resume capability
    @staticmethod
    def download_ftp_urls(ftp_urls, ...):
        # ... implementation with resume using REST command ...
        ftp.retrbinary(...)

After:

# pridepy/files/files.py

class Files:
    # Old FTP downloader refactored to use the new one
    @staticmethod
    def download_files_from_ftp(file_list_json, ...):
        # 1. Convert file_list_json to a list of FTP URLs
        ftp_urls = [file["ftpDownloadLink"] for file in file_list_json]
        # 2. Call the unified, feature-rich downloader
        Files.download_ftp_urls(ftp_urls, ...)

    # Main dispatcher now implicitly uses the new downloader
    @staticmethod
    def download_files(file_list_json, protocol="ftp", ...):
        if protocol == "ftp":
            Files.download_files_from_ftp(file_list_json, ...)
        # ...

    # New downloader for ProteomeXchange data (unchanged)
    def download_px_raw_files(self, px_id_or_url, ...):
        # ...
        self.download_ftp_urls(ftp_urls, ...)

    # Unified FTP downloader with resume capability
    @staticmethod
    def download_ftp_urls(ftp_urls, ...):
        # ... implementation with resume using REST command ...
        ftp.retrbinary(...)
Suggestion importance[1-10]: 9

__

Why: This is an excellent architectural suggestion that correctly identifies duplicated download logic and proposes a valuable refactoring to unify FTP/HTTP downloads, which would improve maintainability and provide consistent features like resuming across the entire tool.

High
Possible issue
Fix FTP resume file corruption

Fix a flaw in the FTP download resume logic to prevent file corruption. If the
REST command for resuming fails, ensure the file is correctly handled to start
the download from the beginning.

pridepy/files/files.py [916-924]

 if current_size:
     try:
         ftp.sendcmd(f"REST {current_size}")
-    except Exception:
-        # If REST not supported, fall back to full download
+        ftp.retrbinary(f"RETR {ftp_path}", callback)
+    except (ftplib.error_perm, ftplib.error_temp):
+        # If REST not supported or fails, fall back to full download
+        logging.warning(f"FTP resume failed for {local_path}. Starting download from beginning.")
         current_size = 0
         f.seek(0)
         f.truncate()
-ftp.retrbinary(f"RETR {ftp_path}", callback)
+        # Re-initialize pbar for full download
+        pbar.reset(total=total_size)
+        pbar.initial = 0
+        pbar.update(0)
+        ftp.retrbinary(f"RETR {ftp_path}", callback)
+else:
+    ftp.retrbinary(f"RETR {ftp_path}", callback)
  • Apply / Chat
Suggestion importance[1-10]: 9

__

Why: This suggestion correctly identifies a critical bug in the FTP resume logic that could lead to file corruption. The proposed fix, while needing refinement, addresses the core issue of handling REST command failures.

High
General
Improve CLI option clarity and consistency

Improve CLI clarity by renaming the --accession option for ProteomeXchange
downloads to --px-id with a -px alias, and remove the confusing "deprecated"
notice from the help text.

pridepy/pridepy.py [260-267]

 @click.option(
-    "-a",
-    "--accession",
-    "--px",
+    "-px",
+    "--px-id",
     "accession",
     required=True,
-    help="ProteomeXchange accession (e.g. PXD039236). --px is deprecated.",
+    help="ProteomeXchange accession or URL (e.g., PXD039236).",
 )
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly points out that marking an option as deprecated in the same PR it's introduced is confusing. The proposed change to use --px-id improves the CLI's clarity and usability.

Low
  • Update

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds ProteomeXchange (PX) dataset support: a new CLI to download raw files referenced by PX datasets and core logic to normalize PX URLs, parse PX XML, and download via FTP/HTTP with resume support.

  • New CLI command: download-px-raw-files (PX accession/URL input, output folder, optional skip-existing).
  • Core Files methods for PX: URL normalization, XML parsing for "Associated raw file URI", FTP/HTTP downloaders with resume/progress.
  • Renames CLI option --filter to --filters for search_projects_by_keywords_and_filters.

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 7 comments.

File Description
pridepy/pridepy.py Adds the PX download CLI and renames --filter to --filters in the search projects command.
pridepy/files/files.py Implements PX URL normalization, XML parsing, and HTTP/FTP download helpers, plus supporting imports.

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +927 to +933
except (socket.timeout, ftplib.error_temp, ftplib.error_perm) as e:
download_attempt += 1
logging.error(
f"Download failed for {local_path} (attempt {download_attempt}): {str(e)}"
)
if download_attempt >= max_download_retries:
logging.error(
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftplib is not imported as a module, but its names (error_temp, error_perm) are referenced via the ftplib namespace. Either add import ftplib at the top of the file (near other imports) or import the specific exceptions: from ftplib import error_temp, error_perm and then remove the ftplib. qualifier in the except clauses. This occurs in both exception handlers within download_ftp_urls.

Copilot uses AI. Check for mistakes.
Comment on lines +942 to +944
except (socket.timeout, ftplib.error_temp, ftplib.error_perm, socket.error) as e:
connection_attempt += 1
logging.error(f"FTP connection failed (attempt {connection_attempt}): {str(e)}")
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above: ftplib is not imported as a module but is referenced here. Add import ftplib or import the specific exceptions (from ftplib import error_temp, error_perm) and update the clause accordingly.

Copilot uses AI. Check for mistakes.
Comment on lines +275 to +279
"-skip",
"--skip_if_downloaded_already",
required=False,
default=True,
help="Boolean to skip a file if it already exists",
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Click short options must be a single character (e.g., -s), so -skip will not work. Also, for a boolean flag, prefer is_flag=True and provide both forms. Suggest: @click.option('-s', '--skip-if-downloaded-already/--no-skip-if-downloaded-already', 'skip_if_downloaded_already', is_flag=True, default=True, help='Skip a file if it already exists').

Suggested change
"-skip",
"--skip_if_downloaded_already",
required=False,
default=True,
help="Boolean to skip a file if it already exists",
"-s",
"--skip-if-downloaded-already/--no-skip-if-downloaded-already",
"skip_if_downloaded_already",
is_flag=True,
default=True,
help="Skip a file if it already exists",

Copilot uses AI. Check for mistakes.
Comment on lines 356 to 358
required=False,
help="Parameters to filter the search results. The structure of the "
"filter is: field1==value1, field2==value2. Example "
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The help text still refers to 'filter' (singular) but the option was renamed to --filters. Update to 'filters' in both sentences for consistency and clarity.

Copilot uses AI. Check for mistakes.
Comment on lines +791 to +792
return (
f"https://proteomecentral.proteomexchange.org/cgi/GetDataset?{query}&outputMode=XML&test=no"
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] If query is empty, this builds a URL with ?&outputMode=XML.... Consider constructing the query string conditionally, e.g., qs = f'{query}&' if query else '' and then ...GetDataset?{qs}outputMode=XML&test=no.

Suggested change
return (
f"https://proteomecentral.proteomexchange.org/cgi/GetDataset?{query}&outputMode=XML&test=no"
qs = f"{query}&" if query else ""
return (
f"https://proteomecentral.proteomexchange.org/cgi/GetDataset?{qs}outputMode=XML&test=no"

Copilot uses AI. Check for mistakes.
Comment on lines +976 to +991
mode = "ab"
else:
resume_size = 0
headers = {}
mode = "wb"

with session.get(url, stream=True, headers=headers, timeout=(10, 60)) as r:
r.raise_for_status()
total_size = int(r.headers.get("content-length", 0)) + resume_size
block_size = 1024 * 1024
with tqdm(
total=total_size,
unit="B",
unit_scale=True,
desc=local_path,
initial=resume_size,
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] When resuming, if the server ignores the Range header and returns 200 (full content), adding resume_size to content-length will overcount and may duplicate data. Guard by checking for r.status_code == 206 or a Content-Range header; only add resume_size when the response is partial. Otherwise, fall back to rewriting from the start.

Suggested change
mode = "ab"
else:
resume_size = 0
headers = {}
mode = "wb"
with session.get(url, stream=True, headers=headers, timeout=(10, 60)) as r:
r.raise_for_status()
total_size = int(r.headers.get("content-length", 0)) + resume_size
block_size = 1024 * 1024
with tqdm(
total=total_size,
unit="B",
unit_scale=True,
desc=local_path,
initial=resume_size,
else:
resume_size = 0
headers = {}
with session.get(url, stream=True, headers=headers, timeout=(10, 60)) as r:
r.raise_for_status()
# Check if server honored Range request
is_partial = r.status_code == 206 or "content-range" in r.headers
if is_partial:
mode = "ab"
total_size = int(r.headers.get("content-length", 0)) + resume_size
initial = resume_size
else:
# Server ignored Range, start from scratch
mode = "wb"
resume_size = 0
total_size = int(r.headers.get("content-length", 0))
initial = 0
block_size = 1024 * 1024
with tqdm(
total=total_size,
unit="B",
unit_scale=True,
desc=local_path,
initial=initial,

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
.gitignore (1)

18-19: Remove redundant blank lines.

These extra blank lines after .qodo are unnecessary and don't follow a consistent pattern with the rest of the file.

Apply this diff:

 .qodo
-

-
 #Ignore cursor AI rules
pridepy/files/files.py (3)

853-951: FTP is insecure but may be required.

FTP transmits credentials and data in plaintext. If possible, prefer SFTP/FTPS. However, if ProteomeXchange repositories only support FTP, document this limitation.


953-1000: Improve error handling in HTTP downloads.

Similar to the FTP downloader:

  1. Blind exception (line 999): Catching Exception is too broad.
  2. Use logging.exception (line 1000): Include traceback for better debugging.

Apply these fixes:

             logging.info(f"Successfully downloaded {local_path}")
-        except Exception as e:
-            logging.error(f"HTTP download failed for {url}: {str(e)}")
+        except Exception:
+            logging.exception(f"HTTP download failed for {url}")

770-797: Use query param merging to avoid duplicates

In _normalize_px_xml_url (pridepy/files/files.py: L790–L793), the code appends &outputMode=XML&test=no onto the original query, risking duplicate or conflicting values. Instead, parse and merge parameters with urllib.parse.parse_qs and urlencode:

from urllib.parse import urlparse, parse_qs, urlencode

parsed = urlparse(px_id_or_url)
params = parse_qs(parsed.query)
params['outputMode'] = ['XML']
params['test'] = ['no']
new_query = urlencode(params, doseq=True)
return f"{parsed.scheme}://{parsed.netloc}{parsed.path}?{new_query}"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 68cb4ac and c9e30c7.

📒 Files selected for processing (3)
  • .gitignore (1 hunks)
  • pridepy/files/files.py (2 hunks)
  • pridepy/pridepy.py (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
pridepy/files/files.py (2)
pridepy/util/api_handling.py (3)
  • Util (18-155)
  • get_api_call (26-37)
  • create_session_with_retries (145-155)
pridepy/pridepy.py (1)
  • download_px_raw_files (281-285)
pridepy/pridepy.py (1)
pridepy/files/files.py (2)
  • download_px_raw_files (820-846)
  • Files (50-1000)
🪛 Ruff (0.14.0)
pridepy/files/files.py

808-808: Using xml to parse untrusted data is known to be vulnerable to XML attacks; use defusedxml equivalents

(S314)


868-868: FTP-related functions are being called. FTP is considered insecure. Use SSH/SFTP/SCP or some other encrypted protocol.

(S321)


919-919: Do not catch blind exception: Exception

(BLE001)


929-931: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


930-930: Use explicit conversion flag

Replace with conversion flag

(RUF010)


933-935: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


937-937: Do not catch blind exception: Exception

(BLE001)


938-938: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


938-938: Use explicit conversion flag

Replace with conversion flag

(RUF010)


944-944: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


944-944: Use explicit conversion flag

Replace with conversion flag

(RUF010)


949-951: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


999-999: Do not catch blind exception: Exception

(BLE001)


1000-1000: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


1000-1000: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build
  • GitHub Check: build
🔇 Additional comments (5)
.gitignore (1)

20-22: LGTM!

The ignore rules for Cursor AI and system files are appropriate.

pridepy/files/files.py (2)

799-818: LGTM with security note.

The XML parsing logic correctly extracts "Associated raw file URI" values. However, ensure you apply the defusedxml fix mentioned in the earlier comment.


820-846: LGTM!

The orchestration logic correctly separates FTP and HTTP URLs and delegates to the appropriate download methods.

pridepy/pridepy.py (2)

256-286: LGTM!

The new CLI command correctly delegates to Files.download_px_raw_files with appropriate options and logging.


355-355: Document the breaking change from --filter to --filters.

Renaming the CLI option from --filter to --filters is a breaking change that will affect existing users and scripts. Ensure this is documented in release notes or a changelog, and consider providing a deprecation period where both options are accepted.

Alternatively, you could temporarily support both options with a deprecation warning:

@click.option(
    "-f",
    "--filters",
    "--filter",
    "filters",
    required=False,
    help="Parameters to filter the search results...",
)

Then add a warning if the old option is detected (this requires custom click callback logic).

Also applies to: 399-399, 415-415

Comment on lines +15 to +16
from urllib.parse import urlparse
import xml.etree.ElementTree as ET
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Use defusedxml to prevent XML attacks.

The standard xml.etree.ElementTree is vulnerable to XML bomb attacks, billion laughs, quadratic blowup, and external entity expansion. ProteomeXchange XML comes from an external source and should be treated as untrusted.

Install and use defusedxml:

 from urllib.parse import urlparse
-import xml.etree.ElementTree as ET
+import defusedxml.ElementTree as ET

Then add defusedxml to your project dependencies (e.g., requirements.txt or setup.py).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from urllib.parse import urlparse
import xml.etree.ElementTree as ET
from urllib.parse import urlparse
import defusedxml.ElementTree as ET
🤖 Prompt for AI Agents
In pridepy/files/files.py around lines 15 to 16, the code imports
xml.etree.ElementTree which is unsafe for untrusted XML; replace that import
with defusedxml.ElementTree (e.g., from defusedxml import ElementTree as ET) so
parsing uses the secure implementation, update any references to ET accordingly,
and add defusedxml to the project dependencies (requirements.txt or setup.py)
and run tests to confirm no API mismatches.

Comment on lines +848 to +851
@staticmethod
def _local_path_for_url(download_url: str, output_folder: str) -> str:
filename = os.path.basename(urlparse(download_url).path)
return os.path.join(output_folder, filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Handle URLs with trailing slashes.

os.path.basename() returns an empty string if the URL path ends with a slash, which would result in an invalid local path.

Add validation:

 @staticmethod
 def _local_path_for_url(download_url: str, output_folder: str) -> str:
     filename = os.path.basename(urlparse(download_url).path)
+    if not filename:
+        raise ValueError(f"Unable to extract filename from URL: {download_url}")
     return os.path.join(output_folder, filename)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@staticmethod
def _local_path_for_url(download_url: str, output_folder: str) -> str:
filename = os.path.basename(urlparse(download_url).path)
return os.path.join(output_folder, filename)
@staticmethod
def _local_path_for_url(download_url: str, output_folder: str) -> str:
filename = os.path.basename(urlparse(download_url).path)
if not filename:
raise ValueError(f"Unable to extract filename from URL: {download_url}")
return os.path.join(output_folder, filename)
🤖 Prompt for AI Agents
In pridepy/files/files.py around lines 848 to 851, the method can return an
empty filename when the URL path ends with a slash; change it to strip trailing
slashes and pick the last non-empty path segment as the filename, and if that
still yields an empty string fall back to a safe name (e.g., netloc or a
uuid-based name, optionally with a default extension). Also unquote and sanitize
the chosen filename to remove unsafe characters before joining with
output_folder so the resulting local path is always valid.

Comment on lines +853 to +951
@staticmethod
def download_ftp_urls(
ftp_urls: List[str],
output_folder: str,
skip_if_downloaded_already: bool,
max_connection_retries: int = 3,
max_download_retries: int = 3,
) -> None:
"""
Download a list of FTP URLs using a single connection, with retries and progress bars.
"""
if not os.path.isdir(output_folder):
os.makedirs(output_folder, exist_ok=True)

def connect_ftp(host: str):
ftp = FTP(host, timeout=30)
ftp.login()
ftp.set_pasv(True)
logging.info(f"Connected to FTP host: {host}")
return ftp

# Group URLs by host to reuse connections efficiently
host_to_paths: Dict[str, List[str]] = {}
for url in ftp_urls:
parsed = urlparse(url)
host_to_paths.setdefault(parsed.hostname, []).append(parsed.path.lstrip("/"))

for host, paths in host_to_paths.items():
connection_attempt = 0
while connection_attempt < max_connection_retries:
try:
ftp = connect_ftp(host)
for ftp_path in paths:
try:
local_path = os.path.join(output_folder, os.path.basename(ftp_path))
if skip_if_downloaded_already and os.path.exists(local_path):
logging.info("Skipping download as file already exists")
continue

logging.info(f"Starting FTP download: {host}/{ftp_path}")
download_attempt = 0
while download_attempt < max_download_retries:
try:
total_size = ftp.size(ftp_path)
# Try to resume using REST if partial file exists
if os.path.exists(local_path):
current_size = os.path.getsize(local_path)
mode = "ab"
else:
current_size = 0
mode = "wb"

with open(local_path, mode) as f, tqdm(
total=total_size,
unit="B",
unit_scale=True,
desc=local_path,
initial=current_size,
) as pbar:
def callback(data):
f.write(data)
pbar.update(len(data))

if current_size:
try:
ftp.sendcmd(f"REST {current_size}")
except Exception:
# If REST not supported, fall back to full download
current_size = 0
f.seek(0)
f.truncate()
ftp.retrbinary(f"RETR {ftp_path}", callback)
logging.info(f"Successfully downloaded {local_path}")
break
except (socket.timeout, ftplib.error_temp, ftplib.error_perm) as e:
download_attempt += 1
logging.error(
f"Download failed for {local_path} (attempt {download_attempt}): {str(e)}"
)
if download_attempt >= max_download_retries:
logging.error(
f"Giving up on {local_path} after {max_download_retries} attempts."
)
break
except Exception as e:
logging.error(f"Unexpected error while processing FTP path {ftp_path}: {str(e)}")
ftp.quit()
logging.info(f"Disconnected from FTP host: {host}")
break
except (socket.timeout, ftplib.error_temp, ftplib.error_perm, socket.error) as e:
connection_attempt += 1
logging.error(f"FTP connection failed (attempt {connection_attempt}): {str(e)}")
if connection_attempt < max_connection_retries:
logging.info("Retrying connection...")
time.sleep(5)
else:
logging.error(
f"Giving up after {max_connection_retries} failed connection attempts to {host}."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix resume logic and improve error handling.

Several issues:

  1. Resume fallback bug (lines 916-923): If REST is not supported and the code falls back to a full download by truncating the file, the progress bar's initial=current_size is not reset to 0, causing incorrect progress display.

  2. Blind exception catching (lines 919, 937): Catching Exception is too broad. Be specific about what failures to handle.

  3. Use logging.exception (lines 929, 933, 938, 944, 949): In exception handlers, use logging.exception() instead of logging.error() to include the traceback.

Apply these fixes:

                                     if current_size:
                                         try:
                                             ftp.sendcmd(f"REST {current_size}")
-                                        except Exception:
+                                        except ftplib.error_perm:
                                             # If REST not supported, fall back to full download
                                             current_size = 0
                                             f.seek(0)
                                             f.truncate()
+                                            pbar.reset()  # Reset progress bar
                                         ftp.retrbinary(f"RETR {ftp_path}", callback)
                             except (socket.timeout, ftplib.error_temp, ftplib.error_perm) as e:
                                 download_attempt += 1
-                                logging.error(
-                                    f"Download failed for {local_path} (attempt {download_attempt}): {str(e)}"
+                                logging.exception(
+                                    f"Download failed for {local_path} (attempt {download_attempt})"
                                 )
                                 if download_attempt >= max_download_retries:
-                                    logging.error(
-                                        f"Giving up on {local_path} after {max_download_retries} attempts."
+                                    logging.exception(
+                                        f"Giving up on {local_path} after {max_download_retries} attempts"
                                     )
-                    except Exception as e:
-                        logging.error(f"Unexpected error while processing FTP path {ftp_path}: {str(e)}")
+                    except Exception:
+                        logging.exception(f"Unexpected error while processing FTP path {ftp_path}")
             except (socket.timeout, ftplib.error_temp, ftplib.error_perm, socket.error) as e:
                 connection_attempt += 1
-                logging.error(f"FTP connection failed (attempt {connection_attempt}): {str(e)}")
+                logging.exception(f"FTP connection failed (attempt {connection_attempt})")
                 if connection_attempt < max_connection_retries:
                     logging.info("Retrying connection...")
                     time.sleep(5)
                 else:
-                    logging.error(
-                        f"Giving up after {max_connection_retries} failed connection attempts to {host}."
+                    logging.exception(
+                        f"Giving up after {max_connection_retries} failed connection attempts to {host}"
                     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@staticmethod
def download_ftp_urls(
ftp_urls: List[str],
output_folder: str,
skip_if_downloaded_already: bool,
max_connection_retries: int = 3,
max_download_retries: int = 3,
) -> None:
"""
Download a list of FTP URLs using a single connection, with retries and progress bars.
"""
if not os.path.isdir(output_folder):
os.makedirs(output_folder, exist_ok=True)
def connect_ftp(host: str):
ftp = FTP(host, timeout=30)
ftp.login()
ftp.set_pasv(True)
logging.info(f"Connected to FTP host: {host}")
return ftp
# Group URLs by host to reuse connections efficiently
host_to_paths: Dict[str, List[str]] = {}
for url in ftp_urls:
parsed = urlparse(url)
host_to_paths.setdefault(parsed.hostname, []).append(parsed.path.lstrip("/"))
for host, paths in host_to_paths.items():
connection_attempt = 0
while connection_attempt < max_connection_retries:
try:
ftp = connect_ftp(host)
for ftp_path in paths:
try:
local_path = os.path.join(output_folder, os.path.basename(ftp_path))
if skip_if_downloaded_already and os.path.exists(local_path):
logging.info("Skipping download as file already exists")
continue
logging.info(f"Starting FTP download: {host}/{ftp_path}")
download_attempt = 0
while download_attempt < max_download_retries:
try:
total_size = ftp.size(ftp_path)
# Try to resume using REST if partial file exists
if os.path.exists(local_path):
current_size = os.path.getsize(local_path)
mode = "ab"
else:
current_size = 0
mode = "wb"
with open(local_path, mode) as f, tqdm(
total=total_size,
unit="B",
unit_scale=True,
desc=local_path,
initial=current_size,
) as pbar:
def callback(data):
f.write(data)
pbar.update(len(data))
if current_size:
try:
ftp.sendcmd(f"REST {current_size}")
except Exception:
# If REST not supported, fall back to full download
current_size = 0
f.seek(0)
f.truncate()
ftp.retrbinary(f"RETR {ftp_path}", callback)
logging.info(f"Successfully downloaded {local_path}")
break
except (socket.timeout, ftplib.error_temp, ftplib.error_perm) as e:
download_attempt += 1
logging.error(
f"Download failed for {local_path} (attempt {download_attempt}): {str(e)}"
)
if download_attempt >= max_download_retries:
logging.error(
f"Giving up on {local_path} after {max_download_retries} attempts."
)
break
except Exception as e:
logging.error(f"Unexpected error while processing FTP path {ftp_path}: {str(e)}")
ftp.quit()
logging.info(f"Disconnected from FTP host: {host}")
break
except (socket.timeout, ftplib.error_temp, ftplib.error_perm, socket.error) as e:
connection_attempt += 1
logging.error(f"FTP connection failed (attempt {connection_attempt}): {str(e)}")
if connection_attempt < max_connection_retries:
logging.info("Retrying connection...")
time.sleep(5)
else:
logging.error(
f"Giving up after {max_connection_retries} failed connection attempts to {host}."
)
@staticmethod
def download_ftp_urls(
ftp_urls: List[str],
output_folder: str,
skip_if_downloaded_already: bool,
max_connection_retries: int = 3,
max_download_retries: int = 3,
) -> None:
"""
Download a list of FTP URLs using a single connection, with retries and progress bars.
"""
if not os.path.isdir(output_folder):
os.makedirs(output_folder, exist_ok=True)
def connect_ftp(host: str):
ftp = FTP(host, timeout=30)
ftp.login()
ftp.set_pasv(True)
logging.info(f"Connected to FTP host: {host}")
return ftp
# Group URLs by host to reuse connections efficiently
host_to_paths: Dict[str, List[str]] = {}
for url in ftp_urls:
parsed = urlparse(url)
host_to_paths.setdefault(parsed.hostname, []).append(parsed.path.lstrip("/"))
for host, paths in host_to_paths.items():
connection_attempt = 0
while connection_attempt < max_connection_retries:
try:
ftp = connect_ftp(host)
for ftp_path in paths:
try:
local_path = os.path.join(output_folder, os.path.basename(ftp_path))
if skip_if_downloaded_already and os.path.exists(local_path):
logging.info("Skipping download as file already exists")
continue
logging.info(f"Starting FTP download: {host}/{ftp_path}")
download_attempt = 0
while download_attempt < max_download_retries:
try:
total_size = ftp.size(ftp_path)
# Try to resume using REST if partial file exists
if os.path.exists(local_path):
current_size = os.path.getsize(local_path)
mode = "ab"
else:
current_size = 0
mode = "wb"
with open(local_path, mode) as f, tqdm(
total=total_size,
unit="B",
unit_scale=True,
desc=local_path,
initial=current_size,
) as pbar:
def callback(data):
f.write(data)
pbar.update(len(data))
if current_size:
try:
ftp.sendcmd(f"REST {current_size}")
except ftplib.error_perm:
# If REST not supported, fall back to full download
current_size = 0
f.seek(0)
f.truncate()
pbar.reset() # Reset progress bar
ftp.retrbinary(f"RETR {ftp_path}", callback)
logging.info(f"Successfully downloaded {local_path}")
break
except (socket.timeout, ftplib.error_temp, ftplib.error_perm) as e:
download_attempt += 1
logging.exception(
f"Download failed for {local_path} (attempt {download_attempt})"
)
if download_attempt >= max_download_retries:
logging.exception(
f"Giving up on {local_path} after {max_download_retries} attempts"
)
break
except Exception:
logging.exception(f"Unexpected error while processing FTP path {ftp_path}")
ftp.quit()
logging.info(f"Disconnected from FTP host: {host}")
break
except (socket.timeout, ftplib.error_temp, ftplib.error_perm, socket.error) as e:
connection_attempt += 1
logging.exception(f"FTP connection failed (attempt {connection_attempt})")
if connection_attempt < max_connection_retries:
logging.info("Retrying connection...")
time.sleep(5)
else:
logging.exception(
f"Giving up after {max_connection_retries} failed connection attempts to {host}"
)
🧰 Tools
🪛 Ruff (0.14.0)

868-868: FTP-related functions are being called. FTP is considered insecure. Use SSH/SFTP/SCP or some other encrypted protocol.

(S321)


919-919: Do not catch blind exception: Exception

(BLE001)


929-931: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


930-930: Use explicit conversion flag

Replace with conversion flag

(RUF010)


933-935: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


937-937: Do not catch blind exception: Exception

(BLE001)


938-938: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


938-938: Use explicit conversion flag

Replace with conversion flag

(RUF010)


944-944: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


944-944: Use explicit conversion flag

Replace with conversion flag

(RUF010)


949-951: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents
In pridepy/files/files.py around lines 853-951, fix resume and error handling:
when REST fails, reset current_size to 0, switch to write mode ('wb'), and
recreate the tqdm progress bar (or reset its initial) so the progress display
reflects a full download rather than the old initial value; replace the broad
"except Exception" around per-path processing with specific exceptions (e.g.,
OSError, ftplib.all_errors, socket.timeout) so only expected I/O/FTP errors are
caught; and replace logging.error(...) calls in all exception handlers at the
indicated lines with logging.exception(...) to include tracebacks. Ensure REST
fallback truncates the file, sets mode='wb' and current_size=0 before
re-creating the retrbinary callback/progress loop, and narrow all catch clauses
to explicit FTP/IO/socket errors.

@ypriverol ypriverol merged commit 03fd460 into master Oct 16, 2025
9 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants