Skip to content

Conversation

@adamziel
Copy link
Contributor

@adamziel adamziel commented Sep 18, 2025

🚧 Work in progress 🚧

Before this PR, media files were downloaded one at a time. Every download would pause inserting posts into the database and would have to be fully processed before starting the next download.

With this PR, up to 20 media files are stream-downloaded at the same time. The imported posts are processed simultaneously without blocking.

A part of WordPress/php-toolkit#138

Performance impact

I've run a11y accessibility test data import with and without this PR 10 times. Here's the typical result representative of an average run:

With this PR – 10 seconds to import

parallel.downloads.mp4

Without this PR – 30 seconds to import

sequential.downloads.2.mp4

How it works?

This PR uses the WordPress\HttpClient\Client class from php-toolkit. It supports:

  • PHP 7.2+ with no PHP extensions required
  • Streaming large files
  • Monitoring progress
  • Pausing and resuming downloads
  • Non-blocking downloads (stream_select tells if if we have new bytes available yet)
  • Enqueuing an arbitrary number of requests and processing them all at once
  • Adding new requests at any point in time
  • Handling failures on just the subset of failed requests
  • curl and fsockopen transports
  • HTTP cache

Why not use the Requests class from WordPress core?

Different design goals:

  • WordPress\Requests\Requests is a reasonably high-level tool to help developers send one or more requests waiting until they're all complete.
  • WordPress\HttpClient\Client is a low-level tool for streaming small data packets across multiple concurrent connections without blocking.

Remaining work

  • E2E tests for failure scenarios
  • Improve error handling to give user useful feedback on failure
  • Use the same file and permission validation logic as the original fetch_remote_file method
  • Make the download parallelization opt-in to a) maintain BC and b) give users a path forward in case of any unexpected errors
  • Compare performance with the previous approach
  • E2E tests for a known WXR file
  • Remove the Filesystem component from this PR
  • Revisit the data flow and structure of draining
  • Implement the concurrency limit
  • Filter request URLs through WordPress security filters
  • Remove the parts of php-toolkit we don't need for this PR

Follow-up work

  • Add a progress bar to inform the user of the currently downloaded file and its state
  • Revisit the url_remap concept, avoid the legacy UPDATE ... post_content = REPLACE() queries. Instead, use the structured URL rewrite.

Work in progress

Uses the streaming HTTPClient from WordPress/php-toolkit to parallelize
media downloads during the import. This speeds up the performance
greatly. Before this PR, the importer processes all downloads
sequentially. After this PR, it downloads up to 25 files in parallel.

How it works:

1. When image is identified, the download is enqueued without blocking.
   The importer moves on to the next entity.
2. The downloader maintains a queue of async, concurrent downloads that
   can succeed, fail, redirect, be enqueued and dropped independently.
3. The queue is periodically drained. It's also drained before
   finishing.

* Configurable limits (timeout per request, max attachment size, max
  redirects, max concurrent requests, ...)
* Add test coverage that includes unhappy paths when assets are on
  non-existent servers, return different error codes, break transmission
  halfway through etc. All of these are covered in the original
  HTTPClient implementation, but let's also make sure the importer
  handles those scenarios in a useful way.
* Use WordPress URL validation utilities – similar to what
  wp_safe_http_get does
* Consider DNS rebinding countermeasures and similar
@adamziel
Copy link
Contributor Author

There's a few more tasks to address before this can be merged, but the big picture is in place so I'll open it up for reviews – cc @akirk @zaerl @brandonpayton @JanJakes

@adamziel adamziel marked this pull request as ready for review September 18, 2025 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant