refactor(dataset): Redirect multipart upload through File Service #4130

carloea2 · 2025-12-11T06:24:24Z

What changes were proposed in this PR?

Backend (DatasetResource)
- Partially new multipart upload API:
  - POST /dataset/multipart-upload?type=init → creates a LakeFS multipart upload session, encode uploadId|did|uid|filePathB64|physicalAddress into a server side encrypted uploadToken, and returns it.
  - POST /dataset/multipart-upload/part?token=...&partNumber=... → streams a single part
  - POST /dataset/multipart-upload?type=finish|abort → completes or aborts the LakeFS multipart upload
- Keep existing access control and dataset permissions enforced on all changed endpoints.
Frontend service (dataset.service.ts)
- Main changes in multipartUpload(...):
  - Calls init to get encrypted uploadToken.
  - Uploads file parts via /multipart-upload/part streaming them with concurrency.
Frontend component (dataset-detail.component.ts)
- Use uploadToken to cancel/abort.

Any related issues, documentation, discussions?

Closes #4110

How was this PR tested?

Manually uploaded large files via the dataset detail page (single and multiple), checked:
- Progress, speed, and ETA updates.
- Abort behavior (UI state).
- Successful completion path (all parts COMPLETED, LakeFS object present, dataset version creation works, use of files in workflow is correct).
- Unit testing is missing

Was this PR authored or co-authored using generative AI tooling?

GPT partial use.

carloea2 · 2025-12-11T07:42:11Z

Why I prefer backing multipart uploads with a DB

I am proposing to store multipart upload state in a DB instead of keeping everything fully stateless, for several reasons:

Clear concurrency guarantees per (uploadId, partNumber)

Both Huawei OBS and Amazon S3 allow multiple uploads to the same uploadId + partNumber and simply apply last-write-wins semantics. OBS is very explicit about concurrent streams: its UploadPart API states that if you repeatedly upload the same partNumber, the last upload overwrites the previous one, and when there are multiple concurrent uploads for the same partNumber of the same object, the server uses a Last Write Win policy and recommends locking concurrent uploads of the same part on the client side to ensure data accuracy. S3’s multipart upload docs similarly say that if you upload a new part using a part number that was used before, the previously uploaded part is overwritten. In other words, neither backend prevents concurrent writers to the same part; they just pick a winner.

With a DB row keyed by (upload_id, part_number) we can use row-level locking to enforce “only one active stream can write this part” in a way that is explicit and portable, regardless of which storage backend we are using. This is useful both for well-behaved clients (we can give strong semantics) and for protecting ourselves from obvious waste (e.g., the same upload repeatedly opening the same part). In a multi-node/Kubernetes deployment, in-memory locks or sticky sessions are fragile or operationally complex; a shared DB is the simplest coordination point we already have. Without a DB we would need to complicate the API (for example /uploadPart/{uploadToken}/{partNumber} plus client-side coordination and disabled concurrency on that endpoint) and we still would not have a single canonical place where we enforce this rule. Note that this is complementary to, not a replacement for, rate limiting and abuse protection at other layers.
Efficient and well-scoped listing, aligned with S3 guidance

If the client stores the raw uploadId and we rely purely on the S3-compatible APIs, completing an upload means calling ListParts or ListMultipartUploads and filtering:
- ListMultipartUploads lists in-progress multipart uploads for the entire bucket, up to 1000 at a time.
- ListParts lists the parts for a specific multipart upload, given its uploadId.
In our lakeFS setup I observed that ListMultipartUploads behaves effectively repo-wide, and the prefix filtering we would like (for something like /dataset-1/...) does not give us the narrow, efficient listing we want. As more uploads accumulate under the same repo, listing gets slower and less predictable. On top of that, the S3 Developer Guide explicitly says that the multipart part listing is not supposed to be your source of truth for completing an upload and that you should maintain your own list of part numbers and ETags. A DB table is exactly that “own list”: an indexed, well-scoped record of parts per logical upload that we can query efficiently and deterministically, independent of how many uploads exist in the bucket or repo, and independent of quirks of the underlying S3/lakeFS implementation.
Opaque upload tokens vs exposing uploadId (and the encrypted token alternative)

I would like to avoid exposing the raw S3 uploadId directly to clients. It is a backend-specific identifier, and once clients depend on it we are coupled to a specific storage implementation and data model.

There are two designs we could follow:
- Encrypted token approach, client as temporary storage
  We could encode uploadId and any metadata (path, repo, limits, timestamps, etc.) into an encrypted and signed token, give that to the client, and let the client act as our temporary storage. On each request, the client sends the token back, we decrypt and verify it, and recover the uploadId. This works, but it has tradeoffs:
  - The token grows as we pack more metadata inside.
  - Any schema change or new metadata field requires token versioning and migration logic.
  - Server-side changes such as aborting an upload, tightening limits, or invalidating uploads become harder, because the server no longer owns the authoritative state and must handle multiple token versions that may be “in the wild”.
- DB-backed opaque token (preferred)
  With a DB, we can generate a short-lived opaque upload token that simply references a row where we store the real uploadId plus metadata. The client never sees uploadId directly, and the token itself can stay small and stable. All evolution (new columns, extra constraints, per-user quotas, flags, counters) happens in the DB without changing the client contract. Invalidating or aborting an upload is also as simple as flipping state in the DB.
Both approaches keep uploadId hidden, but the DB-backed option is simpler to reason about, more flexible as we add features, and easier to revoke or invalidate than encoding everything into an ever-growing token.
Enforcing max file size or resource limits without hammering the DB

If we want a hard limit on the total bytes an upload can consume, we need to enforce that while data is streaming, not only at the very end when we call CompleteMultipartUpload. The multipart upload docs already recommend maintaining our own mapping from part numbers to ETags for completion; that same per-upload metadata is a natural place to store limits.

My preferred approach is:
- The frontend tells us the intended total file size and desired number of parts.
- The backend computes a per-part limit maxPartSize = fileSize / numParts and stores it in the DB as metadata for that upload.
- When the client uploads part X, we require Content-Length and verify that it is less than or equal to maxPartSize for this upload before streaming to S3. We can also keep a local counter for the current request and immediately cut the stream once it exceeds maxPartSize, so the enforcement is truly “hard” from our perspective.
This reduces DB work to O(number of parts) checks instead of O(number of chunks). For example, 1 GiB split into 10 parts of 100 MiB each, streamed in 8 KiB chunks, would otherwise require 100 000+ DB addAndGet calls if we tried to maintain a global accumulator per chunk. Here we do at most 10 DB-backed checks, because we only need to validate each part once. Since maxPartSize and any retry counters live server-side in the DB (not in a client-controlled token), clients cannot bump their own limit or reset their own attempt counters.

If we do not have fileSize and numParts from the frontend we have strictly worse options:
- Maintain a global per-upload byte counter and update it on every chunk: this is a very hot DB row and write-heavy.
- Only update the counter per completed part: cheaper on the DB, but then we only detect a violation after the part has already consumed bandwidth and CPU.
- Use a fixed global maxPartSize based on backend limits (for example 5 GiB S3 part size) and derive a maximum number of parts from an overall limit: this is coarse and either rejects otherwise valid uploads or still requires tracking remaining budget per upload in a more complex way.
Also, if we enforce a global max part size equal to S3’s limit (5 GiB), then many realistic uploads will be restricted to a single part and therefore lose the benefit of concurrent part uploads entirely. With the DB-backed per-upload maxPartSize we can still enforce a hard overall limit while allowing the client to choose a higher degree of concurrency for objects that are much smaller than 5 GiB.

Finally, since this metadata lives in the DB, we can track additional abuse-related signals per upload (e.g., how many times an upload has been started and cancelled right before completion) and cap retries or detect obviously wasteful patterns without continuously growing a token or round-tripping more and more encoded state to the client.
Operational benefits and future features

A DB row per upload also gives us:
- Better observability: we can see which uploads exist, their status, size, parts, timestamps, owners, and retry history without scraping object store listings.
- Easier cleanup: we can implement “abort incomplete uploads older than N hours or days” using our DB, instead of scanning all multipart uploads in the bucket and reverse-mapping them to our logical repos.
- A natural place to attach future features such as per-user quotas, project-level limits, resumable uploads, server-driven cancellation, or finer-grained policies, all without changing the external client-visible API or token format.
Rate limiting, WAF rules, and other perimeter protections are still necessary for dealing with truly malicious clients, but having upload state modeled explicitly in a DB gives us a clean, consistent substrate for correctness, per-upload policies, and evolution over time.

@aicam @chenlica

chenlica · 2025-12-11T20:27:58Z

@carloea2 Please schedule a meeting to discuss the details then report the results here.

aicam

Overall, LGTM

aicam · 2025-12-15T18:20:24Z

common/auth/src/main/scala/org/apache/texera/auth/UploadTokenParser.scala

+    )
+
+    val raw =
+      s"$Version|${payload.uploadId}|${payload.did}|${payload.uid}|$filePathB64|$physicalB64"


I think there are structured ways to use encryption instead of manually concatenating by |

The concatenation and use of | is just for convenience, would you prefer a JSON approach? still the encryption will use the raw chars of the JSON.

I have changed to JSON

aicam · 2025-12-15T18:20:58Z

common/auth/src/main/scala/org/apache/texera/auth/UploadTokenParser.scala

+    *   - malformed structure
+    *   - unsupported version
+    */
+  def decode(token: String): UploadTokenPayload = {


Same here, decode function has too much manual work which I believe library should already provide high level functions

Still with JSON there will be some manual work to be done, how will you modularize this?

New version uses JSON

aicam · 2025-12-15T18:22:08Z

common/auth/src/main/scala/org/apache/texera/auth/util/CryptoService.scala

+
+    val cipherText = cipher.doFinal(plain.getBytes(StandardCharsets.UTF_8))
+
+    val combined = new Array[Byte](iv.length + cipherText.length)


Add comments to explain which part of algorithm is done by each line and why we need it

Got it, thank you.

aicam · 2025-12-15T18:23:52Z

common/workflow-core/src/main/scala/org/apache/texera/service/util/S3StorageClient.scala

+  final case class MultipartUploadInfo(key: String, uploadId: String)
+
+  /** Minimal info about a completed part in an upload. */
+  final case class PartInfo(partNumber: Int, eTag: String)


Are these case classes used as type definition?

Yes, I only keep the info needed and discard others

carloea2 · 2025-12-16T03:05:00Z

Thank you for your comments Ali, I will take a look at it.

carloea2 · 2025-12-18T22:39:30Z

According to our discussion, we will be back to an DB approach, #4136

v1

6d9dcdd

github-actions bot added frontend Changes related to the frontend GUI service common labels Dec 11, 2025

carloea2 mentioned this pull request Dec 11, 2025

refactor(dataset): Redirect multipart upload through File Service #4114

Closed

carloea2 added 5 commits December 11, 2025 20:00

added token encryption

6d9f4f1

Update CryptoService.scala

4d889d9

Update CryptoService.scala

2408cfa

Update UploadTokenParser.scala

3c903a0

Update AuthConfig.scala

dd1c35c

aicam requested changes Dec 15, 2025

View reviewed changes

carloea2 added 2 commits December 15, 2025 21:38

Added comments for Crypto Class

24da656

Merge branch 'main' into multipart_upload_through_dataset_resource

0ca4363

aicam assigned carloea2 Dec 16, 2025

UploadTokenImprovements

489dcd4

carloea2 closed this Dec 18, 2025


		val cipherText = cipher.doFinal(plain.getBytes(StandardCharsets.UTF_8))

		val combined = new Array[Byte](iv.length + cipherText.length)

refactor(dataset): Redirect multipart upload through File Service #4130

refactor(dataset): Redirect multipart upload through File Service #4130

Uh oh!

Conversation

carloea2 commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this PR?

Any related issues, documentation, discussions?

How was this PR tested?

Was this PR authored or co-authored using generative AI tooling?

Uh oh!

carloea2 commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenlica commented Dec 11, 2025

Uh oh!

aicam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carloea2 commented Dec 16, 2025

Uh oh!

carloea2 commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

carloea2 commented Dec 11, 2025 •

edited

Loading

carloea2 commented Dec 11, 2025 •

edited

Loading