-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Transformers.js V4: Native WebGPU EP, repo restructuring, and more! #1382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
* ONNX Runtime improvements (experimental native webgpu; fix iOS) (#1231) * customize the wasm paths * update implementation * allow using 'webgpu' in nodejs binding * update version of onnxruntime-node * Upgrade onnxruntime-web to same version as onnxruntime-node * Update list of supported devices --------- Co-authored-by: Joshua Lochner <26504141+xenova@users.noreply.github.com> * customize the wasm paths (#1250) * customize the wasm paths * update implementation * [internal] Add is_decoder option to session retrieval for preferred output location * Update tests * Formatting * Bump ort versions * Bump onnxruntime-node version * Bump versions * Bump ORT versions * Bump versions * Only check webgpu fp16 for non-node environments * Fix * Assume node supports webgpu * Update ORT node support comment * Relax test strictness * Update conversion script versions * Downgrade onnxslim * cleanup * Update package-lock.json * Update onnxruntime versions * Update post-build script * Use built-in session release function * Call garbage collection after each tokenizer test * Do not double-throw error * Fix race-condition in build process with file removal * Update versions * Bump jinja version * [version] Update to 3.6.3 * Bump jinja version to support new features * [version] Update to 3.6.3 * Add support for LFM2 models (#1367) * Use prefix in lfm2 output location (#1369) * Update package-lock.json * Run `npm audit fix` * Add special tokens in text-generation pipeline if tokenizer requires (#1370) * Add special tokens in text-generation pipeline if tokenizer requires * Fix logits processors tests * Update bundles.test.js * Update comment * Formatting * Add support for ModernBERT Decoder (#1371) * Use from/to buffer instead of string Actually fixes #1343 * Add support for Voxtral (#1373) * Support longform voxtral processing (#1375) * [version] Update to 3.7.0 * Add support for Arcee (#1377) * Optimize tensor.slice() (#1381) * Optimize tensor.slice() The performance of executing `tensor.slice()` is super poor, especially for the 'logits' tensor with large dimensions. ``` const logits = outputs.logits.slice(null, -1, null);` ``` This is because currently implementation of the `slice` method manually iterates through each element and calculate indices which is a big time consuming if the tensor shape is large. For cases like `slice(null, -1, null)`, where the slicing operation is contiguous along certain dimensions, which can be optimized by bulk copy by using `TypeArray.subarray()` and `TypeArray.set()`. * nit * Add a few more tensor slice unit tests --------- Co-authored-by: Joshua Lochner <26504141+xenova@users.noreply.github.com> --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Wanming Lin <wanming.lin@intel.com>
No need to save the entire audio in memory
* suppress console.error while creating InferenceSession * changed console suppress if not one of the misleading errors * set default logSeverityLevel and also match the ONNX_WEB.env.logLevel * indentation * small fix * some clean-up * Apply suggestions from code review Co-authored-by: Joshua Lochner <admin@xenova.com> * added LOG_LEVELS to the top of the file --------- Co-authored-by: Joshua Lochner <admin@xenova.com>
#1471) * added wasm cache * some refactoring of the hub.js and caching of the wasm factory * fixed comment * added string as cache return * fixes after review * Only return if match is found * Return response even if cache doesn't exist Don't throw error if we can't open cache or load file from cache, but we are able to make the request. --------- Co-authored-by: Joshua Lochner <26504141+xenova@users.noreply.github.com> Co-authored-by: Joshua Lochner <admin@xenova.com>
|
Hi @xenova , The two benchmark figures in this PR show pretty impressive speed improvement. Is this v4 vs v3? Or should we do something to the onnx file to achieve that speedup? I tried the https://huggingface.co/Xenova/bge-small-zh-v1.5 model with the v4 branch + WebGPU FP16, but did not observe notable performance improvements over v3 at any batch size. So I am curious if I missed some steps. E.g., should I run the convert.py on the base BAAI/bge-small-zh-v1.5 model to make a new "optimized" version for that? |
This is the official, long-awaited PR that introduces Transformers.js V4. Although it's currently still in draft mode, I'll be posting updates here for early review!
@huggingface/tokenizerslibrary.Qwen2.5-Coder-0.5B-Instructdoes not work, butonnx-community/Qwen2.5-0.5B-Instructdoes #1415See benchmarks
https://huggingface.co/onnx-community/all-MiniLM-L6-v2-ONNX:
https://huggingface.co/onnx-community/bge-base-en-v1.5-ONNX:
./src/models/), grouped by model type -- models.js is getting pretty large!