fastBERTtokens: Tokenizing for BERT in parallel

Version 1.0.0 (1,43 ko) par Ralf Elsas
This function simply divides your text into batches, and tokenizes in parallel. Provides significant speed-up.
18 téléchargements
Mise à jour 24 fév. 2023

Afficher la licence

Function to use Matlab BERT tokenizer in parallel
This function simply divides your text into batches, and tokenizes in parallel. As the Matlab tokenizer is very slow when run on a single processor for large data, this provides a significant speed-up. On an i7-10875H laptop with 8 logical units, tokenizing 76k sentences takes about 100 seconds.
Also note that providing the Matlab BERT model is important, as different BERT models use different encodings for the special BERT tokens like [SEP] etc.

Citation pour cette source

Ralf Elsas (2024). fastBERTtokens: Tokenizing for BERT in parallel (https://www.mathworks.com/matlabcentral/fileexchange/125295-fastberttokens-tokenizing-for-bert-in-parallel), MATLAB Central File Exchange. Récupéré le .

Compatibilité avec les versions de MATLAB
Créé avec R2022b
Compatible avec les versions R2021a et ultérieures
Plateformes compatibles
Windows macOS Linux
Remerciements

Inspiré par : Transformer Models

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!
Version Publié le Notes de version
1.0.0