-
Notifications
You must be signed in to change notification settings - Fork 59
Description
Hi,
I have a task that uses seqio.TfdsDataSource as its source and a pipeline with preprocessors final steps that looks like this: [..., seqio.preprocessors.tokenize, seqio.CacheDatasetPlaceholder(), seqio.preprocessors.append_eos_after_trim].
I have cached this task, so I know the maximum token lengths for both inputs and targets.
My question is: when training a model with t5.models.mesh_transformer_main using this task and providing gin bindings for utils.run.sequence_length, should I use the values I see on the cached stats, or should I add +1 to account for the EOS token? My goal is to avoid data truncation by specifying smaller sequence lengths than what my data requires.
(P.S.: I know this is also related to the t5 repository, but I opened the issue here because I think my question is related to the seqio.preprocessors.append_eos_after_trim function. If you think it would be more appropriate to open this issue in another repository, please let me know, and I can change it.)
Thanks in advance,
Marcos