T5 fine-tuning special tokens

Hello,

Firstly, thanks you all for your work.

I am struggling to understand how to fine-tune T5. 

In #113, it is mentionned that there are 2 eos tokens (one for encoder, one for decoder). However, I can only see one eos token. 

```
(Pdb) tokenizer
T5Tokenizer(name_or_path='Rostlab/prot_t5_xl_uniref50', vocab_size=28, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special
_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'additional_special_tokens': [...]
```

#113 also references another answer from [#137](https://github.com/agemagician/ProtTrans/issues/137#issuecomment-1817576165) which is strange:

- no pad token (problem, because then the first token is not modelled)
- no eos token at all (problem in the decoder, because end of sequence token is not modelled)
- the masked token embeddings have the same ID

There are many other T5 fine-tuning questions on github issues, I think because instructions are not clear.

Combining these 2 contradictory sources I think the correct way to do it would be (using example "E V Q L V E S G A E"):

- Input: `E V <extra_id_0> L <extra_id_1> E S G <extra_id_2> E </s>`.
- label: `<pad> E V Q L V E S G A E </s>`.

Is that how the model was trained? If yes,it would be very helpful to put this on the huggingface hub page.

edit: Another question: does the tokenizer include a [postprocessor](https://huggingface.co/docs/tokenizers/pipeline#postprocessing)? It seems not: `(Pdb) tokenizer.post_processor
*** AttributeError: 'T5Tokenizer' object has no attribute 'post_processor'`. Does it mean all those extra tokens need to be added manually, before calling `tokenizer()`?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

T5 fine-tuning special tokens #158

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

T5 fine-tuning special tokens #158

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions