Off-line encoding #37

cifkao · 2018-06-20T23:30:13Z

This PR adds 2 scripts:

dump_sentences.py dumps all SentEval sentences to stdout.
eval_saved.py loads the saved sentences and the corresponding embeddings and runs SentEval on them.

This removes the need to run the encoding inside the batcher API, allowing to separate encoding and evaluation. The reasons for doing this are:

It's tricky to run the embedding model in the same process as SentEval, especially if the model uses a different framework (e.g. TensorFlow) or if the machine has only one GPU. It's easier to do the encoding off-line (separately from the evaluation).
You can run encoding on a large GPU and evaluation on a small GPU (possibly on a different machine) so that you don't waste resources.
You can save time by only encoding each sentence once. (SentEval has a lot of duplicate sentences.)

Example usage (perhaps this should be described in the README):

python examples/dump_sentences.py | sort -u >senteval.txt
...  # run your model to get the embeddings for senteval.txt and save them to emb.npy
python examples/eval_saved.py senteval.txt emb.npy

aconneau · 2018-06-27T19:30:15Z

Hi,
thanks for the PR, that's indeed an interesting feature to have as an example.
I will look at the code soon.
Thanks,
Alexis

aconneau · 2018-06-27T19:31:00Z

Oops, I closed the task but not on purpose. Just re-opened.

facebook-github-bot added the CLA Signed label Jun 20, 2018

aconneau closed this Jun 27, 2018

aconneau reopened this Jun 27, 2018

cifkao force-pushed the offline_encoding branch from 8d900b6 to 670168a Compare August 12, 2018 12:08

cifkao added 2 commits August 26, 2018 22:32

Add dump_sentences.py

83525ee

Add eval_saved.py

2549fc7

cifkao force-pushed the offline_encoding branch from 670168a to cf59cc1 Compare August 26, 2018 20:32

Path relative to script file

20815ad

cifkao force-pushed the offline_encoding branch from cf59cc1 to 20815ad Compare August 28, 2018 17:18

cifkao added 2 commits August 29, 2018 20:29

Add --no-gpu flag

53e4a0f

End JSON with newline

ee22ac3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Off-line encoding #37

Off-line encoding #37

Uh oh!

cifkao commented Jun 20, 2018

Uh oh!

aconneau commented Jun 27, 2018

Uh oh!

aconneau commented Jun 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Off-line encoding #37

Are you sure you want to change the base?

Off-line encoding #37

Uh oh!

Conversation

cifkao commented Jun 20, 2018

Uh oh!

aconneau commented Jun 27, 2018

Uh oh!

aconneau commented Jun 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants