Skip to content
This repository was archived by the owner on Aug 6, 2025. It is now read-only.

Conversation

@cifkao
Copy link
Contributor

@cifkao cifkao commented Jun 20, 2018

This PR adds 2 scripts:

  • dump_sentences.py dumps all SentEval sentences to stdout.
  • eval_saved.py loads the saved sentences and the corresponding embeddings and runs SentEval on them.

This removes the need to run the encoding inside the batcher API, allowing to separate encoding and evaluation. The reasons for doing this are:

  • It's tricky to run the embedding model in the same process as SentEval, especially if the model uses a different framework (e.g. TensorFlow) or if the machine has only one GPU. It's easier to do the encoding off-line (separately from the evaluation).
  • You can run encoding on a large GPU and evaluation on a small GPU (possibly on a different machine) so that you don't waste resources.
  • You can save time by only encoding each sentence once. (SentEval has a lot of duplicate sentences.)

Example usage (perhaps this should be described in the README):

python examples/dump_sentences.py | sort -u >senteval.txt
...  # run your model to get the embeddings for senteval.txt and save them to emb.npy
python examples/eval_saved.py senteval.txt emb.npy

@aconneau
Copy link
Contributor

Hi,
thanks for the PR, that's indeed an interesting feature to have as an example.
I will look at the code soon.
Thanks,
Alexis

@aconneau aconneau closed this Jun 27, 2018
@aconneau aconneau reopened this Jun 27, 2018
@aconneau
Copy link
Contributor

Oops, I closed the task but not on purpose. Just re-opened.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants