Building a baseline test system

Write a script that:

Loads the new NER model.

Runs it on new PDFs (or texts).

Compares the extracted entities with the reference annotations (from the validation/test set).

Outputs metrics and possibly visualizations (confusion matrix).