Managing Train/Develop Splits with the spaCy command line trainer
Asked Answered
V

0

8

I am training an NER model using the python -m spacy train command line tool. I use gold.docs_to_json to convert my annotated documents to the JSON-serializable format.

The command line training tool uses both a training set and a development set. I'm not sure how much assistance the command line tools give me for managing train/dev splits.

  1. Is there a command line tool to create train/dev splits from a single set of data?
  2. Will the spaCy training command do cross-validation for me instead of making me create a dev set?
  3. When it comes time to train the production model on all the data, what do I use as the dev set?

I think the answer to both questions (1) and (2) is "no", but I want to double-check.

From playing around it appears that you always have to pass in a non-empty dev set, even when you are training a production model for a fixed number of iterations. For now I just pass in a copy of my training data, but seems odd so I'm wondering if there is some other procedure I'm missing.

The spaCy documentation on training mostly discusses writing your own iteration loops. I've done enough of that that I'm sure I could make any of the above work if I wrote my own code, but for these basic training operations I'd rather not write code and just use the command line tools for everything.

Vitalis answered 26/1, 2020 at 18:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.