I was curious if it is possible to use transfer learning in text generation, and re-train/pre-train it on a specific kind of text.
For example, having a pre-trained BERT model and a small corpus of medical (or any "type") text, make a language model that is able to generate medical text. The assumption is that you do not have a huge amount of "medical texts" and that is why you have to use transfer learning.
Putting it as a pipeline, I would describe this as:
- Using a pre-trained BERT tokenizer.
- Obtaining new tokens from my new text and adding them to the existing pre-trained language model (i.e., vanilla BERT).
- Re-training the pre-trained BERT model on the custom corpus with the combined tokenizer.
- Generating text that resembles the text within the small custom corpus.
Does this sound familiar? Is it possible with hugging-face?
_clas
, that's for the classification bit. And your use case is exactly what fastai was designed for. – Homologize