ML.NET: how to retrain a text classification model with new data

I am pretty new to machine learning in general, and to Microsoft ML.NET in particular. What I am trying to do is to create a re-trainable model for text classification. Suppose I have an Article (for training) and an ArticlePrediction (for the classification):

public class Article
{
    public string Text { get; set; }
    public string Topic { get; set; }
}

public class ArticlePrediction
{
    public float[] Score { get; set; }
    public uint PredictedLabel { get; set; }
}

Following the documentation for re-trainable models and this GitHub issue, I concluded that I will need two pipelines - preparation pipeline and training pipeline. And a separate intermediate model for the "prepared" data, in order to do the re-training:

public class ArticlePrepared : Article
{
    [VectorType(???)]
    public float[] Features { get; set; }
    public uint Label { get; set; }
}

The actual training of the model is trivial:

public static void Train(MLContext ctx, IDataView data)
{
    var prepPipeline = ctx.Transforms.Conversion.MapValueToKey("Label", "Topic")
        .Append(ctx.Transforms.Text.FeaturizeText("Features", "Text"));
    var trainPipeline = ctx.MulticlassClassification.Trainers
        .LbfgsMaximumEntropy("Label", "Features", historySize: 50, l1Regularization: 0.1f);

    var prepModel = prepPipeline.Fit(data);
    var prepData = prepModel.Transform(data);
    var trainModel = trainPipeline.Fit(prepData);

    ctx.Model.Save(prepModel, data.Schema, PreparationPipelinePath);
    ctx.Model.Save(trainModel, prepData.Schema, TrainingPipelinePath);
}

The retraining part is the one I am struggling with and I now have my doubts if this is even possible:

public static void Retrain(MLContext ctx, Article article)
{
    var prepModel = ctx.Model.Load(PreparationPipelinePath, out var _);
    var retrainModel = ctx.Model.Load(TrainingPipelinePath, out var _) as ISingleFeaturePredictionTransformer<object>;
    var modelParams = (MaximumEntropyModelParameters)retrainModel.Model;

    var prepData = prepModel.Transform(ctx.Data.LoadFromEnumerable(new[] { article }));

    var retrainedModel = ctx.MulticlassClassification.Trainers
        .LbfgsMaximumEntropy("Label", "Features", historySize: 50, l1Regularization: 0.1f)
        .Fit(prepData, modelParams); // boom!

    ctx.Model.Save(retrainedModel, prepData.Schema, TrainingPipelinePath);
}

The exception that is received is No valid training instances found, all instances have missing features. A couple of questions I have:

It seems to me like each word in the text is being converted to a model feature. This means that when I try to retrain the model with a new Article the trained model does not have all the features of this new Article (because the new text, that I would like to retrain the model with, is different). Is this the reason for the exception I receive?
When it comes to featurizing text, it is impossible to tell how many features the prepared ArticlePrepared model should have (i.e. the size of the VectorType attribute, or the length of the Features array property). Is it possible to work with dynamic number of features? If you inspect the GitHub repo (link below), you'll see that the VectorType has size of 131, but this is a hardcoded value that was taken from the already saved schema. Needless to say, hardcoding like this is not going to work in a real-world scenario.

I have create a GitHub repository that can be used to reproduce the problem.

Is there a way to do what I am trying to do or am I going in a totally wrong direction? Any help or insights are appreciated.

Recommended topics

Hot tags