SBT: How to package an instance of a class as a JAR?
Asked Answered
W

4

7

I have code which essentially looks like this:

class FoodTrainer(images: S3Path) { // data is >100GB file living in S3
  def train(): FoodClassifier       // Very expensive - takes ~5 hours!
}

class FoodClassifier {          // Light-weight API class
  def isHotDog(input: Image): Boolean
}

I want to at JAR-assembly (sbt assembly) time, invoke val classifier = new FoodTrainer(s3Dir).train() and publish the JAR which has the classifier instance instantly available to downstream library users.

What is the easiest way to do this? What are some established paradigms for this? I know its a fairly common idiom in ML projects to publish trained models e.g. http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar

How do I do this using sbt assembly where I do not have to check in a large model class or data file into my version control?

Wie answered 8/11, 2017 at 16:21 Comment(3)
Instances only exist at runtime. You need a way to store the state of the object, and reload it at runtime, you might be able to get away with standard serialization, but you may want a more robust scheme such as some data file.Circus
Here's an idea, throw your model in a resource file that get's added into the jar assembly. I think all jars get distributed with your model if its in that folder. Lmk how it goes, cheers!End
@puhlen: I know its possible since many ML projects use this trick to distribute trained models e.g. nlp.stanford.edu/software/stanford-corenlp-models-current.jarWie
W
0

Okay I managed to do this:

  1. Separate the food-trainer module into 2 separate SBT sub-modules: food-trainer and food-model. The former is only invoked at compile time to create the model and serialize into the generated resources of the latter. The latter serves as a simple factory object to instantiate a model from the serialized version. Every downstream project only depends on this food-model submodule.

  2. The food-trainer has the bulk of all the code and has a main method that can serialize the FoodModel:

    object FoodTrainer {
      def main(args Array[String]): Unit = {
        val input = args(0)
        val outputDir = args(1)
        val model: FoodModel = new FoodTrainer(input).train() 
        val out = new ObjectOutputStream(new File(outputDir + "/model.bin"))
        out.writeObject(model)
      }
    }
    
  3. Add a compile-time task to generate the food trainer module in your build.sbt:

    lazy val foodTrainer = (project in file("food-trainer"))
    
    lazy val foodModel = (project in file("food-model"))
      .dependsOn(foodTrainer)
      .settings(    
         resourceGenerators in Compile += Def.task {
           val log = streams.value.log
           val dest = (resourceManaged in Compile).value   
           IO.createDirectory(dest)
           runModuleMain(
             cmd = s"com.foo.bar.FoodTrainer $pathToImages ${dest.getAbsolutePath}",
             cp = (fullClasspath in Runtime in foodTrainer).value.files,
             log = log
           )             
          Seq(dest / "model.bin")
        }
    
    def runModuleMain(cmd: String, cp: Seq[File], log: Logger): Unit = {
      log.info(s"Running $cmd")
      val opt = ForkOptions(bootJars = cp, outputStrategy = Some(LoggedOutput(log)))
      val res = Fork.scala(config = opt, arguments = cmd.split(' '))
      require(res == 0, s"$cmd exited with code $res")
    }
    
  4. Now in your food-model module, you have something like this:

    object FoodModel {
      lazy val model: FoodModel =
        new ObjectInputStream(getClass.getResourceAsStream("/model.bin").readObject().asInstanceOf[FoodModel])
    }
    

Every downstream project now only depends on food-model and simply uses FoodModel.model. We get the benefit of:

  1. This being statically loaded fast at runtime from the JAR's packaged resources
  2. No need to train the model at runtime (very expensive)
  3. No need to checking-in the model in your version control (again the binary model is very big) - it is only packaged into your JAR
  4. No need to separate the FoodTrainer and FoodModel packages into their own JARs (we have the headache of deploying them internally now) - instead we simply keep them in the same project but different sub-modules which gets packed into a single JAR.
Wie answered 23/11, 2017 at 16:45 Comment(0)
S
4

You should serialize the data which results from training into its own file. You can then package this data file in your JAR. Your production code opens the file and reads it rather than run the training algorithm.

Stud answered 8/11, 2017 at 18:17 Comment(7)
You mean serialize (kryo/objectoutputstream) the model FoodClassifier class and put it in resources and read it in into a class at runtime? That can work but I would need to check-in some binary blob into my codebase and I don't always control the codebases of my downstream dependency. Ideally I want to simply publish a JAR into my local artifact repository and let my downstream users simply add a dependency to the JAR.Wie
@Wie I mean to serialize the results of FoodTrainer.train() in some format that can be easily deserialized into whatever data structure is used in production.Stud
the results of FoodTrainer.train() is FoodClassifier. Again, my question is should I serialize the FoodClassifier and have it in my repo or is it better to override sbt-assembly somehow to do this for me so I don't have to check-in a massive binary blob and make a JARWie
@Wie More precisely, the results of FoodTrainer.train() is some data that you represent as FoodClasifier at run-time. You should serialize that data in some sensible format that is easily stored and read. As for how to manage the file after it is serialized, I do not have any immediate suggestions. I'll get back with you.Stud
I completely understand what you are saying but my question is I don't want to check-in the representation of FoodClassifier since the representation is quite complex so I would need to do Object serialization and I don't really want a 500MB blob in my git repo. What I want is to somehow override sbt assembly so it does this purely at the creation of JAR phase where it creates the representation, puts it in the JAR and cleans up so my git repo is nice and tidy.Wie
@Wie Understandably, you do not want to increase your git repo size by 500MB. I am unfamiliar with the Scala details here, so I cannot give any advice on those lines. My original answer is based on general programming methodologies.Stud
nothing really particular to Scala here since sbt is just a build tool. How would you do this in Java/Maven/Ant/Gradle etc??Wie
J
4

The steps are as follows.

During the resource generation phase of build:

  1. Generate model during resource generation phase of build.
  2. Serialize the contents of the model to a file in a managed resources folder.
    resourceGenerators in Compile += Def.task {
      val classifier = new FoodTrainer(s3Dir).train()
      val contents = FoodClassifier.serialize(classifier)
      val file = (resourceManaged in Compile).value / "mypackage" / "food-classifier.model"
      IO.write(file, contents)
      Seq(file)
    }.taskValue
    
  3. The resource will be included in jar file automatically and it won't appear in source tree.
  4. To load the model just add code that reads resource and parses the model.
    object FoodClassifierModel {
      lazy val classifier = readResource("/mypackage/food-classifier.model")
      def readResource(resourceName: String): FoodClassifier = {
        val stream = getClass.getResourceAsStream(resourceName)
        val lines = scala.io.Source.fromInputStream( stream ).getLines
        val contents = lines.mkString("\n")
        FoodClassifier.parse(contents)
      }
    }
    object FoodClassifier {
      def parse(content: String): FoodClassifier
      def serialize(classfier: FoodClassifier): String
    }
    

Of course, as your data is rather big, you'll need to use streaming serializers and parsers to not overload java heap space. The above just shows how to package resource at build time.

See http://www.scala-sbt.org/1.x/docs/Howto-Generating-Files.html

Jaynejaynell answered 16/11, 2017 at 7:15 Comment(9)
The only problem now is that FoodTrainer is our main project and depends on dozens of other libraries and thousands of lines of code. Moving it into the SBT file means pulling large parts of the core module into the SBT. Is there a way I can make the compile task for this module depend on another module so I can separate these two out?Wie
You want to execute some logic during build time, so you need to declare that the build depends on your libraries (see scala-sbt.org/1.x/docs/…). Also note that sbt runs on Scala 2.10 (afaik) that might require your code to be compatible with 2.10. If your code is 2.11/2.12, then you may have to start a separate JVM (or OSGi container) with your trainer and serializer.Jaynejaynell
That doc tells me how to add external dependencies to sbt. But, how do I add an internal module (in this case FoodTrainer) as a util I can invoke from project/GenerateFoodModel.scala without pulling everything into project/??Wie
Well, I think, you cannot depend on the main project. Instead, you might divide your project into a few modules with dependencies between them. Then you can make sbt to depend on the module with the model and the rest of the project will depend on the resource jar and the model-jar.Jaynejaynell
My project already has multi-modules. I am asking how do I make sbt itself depend on one of the modules?Wie
Well, I don't think it's a straightforward thing to do. You might have to move some of your modules inside ./project or have two completely separate builds. In the latter case you'll have one project FoodModelGenerator with trainer and serializer that pushes artifacts to some repository. And in this project FoodClassifierModelProject you add that one at the build stage as a dependency. Or you might invoke generator/serializer via ProcessBuilder in a separate JVM (it'll also help with Scala minor-version incompatibility). And just include the generated serialized model as a resource.Jaynejaynell
It is actually fairly straight forward. See: https://mcmap.net/q/1622755/-how-to-make-a-sbt-task-depend-on-a-module-defined-in-the-same-sbt-projectWie
Thanks. That solution looks interesting and I guess it solves your problem. Though, it looks a bit like a hack. I usually prefer to keep a module as a whole thing and do not create "fine grained" dependencies on parts of the build process. This solves the immediate problem but might be not easy to comprehend/maintain in future. Also the code that is invoked will be running in a specific different environment (without some resources, for instance) that needs to be kept in mind.Jaynejaynell
I think it makes sense here - the food-trainer and food-model can be thought of as separate modules no?Wie
W
0

Okay I managed to do this:

  1. Separate the food-trainer module into 2 separate SBT sub-modules: food-trainer and food-model. The former is only invoked at compile time to create the model and serialize into the generated resources of the latter. The latter serves as a simple factory object to instantiate a model from the serialized version. Every downstream project only depends on this food-model submodule.

  2. The food-trainer has the bulk of all the code and has a main method that can serialize the FoodModel:

    object FoodTrainer {
      def main(args Array[String]): Unit = {
        val input = args(0)
        val outputDir = args(1)
        val model: FoodModel = new FoodTrainer(input).train() 
        val out = new ObjectOutputStream(new File(outputDir + "/model.bin"))
        out.writeObject(model)
      }
    }
    
  3. Add a compile-time task to generate the food trainer module in your build.sbt:

    lazy val foodTrainer = (project in file("food-trainer"))
    
    lazy val foodModel = (project in file("food-model"))
      .dependsOn(foodTrainer)
      .settings(    
         resourceGenerators in Compile += Def.task {
           val log = streams.value.log
           val dest = (resourceManaged in Compile).value   
           IO.createDirectory(dest)
           runModuleMain(
             cmd = s"com.foo.bar.FoodTrainer $pathToImages ${dest.getAbsolutePath}",
             cp = (fullClasspath in Runtime in foodTrainer).value.files,
             log = log
           )             
          Seq(dest / "model.bin")
        }
    
    def runModuleMain(cmd: String, cp: Seq[File], log: Logger): Unit = {
      log.info(s"Running $cmd")
      val opt = ForkOptions(bootJars = cp, outputStrategy = Some(LoggedOutput(log)))
      val res = Fork.scala(config = opt, arguments = cmd.split(' '))
      require(res == 0, s"$cmd exited with code $res")
    }
    
  4. Now in your food-model module, you have something like this:

    object FoodModel {
      lazy val model: FoodModel =
        new ObjectInputStream(getClass.getResourceAsStream("/model.bin").readObject().asInstanceOf[FoodModel])
    }
    

Every downstream project now only depends on food-model and simply uses FoodModel.model. We get the benefit of:

  1. This being statically loaded fast at runtime from the JAR's packaged resources
  2. No need to train the model at runtime (very expensive)
  3. No need to checking-in the model in your version control (again the binary model is very big) - it is only packaged into your JAR
  4. No need to separate the FoodTrainer and FoodModel packages into their own JARs (we have the headache of deploying them internally now) - instead we simply keep them in the same project but different sub-modules which gets packed into a single JAR.
Wie answered 23/11, 2017 at 16:45 Comment(0)
E
-1

Here's an idea, throw your model in a resource folder that get's added into the jar assembly. I think all jars get distributed with your model if its in that folder. Lmk how it goes, cheers!

Check this out for reading from resource:

https://www.mkyong.com/java/java-read-a-file-from-resources-folder/

It's in Java but you can still use the api in Scala.

End answered 8/11, 2017 at 16:47 Comment(9)
You want me to add the data: File or the Model instance to resources? The former won't work since data is huge (>100GB) and latter won't work since I need it to be an actual Java instance.Wie
If its that big, you need to implement a streaming solution. Save it in S3 or some drive somewhere, and begin stream. That's another question all together.End
How are you loading your model into memory if its that big to begin with? Do you have more than 100GB in ram?End
Loading into s3 or memory is out of the question. I have >100 GB of training data which I am using to train my Model class. The Model class accepts streaming chunked data (4GB at a time) and takes many hours to build but is itself very small (it basically is a matrix of 10000x10000 doubles). So providing the training data as a local or remote resource is out of the question since each downstream dependency would have to spend time crunching the data. What I want to do is provide the trained Model class (which is only few MB) as an instance dependency to all downstream projects.Wie
I never told you to save the data... I was refering to your model. I'm not sure why you down voted my answer but it's same thing as the question you voted up for. You save your file in resource, serialize and then use reflection for JIT compilation. Though normally models are just weights in your stats, so you create a hard class for your model and save the weights in your resource folder. Beyond that, I'm not sure how I can help you. Please reconsider the down vote, because I'm only trying to help.End
You did tell me to save the data here when you said "Save it in S3 or some drive somewhere, and begin stream." What did you refer to by "it" in that sentence?Wie
You insinuated your model was 100GB, not the training data. I'll quote myself... " How are you loading your model into memory if its that big to begin with? Do you have more than 100GB in ram?"End
If you read the first comment "You want me to add the data: File or the Model instance to resources? The former won't work since data is huge (>100GB) and latter won't work since I need it to be an actual Java instance." - its quite clear that I am saying my data is 100+ GB and not my model.Wie
Ok I managed to solve this problem finally (without incurring run-time training cost and without simply checking in the giant model into the resource folder). See my answer here: https://mcmap.net/q/1509487/-sbt-how-to-package-an-instance-of-a-class-as-a-jarWie

© 2022 - 2024 — McMap. All rights reserved.