How can I avoid mutable variables in Scala when using ZipInputStreams and ZipOutpuStreams?
Asked Answered
C

5

17

I'm trying to read a zip file, check that it has some required files, and then write all valid files out to another zip file. The basic introduction to java.util.zip has a lot of Java-isms and I'd love to make my code more Scala-native. Specifically, I'd like to avoid the use of vars. Here's what I have:

val fos = new FileOutputStream("new.zip");
val zipOut = new ZipOutputStream(new BufferedOutputStream(fos));

while (zipIn.available == 1) {
  val entry = zipIn.getNextEntry
  if (entryIsValid(entry)) {
    zipOut.putNewEntry(new ZipEntry("subdir/" + entry.getName())
    // read data into the data Array
    var data = Array[Byte](1024)
    var count = zipIn.read(data, 0, 1024)
    while (count != -1) {
      zipOut.write(data, 0, count)
      count = zipIn.read(data, 0, 1024)
    }
  }
  zipIn.close
}
zipOut.close

I should add that I'm using Scala 2.7.7.

Cyperaceous answered 17/5, 2010 at 13:17 Comment(2)
Because I was being lazy and new Array[Byte] causes the compiler to complain about alternative constructors. I guess I should be using new ArrayBuffer[Byte].Cyperaceous
var data = new Array[Byte](1024)Vernacularize
F
35

dI don't think there's anything particularly wrong with using Java classes that are designed to work in imperative fashion in the fashion they were designed. Idiomatic Scala includes being able to use idiomatic Java as it was intended, even if the styles do clash a bit.

However, if you want--perhaps as an exercise, or perhaps because it does slightly clarify the logic--to do this in a more functional var-free way, you can do so. In 2.8, it's particularly nice, so even though you're using 2.7.7, I'll give a 2.8 answer.

First, we need to set up the problem, which you didn't entirely, but let's suppose we have something like this:

import java.io._
import java.util.zip._
import scala.collection.immutable.Stream

val fos = new FileOutputStream("new.zip")
val zipOut = new ZipOutputStream(new BufferedOutputStream(fos))
val zipIn = new ZipInputStream(new FileInputStream("old.zip"))
def entryIsValid(ze: ZipEntry) = !ze.isDirectory

Now, given this we want to copy the zip file. The trick we can use is the continually method in collection.immutable.Stream. What it does is perform a lazily-evaluated loop for you. You can then take and filter the results to terminate and process what you want. It's a handy pattern to use when you have something that you want to be an iterator, but it isn't. (If the item updates itself you can use .iterate in Iterable or Iterator--that's usually even better.) Here's the application to this case, used twice: once to get the entries, and once to read/write chunks of data:

val buffer = new Array[Byte](1024)
Stream.continually(zipIn.getNextEntry).
  takeWhile(_ != null).filter(entryIsValid).
  foreach(entry => {
    zipOut.putNextEntry(new ZipEntry("subdir/"+entry.getName))
    Stream.continually(zipIn.read(buffer)).takeWhile(_ != -1).
      foreach(count => zipOut.write(buffer,0,count))
  })
}
zipIn.close
zipOut.close

Pay close attention to the . at the end of some lines! I would normally write this on one long line, but it's nicer to have it wrap so you can see it all here.

Just in case it isn't clear, let's unpack one of the uses of continually.

Stream.continually(zipIn.read(buffer))

This asks to keep calling zipIn.read(buffer) for as many times as necessary, storing the integer that results.

.takeWhile(_ != -1)

This specifies how many times are necessary, returning a stream of indefinite length but which will quit when it hits a -1.

.foreach(count => zipOut.write(buffer,0,count))

This processes the stream, taking each item in turn (the count), and using it to write the buffer. This works in a slightly sneaky way, since you rely upon the fact that zipIn has just been called to get the next element of the stream--if you tried to do this again, not on a single pass through the stream, it would fail because buffer would be overwritten. But here it's okay.

So, there it is: a slightly more compact, possibly easier to understand, possibly less easy to understand method that is more functional (though there are still side-effects galore). In 2.7.7, in contrast, I would actually do it the Java way because Stream.continually isn't available, and the overhead of building a custom Iterator isn't worth it for this one case. (It would be worth it if I was going to do more zip file processing and could reuse the code, however.)


Edit: The looking-for-available-to-go-zero method is kind of flaky for detecting the end of the zip file. I think the "correct" way is to wait until you get a null back from getNextEntry. With that in mind, I've edited the previous code (there was a takeWhile(_ => zipIn.available==1) that is now a takeWhile(_ != null)) and provided a 2.7.7 iterator based version below (note how small the main loop is, once you get through the work of defining the iterators, which do admittedly use vars):

val buffer = new Array[Byte](1024)
class ZipIter(zis: ZipInputStream) extends Iterator[ZipEntry] {
  private var entry:ZipEntry = zis.getNextEntry
  private var cached = true
  private def cache { if (entry != null && !cached) {
    cached = true; entry = zis.getNextEntry
  }}
  def hasNext = { cache; entry != null }
  def next = {
    if (!cached) cache
    cached = false
    entry
  }
}
class DataIter(is: InputStream, ab: Array[Byte]) extends Iterator[(Int,Array[Byte])] {
  private var count = 0
  private var waiting = false
  def hasNext = { 
    if (!waiting && count != -1) { count = is.read(ab); waiting=true }
    count != -1
  }
  def next = { waiting=false; (count,ab) }
}
(new ZipIter(zipIn)).filter(entryIsValid).foreach(entry => {
  zipOut.putNextEntry(new ZipEntry("subdir/"+entry.getName))
  (new DataIter(zipIn,buffer)).foreach(cb => zipOut.write(cb._2,0,cb._1))
})
zipIn.close
zipOut.close
Fractious answered 17/5, 2010 at 15:20 Comment(4)
Thanks, Rex, that's a very nice answer.Cyperaceous
The last version here with ZipIter has a serious bug. Calling getNextEntry actually advances the stream pointer, so your entry is referring to a different thing than the stream is set to. E.g. if you have A.txt B.txt you'll get the entry for A.txt but read B.txt, then get the entry for B.txt and I think read nothing.Ajani
@BillBarksdale - Good catch! I'm fixing it.Fractious
couldn't work out the formatting for this comment, posted it as an answer instead :PAjani
S
2

Using scala2.8 and tail recursive call :

def copyZip(in: ZipInputStream, out: ZipOutputStream, bufferSize: Int = 1024) {
  val data = new Array[Byte](bufferSize)

  def copyEntry() {
    in getNextEntry match {
      case null =>
      case entry => {
        if (entryIsValid(entry)) {
          out.putNextEntry(new ZipEntry("subdir/" + entry.getName()))

          def copyData() {
            in read data match {
              case -1 =>
              case count => {
                out.write(data, 0, count)
                copyData()
              }
            }
          }
          copyData()
        }
        copyEntry()
      }
    }
  }
  copyEntry()
}
Shammer answered 17/5, 2010 at 14:34 Comment(3)
Thanks, that looks quite nice. Unfortunately I should have specified I'm still on 2.7.7.Cyperaceous
Also, how will this blow up in 2.7.7 versus 2.8? I'm not very conversant on the subject of tail recursion. Thanks.Cyperaceous
@Cyperaceous Scala 2.8 optimize tail call when possible so you will avoid stackoverflow. For an introduction to what is tail call i suggest you read this entry for example : blog.richdougherty.com/2009/04/…Shammer
F
2

I'd try something like this (yes, pretty much the same idea sblundy had):

Iterator.continually {
  val data = new Array[Byte](100)
  zipIn.read(data) match {
    case -1 => Array.empty[Byte]
    case 0  => new Array[Byte](101) // just to filter it out
    case n  => java.util.Arrays.copyOf(data, n)
  }
} filter (_.size != 101) takeWhile (_.nonEmpty)

It could be simplified like below, but I'm not very fond of it. I'd prefer for read not to be able to return 0...

Iterator.continually {
  val data = new Array[Byte](100)
  zipIn.read(data) match {
    case -1 => new Array[Byte](101)
    case n  => java.util.Arrays.copyOf(data, n)
  }
} takeWhile (_.size != 101)
Fibrinolysin answered 17/5, 2010 at 16:0 Comment(0)
A
2

Based on http://harrah.github.io/browse/samples/compiler/scala/tools/nsc/io/ZipArchive.scala.html:

private[io] class ZipEntryTraversableClass(in: InputStream) extends Traversable[ZipEntry] {
  val zis = new ZipInputStream(in)

  def foreach[U](f: ZipEntry => U) {
    @tailrec
    def loop(x: ZipEntry): Unit = if (x != null) {
      f(x)
      zis.closeEntry()
      loop(zis.getNextEntry())
    }
    loop(zis.getNextEntry())
  }

  def writeCurrentEntryTo(os: OutputStream) {
    IOUtils.copy(zis, os)
  }
}
Ajani answered 27/5, 2013 at 22:22 Comment(1)
This doesn't seem to let you easily get at the actual file contents though... what a painful interfaceAjani
H
1

Without tail-recursion, I'd avoid recursion. You would run the risk to a stack overflow. You could wrap zipIn.read(data) in an scala.BufferedIterator[Byte] and go from there.

Hutton answered 17/5, 2010 at 13:33 Comment(2)
Ok... So are you suggesting that there isn't any better approach?Cyperaceous
Sorry, took me a few minutes to think of something.Hutton

© 2022 - 2024 — McMap. All rights reserved.