Saving to database in stream pipeline
Asked Answered
S

2

12

As per the documentation on Oracle's website:

Side-effects in behavioral parameters to stream operations are, in general, discouraged, as they can often lead to unwitting violations of the statelessness requirement, as well as other thread-safety hazards.

Does this include saving elements of the stream to a database?

Imagine the following (pseudo) code:

public SavedCar saveCar(Car car) {
  SavedCar savedCar = this.getDb().save(car);
  return savedCar;
}

public List<SavedCars> saveCars(List<Car> cars) {
  return cars.stream()
           .map(this::saveCar)
           .collect(Collectors.toList());
}

What are the unwanted effects opposed to this implementation:

public SavedCar saveCar(Car car) {
  SavedCar savedCar = this.getDb().save(car);
  return savedCar;
}

public List<SavedCars> saveCars(List<Car> cars) {
  List<SavedCars> savedCars = new ArrayList<>();
  for (Cat car : cars) {
    savedCars.add(this.saveCar(car));
  }
  return savedCars.
}
Sitnik answered 24/1, 2020 at 15:0 Comment(18)
yes, this is bad and under certain conditions you will be in pain.Kmeson
How so? What is the difference with writing this as a regular for-loop?Sitnik
Tough this would be obvious, if you use parallelStream you would certainly lose the transaction context.Bautista
A doubt around designing this code - Why does a method that writes to your database return and updated model? Could that not be separated? I mean mapping database objects to some other object in one phase and writing it to the database in another.Medius
@Naman, the code is just as an example. But imagine that id generation is handled by my JPA implementation, I want my returned SavedCars to include their new id. If I had to do this in 2 steps, it would require both a save and a fetch, instead of only a save in this case.Sitnik
I do't get it. You quote the documentation yourself and as a result of that I have given you an example (there could be more if you really want that) of exactly where this breaks because you would not follow that documentation and recommendation. what exactly are you looking for here?Kmeson
@Eugene, maybe I have missed some part of the documentation, but it only tells me not to do it, but it does not tell me why I should not do it.Sitnik
@Sitnik ok, and the fact that the entire map that I showed is not executed is not enough? ( A map that you rely on?). Things are even worse if you involve parallel into the picture with other examples.Kmeson
The fact that you tell me that the map will not be executed does not make it so. Not that I don't believe you, but please support your statements with the sources that prove it. Also, parallel is not part of this question.Sitnik
@Sitnik you sound confused. At the bottom of the answer there's a link to the bug stating that it is not executed. Also trying this with jdk-9 or above seems very little effort to do from your side. You either have a correct result using streams or you don't, thus parallel that I brought up; it's just easier to prove a point using that. As to why the documentation tells not to do it, is exactly because they can bring in some future improvements ( like in the example I have, that happened from java-8 to 9) that would be transparent to you IFF you would obey to the documentation.Kmeson
What is up with people telling people to try it yourself? This website's goal is to build up a shared knowledge, if everyone was told to test it themselves, this website would not be worth a lot, don't you think?Sitnik
I believe what you said in your answer, but I will not close this question until an answer can be accepted that could redirect a user to a credible source that explains them why something is a bad idea. The bug ticket that you supplied did not help either, because it only talks about count, without any intermediate operations. This one is better.Sitnik
@Sitnik sorry, imho you confuse yourself. There isn't much I can do about that. Even the ticket you mentioned has the exact same side-effects that are ignored, one is via peek vs map, which you fail to recognize.Kmeson
Hmmmm, I think we got off at the wrong foot, I recognize that peek and map are indeed intermediate stream operations that should not invoke side-effects (which I what my question is about), but the link you posted does not talk about intermediate side-effects, only the count terminal operation. No where in the ticket that was posted on your answer do they talk about map.Sitnik
Let us continue this discussion in chat.Kmeson
Letting the Stream aspects aside, there are two fundamental problems a) if one save(car) fails with an exception, some operation may have succeeded already, not visible in the result of this operation. b) performing a database operation for each object is a performance disaster. Every reasonable API has methods for storing all objects in one operation.Ragout
@Hoger, that is true, but this example code is only to explain my question.Sitnik
The documentation states that side effect are “in general, discouraged”. Then you are asking “what about this specific example”, but when you get a response noting the problems of the specific example, you say “but this is only an example”. So then, if your question is not about this specific example, what is your actual question? Do you really expect the official documentation to make a statement for each hypothetical use case, when it already made a general statement?Ragout
S
5

As per the documentation on Oracle's website [...]

That link is for Java 8. You may want to read the documentation for Java 9 (which came out in 2017) and later versions, as they are more explicit to this regard. Specifically:

A stream implementation is permitted significant latitude in optimizing the computation of the result. For example, a stream implementation is free to elide operations (or entire stages) from a stream pipeline -- and therefore elide invocation of behavioral parameters -- if it can prove that it would not affect the result of the computation. This means that side-effects of behavioral parameters may not always be executed and should not be relied upon, unless otherwise specified (such as by the terminal operations forEach and forEachOrdered). (For a specific example of such an optimization, see the API note documented on the count() operation. For more detail, see the side-effects section of the stream package documentation.)

Source: Java 9's Javadoc for the Stream interface.

And also the updated version of the doc you quoted:

Side-effects

Side-effects in behavioral parameters to stream operations are, in general, discouraged, as they can often lead to unwitting violations of the statelessness requirement, as well as other thread-safety hazards.
If the behavioral parameters do have side-effects, unless explicitly stated, there are no guarantees as to:

  • the visibility of those side-effects to other threads;
  • that different operations on the "same" element within the same stream pipeline are executed in the same thread; and
  • that behavioral parameters are always invoked, since a stream implementation is free to elide operations (or entire stages) from a stream pipeline if it can prove that it would not affect the result of the computation.

The ordering of side-effects may be surprising. Even when a pipeline is constrained to produce a result that is consistent with the encounter order of the stream source (for example, IntStream.range(0,5).parallel().map(x -> x*2).toArray() must produce [0, 2, 4, 6, 8]), no guarantees are made as to the order in which the mapper function is applied to individual elements, or in what thread any behavioral parameter is executed for a given element.

The eliding of side-effects may also be surprising. With the exception of terminal operations forEach and forEachOrdered, side-effects of behavioral parameters may not always be executed when the stream implementation can optimize away the execution of behavioral parameters without affecting the result of the computation. (For a specific example see the API note documented on the count operation.)

Source: Java 9's Javadoc for the java.util.stream package.

All emphasis mine.

As you can see, the current official documentation goes into more detail on the issues that you may encounter if you decide to use side-effects in your stream operations. It is also very clear on forEach and forEachOrdered being the only terminal operations where execution of side-effects is guaranteed (mind you, thread-safety issues still apply, as the official examples show).


That being said, and regarding your specific code, and said code only:

public List<SavedCars> saveCars(List<Car> cars) {
  return cars.stream()
           .map(this::saveCar)
           .collect(Collectors.toList());
}

I see no Streams-related issues with said code as-is.

  • The .map() step will be executed because .collect() (a mutable reduction operation, which is what the official doc recommends instead of things like .forEach(list::add)) relies on .map()'s output and, since this (i.e. saveCar()'s) output is different than its input, the stream cannot "prove that [eliding] it would not affect the result of the computation".
  • It is not a parallelStream() so it should not introduce any concurrency problems that didn't previously exist (of course, if someone added a .parallel() later then problems may arise —much like if someone decided to parallelize a for loop by firing up new threads for the inner computations).

That doesn't mean that the code in that example is Good Code™. The sequence .stream.map(::someSideEffect()).collect() as a way of performing side-effects operations for every item in a collection may look like more simple / short / elegant? than its for counterpart, and it sometimes may be. However, as Eugene, Holger and some others told you, there are better ways to approach this.
As a quick thought: the cost of firing up a Stream vs iterating a simple for is not negligible unless you have a lot of items, and if you have a lot of items then you: a) probably don't want to make a new DB access for each one, so a saveAll(List items) API would be better; and b) probably don't want to take the performance hit of processing a lot of items sequentially, so you would end up using parallelization and then a whole new set of problems arise.

Sisak answered 28/1, 2020 at 13:14 Comment(1)
See, this is the answer I have been looking for. A nice explanation with links to documentation that confirm the behaviour.Sitnik
K
6

The absolute easiest example is:

cars.stream()
    .map(this:saveCar)
    .count()

In this case, from java-9 and up, map will not be executed; since you do not need it to know the count, at all.

There are other multiple cases where side effects would cause you lots of pain; under certain conditions.

Kmeson answered 24/1, 2020 at 15:4 Comment(13)
Wow I did not know this. So java will not execute a Stream if its result is ignored?Sitnik
@Sitnik it will not execute the mapKmeson
I think count() would still be executed, but the implementation may skip intermediate steps if it can produce the result from the source (but there are a lot of implementation-level ifs)Sheetfed
But terminal-operations (like count()) are allowed to have side-effects. What if I implement my own terminal operation that saved to a database?Sitnik
@Sitnik that is an entirely different question and depends on the implementation; but yes, such things should be implemented in the terminal operation, like a custom Collector.Kmeson
@Sitnik would you implement it because you think it's a good idea to use a stream specifically to save things to a database?Jinni
Where can I find more information about this? Maybe I've missed it, but this information is not available in the main documentation for streams.Sitnik
@Sitnik about a custom Collector?Kmeson
@Kayaman, I greatly prefer the functional code style, it makes the code more compact and ordered. Is this such a bad thing?Sitnik
@Eugene, no I meant more information about this stream behavior, why sometimes the map would not be executed because Java knows it does not need to do so in order to get count.Sitnik
@Sitnik this is not going to be documented, anywhere. These are implementation details; but if you follow the documentation (like your side-effects portion) - you will not care about them do you?Kmeson
@Sitnik Java 8 features didn't turn Java into a functional language and Streams etc. aren't a replacement, they're an additional tool. Saving things into a database is a huge side effect, so you are trying to shoehorn something that doesn't fit just because you like the idea of functional programming. You might want to look at Clojure if you want to go all functional.Jinni
FWIW, what may make this side effect even worse is that it would actually work on old Java 8 versions but not on Java 9 or later. That particular optimization for sized streams was introduced in JDK-8067969Tomahawk
S
5

As per the documentation on Oracle's website [...]

That link is for Java 8. You may want to read the documentation for Java 9 (which came out in 2017) and later versions, as they are more explicit to this regard. Specifically:

A stream implementation is permitted significant latitude in optimizing the computation of the result. For example, a stream implementation is free to elide operations (or entire stages) from a stream pipeline -- and therefore elide invocation of behavioral parameters -- if it can prove that it would not affect the result of the computation. This means that side-effects of behavioral parameters may not always be executed and should not be relied upon, unless otherwise specified (such as by the terminal operations forEach and forEachOrdered). (For a specific example of such an optimization, see the API note documented on the count() operation. For more detail, see the side-effects section of the stream package documentation.)

Source: Java 9's Javadoc for the Stream interface.

And also the updated version of the doc you quoted:

Side-effects

Side-effects in behavioral parameters to stream operations are, in general, discouraged, as they can often lead to unwitting violations of the statelessness requirement, as well as other thread-safety hazards.
If the behavioral parameters do have side-effects, unless explicitly stated, there are no guarantees as to:

  • the visibility of those side-effects to other threads;
  • that different operations on the "same" element within the same stream pipeline are executed in the same thread; and
  • that behavioral parameters are always invoked, since a stream implementation is free to elide operations (or entire stages) from a stream pipeline if it can prove that it would not affect the result of the computation.

The ordering of side-effects may be surprising. Even when a pipeline is constrained to produce a result that is consistent with the encounter order of the stream source (for example, IntStream.range(0,5).parallel().map(x -> x*2).toArray() must produce [0, 2, 4, 6, 8]), no guarantees are made as to the order in which the mapper function is applied to individual elements, or in what thread any behavioral parameter is executed for a given element.

The eliding of side-effects may also be surprising. With the exception of terminal operations forEach and forEachOrdered, side-effects of behavioral parameters may not always be executed when the stream implementation can optimize away the execution of behavioral parameters without affecting the result of the computation. (For a specific example see the API note documented on the count operation.)

Source: Java 9's Javadoc for the java.util.stream package.

All emphasis mine.

As you can see, the current official documentation goes into more detail on the issues that you may encounter if you decide to use side-effects in your stream operations. It is also very clear on forEach and forEachOrdered being the only terminal operations where execution of side-effects is guaranteed (mind you, thread-safety issues still apply, as the official examples show).


That being said, and regarding your specific code, and said code only:

public List<SavedCars> saveCars(List<Car> cars) {
  return cars.stream()
           .map(this::saveCar)
           .collect(Collectors.toList());
}

I see no Streams-related issues with said code as-is.

  • The .map() step will be executed because .collect() (a mutable reduction operation, which is what the official doc recommends instead of things like .forEach(list::add)) relies on .map()'s output and, since this (i.e. saveCar()'s) output is different than its input, the stream cannot "prove that [eliding] it would not affect the result of the computation".
  • It is not a parallelStream() so it should not introduce any concurrency problems that didn't previously exist (of course, if someone added a .parallel() later then problems may arise —much like if someone decided to parallelize a for loop by firing up new threads for the inner computations).

That doesn't mean that the code in that example is Good Code™. The sequence .stream.map(::someSideEffect()).collect() as a way of performing side-effects operations for every item in a collection may look like more simple / short / elegant? than its for counterpart, and it sometimes may be. However, as Eugene, Holger and some others told you, there are better ways to approach this.
As a quick thought: the cost of firing up a Stream vs iterating a simple for is not negligible unless you have a lot of items, and if you have a lot of items then you: a) probably don't want to make a new DB access for each one, so a saveAll(List items) API would be better; and b) probably don't want to take the performance hit of processing a lot of items sequentially, so you would end up using parallelization and then a whole new set of problems arise.

Sisak answered 28/1, 2020 at 13:14 Comment(1)
See, this is the answer I have been looking for. A nice explanation with links to documentation that confirm the behaviour.Sitnik

© 2022 - 2024 — McMap. All rights reserved.