I'm requesting clarification on exactly how Apache Flink (1.6.0) handles events from KeyedStreams after the events have been sent through a window and some operator (such as reduce() or process()) has been applied.
Assuming a single node cluster, after an operator on a keyed windowed stream has been executed, is one left with exactly 1 DataStreams or exactly k DataStreams (where k is the number of unique values for the key)?
For clarification, consider needing to read in events from some source, key by some k, send the keyed events into some windowed stream, reduce, and then do pretty much anything else. Which of the two graphs below will actually be constructed?
Graph A
|--------------|
| source |
| (DataStream) |
|--------------|
|
[all events]
|
v
|--------------|
| key by( k ) |
| (KeyedStream)|
|--------------|
/ | \
/ | \
[ k = 1 ] [ k = 2 ] [ k = 3 ]
/ | \
/ | \
v v v
|------------------||------------------||------------------|
| sliding window || sliding window || sliding window |
| (WindowedStream) || (WindowedStream) || (WindowedStream) |
|------------------||------------------||------------------|
| | |
[ k = 1 ] [ k = 2 ] [ k = 3 ]
| | |
v v v
|----------| |----------| |----------|
| reduce | | reduce | | reduce |
|----------| |----------| |----------|
| | |
[ k = 1 ] [ k = 2 ] [ k = 3 ]
| | |
v v v
|--------------| |--------------| |--------------|
| foo | | foo | | foo |
| (DataStream) | | (DataStream) | | (DataStream) |
|--------------| |--------------| |--------------|
Graph B
|--------------|
| source |
| (DataStream) |
|--------------|
|
[all events]
|
v
|--------------|
| key by( k ) |
| (KeyedStream)|
|--------------|
/ | \
/ | \
[ k = 1 ] [ k = 2 ] [ k = 3 ]
/ | \
/ | \
v v v
|------------------||------------------||------------------|
| sliding window || sliding window || sliding window |
| (WindowedStream) || (WindowedStream) || (WindowedStream) |
|------------------||------------------||------------------|
| | |
[ k = 1 ] [ k = 2 ] [ k = 3 ]
| | |
v v v
|----------| |----------| |----------|
| reduce | | reduce | | reduce |
|----------| |----------| |----------|
\ | /
\ | /
\ | /
\ | /
\ | /
\ | /
\ | /
[all products]
|
v
|--------------|
| foo |
| (DataStream) |
|--------------|
Edit (2018-09-22)
Based on David's answer, I think I've misinterpreted how exactly KeyedStreams work in combination with a window or other stream. Somehow, I got the impression that a KeyedStream partitioned the incoming stream by creating multiple streams behind the scenes rather than just grouping objects together by some value using the same stream.
I thought Flink was doing the equivalent of:
List<Foo> eventsForKey1 = ...;
List<Foo> eventsForKey2 = ...;
List<Foo> eventsForKey3 = ...;
...
List<Foo> eventsForKeyN = ...;
I now see that Flink is actually doing the equivalent of:
Map<Key, List<Foo>> events = ...;