How can I split and re-join STDOUT from multiple processes?
Asked Answered
C

3

15

I am working on a pipeline that has a few branch points that subsequently merge-- they look something like this:

         command2
        /        \
command1          command4
        \        /
         command3

Each command writes to STDOUT and accepts input via STDIN. STDOUT from command1 needs to be passed to both command2 and command3, which are run sequentially, and their output needs to be effectively concatenated and passed to command4. I initially thought that something like this would work:

$ command1 | (command2; command3) | command4

That doesn't work though, as only STDOUT from command2 is passed to command 4, and when I remove command4 it's apparent that command3 isn't being passed the appropriate stream from command1 -- in other words, it's as if command2 is exhausting or consuming the stream. I get the same result with { command2 ; command3 ; } in the middle as well. So I figured I should be using 'tee' with process substitution, and tried this:

$ command1 | tee >(command2) | command3 | command4

But surprisingly that didn't work either -- it appears that the output of command1 and the output of command2 are piped into command3, which results in errors and only the output of command3 being piped into command4. I did find that the following gets the appropriate input and output to and from command2 and command3:

$ command1 | tee >(command2) >(command3) | command4

However, this streams the output of command1 to command4 as well, which leads to issues as command2 and command3 produce a different specification than command1. The solution I've arrived on seems hacky, but it does work:

$ command1 | tee >(command2) >(command3) > /dev/null | command4

That suppresses command1 passing its output to command4, while collecting STDOUT from command2 and command3. It works, but I feel like I'm missing a more obvious solution. Am I? I've read dozens of threads and haven't found a solution to this problem that works in my use case, nor have I seen an elaboration of the exact problem of splitting and re-joining streams (though I can't be the first one to deal with this). Should I just be using named pipes? I tried but had difficulty getting that working as well, so maybe that's another story for another thread. I'm using bash in RHEL5.8.

Centennial answered 23/4, 2014 at 21:46 Comment(2)
Looks like your question has a solution which works -- so are you asking for a different solution? Typically that kind of split does not appear frequently in shell scripts, but does frequently in specialized tool like Hadoop-MapReduce -- I don't think you are going to find anything better as bash pipeline.Carroll
@Carroll -- yeah, I'm wondering if there is a better solution. I'm not losing sleep over this as my solution appears to work, but I expect there is a solution which doesn't involve redirecting stdout to /dev/null and I'm curious to where I've erred as it may be informative for me (or others) as I continue developing.Centennial
C
9

You can play around with file descriptors like this;

((date | tee >( wc >&3) | wc) 3>&1) | wc

or

((command1 | tee >( command2 >&3) | command3) 3>&1) | command4

To explain, that is tee >( wc >&3) will output the original data on stdout, and the inner wc will output the result on FD 3. The outer 3>&1) will then merge FD3 output back into STDOUT so output from both wc is sent to the tailing command.

HOWEVER, there is nothing in this pipeline (or the one in your own solution) which will guanrantee that the output will not be mangled. That is incomplete lines from command2 will not be mixed up with lines of command3 -- if that is a concern, you will need to do one of two things;

  1. Write your own tee program which internally uses popen and read each line back before sending complete lines to stdout for command4 to read
  2. Write the output from command2 and command3 to a file and use cat to merge the data as input to command4
Carroll answered 23/4, 2014 at 23:10 Comment(2)
Thank you -- this is exactly what I'm looking for. I also appreciate the note regarding the potential intercalation of the output streams. I wonder if there's a more elegant solution than to rewrite tee or use a file. Maybe I could use wait to hold command3 until after command2 has finished?Centennial
This solution seems to works on bash/ksh/zsh. Does anybody know how to make it work with /bin/static-sh (i.e. busybox)?Fibster
F
1

Please see also https://unix.stackexchange.com/questions/28503/how-can-i-send-stdout-to-multiple-commands. Amongst all answers, I found this answer particularly fits my need.

Expand a little bit @Soren's answer,

$ ((date | tee >( wc >&3) | wc) 3>&1) | cat -n
     1         1       6      29
     2         1       6      29

You can do without using tee but an environment variable,

$ (z=$(date); (echo "$z"| wc ); (echo "$z"| wc) ) | cat -n
     1         1       6      29
     2         1       6      29

In my case, I applied this technique and wrote a much complex script that runs under busybox.

Fibster answered 22/9, 2017 at 16:52 Comment(0)
B
0

I believe your solution is good and it uses tee as documented. If you read manpage of tee, it says:

Copy standard input to each FILE, and also to standard output

Your files are process substitutions.

And the standard output is what you need to remove, because you don't want it, and that's what you did with redirecting it to /dev/null

Beano answered 17/2, 2019 at 7:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.