Future readers: I've tried answering each of OP's questions, and whenever I've deemed it necessary, I've divided the answer to a question into a few headers to make it a more structured and pleasant reading experience. Note that the answers to OP's questions require some foundational knowledge, and in cases where I think a full answer would deviate from the question at hand I've linked to other relevant SO questions/answers.
I do not understand what the |
is doing. Is it simply the opposite of >
?
No, |
isn't the opposite of >
. A pipeline, which is what |
represents, is defined as a set of commands,
which are set up to redirect their input/output into each other. For example, in the pipeline
processA | processB | processC
processA
is set up to redirect its output into processB
's input, and processB
to redirect its output into processC
's input, which ultimately sends its output to the console.
In other words, a pipeline redirects the input and/or output of processes. As a convenience, the shell represents a pipeline with the metacharacter |
, however you can implement a pipeline yourself (which we do below). Similar to pipelining, the shell conveniently provides further I/O functionalities via a few more metacharacters. For example:
- input redirection (i.e.,
<
), where a process's standard input is set up to take input from a file instead of the default input, which is usually the keyboard, i.e., processA < file
.
- output redirection (i.e.,
>
), where a process's standard output is set up to send output into a file instead of the default output, which is usually the console, i.e., processA > file
.
Is there a simple, or metaphorical explanation for what |
does?
Yes, there is and it comes from the person who first conceived the concept of a pipe! American Computer Scientist Doug McIlroy likened it to a "garden hose" in a note (emphasis mine):
- We should have some ways of coupling programs like garden hose--screw in another segment when it becomes when it becomes necessary to massage data in another way. This is the way of IO also.
It's worth mentioning this was some years before Unix was invented, and therefore the concept predates Unix.
Now on with the metaphor... Conventionally, a Unix pipe has been considered a half-duplex or unidirectional pipe, meaning that the data flows in one direction only. Thus, if you can imagine a unidirectional garden hose, with an end where water pushes through (let's call this the write end) and an end where water comes through (let's call this the read end), we can have this non-award-winning drawing (Drawn with tldraw):
In the drawing, the water can only be sent to the garden hose's write end and the water comes from what we call the water source. The data makes it to the garden hose's read end, from which it can be read by what we call a water sink. With dmesg | less
, dmesg
's output is the data source which would be connected to the pipe's write end; and less
's input is the data sink which would be connected to the pipe's read end. After setting the pipe between dmesg
and less
, the data that dmesg
produces flows through the pipe and into less
.
A next-to-useless pipe
In order to create a pipe, we use the system call pipe()
. To summarize the manpage (although you're encouraged to read and understand it), pipe()
takes an array of two integers, and it places a file descriptor in each slot. File descriptor at index 0 is the pipe's read end and file descriptor at index 1 is the pipe's write end. To illustrate this, we've the following trivial and mostly useless program in order to show that the data flows from one end of the pipe to the other end:
// useless-program.c
#include <stdio.h>
#include <unistd.h>
int main(int argc, char *argv[]) {
// we set up a pipe!
int pipefd[2];
pipe(pipefd);
// now pipefd[0] contains a file descriptor representing the pipe's read end
// and pipefd[1] contains a file descriptor representing the pipe's write end.
// we write a value on the pipe's write end, and display it on the console.
int sent_value = 100;
write(pipefd[1], &sent_value, sizeof(sent_value));
printf("Sent value: %d\n", sent_value);
// we read a value from the pipe's read end, and display it on the console.
int received_value;
read(pipefd[0], &received_value, sizeof(received_value));
printf("Received value: %d\n", received_value);
return 0;
}
Compiling and running it:
$ gcc useless-program.c && ./a.out
Sent value: 100
Received value: 100
Now you might argue this is quite the roundabout way to use a value within program and you'd be right. In fact Stevens agrees with you in Advanced Programming in the Unix Environment by stating that "a pipe in a single process is next to useless". After all, pipes are a mechanism for inter-process communication, i.e., you need at least two processes.
What goes on when several pipes are used in a single line?
The short answer is that a pipeline with several pipes isn't any different from a pipeline with a single pipe so let's start with that but first let's talk about processes.
Processes and standard streams
Abstractly, a process is a container for a running program. Thus, when you run the commands dmesg
and less
, two processes are spawned/forked to run these programs.
As a container, a process contains information/data such as the program/text, a program counter (PC), opened files, etc. to enable the process to carry out its job. We're specifically interested on the opened files. When you launch a command in the terminal, the shell creates a process to run the command and one of the things the shell does is to open three files for that specific process:
stdin
, which stands for standard input and where the process can read data from.
stdout
, which stands for standard output and where the process can send normal data to.
stderr
, which stands for standard error and where the process can send error data to.
A process doesn't deal with the opened files directly though, instead it deals with what's known as a file descriptor, which is a non-negative integer that represents the opened file. When the process wants to read from or write from a specific file, it uses this file descriptor to do so. I won't go into details about file descriptors (see here) but what you should know is:
- File descriptors 0, 1, and 2 are assigned to
stdin
, stdout
, and stderr
by default. Unless you close these file descriptors, the next file descriptor within a given process will be 3
, next one will be 4, and so on.
Takeaway: By default, any newly-created process has three opened files available to it, namely stdin
, stdout
, and stderr
. Collectively, they're known as a process's standard streams. A process can have more open files during its lifetime, but only these first three have a special name and preassigned roles.
Implementing a pipeline with a single pipe
Now that we have some idea of a pipeline, are familiar with the pipe()
system call, and know about the standard streams in a newly-created process, we can implement a pipeline. However before that, you should be comfortable with the following system calls:
I've linked to both the manpages and relevant SO questions with some good answers. If you aren't familiar with these system calls, you're encouraged to familiarize yourself with them before moving forward. Alternatively, you can read onward, fill the gaps by reading the above links, and then come back for a second pass.
NOTE: On my machine, I ran dmesg
and got:
dmesg: read kernel buffer failed: Operation not permitted
Thus moving forward, I'll use a more down to earth command that most Unix users are familiar with, i.e., ls
. Instead of dmesg | less
, I'll use ls | less
. It's the same principle, just a different command.
The shell in summary
A lot goes on in the shell but in summary the shell works as follows:
- Shows you a prompt and then waits for you to type something into it. You then type a command (i.e., the name of an executable program, plus any arguments) into it. In this case, we type
ls | less
.
- Parses the command and figures out what it should do. In this instance, it encounters the metacharacter
|
, which tells it the set of commands is a pipeline.
- Calls
fork()
to create a new child process to run the command. In our case, it creates two child processes: one for ls
and one for less
.
- It redirects input/output as necessary and as directed by the metacharacter used. In this case, we're using the
|
metacharacter and it determines that the command to the left (i.e., ls
) of |
needs its output to be directed to the input of the command to the right (i.e., less
). To make this communication possible, it uses a pipe created with pipe()
.
- Calls some variant of
exec()
to run the command. Here, we'll be calling execlp
to execute both ls
and less
.
- Waits for the command to complete by calling
wait()
. When the child completes, the shell returns from wait()
and prints out a prompt again, ready for your next command. Here it'd wait for both ls
and shell
.
C implementation of a pipeline with a single pipe
We'll implement what happens at step 3 and onward. In the following program,
// ls-to-less-pipe.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
int main(int argc, char *argv[]) {
// the parent is forking two processes, and we store their process ID
// in these variables.
pid_t ls_pid, less_pid;
// we set up a pipe that will enable communication between the processes
// for `ls` and `less`.
int pipefd[2];
pipe(pipefd);
// the parent process creates the first child process for the `ls` command.
ls_pid = fork();
if (ls_pid == 0) {
// we assign this process's output to the pipe's write end, i.e.,
// instead of sending output to the screen, it sends it to the pipe's
// write end.
dup2(pipefd[1], STDOUT_FILENO);
// now this process's stdout refers to the pipe's write end too so we
// can close this descriptor.
close(pipefd[1]);
// this process doesn't use the pipe's read end, and thus we close this
// file descriptor.
close(pipefd[0]);
// replace process's current image with this new process image, i.e.,
// the ls command.
if (execlp("ls", "ls", (char *) NULL) < 0) {
fprintf(stderr, "failed trying to execute the ls command");
exit(0);
};
}
else if (ls_pid < 0) {
fprintf(stderr, "failed forking ls process");
}
// the parent process creates the first child process for the `less` command.
less_pid = fork();
if (less_pid == 0) {
// we assign this process's input to the pipe's read end, i.e., instead
// of taking input from the keyboard, it takes it from the pipe's read end.
dup2(pipefd[0], STDIN_FILENO);
// now this process's stdin refers to the pipe's read end too so we
// can close this descriptor.
close(pipefd[0]);
// this process doesn't use the pipe's write end, and thus we close this
// file descriptor.
close(pipefd[1]);
// replace process's current image with this new process image, i.e., the less command.
if (execlp("less", "less", (char *) NULL) < 0) {
fprintf(stderr, "failed trying to execute the less command");
exit(0);
};
}
else if (less_pid < 0) {
fprintf(stderr, "failed forking less process");
exit(0);
}
// the parent process doesn't use the pipe so we close both ends. Also
// needed to send EOF so the children can continue (children blocks until
// all input has been processed).
close(pipefd[0]);
close(pipefd[1]);
// the parent process waits for both child processes to finish their execution.
int ls_status, less_status;
pid_t ls_wpid = waitpid(ls_pid, &ls_status, 0);
pid_t less_wpid = waitpid(less_pid, &less_status, 0);
return 0;
}
Compiling and running it:
$ gcc ls-to-less-pipe.c && ./a.out
file1.txt
file2.txt
file3.txt
:
Therefore we've indeed set up a pipe between ls
and less
, allowing ls
to send its output to less
, which is akin, albeit not in the same way, to what the shell does when you run ls | less
.
C implementation of a pipeline with multiple pipes
Let's say we've a pipeline for printing the top three authors based on number of commits in a git repo:
git log --format='%an' | sort | uniq -c | sort -nr | head -n 3
This can be implemented as follows:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
int main(int argc, char *argv[]) {
pid_t git_pid, sort_pid1, sort_pid2, uniq_pid, head_pid;
int pipefd1[2];
int pipefd2[2];
int pipefd3[2];
int pipefd4[2];
pipe(pipefd1);
pipe(pipefd2);
pipe(pipefd3);
pipe(pipefd4);
if ((git_pid = fork()) == 0) {
dup2(pipefd1[1], STDOUT_FILENO);
close(pipefd1[1]);
close(pipefd1[0]);
if (execlp("git", "git", "log", "--format=%an", (char *) NULL) < 0) {
fprintf(stderr, "failed trying to execute the git command");
exit(0);
};
}
else if (git_pid < 0) {
fprintf(stderr, "failed forking git process");
}
if ((sort_pid1 = fork()) == 0) {
dup2(pipefd1[0], STDIN_FILENO);
dup2(pipefd2[1], STDOUT_FILENO);
close(pipefd1[0]);
close(pipefd1[1]);
close(pipefd2[1]);
close(pipefd2[0]);
if (execlp("sort", "sort", (char *) NULL) < 0) {
fprintf(stderr, "failed trying to execute the sort command");
exit(0);
};
}
else if (sort_pid1 < 0) {
fprintf(stderr, "failed forking sort process");
exit(0);
}
if ((uniq_pid = fork()) == 0) {
dup2(pipefd2[0], STDIN_FILENO);
dup2(pipefd3[1], STDOUT_FILENO);
close(pipefd1[0]);
close(pipefd1[1]);
close(pipefd2[0]);
close(pipefd2[1]);
close(pipefd3[0]);
close(pipefd3[1]);
if (execlp("uniq", "uniq", "-c", (char *) NULL) < 0) {
fprintf(stderr, "failed trying to execute the uniq command");
exit(0);
};
}
else if (sort_pid1 < 0) {
fprintf(stderr, "failed forking uniq process");
exit(0);
}
if ((sort_pid2 = fork()) == 0) {
dup2(pipefd3[0], STDIN_FILENO);
dup2(pipefd4[1], STDOUT_FILENO);
close(pipefd1[0]);
close(pipefd1[1]);
close(pipefd2[0]);
close(pipefd2[1]);
close(pipefd3[0]);
close(pipefd3[1]);
close(pipefd4[0]);
close(pipefd4[1]);
if (execlp("sort", "sort", "-nr", (char *) NULL) < 0) {
fprintf(stderr, "failed trying to execute the sort command");
exit(0);
};
}
else if (sort_pid2 < 0) {
fprintf(stderr, "failed forking sort process");
exit(0);
}
if ((head_pid = fork()) == 0) {
dup2(pipefd4[0], STDIN_FILENO);
close(pipefd1[0]);
close(pipefd1[1]);
close(pipefd2[0]);
close(pipefd2[1]);
close(pipefd3[0]);
close(pipefd3[1]);
close(pipefd4[0]);
close(pipefd4[1]);
if (execlp("head", "head", "-n", "3", (char *) NULL) < 0) {
fprintf(stderr, "failed trying to execute the head command");
exit(0);
};
}
else if (head_pid < 0) {
fprintf(stderr, "failed forking head process");
exit(0);
}
close(pipefd1[0]);
close(pipefd1[1]);
close(pipefd2[0]);
close(pipefd2[1]);
close(pipefd3[0]);
close(pipefd3[1]);
close(pipefd4[0]);
close(pipefd4[1]);
int git_status, sort1_status, sort2_status, uniq_status, head_status;
pid_t git_wpid = waitpid(git_pid, &git_status, 0);
pid_t sort1_wpid = waitpid(sort_pid1, &sort1_status, 0);
pid_t sort2_wpid = waitpid(sort_pid2, &sort2_status, 0);
pid_t uniq_wpid = waitpid(uniq_pid, &uniq_status, 0);
pid_t head_wpid = waitpid(head_pid, &head_status, 0);
return 0;
}
Pictorially this looks as follows:
Is the behavior of pipes consistent everywhere it appears in a Bash script?
If by this question, you mean that "a pipeline is a sequence of one or more commands separated by one of the control operators |
or |&
. The output of each command in the pipeline is connected via a pipe to the input of the next command. That is, each command reads the previous command’s output.", then yes.
ps | cat
. – Incommodious