Having just read the paper (again), let me try to explain it.
Suppose it takes samples at 100Hz, except not when the process is blocked for IO or some other reason. Each sample records the PC, which is in some function. The count of samples in that function is incremented.
So, at the end of a run, if there were (say) 1000 samples, that means the total execution time (CPU-only) was 10 seconds. If routine B recorded, say, 500 of those samples, that means its total execution time is 1/2 of the total, or 5 seconds. This is its self time because the PC is in it. This does not tell how long it takes to execute, on average. To tell that, you need to know how many times it was called. Also it does not include time spent in callees.
When the code is compiled with the -pg flag, a special call is inserted in each routine's entry code. That notices that routine B is entered, and it notices that it is called from a call site in routine A. There is a table, indexed by that call site, where that call can be counted. So at the end, gprof can tell how many times B was called in total, and how many of those were from A.
To get the average self time of B, its total self time is divided by the number of times it is called.
In order to get the total cumulative time (self plus callees) of routine A, gprof needs the self time of A, plus the total number of times it calls each subordinate routine B, times the average cumulative time of B. That number is then divided by the total number of calls to A to get the average cumulative time of A.
Sounds good, until recursion enters the picture, where it gets even more confusing.
It's all very clever, and as the authors point out, full of caveats.