There is a derivation here:
http://www.deepfriedbrainproject.com/2010/07/magical-formula-of-pert.html
In case the link goes dead, I'll provide a summary here.
So, taking a step back from the question for a moment, the goal here is to come up with a single mean (average) figure that we can say is the expected figure for any given 3 point estimate. That is to say, If I was to attempt the project X times, and add up all the costs of the project attempts for a total of $Y, then I expect the cost of one attempt to be $Y/X. Note that this number may or may not be the same as the mode (most likely) outcome, depending on the probability distribution.
An expected outcome is useful because we can do things like add up a whole list of expected outcomes to create an expected outcome for the project, even if we calculated each individual expected outcome differently.
A mode on the other hand, is not even necessarily unique per estimate, so that's one reason that it may be less useful than an expected outcome. For example, every number from 1-6 is the "most likely" for a dice roll, but 3.5 is the (only) expected average outcome.
The rationale/research behind a 3 point estimate is that in many (most?) real-world scenarios, these numbers can be more accurately/intuitively estimated by people than a single expected value:
- A pessimistic outcome (P)
- An optimistic outcome (O)
- The most likely outcome (M)
However, to convert these three numbers into an expected value we need a probability distribution that interpolates all the other (potentially infinite) possible outcomes beyond the 3 we produced.
The fact that we're even doing a 3-point estimate presumes that we don't have enough historical data to simply lookup/calculate the expected value for what we're about to do, so we probably don't know what the actual probability distribution for what we're estimating is.
The idea behind the PERT estimates is that if we don't know the actual curve, we can plug some sane defaults into a Beta distribution (which is basically just a curve we can customise into many different shapes) and use those defaults for every problem we might face. Of course, if we know the real distribution, or have reason to believe that default Beta distribution prescribed by PERT is wrong for the problem at hand, we should NOT use the PERT equations for our project.
The Beta distribution has two parameters A
and B
that set the shape of the left and right hand side of the curve respectively. Conveniently, we can calculate the mode, mean and standard deviation of a Beta distribution simply by knowing the minimum/maximum values of the curve, as well as A
and B
.
PERT sets A
and B
to the following for every project/estimate:
If M > (O + P) / 2
then A = 3 + √2
and B = 3 - √2
, otherwise the values of A
and B
are swapped.
Now, it just so happens that if you make that specific assumption about the shape of your Beta distribution, the following formulas are exactly true:
Mean (expected value) = (O + 4M + P) / 6
Standard deviation = (O - P) / 6
So, in summary
- The PERT formulas are not based on a normal distribution, they are based on a Beta distribution with a very specific shape
- If your project's probability distribution matches the PERT Beta distribution then the PERT formula are exactly correct, they are not approximations
- It is pretty unlikely that the specific curve chosen for PERT matches any given arbitrary project, and so the PERT formulas will be an approximation in practise
- If you don't know anything about the probability distribution of your estimate, you may as well leverage PERT as it's documented, understood by many people and relatively easy to use
- If you know something about the probability distribution of your estimate that suggests something about PERT is inappropriate (like the 4x weighting towards the mode), then don't use it, use whatever you think is appropriate instead
- The reason why you multiply by 4 to get the Mean (and not 5, 6, 7, etc.) is because the number 4 is tied to the shape of the underlying probability curve
- Of course, PERT could have been based off a Beta distribution that yields 5, 6, 7 or any other number when calculating the Mean, or even a normal distribution, or a uniform distribution, or pretty much any other probability curve, but I'd suggest that the question of why they chose the curve they did is out of scope for this answer and possibly quite open ended/subjective anyway