I'm not aware of a existing tool to do this, but I would suggest:
First, measure the stand-alone cost of including every header by itself. Make a list of all headers, and for each header, preprocess it. The simplest measure of the cost of that header is the number of lines that result from preprocessing. A possibly more accurate measure would be to count the occurrences of 'template', as processing template definitions seems to dominate compilation time in my experience. You could also count occurrences of 'inline', as I've seen large numbers of inline functions defined in headers be an issue too (but be aware that inline definitions of class methods don't necessarily use the keyword).
Next, measure the number of translation units (TUs) that include that header. For each main file of a TU (e.g., .cpp file), preprocess that file and gather the set of distinct headers that appear in the output (in the #
lines). Afterward, invert that to get a map from header to number of TUs that use it.
Finally, for each header, multiply its stand-alone cost by the number of TUs that include it. This is a measure of the cumulative effect of this header on total compilation time. Sort that list and go through it in descending order, moving private implementation details into the associated implementation file and trimming the public header accordingly.
Now, the main issue with this or any such approach to measuring the benefit of private implementations is you probably won't see much change at first because, in the absence of engineering discipline to do otherwise, usually there will be many headers that include many others, with lots of overlap. Consequently, optimizing one heavily-used header will simply mean that some other heavily-used header that includes almost as much will keep compilation times high. But once you break through the critical mass of commonly used headers that have many dependencies, optimizing most or all of them, compilation times should start to drop dramatically.
One way to focus the effort, so it's not so "pie in the sky", is to begin by selecting the single TU that takes the most time to compile, and work on optimizing only the headers that it depends on. Once you've significantly reduced the time for that TU, look again at the big picture. And if you can't significantly improve that one TU's compilation time through the private implementation technique, then that suggests you need to consider other approaches for that code base.