Short answer
WebVTT was built as an extension of SRT to add useful features that are not available in SRT. In its most basic form, WebVTT looks near identical to SRT. However, WebVTT adds many (optional) ways to format the output and provide auxiliary information that SRT does not have. The downside of extra features is that it's more difficult for a player to support it. Because SRT has very few features, most players support it. Also, SRT was developed by a relatively obscure group, where WebVTT is a W3C standard.
Long answer
WebVTT (Web Video Text Tracks) was originally called WebSRT. It was made to be an extension of SRT (SubRip). When not using any of the additional features that WebVTT provides, WebVTT looks near identical to SRT. For example, a simple case in SRT might look like:
1
00:00:00.000 --> 00:00:02.000
The first cue text.
2
00:00:02.000 --> 00:00:04.000
The second cue text.
where the equivalent WebVTT might look like:
WEBVTT
00:00.000 --> 00:02.000
The first cue text.
00:02.000 --> 00:04.000
The second cue text.
The reason WebVTT was made was to add many features that don't exist in SRT. For example, in WebVTT you can add data showing who the speaker of a cue is:
00:02.000 --> 00:04.000
<v Mary>Hi, I'm Mary!
SRT doesn't officially include any font styling. Simple font styling is unofficially supported within SRT by many players, but since it's not officially part of the format, there are no guarantees. WebVTT does include font styling as part of the format. For simple font styling, the unofficial SRT and WebVTT again look similar:
<b>This text is bold.</b>
However, WebVTT includes many types of font styling and many methods of including the font styling that are not included by SRT. For example, WebVTT has style sheets similar to CSS for HTML. For the above labeled speaker example, you can include font styling for all instances of a specific speaker in the file:
::cue(v[voice="Mary"]) { color: lime }
This can be useful when multiple speakers are talking over each other so that their speech can be differentiated in the subtitles.
Just as a quick sampling of some other stuff WebVTT supports that SRT does not,
Positioning and sizing the cues within the viewport:
00:00:00.000 --> 00:00:04.000 position:10%,line-left align:left size:35%
I'm over here.
Ruby text for small characters above the normal ones (often used in East Asian languages to provide phonetic guides):
00:02.000 --> 00:04.000
<ruby>東京<rt>とうきょう</rt></ruby>
Cues subdivided in time:
00:00:00.000 --> 00:00:06.000
This <00:00:01.000>text <00:00:02.000>appears <00:00:03.000>over <00:00:04.000>5 <00:00:05.000>seconds.
The advantage of SRT is that it's so simple that more players support it. Luckily, it's fairly easy to translate SRT to WebVTT and vice versa. Since WebVTT is basically a superset of SRT, you just need to change some minor syntax to convert from SRT to WebVTT. To go from WebVTT to SRT, you strip out all extra features tags, then perform the slight syntax change. Of course, in this direction, you lose all the extra features WebVTT provided.
As noted above, SRT was developed by a relatively obscure group, WebVTT is a W3C standard. As noted by the original question, it is more or less the "official" caption/subtitle format for HTML5.