The advice above about semantics remains, but unfortunately you have in fact almost no influence on the way the screen reader actually announce it.
It depends heavily on the language and the speech synthesizer used.
In my case, Jaws 2019, eloquence, French, "0:02" is told as "zéro heure deux" => "zero hours two", while
"0:02:00" is told "zéro heure deux zéro zéro" => "zero hours two zero zero". VoiceOver on iOS 13.1.2 with the voice Audrey says the same.
The same Jaws 2019, eloquence, English US, says "zero o two". As you can see, there are cases where the format isn't particularely interpreted.
However, I wouldn't say that changing to something more explicit like "0 hours 2 minutes", "0hrs2min" or "0h2m" is better.
Saying "hours" instead of "hrs" and "minutes" instead of "min" when encountered depends as much on language, synthsizer and user dictionary as "00:02".
So both can be OK in some situations and wront in others.
Abreviations like "min" for "minutes" even sometimes create problems, because "min" can also be taken as "minimum" for example. Remember that speech synthesizers don't try to make any meaning or contextual analysis.
The full explicit "0 hours 2 minutes" isn't that good either. I'm blind myself and I can tell you: it's crystal clear, well, but in fact, we are used to hear "zero hours two" because it's written "0:02" everywhere.
We are all used to that kind of speech synthesizer quirks and basicly we don't really mind.
In this particular case, we are smart enough to know, given the context, if "zero o two" represents two minutes or two seconds, and that "3hrs" or "3h" means 3 hours.
Nothing is perfect in all situations anyway, so stop scratching your head and take whatever format is the most appropriate for your application.