If you look at your browser's dom inspector you will see that erasing the content and adding the image leaves you with one element wrapping the image and caption, and therefore no-where else to set focus to continue typing. The quickinsert feature also fails to show as there are no block boundaries to trigger it.
Froala Editor offers a decent events API, and there is a workaround via the 'image.inserted' event which fires when an image element is added into the editor. The code below listens for this event and inserts a new para element immediately after Froala's wrapping elements around the image. When typed, your caption text is part of Froala's wrapper around the image, leaving this new para ready to accept focus and let you type into it.
$('#yourselector').on('froalaEditor.image.inserted', function (e, editor, $img, response) {
$img.after("<p>Type something here</p>"); // insert a new para or div here
});
Note the downside of the simple workaround is that you get extra content injected, but the benefit hopefully outweighs that for your use-case.
This is a plain JS solution which hopefully you can adapt for your environment.
I have previously failed to add JS snippets for Froala to SO so provide this codepen working example.