How to convert subtitle file to have only one sentence per subtitle?
Asked Answered
U

2

7

I am trying to program a method to convert subtitle files, such that there is always just one sentence per subtitle.

My idea is the following:

  1. For each subtitle:

1.1 -> I get the subtitle duration

1.2 -> Calculate the characters_per_second

1.3 -> Use this to store (inside dict_times_word_subtitle ) the time it takes to speak the word i

  1. I extract the sentences from the entire text

  2. For each sentence:

3.1 I store (inside dict_sentences_subtitle ) the time it take to speak the sentence with the specific words (from which I can get the duration to speak them)

  1. I create a new srt file (subtitle file) which starts at the same time as the original srt file and the subtitle timings can then be taken from the duration that it takes to speak the sentences.

For now, I have written the following code:

#---------------------------------------------------------
import pysrt
import re
from datetime import datetime, date, time, timedelta
#---------------------------------------------------------

def convert_subtitle_one_sentence(file_name):
    
    sub = pysrt.open(file_name)   

    ### ----------------------------------------------------------------------
    ### Store Each Word and the Average Time it Takes to Say it in a dictionary
    ### ----------------------------------------------------------------------

    dict_times_word_subtitle = {}
    running_variable = 0
    for i in range(len(sub)):

        subtitle_text = sub[i].text
        subtitle_duration = (datetime.combine(date.min, sub[i].duration.to_time()) - datetime.min).total_seconds()

        # Compute characters per second
        characters_per_second = len(subtitle_text)/subtitle_duration

        # Store Each Word and the Average Time (seconds) it Takes to Say in a Dictionary 
        
        for j,word in enumerate(subtitle_text.split()):
            if j == len(subtitle_text.split())-1:
                time = len(word)/characters_per_second
            else:
                time = len(word+" ")/characters_per_second

            dict_times_word_subtitle[str(running_variable)] = [word, time]
            running_variable += 1

            
    ### ----------------------------------------------------------------------
    ### Store Each Sentence and the Average Time to Say it in a Dictionary
    ### ----------------------------------------------------------------------  

    total_number_of_words = len(dict_times_word_subtitle.keys())

    # Get the entire text
    entire_text = ""
    for i in range(total_number_of_words):
        entire_text += dict_times_word_subtitle[str(i)][0] +" "


    # Initialize the dictionary 
    dict_times_sentences_subtitle = {}

    # Loop through all found sentences 
    last_number_of_words = 0
    for i,sentence in enumerate(re.findall(r'([A-Z][^\.!?]*[\.!?])', entire_text)):

        number_of_words = len(sentence.split())

        # Compute the time it takes to speak the sentence
        time_sentence = 0
        for j in range(last_number_of_words, last_number_of_words + number_of_words):
            time_sentence += dict_times_word_subtitle[str(j)][1] 

        # Store the sentence together with the time it takes to say the sentence
        dict_times_sentences_subtitle[str(i)] = [sentence, round(time_sentence,3)]

        ## Update last number_of_words
        last_number_of_words += number_of_words

    # Check if there is a non-sentence remaining at the end
    if j < total_number_of_words:
        remaining_string = ""
        remaining_string_time = 0
        for k in range(j+1, total_number_of_words):
            remaining_string += dict_times_word_subtitle[str(k)][0] + " "
            remaining_string_time += dict_times_word_subtitle[str(k)][1]

        dict_times_sentences_subtitle[str(i+1)] = [remaining_string, remaining_string_time]

    ### ----------------------------------------------------------------------
    ### Create a new Subtitle file with only 1 sentence at a time
    ### ----------------------------------------------------------------------  

    # Initalize new srt file
    new_srt = pysrt.SubRipFile()

    # Loop through all sentence
    # get initial start time (seconds)
    # https://mcmap.net/q/467647/-convert-datetime-time-to-seconds
    start_time = (datetime.combine(date.min, sub[0].start.to_time()) - datetime.min).total_seconds()

    for i in range(len(dict_times_sentences_subtitle.keys())):


        sentence = dict_times_sentences_subtitle[str(i)][0]
        print(sentence)
        time_sentence = dict_times_sentences_subtitle[str(i)][1]
        print(time_sentence)
        item = pysrt.SubRipItem(
                        index=i,
                        start=pysrt.SubRipTime(seconds=start_time),
                        end=pysrt.SubRipTime(seconds=start_time+time_sentence),
                        text=sentence)

        new_srt.append(item)

        ## Update Start Time
        start_time += time_sentence

    new_srt.save(file_name)

The issue:

There are no error messages, but when I apply this to real subtitle files and then watch the video, the subtitles begin correctly, but as the video progresses (error progression) the subtitles get less and less aligned with what is actually said.

Example: The speaker has finished his talk, but the subtitles keep appearing.

enter image description here

Simple example to test

srt = """
1
00:00:13,100 --> 00:00:14,750
Dr. Martin Luther King, Jr.,

2
00:00:14,750 --> 00:00:18,636
in a 1968 speech where he reflects
upon the Civil Rights Movement,

3
00:00:18,636 --> 00:00:21,330
states, "In the end,

4
00:00:21,330 --> 00:00:24,413
we will remember not the words of our enemies

5
00:00:24,413 --> 00:00:27,280
but the silence of our friends."

6
00:00:27,280 --> 00:00:29,800
As a teacher, I've internalized this message.

"""

with open('test.srt', "w") as file:
    file.write(srt)
    
    
convert_subtitle_one_sentence("test.srt")

The output looks like this (yes, there is still some work to do on the sentence recognition par (i.e. Dr. )):

0
00:00:13,100 --> 00:00:13,336
Dr.

1
00:00:13,336 --> 00:00:14,750
Martin Luther King, Jr.

2
00:00:14,750 --> 00:00:23,514
Civil Rights Movement, states, "In the end, we will remember not the words of our enemies but the silence of our friends.

3
00:00:23,514 --> 00:00:26,175
As a teacher, I've internalized this message.

4
00:00:26,175 --> 00:00:29,859
our friends." As a teacher, I've internalized this message.

As you can see the original last time stamp is 00:00:29,800 whereas in the output file it is 00:00:29,859. This might not seem like much in the beginning, but as the video gets longer, the difference increases.

The full sample video can be downloaded here: https://ufile.io/19nuvqb3

The full subtitle file: https://ufile.io/qracb7ai

Attention: The subtitle file will be overridden, so you might want to store a copy with another name to be able to compare.

Method how it could be fixed:

The exact timing for words starting or ending an original subtitle is known. This could be used to cross-check and adjust timing accordingly.

Edit

Here is a code to create a dictionary which stores character, character_duration (average over subtitle), and start or end original time stamb, if it exists for this character.

sub = pysrt.open('video.srt')

running_variable = 0
dict_subtitle = {}

for i in range(len(sub)):

    # Extract Start Time Stamb
    timestamb_start = sub[i].start

    # Extract Text
    text =sub[i].text

    # Extract End Time Stamb
    timestamb_end = sub[i].end

    # Extract Characters per Second 
    characters_per_second = sub[i].characters_per_second
    
    # Fill Dictionary 
    for j,character in enumerate(" ".join(text.split())):
        character_duration = len(character)*characters_per_second
        dict_subtitle[str(running_variable)] = [character,character_duration,False, False]
        if j == 0: dict_subtitle[str(running_variable)] = [character, character_duration, timestamb_start, False]
        if j == len(text)-1 : dict_subtitle[str(running_variable)] = [character, character_duration, False, timestamb_end]
        running_variable += 1

More videos to try

Here you may download more videos and their respective subtitle files: https://filebin.net/kwygjffdlfi62pjs

Edit 3

4
00:00:18,856 --> 00:00:25,904
Je rappelle la définition de ce qu'est un produit scalaire, <i>dot product</i> dans <i>Ⅎ</i>.

5
00:00:24,855 --> 00:00:30,431
Donc je prends deux vecteurs dans <i>Ⅎ</i> et je définis cette opération-là, linéaire, <i>u 
Ukase answered 17/5, 2019 at 14:54 Comment(4)
@tobias_k I reposted my question with a simple example. Please have a look. Thanks.Ukase
See my new edited answer. Hopefully its the last version.Jaquelynjaquenetta
I've added to my first answer, you may find it useful, even if only to hack it into the accepted answer. It does help with that French maths subtitle file, you gave as an example.Jaquelynjaquenetta
@Ukase : remember to also artificially trim off the end-times (by adding another dummy entry in between) if there are VERY long gap(s) between sentences, otherwise the subtitles would awkwardly remain on display during the whole non-verbal durationLucubration
J
2

I have re-coded to rely on the pysrt package, as requested, and a smigeon of re.
The idea is to build a dictionary based on start_times.

If the start time exists, data is added to the entry for that time but the end_time is updated at the same time, so the end time advances with the text.

If no start time exists, it is simply a new dictionary entry.

The start time is only advanced once we know that a sentence has been completed.

So in essence, we start to build a sentence with a fixed start time. The sentence continues to be built, by adding more text and updating the end time, until the sentence finishes. Here we advance the start time using the current record, which we know to be a new sentence.

Sub-title entries with multiple sentences are broken up, with start and end times calculated using the pysrt character_per_second entry for the entire sub-title entry, before it was broken up.

Finally, a new sub-title file is written to disk from the entries in the dictionary.

Obviously, with only a single file to play with, I may well be missing some sub-title layout humps in the road, but at least it gives you a working starting point.

The code is commented throughout, so most things should be clear, as to how and why.

Edit: I have refined the checking for existing dictionary start times and altered the method used to decide if a sentence has ended i.e. put the full stops back into the text, after splitting.
The second video you mentioned does have sub-titles that are slightly off, to begin with, notice that there are no milli-second values at all.

The following code does a fair job on the second video and a good job on the first.

Edit 2: Added contiguous full-stops and html <> tag removal

Edit 3: It turns out that pysrt removes the html tags from the calculation for characters per second. I have now done so as well, which means that the <html> formatting can be retained within the sub-titles.

Edit 4: This version copes with full stops in mathematical and chemical formulae, plus ip numbers etc. Basically places where a full stop doesn't mean a full stop. It also allows for sentences which end in ? and !

import pysrt
import re

abbreviations = ['Dr.','Mr.','Mrs.','Ms.','etc.','Jr.','e.g.'] # You get the idea!
abbrev_replace = ['Dr','Mr','Mrs','Ms','etc','Jr','eg']
subs = pysrt.open('new.srt')
subs_dict = {}          # Dictionary to accumulate new sub-titles (start_time:[end_time,sentence])
start_sentence = True   # Toggle this at the start and end of sentences

# regex to remove html tags from the character count
tags = re.compile(r'<.*?>')

# regex to split on ".", "?" or "!" ONLY if it is preceded by something else
# which is not a digit and is not a space. (Not perfect but close enough)
# Note: ? and ! can be an issue in some languages (e.g. french) where both ? and !
# are traditionally preceded by a space ! rather than!
end_of_sentence = re.compile(r'([^\s\0-9][\.\?\!])')

# End of sentence characters
eos_chars = set([".","?","!"])

for sub in subs:
    if start_sentence:
        start_time = sub.start
        start_sentence = False
    text = sub.text

    #Remove multiple full-stops e.g. "and ....."
    text = re.sub('\.+', '.', text)

    # Optional
    for idx, abr in enumerate(abbreviations):
        if abr in text:
            text = text.replace(abr,abbrev_replace[idx])
    # A test could also be made for initials in names i.e. John E. Rotten - showing my age there ;)

    multi = re.split(end_of_sentence,text.strip())
    cps = sub.characters_per_second

    # Test for a sub-title with multiple sentences
    if len(multi) > 1:
        # regex end_of_sentence breaks sentence start and sentence end into 2 parts
        # we need to put them back together again.
        # hence the odd range because the joined end part is then deleted
        for cnt in range(divmod(len(multi),2)[0]): # e.g. len=3 give 0 | 5 gives 0,1  | 7 gives 0,1,2
            multi[cnt] = multi[cnt] + multi[cnt+1]
            del multi[cnt+1]

        for part in multi:
            if len(part): # Avoid blank parts
                pass
            else:
                continue
            # Convert start time to seconds
            h,m,s,milli = re.split(':|,',str(start_time))
            s_time = (3600*int(h))+(60*int(m))+int(s)+(int(milli)/1000)

            # test for existing data
            try:
                existing_data = subs_dict[str(start_time)]
                end_time = str(existing_data[0])
                h,m,s,milli = re.split(':|,',str(existing_data[0]))
                e_time = (3600*int(h))+(60*int(m))+int(s)+(int(milli)/1000)
            except:
                existing_data = []
                e_time = s_time

            # End time is the start time or existing end time + the time taken to say the current words
            # based on the calculated number of characters per second
            # use regex "tags" to remove any html tags from the character count.

            e_time = e_time + len(tags.sub('',part)) / cps

            # Convert start to a timestamp
            s,milli = divmod(s_time,1)
            m,s = divmod(int(s),60)
            h,m = divmod(m,60)
            start_time = "{:02d}:{:02d}:{:02d},{:03d}".format(h,m,s,round(milli*1000))

            # Convert end to a timestamp
            s,milli = divmod(e_time,1)
            m,s = divmod(int(s),60)
            h,m = divmod(m,60)
            end_time = "{:02d}:{:02d}:{:02d},{:03d}".format(h,m,s,round(milli*1000))

            # if text already exists add the current text to the existing text
            # if not use the current text to write/rewrite the dictionary entry
            if existing_data:
                new_text = existing_data[1] + " " + part
            else:
                new_text = part
            subs_dict[str(start_time)] = [end_time,new_text]

            # if sentence ends re-set the current start time to the end time just calculated
            if any(x in eos_chars for x in part):
                start_sentence = True
                start_time = end_time
                print ("Split",start_time,"-->",end_time,)
                print (new_text)
                print('\n')
            else:
                start_sentence = False

    else:   # This is Not a multi-part sub-title

        end_time = str(sub.end)

        # Check for an existing dictionary entry for this start time
        try:
            existing_data = subs_dict[str(start_time)]
        except:
            existing_data = []

        # if it already exists add the current text to the existing text
        # if not use the current text
        if existing_data:
            new_text = existing_data[1] + " " + text
        else:
            new_text = text
        # Create or Update the dictionary entry for this start time
        # with the updated text and the current end time
        subs_dict[str(start_time)] = [end_time,new_text]

        if any(x in eos_chars for x in text):
            start_sentence = True
            print ("Single",start_time,"-->",end_time,)
            print (new_text)
            print('\n')
        else:
            start_sentence = False

# Generate the new sub-title file from the dictionary
idx=0
outfile = open('video_new.srt','w')
for key, text in subs_dict.items():
    idx+=1
    outfile.write(str(idx)+"\n")
    outfile.write(key+" --> "+text[0]+"\n")
    outfile.write(text[1]+"\n\n")
outfile.close()

The output after passing through the above code for your video.srt file is as follows:

1
00:00:13,100 --> 00:00:27,280
Dr Martin Luther King, Jr, in a 1968 speech where he reflects
upon the Civil Rights Movement, states, "In the end, we will remember not the words of our enemies but the silence of our friends."

2
00:00:27,280 --> 00:00:29,800
As a teacher, I've internalized this message.

3
00:00:29,800 --> 00:00:39,701
Every day, all around us, we see the consequences of silence manifest themselves in the form of discrimination, violence, genocide and war.

4
00:00:39,701 --> 00:00:46,178
In the classroom, I challenge my students to explore the silences in their own lives through poetry.

5
00:00:46,178 --> 00:00:54,740
We work together to fill those spaces, to recognize them, to name them, to understand that they don't
have to be sources of shame.

6
00:00:54,740 --> 00:01:14,408
In an effort to create a culture within my classroom where students feel safe sharing the intimacies of their own silences, I have four core principles posted on the board that sits in the front of my class, which every student signs
at the beginning of the year: read critically, write consciously, speak clearly, tell your truth.

7
00:01:14,408 --> 00:01:18,871
And I find myself thinking a lot about that last point, tell your truth.

8
00:01:18,871 --> 00:01:28,848
And I realized that if I was going to ask my students to speak up, I was going to have to tell my truth and be honest with them about the times where I failed to do so.

9
00:01:28,848 --> 00:01:44,479
So I tell them that growing up, as a kid in a Catholic family in New Orleans, during Lent I was always taught that the most meaningful thing one could do was to give something up, sacrifice something you typically indulge in to prove to God you understand his sanctity.

10
00:01:44,479 --> 00:01:50,183
I've given up soda, McDonald's, French fries, French kisses, and everything in between.

11
00:01:50,183 --> 00:01:54,071
But one year, I gave up speaking.

12
00:01:54,071 --> 00:02:03,286
I figured the most valuable thing I could sacrifice was my own voice, but it was like I hadn't realized that I had given that up a long time ago.

13
00:02:03,286 --> 00:02:23,167
I spent so much of my life telling people the things they wanted to hear instead of the things they needed to, told myself I wasn't meant to be anyone's conscience because I still had to figure out being my own, so sometimes I just wouldn't say anything, appeasing ignorance with my silence, unaware that validation doesn't need words to endorse its existence.

14
00:02:23,167 --> 00:02:29,000
When Christian was beat up for being gay, I put my hands in my pocket and walked with my head
down as if I didn't even notice.

15
00:02:29,000 --> 00:02:39,502
I couldn't use my locker for weeks
because the bolt on the lock reminded me of the one I had put on my lips when the homeless man on the corner looked at me with eyes up merely searching for an affirmation that he was worth seeing.

16
00:02:39,502 --> 00:02:43,170
I was more concerned with
touching the screen on my Apple than actually feeding him one.

17
00:02:43,170 --> 00:02:46,049
When the woman at the fundraising gala said "I'm so proud of you.

18
00:02:46,049 --> 00:02:53,699
It must be so hard teaching
those poor, unintelligent kids," I bit my lip, because apparently
we needed her money more than my students needed their dignity.

19
00:02:53,699 --> 00:03:02,878
We spend so much time listening to the things people are saying that we rarely pay attention to the things they don't.

20
00:03:02,878 --> 00:03:06,139
Silence is the residue of fear.

21
00:03:06,139 --> 00:03:09,615
It is feeling your flaws gut-wrench guillotine your tongue.

22
00:03:09,615 --> 00:03:13,429
It is the air retreating from your chest because it doesn't feel safe in your lungs.

23
00:03:13,429 --> 00:03:15,186
Silence is Rwandan genocide.

24
00:03:15,186 --> 00:03:16,423
 Silence is Katrina.

25
00:03:16,553 --> 00:03:19,661
It is what you hear when there
aren't enough body bags left.

26
00:03:19,661 --> 00:03:22,062
It is the sound after the noose is already tied.

27
00:03:22,062 --> 00:03:22,870
It is charring.

28
00:03:22,870 --> 00:03:23,620
 It is chains.

29
00:03:23,620 --> 00:03:24,543
 It is privilege.

30
00:03:24,543 --> 00:03:25,178
 It is pain.

31
00:03:25,409 --> 00:03:28,897
There is no time to pick your battles when your battles have already picked you.

32
00:03:28,897 --> 00:03:31,960
I will not let silence wrap itself around my indecision.

33
00:03:31,960 --> 00:03:36,287
I will tell Christian that he is a lion, a sanctuary of bravery and brilliance.

34
00:03:36,287 --> 00:03:42,340
I will ask that homeless man what his name is and how his day was, because sometimes all people want to be is human.

35
00:03:42,340 --> 00:03:51,665
I will tell that woman that my students can talk about transcendentalism like their last name was Thoreau, and just because you watched
one episode of "The Wire" doesn't mean you know anything about my kids.

36
00:03:51,665 --> 00:04:03,825
So this year, instead of giving something up, I will live every day as if there were a microphone tucked under my tongue, a stage on the underside of my inhibition.

37
00:04:03,825 --> 00:04:10,207
Because who has to have a soapbox when all you've ever needed is your voice?

38
00:04:10,207 --> 00:04:12,712
Thank you.

39
00:04:12,712 --> 00:00:00,000
(Applause)
Jaquelynjaquenetta answered 20/5, 2019 at 16:36 Comment(15)
Thank you very much for your answer! I just tested it on this video (filebin.net/kwygjffdlfi62pjs) , but it seems to show the subtitles much to fast. I would be pleased, if you could try it. Thanks ! :)Ukase
@Ukase Well that was an unpleasant suprise. The sentence structure was all over the place. Hopefully the amended code does a better job. See my edited answer.Jaquelynjaquenetta
Thanks a lot for your effort ! Its really much better now ! I noticed just a tiny thing while I was testing subtitles in different languages: For some videos I get overlapping subtitles (please have a look at Edit 3). You can download the subtitle file associated to that result here: filebin.net/e8rzcu26m334vl3sUkase
@Ukase I have no idea whether that file is pre or post processing. If it's pre-processing it's garbage, if it's post processing it could be the french characters. To check, I'd need the original sub-title file.Jaquelynjaquenetta
Thanks ! The file extract shown in Edit 3 is post-processing. You can download the original subtitle (01.srt) file here : filebin.net/e8rzcu26m334vl3sUkase
Are we talking about the same file ? Can you tell me what is "utter rubbish" to you ? Sure, there are some weird looking HTML tags, but nothing too strange. Just to be sure, this is the file that I am talking about: filebin.net/e8rzcu26m334vl3s/01.srt?t=7smy9ql9Ukase
@Ukase Take that rubbish out and it's fine. I don't believe that it should be there. pysrt calculates cps for what it sees but the following time stamp was for a real time, not including that garbage.Jaquelynjaquenetta
Sorry, but can you please tell me what you mean with " that garbage." ? I cannot take it out otherwise. Thank you very much.Ukase
Okay, I have found a really nasty subtitle file: ufile.io/jm1bs16v Do you see any way of how to deal with those annoying HTML Tags ? I need them in the final file, so deleting them is unfortunately not an option. Thank you so much for all your kind help!Ukase
@Ukase I do see a way! Remove them! It is just attempting to apply 2 hues of grey as the colour for the sub-titles. It is in no way part of the sub-title text itself. The real problem with that file, is that it does NOT have a single FULL_STOP in it.Jaquelynjaquenetta
Great answer !! Thank you so much for all your work !Ukase
this worked well for me at first, but now I'm running into incorrect subtitle timing issues. The outputted subs have timings that are wholly new and not just a subset of the input timings. I think the core problem is that this script, and pysrt, does a LOT of manual time munging. Like this line s_time = (3600*int(h))+(60*int(m))+int(s)+(int(milli)/1000). I'm re-writing this now to use datetime.timedelta objects which hopefully will make the timings more accurate.Durmast
@Durmast I trust you'll post the amended code and results in an answer.Jaquelynjaquenetta
@RolfofSaxony I gave up building on this script because pysrt represents time in that janky way instead of with datetimes. (though this other python srt lib does: github.com/cdown/srt ) I ended up modifying an mpv lua script that does sentence conversions, tweaked it to work as a general utility, and made other improvements. The downside is that it's lua...so not a valid answer to this question. But I've been very happy with it so far. Here's my script: gist.github.com/varenc/1b117487f78836aa6a25c74cae4fbbedDurmast
@Durmast For posted questions, we have to work with what we're given. If freed from such constraints, the world is your oyster! Nice to see that you published the code, for others to use or take inspiration from. :)Jaquelynjaquenetta
J
2

It may not be what you are after but rather than calculate the times, why not take them directly out of the subtitle file itself.
I mocked this up as an example. It isn't perfect by a long shot but it may help.

import re

#Pre-process file to remove blank lines, line numbers and timestamp --> chars
with open('video.srt','r') as f:
    lines = f.readlines()
with open('video.tmp','w') as f:
    for line in lines:
        line = line.strip()
        if line.strip():
            if line.strip().isnumeric():
                continue
            else:
                line = line.replace(' --> ', ' ')
                line = line+" "
                f.write(line)

# Process pre-processed file
with open('video.tmp','r') as f:
    lines = f.readlines()

outfile = open('new_video.srt','w')
idx = 0

# Define the regex options we will need

#regex to look for the time stamps in each sentence using the first and last only
timestamps = re.compile('\d{1,2}(?::\d{2}){1,2}(?:,)\d{3}')

#regex to remove html tags from length calculations
tags = re.compile(r'<.*?>')

#re.split('([^\s\0-9]\.)',a)
# This is to cope with text that contains mathematical, chemical formulae, ip addresses etc
# where "." does not mean full-stop (end of sentence)
# This is used to split on a "." only if it is NOT preceded by space or a number
# this should catch most things but will fail to split the sentence if it genuinely
# ends with a number followed by a full-stop.
end_of_sentence = re.compile(r'([^\s\0-9]\.)')

#sentences = str(lines).split('.')
sentences = re.split(end_of_sentence,str(lines))

# Because the sentences where split on "x." we now have to add that back
# so we concatenate every other list item with the previous one.
idx = 0
joined =[]
while idx < (len(sentences) -1) :
    joined.append(sentences[idx]+sentences[idx+1])
    idx += 2
sentences = joined

previous_timings =["00:00:00,000","00:00:00,000"]
previous_sentence = ""

#Dictionary of timestamps that will require post-processing
registry = {}

loop = 0
for sentence in sentences:
    print(sentence)
    timings = timestamps.findall(sentence)
    idx+=1
    outfile.write(str(idx)+"\n")
    if timings:
        #There are timestamps in the sentence
        previous_timings = timings
        loop = 0
        start_time = timings[0]
        end_time = timings[-1]
        # Revert list item to a string
        sentence = ''.join(sentence)
        # Remove timestamps from the text
        sentence = ''.join(re.sub(timestamps,' ', sentence))
        # Get rid of multiple spaces and \ characters
        sentence = '  '.join(sentence.split())
        sentence = sentence.replace('  ', ' ')
        sentence = sentence.replace("\\'", "'")
        previous_sentence = sentence
        print("Starts at", start_time)
        print(sentence)
        print("Ends at", end_time,'\n')
        outfile.write(start_time+" --> "+end_time+"\n")
        outfile.write(sentence+"\n\n")

    else:
        # There are no timestamps in the sentence therefore this must
        # be a separate sentence cut adrift from an existing timestamp
        # We will have to estimate its start and end times using data
        # from the last time stamp we know of
        start_time = previous_timings[0]
        reg_end_time = previous_timings[-1]

        # Convert timestamp to  seconds
        h,m,s,milli = re.split(':|,',start_time)
        s_time = (3600*int(h))+(60*int(m))+int(s)+(int(milli)/1000)

        # Guess the timing for the previous sentence and add it
        # but only for the first adrift sentence as the start time will be adjusted
        # This number may well vary depending on the cadence of the speaker
        if loop == 0:
            registry[reg_end_time] = reg_end_time
            #s_time += 0.06 * len(previous_sentence)
            s_time += 0.06 * len(tags.sub('',previous_sentence))
        # Guess the end time
        e_time = s_time + (0.06 * len(tags.sub('',previous_sentence)))

        # Convert start to a timestamp
        s,milli = divmod(s_time,1)
        m,s = divmod(int(s),60)
        h,m = divmod(m,60)
        start_time = "{:02d}:{:02d}:{:02d},{:03d}".format(h,m,s,round(milli*1000))

        # Convert end to a timestamp
        s,milli = divmod(e_time,1)
        m,s = divmod(int(s),60)
        h,m = divmod(m,60)
        end_time = "{:02d}:{:02d}:{:02d},{:03d}".format(h,m,s,round(milli*1000))

        #Register new end time for previous sentence
        if loop == 0:
            loop = 1
            registry[reg_end_time] = start_time

        print("Starts at", start_time)
        print(sentence)
        print("Ends at", end_time,'\n')
        outfile.write(start_time+" --> "+end_time+"\n")
        outfile.write(sentence+"\n\n")
        try:
            # re-set the previous start time in case the following sentence
            # was cut adrift from its time stamp as well
            previous_timings[0] = end_time
        except:
            pass
outfile.close()

#Post processing
if registry:
    outfile = open('new_video.srt','r')
    text = outfile.read()
    new_text = text
    # Run through registered end times and replace them
    # if not the video player will not display the subtitles
    # correctly because they overlap in time
    for key, end in registry.items():
        new_text = new_text.replace(key, end, 1)
        print("replacing", key, "with", end)
    outfile.close()
    outfile = open('new_video.srt','w')
    outfile.write(new_text)
    outfile.close()

Edit: Happily, I persevered with this code because I was intrigued by the problem.
Whilst I appreciate that it is hackey and doesn't use the pysrt subtitle module, just re, I believe that, in this instance, it does a fair job.
I have commented the edited code, so hopefully it will be clear what I am doing and why.
The regx is looking for timestamp patterns 0:00:0,000, 00:00:00,000, 0:00:00,000 etc i.e.

\d{1,2}(?::\d{2}){1,2}(?:,)\d{3}

1 or 2 decimals followed by : plus 2 decimals followed by : plus 1 or 2 decimals followed by :, followed by 3 decimals

If a concatenated sentence has multiple start and end times within it, for the whole sentence we only require the first, the sentence start time, and the last, the sentence end time. I hope that is clear.

Edit 2 This version copes with full stops in mathematical and chemical formulae, plus ip numbers etc. Basically places where a full stop doesn't mean a full stop.

enter image description here

Jaquelynjaquenetta answered 17/5, 2019 at 17:44 Comment(5)
Thank you very much for that answer. it seems to work pretty well, so far. Would you mind to explain me what you mean when you say that you look for the time stamp in each sentence using the first and last only ? I am not quite understanding this regrex part. Thanks.Ukase
See my edited answer. I was intrigued so pushed on. You'll note that numbers no long disappear and with post processing, overlapping timestamps have been adjusted for multiple sentences in a single subtitle, using a dictionary of the troublesome timestamps. The timing factor may need to be a variable to account for different cadences depending on the speaker. Currently it is set, just for this video.Jaquelynjaquenetta
Thank you very much for your update and your explanation ! Let's suppose sentences looks like that: "bla blaa blaa time_stamb_1 blu blu time_stamb_2 bl bl bl bl" --> You would say that this sentence starts at time_stamb_1and ends at time_stamb_2. In that case the subtitle would appear slightly later than it should and also end slightly earlier than it should. Do you think that this could be improved using the character per second count ?Ukase
@Ukase Each sentence will either start with a timestamp (start --> end) followed by text possibly with one or more timestamp start/end and more text or have no timestamps. In the latter case, this is where there was more than 1 sentence in a subtitle entry. It's these we have to calculate timestamps for and indentify the first entry for post processing. I chose to use a fixed character per second count, rather than attempt to calculate one. People don't speak at a fixed rate, so attempting to calculate a rate is as error prone as just guessing, in my opinion.Jaquelynjaquenetta
Thank you for your answer. I edited my questions and added a code to calculate the character per second count for each subtitle. I still need to figure out how to in cooperate it into your code. I just would like to test, if it makes any difference.Ukase
J
2

I have re-coded to rely on the pysrt package, as requested, and a smigeon of re.
The idea is to build a dictionary based on start_times.

If the start time exists, data is added to the entry for that time but the end_time is updated at the same time, so the end time advances with the text.

If no start time exists, it is simply a new dictionary entry.

The start time is only advanced once we know that a sentence has been completed.

So in essence, we start to build a sentence with a fixed start time. The sentence continues to be built, by adding more text and updating the end time, until the sentence finishes. Here we advance the start time using the current record, which we know to be a new sentence.

Sub-title entries with multiple sentences are broken up, with start and end times calculated using the pysrt character_per_second entry for the entire sub-title entry, before it was broken up.

Finally, a new sub-title file is written to disk from the entries in the dictionary.

Obviously, with only a single file to play with, I may well be missing some sub-title layout humps in the road, but at least it gives you a working starting point.

The code is commented throughout, so most things should be clear, as to how and why.

Edit: I have refined the checking for existing dictionary start times and altered the method used to decide if a sentence has ended i.e. put the full stops back into the text, after splitting.
The second video you mentioned does have sub-titles that are slightly off, to begin with, notice that there are no milli-second values at all.

The following code does a fair job on the second video and a good job on the first.

Edit 2: Added contiguous full-stops and html <> tag removal

Edit 3: It turns out that pysrt removes the html tags from the calculation for characters per second. I have now done so as well, which means that the <html> formatting can be retained within the sub-titles.

Edit 4: This version copes with full stops in mathematical and chemical formulae, plus ip numbers etc. Basically places where a full stop doesn't mean a full stop. It also allows for sentences which end in ? and !

import pysrt
import re

abbreviations = ['Dr.','Mr.','Mrs.','Ms.','etc.','Jr.','e.g.'] # You get the idea!
abbrev_replace = ['Dr','Mr','Mrs','Ms','etc','Jr','eg']
subs = pysrt.open('new.srt')
subs_dict = {}          # Dictionary to accumulate new sub-titles (start_time:[end_time,sentence])
start_sentence = True   # Toggle this at the start and end of sentences

# regex to remove html tags from the character count
tags = re.compile(r'<.*?>')

# regex to split on ".", "?" or "!" ONLY if it is preceded by something else
# which is not a digit and is not a space. (Not perfect but close enough)
# Note: ? and ! can be an issue in some languages (e.g. french) where both ? and !
# are traditionally preceded by a space ! rather than!
end_of_sentence = re.compile(r'([^\s\0-9][\.\?\!])')

# End of sentence characters
eos_chars = set([".","?","!"])

for sub in subs:
    if start_sentence:
        start_time = sub.start
        start_sentence = False
    text = sub.text

    #Remove multiple full-stops e.g. "and ....."
    text = re.sub('\.+', '.', text)

    # Optional
    for idx, abr in enumerate(abbreviations):
        if abr in text:
            text = text.replace(abr,abbrev_replace[idx])
    # A test could also be made for initials in names i.e. John E. Rotten - showing my age there ;)

    multi = re.split(end_of_sentence,text.strip())
    cps = sub.characters_per_second

    # Test for a sub-title with multiple sentences
    if len(multi) > 1:
        # regex end_of_sentence breaks sentence start and sentence end into 2 parts
        # we need to put them back together again.
        # hence the odd range because the joined end part is then deleted
        for cnt in range(divmod(len(multi),2)[0]): # e.g. len=3 give 0 | 5 gives 0,1  | 7 gives 0,1,2
            multi[cnt] = multi[cnt] + multi[cnt+1]
            del multi[cnt+1]

        for part in multi:
            if len(part): # Avoid blank parts
                pass
            else:
                continue
            # Convert start time to seconds
            h,m,s,milli = re.split(':|,',str(start_time))
            s_time = (3600*int(h))+(60*int(m))+int(s)+(int(milli)/1000)

            # test for existing data
            try:
                existing_data = subs_dict[str(start_time)]
                end_time = str(existing_data[0])
                h,m,s,milli = re.split(':|,',str(existing_data[0]))
                e_time = (3600*int(h))+(60*int(m))+int(s)+(int(milli)/1000)
            except:
                existing_data = []
                e_time = s_time

            # End time is the start time or existing end time + the time taken to say the current words
            # based on the calculated number of characters per second
            # use regex "tags" to remove any html tags from the character count.

            e_time = e_time + len(tags.sub('',part)) / cps

            # Convert start to a timestamp
            s,milli = divmod(s_time,1)
            m,s = divmod(int(s),60)
            h,m = divmod(m,60)
            start_time = "{:02d}:{:02d}:{:02d},{:03d}".format(h,m,s,round(milli*1000))

            # Convert end to a timestamp
            s,milli = divmod(e_time,1)
            m,s = divmod(int(s),60)
            h,m = divmod(m,60)
            end_time = "{:02d}:{:02d}:{:02d},{:03d}".format(h,m,s,round(milli*1000))

            # if text already exists add the current text to the existing text
            # if not use the current text to write/rewrite the dictionary entry
            if existing_data:
                new_text = existing_data[1] + " " + part
            else:
                new_text = part
            subs_dict[str(start_time)] = [end_time,new_text]

            # if sentence ends re-set the current start time to the end time just calculated
            if any(x in eos_chars for x in part):
                start_sentence = True
                start_time = end_time
                print ("Split",start_time,"-->",end_time,)
                print (new_text)
                print('\n')
            else:
                start_sentence = False

    else:   # This is Not a multi-part sub-title

        end_time = str(sub.end)

        # Check for an existing dictionary entry for this start time
        try:
            existing_data = subs_dict[str(start_time)]
        except:
            existing_data = []

        # if it already exists add the current text to the existing text
        # if not use the current text
        if existing_data:
            new_text = existing_data[1] + " " + text
        else:
            new_text = text
        # Create or Update the dictionary entry for this start time
        # with the updated text and the current end time
        subs_dict[str(start_time)] = [end_time,new_text]

        if any(x in eos_chars for x in text):
            start_sentence = True
            print ("Single",start_time,"-->",end_time,)
            print (new_text)
            print('\n')
        else:
            start_sentence = False

# Generate the new sub-title file from the dictionary
idx=0
outfile = open('video_new.srt','w')
for key, text in subs_dict.items():
    idx+=1
    outfile.write(str(idx)+"\n")
    outfile.write(key+" --> "+text[0]+"\n")
    outfile.write(text[1]+"\n\n")
outfile.close()

The output after passing through the above code for your video.srt file is as follows:

1
00:00:13,100 --> 00:00:27,280
Dr Martin Luther King, Jr, in a 1968 speech where he reflects
upon the Civil Rights Movement, states, "In the end, we will remember not the words of our enemies but the silence of our friends."

2
00:00:27,280 --> 00:00:29,800
As a teacher, I've internalized this message.

3
00:00:29,800 --> 00:00:39,701
Every day, all around us, we see the consequences of silence manifest themselves in the form of discrimination, violence, genocide and war.

4
00:00:39,701 --> 00:00:46,178
In the classroom, I challenge my students to explore the silences in their own lives through poetry.

5
00:00:46,178 --> 00:00:54,740
We work together to fill those spaces, to recognize them, to name them, to understand that they don't
have to be sources of shame.

6
00:00:54,740 --> 00:01:14,408
In an effort to create a culture within my classroom where students feel safe sharing the intimacies of their own silences, I have four core principles posted on the board that sits in the front of my class, which every student signs
at the beginning of the year: read critically, write consciously, speak clearly, tell your truth.

7
00:01:14,408 --> 00:01:18,871
And I find myself thinking a lot about that last point, tell your truth.

8
00:01:18,871 --> 00:01:28,848
And I realized that if I was going to ask my students to speak up, I was going to have to tell my truth and be honest with them about the times where I failed to do so.

9
00:01:28,848 --> 00:01:44,479
So I tell them that growing up, as a kid in a Catholic family in New Orleans, during Lent I was always taught that the most meaningful thing one could do was to give something up, sacrifice something you typically indulge in to prove to God you understand his sanctity.

10
00:01:44,479 --> 00:01:50,183
I've given up soda, McDonald's, French fries, French kisses, and everything in between.

11
00:01:50,183 --> 00:01:54,071
But one year, I gave up speaking.

12
00:01:54,071 --> 00:02:03,286
I figured the most valuable thing I could sacrifice was my own voice, but it was like I hadn't realized that I had given that up a long time ago.

13
00:02:03,286 --> 00:02:23,167
I spent so much of my life telling people the things they wanted to hear instead of the things they needed to, told myself I wasn't meant to be anyone's conscience because I still had to figure out being my own, so sometimes I just wouldn't say anything, appeasing ignorance with my silence, unaware that validation doesn't need words to endorse its existence.

14
00:02:23,167 --> 00:02:29,000
When Christian was beat up for being gay, I put my hands in my pocket and walked with my head
down as if I didn't even notice.

15
00:02:29,000 --> 00:02:39,502
I couldn't use my locker for weeks
because the bolt on the lock reminded me of the one I had put on my lips when the homeless man on the corner looked at me with eyes up merely searching for an affirmation that he was worth seeing.

16
00:02:39,502 --> 00:02:43,170
I was more concerned with
touching the screen on my Apple than actually feeding him one.

17
00:02:43,170 --> 00:02:46,049
When the woman at the fundraising gala said "I'm so proud of you.

18
00:02:46,049 --> 00:02:53,699
It must be so hard teaching
those poor, unintelligent kids," I bit my lip, because apparently
we needed her money more than my students needed their dignity.

19
00:02:53,699 --> 00:03:02,878
We spend so much time listening to the things people are saying that we rarely pay attention to the things they don't.

20
00:03:02,878 --> 00:03:06,139
Silence is the residue of fear.

21
00:03:06,139 --> 00:03:09,615
It is feeling your flaws gut-wrench guillotine your tongue.

22
00:03:09,615 --> 00:03:13,429
It is the air retreating from your chest because it doesn't feel safe in your lungs.

23
00:03:13,429 --> 00:03:15,186
Silence is Rwandan genocide.

24
00:03:15,186 --> 00:03:16,423
 Silence is Katrina.

25
00:03:16,553 --> 00:03:19,661
It is what you hear when there
aren't enough body bags left.

26
00:03:19,661 --> 00:03:22,062
It is the sound after the noose is already tied.

27
00:03:22,062 --> 00:03:22,870
It is charring.

28
00:03:22,870 --> 00:03:23,620
 It is chains.

29
00:03:23,620 --> 00:03:24,543
 It is privilege.

30
00:03:24,543 --> 00:03:25,178
 It is pain.

31
00:03:25,409 --> 00:03:28,897
There is no time to pick your battles when your battles have already picked you.

32
00:03:28,897 --> 00:03:31,960
I will not let silence wrap itself around my indecision.

33
00:03:31,960 --> 00:03:36,287
I will tell Christian that he is a lion, a sanctuary of bravery and brilliance.

34
00:03:36,287 --> 00:03:42,340
I will ask that homeless man what his name is and how his day was, because sometimes all people want to be is human.

35
00:03:42,340 --> 00:03:51,665
I will tell that woman that my students can talk about transcendentalism like their last name was Thoreau, and just because you watched
one episode of "The Wire" doesn't mean you know anything about my kids.

36
00:03:51,665 --> 00:04:03,825
So this year, instead of giving something up, I will live every day as if there were a microphone tucked under my tongue, a stage on the underside of my inhibition.

37
00:04:03,825 --> 00:04:10,207
Because who has to have a soapbox when all you've ever needed is your voice?

38
00:04:10,207 --> 00:04:12,712
Thank you.

39
00:04:12,712 --> 00:00:00,000
(Applause)
Jaquelynjaquenetta answered 20/5, 2019 at 16:36 Comment(15)
Thank you very much for your answer! I just tested it on this video (filebin.net/kwygjffdlfi62pjs) , but it seems to show the subtitles much to fast. I would be pleased, if you could try it. Thanks ! :)Ukase
@Ukase Well that was an unpleasant suprise. The sentence structure was all over the place. Hopefully the amended code does a better job. See my edited answer.Jaquelynjaquenetta
Thanks a lot for your effort ! Its really much better now ! I noticed just a tiny thing while I was testing subtitles in different languages: For some videos I get overlapping subtitles (please have a look at Edit 3). You can download the subtitle file associated to that result here: filebin.net/e8rzcu26m334vl3sUkase
@Ukase I have no idea whether that file is pre or post processing. If it's pre-processing it's garbage, if it's post processing it could be the french characters. To check, I'd need the original sub-title file.Jaquelynjaquenetta
Thanks ! The file extract shown in Edit 3 is post-processing. You can download the original subtitle (01.srt) file here : filebin.net/e8rzcu26m334vl3sUkase
Are we talking about the same file ? Can you tell me what is "utter rubbish" to you ? Sure, there are some weird looking HTML tags, but nothing too strange. Just to be sure, this is the file that I am talking about: filebin.net/e8rzcu26m334vl3s/01.srt?t=7smy9ql9Ukase
@Ukase Take that rubbish out and it's fine. I don't believe that it should be there. pysrt calculates cps for what it sees but the following time stamp was for a real time, not including that garbage.Jaquelynjaquenetta
Sorry, but can you please tell me what you mean with " that garbage." ? I cannot take it out otherwise. Thank you very much.Ukase
Okay, I have found a really nasty subtitle file: ufile.io/jm1bs16v Do you see any way of how to deal with those annoying HTML Tags ? I need them in the final file, so deleting them is unfortunately not an option. Thank you so much for all your kind help!Ukase
@Ukase I do see a way! Remove them! It is just attempting to apply 2 hues of grey as the colour for the sub-titles. It is in no way part of the sub-title text itself. The real problem with that file, is that it does NOT have a single FULL_STOP in it.Jaquelynjaquenetta
Great answer !! Thank you so much for all your work !Ukase
this worked well for me at first, but now I'm running into incorrect subtitle timing issues. The outputted subs have timings that are wholly new and not just a subset of the input timings. I think the core problem is that this script, and pysrt, does a LOT of manual time munging. Like this line s_time = (3600*int(h))+(60*int(m))+int(s)+(int(milli)/1000). I'm re-writing this now to use datetime.timedelta objects which hopefully will make the timings more accurate.Durmast
@Durmast I trust you'll post the amended code and results in an answer.Jaquelynjaquenetta
@RolfofSaxony I gave up building on this script because pysrt represents time in that janky way instead of with datetimes. (though this other python srt lib does: github.com/cdown/srt ) I ended up modifying an mpv lua script that does sentence conversions, tweaked it to work as a general utility, and made other improvements. The downside is that it's lua...so not a valid answer to this question. But I've been very happy with it so far. Here's my script: gist.github.com/varenc/1b117487f78836aa6a25c74cae4fbbedDurmast
@Durmast For posted questions, we have to work with what we're given. If freed from such constraints, the world is your oyster! Nice to see that you published the code, for others to use or take inspiration from. :)Jaquelynjaquenetta

© 2022 - 2024 — McMap. All rights reserved.