parsing transcript .srt files into readable text
Asked Answered
O

5

6

I have a video transcript SRT file with lines in conventional SRT format. Here's an example:

1
00:00:00,710 --> 00:00:03,220
Lorem ipsum dolor sit amet
consectetur, adipisicing elit.

2
00:00:03,220 --> 00:00:05,970
Dignissimos et quod laboriosam
iure magni expedita

3
00:00:05,970 --> 00:00:09,130
nisi, quis quaerat. Rem, facere!

I'm trying to use python to read and then parse through this file, remove (or skip) the lines that include the digit strings (e.g., SKIP '1' & '00:00:00,710 --> 00:00:03,220') and then format the remaining lines of text so that they are joined and presented in readable format. Here's an example of the output I'm trying to generate:

Lorem ipsum dolor sit amet consectetur, adipisicing elit. Dignissimos et quod laboriosam iure magni expedita nisi, quis quaerat. Rem, facere!

Here's the code I've come up with so far:

def main():
    # Access folder in filesystem

    # After parsing content of file, move to next file

    # Declare variable empty list
    lineList = []

    # read file line by line
    file = open( "/Sample-SRT-File.srt", "r")
    lines = file.readlines()
    file.close()

    # look for patterns and parse

    # Remove blank lines from file
    lines = [i for i in lines if i[:-1]]

    # Discount first and second line of each segment using a match pattern
    for line in lines:
        line = line.strip()
        if isinstance(line[0], int) != False:

            # store all text into a list
            lineList.append(line)

    # for every item in the list that ends with '', '.', '?', or '!', append a space at end
    for line in lineList:
        line = line + ' '

    # Finish with list.join() to bring everything together
    text = ''.join(lineList)
    print(text)

main()

I'm pretty out of practice with my Python as is, but right now I'm wondering if the only way to effectively and reliably match the first and second lines of the segment for removal or skipping is to use a regular expression. Otherwise, this might be possible using the itertools library or some kind of function that would skip lines 1 & 2 as well as any blank line.

Anyone out there with the Python moves to help me overcome this?

Overmodest answered 28/6, 2018 at 0:14 Comment(3)
use regex! you could do something like what's shown on this post: #12595551 and look for all lines this pattern (00:00:00,000 --> 00:00:00,000)Narthex
Thanks for your input on this! When I was starting to come around to the idea of using a regex, I got started on reading over the syntax for Python's regex and my brain fogged over as it was pretty late in the day. If the pysrt print(sub) method doesn't work, I'll probably end up implementing a crazy regex pattern to take care of this matching problem once and for allOvermodest
If the blocks are always the same length and line of interest is alway the Nth one in group, you don't need to use a regex to get them—there are simpler ways.Meant
I
4

If you want to use regex to filter out the digit lines and empty lines, you can use this:

import re

def main():
    # read file line by line
    file = open( "sample.srt", "r")
    lines = file.readlines()
    file.close()

    text = ''
    for line in lines:
        if re.search('^[0-9]+$', line) is None and re.search('^[0-9]{2}:[0-9]{2}:[0-9]{2}', line) is None and re.search('^$', line) is None:
            text += ' ' + line.rstrip('\n')
        text = text.lstrip()
    print(text)

main()

This will output:

Lorem ipsum dolor sit amet consectetur, adipisicing elit. Dignissimos et quod laboriosam iure magni expedita nisi, quis quaerat. Rem, facere!
Inseminate answered 28/6, 2018 at 0:45 Comment(5)
This will also filter out subtitle lines that might happen to start with a digit.Apartheid
Ah, nice pattern there! I kind of had a feeling this place would be brimming with answers. That looks like it will work with other files I'm working with, and thanks for including the validated text output. I'll look forward to putting this solution to the test as wellOvermodest
@Rishav Thanks for pointing that out. I improved the regex by making it more specific.Inseminate
Thanks @Inseminate and @Rishav for your help with this! I first tried making use of Rishav's proposed use of the pysrt library, however, after at least an hour and a half of unsuccessfully running into problems of my own trying to get a pipenv running to install packages like pysrt, I ran into too many problems and my own limitations working in the shell. After finally giving up on running pysrt, I was able to quickly get pgngp's solution working, I just had to add import re to the top as well obviously :). I've got a fully-functioning script now thanks to both your efforts!Overmodest
@JamieStrausbaugh Great! If you think this answered your question, you can accept this answer.Inseminate
A
12

I would just use a library like pysrt for parsing srt files. That should prove to be the most robust.

import pysrt
subs = pysrt.open("foo.srt")

for sub in subs:
    print(sub.text)
    print()

Output:

Lorem ipsum dolor sit amet
consectetur, adipisicing elit.

Dignissimos et quod laboriosam
iure magni expedita

nisi, quis quaerat. Rem, facere!
Apartheid answered 28/6, 2018 at 0:23 Comment(2)
Thanks! I did find out about pysrt in my research into this problem, but wasn't sure if that exact functionality was actually supported. I will test this suggestion straight away when I'm back at my desk tomorrow.Overmodest
I believe that this answer is more useful than the accepted answer. It also works fine, I have tested it myself.Cesium
I
4

If you want to use regex to filter out the digit lines and empty lines, you can use this:

import re

def main():
    # read file line by line
    file = open( "sample.srt", "r")
    lines = file.readlines()
    file.close()

    text = ''
    for line in lines:
        if re.search('^[0-9]+$', line) is None and re.search('^[0-9]{2}:[0-9]{2}:[0-9]{2}', line) is None and re.search('^$', line) is None:
            text += ' ' + line.rstrip('\n')
        text = text.lstrip()
    print(text)

main()

This will output:

Lorem ipsum dolor sit amet consectetur, adipisicing elit. Dignissimos et quod laboriosam iure magni expedita nisi, quis quaerat. Rem, facere!
Inseminate answered 28/6, 2018 at 0:45 Comment(5)
This will also filter out subtitle lines that might happen to start with a digit.Apartheid
Ah, nice pattern there! I kind of had a feeling this place would be brimming with answers. That looks like it will work with other files I'm working with, and thanks for including the validated text output. I'll look forward to putting this solution to the test as wellOvermodest
@Rishav Thanks for pointing that out. I improved the regex by making it more specific.Inseminate
Thanks @Inseminate and @Rishav for your help with this! I first tried making use of Rishav's proposed use of the pysrt library, however, after at least an hour and a half of unsuccessfully running into problems of my own trying to get a pipenv running to install packages like pysrt, I ran into too many problems and my own limitations working in the shell. After finally giving up on running pysrt, I was able to quickly get pgngp's solution working, I just had to add import re to the top as well obviously :). I've got a fully-functioning script now thanks to both your efforts!Overmodest
@JamieStrausbaugh Great! If you think this answered your question, you can accept this answer.Inseminate
G
2

you can also use srt module. It has No dependencies outside of the standard library and in typical workflows it would be faster.

import srt 

with open('your_src_filepath') as f:
    subtitle_generator = srt.parse(f)
    subtitles = list(subtitle_generator)

then for some use cases like tokenization, you can use:

subtitles[0].content

Output (in my specific .srt file):

"What you guys don't understand is..."

for more information about another methods, see the documentation.

if you want to format the whole file in another shape, iterate over subtitle list and change the content in your desired format.

Also for using it in a generator format you can use:

with open(subfile_path) as f:
    subtitle_generator = srt.parse(f)
    
    for sub in subtitle_generator:
        print (sub)

Output:

Subtitle(index=1, start=datetime.timedelta(seconds=2, microseconds=877000), end=datetime.timedelta(seconds=4, microseconds=294000), content="What you guys don't understand is...", proprietary='')
Subtitle(index=2, start=datetime.timedelta(seconds=4, microseconds=504000), end=datetime.timedelta(seconds=7, microseconds=548000), content='...for us, kissing is as important\nas any part of it.', proprietary='')
...
Gynandry answered 11/7, 2022 at 17:2 Comment(0)
B
1

thanks to python 3 as no need for extra imports

text =" "
with open(file,'r') as f:
    for line in f:
        if  not line[0].isdigit():
            text+= " " + line.replace('\n','')
            text = text.lstrip()
Boggart answered 5/8, 2021 at 15:55 Comment(0)
E
0

If you want to have a specific list to look for the following code would resolve your problem and would give you the opportunity to specify a list of items even they contain different types.

with open ('foo.srt', 'r') as f:
   for line in f:
      if not line.startswith(('0', '1' , '2', '3')):
         print(line) 

Although, this is a loop so if you would worry about the speed of your program I would recommand the answer above with the pysrt.

Erotica answered 28/6, 2018 at 1:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.