I have a video transcript SRT file with lines in conventional SRT format. Here's an example:
1
00:00:00,710 --> 00:00:03,220
Lorem ipsum dolor sit amet
consectetur, adipisicing elit.
2
00:00:03,220 --> 00:00:05,970
Dignissimos et quod laboriosam
iure magni expedita
3
00:00:05,970 --> 00:00:09,130
nisi, quis quaerat. Rem, facere!
I'm trying to use python to read and then parse through this file, remove (or skip) the lines that include the digit strings (e.g., SKIP '1' & '00:00:00,710 --> 00:00:03,220') and then format the remaining lines of text so that they are joined and presented in readable format. Here's an example of the output I'm trying to generate:
Lorem ipsum dolor sit amet consectetur, adipisicing elit. Dignissimos et quod laboriosam iure magni expedita nisi, quis quaerat. Rem, facere!
Here's the code I've come up with so far:
def main():
# Access folder in filesystem
# After parsing content of file, move to next file
# Declare variable empty list
lineList = []
# read file line by line
file = open( "/Sample-SRT-File.srt", "r")
lines = file.readlines()
file.close()
# look for patterns and parse
# Remove blank lines from file
lines = [i for i in lines if i[:-1]]
# Discount first and second line of each segment using a match pattern
for line in lines:
line = line.strip()
if isinstance(line[0], int) != False:
# store all text into a list
lineList.append(line)
# for every item in the list that ends with '', '.', '?', or '!', append a space at end
for line in lineList:
line = line + ' '
# Finish with list.join() to bring everything together
text = ''.join(lineList)
print(text)
main()
I'm pretty out of practice with my Python as is, but right now I'm wondering if the only way to effectively and reliably match the first and second lines of the segment for removal or skipping is to use a regular expression. Otherwise, this might be possible using the itertools library or some kind of function that would skip lines 1 & 2 as well as any blank line.
Anyone out there with the Python moves to help me overcome this?