parsing a .srt file with regex

Asked 12/5, 2014 at 23:13 Answered 9/10, 2021 at 23:0

I am doing a small script in python, but since I am quite new I got stuck in one part: I need to get timing and text from a .srt file. For example, from

1
00:00:01,000 --> 00:00:04,074
Subtitles downloaded from www.OpenSubtitles.org

I need to get:

00:00:01,000 --> 00:00:04,074

and

Subtitles downloaded from www.OpenSubtitles.org.

I have already managed to make the regex for timing, but i am stuck for the text. I've tried to use look behind where I use my regex for timing:

( ?<=(\d+):(\d+):(\d+)(?:\,)(\d+) --> (\d+):(\d+):(\d+)(?:\,)(\d+) )\w+

but with no effect. Personally, i think that using look behind is the right way to solve this, but i am not sure how to write it correctly. Can anyone help me? Thanks.

Hathorn answered 12/5, 2014 at 23:13 Comment(2)

try with: (\d\d:\d\d:\d\d,\d\d\d.+\d\d:\d\d:\d\d,\d\d\d)|(Subtitles downloaded from www.OpenSubtitles.org) – Connatural 12/5, 2014 at 23:24

Can you add another example of the subtitles, and use code (`) tags instead of quotes (>)? Also, can you show some of the python code that is using this regex? – Convalescence 12/5, 2014 at 23:25

Honestly, I don't see any reason to throw regex at this problem. .srt files are highly structured. The structure goes like:

an integer starting at 1, monotonically increasing
start --> stop timing
one or more lines of subtitle content
a blank line

... and repeat. Note the bold part - you might have to capture 1, 2, or 20 lines of subtitle content after the time code.

So, just take advantage of the structure. In this way you can parse everything in just one pass, without needing to put more than one line into memory at a time and still keeping all the information for each subtitle together.

from itertools import groupby
# "chunk" our input file, delimited by blank lines
with open(filename) as f:
    res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b]

For example, using the example on the SRT doc page, I get:

res
Out[60]: 
[['1\n',
  '00:02:17,440 --> 00:02:20,375\n',
  "Senator, we're making\n",
  'our final approach into Coruscant.\n'],
 ['2\n', '00:02:20,476 --> 00:02:22,501\n', 'Very good, Lieutenant.\n']]

And I could further transform that into a list of meaningful objects:

from collections import namedtuple

Subtitle = namedtuple('Subtitle', 'number start end content')

subs = []

for sub in res:
    if len(sub) >= 3: # not strictly necessary, but better safe than sorry
        sub = [x.strip() for x in sub]
        number, start_end, *content = sub # py3 syntax
        start, end = start_end.split(' --> ')
        subs.append(Subtitle(number, start, end, content))

subs
Out[65]: 
[Subtitle(number='1', start='00:02:17,440', end='00:02:20,375', content=["Senator, we're making", 'our final approach into Coruscant.']),
 Subtitle(number='2', start='00:02:20,476', end='00:02:22,501', content=['Very good, Lieutenant.'])]

Dymphia answered 12/5, 2014 at 23:32 Comment(0)

Disagree with @roippi. Regex is a very nice solution to text matching. And the Regex for this solution is not tricky.

import re   

f = file.open(yoursrtfile)
# Parse the file content
content = f.read()
# Find all result in content
# The first big (__) retrieve the timing, \s+ match all timing in between,
# The (.+) means retrieve any text content after that.
result = re.findall("(\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)\s+(.+)", content)
# Just print out the result list. I recommend you do some formatting here.
print result

Eldwun answered 12/5, 2014 at 23:36 Comment(0)

number:^[0-9]+$
Time:
^[0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9]$
string: *[a-zA-Z]+*

hope this help.

Betz answered 26/1, 2017 at 16:52 Comment(0)

Thanks @roippi for this excellent parser. It helped me a lot to write a srt to stl converter in less than 40 lines (in python2 though, as it has to fit in a larger project)

from __future__ import print_function, division
from itertools import groupby
from collections import namedtuple

# prepare  - adapt to you needs or use sys.argv
inputname = 'FR.srt'  
outputname = 'FR.stl'
stlheader = """
$FontName           = Arial
$FontSize           = 34
$HorzAlign          = Center
$VertAlign          = Bottom

"""
def converttime(sttime):
    "convert from srt time format (0...999) to stl one (0...25)"
    st = sttime.split(',')
    return "%s:%02d"%(st[0], round(25*float(st[1])  /1000))

# load
with open(inputname,'r') as f:
    res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b]

# parse
Subtitle = namedtuple('Subtitle', 'number start end content')
subs = []
for sub in res:
    if len(sub) >= 3: # not strictly necessary, but better safe than sorry
        sub = [x.strip() for x in sub]
        number, start_end, content = sub[0], sub[1], sub[2:]   # py 2 syntax
        start, end = start_end.split(' --> ')
        subs.append(Subtitle(number, start, end, content))

# write
with open(outputname,'w') as F:
    F.write(stlheader)
    for sub in subs:
        F.write("%s , %s , %s\n"%(converttime(sub.start), converttime(sub.end), "|".join(sub.content)) )

Mekka answered 13/9, 2017 at 7:19 Comment(0)

None of the pure REGEx solution above worked for the real life srt files.

Let's take a look of the following SRT patterned text :

1
00:02:17,440 --> 00:02:20,375
Some multi lined text
This is a second line

2
00:02:20,476 --> 00:02:22,501
as well as a single line

3
00:03:20,476 --> 00:03:22,501
should be able to parse unicoded text too
こんにちは

Take a note that :

text may contain unicode characters.
Text can consist of several lines.
every cue started with an integer value and ended with a blank new line which both unix style and windows style CR/LF are accepted

Here is the working regex :

\d+[\r\n](\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)[\r\n]((.+\r?\n)+(?=(\r?\n)?))

https://regex101.com/r/qICmEM/1

Ommatidium answered 9/10, 2021 at 23:0 Comment(0)

for time:

pattern = ("(\d{2}:\d{2}:\d{2},\d{3}?.*)")

Beyer answered 4/12, 2015 at 10:23 Comment(0)

Recommended topics

Hot tags