Divide and Conquer
You're trying to parse too much with one simple expression. That's not going to work very well. The best approach in this case is to divide the problem into smaller problems, and solve each one separately. Then, we can combine everything into one pattern later.
Let's write some patterns for the data you want to extract.
Season/episode:
S\d+(?:E\d+(?:\s*\p{Pd}\s*E\d+)?)?
I used \p{Pd}
instead of -
to accommodate for any dash type.
Date:
\d{4}\.\d{1,2}\.\d{1,2}
Or...
(?i:January|February|March|April|May|June|July|August|September|October|November|December)
\s*\d{1,2},\s*\d{4}
Write a simple pattern for extra info:
.*?
(yeah, that's pretty generic)
We can also detect the show format like this:
\[.*?\]
You can add additional parts as required.
Now, we can put everything into one pattern, using group names to extract data:
^\s*
(?<name>.*?)
(?<info> \s+ (?:
(?<episode>S\d+(?:E\d+(?:\s*\p{Pd}\s*E\d+)?)?)
|
(?<date>\d{4}\.\d{1,2}\.\d{1,2})
|
\(?(?<date>(?i:January|February|March|April|May|June|July|August|September|October|November|December)\s*\d{1,2},\s*\d{4})\)?
|
\[(?<format>.*?)\]
|
(?<extra>(?(info)|(?!)).*?)
))*
\s*$
Just ignore the info
group (it's used for the conditional in extra
, so that extra
doesn't consume what should be part of the show name). And you can get multiple extra
infos, so just concatenate them, putting a space in between each part.
Sample code:
var inputData = new[]
{
"Boyster S01E13 – E14",
"Mysteries at the Museum S08E08",
"Mysteries at the National Parks S01E07 – E08",
"The Last Days Of… S01E06",
"Born Naughty? S01E02",
"Have I Got News For You S49E07",
"Ellen 2015.05.22 Joseph Gordon Levitt [REPOST]",
"The Soup 2015.05.22 [mp4]",
"Big Brother UK Live From The House (May 22, 2015)",
"Alaskan Bush People S02 Wild Times Special",
"500 Questions S01E03"
};
var re = new Regex(@"
^\s*
(?<name>.*?)
(?<info> \s+ (?:
(?<episode>S\d+(?:E\d+(?:\s*\p{Pd}\s*E\d+)?)?)
|
(?<date>\d{4}\.\d{1,2}\.\d{1,2})
|
\(?(?<date>(?i:January|February|March|April|May|June|July|August|September|October|November|December)\s*\d{1,2},\s*\d{4})\)?
|
\[(?<format>.*?)\]
|
(?<extra>(?(info)|(?!)).*?)
))*
\s*$
", RegexOptions.IgnorePatternWhitespace);
foreach (var input in inputData)
{
Console.WriteLine();
Console.WriteLine("--- {0} ---", input);
var match = re.Match(input);
if (!match.Success)
{
Console.WriteLine("FAIL");
continue;
}
foreach (var groupName in re.GetGroupNames())
{
if (groupName == "0" || groupName == "info")
continue;
var group = match.Groups[groupName];
if (!group.Success)
continue;
foreach (Capture capture in group.Captures)
Console.WriteLine("{0}: '{1}'", groupName, capture.Value);
}
}
And the output of this is...
--- Boyster S01E13 - E14 ---
name: 'Boyster'
episode: 'S01E13 - E14'
--- Mysteries at the Museum S08E08 ---
name: 'Mysteries at the Museum'
episode: 'S08E08'
--- Mysteries at the National Parks S01E07 - E08 ---
name: 'Mysteries at the National Parks'
episode: 'S01E07 - E08'
--- The Last Days Ofâ?¦ S01E06 ---
name: 'The Last Days Ofâ?¦'
episode: 'S01E06'
--- Born Naughty? S01E02 ---
name: 'Born Naughty?'
episode: 'S01E02'
--- Have I Got News For You S49E07 ---
name: 'Have I Got News For You'
episode: 'S49E07'
--- Ellen 2015.05.22 Joseph Gordon Levitt [REPOST] ---
name: 'Ellen'
date: '2015.05.22'
format: 'REPOST'
extra: 'Joseph'
extra: 'Gordon'
extra: 'Levitt'
--- The Soup 2015.05.22 [mp4] ---
name: 'The Soup'
date: '2015.05.22'
format: 'mp4'
--- Big Brother UK Live From The House (May 22, 2015) ---
name: 'Big Brother UK Live From The House'
date: 'May 22, 2015'
--- Alaskan Bush People S02 Wild Times Special ---
name: 'Alaskan Bush People'
episode: 'S02'
extra: 'Wild'
extra: 'Times'
extra: 'Special'
--- 500 Questions S01E03 ---
name: '500 Questions'
episode: 'S01E03'
@"(.*?)S?(\d{1,2})E?(\d{1,2})(.*)|(.*?)S?(\d{1,2})E?(\d{1,2})"
why did you write same pattern twice? – Juliennejuliet.*
matches zero characters..?? – Juliennejuliet