Regex pattern isn't matching certain show titles
Asked Answered
L

2

6

Using C# regex to match and return data parsed from a string is returning unreliable results.

The pattern I am using is as follows :

Regex r=new Regex( 
      @"(.*?)S?(\d{1,2})E?(\d{1,2})(.*)|(.*?)S?(\d{1,2})E?(\d{1,2})",
      RegexOptions.IgnoreCase
);

Following are a couple test cases that fail


Ellen 2015.05.22 Joseph Gordon Levitt [REPOST]
The Soup 2015.05.22 [mp4]
Big Brother UK Live From The House (May 22, 2015)

Should return

  • Show Name (eg, Ellen)
  • Date (eg, 2015.05.22)
  • Extra Info (eg, Joseph Gordon Levitt [REPOST])

Alaskan Bush People S02 Wild Times Special

Should return

  • Show Name (eg, Alaskan Bush People)
  • Season (eg, 02)
  • Extra Info (eg, Wild Times Special)

500 Questions S01E03

Should return

  • Show Name (eg, 500 Questions)
  • Season (eg, 01)
  • Episode (eg, 03)

Examples that work and return proper data

Boyster S01E13 – E14
Mysteries at the Museum S08E08
Mysteries at the National Parks S01E07 – E08
The Last Days Of… S01E06
Born Naughty? S01E02
Have I Got News For You S49E07

What it seems like, is that the pattern is ignoring the S and the E if not found, and then using the first set of matching numbers to fill in that slot.

It is clear that there is more work needed on this pattern to work with the above varying strings. Your assistance in this matter is much appreciated.

Librium answered 23/5, 2015 at 9:40 Comment(5)
@"(.*?)S?(\d{1,2})E?(\d{1,2})(.*)|(.*?)S?(\d{1,2})E?(\d{1,2})" why did you write same pattern twice?Juliennejuliet
it's not the same pattern. notice one ends with (.*) for any trailing chars, while the other does not. I found that if I stripped the (.*), strings with more chars after episode number weren't being caught at all.Librium
what i am saying is 2nd part is a subset of first part where .* matches zero characters..??Juliennejuliet
I would like to have you rephrase your problem, as it seems you are trying to catch a multitude of patterns using wildcards and one single regexp. I would recommend you to show a proper example of exactly the input you are trying to regex on and I also think you would need to have several patterns and maybe parse through the text several times as the input is very varied.Phantom
Avoid using '.*' which will take the entire line to end. You need more or's to handle dates. Use names for group to handle empty groups. Here a my fixes : @"(?'name'[^S]*)?S(?'season'\d{1,2})E?(?'episode'\d{1,2})?(?'end'[^$]*)|(?'name'[^S]*)?S(?'season'\d{1,2})E(?'episode'\d{1,2})"Winfordwinfred
E
5

Divide and Conquer

You're trying to parse too much with one simple expression. That's not going to work very well. The best approach in this case is to divide the problem into smaller problems, and solve each one separately. Then, we can combine everything into one pattern later.

Let's write some patterns for the data you want to extract.

  • Season/episode:

    S\d+(?:E\d+(?:\s*\p{Pd}\s*E\d+)?)?
    

    I used \p{Pd} instead of - to accommodate for any dash type.

  • Date:

    \d{4}\.\d{1,2}\.\d{1,2}
    

    Or...

    (?i:January|February|March|April|May|June|July|August|September|October|November|December)
    \s*\d{1,2},\s*\d{4}
    
  • Write a simple pattern for extra info:

    .*?
    

    (yeah, that's pretty generic)

  • We can also detect the show format like this:

    \[.*?\]
    
  • You can add additional parts as required.

Now, we can put everything into one pattern, using group names to extract data:

^\s*
(?<name>.*?)
(?<info> \s+ (?:
    (?<episode>S\d+(?:E\d+(?:\s*\p{Pd}\s*E\d+)?)?)
    |
    (?<date>\d{4}\.\d{1,2}\.\d{1,2})
    |
    \(?(?<date>(?i:January|February|March|April|May|June|July|August|September|October|November|December)\s*\d{1,2},\s*\d{4})\)?
    |
    \[(?<format>.*?)\]
    |
    (?<extra>(?(info)|(?!)).*?)
))*
\s*$

Just ignore the info group (it's used for the conditional in extra, so that extra doesn't consume what should be part of the show name). And you can get multiple extra infos, so just concatenate them, putting a space in between each part.

Sample code:

var inputData = new[]
{
    "Boyster S01E13 – E14",
    "Mysteries at the Museum S08E08",
    "Mysteries at the National Parks S01E07 – E08",
    "The Last Days Of… S01E06",
    "Born Naughty? S01E02",
    "Have I Got News For You S49E07",
    "Ellen 2015.05.22 Joseph Gordon Levitt [REPOST]",
    "The Soup 2015.05.22 [mp4]",
    "Big Brother UK Live From The House (May 22, 2015)",
    "Alaskan Bush People S02 Wild Times Special",
    "500 Questions S01E03"
};

var re = new Regex(@"
    ^\s*
    (?<name>.*?)
    (?<info> \s+ (?:
        (?<episode>S\d+(?:E\d+(?:\s*\p{Pd}\s*E\d+)?)?)
        |
        (?<date>\d{4}\.\d{1,2}\.\d{1,2})
        |
        \(?(?<date>(?i:January|February|March|April|May|June|July|August|September|October|November|December)\s*\d{1,2},\s*\d{4})\)?
        |
        \[(?<format>.*?)\]
        |
        (?<extra>(?(info)|(?!)).*?)
    ))*
    \s*$
", RegexOptions.IgnorePatternWhitespace);

foreach (var input in inputData)
{
    Console.WriteLine();
    Console.WriteLine("--- {0} ---", input);

    var match = re.Match(input);
    if (!match.Success)
    {
        Console.WriteLine("FAIL");
        continue;
    }

    foreach (var groupName in re.GetGroupNames())
    {
        if (groupName == "0" || groupName == "info")
            continue;

        var group = match.Groups[groupName];
        if (!group.Success)
            continue;

        foreach (Capture capture in group.Captures)
            Console.WriteLine("{0}: '{1}'", groupName, capture.Value);
    }
}

And the output of this is...

--- Boyster S01E13 - E14 ---
name: 'Boyster'
episode: 'S01E13 - E14'

--- Mysteries at the Museum S08E08 ---
name: 'Mysteries at the Museum'
episode: 'S08E08'

--- Mysteries at the National Parks S01E07 - E08 ---
name: 'Mysteries at the National Parks'
episode: 'S01E07 - E08'

--- The Last Days Ofâ?¦ S01E06 ---
name: 'The Last Days Ofâ?¦'
episode: 'S01E06'

--- Born Naughty? S01E02 ---
name: 'Born Naughty?'
episode: 'S01E02'

--- Have I Got News For You S49E07 ---
name: 'Have I Got News For You'
episode: 'S49E07'

--- Ellen 2015.05.22 Joseph Gordon Levitt [REPOST] ---
name: 'Ellen'
date: '2015.05.22'
format: 'REPOST'
extra: 'Joseph'
extra: 'Gordon'
extra: 'Levitt'

--- The Soup 2015.05.22 [mp4] ---
name: 'The Soup'
date: '2015.05.22'
format: 'mp4'

--- Big Brother UK Live From The House (May 22, 2015) ---
name: 'Big Brother UK Live From The House'
date: 'May 22, 2015'

--- Alaskan Bush People S02 Wild Times Special ---
name: 'Alaskan Bush People'
episode: 'S02'
extra: 'Wild'
extra: 'Times'
extra: 'Special'

--- 500 Questions S01E03 ---
name: '500 Questions'
episode: 'S01E03'
Evelunn answered 23/5, 2015 at 12:24 Comment(2)
Your's returned what I needed based on the above information in the question. I did encounter another one which perhaps you could resolve (and it would help me understand grouping better) >> Jimmy Fallon 2015 05 22 Sting and Kevin Connolly. I tried to add an option for this date, but not sure :)Librium
Sure, you could just add: (?<date>\d{4}[ ]\d{1,2}[ ]\d{1,2}) or (?<date>\d{4}\s\d{1,2}\s\d{1,2}), or perhaps even change (?<date>\d{4}\.\d{1,2}\.\d{1,2}) to (?<date>\d{4}[. ]\d{1,2}[. ]\d{1,2}) but that last option would accept 2015 05.22 too - you choose the best variant.Evelunn
L
1

Try this:

(?<name>.*?)(?:S(?<season>\d{1,2}))?(?:E(?<episode>\d{1,2}))?(?<date>\d{4}\.\d{2}\.\d{2})(?<extra>.*)?
Lowercase answered 23/5, 2015 at 11:53 Comment(1)
This method did not return proper results. Thank you for the attempt ;)Librium

© 2022 - 2024 — McMap. All rights reserved.