How to classify/categorize strings according to regular expression rules in Python
Asked Answered
D

3

10

I am writing an ETL script in Python that gets data in CSV files, validates and sanitizes the data as well as categorizes or classifies each row according to some rules, and finally loads it into a postgresql database.

The data looks like this (simplified):

ColA, ColB, Timestamp, Timestamp, Journaltext, AmountA, AmountB

Each row is a financial transaction. What I want to do is to categorize or classify transactions based on some rules. The rules are basically regular expressions that match the text in Journaltext column.

So what I want to do is something like this:

transactions = []
for row in rows:
    t = Transaction(category=classify(row.journaltext))
    transactions.append(t)

I am not sure how to write the classify() function efficiently.

This is how the rules for classification works:

  • There are a number of categories (more can and will be added later)
  • Each category has a set of substrings or regular expressions that, if Journaltext of a transaction matches this expression or contains this substring, then this transaction belongs to this category.
  • A transaction can only be on one category
  • If a category, FOO, has substrings 'foo' and 'Foo', and another category BAR has substrings 'football', then a transaction with Journaltext='food' must be put in category FOO, because it only matches FOO, but a transaction with Journaltext='footballs' must be placed in category BAR. I think this means that I have to put a priority or similar on each category.
  • If a transaction does not match any of the expressions, it is either None in category or will be put in a placeholder category called "UNKNOWN" or similar. This does not matter much.

Ok. So how to I represent these categories and corresponding rules in Python?

I would really appreciate your input. Even if you cannot provide a full solution. Just anything to hint me in the right direction will be great. Thanks.

Deedeeann answered 8/3, 2012 at 19:45 Comment(1)
How big is your input (number of categories, terms per categories, number of transactions and average size of the text)?Alienation
L
4

Without any kind of extra fluff:

categories = [
  ('cat1', ['foo']),
  ('cat2', ['football']),
  ('cat3', ['abc', 'aba', 'bca'])
]

def classify(text):
  for category, matches in categories:
    if any(match in text for match in matches):
      return category
  return None

In Python you can use the in operator to test for subsets of a string. You could add some things like isinstance(match, str) to check whether you're using a simple string, or a regular expressions object. How advanced it becomes is up to you.

Lyndes answered 8/3, 2012 at 19:54 Comment(6)
This seems elegant, however, it does not seem to work if categories has more than one substring. How to do that? Say, cat3 has substrings: 'aba', 'abe', and 'bca'Deedeeann
@Deedeeann - Take a look at those adjustments - they allow multiple matches per category. Priority is determined by the order you put things into the main categories list.Lyndes
@ervingsb: If you use regular expressions already, you could also adapt these to use alternation (abc|abe|bca), which simplifies the code and might result in better performance (depending on the regex implementation).Ona
Also, a dict might be a bit less syntax clutter here, while providing the same functionality.Ona
@NiklasB. - I chose a list over a dict primarily because it addresses the priority concept directly. With a dict you'd actually need to use ordereddict, or have some other container that assigned priority, because you don't have guaranteed iteration order when you call myDict.items().Lyndes
@g.d.d.c: One could also easily adapt the rules to be completely disjoint, but yes, if the ordering is relevant, a list of tuples is the best choice.Ona
N
3

what about this solution in pseudo python:

def classify(journaltext):
    prio_list = ["FOO", "BAR", "UPS", ...] # "..." is a placeholder: you have to give the full list here.
    # dictionary: 
    # - key is the name of the category, must match the name in the above prio_list
    # - value is the regex that identifies the category
    matchers = {"FOO": "the regex for FOO", "BAR": "the regex for BAR", "UPS":"...", ...}
    for category in prio_list:
        if re.match(matchers[category], journaltext):
            return category
    return "UNKOWN" # or you can "return None"

Features:

  • this has a prio_list, which is all the categories in descending order.
  • it tries to match in the order of the list.
  • It is matched against a regex from the matchers dictionary. So the category names can be arbitrary.
  • the function returns the name of the category
  • if nothing matches, then you get your placeholder category name.

You even can read the prioritized category list and the regexs from a configuration file, but this is left as an exercise to the reader...

Nob answered 8/3, 2012 at 19:55 Comment(2)
How to I support more than one substring/regex for category FOO? I cannot put more than one 'foo' key in the dict.Deedeeann
you can put more than one substring together in a single regex: "(foo|bar)" matches strings that contain "foo" or "bar". And regexes cann be case insensitiv, see docs.python.org/howto/regex.html for a python regex howto.Verdie
S
0

Named groups and .groupdict() method can be used, an example:

import re
classify = re.compile(r"(?P<TWITTER>twitter|^t\.co$)|(?P<FACEBOOK>facebook|\bfb\b)", re.I)
matched = classify.search("FB")
classifications = {k for k, v in matched.groupdict().items() if v} if matched else set() 

If some limit on the number of named groups will be achieved (see Python regular expressions with more than 100 groups? ), it's easy to break the pattern into several ones and make a union of the results.

Suave answered 20/6, 2023 at 10:37 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.