I am writing an ETL script in Python that gets data in CSV files, validates and sanitizes the data as well as categorizes or classifies each row according to some rules, and finally loads it into a postgresql database.
The data looks like this (simplified):
ColA, ColB, Timestamp, Timestamp, Journaltext, AmountA, AmountB
Each row is a financial transaction. What I want to do is to categorize or classify transactions based on some rules. The rules are basically regular expressions that match the text in Journaltext column.
So what I want to do is something like this:
transactions = [] for row in rows: t = Transaction(category=classify(row.journaltext)) transactions.append(t)
I am not sure how to write the classify() function efficiently.
This is how the rules for classification works:
- There are a number of categories (more can and will be added later)
- Each category has a set of substrings or regular expressions that, if Journaltext of a transaction matches this expression or contains this substring, then this transaction belongs to this category.
- A transaction can only be on one category
- If a category, FOO, has substrings 'foo' and 'Foo', and another category BAR has substrings 'football', then a transaction with Journaltext='food' must be put in category FOO, because it only matches FOO, but a transaction with Journaltext='footballs' must be placed in category BAR. I think this means that I have to put a priority or similar on each category.
- If a transaction does not match any of the expressions, it is either None in category or will be put in a placeholder category called "UNKNOWN" or similar. This does not matter much.
Ok. So how to I represent these categories and corresponding rules in Python?
I would really appreciate your input. Even if you cannot provide a full solution. Just anything to hint me in the right direction will be great. Thanks.