Ontology-based string classification
Asked Answered
C

4

8

I recently started working with ontologies and I am using Protege to build an ontology which I'd also like to use for automatically classifying strings. The following illustrates a very basic class hierarchy:

String
|_ AlphabeticString
   |_ CountryName
   |_ CityName
|_ AlphaNumericString
   |_ PrefixedNumericString
|_ NumericString

Eventually strings like Spain should be classified as CountryName or UE4564 would be a PrefixedNumericString.

However I am not sure how to model this knowledge. Would I have to first define if a character is alphabetic, numeric, etc. and then construct a word from the existing characters or is there a way to use Regexes? So far I only managed to classify strings based on an exact phrase like String and hasString value "UE4565".

Or would it be better to safe a regex for each class in the ontology and then classify the string in Java using those regexes?

Clasping answered 6/3, 2012 at 12:40 Comment(0)
C
6

An approach that might be appropriate here, especially if the ontology is large/complicated or might change in the future, and assuming that some errors are acceptable, is machine learning.

An outline of a process utilizing this approach might be:

  1. Define a feature set you can extract from each string, relating to your ontology (some examples below).
  2. Collect a "train set" of strings and their true matching categories.
  3. Extract features from each string, and train some machine-learning algorithm on this data.
  4. Use the trained model to classify new strings.
  5. Retrain or update your model as needed (e.g. when new categories are added).

To illustrate more concretely, here are some suggestions based on your ontology example.

Some boolean features that might be applicable: does the string matches a regexp (e.g the ones Qtax suggests); does the string exist in a prebuilt known city-names list; does it exist in a known country-names list; existence of uppercase letters; string length (not boolean), etc.

So if, for instance, you have a total of 8 features: match to the 4 regular expressions mentioned above; and the additional 4 suggested here, then "Spain" would be represented as (1,1,0,0,1,0,1,5) (matching the first 2 regular expressions but not the last two, is a city name but not a country name, has an uppercase letter and length is 5).

This set of feature will represent any given string.

to train and test a machine learning algorithm, you can use WEKA. I would start from rule or tree based algorithms, e.g. PART, RIDOR, JRIP or J48.

Then the trained models can be used via Weka either from within Java or as an external command line.

Obviously, the features I suggest have almost 1:1 match with your Ontology, but assuming your taxonomy is larger and more complex, this approach would probably be one of the best in terms of cost-effectiveness.

Choosy answered 13/3, 2012 at 14:32 Comment(1)
That was the best answer I've read in a while! In fact it was so good that I now want to try it out for myself. Thanks, etovLynwoodlynx
G
2

I don't know anything about Protege, but you can use regex to match most of those cases. The only problem would be differentiating between country and city name, I don't see how you could do that without a complete list of either one.

Here are some expressions that you could use:

  • AlphabeticString:

    ^[A-Za-z]+\z (ASCII) or ^\p{Alpha}+\z (Unicode)

  • AlphaNumericString:

    ^[A-Za-z0-9]+\z (ASCII) or ^\p{Alnum}+\z (Unicode)

  • PrefixedNumericString:

    ^[A-Za-z]+[0-9]+\z (ASCII) or ^\p{Alpha}+\p{N}+\z (Unicode)

  • NumericString:

    ^[0-9]+\z (ASCII) or ^\p{N}+\z (Unicode)

Gentlewoman answered 12/3, 2012 at 14:46 Comment(1)
A string could be both a city name and a country name (well, conceptually based on the facts given so far). An ontology does not need to have single inheritance.Fortalice
F
2

A particular string is an instance, so you'll need some code to make the basic assertions about the particular instance. That code itself might contain the use of regular expressions. Once you've got those assertions, you'll be able to use your ontology to reason about them.

The hard part is that you've got to decide what level you're going to model at. For example, are you going to talk about individual characters? You can, but it's not necessarily sensible. You've also got the challenge that arises from the fact that negative information is awkward (as the basic model of such models is intuitionistic, IIRC) which means (for example) that you'll know that a string contains a numeric character but not that it is purely numeric. Yes, you'd know that you don't have an assertion that the instance contains an alphabetic character, but you wouldn't know whether that's because the string doesn't have one or just because nobody's said so yet. This stuff is hard!

It's far easier to write an ontology if you know exactly what problems you intend to solve with it, as that allows you to at least have a go at working out what facts and relations you need to establish in the first place. After all, there's a whole world of possible things that could be said which are true but irrelevant (“if the sun has got his hat on, he'll be coming out to play”).

Fortalice answered 12/3, 2012 at 15:49 Comment(0)
A
1

Responding directly to your question, you start by checking whether a given token is numeric, alphanumeric or alphabetic (you can use regex here) and then you classify it as such. In general, the approach you're looking for is called generalization hierarchy of tokens or hierarchical feature selection (Google it). The basic idea is that you could treat each token as a separate element, but that's not the best approach since you can't cover them all [*]. Instead, you use common features among tokens (for example, 2000 and 1981 are distinct tokens but they share a common feature of being 4 digit numbers and possibly years). Then you have a class for four digit numbers, another for alphanumeric, and so on. This process of generalization helps you to simplify your classification approach.

Frequently, if you start with a string of tokens, you need to preprocess them (for example, remove punctuation or special symbols, remove words that are not relevant, stemming, etc). But maybe you can use some symbols (say, punctuation between cities and countries - e.g. Melbourne, Australia), so you assign that set of useful punctuation symbols to other symbol (#) and use that as a context (so the next time you find an unknown word next to a comma next to a known country, you can use that knowledge to assume that the unknown word is a city.

Anyway, that's the general idea behind classification using an ontology (based on a taxonomy of terms). You may also want to read about part-of-speech tagging.

By the way, if you only want to have 3 categories (numeric, alphanumeric, alphabetic), a viable option would be to use edit distance (what is more likely, that UA4E30 belongs to the alphanumeric or numeric category, considering that it doesn't correspond to the traditional format of prefixed numeric strings?). So, you assume a cost for each operation (insertion, deletion, subtitution) that transforms your unknown token into a known one.

Finally, although you said you're using Protege (which I haven't used) to build your ontology, you may want to look at WordNet.

[*] There are probabilistic approaches that help you to determine a probability for an unknown token, so the probability of such event is not zero. Usually, this is done in the context of Hidden Markov Models. Actually, this could be useful to improve the suggestion given by etov.

Aleron answered 16/3, 2012 at 18:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.