How does Apple find dates, times and addresses in emails?
Asked Answered
T

6

133

In the iOS email client, when an email contains a date, time or location, the text becomes a hyperlink and it is possible to create an appointment or look at a map simply by tapping the link. It not only works for emails in English, but in other languages also. I love this feature and would like to understand how they do it.

The naive way to do this would be to have many regular expressions and run them all. However I this is not going to scale very well and will work for only a specific language or date format, etc. I think that Apple must be using some concept of machine learning to extract entities (8:00PM, 8PM, 8:00, 0800, 20:00, 20h, 20h00, 2000 etc.).

Any idea how Apple is able to extract entities so quickly in its email client? What machine learning algorithm would you to apply accomplish such task?

Tetragon answered 15/2, 2012 at 14:12 Comment(2)
I also thought about this, especially the regex trick. I know they have a patent on it, so maybe you can try to search it. However, I would be very interested in it as well. +1Niela
Actually the regexp trick will probably catch 99% of cases with a very low error rate. And is super fast, when you optimize the regular expressions well. So I'd be not surprised if it indeed just a set of regular expressions.Misti
H
155

They likely use Information Extraction techniques for this.

Here is a demo of Stanford's SUTime tool:

http://nlp.stanford.edu:8080/sutime/process

You would extract attributes about n-grams (consecutive words) in a document:

  • numberOfLetters
  • numberOfSymbols
  • length
  • previousWord
  • nextWord
  • nextWordNumberOfSymbols
    ...

And then use a classification algorithm, and feed it positive and negative examples:

Observation  nLetters  nSymbols  length  prevWord  nextWord isPartOfDate  
"Feb."       3         1         4       "Wed"     "29th"   TRUE  
"DEC"        3         0         3       "company" "went"   FALSE  
...

You might get away with 50 examples of each, but the more the merrier. Then, the algorithm learns based on those examples, and can apply to future examples that it hasn't seen before.

It might learn rules such as

  • if previous word is only characters and maybe periods...
  • and current word is in "february", "mar.", "the" ...
  • and next word is in "twelfth", any_number ...
  • then is date

Here is a decent video by a Google engineer on the subject

Heterogeneous answered 18/2, 2012 at 22:4 Comment(5)
el chief, in your opinion, what kind of model would be best for that? Bayesian?Tetragon
I am pretty sure such an approach won't perform better than, say, f-measure of approx. 0.9. (Note, this is just a feeling, I may be wrong). On the other hand I'd except the naiive approach of encoding all common formats to perform way better (possibly 0.99+ given that the most frequent formats will never be missed) and to be faster to implement + at runtime.Lecythus
@b.buchhold, maybe, but then you'd have to do the same amount of work for the next language, and the next language, whereas my solution is general.Heterogeneous
@Neil McGuigan, true. But you'd have to provide lots of training data for all those formats / languages which is much more work.Lecythus
@NeilMcGuigan thank u so much for this answer. I did what you mentioned above but not able to figure out how to train this data and using which algorithm. I can't use decision tree since the attributes are not of same typeGautea
M
117

That's a technology Apple actually developed a very long time ago called Apple Data Detectors. You can read more about it here:

http://www.miramontes.com/writing/add-cacm/

Essentially it parses the text and detects patterns that represent specific pieces of data, then applies OS-contextual actions to it. It's neat.

Maryjomaryl answered 25/2, 2012 at 10:10 Comment(3)
This is the correct answer. Other answers may tell you how you could do it, but this one tells you how Apple does it.Bibliolatry
could we have a little more detail in the write up tho ? single link entries don't add as muchKonstanze
Ah, so THIS is where all the hits on my website came from :) FWIW, I was the project lead on Apple Data Detectors back in the days of ATG; what I can add here is that this was an OS 8 and 9 technology only -- it never made the jump to OS X. There are obviously some similar things happening in OS X and IOS, and, while I'm not at Apple anymore and so can't really say, I wouldn't be surprised if the architecture is a bit different. Nevertheless, I expect some sort of grammar/parser system is still at the heart of it. Computers are fast these days, and simple grammars are pretty cheap.Garter
S
21

This is called temporal expression identification and parsing. Here are some Google searches to get you started:

https://www.google.com/#hl=en&safe=off&sclient=psy-ab&q=timebank+timeml+timex

https://www.google.com/#hl=en&safe=off&sclient=psy-ab&q=temporal+expression+tagger

Spence answered 15/2, 2012 at 21:12 Comment(1)
+1 for saying what the name of "identifying expressions that refer to time" is in some/much of the literatureHygrometry
U
7

One part of the puzzle could be the NSDataDetector class. Its used to recognize some standard types like phone numbers.

Unpin answered 24/2, 2012 at 13:12 Comment(2)
It seems the NSDataDetector class is the result of the effort Apple put into implementing this. The question is how does the class work internally?Housekeeper
it's in NSRegularExpression.h, so it seems quite possible that it is, as pointed out, just a set of regular expressions.Twylatwyman
W
3

I once wrote a parser to do this, using pyparsing. It's really very simple, you just need to get all the different ways right, but there aren't that many. It only took a few hours and was pretty fast.

Wayne answered 25/2, 2012 at 10:42 Comment(1)
Extract from Miramontes "It is not difficult to hardcode a recognizer for an atomic structure such as a URL, but substantial work is required to craft an architecture that opens up the process of creating complex structures."Cathexis
S
2

Apple has a patent on how they did it System and method for performing an action on a structure in computer data, and here's a story on this patent apples-patent-on-nsdatadetector

Stephi answered 31/7, 2012 at 2:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.