Uima Ruta Out of Memory issue in spark context
Asked Answered
I

1

9

I'm running an UIMA application on apache spark. There are million of pages coming into batches to be processed by UIMA RUTA for calculation. But some time i'm facing out of memory exception.It throws exception sometime as it successfully process 2000 pages but some time fail on 500 pages.

Application Log

Caused by: java.lang.OutOfMemoryError: Java heap space
        at org.apache.uima.internal.util.IntArrayUtils.expand_size(IntArrayUtils.java:57)
        at org.apache.uima.internal.util.IntArrayUtils.ensure_size(IntArrayUtils.java:39)
        at org.apache.uima.cas.impl.Heap.grow(Heap.java:187)
        at org.apache.uima.cas.impl.Heap.add(Heap.java:241)
        at org.apache.uima.cas.impl.CASImpl.ll_createFS(CASImpl.java:2844)
        at org.apache.uima.cas.impl.CASImpl.createFS(CASImpl.java:489)
        at org.apache.uima.cas.impl.CASImpl.createAnnotation(CASImpl.java:3837)
        at org.apache.uima.ruta.rule.RuleMatch.getMatchedAnnotations(RuleMatch.java:172)
        at org.apache.uima.ruta.rule.RuleMatch.getMatchedAnnotationsOf(RuleMatch.java:68)
        at org.apache.uima.ruta.rule.RuleMatch.getLastMatchedAnnotation(RuleMatch.java:73)
        at org.apache.uima.ruta.rule.ComposedRuleElement.mergeDisjunctiveRuleMatches(ComposedRuleElement.java:330)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:213)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)

UIMA RUTA SCRIPT

WORDLIST EnglishStopWordList = 'stopWords.txt';
WORDLIST FiltersList = 'AnchorFilters.txt';
DECLARE Filters, EnglishStopWords;
DECLARE Anchors, SpanStart,SpanClose;

DocumentAnnotation{-> ADDRETAINTYPE(MARKUP)};

DocumentAnnotation{-> MARKFAST(Filters, FiltersList)};

STRING MixCharacterRegex = "[0-9]+[a-zA-Z]+";

DocumentAnnotation{-> MARKFAST(EnglishStopWords, EnglishStopWordList,true)};
(SW | CW | CAP ) { -> MARK(Anchors, 1, 2)};
Anchors{CONTAINS(EnglishStopWords) -> UNMARK(Anchors)};

(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)};
(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)? (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)};
(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)? EnglishStopWords? { -> MARK(Anchors, 1, 4)};
(SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 3)};

Anchors{CONTAINS(MARKUP) -> UNMARK(Anchors)};
MixCharacterRegex -> Anchors;

"<Value>"  -> SpanStart;
"</Value>" -> SpanClose;

Anchors{-> CREATE(ExtractedData, "type" = "ANCHOR", "value" = Anchors)};

SpanStart Filters? SPACE? ExtractedData SPACE? Filters? SpanClose{-> GATHER(Data, 2, 6, "ExtractedData" = 4)};
Ilex answered 4/6, 2017 at 5:30 Comment(2)
Can you add a link to a (mock) document which causes such problems?Tegucigalpa
Which ruta version do you use?Tegucigalpa
T
2

Normally, the reasons for high memory usage in UIMA Ruta can be found in RutaBasic (many annotation, coverage information) or in RuleMatch (inefficient rules, many rule element matches).

This your example, the problem seems to origin somewhere else. The stacktrace indicates that the memory is used up by some disjunctive rule element, which requires to create new annotations for storing the match information.

It seems that the version of UIMA Ruta is rather old since line number do not match at all with the source I am looking at.

There are seven (!!!) calls of continueOwnMatch in the stacktrace. I was looking for a rule that could cause something like this but found none. This could be a old flaw which has been fixed in newer versions, or some preprocessing added additional CW/SW/CAP annotations.

As a first advice, I would suggest two things:

  1. Update to UIMA Ruta 2.6.0
  2. Get rid of all disjunctive rule elements

The disjunctive rule elements are not really needed in your script. In general, they should not used at all if not really required. I do not use them at all in productive rules.

Instead of (SW | CW | CAP ) you can simply write W.

Instead of (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) you can write ANY{OR(REGEXP("['\"-=()\\[\\]]"),IS(PM))}.

Using ANY as a matching condition can reduce the runtime performance. In this example, two rules instead of the rule lement rewrite might be better, e.g., something like

SPECIAL{REGEXP("['\"-=()\\[\\]]")} W ANY?{OR(REGEXP("['\"-=()\\[\\]]"),IS(PM))} EnglishStopWords? { -> MARK(Anchors, 1, 4)};
PM W ANY?{OR(REGEXP("['\"-=()\\[\\]]"),IS(PM))} EnglishStopWords? { -> MARK(Anchors, 1, 4)};

(optional rule elements at the start of a rule without any anchors in the rule are not optional)

btw, there is a lot of room for optimization in your rules. If I had to guess, I'd say you can get rid at least of half the rules and 90% of all created annotations, which would also considerably reduce the memory usage.

DISCLAIMER: I am a developer of UIMA Ruta

Tegucigalpa answered 8/6, 2017 at 19:54 Comment(9)
I tried to change rule as per your suggestion but there is degradation of 10-15% in performanceIlex
Ok, that's strange. Did you have some overlapping Anchors before? How do you evaluate the performance (=accuracy?)? The rewrite should not change the result.Tegucigalpa
Rewriting rule giving me exact same results. Performance i mean here is time taken to calculate anchors.I'm using ruta in spark for batches to get anchors from pages, previously it was taking less time to get the anchors from pages.No Doubt rewriting may taking less memory but i don't have such benchmark for now.Ilex
One more thing by increasing executor memory i'm not getting out of memory exception but as i have limitation of hardware i'm looking for ruta improvement right now i don't have enough bandwidth to upgrade ruta version for now as it may give me different results/issues but i also think this will boost performance with rule rewriting & version upgrade.Ilex
Yes, there is much room for optimizing the rules. I'd guess it could be ten times faster. I'll adapt the answer for avoiding the performance overhead.Tegucigalpa
Yes I agree i'm not facing memory issue anymore. But how can i achieve 10x performance apart from rule rewriting ?Ilex
The comments of this question are not suitable to discuss the speed optimizations. Ask this question on the uima user mailing list and provide an exemplary document of representative size. I'll help you to optimize it but I am quite occupied the next 2 weeks.Tegucigalpa
Essentially, you need to reduce the matches and the usage of UIMA iterators. Do you need the anchors annotations at all? Use a mtwl instead of the MARKFAST. Merge some rules, move some checks to the mtwl since you already need to check the complete document there. What is the output of your script? Data? Then, you can make all rules depednent of the anchor of the last one and avoid a lot of matches.Tegucigalpa
let me create some sample data as original data is sensitive. I will drop these over mailing list.Ilex

© 2022 - 2024 — McMap. All rights reserved.