Most efficient way to count occurrences in XQuery for multiple values

I have 1581 words that I need to look up in an XML corpus of Dutch (500 million words). This corpus itself is split up in many databases. (You can read why here.) We use BaseX as a server (version 7.9), which uses XQuery as input.

I am interested in finding out how many times each word is in the corpus with a neuter determiner (het) or a non neuter determiner (de) - this is done by looking for an XPath structure that consists of an NP (noun phrase) which has two daughters, namely a determiner with as lemma de or het, and a head, which is the word I am interested in.

Example structures for de

/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="de"] and node[@rel="hd" and @pt="n" and @word="accelerator"]]
/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="de"] and node[@rel="hd" and @pt="n" and @word="accountant"]]
/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="de"] and node[@rel="hd" and @pt="n" and @word="ace"]]
/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="de"] and node[@rel="hd" and @pt="n" and @word="acroniem"]]

Example structures for het

/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="het"] and node[@rel="hd" and @pt="n" and @word="accelerator"]]
/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="het"] and node[@rel="hd" and @pt="n" and @word="accountant"]]
/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="het"] and node[@rel="hd" and @pt="n" and @word="ace"]]
/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="het"] and node[@rel="hd" and @pt="n" and @word="acroniem"]]

In XQuery, I would it then do it like so, for each XPath structure:

count(for $node in db:open("mydatabase")/treebank/tree/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="het"] and node[@rel="hd" and @pt="n" and @word="accelerator"]] return $node)

This works just fine. The thing is, that this takes a long time. Each time, the same (thousands of) databases need to be opened sequentially, and this process is repeated for each word. My question is, isn't there a way to concatenate some queries. I have some ideas, but I'm not sure how to execute them - also, I'm not sure how many arguments BaseX can deal with.

Merge de and het queries.

This is probably the most straightforward case. By doing so, I at least cut the need of queries in half. But I do not know how to distinguish between the two when results are found. For instance, if I change my XPath code to:

... (@lemma="de" or @lemma="het") ...

I should find all cases, but how can I then distinguish between one or the other? In other words, if I use that XPath, I will get one number back from the count function in XQuery but there is no way for me to know which are de and which are the?

The same idea can be applied to the word attribute near the end

Instead of executing a new query for each word, I can concatenate them as follows:

... (@word="accelerator" or @word="accountant" or @word="ace" or ...) ...

But again, how can I distinguish between these values? And can I put all 1581 values in one XPath? Can BaseX handle that?

A for loop with a list of words which would then return back the results in XML format for a lot of words (maybe all, if BaseX can handle that).

I am no expert in XQuery but in pseudo code I guess something like this is possible:

$wordlist = ['accelerator', 'accountant', 'ace', 'acroniem'];
$determinerlist = ['de', 'het'];
$db = 'mydatabase';
foreach ($wordlist as $word) {
  foreach ($determinerlist as $det) {
    count(for $node in db:open("'.$db.'")/treebank/tree/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="'.$det.'"] and node[@rel="hd" and @pt="n" and @word="'.$word.'"]] return $node);
  }
}

I'm not sure how to assign the count to array variables in XQuery, if that is possible, but XML output could look like this (but better variations are welcome of course):

<results>
  <result word="accelerator">
    <neuter>12</neuter>
    <nonneuter>3</nonneuter>
  </result>
  <result word="accountant">
    <neuter>4</neuter>
    <nonneuter>0</nonneuter>
  </result>
  <result word="ace">
    <neuter>14</neuter>
    <nonneuter>2</nonneuter>
  </result>
  <result word="acroniem">
    <neuter>3</neuter>
    <nonneuter>7</nonneuter>
  </result>
</results>

Which I could then run through with Perl with regular expressions or XML twig to get the values I need.

As you can see the may issue is finding a suitable XQuery code that is efficient, and that takes into account that I have 1581 words to look up in a huge corpus, and that the amount of databases to go through are a lot as well (thousands). For each database look up, a new connection is made through Perl.

If you have any questions, please comment and I'll try to answer as well as possible.

In general BaseX queries will be fastest (often, blindingly fast) if you leverage an index rather than making your query traverse a trillion nodes. BaseX creates the TEXT, ATTRIBUTE and TOKEN indices for you by default, unless you have modified the default DB creation options. (BaseX also tries to re-write queries to leverage the indices - although it's not always successful in this).

So assuming that your databases have been built with an ATTRIBUTE index, you should be able to re-write your query along these lines:

db:attribute('dbname', 'accelerator', 'word')/parent::*

The db:attribute function as used above will return, for database 'dbname', the parent element of any attribute with 'accelerator' as the value for @word. Obviously you can predicate this query as much as needed, something like this, judging from your previous example:

db:attribute('dbname', 'accelerator', 'word')
      [parent::node[@rel="hd" and @pt="n"]]
      [ancestor::node
        [@cat="np"]
        [child::node[@rel="det" and @pt="lid" and @lemma="het"]
      ]
    ]

Here is good documentation on the BaseX's index functionality. I've used it to great effect for speed querying of large (> 20 GB) databases.

http://docs.basex.org/wiki/Indexes

Recommended topics

Hot tags