Most efficient way to count occurrences in XQuery for multiple values
Asked Answered
R

1

7

I have 1581 words that I need to look up in an XML corpus of Dutch (500 million words). This corpus itself is split up in many databases. (You can read why here.) We use BaseX as a server (version 7.9), which uses XQuery as input.

I am interested in finding out how many times each word is in the corpus with a neuter determiner (het) or a non neuter determiner (de) - this is done by looking for an XPath structure that consists of an NP (noun phrase) which has two daughters, namely a determiner with as lemma de or het, and a head, which is the word I am interested in.

Example structures for de

/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="de"] and node[@rel="hd" and @pt="n" and @word="accelerator"]]
/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="de"] and node[@rel="hd" and @pt="n" and @word="accountant"]]
/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="de"] and node[@rel="hd" and @pt="n" and @word="ace"]]
/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="de"] and node[@rel="hd" and @pt="n" and @word="acroniem"]]

Example structures for het

/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="het"] and node[@rel="hd" and @pt="n" and @word="accelerator"]]
/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="het"] and node[@rel="hd" and @pt="n" and @word="accountant"]]
/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="het"] and node[@rel="hd" and @pt="n" and @word="ace"]]
/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="het"] and node[@rel="hd" and @pt="n" and @word="acroniem"]]

In XQuery, I would it then do it like so, for each XPath structure:

count(for $node in db:open("mydatabase")/treebank/tree/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="het"] and node[@rel="hd" and @pt="n" and @word="accelerator"]] return $node)

This works just fine. The thing is, that this takes a long time. Each time, the same (thousands of) databases need to be opened sequentially, and this process is repeated for each word. My question is, isn't there a way to concatenate some queries. I have some ideas, but I'm not sure how to execute them - also, I'm not sure how many arguments BaseX can deal with.

  1. Merge de and het queries.

This is probably the most straightforward case. By doing so, I at least cut the need of queries in half. But I do not know how to distinguish between the two when results are found. For instance, if I change my XPath code to:

... (@lemma="de" or @lemma="het") ...

I should find all cases, but how can I then distinguish between one or the other? In other words, if I use that XPath, I will get one number back from the count function in XQuery but there is no way for me to know which are de and which are the?

  1. The same idea can be applied to the word attribute near the end

Instead of executing a new query for each word, I can concatenate them as follows:

... (@word="accelerator" or @word="accountant" or @word="ace" or ...) ...

But again, how can I distinguish between these values? And can I put all 1581 values in one XPath? Can BaseX handle that?

  1. A for loop with a list of words which would then return back the results in XML format for a lot of words (maybe all, if BaseX can handle that).

I am no expert in XQuery but in pseudo code I guess something like this is possible:

$wordlist = ['accelerator', 'accountant', 'ace', 'acroniem'];
$determinerlist = ['de', 'het'];
$db = 'mydatabase';
foreach ($wordlist as $word) {
  foreach ($determinerlist as $det) {
    count(for $node in db:open("'.$db.'")/treebank/tree/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="'.$det.'"] and node[@rel="hd" and @pt="n" and @word="'.$word.'"]] return $node);
  }
}

I'm not sure how to assign the count to array variables in XQuery, if that is possible, but XML output could look like this (but better variations are welcome of course):

<results>
  <result word="accelerator">
    <neuter>12</neuter>
    <nonneuter>3</nonneuter>
  </result>
  <result word="accountant">
    <neuter>4</neuter>
    <nonneuter>0</nonneuter>
  </result>
  <result word="ace">
    <neuter>14</neuter>
    <nonneuter>2</nonneuter>
  </result>
  <result word="acroniem">
    <neuter>3</neuter>
    <nonneuter>7</nonneuter>
  </result>
</results>

Which I could then run through with Perl with regular expressions or XML twig to get the values I need.

As you can see the may issue is finding a suitable XQuery code that is efficient, and that takes into account that I have 1581 words to look up in a huge corpus, and that the amount of databases to go through are a lot as well (thousands). For each database look up, a new connection is made through Perl.

If you have any questions, please comment and I'll try to answer as well as possible.

Reisch answered 27/6, 2016 at 14:35 Comment(6)
Without (original) data, it is hard to give proper advice. For reasonable performance, using the right indices will be crucial. BaseX' query info pane in the GUI (or enable the query info for command line queries) will help you understand when it uses which index. You can also enforce index usage. I'd go for querying the text index using index:texts (I don't think you need the full text index here) for each word, then filter as needed and count the result.Suffice
@JensErat Because other users are also running other queries, it is not possible to index to my own needs unfortunately.Reisch
An index is crucial for reasonable search performance. Creating the query will heavily speed up all following queries by magnitude (if used properly). Is there any reason against creating an index? Are you sure a text index is not already created anyway?Suffice
I think I misunderstood what you mean by indices, then. Can you link me to any low-level documentation about indices in Basex?Reisch
The BaseX wiki page on indexes contains a broad description on what indexes exist. There are some very basic indexes generated by default, and the query optimizer tries to use them whenever it is reasonable. For complex queries, it sometimes fails at realizing index usage would be reasonable. With very large amounts of data, using the index explicitly can sometimes speed-up query times by orders of magnitude.Suffice
Thank you for your replies thus far. I'll have to come back to you when I have some more time!Reisch
E
1

In general BaseX queries will be fastest (often, blindingly fast) if you leverage an index rather than making your query traverse a trillion nodes. BaseX creates the TEXT, ATTRIBUTE and TOKEN indices for you by default, unless you have modified the default DB creation options. (BaseX also tries to re-write queries to leverage the indices - although it's not always successful in this).

So assuming that your databases have been built with an ATTRIBUTE index, you should be able to re-write your query along these lines:

db:attribute('dbname', 'accelerator', 'word')/parent::*

The db:attribute function as used above will return, for database 'dbname', the parent element of any attribute with 'accelerator' as the value for @word. Obviously you can predicate this query as much as needed, something like this, judging from your previous example:

db:attribute('dbname', 'accelerator', 'word')
      [parent::node[@rel="hd" and @pt="n"]]
      [ancestor::node
        [@cat="np"]
        [child::node[@rel="det" and @pt="lid" and @lemma="het"]
      ]
    ]

Here is good documentation on the BaseX's index functionality. I've used it to great effect for speed querying of large (> 20 GB) databases.

http://docs.basex.org/wiki/Indexes

Elasmobranch answered 11/7, 2016 at 15:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.