I have 1581 words that I need to look up in an XML corpus of Dutch (500 million words). This corpus itself is split up in many databases. (You can read why here.) We use BaseX as a server (version 7.9), which uses XQuery as input.
I am interested in finding out how many times each word is in the corpus with a neuter determiner (het) or a non neuter determiner (de) - this is done by looking for an XPath structure that consists of an NP (noun phrase) which has two daughters, namely a determiner with as lemma de or het, and a head, which is the word I am interested in.
Example structures for de
/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="de"] and node[@rel="hd" and @pt="n" and @word="accelerator"]]
/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="de"] and node[@rel="hd" and @pt="n" and @word="accountant"]]
/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="de"] and node[@rel="hd" and @pt="n" and @word="ace"]]
/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="de"] and node[@rel="hd" and @pt="n" and @word="acroniem"]]
Example structures for het
/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="het"] and node[@rel="hd" and @pt="n" and @word="accelerator"]]
/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="het"] and node[@rel="hd" and @pt="n" and @word="accountant"]]
/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="het"] and node[@rel="hd" and @pt="n" and @word="ace"]]
/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="het"] and node[@rel="hd" and @pt="n" and @word="acroniem"]]
In XQuery, I would it then do it like so, for each XPath structure:
count(for $node in db:open("mydatabase")/treebank/tree/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="het"] and node[@rel="hd" and @pt="n" and @word="accelerator"]] return $node)
This works just fine. The thing is, that this takes a long time. Each time, the same (thousands of) databases need to be opened sequentially, and this process is repeated for each word. My question is, isn't there a way to concatenate some queries. I have some ideas, but I'm not sure how to execute them - also, I'm not sure how many arguments BaseX can deal with.
- Merge de and het queries.
This is probably the most straightforward case. By doing so, I at least cut the need of queries in half. But I do not know how to distinguish between the two when results are found. For instance, if I change my XPath code to:
... (@lemma="de" or @lemma="het") ...
I should find all cases, but how can I then distinguish between one or the other? In other words, if I use that XPath, I will get one number back from the count function in XQuery but there is no way for me to know which are de and which are the?
- The same idea can be applied to the word attribute near the end
Instead of executing a new query for each word, I can concatenate them as follows:
... (@word="accelerator" or @word="accountant" or @word="ace" or ...) ...
But again, how can I distinguish between these values? And can I put all 1581 values in one XPath? Can BaseX handle that?
- A for loop with a list of words which would then return back the results in XML format for a lot of words (maybe all, if BaseX can handle that).
I am no expert in XQuery but in pseudo code I guess something like this is possible:
$wordlist = ['accelerator', 'accountant', 'ace', 'acroniem'];
$determinerlist = ['de', 'het'];
$db = 'mydatabase';
foreach ($wordlist as $word) {
foreach ($determinerlist as $det) {
count(for $node in db:open("'.$db.'")/treebank/tree/node[@cat="np" and node[@rel="det" and @pt="lid" and @lemma="'.$det.'"] and node[@rel="hd" and @pt="n" and @word="'.$word.'"]] return $node);
}
}
I'm not sure how to assign the count to array variables in XQuery, if that is possible, but XML output could look like this (but better variations are welcome of course):
<results>
<result word="accelerator">
<neuter>12</neuter>
<nonneuter>3</nonneuter>
</result>
<result word="accountant">
<neuter>4</neuter>
<nonneuter>0</nonneuter>
</result>
<result word="ace">
<neuter>14</neuter>
<nonneuter>2</nonneuter>
</result>
<result word="acroniem">
<neuter>3</neuter>
<nonneuter>7</nonneuter>
</result>
</results>
Which I could then run through with Perl with regular expressions or XML twig to get the values I need.
As you can see the may issue is finding a suitable XQuery code that is efficient, and that takes into account that I have 1581 words to look up in a huge corpus, and that the amount of databases to go through are a lot as well (thousands). For each database look up, a new connection is made through Perl.
If you have any questions, please comment and I'll try to answer as well as possible.
index:texts
(I don't think you need the full text index here) for each word, then filter as needed and count the result. – Suffice