How to group wikipedia categories in python?
Asked Answered
O

6

21

For each concept of my dataset I have stored the corresponding wikipedia categories. For example, consider the following 5 concepts and their corresponding wikipedia categories.

  • hypertriglyceridemia: ['Category:Lipid metabolism disorders', 'Category:Medical conditions related to obesity']
  • enzyme inhibitor: ['Category:Enzyme inhibitors', 'Category:Medicinal chemistry', 'Category:Metabolism']
  • bypass surgery: ['Category:Surgery stubs', 'Category:Surgical procedures and techniques']
  • perth: ['Category:1829 establishments in Australia', 'Category:Australian capital cities', 'Category:Metropolitan areas of Australia', 'Category:Perth, Western Australia', 'Category:Populated places established in 1829']
  • climate: ['Category:Climate', 'Category:Climatology', 'Category:Meteorological concepts']

As you can see, the first three concepts belong to medical domain (whereas the remaining two terms are not medical terms).

More precisely, I want to divide my concepts as medical and non-medical. However, it is very difficult to divide the concepts using the categories alone. For example, even though the two concepts enzyme inhibitor and bypass surgery are in medical domain, their categories are very different to each other.

Therefore, I would like to know if there is a way to obtain the parent category of the categories (for example, the categories of enzyme inhibitor and bypass surgery belong to medical parent category)

I am currently using pymediawiki and pywikibot. However, I am not restricted to only those two libraries and happy to have solutions using other libraries as well.

EDIT

As suggested by @IlmariKaronen I am also using the categories of categories and the results I got is as follows (The small font near the category is the categories of the category). enter image description here

However, I still could not find a way to use these category details to decide if a given term is a medical or non-medical.

Moreover, as pointed by @IlmariKaronen using Wikiproject details can be potential. However, it seems like the Medicine wikiproject do not seem to have all the medical terms. Therefore we also need to check other wikiprojects as well.

EDIT: My current code of extracting categories from wikipedia concepts is as follows. This could be done using pywikibot or pymediawiki as follows.

  1. Using the librarary pymediawiki

    import mediawiki as pw

    p = wikipedia.page('enzyme inhibitor')
    print(p.categories)
    
  2. Using the library pywikibot

    import pywikibot as pw
    
    site = pw.Site('en', 'wikipedia')
    
    print([
        cat.title()
        for cat in pw.Page(site, 'support-vector machine').categories()
        if 'hidden' not in cat.categoryinfo
    ])
    

The categories of categories can also be done in the same way as shown in the answer by @IlmariKaronen.

If you are looking for longer list of concepts for testing I have mentioned more examples below.

['juvenile chronic arthritis', 'climate', 'alexidine', 'mouthrinse', 'sialosis', 'australia', 'artificial neural network', 'ricinoleic acid', 'bromosulfophthalein', 'myelosclerosis', 'hydrochloride salt', 'cycasin', 'aldosterone antagonist', 'fungal growth', 'describe', 'liver resection', 'coffee table', 'natural language processing', 'infratemporal fossa', 'social withdrawal', 'information retrieval', 'monday', 'menthol', 'overturn', 'prevailing', 'spline function', 'acinic cell carcinoma', 'furth', 'hepatic protein', 'blistering', 'prefixation', 'january', 'cardiopulmonary receptor', 'extracorporeal membrane oxygenation', 'clinodactyly', 'melancholic', 'chlorpromazine hydrochloride', 'level of evidence', 'washington state', 'cat', 'newyork', 'year elevan', 'trituration', 'gold alloy', 'hexoprenaline', 'second molar', 'novice', 'oxygen radical', 'subscription', 'ordinate', 'approximal', 'spongiosis', 'ribothymidine', 'body of evidence', 'vpb', 'porins', 'musculocutaneous']

For a very long list please check the link below. https://docs.google.com/document/d/1BYllMyDlw-Rb4uMh89VjLml2Bl9Y7oUlopM-Z4F6pN0/edit?usp=sharing

NOTE: I am not expecting the solution to work 100% (if the proposed algorithm is able to detect many of the medical concepts that is enough for me)

I am happy to provide more details if needed.

Ori answered 11/2, 2019 at 7:10 Comment(19)
The ones which are squarely medicine have e.g. ICD links, though that excludes the enzyme one.Oren
What coding have you tried?Nutpick
Quick question. What do you mean medical category? Wikapedia has multiple medical categories. Are you looking for anything medical related? Or specifically practicing medicine.Typecast
@EdekiOkoh Thanks a lot for your comment. Yes, I am looking for anything medical related :)Ori
My biggest fear. NLP is probably the only real way to do it since it is extremely inefficient to make a dictionary of every medically related term on wikapedia. I can write something up over the weekend.Typecast
Can you post your script so far? Would be easier to see how you are currently implementing your code rather than writing something from scratch just to have you tell me you aren't doing it that wayTypecast
@EdekiOkoh Sorry for the late response as I just checked my stackoverflow. Sure, I will update the question. Give me 10 minutes :)Ori
@EdekiOkoh I edited the question. Please let me know if you need any further details. Looking forward to hearing from you. Thank you :)Ori
paws-public.wmflabs.org/paws-public/User:Luitzen/Medicine.ipynbTevis
@StanislavKralin It seems like some of the words in my dataset are not in dbc, the issue maybe that I have preprocessed my data and all of them are lowercased and without symbols. Is there a way to get nearly equal concepts from dbc? For example given marine_oil, it returns the corresponding dbc concept of it without giving an error in the code? :)Ori
These dbc:-s are Wikipedia categories you have extracted. I've proposed to check if these categories have dbc:Medicine as an ancestral category. If more than, say, half of concept categories have dbc:Medicine as an ancestral category, you could consider this concept to be 'medical'.Tevis
@StanislavKralin Thanks a lot. It is clear to me now. Can you please tell me what is the two numbers in skos:broader{1,7}? :)Ori
These {m,n} are Virtuoso-specific extensions of SPARQL 1.1 property paths. You could also try "unqualified" skos:broader+ or skos:broader+.Tevis
@StanislavKralin Thank you very much for the suggestions. Do you mean something like this: sparql.setQuery(" ASK { dbc:Lipid_metabolism_disorders "unqualified" skos:broader+ dbc:Medicine } ")? Please correct me if I am wrong :)Ori
" ASK { dbc:Lipid_metabolism_disorders skos:broader+ dbc:Medicine } "Tevis
@StanislavKralin what does the + denotes in skos:broder? :)Ori
w3.org/TR/sparql11-query/#propertypathsTevis
@StanislavKralin {1,7} is this mean that it recursively go upto 7 hierarchical levels? I still do not understand what is meant by +. In the link you have mentioned above it says A path that connects the subject and object of the path by one or more matches of elt.. What does this mean? :)Ori
This is SPARQL :-). + means one or more, * means zero or more. {1,7} means from one to seven hops, but supported only by Virtuoso SPARQL endpoint.Tevis
B
13

Solution Overview

Okay, I would approach the problem from multiple directions. There are some great suggestions here and if I were you I would use an ensemble of those approaches (majority voting, predicting label which is agreed upon by more than 50% of classifiers in your binary case).

I'm thinking about following approaches:

  • Active learning (example approach provided by me below)
  • MediaWiki backlinks provided as an answer by @TavoGC
  • SPARQL ancestral categories provided as a comment to your question by @Stanislav Kralin and/or parent categories provided by @Meena Nagarajan (those two could be an ensemble on their own based on their differences, but for that you would have to contact both creators and compare their results).

This way 2 out of three would have to agree a certain concept is a medical one, which minimizes chance of an error further.

While we're at it I would argue against approach presented by @ananand_v.singh in this answer, because:

  • distance metric should not be euclidean, cosine similarity is much better metric (used by, e.g. spaCy) as it does not take into account magnitude of the vectors (and it shouldn't, that's how word2vec or GloVe were trained)
  • many artificial clusters would be created if I understood correctly, while we only need two: medicine and non-medicine one. Furthermore, centroid of medicine is not centered on the medicine itself. This poses additional problems, say centroid is moved far away from the medicine and other words like, say, computer or human (or any other not-fitting in your opinion into medicine) might get into the cluster.
  • it's hard to evaluate results, even more so, the matter is strictly subjective. Furthermore word vectors are hard to visualize and understand (casting them into lower dimensions [2D/3D] using PCA/TSNE/similar for so many words, would give us totally non-sensical results [yeah, I have tried to do it, PCA gets around 5% explained variance for your longer dataset, really, really low]).

Based on the problems highlighted above I have come up with solution using active learning, which is pretty forgotten approach to such problems.

Active Learning approach

In this subset of machine learning, when we have a hard time coming up with an exact algorithm (like what does it mean for a term to be a part of medical category), we ask human "expert" (doesn't actually have to be expert) to provide some answers.

Knowledge encoding

As anand_v.singh pointed out, word vectors are one of the most promising approach and I will use it here as well (differently though, and IMO in a much cleaner and easier fashion).

I'm not going to repeat his points in my answer, so I will add my two cents:

  • Do not use contextualized word-embeddings as currently available state of the art (e.g. BERT)
  • Check how many of your concepts have no representation (e.g. is represented as a vector of zeros). It should be checked (and is checked in my code,, there will be further discussion when the time comes) and you may use the embedding which has most of them present.

Measuring similarity using spaCy

This class measures similarity between medicine encoded as spaCy's GloVe word vector and every other concept.

class Similarity:
    def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
        # In our case it will be medicine
        self.centroid = centroid

        # spaCy's Language model (english), which will be used to return similarity to
        # centroid of each concept
        self.nlp = nlp
        self.n_threads: int = n_threads
        self.batch_size: int = batch_size

        self.missing: typing.List[int] = []

    def __call__(self, concepts):
        concepts_similarity = []
        # nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
        for i, concept in enumerate(
            self.nlp.pipe(
                concepts, n_threads=self.n_threads, batch_size=self.batch_size
            )
        ):
            if concept.has_vector:
                concepts_similarity.append(self.centroid.similarity(concept))
            else:
                # If document has no vector, it's assumed to be totally dissimilar to centroid
                concepts_similarity.append(-1)
                self.missing.append(i)

        return np.array(concepts_similarity)

This code will return a number for each concept measuring how similar it is to centroid. Furthermore, it records indices of concepts missing their representation. It might be called like this:

import json
import typing

import numpy as np
import spacy

nlp = spacy.load("en_vectors_web_lg")

centroid = nlp("medicine")

concepts = json.load(open("concepts_new.txt"))
concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)(
    concepts
)

You may substitute you data in place of new_concepts.json.

Look at spacy.load and notice I have used en_vectors_web_lg. It consists of 685.000 unique word vectors (which is a lot), and may work out of the box for your case. You have to download it separately after installing spaCy, more info provided in the links above.

Additionally you may want to use multiple centroid words, e.g. add words like disease or health and average their word vectors. I'm not sure whether that would affect positively your case though.

Other possibility might be to use multiple centroids and calculate similiarity between each concept and multiple of centroids. We may have a few thresholds in such case, this is likely to remove some false positives, but may miss some terms which one could consider to be similar to medicine. Furthermore it would complicate the case much more, but if your results are unsatisfactory you should consider two options above (and only if those are, don't jump into this approach without previous thought).

Now, we have a rough measure of concept's similarity. But what does it mean that a certain concept has 0.1 positive similarity to medicine? Is it a concept one should classify as medical? Or maybe that's too far away already?

Asking expert

To get a threshold (below it terms will be considered non medical), it's easiest to ask a human to classify some of the concepts for us (and that's what active learning is about). Yeah, I know it's a really simple form of active learning, but I would consider it such anyway.

I have written a class with sklearn-like interface asking human to classify concepts until optimal threshold (or maximum number of iterations) is reached.

class ActiveLearner:
    def __init__(
        self,
        concepts,
        concepts_similarity,
        max_steps: int,
        samples: int,
        step: float = 0.05,
        change_multiplier: float = 0.7,
    ):
        sorting_indices = np.argsort(-concepts_similarity)
        self.concepts = concepts[sorting_indices]
        self.concepts_similarity = concepts_similarity[sorting_indices]

        self.max_steps: int = max_steps
        self.samples: int = samples
        self.step: float = step
        self.change_multiplier: float = change_multiplier

        # We don't have to ask experts for the same concepts
        self._checked_concepts: typing.Set[int] = set()
        # Minimum similarity between vectors is -1
        self._min_threshold: float = -1
        # Maximum similarity between vectors is 1
        self._max_threshold: float = 1

        # Let's start from the highest similarity to ensure minimum amount of steps
        self.threshold_: float = 1
  • samples argument describes how many examples will be shown to an expert during each iteration (it is the maximum, it will return less if samples were already asked for or there is not enough of them to show).
  • step represents the drop of threshold (we start at 1 meaning perfect similarity) in each iteration.
  • change_multiplier - if an expert answers concepts are not related (or mostly unrelated, as multiple of them are returned), step is multiplied by this floating point number. It is used to pinpoint exact threshold between step changes at each iteration.
  • concepts are sorted based on their similarity (the more similar a concept is, the higher)

Function below asks expert for an opinion and find optimal threshold based on his answers.

def _ask_expert(self, available_concepts_indices):
    # Get random concepts (the ones above the threshold)
    concepts_to_show = set(
        np.random.choice(
            available_concepts_indices, len(available_concepts_indices)
        ).tolist()
    )
    # Remove those already presented to an expert
    concepts_to_show = concepts_to_show - self._checked_concepts
    self._checked_concepts.update(concepts_to_show)
    # Print message for an expert and concepts to be classified
    if concepts_to_show:
        print("\nAre those concepts related to medicine?\n")
        print(
            "\n".join(
                f"{i}. {concept}"
                for i, concept in enumerate(
                    self.concepts[list(concepts_to_show)[: self.samples]]
                )
            ),
            "\n",
        )
        return input("[y]es / [n]o / [any]quit ")
    return "y"

Example question looks like this:

Are those concepts related to medicine?                                                      

0. anesthetic drug                                                                                                                                                                         
1. child and adolescent psychiatry                                                                                                                                                         
2. tertiary care center                                                     
3. sex therapy                           
4. drug design                                                                                                                                                                             
5. pain disorder                                                      
6. psychiatric rehabilitation                                                                                                                                                              
7. combined oral contraceptive                                
8. family practitioner committee                           
9. cancer family syndrome                          
10. social psychology                                                                                                                                                                      
11. drug sale                                                                                                           
12. blood system                                                                        

[y]es / [n]o / [any]quit y

... parsing an answer from expert:

# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
    if decision.lower() == "y":
        # You can't go higher as current threshold is related to medicine
        self._max_threshold = self.threshold_
        if self.threshold_ - self.step < self._min_threshold:
            return False
        # Lower the threshold
        self.threshold_ -= self.step
        return True
    if decision.lower() == "n":
        # You can't got lower than this, as current threshold is not related to medicine already
        self._min_threshold = self.threshold_
        # Multiply threshold to pinpoint exact spot
        self.step *= self.change_multiplier
        if self.threshold_ + self.step < self._max_threshold:
            return False
        # Lower the threshold
        self.threshold_ += self.step
        return True
    return False

And finally whole code code of ActiveLearner, which finds optimal threshold of similiarity accordingly to expert:

class ActiveLearner:
    def __init__(
        self,
        concepts,
        concepts_similarity,
        samples: int,
        max_steps: int,
        step: float = 0.05,
        change_multiplier: float = 0.7,
    ):
        sorting_indices = np.argsort(-concepts_similarity)
        self.concepts = concepts[sorting_indices]
        self.concepts_similarity = concepts_similarity[sorting_indices]

        self.samples: int = samples
        self.max_steps: int = max_steps
        self.step: float = step
        self.change_multiplier: float = change_multiplier

        # We don't have to ask experts for the same concepts
        self._checked_concepts: typing.Set[int] = set()
        # Minimum similarity between vectors is -1
        self._min_threshold: float = -1
        # Maximum similarity between vectors is 1
        self._max_threshold: float = 1

        # Let's start from the highest similarity to ensure minimum amount of steps
        self.threshold_: float = 1

    def _ask_expert(self, available_concepts_indices):
        # Get random concepts (the ones above the threshold)
        concepts_to_show = set(
            np.random.choice(
                available_concepts_indices, len(available_concepts_indices)
            ).tolist()
        )
        # Remove those already presented to an expert
        concepts_to_show = concepts_to_show - self._checked_concepts
        self._checked_concepts.update(concepts_to_show)
        # Print message for an expert and concepts to be classified
        if concepts_to_show:
            print("\nAre those concepts related to medicine?\n")
            print(
                "\n".join(
                    f"{i}. {concept}"
                    for i, concept in enumerate(
                        self.concepts[list(concepts_to_show)[: self.samples]]
                    )
                ),
                "\n",
            )
            return input("[y]es / [n]o / [any]quit ")
        return "y"

    # True - keep asking, False - stop the algorithm
    def _parse_expert_decision(self, decision) -> bool:
        if decision.lower() == "y":
            # You can't go higher as current threshold is related to medicine
            self._max_threshold = self.threshold_
            if self.threshold_ - self.step < self._min_threshold:
                return False
            # Lower the threshold
            self.threshold_ -= self.step
            return True
        if decision.lower() == "n":
            # You can't got lower than this, as current threshold is not related to medicine already
            self._min_threshold = self.threshold_
            # Multiply threshold to pinpoint exact spot
            self.step *= self.change_multiplier
            if self.threshold_ + self.step < self._max_threshold:
                return False
            # Lower the threshold
            self.threshold_ += self.step
            return True
        return False

    def fit(self):
        for _ in range(self.max_steps):
            available_concepts_indices = np.nonzero(
                self.concepts_similarity >= self.threshold_
            )[0]
            if available_concepts_indices.size != 0:
                decision = self._ask_expert(available_concepts_indices)
                if not self._parse_expert_decision(decision):
                    break
            else:
                self.threshold_ -= self.step
        return self

All in all, you would have to answer some questions manually but this approach is way more accurate in my opinion.

Furthermore, you don't have to go through all of the samples, just a small subset of it. You may decide how many samples constitute a medical term (whether 40 medical samples and 10 non-medical samples shown, should still be considered medical?), which let's you fine-tune this approach to your preferences. If there is an outlier (say, 1 sample out of 50 is non-medical), I would consider the threshold to still be valid.

Once again: This approach should be mixed with others in order to minimalize the chance for wrong classification.

Classifier

When we obtain the threshold from expert, classification would be instantenous, here is a simple class for classification:

class Classifier:
    def __init__(self, centroid, threshold: float):
        self.centroid = centroid
        self.threshold: float = threshold

    def predict(self, concepts_pipe):
        predictions = []
        for concept in concepts_pipe:
            predictions.append(self.centroid.similarity(concept) > self.threshold)
        return predictions

And for brevity, here is the final source code:

import json
import typing

import numpy as np
import spacy


class Similarity:
    def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
        # In our case it will be medicine
        self.centroid = centroid

        # spaCy's Language model (english), which will be used to return similarity to
        # centroid of each concept
        self.nlp = nlp
        self.n_threads: int = n_threads
        self.batch_size: int = batch_size

        self.missing: typing.List[int] = []

    def __call__(self, concepts):
        concepts_similarity = []
        # nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
        for i, concept in enumerate(
            self.nlp.pipe(
                concepts, n_threads=self.n_threads, batch_size=self.batch_size
            )
        ):
            if concept.has_vector:
                concepts_similarity.append(self.centroid.similarity(concept))
            else:
                # If document has no vector, it's assumed to be totally dissimilar to centroid
                concepts_similarity.append(-1)
                self.missing.append(i)

        return np.array(concepts_similarity)


class ActiveLearner:
    def __init__(
        self,
        concepts,
        concepts_similarity,
        samples: int,
        max_steps: int,
        step: float = 0.05,
        change_multiplier: float = 0.7,
    ):
        sorting_indices = np.argsort(-concepts_similarity)
        self.concepts = concepts[sorting_indices]
        self.concepts_similarity = concepts_similarity[sorting_indices]

        self.samples: int = samples
        self.max_steps: int = max_steps
        self.step: float = step
        self.change_multiplier: float = change_multiplier

        # We don't have to ask experts for the same concepts
        self._checked_concepts: typing.Set[int] = set()
        # Minimum similarity between vectors is -1
        self._min_threshold: float = -1
        # Maximum similarity between vectors is 1
        self._max_threshold: float = 1

        # Let's start from the highest similarity to ensure minimum amount of steps
        self.threshold_: float = 1

    def _ask_expert(self, available_concepts_indices):
        # Get random concepts (the ones above the threshold)
        concepts_to_show = set(
            np.random.choice(
                available_concepts_indices, len(available_concepts_indices)
            ).tolist()
        )
        # Remove those already presented to an expert
        concepts_to_show = concepts_to_show - self._checked_concepts
        self._checked_concepts.update(concepts_to_show)
        # Print message for an expert and concepts to be classified
        if concepts_to_show:
            print("\nAre those concepts related to medicine?\n")
            print(
                "\n".join(
                    f"{i}. {concept}"
                    for i, concept in enumerate(
                        self.concepts[list(concepts_to_show)[: self.samples]]
                    )
                ),
                "\n",
            )
            return input("[y]es / [n]o / [any]quit ")
        return "y"

    # True - keep asking, False - stop the algorithm
    def _parse_expert_decision(self, decision) -> bool:
        if decision.lower() == "y":
            # You can't go higher as current threshold is related to medicine
            self._max_threshold = self.threshold_
            if self.threshold_ - self.step < self._min_threshold:
                return False
            # Lower the threshold
            self.threshold_ -= self.step
            return True
        if decision.lower() == "n":
            # You can't got lower than this, as current threshold is not related to medicine already
            self._min_threshold = self.threshold_
            # Multiply threshold to pinpoint exact spot
            self.step *= self.change_multiplier
            if self.threshold_ + self.step < self._max_threshold:
                return False
            # Lower the threshold
            self.threshold_ += self.step
            return True
        return False

    def fit(self):
        for _ in range(self.max_steps):
            available_concepts_indices = np.nonzero(
                self.concepts_similarity >= self.threshold_
            )[0]
            if available_concepts_indices.size != 0:
                decision = self._ask_expert(available_concepts_indices)
                if not self._parse_expert_decision(decision):
                    break
            else:
                self.threshold_ -= self.step
        return self


class Classifier:
    def __init__(self, centroid, threshold: float):
        self.centroid = centroid
        self.threshold: float = threshold

    def predict(self, concepts_pipe):
        predictions = []
        for concept in concepts_pipe:
            predictions.append(self.centroid.similarity(concept) > self.threshold)
        return predictions


if __name__ == "__main__":
    nlp = spacy.load("en_vectors_web_lg")

    centroid = nlp("medicine")

    concepts = json.load(open("concepts_new.txt"))
    concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)(
        concepts
    )

    learner = ActiveLearner(
        np.array(concepts), concepts_similarity, samples=20, max_steps=50
    ).fit()
    print(f"Found threshold {learner.threshold_}\n")

    classifier = Classifier(centroid, learner.threshold_)
    pipe = nlp.pipe(concepts, n_threads=-1, batch_size=4096)
    predictions = classifier.predict(pipe)
    print(
        "\n".join(
            f"{concept}: {label}"
            for concept, label in zip(concepts[20:40], predictions[20:40])
        )
    )

After answering some questions, with threshold 0.1 (everything between [-1, 0.1) is considered non-medical, while [0.1, 1] is considered medical) I got the following results:

kartagener s syndrome: True
summer season: True
taq: False
atypical neuroleptic: True
anterior cingulate: False
acute respiratory distress syndrome: True
circularity: False
mutase: False
adrenergic blocking drug: True
systematic desensitization: True
the turning point: True
9l: False
pyridazine: False
bisoprolol: False
trq: False
propylhexedrine: False
type 18: True
darpp 32: False
rickettsia conorii: False
sport shoe: True

As you can see this approach is far from perfect, so the last section described possible improvements:

Possible improvements

As mentioned in the beginning using my approach mixed with other answers would probably leave out ideas like sport shoe belonging to medicine out and active learning approach would be more of a decisive vote in case of a draw between two heuristics mentioned above.

We could create an active learning ensemble as well. Instead of one threshold, say 0.1, we would use multiple of them (either increasing or decreasing), let's say those are 0.1, 0.2, 0.3, 0.4, 0.5.

Let's say sport shoe gets, for each threshold it's respective True/False like this:

True True False False False,

Making a majority voting we would mark it non-medical by 3 out of 2 votes. Furthermore, too strict threshold would me mitigated as well if thresholds below it out-vote it (case if True/False would look like this: True True True False False).

Final possible improvement I came up with: In the code above I'm using Doc vector, which is a mean of word vectors creating the concept. Say one word is missing (vectors consisting of zeros), in such case, it would be pushed further away from medicine centroid. You may not want that (as some niche medical terms [abbreviations like gpv or others] might be missing their representation), in such case you could average only those vectors which are different from zero.

I know this post is quite lengthy, so if you have any questions post them below.

Belated answered 19/2, 2019 at 19:49 Comment(8)
Wow. This is impressive. I am still reading your post. I will comment below if I got any questions :)Ori
Thank you very much for the great answer. So, if I understand you correctly, we are using a pretrained embeeding namely en_core_web_lg and we take the concepts closer to medicine in the pretrained embedding as the potential medical terms? Please correct me if I am wrong :)Ori
Yes, we are using pretrained embeddings. Pretrained medicine word embedding should be considered a center point in N dimensional space. After that we calculate cosine similarity (say it's a metric of distance) of each concept to centroid (medicine). After obtaining that, we ask a human to obtain threshold (for example if concept's similarity to medicine is smaller than 0.1 consider it non-medical), that's what the script above does (you may want to tune it). It is far from foolproof, hence I advise using 3 methods, and classifying concepts based on majority of votes.Belated
Thank you for the comment. Is there any special reason why you used en_core_web_lg? I saw that in the link you have provided there are multiple other pretrained embeddings: spacy.io/usage/models :)Ori
Yes, it contains the biggest amount of pretrained vectors when compared to other english language models. Using it, your chance of getting no representation for less frequent words is smaller.Belated
Can you please send me a URL where I can read more details about en_core_web_lg? I would like to know what are the data they have used for training this embedding? Is it entire wikipedia data? :)Ori
Here is the specification, other informations about models can be found in their documentation. It was trained using Common Crawl data. Crawler goes through tones of newly created pages each month (240TB of uncompressed data), so it wasn't strictly trained on wikipedia (though wikipedia was definitely part of it). Furthermore, they have chosen 1.1 million most frequent unique tokens and we cannot be sure how much of Common Crawl they have used, I'm unaware of those informations at least.Belated
Let us continue this discussion in chat.Belated
S
8

"Therefore, I would like to know if there is a way to obtain the parent category of the categories (for example, the categories of enzyme inhibitor and bypass surgery belong to medical parent category)"

MediaWiki categories are themselves wiki pages. A "parent category" is just a category which the "child" category page belongs to. So you can get the parent categories of a category in exactly the same way as you'd obtain the categories of any other wiki page.

For example, using pymediawiki:

p = wikipedia.page('Category:Enzyme inhibitors')
parents = p.categories
Shadowgraph answered 11/2, 2019 at 7:19 Comment(18)
But this doesn't immediately resolve the OP's question. Each category can belong to one or more categories, some of which might be unambiguously "Medicine"; but I have not resolved anything to the point where it would be easy to decide yes or no for any of the given examples.Oren
@tripleee: True. The OP wrote at the end of their question that they wanted to do this by looking at the parent categories, and asked how to find those, so I assumed that was their specific question. Whether that will actually help them solve their original problem of grouping categories, I can't really tell. (Another possible approach might be to look for a relevant wikiproject. Or maybe even try to apply some kind of a statistical clustering algorithm.)Shadowgraph
@IlmariKaronen thank you very much for the answer. I would like to know if there is a way to look the relevant wikiproject (as you have suggested) using python? I feel like that is a potential approach :)Ori
@Emi: Look at the (meta)categories associated with the wikiproject. For example, here's the category for all WikiProject Medicine articles. Note that it contains the talk pages for the articles, since that's where the associated template goes, but it's easy enough to get the article name from the talk page name (just remove the Talk: prefix).Shadowgraph
@IlmariKaronen thanks a lot for the details. Just wondering if there is a way to get all WikiProject Medicine articles. Is there any data dumps available? Moreover, I manually check if these article titles contain the concepts in my dataset. Unfortunately, some of them are not listed (even though they do have wikipedia articles). Do you know why this happens?Ori
@Emi: Without specific examples, I can't really say. Maybe they're associated with some other related wikiproject instead?Shadowgraph
@IlmariKaronen thank you for the commnet. I meant like terms such as marine oil, fish oil etc. I could not find them in the first link you shared. Is there any reason why that happens?Ori
@Emi: Looking at the talk page, it seems the "fish oil" article is considered to belong to the wikiprojects "food and drink", "fisheries and fishing", "alternative medicine" and "dietary supplements". The alternative medicine wikiproject, which I suspect you may be looking for, seems a bit bare-bones, but like all(?) wikiprojects, they do have a category for articles they're interested in.Shadowgraph
... Unfortunately, it seems they don't have a single huge category for all of their articles like the medicine wikiproject does, so you'll have to combine all the "by quality" categories together. Or, alternatively (pun not intended), you could just look directly for talk pages that transclude their template.Shadowgraph
@IlmariKaronen thank you very much for the details. Since, I am not clear how to select what wikiprojects I needed (as I have to look at multiple wikiprojects and they seems to have like 1000s of wikiprojects) and how to obtain these data (data dumps or programmatically), I started a bounty. Also, I mentioned your thoughts in the bounty. Please let me know your thoghts on my current issues. I look forward to hearing from you. Thank you once again :)Ori
Could you provide your whole dataset? I have a few ideas, but would like to test them with your data (or substantial part of it, having both negative (non-medicine) cases and the positive ones (related to medicine)).Belated
@SzymonMaszke Sure, I will update the question now with a longer dataset. Give me 10 mintes. Thank you very much once again :)Ori
@SzymonMaszke I have attached a longer list herewith: docs.google.com/document/d/… Please let me know if you need any further details. Thank you :)Ori
Yeah, readable file would be good. I'm unable to load it using ast.literal_eval as I did in the previous problem of yours. If you could fix it so I can load this data easily (maybe .csv format instead of this?) I would be grateful.Belated
@SzymonMaszke Sure, I will fix the issue now. Give me 10 mins :)Ori
@SzymonMaszke I have shared a new file herewith: drive.google.com/file/d/1-epGMXJkhpdUyWfHD5tpDOiwVp6E0Jnd/… I could load the data using import json concepts = json.load(open('concepts_new.txt')). Please kindly let me know if this does not work. Looking forward to hearing from you. Thank you once again :)Ori
Just noticed, wanted to make a comment under OP's question, my bad. Thanks, everything is fine now though.Belated
@SzymonMaszke Thanks a lot. Looking forward for your answer :) Please let me know if you need any further details. Thank you :)Ori
S
6

You could try to classify the wikipedia categories by the mediawiki links and backlinks returned for each category

import re
from mediawiki import MediaWiki

#TermFind will search through a list a given term
def TermFind(term,termList):
    responce=False
    for val in termList:
        if re.match('(.*)'+term+'(.*)',val):
            responce=True
            break
    return responce

#Find if the links and backlinks lists contains a given term 
def BoundedTerm(wikiPage,term):
    aList=wikiPage.links
    bList=wikiPage.backlinks
    responce=False
    if TermFind(term,aList)==True and TermFind(term,bList)==True:
         responce=True
    return responce

container=[]
wikipedia = MediaWiki()
for val in termlist:
    cpage=wikipedia.page(val)
    if BoundedTerm(cpage,'term')==True:
        container.append('medical')
    else:
        container.append('nonmedical')

The idea is to try to guess a term that is shared by most of the categories, I try biology, medicine and disease with good results. Perhaps you can try to use mulpile calls of BoundedTerms to make the clasification, or a single call for multiple terms and combine the result for the classification. Hope it helps

Sprint answered 19/2, 2019 at 0:5 Comment(7)
Hi, thank you very much for the answer. However, I got the following error while running it. NameError: name 'wikipedia' is not defined Can you please tell me how to resolve this issue? :)Ori
Sorry, i edit the answer, i forget to add wikipedia = MediaWiki()Sprint
Thank you so much. That fixed the error. For the term vasodilation I get mediawiki.exceptions.DisambiguationError:. However, the term vasodilation is a valid Wikipedia page. Do you know why this happens? :)Ori
Sorry, I was unable to replicate the error, however a disambiguation error happens when the term could mean several things, for example (en.wikipedia.org/wiki/Raby). Also, vasodilation isn't present in the disambiguation index from wikipedia (en.wikipedia.org/wiki/…) perhaps is another therm raising the error. Hope it helpsSprint
Please let me know if you know an answer for this #55869694 Looking forward to hearing from you. Thank you :)Ori
@Emi Question Not found at the linkJemison
@Jemison sorry, I moved the question to open data opendata.stackexchange.com/questions/15206/… Please let me know if you know an answer for this. Thank you very much :)Ori
J
5

There is a concept of word Vectors in NLP, what it basically does is by looking through mass volumes of text, it tries to convert words to multi-dimensional vectors and then lesser the distance between those vectors, greater the similarity between them, the good thing is that many people have already generated this word vectors and made them available under very permissive licences, and in your case you are working with Wikipedia and there exist word vectors for them here http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Now these would be the most suited for this task since they contain most words from Wikipedia's corpora, but in case they are not suited for you, or are removed in the future you can use one from I will list below more of these, with that said, there is a better way to do this, i.e. by passing them to tensorflow's universal language model embed module in which you don't have to do most of the heavy lifting, you can read more about that here. The reason I put it after the Wikipedia text dump is because I have heard people say that they are a bit hard to work with when working with medical samples. This paper does propose a solution to tackle that but I have never tried that so I cannot be sure of it's accuracies.

Now how you can use the word embeddings from tensorflow is simple, just do

embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2")
embeddings = embed(["Input Text here as"," List of strings"])
session.run(embeddings)

Since you might not be familiar with tensorflow and trying to run just this piece of code you might run into some troubles, Follow this link where they have mentioned completely how to use this and from there you should be able to easily modify this to your needs.

With that said I would recommend first checking out he tensorlfow's embed module and their pre-trained word embedding's, if they don't work for you check out the Wikimedia link, if that also doesn't work then proceed to the concepts of the paper I have linked. Since this answer is describing an NLP approach, it will not be 100% accurate, so keep that in mind before you proceed.

Glove Vectors https://nlp.stanford.edu/projects/glove/

Facebook's fast text: https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

Or this http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz

If you run into problems implementing this after following the colab tutorial add your problem to the question and comment below, from there we can proceed further.

Edit Added code to cluster topics

Brief, Rather than using words vector, I am encoding their summary sentences

file content.py

def AllTopics():
    topics = []# list all your topics, not added here for space restricitons
    for i in range(len(topics)-1):
        yield topics[i]

File summaryGenerator.py

import wikipedia
import pickle
from content import Alltopics
summary = []
failed = []
for topic in Alltopics():
    try:
        summary.append(wikipedia.summary(tuple((topic,str(topic)))))
    except Exception as e:
        failed.append(tuple((topic,e)))
with open("summary.txt", "wb") as fp:
    pickle.dump(summary , fp)
with open('failed.txt', 'wb') as fp:
    pickle.dump('failed', fp)

File SimilartiyCalculator.py

import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import os
import pandas as pd
import re
import pickle
import sys
from sklearn.cluster import AgglomerativeClustering
from sklearn import metrics
from scipy.cluster import hierarchy
from scipy.spatial import distance_matrix


try:
    with open("summary.txt", "rb") as fp:   # Unpickling
        summary = pickle.load(fp)
except Exception as e:
    print ('Cannot load the summary file, Please make sure that it exists, if not run Summary Generator first', e)
    sys.exit('Read the error message')

module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3"
embed = hub.Module(module_url)

tf.logging.set_verbosity(tf.logging.ERROR)
messages = [x[1] for x in summary]
labels = [x[0] for x in summary]
with tf.Session() as session:
    session.run([tf.global_variables_initializer(), tf.tables_initializer()])
    message_embeddings = session.run(embed(messages)) # In message embeddings each vector is a second (1,512 vector) and is numpy.ndarray (noOfElemnts, 512)

X = message_embeddings
agl = AgglomerativeClustering(n_clusters=5, affinity='euclidean', memory=None, connectivity=None, compute_full_tree='auto', linkage='ward', pooling_func='deprecated')
agl.fit(X)
dist_matrix = distance_matrix(X,X)
Z = hierarchy.linkage(dist_matrix, 'complete')
dendro = hierarchy.dendrogram(Z)
cluster_labels = agl.labels_

This is also hosted on GitHub at https://github.com/anandvsingh/WikipediaSimilarity Where you can find the similarity.txt file, and other files, In my case I couldn't run it on all the topics, but I would urge you to run it on the full list of topics (Directly clone the repository and run SummaryGenerator.py), and upload the similarity.txt via a pull request in case you don't get expected result. And if possible also upload the message_embeddings in a csv file as topics and there embeddings.

Changes after edit 2 Switched the similarityGenerator to a hierarchy based clustering(Agglomerative) I would suggest you to keep the title names at the bottom of the dendrogram and for that look at the definition of dendrogram here, I verified viewing some samples and the results look quite good, you can change the n_clusters value to fine tune your model. Note: This requires you to run summary generator again. I think you should be able to take it from here, what you have to do is try a few values of n_cluster and see in which all medical terms are grouped together, then find the cluster_label for that cluster and you are done. Since here we group by summary, the clusters will be more accurate. If you run into any problems or don't understand something, comment below.

Jemison answered 16/2, 2019 at 8:54 Comment(25)
According to opendata.stackexchange.com/q/13686/16193, the OP tries this approach: link.springer.com/article/10.1007/s10115-008-0152-4Tevis
@StanislavKralin I wasn't aware of that question on stats exchange, I think I can make a model that produces acceptable outputs, If I succeed I will update the answer accordinglyJemison
@Jemison thanks a lot. let me know if you find another solution :) I actually tried diffrent ways like clustering etc. using a word2vec model. But it did not produces good results. Anyway that you for the detailed answer :) Let me know if you find another way of solving my issue :)Ori
@Emi I am going to attempt to solve the problem after the part in which you have extracted the text as list, and just want to group themJemison
@Jemison thanks a lot. If you want a more lengthy list I have attached one herewith: drive.google.com/file/d/1-epGMXJkhpdUyWfHD5tpDOiwVp6E0Jnd/viewOri
@Emi I am using the one that you have added as a google doc list in the questionJemison
I am working here, If anyone wants to join in, more than welcome github.com/anandvsingh/WikipediaSimilarity to do so. It will take time as summary scraping is going slowly.Jemison
@Emi I an running into some Internet speed bottle necks, have a look at the edits and upload the results either to a google drive or directly to the GitHub Repository if it doesn't work :).Jemison
@Jemison wow, thank you so much. sure, I will have a look at your code. So if I understand you correctly, summary.txt is a list of concepts? :)Ori
@Emi summary.txt is a list of tuples where the first element in each tuple is the topic and the second is the summary of that topic displayed in it's Wikipedia page. Failed.txt is also a list of tuples where the first element is the topic and second is the reason why it failed.Jemison
@Jemison Thank you for the comment. So if I understand you correctly, I first need to run summaryGenerator.py and then SimilartiyCalculator.py? or do you have already got results for summaryGenerator.py?Ori
Let us continue this discussion in chat.Jemison
@Jemison It looks like the programme has run. I have attached the results herewith: drive.google.com/file/d/1OYoWT2D4r_TBWFzzbfSdJPko2bzE2-Hn/… :)Ori
@Emi it looks like you have run an earlier version of the program as in summary.txt it is supposed to be a list of tuples containing topics and their summary but in this it is only summary, I will run it on this dataset itself, and if it works, I will tell you to run it again to overcome thisJemison
@Jemison I am sorry, I did not notice the code change. I can rerun the results if that is more convienet for you? Looking forward to hearing from you :)Ori
@Emi Not really, this is enough for me to verify that it works, and then I would recommend you to do this yourself and just take care of what is the topic of the summary is.Jemison
@Jemison Did you get good results? Can you please update the answer with the results you got? Thank you :)Ori
@Emi the updates are now available here and on github as well, Read the changes after edit section at the bottom of the answer to understand what has changed and how to proceed.Jemison
I would also attach @szymonmaszke 's active learning part to it, to reduce the overhead you need to do, but I would still stick to summary text embeddings as used above rather than just word embeddings.Jemison
@Jemison Thank you very much for the great answer. Sure, I will go through your answer and read them. I will comment if there is anything that I am not clear. Thank you so much once gaian. I really appreciate it. Please let me know if you get any further suggestions. Thank you very much once again :)Ori
@Jemison I see you are using whole summary of concepts instead of plain concepts names. That's definitely an improvement if OP has enough computational power (it could be used in active learning as well and could be easily mixed). I think this embedding would work much better than the one I have used in my answer, though I still don't think any form of clustering is the right approach and much harder to verify. I think wikipedia heuristics provided by other answers would be much more accurate than machine learning-like approaches, hence their mix seems reasonable.Belated
@SzymonMaszke The reason I don't trust MediaWiki backlinks is because this site exists sixdegreesofwikipedia.com, however I agree that verifying clusters might be hard for the OP, but having some words that you already know belong to your target cluster will help you find the target clusters, but yes compute power would be a challenge for this problem so I would recommend using Google Colab Notebook, I don't completely agree with you on the accuracy part, I have used same technique on movie descriptions and the results were quite satisfying, jury's still out on this data though:)Jemison
@Emi I am a bit skeptical about mixing approach's of machine learning and heuristics as in the combined result, you might get the worst of both worlds, but I would encourage you to try that.Jemison
@Jemison Hi, just wondering if you know an answer for this. #54849786 Thank you very much :)Ori
@Jemison Please let me know if you know an answer for this #55869694 Looking forward to hearing from you. Thank you :)Ori
R
5

The wikipedia library is also a good bet to extract the categories from a given page, as wikipedia.WikipediaPage(page).categories returns a simple list. The library also lets you search multiple pages should they all have the same title.

In medicine there seems to be a lot of key roots and suffixes, so the approach of finding key words may be a good approach to finding medical terms.

import wikipedia

def categorySorter(targetCats, pagesToCheck, mainCategory):
    targetList = []
    nonTargetList = []
    targetCats = [i.lower() for i in targetCats]

    print('Sorting pages...')
    print('Sorted:', end=' ', flush=True)
    for page in pagesToCheck:

        e = openPage(page)

        def deepList(l):
            for item in l:
                if item[1] == 'SUBPAGE_ID':
                    deepList(item[2])
                else:
                    catComparator(item[0], item[1], targetCats, targetList, nonTargetList, pagesToCheck[-1])

        if e[1] == 'SUBPAGE_ID':
            deepList(e[2])
        else:
            catComparator(e[0], e[1], targetCats, targetList, nonTargetList, pagesToCheck[-1])

    print()
    print()
    print('Results:')
    print(mainCategory, ': ', targetList, sep='')
    print()
    print('Non-', mainCategory, ': ', nonTargetList, sep='')

def openPage(page):
    try:
        pageList = [page, wikipedia.WikipediaPage(page).categories]
    except wikipedia.exceptions.PageError as p:
        pageList = [page, 'NONEXIST_ID']
        return
    except wikipedia.exceptions.DisambiguationError as e:
        pageCategories = []
        for i in e.options:
            if '(disambiguation)' not in i:
                pageCategories.append(openPage(i))
        pageList = [page, 'SUBPAGE_ID', pageCategories]
        return pageList
    finally:
        return pageList

def catComparator(pageTitle, pageCategories, targetCats, targetList, nonTargetList, lastPage):

    # unhash to view the categories of each page
    #print(pageCategories)
    pageCategories = [i.lower() for i in pageCategories]

    any_in = False
    for i in targetCats:
        if i in pageTitle:
            any_in = True
    if any_in:
        print('', end = '', flush=True)
    elif compareLists(targetCats, pageCategories):
        any_in = True

    if any_in:
        targetList.append(pageTitle)
    else:
        nonTargetList.append(pageTitle)

    # Just prints a pretty list, you can comment out until next hash if desired
    if any_in:
        print(pageTitle, '(T)', end='', flush=True)
    else:
        print(pageTitle, '(F)',end='', flush=True)

    if pageTitle != lastPage:
        print(',', end=' ')
    # No more commenting

    return any_in

def compareLists (a, b):
    for i in a:
        for j in b:
            if i in j:
                return True
    return False

The code is really just comparing a lists of key words and suffixes to the titles of each page as well as their categories to determine if a page is medically related. It also looks at related pages/sub pages for the bigger topics, and determines if those are related as well. I am not well versed in my medicine so forgive the categories but here is an example to tag onto the bottom:

medicalCategories = ['surgery', 'medic', 'disease', 'drugs', 'virus', 'bact', 'fung', 'pharma', 'cardio', 'pulmo', 'sensory', 'nerv', 'derma', 'protein', 'amino', 'unii', 'chlor', 'carcino', 'oxi', 'oxy', 'sis', 'disorder', 'enzyme', 'eine', 'sulf']
listOfPages = ['juvenile chronic arthritis', 'climate', 'alexidine', 'mouthrinse', 'sialosis', 'australia', 'artificial neural network', 'ricinoleic acid', 'bromosulfophthalein', 'myelosclerosis', 'hydrochloride salt', 'cycasin', 'aldosterone antagonist', 'fungal growth', 'describe', 'liver resection', 'coffee table', 'natural language processing', 'infratemporal fossa', 'social withdrawal', 'information retrieval', 'monday', 'menthol', 'overturn', 'prevailing', 'spline function', 'acinic cell carcinoma', 'furth', 'hepatic protein', 'blistering', 'prefixation', 'january', 'cardiopulmonary receptor', 'extracorporeal membrane oxygenation', 'clinodactyly', 'melancholic', 'chlorpromazine hydrochloride', 'level of evidence', 'washington state', 'cat', 'year elevan', 'trituration', 'gold alloy', 'hexoprenaline', 'second molar', 'novice', 'oxygen radical', 'subscription', 'ordinate', 'approximal', 'spongiosis', 'ribothymidine', 'body of evidence', 'vpb', 'porins', 'musculocutaneous']
categorySorter(medicalCategories, listOfPages, 'Medical')

This example list gets ~70% of what should be on the list, at least to my knowledge.

Rhona answered 20/2, 2019 at 7:55 Comment(0)
E
4

The question appears a little unclear to me and does not seem like a straightforward problem to solve and may require some NLP model. Also,the words concept and categories are interchangeably used. What I understand is that the concepts such as enzyme inhibitor, bypass surgery and hypertriglyceridimia need to be combined together as medical and the rest as non medical. This problem will require more data than just the category names. A corpus is required to train an LDA model(for instance) where the entire text information is fed to the algorithm and it returns the most likely topics for each of the concepts.

https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/

Electromagnet answered 16/2, 2019 at 0:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.