Text classification beyond the keyword dependency and inferring the actual meaning
Asked Answered
K

3

10

I am trying to develop a text classifier that will classify a piece of text as Private or Public. Take medical or health information as an example domain. A typical classifier that I can think of considers keywords as the main distinguisher, right? What about a scenario like bellow? What if both of the pieces of text contains similar keywords but carry a different meaning.

Following piece of text is revealing someone's private (health) situation (the patient has cancer):

I've been to two clinics and my pcp. I've had an ultrasound only to be told it's a resolving cyst or a hematoma, but it's getting larger and starting to make my leg ache. The PCP said it can't be a cyst because it started out way too big and I swear I have NEVER injured my leg, not even a bump. I am now scared and afraid of cancer. I noticed a slightly uncomfortable sensation only when squatting down about 9 months ago. 3 months ago I went to squat down to put away laundry and it kinda hurt. The pain prompted me to examine my leg and that is when I noticed a lump at the bottom of my calf muscle and flexing only made it more noticeable. Eventually after four clinic visits, an ultrasound and one pcp the result seems to be positive and the mass is getting larger.
[Private] (Correct Classification)

Following piece of text is a comment from a doctor which is definitely not revealing is health situation. It introduces the weaknesses of a typical classifier model:

Don’t be scared and do not assume anything bad as cancer. I have gone through several cases in my clinic and it seems familiar to me. As you mentioned it might be a cyst or a hematoma and it's getting larger, it must need some additional diagnosis such as biopsy. Having an ache in that area or the size of the lump does not really tells anything bad. You should visit specialized clinics few more times and go under some specific tests such as biopsy, CT scan, pcp and ultrasound before that lump become more larger.
[Private] (Which is the Wrong Classification. It should be [Public])

The second paragraph was classified as private by all of my current classifiers, for obvious reason. Similar keywords, valid word sequences, the presence of subjects seemed to make the classifier very confused. Even, both of the content contains subjects like I, You (Noun, Pronouns) etc. I thought about from Word2Vec to Doc2Vec, from Inferring meaning to semantic embeddings but can't think about a solution approach that best suits this problem.

Any idea, which way I should handle the classification problem? Thanks in advance.

Progress so Far:
The data, I have collected from a public source where patients/victims usually post their own situation and doctors/well-wishers reply to those. I assumed while crawling is that - posts belongs to my private class and comments belongs to public class. All to gether I started with 5K+5K posts/comments and got around 60% with a naive bayes classifier without any major preprocessing. I will try Neural Network soon. But before feeding into any classifier, I just want to know how I can preprocess better to put reasonable weights to either class for better distinction.

Kristikristian answered 4/3, 2019 at 22:0 Comment(5)
Could you highlight your current approach/approaches and drawbacks of it/them? More detail would be helpful in order not to repeat what you have already tried (with, from what I understand, not satisfactory results). Things like models, architectures, used representations, training time, size of data, anything would help here.Floatfeed
The data, I have collected from a public source where patients/victims usually post their own situation and doctors/well-wishers reply to those. I assumed while crawling is that - posts belongs to my private class and comments belongs to public class. All to gether I started with 5K+5K posts/comments and got around 60% with a naive bayes classifier without any major preprocessing. I will try Neural Network soon. But before feeding into any classifier, I just want to know how I can preprocess better to put reasonable weights to either class for better distinction.Kristikristian
Please update your question instead of posting comment, it will be more readable for everyone.Floatfeed
Truthfully it will be hard to write something without getting more samples, since anything built for this specific scenario may fail for another. For example putting more weight on the word I, me, and my vs You could help differentiate in this case due to the fact that that is more likely to indicate a patient talking about their own medical history which would be more likely to contain private information. But that can easily fail for another conversation. Moreover we do not knowMoia
Do you try to discriminate between "topic starter" text and "non-author replies" ? It is not clear what is a distinction between public and privateRheta
M
4

If the data you posted is representative of the classes you're trying to distinguish, keyword based features might not be the most effective. It looks like some terms that are sometimes treated as stop-words will be very good cues as to what is Private and what is Public.

You mention pronouns, I think that's likely still a good avenue forward. If you're using unigram/bag-of-words kinds of features, make sure your vectorizer is not removing them.

Doing a count of instances of first person pronouns (I, my, I've, mine) gives 13 for the Private case and 2 for the Public case.

The Public example has second person pronouns (e.g. you) where the first example doesn't. So maybe features about counts or smoothed ratios of first to second person pronouns would be effective.

If you have syntactic structure or are keeping track of positional information through n-grams or a similar representation, then features involving first-person pronouns and your keywords may be effective.

Also, verb-initial sentence structures (Don't be ..., Having an...) are characteristic of second-person directed language and may show up more in the public than the private text.

One last speculative thought: The sentiment of the two passages is pretty different, so if you have access to sentiment analysis, that might provide additional cues. I would expect the Public class would be more neutral that the Private class.

Plugging your Public example into the Watson Tone Analyzer demo gives this notable result:

{
  "sentence_id": 3,
  "text": "I am now scared and afraid of cancer.",
  "tones": [
    {
      "score": 0.991397,
      "tone_id": "fear",
      "tone_name": "Fear"
    }
  ]
},

The Public statement also contains a fear-tagged sentence, but it's not scored as highly, is accompanied by other annotations, and contains an explicit negation in the sentence. So it might be worthwhile to leverage those as features as well.

"sentences_tone": [
    {
      "sentence_id": 0,
      "text": "Don’t be scared and do not assume anything bad as cancer.",
      "tones": [
        {
          "score": 0.874498,
          "tone_id": "fear",
          "tone_name": "Fear"
        },
        {
          "score": 0.786991,
          "tone_id": "tentative",
          "tone_name": "Tentative"
        },
        {
          "score": 0.653099,
          "tone_id": "analytical",
          "tone_name": "Analytical"
        }
      ]
    },
Mullens answered 13/3, 2019 at 13:24 Comment(0)
F
3

Those are only vaguely described, as whole process is task specific. You may want to look at those and take some inspiration though.

General tips

  • Start with simpler models (as you seem to be doing) and gradually increase their complexity if the results are unsatisfactory. You may want to try well-known Random Forest and xgboost before jumping towards neural networks

Data tips

Few quick points that might help you:

  • You don't have too many data points. If possible, I would advise you to gather more data from the same (or at least very similar) source/distribution, it would help you the most in my opinion.
  • Improve representation of your data (more details below), second/first best option.
  • You could try stemming/lemmatization (from nltk or spaCy but I don't think it will help in this case, might leave this one out.

Data representation

I assume you current representation is Bag Of Words or TF-IDF. If you haven't tried the second one, I advise you to do it before delving into more complicated (or is it?) stuff. You could easily do it with sklearn's TfidfVectorizer.

If the results are unsatisfactory (and you have tried Random Forest/xgboost (or similar like LightGBM from Microsoft), you should move on to semantic representation in my opinion.

Semantic representation

As you mentioned, there is a representation created by word2vec or Doc2Vec algorithms (I would leave the second one, it will not help probably).

You may want to separate your examples into sentences and add token like <eos> to represent the of sentence, it might help neural network learn.

On the other hand, there are others, which would probably be a better fit for your task like BERT. This one is context dependent, meaning a token I would be represented slightly different based on the words around it (as this representation is trainable, it should fit your task well).

Flair library offers nice and intuitive approach to this problem if you wish to go with PyTorch. If you are on the Tensorflow side, they have Tensorflow Hub, which also has State Of The Art embeddings for you to use easily.

Neural Networks

If it comes to the neural networks, start with simple recurrent model classifier and use either GRU or LSTM cell (depending on framework of choice, their semantics differ a bit).

If this approach is still unsatisfactory, you should look at Attention Networks, Hierarchical Attention Networks (one attention level per sentence, and another one for the whole document) or convolution based approaches.

Those approaches will take you a while and span quite some topics for you to try, one combination of those (or more) will probably work nicely with your task.

Floatfeed answered 7/3, 2019 at 22:18 Comment(0)
D
-2

(1) Bayes is indeed a weak classifier - I'd try SVM. If you see improvement than further improvement can be achieved using Neural Network (and perhaps Deep Learning)

(2) Feature engineering - use TFiDF , and try other things (many people suggest Word2Vec, although I personally tried and it did not improve). Also you can remove stop words.

One thing to consider, because you give two anecdotes is to measure objectively the level of agreement between human beings on the task. It is sometime overlooked that two people given the same text can disagree on labels (some might say that a specific document is private although it is public). Just a point to notice - because if e.g. the level of agreement is 65%, then it will be very difficult to build an algorithm that is more accurate.

Downatheel answered 12/3, 2019 at 7:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.