I am trying to develop a text classifier that will classify a piece of text as Private or Public. Take medical or health information as an example domain. A typical classifier that I can think of considers keywords as the main distinguisher, right? What about a scenario like bellow? What if both of the pieces of text contains similar keywords but carry a different meaning.
Following piece of text is revealing someone's private (health) situation (the patient has cancer):
I've been to two clinics
and my pcp
. I've had an ultrasound
only to be told it's a resolving cyst
or a hematoma
, but it's getting larger and starting to make my leg ache
. The PCP
said it can't be a cyst
because it started out way too big and I swear I have NEVER injured
my leg, not even a bump
. I am now scared and afraid of cancer
. I noticed a slightly uncomfortable sensation only when squatting down about 9 months ago. 3 months ago I went to squat down to put away laundry and it kinda hurt
. The pain
prompted me to examine my leg
and that is when I noticed a lump
at the bottom of my calf muscle
and flexing only made it more noticeable. Eventually after four clinic
visits, an ultrasound
and one pcp
the result seems to be positive and the mass is getting larger.
[Private] (Correct Classification)
Following piece of text is a comment from a doctor which is definitely not revealing is health situation. It introduces the weaknesses of a typical classifier model:
Don’t be scared and do not assume anything bad as cancer
. I have gone through several cases in my clinic
and it seems familiar to me. As you mentioned it might be a cyst
or a hematoma
and it's getting larger, it must need some additional diagnosis
such as biopsy
. Having an ache
in that area or the size of the lump
does not really tells anything bad
. You should visit specialized clinics
few more times and go under some specific tests such as biopsy
, CT scan
, pcp
and ultrasound
before that lump
become more larger.
[Private] (Which is the Wrong Classification. It should be [Public])
The second paragraph was classified as private by all of my current classifiers, for obvious reason. Similar keywords, valid word sequences, the presence of subjects seemed to make the classifier very confused. Even, both of the content contains subjects like I
, You
(Noun, Pronouns) etc. I thought about from Word2Vec to Doc2Vec, from Inferring meaning to semantic embeddings but can't think about a solution approach that best suits this problem.
Any idea, which way I should handle the classification problem? Thanks in advance.
Progress so Far:
The data, I have collected from a public source where patients/victims usually post their own situation and doctors/well-wishers reply to those. I assumed while crawling is that - posts belongs to my private class and comments belongs to public class. All to gether I started with 5K+5K posts/comments and got around 60% with a naive bayes classifier without any major preprocessing. I will try Neural Network soon. But before feeding into any classifier, I just want to know how I can preprocess better to put reasonable weights to either class for better distinction.
public
andprivate
– Rheta