Keyword/keyphrase extraction from text [closed]
Asked Answered
H

1

6

I am working on a project where I need to extract "technology related keywords/keyphrases" from text. For example, my text is:

"ABC Inc has been working on a project related to machine learning which makes use of the existing libraries for finding information from big data."

The extracted keywords/keyphrase should be: {machine learning, big data}.

My text documents are stored as BSON documents in MongoDb.

What are the best nlp libraries(with sufficient documentation and examples) out there to perform this task and how?

Thanks!

Herald answered 13/3, 2018 at 18:28 Comment(0)
M
6

It looks you need to narrow down more than just keywords/key phrases and find the subject and object per sentence. For subject/object recognition, I recommend the Stanford Parser or the Google Language API, where you send a string and get a dependency tree response.

You can test the Google API first to see if it works well with your corpus: https://cloud.google.com/natural-language/

The outcome here is a subject predicate object (SPO) triplet, where your predicate describes the relationship. You'll need to traverse the dependency graph and write a script to parse out the triplet.

Other Packages: I use NLTK, Spacy, and Textblob frequently. If the corpus is simple, generic, and straightforward, Spacy and Textblob work well OOTB. If the corpus is highly customized, domain-specific, messy (incorrect spelling or grammar), etc. I'll use NLTK and spend more time customizing my NLP text processing pipeline with scrubbing, lemmatizing, etc. You may want to add your own custom dictionary of technology related keywords and keyphrases so that your parser can catch these if you decide to go with one of these packages.

NLTK Tutorial: http://www.nltk.org/book/

Spacy Quickstart: https://spacy.io/usage/

Textblob Quickstart: http://textblob.readthedocs.io/en/dev/quickstart.html

Maxentia answered 13/3, 2018 at 21:3 Comment(3)
Some extra pointers (as this SO question was one of the first results in my search for keyword extraction): check out TextRank and RAKE. A relevant applied use-case: graphaware.com/neo4j/2017/10/03/…Snooze
@JoppeGeluykens do you know if this libraries works with non-English texts?Aggressor
@Aggressor The algorithms should be language independent, but you might want to use a list of stopwords specific to the target language.Snooze

© 2022 - 2024 — McMap. All rights reserved.