English grammar for parsing in NLTK [closed]
Asked Answered
C

9

77

Is there a ready-to-use English grammar that I can just load it and use in NLTK? I've searched around examples of parsing with NLTK, but it seems like that I have to manually specify grammar before parsing a sentence.

Thanks a lot!

Cluff answered 24/5, 2011 at 19:17 Comment(0)
A
33

You can take a look at pyStatParser, a simple python statistical parser that returns NLTK parse Trees. It comes with public treebanks and it generates the grammar model only the first time you instantiate a Parser object (in about 8 seconds). It uses a CKY algorithm and it parses average length sentences (like the one below) in under a second.

>>> from stat_parser import Parser
>>> parser = Parser()
>>> print parser.parse("How can the net amount of entropy of the universe be massively decreased?")
(SBARQ
  (WHADVP (WRB how))
  (SQ
    (MD can)
    (NP
      (NP (DT the) (JJ net) (NN amount))
      (PP
        (IN of)
        (NP
          (NP (NNS entropy))
          (PP (IN of) (NP (DT the) (NN universe))))))
    (VP (VB be) (ADJP (RB massively) (VBN decreased))))
  (. ?))
Apprehensive answered 29/7, 2013 at 22:52 Comment(4)
For Python 3 users, there's a pull request to add Python 3 support here: github.com/emilmont/pyStatParser/pull/7 I only found out about that pull request after using the 2to3 tool to "manually" convert all the files from Python 2 to Python 3.Siler
To build the grammar model and run an example: python example.py with the default text hardcoded. Very easy to use and embeddable.Schiffman
I've issued these commands to be able to use pyStatParser 2to3 --output-dir=stat_parser3 -W -n stat_parser rm star_parser mv stat_parser3 stat_parser setup.py build setup.py install and it worked, thanks @ApprehensiveIndene
The library would parse "The Sun rises from the East" as - (SINV (NP (NP (DT the) (NNP Sun) (NNP rises)) (PP (IN from) (NP (DT the) (NNP East)))) (. .)) Shouldn't "rises" be a VP? How do we avoid interpreting "rises" as a proper noun?Kilogrammeter
G
26

My library, spaCy, provides a high performance dependency parser.

Installation:

pip install spacy
python -m spacy.en.download all

Usage:

from spacy.en import English
nlp = English()
doc = nlp(u'A whole document.\nNo preprocessing require.   Robust to arbitrary formating.')
for sent in doc:
    for token in sent:
        if token.is_alpha:
            print token.orth_, token.tag_, token.head.lemma_

Choi et al. (2015) found spaCy to be the fastest dependency parser available. It processes over 13,000 sentences a second, on a single thread. On the standard WSJ evaluation it scores 92.7%, over 1% more accurate than any of CoreNLP's models.

Golding answered 8/9, 2015 at 20:25 Comment(6)
thank you for this, I'm excited to check out spaCy. Is there a way to selectively import only the minimal amount of data necessary to parse your example sentence? Whenever I run spacy.en.download all it initiates a download that appears to be over 600 MB!Lucania
In addition, my empty 1GB RAM vagrant box doesn't seem to be able to handle the memory required by spaCy and faults with a MemoryError. I'm assuming it's loading the whole dataset into memory?Strontium
You can't only load the data necessary to parse one sentence, no — the assumed usage is that you'll parse arbitrary text. It does require 2-3gb of memory per process. We expect the memory requirements to go down when we finish switching over to a neural network. In the meantime, we've added multi-threading support, so that you can amortise the memory requirement across multiple CPUs.Golding
Note that the correct usage is now for sent in doc.sents:Regress
@JamesKo API changed, use: import spacy, then nlp = spacy.load('en') , and then process your sentences as: doc = nlp(u'Your unprocessed document here)Wiring
It is now python -m spacy download enSennar
B
6

There are a few grammars in the nltk_data distribution. In your Python interpreter, issue nltk.download().

Beebe answered 24/5, 2011 at 19:25 Comment(4)
Yes, but it's not sufficient for an arbitrary sentence. When I try some random sentence, it shows "Grammar does not cover some of the input words: ...." Am I doing it wrong? I want to get a parse tree of a sentence. Is this the right way to do it? Thanks!Cluff
@roboren: you could take the Penn treebank portion in nltk_data and derive a CFG from it by simply turning tree fragments (a node and its direct subnodes) into rules. But you probably won't find a "real" grammar unless you look into statistical parsing; no-one builds non-stochastic grammars anymore since they just don't work, except for very domain-specific applications.Beebe
Does nltk provide statistical parsing? Otherwise, I may want to switch to Stanford parser. Once again, thank you very much =)Cluff
Yes: nltk.googlecode.com/svn-history/r7492/trunk/doc/api/…. Not sure if you have to derive the grammar for this yourself, though.Beebe
C
6

There is a Library called Pattern. It is quite fast and easy to use.

>>> from pattern.en import parse
>>>  
>>> s = 'The mobile web is more important than mobile apps.'
>>> s = parse(s, relations=True, lemmata=True)
>>> print s

'The/DT/B-NP/O/NP-SBJ-1/the mobile/JJ/I-NP/O/NP-SBJ-1/mobile' ... 
Contusion answered 25/7, 2014 at 21:42 Comment(1)
This is shallow parsing output (also called chunking). I'm not sure that's what OP is after.Denims
S
5

Use the MaltParser, there you have a pretrained english-grammar, and also some other pretrained languages. And the Maltparser is a dependency parser and not some simple bottom-up, or top-down Parser.

Just download the MaltParser from http://www.maltparser.org/index.html and use the NLTK like this:

import nltk
parser = nltk.parse.malt.MaltParser()
Smaragdine answered 8/8, 2012 at 6:47 Comment(1)
MaltParser looks good, but I wasn't able to get it working with nltk (it kept failing with the message "Couldn't find the MaltParser configuration file: malt_temp.mco". The MaltParser itself, I got working fine.Humbuggery
C
4

I've tried NLTK, PyStatParser, Pattern. IMHO Pattern is best English parser introduced in above article. Because it supports pip install and There is a fancy document on the website (http://www.clips.ua.ac.be/pages/pattern-en). I couldn't find reasonable document for NLTK (And it gave me inaccurate result for me by its default. And I couldn't find how to tune it). pyStatParser is much slower than described above in my Environment. (About one minute for initialization and It took couple of seconds to parse long sentences. Maybe I didn't use it correctly).

Cramp answered 10/11, 2014 at 23:2 Comment(2)
Pattern doesn't seem to be doing parsing (as in, dependency parsing), only POS-tagging and maybe chunking. It's fairly normal for parsers to take a while on long sentences.Denims
@NikanaReklawyks exactly, the right nltk tool here is like PyStatParser that builds a grammar that is a PCFG grammar i.e. Probabilistic Context-Free Grammars - cs.columbia.edu/~mcollins/courses/nlp2011/notes/pcfgs.pdfSchiffman
R
4

Did you try POS tagging in NLTK?

text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)

The answer is something like this

[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),('completely', 'RB'), ('different', 'JJ')]

Got this example from here NLTK_chapter03

Rouse answered 24/10, 2017 at 18:14 Comment(0)
L
1

I'm found out that nltk working good with parser grammar developed by Stanford.

Syntax Parsing with Stanford CoreNLP and NLTK

It is very easy to start to use Stanford CoreNLP and NLTK. All you need is small preparation, after that you can parse sentences with following code:

from nltk.parse.corenlp import CoreNLPParser
parser = CoreNLPParser()
parse = next(parser.raw_parse("I put the book in the box on the table."))

Preparation:

  1. Download Java Stanford model
  2. Run CoreNLPServer

You can use following code to run CoreNLPServer:

import os
from nltk.parse.corenlp import CoreNLPServer
# The server needs to know the location of the following files:
#   - stanford-corenlp-X.X.X.jar
#   - stanford-corenlp-X.X.X-models.jar
STANFORD = os.path.join("models", "stanford-corenlp-full-2018-02-27")
# Create the server
server = CoreNLPServer(
   os.path.join(STANFORD, "stanford-corenlp-3.9.1.jar"),
   os.path.join(STANFORD, "stanford-corenlp-3.9.1-models.jar"),    
)
# Start the server in the background
server.start()

Do not forget stop server with executing server.stop()

Leroylerwick answered 17/2, 2020 at 19:16 Comment(0)
E
0

SpaCy 2024

for token in sent:
TypeError: 'spacy.tokens.token.Token' object is not iterable

Solution:

nlp:Language               = load('en_core_web_sm') # Sam Redway ==> 
"You can now load the package via spacy.load('en_core_web_sm')"
doc:Doc                    = nlp(text)
tok:Iterable[Token]        = chain(doc)
f  :Callable[[Token],bool] = lambda token: token.is_alpha
tok                        = filter(f, tok)
return list(tok)
Edmea answered 23/6 at 8:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.