Separate word lists for nouns, verbs, adjectives, etc
Asked Answered
S

5

36

Usually word lists are 1 file that contains everything, but are there separately downloadable noun list, verb list, adjective list, etc?

I need them for English specifically.

Savoyard answered 18/2, 2010 at 13:42 Comment(0)
A
10

See Kevin's word lists. Particularly the "Part Of Speech Database." You'll have to do some minimal text-processing on your own, in order to get the database into multiple files for yourself, but that can be done very easily with a few grep commands.

The license terms are available on the "readme" page.

Appassionato answered 18/2, 2010 at 15:42 Comment(0)
P
45

If you download just the database files from wordnet.princeton.edu/download/current-version you can extract the words by running these commands:

egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z_]*\s" data.adj | cut -d ' ' -f 5 > conv.data.adj
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z_]*\s" data.adv | cut -d ' ' -f 5 > conv.data.adv
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z_]*\s" data.noun | cut -d ' ' -f 5 > conv.data.noun
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z_]*\s" data.verb | cut -d ' ' -f 5 > conv.data.verb

Or if you only want single words (no underscores)

egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z]*\s" data.adj | cut -d ' ' -f 5 > conv.data.adj
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z]*\s" data.adv | cut -d ' ' -f 5 > conv.data.adv
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z]*\s" data.noun | cut -d ' ' -f 5 > conv.data.noun
egrep -o "^[0-9]{8}\s[0-9]{2}\s[a-z]\s[0-9]{2}\s[a-zA-Z]*\s" data.verb | cut -d ' ' -f 5 > conv.data.verb
Pastis answered 11/12, 2014 at 4:9 Comment(5)
This doesn't seem to add much to what have been said 4 years ago.Silvern
Speak for yourself, this is exactly what I needed. Thanks Chilly!Mclyman
Link is broken, think it should be: wordnet.princeton.edu/download/current-versionProcto
You da real MVP!Biforked
Not sure the cmd for cut in windows so did it in notepad++ Search: ^[^a-z]*?[a-z][^a-z]*?([a-zA-Z]+).*?$ Replace: \1Clip
M
34

This is a highly ranked Google result, so I'm digging up this 2 year old question to provide a far better answer than the existing one.

The "Kevin's Word Lists" page provides old lists from the year 2000, based on WordNet 1.6.

You are far better off going to https://wordnet.princeton.edu/download/current-version and downloading WordNet 3.0 (the Database-only version) or whatever the latest version is when you're reading this.

Parsing it is very simple; just apply a regex of "/^(\S+?)[\s%]/" to grab every word, and then replace all "_" (underscores) in the results with spaces. Finally, dump your results to whatever storage format you want. You'll be given separate lists of adjectives, adverbs, nouns, verbs and even a special (very useless/useful depending on what you're doing) list called "senses" which relates to our senses of smell, sight, hearing, etc, i.e. words such as "shirt" or "pungent".

Enjoy! Remember to include their copyright notice if you're using it in a project.

Monamonachal answered 15/8, 2012 at 16:31 Comment(4)
Which files do you use though?Bragg
Note that WordNet 3.0 does not contain conjugations, e.g. if you search for the word "are" in the list of verbs, it will come up with nothing. Of course "be" is in there, so the verb is there, just not the conjugation.Cajuput
The link is dead.Peen
@Cajuput any recommendations for list of conjugations?Himself
P
14

As others have suggested, the WordNet database files are a great source for parts of speech. That said, the examples used to extract the words isn't entirely correct. Each line is actually a "synonym set" consisting of multiple synonyms and their definition. Around 30% of words only appear as synonyms, so simply extracting the first word is missing a large amount of data.

The line format is pretty simple to parse (search.c, function parse_synset), but if all you're interested in are the words, the relevant part of the line is formatted as:

NNNNNNNN NN a NN word N [word N ...]

These correspond to:

  • Byte offset within file (8 character integer)
  • File number (2 character integer)
  • Part of speech (1 character)
  • Number of words (2 characters, hex encoded)
  • N occurrences of...
    • Word with spaces replaced with underscores, optional comment in parentheses
    • Word lexical ID (a unique occurrence ID)

For example, from data.adj:

00004614 00 s 02 cut 0 shortened 0 001 & 00004412 a 0000 | with parts removed; "the drastically cut film"
  • Byte offset within the file is 4614
  • File number is 0
  • Part of speech is s, corresponding to adjective (wnutil.c, function getpos)
  • Number of words is 2
    • First word is cut with lexical ID 0
    • Second word is shortened with lexical ID 0

A short Perl script to simply dump the words from the data.* files:

#!/usr/bin/perl

while (my $line = <>) {
    # If no 8-digit byte offset is present, skip this line
    if ( $line !~ /^[0-9]{8}\s/ ) { next; }
    chomp($line);

    my @tokens = split(/ /, $line);
    shift(@tokens); # Byte offset
    shift(@tokens); # File number
    shift(@tokens); # Part of speech

    my $word_count = hex(shift(@tokens));
    foreach ( 1 .. $word_count ) {
        my $word = shift(@tokens);
        $word =~ tr/_/ /;
        $word =~ s/\(.*\)//;
        print $word, "\n";

        shift(@tokens); # Lexical ID
    }
}

A gist of the above script can be found here.
A more robust parser which stays true to the original source can be found here.

Both scripts are used in a similar fashion: ./wordnet_parser.pl DATA_FILE.

Posy answered 6/12, 2016 at 18:17 Comment(1)
Thank you so much for adding this useful answer to this older question. You have definitely made my life a lot easier. I'd upvote 99 more times if I could.Derrickderriey
A
10

See Kevin's word lists. Particularly the "Part Of Speech Database." You'll have to do some minimal text-processing on your own, in order to get the database into multiple files for yourself, but that can be done very easily with a few grep commands.

The license terms are available on the "readme" page.

Appassionato answered 18/2, 2010 at 15:42 Comment(0)
T
5

http://icon.shef.ac.uk/Moby/mpos.html

Each part-of-speech vocabulary entry consists of a word or phrase field followed by a field delimiter of (ASCII 215) and the part-of-speech field that is coded using the following ASCII symbols (case is significant):

Noun                            N
Plural                          p
Noun Phrase                     h
Verb (usu participle)           V
Verb (transitive)               t
Verb (intransitive)             i
Adjective                       A
Adverb                          v
Conjunction                     C
Preposition                     P
Interjection                   !
Pronoun                         r
Definite Article                D
Indefinite Article              I
Nominative                      o
Testa answered 19/9, 2015 at 13:39 Comment(3)
link is dead nowMassingill
@Łukasz D. Tulikowski Any chance you still have this data mpos.tar.Z? Just checked the archive but the download link was not archived web.archive.org/web/20170213044128/http://icon.shef.ac.uk/Moby/… web.archive.org/web/20170213044128/http://www.dcs.shef.ac.uk/…Mcnary
Looks it's there along with others: ai1.ai.uga.edu/ftplib/natural-language/moby web.archive.org/web/20240414230916/https://ai1.ai.uga.edu/…Mcnary

© 2022 - 2024 — McMap. All rights reserved.