ARFF for natural language processing
Asked Answered
R

2

9

I'm trying to take a set of reviews, and convert them into the ARFF format for use with WEKA. Unfortunately either I completely misunderstand how the format works, or I'll have to have an attribute for ALL possible words, then a presence indicator. Does anyone know a better way, or ideally have a sample ARFF file?

Rabaul answered 28/5, 2011 at 14:19 Comment(0)
R
3

Took a while to work out, but with this input.arff:

@relation text_files

@attribute review string
@attribute sentiment {0, 1}

@data
"this is some text", 1
"this is some more text", 1
"different stuff", 0

And this command:

java -classpath "C:\\Program Files\\Weka-3-6\\weka.jar" weka.filters.unsupervised.attribute.StringToWordVector -i input.arff -o output.arff

The following is produced:

@relation 'text_files-weka.filters.unsupervised.attribute.StringToWordVector-R1-W1000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"'

@attribute sentiment {0,1}
@attribute different numeric
@attribute is numeric
@attribute more numeric
@attribute some numeric
@attribute stuff numeric
@attribute text numeric
@attribute this numeric

@data

{0 1,2 1,4 1,6 1,7 1}
{0 1,2 1,3 1,4 1,6 1,7 1}
{1 1,5 1}
Rabaul answered 28/5, 2011 at 16:4 Comment(2)
Do you know what the tuples like 0 1, separated by commas in {0 1,2 1,4 1,6 1,7 1} represent? I think this is different from the conventional .arff format. Have you had any luck getting meaningful results with WEKA?Anaesthesia
This is quite an old post, but from what I remmember the first digit in the tuple is the @attribute number, and the second number is the occurence count in the string. I think that to a certain extent it can mean what you want, as long as you understand what the results mean.Rabaul
M
4

If you store the reviews in plain text files and different folders (positive and negative in your case) you can use TextDirectoryLoader.

You find this in the KnowledgeFlow application in Weka or from the command line. More info here: http://weka.wikispaces.com/ARFF+files+from+Text+Collections

Magree answered 29/5, 2011 at 9:35 Comment(1)
Is the format of the files one instance, say a review, per line in the txt files?Rabaul
R
3

Took a while to work out, but with this input.arff:

@relation text_files

@attribute review string
@attribute sentiment {0, 1}

@data
"this is some text", 1
"this is some more text", 1
"different stuff", 0

And this command:

java -classpath "C:\\Program Files\\Weka-3-6\\weka.jar" weka.filters.unsupervised.attribute.StringToWordVector -i input.arff -o output.arff

The following is produced:

@relation 'text_files-weka.filters.unsupervised.attribute.StringToWordVector-R1-W1000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"'

@attribute sentiment {0,1}
@attribute different numeric
@attribute is numeric
@attribute more numeric
@attribute some numeric
@attribute stuff numeric
@attribute text numeric
@attribute this numeric

@data

{0 1,2 1,4 1,6 1,7 1}
{0 1,2 1,3 1,4 1,6 1,7 1}
{1 1,5 1}
Rabaul answered 28/5, 2011 at 16:4 Comment(2)
Do you know what the tuples like 0 1, separated by commas in {0 1,2 1,4 1,6 1,7 1} represent? I think this is different from the conventional .arff format. Have you had any luck getting meaningful results with WEKA?Anaesthesia
This is quite an old post, but from what I remmember the first digit in the tuple is the @attribute number, and the second number is the occurence count in the string. I think that to a certain extent it can mean what you want, as long as you understand what the results mean.Rabaul

© 2022 - 2024 — McMap. All rights reserved.