I'm trying to take a set of reviews, and convert them into the ARFF format for use with WEKA. Unfortunately either I completely misunderstand how the format works, or I'll have to have an attribute for ALL possible words, then a presence indicator. Does anyone know a better way, or ideally have a sample ARFF file?
ARFF for natural language processing
Asked Answered
Took a while to work out, but with this input.arff:
@relation text_files
@attribute review string
@attribute sentiment {0, 1}
@data
"this is some text", 1
"this is some more text", 1
"different stuff", 0
And this command:
java -classpath "C:\\Program Files\\Weka-3-6\\weka.jar" weka.filters.unsupervised.attribute.StringToWordVector -i input.arff -o output.arff
The following is produced:
@relation 'text_files-weka.filters.unsupervised.attribute.StringToWordVector-R1-W1000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"'
@attribute sentiment {0,1}
@attribute different numeric
@attribute is numeric
@attribute more numeric
@attribute some numeric
@attribute stuff numeric
@attribute text numeric
@attribute this numeric
@data
{0 1,2 1,4 1,6 1,7 1}
{0 1,2 1,3 1,4 1,6 1,7 1}
{1 1,5 1}
This is quite an old post, but from what I remmember the first digit in the tuple is the @attribute number, and the second number is the occurence count in the string. I think that to a certain extent it can mean what you want, as long as you understand what the results mean. –
Rabaul
If you store the reviews in plain text files and different folders (positive and negative in your case) you can use TextDirectoryLoader.
You find this in the KnowledgeFlow application in Weka or from the command line. More info here: http://weka.wikispaces.com/ARFF+files+from+Text+Collections
Is the format of the files one instance, say a review, per line in the txt files? –
Rabaul
Took a while to work out, but with this input.arff:
@relation text_files
@attribute review string
@attribute sentiment {0, 1}
@data
"this is some text", 1
"this is some more text", 1
"different stuff", 0
And this command:
java -classpath "C:\\Program Files\\Weka-3-6\\weka.jar" weka.filters.unsupervised.attribute.StringToWordVector -i input.arff -o output.arff
The following is produced:
@relation 'text_files-weka.filters.unsupervised.attribute.StringToWordVector-R1-W1000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"'
@attribute sentiment {0,1}
@attribute different numeric
@attribute is numeric
@attribute more numeric
@attribute some numeric
@attribute stuff numeric
@attribute text numeric
@attribute this numeric
@data
{0 1,2 1,4 1,6 1,7 1}
{0 1,2 1,3 1,4 1,6 1,7 1}
{1 1,5 1}
Do you know what the tuples like
0 1
, separated by commas in {0 1,2 1,4 1,6 1,7 1}
represent? I think this is different from the conventional .arff format. Have you had any luck getting meaningful results with WEKA? –
Anaesthesia This is quite an old post, but from what I remmember the first digit in the tuple is the @attribute number, and the second number is the occurence count in the string. I think that to a certain extent it can mean what you want, as long as you understand what the results mean. –
Rabaul
© 2022 - 2024 — McMap. All rights reserved.
0 1
, separated by commas in{0 1,2 1,4 1,6 1,7 1}
represent? I think this is different from the conventional .arff format. Have you had any luck getting meaningful results with WEKA? – Anaesthesia