Organization of this question:
I. Background
II. The Problem/Question
III. Steps Taken to Make this Question Good
IV. Update: the output of head(x.path) and dput(x.path)
I. Background
I am customizing/adapting the e-mail classification code from the O'Reilly book "Machine Learning for Hackers" (Chapter 3). That code and its accompanying data can be found here: https://github.com/johnmyleswhite/ML_for_Hackers/tree/master/03-Classification
II. The Problem/Question
One of the main functions in that code is called get.msg()
. The original function is
get.msg <- function(path)
{
con <- file(path, open = "rt", encoding = "latin1")
text <- readLines(con)
# The message always begins after the first full line break
msg <- text[seq(which(text == "")[1] + 1, length(text), 1)]
close(con)
return(paste(msg, collapse = "\n"))
}
My data is different in a number of ways though, so I have to edit this quite a bit. My data is read in earlier from a relational DB, thus I don't have to read in and clean a text file. Instead, my email body data is the 18th column of a dataframe, which we can call x
. Here is my version of get.msg()
:
get.msg <- function(path) {
bodyvector <- path[!(is.na(path[,18]) | path[,18]==""), ]
return(paste(bodyvector))
}
Originally I referred to it as x$email
and this worked through most of the code, however in a later step the get.msg()
function was used on x.path
, where x.path
pointed to x
and was used within another function in combination with the paste()
function, as per the authors of the example code:
z.spam <- sapply(spam.docs, function(p) count.word(paste(x.path,p,sep = ""), "keyword"))
Here, the count.word()
function is a function containing get.msg()
. So, the paste()
function was causing problems because it caused x.path
to be considered an atomic array apparently, and gave the error that $ could not be used with an atomic array. As per an older StackOverflow Q&A, I changed the way I referred to the column to path[,18]
(which is evaluated as x.path[,18]
and therefore is the same as x[,18]
).
Then I did some checking to ensure that x.path[,18]
had the same information as x.path$email
, which it did. However, when I try to run the code I get an error message on get.msg(x.path)
, which is:
Error in path[,18] : incorrect number of dimensions.
I tried path[,'email']
, then path[18,]
and then just path
by itself and all three led to the same error. I tried path[[1]][[18]]
and that gave me a subscript out of bounds error.
Any thoughts?
III. Steps Taken to Make this Question Good
To avoid annoying anyone and getting any down votes, I confirmed that the topic was relevant to StackOverflow and I feel that it may be relevant to other people dealing with this or similar programming problems in the future. I also spent almost an hour researching this problem online and trying things in R to fix it.
There were plenty of references to this error message, however the causes seemed to be very diverse and completely unrelated (such as networking trouble, etc). Finally, I spent a significant amount of time editing this question to try to make it readable and properly formatted (I hope I did okay, I know it's a lot of information).
IV. The output of head()
and dput()
Some of you extremely helpful folks have requested to see the output of head(x.path)
or dput(x.path)
. I don't mind except that it's confidential company email data and I'll be out of a job and sued if I publish it. ;-)
I've pasted it here and replaced the real info with fake info. I hope this is okay. I tried to use dput()
at first and I can do so if you like but it was truly an overwhelming amount of data. Here's head(x.path)
:
> head(x.path)
[1] "c(\"Z12e3317e4b1jZbbajZ9Zdd6\", \"Z12e3317e4b1jZbbajZ99124\", \"Z12e331Ze4b1jZbbajZ996dd\", \"Z12e3319e4b1jZbbajZ9acb6\", \"Z12e3319e4b1jZbbajZ9ad3b\", \"Z12e3319e4b1jZbbajZ9adjd\", \"Z12e3319e4b1jZbbajZ9aebZ\", \"Z12e3319e4b1jZbbajZ9aj23\", \"Z12e3319e4b1jZbbajZ9b22b\", \"Z12e3319e4b1jZbbajZ9b42a\", \"Z12e3319e4b1jZbbajZ9b49a\", \"Z12e331ae4b1jZbbajZ9bZ11\", \"Z12e331ae4b1jZbbajZ9bZZ4\", \"Z12e331ae4b1jZbbajZ9c237\", \"Z12e331ae4b1jZbbajZ9c2e4\", \"Z12e331ae4b1jZbbajZ9c3bZ\", \"Z12e331ae4b1jZbbajZ9c3cZ\", \"Z12e331ae4b1jZbbajZ9cZ31\", \n\"Z12e331be4b1jZbbajZ9cddd\", \"Z12e331be4b1jZbbajZ9cja6\", \"Z12e331ce4b1jZbbajZ9da1j\", \"Z12e331de4b1jZbbajZ9e649\", \"Z12e331de4b1jZbbajZ9j669\", \"Z12e331de4b1jZbbajZ9jZZZ\", \"Z12e331ee4b1jZbbajZ9j944\", \"Z12e331ee4b1jZbbajZ9jcZa\", \"Z12e331ee4b1jZbbajZ9jd4c\", \"Z12e331ee4b1jZbbajZa11e2\", \"Z12e331ee4b1jZbbajZa1291\", \"Z12e331ee4b1jZbbajZa1344\", \"Z12e3311e4b1jZbbajZa1j73\", \"Z12e3311e4b1jZbbajZa1131\", \"Z12e3311e4b1jZbbajZa11Z6\", \"Z12e3311e4b1jZbbajZa124c\", \"Z12e3311e4b1jZbbajZa1Zbc\", \"Z12e3311e4b1jZbbajZa19a9\", \n\"Z12e3311e4b1jZbbajZa1ac2\", \"Z12e3311e4b1jZbbajZa1b79\", \"Z12e3311e4b1jZbbajZa1db2\", \"Z12e3311e4b1jZbbajZa1ejb\", \"Z12e3312e4b1jZbbajZa2333\", \"Z12e3312e4b1jZbbajZa23aZ\", \"Z12e3312e4b1jZbbajZa24bb\", \"Z12e3312e4b1jZbbajZa2Z79\", \"Z12e3312e4b1jZbbajZa2Zea\", \"Z12e3312e4b1jZbbajZa2ba9\", \"Z12e3312e4b1jZbbajZa2cZa\", \"Z12e3313e4b1jZbbajZa3bc1\", \"Z12e3313e4b1jZbbajZa3ca9\", \"Z12e3313e4b1jZbbajZa3e71\", \"Z12e3ajbe4b1j66Zbcja4eZc\", \"Z12e3ajbe4b1j66Zbcja4ja4\", \"Z12e3c79e4b1j66ZbcjaZc36\", \"Z12e3e1ce4b1j66Zbcja64bd\", \n\"Z12e4117e4b1j66Zbcja6Zj1\", \"Z12e41bae4b1j66Zbcja734Z\", \"Z12e4226e4b1j66Zbcja7b13\", \"Z12e4226e4b1j66Zbcja7cbZ\", \"Z12e4ajee4b1j66Zbcjaa916\", \"Z12e4e61e4b1j66Zbcjab1c2\", \"Z12e4e61e4b1j66Zbcjab2da\", \"Z12eZ226e4b1j66ZbcjacZea\", \"Z12e6141e4b1j66Zbcjb19Z9\", \"Z12e6141e4b1j66Zbcjb19jd\", \"Z12e61Z9e4b1j66Zbcjb1acb\", \"Z12e61Z9e4b1j66Zbcjb1acj\", \"Z12j9713e4b1j66Zbcjc34db\", \"Z12j9713e4b1j66Zbcjc3ZZa\", \"Z12j9713e4b1j66Zbcjc3Za7\", \"Z12j9713e4b1j66Zbcjc3Zd2\", \"Z12j9713e4b1j66Zbcjc36c2\", \"Z12j973ce4b1j66Zbcjc396b\"\n)"
[2] "c(\"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \n\"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\", \"Something\")"
[3] "c(61Z7, 674Z, Z462, 692, Z26, 1121, 1213, 1317, 21ZZ, 2Z9Z, 2711, 3612, 3717, 4774, 4Z93, Z117, Z113, Z197, Z77Z, 61Z3, Z16Z, 11771, 12923, 13374, 13Z93, 14277, 1446Z, 1Z3ZZ, 1ZZ16, 1Z993, 164Z2, 16664, 1711Z, 171Z6, 1Z6ZZ, 1Z921, 19211, 193ZZ, 19931, 21117, 21164, 21177, 21371, 21Z61, 21673, 22ZZ7, 23137, 2ZZ44, 26166, 26Z1Z, 173Z6, 17661, 21Z74, 23119, 232ZZ, 249Z3, 2ZZ31, 261Z9, 31211, 33414, 336Z6, 37941, 1743, 1Z61, 216Z, 2171, 1ZZ3, 2119, 21Z4, 2129, 2334, 2ZZZ)"
[4] "c(\"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \n\"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\", \"Booty\")"
[5] "c(Z6, 93Z, 1314, 3, 4, Z, 6, 7, 9, 11, 11, 13, 14, 2Z, 26, 27, 2Z, 29, 33, 34, ZZ, Z3, 122, 12Z, 133, 139, 142, 147, 1Z2, 1Z3, 16Z, 169, 171, 171, 219, 221, 221, 222, 22Z, 226, 244, 246, 247, 24Z, 249, 2637, 264, 2Z9, 292, 296, 49, Z1, 76, 93, 9Z, 112, 111, 114, 1Z7, 211, 214, 263, 6, 7, 11, 11, 11, 11, 12, 13, 14, 1Z)"
[6] "c(3Z11, 3Z11, 3Z11, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, 691Z, Z664, Z664, Z664, Z664, Z664, Z664, Z664, Z664, Z664, Z664, Z664, Z664, 66Z1, 66Z1, 66Z1, 66Z1, 4ZZ4, 4ZZ4, 4ZZ4, 4ZZ4, 4ZZ4, 4ZZ4)"
If this were to show you more then you'd see message bodies for [18].
path
needs to be a two-dimension object (e.g.data.frame
ormatrix
) so you can dopath[,18]
; yourx.path
is not. Just doclass(x.path)
and you should see that. – Fantastgetmsg
started out with the argumentpath
as a string that referred to the file that a single message was stored in, yes? But now the argumentpath
is meant to be what? A single row of your email data frame? The whole data frame? Is it one data frame for each email? I'm afraid you lost me somewhere along the way. – Houseman