creating a transitions matrix for markov Model

Asked 15/12, 2017 at 18:1 Answered 15/12, 2017 at 19:59

I need help on a topic related to markov chains and preprocessing of data. Suppose I have the following matrix relating individuals to states over time:

     ID Time1 Time2
1 14021     A     A
2 15031     B     A
3 16452     A     C

I would like to obtain, for this matrix, the state transition matrix: Hence, what is required is

and the same thing, but now weighted by the toal number of transitions from that state, i.e,

  A    B   C
A 0.5  0  0.5
B 1    0   0
C 0    0   0

(as there are two transitions leaving from state A). I know that the markovchain package has a functionality for doing this if one has a sequence, say AAABBAAABBCC, but not if data is set up like I have. Ideally a direct procedure would be great, but if there is some way of turning the data into a set of sequences that would work as well.

Atelectasis answered 15/12, 2017 at 18:1 Comment(3)

if you can make this c("B","A","A","C") out of your data.frame then you can use markovchain::createSequenceMatrix. – Modernism 15/12, 2017 at 18:44

Thanks. But not really sure how to do the sequence... – Atelectasis 15/12, 2017 at 19:53

Me neither. That's why I went the long way. – Modernism 15/12, 2017 at 19:53

An igraph approach, so using df from Joseph's answer:

library(igraph)

g <- graph_from_data_frame(df)

E(g)$weight = 1/degree(g, mode="out")[df$Time1] # get counts

as_adj(g, attr = "weight", sparse=FALSE) # output weighted adjacency matrix

    A B   C
A 0.5 0 0.5
B 1.0 0 0.0
C 0.0 0 0.0

Hypotrachelium answered 15/12, 2017 at 19:59 Comment(4)

@Atelectasis This is by far better answer. If I were you, I would accept this. – Modernism 15/12, 2017 at 20:4

@Arrebimbomalho, completely agree with Masoud! – Interpellant 15/12, 2017 at 20:12

@JosephWood I hope you understand that this is nothing personal. Just thought this answer is more elegant (of course, firstly, comparing to my answer). Cheers. – Modernism 15/12, 2017 at 20:15

@Masoud, not at all... I really meant that I agree that this a far better answer. I truly appreciate the attitude of putting the community first. – Interpellant 15/12, 2017 at 20:19

Here is another base R solution.

df <- data.frame(Time1 = c("A","B","A"), Time2 = c("A","A","C"), stringsAsFactors = FALSE)

myStates <- sort(unique(c(df$Time1, df$Time2)))
lenSt <- length(myStates)

currState <- match(df$Time1, myStates)
nextState <- match(df$Time2, myStates)
transMat <- matrix(0L, lenSt, lenSt)

transMat[cbind(currState, nextState)] <- 1L
transMat <- transMat/rowSums(transMat)
transMat[is.na(transMat)] <- 0

transMat
     [,1] [,2] [,3]
[1,]  0.5    0  0.5
[2,]  1.0    0  0.0
[3,]  0.0    0  0.0

Interpellant answered 15/12, 2017 at 18:32 Comment(2)

Thanks! The dataset I need to convert has literally millions of observations, hundreds of states and dozens of periods...probably not practical :( do you know of any automatic way of turning the dataframe into the df you have at the start of your code? that would be great – Atelectasis 15/12, 2017 at 19:54

@Arrebimbomalho, you could subset your data.frame to only include the columns labeled Time. I simply included the creation of the df, because you didn't provide it in your question. In the future, you could dput your data structures to help speed up the analysis for people trying to help. Other than subsetting, I'm not really sure I could help much further without seeing some more data. – Interpellant 15/12, 2017 at 20:8

An igraph approach, so using df from Joseph's answer:

library(igraph)

g <- graph_from_data_frame(df)

E(g)$weight = 1/degree(g, mode="out")[df$Time1] # get counts

as_adj(g, attr = "weight", sparse=FALSE) # output weighted adjacency matrix

    A B   C
A 0.5 0 0.5
B 1.0 0 0.0
C 0.0 0 0.0

Hypotrachelium answered 15/12, 2017 at 19:59 Comment(4)

@Atelectasis This is by far better answer. If I were you, I would accept this. – Modernism 15/12, 2017 at 20:4

@Arrebimbomalho, completely agree with Masoud! – Interpellant 15/12, 2017 at 20:12

@JosephWood I hope you understand that this is nothing personal. Just thought this answer is more elegant (of course, firstly, comparing to my answer). Cheers. – Modernism 15/12, 2017 at 20:15

@Masoud, not at all... I really meant that I agree that this a far better answer. I truly appreciate the attitude of putting the community first. – Interpellant 15/12, 2017 at 20:19

Definitely there's a better way. This is me doodling with loops on a lame Friday afternoon.

lvls <- sort(unique(unlist(df[,-1])))

dat <- matrix(0, nrow= length(lvls), ncol= length(lvls))

colnames(dat) <- lvls
rownames(dat) <- lvls

concat <- paste0(df[,2], df[,3])

for (i in 1:length(lvls)) {
  for (j in 1:length(lvls)) {
    dat[i,j] <- paste0(rownames(dat)[i], colnames(dat)[j])
  }
}

dat <- matrix(sapply(dat, function(x) length(grep(x, concat))), 
       nrow= length(lvls), ncol= length(lvls))

colnames(dat) <- lvls
rownames(dat) <- lvls

dat

##   A B C
## A 1 0 1
## B 1 0 0
## C 0 0 0

dat <- dat / rowSums(dat)
dat[is.na(dat)] <- 0

dat

##    A B   C
##A 0.5 0 0.5
##B 1.0 0 0.0
##C 0.0 0 0.0

Modernism answered 15/12, 2017 at 18:27 Comment(5)

Great! also, it's scalable, so it may be able to handle the job (the real matrix I need has millions of observations...) – Atelectasis 15/12, 2017 at 19:55

@Atelectasis those two loops that I have in the middle, would not that fast, but with let's say (1M rows * 100 columns) it won't get more than minutes to complete I guess. – Modernism 15/12, 2017 at 19:57

seems the first bit could be done with table ie df[c("Time1", "Time2")] <- lapply(df[c("Time1", "Time2")], factor, levels=c("A", "B", "C")) ; table(df) – Hypotrachelium 15/12, 2017 at 19:59

@Hypotrachelium attempt to make a table with >= 2^31 elements That would happen for large datasets tho. Right? – Modernism 15/12, 2017 at 20:1

Well only use table on the two columns - outout size will depend on the number of unique levels. – Hypotrachelium 15/12, 2017 at 20:2

Recommended topics

Hot tags