creating a transitions matrix for markov Model
Asked Answered
A

3

5

I need help on a topic related to markov chains and preprocessing of data. Suppose I have the following matrix relating individuals to states over time:

     ID Time1 Time2
1 14021     A     A
2 15031     B     A
3 16452     A     C

I would like to obtain, for this matrix, the state transition matrix: Hence, what is required is

  A  B  C
A 1  0  1
B 1  0  0
C 0  0  0

and the same thing, but now weighted by the toal number of transitions from that state, i.e,

  A    B   C
A 0.5  0  0.5
B 1    0   0
C 0    0   0

(as there are two transitions leaving from state A). I know that the markovchain package has a functionality for doing this if one has a sequence, say AAABBAAABBCC, but not if data is set up like I have. Ideally a direct procedure would be great, but if there is some way of turning the data into a set of sequences that would work as well.

Atelectasis answered 15/12, 2017 at 18:1 Comment(3)
if you can make this c("B","A","A","C") out of your data.frame then you can use markovchain::createSequenceMatrix.Modernism
Thanks. But not really sure how to do the sequence...Atelectasis
Me neither. That's why I went the long way.Modernism
H
3

An igraph approach, so using df from Joseph's answer:

library(igraph)

g <- graph_from_data_frame(df)

E(g)$weight = 1/degree(g, mode="out")[df$Time1] # get counts

as_adj(g, attr = "weight", sparse=FALSE) # output weighted adjacency matrix

    A B   C
A 0.5 0 0.5
B 1.0 0 0.0
C 0.0 0 0.0
Hypotrachelium answered 15/12, 2017 at 19:59 Comment(4)
@Atelectasis This is by far better answer. If I were you, I would accept this.Modernism
@Arrebimbomalho, completely agree with Masoud!Interpellant
@JosephWood I hope you understand that this is nothing personal. Just thought this answer is more elegant (of course, firstly, comparing to my answer). Cheers.Modernism
@Masoud, not at all... I really meant that I agree that this a far better answer. I truly appreciate the attitude of putting the community first.Interpellant
I
3

Here is another base R solution.

df <- data.frame(Time1 = c("A","B","A"), Time2 = c("A","A","C"), stringsAsFactors = FALSE)

myStates <- sort(unique(c(df$Time1, df$Time2)))
lenSt <- length(myStates)

currState <- match(df$Time1, myStates)
nextState <- match(df$Time2, myStates)
transMat <- matrix(0L, lenSt, lenSt)

transMat[cbind(currState, nextState)] <- 1L
transMat <- transMat/rowSums(transMat)
transMat[is.na(transMat)] <- 0

transMat
     [,1] [,2] [,3]
[1,]  0.5    0  0.5
[2,]  1.0    0  0.0
[3,]  0.0    0  0.0
Interpellant answered 15/12, 2017 at 18:32 Comment(2)
Thanks! The dataset I need to convert has literally millions of observations, hundreds of states and dozens of periods...probably not practical :( do you know of any automatic way of turning the dataframe into the df you have at the start of your code? that would be greatAtelectasis
@Arrebimbomalho, you could subset your data.frame to only include the columns labeled Time. I simply included the creation of the df, because you didn't provide it in your question. In the future, you could dput your data structures to help speed up the analysis for people trying to help. Other than subsetting, I'm not really sure I could help much further without seeing some more data.Interpellant
H
3

An igraph approach, so using df from Joseph's answer:

library(igraph)

g <- graph_from_data_frame(df)

E(g)$weight = 1/degree(g, mode="out")[df$Time1] # get counts

as_adj(g, attr = "weight", sparse=FALSE) # output weighted adjacency matrix

    A B   C
A 0.5 0 0.5
B 1.0 0 0.0
C 0.0 0 0.0
Hypotrachelium answered 15/12, 2017 at 19:59 Comment(4)
@Atelectasis This is by far better answer. If I were you, I would accept this.Modernism
@Arrebimbomalho, completely agree with Masoud!Interpellant
@JosephWood I hope you understand that this is nothing personal. Just thought this answer is more elegant (of course, firstly, comparing to my answer). Cheers.Modernism
@Masoud, not at all... I really meant that I agree that this a far better answer. I truly appreciate the attitude of putting the community first.Interpellant
M
2

Definitely there's a better way. This is me doodling with loops on a lame Friday afternoon.

lvls <- sort(unique(unlist(df[,-1])))

dat <- matrix(0, nrow= length(lvls), ncol= length(lvls))

colnames(dat) <- lvls
rownames(dat) <- lvls

concat <- paste0(df[,2], df[,3])

for (i in 1:length(lvls)) {
  for (j in 1:length(lvls)) {
    dat[i,j] <- paste0(rownames(dat)[i], colnames(dat)[j])
  }
}

dat <- matrix(sapply(dat, function(x) length(grep(x, concat))), 
       nrow= length(lvls), ncol= length(lvls))

colnames(dat) <- lvls
rownames(dat) <- lvls

dat

##   A B C
## A 1 0 1
## B 1 0 0
## C 0 0 0

dat <- dat / rowSums(dat)
dat[is.na(dat)] <- 0

dat

##    A B   C
##A 0.5 0 0.5
##B 1.0 0 0.0
##C 0.0 0 0.0
Modernism answered 15/12, 2017 at 18:27 Comment(5)
Great! also, it's scalable, so it may be able to handle the job (the real matrix I need has millions of observations...)Atelectasis
@Atelectasis those two loops that I have in the middle, would not that fast, but with let's say (1M rows * 100 columns) it won't get more than minutes to complete I guess.Modernism
seems the first bit could be done with table ie df[c("Time1", "Time2")] <- lapply(df[c("Time1", "Time2")], factor, levels=c("A", "B", "C")) ; table(df)Hypotrachelium
@Hypotrachelium attempt to make a table with >= 2^31 elements That would happen for large datasets tho. Right?Modernism
Well only use table on the two columns - outout size will depend on the number of unique levels.Hypotrachelium

© 2022 - 2024 — McMap. All rights reserved.