How to enter censored data into R's survival model?
Asked Answered
S

4

10

I'm attempting to model customer lifetimes on subscriptions. As the data is censored I'll be using R's survival package to create a survival curve.

The original subscriptions dataset looks like this..

id  start_date  end_date
1   2013-06-01  2013-08-25
2   2013-06-01  NA
3   2013-08-01  2013-09-12

Which I manipulate to look like this..

id  tenure_in_months status(1=cancelled, 0=active)
1   2                1
2   ?                0
3   1                1

..in order to feed the survival model:

obj <- with(subscriptions, Surv(time=tenure_in_months, event=status, type="right"))
fit <- survfit(obj~1, data=subscriptions)
plot(fit)

What shall I put in the tenure_in_months variable for the consored cases i.e. the cases where the subscription is still active today - should it be the tenure up until today or should it be NA?

Shiver answered 23/9, 2013 at 11:56 Comment(1)
it should be up until the day you collected your data, I guess that's "today".Sachasachem
O
1

If a missing end date means that the subscription is still active, then you need to take the time until the current date as censor date.

NA wont work with the survival object. I think those cases will be omitted. That is not what you want! Because these cases contain important information about the survival.

SQL code to get the time till event (use in SELECT part of query)

DATEDIFF(M,start_date,ISNULL(end_date,GETDATE()) AS tenure_in_months

BTW: I would use difference in days, for my analysis. Does not make sense to round off the time to months.

Orpine answered 23/9, 2013 at 12:4 Comment(1)
Code is SQL. I thought you made the query yourself, so you would be able to adjust it.Orpine
K
10

First I shall say I disagree with the previous answer. For a subscription still active today, it should not be considered as tenure up until today, nor NA. What do we know exactly about those subscriptions? We know they tenured up until today, that is equivalent to say tenure_in_months for those observations, although we don't know exactly how long they are, they are longer than their tenure duration up to today.

This is a situation known as right-censor in survival analysis. See: http://en.wikipedia.org/wiki/Censoring_%28statistics%29

So your data would need to translate from

id  start_date  end_date
1   2013-06-01  2013-08-25
2   2013-06-01  NA
3   2013-08-01  2013-09-12

to:

id  t1   t2    status(3=interval_censored)
1   2    2           3
2   3    NA          3
3   1    1           3

Then you will need to change your R surv object, from:

Surv(time=tenure_in_months, event=status, type="right")

to:

Surv(t1, t2, event=status, type="interval2")

See http://stat.ethz.ch/R-manual/R-devel/library/survival/html/Surv.html for more syntax details. A very good summary of computational details can be found: http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_lifereg_sect018.htm

Interval censored data can be represented in two ways. For the first use type = interval and the codes shown above. In that usage the value of the time2 argument is ignored unless event=3. The second approach is to think of each observation as a time interval with (-infinity, t) for left censored, (t, infinity) for right censored, (t,t) for exact and (t1, t2) for an interval. This is the approach used for type = interval2, with NA taking the place of infinity. It has proven to be the more useful.

Klara answered 23/9, 2013 at 15:44 Comment(3)
Ah, you are right I think. Right censor should mean event time in [time, +inf) but in R, I think, it is just like exact event on time when using type='interval'. Need interval2 then, see edit.Klara
I believe your entry for id 2 is quite wrong. You need to know when the "NA" was measured. The t1 entry should then be that time minus 2013-06-01.Sachasachem
For me, if I specify the argument type="interval2" and event=status the function returns an error. I think the correct syntax in combination with type=interval2 would be to have no event argument.Slipway
O
1

If a missing end date means that the subscription is still active, then you need to take the time until the current date as censor date.

NA wont work with the survival object. I think those cases will be omitted. That is not what you want! Because these cases contain important information about the survival.

SQL code to get the time till event (use in SELECT part of query)

DATEDIFF(M,start_date,ISNULL(end_date,GETDATE()) AS tenure_in_months

BTW: I would use difference in days, for my analysis. Does not make sense to round off the time to months.

Orpine answered 23/9, 2013 at 12:4 Comment(1)
Code is SQL. I thought you made the query yourself, so you would be able to adjust it.Orpine
S
0

You need to know the date the data was collected. The tenure_in_months for id 2 should then be this date minus 2013-06-01.

Otherwise I believe your encoding of the data is correct. the status of 0 for id 2 indicates it's right-censored (meaning we have a lower bound on it's lifetime, but not an upper bound).

Sachasachem answered 22/7, 2015 at 10:12 Comment(0)
S
0

Your dataset consists of 3 observations and only one of them is right censored (2nd observation). As pointed out by @drevicko, it's unclear until which date this 2nd subject was observed. Let's assume this was until 2013-10-01 i.e. for 4 months without an event taking place.

There are 3 option how to encode data which only contains right censoring using survival::Surv().

library(survival)
dat <- data.frame(start_date = as.Date(c("2013-06-01", "2013-06-01", "2013-08-01")),
          end_date = as.Date(c("2013-08-25", "2013-10-01", "2013-09-12")))
dat$t = as.numeric(difftime(dat$end_date, dat$start_date, units = "days"))/30.5
dat$event <- c(1,0,1)

## Option 1: "right"
Surv(time = dat$t, event = dat$event, type = "right")
#> [1] 2.786885  4.000000+ 1.377049

## Option 2: "interval"
Surv(time = dat$t, time2 = c(NA, NA, NA), event = dat$event, type = "interval")
#> [1] 2.786885  4.000000+ 1.377049

## Option 3: "interval2"
dat$t2 <- dat$t
dat$t2[dat$event == 0] <- Inf
Surv(time = dat$t, time2 = dat$t2, type = "interval2")
#> [1] 2.786885  4.000000+ 1.377049

Created on 2024-07-12 with reprex v2.1.0

I find useful to have some examples of data and the corresponding argument values encoding the data with survival::Surv(). Censored observations have dashed lines to indicate the range in which the true observation could be.

Slipway answered 12/7 at 8:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.