Efficient way to find manager's manager's id
Asked Answered
C

1

6

I have a database of employees, with their manager's id, in long format (one row per employee per month). I would like to add a column that contains their manager's manager's id (or the id of their skip level manager).

Here is a toy dataset:

id <- c(seq.int(1,11), seq.int(2,12))
mgr_id <- as.integer(c(NA, 1, 1, 2, 2, 2, 2, 3, 3, 5, 5,   #period 1
                      NA, 2, 5, 2, 5, 5, 3, 3, 5, 10, 10)) #period 2
period <- c(rep(1, 11), rep(2, 11))
left_company <- c(1, rep(0, 21))
joined_company <- c(rep(0, 21), 1) 

df <- data.frame(id, mgr_id, period, left_company, joined_company)

And here is a function I wrote that returns the expected results.

# finds the employee's manager in the correct period, and returns that manager's id

    get_mgr_mgr_id <- function(manager_id, period){
      mgr_mgr_id <- df$mgr_id[df$id == manager_id & df$period == period] 
      return(mgr_mgr_id[1])
    }

When I use the function with mapply, all is well. Note that employee 1 left the company, and they were replaced by employee 5, who was replaced by employee 10, who was replaced by employee 12, a new hire.

df$mgr_mgr_id <- mapply(get_mgr_mgr_id, df$mgr_id, df$period)

df
   id mgr_id period left joined mgr_mgr_id
1   1     NA      1    1      0         NA
2   2      1      1    0      0         NA
3   3      1      1    0      0         NA
4   4      2      1    0      0          1
5   5      2      1    0      0          1
6   6      2      1    0      0          1
7   7      2      1    0      0          1
8   8      3      1    0      0          1
9   9      3      1    0      0          1
10 10      5      1    0      0          2
11 11      5      1    0      0          2
12  2     NA      2    0      0         NA
13  3      2      2    0      0         NA
14  4      5      2    0      0          2
15  5      2      2    0      0         NA
16  6      5      2    0      0          2
17  7      5      2    0      0          2
18  8      3      2    0      0          2
19  9      3      2    0      0          2
20 10      5      2    0      0          2
21 11     10      2    0      0          5
22 12     10      2    0      1          5

My question: is there is a more efficient way to get this result? Currently it takes quite a long time to run even on 10,000 rows, and my dataset has closer to a million.

I'm also open to suggestions on a more general question title (possibly a version of this SQL question: Most efficient way to find something recursively in a table?)

Crapulent answered 29/8, 2016 at 16:16 Comment(0)
D
5

You can run a join with data.table. I'm not sure how much faster it will be:

library(data.table)
setDT(df)

df[, m2id := df[.(id = mgr_id, period = period), on=c("id", "period"), mgr_id]]

    id mgr_id period left_company joined_company m2id
 1:  1     NA      1            1              0   NA
 2:  2      1      1            0              0   NA
 3:  3      1      1            0              0   NA
 4:  4      2      1            0              0    1
 5:  5      2      1            0              0    1
 6:  6      2      1            0              0    1
 7:  7      2      1            0              0    1
 8:  8      3      1            0              0    1
 9:  9      3      1            0              0    1
10: 10      5      1            0              0    2
11: 11      5      1            0              0    2
12:  2     NA      2            0              0   NA
13:  3      2      2            0              0   NA
14:  4      5      2            0              0    2
15:  5      2      2            0              0   NA
16:  6      5      2            0              0    2
17:  7      5      2            0              0    2
18:  8      3      2            0              0    2
19:  9      3      2            0              0    2
20: 10      5      2            0              0    2
21: 11     10      2            0              0    5
22: 12     10      2            0              1    5
    id mgr_id period left_company joined_company m2id

How it works

The syntax for a join is x[i, on=, j]. It uses i and on to subset x and then returns j. The key point here is setting id = mgr_id in i so we're subsetting to the manager's rows.

The syntax for assigning a column is DT[, col_name := value]. In this case, the value comes from the join explained in the last paragraph.

Disabuse answered 29/8, 2016 at 16:49 Comment(9)
I edited my example so that mgr_id is declared as an integer.Crapulent
Using microbenchmark, on my machine it seems that the time taken for both methods (at least on this small dataset) are comparable. Perhaps OP can let us know how they compare on, say, 10000 rows?Casar
@WeihuangWong Good to know, thanks. I'm assuming you're testing on a larger example (more periods, maybe more employees). Oh never mind, just saw your edit.Disabuse
OP's solution is performing a vector scan on entire df for each row. This should obviously scale well. The fact the it also updates the original table (instead of returning an entirely new table after joining should make things only better on larger tables).Moreen
My solution took > 5 minutes for 10000 rows. @Disabuse 's answer took seconds for ~1,000,000. So +1, but for some (unrelated?) reason, the output doesn't end up being right all the time on my actual data. I'm looking into why that might be.Crapulent
I do get the following Warning message: In [.data.table(df, , :=(m2id, df[.(id = mgr_id, : Supplied 985493 items to be assigned to 985038 items of column 'm2id' (455 unused)Crapulent
@Crapulent I think that means that you have folks appearing multiple times in the same period. To investigate that possibility, you could do df[, .N, by=.(id, period)][N > 1L]. For more on this: stackoverflow.com/documentation/data.table/3785/…Disabuse
@Disabuse Thank you, you are correct. I modified your code slightly to get it to work: df[, .N, by=.(id, period)][N > 1L] (removed . in 2nd .N). Once I fix this and verify everything is working I will accept your answer.Crapulent
@Crapulent If you have legitimate reasons for having multiple records for folks and know that you want the top or bottom match, you can use mult="first" or "last" inside the join, documented in ?data.tableDisabuse

© 2022 - 2024 — McMap. All rights reserved.