Is there an equivalent R function to Stata 'order' command?
Asked Answered
B

6

5

'order' in R seems like 'sort' in Stata. Here's a dataset for example (only variable names listed):

v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 v16 v17 v18

and here's the output I expect:

v1 v2 v3 v4 v5 v7 v8 v9 v10 v11 v12 v17 v18 v13 v14 v15 v6 v16

In R, I have 2 ways:

data <- data[,c(1:5,7:12,17:18,13:15,6,16)]

OR

names <- c("v1", "v2", "v3", "v4", "v5", "v7", "v8", "v9", "v10", "v11", "v12",  "v17", "v18", "v13", "v14", "v15", "v6", "v16")
data <- data[names]

To get the same output in Stata, I may run 2 lines:

order v17 v18, before(v13)
order v6 v16, last

In the ideal data above, we can know the positions of the variables we want to deal with. But in most real cases, we have variables like 'age' 'gender' with no position indicators and we may have more than 50 variables in one dataset. Then the advantage of 'order' in Stata could be more obvious. We don't need to know the exact place of the variable and just type its name:

order age, after(gender)

Is there a base function in R to deal with this issue or could I get a package? Thanks in advance.

tweetinfo <- data.frame(uid=1:50, mid=2:51, annotations=3:52, bmiddle_pic=4:53, created_at=5:54, favorited=6:55, geo=7:56, in_reply_to_screen_name=8:57, in_reply_to_status_id=9:58, in_reply_to_user_id=10:59, original_pic=11:60, reTweetId=12:61, reUserId=13:62, source=14:63, thumbnail_pic=15:64, truncated=16:65)
noretweetinfo <- data.frame(uid=21:50, mid=22:51, annotations=23:52, bmiddle_pic=24:53, created_at=25:54, favorited=26:55, geo=27:56, in_reply_to_screen_name=28:57, in_reply_to_status_id=29:58, in_reply_to_user_id=30:59, original_pic=31:60, reTweetId=32:61, reUserId=33:62, source=34:63, thumbnail_pic=35:64, truncated=36:65)
retweetinfo <- data.frame(uid=41:50, mid=42:51, annotations=43:52, bmiddle_pic=44:53, created_at=45:54, deleted=46:55, favorited=47:56, geo=48:57, in_reply_to_screen_name=49:58, in_reply_to_status_id=50:59, in_reply_to_user_id=51:60, original_pic=52:61, source=53:62, thumbnail_pic=54:63, truncated=55:64)
tweetinfo$type <- "ti"
noretweetinfo$type <- "nr"
retweetinfo$type <- "rt"
gtinfo <- rbind(tweetinfo, noretweetinfo)
gtinfo$deleted=""
gtinfo <- gtinfo[,c(1:16,18,17)]
retweetinfo <- transform(retweetinfo, reTweetId="", reUserId="")
retweetinfo <- retweetinfo[,c(1:5,7:12,17:18,13:15,6,16)]
gtinfo <- rbind(gtinfo, retweetinfo)
write.table(gtinfo, file="C:/gtinfo.txt", row.names=F, col.names=T, sep="\t", quote=F)
# rm(list=ls(all=T))
Baldric answered 22/9, 2012 at 14:59 Comment(8)
Why do you want to order columns? Normally one doesn't care about the order of columns (variables) in a data.frame, but only about the order of rows (observations).Darladarlan
...and even the order in the rows is often superfluous, except when observations have a clear order such as in a timeseries.Overarch
I have 3 datasets, 2 of which don't include v6 and the other doesn't include v17 & v18. I want to generate v16 to record the data origins and combine them together. I created the missing variables with null values in each and I want to export the output of rbind() into a txt file with the same variable order with dataset1&2, attaching v6 & v16(the origin) at the end.Baldric
Please ask that as a question with reproducible code. It can be done easily in a much better way.Darladarlan
@Darladarlan I've put the code at the bottom to simulate my situation.Baldric
Please read ?rbind. If the arguments to rbind are data.frames, columns are matched by name and not by position. There is no need to order them.Darladarlan
following up @Roland's comment: that means (I think) that the command retweetinfo <- retweetinfo[,c(1:5,7:12,17:18,13:15,6,16)] is completely unnecessary ...Musician
But I want to export the txt file in the exact order, is it still unnecessary?Baldric
W
3

Because I'm procrastinating and experimenting with different things, here's a function that I whipped up. Ultimately, it depends on append:

moveme <- function(invec, movecommand) {
  movecommand <- lapply(strsplit(strsplit(movecommand, ";")[[1]], ",|\\s+"), 
                        function(x) x[x != ""])
  movelist <- lapply(movecommand, function(x) {
    Where <- x[which(x %in% c("before", "after", "first", "last")):length(x)]
    ToMove <- setdiff(x, Where)
    list(ToMove, Where)
  })
  myVec <- invec
  for (i in seq_along(movelist)) {
    temp <- setdiff(myVec, movelist[[i]][[1]])
    A <- movelist[[i]][[2]][1]
    if (A %in% c("before", "after")) {
      ba <- movelist[[i]][[2]][2]
      if (A == "before") {
        after <- match(ba, temp)-1
      } else if (A == "after") {
        after <- match(ba, temp)
      }    
    } else if (A == "first") {
      after <- 0
    } else if (A == "last") {
      after <- length(myVec)
    }
    myVec <- append(temp, values = movelist[[i]][[1]], after = after)
  }
  myVec
}

Here's some sample data representing the names of your dataset:

x <- paste0("v", 1:18)

Imagine now that we wanted "v17" and "v18" before "v3", "v6" and "v16" at the end, and "v5" at the beginning:

moveme(x, "v17, v18 before v3; v6, v16 last; v5 first")
#  [1] "v5"  "v1"  "v2"  "v17" "v18" "v3"  "v4"  "v7"  "v8"  "v9"  "v10" "v11" "v12"
# [14] "v13" "v14" "v15" "v6"  "v16"

So, the obvious usage would be, for a data.frame named "df":

df[moveme(names(df), "how you want to move the columns")]

And, for a data.table named "DT" (which, as @mnel points out, would be more memory efficient):

setcolorder(DT, moveme(names(DT), "how you want to move the columns"))

Note that compound moves are specified by semicolons.

The recognized moves are:

  • before (move the specified columns to before another named column)
  • after (move the specified columns to after another named column)
  • first (move the specified columns to the first position)
  • last (move the specified columns to the last position)
Weatherboard answered 24/8, 2013 at 16:29 Comment(0)
B
2

I get your problem. I now have code to offer:

move <- function(data,variable,before) {
  m <- data[variable]
  r <- data[names(data)!=variable]
  i <- match(before,names(data))
  pre <- r[1:i-1]
  post <- r[i:length(names(r))]
  cbind(pre,m,post)
}

# Example.
library(MASS)
data(painters)
str(painters)

# Move 'Expression' variable before 'Drawing' variable.
new <- move(painters,"Expression","Drawing")
View(new)
Bloodsucker answered 22/9, 2012 at 20:20 Comment(5)
It's a very innovative way of thinking, to divide the data into 3 parts. Right now it may not address multiple variable relocation, but we can go further in this way. Thank you very much.Baldric
Please be aware, that this approach is not efficient and should be avoided for large datasets or within loops.Darladarlan
@Darladarlan The mere principle of having to order variables is inefficient, but I have found it to be like variable names, something that you sometimes need to fix.Bloodsucker
@Baldric You can make the variable parameter of the function a vector of variables, if that's what you need: change the r variable to data[!(names(data) %in% variable)].Bloodsucker
@Bloodsucker No, what I meant is that your function is not efficient. Particularly, splitting data.frames and cbinding are inefficient operations that could be avoided here.Darladarlan
T
2

You could write your own function that does this.

The following will give you the new order for your column names using similar syntax to stata

  • where is a named list with 4 possibilities

    • list(last = T)
    • list(first = T)
    • list(before = x) where x is the variable name in question
    • list(after = x) where x is the variable name in question
  • sorted = T will sort var_list lexicographically (a combination of alphabetic and sequential from the stata command

The function works on the names only, (once you pass a data.frame object as data, and returns a reordered list of names

eg

stata.order <- function(var_list, where, sorted = F, data) {
    all_names = names(data)
    # are all the variable names in
    check <- var_list %in% all_names
    if (any(!check)) {
        stop("Not all variables in var_list exist within  data")
    }
    if (names(where) == "before") {
        if (!(where %in% all_names)) {
            stop("before variable not in the data set")
        }
    }
    if (names(where) == "after") {
        if (!(where %in% all_names)) {
            stop("after variable not in the data set")
        }
    }

    if (sorted) {
        var_list <- sort(var_list)
    }
    where_in <- which(all_names %in% var_list)
    full_list <- seq_along(data)
    others <- full_list[-c(where_in)]

    .nwhere <- names(where)
    if (!(.nwhere %in% c("last", "first", "before", "after"))) {
        stop("where must be a list of a named element first, last, before or after")
    }

    do_what <- switch(names(where), last = length(others), first = 0, before = which(all_names[others] == 
        where) - 1, after = which(all_names[others] == where))

    new_order <- append(others, where_in, do_what)
    return(all_names[new_order])
}

tmp <- as.data.frame(matrix(1:100, ncol = 10))

stata.order(var_list = c("V2", "V5"), where = list(last = T), data = tmp)

##  [1] "V1"  "V3"  "V4"  "V6"  "V7"  "V8"  "V9"  "V10" "V2"  "V5" 

stata.order(var_list = c("V2", "V5"), where = list(first = T), data = tmp)

##  [1] "V2"  "V5"  "V1"  "V3"  "V4"  "V6"  "V7"  "V8"  "V9"  "V10"

stata.order(var_list = c("V2", "V5"), where = list(before = "V6"), data = tmp)

##  [1] "V1"  "V3"  "V4"  "V2"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10"

stata.order(var_list = c("V2", "V5"), where = list(after = "V4"), data = tmp)

##  [1] "V1"  "V3"  "V4"  "V2"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10"

# throws an error
stata.order(var_list = c("V2", "V5"), where = list(before = "v11"), data = tmp)

## Error: before variable not in the data set

if you want to do the reordering memory-efficiently (by reference, without copying) use data.table

DT <- data.table(tmp)
# sets by reference, no copying
setcolorder(DT, stata.order(var_list = c("V2", "V5"), where = list(after = "V4"), 
    data = DT))

DT

##     V1 V3 V4 V2 V5 V6 V7 V8 V9 V10
##  1:  1 21 31 11 41 51 61 71 81  91
##  2:  2 22 32 12 42 52 62 72 82  92
##  3:  3 23 33 13 43 53 63 73 83  93
##  4:  4 24 34 14 44 54 64 74 84  94
##  5:  5 25 35 15 45 55 65 75 85  95
##  6:  6 26 36 16 46 56 66 76 86  96
##  7:  7 27 37 17 47 57 67 77 87  97
##  8:  8 28 38 18 48 58 68 78 88  98
##  9:  9 29 39 19 49 59 69 79 89  99
## 10: 10 30 40 20 50 60 70 80 90 100
Taligrade answered 2/10, 2012 at 4:55 Comment(0)
D
1

The package dplyr and the function dplyr::relocate, a new verb introduced in dplyr 1.0.0, does exactly what you are looking for.

library(dplyr)

data %>% relocate(v17, v18, .before = v13)

data %>% relocate(v6, v16, .after = last_col())

data %>% relocate(age, .after = gender)

Doy answered 20/4, 2020 at 7:0 Comment(0)
R
0

It is very unclear what you would like to do, but your first sentence makes me assume you would like to sort dataset.

Actually, there is a built-in order function, which returns the indices of the ordered sequence. Are you searching this?

> x <- c(3,2,1)

> order(x)
[1] 3 2 1

> x[order(x)]
[1] 1 2 3
Reinertson answered 22/9, 2012 at 18:34 Comment(1)
That's the least thing I want to do—sort the data. "Order" in Stata means another thing who has used it could understand.Baldric
D
0

This should give you the same file:

#snip
gtinfo <- rbind(tweetinfo, noretweetinfo)
gtinfo$deleted=""
retweetinfo <- transform(retweetinfo, reTweetId="", reUserId="")
gtinfo <- rbind(gtinfo, retweetinfo)
gtinfo <-gtinfo[,c(1:16,18,17)]
#snip

It is possible to implement a function like Strata's order function in R, but I don't think there is much demand for that.

Darladarlan answered 23/9, 2012 at 9:48 Comment(7)
Em, it's not a big problem for all and people who're interested may look into it.Baldric
@Baldric My point is that you were only interested in it, because you are still new to R and coming from Stata. I showed in my answer that you don't need to clutter your code with ordering. In fact, you only need to order once and that only because you want a specific order in your output file.Darladarlan
you're right, gtinfo <-gtinfo[,c(1:16,18,17)] at last is better than what I did with 2 lines like c(1:5,7:12,17:18,13:15,6,16). But you can't deny that there's no such base function in R to adjust the column order. I can't rbind it and tell my boss "See, the software orders it automatically and you'd better get used to it".Baldric
I do not understand. You can order it using base functionality as shown above. If you do not want to work with indices you can use column names too, possibly using subset.Darladarlan
data <- data[,c("A", "C", "B")] OR data <- data[,c(1,3,2)] OR data <- subset(data, select=c(1,3,2)), each of them could work indeed. But what if I get 50 or more columns? I have to type all the col.names or find the column number of the object and the destination, by hand.Baldric
If you need to order 50 or more columns you are doing something very strange. But you are free to define any function that you think you need. :)Darladarlan
...I have 50 columns and I want to move 'age' 'gender' after 'hometown'. Before the data<- data[,c(, *:, **)] stuff, I have 3 lines to run: which( colnames(data)=="age") // which( colnames(data)=="gender") // which(colnames(data)=="hometown"). I don't see the efficiency of R here.Baldric

© 2022 - 2024 — McMap. All rights reserved.