# [ACCEPTED]-Select rows with min value by group-dataframe

Score: 32

Using DWin's solution, `tapply` can be avoided using 6 `ave`.

``````df[ df\$v1 == ave(df\$v1, df\$f, FUN=min), ]
``````

This gives another speed-up, as shown 5 below. Mind you, this is also dependent 4 on the number of levels. I give this as 3 I notice that `ave` is far too often forgotten 2 about, although it is one of the more powerful 1 functions in R.

``````f <- rep(letters[1:20],10000)
v1 <- rnorm(20*10000)
v2 <- 1:(20*10000)
df <- data.frame(f,v1,v2)

> system.time(df[ df\$v1 == ave(df\$v1, df\$f, FUN=min), ])
user  system elapsed
0.05    0.00    0.05

> system.time(df[ df\$v1 %in% tapply(df\$v1, df\$f, min), ])
user  system elapsed
0.25    0.03    0.29

> system.time(lapply(split(df, df\$f), FUN = function(x) {
+             vec <- which(x[3] == min(x[3]))
+             return(x[vec, ])
+         })
+  .... [TRUNCATED]
user  system elapsed
0.56    0.00    0.58

> system.time(df[tapply(1:nrow(df),df\$f,function(i) i[which.min(df\$v1[i])]),]
+ )
user  system elapsed
0.17    0.00    0.19

> system.time( ddply(df, .var = "f", .fun = function(x) {
+     return(subset(x, v1 %in% min(v1)))
+     }
+ )
+ )
user  system elapsed
0.28    0.00    0.28
``````
Score: 15

A `data.table` solution.

``````library(data.table)
DT <- as.data.table(df)
DT[,.SD[which.min(v1)], by = f]

##   f  v1 v2
## 1: a 1.3  1
## 2: b 2.0  3
## 3: c 1.1  6
## 4: d 3.1  8
``````

Or, more efficiently

``````DT[DT[,.I[which.min(v1)],by=f][['V1']]]
``````

# some benchmarking

``````f <- rep(letters[1:20],100000)
v1 <- rnorm(20*100000)
v2 <- 1:(20*100000)
df <- data.frame(f,v1,v2)
DT <- as.data.table(df)
f1<-function(){df2<-df[order(df\$f,df\$v1),]
df2[!duplicated(df2\$f),]}

f2<-function(){df2<-df[order(df\$v1),]
df2[!duplicated(df2\$f),]}

f3<-function(){df[ df\$v1 == ave(df\$v1, df\$f, FUN=min), ]}

f4 <- function(){DT[,.SD[which.min(v1)], by = f]}

f5 <- function(){DT[DT[,.I[which.min(v1)],by=f][['V1']]]}

library(microbenchmark)
microbenchmark(f1(),f2(),f3(),f4(), f5(),times = 5)
# Unit: milliseconds
# expr       min        lq    median        uq       max neval
# f1() 3254.6620 3265.4760 3286.5440 3411.4054 3475.4198     5
# f2() 1630.8572 1639.3472 1651.5422 1721.4670 1738.6684     5
# f3()  172.2639  174.0448  177.4985  179.9604  184.7365     5
# f4()  206.1837  209.8161  209.8584  210.4896  210.7893     5
# f5()  105.5960  106.5006  107.9486  109.7216  111.1286     5
``````

The `.I` approach 3 is the winner (FR #2330 will hopefully render the 2 elegance of the `.SD` approach similarly fast 1 when implemented).

Score: 8

With `plyr`, I'd use:

``````ddply(df, .var = "f", .fun = function(x) {
return(subset(x, v1 %in% min(v1)))
}
)
``````

Give that a try and see if 1 it returns what you want.

Score: 6

Another `tapply` solution, with no unnecessary scanning 4 of vector with `%in%`:

``````df[tapply(1:nrow(df),df\$f,function(i) i[which.min(df\$v1[i])]),]
``````

EDIT: This will left only 3 first row in case of a tie.

EDIT2: Impressed 2 by `ave`, I've made additional improvements:

``````df[sapply(split(1:nrow(df),df\$f),function(x) x[which.min(df\$v1[x])]),]
``````

On 1 my machine (using Joris' benchmark data):

``````> system.time(df[ df\$v1 == ave(df\$v1, df\$f, FUN=min), ])
user  system elapsed
0.022   0.000   0.021
> system.time(df[sapply(split(1:nrow(df),df\$f),function(x) x[which.min(df\$v1[x])]),])
user  system elapsed
0.006   0.000   0.007
``````
Score: 5

This is the dplyr-way to filter for the 7 minimum `v1` values by groups of `f`:

``````require(dplyr)
df %>%
group_by(f) %>%
filter(v1 == min(v1))

#Source: local data frame [4 x 3]
#Groups: f
#
#  f  v1 v2
#1 a 1.3  1
#2 b 2.0  3
#3 c 1.1  6
#4 d 3.1  8
``````

In cases of 6 ties in `v1`, this would result in multiple 5 rows per group of `f`. If you want to avoid 4 that, you can use:

``````df %>%
group_by(f) %>%
filter(rank(v1, ties.method= "first") == 1)
``````

This way, you'll only 3 get the first row in case of ties. You could 2 alternatively use `ties.method = "random"` or others as described 1 in the help file.

Score: 4

Here's a tapply solution;

``````> df[ df\$v1 %in% tapply(df\$v1, df\$f, min), ]

f  v1 v2
1 a 1.3  1
3 b 2.0  3
6 c 1.1  6
8 d 3.1  8
``````

In your example 4 it only picks out one per group, but if 3 there were ties this method would show them 2 all. (As would Parker's and Luštrik's I 1 suspect.)

Score: 2

I'm sorry, my thinking power is depleted, and 2 this ugly solution is all I can come up 1 with at almost 1 am.

``````lapply(split(df, df\$f), FUN = function(x) {
vec <- which(x[3] == min(x[3]))
return(x[vec, ])
})
``````
Score: 2

Another way is to use `order` and `!duplicated`, but you would 1 only get the first on ties.

``````df2 <- df[order(df\$f,df\$v1),]
df2[!duplicated(df2\$f),]

f  v1 v2
1 a 1.3  1
3 b 2.0  3
6 c 1.1  6
8 d 3.1  8
``````

Timings

``````f1<-function(){df2<-df[order(df\$f,df\$v1),]
df2[!duplicated(df2\$f),]}

f2<-function(){df2<-df[order(df\$v1),]
df2[!duplicated(df2\$f),]}

f3<-function(){df[ df\$v1 == ave(df\$v1, df\$f, FUN=min), ]}

library(rbenchmark)
> benchmark(f1(),f2(),f3())
test replications elapsed relative user.self sys.self user.child sys.child
1 f1()          100   38.16 7.040590     36.66     1.48         NA        NA
2 f2()          100   20.54 3.789668     19.30     1.23         NA        NA
3 f3()          100    5.42 1.000000      4.96     0.46         NA        NA
``````
Score: 2

Here is a solution with `by`

``````do.call(rbind, unname(by(df, df\$f, function(x) x[x\$v1 == min(x\$v1),])))
##   f  v1 v2
## 1 a 1.3  1
## 3 b 2.0  3
## 6 c 1.1  6
## 8 d 3.1  8
``````

0

Score: 0

Using tidyverse

``````df %>%
arrange(v1) %>% # You can also do arrange(f, v1)
distinct(f, .keep_all = TRUE)
``````

I also like the previous 2 answer from @talat

``````df %>%
group_by(f) %>%
filter(v1 == min(v1))
``````

but the first one avoid 1 grouping and ungrouping.

More Related questions