development

문자열 벡터 입력을 사용하여 dplyr에서 여러 열로 그룹화

big-blog 2020. 6. 16. 07:52
반응형

문자열 벡터 입력을 사용하여 dplyr에서 여러 열로 그룹화


plyr에 대한 이해를 dplyr으로 옮기려고하지만 여러 열로 그룹화하는 방법을 알 수 없습니다.

# make data with weird column names that can't be hard coded
data = data.frame(
  asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
  a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
  value = rnorm(100)
)

# get the columns we want to average within
columns = names(data)[-3]

# plyr - works
ddply(data, columns, summarize, value=mean(value))

# dplyr - raises error
data %.%
  group_by(columns) %.%
  summarise(Value = mean(value))
#> Error in eval(expr, envir, enclos) : index out of bounds

plyr 예제를 dplyr-esque 구문으로 변환하기 위해 무엇을 놓치고 있습니까?

편집 2017 : Dplyr이 업데이트되었으므로 더 간단한 솔루션을 사용할 수 있습니다. 현재 선택된 답변을 참조하십시오.


이 질문이 게시 된 이후 dplyr은 범위가 지정된 버전 group_by( documentation here )을 추가했습니다. 이를 통해 다음과 같은 기능을 사용할 수 있습니다 select.

data = data.frame(
    asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
    a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
    value = rnorm(100)
)

# get the columns we want to average within
columns = names(data)[-3]

library(dplyr)
df1 <- data %>%
  group_by_at(vars(one_of(columns))) %>%
  summarize(Value = mean(value))

#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE 
##  27 

예제 질문의 결과는 예상대로입니다 (위의 plyr와 아래의 결과 비교 참조).

# A tibble: 9 x 3
# Groups:   asihckhdoydkhxiydfgfTgdsx [?]
  asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja       Value
                     <fctr>                    <fctr>       <dbl>
1                         A                         A  0.04095002
2                         A                         B  0.24943935
3                         A                         C -0.25783892
4                         B                         A  0.15161805
5                         B                         B  0.27189974
6                         B                         C  0.20858897
7                         C                         A  0.19502221
8                         C                         B  0.56837548
9                         C                         C -0.22682998

dplyr::summarize한 번에 하나의 그룹화 계층 만 제거하기 때문에 결과로 생성되는 티블에서 그룹화가 계속 진행됩니다 (나중에 서서히 사람들을 잡을 수 있음). 예기치 않은 그룹화 동작으로부터 완전히 안전 %>% ungroup하려면 요약 후 항상 파이프 라인에 추가 할 수 있습니다 .


코드를 완전히 작성하기 위해 Hadley의 답변에 대한 새로운 구문이 업데이트되었습니다.

library(dplyr)

df <-  data.frame(
    asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
    a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
    value = rnorm(100)
)

# Columns you want to group by
grp_cols <- names(df)[-3]

# Convert character vector to list of symbols
dots <- lapply(grp_cols, as.symbol)

# Perform frequency counts
df %>%
    group_by_(.dots=dots) %>%
    summarise(n = n())

산출:

Source: local data frame [9 x 3]
Groups: asihckhdoydk

  asihckhdoydk a30mvxigxkgh  n
1            A            A 10
2            A            B 10
3            A            C 13
4            B            A 14
5            B            B 10
6            B            C 12
7            C            A  9
8            C            B 12
9            C            C 10

dplyr에서 이것에 대한 지원은 현재 매우 약합니다. 결국 구문은 다음과 같습니다.

df %.% group_by(.groups = c("asdfgfTgdsx", "asdfk30v0ja"))

그러나 아마도 한동안은 없을 것입니다 (모든 결과를 생각해야하기 때문에).

그 동안에는 regroup()기호 목록을 사용 하는을 사용할 수 있습니다 .

library(dplyr)

df <-  data.frame(
  asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
  a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
  value = rnorm(100)
)

df %.%
  regroup(list(quote(asihckhdoydk), quote(a30mvxigxkgh))) %.%
  summarise(n = n())

If you have have a character vector of column names, you can convert them to the right structure with lapply() and as.symbol():

vars <- setdiff(names(df), "value")
vars2 <- lapply(vars, as.symbol)

df %.% regroup(vars2) %.% summarise(n = n())

String specification of columns in dplyr are now supported through variants of the dplyr functions with names finishing in an underscore. For example, corresponding to the group_by function there is a group_by_ function that may take string arguments. This vignette describes the syntax of these functions in detail.

The following snippet cleanly solves the problem that @sharoz originally posed (note the need to write out the .dots argument):

# Given data and columns from the OP

data %>%
    group_by_(.dots = columns) %>%
    summarise(Value = mean(value))

(Note that dplyr now uses the %>% operator, and %.% is deprecated).


Until dplyr has full support for string arguments, perhaps this gist is useful:

https://gist.github.com/skranz/9681509

It contains bunch of wrapper functions like s_group_by, s_mutate, s_filter, etc that use string arguments. You can mix them with the normal dplyr functions. For example

cols = c("cyl","gear")
mtcars %.%
  s_group_by(cols) %.%  
  s_summarise("avdisp=mean(disp), max(disp)") %.%
  arrange(avdisp)

It works if you pass it the objects (well, you aren't, but...) rather than as a character vector:

df %.%
    group_by(asdfgfTgdsx, asdfk30v0ja) %.%
    summarise(Value = mean(value))

> df %.%
+   group_by(asdfgfTgdsx, asdfk30v0ja) %.%
+   summarise(Value = mean(value))
Source: local data frame [9 x 3]
Groups: asdfgfTgdsx

  asdfgfTgdsx asdfk30v0ja        Value
1           A           C  0.046538002
2           C           B -0.286359899
3           B           A -0.305159419
4           C           A -0.004741504
5           B           B  0.520126476
6           C           C  0.086805492
7           B           C -0.052613078
8           A           A  0.368410146
9           A           B  0.088462212

where df was your data.

?group_by says:

 ...: variables to group by. All tbls accept variable names, some
      will also accept functons of variables. Duplicated groups
      will be silently dropped.

which I interpret to mean not the character versions of the names, but how you would refer to them in foo$bar; bar is not quoted here. Or how you'd refer to variables in a formula: foo ~ bar.

@Arun also mentions that you can do:

df %.%
    group_by("asdfgfTgdsx", "asdfk30v0ja") %.%
    summarise(Value = mean(value))

But you can't pass in something that unevaluated is not a name of a variable in the data object.

I presume this is due to the internal methods Hadley is using to look up the things you pass in via the ... argument.


data = data.frame(
  my.a = sample(LETTERS[1:3], 100, replace=TRUE),
  my.b = sample(LETTERS[1:3], 100, replace=TRUE),
  value = rnorm(100)
)

group_by(data,newcol=paste(my.a,my.b,sep="_")) %>% summarise(Value=mean(value))

One (tiny) case that is missing from the answers here, that I wanted to make explicit, is when the variables to group by are generated dynamically midstream in a pipeline:

library(wakefield)
df_foo = r_series(rnorm, 10, 1000)
df_foo %>% 
  # 1. create quantized versions of base variables
  mutate_each(
    funs(Quantized = . > 0)
  ) %>% 
  # 2. group_by the indicator variables
  group_by_(
    .dots = grep("Quantized", names(.), value = TRUE)
    ) %>% 
  # 3. summarize the base variables
  summarize_each(
    funs(sum(., na.rm = TRUE)), contains("X_")
  )

This basically shows how to use grep in conjunction with group_by_(.dots = ...) to achieve this.


General example on using the .dots argument as character vector input to the dplyr::group_by function :

iris %>% 
    group_by(.dots ="Species") %>% 
    summarise(meanpetallength = mean(Petal.Length))

Or without a hard coded name for the grouping variable (as asked by the OP):

iris %>% 
    group_by(.dots = names(iris)[5]) %>% 
    summarise_at("Petal.Length", mean)

With the example of the OP:

data %>% 
    group_by(.dots =names(data)[-3]) %>% 
    summarise_at("value", mean)

See also the dplyr vignette on programming which explains pronouns, quasiquotation, quosures, and tidyeval.

참고URL : https://stackoverflow.com/questions/21208801/group-by-multiple-columns-in-dplyr-using-string-vector-input

반응형