R은 가족을 구문 설탕보다 더 많이 적용합니까?

development

R은 가족을 구문 설탕보다 더 많이 적용합니까?

big-blog 2020. 6. 15. 07:53

R은 가족을 구문 설탕보다 더 많이 적용합니까?

... 실행 시간 및 / 또는 메모리와 관련하여.

이것이 사실이 아닌 경우 코드 스 니펫으로이를 증명하십시오. 벡터화에 의한 속도 향상은 계산되지 않습니다. 속도 향상에서 온해야한다 apply( tapply, sapply, ...) 그 자체.

applyR 의 함수는 다른 루핑 함수 (예 :)보다 향상된 성능을 제공하지 않습니다 for. 이것에 대한 한 가지 예외 lapply는 R보다 C 코드에서 더 많은 작업을 수행하기 때문에 조금 더 빠를 수 있습니다 ( 이 예제는이 질문을 참조하십시오 ).

그러나 일반적으로 성능이 아니라 명확성을 위해 apply 함수를 사용해야합니다 .

나는이에 추가 할 기능이 없다 적용 부작용 은 사용에 의해 오버라이드 (override) 할 수 있습니다 R.이와 함수형 프로그래밍에있어 중요한 차이입니다, assign또는 <<-,하지만 매우 위험 할 수 있습니다. 변수의 상태는 히스토리에 따라 달라지기 때문에 부작용은 프로그램을 이해하기 어렵게 만듭니다.

편집하다:

피보나치 시퀀스를 재귀 적으로 계산하는 간단한 예제로 이것을 강조하기 만하면됩니다. 이것은 정확한 측정을 위해 여러 번 실행될 수 있지만 요점은 성능이 크게 다른 방법이 없다는 것입니다.

> fibo <- function(n) {
+   if ( n < 2 ) n
+   else fibo(n-1) + fibo(n-2)
+ }
> system.time(for(i in 0:26) fibo(i))
   user  system elapsed 
   7.48    0.00    7.52 
> system.time(sapply(0:26, fibo))
   user  system elapsed 
   7.50    0.00    7.54 
> system.time(lapply(0:26, fibo))
   user  system elapsed 
   7.48    0.04    7.54 
> library(plyr)
> system.time(ldply(0:26, fibo))
   user  system elapsed 
   7.52    0.00    7.58

편집 2 :

R의 병렬 패키지 사용 (예 : rpvm, rmpi, snow)과 관련하여 일반적으로 apply패밀리 기능을 제공 합니다 ( foreach이름에도 불구하고 패키지는 본질적으로 동일합니다). 다음은 간단한 sapply함수 예입니다 snow.

library(snow)
cl <- makeSOCKcluster(c("localhost","localhost"))
parSapply(cl, 1:20, get("+"), 3)

이 예에서는 추가 소프트웨어를 설치할 필요가없는 소켓 클러스터를 사용합니다. 그렇지 않으면 PVM 또는 MPI와 같은 것이 필요합니다 ( Tierney의 클러스터링 페이지 참조 ). snow다음과 같은 적용 기능이 있습니다.

parLapply(cl, x, fun, ...)
parSapply(cl, X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
parApply(cl, X, MARGIN, FUN, ...)
parRapply(cl, x, fun, ...)
parCapply(cl, x, fun, ...)

apply함수는 부작용 이 없으므로 병렬 실행에 사용해야한다는 것이 합리적입니다 . for루프 내에서 변수 값을 변경하면 전체적으로 설정됩니다. 반면에, 모든 apply(당신이 사용하려고하지 않는 한 변경은 함수 호출에 지역이기 때문에 기능은 병렬로 안전하게 사용할 수 있습니다 assign또는 <<-,이 경우는 부작용을 도입 할 수 있습니다). 말할 것도없이, 특히 병렬 실행을 다룰 때 지역 변수와 전역 변수에주의하는 것이 중요합니다.

편집하다:

여기서 차이 입증 사소한 예제 for와 *apply원경 부작용이 우려 같이

> df <- 1:10
> # *apply example
> lapply(2:3, function(i) df <- df * i)
> df
 [1]  1  2  3  4  5  6  7  8  9 10
> # for loop example
> for(i in 2:3) df <- df * i
> df
 [1]  6 12 18 24 30 36 42 48 54 60

어떻게 참고 df부모 환경에 의해 변경되어 for있지만 *apply.

여러 요소의 그룹화를 기반으로 평균을 얻기 위해 for-loops를 중첩해야 할 때처럼 속도 향상이 상당 할 수 있습니다. 여기에는 동일한 결과를 제공하는 두 가지 접근 방식이 있습니다.

set.seed(1)  #for reproducability of the results

# The data
X <- rnorm(100000)
Y <- as.factor(sample(letters[1:5],100000,replace=T))
Z <- as.factor(sample(letters[1:10],100000,replace=T))

# the function forloop that averages X over every combination of Y and Z
forloop <- function(x,y,z){
# These ones are for optimization, so the functions 
#levels() and length() don't have to be called more than once.
  ylev <- levels(y)
  zlev <- levels(z)
  n <- length(ylev)
  p <- length(zlev)

  out <- matrix(NA,ncol=p,nrow=n)
  for(i in 1:n){
      for(j in 1:p){
          out[i,j] <- (mean(x[y==ylev[i] & z==zlev[j]]))
      }
  }
  rownames(out) <- ylev
  colnames(out) <- zlev
  return(out)
}

# Used on the generated data
forloop(X,Y,Z)

# The same using tapply
tapply(X,list(Y,Z),mean)

둘 다 평균과 이름이 지정된 행과 열을 가진 5 x 10 행렬 인 정확히 동일한 결과를 제공합니다. 그러나 :

> system.time(forloop(X,Y,Z))
   user  system elapsed 
   0.94    0.02    0.95 

> system.time(tapply(X,list(Y,Z),mean))
   user  system elapsed 
   0.06    0.00    0.06

당신은 간다. 내가 무엇을 이겼습니까? ;-)

... 방금 다른 곳에 쓴 것처럼, vapply는 당신의 친구입니다! ... 그것은 sapply와 비슷하지만 훨씬 빠르게 만드는 반환 값 유형을 지정합니다.

> system.time({z <- numeric(1e6); for(i in y) z[i] <- foo(i)})
   user  system elapsed 
   3.54    0.00    3.53 
> system.time(z <- lapply(y, foo))
   user  system elapsed 
   2.89    0.00    2.91 
> system.time(z <- vapply(y, foo, numeric(1)))
   user  system elapsed 
   1.35    0.00    1.36

I've written elsewhere that an example like Shane's doesn't really stress the difference in performance among the various kinds of looping syntax because the time is all spent within the function rather than actually stressing the loop. Furthermore, the code unfairly compares a for loop with no memory with apply family functions that return a value. Here's a slightly different example that emphasizes the point.

foo <- function(x) {
   x <- x+1
 }
y <- numeric(1e6)
system.time({z <- numeric(1e6); for(i in y) z[i] <- foo(i)})
#   user  system elapsed 
#  4.967   0.049   7.293 
system.time(z <- sapply(y, foo))
#   user  system elapsed 
#  5.256   0.134   7.965 
system.time(z <- lapply(y, foo))
#   user  system elapsed 
#  2.179   0.126   3.301

If you plan to save the result then apply family functions can be much more than syntactic sugar.

(the simple unlist of z is only 0.2s so the lapply is much faster. Initializing the z in the for loop is quite fast because I'm giving the average of the last 5 of 6 runs so moving that outside the system.time would hardly affect things)

One more thing to note though is that there is another reason to use apply family functions independent of their performance, clarity, or lack of side effects. A for loop typically promotes putting as much as possible within the loop. This is because each loop requires setup of variables to store information (among other possible operations). Apply statements tend to be biased the other way. Often times you want to perform multiple operations on your data, several of which can be vectorized but some might not be able to be. In R, unlike other languages, it is best to separate those operations out and run the ones that are not vectorized in an apply statement (or vectorized version of the function) and the ones that are vectorized as true vector operations. This often speeds up performance tremendously.

Taking Joris Meys example where he replaces a traditional for loop with a handy R function we can use it to show the efficiency of writing code in a more R friendly manner for a similar speedup without the specialized function.

set.seed(1)  #for reproducability of the results

# The data - copied from Joris Meys answer
X <- rnorm(100000)
Y <- as.factor(sample(letters[1:5],100000,replace=T))
Z <- as.factor(sample(letters[1:10],100000,replace=T))

# an R way to generate tapply functionality that is fast and 
# shows more general principles about fast R coding
YZ <- interaction(Y, Z)
XS <- split(X, YZ)
m <- vapply(XS, mean, numeric(1))
m <- matrix(m, nrow = length(levels(Y)))
rownames(m) <- levels(Y)
colnames(m) <- levels(Z)
m

This winds up being much faster than the for loop and just a little slower than the built in optimized tapply function. It's not because vapply is so much faster than for but because it is only performing one operation in each iteration of the loop. In this code everything else is vectorized. In Joris Meys traditional for loop many (7?) operations are occurring in each iteration and there's quite a bit of setup just for it to execute. Note also how much more compact this is than the for version.

When applying functions over subsets of a vector, tapply can be pretty faster than a for loop. Example:

df <- data.frame(id = rep(letters[1:10], 100000),
                 value = rnorm(1000000))

f1 <- function(x)
  tapply(x$value, x$id, sum)

f2 <- function(x){
  res <- 0
  for(i in seq_along(l <- unique(x$id)))
    res[i] <- sum(x$value[x$id == l[i]])
  names(res) <- l
  res
}            

library(microbenchmark)

> microbenchmark(f1(df), f2(df), times=100)
Unit: milliseconds
   expr      min       lq   median       uq      max neval
 f1(df) 28.02612 28.28589 28.46822 29.20458 32.54656   100
 f2(df) 38.02241 41.42277 41.80008 42.05954 45.94273   100

apply, however, in most situation doesn't provide any speed increase, and in some cases can be even lot slower:

mat <- matrix(rnorm(1000000), nrow=1000)

f3 <- function(x)
  apply(x, 2, sum)

f4 <- function(x){
  res <- 0
  for(i in 1:ncol(x))
    res[i] <- sum(x[,i])
  res
}

> microbenchmark(f3(mat), f4(mat), times=100)
Unit: milliseconds
    expr      min       lq   median       uq      max neval
 f3(mat) 14.87594 15.44183 15.87897 17.93040 19.14975   100
 f4(mat) 12.01614 12.19718 12.40003 15.00919 40.59100   100

But for these situations we've got colSums and rowSums:

f5 <- function(x)
  colSums(x) 

> microbenchmark(f5(mat), times=100)
Unit: milliseconds
    expr      min       lq   median       uq      max neval
 f5(mat) 1.362388 1.405203 1.413702 1.434388 1.992909   100

참고URL : https://stackoverflow.com/questions/2275896/is-rs-apply-family-more-than-syntactic-sugar

'development' 카테고리의 다른 글

초기 용량으로 ArrayList를 시작하는 이유는 무엇입니까? (0)	2020.06.15
“컴파일 타임에 할당 된 메모리”는 실제로 무엇을 의미합니까? (0)	2020.06.15
너비가있는 CSS 입력 : 100 %가 부모의 경계를 벗어납니다. (0)	2020.06.15
Node.js에서 스크립트가 실행 중인지 확인하는 방법 (0)	2020.06.15
좋은 속도 제한 알고리즘은 무엇입니까? (0)	2020.06.15

현재글R은 가족을 구문 설탕보다 더 많이 적용합니까?

big-blog

R은 가족을 구문 설탕보다 더 많이 적용합니까?

R은 가족을 구문 설탕보다 더 많이 적용합니까?

'development' 카테고리의 다른 글

'development'의 다른글

티스토리툴바

R은 가족을 구문 설탕보다 더 많이 적용합니까?

R은 가족을 구문 설탕보다 더 많이 적용합니까?

'development' 카테고리의 다른 글

'development'의 다른글

관련글

티스토리툴바