


stats graphics grDevices utils datasets methods base While we're certainly still generally thinking about performance in dplyr, tracking this specific issue isn't particularly useful for us.Platform: x86_64-apple-darwin19.5.0 (64-bit)īLAS: /System/Library/Frameworks/amework/Versions/A/Frameworks/amework/Versions/A/libBLAS.dylib I think we will probably need to fix this by reconsidering the built-in backend altogether, rather than patching it with more band aids. #> # A tibble: 4 × 6 #> expression min median itr/s…¹ #> #> 1 collect(summarize(dtplyr::lazy_dt(df_mny), y = n())) 36.8ms 43.7ms 18.7 #> 2 collect(summarize(dtplyr::lazy_dt(df_mny), y = max(x))) 58.7ms 63.2ms 13.5 #> 3 collect(summarize(dtplyr::lazy_dt(df_few), y = n())) 18.9ms 20.7ms 36.8 #> 4 collect(summarize(dtplyr::lazy_dt(df_few), y = max(x))) 33.9ms 35.5ms 28.0 #> # … with 2 more variables: mem_alloc, `gc/sec`, and abbreviated #> # variable name ¹`itr/sec` #> # ℹ Use `colnames()` to see all variable namesĬreated on by the reprex package (v2.0.1)ĭplyr's computational engine is showing it's limitations for larger datasets and it seems every time we improve performance in one place we either make it worse somewhere else or introduce a buggy edge case. #> Warning: Some expressions had a GC in every iteration so filtering is disabled. I believe I can also make my data files and a fairly simple script available via email and cloud storage if someone wants to take a closer look. This sort of calculation isn't really amenable to a repex even though the data size is small (input tibble is <200MB in memory) but if there are log files or other diagnostics I can collect I'm happy to provide those. While something of a hassle due to the two minute calculation grind and need to keep files synchronized, it's better than the alternative.


Curiously, the first time summarize() is called is noticeably more reliable than recalculations meaning, so far at least, I can mostly mitigate the issue by restarting the R session and caching the summarized results in a data file.
#Dplyr summarize issues with list update
Worse, the output tibble often isn't added to the R workspace, leading to script failure because it's missing or because recalculating it with updates didn't actually update the workspace to the new version of the tibble. I'm doing some straightforward %>% summarize() %>% operations which are taking over two minutes even though they're running on Intel 8th gen at 3.8 GHz and have only ~700k observations in ~200k groups. #> 100 microbenchmark ::microbenchmark(summarize( many_grps, n = n()),
