"Moreover, support for dplyr and data.table are on the way. "
Well, I can't really use it in my day to day work, since that almost always involves cleaning and munging via one of those two packages. And it's not like ggplot2 is where my R code is most delayed, usually I'm working on aggregate data or perhaps a very much smaller analytical dataset which requires much less speed for plotting. My hang-ups are in initial munging phases where the data is still very large, which often calls for data.table over dplyr due to the latter's much slower performance.
Yeah, data.table provides already significant speedup vs dplyr - so much that the "better" syntax of dplyr makes no sense anymore when you have to deal with very large datasets. But maybe FastR can somewhat change that?
Wait so the time difference of you running your code is longer then working with a "better syntax?"
I spend hours cleaning up data and only have to run the code once (I normally save the output to a feather and then work with a separate file from there).
I still believe that the 'tidyverse' is hands down the best thing that has happened to R and is the whole reason why R has grown so fast.
Sometimes it can take 12 or more hours to run the code on the millions of observations. There's also competition from other researchers who use computational resources, which can mean I have to leave something running for hours due to the server being heavily queried. My workflow also doesn't allow easy interruption of the execution, sometimes it has to execute fully incorrectly before I can change an error or parameter.
I would then say your using the wrong tool for your problem? I can't imagine 12 hour runs. I would imagine Spark is a better bet or is that not an option?
I dunno how large your data set is, but I just set up for work a 16-core Threadripper workstation with 32 GB RAM and 1 TB M.2 SSD for approx. $2500. If it can regularly save you hours or days of waiting, getting something equivalent should be a no-brainer.
How large are we talking? I haven't had any problems with dplyr performance as long as my data fits in main memory. (I have 16GB, so that means single digit GB data frames at most - I realize that doesn't qualify as "very large".) It does slow down considerably for larger data sets, but I assumed that that was because it was hitting the pagefile.
In the event that the data doesn't fit into memory, it's better to preprocess w/ SQL at the data-store level. There hasn't been a case where I'd need to feed massive amounts of data into a ggplot2 visualization unaggregated.
FastR doesn't alter the semantics of R, so when dplyr copies a vector in GNU-R then FastR has to copy it too. However, FastR does use reference counting (not sure if that's turned on in GNU-R 3.5.1 now) so it may avoid some unnecessary copies.
Yeppo.
Also, for myself at least in the geospatial realm, I need raster, rgdal, sp, sf, and parallel. The primary allure of R (imo) is the thousands of packages that allow you to quickly and easily implement whatever you want to do. Combine those with data.table and parLapply, and you're off to the races.
Maybe 3-4 years ago there was a big push to speedup R by replacing the runtime; at least 3 competing replacements were talked about pretty actively. None of them achieved much mindshare. R trades runtime speed for dev speed, and we juice performance by writing slow stuff in C++ and linking Intel's MKL. The RStudio folks are also making the low-level stuff faster and more consistent through the r-lib family of packages, which are awesome.
Big barriers to adoption here: not a truly drop-in replacement, R people have an aversion to Java (we've all spent hours debugging rJava; luckily most of those packages have been rewritten in C++ now), and nobody likes Oracle.
I think the best-case scenario here is that progress on FastR pushes the R-Core team to improve GNU-R.
I never fail to be amazed at all the work the RStudio et al. team do to push R towards the wonderful programming language/environment it could be, rather than what it has been.
This claim is made about a lot of things, Ruby, Python etc. I think the important point is it that there is no trade going on. It just that these things are all slower / less efficient than they need to be.
Maybe that's true, but I think Julia is the first effort to prove that out in the numerical/statistical world, and while lovely the ecosystem is far behind because of how much newer it is.
javascript showed that dynamically typed languages can be jitted well. It is just hard, and we spread our efforts over so many languages they don't all have the resources to do it.
oh for sure, but for Python/R the barrier to speed isn't any of their important productivity features (as far as I know) but just a high quality compiler/JIT
If I was Lord Of Computing I wouldn't let languages out of beta until they had a high quality compiler or JIT. Turns out I am not though.
There's also microsoft's R-Open (https://mran.microsoft.com/download) which I've found is faster than the out of the box R since it supports better multi-threading of commands.
IIRC most of that is because they use Intel's MKL and a better BLAS; if you like docker, using the Rocker containers uses the better BLAS, and I think adding MKL isn't too hard either.
I'm not even sure GNU-R is the most important comparison (although it is an important comparison). How does it compare to R with Intel MKL? How does it compare to other (faster) languages?
FastR also uses native BLAS and LAPACK libraries. It should be possible to link it with Intel MKL as well.
We didn't want to include comparison to R-3.5.X, because FastR itself is based on the base library of 3.4.0, but the results for GNU-R 3.5.1 almost the same as for R-3.4.0.
AFAIK ALTREP is not used that much yet inside GNU-R itself. They can now do efficient integer sequences (i.e. 1:1000 does not allocate 1000 integers unless necessary), which would save a little bit of memory in this example, but that's about it. FastR also plans to implement the ALTREP interface for packages. Internally, we've been already using things like compact sequences.
There is also the xtensor initiative which aims to provide a unified backend for array / statistical computations in C++ and then makes it pretty easy to create bindings to all the data science languages (R, Julia and of course Python). Usually, going to C++ provides a pretty sizeable speedup.
This is very interesting! Have you gotten any buy-in from the wider R community, is anyone rewriting their packages atop xtensor? Does R 3.5 and ALTREP make such a transition any easier?
I actually can't tell, but it has not yet been significant. It takes quite a bit of time to really get a library like this started. So far we've mostly dealt with people who are using xtensor from C++ or bind it to Python.
We've mainly gone through RCpp for the R language, and that has been working great. I don't know about changes in R 3.5 or ALTREP. Is there something we should know/change for it?
At this point, the tidyverse packages probably cover >90% of my data analysis workflow, so it'd be great to see all of those compatible with FastR. I'd guess tidyr and dplyr would be the trickiest, and dplyr is already being being worked on!
FastR can actually run all tests of the development version of dplyr with a simple patch. We're working on removing the need for that patch altogether.
data.table is a different beast and we will probably provide and maintain patched version for FastR. They do things like casting data of internal R structure to byte array and then memcopy it to another R structure. This is very tricky to emulate if your data structures actually live on Java side and you're handing out only some handles to the native code.
The last graph is a bit hard to read with the log scale. It's 10x improvement from GNU-R to FastR+rJava and another 10x with the native GraalVM interop.
I've actually tried porting some existing R applications that are currently run with RApache to Graal to try and get simpler deployment and better/more consistent operational support. Unfortunately at the time the gsub() function was broken, and that broke some of our core logic.
Hm... looks like the issue may have been fixed. I'll have to try again.
Functions in R are not referentially transparent, so replacing an argument with its value is not necessarily the same. That is a clear restriction on optimizations. If you would want to choose a restricted subset of R to speedup, then this would be a good candidate to cut out since the standard place to compile is at the function level (Numba, Cython, and Julia all do it at functions).
I'm not sure this is right; the NSE stuff tends to be at the shell, the user-facing API. The workhorse functions generally are referentially transparent, and writing pure functions is both natural and recommended in R. The slow parts are deeper than the NSE, so removing NSE wouldn't open up much room to optimize.
I suspect pass-by-value is a much bigger barrier to speed in R than non-standard evaluation.
Oh yes, I forgot about its pass-by-value. Removing pass-by-value is a double edged sword though. I generally dislike it, but you have to admit that having everything pass-by-value is much simpler to a non-programmer. If you chop that out then the "fast R subset" suddenly can act very differently. In order to really write efficient code you'd want to start making use of mutation on this fast part. This means throwing a macro on some array-based R code won't really be automatic: it would need a bit of a re-write for full speed but the re-written version would be incompatible with pass-by-value semantics. This is quite an interesting and tough problem to solve. I think it might be better to keep things pass-by-value and try to optimize pure functions.
That R is still around while not enjoying the wide array of benefits of general-purpose programming languages is impressive. It must truly have pluses that Python users don't even dream about.
E.g. can you quickly spin up a REST-like HTTP interface for your goods?
RStudio is pretty amazing for interactive statistical work. Also, A lot of open source developers tend to ignore Windows, but the less technical users are on Windows, and so proper Windows support is a key win. R's CRAN has a very clean documentation system and the setup for packages ensures that most things work on Windows (Windows CI is required). Also, its non-standard evaluation and associated metaprogramming is very integrated into the language, so you can build very intuitive APIs. Most users wouldn't know how to program what you just did, but that doesn't matter since the workflow for the average R user is "package-user" not "package-developer". So while R does have quite a few downsides, there's a lot that other general-purpose programming languages can pull from it.
E.g. can you quickly spin up a REST-like HTTP interface for your goods?
On the contrary, it started life as a Bell project called S, more or less a math/stats DSL. It was implemented in GNU as R, and R became one of many competing "stats packages" you may or may not be familiar with: SAS, Stata, SPSS, etc.
While it can be used for general purpose programming, its main advantage is that it is still primary a math, statistics, and data analysis DSL at heart. The concept of a "data frame" (which you are familiar with if you've used Pandas) as a data structure originated, as far as I can tell, in R. Data frames are built into the language, and the language offers custom syntax support for them.
Also, the standard library is full of high-quality statistics tools. Fitted model objects have handsome, human-readable string representations. The formula DSL is elegant and convenient. Manipulating data (replacing missing values, etc) is easy and relatively concise. Math and linear algebra is similarly and it is linked to BLAS so it's pretty fast. Plotting is built into the language and it's pretty intuitive, even if the defaults aren't that pretty. The language is also fully homoiconic and wildly dynamic, allowing you introspect and modify pretty much any chunk of code.
And all that's just in the standard library. The package ecosystem is downright enormous. You can write R packages in C/C++ just like in Python if you need something to go fast, aided by Rcpp. There's Shiny, which is a self-contained HTTP server for data-driven web applications. GGPlot2 was a minor revolution in elegant data visualization. The Tidyverse package collection was similarly mold-breaking by letting users write organic "data pipelines" instead of imperative code. Caret is at least as good as Scikit-learn for general-purpose machine learning. XTS takes the pain out of time series manipulation and modeling. Data.table can efficiently join and subset billion-row datasets in memory using indexes. The list goes on.
Long story short:
- domain-specific niceties
- batteries-included standard library that mimics features found in big monolithic stats packages
- has general-purpose programming capability
- extensible in C for speed
- built-in plotting that's not perfect but it's pretty good
- huge package ecosystem.
> Caret is at least as good as Scikit-learn for general-purpose machine learning
Oh how I wish this was true! Luckily RStudio hired the author of Caret to develop a family of smaller tidy modeling packages (https://github.com/tidymodels), and with recipes we're finally close to having something like sklearn's Pipelines, which IMO is one of the best parts of sklearn.
True, the pipeline is a great feature. I haven't used tidymodels yet but it looks like the start of a great ecosystem. I do remember seeing Broom at a talk a couple years ago and thought it was a nice idea.
I only used Stata in school but that's how it turned out for me. "Why learn Stata, SAS, or SPSS when I can just use R?" It made no sense to me (and still doesn't, honestly).
> E.g. can you quickly spin up a REST-like HTTP interface for your goods?
With R? Why would you want to do that with R? R is not suitable as a web server. May be you can write a package for that using C. There are 13170 packages for R. ın fact 99% of R consists of packages. You don't sit and write web server with R.
I dunno, I was able to cobble together a timeseries forecast API using the plummer and forecast packages in an afternoon that a product team was then able to work against to create demos for customers. Yeah, they’d probably eventually want to rewrite the API to be “production ready.” But on the other hand, for prototyping and getting to show something real to prospective customers? Pure dynamite.
Even then, if the stats being done in the background were hard to reimplement, I suppose plummer & R could still work with the right cloud / load balancing infra. Might end up being more expensive than it needs to be in the final iteration, but in the meantime money could be flowing in and customers gettin’ happy.
The big pluses are the huge range of libraries that make developing analyses easier, faster, and more reproducible.
Python has some fine libraries, but its leagues behind whats available in R.
I use R like I use bash for neuroimgaing analysis: I utilize a whole lot of powerful/specialized command line executables (e.g. R lmer|e.g. neurogimaging AFNI) the outputs and inputs of which I link together into a pipeline using R/Bash utilities.
admittedly there are tools like nipype that use pythonto create an interface for those different neuroimaging tools, but most of the time bash scripting works perfectly reasonably for this.
The article makes mention that FastR supports GraalVM's polyglot mechanism. One possible option for your task is you do your data analysis with FastR and render it with Node on Graal.js or Sinatra on TruffleRuby. At first blush this might not sound all that different from CGI of yore, but the key thing is all Truffle-based languages can optimize with one another. So, when your web server endpoint gets hot, Truffle's PE can inline nodes from FastR and JIT the whole thing with Graal.
You get to use the best language for the task at hand and don't have to worry about performance penalties for doing so.
In answer to your question -- my sense is that you can spin up super nice dashboards using shiny, and those will be opinionated HTTP interfaces. If you want to combine the flexibility of a bonafide web framework, and R shiny dashboards, you're going to have a rough time. R shiny itself has a pretty rough HTTP implementation built in.
So I'd say the answer is yes, and you'll have a good time as long you only need the HTTP interface to do certain things (responsive dashboards; and do them well!).
Webserver implementations exist in R, but don't have near the time / attention put into them as with Python.
I've used plumber, and it's pretty easy to get started, though doesn't feel very polished. Handling multipart form data took some hackarounds with the underlying "Rook" package.
I'm curious to see how https://github.com/thomasp85/fiery performs and if anyone has used that. May be higher performance than plumber (re: concurrency) because I get the sense from the docs that it's closer to libuv.
Doing something like that is definitely possible, all the parts are there and work well. Shiny gives you a lot out of the box, is great for prototyping and can be customized. I’ve been working on a less opionated package that isn’t ready for anything but gives an idea of what would be possible:
IIs there any information about how Graal+FastR are right now with respect to memory usage and warmup speeds? Are these benchmarks for total wall time or just the post-warmup speed?
There is a plot of warm-up curves for this specific example. Search for "To make the analysis of that benchmark complete, here is a plot with warm-up curves".
However, it is true that the warm-up and memory usage are something we need to improve. We're working on providing native image [1] of FastR. With that, both the warm-up and memory usage shold get close to GNU-R.
Well, I can't really use it in my day to day work, since that almost always involves cleaning and munging via one of those two packages. And it's not like ggplot2 is where my R code is most delayed, usually I'm working on aggregate data or perhaps a very much smaller analytical dataset which requires much less speed for plotting. My hang-ups are in initial munging phases where the data is still very large, which often calls for data.table over dplyr due to the latter's much slower performance.