Using cache to avoid re-processing, improve UX, and quicken results in R

If you’ve never heard of cache (/kaSH/) before, Google it and you’ll quickly find that it is “a collection of items of the same type stored in a hidden or inaccessible place”. Basically, you have “something” stored “somewhere” so you can fetch it “sometime” later. If it sounds basic, it (can be) is! This simple technic can come quite handy when you are coding functions that take some time to gather and/or process the data you’re working with. In other words, think of those processes that take some time to run and there’s really no need to re-run it “every time” because the outcome will be exactly the same. Also, you are unnecessarily spending time, computer power, and real energy when you re-process cache-able stuff.

Today I’ll show you how I use cache in R to accelerate results, avoid re-processing, and improve UX for my users using the `lares` library. Let’s see a couple of functions that actually leverage cache usage and how can you start using them. To begin, let’s install the library:

install.packages("lares")

Where will these cache objects be saved?

By default, they’ll be saved on your tempdir(), which is a temporary directory you will always have available in your R sessions. Remember that these cache objects will be suppressed once you close your R session and On the other hand, you could set and option to customize this directory to be any other one. You can do so with: option("LARES_CACHE_DIR" = yourdir), where yourdir is a valid directory path. In this case, these files will NOT be erased when you close your R session. In both cases, the files will probably be hidden by default as they have no file-type.

Use cases: cache = faster

If you need to hold temporarily anything, use cache. When would you hold temporarily anything? When it’s something hard/expensive to get it and will use it again, when your requested platforms have limits or fares, when you think you’ll save time by preventing re-processing functions… I’ll share a couple of examples where you can use cache to improve your timings (and a cool trick).

– Reporting: you are running a daily report that extracts and gathers data from multiple platforms and sources (API, databases, Excel files…). You never close your RStudio session (like me) and use those results multiple times per day. Instead of saving a new file every day, checking if you already ran it today, or running the requests repeatedly, you can simply use a dynamic cache name containing today’s date, and worry on more important stuff. I use it in a couple of functions that scrap daily stocks data to avoid re-scrapping when I test function changes.
– Long-running processing functions: you shared with your peers your scripts/library that perform a magical output in a couple of minutes, but they are not quite savvy with coding. One of your parameters change a small detail that actually runs almost at the end of all the process. Your user doesn’t care and runs the function N times until it’s perfect… but every time runs the whole code. Was it really necessary?
– Inter-functions data: you created a function that needs the output of another function and do not wish to create an object in you environment (for security or any other reason). You can pass that API token, timestamp, user input, or any data at all, seamlessly to any other function or process without revealing its content openly to the user. For example: you have an interactive session and asked the user something; you’ll probably need that for future processes and should not ask the user again (nor save it visibly into the environment as an object). Use cache!

Cache functions

– cache_write: write any R object as cache.
– cache_read: read cache file and assign its content to an R object.
And, there are two more functions that may bee useful too:
– cache_exists: check if a cache file already exists.
– cache_clear: (force) suppress all cache files in your cache_dir.

Function’s parameters explained:

– data: any R object you wish to cache. It can be literally anything you have in your R environment.
– base: Character vector. This will be the unique name for your cache file. You can pass a character vector with multiple elements that will be concatenated. For example: c(Sys.Date(), "results", names(myList)). All cache files with start with lares_cache_ automatically to quickly detect these cache files.
– cache_dir: Where do you want to save you cache files? By default they’ll be stored on tempdir() but you can change it using this parameter manually or setting a global option called "LARES_CACHE_DIR".
– ask: If cache exists, when reading: (interactive) ask the user if the cache should be used to proceed or ignored; when writing, (interactive) ask the user if the cache should be overwritten. Note that you can only ask for one cache file at a time because vectors are concatenated.
– quiet: Keep quiet? If not, messages with the back-end processes and results will be printed.

Dummy example

Let’s create a dummy list, save it using the list’s elements names, and then read it (without assigning it to any new object):

library(lares)
a = list(a=1, b=1:5, c=LETTERS[1:5])
cache_write(a, base = names(a))
cache_read(names(a))
$a
[1] 1
$b
[1] 1 2 3 4 5
$c
[1] "A" "B" "C" "D" "E"

Check out the name of the cache file created: lares_cache_a.b.c. You could have passed base = "a.b.c" and the outcome would be the same. You could also leave base argument empty and will use a lares_cache_temp name in case you only use one cache file in your pipeline.

Now, we can check if the cache file we just created was really created:

cache_exists("a.b.c")
[1] TRUE
attr(,"filename")
[1] "/var/folders/3w/x0bjsdv52v3b3kjbw87y3wfw0000gn/T//RtmpqM9gYw/lares_cache_a.b.c.RDS"
attr(,"base")
[1] "lares_cache_a.b.c"
attr(,"cache_dir")
[1] "/var/folders/3w/x0bjsdv52v3b3kjbw87y3wfw0000gn/T//RtmpqM9gYw"

So TRUE is for “it does exist” and there are a couple of attributes sticked to it: absolute path and filename, base used, and cache directory where it lives.

TIP: Boost your data gather/processing functions

If you are using functions that gather data or wrangle your results in a way that you actually have to wait, consider adding this (adapted) chunk of code to your function. That way you can check if you have already run the function with those exact same inputs and you don’t have to actually re-process everything again.

cache_file = c(input$a, input$b, ...)
if (cache_exists(cache_file)) {
  mycache = cache_read(cache_file, quiet = TRUE)
  return(mycache)
}

And don’t forget to add the following to the end of your function to create your cache:

  ...
  cache_write(results, cache_file, quiet = TRUE)
  return(results)
}

Feel free to contact me on LinkedIn for any doubt, requests, or anything at all. I’m always happy to know of users reading my posts and using my library to boost their everyday tasks.

cachelaresprogramming