Bringing the powers of SQL into R

One of the big flaw of R is that data loaded into it are stored in the memory (on the RAM) and not on the disk. As you are working in an analysis with large (big) data the processing time of simple and more complex functions can become very long or even crash your computer. SQL enters here, it is a powerful language designed to work with (large) database and to perform simple operation (like subsetting, sorting …) on them. It is particularly useful to explore very large dataset and format the data for further analysis. There are many programs for doing database management using SQL. I decided to start looking at MySQL since it has an R package and is rather easy to set up (one could also use PostgreSQL …). In this post I will show you step by step how to create a database in MySQL, to upload data from R into it, then to do some queries to look at the power of SQL. Before I start note that the data.table package was developed to perform fast operation on big data (have a look here).

Create a database

First you need to download MySQL from this website or from synaptic for the ubuntu users. Then you need to open a shell window (type cmd for windows users, terminal for Linux), type this:

> mysql -p -u root

This will ask you for the password of the root user if it worked you will see some text and mysql> appearing. Then if you don’t want to bother with different users and their rights you can directly create a database using:

mysql> CREATE DATABASE intro_to_sql;

That’s it you created a database named intro_to_sql. At this point it is very important to remember that every time you are in the shell with mysql you need to use semi-colon (;) at the end of your statement, otherwise it doesn’t work. You can look at all the databases in your system using:

mysql> show databases;

+--------------------+
| Database           |
+--------------------+
| information_schema |
| intro_to_sql       |
| mysql              |
| performance_schema |
+--------------------+
4 rows in set (0.00 sec)

Then we create a user having all rights on the intro_to_sql database:

mysql> GRANT ALL ON data.* TO 'user1'@'localhost' IDENTIFIED BY '12345';

Once a database and a user have been created we don’t need the shell interface everything else can be done from R.

Uploading datasets from R

We could directly create tables in the database from the shell interface but let’s see how to transfer data from R into the database:

library(RMySQL)
#connect to the database
con<-dbConnect(MySQL(),user='user1',password='12345',dbname='intro_to_sql')

#load some data
library(ggplot2)
data(diamonds)
#have a look at them
summary(diamonds)
#write the table into the database
dbWriteTable(con,"diamonds",diamonds_data)
#remove the dataset from R
rm(diamonds)
carat               cut        color        clarity          depth           table           price             x                y         
 Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
 Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80   Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
 Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
 3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
 Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
                                    J: 2808   (Other): 2531                                                                                    
       z         
 Min.   : 0.000  
 1st Qu.: 2.910  
 Median : 3.530  
 Mean   : 3.539  
 3rd Qu.: 4.040  
 Max.   :31.800

We now have one table named diamonds_data in our database.

Performing queries from R

Now that our intro_to_sql database has one table we can start playing with some SQL queries from R:

#count the number of diamonds that are more than 2000$ expensive
dbGetQuery(con,"select count(*) from diamonds_data where price>2000")
#make a new data frame with diamonds of color ‘D’ and a depth less than 60%
subs<-dbGetQuery(con,"select * from diamonds_data where color='D' AND depth<60")
unique(subs$color)
#make a new data frame only with the column x,y,z and order them by ascending x
subs<-dbGetQuery(con,"select x,y,z from diamonds_data order by x")
head(subs)
#from this dataset let’s create a new variable which is the mean of x,y,z
subs$Mean<-apply(sub,1,mean)
#write the results in a new table
dbWriteTable(con,"XYZMean",subs)
#check that it has been created
dbListTables(con)
count(*)
1    29733
[1] "D"
 x    y z
1 0 6.62 0
2 0 0.00 0
3 0 0.00 0
4 0 0.00 0
5 0 0.00 0
6 0 0.00 0
[1] "XYZMean"       "diamonds_data"

As you can see it is fairly easy to work with RMySQL, there are many advantages in working with this tool: (i) all the powers of SQL are at your command from within R, so it is easy to include this in your workflow (i.e. using the script window from RStudio …), (ii) no need to load big chunk of unprocessed data into R, use SQL to process efficiently the data (I did not talk about how to directly load a table into a MySQL database, have a look here).

There are many helpful ressources online about this topic here are a few that I found interesting: A working guide to MySQL. Two nice introductions into another SQL-platform supported in R (SQLite). A blog post about the issue of big data in R.

Data ManipulationDatabasesSQL