Follow my blog with Bloglovin

Friday, October 4, 2013

VoltDB with R : Real time Analysis

I my previous post, I discussed about using VoltDB as front end (OLTP and real time analysis with current data, handle velocity aspect of Big Data) of Big Data management and Vertica as backend (historical analysis and OLAP, handle volume aspect of Big Data) and integrate data flow using Talend DI (ETL).

As, R is most popular in statistical analysis (open source), it is a good idea to do analysis on current data using R. It can be used for machine learning, text mining, sentiment analysis or fancy visualization and tag clouds from documents etc. VoltDB does not provide client libraries for R. We have two ways to interact with VoltDB from R:
1. Using RJDBC package (we will use this way as using JDBC is not encouraged by VoltDB).
2. Interact with VoltDB over HTTP using its REST API. This is the best way to interact from a language for which client libraries are not available.

I am using Fedora 19 64-bit for this tutorial. Make changes as per your OS.

To install and setup VoltDB database look into my previous post sections specific to VoltDB.

Install R using command "yum install R" after logging as super user. I will download and install R and dependent libraries.
After installation is done use "R" command to start R console. R-studio can be used as IDE, but I am using console.
Install RCurl package for curl functionality in R:
> install.packages("RCurl")
Check output or successful installation. You need to have gcc and libxml2 installed on Fedora. If not, install them using yum.
Load installed library and declare as import:
> library(RCurl)
> require(RCurl)

Create and store request URLs and request data for insert and select
> request.url<-'http://localhost:8080/api/1.0/'
> insert.req<-c(Procedure='SaveOrder',Parameters='[1, 10, 102, "CE", 10, "IN", "Mumbai", 25]')
> select.req<-c(Procedure='@AdHoc',Parameters='["select * from orders"]')
> insert.result<-getForm(request.url,.params=insert.req)
> select.result<-getForm(request.url,.params=select.req)

Every line start with > need to be executed on R console line by line. For R syntax one can look at http://www.r-tutor.com or to learn interactively tryr.codeschool.com. A brief about these R Code:
<- is assignment operator and value from right expression is assigned to left variable. Values enclosed in ' or " are strings and c(...) function is used to create list. There are several optional parameters to getForm function for examining headers etc.

First argument, 1,  in Parameters of insert.req is primary key in table, so change id to make more inserts. For more on schema or VoltDB visit my previous post. "SaveOrder" is stored procedure created in previous post and "@AdHoc" is a system procedure provided to run any arbitrary query.

Now, examine insert.result and select.result containing data in JSON format.
insert.result
select.result
For select.result JSON result will be like:
{
    "status"1,
    "appstatus": -128,
    "statusstring": null,
    "appstatusstring": null,
    "exception": null,
    "results": [
        {
            "status": -128,
            "schema": [
                {
                    "name""ORDER_ID",
                    "type"5
                },
                {
                    "name""ITEM_ID",
                    "type"5
                },
                {
                    "name""PRODUCT_ID",
                    "type"5
                },
                {
                    "name""PRODUCT_CATEGORY",
                    "type"9
                },
                {
                    "name""USER_ID",
                    "type"5
                },
                {
                    "name""USER_COUNTRY",
                    "type"9
                },
                {
                    "name""USER_CITY",
                    "type"9
                },
                {
                    "name""USER_AGE",
                    "type"5
                }
            ],
            "data": [
                [
                    1,
                    10,
                    102,
                    "CE",
                    10,
                    "IN",
                    "Mumbai",
                    25
                ]
            ]
        }
    ]
}
The "data" key contains actual data of query, an array. To process this JSON and get data install package "RJSONIO" as "RCurl" was installed and loaded. After that, execute following on console:
>select.result<-fromJSON(select.result)

After this print select.result to see key for each data item. To get data part from JSON use
> select.result<-unlist(select.result$results[[1]]$data)

As data is list of list need to unlist it. This will provide array of result, as below:
[1] "1"      "10"     "102"    "CE"     "10"     "IN"     "Mumbai" "25"

This is a one dimensional array, which can be converted to two dimensional matrix and then data frame, as:
> dim(select.result)<-c(2,8)
> select.result
     [,1] [,2]  [,3] [,4]     [,5] [,6]  [,7] [,8]    
[1,] "1"  "102" "10" "Mumbai" "2"  "102" "10" "Mumbai"
[2,] "10" "CE"  "IN" "25"     "10" "CE"  "IN" "25"    
> as.data.frame(select.result)
  V1  V2 V3     V4 V5  V6 V7     V8
1  1 102 10 Mumbai  2 102 10 Mumbai
2 10  CE IN     25 10  CE IN     25

Names for columns can be set as (call this before converting to data frame) :
> dimnames(select.result) <- list(c(), c('a','b','c','d','e','f','g','h'))

Matrix and data frame are basic structure of any analysis. All algorithms or analysis function operate on these structures, so now the data is ready for analysis. 

This is for interactive analysis on console. These sequence can be written in a text file as R script and executed using source('<filepath>') in console. For example create a file GetData.R and paste this content:
GetData.R:
require(RCurl)
require(RJSONIO)

#This function prints JSON received from server and returns data #as matrix with column names and returned in schema
getData<-function(query){
    request.url<-'http://localhost:8080/api/1.0/';
    select.req<-c(Procedure='@AdHoc',Parameters=query);
    select.result<-getForm(request.url,.params=select.req);
    print(select.result);
    header<-c();
    select.result<-fromJSON(select.result);
    schema<-select.result$results[[1]]$schema;
    for(item in schema){
       header<-c(header, item$name);
    }
   
    select.result<-unlist(select.result$results[[1]]$data);
    dim(select.result)<-c(2,8);
    dimnames(select.result) <- list(c(),header);
    return(select.result);
}

Change working directory of R workspace where the file is created and execute, for example:
> setwd("/home/lalit/Documents/")
> source("GetData.R")
> source("GetData.R")
> getData('["select * from orders"]')
[1] "{\"status\":1,\"appstatus\":-128,\"statusstring\":null,\"appstatusstring\":null,\"exception\":null,\"results\":[{\"status\":-128,\"schema\":[{\"name\":\"ORDER_ID\",\"type\":5},{\"name\":\"ITEM_ID\",\"type\":5},{\"name\":\"PRODUCT_ID\",\"type\":5},{\"name\":\"PRODUCT_CATEGORY\",\"type\":9},{\"name\":\"USER_ID\",\"type\":5},{\"name\":\"USER_COUNTRY\",\"type\":9},{\"name\":\"USER_CITY\",\"type\":9},{\"name\":\"USER_AGE\",\"type\":5}],\"data\":[[1,10,102,\"CE\",10,\"IN\",\"Mumbai\",25],[2,10,102,\"CE\",10,\"IN\",\"Mumbai\",25]]}]}"
attr(,"Content-Type")
                              charset 
"application/json"            "utf-8" 
     ORDER_ID ITEM_ID PRODUCT_ID PRODUCT_CATEGORY USER_ID USER_COUNTRY
[1,] "1"      "102"   "10"       "Mumbai"         "2"     "102"       
[2,] "10"     "CE"    "IN"       "25"             "10"    "CE"        
     USER_CITY USER_AGE
[1,] "10"      "Mumbai"
[2,] "IN"      "25"   

Also it is possible to create a complete web application in R using Shiny. Its a R package and can be installed as any R package. So one can write R scripts to continuously, analyse it and  send result to UI to update a live dashboard, tag cloud for analyzing trends etc. Shiny supports WebSocket and uses Twitter Bootstrap like modern library and features to create full web application with very few lines of code.

Try a complete web application in R to analyse realtime data and show analysis result live or wait for my next post for Complete Web Application In R : Shiny.

Please, mail me at sylentprayer@gmail.com or add comments regarding any issue or question regarding discussed technologies.


Popular Posts