Follow my blog with Bloglovin

Saturday, September 28, 2013

HP Vertica, VoltDB and Talend together to manage Big Data

This blog is about using best of the breed database technology and integrate them for syncing data in near real time. VoltDB is specialized in OLTP and Vertica is an analytic platform and specialized in OLAP.
Here, I will describe required installation (single node installation with 0 K-safety) mandatory for getting databases running. To setup cluster and K-safety it needs addition host machines and configuration. As this does not include configuration for cluster, K-safety and other required for production use, this will not help DBAs, rather for developers, programmers or ETLs to install database on local/development machine to test application or ETL jobs.
Further posts will include writing UDF (user defined function) and other database developer perspective topics. Feel free to request me anything related to this.
OLTP and OLAP in traditional DBMS and NewSQL DBMS (VoltDB and Vertica)

This blog is about emerging NewSQL DBMSs and how this paradigm shift outperforms traditional RDBMS in performance. I will discuss about OLAP database Vertica Analytic Platform and most performant OLTP RDBMS, VoltDB. Both are from database research pioneer Michael Stonebraker (Vertica is acquired by HP). The idea behind these two DBMSs is, traditional RDBMS tend to provide all-in-one (general purpose, OLTP+OLAP) solution and not designed for large volume of data (Big Data). They are not scalable or very complex to scale and maintain. OLTP and OLAP have different needs, so there are two different DBMS redesigned and developed from ground-up, purpose built for two different needs. Both scale out horizontally, so simply add more machines to cluster to handle more load, K-safety etc.
As to get all functionalities (high volume transactions and analytic with complex queries) need to integrate both DBMS. For example, daily business activities need OLTP database and populate (insert/update) data and for analyzing this data (historical analysis) using a BI (Business Intelligence) solution need OLAP database. These two have different requirements, OLTP needs write optimized and OLAP needs read optimized DBMS technology. Also, OLTP and OLAP data modeling is different. In OLTP schema is normalized, for example 3NF, and in OLAP its snowflake or star schema where fact tables are surrounded by dimension table (dimensional modeling).
In tradition DBMS, its general purpose and for both same DBMS was used. Evolution of NewSQL is purpose built and optimized for specific use.

To demonstrate this we will consider following use case:
An eCommerce website (popular and hence high volume) stores order placed and also, need to analyze user trends, like, by geographic location.

We will use following technologies:
VoltDB [Open Source Edition, 3.5.0.1] for OLTP database. Size ~19MB
Vertica [Community Edition, 6.1.2] for OLAP database. Size ~80MB
Talend DIfor ETL (synchronizing from VoltDB to Vertica). Size ~600MB

Will use Fedora 19 64-bit OS (x86_64 architecture, check linux architecture with 'uname' command) and decent hardware resources.

Step #1: Install and configure VoltDB, Vertica and Talend
Detailed installation step and fine tuned configuration can be found in corresponding installation guides or getting started documents.

Install VoltDB:
To download go to voltdb.com and register for download. A download link will be sent to the email. Download VoltDB (voltdb-3.5.0.1.tar.gz) and extract it, refer extracted directory as VOLTDB_HOME in this tutorial. Add VOLTDB_HOME/bin to $PATH (edit ~/.bashrc file). Installation Done!!!

This installation contains documents, web console (localhost:8080/studio) and JSON based REST API for accessing VoltDB, documentation for this http://voltdb.com/docs/UsingVoltDB/ProgLangJson.php

Install Vertica:
Go to my.vertica.com (register and login) and download Vertica for Fedora (vertica-6.1.2-0.x86_64.FC.rpm).
Prepare linux for installation:
    Disable SELinux by editing selinux configuration file:
    In terminal login with root user and execute vi /etc/sysconfig/selinux command and add “SELINUX=disabled” at end of this file.
Edit “vi /etc/pam.d/su“ and add “session required pam_limits.so” if not in file.
Now install Vertica with “rpm -ivh vertica-6.1.2-0.x86_64.FC.rpm” (change directory where rpm is downloaded before executing command). Few inputs will be asked for path and dbadmin user.
After installation a new linux user “dbadmin” will be created.
Switch user “dbadmin” to setup and create database with command “su – dbadmin in terminal.

Install Talend DI:
Download Talend Open Studio for Data Integration v5.3.1 from http://www.talend.com/download/data-integration and extract the archive. Refer extracted folder as TALEND_HOME.

Step #2: Setup Databases for both DBMS
Data Modeling:
As this tutorial is to demonstrate technologies we will not focus on modeling concepts, rather have a single table in both database and sync data. Learn data modeling on http://www.learndatamodeling.com

Orders table OLTP [VoltDB]:
order_id <primary key>
item_id <item identifier of items stored in some warehouse table>
product_id <name of product in some product table>
product_category <category of product>
user_id <customer related information>
user_country
user_city
user_age
In real application product and user related information will be stored in different table and have references to these tables. For simplicity there is single table.
product_category will be partition column, because application and business both can be separated based on that and will produce evenly distributed data. When there is separate product and users tables, they can be partitioned on product_category and user_country respectively as they are expected to have large data. If expected data is small they can be kept as replication table. See more on partitioning and replication in “UsingVoltDB” document and http://voltdb.com/resources/volt-university/tutorials/section-1-4/

Orders table OLAP [Vertica]:
order_id <primary key>
product_id <name of product in some product table>
product_category <category of product>
user_id <customer related information>
user_country
user_city
user_age
date_created <populated from ETL>
In real application product and user will be different dimension table and orders will be fact table. Partitioning strategy will be same. The columns are different in table of both databases, as for analytic all data from OLTP is not required.

Create Database VoltDB and Configure for synchronization:
VoltDB compiles DDLs and Java stored procedures into a single jar, called catalog and deployed to VoltDB. We need to bundle transactions (data access logic) in form of stored procedures written in Java. We can execute adhoc queries, but this will not take advantage of architecture of VoltDB and will not be performant. Also, configure export of data incrementally to sync with Vertica using ETL.
There is a deployment.xml configuration fie to configure and enable features for VoltDB database.

We need following things to compile and generate database catalog:
DDLs for create table and export table, register stored procedure written in Java and partition table and stored procedure.

sample.sql
--create table for storing orders data
--add partition column as part of primary key to guarantee uniqueness across all
--partitions in database and no unique constraint violation while repartitioning
create table orders (
order_id integer not null,
item_id integer,
product_id integer,
product_category varchar(30) not null,
user_id integer,
user_country varchar(20),
user_city varchar(20),
user_age integer,
primary key(order_id, product_category)
);

--table for exporting selected columns from orders table
--only insert is allowed for export tables, as data is queued and
--fetched by an export client. This feature is for incremental sync
--of data to external system
create table orders_export (
order_id integer not null,
product_id integer,
product_category varchar(30) not null,
user_id integer,
user_country varchar(20),
user_city varchar(20),
user_age integer
);

--VoltDB does not support auto increment, to implement this we can have a table
--to store max +1 as next value of identifier field and query this table in
--stored procedure. This will be a replicated table.
create table auto_increment(
table_name varchar(50) not null,
next_value integer,
primary key(table_name)
);

--Mark orders_export table as export only
EXPORT TABLE orders_export;

--Partition orders table. No need to partition export table as no data is stored for them.
partition table orders on column product_category;

--This is a small table and suied as replication table, but we need to write to this table
--while get and increment next value for a table, so partition this on primary key
partition table auto_increment on column table_name;

--register stored procedure written in Java
CREATE PROCEDURE FROM CLASS SaveOrder;
CREATE PROCEDURE FROM CLASS AutoIncrement;

--Partition stored procedure on same column as for table and provide parameter index
--for partition column in argument passed to procedure. By defaut its expected asfirst
--argument, in our procedure it will be 4rd argument (index 3).
PARTITION PROCEDURE SaveOrder ON TABLE orders COLUMN product_category PARAMETER 3;
PARTITION PROCEDURE AutoIncrement ON TABLE auto_increment COLUMN table_name;

Deployment configuration file.

deployment.xml
<?xml version="1.0"?>
<deployment>
<!--
Single host local deployment
-->
<cluster hostcount="1" sitesperhost="2"/>

<!--
Directories for storing snapshots, export overflow and other files generated by VoltDB
-->
<paths>
<exportoverflow path="/home/lalit/Softwares/VoltDB/sample/export"/>
<snapshots path="/home/lalit/Softwares/VoltDB/sample/snapshots"/>
<voltdbroot path="/home/lalit/Softwares/VoltDB/sample/root"/>
</paths>

<!--
Enable web console and REST API to interact with VoltDB. Apart from, JDBC driver and Java client
VoltDB can be accessed using JSON based REST API to execute queries
-->
<httpd enabled="true">
<jsonapi enabled="true" />
</httpd>

<!--
VoltDB is in memory database and provides durability by writing data to file on regular interval.
Also, before shutdown database should be paused and saved to ensure all data written to disk and
on startup should be restored.
This configuration will save snapshots to path configured in <paths> on every 5 minutes and will
keep 3 recent snapshots.
Snapshots saves all data in tables excepts tables marked with export only.
-->
<snapshot prefix="sample" frequency="5m" retain="3"/>

<!--
This configuration is for enabling export functionality and use export-to-file export client to
write exported data to file. There are other export client available like JDBC client to write data
directly to other database and Hadoop client to export data to Hadoop. One can write custom export
client as per need.

Export is for integrating VoltDb with other system. To export data we need to create tables marked
as export only. All insert to export only tables go to a queue and export client fetch from queue,
hence incremental export. On overflow of queue, data is written to disk to location specified in
<paths>

Enabing skipinternals option will remove transaction id, partition id, timestamp created like data
frm export and will export only data in table.
-->
<export enabled="true">
<onserver exportto="file">
<configuration>
<property name="type">csv</property>
<property name="nonce">sample</property>
<property name="period">15</property>
<property name="outdir">/home/lalit/Softwares/VoltDB/sample/export</property>
<property name="skipinternals">true</property>
<property name="with-schema">true</property>
</configuration>
</onserver>
</export>
</deployment>


Java classes for stored procedure.

SaveOrder.java
import org.voltdb.SQLStmt;
import org.voltdb.VoltProcedure;


public class SaveOrder extends VoltProcedure {

private final SQLStmt insert = new SQLStmt("insert into orders values (?, ?, ?, ?, ?, ?, ?, ?)");

private final SQLStmt export = new SQLStmt("insert into orders_export values (?, ?, ?, ?, ?, ?, ?)");

/**
* VoltDB procedures are subclass of {@link VoltProcedure} and run implicitly in transaction.
* @param itemId
* @param productId
* @param productCategory
* @param userId
* @param userCountry
* @param userCity
* @param userAge
* @return
* @throws VoltAbortException
*/
public long run(int orderId, int itemId, int productId, String productCategory, int userId, String userCountry,
String userCity, int userAge) throws VoltAbortException{
//insert data into orders table and exprt table.
voltQueueSQL(insert, orderId, itemId, productId, productCategory, userId, userCountry, userCity, userAge);
voltQueueSQL(export, orderId, productId, productCategory, userId, userCountry, userCity, userAge);
voltExecuteSQL();
//procedures must return long, Long, VoltTable or VoltTable[], so return a value
return orderId;
}
}

AutoIncrement.java
import org.voltdb.SQLStmt;
import org.voltdb.VoltProcedure;
import org.voltdb.VoltTable;
import org.voltdb.VoltType;


public class AutoIncrement extends VoltProcedure {

private final SQLStmt autoIncrementSelect = new SQLStmt("select next_value from auto_increment where table_name=?");

private final SQLStmt autoIncrementUpdate = new SQLStmt("update auto_increment set next_value = ? where table_name=?");

private final SQLStmt autoIncrementInsert = new SQLStmt("insert into auto_increment values (?, ?)");

public long run(String table){
// Get next value for orders table, if null use 1
voltQueueSQL(autoIncrementSelect, "orders");
VoltTable[] result = voltExecuteSQL();
Integer nextValueOrders = 1;
if(result.length>0 && result[0].getRowCount()>0){
nextValueOrders = (Integer) result[0].fetchRow(0).get(0, VoltType.INTEGER);
}

//update auto increment table
if(nextValueOrders>1){
voltQueueSQL(autoIncrementUpdate, nextValueOrders+1, "orders");
}else{
voltQueueSQL(autoIncrementInsert, "orders", nextValueOrders+1);
}
voltExecuteSQL();
return nextValueOrders;
}
}

Create a Java project in Eclipse and add jars from VOLTDB_HOME/voltdb VOLTDB_HOME/lib folders. Also create a client to use this procedure to insert data to test the application.

Application.java
import java.io.IOException;

import org.voltdb.VoltTable;
import org.voltdb.VoltTableRow;
import org.voltdb.VoltType;
import org.voltdb.client.Client;
import org.voltdb.client.ClientFactory;
import org.voltdb.client.ClientResponse;
import org.voltdb.client.NoConnectionsException;
import org.voltdb.client.ProcCallException;


public class Application {

public static void main(String[] args) throws Exception{
org.voltdb.client.Client client = ClientFactory.createClient();
client.createConnection("localhost");
//TODO modify AutoIncrement procedure to accept int arg to set next value, to avoid
//calling this get-and-increment every time in bulk load.
int orderId = getNextValueForTable(client);
client.callProcedure("SaveOrder", orderId, 1, 101, "CE", 1, "IN", "Mumbai", 25);
orderId = getNextValueForTable(client);
client.callProcedure("SaveOrder", orderId, 2, 101, "CE", 2, "IN", "Mumbai", 25);
orderId = getNextValueForTable(client);
client.callProcedure("SaveOrder", orderId, 3, 101, "CE", 3, "US", "New York", 34);
orderId = getNextValueForTable(client);
client.callProcedure("SaveOrder", orderId, 4, 107, "APP", 4, "IN", "Mumbai", 25);
orderId = getNextValueForTable(client);
client.callProcedure("SaveOrder", orderId, 5, 101, "CE", 5, "GB", "London", 23);
orderId = getNextValueForTable(client);
client.callProcedure("SaveOrder", orderId, 6, 101, "CE", 6, "IN", "Mumbai", 25);
orderId = getNextValueForTable(client);
client.callProcedure("SaveOrder", orderId, 7, 101, "CE", 7, "IN", "Mumbai", 25);
orderId = getNextValueForTable(client);
client.callProcedure("SaveOrder", orderId, 8, 103, "APP", 8, "IN", "Mumbai", 25);
orderId = getNextValueForTable(client);
client.callProcedure("SaveOrder", orderId, 9, 101, "CE", 9, "IN", "Mumbai", 25);
orderId = getNextValueForTable(client);
client.callProcedure("SaveOrder", orderId, 10, 102, "CE", 10, "IN", "Mumbai", 25);
client.drain();
client.close();
}
private static int getNextValueForTable(Client client) throws NoConnectionsException, IOException, ProcCallException{
ClientResponse response = client.callProcedure("AutoIncrement", "orders");
if(response.getStatus()!=ClientResponse.SUCCESS){
System.out.println("Failed t retrive words");
System.exit(-1);
}
VoltTable[] results = response.getResults();
if(results.length>0){
VoltTable result = results[0];
if(result.getRowCount()>0){
VoltTableRow row = result.fetchRow(0);
return ((Integer)row.get(0, VoltType.INTEGER)).intValue();
}
}
return 1;
}
}

Inserts can be done using sqlcmd, a command line tool or REST API but this way is chosen to show how to develop VoltDB client application. Call procedure may return values.

Compile and create catalog:
Create a directory named “sample” to keep all database related files and create sample.sql and deployment.xml files from .previous section. Also copy compiled SaveOrder.class file from eclipse projects bin directory.
Open terminal and move to sample directory.

Execute command voltdb compile --classpath="./" -o sample.jar sample.sql to generate deployables for database. Check output or any errors, if succeed move to next step.

Create two folders root and export configured in deployment.xml <paths> element. Now deploy database with command voltdb create host localhost catalog sample.jar deployment deployment.xml
In another terminal use sqlcmd command to test database or open localhost:8080/studio in browser

Now execute Application.java main method (in eclipse) to insert data and verify data inserted. Verify data inserted and after 5 minutes check data exported in export folder.

Execute this query from studio in browser or sqlcmd and check incremental export.
insert into orders_export values(11, 11, 101, 'CE', 1, 'IN', 'Mumbai', 25);

Create Database Vertica
For data modeling we will use same schema as export table in VoltDB with few additional column populated from ETL.
DDL for create database:
example.sql
--Only one database runs at a time in Vertica cluster
--Rather than databases we need to create schema, but it differes
--from Orale schema as this is not associated with user
create schema example;

create table orders (
order_id integer not null primary key,
product_id integer,
product_category varchar(30) not null,
user_id integer,
user_country varchar(20),
user_city varchar(20),
user_age integer,
create_date date
);

This script creates a schema and one table in this schema. Vertica is a column oriented database and achieves this performance by means of partitioning and projections. Partitioning, K-safety and cluster scalability is similar to VoltDB. Projections are collection of columns created with specific ordering or grouping for query specific or pre join projections are pre computed inner join of tables. For each table there is a super projection, which is collection of all columns. Users can create custom query specific or pre join projections. This is similar to materialized view in other DBMS.
This partitioning and projections let database designers to plan design early. Vertica provides database designer tool to create and deploy projection. This requires DDL schema, sample data in table and sample queries to guess projections. So let load data in table and create sample queries file to run database designer.
Sample data must be similar to actual data, for example should not duplicate all record or generate them all random. For example if some column have a set of values ~20 should generate random number between 0-20 and insert 100K records.

Java class to insert records using JDBC. Create a new class in Java project and add jars from /opt/vertica/java/lib. 
VerticaInsert.java
import java.sql.Connection;
import java.sql.Date;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.util.Random;


public class VerticaInsert {

public static void main(String[] args) throws Exception{
Class.forName("com.vertica.jdbc.Driver");
Connection connection = DriverManager.getConnection("jdbc:vertica://localhost:5433/vmartdb", "dbadmin", "lalit");
connection.createStatement().executeUpdate("set search_path to example");
PreparedStatement statement = connection.prepareStatement("insert into orders values(?, ?, ?, ?, ?, ?, ?, ?)");
Random random = new Random();
String[] prodCats = {"CE", "APP", "ACC", "ELEC", "PNS", "FRN"};
String[] country = {"IN", "US", "GB", "SW", "AUS", "KOR", "JAP"};
String[] city = {"Mumbai", "Delhi", "London", "tokyo", "Melborne", "New York", "Malysia"};
Date date = new Date(System.currentTimeMillis());
for(int i=0;i<100000;i++){
if(i%10000==0)date.setTime(date.getTime()-84600);
statement.setInt(1, i);
statement.setInt(2, random.nextInt(100));
statement.setString(3, prodCats[random.nextInt(6)]);
statement.setInt(4, random.nextInt(1000));
statement.setString(5, country[random.nextInt(7)]);
statement.setString(6, city[random.nextInt(7)]);
statement.setInt(7, random.nextInt(50));
statement.setDate(8, date);
statement.execute();
}
statement.close();
connection.close();
}
}

example_queries.sql
select count(*) from orders group by product_category;
select count(*) from orders group by user_country;

Add all complex frequently used analytic queries here.

Create Database:
Switch user and login with dbadmin linux user to create and start Vertica database. Open a new terminal and execute command su – dbadmin
Create a new folder named “example” and move to this folder. This folder should be writable for dbadmin, so create it to user's home using terminal.

Execute /opt/vertica/bin/admintools command to launch admin tools. Select “Configuration Menu” and click OK.

Select “Create Database” and Click OK and provide database name “orders” and click OK. On next screen provide password (optional) or skip and click OK.

Then select host (Only one host will be visible). Keep paths as it is or provide the new directory created. Vertica will store data files there.

Select Yes to create database and OK on success message. On next screen, press Cancel and exit admin tools and type “vsql” command on terminal to interact with Vertica using command line. Provide password if provided in create database.
Execute command \i example.sql on vsql pompt to create table.
Exit vsql by typing \q.

Now, run VerticaInsert.java from eclipse to insert sample data for database designer. This will take several minutes. Batch update can be used to improve speed. Now database designer can analyze schema and nature of data and queries to be executed and generate optimal partition and projections.

Using Database Designer:
This step is not mandatory for tutorial but important for designing optimized Vertica database for real application.
Again open admin tools and Select Configuration Menu->Run Database Designer and select “orders” database and provide password.
Provide folder path to store output of database designer and provide design name “orders_design”.
Select Design Type as Comprehensive.
Select Schema as “orders” and select all design options in next window.
Provide path of sample queries example_queries.sql and K-Safety as 2. K-Safety upto 3 is officially supported by Vertica.
Select designer's design priority as Balanced and click OK. Proceed with the messages and exit admin tools.

Database design and setup is done. Now we can check database using vsql. Delete sample data from table to use it with real data sync from ETL. To start database open admin tools ->Start Database and follow wizard and to shutdown admin tools ->Stop Database.

Step #3: Configure ETL using Talend DI
Talend provides an Eclipse based IDE and it has projects like eclipse workspace. Generally one project (workspace) per ETL solution (project) is created for manageability.
Start Talend by double clicking TOS_BD-linux-gtk-x86_64 executable in TALEND_HOME. First time Talend will ask to create project. Create a project and it will appear in project list as selected. Click on Open button on right side to start IDE. Close the welcome window to start creating ETL.
Right click on Job Design in Repository section on left side and fill wizard. Provide name as “VoltDBToVerticaSync” and click finish. Empty canvas will appear and a pallate on right side.
Now we are ready to create ETL job. ETL flow, 3 steps:
  • Pick files exported by VoltDB and process it.
  • For each record in file map it to table in Vertica (direct mapping no transformation) and add create_date column in mapping as current date.
  • Write this to Vertica in bulk, as even if we export in 15-30 minutes interval there will be lot of data.
Drag and drop following components from pallate to canvas of created job:
  • tFileList to iterate over the directory where exported files from VoltDB are stored. This component supports configurations like sort files by name, modified date etc and file name pattern filters.
  • tFileInputDelimited to process file row by row. We need to define schema for file, for column type position etc. This supports advanced CSV options.
  • tMap to map inputs to outputs. This component have one row input and multiple lookup input and many outputs. We can perform joins between lookup and row input to lookup tables, in our case we have identifiers in exported CSV like product_id, user_id etc. We can lookup product or user tables for other columns by identifiers. Row input is processed one by one but lookup is loaded at once before processing row input.
  • tVerticaOutput to write output of mapping to Vertica database.

After dropping all components on canvas align them in a row and connect them. To connect two component right click on one and select “Row->[ Iterate | Main | New Output] as shown in figure below.



After connecting there will be an error in tVerticaOutput component “Module vertica-jdk5-6.0.0-0.jar is required”. T fix this go to module tab and scroll down to locate tVerticaOutput and click on it. Then create a copy of Vertica java connector jar /opt/vertica/java/lib/ vertica-jdk5-6.1.2-0.jar and rename it to vertica-jdk5-6.0.0-0.jar . Click on jar icon and select this jar.

Now configure each component:
To configure click on component to select and go to “Component” View.

tFileList



Enter path of directory where VoltDB exports file and file name pattern. Also,sort files in ascending order according to modified date to process files in sequence in which they are created

tFileInputDelimited



Specify file name by typing “curr” and press Ctrl+Space to get suggession and select the shown suggestion in image. Other file configuration as shown.

Configure schema for input file:



Click on button next to “Edit Schema” to open this dialog and add schema row by clicking '+' button in dialog. Create each row in specified order (as in file) and provide name, tye etc.
Click OK when done and 'Yes' when prompted for propagate schema.
Note: Create “product_category” instead of “prod_category” in third column.

tMap
Double click on tMap to open mapper.
Click on yellow header on left to select all rows and drag the to right side 'orders' box and drop to map them.
Create entry highlighted in blue at bottom right by clicking '+' button.
Click OK to save and close.



tVerticaOutput



As export tables are insert only, all updates are also inserted into export tables to export. In this example procedure is not updating, if some procedure is updating, then must insert updated data in export tables. Target system (in our case Vertica) must implement sync in insert or update manner to merge/update data rather than having duplicates.
There are trade-off in exporting complete row or only modified columns in exporting update scenario, as if export is enabled after some time will not have data to update. Exporting complete row have overhead to fetch row as update procedures will have only data (columns) to be updated. One choice could be before enabling sync sync all history data.

Sync history data to external system:
Other way to sync data to external system is to query from actual tables and to do this incrementally need to store data by timestamp and query data after previous polled timestamp. But as this timestamp may not be suitable for partitioning this query will be multi partition and wills be less performing. So we will use this for querying all data for one time history sync.
To query all data we can do a select query on all tables using generic JDBC component in Talend, as there is no component for VoltDB or can user HTTP request or REST components to use REST API provided by VoltDB. We will use generic JDBC component.

Create a new job “VoltDBToVerticaHistorySync” and drag drop tJdbcInput component and configure:



To add Driver jars, click on '+' button and on right end of added row click button to browse.



Select 'External modules' and browse to “VOLTDB_HOME/voltdb/voltdbclient-3.5.0.1.jar” click OK. This is the driver class for VoltDB. Also, add “guava-12.0.1.jar” in the same way except it is an internal module so rather than browsing we need to select from list.
Add tMap and tVerticaOutput as in previous job. Now run job in Talend to sync one time history data. We will run this in Talend as this is one time process, for other job we will schedule it to run on regular interval.

Run job in Talend:



Also, do this to test previous job.

Schedule VoltDBToVerticaSync job using crontab:
Export job by right clicking on job in Repository section as shown below:



Define settings (defaults are sufficient, only specify or note path to export)  for export and click Finish.



The archive will be created in TALEND_HOME with job's name. Extract this to any suitable location. There will be VoltDBToVerticaSync_run.sh in VoltDBToVerticaSync folder. This script is for executing the job, make entry of this job in crontab to schedule it.
Open terminal and execute command crontab -e and make entry in vi editor:
0/15 * * * * <path to VoltDBToVerticaSync_run.sh>
for 0, 15, 30, 15 minuthe of every hour or
5,20,35,50 * * * * <path to VoltDBToVerticaSync_run.sh>
for specific minutes entry.

After that save it using “:wq” command. Done, job is scheduled!!!

Now, every time an order is saved using the SaveOrder procedure in VoltDB it will also queued to export and export client will fetch it and write to file on every 15 minutes. Talend job will process the file and write to Vertica.


Summary
We created a demonstration of complete database ecosystem for an application, in which business data is stored in OLTP system and a OLAP system to store data for historical analysis. Also, ETL to sync live data from OLTP to OLAP system.
For this we have used best in breed, purpose built, NewSQL DBMS technlogies.


Thats it! Feel free to post any doubts or issues faced in exercising this tutorial r regarding technologies used.


No comments:

Post a Comment

Popular Posts