Wednesday, 25 February 2015

Analytics with BangDB




Introduction

One of the main goals of BangDB is to allow user to deal with high volume of data in an efficient and performant manner for various use case scenarios. Features like different types of tables, key types, multi indexes, json support, treating BangDB as document database etc... allow users to model data according to their requirements and gives them flexibility for storing and retrieving data as needed. This certainly means that users can create their own custom app using BangDB for doing data analysis

The other approach would also be to provide fully baked up native constructs, the abstractions which can be used off the shelf for enabling data analysis in different ways. The abstractions hide all the complexities and expose simple APIs to be used for storing and retrieving data for analysis. Thus the built in constructs frees developers from worrying about the data modelling, configuring db objects, processing the input, querying method, post processing etc... and allows them to just enable the analysis by using the object of the type


Native Constructs or abstraction 

The following high level constructs are being provided by the BangDB 1.5, and the goal is to keep on adding more and more abstractions and more capabilities such that user may find the BangDB useful for lot many other analytical requirements. 


1. Sliding Window 
2. Counting
3. TopK 



Sliding Window

In real time analysis, we are interested in most recent data and wish to analyse the data accordingly. This is different from typical hot or cold data concept where older data could be hotter than recent data. Here we strictly want to work within the defined recent window.

BangDB provides the concept of Sliding Window as a type where user can define the term 'recent' by providing time range and then work within the time range always as the window keeps on sliding continuously.

To further ease the development, BangDB also provides sliding table concept, which means that user can simply create a table which always works on recent data window sliding continuously. Similar abstraction is for counting and topk.

Counting

In almost all analytical purposes, counting in inevitable. Many a times we need exact total counting and some times aproximate count is also sufficient within acceptable error margin, and in many other cases we need unique counting or may be non-unique in some other scenarios. Again these counting could be counting since begining or for specified time window which keeps sliding. For such use cases, BangDB provides native constructs for counting.

Counting can be done in various ways using BangDB. For example, we can simply create the object of Counting type and let it count uniquely for ever. Now in some case this would be good but imagine a scenario where user would like to do counting for each entity uniquely and if the number of entity is large then overhead of counting becomes very high. Let's say we have 100 M entities and we would like to count for each entity. Even if we have dedicated 16 bytes for each entity for counting we would need 1.6GB of space to do that and since we need to respond quickly we would like to keep these in memory as much as possible. In such scenario, if we are fine with not counting exactly and are ready to tolerate error margin or say 0.05% then BangDB provides a construct using which we can count in required fashion with few MB (less than 4-5 MB as compared to few GB) overhead only. This is probabilistic count with using hyperloglog concept.

All these counting can then be done in sliding window and there are many configurations for different setting in different use cases.

TopK

This is another important feature from analytics perspective. TopK has been a topic of interest for many researchers and analysts and therefore used at many places. BangDB provides native construct for TopK.

TopK means keeping track of top k items. These top k items could be anything, for ex; top 30 users with highest items in cart, top 20 prodcuts searched every 15 min, top 10 queries done every 1 hour etc... Using BangDB topk abstraction, user can simply do the topk analysis with just using get and put API.

TopK can again be done in absolute manner or within a sliding window with different settings
These are available in BangDB as fully baked up constructs and hence amy be used directly. However user can enable different analytical capabilities using BangDB different features. In coming days more such abstraction will be added for different analysis needs.

In next blog we will go into the details on these concepts and also provide example code for each of these concepts. The power of these concepts could be defined by stating that we can now create google analytics kind of portal within organisation, covering lot more data points in less than few hours. We will demonstrate this in upcoming blog