GaianDB: A lightweight Dynamic Distributed Federated Database based on Derby
David Vyvyan 1100006961 Tags:  scalability gaiandb database distributed performance 3 Comments 5,111 Views
I have been working with the Gaian Database recently to demonstrate its scalability.
In the tests I grew a database cluster with over a thousand Gaian Database nodes and measured the time it took to query across these thousand nodes, and fetch over a million rows of data. I also tested the impact on speed of executing multiple queries at the same time.
I will include more detailed postings on each of these three cases, but the high level results are as follows:
Query Time – We are able to query all 1000 nodes in about 1/8 second. The results show that the query time grows logarithmically - in other words as you add more and more databases, the increase in query time slows down, providing excellent scaling. The way that a Gaian Network is grown from individual nodes automatically ensures this behaviour.
Fetch Time – We are able to fetch 1 million rows of data in under 5 seconds. The fetch time is proportional to the amount of data returned so that if you fetch twice the data it takes twice as long regardless of which of the 1000 nodes the data resides in. The Gaian Database actively pre-fetches the data from all the nodes to achieve this scalability
Concurrent Queries – I injected queries from up to 40 nodes at the same time, the Gaian Database showed that it could handle these queries robustly with a modest increase in the query time due to running out of available processor time on our test platform.
There have been a number of changes to the Gaian code to achieve these results, a new release will be delivered to Alphaworks soon.
Check out the following link for a visualization of 1250 Gaian Database nodes in a network:http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/gaian-db-1250-nodes[Read More]
David Vyvyan 1100006961 Tags:  performance database distributed scalability 3 Comments 4,367 Views
We have recently been focusing on improving performance and scalability...Take a look at these visualisation graphs for GaianDB networks of up to 520 nodes! - These were running on 13 blades, each having 4 logical CPUs
The next release of GaianDB (1.02) expected this month will contain many optimisations. In particular, we have looked at minimising CPU and memory consumption of GaianDB nodes. We have also improved the node discovery and connection algorithm to minimise network diameter and hence query times across the network from any node.
For reference, simple query times across a network of 52 nodes were initially taking 89ms *before* optimisations...[Read More]
GaianDB running easily on Raspberry PI !
Today 10:47 AM
Good news! the GaianDB can run comfortably on the Raspberry Pi. Its only dependency is a Java VM and the Raspberry Pi distribution has OpenJDK available.
Here’s a quick setup guide:
- Get a Raspberry Pi & SD Card (4GB minimum as the operating system takes about 1.5GB of space). The GaianDB will only take 20MB including the empty database. so even 2GB SD card should work.
- Download & install Raspbian “Wheezy” from http://www.raspberrypi.org/downloads
- Install OpenJDK,
sudo apt-get install openjdk-6-jre
sudo apt-get install openjdk-6-jdk
(depending on whether you want a JRE or a JDK).
- Get the GaianDB zip file from this web site (GaianDB V1.5 is fine, and GaianDB V2.0 is not free , hence send an email to Patrick Dantressangle to get a fee-based version)
- Create a directory called GaianDB and unzip that file into the directory.
- Run ./launchGaianServer.sh to launche the GaianDB server. if there is other GDB in the same network, it will automatically detect them and connect to them.
-eventually run the dashboard to start interacting with GainaDB : ./dashboard.sh
It’s that easy!
The GaianDB team.
There have been several requests for standardisation of the SQL used by GaianDB to query its logical tables..
There is in fact a very simple way of getting around the problem using Derby views, e.g.CREATE VIEW V_LT0 AS SELECT * FROM NEW com.ibm.db2j.GaianTable('LT0') TNow, the logical table 'LT0' can be queried through the view 'V_LT0', i.e. the following 2 statements are equivalent:
SELECT * FROM NEW com.ibm.db2j.GaianTable('LT0') T
SELECT * FROM V_LT0
.. The only drawback is that you need a separate view for every combination of arguments used in conjunction with the logical table, e.g, to have a result that includes all columns of the logical table and also columns describing where in the network the data came from:CREATE VIEW V_LT0_WITH_PROVENANCE AS SELECT * FROM NEW com.ibm.db2j.GaianTable('LT0', 'with_provenance') TThen:SELECT * FROM V_LT0_WITH_PROVENANCE WHERE GDB_NODE LIKE 'TEMPERATURE_SENSOR_HOST%'[Read More]
Here are some examples of different queries in the Gaian Database. I hope that they are useful to Gaian Users.
GaianTable vs GaianQuery
There can be confusion between the use of GaianTable and GaianQuery classes to get retrieve data.
Both of these “queries” are propagated to all nodes in the Gaian Network and can return data from multiple nodes. The differences are as follows:
- with a GaianTable statement you specify a Logical Table and you get all data from data sources associated with that logical table, from any nodes.
- with a GaianQuery statement with specify a SQL expression as a subquery. The subquery is executed on every node that the query is propagated to, any resultant data is returned.
Return results from a single Logical table:
Here "PEOPLE" is the name of a distributed logical table.
Return results from a single Logical table using a view alias:
For convenience, GaianDB automatically maintains some system managed views to query a GaianDB logical table.
As before, "PEOPLE" is the name of a distributed logical table
For a logical table PEOPLE, the managed views are:
Beware, there can be performance overheads of using the view alias.
Join multiple logical tables
Find information about people and their locations
Find pairs of people in the same locations:
BEWARE - If you nest GaianTable statements inside GaianQueries, it is easy to unintentionally re-propagate a query.
For example, the following is NOT RECOMMENDED, it will issue a global query to all nodes, each node will re-issue a global sub-query, so in a 100 node network, 10000 queries will be evaluated:
Evaluate a database function of each node:
Return the current_timestamp on each node in the network
This queries data sources federated by logical table LT on each node and only gets the data that was generated in the last 10 minutes, relative to local current timestamps
Determine the quantity of data at each location (efficiently):
Count the number of rows in the LT0 table at each node:
Compare this with the following, which is equivalent but less efficient as it retrieves all data back to the querying node before counting entries:
Efficient Aggregate Functions:
As with the count example above, we can calculate a sub-aggregate at each node, avoiding passing all data over the network:
Compare this with the following, which is equivalent but less efficient:
Pass through create/insert/update/delete/stored procedure calls to distributed nodes
Create a Derby table of each distributed node:
Insert data into distributed table on a specific node (T61p:6415 here):
Drop distributed tables:
Select data from non-federated tables:
Here table1 is a derby table, not federated as a logical table:
Data will only be joined with other data from that one node
Delegate Query to another node
Here the join between person and location tables is run on a specific node "<BESTNODEID>" which may be the optimal node
Stored Procedure examples
List all defined stored procedures and functions
Example information procedures:
Show logical table definitions in the whole network.
Show loaded rdb connections in whole network.
Show loaded data sources in whole network.
Show connections/topology traversed by a flood query - with no access to data sources in the network.
Show all configuration properties and their values.
Filtering or sorting the results of stored procedure calls
If you want to filter values from these, the calls need to be contained in a GaianQuery. e,.g.
Example configuration procedures
Executing a function against local node:
Returns the measured data throughput of the node using the gaian gdb_throughput function
Executing a function against all gaian nodes:
Executing a function against specific gaian nodes:
For French speakers out there, an article has just been published on GaianDB in one of the most famous IT magazines in France...http://www.01informatique.fr/infrastructures-stockage-serveurs-116/ibm-federe-sgbdr-46118/[Read More]
Gaian Database version 1.03 is now available on Alphaworks:
This version includes:
Resilience Improvements: New dynamic parameters to control connection maintenance heartbeat and max concurrent connection attempts
More configurable and usable discovery parameters
Discovery Gateways: Ability to discover nodes using unicast messages forwarded through designated gateways
New APIs listnodes(), listflood(), addgateway(). listltmatches()
Excel Spreadsheet data source wrapper (VTI)
New configuration parameters defining permitted hosts, denied hosts and cluster IDs to restrict connectivity between nodes
Oracle compatibility fixes
Better handling of unreachable jdbc data sources and missing logical table columns in them
New control file for Flat File data sources, e.g. to set record delimiters
Here's the download link:
Version 1.5 includes:
Dashboard with tabs for: connection establishment, topology view, metrics monitoring and query processing
New discovery capability, controlled using new parameters (discovery via gateways and cluster membership control)
Access control to allow/disallow SQL API configuration and propagated writes
Improved connectivity resilience to network variability in terms of latency and bandwidth
New data source options for more flexible column mappings (mapping by position and quotable columns)
Ability to propagate 'pass-through' CRUD operations (create/insert/update/delete/call) in GaianQuery sub-queries
API: new and updated procedures & functions, e.g. listnodes, addgateway
New configuration capability to limit the number of inbound connection threads
System management of log files such that their disk space usage is limited
Scripts re-work for portability across unix and Mac OS X machines
Better handling of expensive JOIN queries with caching spillover to disk to limit memory consumption
Added ability to specify a node name when starting GaianDB
Memory leak fixes (e.g. regulated connection attempts) and performance improvements (e.g. extended predicate pushdown)
Re-worked physical table and column mappings flexibility and extended it such that they can now hold composite SQL expressions
Here are tips on using managed views, with_provenance and propagating functions using GaianQuery() ... New release GaianDB 1.02 coming up v. shortly! ...
David Vyvyan 1100006961 2,142 Views
1. I have added a managed default view for all logical tables queried with maxDepth=0. Use the suffix: "_0" to reference them, e.g. to query all sources federated locally by LT0:
select * from LT0_0
So the list of all managed views for LT0 are:
Propagated logical table: LT0
Just local (not propagated): LT0_0
Propagated, with provenance: LT0_P
Explain query: LT0_X
Explain to file graph.dot LT0_XF
2. To push query processing to each node: e.g. to make all nodes query their local LT0 and return their own timestamps and provenance information:
select * from new com.ibm.db2j.GaianQuery('select CURRENT_TIMESTAMP TS, LOCATION, MISC from LT0_0', 'with_provenance') Q
This makes every node query its own local federated sources for LT0 columns LOCATION and MISC and compute its own local timestamp. The provenance columns are added to the results on every node.
3. One way of propagating a function without needing a logical table is using a dummy system physical table:
select * from new com.ibm.db2j.GaianQuery('select current_timestamp ts from sysibm.sysdummy1', 'with_provenance') Q
4. All columns returned by the nested query inside a GaianQuery() must be named columns. Columns that are the result of a function should be aliased.
e.g. this is invalid:
e.g. this is invalid:
select * from new com.ibm.db2j.GaianQuery('select count(*) from LT0_0', 'with_provenance') Q
... we have to alias the result of count(*), e.g:
select * from new com.ibm.db2j.GaianQuery('select count(*) c from LT0_0', 'with_provenance') Q
Further, if the nested query contains a distributed table or query, it may only reference the provenance, explain or constant columns if these are renamed within the nested query. This is because otherwise they would be seen as potential duplicates with the columns defined by the outer GaianQuery().
e.g. this is invalid:
select * from new com.ibm.db2j.GaianQuery('select CURRENT_TIMESTAMP TS, GDB_NODE from LT0_P') Q
but this is OK:
select * from new com.ibm.db2j.GaianQuery('select CURRENT_TIMESTAMP TS, GDB_NODE NESTED_PROVENANCE from LT0_P') Q
David Vyvyan 1100006961 675 Views
GaianDB is available here in source and binary format:
Note this contains the last build from IBM with an IBM license. Support is available for other versions - contact myself or other authors for this.
To build it yourself, you will need to find the jars listed in the dependencies file (lib-prereqs-list-to-download.txt).
Many of the jars are used for peripheral capability, so if you can't find them you can also stub out the code that uses them.