Big data in motion
Back in around 1988, I decided it was time for me to learn about object-oriented programming, design, etc.
Learning C++ was not too bad but when it came to define problems in an object oriented way, I started to panic: Had I reached the limit of what I was able to learn? Had I been passed by technology? Was I now a dinosaur? (and you thought the title referred to something else...)
It turned out ok... I think :-)
Today, the rate of change in technology has been accelerating and does not seem to be slowing down. To make things worst for database people, we've been told for years that database are commodities and they are just a persistent storage. I may expand on that later but let's just say for now that I totally disagree.
This being said, why start a blog? For one, I want to communicate with the Informix community in a more continuous manner and on subjects that may not require a one-hour powerpoint presentation. I want to discuss any technologies that is remotely related to databases and I want to start a continuous dialog with the Informix community on any subject of interest.
This blog will be in part educational (I hope) and also a place to discuss business problems and potential approaches to solutions. I believe that DBAs are experts in optimizing database access. It is time to expand DBAs impact in the enterprise to improve data processing. There is no need for a DBA to become a programmer. It is an issue of getting involve in the analysis and design of new applications. For programmers using Informix, let's start talking about what you are trying to accomplish. We may be able to find a better approach specially considering the IDS new features and database extensibility.
Please comment on my blog entries and send me your questions and let's start talking![Read More]
The first entry went in almost without problems. I had to quickly add some formatting after I posted it to make it more readable. Over time, I'll figure out how to do this and make my formatting fancier.
I'm new to blogging. I'm also pretty new at reading blogs. With the new Web 2.0 technologies, there is no need to go to a blog every so often to find out if there is something new. So, in case you don't know, here's some information on dealing with blogs.
Blogs support a capability called syndication. This allows to create what's called a feed to warn people of changes. There are two main types of feeds: RSS and Atom. No need to know more about it for now. Just that you can use a feed URL to get the changes in a syndicated site.
I would suggest that you use a feed reader. Why? because there are many sites you may want to subscribe to. For example:
- Guy Bowerman's blog (http://www-128.ibm.com/developerworks/blogs/page/gbowerman)
- Madison Pruet's blog (http://www.ibm.com/developerworks/blogs/page/roundrep)
- Feeds form the IIUG site (http://www.iiug.org/rss/index.php)
- Informix Zone site (http://www.informix-zone.com/)
- and many more. . .
Take a look at the latest IIUG newsletter (insider #94) for more in the Informix Resource section.
Having to visit each site regularly to see if it has new stuff can be time consuming. Using a feed readeraggregates all those and lets you know what's new. That's the way to go!.
If you do a search on the web, you can find multiple feed readers. I did not want to spend too much time figuring it out so I downloaded a Windows-based open-source product that will do until I find or am told about something better. Check out: http://www.feedreader.com/
If you haven't done it already, set yourself up and stay informed on the latest entries.
That's it for the introduction to this blog. Now it's time to dive into Informix and Computing![Read More]
When I was in school I wanted to know why I had to learn something: Why learn about history? It’s about a bunch of dead people, often from far away. I would also ask: Why would I ever learn English. . .
I feel that the computer industry does not only forget about history but is quick to discard what has been done before. Just remember when object databases came out, the trade magazines where trumpeting the death of relational databases.
There is a disconnect between the object-oriented (OO) approach and the use of relational databases. This will be the subject of the next few entries. Lets start with an example:
An object person will look at the employees of a company and see managers, full-time employees, part-time employees and contractors. This will lead to the following model:
With the definition of the multiple types of employees, we can easily see that they will want multiple tables, one per defined object. Of course, for a database person, we see something like:
CREATE TABLE employee (
Empno int PRIMARY KEY,
mgrNo int ,
. . .
As you can see, we can already see that a "data access expert" can start some discussions with the OO architects and programmers.Don’t get me wrong. I like OO. I think it is a wonderful approach but just like anything it can be abused. See what you think of:http://csis.pace.edu/~bergin/patterns/ppoop.html[Read More]
JacquesRoy 120000A2MS 1,805 Views
I saw some interesting comments related to my blog entries. Hope you are reading them... the main subject is object to relational mapping (OMR). I'll get back to that soon. For now, I want to continue what I was talking about
Some people are really passionate about following a strict approach. This can cause problems with such things as encapsulation that insures that the implementation of the object is opaque. Look at it a little bit as being very strict about following the highest possible normal form. My point is that you have to be careful about not offending people in their approach. Learn about their methodology before jumping into a passionate presentation of your approach: Take them from where they are to where you want them to be slowly, watching for resistance where communication could break down.
Looking back at the employee definition presented in my previous blog entry, note the following: A manager can have multiple employees working for her. This leads to a representation where a manager object includes a collection of employee objects. This also leads to implementation performance problems where because all the objects were instantiated (created) it took a long time to create the object that included the collection of object. The concept of "lazy binding" was implemented to solve this. Basically, the object in a collection is not instantiated until it is accessed.
This is another area where database specialists can start a discussion to improve the overall performance. Now that I've set the premise, I'll cover it in more details next time.[Read More]
I was talking earlier about encapsulation and the collection of objects that can be found in another object. Let's look at another possibility:
A corporation has multiple regions, a region has multiple branches, a branch has multiple customers. To summarize:
Let's say that the customers are loans taken by different types of companies. To find out the average amount of the loans given out by each branch, the strict approach would be that each branch has a method (function) that does the following:customer_count = 0
total_loans = 0
for each customer
customer_count = customer_count + 1
total_loans = total_loans + customer.getLoanAmount()
end // for each customer
return(total_loans / customer_count)
We protect the encapsulation of customers by providing a method that returns the loan amount (getLoanAmount). The first problem we have relates to performance: All the customer objects for a branch need to be instantiated (created). That may require quite a bit of memory. The second performance problem is that each customer object instantiation requires one database call.
What about if we want to do this average at the region level instead of the branch level? Then, to preserve the encapsulation, we need to created additional methods to return totals and counts. I'll let you imagine the processing needed. On the performance side, we see that the number of objects instantiated and the number of database calls increase with the number of branches and customer objects processed.
If you can convince the architects and programmers to relax their encapsulation requirements, you could add one method at the branch level, one at the region level, and even possibly one at the corporation level to return the desired average. Considering the average for a region, the method would implement the one SQL statement looking like:SELECT AVG(loan) FROM customers
WHERE region_id = :region_num
GROUP BY region_id;
In this case, I don't instantiate all the customer (and branch) objects, saving processing and memory. It is pretty obvious that the performance of these requests will be greatly improved compared to the "strict" OO approach.
Having a method that uses the database to do the processing is one thing. What about more complex processing like the average risk taken by a branch on their loans?
IDS provides the ability to implement user-defined aggregates. It would be easy to implement the average risk function. The number of lines of code would be less than implementing it in the application and the performance would be better even if it was only because of the significant reduction in the volume of data transferred.
I hope that in the last few blog entries I gave you some things to think about to improve the overall performance of your systems. The bottom line is: get involved in the analysis and design phases of new projects. You can add a lot of value there.[Read More]
JacquesRoy 120000A2MS 1,256 Views
Since I mentioned user-defined aggregates (UDAs), in the previous entry, it as good a time as ever to cover this subject.
Everybody that knows a little bit about (object) relational databases knows about aggregate functions such as AVG, SUM, and so on. User-defined aggregates allow you to create your own functions to aggregate your data. The thing is it can do more.
I had a situation once where people had a complex stored procedure that included two foreach loop that were executing SELECT statements. The goal of the procedure was to take a bunch of geo shapes and merge them into a multi-polygon. This procedure had multiple issues:
1) It was issuing multiple SELECT statements making it less than optimal2) It solved one and only one problem due to the specificity of the SELECT statements. (with IDS 11.50, we could build the SQL statements dynamically in the stored procedure)3) It was complex (82 lines long)
I was able to convert the procedure into a 22 lines UDA (including 2 lines of comments and 6 blank lines). The UDA could then be inserted into an SQL statement that would decide how to group rows. The advantages of the UDA are:
1) The UDA is simpler (22 lines without loops or SQL statements vs. 82 lines)2) The UDA is more flexible since it gets the grouping from the SQL statement3) Better performance, more streamlined
UDAs include an initialization function and parameter. Since the initialization parameter can be of any type, we can pass a row type that can include multiple values providing the flexibility required for any initialization. Any business processing that takes a large number of values to get to an answer can be converted from a client application to a user-defined aggregate. The benefits include:
1) reduced network traffic, translating into increased performance2) Simpler code since all the code related to issuing SQL statements, cursor manipulations and error testing can be eliminated3) That specific business processing becomes available to all applications
Some people are really passionate about following a strict approach. This can cause problems with such things as encapsulation that insures that the implementation of the object is opaque. Look at it a little bit as being very strict about following the highest possible normal form. My point is that you have to be careful about not offending people in their approach. Learn about their methodology before jumping into a passionate presentation of your approach: Take them from where they are to where you want them to be slowly, watching for resistance where comminucation could break down.
Looking back at the employee definition presented in my previous blog entry, note the following: A manager can have multiple employees working for her. This lead to a representation where a manager object includes a collection of employee objects. This lead to implementation performance problems where because all the objects were instantiated (created) it took a long time to create the object that included the collection of object. The concept of "lazy binding" was implemented to solve this. Basically, the object in a collection is not instantiated until it is accessed.
This is another area where database specialists can start a discussion to improve the overall performance. Now that I've set the premise, I'll cover it in more details another time.[Read More]
JacquesRoy 120000A2MS 1,542 Views
I was reading the hibernate documentation the other day. I came across the following statements in chapter 11:
". . .you most likely should consider a Stored Procedure if you need mass data operations"
That makes the case for my previous blog entries. In our case, we should also consider user-defined functions and user-defined aggregates. Let's talk about stored procedure for a bit.
Over the years, I've seen sites where they had basically multiple copied of the same stored procedure. the only difference between the different copied was a variation on a specific SQL statement. With IDS 11.50, we now support dynamic SQL in stored procedures. This is a good time to take a look at these stored procedures and see if you could eliminate some redundancy. One benefit is that if you have to change the processing in the future, you'll have to modify only one procedure as opposed to multiple ones.
Of course, I would also look at the possibility of using user-defined functions or user-defined aggregates as illustrated in another blog entry. Let me know what you find out.[Read More]
JacquesRoy 120000A2MS 1,864 Views
If you're leaning against a wall, you have two choices to increase the distance between you and the wall. Either you push the wall away from you or you push yourself away from the wall.
Not long ago, I heard someone say that to increase performance, we should move the data to where the processing is. I did not want to get into an argument but I could have asked: What about moving the processing to where the data is?
This could be done looking back at user-defined aggregates as discussed in the previous blog entry.
So, which is the wall? Of course, the answer is: It depends. The key is to keep an open mind about where the processing should be. there could be many cases where putting the processing in the database is the right answer. Using user-defined aggregates could be the technology to use.[Read More]
JacquesRoy 120000A2MS 1,853 Views
I was working on some existing PHP code that accessed a database. I ran into a problem: When using the PDO->fetchObject() method, IDS was returning the column names in uppercase. Since that translated into attribute names in the returned object that was a problem because the application expected these attribute names to be in lower case. Being new to PDO, I came up with the following clever code (sorry about the formatting, I can't figure out how to indent):
$row = $conn->fetchObject();
I was pretty proud of myself for finding about the stdClass and how to add attributes to the class dynamically.
It turns out that there is a much easier way to do it. I only had to set an attribute on the connection:
I'm sure one day I'll be glad I learned about stdClass. For now, I was reminded about reading the relevant section of the documentation...[Read More]
One of the first applications I ever used (after lunar landing) was called Eliza. It was at a time where terminals wrote on paper and 110 baud transmission rate was state of the art. The program would start with the following statement:
I am Eliza. Tell me about your problem.
If you proceeded with a coherent conversation, you'd think you were talking to a real person.
I read, much later, that some people, after starting the dialog with Eliza, asked for privacy to continue their dialog.
Eliza was designed in 1966 (I did a google search on it). The IBM S/360 came out two years before. All that to say that there are a lot of amazing things that people have done with computers since the first computer was invented. Many people have a tendency to discard what has been done in the past in favor of the latest technology.
From time to time, I'll suggest some reading on technology in general. The first on I want to suggest is about programming. It first came out in 1986. The second edition came out in 2000:
Programming Pearls, second Edition
Hope you’ll enjoy it.[Read More]
JacquesRoy 120000A2MS 1,930 Views
Wikis are part of the technologies included in Web 2.0. They can be very useful for sharing information within an organization. Here at IBM wikis are becoming very popular.
MediaWiki is the PHP-based wiki engine that is used by wikipedia. In fact, it was originally written for Wikipedia. It is available as open-source from http://www.mediawiki.org. The current release is version 1.12.0.
It seems to me that IDS is particularly well suited to support a wiki. It is a high-performance database engine that is scalable and required little administration. Also, with the clustering capabilities of IDS 11, It provides a very scalable platform for a wiki.
Why mention all this? Because I created modules to support IDS as a database engine under MediaWiki. It requires IDS 11 since it also uses the Basic Text Search (BTS) DataBlade to provide searching capabilities. I still consider this "beta" code and it will require additional modifications but it works.
Any interest? Contact me at firstname.lastname@example.org[Read More]
JacquesRoy 120000A2MS 1,309 Views
With the work on Mediawiki I've done recently, I was wondering about the characteristics of web-based applications such as Mediawiki.
What I see there is a lot of reads and few writes. Looking at other types of web applications, we can see the same thing. Think about reservation systems. Someone goes on the web and does a search on flights and hotels. What comes back is a list of possible flights and a list of possible hotels. Out of the list of flights and hotels, maybe one flight and one hotel will be selected. I say maybe because many people may only be browsing. If a selection of possible 10 hotels is returned, that means that there is a 10 to 1 ratio of reads vs. writes. The ratio is most likely higher.
Web-based applications appear to be the perfect candidates to take advantage of the IDS continuous availability features. For a fast growing demand, in minutes, you can add additional machines to share the same database. The benefits are quick access to scalability, quick recovery in the case of a node failure, and disk savings compare to on-site replication.
If you have a successful web based application, this is something to look into.
JacquesRoy 120000A2MS 1,253 Views
IDS on the web
A question poped into my head: where do we find mention of IDS on the web, outside of the expected IBM, IIUG, informix-zone and other sites?
Here's a sample of what I found:
There is a lot more..Google says 197,000. Other entries could be very useful. I'll bring up more next time.
JacquesRoy 120000A2MS 1,561 Views
Continuing from the last entry...
Out of these 197,000 entries, here are some more that may be of interest:
I'm sure we can find a lot more of interesting sites. There is a lot about IDS on the web[Read More]
JacquesRoy 120000A2MS 1,601 Views
I was reading recently that scripting languages are becoming more and more popular. One of the benefits listed was that it can increase productivity significantly. That brings me back a long time when I used to argue that you could write a solution in APL so much faster than any of the traditional languages...
PHP is a popular scripting language that is used to create web solutions. Here are some information if you want or need to use it with IDS.
To use IDS with PHP, you need to add the PDO_INFORMIX interface to it. tis interface can be found in PECL:http://pecl.php.net/package/PDO_INFORMIX
The current version is 1.2.0 and came out in March 2008. You will to use the appropriate Informix CSDK to compile it for your platform.
Note that there is another PHP driver available for IDS 11: PDO_IBM. It is available from IBM in the IBM Data Server Client as a pre-built component. See:https://www14.software.ibm.com/webapp/iwm/web/reg/pick.do?lang=en_US&source=swg-idsc11
If you want to get going quickly, you can find Apache-PHP-PDO_INFORMIX ready to go with the Open Admin Tool for IDS (OAT). This product is available for IDS 11. You can find it at:https://www14.software.ibm.com/webapp/iwm/web/reg/download.do?lang=en_US&cp=UTF-8&S_PKG=dl&source=swg-informixfpd
You can also use the package from Zend: Zend Core for IBM at:http://www.zend.com/en/products/core/for-ibm
If you want to build the environment yourself, here are two articles that can help you figure out the process:
With this information, you should be able to get started quickly[Read More]