Was 22 billion rows Big Data? As I began the new data warehouse project, I knew that the data warehouse performance needed to be designed from the very first data element. The new project was very interesting with complex calculations, cross referencing reports and Big Data implications and big data warehouse performance challenges. Many standard reports that affected whole industries and government policies were depending on data warehouse performance for timely reports and 100% not even 99.999% accuracy of every bit of the Big Data.
Another consideration was the implementation deadline for this high performance Big Data data warehouse system was only 120 days away and wasn’t movable. All of these reasons made this Big Data system quite the challenge. Our small team of business users, SQL analysts and technical staff were committed and we were successful in completing the implementation using agile scrum development techniques. The data warehouse performance was so improved that some of the processes went from 37 hours to under 7 minutes and some reports come back in seconds. Complex calculations and report verification done during our data quality testing found bugs in the old system’s existing reports that had been trusted for over 20 years, so we knew everything was 100%. Our overall performance, 100% accuracy and overall quality implementation were a testament to the commitment of a winning team effort.
There were many success factors that drove the team, the design, development and successful implementation of this 22 billion row Big Data system. Three success factors drove the data warehouse performance.
First: Keep it simple. Yes I know everyone talks about keeping it simple in every development and system design. Designing it and doing it simply can be much harder. Instead of endless discussions and meetings, design the big picture and then see how all the reports, every type of interface and users interact with the warehouse. Data warehouse performance is achieved through computing power. Simple designs that leverage parallelism, that drive I/O and calculation computing power for quickly producing reports are key.
Second: Leverage all existing infrastructure and processes. The tight 120-day schedule from start to finish did not allow the team to have issues, only time delayed solutions. Since we used an agile scrum type of development methodology the agile scrum stories and the data warehouse performance of the Big Data reports were tested several times during the short development period. All used the same standard corporate testing procedures, documentation and approval processes that had been used for years. The team contacted everyone that needed to guarantee the data warehouse performance, got them involved early and got their requirements, feedback, improvements and eventual approval for all the components and implementation. Everyone knew the Big Data project schedule ahead of time so all infrastructure and processes were ready. Issues were worked on until solutions were approved and implemented.
Third: Don’t be intimidated by Big Data. I have been fortunate to design, develop, deploy and study data warehouse performance since back in 1988 when I implemented my first “Big Data” system for the financial firm E.F. Hutton. At least back then it was considered Big Data and it seems that all the other data warehouse systems I have dealt with over the many years have had bigger data then the previous one. Big Data will keep getting bigger and it is just another couple of zeros on the end of your table population and design considerations. Data warehouse performance is built on the same principles as many years ago. As I mentioned with the first success factor, design for both I/O and CPU parallelism in every single item of your system. Only then can your Big Data system have the data warehouse performance it needs to speed up your processes and reports from hours to just minutes or even seconds making everyone, especially your users, happy