It's official! My "blook" Inside System Storage - Volume I
is now available.
|This blog-based book, or “blook”, comprises the first twelve months of posts from this Inside System Storage blog,165 posts in all, from September 1, 2006 to August 31, 2007. Foreword by Jennifer Jones. 404 pages.|
- IT storage and storage networking concepts
- IBM strategy, hardware, software and services
- Disk systems, Tape systems, and storage networking
- Storage and infrastructure management software
- Second Life, Facebook, and other Web 2.0 platforms
- IBM’s many alliances, partners and competitors
- How IT storage impacts society and industry
You can choose between hardcover (with dust jacket) or paperback versions:
This is not the first time I've been published. I have authored articles for storage industry magazines, written large sections of IBM publications and manuals, submitted presentations and whitepapers to conference proceedings, and even had a short story published with illustrations by the famous cartoon writer[Ted Rall].
But I can say this is my first blook, and as far as I can tell, the first blook from IBM's many bloggers on DeveloperWorks, and the first blook about the IT storage industry.I got the idea when I saw [Lulu Publishing] run a "blook" contest. The Lulu Blooker Prize is the world's first literary prize devoted to "blooks"--books based on blogs or other websites, including webcomics. The [Lulu Blooker Blog] lists past year winners. Lulu is one of the new innovative "print-on-demand" publishers. Rather than printing hundredsor thousands of books in advance, as other publishers require, Lulu doesn't print them until you order them.
I considered cute titles like A Year of Living Dangerously, orAn Engineer in Marketing La-La land, or Around the World in 165 Posts, but settled on a title that matched closely the name of the blog.
In addition to my blog posts, I provide additional insights and behind-the-scenes commentary. If you go to the Luluwebsite above, you can preview an entire chapter in its entirety before purchase. I have added a hefty 56-page Glossary of Acronyms and Terms (GOAT) with over 900 storage-related terms defined, which also doubles as an index back to the post (or posts) that use or further explain each term.
So who might be interested in this blook?
- Business Partners and Sales Reps looking to give a nice gift to their best clients and colleagues
- Managers looking to reward early-tenure employees and retain the best talent
- IT specialists and technicians wanting a marketing perspective of the storage industry
- Mentors interested in providing motivation and encouragement to their proteges
- Educators looking to provide books for their classroom or library collection
- Authors looking to write a blook themselves, to see how to format and structure a finished product
- Marketing personnel that want to better understand Web 2.0, Second Life and social networking
- Analysts and journalists looking to understand how storage impacts the IT industry, and society overall
- College graduates and others interested in a career as a storage administrator
And yes, according to Lulu, if you order soon, you can have it by December 25.
technorati tags: IBM, blook, Volume I, Jennifer Jones, system, storage, strategy, hardware, software, services, disk, tape, networking, SAN, secondlife, Web2.0, facebook, Lulu, publishing, Blooker Prize, articles, magazines, proceedings, Ted Rall, insights, glossary, early-tenure, mentors, library, classroom, administrator, print, publish, on demand
Continuing my rant from Monday's post [Time for a New Laptop], I got my new laptop Wednesday afternoon. I was hoping the transition would be quick, but that was not the case. Here were my initial steps prior to connecting my two laptops together for the big file transfer:
- Document what my old workstation has
Back in 2007, I wrote a blog post on how to [Separate Programs from Data]. I have since added a Linux partition for dual-boot on my ThinkPad T60.
|/dev/sda1||26GB||NTFS||C:||Windows XP SP3 operating system and programs|
|/dev/sda2||12GB||ext3||/(root)||Red Hat Enterprise Linux 5.4|
|/dev/sda6||80GB||NTFS||D:||My Documents and other data|
I also created a spreadsheet of all my tools, utilities and applications. I combined and deduplicated the list from the following sources:
- Control Panel -> Add/Remove programs
- C:\Program Files
- Start -> Programs panels
- Program taskbar at bottom of screen
The last one was critical. Over the years, I have gotten in the habit of saving those ZIP or EXE files that self-install programs into a separate directory, D:/Install-Files, so that if I had to unintsall an application, due to conflicts or compatability issues, I could re-install it without having to download them again.
So, I have a total of 134 applications, which I have put into the following rough categories:
- AV - editing and manipulating audio, video or graphics
- Files - backup, copy or manipulate disks, files and file systems
- Browser - Internet Explorer, Firefox, Opera and Google Chrome
- Communications - Lotus Notes and Lotus Sametime
- Connect - programs to connect to different Web and Wi-Fi services
- Demo - programs I demonstrate to clients at briefings
- Drivers - attach or sync to external devices, cell phones, PDAs
- Games - not much here, the basic solitaire, mindsweeper and pinball
- Help Desk - programs to diagnose, test and gather system information
- Projects - special projects like Second Life or Lego Mindstorms
- Lookup - programs to lookup information, like American Airlines TravelDesk
- Meeting - I have FIVE different webinar conferencing tools
- Office - presentations, spreadsheets and documents
- Platform - Java, Adobe Air and other application runtime environments
- Player - do I really need SIXTEEN different audio/video players?
- Printer - print drivers and printer management software
- Scanners - programs that scan for viruses, malware and adware
- Tools - calculators, configurators, sizing tools, and estimators
- Uploaders - programs to upload photos or files to various Web services
- Backup my new workstation
My new ThinkPad T410 has a dual-core i5 64-bit Intel processor, so I burned a 64-bit version of [Clonezilla LiveCD] and booted the new system with that. The new system has the following configuration:
|/dev/sda1||320GB||NTFS||C:||Windows XP SP3 operating system, programs and data|
There were only 14.4GB of data, it took 10 minutes to backup to an external USB disk. I ran it twice: first, using the option to dump the entire disk, and the second to dump the selected partition. The results were roughly the same.
- Run Workstation Setup Wizard
The Workstation Setup Wizard asks for all the pertinent location information, time zone, userid/password, needed to complete the installation.
- Re-Partition Disk Drive
I burned a 64-bit version of [System Rescue CD] and ran [Gparted] to re-partition this disk into the following:
|/dev/sda1||40GB||NTFS||C:||Windows XP SP3 operating system and programs|
|/dev/sda2||15GB||ext3||/(root)||Ubuntu Desktop 10.04 LTS|
|/dev/sda6||245GB||NTFS||D:||My Documents and other data|
- Redefine Windows directory structure
I made two small changes to connect C: to D: drive.
- Changed "My Documents" to point to D:\Documents which will move the files over from C: to D: to accomodate its new target location. See [Microsoft procedure] for details.
- Edited C:\notes\notes.ini to point to D:\notes\data to store all the local replicas of my email and databases.
- Install Ubuntu Desktop 10.04 LTS
My plan is to run Windows and Linux guests through virtualization. I decided to try out Ubuntu Desktop 10.04 LTS, affectionately known as Lucid Lynx, which can support a variety of different virtualization tools, including KVM, VirtualBox-OSE and Xen. I have two identical 15GB partitions (sda2 and sda3) that I can use to hold two different systems, or one can be a subdirectory of the other. For now, I'll leave sda3 empty.
- Take another backup of my new workstation
I took a fresh new backup of paritions (sda1, sda2, sda6) with Clonezilla.
The next step involved a cross-over Ethernet cable, which I don't have. So that will have to wait until Thursday morning.
technorati tags: IBM, Lenovo, ThinkPad, T60, T410, Intel, Clonezilla, SysRescCD, Gparted, Windows, Ubuntu, Linux, Lucid, LTS
Wrapping up my week's theme of storage optimization, I thought I would help clarify the confusion between data reduction and storage efficiency. I have seen many articles and blog posts that either use these two terms interchangeably, as if they were synonyms for each other, or as if one is merely a subset of the other.
- Data Reduction is LOSSY
By "Lossy", I mean that reducing data is an irreversible process. Details are lost, but insight is gained. In his paper, [Data Reduction Techniques", Rajana Agarwal defines this simply:
"Data reduction techniques are applied where the goal is to aggregate or amalgamate the information contained in large data sets into manageable (smaller) information nuggets."
Data reduction has been around since the 18th century.
Take for example this histogram from [SearchSoftwareQuality.com]. We have reduced ninety individual student scores, and reduced them down to just five numbers, the counts in each range. This can provide for easier comprehension and comparison with other distributions.
The process is lossy. I cannot determine or re-create an individual student's score from these five histogram values.
This next example, complements of [Michael Hardy], represents another form of data reduction known as ["linear regression analysis"]. The idea is to take a large set of data points between two variables, the x axis along the horizontal and the y axis along the vertical, and find the best line that fits. Thus the data is reduced from many points to just two, slope(a) and intercept(b), resulting in an equation of y=ax+b.
The process is lossy. I cannot determine or re-create any original data point from this slope and intercept equation.
In this last example, from [Yahoo Finance], reduces millions of stock trades to a single point per day, typically closing price, to show the overall growth trend over the course of the past year.
The process is lossy. Even if I knew the low, high and closing price of a particular stock on a particular day, I would not be able to determine or re-create the actual price paid for individual trades that occurred.
- Storage Efficiency is LOSSLESS
By contrast, there are many IT methods that can be used to store data in ways that are more efficient, without losing any of the fine detail. Here are some examples:
- Thin Provisioning: Instead of storing 30GB of data on 100GB of disk capacity, you store it on 30GB of capacity. All of the data is still there, just none of the wasteful empty space.
- Space-efficient Copy: Instead of copying every block of data from source to destination, you copy over only those blocks that have changed since the copy began. The blocks not copied are still available on the source volume, so there is no need to duplicate this data.
- Archiving and Space Management: Data can be moved out of production databases and stored elsewhere on disk or tape. Enough XML metadata is carried along so that there is no loss in the fine detail of what each row and column represent.
- Data Deduplication: The idea is simple. Find large chunks of data that contain the same exact information as an existing chunk already stored, and merely set a pointer to avoid storing the duplicate copy. This can be done in-line as data is written, or as a post-process task when things are otherwise slow and idle.
When data deduplication first came out, some lawyers were concerned that this was a "lossy" approach, that somehow documents were coming back without some of their original contents. How else can you explain storing 25PB of data on only 1PB of disk?
(In some countries, companies must retain data in their original file formats, as there is concern that converting business documents to PDF or HTML would lose some critical "metadata" information such as modificatoin dates, authorship information, underlying formulae, and so on.)
Well, the concern applies only to those data deduplication methods that calculate a hash code or fingerprint, such as EMC Centera or EMC Data Domain. If the hash code of new incoming data matches the hash code of existing data, then the new data is discarded and assumed to be identical. This is rare, and I have only read of a few occurrences of unique data being discarded in the past five years. To ensure full integrity, IBM ProtecTIER data deduplication solution and IBM N series disk systems chose instead to do full byte-for-byte comparisons.
- Compression: There are both lossy and lossless compression techniques. The lossless Lempel-Ziv algorithm is the basis for LTO-DC algorithm used in IBM's Linear Tape Open [LTO] tape drives, the Streaming Lossless Data Compression (SLDC) algorithm used in IBM's [Enterprise-class TS1130] tape drives, and the Adaptive Lossless Data Compression (ALDC) used by the IBM Information Archive for its disk pool collections.
Last month, IBM announced that it was [acquiring Storwize. It's Random Access Compression Engine (RACE) is also a lossless compression algorithm based on Lempel-Ziv. As servers write files, Storwize compresses those files and passes them on to the destination NAS device. When files are read back, Storwize retrieves and decompresses the data back to its original form.
To read independent views on IBM's acquisition, read Lauren Whitehouse (ESG) post [Remote Another Chair, Chris Mellor (The Register) article [Storwize Swallowed], or Dave Raffo (SearchStorage.com) article [IBM buys primary data compression].
As with tape, the savings from compression can vary, typically from 20 to 80 percent. In other words, 10TB of primary data could take up from 2TB to 8TB of physical space. To estimate what savings you might achieve for your mix of data types, try out the free [Storwize Predictive Modeling Tool].
So why am I making a distinction on terminology here?
Data reduction is already a well-known concept among specific industries, like High-Performance Computing (HPC) and Business Analytics. IBM has the largest marketshare in supercomputers that do data reduction for all kinds of use cases, for scientific research, weather prediction, financial projections, and decision support systems. IBM has also recently acquired a lot of companies related to Business Analytics, such as Cognos, SPSS, CoreMetrics and Unica Corp. These use data reduction on large amounts of business and marketing data to help drive new sources of revenues, provide insight for new products and services, create more focused advertising campaigns, and help understand the marketplace better.
There are certainly enough methods of reducing the quantity of storage capacity consumed, like thin provisioning, data deduplication and compression, to warrant an "umbrella term" that refers to all of them generically. I would prefer we do not "overload" the existing phrase "data reduction" but rather come up with a new phrase, such as "storage efficiency" or "capacity optimization" to refer to this category of features.
IBM is certainly quite involved in both data reduction as well as storage efficiency. If any of my readers can suggest a better phrase, please comment below.
technorati tags: IBM, data reduction, storage efficiency, histogram, linear regression, thin provisioning, data deduplication, lossy, lossless, EMC, Centera, hash collisions, Information Archive, LTO, LTO-DC, SLDC, ALDC, compression, deduplication, Storwize, supercomputers, HPC, analytics
Are you tired of hearing about Cloud Computing without having any hands-on experience? Here's your chance. IBM has recently launched its IBM Development and Test Cloud beta. This gives you a "sandbox" to play in. Here's a few steps to get started:
- Go to the [IBM Developer & Test in the IBM Cloud beta] dashboard and register for an account. This can be your email address and your own password. You can watch the helpful videos.
- Generate a "key pair". There are two keys. A "public" key that will reside in the cloud, and a "private" key that you download to your personal computer. Don't lose this key.
- Request an IP address. This step is optional, but I went ahead and got a static IP, so I don't have to type in long hostnames like "vm353.developer.ihost.com".
- Request storage space. Again, this step is optional, but you can request a 50GB, 100GB and 200GB LUN. I picked a 200GB LUN. Note that each instance comes with some 10 to 30GB storage already. The advantage to a storage LUN is that it is persistent, and you can mount it to different instances.
- Start an "instance". An "instance" is a virtual machine, pre-installed with whatever software you chose from the "asset catalog". These are Linux images running under Red Hat Enterprise Virtualization (RHEV) which is based on Linux's kernel virtual machine (KVM). When you start an instance, you get to decide its size (small, medium, or large), whether to use your static IP address, and where to mount your storage LUN. On the examples below, I had each instance with a static IP and mounted the storage LUN to /media/storage subdirectory. The process takes a few minutes.
- Download some programs on your personal computer. I downloaded an SSH client called [PuTTY] and an [NX client from NoMachine].
So, now that you are ready to go, what instance should you pick from the catalog? Here are three examples to get you started:
- IBM WebSphere sMASH Application Builder
- Base OS server to run LAMP stack
Next, I decided to try out one of the base OS images. There are a lot of books on Linux, Apache, MySQL and PHP (LAMP) which represents nearly 70 percent of the web sites on the internet. This instance let's you install all the software from scratch. Between Red Hat and Novell SUSE distributions of Linux, Red Hat is focused on being the Hypervisor of choice, and SUSE is focusing on being the Guest OS of choice. Most of the images on the "asset catalog" are based on SLES 10 SP2. However, there was a base OS image of Red Hat Enterprise Linux (RHEL) 5.4, so I chose that.
To install software, you either have to find the appropriate RPM package, or download a tarball and compile from source. To try both methods out, I downloaded tarballs of Apache Web Server and PHP, and got the RPM packages for MySQL. If you just want to learn SQL, there are instances on the asset catalog with DB2 and DB2 Express-C already pre-installed. However, if you are already an expert in MySQL, or are following a tutorial or examples based on MySQL from a classroom textbook, or just want a development and test environment that matches what your company uses in production, then by all means install MySQL.
This is where my SSH client comes in handy. I am able to login to my instance and use "wget" to fetch the appropriate files. An alternative is to use "SCP" (also part of PuTTY) to do a secure copy from your personal computer up to the instance. You will need to do everything via command line interface, including editing files, so I found this [VI cheat sheet] useful. I copied all of the tarballs and RPMs on my storage LUN ( /media/storage ) so as not to have to download them again.
Compiling and configuring them is a different matter. By default, you login as an end user, "idcuser" (which stands for IBM Developer Cloud user). However, sometimes you need "root" level access. Use "sudo bash" to get into root level mode, and this allows you to put the files where they need to be. If you haven't done a configure/make/make install in awhile, here's your chance to relive those "glory days".
In the end, I was able to confirm that Apache, MySQL and PHP were all running correctly. I wrote a simple index.php that invoked phpinfo() to show all the settings were set correctly. I rebooted the instance to ensure that all of the services started at boot time.
- Rational Application Developer over VDI
This last example, I started an instance pre-installed with Rational Application Developer (RAD), which is a full Integrated Development Environment (IDE) for Java and J2EE applications. I used the "NX Client" to launch a virtual desktop image (VDI) which in this case was Gnome on SLES 10 SP2. You might want to increase the screen resolution on your personal computer so that the VDI does not take up the entire screen.
From this VDI, you can launch any of the programs, just as if it were your own personal computer. Launch RAD, and you get the familiar environment. I created a short Java program and launched it on the internal WebSphere Application Server test image to confirm it was working correctly.
If you are thinking, "This is too good to be true!" there is a small catch. The instances are only up and running for 7 days. After that, they go away, and you have to start up another one. This includes any files you had on the local disk drive. You have a few options to save your work:
- Copy the files you want to save to your storage LUN. This storage LUN appears persistent, and continues to exist after the instance goes away.
- Take an "image" of your "instance", a function provided in the IBM Developer and Test Cloud. If you start a project Monday morning, work on it all week, then on Friday afternoon, take an "image". This will shutdown your instance, and backup all of the files to your own personal "asset catalog" so that the next time you request an instance, you can chose that "image" as the starting point.
Another option is to request an "extension" which gives you another 7 days for that instance. You can request up to five unique instances running at the same time, so if you wanted to develop and test a multi-host application, perhaps one host that acts as the front-end web server, another host that does some kind of processing, and a third host that manages the database, this is all possible. As far as I can tell, you can do all the above from either a Windows, Mac or Linux personal computer.
Getting hands-on access to Cloud Computing really helps to understand this technology!
technorati tags: , IBM, Development, Test, Cloud, WebSphere, sMASH, WaveMaker, LAMP, Linux, Apache, MySQL, PHP, Rational, RAD, VDI, , RHEL, RHEV, KVM, SUSE, SLES, Novell, NX, PuTTY, SSH
Continuing on the [IBM Storage Launch of February 9], John Sing has offered to write the following guest post about the [announcement] of IBM Scale Out Network Attached Storage [IBM SONAS]. John and I have known each other for a while, traveled the world to work with clients and speak at conferences. He is an Executive IT Consultant on the SONAS team.
[John Sing profile]
Guest Post written by John Sing, IBM San Jose, California
What is IBM SONAS? It’s many things, so let’s start with this list:
- It’s IBM’s delivery of a productized, pre-packaged Scale Out NAS global virtual file server, delivered in a easy-to-use appliance
- IBM’s solution for large enterprise file-based storage requirements, where massive scale in capacity and extreme performance is required, especially for today’s modern analytics-based Competitive Advantage IT applications
- Scales to many petabytes of usable storage and billions of files in a single global namespace
- Provides integrated central management, central deployment of petabyte levels of storage
- Modular commercial-off-the-shelf [COTS] building blocks. I/O, storage, network capacity scale independently of each other. Up to 30 interface nodes and 60 storage nodes, in an IBM General Parallel File System [GPFS]-based cluster. Each 10Gb CEE interface node port is capable of streaming at 900 MB/sec
- Files are written in block-sized chunks, striped over as many multiple disk drives in parallel – aggregating throughput on a massive scale (both read and write), as well as providing auto-tuning, auto-balancing
- Functionality delivered via one program product, IBM SONAS Software, which provides all of above functions, along with clustered CIFS, NFS v2/v3 with session auto-failover, FTP, high availability, and more
IBM SONAS makes automated tiered storage achievable and realistic at petabyte levels:
- Integrated global policy engine file placement, physical management, replication, deletion, archival
- Integrated high performance parallel scan engine capable of identifying files at over 10 million files per minute per node
- Integrated parallel data movement engine to physically relocate the data within tiered storage
And we’re just scratching the surface. IBM has plans to deploy additional protocols, storage hardware options, and software features.
However, the real question of interest should be, “who really needs that much storage capacity and throughput horsepower?”
The answer may surprise you. IMHO, the answer is: almost any modern enterprise that intends to stay competitive. Hmmm…… Consider this: the reason that IT exists today is no longer to simply save cost (that may have been true 10 years ago). Everyone is reducing cost… but how much competitive advantage is purchased through “let’s cut our IT budget by 10% this year”?
Notice that in today’s world, there are (many) bright people out there, changing our world every day through New Intelligence Competitive Advantage analytics-based IT applications such as real time GPS traffic data, real time energy monitoring and redirection, real time video feed with analytics, text analytics, entity analytics, real time stream computing, image recognition applications, HDTV video on demand, etc. Think of how GPS industry, cell phone / Twitter / Facebook, iPhone and iPad applications, as examples, are creating whole new industries and markets almost overnight.
Then start asking yourself, “What's behind these Competitive Advantage IT applications – as they are the ones that are driving all my storage growth? Why do they need so much storage? What do those applications mean for my storage requirements?”
( Read more about Analytics, pages 14-48 of [New Intelligence
for a Smarter Planet - Driving Business Innovation with IBM Analytic Solutions] )
To be “real-time”, long-held IT paradigms are being broken every day. Things like “data proximity”: we can no longer can extract terabytes of data from production databases and load them to a data warehouse – where’s the “real-time” in that? Instead, today’s modern analytics-based applications demand:
- Multiple processes and servers (sometimes numbering in the 100s) simultaneously ….
- Running against hundreds of terabytes of data of live production data, streaming in from expanding number of smarter sensors, input devices, users
- Producing digital image-intensive results that must be programatically sent to an ever increasing number of mobile devices in geographically dispersed storage
- Requiring parallel performance levels, that used to be the domain only of High Performance Computing (HPC)
This is a major paradigm shift in storage – and that is the solution and storage capabilities that IBM SONAS is designed to address. And of course, you should be able to save significant cost through the SONAS global virtual file server consolidation and virtualization as well.
Certainly, this topic warrants more discussion. If you found it interesting, contact me, your local IBM Business Partner or IBM Storage rep to discuss Competitive Advantage IT applications and SONAS further.
technorati tags: , IBM, Scale-Out, NAS, SONAS, John Sing, San+Jose, COTS, Petabytes, CEE, CIFS, NFS, FTP, HPC, New Intelligence, Analytics, Competitive Advantage IT Applications