As promised, here is my first blog post on little endian or "LE" as we call it. Where better place to start than with a list of frequently ask questions (FAQs)? Hopefully, you'll find this helpful. Let me know if you have any questions I missed.
What is big endian and little endian, anyway?
In order to perform operations on data, computers routinely load and store bytes of data from and to memory, the network, and disk. This data management generally follows one of two schemes: little endian or big endian.
Imagine the number one hundred twenty three. When representing this number with numerals, we typically write it with the most significant digit first and the least significant digit last: 123. This is big endian. Mainframes and RISC architectures like POWER default to big endian when manipulating data.
Some microprocessor architectures store the numbers representing one hundred twenty three in reverse – the least significant digit first and the most significant digit last: 321. This is little endian. x86 architectures use little endian when storing data.
Why do people care about what endian mode their platform runs?
Most users do not care which endian mode their platform is using. They simply care about what applications are supported by their Linux operating systems. Only application providers care about endianess. For example:
A software developer that has code manipulating data through pointer casting or bitfields would not be able to simply recompile an application for one endian mode to another.
A user with large amounts of data stored to disk or exchanged among systems over network connections without consideration of endian schemes risks a range of application failures from very subtle to complete failures.
A system accelerator programmer (GPU or FPGA) who needs to share memory with applications running in the system processor must share data in an pre-determined endianness for correct application functionality.
Why is Linux on Power transitioning from big endian to little endian?
The Power architecture is bi-endian in that it supports accessing data in both little endian and big endian modes. Although Power already has Linux distributions and supporting applications that run in big endian mode, the Linux application ecosystem for x86 platforms is much larger and Linux on x86 uses little endian mode. Numerous clients, software partners, and IBM’s own software developers have told us that porting their software to Power becomes simpler if the Linux environment on Power supports little endian mode, more closely matching the environment provided by Linux on x86. This new level of support will lower the barrier to entry for porting Linux on x86 software to Linux on Power.
Which Linux distributions will support little endian on Power?
So far, only Canonical’s Ubuntu Server 14.04 distribution supports little endian on Power. Plans are underway in the community distributions of Debian and openSUSE for little endian releases.
Additionally, SUSE has stated publicly that SLES 12 will be little endian when it becomes available. See SUSE Conversations for more information.
Red Hat has not yet publicly disclosed their plans around a little endian operating systems However, work to create a ppc64le architecture has started in the Fedora.
Which Linux distributions will support big endian on Power?
It is IBM's understanding that Red Hat and SUSE will continue to support their existing big endian releases on Power for their full product lifecycles.
While SUSE has announced their plans to transition their distribution to little endian (see above), Red Hat has not disclosed anything. The newly available Red Hat Enterprise Linux 7 operates in big endian mode on Power. Specifics about the transition to little endian will be decided and disclosed by Red Hat.
What about Linux applications that have already been optimized for big endian on Power?
The existing PowerLinux application portfolio supports only big endian modes today. Open source applications have begun extending their support to little endian mode on Power Systems. Existing third party and IBM applications will likely migrate more slowly and deliberately. As such, Power hardware will support both endian modes for the foreseeable future so that existing Linux applications optimized for a big endian platform will continue to run unchanged while new applications optimized to little endian mode are added.
Can applications compiled for x86 (Windows or Linux) run without change on little endian Power?
Because the x86 and Power processors use different instruction set architectures (ISAs) – the binary executable known to the processor – compiled applications will need at least a recompile on the new platform. Whether source code changes are required depends on how many optimizations have been made in the application source – such as the use of assembler language and any assumptions about page size or cache line size, etc.
However, interpreted applications such as those in Java, perl, python, php, ruby and others should be capable of migrating with little to no change.
Does this transition affect application ecosystems for AIX or IBM i?
No, there will be no effect on AIX or IBM i application environments as a result of this change.
What if I want to run a mix of big endian and little endian applications on the same Power System?
Each Linux distribution will support a particular endian mode, little or big. Applications always certify to specific distributions. As such, endian mode decisions should be transparent to the end user. Customers should not have to consider endianess in their application choice.
If one requires different Linux distributions or the same distribution at different releases on a single server, then Power Systems virtualization (LPARs or VMs) allows customers to run applications supported by a big endian Linux distribution like RHEL6 as well as applications supported by a little endian distribution like Canonical’s Ubuntu Server at the same time. However, concurrent little endian and big endian support on the same server will not be available until a future date. See more details in the questions below.
Which POWER processors support little endian mode?
The POWER8 processor is the first processor to support little endian and big endian modes equivalently. Although previous generations of the POWER processors had basic little endian functionality, they did not fully implement the necessary instructions in such a way to enable enterprise operating system offerings.
Where can little endian distributions run on Power?
When IBM announced POWER8 in April 2014, little endian (LE) operating systems were initially supported as KVM guests. Further, KVM support was limited to only include all LE or all big endian (BE) guests. In coming releases, IBM expects to support concurrent LE and BE guests in KVM, as well as the support of LE guests on PowerVM.
Do POWER systems support the running of mixed environments of big and little endian operating systems?
The POWER8 processor supports mixing of big and little endian memory accesses at the core level, through the use of SPR (special purpose register) settings. While this could technically support the running of both big and little endian software threads, the complexity of implementing such a design point would be high. Therefore, IBM has elected to enable operating system versions as completely big endian or little endian by design.
The virtualization capabilities of the POWER platform have allowed for mixed environments of operating system levels and types. This same isolation mechanism applies to big and little endian operating systems. However, in implementing the initial releases of little endian, IBM has introduced some short-term limitations on where LE operating systems can run. Over time, these will be removed and both KVM and PowerVM will support concurrent mixing of LE and BE operating systems.
See the previous question for more information.
Does PowerVM support little endian operating systems?
While the POWER8 systems support little endian (LE) mode, IBM has not yet completed the software development and testing to enable LE operating systems on PowerVM. The outlook is that this function will be delivered around mid-2015. When this capability is delivered, PowerVM will support the mixing of both big endian (BE) and LE operating systems. This enablement will also enable the running of LE operating systems on the Power Integrated Facilities for Linux (IFLs).
Does PowerKVM support mixing of little endian and big endian operating systems?
Testing has not yet completed to enable the mixing of little endian (LE) and big endian (BE) guests for KVM. Until this completes, IBM supports guests of the same type – all LE or all BE.
IBM hopes to support mixing of guest types around mid-2015.
Can I run big endian applications on a little endian operating system or vice versa?
No, the operating system enablement only supports applications of the same type. As such, a little endian operating system (ppc64le or ppc64el) can only run little endian applications built for this software platform. Likewise, big endian operating systems (ppc64) only support software built for big endian.
January 23, 2015 - Author's update
A couple noteworthy activities have occurred since this blog was originally published.
A little endian (LE) version of RHEL 7.1 has been released in beta form. This announcement indicates that RHEL 7 updates will have both the existing big endian (BE) offering and a new LE offering. For more information about the beta, see the RHEL 7.1 beta announcement information. This means that all three Linux on Power distribution partners -- SUSE, Canonical, and now Red Hat -- have LE operating systems.
IBM PowerKVM now supports the mixture of BE and LE guests beginning with the 2.1.1 update in October 2014. This was a subtle change that is hard to find in documentation.
Support for LE operating systems on PowerVM continues to make progress toward a delivery sooner versus later this year. When this is delivered, the mixing of BE and LE logical partitions will be supported.
Additionally, the following question keeps being asked and needs it's own FAQ:
Can I run x86 Linux applications on LE Linux on Power operating systems unchanged?
If your application was written in a dynamic language, it is highly portable and often migrates to BE and LE Linux on Power operating system environments without change. Examples include applications written in Java, php, perl, python, etc.
If your application was written in a compiled language like C/C++, it must be recompiled on Power in both the BE and LE operating systems. Applications migrating from x86 Linux onto an LE Linux operating system on Power will migrate without concern for data layout (endianness). Applications migrating onto BE operating systems need to be reviewed for consist data access, especially if they will share data using disk or networking with LE systems.
The default iptables rules that come with most of the Enterprise Linux distributions (e.g. RHEL and SLES) prevent multicast IP packets from reaching client applications that have joined multicast groups. This article will explain how to configure or disable iptables so client multicast applications can receive multicast packets.
Disabling iptables If your multicast client system doesn't need to be protected by a firewall the easiest way to make a multicast application to work is by disabling iptables or any other firewall service that might be running. That can be done temporarily by flushing all the iptables rules with the following command:
# iptables -F
This command will disable all the iptables rules, however, the rules will be reloaded on the next system reboot.
The iptables rules can also be prevented from being loaded during the system boot. On RHEL systems it can be done by disabling the iptables service:
# chkconfig iptables off
On SLES system it can be done by disabling the firewall on Yast interface, or disabling the SuSEfirewall2_init and SuSEfirewall2_setup services:
# chkconfig SuSEfirewall2_init off # chkconfig SuSEfirewall2_setup off
Configuring the iptables rules
If your multicast client system needs to have a firewall service running, the firewall will have to be configured to allow multicast packets. This section will show how to configure the iptables rules on a RHEL system.
The file /etc/sysconfig/iptables is the configuration file that contains the iptables rules that will be loaded during the iptables service start. By adding the following line to this file, iptables will allow all incoming multicast packets:
-A INPUT -m pkttype --pkt-type multicast -j ACCEPT
The order of the lines in the file is important, so the example rule above needs to be placed before the default rule that rejects all packets that don't fit in any of the previous rules. Following is an example of a complete configuration file:
*filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT -A INPUT -p icmp -j ACCEPT -A INPUT -i lo -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT -A INPUT -m pkttype --pkt-type multicast -j ACCEPT -A INPUT -j REJECT --reject-with icmp-host-prohibited -A FORWARD -j REJECT --reject-with icmp-host-prohibited COMMIT
This example adds a rule to allow all incoming multicast packets. However, you might want to allow packets that arrive only to a specific network interface, from a certain range of source address or from a certain broadcast group. For more specific configurations, refer to the iptables man page.
zswap" is discussed, with some initial performance data provided to demonstrate the potential benefits for a system (partition or guest) which has constrained memory and is beginning to swap memory pages to disk. The technique improves the throughput of a system, while significantly reducing the disk I/O activity normally associated with page swapping. We also explore how zswap works in conjunction with the new compression accelerator feature of the POWER7+ processor to potentially improve the system throughput even more than software compression alone.
This article is a good example of the ongoing collaboration that occurs in the Linux open-source community. New implementations are proposed, discussed, debated, refined and updated across developers, community members, interested customers, and performance teams. Here on the PowerLinux technical community, we are working to highlight more of these examples of work-in-progress from the broader Linux community. These proposals are applicable to both x86 systems and Power systems, so examples shown below cover both realms.
What is zswap?
Zswap is a new lightweight backend framework that takes pages that are in the process of being swapped out and attempts to compress them and store them in a RAM-based memory pool. Aside from a small reserved portion intended for very low-memory situations, this zswap pool is not pre-allocated, it grows on demand and the max size is user-configurable. Zswap leverages an existing frontend already in mainline called frontswap. The zswap/frontswap process intercepts the normal swap path before the page is actually swapped out, so the existing swap page selection algorithms are unchanged. Zswap also introduces key functionality that automatically evicts pages from the zswap pool to a swap device when the zswap pool is full. This prevents stale pages from filling up the pool.
The zswap patches have been submitted to the Linux Kernel Mailing List (lkml) for review, you can view them in this post.
Instructions for building a zswap-enabled kernel on a system installed with Fedora 17 can be found on this wiki.
What are the benefits?
When a page is compressed and stored in a RAM-based memory pool instead of actually being swapped out to a swap device, this results in a significant I/O reduction and in some cases can significantly improve workload performance. The same is true when a page is "swapped back in" - retrieving the desired page from the in-memory zswap pool and decompressing it can result in performance improvements and I/O reductions compared to actually retrieving the page from a swap device.
Using the SPECjbb2005 workload for our engineering tests, we gathered some performance data to show the benefits of zswap. SPECjbb2005 uses a Java™ benchmark that evaluates server performance and calculates a throughput metric called "bops" (business operations per second). To find out more about this benchmark or see the latest official results, see the SPEC web site. Note that the following results are not tuned for optimal performance and should not be considered official benchmark results for the system, but rather results obtained for research purposes. We liked this benchmark for this use case because we could more carefully control the amount of active memory being used in increments.
The SPECjbb2005 workload ramps up a specified number of "warehouses", or units of stored data, during the run. The number of warehouses is a user-controlled setting that is configured depending on the number of threads available to the JVM. As the benchmark increases the number of warehouses throughout the run, the system utilization level increases. A bops score is reported for each warehouse run. For this work, we focused on the bops score from the warehouse that keeps the system about 50% utilized. We also increased the default runtime for each warehouse to 5 minutes since swapping can be bursty and a longer runtime helps to achieve more consistent results.
For these results, the system was assigned 2 cores, 10 GB of memory, and a 20 GB swap device. A single JVM was created for the SPECjbb2005 runs, using IBM Java. First, a baseline measurement was taken where normal swapping activity occurred, then a run with zswap enabled was measured to show the benefits of zswap. We gathered results on both a Power7+ system and an x86 system to observe the performance impacts on different architecture types. The mpstat, vmstat, and iostat profilers from the sysstat package were used to record CPU utilization, memory usage, and I/O statistics. We would recommend taking advantage of the lpcpu package to gather these data points.
To demonstrate the performance effects of swapping and compression, we started with a JVM heap size that could be covered by available memory, and then increased the JVM heap size in increments until we were well beyond the amount of free memory, which forced swapping and/or compression to occur. We recorded the throughput metric and swap rate at each data point to measure the impacts as the workload demanded more and more pages.
Settting up zswap
With the current implementation, zswap is enabled by this kernel boot parameter:
We looked at several new in-kernel stats to determine the characteristics of compression during the run. The metrics used were as follows:
pool_pages - number pages backing the compressed memory pool
reject_compress_poor - reject pages due to poor compression policy (cumulative) (see max_compressed_page_size sysfs attribute)
reject_zsmalloc_fail - rejected pages due to zsmalloc failure (cumulative)
reject_kmemcache_fail - rejected pages due to kmem failure (cumulative)
reject_tmppage_fail - rejected pages due to tmppage failure (cumulative)
reject_flush_attempted - reject flush attempted (cumulative)
reject_flush_fail - reject flush failed (cumulative)
stored_pages - number of compressed pages stored in zswap
outstanding_flushes - the number of pages queued to be written back
flushed_pages - the number of pages written back from zswap to the swap device (cumulative)
saved_by_flush - the number of stores that succeeded after an initial failure due to reclaim by flushing pages to the swap device
pool_limit_hit - the zswap pool limit has been reached
There are two user-configurable zswap attributes:
max_pool_percent - the maximum percentage of memory that the compressed pool can occupy
max_compressed_page_size - the maximum size of an acceptable compressed page. Any pages that do not compress to be less than or equal to this size will be rejected (i.e. sent to the actual swap device)
failed_stores - how many store attempts have failed (cumulative)
loads - how many loads were attempted (all should succeed) (cumulative)
succ_stores - how many store attempts have succeeded (cumulative)
invalidates - how many invalidates were attempted (cumulative)
To observe performance and swapping behavior once the zswap pool becomes full, we set the max_pool_percent parameter to 20 - this means that zswap can use up to 20% of the 10GB of total memory.
The following graphs represent the SPECjbb2005 performance and swap rate for a run using the normal swapping mechanism.
Note that as "available" memory is used up around 10GB, the performance falls off very quickly (the Blue Line) and normal page swapping (the Red Line) to disk increases. The behavior is consistent both on Power7+ and x86 systems.
Power7+ baseline results:
x86 baseline results:
As you can see, performance dramatically decreased once the system started swapping and continued to level off as the JVM heap was increased.
The following graphs represent the SPECjbb2005 performance and swap rate for a run when zswap is enabled. In these cases, memory is now being compressed, which significantly reduces the need to go to disk for swapped pages. Performance of the workload (the blue line) still drops off but not as sharply, but more importantly the system load on I/O drops dramatically.
Power7+ with zswap compression:
x86 with zswap compression:
As you can see, the swap (I/O) rate was dramatically reduced. This is because most pages were compressed and stored in the zswap pool instead of swapped to disk, and taken from the zswap pool and decompressed instead of swapped in from disk when the page was requested again. The small amount of "real" swapping that occurred is due to the fact that some pages compressed poorly - which means they did not meet a user-defined max compressed page size - and were therefore swapped out to the disk, and/or stale pages were evicted from the zswap pool.
Looking at the zswap metrics for each run, we can calculate some interesting statistics from this set of runs - keep in mind the base page size is different between Power (64K pages) and x86 (4K pages), which accounts for some of the different behaviour. Also note that we set the max zswap pool size to 20% of total memory for these runs, as mentioned above - this max setting can be adjusted as needed. On Power, the average zswap compression ratio was 4.3. On x86, the average zswap compression ratio was 3.6. For the Power runs, we saw entries for "pool_limit_hit" starting at the 17 GB data point. For the x86 runs, the pool limit was hit earlier - starting at the 15.5 GB data point. For the Power runs, at most the zswap pool stored 139,759 pages. For the x86 runs, the max number of stored pages was 1,914,720. This means all those pages were compressed and stored in the zswap pool, rather than being swapped out to disk, which results in the performance improvements seen here.
POWER7+ hardware acceleration
The POWER7+ processor introduces new onboard hardware assist accelerators that offer memory compression and decompression capabilities, which can provide significant performance advantages over software compression. As an example, the system specifications for the IBM Flex System p260 and p460 Compute Nodes found here mention the "Memory Expansion acceleration" feature of the processor.
The current zswap implementation is designed to work with these hardware accelerators when they are available, allowing for either software compression or hardware compression. When a user enables zswap and the hardware accelerator, zswap simply passes the pages to be compressed or decompressed off to the accelerator instead of performing the work in software. Here we demonstrate the performance advantages that can result from leveraging the POWER7+ on-chip memory compression accelerator.
POWER7+ hardware compression results
Because the hardware accelerator speeds up compression, looking at the zswap metrics we observed that there were more store and load requests in a given amount of time, which filled up the zswap pool faster than a software compression run. Because of this behavior, we set the max_pool_percent parameter to 30 for the hardware compression runs - this means that zswap can use up to 30% of the 10GB of total memory.
The following graph represents the SPECjbb2005 performance and swap rate for a run when zswap and the POWER7+ hardware accelerator are enabled. In this case, memory is now being compressed in hardware instead of software, and this results in a significant performance improvement. Performance of the workload (the blue line) still drops off, but even less sharply than the zswap software compression case, and the system load on I/O still remains very low.
Power7+ hardware compression:
As you can see, the swap (I/O) rate was dramatically reduced. This is because most pages were compressed using the hardware accelerator and stored in the zswap pool instead of swapped to disk, and taken from the zswap pool and decompressed in the hardware accelerator instead of swapped in from disk when the page was requested again. The small amount of "real" swapping that occurred is due to the fact that some pages compressed poorly - which means they did not meet a user-defined max compressed page size - and were therefore swapped out to the disk, and/or stale pages were evicted from the zswap pool.
The following graphs show the performance comparison between normal swapping and zswap compression, and the POWER7+ graph also includes the hardware compression results, showing that the hardware accelerator provides even more performance advantages over software compression alone:
Power7+ performance comparison:
x86 performance comparison:
As you can see, this workload shows up to a 40% performance improvement in some cases after the heap size exceeds available memory when zswap is enabled, and the POWER7+ results show that the hardware accelerator can improve the performance by up to 60% in some cases compared to the baseline performance.
Swap (I/O) comparison
The following graphs show the swap rate comparison between normal swapping and zswap compression, and the POWER7+ graph includes the hardware compression results, showing that the hardware accelerator also reduces the swap rate dramatically. Swap rates are dramatically reduced on both architectures when zswap is enabled, including the POWER7+ hardware compression results.
Power7+ swap I/O comparison:
x86 swap I/O comparison:
The new zswap implementation can improve performance while reducing swap I/O , which can also have positive effects on other partitions that share the same I/O bus. The new POWER7+ on-chip memory compression accelerator can be leveraged to provide performance improvements while still keeping swap I/O very low.
It’s been an exciting year, and one that I’ll be looking back on with much satisfaction for many years to come. In the space of a year, we’ve created PowerAI, converting Deep Learning from a promise into a technology that’s ready to use. For many years, Artificial Neural Networks, the technology behind Deep Learning, has held out the promise of real cognitive computing. However, it has taken many innovations, including more sophisticated network architectures and system innovations such as programmable inference accelerators which were invented in 1994 and, to accelerate neural network training, numeric accelerators invented around 2000 to realize the promise of Deep Learning.
What a difference a year makes! Last year this time we were debugging our future work horse for Deep Learning on Power, a PCIe-based multi-GPU enclosure. A few months earlier, we had concluded that we were running out of steam with our trusty Nvidia K80 numeric accelerators. These accelerators had been designed for HPC workloads in all national labs and were the best accelerators in the industry – but they were strating to run out of steam for the most sophisticated cognitive applications. The 16 GPU enclosures gave us the work horse to build a mature Deep Learning environment on Power that was ready for even the most advanced enterprise users.
But creating a stable hardware platform for Deep Learning users was just a part of the challenge. The Deep Learning systems in use in early 2016 originated from academic and industrial research labs and were built around technological prowess, rather than ease of use and deployment. While rapid innovation is a good model for incubating new technology, it did not address the needs of technology adopters. To fill this gap, we created the Power Machine Learning and Deep Learning software kit – a distro for cognitive applications which we first released in April 2016.
To accelerate progress in the advancement of Deep learning, we also contributed our enhancements to take advantage of the innovations in Power to the community and published build recipes for early adopters who want to use the very newest versions of DL frameworks from source. In parallel to our work on Deep Learning frameworks, we expanded and optimized the Power software ecosystem for Deep Learning: we engaged with the open source community to optimize and release mathematics libraries for Power, such as OpenBLAS, ATLAS, FFTW, numpy; and IBM’s Toronto Lab released MASS, the mathematics acceleration subsystem for Linux to accelerate Deep learning applications. In addition, we incubated efforts to port and optimize the dynamic scripting languages used to develop cognitive applications such as Julia, Lua and Python for Power.
In parallel, we worked with our colleagues in the system architecture team to use our experiences with deep learning stacks to create the first enterprise server optimized for Deep Learning which was announced in September. Building on our experience with up to 16 PCIe-connected GPUs, and with K40, K80 and M40 GPUs we concluded that Power offered the best host architecture for a multi-GPU solution, and that a smaller number of stronger GPUs would offer significant scalability benefits. Drawing on the ongoing work on coherent processor/accelerator interfaces and the world’s fastest supercomputers, we also recognized the benefits of a coherent high-performance accelerator interconnect. With these guidelines as a basis, a system based on the POWER8+ with up to four of the newest Nvidia P100 GPUs and CPU/GPU NVLink offered an ideal design point.
But we’re not done – as we look ahead to 2017, we have many new exciting ideas to transform the industry with cognitive computing innovations! Let us know how cognitive computing will transform your business and let’s innovate together!
Dr. Michael Gschwind is Chief Engineer for Machine Learning and Deep Learning for IBM Systems where he leads the development of hardware/software integrated products for cognitive computing. During his career, Dr. Gschwind has been a technical leader for IBM’s key transformational initiatives, leading the development of the OpenPOWER Hardware Architecture as well as the software interfaces of the OpenPOWER Software Ecosystem. In previous assignments, he was a chief architect for Blue Gene, POWER8, POWER7, and Cell BE. Dr. Gschwind is a Fellow of the IEEE, an IBM Master Inventor and a Member of the IBM Academy of Technology.
Game theorists use the example of Prisoner’s Dilemma to show how two rational people might refuse to cooperate with each other, increasing the suffering on both sides. The same can be said for businesses. It’s easy for a company to refuse to be the first to share for fear of being betrayed by competitors. Historically, the possible innovation and growth rewards from collaboration didn’t outweigh the costs of potential betrayal.
But times have changed.
Now the costs of not cooperating far outweigh the potential benefits of openness. Business leaders are embracing a mantra of agile innovation and open partnerships. Technology is leveling the playing field. New competition can emerge overnight, disrupting entire industries. To innovate at the rate the market demands, companies must remove silos and adopt a more open approach, and it's already defining the future of business.
Open innovation is about collaboration and co-creation, which enables people to work and solve problems in new ways. It allows them to use technology to take existing knowledge and real insights and ensure that they are accessible to everyone who needs them. Open innovation relies on people working on open platforms, collectively seeking out the next advancement, not replicating the last one.
What an Open Innovation Leader Looks Like
Tesla recently announced they were making all their patents open source. While they know their competitors can leverage these developments, the benefits of open innovation outweigh that risk. In their words, "We will never stop innovating, that is what gives us the edge." Tesla is pushing the whole industry forward by challenging them to look for what’s next. The key is breaking down the internal and external barriers within an organization and enabling closer connections with customers. An open innovation leader creates a dialogue with unique thinkers, takes the chance to sit down with leaders from different verticals, and finds a new way to look at competitive data.
Here are some things open innovation leaders do to set themselves apart.
Collaborate & Co-Create
Job titles can be constricting. Everyone in an organization has unique ideas and skill sets outside of their day-to-day responsibilities. Successful open innovation brings diverse teams together for a broad mix of thoughts, inspirations, experiences and solutions. Occasionally this approach will yield projects and ideas that are fully shared and not owned by individuals. Creativity thrives when people can build things and share easily, when they have a space that allows for collaborative problem solving. Check out Airbnb’s new office space in Dublin for inspiration. The ultimate result of open collaboration and co-creation is the advantage that comes from being the first to develop and bring to market truly unique new products and solutions.
Keep Things Nimble
Nothing stifles innovation like rigid boundaries and drawn out production timelines. Open innovation is about throwing ideas and concepts into real environments, seeing how they stand up and getting people on board. Companies also need to look down the road and explore how they will adapt to industry changes and new technology. Having the latest tools may be out of reach for some companies, but shared R&D spaces exist to enable access without the full investment.
We’ve all heard the call of Silicon Valley telling us to fail fast. But in the majority of businesses, failure is still a huge deterrent. Environments are needed where failure is possible and where there is flexibility to test ideas without affecting core operations. Just look at the rise of FailCon, a conference founded in 2009 by technology leaders to draw attention to the important insights within business failure – now a global event. The direct feedback and learnings from failure are a powerful accelerator of innovation. Don’t be the boss that looks down on failure.
Put it to Work
Changing a corporate culture can seem nearly impossible. But it’s essential in adapting to a new landscape. A common way to implement open innovation is to create a spin off innovation unit or skunkworks, often in a different location, and give them a mandate to take a fresh approach. Successful projects can then expand into other departments. But for a concept like this to work it needs to be applied to everything and at every level. Open innovation is a constant search for opportunities and requires a willingness to ask questions. Start by embracing the curiosity inside your company.
This technical preview tutorial explains how users of IBM's latest POWER8-based scale-out Linux servers can try Ubuntu running non-virtualized. We show how Ubuntu can be installed directly on the OPAL firmware, and run as a single-image operating system directly on the system.
Ubuntu 14.04 is generally available today and fully supported as a PowerKVM guest on the IBM Power Systems shown below:
It is also possible to run Ubuntu 14.04 directly on these systems, which the development teams refer to as non-virtualized mode, or "bare-metal". There is no PowerVM LPAR layer, and there is no PowerKVM hosting layer. Over time, this capability is available in the open-source communities, so new versions of other Linux distros are expected to be enabled for this support as well.
Note: If you are running Ubuntu 14.04 non-virtualized, you need to upgrade the kernel packages to get cpufreq support. The 3.13.0-32 level works.
The OPAL firmware referenced below is designed to allow a Linux operating system to run directly on the POWER8 system. By running directly on the system, this enables the operating system to be a KVM host, creating and controlling KVM Guests. In the scenario described here in this article, there is no KVM hosting, and the Ubuntu 14.04 operating system is running as an operating system directly on the system.
Because the OPAL firmware enables the PowerKVM mode, the terminology used in selecting the firmware below is targeted at that mode. In practice, OPAL firmware enables a Linux operating system to run directly on the system, and running KVM in that operating system is not a requirement.
Technical Preview only at this time. The ability to run Ubuntu directly on the POWER8 Linux-only system is provided as-is, is not a supported configuration option at this time, and therefore, is not for production use. This ability is provided as a technical preview only. If you should encounter any problems running with the non-virtualized technical preview you can report your bugs against Ubuntu in Launchpad. Alternatively, you can always ask a question in the Forums here on the Community!
You are over-writing your PowerKVM install. These instructions replace (destroy) your existing PowerKVM installed host and all of the guests. The PowerKVM software can be re-installed at a later time, and your guests can be re-created.
Your system must have access to the external web for access to Canonical's netboot server - or you will need a DVD image downloaded and burned to a DVD. These instructions assume a netboot load.
1. In order to install Ubuntu 14.04 on the IBM Power system, the system needs to be set for KVM as the Hypervisor mode. This step selects the OPAL Firmware to be loaded. If you build the system with the PowerKVM configuration, you are ready to go, otherwise, you can configure it using the following steps:
Turn off the server by going to the server Advanced System Management (ASM), under System Configuration ⇒ Hypervisor Configuration, and set the hypervisor mode to KVM (or OPAL) and choose a IPMI password
2. Once the machine is in PowerKVM mode, you need to connect to the FSP using IPMI to get the machine console. Run the following IPMI commands:
Restart the machine
# ipmitool -I lanplus -H fsphostname -P password power cycle
Log into the machine console
# ipmitool -I lanplus -H fsphostname -P password sol activate
Once you run the last command, you will be seeing the machine console, and everything you type will be sent to the machine. In order to exit the console, you should type ~. and ~? shows the help menu.
3. Once the machine is booted up, you will see the petitboot console, as shown below:
Petitboot is the bootloader for the IBM Power machines configured with PowerKVM. From here, you can install an Ubuntu DVD in the machine using DVD-ROM and boot from it. You can also boot from the network.
This document will explain how to install from the network.
4. In order to install from the network, you need to configure the System Network in the 'System Configuration' menu entry. Once the network is configured, you can create a new entry in the petitboot by pressing the letter 'n'. By creating a new entry, you will go into Option Editor to configure the entry details, as shown:
Once you are editing the boot entry, you should choose the 'Specify paths/URLs manually'option and you must provide the installer kernel and initird. For Ubuntu, you should point it to the Canonical Ubuntu 14.04 netboot website.
On this example, I used version 14.04 that I following URLs:
You do not need to fill out the other entries, if you just want to do a default installation. Once you do the configuration, get back to the petitboot menu, and boot the entry you just configured "User Item 1".
Then just boot on that entry, pressing 'Enter',, and you are going to launch the Ubuntu 14.04 installer, as shown:
When you see this screen, select the language you want to use during the installation and proceed through normal Ubuntu 14.04 installation processes.
As you likely have heard, Arvind Krishna, IBM General Manager for Development and Manufacturing in the IBM Systems & Technology Group, announced that Power Systems would be supporting KVM. This is an exciting announcement for numerous reasons that I'll defer for another posting. For this blog entry, I thought I'd do some question/answer session based on common questions I've been asked in the past couple weeks. However, before I do so, I need to remind you that these are our current thoughts at this time: things may change.
Q: When will KVM be available on Power?
A: The outlook for general availability is next year. However, IBM has already started releasing patches to various KVM communities to support the POWER platform.
Q: On what systems does IBM intend to support KVM?
A: IBM intends to initially support KVM on a limited set of models, targeted at the entry end of the system servers. This strategy supports IBM's efforts to capture the largest growing market, x86 Linux servers in the 2-socket and smaller space.
Q: How does IBM plan to position KVM against PowerVM?
A: IBM remains committed to the PowerVM being the premier enterprise virtualization software in the industry. With KVM on Power, IBM will be targeting x86 customers on entry servers but will offer both KVM and PowerVM to meet the varying virtualization needs PowerLinux customers. However, KVM virtualization technology represents an opportunity to simplify customer's virtualization infrastructure with a single hypervisor and management software across multiple platforms.
Q: What Linux versions from Red Hat and SUSE will provide KVM hosts support on Power?
A: The decision to provide KVM on PowerLinux will be made by Red Hat and SUSE. IBM will be working with them in the months to come and would welcome their support.
Q: What management and cloud software will support KVM on Power?
A: For KVM node management, IBM intends to work with multiple vendors, including Red Hat and SUSE to certify KVM on Power into their system management software offerings. Additionally, IBM plans to contribute any patches necessary to OpenStack to extend the KVM driver to Power. Using this foundation, additional IBM and third-party software should provide a diverse set of management software.
Q: What will software providers need to do to support KVM on Power?
A: Most software provides have become comfortable with some form of virtualization such as PowerVM, VMWare, and KVM. Just like with applications on Linux, software providers should find that applications in the KVM environment behave similarly on x86 and Power platforms. As such, each vendor should understand any challenge KVM on Power would provide.
Q: What operating systems will be supported as guests in KVM on Power?
A: Given that KVM is initially targetted to be released on Linux-only servers, only Linux is planned at this time. IBM plans to certify the latest updates of RHEL 6 and SLES 11 as KVM guests.
Q: How will KVM run on the Power Systems?
A: The design goal of KVM on Power is to be just another hardware platform supporting KVM. As such, the KVM on Power will be true to the KVM design point of a KVM host image that supports one or more guests. PowerVM constructs such as the HMC, IVM, and VIOS will not exist in KVM. Management and virtualization will occur through the KVM host image.
Q: Will KVM run in a PowerVM logical partition (LPAR)?
A: While KVM supports a user-mode virtualization that can run on any Linux operating system, KVM on Power is being developed to run natively on the system, not nested in PowerVM. This is done to enable KVM to run optimally using the POWER processor Hypervisor Mode. As such, the system will make a decision very early in the boot process to run KVM or PowerVM. This is envisioned as a selectable option managed by the Service Processor (FSP)?
Q: Will it be possible to migrate from KVM on Power to PowerVM or vice versa?
A: While the virtualization mode will be selectable on systems, the process of migrating from KVM and PowerVM will require additional steps such that frequent migrations will be unlikely. However, in the case where a customer wishes to upgrade to PowerVM to acquire advanced virtualization capabilities, this migration should be supported. Steps to backup and restore the VM image will be required when migrating in either direction.
Q: Will AIX or IBM I run in KVM on Power?
A: Given that KVM initially runs on Linux-only platforms, support for non-Linux operatings systems has not been planned at this time.
Q: Will Windows run in KVM on Power?
A: Windows does not run on Power Systems. As such, supporting it in a KVM guest VM will not work.
Hopefully, these questions were helpful to folks. As usual, follow-up questions/comments appreciated.
The Linux Trace Toolkit Next Generation (LTTng) is a toolkit for trace and visualization of events produced by both the Linux kernel and applications (user-space). Version 2.x offers several improvements in relation to previous 1.x series, including:
Introduction of a new trace file format called CTF(Common Trace Format)
Beyond default kernel events, it allows trace of user-space applications
New implementation of ring buffer algorithm
Able to attach context information to events
Building & installing
Note: In this section I will show how to build LTTng from source although it is already delivered in some Linux distributions, for instance, in RPM packages for pcc64 in OpenSuse Linux and Fedora 17. So you might want to use a stable version from your chosen distro or just build latest yourself.
Its source code is version'ed in different git trees, one for each component (see table 1).
Table 1. LTTng source components
Library that implement RCU (Read-Copy-Update)mechanism in user-space
Provides main client that control execution of LTTng
At the time this post is written, requires kernel >= 2.6.38 to build lttng-modules
Some components rely on third-party libraries so take a look at README file in each component
# Buiding and installing liburcu library
$ cd userspace-rcu
$ make install
$ sudo ldconfig
# Bulding and installing lttng-ust
$ cd lttng-ust
$ make install
# Building and installing lttng-tools
$ cd lttng-tools
$ make install
# Building and installing lttng-modules
$ cd lttng-modules
$ sudo make modules_install
$ sudo depmod -a
There is a post-installation procedure that must be done in order to allow non-root users to (transparently) start LTTng daemon for monitoring kernel events. These users must be added in the tracing group as shown below (in Fedora 17):
# Create group if it doesn't exist
$ sudo groupadd -r tracing
# Add <username> to the group
$ sudo usermod -aG tracing <username>
Managing a trace session
LTTng tracing relies on concept of session. The table 2 shows commands to manage a session lifecycle.
Table 2. Commands to manage tracing session
Create a session with NAME. By default, tracing files are held in ~/lttng-traces but it may be redefined with option -o
Used to switch between sessions, setting current to NAME.
Destroy the session with NAME. The option -a or –all may be used to destroy all the sessions.
Show information regarding session with NAME or list all sessions if NAME is omitted.
Below is an example of tracing session with LTTng where it is monitored system behavior after booting up an instance of Firefox browser.
$ sudo lttng-sessiond &
$ lttng create demo_session
Session demo_session created.
Traces will be written in /home/wainersm/lttng-traces/demo_session-20121030-233238
$ lttng start
Tracing started for session demo_session
$ firefox &
$ lttng stop
Waiting for data availability
Tracing stopped for session demo_session
$ lttng destroy demo_session
Session demo_session destroyed
Notice from example above lttng-sessiond (daemon) is initialized with sudo (i.e. root). The trace session may be started/stopped several times, allowing you to some parameters (i.e. add/remove events and context information).
Managing events in a trace session
The tool is able to trace events emitted by kernel and application which are made available through several infrastructures just like Kprobe, Ftrace, tracepoints and also processor PMU (Performance Monitoring Unit). Therefore, lttng has a set of commands to manage events to be monitored as shown in table 3.
Table 3. Commands to manage events tracing
list [-k] [-u]
List available events of kernel (-k) and user-space (-u)
Add events to the session. The [options]
filter which events should be traced. For example, “-a -k --syscall” is used to add syscall events.
Remove events to the session. The [options]
filter which events should be removed.
add-context -t [type]
Add information context to an event. As of this post is written, [type] may be pid, procname,
prio, nice, vpid, tid, pthread_id, vtid, ppid, vppid as well as available PMU events.
Below listing show how to display all kernel events available for tracing.
If you missed my post last week on the IBM POWER8 Coherent Accelerator Processor Interface, or CAPI, you might want to read that first.
CAPI can also be leveraged as the basis for flash in-memory expansion. An innovative solution designed to reduce latency and speed up access to data, CAPI Flash creates a bigger memory store, which as we know, is the driving force behind fast access to data.
While CAPI is designed to speed up workloads, CAPI Flash serves to significantly lower operational and deployment costs by creating a faster path to a larger store of memory. As such, CAPI Flash manages to reduce the number of servers required to accomplish the same workloads.
The elimination of I/O overhead, which is inherent to the CAPI, also factors in as an essential benefit of CAPI Flash. With CAPI Flash, the number of instructions required to retrieve data is reduced from 20,000 to fewer than 500.
The IBM Data Engine for NoSQL uses CAPI Flash and software from Redis Labs to reduce operational costs by cutting through complexities and creating an entirely new level of efficiency for NoSQL in-memory databases running key value store (KVS) workloads: most in-memory databases consist of a vast network of costly compute nodes. Since there’s only so much data that can be stored on a node, more data requires nodes, fueling a steady rise in cost and complexity.
OpenPOWER partner Redis Labs and IBM have partnered on the Data Engine for NoSQL's implementation of CAPI Flash, which attaches flash arrays that put 40 TB of memory onto the POWER CPU – resulting in a level of acceleration that exceeds that of a fast hard drive. The impact on cost, speed, efficiency, and latency is significant. Key among the metrics:
A five-time increase in bandwidth on a per-thread basis,
A three-time reduction in deployment costs, and
A 24 to 1 reduction in nodes.
Concerning the reduction in nodes – which represents just one angle of the level complexity that CAPI Flash manages to eliminate – imagine the difference between juggling 24 balls versus one that has all of the same functionality. That’s CAPI Flash.
Moving forward, possibly the most significant benefit of CAPI Flash is a concept known as “eventual coherence.”
A key challenge in almost all NoSQL environments is that when data is written to a node that is networked within a database, the data is not instantly available. The new data will not appear on the system until coherency is achieved among all of the nodes. If a node fails before it is replicated throughout the system, the data will be lost.
With CAPI Flash, data appears and is accessible on the system as it is entered. Even if a node fails, the data will remain on the flash.
One final point worth mentioning: while most conversations about CAPI Flash to date have focused on NoSQL, the work that we are doing is applicable to any application that has a large memory footprint.
Learn more about CAPI, and other unique advantages offered by Power Systems scale-out servers, in this paper written by Robert Frances Group, "The IBM Power Scale-out Advantage."
For more information on the IBM Data Engine for NoSQL, read the solution brief.
month, the PowerLinux team is announcing the biggest technology change in PowerLinux servers
since we launched, with the availability of our POWER7+ chips on the platform.
POWER7+ is more than just a speed bump on our POWER7
processors.Our hardware teams have
worked hard to increase the flexibility of the platform, bringing
balanced performance increases while keeping other factors like energy
consumption at bay.Some examples:
doubled the memory capacity in servers like the 7R1 and 7R2.We’ve also doubled the number of virtual
machines you can allocate to a single processor core.This means we’ve dramatically increased
the system’s flexibility when it comes to deploying virtualized workloads
… in many cases, this will eliminate memory as the gating factor, allowing
users to drive utilization rates even higher and boost system efficiency.
reduced the feature size in the chips from 45 nm to 32 nm.This not just a simple die shrink,
though … with every shrink, the chip team has to work even harder to
ensure the computational and thermal stability of the chip while driving
higher clock speeds.In PowerLinux
servers like the 7R2, the new chips now top out at 4.2 GHz.
Because we have more available chip real estate now that we’ve shrunk the die
size, we’ve bumped up the L3 cache from 4 MB to 10 MB.This significantly boosts performance in
workloads that are memory dependent, like Java and big data applications.
feature additions to POWER7+ allow us to improve chip reliability and
boost energy savings.We’ve added
self-healing capabilities and automatic processor reinitialization to
increase system robustness, and we’ve introduced a new energy saving mode
that saves 45% more energy than before when the processor is idle.
The new performance capabilities afforded by POWER7+ enable
some pretty interesting possibilities when it comes to reducing costs.For example, we’ve found that people
typically need just two dual-socket (16 core) PowerLinux 7R2s to do what it
would take three dual-socket (16 core) Xeon servers to do.Given the already competitive pricing on the
7R2s, this means that you can potentially save north of 40% on your costs of acquisition by choosing PowerLinux.
These changes make PowerLinux an ideal platform for the most
critical workloads your business runs today, like your customer facing web
applications, or your ERP system.Customers like Kwik Fit (PDF) and IT Informatik (PDF) are already realizing the benefits. Click the links to read the
case studies on these customers.
But PowerLinux is also a great platform for today’s growth
workloads, like development and deployment of mobile and web applications.To make it easier for businesses to create
and launch these types of client experiences, we’re introducing a new solution for WebSphere mobile and web applications that leverages the lightweight
WebSphere Liberty Profile software.This
is a light, easily reconfigured web app environment that makes it simple for
developers to test and deploy applications.
As 2013 progresses, we'll continue to bring you more
announcements that improve the PowerLinux platform's ability to reduce costs while improving
efficiency, enabling new and growth workloads, and giving you a better overall
In the category of learning something new on a regular basis, over the last week I discovered some commands on Linux running on Power systems which were new to me. Turns out "lparstat" has been implemented, and a colleague here in the LTC pointed out two commands "lscpu" and "lsblk" which I hadn't seen before.
# lparstat -i
Node Name : testsys
Partition Name : lpar1
Partition Number : 1
Type : Dedicated
Mode : Capped
Entitled Capacity : 16.0
Partition Group-ID : 32769
Online Virtual CPUs : 16
Maximum Virtual CPUs : 16
Minimum Virtual CPUs : 1
Online Memory : 130797952 kB
Minimum Memory : 256
Desired Variable Capacity Weight : 0
Minimum Capacity : 1.0
Maximum Capacity : 16.0
Capacity Increment : 1.0
Active Physical CPUs in system : 16
Active CPUs in Pool : 0
Maximum Capacity of Pool : 0.0
Entitled Capacity of Pool : 0
Unallocated Processor Capacity : 0
Physical CPU Percentage : 100
Unallocated Weight : 0
Memory Mode : Shared
Total I/O Memory Entitlement : 134754598912
Variable Memory Capacity Weight : 0
Memory Pool ID : 65535
Unallocated Variable Memory Capacity Weight : 0
Unallocated I/O Memory Entitlement : 0
Memory Group ID of LPAR : 32769
Desired Variable Capacity Weight : 0
lscpu is available, although the socket calculation isn't correct in the realm of terminology that we more typically use on POWER systems. I'll need to follow-up on that, in our thinking the two nodes are sockets, and these processors are 8-cores per socket.
Byte Order: Big Endian
On-line CPU(s) list: 0-63
Thread(s) per core: 4
Core(s) per socket: 1
CPU socket(s): 16
NUMA node(s): 2
Hypervisor vendor: pHyp
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 10240K
NUMA node0 CPU(s): 0-31
NUMA node1 CPU(s): 32-63
lsblk is available. It provides another view of the block devices on a system.
A wide range of applications applied to technical areas such as computational vision, chemistry, bioinformatics, molecular biology, engineering and financial analysis are using heterogeneous computing systems with general purpose GPU (Graphics Processing Unit) hardware as their high performance platform of choice.
Recently launched, the IBM Power System S824L comes into play to explore use of the NVIDIA Tesla K40 GPU combined with the latest IBM POWER8 CPU, providing a unique platform for heterogeneous high performance computing.
The Power S824L system comes with up to 2 Tesla K40 GPU cards (based on Kepler(TM) Architecture), each of them is able to delivery 1.43 and 4.29 Tflops of peak performance on, respectively, single and double-precision float point operations. The Tesla K40 GPU features:
15 SMX (Streaming Multiprocessor)
Simultaneously execute 4 warps (group of 32 parallel threads)
ALU fully compliant with IEEE 754-2008 standard
64 KB configurable shared memory and L1 cache per multiprocessor.
48 KB read-only data cache per multiprocessor
1536 KB L2 cache
12 GB DRAM (GDDR5)
GPU Boost Clock
Simultaneously execute 2880 CUDA cores (192 per multiprocessor)
Supports CUDA compute capability 3.5
Dynamic parallelism (ability to launch nested CUDA kernels)
Hyper-Q (allows several CPU threads/processes to dispatch CUDA kernels concurrently)
C/C++ CUDA programming support for POWER8 was first introduced with CUDA Toolkit 5.5 for Ubuntu 14.10 ppc64le. As of this writing, version 7 is latest CUDA Toolkit release and it supports Ubuntu 14.04 ppc64le as well. The toolkit comes with following tools and libraries that allow development of CUDA applications on Power:
NVCC (NVidia CUDA Compiler) - front-end compiler
CUDA GDB - command line GDB-based debugger
CUDA Memcheck - command line memory and race checker tool
nvprof - command line profiling tool
binary utilities - include cuobjdump and nvdisasm
POWER cross-compilation support(new in CUDA Toolkit 7.0)
GPU-accelerated libraries - provides many libraries and APIs, as for example, cuBLAS, cuFFT, cuSPARSE, Thrust.
NSight Eclipse Edition - Eclipse-based Integrated Development Environment (IDE)
Because an CUDA application have portions of code that run exclusively on host or device processors, the NVCC is a front-end compiler driver that simplifies the process of compiling C/C++ code. As back-end compilers, there can be used either distro's GCC or IBM XL C/C++ compiler 13.1.1 (or newer). They are used to generate the objects which run on host processor, while nvcc is going to compile portions of code targeting the GPU device.
This tutorial explains how to create a RAID device on PowerLinux machines using an array of disks. This step by step tutorial includes identifying the disks, formatting them, combining them in a RAID array, creating a partition and, finally, creating a file system on this partition.
The PowerLinux machines support a RAID (Redundant Array of Independent Disks) card. A RAID card is a device that combines a set of physical disks into a logical unit to achieve a better performance and more data redundancy. A RAID array could also be created by the operating system (known as Software-based RAID), and it consumes some CPU cycles from the machine to manage and control the disk array. On the other side, a RAID card, as the one embedded on PowerLinux machines, offers a Hardware-based RAID, meaning that the operations on the disk array are offloaded to the RAID card, not utilizing CPU cycles managing the disks, thus, being more efficient than Software-based RAID solutions.
The RAID adapters on PowerLinux machines support several different RAID protection levels. Depending of the protection level, you might have different benefits, as potentially achieving a higher data transfer, a smaller latency and data redundancy when compared to a single big disk. You might also want to combine these benefits all together in the same disk array, which is also a possible depending on the RAID protection level.
Using RAID is usually a trade-off between disk space and redundancy, so, depending on the RAID protection level, part of the disk space is used to save redundant data, thus, part of the disk space is not available for general usage. The real space available to the users varies from 50% to 100% of the total disk space.
The RAID protection levels supported by most PowerLinux RAID adapters are:
RAID 0: On this configuration, a block of data is striped in different disks on the array, so, the read/write operations on the disks could happen in parallel on the disks in the array. On this configuration, there is no fault tolerance i.e., if a disk fail, the whole data is lost. This level usually improves the data throughput.
Requires at least 1 disk. In a single disk RAID 0 configuration, no striping occurs.
RAID 1: On this level, the data is written at the same time on 2 disks. As both disks have the same data, a read operation will occur on the disks that has the smaller latency. On this case, if one disk fails, the whole data will continue be preserved on the other disk. Once the other disk is replaced, the RAID would be reconstructed. As expected, this level improves the data redundancy.
Requires at least 2 disks. Note: The PowerLinux RAID adapters refer to this RAID level as RAID 10.
RAID 5: On this level, the data and parity bits are spread all across the disks. If one disk fails, then all the data is still available, once the original data could be reconstructed using parity data from the other disks. If more than one disk fails, then the data will be corrupted. (If a disks fail happen, the operations may happen in a slower fashion, since the data being accessed is on the lost disk, then the data will need to be reconstructed.). One disk's worth of capacity is consumed for redundancy information for the array.
Requires at least 3 disks.
RAID 6: The same as RAID 5, but up to 2 disk can fail, and the data will still be preserved. Two disk's worth of capacity is consumed for redundancy information for the array.
Requires at least 4 disks.
RAID 10: This level combines the best concepts of RAID 1 and RAID 0. On RAID 10, the data is striped on a set of hard disks, and these hard disks are mirror to another set of hard disks. So, you have a very good throughput and also a data redundancy.
Requires at least 2 disks for mirroring and striping. In a two disk RAID 10 configuration, no striping occurs.
RAID card on PowerLinux
The PowerLinux machines come with an embedded RAID controller that supports up to 6 SAS or SDD disks and RAID levels 0, 5, 6 and 10 on machines 7R1 and 7R2. RAID 1 is also supported as a subset of RAID 10, since the RAID controller allows you do create a RAID 10 with just two disks as part of the array. In this case, since there is no enough disk to mirror and strip, the data just gets mirrored on both disk, instead of striped, which is what a RAID 1 does. So, the cards also supports, in a different form, a RAID 1.
On Linux, the device is listed as a PCI-E device named "IBM Obsidian-E PCI-E SCSI controller ".
In order to manage this controller, there is a set of tools on the packaged called iprutils that helps the system administrator to create, configure and delete disks and arrays using the RAID controller.
The iprutils package provides the iprconfig application. The iprconfig is the tool responsible for configuring the RAID devices on your machine, and will be the tool that will be covered below.
The device driver
The device driver for the PowerLinux RAID controller is named ipr.ko. It's currently part of the Linux kernel and comes with all the supported Linux Distros for PowerLinux. So, it's recommended to always use the last supported kernel version from the distro in order to take the best from you PowerLinux machine.
Using iprconfig tool
The iprconfig tool is a very easy application to use. It's a text-based (TUI) application that helps you to list and configure the RAID controller and the disks on your system. iprconfig also allows you to check the controller log, and upgrade the card firmware. An example of the iprconfig screen could be seen at Figure 1.
Now on, a step-by-step tutorial will show how to format a set of disks, combine them together in a RAID mode, create a partition over this array and, then, create a file system over this partition. This is an easy process that might take less than 30 minutes to be accomplished. For this tutorial, we are going to create a RAID 5 device, meaning that the array will have the data mirrored and striped over the disks arrays.
You can use iprconfig just as a command line option, in this case, you need to pass the parameters you want in the command line. For example, in order to see what are the RAID levels supported on a controller (sg5), the following command should be used:
In order to use a disk as part of a disk array, it needs to be formatted specifically for being part of RAID, also known as, advanced function mode. If the disks is not formatted properly, you can not add it to a RAID array. In order to format a disk in advanced function mode, the following steps should be followed:
Launch the iprconfig tool on a console.
Select the menu Work with disk array (as shown in Figure 1)
In order to do it, press 2 and then enter.
Select Format device for RAID function
In order to do it, press 5 and then enter.
Then select the disks you want to format (all of those that are going to take part of the disk array) and continue. (As shown in figure 2)
In order to select the disks, you must use the up/down arrows, and press 1 to select the devices you want to select.
Wait until the disks are formatted as shown in Figure 3. (It takes some minutes until the disks are formatted.)
Figure 3: Disks being formatted
Creating an RAID disk array
As explained above, in order to add a disk into the RAID array, the disk need to be formatted in RAID mode. Once the disks are formatted in RAID mode, they can be available to be added to a disk array, and the RAID device could be created. You might want to create as many RAID devices you want, and give them a set of disks. Let' s go through the process of creating an array device. It is recommended when creating a RAID array to format the devices for RAID as described in the previous step, then create the disk array following the steps below without exiting the iprconfig tool. Exiting the tool will result in the loss of knowledge that the disks have just been zeroed and it will take longer for the array to initialize.
Launch the iprconfig tool.
Select Work with disk arrays, asshown in Figure 1.
Select Create a disk array.
Select the controller you want (You might have just one controller), as shown in Figure 4.
Select the disk that will be part of the array, as shown in Figure 5
Press '1' over the disks you want to select.
Select the RAID type, as shown in Figure 6.
Go into the Protect Level and press 'c' to change the RAID level.
Select the disks that will take part of that disk array
It may take a while to have the array created.
Figure 4: Selecting the RAID controller
Figure 6: Select the RAID type
Start using the RAID array
Once you had the RAID array created, it becomes a block device as any other block device on the system. You can create a partition on the device, make a file system on it, and start using. The next steps will help you to create a partition and a file system on the array partition. On this tutorial, I will create just one partition using the whole array and format it using EXT4 file system, as shown below:
Figure 8: Creating a EXT4 file system on the partition over the RAID device
Once the file system is created in the partition, you can use this partition as a traditional file system, i.e, mount it on a directory and start using it. All the RAID operation, as managing, striping, checksumming or mirroring the data will be offloaded to the RAID card, and happen transparently.
Linux Journal just released their 2011 Readers’ Choice Awards. I am very pleased to share in this blog that IBM is the winner in the “Best Linux Server Vendor” category, for the second year in a row.
Every year, Linux Journal invites its readership to cast their vote for their favorite Linux vendor. This year, over 20 server vendors were nominated for the “Best Linux Server Vendor” award including Dell, HP and Sun Microsystems. The awards are announced in the December issue of the Linux Journal.
Shawn Powers, Linux Journal Editor, elaborated:
"IBM proved again this year that it is king of the server room. Based on value, reliability, compatibility and support, IBM beat the competition to be our readers' favorite server vendor.“
IBM's win in this category is a testament to IBM’s long standing commitment to Linux. Eleven years ago, IBM announced a $1 billion dollar investment in Linux, taking the technology from a successful science project to a major force in business IT. Not only was this a turning point for Linux and the Linux community, it was also a pivotal moment in IBM's history. This investment was one of the first times IBM made a decision to embrace open source software and make it core to our business strategy.
Today that tradition continues. IBM is consistently among the top commercial contributors of Linux code as measured regularly by The Linux Foundation's "Who writes Linux" series. Linux also continues to be a fundamental component of IBM business -- embedded deeply in hardware, software, services and internal deployment.
This award is proof of how great IBM systems with Linux - such as Watson – are. You may not be aware that Watson is running Linux on a scale-out Hadoop cluster of Power Systems, optimized for Jeopardy! with natural language technologies from IBM Research. With Watson experience under our belts, we continue to invest in optimizing new solutions around Power Linux.
Only IBM has the Linux, IT, virtualization, industry, and solution integration expertise to design optimized solutions for our clients’ needs - whether they require emerging new applications like big data or lower operational costs for existing workloads. More than 3900 companies have moved to IBM Power-based systems from competitive platforms over the past five years because we understand the business requirement to maximize IT investments while reducing complexity, risk and costs. Existing clients are migrating or consolidating x86 servers to Power, bringing screaming performance, industry-leading RAS, and efficient virtualization to traditional Linux applications like Web and network infrastructure.
Learn more about what Power Linux and integrated virtualization can do for your business by downloading Edison Group's latest report about scaling out with PowerVM - and get up to 200 times better throughout performance than with VMware on Intel x86-based platforms.
Jean Staten Healy
Director, WW cross-IBM Linux and Open Virtualization
At the recent Hot Chips conference (hotchips.org), IBM presented some details on the upcoming POWER8 processor. Jeff Stuecheli presented some intriguing details on the next generation POWER microprocessor. The Register posted a summary of Jeff's presentation with many of the new features being discussed.
With the OpenPower Consortium announcement, KVM on Power plans, and a new microprocessor under development, there's plenty of things to anticipate and prepare for around Linux and POWER systems.
The Power Instruction Set Architecture has been re-published (Version 2.07) with many of the POWER8 features expected, allowing infrastructure products and teams to begin early preparations for the next generation microprocessor.
IBM's Advance Toolchain has just been updated with a new GCC version and the enabling of the new POWER8 instructions. New instructions can be coded and compiled with the Advance Toolchain (Version 7.0).