Water-cooled copper way more efficient: IBM's unveiled the Power6-base "Hydro-Cluster" Power 575 supercomputer which uses water-cooled copper plates above each microprocessor to remove heat. The system is five times faster than previous models with a three-fold increase in energy efficiency. Estimates are that data centers using the system would need 80 percent less air conditioning units and 40 percent less power. The Power 575 supercomputer uses 448 Power6 cores in a single rack with 3.5 terrabytes of memory divided among 14 nodes. More on this at the May 28 ITherm conference, the international conference for scientific and engineering exploration of thermal, thermomechanical, and emerging technology issues associated with electronic devices, packages and systems.
Cell Broadband Engine/Power Architecture notebook
This blog-based column looks at some of the more interesting problems and challenges posed recently in the Cell Broadband Engine Architecture forum.
Perif wants to know about how to use ALF with multiple inputs: i'd like to use alf and i got a simple question about using differents data inputs. I'd like to use two matrix as inputs but i can't be sure that they are in contiguous memory on the host. I don't really understand how spe applications can access those matrix like in the matmul example of the alf programming guide:
cnt = p_parm->h * p_parm->v / 4;sa = (vector float *) p_input_buffer; sb = sa + cnt;
My question would be how can i use multiple differents inputs data sets (like tables or vectors) with alf on the accelerator side, do i need many inputs buffers?
[Perif]: Oh wait! I just got it: I needed to understand the function of buffers in ALF.
[Editor]: If you need to understand the basic function of buffers in ALF, try these resources:
Back to the question.
[donm]: Does that mean you are having no other problems? I'll throw this out in case you find it helpful:
Data need not be contiguous in the host memory. For example, say you are transferring a 32x32 portion of a larger matrix. Each 32 values in a given row will be contiguous, but then you will have to skip some values to get down to subsequent rows. The answer? Make a DMA transfer list that does 32 transfers. ALF provides macros for the accelerator-side to make this a bit easier -- you just have to calculate the offset to the beginning of each row.
The net result? You will have an input buffer with a 32x32 chunk of contiguous data now residing in the Local Store on the SPE.
If you are transferring multiple such matrices, then you are responsible for passing enough information in the parameter context buffer to "know" how much space each one takes up in the input buffer on the accelerator side. Hope this helps.
[Perif]: Thanks for your answer :) I have just one more question -- do you have some example of DMA transfer list creation and use on accelerator side? If i have a 64x65 matrix and my buffer is defined as a 4x4, how is the transfer handled by ALF for the last line? Do i have to define the size of my buffer to fit scalably the size of my data input (here the matrix)?
[donm]: Some examples are found in /opt/cell/sdk/prototype/src/alf-examples-source.tar as well as the manual in /opt/cell/sdk/docs/lib/ALF_Prog_Guide_API_v3.0.pdf. Look for
Yes, your buffer on the SPU should be sized according to the size passed down from the PPU. I may be mis-interpreting your question though. If you have a 64x65 matrix (of single-precision floats?) then your buffer on the SPU should be the same size. Here you are transferring the entire matrix instead of doing it row-by-row, so it is even easier -- just one macro call to transfer the entire thing.
bmayer wants to know about compiling simple vector code: I am trying to make a simple test code do compute the
#include <massv.h>#include <simdmath/expf4.h>
CC=ppu32-g++SPUCC=spu-g++OPTS=SPUOPTS=-lmass_simd -lmass -lmassv -fno-rtti -fno-exceptions
[donm]: I think you may need
[IBM SDK Service Administrator]: I think if you're including simdmath/expf4.h you need to call it as
[bmayer]: Neither worked. The only thing I am doing is using a Makefile and typing
CC=ppu32-g++SPUCC=spu-g++OPTS=SPUOPTS=-lmass_simd -lsimdmath -lmass -lmassv -fno-rtti -fno-exceptions
and the updated code (only the loop changed)
#include <massv.h>#include <simdmath/expf4.h>
[IBM SDK Service Administrator]: Ah, I see it. You are including expf4.h but calling
JohnSchneider-MentorGraphics wants to know if anyone needs a good RTOS for Cell/B.E. platforms: Does anyone need a hard real-time OS for Cell? We are getting some interesting results when benchmarking Nucleus OS vs Linux on QS21s. http://www.mentor.com/products/embedded_software/cpu/cell-be.cfm
[jfenton]: Sounds interesting, it was only a matter of time before an RTOS was ported :) Are you able to publish your initial benchmarks?
[JohnSchneider-MentorGraphics]: Mentor Graphics have been working with the CELL group at IBM to confirm the benchmarks before we publish them, else it would be a little one sided. I've been mainly working on the LIBSPE2 aspect of the system (ie. comparing loading times, DMA latencies as a variable of sizes and direction, etc.) and on the pure Nucleus vs. Linux benchmarks for various POSIX context switch operations (pthread, queue, semaphore, etc.).
If we have enough interest, we can hold a Webinar on what are doing in this space and the initial benchmarking results. If you are interested, either respond to this thread or send a private message.
[billcode]: For applications written using SDK 3.0, what level of effort is required to port? How similar are your CBE APIs?
[JohnSchneider-MentorGraphics]: We haven't touched the SPE side; therefore, it's all the same for the SPE. We have a solid pthread API for the PPE and an identical LIBSPE2 API. Nucleus has a different entry point from your typical Linux
Khushboo-Sancheti wants to know about running a SPE thread on a particular SPE: Is there a way to specify the SPE on which you want a particular context to be run?
[lowellns]: I think the closest you can get is using
[IBM SDK Service Administrator]: See /opt/cell/sdk/src/benchmarks/dma/dmabench.c for examples of using
[Khushboo-Sancheti]: So, if it is not possible to specify a particular SPE on which a thread should run, how do we use the pipelining and the specific-SPEs-for-specific tasks programming model? I am essentially trying to start two threads on a particular SPE and see if I can implement application-initiated yielding between the two threads -- I understand this is not possible. Please confirm.
[lowellns]: I'm fairly certain that the SPEs do not have thread support. As in, you can't run multiple threads on one SPE. So pipelining is implemented by taking what would be one large job and breaking it up into intermittent stages and creating spe programs for each. You then load each program onto a different SPE and data is passed through them (by way of communication and DMA).
gshi wants to know about getting benchmark code that can measure a Cell/B.E. processor's single and double precision FLOPS: Well, can I?
[IBM SDK Service Administrator]: If you extract demos_source.tar from /opt/cell/src/src you'll get the Matrix Multiplication workload which can provide statistics like this:
# /opt/cell/sdk/src/demos/matrix_mul/matrix_mul -p -s 16 -m 8192Initializing Arrays ... doneRunning test ... donePerformance Statistics: number of SPEs = 16 execution time = 3.02 seconds computation rate = 364.05 GFlops/sec data transfer rate = 22.84 GBytes/sec
Distributed earthquake monitor project coming soon: UCRiverside's Quake-Catcher Network distributed seismic monitoring system (could almost be called Earthquake@home) should soon be a reality -- it uses already installed accelerometers in laptops to monitor for seismic activity (the ones that protect hard disk drives from sudden impacts). There are about 300 worldwide (Mac-only) testers at this time -- plans are to spread the system to PC laptops and pluggable USB dongles by Summer 2008.
There are also other similar efforts in this area: UHouston and IBM researchers (in a consortium called the Mission-Oriented Seismic Research Program or MOSRP) are exploring the use of Cell/B.E. systems as a tool to re-think and refine seismic-processing algorithms; Panorama Technologies is busy optimizing popular seismic-imaging algorithms to run directly on the PS3.
There's a new solar material in town: DOE's National Renewable Energy Lab has put the ability of copper indium gallium selenide, or CIGS, to capture energy from sunlight within reach of silicon's capabilities. The CIGS material now converts 19.9 percent of received sunlight into power; the previous mark was 19.5 percent and silicon's record is 20.3 percent. CIGS technology is generally less expensive than versions relying on polysilicon. (This blog has covered a lot of the advances in solar energy resources.)
New CMOS eliminates quartz: Mobius Microsystems debuts CMOS Harmonic Oscillator (CHO), technology that can do away with the need for quartz crystals in many applications. The company has integrated an oscillator onto an ordinary complementary metal oxide semiconductor (CMOS) chip, essentially eliminating a moving part from IC. The company notes that the CHO technology isn't as accurate as the quartz oscillators, but they include some circuitry to compensate for that. Your application has to be able to tolerate timing inaccuracies of 100 parts PPM (like the timing signals for the serial PCI-Express peripheral bus and the USB devices for serial hard disk drives, for flat-panel displays, and for printers).
JIT method debugs RTL memory from SW POV: UCSB/ARM researchers like co-simulation (which provides an integrated SoC design platform to eliminate most design errors at an early stage) -- it is faster than hardware-based sim because the communication and synchronization overhead between the hardware and software simulators can be huge. To make it work better, the scientists have developed a just-in-time shadow memory technique that allows the better debugging capabilities of RTL memory from a software perspective. Key to the discovery: A message queue-based communication backplane helps to alleviate communication overhead better than many other alternatives.
On-chip coolant systems: Two Nextreme Thermal Solutions technologists are proposing a new approach for electronic thermal management that focuses on delivering appropriate cooling when and where it is needed using thin-film thermoelectric coolers (TF-TECs). In this article, they not only discuss issues surrounding the effective integration of this technology into the chip process, but they also talk about the need to quickly and accurately model the behavior of this new circuit element and what existing algorithms and potential hybrid approaches could solve this problem (including one that separates the computational ability of finite element analysis and increased mesh density, eliminating the increased density aspect that typically accompanies thermoelectric simulations). An excellent design article.
How long until I get my realistic VR?: Brookhaven lab's Michael McGuigan says that VR worlds indistinguishable from reality are just a few years away because the basic elements already exist:
That probably means still a ways off.
Toshiba's little smart remote robot: The Toshiba ApriPoko robot is a lovely little 11-inch bird robot that is really a voice-activated remote control that incorporates "artificial intelligence" -- if you start changing TV channels, it wakes up and asks what you're doing. It memorizes the IR codes coming from your remote. After that, all you have to do is to tell it what you want (from your TV).
Need toner? This printer will make it: The open source RepRap (Replicating Rapid-prototyper) printer can replicate and update itself by printing its own parts. It builds up layers of polylactic acid plastic (PLA, a bio-degradable polymer made from lactic acid) -- the difference between this concept and other 3D replicators in existence is that this machine is designed to be less expensive and the development team is giving away the design.
Hyper-entangled photons claim bit-encoding record: UIllinois researchers claim the new record for encoding information onto a photon -- 1.63 bits (out of a theoretical limit of 2 bits; the limit may eventually be raised by combining techniques and technologies). The previous record was 1.18. The scientists upped the limit by using hyper-entanglement.
Classic encoding calls for 1 bit per photon. Normal quantum entanglement is a quantum mechanical phenomenon in which the quantum states of two or more objects have to be described with reference to each other, even though the individual objects may be spatially separated. In hyper-entanglement, the photon's "wiggle" polarization and its "twisting" orbital angular momentum are both used as descriptions, offering more "entanglement" for the effort. In order to create dual-parameter hyperentanglement, spontaneous parametric down conversion is performed on a pair of nonlinear crystals -- you then change the polarization to transfer data from one photon to the other by applying birefringent phase shifts using liquid crystals.
How to build a stretchy chip: UIllinois researcher John Rogers says that integrating silicon processors and the human body (as in biomedical chips) is going to take a processor-producing method that allows for the chips to be bendable and stretchy. What he and others have built is an accordian-like structure of very thin silicon bonded to rubber, like this:
Bendable silicon chips.
Research details potential gotcha in nanotube use: CornellU, HarvardU, and Weizmann Institute physicists have discovered that electron spin in a carbon nanotube interacts with the electron's orbit, meaning researchers will have to change the way they read out spin. But they think they may have that problem solved: Manipulate the spin by manipulating the orbit. In a CT, free electrons don't orbit individual atoms as in nature, they orbit around the circumference of the tube. Normally, physicists thought that the four possible states of an electron (spin up or down, orbit clock- or counterclockwise) were equivalent. However, this experiment shows that changing the orbit direction changes the spin and vice versa. Means that the CT design rules may have to change a bit.
Advancing science and recouping costs: Many scientists cannot afford the cost of equipment and maintenance to develop new nanotechnology products. Well, they're in luck. Many established nano-labs have discovered a funding source for themselves -- rental fees to other developers. (And some even waive the fees if the work will be non-proprietary or if it fits a need they have.) So far, there is
Don't toss old technologies: This NY Times article chronicles old predictions (like, the last mainframe would die off by 1996) and paints a profile of technology attributes that allow technologies to survive the test of time:
Harvard business historian Richard Tedlow points out that when predicting the life of a technology, pure technical innovation is usually overestimated and business judgment and the weight of legacy is usually underestimated, but that the "rise and fall of technologies is mainly about business and not technological determinism." Another observer, John Steele Gordon, notes that technologies have to evolve to survive and says that there are similarities in the business and biological evolutionary systems. The article points to radio's adaptation at the rise of television and notes how IBM managed to adapt the mainframe to changing economic conditions.
Predictions of the death of silicon: With the previous article in mind, researchers at the recent Institute of Physics Condensed Matter and Material Physics conference think maybe that the silicon chip will be unable to sustain its recent pace in power and speed. Estimates of its demise run as short as four years.
Laptops ... of ... tomorrow!: Here's some concept models of the laptops (and notebook considerations) of the far future of 2015:
And some broader trends in future laptopping:
Still, I bet I'll still be using the laptop I'm using now (unless it has a built-in obsolescence destruction device I don't know about).
How high-level parallel programming model simplifies multicore design: Dr. Michael McCool, co-founder and chief scientist of RapidMind and a researcher at UWaterloo (and possessor of a great last name), wrote this article for EE Times detailing why parallelized software programming is a necessity to leveraging multicore performance (in essence, multicore processor performance continues to scale exponentially, but clock rates are no longer significantly scaling). An excellent thoughtpiece. (For more, Dr. McCool also wrote "Core partners, Part 1: Build high-performance apps for multicore processors" (developerWorks, May 2007) which shows how the RapidMind single-source development platform uses these principles to develop parallel Cell/B.E. applications.