More efficient memory-management scheme could help enable chips with thousands of cores

In a modern, multicore chip, every core -- or processor -- has its own small memory cache, where it stores frequently used data. But the chip also has a larger, shared cache, which all the cores can access.

If one core tries to update data in the shared cache, other cores working on the same data need to know. So the shared cache keeps a directory of which cores have copies of which data. 

That directory takes up a significant chunk of memory: In a 64-core chip, it might be 12 percent of the shared cache. And that percentage will only increase with the core count. Envisioned chips with 128, 256, or even 1,000 cores will need a more efficient way of maintaining cache coherence.

At the International Conference on Parallel Architectures and Compilation Techniques in October, MIT researchers unveil the first fundamentally new approach to cache coherence in more than three decades. Whereas with existing techniques, the directory's memory allotment increases in direct proportion to the number of cores, with the new approach, it increases according to the logarithm of the number of cores.

In a 128-core chip, that means that the new technique would require only one-third as much memory as its predecessor. With Intel set to release a 72-core high-performance chip in the near future, that's a more than hypothetical advantage. But with a 256-core chip, the space savings rises to 80 percent, and with a 1,000-core chip, 96 percent.

When multiple cores are simply reading data stored at the same location, there's no problem. Conflicts arise only when one of the cores needs to update the shared data. With a directory system, the chip looks up which cores are working on that data and sends them messages invalidating their locally stored copies of it.

"Directories guarantee that when a write happens, no stale copies of the data exist," says Xiangyao Yu, an MIT graduate student in electrical engineering and computer science and first author on the new paper. "After this write happens, no read to the previous version should happen. So this write is ordered after all the previous reads in physical-time order."

Time travel

What Yu and his thesis advisor -- Srini Devadas, the Edwin Sibley Webster Professor in MIT's Department of Electrical Engineering and Computer Science -- realized was that the physical-time order of distributed computations doesn't really matter, so long as their logical-time order is preserved. That is, core A can keep working away on a piece of data that core B has since overwritten, provided that the rest of the system treats core A's work as having preceded core B's.

The ingenuity of Yu and Devadas' approach is in finding a simple and efficient means of enforcing a global logical-time ordering. "What we do is we just assign time stamps to each operation, and we make sure that all the operations follow that time stamp order," Yu says.

With Yu and Devadas' system, each core has its own counter, and each data item in memory has an associated counter, too. When a program launches, all the counters are set to zero. When a core reads a piece of data, it takes out a "lease" on it, meaning that it increments the data item's counter to, say, 10. As long as the core's internal counter doesn't exceed 10, its copy of the data is valid. (The particular numbers don't matter much; what matters is their relative value.)

When a core needs to overwrite the data, however, it takes "ownership" of it. Other cores can continue working on their locally stored copies of the data, but if they want to extend their leases, they have to coordinate with the data item's owner. The core that's doing the writing increments its internal counter to a value that's higher than the last value of the data item's counter.

Say, for instance, that cores A through D have all read the same data, setting their internal counters to 1 and incrementing the data's counter to 10. Core E needs to overwrite the data, so it takes ownership of it and sets its internal counter to 11. Its internal counter now designates it as operating at a later logical time than the other cores: They're way back at 1, and it's ahead at 11. The idea of leaping forward in time is what gives the system its name -- Tardis, after the time-traveling spaceship of the British science fiction hero Dr. Who. 

Now, if core A tries to take out a new lease on the data, it will find it owned by core E, to which it sends a message. Core E writes the data back to the shared cache, and core A reads it, incrementing its internal counter to 11 or higher. 

Unexplored potential

In addition to saving space in memory, Tardis also eliminates the need to broadcast invalidation messages to all the cores that are sharing a data item. In massively multicore chips, Yu says, this could lead to performance improvements as well. "We didn't see performance gains from that in these experiments," Yu says. "But that may depend on the benchmarks" -- the industry-standard programs on which Yu and Devadas tested Tardis. "They're highly optimized, so maybe they already removed this bottleneck," Yu says.

Fig. 1. Schematic of the electrically pumped active hybrid plasmonic waveguide and energy density distribution of the surface plasmon field.

Researches from the Laboratory of Nanooptics and Plasmonics at the MIPT Center of Nanoscale Optoelectronics have developed a new method for optical communication on a chip, which will give a possibility to decrease the size of optical and optoelectronic elements and increase the computer performance several tenfold. According to their article published in Optics Express, they have proposed the way to completely eliminate energy losses of surface plasmons in optical devices.

"Surface plasmon polaritons have previously been proposed to be used as information carriers for optical communication, but the problem is that the signal is rapidly attenuated propagating along plasmonic waveguides. Now, we have come very close to the complete solution of this problem. Our approach clears the way for the development of a new generation of high performance optoelectronic chips", says Dmitry Fedyanin, the head of the research.

Modern electronics is based on the use of electrons as information carriers, but they have ceased to meet the contemporary requirements: standard electrical copper wires and channels on chips cannot transfer information with speeds sufficient for modern microprocessors. This currently hinders the microprocessor performance growth; hence, the implementation of new groundbreaking technologies is required to maintain Moore's law.

Transition from electrical to optical pulses can solve the problem. The high frequency of light waves (hundreds of terahertz) allows transferring and processing more data, and, therefore, gives a possibility to increase performance. Fiber optic technologies are widely used in communication networks, but the use of light in microprocessors and logical elements faces the problem of diffraction limit, since the size of waveguides and other optical elements cannot be significantly smaller than the light wavelength. These are micrometers for near-infrared radiation used for optical communications, which doesn't meet the requirements of the contemporary electronics. Logical elements of standard contemporary processors are dozens of nanometers in size. “Optical electronics” can become competitive only if light is “compressed” to this scale.

Overcoming the diffraction limit is possible with transition from photons to surface plasmon polaritons, which are collective excitations emerging due to interaction between photons and electron oscillations on the boundary between a metal and an insulator. They are also called quasi-particles, because, by their properties, they are quite similar to standard particles such as photons or electrons. Unlike three-dimensional light waves, surface polaritons “hold on” the boundary between two media. This gives a possibility to switch from the conventional three-dimensional optics to a two-dimensional optics. 

"Roughly speaking, a photon occupies a certain volume in space, which is of the order of the light wavelength. We can “compress” it, transforming into a surface plasmon polariton. Using this approach, we can improve the integration density and reduce the size of optical elements. Unfortunately, this brilliant solution has its flip side. For the surface plasmon polariton to exist, a metal, or more specifically, an electron gas in the metal, is needed. This leads to excessively high Joule losses similar to those one has when the current is passed through metal wires or resistors", says Dr. Fedyanin.


Fig. 2. Nanoscale plasmonic waveguides under the scanning electron microscope.

According to him, the surface plasmon energy drops a billion times at distances of around one millimeter due to absorption in the metal, which de facto makes the practical implementation of surface plasmons pointless.

"Our idea is to compensate the surface plasmon propagation losses by pumping extra energy to surface plasmon polaritons. It should be also noted that, if we want to integrate plasmonic waveguides on a chip, we can use only electrical pumping," explains the researcher.

He, together with his colleagues Dmitry Svintsov and Aleksey Arsenin from the Laboratory of Nanooptics and Plasmonics, has developed a new method of electric pumping of plasmonic waveguides based on the metal-insulator-semiconductor (MIS) structure and carried out its simulations. The results show that the passage of relatively weak pump currents through the nanoscale plasmonic waveguides give a possibility to fully compensate the surface plasmon propagation losses. This means that it becomes possible to transmit a signal over long distances (in chip standards) with no losses. At the same time, the integration density of such active plasmonic waveguides is an order of magnitude higher than that of photonic waveguides.


Fig. 3. Operating principle of the proposed electrical pumping scheme.

"Working in optoelectronics, we always need to find a compromise between optical and electrical properties, whereas in plasmonics it is almost impossible, since the choice of metals is limited to three or four materials. The main advantage of the proposed pumping scheme is that it doesn't dependent on the properties of the metal-semico nductor contact. For each semiconductor, we can find an appropriate insulator, which allows to achieve the same efficiency level as in double-heterostructure lasers. At the same time, we are able to maintain the typical plasmonic structure size at a level of 100 nanometers," says Fedyanin.

The researches note that their results are awaiting an experimental verification, but the key difficulty has been eliminated.

More information: D.A. Svintsov, A.V. Arsenin, D.Yu. Fedyanin, Full loss compensation in hybrid plasmonic waveguides under electrical pumping, Optics Express 23, 19358-19375 (2015).


New network design exploits cheap, power-efficient flash memory without sacrificing speed

Random-access memory, or RAM, is where computers like to store the data they're working on. A processor can retrieve data from RAM tens of thousands of times more rapidly than it can from the computer's disk drive.

But in the age of big data, data sets are often much too large to fit in a single computer's RAM. The data describing a single human genome would take up the RAM of somewhere between 40 and 100 typical computers.

Flash memory -- the type of memory used by most portable devices -- could provide an alternative to conventional RAM for big-data applications. It's about a tenth as expensive, and it consumes about a tenth as much power.

The problem is that it's also a tenth as fast. But at the International Symposium on Computer Architecture in June, MIT researchers presented a new system that, for several common big-data applications, should make servers using flash memory as efficient as those using conventional RAM, while preserving their power and cost savings.

The researchers also presented experimental evidence showing that, if the servers executing a distributed computation have to go to disk for data even 5 percent of the time, their performance falls to a level that's comparable with flash, anyway.

In other words, even without the researchers' new techniques for accelerating data retrieval from flash memory, 40 servers with 10 terabytes' worth of RAM couldn't handle a 10.5-terabyte computation any better than 20 servers with 20 terabytes' worth of flash memory, which would consume only a fraction as much power.

"This is not a replacement for DRAM [dynamic RAM] or anything like that," says Arvind, the Johnson Professor of Computer Science and Engineering at MIT, whose group performed the new work. "But there may be many applications that can take advantage of this new style of architecture. Which companies recognize: Everybody's experimenting with different aspects of flash. We're just trying to establish another point in the design space."

Joining Arvind on the new paper are Sang Woo Jun and Ming Liu, MIT graduate students in computer science and engineering and joint first authors; their fellow grad student Shuotao Xu; Sungjin Lee, a postdoc in Arvind's group; Myron King and Jamey Hicks, who did their PhDs with Arvind and were researchers at Quanta Computer when the new system was developed; and one of their colleagues from Quanta, John Ankcorn -- who is also an MIT alumnus.

Outsourced computation

The researchers were able to make a network of flash-based servers competitive with a network of RAM-based servers by moving a little computational power off of the servers and onto the chips that control the flash drives. By preprocessing some of the data on the flash drives before passing it back to the servers, those chips can make distributed supercomputation much more efficient. And since the preprocessing algorithms are wired into the chips, they dispense with the computational overhead associated with running an operating system, maintaining a file system, and the like.

With hardware contributed by some of their sponsors -- Quanta, Samsung, and Xilinx -- the researchers built a prototype network of 20 servers. Each server was connected to a field-programmable gate array, or FPGA, a kind of chip that can be reprogrammed to mimic different types of electrical circuits. Each FPGA, in turn, was connected to two half-terabyte -- or 500-gigabyte -- flash chips and to the two FPGAs nearest it in the server rack.

Because the FPGAs were connected to each other, they created a very fast network that allowed any server to retrieve data from any flash drive. They also controlled the flash drives, which is no simple task: The controllers that come with modern commercial flash drives have as many as eight different processors and a gigabyte of working memory.

Finally, the FPGAs also executed the algorithms that preprocessed the data stored on the flash drives. The researchers tested three such algorithms, geared to three popular big-data applications. One is image search, or trying to find matches for a sample image in a huge database. Another is an implementation of Google's PageRank algorithm, which assesses the importance of different Web pages that meet the same search criteria. And the third is an application called Memcached, which big, database-driven websites use to store frequently accessed information.

Chameleon clusters

FPGAs are about one-tenth as fast as purpose-built chips with hardwired circuits, but they're much faster than central processing units using software to perform the same computations. Ordinarily, either they're used to prototype new designs, or they're used in niche products whose sales volumes are too small to warrant the high cost of manufacturing purpose-built chips.

But the MIT and Quanta researchers' design suggests a new use for FPGAs: A host of applications could benefit from accelerators like the three the researchers designed. And since FPGAs are reprogrammable, they could be loaded with different accelerators, depending on the application. That could lead to distributed processing systems that lose little versatility while providing major savings in energy and cost.

This rendering depicts a new "plasmonic oxide material" that could make possible devices for optical communications that are at least 10 times faster than conventional technologies. (Purdue University image/Nathaniel Kinsey)

Researchers have created a new "plasmonic oxide material" that could make possible devices for optical communications that are at least 10 times faster than conventional technologies.

In optical communications, laser pulses are used to transmit information along fiber-optic cables for telephone service, the Internet and cable television.

Researchers at Purdue University have shown how an optical material made of aluminum-doped zinc oxide (AZO) is able to modulate – or change – how much light is reflected by 40 percent while requiring less power than other "all-optical" semiconductor devices.

"Low power is important because if you want to operate very fast - and we show the potential for up to a terahertz or more - then you need low energy dissipation," said doctoral student Nathaniel Kinsey. "Otherwise, your material would heat up and melt when you start pushing it really fast. All-optical means that unlike conventional technologies we don't use any electrical signals to control the system. Both the data stream and the control signals are optical pulses."

Being able to modulate the amount of light reflected is necessary for potential industrial applications such as data transmission.

"We can engineer the film to provide either a decrease or an increase in reflection, whatever is needed for the particular application," said Kinsey, working with a team of researchers led by Alexandra Boltasseva, an associate professor of electrical and computer engineering, and Vladimir M. Shalaev, scientific director of nanophotonics at Purdue's Birck Nanotechnology Center and a distinguished professor of electrical and computer engineering. "You can use either an increase or a decrease in the reflection to encode data. It just depends on what you are trying to do. This change in the reflection also results in a change in the transmission."

Findings were detailed in a research paper appearing in July in the journal Optica, published by the Optical Society of America.

The material has been shown to work in the near-infrared range of the spectrum, which is used in optical communications, and it is compatible with the complementary metal–oxide–semiconductor (CMOS) manufacturing process used to construct integrated circuits. Such a technology could bring devices that process high-speed optical communications.

The researchers have proposed creating an "all optical plasmonic modulator using CMOS-compatible materials," or an optical transistor.

In electronics, silicon-based transistors are critical building blocks that switch power and amplify signals. An optical transistor could perform a similar role for light instead of electricity, bringing far faster systems than now possible.

The Optica paper, featured on the cover of the journal, was authored by Kinsey, graduate students Clayton DeVault and Jongbum Kim; visiting scholar Marcello Ferrera from Heriot-Watt University in Edinburgh, Scotland; Shalaev and Boltasseva.

Exposing the material to a pulsing laser light causes electrons to move from one energy level called the valence band to a higher energy level called the conduction band. As the electrons move to the conduction band they leave behind "holes" in the valance band, and eventually the electrons recombine with these holes.

The switching speed of transistors is limited by how fast it takes conventional semiconductors such as silicon to complete this cycle of light to be absorbed, excite electrons, produce holes and then recombine.

"So what we would like to do is drastically speed this up," Kinsey said.

This cycle takes about 350 femtoseconds to complete in the new AZO films, which is roughly 5,000 times faster than crystalline silicon and so fleeting that light travels only about 100 microns, or roughly the thickness of a sheet of paper, in that time.

"We were surprised that it was this fast," Kinsey said.

The increase in speed could translate into devices at least 10 times faster than conventional silicon-based electronics.

The AZO films are said to be "Epsilon-near-zero," meaning the refractive index is near zero, a quality found normally in metals and new "metamaterials," which contain features, patterns or elements that enable unprecedented control of light by harnessing clouds of electrons called surface plasmons. Unlike natural materials, metamaterials are able to reduce the index of refraction to less than one or less than zero. Refraction occurs as electromagnetic waves, including light, bend when passing from one material into another. Each material has its own refraction index, which describes how much light will bend in that particular material and defines how much the speed of light slows down while passing through a material.

The pulsing laser light changes the AZO's index of refraction, which, in turn, modulates the amount of reflection and could make higher performance possible.

"If you are operating in the range where your refractive index is low then you can have an enhanced effect, so enhanced reflection change and enhanced transmission change," he said.

The researchers "doped" zinc oxide with aluminum, meaning the zinc oxide is impregnated with aluminum atoms to alter the material's optical properties. Doping the zinc oxide causes it to behave like a metal at certain wavelengths and like a dielectric at other wavelengths.

A new low-temperature fabrication process is critical to the material's properties and for its CMOS compatibility.

"For industrial applications you can't go to really high fabrication temperatures because that damages underlying material on the chip or device," Kinsey said. "An interesting thing about these materials is that by changing factors like the processing temperature you can drastically change the properties of the films. They can be metallic or they can be very much dielectric."

The AZO also makes it possible to "tune" the optical properties of metamaterials, an advance that could hasten their commercialization, Boltasseva said.

The ongoing research is based at Purdue's Birck Nanotechnology Center and is funded by the Air Force Office of Scientific Research, a Marie Curie Outgoing International Fellowship, the National Science Foundation, and the Office of Naval Research.

CAPTION This is a schematic of the "puckered honeycomb" crystal structure of black phosphorus. CREDIT Vahid Tayari/McGill University

New material could make it possible to pack more transistors on a chip, research suggests 

As scientists continue to hunt for a material that will make it possible to pack more transistors on a chip, new research from McGill University and Université de Montréal adds to evidence that black phosphorus could emerge as a strong candidate.

In a study published today in Nature Communications, the researchers report that when electrons move in a phosphorus transistor, they do so only in two dimensions. The finding suggests that black phosphorus could help engineers surmount one of the big challenges for future electronics: designing energy-efficient transistors.

"Transistors work more efficiently when they are thin, with electrons moving in only two dimensions," says Thomas Szkopek, an associate professor in McGill's Department of Electrical and Computer Engineering and senior author of the new study. "Nothing gets thinner than a single layer of atoms."

In 2004, physicists at the University of Manchester in the U.K. first isolated and explored the remarkable properties of graphene -- a one-atom-thick layer of carbon. Since then scientists have rushed to to investigate a range of other two-dimensional materials. One of those is black phosphorus, a form of phosphorus that is similar to graphite and can be separated easily into single atomic layers, known as phosphorene.

Phosphorene has sparked growing interest because it overcomes many of the challenges of using graphene in electronics. Unlike graphene, which acts like a metal, black phosphorus is a natural semiconductor: it can be readily switched on and off.

"To lower the operating voltage of transistors, and thereby reduce the heat they generate, we have to get closer and closer to designing the transistor at the atomic level," Szkopek says. "The toolbox of the future for transistor designers will require a variety of atomic-layered materials: an ideal semiconductor, an ideal metal, and an ideal dielectric. All three components must be optimized for a well designed transistor. Black phosphorus fills the semiconducting-material role."

The work resulted from a multidisciplinary collaboration among Szkopek's nanoelectronics research group, the nanoscience lab of McGill Physics Prof. Guillaume Gervais, and the nanostructures research group of Prof. Richard Martel in Université de Montréal's Department of Chemistry.

To examine how the electrons move in a phosphorus transistor, the researchers observed them under the influence of a magnetic field in experiments performed at the National High Magnetic Field Laboratory in Tallahassee, FL, the largest and highest-powered magnet laboratory in the world. This research "provides important insights into the fundamental physics that dictate the behavior of black phosphorus," says Tim Murphy, DC Field Facility Director at the Florida facility.

"What's surprising in these results is that the electrons are able to be pulled into a sheet of charge which is two-dimensional, even though they occupy a volume that is several atomic layers in thickness," Szkopek says. That finding is significant because it could potentially facilitate manufacturing the material -- though at this point "no one knows how to manufacture this material on a large scale." 

"There is a great emerging interest around the world in black phosphorus," Szkopek says. "We are still a long way from seeing atomic layer transistors in a commercial product, but we have now moved one step closer." 

Page 3 of 12