Wednesday, November 27, 2013

Programming a million core machine

I have just attended an excellent talk by Steve Furber, Professor of Computer Engineering at University of Manchester on the challenges on programming a million core machine as part of the SpiNNaker project. 





The SpiNNaker project has been in existence for around 15 years and has been attempting to answer two fundamental questions:
  • How does the brain do what it does? Can massively parallel computing accelerate our understanding of the brain?
  • How can our (increasing) understanding of the brain help us create more efficient, parallel and fault-tolerant computation?
The comparison of a parallel computer with a brain is not accidental since brains share many of the required attributes being massively parallel, have lots of interconnections, provide excellent power-efficiency, require low speed communications, is adaptable/fault-tolerant (of failure) and capable of learning autonomously. The challenges for computing as Moore's law progresses is that there will eventually come a time when further increases in speed will not be possible and as processing speed has increased, energy efficiency has become an increasingly important characteristic to address. The future is therefore parallel but the approach to handling this is far from clear. The SpiNNaker project has been established to attempt to model a brain (around 1% of a human brain) using approximately 1 million mobile phone chips with efficient asynchronous interconnections whilst also examining the approach to developing efficient parallel applications.

The project is built on 3 core principles:
  • The topology is virtualised and is as generic as possible. The physical and logical connectivity are decoupled.
  • There is no global synchronisation between the processing elements.
  • Energy frugality such that that cost of a processor is zero (removing the need for load balancing) and the energy usage of each processor is minimised.
[As an aside, energy efficient computing is a growing interest such that when a program is constructed, how much energy is required to complete the computation is now the key factor in many systems (in terms of operational cost)]

The SpiNNaker project has designed a node which contains two chips; one chip is used for processing and consists of 18 ARM processors (1 hosts the operating system, 16 are used for application execution and 1 is spare) and the other chip is for memory (SDRAM). The nodes are connected in a 2D-mesh due to simplicity and cost. 48 nodes are assembled onto a PCB such that 864 processors are available per board. The processor only supports integer computation. The major innovation in the design is the interconnectivity within a node and between nodes on a board, A simple packet switched network is used to send very small packets around; each node has a router which is used to efficiently send the packets either within the node or to a neighbouring node. Ultimately, 24 PCBs are housed within a single 19” rack which are then housed (5) within a cabinet such that each cabinet has 120 PCBs which equates to 5760 nodes or 103680 processors. 10 cabinets would therefore result in over 1 million processors and would require around 10KW. A host machine (running Linux) is connected via Ethernet to the cabinet (and optionally each board).

Networking (and is efficiency) is the key challenge to emulate neurons. The approach by Spinnaker is to capture a simple spike (representing a neuron communication) within a small packet (40 bits) and then multicast this data around (each neuron is allocated a unique identifier, there is a theoretical limit of 4 billion neurons which can be modelled). By the use of a 3-stage associative memory holding some simple routing information, the destination of each event can be determined. If the table does not contain an entry, the packet is simply passed through to the next router. This approach is ideally suited to a static network or a (very) slowly changing network. It struck me that this simple approach could be very useful in efficient communication across the internet and maybe useful for meeting the challenge of the 'Internet of Things'.

Developing applications for SpiNNaker requires that the problem is split into two parts; one part handles the connectivity graph between nodes; the other part handles the conventional computing cycle with compile/link and deploy. Whilst the performance in terms of throughput is impressive (250 Gbps for 1024 links), it is the throughput which is exceptional at over 10 Billion packets/second.

The programming approach is to use an event-driven programming paradigm which discourages single-threaded execution. Each node runs a single application with the applications (written in C) communicating via an API to SARK (the SpiNNaker Application Runtime Kernel) which is hosted on the processor. The event model effectively maps to interrupt handlers on the processor with 3 key events handled by each application:
  • A new packet (highest priority)
  • A (DMA) memory transfer
  • A timer event (typically 1 millisecond)
As most applications for Spinnaker have been to model the brain, most of the applications have been written in PyNN (a python neural network) which is then translated into code which can be hosted by SpiNNaker. The efficiency of the interconnections mean that brain simulations can now be executed in real-time, a significant improvement over conventional supercomputing.

In concluding, it is clear that whilst the focus has been on addressing the 'science' challenges, the are clearly insights into future computing in terms of improved inter-processor connectivity, improved energy utilization and a flexible platform. Whilst commercial exploitation has not been a major driving force for this project, I am confident that some of the approaches and ideas will find a way into main-stream computing in much the same way that 50 years ago, Manchester developed the paging algorithm which is now commonplace in all computing platforms. 

The slides are available here.