Most Read Latest News Blog Resources

Integration Watch: A new twist on threads




November 15, 2008 — 
In 2004, Intel asked me to co-author a book for Intel Press on “Programming with Hyper-Threading Technology.” The resulting work was a comprehensive survey of parallel programming tools and APIs, with lots of how-to code and specific advice for programming on Intel hyperthreading chips. In the discussion of resource allocation, I presented numbers from my own investigations that supported the common wisdom regarding the optimal number of threads to use—namely, one thread per execution pipeline.

On hyperthreaded processors, as on today’s Core Duo chips, that meant two threads per processor. Similarly, on quad-core chips, the number would be four threads per processor. Using fewer threads leaves some processor resources to lie fallow while waiting for work; using more requires some threads to wait while others execute. So, a 1:1 mapping between threads and execution pipelines is just about perfect, both conceptually and in practice.

This ratio holds true if you’re writing parallel threads that keep the pipelines busy 100% of the time. In the dawning era of manycore processors, however, keeping many threads busy the whole time is a highly difficult task. What to do?

The answer, surprisingly, is to use smaller workloads, broken up over many more threads than there are pipelines. The overarching concept is that the smaller chunks of work can be scheduled for work whenever and however the OS scheduler dictates. If the scheduler is optimized for parallel work, it can load-balance very effectively with this approach, in a way that it could not with the older, 1:1 thread/processor assignments.

Consider a quad-core system. Under the classic approach, it would ideally support four threads of simultaneous execution. If one thread encountered a pause, however, its resources would lie unused until the pause ended. Another thread, working at 100%, could not offload work to the paused thread’s execution resources.

Now suppose the original workload had been broken up into more than four threads. Another thread could then be swapped in for the paused thread, and execution could continue apace. Situations in which there are more threads than pipelines can thus create the opportunity for load balancing.

Graphics processors are increasingly being used for parallel computation when heavy graphics processing is not needed. Both ATI and Nvidia provide software tools for offloading parallel programming work to GPUs. Nvidia’s software, Compute Unified Device Architecture (CUDA), just like ATI’s, works best when it has literally hundreds, if not thousands, of threads in flight. That lets the hardware optimize load balancing and agilely manage work queues to obtain the best possible performance. I described the Nvidia software in my May 1 column, “Parallel programming with CUDA.”

The upcoming Larrabee processor from Intel will do exactly the same thing. The many-x86-core chip will perform GPU functions and enable parallel programming to occur simultaneously.

Intel’s move into this area is already having repercussions on the company’s well-regarded development tool offerings. The biggest impact is the new way of conceptualizing thread processing.

Intel proposes the new concepts as a progression of increasingly chunky execution blocks. A strand is the finest sequence of instructions; typically, in data-intensive apps, it’s the program work that can run on one SIMD (single instruction, multiple data) lane. A fiber is a software-managed context that runs 16 to 64 strands. A thread runs two to 10 fibers; a core runs one to four threads.

By dividing tasks into strands, fibers and threads, good performance can be assured because slow tasks can be intermixed with fast ones, and all will be executed in a way that will keep the system resources busiest. Thus the new parallel model will be oriented toward small workload chunks, with less regard for threads and cores.

The approach provides an added benefit: portability of performance. If you write your app for four classic threads and your user runs it on a dual quad-core system, your code runs no faster because eight cores are available; it simply ignores the extra resources. Under the new approach, every additional execution resource brings additional lift to your application. That means users will be delighted to see how much faster your code runs when they upgrade their hardware.

The details of how to program for this model are still being investigated. Clearly, there are caveats; likewise, there are situations where its indiscriminate use is not appropriate. Nonetheless, it is certain to emerge as the principal threading design in the manycore era. We thus would do well to familiarize ourselves with it, especially during application architecture.

Andrew Binstock is the principal analyst at Pacific Data Works. Read his blog at binstock.blogspot.com.


Related Search Term(s): multicoreIntel


Share this link: http://www.sdtimes.com/link/33046
 

Add comment


Name*
Email*  
Country     


  • Comment
  • Preview
Loading



 
 
 
 
News on Monday
more>>
SharePoint Tech Report
more>>


   

 
 
Download Current Issue
ISSUE 3/15/2010 PDF

Need Back Issues?
DOWNLOAD HERE

Receive the print Edition?


 
blogs tab
Google Code turns 5
Google Code Turns 5, and adds a Paxos Algorithm to make the system more stable and reliable.
03/17/2010 11:16 AM EST

Test your Visual Studio 2010 know-how
Microsoft is offering free beta certification exams for Visual Studio 2010.
03/17/2010 11:08 AM EST

Microsoft lifts the hood on IE9
Microsoft is previewing IE9.
03/16/2010 01:10 PM EST

 

Events calendar tab
3/22/2010 to 3/25/2010
Santa Clara, Calif.
The Eclipse Foundation

4/12/2010 to 4/14/2010
Las Vegas
Penton Media

4/12/2010 to 4/15/2010
Santa Clara, Calif.
O'Reilly Media

4/19/2010
New York City
Flagg Management

4/25/2010 to 4/28/2010
Overland Park, Kans.
IIUG