Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Optimizing image processing on multi-core CPUs with Intel parallel programming technologies
Kim C., Kim J., Lee D. Multimedia Tools and Applications68 (2):237-251,2014.Type:Article
Date Reviewed: Jul 9 2015

A comparison of the use of serial, data, and task parallelism in image processing is presented in this paper. The first part of the paper describes the convolution operations that are used for edge detection during image processing. The rest of paper compares the use of these three programming styles, together with a fourth program that combines all of the features.

The paper describes measurements made using an image with three versions (256x256, 512x512, and 1024x1024 points). Each image is a square matrix of floating-point numbers. The computation involved replaces each point with a weighted average of its neighbors. The weights are all integers from among -2, -1, 0, 1, and 2; thus, very little floating-point multiplication is needed. The same operations are programmed in four different ways: a serial program written in C++; a single instruction, multiple data (SIMD) program written using the SSE2 instructions on the Intel processor; a task-oriented program using the thread building blocks (TBB) open-source library provided by Intel; and a combination of both SIMD and task-oriented styles. Unfortunately, the source code for these four programs is not provided since the results are unrealistic.

The experiments are run on a quad-core Intel processor with hyperthreading, giving eight available virtual processors. Physically, there are only four processors, so the best speedup that can be gained is a factor somewhere between 4 and 8. The SSE2 instructions allow the simultaneous processing of four floating-point operations. This is the best speedup that can be achieved with data parallelism, so the cumulative speedup should be no more than a factor of 32. The paper claims a larger speedup, which implies that there are other factors that are not being measured. Since the source code is not available, I must guess at the hidden factor; I am certain that it involves the processor cache.

Here is the problem with this form of experiment. The experimenter should compare four programs that have been individually optimized for the individual styles of programming. Optimization is not just setting compiler options to perform classical optimizations. The programs should be manually optimized to take advantage of the available hardware, in this case, the available processor caches, data prefetching, and instruction scheduling. Each of the images can be stored in the secondary cache. The program should be written to avoid reloading the processor cache. Furthermore, the inner loops should be unrolled to allow instruction scheduling.

The SSE2 version of the program should be optimized in the same fashion as the serial program. There are some hints that data prefetching was used. It should be used in both the serial and SIMD versions of the program.

There are no hints about how the task-parallel version of the program was written. I would guess that the original serial program was modified to use the task-oriented features. In that case, the computation might be broken up into small enough chunks to avoid reloading the processor cache. This would explain the large speeds gained by combined task-parallel and SIMD execution.

The paper is useful as an introduction to these forms of programming for programmers and designers. Unfortunately, the results are unrealistic. I recommend that the authors rerun the experiments using the techniques describe earlier. I would run an additional program that simply loads the data into the processor in the same order that the optimized programs load it. I suspect that most of the execution time in the fully optimized program will be memory activity rather than computation.

Reviewer:  Charles Morgan Review #: CR143592 (1509-0818)
Bookmark and Share
 
Parallel Processing (I.3.1 ... )
 
 
Parallelism And Concurrency (F.1.2 ... )
 
 
Single-Instruction-Stream, Multiple-Data-Stream Processors (SIMD) (C.1.2 ... )
 
 
Threads (D.4.1 ... )
 
Would you recommend this review?
yes
no
Other reviews under "Parallel Processing": Date
Parallel volume rendering and data coherence
Corrie B., Mackerras P.  Parallel rendering (Proceedings of the 1993 symposium, San Jose, California, United States, Oct 25-26, 1993)23-26, 1993. Type: Proceedings
Jul 1 1994
Guest Editor’s Introduction: Parallel Processing for Computer Vision and Image Understanding
Choudhary A., Ranka S. Computer 25(2): 7-10, 1992. Type: Article
Jun 1 1993
Time-optimal, space-efficient single-scanner snapshots & multi-scanner snapshots using CAS
Fatourou P., Kallimanis N.  Principles of distributed computing (Proceedings of the Twenty-Sixth Annual ACM Symposium on Principles of Distributed Computing, Portland, Oregon, Aug 12-15, 2007)33-42, 2007. Type: Proceedings
Nov 15 2007
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy