Computing Reviews, the leading online review service for computing literature.

Search

Optimizing image processing on multi-core CPUs with Intel parallel programming technologies
Kim C., Kim J., Lee D. Multimedia Tools and Applications68 (2):237-251,2014.Type:Article

Date Reviewed: Jul 9 2015

A comparison of the use of serial, data, and task parallelism in image processing is presented in this paper. The first part of the paper describes the convolution operations that are used for edge detection during image processing. The rest of paper compares the use of these three programming styles, together with a fourth program that combines all of the features. The paper describes measurements made using an image with three versions (256x256, 512x512, and 1024x1024 points). Each image is a square matrix of floating-point numbers. The computation involved replaces each point with a weighted average of its neighbors. The weights are all integers from among -2, -1, 0, 1, and 2; thus, very little floating-point multiplication is needed. The same operations are programmed in four different ways: a serial program written in C++; a single instruction, multiple data (SIMD) program written using the SSE2 instructions on the Intel processor; a task-oriented program using the thread building blocks (TBB) open-source library provided by Intel; and a combination of both SIMD and task-oriented styles. Unfortunately, the source code for these four programs is not provided since the results are unrealistic. The experiments are run on a quad-core Intel processor with hyperthreading, giving eight available virtual processors. Physically, there are only four processors, so the best speedup that can be gained is a factor somewhere between 4 and 8. The SSE2 instructions allow the simultaneous processing of four floating-point operations. This is the best speedup that can be achieved with data parallelism, so the cumulative speedup should be no more than a factor of 32. The paper claims a larger speedup, which implies that there are other factors that are not being measured. Since the source code is not available, I must guess at the hidden factor; I am certain that it involves the processor cache. Here is the problem with this form of experiment. The experimenter should compare four programs that have been individually optimized for the individual styles of programming. Optimization is not just setting compiler options to perform classical optimizations. The programs should be manually optimized to take advantage of the available hardware, in this case, the available processor caches, data prefetching, and instruction scheduling. Each of the images can be stored in the secondary cache. The program should be written to avoid reloading the processor cache. Furthermore, the inner loops should be unrolled to allow instruction scheduling. The SSE2 version of the program should be optimized in the same fashion as the serial program. There are some hints that data prefetching was used. It should be used in both the serial and SIMD versions of the program. There are no hints about how the task-parallel version of the program was written. I would guess that the original serial program was modified to use the task-oriented features. In that case, the computation might be broken up into small enough chunks to avoid reloading the processor cache. This would explain the large speeds gained by combined task-parallel and SIMD execution. The paper is useful as an introduction to these forms of programming for programmers and designers. Unfortunately, the results are unrealistic. I recommend that the authors rerun the experiments using the techniques describe earlier. I would run an additional program that simply loads the data into the processor in the same order that the optimized programs load it. I suspect that most of the execution time in the fully optimized program will be memory activity rather than computation.

Reviewer: Charles Morgan	Review #: CR143592 (1509-0818)

Parallel Processing (I.3.1 ... )

Parallelism And Concurrency (F.1.2 ... )

Single-Instruction-Stream, Multiple-Data-Stream Processors (SIMD) (C.1.2 ... )

Threads (D.4.1 ... )

Would you recommend this review?

yes

Other reviews under "Parallel Processing":	Date

Parallel volume rendering and data coherence Corrie B., Mackerras P. Parallel rendering (Proceedings of the 1993 symposium, San Jose, California, United States, Oct 25-26, 1993)23-26, 1993. Type: Proceedings	Jul 1 1994

Guest Editor’s Introduction: Parallel Processing for Computer Vision and Image Understanding Choudhary A., Ranka S. Computer 25(2): 7-10, 1992. Type: Article	Jun 1 1993

Time-optimal, space-efficient single-scanner snapshots & multi-scanner snapshots using CAS Fatourou P., Kallimanis N. Principles of distributed computing (Proceedings of the Twenty-Sixth Annual ACM Symposium on Principles of Distributed Computing, Portland, Oregon, Aug 12-15, 2007)33-42, 2007. Type: Proceedings	Nov 15 2007

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy