The bulksynchronous parallel (BSP) paradigm provides an abstract model that separates the concerns of hardware architects and software designers, thereby bridging the two. Leslie Valiant, winner of the 2010 ACM Turing Award, presented BSP as such a bridging model [1].
First and foremost, the BSP model simplifies the analysis and development of scalable and portable parallel algorithms. Unlike the preceding PRAM model [2], which ignores communication, the BSP cost model captures a wide range of parallel computers within a single realistic framework, thus resulting in the design of truly optimal algorithms.
The BSP model outlined in Section 1 has three main elements: a framework for building wellstructured algorithms, a target architecture that is both simple and realistic, and a cost model that accurately predicts running time. As a theory, BSP has been successful and has led to many tools that had great impact in the areas of big data computing and highperformance computing (HPC). These practical tools are reviewed in Section 2. BSP will need to cope with several challenges in the coming years: everincreasing core counts, corresponding increases in latency, hardware variability, and nonuniform memory access. Section 3 discusses these. Section 4 introduces the multiBSP model [3] as an answer to those challenges.
1. Algorithm, architecture, and cost
A BSP algorithm consists of a sequence of supersteps that are executed concurrently by a number of processes. All processes execute the same program with the same superstep sequence, but each program may differentiate its instructions based on its unique process identity.
A BSP superstep contains both local computation and remote communication. Upon reaching a superstep, the algorithm designer may assume all previous supersteps have completed. The idea of bundling as many operations as possible in a single superstep leads to the term “bulk,” while the strict separation of one superstep from the next leads to the term “synchronous.”
A BSP computer consists of a set of p processors, each of which, for purposes of algorithm analysis, executes a single computation operation in time t. Each processor is equipped with local memory and an interconnect facility that enables interprocessor communication. Data in local memory are immediately accessible to a processor, whereas data residing in remote memories can only be accessed by communication. The network is optimized for fullduplex alltoall communication, during which processors can simultaneously receive and send one data word at constant cost g. A global synchronization facility furthermore checks whether a superstep has completed, and performs this check periodically: every l time units. A BSP computer is thus completely modeled by only four parameters, (p,t,g,l).
The BSP cost model provides a way of analyzing the supersteps of BSP algorithms to predict their performance when deployed on a BSP computer. The computation cost of a superstep is the maximum number of computation operations carried out by a processor, times t. The communication cost is the maximum number of data words sent or received by a processor, times g. The synchronization cost of a superstep is l. The cost of a complete BSP algorithm then becomes an expression of the form at+bg+cl, where (a,b,c) depend on the algorithm, (t,g,l) depend on the computer, while typically all of those parameters depend on p.
2. BSP in practice
The BSP model has evolved over the years, with pioneering work done in the 1990s by the Oxford Parallel group, led by Bill McColl. This resulted in the definition of the BSPlib standard interface in 1998 [4], accompanied by an implementation in the Oxford BSP toolset. This toolset demonstrated the opportunities that BSP offers for optimizing communication at runtime, without involvement of the programmer, for instance by automatically reordering communication requests or by automatically combining them [5].
Structured programming
Humans, inadequate at multitasking, often have similar difficulty in designing correct and efficient parallel algorithms. Imposing a structure on parallel programming, like BSP does via supersteps, leads to natural points of rest that ease thinking in parallel. The superstep counter furthermore provides us with a conceptual global clock. This imposed structure can be exploited for various purposes: proof of correctness, optimization of communication, checkpoints for fault tolerance, and perhaps even for energy minimization.
One great success of BSP is precisely this highly structured approach of writing parallel algorithms. Examples of successful programming paradigms exploiting BSPstructured parallel programming are: collectives in MPI [6, ch. 4], MapReduce [7], Hadoop [8], Hama [9], Pregel [10], Giraph [11], Spark [12], and Petuum [13]. These are all platforms that force a programmer to write in a superstep structure; other platforms, like MPI, can express more unstructured programs, but can of course express superstep structures as well.
Largescale graph computations form a prime application area of the BSP algorithmic approach. Google's Pregel system [10] and Facebook's opensource Giraph system for processing graphs [11] have both been inspired by the BSP model. In a computation by these systems, a single superstep has every vertex in the graph perform some local operations, after which every vertex communicates along its edges before starting a next superstep. In this fashion, Facebook managed to perform PageRank, label propagation, and kmeans clustering computations on graphs with up to 10^{12} vertices and 10^{15} edges [14]. Similarly, the BSP paradigm is especially well suited for machine learning [12,13].
Algorithm design
Using BSP, algorithm designers can formulate algorithms that are oblivious to the parameters of the BSP computer model. Algorithms that are provably optimal regardless of the precise values of (p,t,g,l) are socalled immortal algorithms: algorithms that need to be written just once, and remain forever optimal.
In the absence of a proof of optimality, the cost model allows derivation of the asymptotic behavior of the algorithm to determine its scalability. Reducing the BSP cost of algorithms leads to practices proven healthy over decades in HPC: reducing and balancing both computation and communication, while avoiding synchronization.
The BSP model also is ideally suited for teaching parallel computing. In our own experience at Utrecht University and KU Leuven, we used BSP as our parallel computing framework because it is transparent, easy to explain, and because of the limited size of the BSPlib interface, which eases practical implementation.
The BSPlib interface
The BSPlib interface [4] is the closest interpretation of the BSP paradigm available since it allows any possible communication pattern to be expressed while still imposing a clear superstep structure. This ensures that processes that receive data do not need to confirm receipt, which entails that all communication can be done in a onesided and unbuffered manner. Zerocopy direct remote memory access (DRMA) effects this by using put or get primitives, where local data is directly put into remote memory and remote data is directly retrieved into local memory, respectively.
This contrasts with twosided message passing, where a local send operation must be met by a matching remote receive operation; a practice which has been likened to the use of the gotostatement prior to the days of structured sequential programming [15,16]. Onesided communication furthermore ensures deadlocks cannot occur.
Apart from DRMA, BSPlib also incorporates bulksynchronous message passing as inspired by the Green BSP library [17]. Here, data is sent to a remote receiver queue in one superstep and all of the queues are emptied in the next superstep. An interface similar to BSPlib is the Paderborn University BSP library [18]. BSPonMPI, released by Wijnand Suijlen in 2006, allows BSPlib to run on all MPIsupported architectures. For functional programming, there exists bulk synchronous ML (BSML) [19], which is accompanied by a formal calculus [20].
The objectoriented MulticoreBSP for Java library [21] first explored the effectiveness, predictability, and scalability of BSP on the thenemerging sharedmemory multicore architectures. MulticoreBSP for C [22] proves that stateoftheart performance can be matched or exceeded while using the classical BSPlib, at least for the important kernels of sparse matrixvector multiplication and the fast Fourier transform. MulticoreBSP for C is an implementation of an updated BSPlib interface and contains only 23 primitives. This is much less than the 452 primitives of the mostused parallel programming interface in HPC: MPI3. This small size of the BSPlib interface makes full command of it attainable, and has repeatedly inspired students to write their own implementation for new architectures: MulticoreBSP for Java and the Parallella Epiphany BSP are two such examples.
3. Challenges to BSP
Recent trends in hardware architecture—such as a dramatic increase in the number of processor cores, larger variations in processor performance, more likely hardware failures, and deeper architectural hierarchies—pose new challenges to the BSP model. The Tianhe2, the current leader of the TOP500 supercomputing sites, possesses 3.12 million processor cores, many of which reside on coprocessors. This section outlines the problems such architectures cause for the BSP paradigm.
Latency
In contemporary computer hardware, synchronization usually is a collective barrier operation initiated over a network interconnect. It is not implemented as an automatic periodic mechanism in specialized hardware, as originally suggested for BSP by Valiant [1]. While the cost of synchronization can still be modeled via the latency parameter l, the number of processor cores in a massively parallel supercomputer has increased from hundreds to millions; synchronizing all of these is timeconsuming and should not be done lightheartedly [23, Way 5].
The use of a global barrier in a BSP system, however, is neither prescribed by the model, nor by BSPlib. If the communication pattern of a superstep is known beforehand, a BSP process can locally and without barriers determine when it is safe to proceed with the next superstep, without breaking the global consistency guaranteed by the BSP algorithm model. A wellknown example of a global operation that avoids global synchronization is the treebased allreduce operation [24]. Avoiding global barriers is furthermore a key feature of BSP systems like Pregel and Giraph, and has been incorporated in more generic BSPlib style frameworks as well [18,25].
Hardware variability
The BSP cost model does not consider the possibility of processors slowing down due to error corrections, power management, heat constraints, and so on. The timetocompletion of a parallel program may no longer be proportional to the work distributed to each individual process, and instead has become dependent on parameters outside the influence of the parallel program itself. If there are no hard guarantees on hardware variabilities, computation times become unpredictable, making the notion of an optimal parallel algorithm irrelevant.
While hardware has had stable bounds on variability, recent trends indicate that this is changing; on large scales, hardware failure has even become unavoidable. To deal with hardware variability, MapReduce [7] and Hadoop [8], for instance, use diskbased data replication and hashing to achieve both tailtolerance and faulttolerance. Such techniques are already commonplace in BSPstructured big data computing [7,10,26], and are becoming equally important to HPC.
Nonuniform memory access
Modern sharedmemory architectures exhibit increasingly nonuniform memory access (NUMA) times. This stems from a necessary increase in the number of compute cores per processor and from the high cost of providing a highperformance alltoall interconnect between those cores.
NUMA effects have also appeared in network interconnects. A traditional fattree interconnect such as that used by the Tianhe2 [27] naturally introduces a hierarchy of networks. The 2015 overview of HPC systems by Vetter [28] has six hierarchical interconnects versus two nonhierarchical ones, whereas the 2013 overview still had a nonhierarchical majority. Such architectural developments are contrary to the BSP computer model, which assumes a network optimized for uniform alltoall communication.
4. MultiBSP
To function as a realistic bridging model, BSP should account for NUMA effects. Valiant proposes the theoretical multiBSP model [3] to achieve this. It redefines the BSP computer model recursively: a multiBSP computer consists of either p other multiBSP computers, or of p regular processors. This leads to a tree hierarchy of interconnects that represents the parallel computer: each internal node represents a communication network while every node is equipped with local memory (for example, caches or main memory). All the compute power is attached to the leaves of the tree.
Since the network interconnect and the local memory sizes differ depending on the level in the architectural hierarchy, the number of parameters of the multiBSP computer becomes linear in the number of levels. A multiBSP algorithm must execute different types of supersteps for each level of the communication network, as illustrated in Fig. 1; the cost expression for a multiBSP algorithm thus also grows linearly in the number of levels.
Figure 1: A multiBSP algorithm for four processes, organized in a twolevel hierarchy. The horizontal axis represents time, from left to right, while the vertical axis represents the processes. The arrows denote interprocess communication. The vertical bars separate the computations and the communications of the different levels.
A current practical topic of investigation is the portable implementation of multiBSP algorithms. An effective multiBSP programming framework must automatically adapt the implemented algorithm to arbitrarily deep architectural hierarchies. Work in this direction is in progress at several research groups, including for MulticoreBSP and for BSML.
5. Outlook
BSP can play a role in unifying hybrid programming on arbitrary levels of hierarchy, providing an attractive alternative to today's mixed programming practice as often seen in HPC: for example, MPI for nodes, OpenMP within nodes, and CUDA for accelerators. Results obtained with modern BSPlib implementations show that bulk synchrony is neither at odds with efficiency, nor with scalability. Hardware variability and failure, furthermore, can be mitigated with the help of BSP as already proven by modern big data platforms. BSP thus can provide a unified practical framework that is suitable for both big data and HPC, providing resiliency without compromising efficiency: BSP has still untapped potential that can move both big data and HPC one “superstep” forward.

Created: Apr 10 2016
Last updated: Apr 10 2016




Web pages
BSP Worldwide: central web resource and mailing list
MulticoreBSP: updated BSPlib version with Java and C API for multicore computers
Epiphany BSP: BSPlib version for the Parallella board
BSML: Bulk Synchronous Parallel ML, for functional programming; also has a multiBSP variant, MultiML
BSPedupack: educational programs in BSPlib for dense LU decomposition, fast Fourier transform, sparse matrixvector multiplication
Architecture Independent Parallel Code: sorting, matrix multiplication
Articles
A bridging model for parallel computation Valiant, L. G. CACM
A bridging model for multicore computing Valiant, L. G. J. Computer and System Sciences
BSPlib: the BSP Programming Library Hill, J. M. D.; McColl, B.; Stefanescu, D. C.; Goudreau, M. W.; Lang, K.; Rao, S. B.; Suel, T.; Tsantilas, T.; Bisseling, R. H. Parallel Computing
MulticoreBSP for C: a highperformance library for sharedmemory parallel programming Yzelman, A. N.; Bisseling, R. H.; Roose, D.; Meerbergen, K. Int. J. Parallel Programming
Pregel: a system for largescale graph processing Malewicz, G.; Austern, M. H.; Bik, A. J. C.; Dehnert, J. C.; Horn, I.; Leiser, N.; Czajkowski, G. Proc. ACM SIGMOD 2010
Questions and answers about BSP Skillicorn, D. B.; Hill, J. M. D.; McColl, W. F. Scientific Programming
BSP (bulk synchronous parallelism) Tiskin, A. Encyc. Parallel Computing
Books
Parallel scientific computation: a structured approach using BSP and MPI Bisseling, R. H., 2004
Conferences and Workshops
International Symposium on HighLevel Parallel Programming and Applications (HLPP): annual conference on higherlevel, structured approaches that “offer many possible advantages and have a key role to play in the scalable exploitation of ubiquitous parallelism.”
SIAM Conference on Parallel Processing for Scientific Computing (SIAMPP): biannual conference organized by the SIAM Special Interest Group on Supercomputing. SIAM PP16 features a minisymposium on BSP.
Reviews
Parallel scientific computation Bisseling R., Oxford University Press, 2004.
HPC in big data age Cheptsov A. EuroMPI/ASIA 2014 (Proceedings of the 21st European MPI Users’ Group Meeting), Kyoto, Japan, Sep 912, 2014, p. 175180.
Design patterns percolating to parallel programming framework implementation Aldinucci M., Campa S., Danelutto M., Kilpatrick P., Torquati M. International Journal of Parallel Programming 42 (6), 2014, pp. 10121031.





1)  Valiant, L. G. A bridging model for parallel computation. CACM 33, 8 (1990), 103111.

2)  Fortune, S.; Wyllie, J. Parallelism in random access machines. In Proc. of STOC ’78 (1978), 114118.

3)  Valiant, L. G. A bridging model for multicore computing. Journal of Computer and System Sciences 77, 1 (2011), 154166.

4)  Hill, J. M. D.; McColl, B.; Stefanescu, D. C.; Goudreau, M. W.; Lang, K.; Rao, S. B.; Suel, T.; Tsantilas, T.; Bisseling, R. H. BSPlib: the BSP programming library. Parallel Computing 24, 14 (1998), 19471980.

5)  Hill, J. M. D.; Skillicorn D. B. Lessons learned from implementing BSP. Future Generation Computer Systems 13, 45 (1998), 327335.

6)  Snir, M.; Otto, S. W.; HussLederman, S.; Walker, D. W.; Dongarra, J. MPI—the complete reference: volume 1: the MPI core (2nd ed.). MIT Press, 1998.

7)  Dean, J.; Ghemawat, S. MapReduce: simplified data processing on large clusters. CACM 51, 1 (2008), 107113.

8)  White, T. Hadoop: the definitive guide (4th ed.). O’Reilly, 2015

9)  Seo, S.; Yoon, E. J.; Kim, J.; Jin, S.; Kim, J.S.; Maeng, S. Hama: an efficient matrix computation with the MapReduce framework. In Proc. of CloudCom ’10 (2010), 721726.

10)  Malewicz, G.; Austern, M. H.; Bik, A. J. C.; Dehnert, J. C.; Horn, I.; Leiser, N.; Czajkowski, G. Pregel: a system for largescale graph processing. In Proc. of SIGMOD ’10 (2010), 135146.

11)  Apache Giraph

12)  Zaharia, M.; Chowdhury, M.; Das, T.; Dave, A.; Ma, J.; McCauley, M.; Franklin, M. J.; Shenker, S.; Stoica, I. Resilient distributed datasets: a faulttolerant abstraction for inmemory cluster computing. In Proc. of NSDI ’12 (2012), 1528.

13)  Xing, E. P.; Ho, Q.; Dai, W.; Kim, J. K.; Wei, J.; Lee, S.; Zheng, X.; Xie, P.; Kumar, A.; Yu, Y. Petuum: a new platform for distributed machine learning on big data. IEEE Transactions on Big Data 1, 2 (2015), 4967.

14)  Ching, A.; Edunov, S.; Kabiljo, M.; Logothetis, D.; Muthukrishnan, S. One trillion edges: graph processing at Facebookscale. Proceedings of the VLDB Endowment 8, 12 (2015), 18041815.

15)  Bisseling, R. H. Parallel scientific computation: a structured approach using BSP and MPI. Oxford University Press, 2004.

16)  Gorlatch, S. Sendreceive considered harmful: myths and realities of message passing. ACM Transactions on Programming Languages and Systems 26, 1 (2004), 4756.

17)  Goudreau, M. W.; Lang, K.; Rao, S. B.; Suel, T.; Tsantilas, T. Portable and efficient parallel computing using the BSP model. IEEE Transactions on Computers 48, 7 (1999), 670689.

18)  Bonorden, O.; Juurlink, B.; von Otte, I.; Rieping, I. The Paderborn University BSP (PUB) library. Parallel Computing 29, 2 (2003), 187207.

19)  Bousdira, W.; Gava, F.; Gesbert, L.; Loulergue, F.; Petiot, G. Functional parallel programming with revised bulk synchronous parallel ML. In Proc. of ICNC ’10 (2010), 191196.

20)  Loulergue, F.; Hains, G.; Foisy, C. A calculus of functional BSP programs. Science of Computer Programming 37, 13 (2000), 253277.

21)  Yzelman, A. N.; Bisseling, R. H. An objectoriented bulk synchronous parallel library for multicore programming. Concurrency and Computation: Practice and Experience 24, 5 (2012), 533553.

22)  Yzelman, A. N.; Bisseling, R. H.; Roose, D.; Meerbergen, K. MulticoreBSP for C: a highperformance library for sharedmemory parallel programming. International Journal of Parallel Programming 42, 4 (2014), 619642.

23)  Yelick. K. A. Ten ways to waste a parallel computer. In Proc. of ISCA ’09 (2009). Slides of keynote talk.

24)  Chan, E.; Heimlich, M.; Purkayastha, A.; Van De Geijn, R. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience 19, 13 (2007), 17491783.

25)  Alpert, R. D.; Philbin J. F. cBSP: zerocost synchronization in a modified BSP model. Technical Report 97054. NEC Research Institute, 1997.

26)  Dean, J.; Barroso, L. A. The tail at scale. CACM 56, 2 (2013), 7480.

27)  Liao, X.; Xiao, L.; Yang, C.; Lu Y. MilkyWay2 supercomputer: system and application. Frontiers of Computer Science 8, 3 (2014), 345356.

28)  Vetter, J. S. (Ed.). Contemporary high performance computing: from petascale toward exascale, volume 2. CRC Press, 2015.


