Program and network properties conditions of parallelism




















You can change your ad preferences anytime. Next SlideShares. You are reading a preview. Create your free account to continue reading. Sign Up. Upcoming SlideShare. What to Upload to SlideShare. Embed Size px. Start on. Show related SlideShares at end. WordPress Shortcode. Share Email. Top clipped slide. Download Now Download Download to read offline. Pankaj Kumar Jain Follow. Assistant Professor. A few thoughts on work life-balance. Related Books Free with a 30 day trial from Scribd.

Dry: A Memoir Augusten Burroughs. Related Audiobooks Free with a 30 day trial from Scribd. Empath Up! No problem. Shan Dhan. Mohithraj Kulal. Amal Salvin Joseph. Sushant Gupta. Riyashi Chandak. Abdul Manan. Theresa Prince. Show More. Views Total views. Actions Shares. No notes for slide. Advanced Computer Architecture Conditions of Parallelism 2.

Conditions of Parallelism The exploitation of parallelism in computing requires understanding the basic theory associated with it. Progress is needed in several areas: computation models for parallel computing interprocessor communication in parallel architectures integration of parallel systems into general environments 3.

Data and Resource Dependencies Program segments cannot be executed in parallel unless they are independent. Independence comes in several forms: Data dependence: data modified by one segement must not be modified by another parallel segment. Bus arbitration logic must deal with conflicting requests.

Lowest cost and bandwidth of all dynamic schemes. Many bus standards are available. In general, any input can be connected to one or more of the outputs. However, multiple inputs may not be connected to the same output. In general, any multistage network is comprised of a collection of a b switch modules and fixed network modules. The a b switch modules are used to provide variable permutation or other reordering of the inputs, which are then further reordered by the fixed network modules.

A generic multistage network consists of a sequence alternating dynamic switches with relatively small values for a and b with static networks with larger numbers of inputs and outputs. The static networks are used to implement interstage connections ISC. Straight-through Crossover Upper broadcast upper input to both outputs Lower broadcast lower input to both outputs No output is a somewhat vacuous possibility as well. With four stages of eight 2 2 switches, and a static perfect shuffle for each of the four ISCs, a 16 by 16 Omega network can be constructed but not all permutations are possible.

A baseline network can be shown to be topologically equivalent to other networks including Omega , and has a simple recursive generation procedure. A m n crossbar network can be used to provide a constant latency connection between devices; it can be thought of as a single stage switch. Different types of devices can be connected, yielding different constraints on which switches can be enabled. With m processors and n memories, one processor may be able to generate requests for multiple memories in sequence; thus several switches might be set in the same row.

For m m interprocessor communication, each PE is connected to both an input and an output of the crossbar; only one switch in each row and column can be turned on simultaneously. Additional control processors are used to manage the crossbar itself. Open navigation menu. Close suggestions Search Search. User Settings. Skip carousel. Carousel Previous.

Carousel Next. What is Scribd? Explore Ebooks. Bestsellers Editors' Picks All Ebooks. Explore Audiobooks. Bestsellers Editors' Picks All audiobooks. Explore Magazines. Editors' Picks All magazines. Explore Podcasts All podcasts. Difficulty Beginner Intermediate Advanced. Explore Documents. Program and Network Properties. Uploaded by Tri Awan. Document Information click to expand document information Description: The exploitation of parallelism in computing requires understanding the basic theory associated with it.

Progress is needed in several areas. Did you find this document useful? Is this content inappropriate? Report this Document. Description: The exploitation of parallelism in computing requires understanding the basic theory associated with it. Flag for inappropriate content. Download now. Related titles. Carousel Previous Carousel Next. Jump to Page. Search inside document.

Antidependence: S1 precedes S2, and the output of S2 overlaps the input to S1. Data Dependence - 2 Unknown dependence: o The subscript of a variable is itself subscripted. The subscript does not contain the loop index variable. The subscript is nonlinear in the loop index variable. Intersection of the input sets is allowed. Solving the Mismatch Problems Develop compilation support Redesign hardware for more efficient exploitation by compilers Use large register files and sustained instruction pipelining.

The Role of Compilers Compilers used to exploit hardware features to improve performance. The sizes are roughly classified using the term granule size, or simply granularity.

Latency Latency is the time required for communication between different subsystems in a computer. Memory latency, for example, is the time required by a processor to access memory. Computational granularity and communicatoin latency are closely related. Most optimized program construct to execute on a parallel or vector machine Some loops e. Procedure-level Parallelism Medium-sized grain; usually less than instructions. Subprogram-level Parallelism Job step level; grain typically has thousands of instructions; medium- or coarse-grain level.

Job or Program-Level Parallelism Corresponds to execution of essentially independent jobs or programs on a parallel computer. Communication Latency Balancing granularity and latency can yield better performance. Interprocessor Communication Latency Needs to be minimized by system designer Affected by signal delays and communication patterns Ex. Communication Patterns Determined by algorithms used and architectural support provided Patterns include o permutations broadcast multicast conference Tradeoffs often exist between granularity of parallelism and communication demand.

Grain Packing and Scheduling Two questions: o How can I partition a program into parallel pieces to yield the shortest execution time? What is the optimal size of parallel grains? One approach to the problem is called grain packing. Some general scheduling goals o Schedule all fine-grain activities in a node to the same processor to minimize communication delays.

Node Duplication Grain packing may potentially eliminate interprocessor communication, but it may not always produce a shorter schedule see figure 2. Control Flow vs. Data Flow Control flow machines used shared memory for instructions and data. Each datum is tagged with o o address of instruction to which it belongs context in which the instruction is being executed Tagged tokens enter PE through local path pipelined , and can also be communicated to other PEs through the routing network.

A Dataflow Architecture - 2 Instruction address es effectively replace the program counter in a control flow machine. Reduction Machine Models String-reduction model: o o each demander gets a separate copy of the expression string to evaluate each reduction step has an operator and embedded reference to demand the corresponding operands each operator is suspended while arguments are evaluated Graph-reduction model: o expression graph reduced by evaluation of branches or subgraphs, possibly in parallel, with demanders given pointers to results of reductions.

Permutations For n objects there are n! The inverse perfect shuffle reverses the effect of the perfect shuffle. Static Networks Ring, Chordal Ring Like a linear array, but the two end nodes are connected by an n th link; the ring can be uni- or bi-directional.

Static Networks Barrel Shifter Like a ring, but with additional links between all pairs of nodes that have a distance equal to a power of 2. The balanced binary tree is scalable, since it has a constant maximum node degree.

Static Networks Fat Tree A fat tree is a tree in which the number of edges between nodes increases closer to the root similar to the way the thickness of limbs increases in a real tree as we get closer to the root.

Static Networks Systolic Array A systolic array is an arrangement of processing elements and communication links designed specifically to match the computation and communication requirements of a specific algorithm or class of algorithms. Static Networks Cube-connected Cycles k-cube connected cycles CCC can be created from a k-cube by replacing each vertex of the k-dimensional hypercube by a ring of k nodes.

Static Networks k-ary n-Cubes Rings, meshes, tori, binary n-cubes, and Omega networks to be seen are topologically isomorphic to a family of k-ary n-cube networks. Static Networks k-ary n-Cubes The cost of k-ary n-cubes is dominated by the amount of wire, not the number of switches. Network Throughput Network throughput number of messages a network can handle in a unit time interval.

Dynamic Connection Networks Dynamic connection networks can implement all communication patterns based on program demands. In increasing order of cost and performance, these include o bus systems multistage interconnection networks crossbar switch networks Price can be attributed to the cost of wires, switches, arbiters, and connectors.

Dynamic Networks Bus Systems A bus system contention bus, time-sharing bus has o a collection of wires and connectors multiple modules processors, memories, peripherals, etc. Dynamic Networks Switch Modules An a b switch module has a inputs and b outputs. When only one-to-one mappings are allowed, the switch is called a crossbar switch. Multistage Networks In general, any multistage network is comprised of a collection of a b switch modules and fixed network modules. For a single-core system, one thread would simply sum the elements [0].

For a dual-core system, however, thread A, running on core 0, could sum the elements [0]. So the Two threads would be running in parallel on separate computing cores. Consider again our example above, an example of task parallelism might involve two threads, each performing a unique statistical operation on the array of elements.

Again The threads are operating in parallel on separate computing cores, but each is performing a unique operation.



0コメント

  • 1000 / 1000