How to Avoid Herd: A Novel Stochastic Algorithm in Grid Scheduling Qinghua Zheng1,2, Haijun Yang3 and Yuzhong Sun1 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100080 2 Graduate School of the Chinese Academy of Sciences, Beijing, 100049 3 School of Economics and Management, BEIHANG University zhengqinghua@ncic.ac.cn, navy@buaa.edu.cn, yuzhongsun@ict.ac.cn behavior and thus degrade performance[3]. Herd behavior is caused by independent entities, with imperfect and stale system information, allocating multiple requests onto a resource simultaneously. Here, “imperfect” means that one does not know others’ decisions when it performs resource allocation. This herd behavior can be found in the supercomputer scheduling[4], where many SAs (stands for Supercomputer AppLeS) scheduled jobs on one supercomputer. SA estimated the turn-around time of each request based on the current state of the supercomputer, and then forwarded the request with the smallest expected turn-around time to the supercomputer. The decision made by one SA affected the state of the system, therefore impacting other instances of SA, and the global behavior of the system thus came from the aggregate behavior of all SAs. We can abstract a model as Fig.1 for the above system in which herd behavior might happen. In this model, jobs are firstly submitted to one of SIs, and then forwarded to one of sharing resources allocated independently by the SI. When multiple SIs, with imperfect and stale information, perform same resource allocations simultaneously, herd behavior will happen, causing some resources overloaded while others being idle. Abstract Grid technologies promise to bring the grid users high performance. Consequently, scheduling is being becoming a crucial problem. Herd behavior is a common phenomenon, which causes the severe performance decrease in grid environment with respect to bad scheduling behaviors. In this paper, on the basis of the theoretical results of the homogeneous balls and bins model, we proposed a novel stochastic algorithm to avoid herd behavior. Our experiments address that the multi-choice strategy, combined with the advantages of DHT, can decrease herd behavior in large-scale sharing environment, at the same time, providing better schedule performance while burdening much less scheduling overhead than greedy algorithms. In the case of 1000 resources, the simulations show that, for the heavy load(i.e. system utilization rate 0.5), the multi-choice algorithm reduces the number of incurred herds by a factor of 36, the average job waiting time by a factor of 8, and the average job turn-around time by 12% compared to the greedy algorithms. 1. Introduction 1-4244-0307-3/06/$20.00 ©2006 IEEE. SI Jobs Scheduling is one of key issues in computing grid, and various scheduling systems can be found in many grid projects, such as GrADS[1], Ninf[2], etc. Among all challenges in grid scheduling, “herd behavior” (with the meaning that “all tasks arriving in an information update interval go to a same subset of servers”)[3], which causes imbalance and drastic contention on resources, should be a critical one. Large scale of grid resources makes network be partitioned into many autonomy sites, each having its autonomous decision. However, there has been theoretical evidence that systems in which resource allocation is performed by many independent entities can exhibit herd R …… Jobs R SI R Fig.1. Job Scheduling Model (SI: Scheduling Instance. R: Resource). This abstract model is still reasonable in grid environment. Autonomy of each site and network latency make the information for scheduling imperfect and stale, and they even worsen these characteristics compared to the supercomputer environment. With 267 imperfect and stale information, however, the existing grid scheduling methods will possibly cause herd behavior and thus degrade system performance. How to prevent herd behavior in grid scheduling is an important problem to promote system performance. The main contributions of this paper are as follows: 1) On the basis of the balls and bins model, we proposed a novel stochastic algorithm, named multi-choice algorithm, to reduce herds incurred in grid scheduling by at least one order of magnitude. 2) With combination of the advantages of DHT and data replication techniques, we bridged the gap between the theoretical achievement(the homogeneous balls and bins model) and a complex and realistic world (the heterogeneous distributed grid environment). 3) Simulations demonstrated that the multi-choice algorithm provided better schedule performance while burdening much less overhead compared to conventional ones. In the case of 1000 resources, the simulations showed that, for the heavy load(i.e. system utilization rate 0.5), the multi-choice algorithm reduced the average job waiting time by a factor of 8, and the average job turn-around time by 12% compared to the greedy algorithms. The rest of this paper is structured as follows. Section II compares our method to related work. Section III presents our scheme to prevent herd. Section IV addresses our implementation and algorithms. Section V presents simulations supporting our claims. Finally, we summarize our conclusions in Section VI. components, named metaserver, provides the scheduler and resource monitor. The scheduler obtains the computing servers’ load information from the resource information database and decides the most appropriate server depending on the characteristics of the requested function (usually the server which would yield the best response time, while maintaining reasonable overall system throughput) . Although details of above scheduling systems vary dramatically, there are aspects which they share to causing herd behavior: all of them are in nature greedy algorithms, selecting the best resource set according to some criteria. When jobs simultaneously arrive at multiple scheduling instances, same schedules, taken by these scheduling instances, might aggregate on the system and thus herd happens. Condor[5] is a distributed batch system for sharing the workload of compute-intensive jobs in a pool of UNIX workstations connected by a network. In a pool, the Central Manager is responsible for job scheduling, finding matches between various ClassAds[6,7], in particular, resource requests and resource offers, choosing the one with the highest Rank value (noninteger values are treated as zero) among compatible matches, breaking ties according to the provider’s rank value[6]. To allow multiple Condor pools to share resources by sending jobs to each other, Condor employs a distributed, layered flocking mechanism, whose basis is formed by the Gateway machines – at least one in every participating pool[8]. The purpose of these Gateway machines is to act as resource brokers between pools. Gateway chooses at random a machine from the availability lists received from other pools to which it is connected, and represents the machine to the Central Manager for job allocation. When a job is assigned to the Gateway, the Gateway scans the availability lists in random order until it encounters a machine satisfying the job requirements. It then sends the job to the pool to which this machine belongs. Although DHT can be used to automatically organize Condor pools[9], the fashion of scheduling in a Condor flock keeps unchanged. Scheduling in Condor would not cause herd behavior, because: (1) it defines a scope (Condor pool) within which only one scheduling instance is running, and consequently scheduling in each scope does not affect others; (2) stochastic selection is used when scheduling in Condor flocking. Some disadvantages are obvious: a large scope will arouse the concerns of performance bottleneck and single point of failure, while a small scope will prevents certain optimizations due to its layered Condor flocking mechanism[8]. Its scope leads to the lack of efficiency and flexibility to 2. Related work The goal of the Grid Application Development Software (GrADS) Project is to provide programming tools and an execution environment to ease program development for the Grid. Its scheduling procedure involves the following general steps: (i) identify a large number of sets of resources that might be good platforms for the application, (ii) use the applicationspecific mapper and performance model specified in the COPs (configurable object programs) to generate a data map and predict execution time for those resource sets, and (iii) select the resource set that results in the best rank value (e.g. the smallest predicted execution time). The goal of Ninf is to develop a platform for global scientific computing with computational resources distributed in a world-wide global network. One of its 268 Suppose that we sequentially place n balls into n boxes by putting each ball into a randomly chosen share grid resources. Nevertheless, its stochastic thought is a key to prevent herd behavior. Randomization has been widely mentioned to be useful for scheduling. In [10], it was pointed out that parameters of performance models might exhibit a range of values because, in non-dedicated distributed systems, execution performance may vary widely as other users share resources. In [11], their experiences proved that introducing some degree of randomness in several of the heuristics for host selection can lead to better scheduling decision in certain case. When multiple task choices achieve or are very close to the required minimum or maximum of some value (e.g. minimum completion time, maximum sufferage, etc.), their idea is to pick one of these tasks at random rather than letting this choice be guided by the implementation of the heuristics. The paper[12] described a methodology that allows schedulers to take advantage of stochastic information (choose a value from the range given by the stochastic prediction of performance) to improve application execution performance. In next section, we will show a novel stochastic algorithm based on the balls-and-bins model. box. It is well known that when we are done, the fullest box has with high probability (1 + O(1)) ln n / ln ln n balls in it. Suppose instead, that for each ball we choose d ( d ≥ 2) boxes at random and place the ball into the one that is less full at the time of placement. Azer[13] has shown that with high probability, the fullest box contains only ln ln n / ln d + Θ(1) balls - exponentially less than before. Further, Vöcking[14] found that choosing the bins in a non-uniform way results in a better load balancing than choosing the bins uniformly. Mitzenmacher[3] demonstrated the importance of using randomness to break symmetry. For systems replying on imperfect information to make scheduling decisions, “deterministic” heuristics that choose the lightest loaded server do poorly and significantly hurt performance (even for “continuously updated but stale” load info) due to “herd behavior”, but a small amount of randomness (e.g. “d-random” strategy) suffices to break the symmetry and give performance comparable to that possible with perfect information. 3. How to avoid herd behavior 3.2. How to employ this model Recall that herd behavior is caused by same behaviors simultaneously aggregating on a system. It is the imperfect and stale information that leads to multiple individuals taking same behaviors. With perfect and up-to-date information (the current state of resources and all the decisions made by others are known), each scheduling instance can properly use greedy algorithms, making appropriate decisions to maximize the system performance, and thus herd will not take place. In grid environment, however, the information for scheduling is inevitably imperfect and stale. In this case, stochastic algorithms instead of greedy algorithms can be used to make individuals avoid taking same behaviors. In this paper, we introduce a stochastic algorithm, which is based on the balls and bins model, into the grid scheduling. However, the grid scheduling cannot immediately benefit from the model because it has very strong constraints. Therefore, some technologies will be proposed to overcome these constraints. This multiple-choice idea, that is, the concept of first selecting a small subset of alternatives at random and then placing the ball into one of these alternatives, has several applications in randomized load balancing, data allocation, hashing, routing, PRAM simulation, and so on[15]. For example, it is recently employed in the DHT load balance[16]. When we use such an algorithm for load balancing in grid scheduling systems, balls represent jobs or tasks, bins represent machines or servers and the maximum load corresponds to the maximum number of jobs assigned to a single machine. However, there are three essential problems which we must tackle before employing this model. Organization of grid resources In the original theory, it is supposed that all of n bins are known and we can find any bin when placing a ball. In grid environments, however, we must implement this supposition by organizing all grid resources in resource management. Constraints imposed by jobs on resources In the original theory, there is no constraint on the choice of d bins, that is, one ball can be placed into any one of n bins if we do not care about load balance. In grid scheduling applications, however, some jobs will possibly have certain resource requirements (e.g. 3.1. Balls and bins model The balls and bins model is a theoretical model to study the load balance. Here, we introduce some results as following. 269 CPU clock rate, MEM size, etc.), which limits the scope of d choices. Hence, we should make appropriate resource choices for each job. Probability of choosing a resource In the original theory, all bins are homogeneous and each one is chosen with same probability (1/n). It doesn’t matter that which bin burdens the maximum load. In grid scheduling applications, however, resources are drastically heterogeneous and their capabilities (e.g. CPU, MEM) differ greatly from each other’s. Therefore, we must carefully consider how to adjust the probability of choosing a resource somewhat because it does matter that which resource is allocated the maximum load now. We will bridge these gaps in the following subsections. The implementations of SWORD use the Bamboo DHT[21], but the approaches generalize to any DHT because only the key-based routing functionality of DHT is used. One disadvantage of this decentralized DHT-based resource discovery infrastructure is the relative poor performance of collecting information compared to a centralized architecture. All above three methods have their advantages as well as disadvantages. Nevertheless, we adopt the DHT-based resource management based on the following reasons: (1) The advantages of using DHT DHT inherits the quick search and load balance merits of hash table, and its key properties include selforganization, decentralization, routing efficiency, robustness, and scalability. Moreover, as a middleware with simple APIs, DHT features simplicity and easiness of large-scale application development. Besides, using it as an infrastructure, applications automatically inherit the properties of underlying DHT. (2) Small overhead of the d-random strategy As mentioned above, the disadvantage of using DHT is the relative poor performance of collecting information compared to a centralized architecture. However, our “d-choice” strategy only submits d queries on the DHT, each taking a time complexity of O (log(n)) . Hence, our query complexity is only 3.2.1. Organization of grid resources. There are three typical kinds of organizing resources and each can be found in a certain project. Flat organization: flock of Condor. As stated in Section 2, sharing resources between Condor pools requires the bi-connection of their Gateways. All of these Gateways are placed in a same plane, each connecting directly to others when flocking. The disadvantage of flat organization is obvious: for a global share of resources, an all-connection network is created by Gateways, which might be unsuitable for a large-scale environment such as grid because of the overburden of these Gateways; otherwise, it is lack of efficiency to share several rare resources among all users. Hierarchal organization: MDS[17] of Globus[18]. In MDS, aggregate directories provide often specialized, VO-specific views of federated resources, services, and so on. Like a Condor pool, each aggregate directory defines a scope within which search operations take place, allowing users and other services to perform efficient discovery without scaling to large numbers of distributed information providers within a VO. Unlike Condor flocking mechanism, MDS builds an aggregate directory in a hierarchal way as a directory of directories. One disadvantage of MDS is that directory should be maintained by an organization. Another disadvantage is that when more and more directories are added––and we envision truly root directory consisting of a very large number of directories––and many searches are directed to the root directory, the concerns of performance bottleneck and single point of failure are aroused. P2P structure organization: SWORD[19] of PlanetLab[20]. SWORD is a scalable resource discovery service for wide-area distributed systems. O(d * log(n)) - far less than that of collecting all information which requires traveling the whole DHT and takes a time complexity of O (n) . From above statement, the powerful theoretical results of d-choice strategy, combined with the quick search of DHT, will possibly produce a practical scheduling system, which can provide good performance even with imperfect information while incurring small overhead. In this paper, we use Chord[22] to organize resources. Note that other DHTs can also be employed, and the advanced research on DHT can benefit our scheduling system too. The application using Chord is responsible for providing desired authentication, caching, replication, and user-friendly naming of data. Chord’s flat keyspace eases the implementation of these features. Chord provides support for just one operation: given a key, it maps the key onto a node. Depending on the application using Chord, that node might be responsible for storing a value associated with the key. Data location can be easily implemented on top of Chord by associating a key with each data item, and 270 storing the <key, data> pair at the node to which the key maps. As SWORD of PlanetLab has successfully designed and implemented a resource discovery system with DHT that can answer multi-attribute range queries, we just follow its design of key mapping in this part. In this way, our research can efficiently be employed on top of SWORD. For simplicity, we use multiple Chord instances, one per attribute of resource. For example, suppose that resource defines two attributes: <CPU, MEM>, we should create two Chord rings, each corresponding to an attribute. Chord places no constraint on the structure of the keys it looks up: the Chord key-space is flat. This gives applications a large amount of flexibility in how they map their own names to Chord keys. Fig. 2 shows how we map a measurement of one attribute to a DHT key. The length of value bits is allocated according to the range of attribute value. The random bits spread measurements of the same value among multiple DHT nodes, to load balance attributes that take on a small number of values. In this way, we can get an ordered DHT ring. Resource A B C D Node Key 0x16F92E11 0x990663CB 0x5517A339 0xD5827FA3 D B Routing table of CHORD Stored Data Routing table of CHORD Stored Data Routing table of CHORD CPU = 86 (MFLOPS) Stored Data Value Random bits bits Routing table of CHORD Key = 0 x 5 6 3 A 8 1 D 9 Fig. 2. Mapping a measured value to a DHT key. Stored Data CPU 86 120 220 135 CPU_Key 0x563A81D9 0x780209E1 0xDC937C5A 0x87103FF2 A CHORD ordered ring of CPU attribute C Node A Predecessor Successor Finger[0:m] Resource List Node D Node C … C Node C Predecessor Successor Finger[0:m] Resource List Node A Node B … Nil Node B Predecessor Successor Finger[0:m] Resource List Node C Node D … A, B, D Node D Predecessor Successor Finger[0:m] Resource List Node B Node A … Nil Fig. 3. Resource organization for the CPU attribute Fig. 3 shows an example of how to organize resources for an attribute. When a resource joins the DHT, it is assigned a random node key and holds a key space between its predecessor’s node key and itself. It is responsible for storing the data whose key falls in its holding key space, answering the query whose key falls in this space and forwarding to other node the query whose key is beyond this space. simultaneously because the intersection of these query results will possibly be a null set. We firstly generate d queries, each having a random CPU_Query_Key in the interval [0x64000000, 0xFFFFFFFF]. Then all these d queries are launched to the corresponding DHT instance. With the help of the Chord routing algorithms, each query can reach a proper DHT node whose holding key space contains the CPU_Query_Key. Now, this DHT node checks its stored data whether there is a valid resource satisfying the job requirements. If so, it should send the valid resource to the query node for further use; if not, it should alter the CPU_Query_Key to fit the holding key space of its successor and then forward the query to its successor. This process can be iterative until a valid resource is returned, or the query reaches the first node in the DHT key space (the one having a minimum 3.2.2. Constraints imposed by jobs on resources. Next, we focus on how to choose d resources that satisfy the job requirements. We take Fig. 3 as an example. Suppose that there is a job requiring a resource with CPU≥100(MFLOPS) and MEM≥700(MB) for running. We query resources on the DHT instance corresponding to the CPU attribute of resource. Note that it is a big challenge to perform this query on multiple DHT instances 271 node key), for example, Node A in Fig. 3. Therefore, all resources returned by d queries, if not nil, can satisfy the job requirements. distribution of their keys is an open issue in this scheduling application. In above statement, several techniques have been introduced to prevent herd behavior in grid scheduling. Firstly, the balls and bins model is a powerful tool for this problem. Secondly, we employ DHT and data replication to bridge the gap between the homogeneity of the model and the heterogeneity of grid environment. We believe that d-random strategy, combined with the quick search of DHT, can provide a good system for grid job scheduling. 3.2.3. Probability of choosing a resource. From the point of load balance, the number of jobs assigned to a resource should be proportional to its capabilities when system is middle or high loaded. Whereas, when system is light-loaded, users will favor fast machines to run their jobs. In order to achieve above both goals, intuitively, we should appropriately increase the probabilities of choosing high capability resources. There are two existing techniques for this purpose. The first is the virtual server used in DHT load balance[23], and the second is data replication. In [23], each host acts as one or several virtual servers, each behaving as an independent peer in the DHT. The host’s load is thus determined by the sum of these virtual servers’ loads. By adjusting the number of working virtual servers(for example, the number of virtual servers of a host is proportional to its capacity), we can achieve the DHT load balance among all hosts. From the view of our algorithm, however, one of its disadvantages is that it requires DHT middleware to support this technique by itself. Some DHT middleware can do it, but certainly not all. In order to make our system running over different platforms with flexible parameter settings, we adopt another application-layer method at the same time. Data replication has extensively been used in widearea distributed systems (Content Distribution Networks). It aims at decreasing the access to some hot-spots and increasing the access QoS. Here, we use data replication to appropriately increase the access to some high capability resources instead. To keep the constraints of job on resource, we place replicas of a resource on DHT nodes whose node keys are not more than the maximum key corresponding to the capabilities of that resource. Taking Fig. 3 for an example, we create a replica for Resource C. The CPU_Key of this replica should be in the interval [0x00000000, 0xDCFFFFFF] and hence this replica is stored on a DHT node whose node key is not more than 0xDCFFFFFF. Therefore, any query reaching on this DHT node can possibly get Resource C because the CPU capability of Resource C is greater than that of the query with high probability. The number of replicas of a resource as well as the distribution of their keys can affect the probability of choosing this resource. As the number increases, the possibility of herd behavior will increase too. How to determine and adjust the number of replicas and the 4. Our implementation In this section, we describe our system architecture and related algorithms. Each resource acts as a scheduling instance and independently performs job scheduling. Fig.4 shows the modules in a DHT node. Submit Job Launch Job Job Proxy ① ⑥ Scheduling ② Resource Index ⑤ ③ Resource Update ④ DHT Module TCP/IP Chord DHT Fig. 4. Modules in a DHT node When a resource joins the system, it firstly joins the Chord DHT. The algorithm of joining Chord can be seen in [22]. Then, it places certain number of resource replicas on the DHT and updates these replicas periodically to keep alive, using the algorithm shown in Fig.5. The resource update module in Fig.4 is responsible for collecting the information of local resource, generating and updating replicas. The resource index module uses a list to store the replicas which other resources place on the DHT. 272 STEP6: When all of d queries return, among these d resources, one with the best rank according to the performance model will be chosen and informed to the job proxy module. Then, the job proxy module launches the job to that resource. # The scheduling process includes the following 6 steps: STEP1: The job proxy module receives a job from a user, abstracts the job running requirements and performance model, and requests for scheduling with the requirements and performance model. STEP2: The scheduling module generates d queries according to the job requirements and then places these queries on the DHT. The job requirements as well as the performance model are included in each of these d queries. Here, d is a system parameter that is predefined, or dynamically adjusted according to system utilization rate. These d query keys should be not less than a certain minimum value to keep the job constraints as stated in Section 3.2.3, obeying the uniform or non-uniform and independent or dependent distribution between this minimum value and the maximum key of DHT key space. STEP3: With the help of CHORD routing algorithms, these queries reach their destination DHT nodes respectively. In the destination node, the query is processed in the resource index module. STEP4: The resource index module checks its stored resource list whether there is a resource satisfying the job running requirements. If not so, the query will be forwarded to the successor node (until it reaches the first node of the DHT key space, as stated in Section 3.2.3) and GOTO STEP3. If there were multiple resources satisfying the requirements, a competition algorithm would be performed to return one of valid resources. Now, the valid resource is returned to the source node of this query. STEP5: When a query result returns, we can use one-more-hop to collect the up-to-date information of the resource indicated in the returned query result. 5. Experiment Performance model, which can be expressed in for example ClassAd, is widely used to rank candidate resources when scheduling. In this paper, we only use one resource attribute (CPU capability, which is expressed by an integer with the meaning of, for example, number of million floating-point operations it can process per second) to simulate the performance of our scheduling system. Figueira[24] had modeled the effect of contention on a single-processor machine, and a performance model can be given as PM = CPU _ capability /(1 + load ) where load is the number of jobs being executed on the single-processor machine. All algorithms use the one-more-hop, as mentioned in Section 4.1, to get the up-to-date resource information to reduce herd effect, which means that the information update interval T is set to zero. Nevertheless, the information we collect with the onemore-hop is still “stale” due to network delays, as we can see in the experiment results later. 5.1. Environment First we introduce our simulation tools. The BRITE[25] (for Boston university Representative Internet Topology gEnerator.) is used to generate network topologies, and the NS2[26] (for Network Simulator) is used to simulate network behavior. FUNCTION Replica Creation Algorithm (Input Parameter: r) // Number of replicas = Capability / r, except for r=0 BEGIN IF (r==0) THEN RETURN; 5.1.1. BRITE. Based on examining the origin of the power laws in the Internet topologies that was shown by empirical studies, Medina[25] build a topology generator named BRITE that can do a fairly good job in reproducing some properties of the Internet topologies. We use BRITE to generate a Top-Down network with 1000 nodes. These 1000 nodes act as 1000 resources, each scheduling its jobs independently. Using a normal distribution with avg.=128 and std=32, we assign a stochastic value to the capability of each resource. Local_Replica_List.set_nil(); rep_num = local_resource.capability DIV r; remainder = local_resource.capability MOD r; IF (RANDOM (r)<=remainder) THEN rep_num = rep_num +1; FOR (i=0; i<rep_num; i++) BEGIN NEW replica; replica.key = HASH_REPLICA(local_resource.CPU); replica.resource = local_resource; PLACE_DHT_ITEM(replica); //place replica on DHT Local_Replica_List.insert(replica); //for update END //for end END 5.1.2. NS2. The NS2 is a discrete event simulator targeted at networking research. It provides substantial support for simulation of TCP, routing, and multicast protocols over wired and wireless (local and satellite) Fig.5. Replica creation algorithm 273 networks. We use the release ns-allinone-2.28, and program and simulate our system based on NS2. at the same time, which led to the turn-around time of these jobs being double or triple or even more times than expected. This figure demonstrated that there are heavy herd behaviors in grid environment by the common greedy algorithm.Fig.9.b shows the results of the stochastic greedy algorithm as listed in Fig.8. From this figure, it can be seen that herd happens occasionally(only a small quantity of points congregates round 2400sec.). The results reveal that the stochastic method can greatly reduce herd occurrences and decrease the average turn-around times of jobs. Comparing with Fig.9.a, the average turn-around time of jobs reduces from 1969.76 sec. to 1311.37 sec., which is only 66.58% of the former. Thus, the improved greedy algorithm makes a dramatic improvement on performance. Fig.9.c depicts the results of the d-choice algorithm. From this figure, we can hardly find herd occurrences. This algorithm reduces the average turn-around time from 1311.37 sec. to 1296.68 sec. with much lower overheads as shown in Tab 1 and 2. From above results, we see that the d-choice algorithm is much better than the common greedy algorithm, and get slight performance improvement to the stochastic greedy algorithm while having much lower overheads. Next, we shall investigate what factors impact our algorithm and how they do. We run our experiments on a Dawning 4000A server with 20 nodes, each having dual 1.8G AMD Opteron processors and 4.0GB memory. The OS is Suse9.0 (Suse Linux Enterprise Server 9.0) and the kernel is 2.6.5-7.97-smp. 5.2. Data and analysis In this paper, we assume that the job size is fixed and the execution time of such a job on a resource with 128 MFLOPS capability is 30 minutes. The minimum running requirement of CPU capability for a job is a stochastic value uniformly distributed in [0,200]. Moreover, we suppose that the arrived jobs form a Poisson process in every node. Note that ρ = 0.1 denotes every node receives 0.2 jobs in one hour. Similarly, ρ = 0.3 denotes every node receives 0.6 jobs in one hour, and ρ = 0.5 denotes every node receives one job in one hour. 5.2.1. Algorithm comparison. We compare our algorithm with greedy algorithms, which, scheduling a job to the resource with the best rank among all according to the performance model, is used by many projects as stated in related work. Our scheduling system might be impacted mostly by three factors. The first one is the number of choices(i.e. d ) , the second one is the capability distribution of resources, and the third one is the number of replicas and the distribution of their keys. The policy of replica has been listed in Fig.5. We can use parameter r to control the number of replicas that a resource can generate, studying the relation between replicas and scheduling performance. All of these replicas’ keys obey the uniform distribution. The d-choice algorithm has been shown in Section 4.1 (6 steps). The competition algorithm, used to return one of valid resources from the resource list in a DHT node as mentioned in Section 4.1 STEP4, is listed in Fig.6. Here, in order to reduce herd behaviors, we employ a stochastic greedy algorithm. Fig.9.a shows the results of the common greedy algorithm as listed in Fig.7. From this figure, we can find that almost all points congregate as three parallel straps. The greedy algorithm always chooses the fastest machine in scheduling. However, for most jobs(Job ID < 7500), there is a blank area at the bottom of this figure(turn-around time<1100 sec.), because the nodes with high capability were assigned multiple jobs FUNCTION Competition Algorithm () BEGIN For each resource satisfying the job requirements: Calculating its PM according to the performance model; RETURN resource having MAX{RANDOM(PM)}; END Fig.6. Competition algorithm used in the resource index module. FUNCTION Greedy Algorithm () BEGIN Collect all resources that satisfy the job requirements; Calculate these resources’ PMs according to performance model; RETURN resource having MAX{PM}; END the Fig.7. Common Greedy Algorithm. FUNCTION Stochastic Greedy Algorithm () BEGIN Collect all resources that satisfy the job requirements; Calculate these resources’ PMs according to performance model; RETURN resource having MAX{RANDOM(PM)}; END Fig.8. Stochastic Greedy Algorithm. 274 the 5.2.2. Performance of the d-choice algorithm. In following figures corresponding to various parameter settings, in order to make the result more obvious, we just give the average turn-around time instead of detailed turn-around time for each job. Fig.11 shows the average turn-around times with different parameter settings (parameter d and r ). It is clear that the replica policy does improve the performance of scheduling system because each d-choice strategy with replica ( r < ∞ ) outperforms the one without replica ( r = ∞ ) even if the replica number is small ( r = 128 ). On the other hand, as parameter d increases, the average turn-around time steadily decrease. When d reaches 15, the d-choice strategy outperforms the stochastic greedy algorithm in all cases: in the high-loaded case ( ρ = 0.5 ), a maximum improvement of 12.71% can be achieved by the d-choice strategy with settings of d = 15 and r = 8 , and even in the light-loaded case ( ρ = 0.1 ), an improvement of 1.07% can be achieved (a). The common greedy algorithm by the d-choice strategy with settings of d = 15 and r = 128 . These figures show that, with parameter d increasing, the average turn-around time decreases slowly. That is to say, parameter d might be related with for example the number of resources, and we deduce that too big parameter d is of no use to improve the scheduling performance. Compared to parameter d , parameter r exerts less influence on the performance. What surprising us most is that the increase of the number of replicas impacts only a little on the job turn-around time, which is possibly because we use a stochastic algorithm to select a resource from the resource list in a DHT node. Anyway, the existence of replicas does improve the system performance. Tab.1 shows the number of query messages processed by the DHT. For d-choice strategy, the number of query messages doubles the theoretical number ( d * log(n)) * NumOfJob / n because, as we analyzed above, resource lists in some DHT nodes are possibly nil, which causes additional query messages forwarding along the DHT ring until a valid resource is found or the query message reaches the first node of the DHT key space. Nevertheless, it is less than that of the stochastic greedy algorithm, which collects all resources to perform scheduling. Most of all, from Tab.2, it can be seen that, the waiting(query) time of scheduling of the d-choice strategy is far less than that of the stochastic greedy algorithm - only 12% of the later one (2 sec. vs. 17 sec.). For a large-scale (b). The stochastic greedy algorithm (c). The d-choice algorithm with d=15, r=128 Fig.9. Detailed the turn-around time for each job when ρ = 0.1 . X-axis stands for jobs submitted in time sequence. 275 network, this predominance on both DHT query load and scheduling waiting(query) time will significantly appear. These two tables show that the overheads of the algorithms are hardly affected by ρ , because they depend on their query processes instead of the system loads. From Tab.3, it can also be seen that herd is effectively prevented by the d-choice strategy because the number of its unexpected competitions keeps stable as ρ increases, while the stochastic greedy algorithm suffers greatly from herd (the number of its unexpected competitions increases from 249 to 1191). In fact, herd is not a direct metric for evaluating a system, but it has the impact on both the job turnaround time and its standard deviation. From Tab.3, we can see that the number of unexpected competitions caused by the stochastic greedy algorithm is almost proportional to ρ . When the percentage of herd jobs(jobs with unexpected competition) over total jobs is small(i.e. 2.5% for ρ = 0.1 ), herd has very little impact on the average turn-around time; however, when the percentage increases to 11.9%( ρ = 0.5 ), the impact on the average turn-around time can be easily seen in Fig.11.c. On the other hand, Tab.4 shows the standard deviation of job turn-around times for the algorithms. The standard deviation can show how much the job turn-around times disperse. A system with a small standard deviation can promise more stable performance than the one with a big standard deviation. The results of Tab.4 can show that our algorithm is priority of the stochastic greedy algorithm. (a). ρ = 0 .1 (b). ρ = 0 .3 (c). ρ = 0 .5 Tab.1. Avg. query load of a DHT node ρ 0.1 0.3 0.5 Stochastic greedy 6050 6050 6059 d=6, r=8 1324 1327 1322 d=9, r=8 2047 2021 2023 d=12, r=8 2747 2725 2717 d=14, r=8 3196 3193 3184 d=15, r=8 3418 3427 3431 Tab.2. Avg. waiting(query) time (sec.) ρ 0.1 0.3 Stochastic greedy 16.67 16.68 d=6, r=8 1.33 1.33 d=9, r=8 1.64 1.61 d=12, r=8 1.82 1.81 d=14, r=8 1.92 1.92 d=15, r=8 1.96 1.96 0.5 16.68 1.33 1.61 1.81 1.92 1.97 Fig.10. Turn-around time. The dashed line stands for the average job turn-around time of the stochastic greedy algorithm, which is not impacted by the replica policy. The dots under this dashed line account for their corresponding algorithms get better performance than that of the stochastic greedy algorithm. 276 Tab.3. Number of unexpected competitions1 ρ 0.1 0.3 0.5 Stochastic greedy 249 782 1191 d=6, r=8 5 9 7 d=9, r=8 17 19 17 d=12, r=8 21 24 31 d=14, r=8 29 30 36 d=15, r=8 29 36 33 Tab.4. Standard deviation of job turn-around time ρ 0.1 0.3 0.5 Stochastic greedy 288 570 863 d=6, r=8 318 473 906 d=9, r=8 230 395 663 d=12, r=8 209 364 655 d=14, r=8 202 347 650 d=15, r=8 199 357 617 (a) ρ = 0 .1 (b) ρ = 0 .3 (c) ρ = 0 .5 5.3. Discussion Fig.10 shows the cumulative distribution of jobs on resources. It can be seen that the stochastic greedy algorithm causes all jobs being allocated on only a small part of resources with high capability, which leads to a total number of 249 unexpected competitions in the light-loaded case and a total number of 1191 ones in the heavy-loaded case, and thus degrades the system performance. On the other hand, d-choices algorithm can disperse jobs on a wide spread of resources, and the smaller parameter d is, the wider the range of resources is. From above experiment results, it is clear that scheduling jobs only favoring high capability resources will possibly cause herd behavior and thus degrade the system performance. On the other hand, the wide spread of jobs on heterogeneous resources will possibly degrade the performance too (because resources with low capability are allocated jobs while resources with high capability are still idle). Therefore, it is important to adjust parameter d and r to make the tradeoff between the overburden (herd behavior) and insufficient use of high capability resources. Our experiments show that d=15, in all three cases (light, mid, and heavy loaded), is one of good tradeoffs that can preserve both advantages of preventing herd behavior and providing high scheduling performance. Fig.12. Cumulative allocation of Jobs on resources. It shows the number of jobs being allocated to resources whose capability is in a descend order. For example, Fig.(a) shows that all jobs are allocated to resources whose aggregate capability is the first 21.28%, 52.34%, and 85.41% of total resource capability corresponding to the stochastic greedy algorithm, the d-choice strategy with d=15 r=8, and the d-choice strategy with d=6 r=8 respectively. 1 For a job, if the submission load(the resource load at the time the job is submitted to this resource for running) is bigger than the query load(being used to make resource selection when scheduling) , we increase the number of unexpected competitions by 1 and say this job is a herd job. 277 [7] R. Raman, M. Livny, and M. Solomon, “Resource Management through Multilateral Matchmaking,” HPDC-9, pages 290–291, Pittsburgh, PA, 2000. [8]D. H. J. Epema, M. Livny, R. van Dantzig, X. Evers and J. Pruyne, “A Worldwide Flock of Condors: Load Sharing among Workstation Clusters,” Future Generation Computer Systems, 12, 1996 [9] A. R. Butt, R. Zhang, and Y. C. Hu, “A Self-Organizing Flock of Condors,” ACM/IEEE SuperComputing’03. [10] D. Zagorodnov, F. Berman, and R. Wolski, “Application Scheduling on the Information Power Grid,” International Journal of High-Performance Computing, 1998. 8 [11] H. Casanova, A. Legrand and D. Zagorodnov et al, “Using Simulation to Evaluate Scheduling Heuristics for a Class of Applications in Grid Environments,” Ecole Normale Superieure de Lyon, RR1999-46. 1999. [12] J. M. Schopf, and F. Berman, “Stochastic Scheduling,” ACM/IEEE SuperComputing '99. [13] Y. Azar, A.Z. Broder, A.R. Karlin, E.Upfal, “Balanced Allocations,” Proc. 26th Annual ACM Symposium on the Theory of Computing (STOC 94), pp. 593-602, 1994 [14] B. Vöcking, “How Asymmetry Helps Load Balancing,” Journal of the ACM, Vol. 50, No. 4, pp. 568-589, 2003. [15] B. Vöcking, “Symmetric vs. Asymmetric MultipleChoice Algorithms (invited paper),” Proc. of 2nd ARACNE workshop (Aarhus, 2001), pp. 7-15 [16] J. Ledlie, and M. Seltzer, “Distributed, Secure Load Balancing with Skew, Heterogeneity, and Churn,” IEEE INFOCOM 2005, March 2005 [17] K. Czajkowski, S. Fitzgerald, I. Foster, and C. Kesselman, “Grid Information Services for Distributed Resource Sharing,” HPDC-10, August 2001. [18] The Globus Alliance, http://www.globus.org/ [19] D. Oppenheimer, J. Albrecht, D. Patterson, and Amin Vahdat, “Scalable Wide-Area Resource Discovery,” UC Berkeley Technical Report UCB//CSD-04-1334, July 2004. [20] The PlanetLab Home, http://www.planet-lab.org/ [21] S. Rhea, D. Geels, T. Roscoe, and J. Kubiatowicz, “Handling Churn in a DHT,” Proc. of the USENIX Annual Technical Conference, 2004. [22] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan, “Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications,” ACM SIGCOMM 2001, San Deigo, CA, August 2001, pp. 149-160. [23] P. Brighten Godfrey and Ion Stoica, “Heterogeneity and Load Balance in Distributed Hash Tables,” IEEE INFOCOM 2005, March 2005 [24] S.M. Figueira, and F. Berman, “Modeling the Effects of Contention on the Performance of Heterogeneous Application,” HPDC. 1996. [25] A. Medina, A. Lakhina, I. Matta, and J. Byers, “BRITE: An Approach to Universal Topology Generation,” Proc. of the International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunications SystemsMASCOTS '01, Cincinnati, Ohio, Aug. 2001. [26] The Network Simulator NS2, http://www.isi.edu/nsnam/ns/. 6. Conclusion On the basis of the theoretical results of the homogeneous balls and bins model, we proposed a novel stochastic algorithm and employed it in the grid environment. Our analysis and experiments have demonstrated that the multi-choice strategy, combined with the advantages of DHT, can prevent herd behavior in a large-scale sharing environment such as grid, therefore, providing better scheduling performance with the much less incurred overhead compared to the common greedy algorithms and the stochastic greedy algorithms. The balls and bins model has been intensively investigated in the past and thus a lot of powerful theoretical results can be used to further improve this system. Some better performance will possibly appear in the future. Acknowledgement We would like to thank those anonymous reviewers who gave us much valuable advice to revise this manuscript. This work was supported in part by the National Natural Science Foundation of China under grants 90412010 and 90412013, China's National Basic Research and Development 973 Program (2003CB317008, 2005CB321807), and Theory Foundation of ICT (NO. 20056130). References [1] F. Berman, A. Chien, K. Cooper, J. Dongarra, I. Foster, L. J. Dennis Gannon, K. Kennedy, C. Kesselman, D. Reed, L. Torczon, and R. Wolski, “The GrADS Project: Software Support for High-Level Grid Application Development,” International Journal of High-performance Computing Applications, 15(4), Winter 2001. [2] M. Sato, H. Nakada, S. Sekiguchi, S. Matsuoka, U. Nagashima, and H. Takagi, “Ninf: A Network based Information Library for a Global World-Wide Computing Infrastracture,” Proc. of HPCN'97 (LNCS-1225), pages 491502, 1997. [3] M. Mitzenmacher, “How Useful is Old Information?” IEEE Trans. Parallel and Distributed Systems, vol. 11, no. 1, pp. 6-20, Jan. 2000. [4] W. Cirne and F. Berman, “When the Herd Is Smart: The Aggregate Behavior in the Selection of Job Request,” IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 2, pp. 181-192, Feb. 2003. [5] M. J. Litzkow, M. Livny, and M. W. Mutka, “Condor – A Hunter of Idle Workstations,” Proc. 8th International Conference on Distributed Computing Systems (ICDCS 1988), pages 104–111, San Jose, CA, 1988. [6] R. Raman, M. Livny, and M. Solomon, “Matchmaking: Distributed Resource Management for High Throughput Computing,” HPDC-7, pages 140–146, Chicago, IL, 1998. 278
© Copyright 2024