Deadline-aware Job and Task Scheduling in Cloud Environment Open Access
Downloadable ContentDownload PDF
Data-intensive jobs, e.g., financial analysis and log mining, need to process huge amount of data, and those jobs are suitable to run on the cloud environment, which provides computing and storage resources by connecting thousands of servers. A job splits a dataset into multiple data chunks, and distributes the chunks to different machines. Then, the job launches tasks to process the chunks on machines. Multiple jobs might run simultaneously in a cloud computing system. Schedulers, e.g., FIFO, Fair and Capacity, allocate resources among multiple jobs. However, most of these schedulers do not take job deadlines into account. Users submit jobs to a cloud service provider, and expect the provider meet users' Service Level Agreement (SLA), e.g., meet jobs' deadlines. To meet jobs' deadlines, the service provider allocates different resources for jobs. This dissertation aims to provide job and task scheduling algorithms to meet jobs' deadlines under different constraints. The first part of this dissertation considers cloud right-sizing with execution deadlines and data locality constraints. Processing and analyzing data within certain deadlines have become more and more important to meet SLA. Also, to improve data access efficiency and task throughput, data locality is often maximized by assigning tasks only to nodes that contain their input data. We present a novel framework CRED, which jointly optimizes data placement and task scheduling in data centers with the aim of minimizing the number of nodes needed while meeting users’ SLA requirements. We formulate CRED as an integer optimization problem and present a heuristic algorithm with provable performance guarantees to solve the problem. Competitive ratios of the proposed algorithm are quantified in closed form for arbitrary task parameters and cloud configurations. We also extend our work to obtain a resilient solution, which allows successful recovery at run time from any single node failure and is guaranteed to meet both deadline and locality constraints.The second part of this dissertation presents Chronos, a quantitative framework that optimizes speculative execution of tasks for offering guaranteed SLAs to meet application deadlines. We bring several speculative scheduling strategies together under a unifying optimization framework, called Chronos, which defines a new metric, Probability of Completion before Deadlines (PoCD), tomeasure the probability that jobs meet their desired deadlines. We systematically analyze PoCD for popular strategies including Clone, Speculative-Restart, and Speculative-Resume, and quantify their PoCD in closed-form. The result illuminates an important tradeoff between PoCD and the cost of speculative execution, measured by the total (virtual) machine time required under different strategies. We propose an optimization problem to jointly optimize PoCD and execution cost, and develop an algorithmic solution that is guaranteed to be optimal.The third part of this dissertation presents LASER, a deep learning approach for speculative execution and replication of deadline-critical jobs. Machine learning has been successfully used to solve a large variety of classification and prediction problems. In particular, the deep neural network (DNN), consisting of multiple hidden layers of units between input and output layers, can provide more accurate regression (prediction) than traditional machine learning algorithms. We compare LASER with SRQuant, a speculative-resume strategy that is based on quantitative analysis. Both these scheduling algorithms aim to improve PoCD, and reduce the cost of speculative execution as measured by the total (virtual) machine time.The final part of this dissertation presents a new framework for scheduling machine learning (ML) jobs. ML job scheduling introduces a new challenge, i.e., tasks of a job start at the same time and complete at the same time. Our framework schedules jobs in a time-slotted manner. In each time slot, our framework employs Reinforcement Learning to determine if jobs should be scheduled, and schedules jobs one by one based on a scheduling order obtaining using Bayesian Optimization.