-
Essay / Fault Tolerance in Checkpointing Approach
Today, a highly secure virtual grid is demanding in which you can share any resource of any cluster, even in the presence of a failure in the system. Grid computing addresses large-scale systems that span even organizational boundaries and is a distributed computing paradigm different from traditional distributed computing. Reliability issues arise due to the unreliable nature of the network infrastructure, in addition to the challenges of managing and scheduling these applications. A failure may occur due to link failure, resource failure, or any other reason that must be tolerated to operate the system smoothly and accurately without interrupting the work in progress. Many techniques are used accordingly for the detection and recovery of these faults. A proper fault detector can avoid a loss that occurs in the system due to a system failure and a reliable fault tolerance technique can avoid a system failure. In order to achieve reliability, availability and quality of service, fault tolerance is an important property. The fault tolerance mechanism used here sets task checkpoints based on the resource failure rate. The job is restarted from its last successful state using a checkpoint file from another grid resource in the event of a resource failure. Selecting optimal checkpoint intervals of an application is important to minimize application execution time in the presence of system failures. The fault index based rescheduling algorithm reschedules the work from the failed resource to another available resource with the lowest fault index value and executes the work from a recently saved checkpoint in case of resource failure. This ensures that the work will be completed on time with increased throughput and helps make the network environment reliable. Say no to plagiarism. Get Custom Essay on “Why Violent Video Games Should Not Be Banned”?Get Original Essay Grid computing is a term referring to the aggregation of computing resources from multiple administrative domains to achieve a goal common. The Grid can be considered as a distributed system with non-interactive workloads and involving a large number of files. It is more common for a single grid to be used for a variety of purposes, although a grid may be dedicated to a specialized application. Grids are often built using general-purpose grid software libraries called middleware. Sharing, selection, and aggregation of a wide variety of geographically distributed resources, including supercomputers, storage systems, data sources, and specialized devices belonging to different organizations, are made possible by the grid. Managing these resources is an important infrastructure in a grid computing environment. To realize the promising potential of computing grids, fault tolerance is of fundamental importance since resources are geographically distributed to realize the promising potentials of computing grid. Moreover, the probability of resource failure is much greater than in traditional parallel computing, and resource failure fatally affects task execution. Fault tolerance is the ability of a system to correctly perform its function even in the presence of faults and makes themore reliable system. The fault tolerance service is essential for meeting QoS requirements in grid computing and addresses different types of resource failures, including process failures, processor failures, and network failures. Checkpoint interval or application health monitoring period is one of the important parameters of a checkpointing system that provides fault tolerance. Smaller checkpoint intervals result in increased application execution overhead due to checkpointing, while longer checkpoint intervals result in increased failure recovery times. Therefore, in the event of a failure, optimal control intervals that lead to minimal application execution time should be determined. PROBLEMS: 1. If a failure occurs at one grid resource, the work is rescheduled to another resource, which ends up not satisfying the user's QOS requirements, i.e. the deadline. The reason is simple. As the job is rerun, it takes longer. 2. Some resources meet the deadline constraint criterion, but they tend to exhibit errors in compute-based grid environments. In such a scenario, the grid scheduler selects the same resource for the simple reason that the grid resource promises to meet the user's grid task requirements. This ultimately results in compromising the user's QOS settings in order to complete the job. 3. Even if there is a fault in the system, an ongoing task must be completed on time. Such a task is meaningless if it is not completed before its due date. The major problem is therefore meeting deadlines in real time. 4. Real-time distributed system availability of end-to-end services and the ability to sustain outages or systematic attacks, without impacting customers or operations. 5. It is the ability to handle an increasing amount of work and the ability of a system to increase total throughput under increased load as resources are added. REMEDIES: An adaptive checkpoint fault tolerance approach is used to overcome the above-mentioned drawbacks in such a scenario. In this approach, each resource maintains fault tolerance information. When a fault occurs, the resource updates the fault occurrence information. When making decisions regarding the allocation of resources to the job, fault tolerance information is used. Checkpointing is one of the most popular techniques. To provide fault tolerance on unreliable systems, checkpointing is one of the most popular techniques. This is a recording of a snapshot of the complete system state in order to restart the application after a crash occurs. Checkpoint can be stored on temporary or stable storage. However, the effectiveness of the mechanism strongly depends on the duration of the control interval. Frequent checkpointing increases overhead, while lazy checkpointing can cause significant computational loss. Therefore, the decision regarding checkpoint interval size and checkpoint technique is a complicated task and should be based on knowledge of the system as well as the application. Checkpoint recovery depends on the MTTR of the system. Usually, a hard drive periodically saves the state of an application to stable storage. After a crash, the application is restarted from the last checkpoint rather than starting the application again. There existsthree checkpoint strategies. These are coordinated checkpoints, uncoordinated checkpoints, and communication-induced checkpoints. 1. In coordinated checkpointing, processes synchronize checkpoints to ensure that their recorded states are consistent with each other, so that the overall recorded and combined state is also consistent. In contrast, 2. In uncoordinated checkpointing, processes schedule checkpoints independently at different times and ignore messages.3. Communication-induced checkpoints attempt to coordinate only selected critical checkpoints. CHECKPOINT MECHANISM: A Grid Resource is a member of a Grid and provides computing services to Grid users. Grid users register with a grid's Grid Information Server (GIS) by specifying QoS requirements such as deadline to complete execution, number of processors, operating system type, etc. . The components used in the architecture are described below: Scheduler-Schedulers is an important entity of a schedule. It receives work from network users. It selects the feasible resources for these jobs based on the information received from the GIS. It then generates task-to-resource mappings. When the planning manager receives a grid job from a user, it obtains details of the available grid resources from the GIS. It then transmits the list of resources available to the entities in the MTTR planning strategy. The Matchmaker entity matches resources and job requirements. The Response Time Estimator entity estimates the response time for a task on each corresponding resource based on the transfer time, queue waiting time, and service time of the task. The resource selector selects the resource with minimum response time. A task dispatcher distributes tasks one by one to the checkpoint manager. GIS- GIS contains information about all available grid resources. It keeps resource details like CPU speed, available memory, load, etc. All grid resources that join and leave the grid are monitored by GIS. A scheduler consults the GIS for information about available grid resources whenever he or she has tasks to execute. Checkpoint Manager − It receives scheduled work from the scheduler and sets the checkpoint based on the failure rate of the resource it is scheduled on. Then it submits the work to the resource. The checkpoint manager receives a job completion message or a job failure message from the grid resource and responds accordingly. During execution, if the task fails, the task is rescheduled from the last checkpoint instead of running from scratch. The checkpoint server job status is reported to the checkpoint server at each checkpoint defined by the checkpoint manager. The Checkpoint server records the job status and returns it on demand, i.e. when a job/resource fails. For a particular task, the checkpoint server ignores the previous checkpoint result when a new checkpoint result value is received. Fault Index Manager - It maintains the fault index value of each resource which indicates the failure rate of the resource. The failure index of a resource is incremented each time a resource does not complete the task assigned to it..