The Sysadmin Notebook  

Sitemap

Service Operation

Supporting Business as Usual Activities

Contents

Service Operation is responsible for the day-to-day operation of IT Services. This involves carrying out activities to ensure that services are delivered to a specified agreed level and managing the technology used to deliver those services. In other words:

SFIA (Skills Framework for the Information Age) and RACI are tools that can be used to clarify roles and responsibilities in Service Operation. A RACI authority matrix is normally used during Service Design to assign processes and activities to defined roles. RACI is an acronym for the four levels of authority between an activity/process and a process role:

Responsible
responsible for carrying out the activity
Accountable
accountable for the activity. Only one person/role is assigned this authority and is referred to as the Process owner
Consulted
consulted regarding the progress of the activity
Informed
informed of progress of the activity

Functions

Top Bottom

Service Operations processes are usually delivered by four functions: Service Desk, Technical Management, Operations Management and Applications Management.

Service Desk

Top Bottom

ITIL Best Practice demands a single point of contact for users in their communication with the IT Service Provider. A user is a person who uses a product or service. A customer is a person who negotiates for the provision of a product or service. The Service Desk is the main point of contact for a user with the IT Service Provider. Service Level Management is the main point of contact for a customer with the IT Service Provider.

The Service Desk must be aware of SLAs and their relation to user issues. The Service Desk acts on behalf of the user within the IT infrastructure, and the first duty of the Service Desk is to act as the users friend. The Service Desk will be responsible for:

An effective Service Desk requires well-trained, motivated staff with good interpersonal skills. Systems and processes in use will need to be well-designed for recording and tracking incidents and matching to previous incidents. Technology will be used for call distribution and knowledgebase access. Service Desk staff will need the appropriate level of technical competence to record, resolve and communicate issues.

The geographical positioning of the Service Desk may be one of:

Local
May lead to duplicating costs over multiple local Service Desks, adding overheads to maintain consistency between Service Desks and sharing of lessons learnt - this can be minimised by using centralised call-logging systems and CMDBs. Localised knowledge may be a key benefit of local Service Desks
Central
Loss of local knowledge and increased voice and data communications charges. Has the advantage of improving centralisation of information and knowledge and reduction in resource duplication
Virtual
Use of a single contact number, with calls routed by proximity or time of day
Follow the Sun
Allows use of cheaper daytime labour for out of hours support

When a Service Desk is established the various inputs must be anticipated and catered for: phonecalls, email, faxes, web-forms, voicemail and operational events. Automated responses can be configured for some, but should not detract for the Service Desk responsibilities for monitoring and tracking issues.

Escalation Management is the process of moving an incident to the appropriate team for resolution. Functional escalation is moving the issue to relevant experts: Hierarchical escalation is then the issue is moved up the management chain either because the issue is very serious or needs higher authority to sanction additional resources. Explicit parameters are needed to govern hierarchical escalation.

The skill-level of the Service Desk requires balancing the increased cost of skilled staff againgst the improved service provided. Achieving the optimum balance may require monitoring the percentage of calls closed by the Service Desk. The skill level may require a temporary uplift during ELS or major incidents. Improved processes can reduce the requirement for skilled staff and can include the use of automated scripts and better access to knowledge bases, change schedules and SLAs.

The technology used by Service Desks will fall into two distinct classes:

Telephony
  • Automatic Call Distribution systems
  • Conference Calling
  • VOIP
Software
  • Intelligent Call-Logging Systems, that identify patterns and suggest solutions
  • CMDBs and Known Error Databases
  • Automatic Referral & Escalation
  • Tracking and Alerting tools

Total number of calls can be a misleading indicator of Service Desk performance, as the figure will be skewed during deployment of new or changed services. Preferred Service Desk metrics include:

Soft measures can also be used to measure Service Desk performance, and these will typically consist of surveys and interviews, conducted in person, via telephone, email or web-forms and can be directed at individuals or groups of users.

Technical Management

Top Bottom

Technical Management provides the organisation technical expertise and overall management of the IT Infrastructure technology. The objective of Technical Management are:

Operations Management

Top Bottom

IT Operations Management carries out the daily activities needed to manage the IT Infrastructure, according to defined performance standards.

Operations Management will be sub-divided into Operations Control and Facilities Management. Operations control ensures that routine operational tasks are carried out, such as job scheduling and backup and restore activities. Facilities Management will be responsible for managing the buildings and environment housing CIs and logistical operations concerning CIs.

Applications Management

Top Bottom

The application lifecycle closely matches the ITSM lifecycle, but is focussed on applications not services: Requirements, Design, Build and Deploy, Operate, Optimise. Application Management is responsible for applications through the whole Application Lifecycle.

Application Management is the custodian of the technical knowledge, skills and documentation related to the management of applications and is responsible for identifying and specifying application requirements. Application Management will provide resources to support applications through the ITSM lifecycle, and guidance to IT Operations on the on-going management of applications. Application Management objectives are:

Processes

Top Bottom

Incident Management

Top Bottom

An incident is

an unplanned interruption to an IT service or reduction
in the quality of an IT service or a failure of a CI that has not yet
impacted an IT service

The main goal of Incident Management is to restore normal service as quickly as possible and minimise the adverse impact of service incidents. It is not concerned with longer-term structural resolution of faults, which is the domain of Problem Managment. Normal service operation is defined as a service that is within SLA limits. Incidents lifecycle:

  1. identification
  2. logging
  3. categorisation
  4. prioritisation
  5. initial diagnosis
  6. resolution and recovery
  7. incident closure

Recurring incidents may be raised with Problem Management via a Problem Record. Until a resolution is found, the recurring incident will be handled using an Incident Model, which details steps to be taken, staff involved, timescales, escalation procedures and evidence preservation activities. During the handling of an incident, Incident Management or the Service Desk will be responsible for ownership of the incident and monitoring and reporting progress.

Incident Management will need to prioritise incidents in terms of urgency and impact. Impact is the effect on the business and urgency concerns the timescale in which the incident needs to be resolved. For incidents to be high priority, both the impact and the urgency need to be high. The priority defined for different incidents will need to be agreed across the organisation, and will also be defined in SLAs in terms of response times and resolution targets. Incident managment will have to meet targets set in SLAs for incident response times and resolution targets, and so will need to ensure that these targets are:

Incidents will need to be assigned to appropriate teams for resolution, and timescales will be set and thresholds monitored. Escalation procedures will need to be invoked where SLA targets are in danger of being missed. When a major incident occurs (high impact, high urgency and SLA threat), then Problem Management staff need to be informed, who will provide additional support to the Service Desk. Where a number of equal priority incidents occur at the same time, then it is best to first tackle those were suitable resources are available.

Key Performance Indicators for Incident Management are:

The Critical Success Factors for Incident Managment are:

The Risks facing Incident Management are:

The Challenges facing Incident Management are:

Problem Management

Top Bottom

Problems are defined as 'the unknown cause of one or more incidents'. Problem Management is responsible for preventing problems and resulting incidents from happening, eliminating recurring incidents and minimising the impact of incidents that cannot be prevented.

Problem Management must ensure that processes are in place to identify the root causes of problems. These processes will normally be carried out by technical staff, working closely with the Service Desk, Incident Management and Suppliers. Root Cause analysis techniques used in Problem Management include:

Chronological Analysis
Documenting all events in time order to identify related events
Kepner Trego
Define the problem, describe the problem, establish possible causes, test the probable cause, verify the true cause
Five-Wise
some blurb about Five-Wise
Brainstorming
Gathering relevant technical personnel together to find potential causes of problem
Ishikawa Fishbone Tool
Problem is presented in diagrammatic form resembling a fish skeleton. The problem is represented as the head, and possible causes are represented as the main bones branching from the backbone. Each main bone can be subdivided with each branch representing potential solutions for further investigation. Also known as 'Cause and Effect' analysis.

Resolutions must be implemented through controlled procedures. Information about problems, known errors and resolutions should be documented and available to other service areas via the SKMS.

Resources to deal with problems will be prioritised on a pain-factor basis, or the seriousness of the impact on the business. Tools used for defining priorities include:

Pain-Value Analysis
Asses impact by considering the number of people affected, duration of downtime and the cost to the business
Pareto Analysis
Problems are ranked and highest ranking ones are tackled first. Also known as the 80:20 rule

Problem Management will be responsible for conducting Major Problem Reviews after the resolution of a major problem. The review will document the problem, causes and resolution and analyse any weaknesses in the Problem Management process. The review should attempt to identify strengths and weaknesses in the managment of the problem and track third-party activities related to the problem

The Problem Management process involves a standard set of control processes:

Detection
Either the result of proactive identification or escalation from Service Desk or Incident Management. The problem will need to be classified as new or recurring. A new problem occurs when no matches are found to existing problem records or Known Errors. Incident Matching is used to establish relationships to other incidents, and thus determine if the problem is new or recurring.
Logging
Problem Record is created and linked to associated records and Known Errors. Links will also be created to resultant RFCs in the CMS
Categorisation
used to determine the allocation of appropriate resources and capabilities
Prioritisation
Dependant on the effort required and the urgency and impact of the problem
Investigation and Diagnosis
An iterative process using one of the Root Cause analysis techniques described earlier
Establish Workarounds
Implement temporary solution to mitigate the impact. Workarounds should be recorded in the problem record which is kept open. Workarounds may affect the prioritisation of the problem
Raising a Known Error Record
As soon as the diagnosis is complete, a Known Error Record is raised and placed in the Known Error Database. Known Error records may be raised earlier for information purposes if, for example, a root cause or workaround has not been fully confirmed
Problem Resolution
RFCs will be raised if change in functionality required. Problem Management is not responsible for implementing resolutions - just identifying them
Closure
May occur automatically when a resolution to a Known Error is implemented or may require Post Implementation Review (PIR). Closed pending PIR status allows confirmation of the effectiveness of the solution to be assessed

Event Management

Top Bottom

Events are defined as any 'change of state that has significance for the management of a CI or service'. Event Managment is concerned with monitoring events and responding accordingly. Events can be monitored either actively or passively and will be categorised as informational, warnings or exceptions. Event Management only applies to Service Management tasks that are capable of automation. Alerts are used to automate Event Management and are configured to be generated by specific events or thresholds

The objectives of Event Management are:

Request Fulfillment

Top Bottom

Service Requests are characterised as low-risk, common, easy to model and low cost. Request Fulfillment is the process of dealing with service requests from users. Service Requests will fall into one of the following four categories:

Standard Changes
pre-approved and pre-authorised changes that are low-risk, relatively common and follow an established work pattern
Questions, Difficulties or Queries
normally arising through a lack of training or access to documentation
Standard Operational Requests
requests for disposal items, supplies or facilities
Complaints, Comments or Praise
tracking user of customer feedback, which will be of use for service improvement tasks

Request Fulfillment will use a Request Model to deal with service requests and most should be capable of some degree of automation. Service requests should be traceable - that is, the authorisation for the request should be easy to determine.

Access Management

Top Bottom

Access Management is also referred to as Rights Management or Identity Management, and refers to the process of granting users the right to access a service and preventing unauthorised access. Access Management ensures that the Security Policy is followed and typical activities will include:

Security polices will identify which roles should be granted which rights and mechanisms or responsible parties to authorise requests for access that are not defined in advance. Security policies will also define when rights should be revoked.