Service Operation
Supporting Business as Usual Activities
Contents
Service Operation is responsible for the day-to-day operation of IT Services. This involves carrying out activities to ensure that services are delivered to a specified agreed level and managing the technology used to deliver those services. In other words:
- Ensuring that customers achieve their goals
- Ensuring that components support the service function effectively
SFIA (Skills Framework for the Information Age) and RACI are tools that can be used to clarify roles and responsibilities in Service Operation. A RACI authority matrix is normally used during Service Design to assign processes and activities to defined roles. RACI is an acronym for the four levels of authority between an activity/process and a process role:
- Responsible
- responsible for carrying out the activity
- Accountable
- accountable for the activity. Only one person/role is assigned this authority and is referred to as the Process owner
- Consulted
- consulted regarding the progress of the activity
- Informed
- informed of progress of the activity
Functions
Top BottomService Operations processes are usually delivered by four functions: Service Desk, Technical Management, Operations Management and Applications Management.
Service Desk
Top BottomITIL Best Practice demands a single point of contact for users in their communication with the IT Service Provider. A user is a person who uses a product or service. A customer is a person who negotiates for the provision of a product or service. The Service Desk is the main point of contact for a user with the IT Service Provider. Service Level Management is the main point of contact for a customer with the IT Service Provider.
The Service Desk must be aware of SLAs and their relation to user issues. The Service Desk acts on behalf of the user within the IT infrastructure, and the first duty of the Service Desk is to act as the users friend. The Service Desk will be responsible for:
- monitor progress on issues
- report back to the user on progress
- chase assigned actions on behalf of the user
- monitor SLAs on issues
An effective Service Desk requires well-trained, motivated staff with good interpersonal skills. Systems and processes in use will need to be well-designed for recording and tracking incidents and matching to previous incidents. Technology will be used for call distribution and knowledgebase access. Service Desk staff will need the appropriate level of technical competence to record, resolve and communicate issues.
The geographical positioning of the Service Desk may be one of:
- Local
- May lead to duplicating costs over multiple local Service Desks, adding overheads to maintain consistency between Service Desks and sharing of lessons learnt - this can be minimised by using centralised call-logging systems and CMDBs. Localised knowledge may be a key benefit of local Service Desks
- Central
- Loss of local knowledge and increased voice and data communications charges. Has the advantage of improving centralisation of information and knowledge and reduction in resource duplication
- Virtual
- Use of a single contact number, with calls routed by proximity or time of day
- Follow the Sun
- Allows use of cheaper daytime labour for out of hours support
When a Service Desk is established the various inputs must be anticipated and catered for: phonecalls, email, faxes, web-forms, voicemail and operational events. Automated responses can be configured for some, but should not detract for the Service Desk responsibilities for monitoring and tracking issues.
Escalation Management is the process of moving an incident to the appropriate team for resolution. Functional escalation is moving the issue to relevant experts: Hierarchical escalation is then the issue is moved up the management chain either because the issue is very serious or needs higher authority to sanction additional resources. Explicit parameters are needed to govern hierarchical escalation.
The skill-level of the Service Desk requires balancing the increased cost of skilled staff againgst the improved service provided. Achieving the optimum balance may require monitoring the percentage of calls closed by the Service Desk. The skill level may require a temporary uplift during ELS or major incidents. Improved processes can reduce the requirement for skilled staff and can include the use of automated scripts and better access to knowledge bases, change schedules and SLAs.
The technology used by Service Desks will fall into two distinct classes:
- Telephony
- Automatic Call Distribution systems
- Conference Calling
- VOIP
- Software
- Intelligent Call-Logging Systems, that identify patterns and suggest solutions
- CMDBs and Known Error Databases
- Automatic Referral & Escalation
- Tracking and Alerting tools
Total number of calls can be a misleading indicator of Service Desk performance, as the figure will be skewed during deployment of new or changed services. Preferred Service Desk metrics include:
- First-line call resolution rate
- Average first-line time
- Average escalation time
- Average cost or total cost divided by number of calls
- Average review/resolution time
- Call breakdown
Soft measures can also be used to measure Service Desk performance, and these will typically consist of surveys and interviews, conducted in person, via telephone, email or web-forms and can be directed at individuals or groups of users.
Technical Management
Top BottomTechnical Management provides the organisation technical expertise and overall management of the IT Infrastructure technology. The objective of Technical Management are:
- to support business processes through well-designed, resilient, cost-effective technical topology
- to support business processes through the use of technical skills to maintain technical infrastructure in optimal condition
- to support business processes through swift use of technical skills to diagnose and resolve technical failures
Operations Management
Top BottomIT Operations Management carries out the daily activities needed to manage the IT Infrastructure, according to defined performance standards.
Operations Management will be sub-divided into Operations Control and Facilities Management. Operations control ensures that routine operational tasks are carried out, such as job scheduling and backup and restore activities. Facilities Management will be responsible for managing the buildings and environment housing CIs and logistical operations concerning CIs.
Applications Management
Top BottomThe application lifecycle closely matches the ITSM lifecycle, but is focussed on applications not services: Requirements, Design, Build and Deploy, Operate, Optimise. Application Management is responsible for applications through the whole Application Lifecycle.
Application Management is the custodian of the technical knowledge, skills and documentation related to the management of applications and is responsible for identifying and specifying application requirements. Application Management will provide resources to support applications through the ITSM lifecycle, and guidance to IT Operations on the on-going management of applications. Application Management objectives are:
- ensure that applications are well-designed, resilient and cost-effective
- ensure required functionality is available to achieve required business outcomes
- ensure that adequate technical skills are available to maintain operational applications
- ensure skills exist to swiftly diagnose and resolve issues
Processes
Top BottomIncident Management
Top BottomAn incident is
an unplanned interruption to an IT service or reduction in the quality of an IT service or a failure of a CI that has not yet impacted an IT service
The main goal of Incident Management is to restore normal service as quickly as possible and minimise the adverse impact of service incidents. It is not concerned with longer-term structural resolution of faults, which is the domain of Problem Managment. Normal service operation is defined as a service that is within SLA limits. Incidents lifecycle:
- identification
- logging
- categorisation
- prioritisation
- initial diagnosis
- resolution and recovery
- incident closure
Recurring incidents may be raised with Problem Management via a Problem Record. Until a resolution is found, the recurring incident will be handled using an Incident Model, which details steps to be taken, staff involved, timescales, escalation procedures and evidence preservation activities. During the handling of an incident, Incident Management or the Service Desk will be responsible for ownership of the incident and monitoring and reporting progress.
Incident Management will need to prioritise incidents in terms of urgency and impact. Impact is the effect on the business and urgency concerns the timescale in which the incident needs to be resolved. For incidents to be high priority, both the impact and the urgency need to be high. The priority defined for different incidents will need to be agreed across the organisation, and will also be defined in SLAs in terms of response times and resolution targets. Incident managment will have to meet targets set in SLAs for incident response times and resolution targets, and so will need to ensure that these targets are:
- measureable
- achievable
- traceable
Incidents will need to be assigned to appropriate teams for resolution, and timescales will be set and thresholds monitored. Escalation procedures will need to be invoked where SLA targets are in danger of being missed. When a major incident occurs (high impact, high urgency and SLA threat), then Problem Management staff need to be informed, who will provide additional support to the Service Desk. Where a number of equal priority incidents occur at the same time, then it is best to first tackle those were suitable resources are available.
Key Performance Indicators for Incident Management are:
- total number of incidents
- size of incident backlog
- number and percentage of major incidents
- percentage of incidents handled with SLA, OLA and UC targets
- average cost of incidents
- percentage of incidents processed by the Service Desk alone
- number and percentage of incidents processed by the Service Desk staff
The Critical Success Factors for Incident Managment are:
- resolve incidents quickly, minimising impact
- maintain quality of IT Service
- maintain user satisfaction with IT Service
- increase visibility and communication of incidents to business and IT Support staff
- ensure the use of standardised methods and procedures for prompt response, analysis, documentation, on-going management and reporting of incidents
The Risks facing Incident Management are:
- too many incidents to handle
- incident backlog
- lack of information to handle incidents
- mismatch of objectives due to poorly aligned OLAs or UCs
The Challenges facing Incident Management are:
- the ability to detect incidents as early as possible
- convincing all staff that all incidents must be logged
- making information available about Known Errors
- integration to CMS, to assist identification of relationships between CIs
- integration to SLM process to correctly prioritise incidents
- defining escalation procedures
Problem Management
Top BottomProblems are defined as 'the unknown cause of one or more incidents'. Problem Management is responsible for preventing problems and resulting incidents from happening, eliminating recurring incidents and minimising the impact of incidents that cannot be prevented.
Problem Management must ensure that processes are in place to identify the root causes of problems. These processes will normally be carried out by technical staff, working closely with the Service Desk, Incident Management and Suppliers. Root Cause analysis techniques used in Problem Management include:
- Chronological Analysis
- Documenting all events in time order to identify related events
- Kepner Trego
- Define the problem, describe the problem, establish possible causes, test the probable cause, verify the true cause
- Five-Wise
- some blurb about Five-Wise
- Brainstorming
- Gathering relevant technical personnel together to find potential causes of problem
- Ishikawa Fishbone Tool
- Problem is presented in diagrammatic form resembling a fish skeleton. The problem is represented as the head, and possible causes are represented as the main bones branching from the backbone. Each main bone can be subdivided with each branch representing potential solutions for further investigation. Also known as 'Cause and Effect' analysis.
Resolutions must be implemented through controlled procedures. Information about problems, known errors and resolutions should be documented and available to other service areas via the SKMS.
Resources to deal with problems will be prioritised on a pain-factor basis, or the seriousness of the impact on the business. Tools used for defining priorities include:
- Pain-Value Analysis
- Asses impact by considering the number of people affected, duration of downtime and the cost to the business
- Pareto Analysis
- Problems are ranked and highest ranking ones are tackled first. Also known as the 80:20 rule
Problem Management will be responsible for conducting Major Problem Reviews after the resolution of a major problem. The review will document the problem, causes and resolution and analyse any weaknesses in the Problem Management process. The review should attempt to identify strengths and weaknesses in the managment of the problem and track third-party activities related to the problem
The Problem Management process involves a standard set of control processes:
- Detection
- Either the result of proactive identification or escalation from Service Desk or Incident Management. The problem will need to be classified as new or recurring. A new problem occurs when no matches are found to existing problem records or Known Errors. Incident Matching is used to establish relationships to other incidents, and thus determine if the problem is new or recurring.
- Logging
- Problem Record is created and linked to associated records and Known Errors. Links will also be created to resultant RFCs in the CMS
- Categorisation
- used to determine the allocation of appropriate resources and capabilities
- Prioritisation
- Dependant on the effort required and the urgency and impact of the problem
- Investigation and Diagnosis
- An iterative process using one of the Root Cause analysis techniques described earlier
- Establish Workarounds
- Implement temporary solution to mitigate the impact. Workarounds should be recorded in the problem record which is kept open. Workarounds may affect the prioritisation of the problem
- Raising a Known Error Record
- As soon as the diagnosis is complete, a Known Error Record is raised and placed in the Known Error Database. Known Error records may be raised earlier for information purposes if, for example, a root cause or workaround has not been fully confirmed
- Problem Resolution
- RFCs will be raised if change in functionality required. Problem Management is not responsible for implementing resolutions - just identifying them
- Closure
- May occur automatically when a resolution to a Known Error is implemented or may require Post Implementation Review (PIR). Closed pending PIR status allows confirmation of the effectiveness of the solution to be assessed
Event Management
Top BottomEvents are defined as any 'change of state that has significance for the management of a CI or service'. Event Managment is concerned with monitoring events and responding accordingly. Events can be monitored either actively or passively and will be categorised as informational, warnings or exceptions. Event Management only applies to Service Management tasks that are capable of automation. Alerts are used to automate Event Management and are configured to be generated by specific events or thresholds
The objectives of Event Management are:
- detect all significant changes of state for a CI or service
- determine appropriate control action for events and communicate to appropriate functions
- provide the trigger for the execution of many service processes and operations managment activities
- provide a basis for service assurance, reporting and improvement
Request Fulfillment
Top BottomService Requests are characterised as low-risk, common, easy to model and low cost. Request Fulfillment is the process of dealing with service requests from users. Service Requests will fall into one of the following four categories:
- Standard Changes
- pre-approved and pre-authorised changes that are low-risk, relatively common and follow an established work pattern
- Questions, Difficulties or Queries
- normally arising through a lack of training or access to documentation
- Standard Operational Requests
- requests for disposal items, supplies or facilities
- Complaints, Comments or Praise
- tracking user of customer feedback, which will be of use for service improvement tasks
Request Fulfillment will use a Request Model to deal with service requests and most should be capable of some degree of automation. Service requests should be traceable - that is, the authorisation for the request should be easy to determine.
Access Management
Top BottomAccess Management is also referred to as Rights Management or Identity Management, and refers to the process of granting users the right to access a service and preventing unauthorised access. Access Management ensures that the Security Policy is followed and typical activities will include:
- Verification check: should this user be granted the requested right
- Identity check: is this user who they claim to be
- Provide rights
- Monitor use
- Remove or restrict access
- Password resets
Security polices will identify which roles should be granted which rights and mechanisms or responsible parties to authorise requests for access that are not defined in advance. Security policies will also define when rights should be revoked.
