Mail writing

Overview

메일 예제 모음

incident

Report problem occurred

문제의 내용과 문제를 어떻게 해결할 것인지를 같이 나타낸다.

Title: incident: earth01 disk array controller is dead

The problem started ~ 08:20 and was rectified by removing earth01 from production ~08:50

This is just a headsup: the disk controller has malfunctioned, and we now only process calls on earth02.


Any faults before 08:50 should be attributed to earth01 failed, after this time, we want error reports and 
as soon as customers report problems with web. earth02 is _not_ expected to cope with a full maximum load of calls.



Plan A is to make a virtual machine of earth01, restored from either earth02 and/or backup.

Plan B is to try to recover the hardware earth01 runs on, pending PCHERO arrival at UNIVERSE hosting.


more info to follow

-- 

Re.

Sungtae Kim

Report problem solved

상세한 원인 분석 내용과 어떻게 처리했는지 내용을 나타낸다.

Title: incident: earth01 was unavailable (Was: incident: earth01 disk array controller is dead)

begin : 2016-08-16 08:17 end: 2016-08-16 11:30
Area : test_server/earth
Severity : Medium

Noter / Notes:

earth01 responded to ping and HTTP port, and ssh port, but login was not possible.
The monitoring system did flag non-critical alarms, and the host was attempted powercycled.

Web sevices where randomly affected since most queries where routed through earth02 (the customers request out on a random select between earth01 and earth02.
Queries failed periodically up to 10% of the time, until earth01 was fully back into production.

timeline:
08:20 Problems with earth01 was detected.
08:40 earth01 was attempted powercycled
09:10 fault with smart controller was identified, Network engineer was dispatched to UNIVERSE to perform hands-on fault correction.
09:20 alternative plan: rebuild earth01 as virtual machine and recover it, ETA to complete: 4-6 hours.
10:15 Engineer onsite
10:40 Hardware disassembled and reseated vital components, and after reboot controller was working again
Checking hardware for faults and testing software + OS until ~11:25
11:30 earth was back in normal operation, rebuild earth01 as VM work was halted.

Root cause: the most apparent cause was hung kernel on earth01 due to faults on storage IO subsystem:
Slot 0 HP Smart Array P410i Controller
1783-Slot 0 Drive Array Controller Failure!
[Command failure (cmd=11h, err=020h)]

Corrective measures:
a) Hardware has been disassembled and reassembled, thus reseating boards, disk, PSU and controller.

b) We are currently working on replacing EARTH and TEST_SERVER functionality entirely with MARS so the impact of this will be resolved.

c) We are evaluating if time invested in fully virtualizing earth01 would be mitigating downtime if it should occur again,
however with regards to getting close to have b) done in short timespan, since the blocking customer for getting traction
is being resolved this week, so b) can now progress.
-o-o-o-

--

Re.

Sungtae Kim

Mail writing

Contents

Overview

incident

Report problem occurred

Report problem solved

Navigation menu

Mail writing

Overview

incident

Report problem occurred

Report problem solved

Navigation menu

Search