<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>http://wiki.pchero21.com/index.php?action=history&amp;feed=atom&amp;title=Mail_writing</id>
	<title>Mail writing - Revision history</title>
	<link rel="self" type="application/atom+xml" href="http://wiki.pchero21.com/index.php?action=history&amp;feed=atom&amp;title=Mail_writing"/>
	<link rel="alternate" type="text/html" href="http://wiki.pchero21.com/index.php?title=Mail_writing&amp;action=history"/>
	<updated>2026-04-18T22:14:41Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.38.2</generator>
	<entry>
		<id>http://wiki.pchero21.com/index.php?title=Mail_writing&amp;diff=1508&amp;oldid=prev</id>
		<title>Pchero: Created page with &quot;== Overview == 메일 예제 모음  == incident == === Report problem occurred === 문제의 내용과 문제를 어떻게 해결할 것인지를 같이 나타낸다. &lt;pre&gt; T...&quot;</title>
		<link rel="alternate" type="text/html" href="http://wiki.pchero21.com/index.php?title=Mail_writing&amp;diff=1508&amp;oldid=prev"/>
		<updated>2016-08-16T14:27:05Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;== Overview == 메일 예제 모음  == incident == === Report problem occurred === 문제의 내용과 문제를 어떻게 해결할 것인지를 같이 나타낸다. &amp;lt;pre&amp;gt; T...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;== Overview ==&lt;br /&gt;
메일 예제 모음&lt;br /&gt;
&lt;br /&gt;
== incident ==&lt;br /&gt;
=== Report problem occurred ===&lt;br /&gt;
문제의 내용과 문제를 어떻게 해결할 것인지를 같이 나타낸다.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Title: incident: earth01 disk array controller is dead&lt;br /&gt;
&lt;br /&gt;
The problem started ~ 08:20 and was rectified by removing earth01 from production ~08:50&lt;br /&gt;
&lt;br /&gt;
This is just a headsup: the disk controller has malfunctioned, and we now only process calls on earth02.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Any faults before 08:50 should be attributed to earth01 failed, after this time, we want error reports and &lt;br /&gt;
as soon as customers report problems with web. earth02 is _not_ expected to cope with a full maximum load of calls.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Plan A is to make a virtual machine of earth01, restored from either earth02 and/or backup.&lt;br /&gt;
&lt;br /&gt;
Plan B is to try to recover the hardware earth01 runs on, pending PCHERO arrival at UNIVERSE hosting.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
more info to follow&lt;br /&gt;
&lt;br /&gt;
-- &lt;br /&gt;
&lt;br /&gt;
Re.&lt;br /&gt;
&lt;br /&gt;
Sungtae Kim&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Report problem solved ===&lt;br /&gt;
상세한 원인 분석 내용과 어떻게 처리했는지 내용을 나타낸다.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Title: incident: earth01 was unavailable (Was: incident: earth01 disk array controller is dead)&lt;br /&gt;
&lt;br /&gt;
begin                : 2016-08-16 08:17   end: 2016-08-16 11:30&lt;br /&gt;
Area                 : test_server/earth&lt;br /&gt;
Severity             : Medium&lt;br /&gt;
&lt;br /&gt;
Noter / Notes:&lt;br /&gt;
&lt;br /&gt;
earth01 responded to ping and HTTP port, and ssh port, but login was not possible. &lt;br /&gt;
The monitoring system did flag non-critical alarms, and the host was attempted powercycled.&lt;br /&gt;
&lt;br /&gt;
Web sevices where randomly affected since most queries where routed through earth02 (the customers request out on a random select between earth01 and earth02. &lt;br /&gt;
Queries failed periodically up to 10% of the time, until earth01 was fully back into production.&lt;br /&gt;
&lt;br /&gt;
timeline:&lt;br /&gt;
08:20 Problems with earth01 was detected.&lt;br /&gt;
08:40 earth01 was attempted powercycled&lt;br /&gt;
09:10 fault with smart controller was identified, Network engineer was dispatched to UNIVERSE to perform hands-on fault correction.&lt;br /&gt;
09:20 alternative plan: rebuild earth01 as virtual machine and recover it, ETA to complete: 4-6 hours.&lt;br /&gt;
10:15 Engineer onsite&lt;br /&gt;
10:40 Hardware disassembled and reseated vital components, and after reboot controller was working again&lt;br /&gt;
Checking hardware for faults and testing software + OS until ~11:25&lt;br /&gt;
11:30 earth was back in normal operation, rebuild earth01 as VM work was halted.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Root cause: the most apparent cause was hung kernel on earth01 due to faults on storage IO subsystem:&lt;br /&gt;
Slot 0  HP Smart Array P410i Controller&lt;br /&gt;
1783-Slot 0 Drive Array Controller Failure!&lt;br /&gt;
     [Command failure (cmd=11h, err=020h)]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Corrective measures:&lt;br /&gt;
a) Hardware has been disassembled and reassembled, thus reseating boards, disk, PSU and controller.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
b) We are currently working on replacing EARTH and TEST_SERVER functionality entirely with MARS so the impact of this will be resolved.&lt;br /&gt;
&lt;br /&gt;
c) We are evaluating if time invested in fully virtualizing earth01 would be mitigating downtime if it should occur again, &lt;br /&gt;
however with regards to getting close to have b) done in short timespan, since the blocking customer for getting traction &lt;br /&gt;
is being resolved this week, so b) can now progress.&lt;br /&gt;
-o-o-o-&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
-- &lt;br /&gt;
&lt;br /&gt;
Re.&lt;br /&gt;
&lt;br /&gt;
Sungtae Kim&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[category:etc]]&lt;/div&gt;</summary>
		<author><name>Pchero</name></author>
	</entry>
</feed>