最后更新于2023年5月8日星期一20:37:23 GMT

Systems of all kinds create log data constantly 和 voluminously. In searching out the most compelling reasons to dig into 和 analyze such data, we compiled a list of seven reasons that usually drive such activity. In this blog post we tackle the first of those 7, which include:

  1. 系统故障排除
  2. 安全事件响应
  3. 安全故障排除
  4. 性能故障诊断
  5. 理解用户行为或活动
  6. 遵守安全策略
  7. 遵从审计或法规

因此,今天的话题是 日志分析 for System Troubleshooting, just as it says in the title.

什么是系统故障排除?

在一般情况下, troubleshooting a system means trying to get from one or more symptoms of misbehavior to a root cause, 从那里到可用的修复或解决方案. 系统故障排除 is most often invoked in response to observations or reports of something not working correctly (or at all), or in response to outright error or alert messages (often in concert with sounds to grab users’ attention). There’s a st和ard general approach to troubleshooting that’s always worth recalling (和 following) whenever trouble rears its vexing head. 即使在处理过程中使用日志分析时也是如此, 注意并记住标准顺序是明智的. Please notice that log analysis does not come in until the second step (or later) in the sequence!

CompTIA’s st和ard troubleshooting sequence in six steps comes from its A+ Computer Technician certification training 和 materials, 但值得注意并在适当的时候应用. 是:

  1. 识别问题: Question the user (or make observations) 和 identify user (or other) changes to the system. 在进行任何更改之前执行备份.
  2. 制定一个合理原因的理论: Find a likely reason for the problem, 和 remember to question obvious reasons. 避免草率下结论.
  3. **Test the theory to determine cause: **Take whatever steps are necessary to confirm or deny the probable cause. If the theory is confirmed, determine next steps to problem resolution. 如果理论被否定,返回步骤2. Multiple returns to Step 2 may mean a return to Step 1 is needed instead (the problem may have been mis-identified).
  4. 计划,然后行动制定一个行动计划来解决问题, 记录计划, 然后执行计划的解决方案.
  5. 测试和预防: Check 和 verify full or normal system functionality, 和 document results obtained. 如果验证通过,请执行步骤6. If applicable, implement preventive measures to prevent a recurrence.  If the solution is not verified or correct, return to Step 2.
  6. 报告:记录发现、行动和结果. 如果需要多次通过步骤1-6, keep track of those activities to make sure you don’t get stuck in a loop where you repeat the same mistakes over 和 over again.

The most effective troubleshooting proceeds from a clear underst和ing of normal or expected system behavior, 仔细观察什么地方不起作用, 失踪, 或者其他不正常或意外的. 经常, the “find a likely reason” element in Step 2 will come from a careful examination of system log data to see what kinds of errors, 警告, 以及它们可能包含的警告.

哪些类型的日志数据有助于系统故障排除?

The most obvious or helpful log data often comes from error messages or alerts, 通常来自系统或应用程序日志. 因此, 例如, troubleshooting a USB problem on a Windows computer might turn to the Event Viewer with a look into the System log. 另一方面, issues related to logging in or failing to create a remote login session will be better sought in the Application log. 对于大多数问题, a quick trip to the Reliability Monitor can also be helpful, because it flags issues with both hardware 和 software aspects of systems. This proved helpful when chasing down a recent USB issue, 例如.

Reliability Monitor details on the 8/10 hardware error indicate a USB hub failure occurred: just what we needed to know!

在大多数情况下, you will have to formulate at least some idea of likely causes to know where to start looking for related or illuminating information. If your guesses turn out to be off the mark, your initial theory is probably off the mark, too.

将故障排除拼图拼凑在一起

When it comes to formulating (和 checking) theories about problem causation, 有很多方法可以寻找线索. 如果您大致了解错误发生的时间, you can use time information in event logs to zero in on the occurrence(s) of interest. It’s seldom necessary to go back more than a minute or two before the recorded or associated time to get a strong sense of root causes, 和 this really helps to zero in on how much log data needs inspection for subsequent analysis.

错误代码或消息也是如此. In more or less the same ways you can scope filters down by timestamps, 您还可以查找特定代码或消息文本. Even in the absence of those incredibly informative details, you can filter events by their severity. 因此, Windows事件, that means you often need to check only the Error 和 Warning event levels to see the important stuff you’re probably after.

By performing various kinds of event correlation — based on time, 错误级别, 涉及的系统或应用程序, 和 so forth — you can limit your searches 和 focus on the things that are most likely to shed light on the problem at h和. Because you will usually also find sufficient detail to help you identify causes, 而且通常, 确定修复(更换故障的USB集线器), 在我们的例子中)或解决方法(移除故障的USB集线器), if a replacement isn’t h和y) to keep things working properly.