System administration isn't an easy job, but it's manageable with the right tools, the right people, and the right set of rules to live by. Learning some rules brings order out of the often chaotic world of system administration.
Who better to use as a reference than the people who practice the fine art of system administration themselves? The SAGE-IE group actually published ten rules for System Administrators in this presentation, but we decided to take their top five and go into a bit of depth on each one for you.
The first rule is a frequently overlooked one and a somewhat obscure one for a best practices list. However, its unusual inclusion makes it all the more compelling. Being a good citizen has to do with customer service. We don't think of network users as our customers, but they are exactly that.
For example, do your users see System Administrators as enablers and business assets or do they view them as sources of roadblocks or as production delays? Your job is to serve your users by maintaining systems, providing security, performing tasks within specified guidelines, and responding quickly to requests. Additionally, you're expected to do all of those things while maintaining a professional demeanor with your users and your management.
Monitoring is more than simple UP/DOWN ping tests; it's a comprehensive insight into your environment that includes CPU, memory usage, network traffic, capacity, and environmental measurements. When you begin monitoring, you should collect statistics for your systems that establish a baseline of normal operating behavior to which you can refer in the future. You should collect usage statistics for CPU, memory, disk, and network. You also need to calculate growth statistics on logfiles, databases, and user data so that you can predict future capacity needs.
Gathering metrics, however, is only one aspect of monitoring. The other is alerting when those metrics fall outside of normal operating parameters. What happens when a filesystem fills up? Do you receive an alert at 85 percent capacity or does your system crash or experience a service outage due to a stopped process? Proactive alerting on system and service behavior is an essential part of your total datacenter picture.
The third rule or best practice is to 'Perform Disaster Recovery Planning'. Contrary to some beliefs, disaster recovery doesn't necessarily mean recovery from a major disaster that affects the entire datacenter. It means recovery from any disaster, even single system disaster. One question you might consider as you think about disaster recovery is 'how are you going to fix the problem once it occurs?'. You might not have direct physical access to a failed system to help in its recovery. You'll have to rely on remote personnel working at the datacenter to recover a system that's experienced a hardware fault.
The other question to think about is 'where will you be when a disaster occurs?'. Disasters don't often occur at convenient times during working hours. They happen while you're away from the office and away from your computer. How will you meet the mean time to restore (MTTR) and the SLA for the failed system or systems when you have no access to them?
It's not enough to simply prepare for disasters; you have to plan for their occurrence. No amount of redundancy, load balancing, or regular backups will prevent disasters from happening. How to recover from a disaster, from a single system to an entire computing environment, is what you have to think about and plan for. How you will connect to and recover those failed systems has to be part of the plan.
As challenging as it is, you must document standard procedures, connectivity information, regular maintenance tasks, and disaster recovery contingency plans.
Documentation is difficult because it requires the System Administrator to stop and move stepwise through each task, while thoughtfully documenting each procedure. It's time-consuming and labor-intensive to thoroughly document, take screenshots, describe procedures, and explain possible outcomes. If you don't have well documented procedures, then you'd better have the contingency plan of always being close to a computer and a network.
As you can surmise, rules four and five are closely related to each other. Establish standard procedures and document them. Standard procedures help you maintain consistency and reproducibility in your computing environment. Creating and adhering to a set of standard procedures has the added effect of stabilizing your systems and services, which, in turn, stabilizes your company's overall productivity.
System Administrators created these five best practices for System Administrators to use as guidelines that lead to more stable work environments and higher productivity. They'll help streamline your work, assist other System Administrators in your group, and maintain your sanity when things break.
Find out how you can remotely monitor and manage your systems using your mobile device, no matter where you are when things go wrong.
Share on
Check out our recent stories from our blog that our editors selected for you