Systems Administration Best Practices

As a Data Systems Administrator in the U.S. Marine Corps Reserve and Field Implementation Specialist at United Airlines, I’ve learned that effective systems administration requires a methodical approach, continuous learning, and attention to detail.
Modern systems administration relies heavily on automation and scripting. Throughout this guide, I’ll demonstrate practical examples using Python for automation scripts, JavaScript for web-based monitoring tools, and Git for version control of configurations and scripts - three essential technologies in today’s IT infrastructure management.
Network Asset Management
Infrastructure Documentation
Proper documentation is the foundation of effective systems administration. Every network asset should be catalogued with detailed information.
# Document all network assets with proper inventory tracking
asset_id="NET-001"
asset_type="Server"
location="Data Center A"
responsible_admin="Jacob Tapfar"
ip_address="192.168.1.100"
last_maintenance="2025-07-15"
Network Monitoring
Implement comprehensive monitoring to detect issues before they impact operations:
# Basic network health check script
#!/bin/bash
ping -c 4 192.168.1.1 > /dev/null
if [ $? -eq 0 ]; then
echo "Gateway is responding"
else
echo "Gateway unreachable - investigate immediately"
fi
# Python network monitoring with alerts
import subprocess
import smtplib
from datetime import datetime
def check_network_health():
hosts = ['192.168.1.1', '8.8.8.8', 'google.com']
failed_hosts = []
for host in hosts:
result = subprocess.run(['ping', '-c', '1', host],
capture_output=True, text=True)
if result.returncode != 0:
failed_hosts.append(host)
if failed_hosts:
send_alert(f"Network issues detected: {', '.join(failed_hosts)}")
return False
return True
def send_alert(message):
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print(f"[{timestamp}] ALERT: {message}")
# Add email notification logic here
Security Protocols
Access Control
Implement principle of least privilege across all systems:
# PowerShell example for user account management
New-LocalUser -Name "TempUser" -Description "Temporary contractor access"
Add-LocalGroupMember -Group "Remote Desktop Users" -Member "TempUser"
# Set account expiration
Set-LocalUser -Name "TempUser" -AccountExpires (Get-Date).AddDays(30)
Regular Security Audits
Conduct monthly security reviews:
- Review user access permissions
- Audit failed login attempts
- Check for unauthorized software installations
- Validate firewall rules and configurations
Disaster Recovery Planning
Backup Strategies
Follow the 3-2-1 backup rule:
- 3 copies of important data
- 2 different storage types
- 1 offsite backup
# Automated backup script example
#!/bin/bash
backup_date=$(date +%Y%m%d)
rsync -av /critical/data/ /backup/local/$backup_date/
# Sync to offsite location
rsync -av /backup/local/ backup-server:/remote/backups/
Testing Recovery Procedures
Regular disaster recovery testing is crucial:
- Monthly: Test file restoration from backups
- Quarterly: Full system recovery simulation
- Annually: Complete disaster scenario exercise
Performance Optimization
System Monitoring
Monitor key performance indicators:
# Linux system health check
echo "=== CPU Usage ==="
top -bn1 | grep "Cpu(s)" | awk '{print $2 + $4"%"}'
echo "=== Memory Usage ==="
free -m | awk 'NR==2{printf "%.2f%%\n", $3*100/$2}'
echo "=== Disk Usage ==="
df -h | awk '$NF=="/"{printf "%s\n", $5}'
Capacity Planning
Proactive capacity planning prevents outages:
- Monitor storage growth trends
- Track network bandwidth utilization
- Plan hardware refresh cycles
- Document performance baselines
Automation and Scripting
Task Automation
Automate repetitive tasks to reduce human error:
# Windows scheduled task for log cleanup
$action = New-ScheduledTaskAction -Execute "PowerShell.exe" -Argument "-File C:\Scripts\CleanLogs.ps1"
$trigger = New-ScheduledTaskTrigger -Weekly -DaysOfWeek Sunday -At 2AM
Register-ScheduledTask -TaskName "WeeklyLogCleanup" -Action $action -Trigger $trigger
// JavaScript-based system dashboard for real-time monitoring
class SystemDashboard {
constructor() {
this.updateInterval = 5000; // 5 seconds
this.initDashboard();
}
async fetchSystemStats() {
try {
const response = await fetch('/api/system-stats');
const data = await response.json();
return data;
} catch (error) {
console.error('Failed to fetch system stats:', error);
return null;
}
}
updateDashboard(stats) {
document.getElementById('cpu-usage').textContent = `${stats.cpu}%`;
document.getElementById('memory-usage').textContent = `${stats.memory}%`;
document.getElementById('disk-usage').textContent = `${stats.disk}%`;
// Update status indicators
this.updateStatusIndicator('network', stats.networkStatus);
this.updateStatusIndicator('services', stats.servicesStatus);
}
updateStatusIndicator(elementId, status) {
const indicator = document.getElementById(elementId);
indicator.className = status === 'healthy' ? 'status-green' : 'status-red';
}
initDashboard() {
setInterval(async () => {
const stats = await this.fetchSystemStats();
if (stats) this.updateDashboard(stats);
}, this.updateInterval);
}
}
// Initialize dashboard on page load
document.addEventListener('DOMContentLoaded', () => {
new SystemDashboard();
});
Configuration Management
Use configuration management tools to maintain consistency:
- Document standard configurations
- Version control all scripts and configs
- Implement change approval processes
- Test changes in staging environments
# Git workflow for configuration management
# Initialize configuration repository
git init /etc/config-repo
cd /etc/config-repo
# Add configuration files to version control
git add nginx.conf
git add firewall-rules.txt
git add monitoring-config.yml
git commit -m "Initial configuration baseline"
# Create feature branch for changes
git checkout -b "update-nginx-ssl"
# Make configuration changes
vim nginx.conf
git add nginx.conf
git commit -m "Update SSL configuration for enhanced security"
# Push to remote for review
git push origin update-nginx-ssl
# Python script for automated configuration deployment
import subprocess
import os
from datetime import datetime
class ConfigDeployment:
def __init__(self, repo_path, config_dir):
self.repo_path = repo_path
self.config_dir = config_dir
def backup_current_config(self):
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
backup_dir = f"/backup/config_{timestamp}"
subprocess.run(['cp', '-r', self.config_dir, backup_dir])
return backup_dir
def deploy_from_git(self, branch='main'):
# Pull latest changes
os.chdir(self.repo_path)
subprocess.run(['git', 'checkout', branch])
subprocess.run(['git', 'pull', 'origin', branch])
# Copy configurations
subprocess.run(['rsync', '-av', f"{self.repo_path}/",
self.config_dir])
# Restart services
self.restart_services()
def restart_services(self):
services = ['nginx', 'firewall', 'monitoring']
for service in services:
subprocess.run(['systemctl', 'restart', service])
print(f"Restarted {service}")
# Usage
deployer = ConfigDeployment('/etc/config-repo', '/etc/')
backup_path = deployer.backup_current_config()
deployer.deploy_from_git('production')
Best Practices Summary
- Documentation First: Document everything before implementing
- Security by Design: Build security into every process
- Monitor Everything: You can’t manage what you don’t measure
- Automate Routine Tasks: Focus human effort on complex problems
- Plan for Failure: Assume systems will fail and prepare accordingly
- Continuous Learning: Stay current with technology trends and threats
Lessons from Military and Aviation IT
Working in high-stakes environments has taught me valuable lessons:
Mission-Critical Mindset
- Zero tolerance for downtime in critical systems
- Redundancy planning for all single points of failure
- Clear escalation procedures for incident response
Attention to Detail
- Thorough documentation saves time during emergencies
- Standardized procedures ensure consistent results
- Regular training keeps skills sharp
Conclusion
Effective systems administration combines technical expertise with strong organizational skills. Whether managing military communications systems or airline operational networks, the fundamentals remain the same: plan thoroughly, monitor continuously, and always be prepared for the unexpected.
The key to success is building robust, well-documented systems that can withstand both planned maintenance and unexpected failures. By following these best practices and maintaining a disciplined approach to system management, you can ensure reliable operations that support your organization’s mission.
Jacob Tapfar is a Data Systems Administrator in the U.S. Marine Corps Reserve and Field Implementation Specialist at United Airlines. Through TapfarTech, he provides systems administration expertise and web solutions to small businesses and organizations.