Debian System Monitoring Playbook

A reusable, copy‑paste friendly checklist for diagnosing and stabilising high CPU/load on a Debian LAMP/LEMP stack with Apache/PHP‑FPM, Varnish, and MariaDB/MySQL.


0) Goals

  • Identify what is burning CPU (service → process → thread).
  • Determine why (CPU vs I/O wait vs virtualization steal vs network flood).
  • Capture enough evidence to fix the root cause (queries, endpoints, IPs).
  • Put tripwires in place so the next spike is easy to explain.

1) Fast triage (run in order)

# Overview
uptime

# Top snapshot (batch mode)
top -b -n1 | head -n 25

# Per-CPU breakdown (requires sysstat)
mpstat -P ALL 1 5

# Run queue / context switches / iowait
vmstat 1 5

Interpretation:

  • us% high → application (PHP, mysqld, etc.) is busy.
  • sy% high → kernel/interrupts (network/disk overhead, drivers).
  • wa% high → disk I/O bottleneck.
  • st% high → virtualization host contention (noisy neighbour).

Offenders: service → process → thread

# By cgroup (which service)
systemd-cgtop --iterations=3 --depth=2

# Top processes
ps -eo pid,ppid,comm,%cpu,%mem,etime --sort=-%cpu | head -n 20

# Hot threads of the hottest process
TOPPID=$(ps -eo pid,comm,%cpu --sort=-%cpu | awk 'NR==2{print $1}')
top -H -p "$TOPPID"

2) Apache / PHP‑FPM focus

# See running Apache & PHP-FPM processes
pgrep -a apache2
pgrep -a php-fpm || pgrep -a php

# Apache MPM
apachectl -V | grep -i mpm

# mod_status (if enabled)
curl -s http://127.0.0.1/server-status?auto | sed -n '1,120p'

# Hot URLs & IPs (last 500 lines)
tail -n 500 /var/log/apache2/your-vhost-access.log \
 | awk '{print $7}' | sort | uniq -c | sort -nr | head

tail -n 500 /var/log/apache2/your-vhost-access.log \
 | awk '{print $1}' | sort | uniq -c | sort -nr | head

# Hot user agents (quick UA census)
awk -F\" '{print $6}' /var/log/apache2/your-vhost-access.log | sort | uniq -c | sort -nr | head

Common hotspots: /wp-admin/admin-ajax.php, /wp-cron.php, /xmlrpc.php, feeds.

Quick temporary mitigation (pick one; replace x.x.x.x):

# iptables
sudo iptables -I INPUT -s x.x.x.x -j DROP

# ufw
sudo ufw deny from x.x.x.x

3) MariaDB/MySQL focus

# Active queries
mysql -e "SHOW FULL PROCESSLIST\G"

# Slow query log status
mysql -e "SHOW VARIABLES LIKE 'slow_query_log%';"

# InnoDB status (locks)
mysql -e "SHOW ENGINE INNODB STATUS\G" | sed -n '1,150p'

Signals:

  • Long states like Copying to tmp table, Filesort, Waiting for table metadata lock → missing indexes / bad plan / lock contention.
  • Kill by Id if necessary: mysql -e "KILL <Id>;"

4) Varnish / traffic surge

# Hit/miss & connections
varnishstat -1 | egrep 'MAIN.cache_(hit|miss)|sess_conn|client_req'

# Sample live requests (Ctrl+C to stop)
sudo timeout 10s varnishncsa | head

If client_req surges and misses increase → backend (PHP/MySQL) pressure.


5) Network / DDoS quick checks

# Active web connections count
ss -tn state established '( sport = :80 or sport = :443 )' | wc -l

# Top TCP states on 80/443
ss -Hnta '( dport = :80 or dport = :443 )' | awk '{print $6}' | sort | uniq -c | sort -nr | head

6) Disk I/O bottlenecks

# Extended iostat
iostat -xz 1 3      # high util% / await = disk bound

# Per-process I/O (accumulated)
sudo iotop -aoP     # Ctrl+C to stop

7) Deep dive on a hot PID

# What files/sockets are open?
sudo lsof -p <PID> | head

# What syscalls is it doing right now?
sudo strace -p <PID> -f -tt -s 200 -o /tmp/strace.<PID>.log
# Let it run ~10s, Ctrl+C, then:
tail -n 40 /tmp/strace.<PID>.log

Clues: endless futex() → lock contention; heavy read()/open() on large files; repeated connect()/recvfrom() → network waits.


8) Persistent telemetry (set once)

sudo apt-get update && sudo apt-get install -y sysstat atop iotop iostat

# Enable sysstat (sar)
sudo sed -i 's/ENABLED="false"/ENABLED="true"/' /etc/default/sysstat
sudo systemctl enable --now sysstat

# Enable atop logging to /var/log/atop/
sudo systemctl enable --now atop
# Optional: log every 60s
sudo sed -i 's/ATOPLOGINTERVAL=.*/ATOPLOGINTERVAL=60/' /etc/default/atop && sudo systemctl restart atop

Review later:

sar -u -f /var/log/sa/sa$(date +%d)
atopsar -c 15:00 16:00    # CPU by time window

9) Status endpoints (local‑only)

Apache mod_status

sudo a2enmod status
cat <<'EOF' | sudo tee /etc/apache2/conf-available/status-local.conf
<Location /server-status>
    SetHandler server-status
    Require local
</Location>
EOF
sudo a2enconf status-local && sudo systemctl reload apache2

PHP‑FPM pool status

In /etc/php/*/fpm/pool.d/www.conf:

pm.status_path = /fpm_status
ping.path = /ping

Restrict in the Apache vhost:

<LocationMatch "^/(fpm_status|ping)$">
    Require local
</LocationMatch>

Then:

curl -s http://127.0.0.1/server-status?auto | sed -n '1,120p'
curl -s http://127.0.0.1/fpm_status?full | sed -n '1,120p'

10) Enable slow query logging (persistent)

# Turn on now (non-persistent)
sudo mysql -e "SET GLOBAL slow_query_log=1; SET GLOBAL long_query_time=1;"

# Persist settings
sudo bash -lc 'cat >>/etc/mysql/mariadb.conf.d/50-server.cnf <<EOF
[mysqld]
slow_query_log = 1
long_query_time = 1
slow_query_log_file = /var/log/mysql/slow.log
EOF'

sudo systemctl restart mariadb

# Optional analysis tool
sudo apt-get install -y percona-toolkit
pt-query-digest /var/log/mysql/slow.log | less

11) Log analysis helpers

GoAccess (interactive + HTML report)

sudo apt-get install -y goaccess
# Interactive (auto-prompt for format)
goaccess /var/log/apache2/your-vhost-access.log -c

# Include rotated .gz and output HTML
zcat /var/log/apache2/your-vhost-access.log.*.gz \
 | goaccess -a --date-format='%d/%b/%Y' --time-format='%T' \
   --log-format=COMBINED -o /var/www/html/goa-report.html -

apachetop (live “top” for Apache)

sudo apt-get install -y apachetop
apachetop -f /var/log/apache2/your-vhost-access.log

lnav (smart log navigator)

sudo apt-get install -y lnav
lnav /var/log/apache2/your-vhost-access.log
# Inside lnav:
# :filter-in wp-login.php
# :filter-in admin-ajax.php

12) Spike capture bundle (one‑shot evidence)

Create /usr/local/bin/spike-capture.sh and make executable.

sudo tee /usr/local/bin/spike-capture.sh >/dev/null <<'EOF'
#!/usr/bin/env bash
set -u
TS=$(date +"%Y%m%d-%H%M%S")
OUT="/root/spike-$TS"
LOG="/var/log/apache2/your-vhost-access.log"   # <-- set this to the hottest vhost
mkdir -p "$OUT"
{
  echo "=== uptime"; uptime
  echo; echo "=== mpstat"; mpstat -P ALL 1 3
  echo; echo "=== vmstat"; vmstat 1 3
  echo; echo "=== top"; top -b -n1 | head -n 40
  echo; echo "=== ps top cpu"; ps -eo pid,ppid,comm,%cpu,%mem,etime --sort=-%cpu | head -n 30
  echo; echo "=== net conns:80/443 (established)"; ss -tn state established '( sport = :80 or sport = :443 )' | wc -l
  echo; echo "=== top URLs (last 5k lines)"; tail -n 5000 "$LOG" | awk '{print $7}' | sort | uniq -c | sort -nr | head -n 20
  echo; echo "=== top IPs (last 5k lines)"; tail -n 5000 "$LOG" | awk '{print $1}' | sort | uniq -c | sort -nr | head -n 20
  echo; echo "=== php-fpm pools"; pgrep -a php-fpm || true
  echo; echo "=== mysql processlist"; mysql -e "SHOW FULL PROCESSLIST\\G" 2>/dev/null | head -n 200 || true
} > "$OUT.txt"
echo "Saved to $OUT.txt"
EOF

sudo chmod +x /usr/local/bin/spike-capture.sh

Usage:

sudo spike-capture.sh

13) Guardrails to prevent repeats

  • Logrotate: ensure /etc/logrotate.d/apache2 uses daily, size 50M, compress.
  • Rate limiting:
    • Simple: libapache2-mod-evasive (burst blocking).
    • Flexible: fail2ban jails for wp-login.php, xmlrpc.php, abusive UAs.
  • Caching: Verify Varnish pass rules for admin/login and cache rules for static assets; minimize backend hits.
  • PHP‑FPM tuning: ensure pm.max_children fits RAM & workload (avoid CPU thrash/oom).
  • MySQL tuning: adequate InnoDB buffer pool, tmp table size; add missing indexes from slow log insights.

14) Quick references

  • High us% + PHP/Apache → hot endpoints or plugin/theme path; add cache/rate limits.
  • High us% + mysqld → expensive queries; capture with slow log and index.
  • High wa% → disk bound; tune DB, reduce log churn, consider faster storage.
  • High sy% with ksoftirqd → NIC/driver/interrupt pressure; ensure irqbalance is running.
  • High st% → host contention; escalate to VPS provider or resize.

Tip: Keep this playbook on the server (e.g., /root/monitoring-playbook.md) and keep spike-capture.sh ready. When a spike happens, run the script, then consult sections 2–6 based on what tops the charts.