Debian System Monitoring Playbook

A reusable, copy‑paste friendly checklist for diagnosing and stabilising high CPU/load on a Debian LAMP/LEMP stack with Apache/PHP‑FPM, Varnish, and MariaDB/MySQL.

0) Goals

Identify what is burning CPU (service → process → thread).
Determine why (CPU vs I/O wait vs virtualization steal vs network flood).
Capture enough evidence to fix the root cause (queries, endpoints, IPs).
Put tripwires in place so the next spike is easy to explain.

1) Fast triage (run in order)

# Overview
uptime

# Top snapshot (batch mode)
top -b -n1 | head -n 25

# Per-CPU breakdown (requires sysstat)
mpstat -P ALL 1 5

# Run queue / context switches / iowait
vmstat 1 5

Interpretation:

us% high → application (PHP, mysqld, etc.) is busy.
sy% high → kernel/interrupts (network/disk overhead, drivers).
wa% high → disk I/O bottleneck.
st% high → virtualization host contention (noisy neighbour).

Offenders: service → process → thread

# By cgroup (which service)
systemd-cgtop --iterations=3 --depth=2

# Top processes
ps -eo pid,ppid,comm,%cpu,%mem,etime --sort=-%cpu | head -n 20

# Hot threads of the hottest process
TOPPID=$(ps -eo pid,comm,%cpu --sort=-%cpu | awk 'NR==2{print $1}')
top -H -p "$TOPPID"

2) Apache / PHP‑FPM focus

# See running Apache & PHP-FPM processes
pgrep -a apache2
pgrep -a php-fpm || pgrep -a php

# Apache MPM
apachectl -V | grep -i mpm

# mod_status (if enabled)
curl -s http://127.0.0.1/server-status?auto | sed -n '1,120p'

# Hot URLs & IPs (last 500 lines)
tail -n 500 /var/log/apache2/your-vhost-access.log \
 | awk '{print $7}' | sort | uniq -c | sort -nr | head

tail -n 500 /var/log/apache2/your-vhost-access.log \
 | awk '{print $1}' | sort | uniq -c | sort -nr | head

# Hot user agents (quick UA census)
awk -F\" '{print $6}' /var/log/apache2/your-vhost-access.log | sort | uniq -c | sort -nr | head

Common hotspots: /wp-admin/admin-ajax.php, /wp-cron.php, /xmlrpc.php, feeds.

Quick temporary mitigation (pick one; replace x.x.x.x):

# iptables
sudo iptables -I INPUT -s x.x.x.x -j DROP

# ufw
sudo ufw deny from x.x.x.x

3) MariaDB/MySQL focus

# Active queries
mysql -e "SHOW FULL PROCESSLIST\G"

# Slow query log status
mysql -e "SHOW VARIABLES LIKE 'slow_query_log%';"

# InnoDB status (locks)
mysql -e "SHOW ENGINE INNODB STATUS\G" | sed -n '1,150p'

Signals:

Long states like Copying to tmp table, Filesort, Waiting for table metadata lock → missing indexes / bad plan / lock contention.
Kill by Id if necessary: mysql -e "KILL <Id>;"

4) Varnish / traffic surge

# Hit/miss & connections
varnishstat -1 | egrep 'MAIN.cache_(hit|miss)|sess_conn|client_req'

# Sample live requests (Ctrl+C to stop)
sudo timeout 10s varnishncsa | head

If client_req surges and misses increase → backend (PHP/MySQL) pressure.

5) Network / DDoS quick checks

# Active web connections count
ss -tn state established '( sport = :80 or sport = :443 )' | wc -l

# Top TCP states on 80/443
ss -Hnta '( dport = :80 or dport = :443 )' | awk '{print $6}' | sort | uniq -c | sort -nr | head

6) Disk I/O bottlenecks

# Extended iostat
iostat -xz 1 3      # high util% / await = disk bound

# Per-process I/O (accumulated)
sudo iotop -aoP     # Ctrl+C to stop

7) Deep dive on a hot PID

# What files/sockets are open?
sudo lsof -p <PID> | head

# What syscalls is it doing right now?
sudo strace -p <PID> -f -tt -s 200 -o /tmp/strace.<PID>.log
# Let it run ~10s, Ctrl+C, then:
tail -n 40 /tmp/strace.<PID>.log

Clues: endless futex() → lock contention; heavy read()/open() on large files; repeated connect()/recvfrom() → network waits.

8) Persistent telemetry (set once)

sudo apt-get update && sudo apt-get install -y sysstat atop iotop iostat

# Enable sysstat (sar)
sudo sed -i 's/ENABLED="false"/ENABLED="true"/' /etc/default/sysstat
sudo systemctl enable --now sysstat

# Enable atop logging to /var/log/atop/
sudo systemctl enable --now atop
# Optional: log every 60s
sudo sed -i 's/ATOPLOGINTERVAL=.*/ATOPLOGINTERVAL=60/' /etc/default/atop && sudo systemctl restart atop

Review later:

sar -u -f /var/log/sa/sa$(date +%d)
atopsar -c 15:00 16:00    # CPU by time window

9) Status endpoints (local‑only)

Apache mod_status

sudo a2enmod status
cat <<'EOF' | sudo tee /etc/apache2/conf-available/status-local.conf
<Location /server-status>
    SetHandler server-status
    Require local
</Location>
EOF
sudo a2enconf status-local && sudo systemctl reload apache2

PHP‑FPM pool status

In /etc/php/*/fpm/pool.d/www.conf:

pm.status_path = /fpm_status
ping.path = /ping

Restrict in the Apache vhost:

<LocationMatch "^/(fpm_status|ping)$">
    Require local
</LocationMatch>

Then:

curl -s http://127.0.0.1/server-status?auto | sed -n '1,120p'
curl -s http://127.0.0.1/fpm_status?full | sed -n '1,120p'

10) Enable slow query logging (persistent)

# Turn on now (non-persistent)
sudo mysql -e "SET GLOBAL slow_query_log=1; SET GLOBAL long_query_time=1;"

# Persist settings
sudo bash -lc 'cat >>/etc/mysql/mariadb.conf.d/50-server.cnf <<EOF
[mysqld]
slow_query_log = 1
long_query_time = 1
slow_query_log_file = /var/log/mysql/slow.log
EOF'

sudo systemctl restart mariadb

# Optional analysis tool
sudo apt-get install -y percona-toolkit
pt-query-digest /var/log/mysql/slow.log | less

11) Log analysis helpers

GoAccess (interactive + HTML report)

sudo apt-get install -y goaccess
# Interactive (auto-prompt for format)
goaccess /var/log/apache2/your-vhost-access.log -c

# Include rotated .gz and output HTML
zcat /var/log/apache2/your-vhost-access.log.*.gz \
 | goaccess -a --date-format='%d/%b/%Y' --time-format='%T' \
   --log-format=COMBINED -o /var/www/html/goa-report.html -

apachetop (live “top” for Apache)

sudo apt-get install -y apachetop
apachetop -f /var/log/apache2/your-vhost-access.log

lnav (smart log navigator)

sudo apt-get install -y lnav
lnav /var/log/apache2/your-vhost-access.log
# Inside lnav:
# :filter-in wp-login.php
# :filter-in admin-ajax.php

12) Spike capture bundle (one‑shot evidence)

Create /usr/local/bin/spike-capture.sh and make executable.

sudo tee /usr/local/bin/spike-capture.sh >/dev/null <<'EOF'
#!/usr/bin/env bash
set -u
TS=$(date +"%Y%m%d-%H%M%S")
OUT="/root/spike-$TS"
LOG="/var/log/apache2/your-vhost-access.log"   # <-- set this to the hottest vhost
mkdir -p "$OUT"
{
  echo "=== uptime"; uptime
  echo; echo "=== mpstat"; mpstat -P ALL 1 3
  echo; echo "=== vmstat"; vmstat 1 3
  echo; echo "=== top"; top -b -n1 | head -n 40
  echo; echo "=== ps top cpu"; ps -eo pid,ppid,comm,%cpu,%mem,etime --sort=-%cpu | head -n 30
  echo; echo "=== net conns:80/443 (established)"; ss -tn state established '( sport = :80 or sport = :443 )' | wc -l
  echo; echo "=== top URLs (last 5k lines)"; tail -n 5000 "$LOG" | awk '{print $7}' | sort | uniq -c | sort -nr | head -n 20
  echo; echo "=== top IPs (last 5k lines)"; tail -n 5000 "$LOG" | awk '{print $1}' | sort | uniq -c | sort -nr | head -n 20
  echo; echo "=== php-fpm pools"; pgrep -a php-fpm || true
  echo; echo "=== mysql processlist"; mysql -e "SHOW FULL PROCESSLIST\\G" 2>/dev/null | head -n 200 || true
} > "$OUT.txt"
echo "Saved to $OUT.txt"
EOF

sudo chmod +x /usr/local/bin/spike-capture.sh

Usage:

sudo spike-capture.sh

13) Guardrails to prevent repeats

Logrotate: ensure /etc/logrotate.d/apache2 uses daily, size 50M, compress.
Rate limiting:
- Simple: libapache2-mod-evasive (burst blocking).
- Flexible: fail2ban jails for wp-login.php, xmlrpc.php, abusive UAs.
Caching: Verify Varnish pass rules for admin/login and cache rules for static assets; minimize backend hits.
PHP‑FPM tuning: ensure pm.max_children fits RAM & workload (avoid CPU thrash/oom).
MySQL tuning: adequate InnoDB buffer pool, tmp table size; add missing indexes from slow log insights.

14) Quick references

High us% + PHP/Apache → hot endpoints or plugin/theme path; add cache/rate limits.
High us% + mysqld → expensive queries; capture with slow log and index.
High wa% → disk bound; tune DB, reduce log churn, consider faster storage.
High sy% with ksoftirqd → NIC/driver/interrupt pressure; ensure irqbalance is running.
High st% → host contention; escalate to VPS provider or resize.

Tip: Keep this playbook on the server (e.g., /root/monitoring-playbook.md) and keep spike-capture.sh ready. When a spike happens, run the script, then consult sections 2–6 based on what tops the charts.

Webserver Monitoring