Debian System Monitoring Playbook
A reusable, copy‑paste friendly checklist for diagnosing and stabilising high CPU/load on a Debian LAMP/LEMP stack with Apache/PHP‑FPM, Varnish, and MariaDB/MySQL.
0) Goals
- Identify what is burning CPU (service → process → thread).
- Determine why (CPU vs I/O wait vs virtualization steal vs network flood).
- Capture enough evidence to fix the root cause (queries, endpoints, IPs).
- Put tripwires in place so the next spike is easy to explain.
1) Fast triage (run in order)
# Overview
uptime
# Top snapshot (batch mode)
top -b -n1 | head -n 25
# Per-CPU breakdown (requires sysstat)
mpstat -P ALL 1 5
# Run queue / context switches / iowait
vmstat 1 5
Interpretation:
- us% high → application (PHP, mysqld, etc.) is busy.
- sy% high → kernel/interrupts (network/disk overhead, drivers).
- wa% high → disk I/O bottleneck.
- st% high → virtualization host contention (noisy neighbour).
Offenders: service → process → thread
# By cgroup (which service)
systemd-cgtop --iterations=3 --depth=2
# Top processes
ps -eo pid,ppid,comm,%cpu,%mem,etime --sort=-%cpu | head -n 20
# Hot threads of the hottest process
TOPPID=$(ps -eo pid,comm,%cpu --sort=-%cpu | awk 'NR==2{print $1}')
top -H -p "$TOPPID"
2) Apache / PHP‑FPM focus
# See running Apache & PHP-FPM processes
pgrep -a apache2
pgrep -a php-fpm || pgrep -a php
# Apache MPM
apachectl -V | grep -i mpm
# mod_status (if enabled)
curl -s http://127.0.0.1/server-status?auto | sed -n '1,120p'
# Hot URLs & IPs (last 500 lines)
tail -n 500 /var/log/apache2/your-vhost-access.log \
| awk '{print $7}' | sort | uniq -c | sort -nr | head
tail -n 500 /var/log/apache2/your-vhost-access.log \
| awk '{print $1}' | sort | uniq -c | sort -nr | head
# Hot user agents (quick UA census)
awk -F\" '{print $6}' /var/log/apache2/your-vhost-access.log | sort | uniq -c | sort -nr | head
Common hotspots: /wp-admin/admin-ajax.php
, /wp-cron.php
, /xmlrpc.php
, feeds.
Quick temporary mitigation (pick one; replace x.x.x.x):
# iptables
sudo iptables -I INPUT -s x.x.x.x -j DROP
# ufw
sudo ufw deny from x.x.x.x
3) MariaDB/MySQL focus
# Active queries
mysql -e "SHOW FULL PROCESSLIST\G"
# Slow query log status
mysql -e "SHOW VARIABLES LIKE 'slow_query_log%';"
# InnoDB status (locks)
mysql -e "SHOW ENGINE INNODB STATUS\G" | sed -n '1,150p'
Signals:
- Long states like
Copying to tmp table
,Filesort
,Waiting for table metadata lock
→ missing indexes / bad plan / lock contention. - Kill by Id if necessary:
mysql -e "KILL <Id>;"
4) Varnish / traffic surge
# Hit/miss & connections
varnishstat -1 | egrep 'MAIN.cache_(hit|miss)|sess_conn|client_req'
# Sample live requests (Ctrl+C to stop)
sudo timeout 10s varnishncsa | head
If client_req
surges and misses increase → backend (PHP/MySQL) pressure.
5) Network / DDoS quick checks
# Active web connections count
ss -tn state established '( sport = :80 or sport = :443 )' | wc -l
# Top TCP states on 80/443
ss -Hnta '( dport = :80 or dport = :443 )' | awk '{print $6}' | sort | uniq -c | sort -nr | head
6) Disk I/O bottlenecks
# Extended iostat
iostat -xz 1 3 # high util% / await = disk bound
# Per-process I/O (accumulated)
sudo iotop -aoP # Ctrl+C to stop
7) Deep dive on a hot PID
# What files/sockets are open?
sudo lsof -p <PID> | head
# What syscalls is it doing right now?
sudo strace -p <PID> -f -tt -s 200 -o /tmp/strace.<PID>.log
# Let it run ~10s, Ctrl+C, then:
tail -n 40 /tmp/strace.<PID>.log
Clues: endless futex()
→ lock contention; heavy read()
/open()
on large files; repeated connect()
/recvfrom()
→ network waits.
8) Persistent telemetry (set once)
sudo apt-get update && sudo apt-get install -y sysstat atop iotop iostat
# Enable sysstat (sar)
sudo sed -i 's/ENABLED="false"/ENABLED="true"/' /etc/default/sysstat
sudo systemctl enable --now sysstat
# Enable atop logging to /var/log/atop/
sudo systemctl enable --now atop
# Optional: log every 60s
sudo sed -i 's/ATOPLOGINTERVAL=.*/ATOPLOGINTERVAL=60/' /etc/default/atop && sudo systemctl restart atop
Review later:
sar -u -f /var/log/sa/sa$(date +%d)
atopsar -c 15:00 16:00 # CPU by time window
9) Status endpoints (local‑only)
Apache mod_status
sudo a2enmod status
cat <<'EOF' | sudo tee /etc/apache2/conf-available/status-local.conf
<Location /server-status>
SetHandler server-status
Require local
</Location>
EOF
sudo a2enconf status-local && sudo systemctl reload apache2
PHP‑FPM pool status
In /etc/php/*/fpm/pool.d/www.conf
:
pm.status_path = /fpm_status
ping.path = /ping
Restrict in the Apache vhost:
<LocationMatch "^/(fpm_status|ping)$">
Require local
</LocationMatch>
Then:
curl -s http://127.0.0.1/server-status?auto | sed -n '1,120p'
curl -s http://127.0.0.1/fpm_status?full | sed -n '1,120p'
10) Enable slow query logging (persistent)
# Turn on now (non-persistent)
sudo mysql -e "SET GLOBAL slow_query_log=1; SET GLOBAL long_query_time=1;"
# Persist settings
sudo bash -lc 'cat >>/etc/mysql/mariadb.conf.d/50-server.cnf <<EOF
[mysqld]
slow_query_log = 1
long_query_time = 1
slow_query_log_file = /var/log/mysql/slow.log
EOF'
sudo systemctl restart mariadb
# Optional analysis tool
sudo apt-get install -y percona-toolkit
pt-query-digest /var/log/mysql/slow.log | less
11) Log analysis helpers
GoAccess (interactive + HTML report)
sudo apt-get install -y goaccess
# Interactive (auto-prompt for format)
goaccess /var/log/apache2/your-vhost-access.log -c
# Include rotated .gz and output HTML
zcat /var/log/apache2/your-vhost-access.log.*.gz \
| goaccess -a --date-format='%d/%b/%Y' --time-format='%T' \
--log-format=COMBINED -o /var/www/html/goa-report.html -
apachetop (live “top” for Apache)
sudo apt-get install -y apachetop
apachetop -f /var/log/apache2/your-vhost-access.log
lnav (smart log navigator)
sudo apt-get install -y lnav
lnav /var/log/apache2/your-vhost-access.log
# Inside lnav:
# :filter-in wp-login.php
# :filter-in admin-ajax.php
12) Spike capture bundle (one‑shot evidence)
Create /usr/local/bin/spike-capture.sh
and make executable.
sudo tee /usr/local/bin/spike-capture.sh >/dev/null <<'EOF'
#!/usr/bin/env bash
set -u
TS=$(date +"%Y%m%d-%H%M%S")
OUT="/root/spike-$TS"
LOG="/var/log/apache2/your-vhost-access.log" # <-- set this to the hottest vhost
mkdir -p "$OUT"
{
echo "=== uptime"; uptime
echo; echo "=== mpstat"; mpstat -P ALL 1 3
echo; echo "=== vmstat"; vmstat 1 3
echo; echo "=== top"; top -b -n1 | head -n 40
echo; echo "=== ps top cpu"; ps -eo pid,ppid,comm,%cpu,%mem,etime --sort=-%cpu | head -n 30
echo; echo "=== net conns:80/443 (established)"; ss -tn state established '( sport = :80 or sport = :443 )' | wc -l
echo; echo "=== top URLs (last 5k lines)"; tail -n 5000 "$LOG" | awk '{print $7}' | sort | uniq -c | sort -nr | head -n 20
echo; echo "=== top IPs (last 5k lines)"; tail -n 5000 "$LOG" | awk '{print $1}' | sort | uniq -c | sort -nr | head -n 20
echo; echo "=== php-fpm pools"; pgrep -a php-fpm || true
echo; echo "=== mysql processlist"; mysql -e "SHOW FULL PROCESSLIST\\G" 2>/dev/null | head -n 200 || true
} > "$OUT.txt"
echo "Saved to $OUT.txt"
EOF
sudo chmod +x /usr/local/bin/spike-capture.sh
Usage:
sudo spike-capture.sh
13) Guardrails to prevent repeats
- Logrotate: ensure
/etc/logrotate.d/apache2
usesdaily
,size 50M
,compress
. - Rate limiting:
- Simple:
libapache2-mod-evasive
(burst blocking). - Flexible: fail2ban jails for
wp-login.php
,xmlrpc.php
, abusive UAs.
- Simple:
- Caching: Verify Varnish pass rules for admin/login and cache rules for static assets; minimize backend hits.
- PHP‑FPM tuning: ensure
pm.max_children
fits RAM & workload (avoid CPU thrash/oom). - MySQL tuning: adequate InnoDB buffer pool, tmp table size; add missing indexes from slow log insights.
14) Quick references
- High us% + PHP/Apache → hot endpoints or plugin/theme path; add cache/rate limits.
- High us% + mysqld → expensive queries; capture with slow log and index.
- High wa% → disk bound; tune DB, reduce log churn, consider faster storage.
- High sy% with
ksoftirqd
→ NIC/driver/interrupt pressure; ensureirqbalance
is running. - High st% → host contention; escalate to VPS provider or resize.
Tip: Keep this playbook on the server (e.g., /root/monitoring-playbook.md
) and keep spike-capture.sh
ready. When a spike happens, run the script, then consult sections 2–6 based on what tops the charts.