Linux System Performance/Observability Tools
BPF (bpfcc-tools), BCC Tools, and Complete Observability Toolkit
Assuming you have Linux Server in place and have the required BPF aka BCC related packages installed on the system(s) for Linux distribution.
BPF(eBPF) aka BCC Tools (bpfcc-tools):
- BPF, which originally stood for Berkley Packet Filter is the dynamic tracing tools for Linux Systems.
- BPF initially used for the speeding up for the tcpdump expressions and since then it has been know as the extended Berkley packet Filter (eBPF).
- Its new uses are Tracing Tools where it provides programmability for the BPF Compiler Collection (BCC) and bpftrace front ends.
- Example: execsnoop, biosnoop etc is a BCC Tool.
- When facing production performance crisis these such list of tools comes handy to trace and fix the issue. However, it requires certain KERNEL level config options to be enabled such as CONFIG_FTRACE, CONFIG_BPF.
- Profiling tools typically required complied version of all packages to run properly on the systems.
Credits: Brendan Gregg
When facing production performance crisis these such list of tools comes handy to trace and fix the issue.
Here are the list of tools that will be handy when you wanted to fix issues for Prod or any other environments.
🚨 Production CRISIS Tools
#. Name - Provides
- procps - ps, vmstat, uptime, top
- util-linux - dmesg, lsblk, lscpu
- sysstat - iostat, mpstat, pidstat, sar
- iproute2 - ip, ss, nstat, tc
- numactl - numastat
- linux-tools-common - perf, turbostat
linux-tools-$(uname -r) - bcc-tools (aka bpfcc-tools) - opensnoop, execsnoop, runqlat, runqlen, softirqs, hardirqs, ext4slower, tcptop, ext4dist, biotop, biosnoop, biolatency, tcplife, trace, argdist, funccount, stackcount, profile etc..
- bpftrace - bpftrace, etc..
- perf-tools-unstable - ftrace version of opensnoop, execsnoop, iolatency, iosnoop, bitesize, kprobe, funccount
- trace-cmd - trace-cmd
- nicstat - nicstat
- ethtool - ethtool
- tiptop - tiptop (# apt install tiptop)
- msr-tools - rdmsr, wrmsr
🔧 Linux Application Debugging/Observability Tools
#. Tool Name - Description
- perf - CPU (Profiling | Flame Graphs), syscall tracing
- profile - CPU Profiling using timed sampling
- offcputime - Off-CPU profiling using Scheduler Tracing
- strace - Syscall Tracing
- execsnoop - New Process Tracing
- syscount - Syscall Counting
- bpftrace - Signal tracing, I/O profiling, Lock analysis
⚙️ Linux CPU Performance Debugging/Observability Tools
#. Tool Name - Description
- uptime - Load Averages (# cat /proc/pressure/cpu (10s, 60s & 300s))
- vmstat - Includes system-wide CPU Averages
- mpstat - Per-CPU Statistics
- sar - Historical Statistics
- ps - Process Status
- top - Monitor per-process/thread CPU usage
- pidstat - Per-process/thread CPU breakdowns
- time, ptime - time a command, with CPU breakdowns
- turboboost - Show CPU clock rate and other states
- showboot - Show CPU clock rate and turbo boost
- pmcarch - show high-level CPU cycle usage
- tlbstat - Summarize TLB Cycles
- perf - CPU profiling & PMC Analysis
- profile - Sample CPU Stack traces
- cpudist - Summarize on cpu-time
- runqlat - Summarize CPU run queue latency
- runqlen - Summarize CPU run queue length
- softirqs - Summarize soft Interrupt time
- hardirqs - Summarize hard Interrupt time
- bpftrace - Tracing programs for CPU analysis
💾 Linux MEMORY Performance Debugging/Observability Tools
#. Tool Name - Description
- vmstat - Virtual and Physical Memory statistics
Ex: root@ip-172-31-21-94:~# vmstat -Sm 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 0 271 29 511 0 0 67 223 54 42 1 0 89 0 10 0 0 0 271 29 511 0 0 0 0 40 27 0 0 99 0 1 1 0 0 271 29 511 0 0 0 0 32 28 0 0 81 0 19
- PSI - Memory pressure stall information
Ex: root@ip-172-31-21-94:~# cat /proc/pressure/memory
some avg10=0.00 avg60=0.00 avg300=0.00 total=730880 full avg10=0.00 avg60=0.00 avg300=0.00 total=649756
- swapon - Swap Device Usage
Ex: swapon
- sar - Historical Statistics
- slabtop - Kernel Slab Allocator Statistics
Ex: root@ip-172-31-21-94:~# slabtop -sc
Active / Total Objects (% used) : 245106 / 253954 (96.5%) Active / Total Slabs (% used) : 6869 / 6869 (100.0%) Active / Total Caches (% used) : 312 / 370 (84.3%) Active / Total Size (% used) : 61201.99K / 63941.81K (95.7%) Minimum / Average / Maximum Object : 0.01K / 0.25K / 10.12K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 12339 12188 98% 1.16K 457 27 14624K ext4_inode_cache 7975 7886 98% 0.62K 319 25 5104K inode_cache 30772 30772 100% 0.14K 1099 28 4396K kernfs_node_cache 22953 21881 95% 0.19K 1093 21 4372K dentry
- numastat - NUMA Statistics
Ex: root@ip-172-31-21-94:~# numastat
node0 numa_hit 1673622 numa_miss 0 numa_foreign 0 interleave_hit 77 local_node 1673622 other_node 0
- ps - Process Status
Ex:
# ps aux # ps -eo pid,pmem,vsz,rss,comm
- top - Monitor Per Process memory usage
Ex: # top -o %MEM
- pmap - Process address space statistics
Ex: pmap -x <pid>
- perf - Memory PMC and tracepoint analysis
Ex: Sample page faults (RSS Growth) with stack traces system wide, until Ctrl-C
root@ip-172-31-21-94:~# perf record -e page-faults -a -g [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.135 MB perf.data (3 samples) ]
- drsnoop - Direct reclaim tracing (BCC tool)
- wss - Working set size estimation (shows how a working set size be measured via PTE page table entry)
- bpftrace - Tracing Programs for memory analysis (BPF based tracer)
📁 File System Performance Debugging/Observability Tools
#. Tool Name - Description
- mount - List file system and their mount flags
- free - Cache capacity statistics
- top - Includes memory usage summary
- vmstat - Virtual memory statistics
- sar - Various statistics including historic
- slabtop - Kernel slab allocator statistics
- strace - System call tracing
- fatrace - Trace file system operations using fanotify
- latencytop - Show system-wide latency sources
- opensnoop - Traces file opened
- filetop - Top files in by IOPS
- cachestat - Page Cache Statistics
- ext4dist (xfs,zfs,btrfs,nfs) - Show ext4 operation latency distribution
- ext4slower (xfs,zfs,btrfs,nfs) - Show slow ext4 operations
- bpftrace - Custom file system tracing
💿 Disk Performance Debugging/Observability Tools
#. Tool Name - Description
- iostat - Various per-disk statistics
- sar - Historical disk statistics
- PSI - Disk Pressure stall information
- pidstat - Disk I/O usage by process
- perf - Record Block I/O tracepoints
Ex:
# perf list 'block:*' # perf record -e block:block_rq_issue -a -g sleep 10 # perf script --header
- biolatency - Summarize Disk I/O latency as histogram
- biosnoop - Trace disk I/O with PID and latency
- iotop, biotop - Top for disks: summarize disk I/O by process
- biostacks - Show disk I/O with Initialization Stacks
- blktrace - Disk I/O event tracing
- bpftrace - Custom Disk Tracing
Ex: Count block I/O tracepoint events:
# bpftrace -e 'tracepoint:block:* { @[probe] = count(); }' - smartctl - Disk controller statistics (Self-Monitoring, Analysis & Reporting Technology)
Ex: Install and use
# apt install smartmontools # smartctl --all -d megaraid,0 /dev/xvda15
🌐 Network Performance Debugging/Observability Tools
#. Tool Name - Description
- ss - Socket statistics
Ex: root@ip-172-31-21-94:~# ss -tiepm
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process ESTAB 0 52 172.31.21.94:ssh 103.252.203.93:7237 users:(("sshd",pid=1251,fd=4)) timer:(on,217ms,0) ino:7097 sk:5e <-> skmem:(r0,rb2142943,t0,tb87040,f3148,w948,o0,bl0,d25) ts sack cubic rto:223 rtt:22.584/29.792 mss:1448 cwnd:10 bytes_sent:1513930 - ip - Network interface & route statistics
Ex: root@ip-172-31-21-94:~# ip -s link
1: lo:
mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 RX: bytes packets errors dropped missed mcast 31750 264 0 0 0 0 - ifconfig - Network interface statistics
- nstat - Network stack statistics
Ex: root@ip-172-31-21-94:~# nstat -s
#kernel IpInReceives 19426 0.0 IpInDelivers 19426 0.0 IpOutRequests 15286 0.0 TcpActiveOpens 27 0.0 TcpInSegs 18280 0.0 TcpOutSegs 14237 0.0
- netstat - Various network stack & interface statistics
- sar - Historical statistics
Ex: root@ip-172-31-21-94:~# sar -n TCP 1
Linux 6.8.0-1015-aws (ip-172-31-21-94) 11/18/24 _x86_64_ (1 CPU) 14:47:33 active/s passive/s iseg/s oseg/s 14:47:34 0.00 0.00 2.00 0.00 14:47:35 0.00 0.00 1.00 1.00 Average: 0.00 0.00 1.25 0.75
- nicstat - Network interface throughput and utilization
Ex: root@ip-172-31-21-94:~# nicstat -z 1
Time Int rKB/s wKB/s rPk/s wPk/s rAvs wAvs %Util Sat 14:49:35 lo 0.00 0.00 0.02 0.02 122.0 122.0 0.00 0.00 14:49:35 eth0 19.70 0.27 14.33 1.88 1408.4 144.7 0.00 0.00
- ethtool - Network interface driver statistics
Ex: ethtool -i eth0 [-i option shows driver details & -k shows interface Tunables]
# ethtool -k eth0
- tcplife - Trace TCP Session lifespans with connection details
Ex: # tcplife
- tcptop - Show TCP throughput by Host and Process
Ex: # tcptop
- tcpretrans - Trace TCP retransmits with address & TCP state
Ex: # tcpretrans
- bpftrace - TCP/IP Stack Tracing: connections, packets, drops, latency
Ex: Count socket accepts by PIDs and process name
# bpftrace -e 't:syscalls:sys_enter_accept* { @[pid, comm] = count(); }' # bpftrace -l 't:tcp:*' tracepoint:tcp:tcp_bad_csum tracepoint:tcp:tcp_cong_state_set tracepoint:tcp:tcp_destroy_sock tracepoint:tcp:tcp_probe tracepoint:tcp:tcp_retransmit_skb tracepoint:tcp:tcp_send_reset - tcpdump - Network packet sniffer
- wireshark - Graphical network packet inspection
Benefits of knowing & using the BCC aka eBPF (bpfcc-tools) Tools:
- Can debug, identify and fix the issue within stipulated timelines.
- Provides dynamic tracing capabilities using BPF Tools
- Can use specific tools for the right system resource(s).
- And many more..
Ref: Brendan Gregg Online resources and books (BPF Performance Tools, Systems Performance).
Comments