CPU和內(nèi)存調(diào)優(yōu)
Monitor:
Process: 一個獨立運行單位
系統(tǒng)資源:CPU時間,存儲空間
Process: 一個獨立運行單位
OS: VM
CPU:
時間:切片
緩存:緩存當前程序數(shù)據(jù)
進程切換:保存現(xiàn)場、恢復現(xiàn)場
內(nèi)存:線性地址 <-- 物理地址
空間:映射
I/O:
內(nèi)核 --> 進程
進程描述符:
進程元數(shù)據(jù)
雙向鏈表
Linux: 搶占
系統(tǒng)時鐘:時鐘
tick: 滴答
時間解析度
100Hz
1000Hz
時鐘中斷
A: 5ms,1ms
C:
進程類別:
交互式進程(I/O)
批處理進程(CPU)
實時進程(Real-time)
CPU: 時間片長,優(yōu)先級低
IO:時間片短,優(yōu)先級高
Linux優(yōu)先級:priority
實時優(yōu)先級: 1-99,數(shù)字越小,優(yōu)先級越低
靜態(tài)優(yōu)先級:100-139,數(shù)據(jù)越小,優(yōu)先級越高 -20,19:100,139
0:120
實時優(yōu)先級比靜態(tài)優(yōu)先級高
nice值:調(diào)整靜態(tài)優(yōu)先級
調(diào)度類別:
實時進程:
SCHED_FIFO:First In First Out
SHCED_RR: Round Robin
SCHED_Other: 用來調(diào)度100-139之間的進程
100-139
10: 110
30: 115
50: 120
2: 130
動態(tài)優(yōu)先級:
dynamic priority = max (100, min ( static priority - bonus + 5, 139))
bonus: 0-10
110,10
110
手動調(diào)整優(yōu)先級:
100-139: nice
nice N COMMAND
renice -n # PID
chrt -p [prio] PID
1-99:
chrt -f -p [prio] PID
chrt -r -p [prio] PID
chrt -f -p [prio] COMMAND
ps -e -o class,rtprio,pri,nice,cmd
CPU affinity: CPU姻親關系
numastat
numactl
taskset: 綁定進程至某CPU上
mask:
0x0000 0001
0001: 0
0x0000 0003
0011:0和1
0x0000 0005:
0101: 0和2
0007
0111:0-2號
# taskset -p mask pid
101, 3# CPU
# taskset -p 0x00000003 101
taskset -p -c 0-2,7 101
應該將中斷綁定至那些非隔離的CPU上,從而避免那些隔離的CPU處理中斷程序;
echo CPU_MASK > /proc/irq/<irq number>/smp_affinity
sar -w
查看上下文切換的平均次數(shù),以及進程創(chuàng)建的平均值;
查看CPU相關信息
sar -q
vmast 1 5
mpstat 1 2
sar -P 0 1
iostat -c
dstat -c
/etc/grub.conf
isolcpu=cpu number,....cpu number
slab allocator:
buddy system:
memcached:
MMU: Memory Management Unit
地址映射
內(nèi)存保護
進程:線性地址 --> 物理地址
物理:頁框
地址
進程:頁面
頁面
TLB: Transfer Lookaside Buffer
sar -R: 觀察內(nèi)存分配與釋放動態(tài)信息
dstat --vm:
Given this performance penalty, performance-sensitive applications should avoid regularly accessing remote memory in a NUMA topology system. The application should be set up so that it stays on a particular node and allocates memory from that node.
To do this, there are a few things that applications will need to know:
What is the topology of the system?
Where is the application currently executing?
Where is the closest memory bank?
CPU affinity is represented as a bitmask. The lowest-order bit corresponds to the first logical CPU, and the highest-order bit corresponds to the last logical CPU. These masks are typically given in hexadecimal, so that 0x00000001 represents processor 0, and 0x00000003 represents processors 0 and 1.
# taskset -p mask pid
To launch a process with a given affinity, run the following command, replacing mask with the mask of the processor or processors you want the process bound to, and program with the program, options, and arguments of the program you want to run.
# taskset mask -- program
Instead of specifying the processors as a bitmask, you can also use the -c option to provide a comma-delimited list of separate processors, or a range of processors, like so:
# taskset -c 0,5,7-9 -- myprogram
numactl can also set a persistent policy for shared memory segments or files, and set the CPU affinity and memory affinity of a process. It uses the /sys file system to determine system topology.
The /sys file system contains information about how CPUs, memory, and peripheral devices are connected via NUMA interconnects. Specifically, the /sys/devices/system/cpu directory contains information about how a system's CPUs are connected to one another. The /sys/devices/system/node directory contains information about the NUMA nodes in the system, and the relative distances between those nodes.
numactl allows you to bind an application to a particular core or NUMA node, and to allocate the memory associated with a core or set of cores to that application.
numastat displays memory statistics (such as allocation hits and misses) for processes and the operating system on a per-NUMA-node basis. By default, running numastat displays how many pages of memory are occupied by the following event categories for each node. Optimal CPU performance is indicated by low numa_miss and numa_foreign values.
numad is an automatic NUMA affinity management daemon. It monitors NUMA topology and resource usage within a system in order to dynamically improve NUMA resource allocation and management (and therefore system performance).
Depending on system workload, numad can provide benchmark performance improvements of up to 50%. To achieve these performance gains, numad periodically accesses information from the /proc file system to monitor available system resources on a per-node basis. The daemon then attempts to place significant processes on NUMA nodes that have sufficient aligned memory and CPU resources for optimum NUMA performance. Current thresholds for process management are at least 50% of one CPU and at least 300 MB of memory. numad attempts to maintain a resource utilization level, and rebalances allocations when necessary by moving processes between NUMA nodes.
To restrict numad management to a specific process, start it with the following options.
# numad -S 0 -p pid
-p pid
Adds the specified pid to an explicit inclusion list. The process specified will not be managed until it meets the numad process significance threshold.
-S mode
The -S parameter specifies the type of process scanning. Setting it to 0 as shown limits numad management to explicitly included processes.
To stop numad, run:
# numad -i 0
Stopping numad does not remove the changes it has made to improve NUMA affinity. If system use changes significantly, running numad again will adjust affinity to improve performance under the new conditions.
Using Valgrind to Profile Memory Usage
Profiling Memory Usage with Memcheck
Memcheck is the default Valgrind tool, and can be run with valgrind program, without specifying --tool=memcheck. It detects and reports on a number of memory errors that can be difficult to detect and diagnose, such as memory access that should not occur, the use of undefined or uninitialized values, incorrectly freed heap memory, overlapping pointers, and memory leaks. Programs run ten to thirty times more slowly with Memcheck than when run normally.
Profiling Cache Usage with Cachegrind
Cachegrind simulates your program's interaction with a machine's cache hierarchy and (optionally) branch predictor. It tracks usage of the simulated first-level instruction and data caches to detect poor code interaction with this level of cache; and the last-level cache, whether that is a second- or third-level cache, in order to track access to main memory. As such, programs run with Cachegrind run twenty to one hundred times slower than when run normally.
Profiling Heap and Stack Space with Massif
Massif measures the heap space used by a specified program; both the useful space, and any additional space allocated for book-keeping and alignment purposes. It can help you reduce the amount of memory used by your program, which can increase your program's speed, and reduce the likelihood that your program will exhaust the swap space of the machine on which it executes. Massif can also provide details about which parts of your program are responsible for allocating heap memory. Programs run with Massif run about twenty times more slowly than their normal execution speed.
Capacity-related Kernel Tunables
1、內(nèi)存區(qū)域劃分:
32bits: ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM
64bits: ZONE_DMA, ZONE_DMA32, ZONE_NORMAL
2、MMU:
10bit, 10bit, 12bit PTE
3、TLB
HugePage
CPU
O(1): 100-139
SCHED_Other: CFS
1-99
SCHED_FIFO
SCHED_RR
動態(tài)優(yōu)先級:
sar -p
mpstat
iostat -c
dstat -c
--top-cpu
top
sar -q
vmstat
uptime
內(nèi)存子系統(tǒng)組件:
slab allocator
buddy system
kswapd
pdflush
mmu
虛擬化環(huán)境:
PA --> HA --> MA
虛擬機轉(zhuǎn)換:PA --> HA
GuestOS, OS
Shadow PT
Memory:
TLB:提升性能
啟用大頁面/etc/sysctl.conf
vm.nr_hugepages=n
strace:
strace COMMAND: 查看命令的syscall
strace -p PID: 查看已經(jīng)啟動進程的syscall
-c: 只輸出其概括信息;
-o FILE: 將追蹤結(jié)果保存至文件中,以供后續(xù)分析使用;
1、降低微型內(nèi)存對象的系統(tǒng)開銷
slab
2、縮減慢速子系統(tǒng)的服務時間
使用buffer cache緩存文件元數(shù)據(jù)據(jù);
使用page cache緩存DISK IO;
使用shm完成進程間通信;
使用buffer cache、arp cache和connetion tracking提升網(wǎng)絡IO性能;
過量使用:
2,2,2,2:8
物理內(nèi)存的過量使用是以swap為前提的:
可以超出物理內(nèi)存一部分:
Swap
/proc/slabinfo
slabtop
vmstat -m
vfs_cache_pressure:
0:不回收dentries和inodes;
1-99: 傾向于不回收;
100: 傾向性與page cache和swap cache相同;
100+:傾向于回收;
調(diào)優(yōu)思路:性能指標,定位瓶頸
進程管理,CPU
內(nèi)存調(diào)優(yōu)
I/O調(diào)優(yōu)
文件系統(tǒng)調(diào)優(yōu)
網(wǎng)絡子系統(tǒng)調(diào)優(yōu)
setting the /proc/sys/vm/panic_on_oom parameter to 0 instructs the kernel to call the oom_killer function when OOM occurs
oom_adj
Defines a value from -16 to 15 that helps determine the oom_score of a process. The higher the oom_score value, the more likely the process will be killed by the oom_killer. Setting a oom_adj value of
-16-15:協(xié)助計算oom_score
-17:disables the oom_killer for that process.
進程間通信管理類命令:
ipcs
ipcrm
shm:
shmmni: 系統(tǒng)級別,所允許使用的共享內(nèi)存段上限;
shmall: 系統(tǒng)級別,能夠為共享內(nèi)存分配使用的最大頁面數(shù);
shmmax: 單個共享內(nèi)存段的上限;
messages:
msgmnb: 單個消息隊列的上限,單位為字節(jié);
msgmni: 系統(tǒng)級別,消息隊列個數(shù)上限;
msgmax: 單個消息大小的上限,單位為字節(jié);
手動清寫臟緩存和緩存:
sync
echo s > /proc/sysrq-trigger
vm.dirty_background_ratio
總體臟頁占總體內(nèi)存比,開始清除
vm.dirty_ratio
單個進程。。。
vm.dirty_expire_centisecs
單位1/100秒,每隔多少時間啟動一次
vm.dirty_writeback_centisecs
一個臟頁在內(nèi)存中存在多久以后,進行清理
vm.swappiness
內(nèi)存調(diào)優(yōu)
HugePage:TLB
IPC:
pdflush
slab
swap
oom
I/O, Filesystem, Network
Note that the I/O numbers reported by vmstat are aggregations of all I/O to all devices. Once you have determined that there may be a performance gap in the I/O subsystem, you can examine the problem more closely with iostat, which will break down the I/O reporting by device. You can also retrieve more detailed information, such as the average request size, the number of reads and writes per second, and the amount of I/O merging going on.
vmstat命令和dstat -r用來查看整體IO活動狀況;
iostat可以查看單個設備的IO活動狀況;
slice_idle = 0
quantum = 64
group_idle = 1
blktrace
blkparse
btt
fio
io-stress
iozone
iostat
ext3
ext4: 16TB
xfs:
mount -o nobarrier,noatime
ext3: noatime
data=ordered, journal, writeback
ext2, ext3
Tuning Considerations for File Systems
Formatting Options:
File system block size
Mount Options
Barrier: 為了保證元數(shù)據(jù)寫入的安全性;可以使用nobarrier
Access Time (noatime)
Historically, when a file is read, the access time (atime) for that file must be updated in the inode metadata, which involves additional write I/O
Increased read-ahead support
# blockdev -getra device
# blockdev -setra N device
Ext4 is supported for a maximum file system size of 16 TB and a single file maximum size of 16TB. It also removes the 32000 sub-directory limit present in ext3.
優(yōu)化Ext4:
1、格式大文件系統(tǒng)化時延遲inode的寫入;
-E lazy_itable_init=1
# mkfs.ext4 -E lazy_itable_init=1 /dev/sda5
2、關閉Ext4的自動fsync()調(diào)用;
-o noauto_da_alloc
mount -o noauto_da_alloc
3、降低日志IO的優(yōu)先級至與數(shù)據(jù)IO相同;
-o journal_ioprio=n
n的用效取值范圍為0-7,默認為3;
優(yōu)化xfs:
xfs非常穩(wěn)定且高度可擴展,是64位日志文件系統(tǒng),因此支持非常大的單個文件及文件系統(tǒng)。在RHEL6.4上,其默認的格式化選項及掛載選項均已經(jīng)達到很優(yōu)的工作狀態(tài)。
dd
iozone
bonnie++
I/O:
I/O scheduler: CFQ, deadline, NOOP
EXT4:
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_syncookies = 1
net.core.rmem_max = 12582912
net.core.rmem_default
net.core.netdev_max_backlog = 5000
net.core.wmem_max = 12582912
net.core.wmem_default
net.ipv4.tcp_rmem= 10240 87380 12582912
net.ipv4.tcp_wmem= 10240 87380 12582912
net.ipv4.tcp_tw_reuse=1
Set the max OS send buffer size (wmem) and receive buffer size (rmem) to 12 MB for queues on all protocols. In other words set the amount of memory that is allocated for each TCP socket when it is opened or created while transferring files
netstat -an
ss
lsof
ethtool
Systemtap
Oprofile
Valgrind
perf
perf stat
Task-clock-msecs:CPU 利用率,該值高,說明程序的多數(shù)時間花費在 CPU 計算上而非 IO。
Context-switches:進程切換次數(shù),記錄了程序運行過程中發(fā)生了多少次進程切換,頻繁的進程切換是應該避免的。
Cache-misses:程序運行過程中總體的 cache 利用情況,如果該值過高,說明程序的 cache 利用不好
CPU-migrations:表示進程 t1 運行過程中發(fā)生了多少次 CPU 遷移,即被調(diào)度器從一個 CPU 轉(zhuǎn)移到另外一個 CPU 上運行。
Cycles:處理器時鐘,一條機器指令可能需要多個 cycles,
Instructions: 機器指令數(shù)目。
IPC:是 Instructions/Cycles 的比值,該值越大越好,說明程序充分利用了處理器的特性。
Cache-references: cache 命中的次數(shù)
Cache-misses: cache 失效的次數(shù)。
通過指定 -e 選項,您可以改變 perf stat 的缺省事件 ( 關于事件,在上一小節(jié)已經(jīng)說明,可以通過 perf list 來查看 )。假如您已經(jīng)有很多的調(diào)優(yōu)經(jīng)驗,可能會使用 -e 選項來查看您所感興趣的特殊的事件。
perf Top
使用 perf stat 的時候,往往您已經(jīng)有一個調(diào)優(yōu)的目標。也有些時候,您只是發(fā)現(xiàn)系統(tǒng)性能無端下降,并不清楚究竟哪個進程成為了貪吃的 hog。此時需要一個類似 top 的命令,列出所有值得懷疑的進程,從中找到需要進一步審查的家伙。類似法制節(jié)目中辦案民警常常做的那樣,通過查看監(jiān)控錄像從茫茫人海中找到行為古怪的那些人,而不是到大街上抓住每一個人來審問。
Perf top 用于實時顯示當前系統(tǒng)的性能統(tǒng)計信息。該命令主要用來觀察整個系統(tǒng)當前的狀態(tài),比如可以通過查看該命令的輸出來查看當前系統(tǒng)最耗時的內(nèi)核函數(shù)或某個用戶進程。
使用 perf record, 解讀 report
使用 top 和 stat 之后,您可能已經(jīng)大致有數(shù)了。要進一步分析,便需要一些粒度更細的信息。比如說您已經(jīng)斷定目標程序計算量較大,也許是因為有些代碼寫的不夠精簡。那么面對長長的代碼文件,究竟哪幾行代碼需要進一步修改呢?這便需要使用 perf record 記錄單個函數(shù)級別的統(tǒng)計信息,并使用 perf report 來顯示統(tǒng)計結(jié)果。
您的調(diào)優(yōu)應該將注意力集中到百分比高的熱點代碼片段上,假如一段代碼只占用整個程序運行時間的 0.1%,即使您將其優(yōu)化到僅剩一條機器指令,恐怕也只能將整體的程序性能提高 0.1%。
Disk:
IO Scheduler:
CFQ
deadline
anticipatory
NOOP
/sys/block/<device>/queue/scheduler
Memory:
MMU
TLB
vm.swappiness={0..100}:使用交換分區(qū)的傾向性, 60
overcommit_memory=2: 過量使用
overcommit_ratio=50:
swap+RAM*ratio
swap: 2G
RAM: 8G
memory=2G+4G=6G
充分使用物理內(nèi)存:
1、swap跟RAM一樣大;swappiness=0;
2、 overcommit_memory=2, overcommit_ratio=100:swappiness=0;
memory: swap+ram
tcp_max_tw_buckets: 調(diào)大
tw:連接個數(shù)
established --> tw
IPC:
message
msgmni
msgmax
msgmnb
shm
shmall
shmmax
shmmni
常用命令:
sar, dstat, vmstat, mpstat, iostat, top, free, iotop, uptime, cat /proc/meminfo, ss, netstat, lsof, time, perf, strace
blktrace, blkparse, btt
dd, iozone, io-stress, fio