Path: blob/master/Documentation/cgroups/memcg_test.txt
10821 views
Memory Resource Controller(Memcg) Implementation Memo.1Last Updated: 2010/22Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).34Because VM is getting complex (one of reasons is memcg...), memcg's behavior5is complex. This is a document for memcg's internal behavior.6Please note that implementation details can be changed.78(*) Topics on API should be in Documentation/cgroups/memory.txt)9100. How to record usage ?112 objects are used.1213page_cgroup ....an object per page.14Allocated at boot or memory hotplug. Freed at memory hot removal.1516swap_cgroup ... an entry per swp_entry.17Allocated at swapon(). Freed at swapoff().1819The page_cgroup has USED bit and double count against a page_cgroup never20occurs. swap_cgroup is used only when a charged page is swapped-out.21221. Charge2324a page/swp_entry may be charged (usage += PAGE_SIZE) at2526mem_cgroup_newpage_charge()27Called at new page fault and Copy-On-Write.2829mem_cgroup_try_charge_swapin()30Called at do_swap_page() (page fault on swap entry) and swapoff.31Followed by charge-commit-cancel protocol. (With swap accounting)32At commit, a charge recorded in swap_cgroup is removed.3334mem_cgroup_cache_charge()35Called at add_to_page_cache()3637mem_cgroup_cache_charge_swapin()38Called at shmem's swapin.3940mem_cgroup_prepare_migration()41Called before migration. "extra" charge is done and followed by42charge-commit-cancel protocol.43At commit, charge against oldpage or newpage will be committed.44452. Uncharge46a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by4748mem_cgroup_uncharge_page()49Called when an anonymous page is fully unmapped. I.e., mapcount goes50to 0. If the page is SwapCache, uncharge is delayed until51mem_cgroup_uncharge_swapcache().5253mem_cgroup_uncharge_cache_page()54Called when a page-cache is deleted from radix-tree. If the page is55SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache().5657mem_cgroup_uncharge_swapcache()58Called when SwapCache is removed from radix-tree. The charge itself59is moved to swap_cgroup. (If mem+swap controller is disabled, no60charge to swap occurs.)6162mem_cgroup_uncharge_swap()63Called when swp_entry's refcnt goes down to 0. A charge against swap64disappears.6566mem_cgroup_end_migration(old, new)67At success of migration old is uncharged (if necessary), a charge68to new page is committed. At failure, charge to old page is committed.69703. charge-commit-cancel71In some case, we can't know this "charge" is valid or not at charging72(because of races).73To handle such case, there are charge-commit-cancel functions.74mem_cgroup_try_charge_XXX75mem_cgroup_commit_charge_XXX76mem_cgroup_cancel_charge_XXX77these are used in swap-in and migration.7879At try_charge(), there are no flags to say "this page is charged".80at this point, usage += PAGE_SIZE.8182At commit(), the function checks the page should be charged or not83and set flags or avoid charging.(usage -= PAGE_SIZE)8485At cancel(), simply usage -= PAGE_SIZE.8687Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.88894. Anonymous90Anonymous page is newly allocated at91- page fault into MAP_ANONYMOUS mapping.92- Copy-On-Write.93It is charged right after it's allocated before doing any page table94related operations. Of course, it's uncharged when another page is used95for the fault address.9697At freeing anonymous page (by exit() or munmap()), zap_pte() is called98and pages for ptes are freed one by one.(see mm/memory.c). Uncharges99are done at page_remove_rmap() when page_mapcount() goes down to 0.100101Another page freeing is by page-reclaim (vmscan.c) and anonymous102pages are swapped out. In this case, the page is marked as103PageSwapCache(). uncharge() routine doesn't uncharge the page marked104as SwapCache(). It's delayed until __delete_from_swap_cache().1051064.1 Swap-in.107At swap-in, the page is taken from swap-cache. There are 2 cases.108109(a) If the SwapCache is newly allocated and read, it has no charges.110(b) If the SwapCache has been mapped by processes, it has been111charged already.112113This swap-in is one of the most complicated work. In do_swap_page(),114following events occur when pte is unchanged.115116(1) the page (SwapCache) is looked up.117(2) lock_page()118(3) try_charge_swapin()119(4) reuse_swap_page() (may call delete_swap_cache())120(5) commit_charge_swapin()121(6) swap_free().122123Considering following situation for example.124125(A) The page has not been charged before (2) and reuse_swap_page()126doesn't call delete_from_swap_cache().127(B) The page has not been charged before (2) and reuse_swap_page()128calls delete_from_swap_cache().129(C) The page has been charged before (2) and reuse_swap_page() doesn't130call delete_from_swap_cache().131(D) The page has been charged before (2) and reuse_swap_page() calls132delete_from_swap_cache().133134memory.usage/memsw.usage changes to this page/swp_entry will be135Case (A) (B) (C) (D)136Event137Before (2) 0/ 1 0/ 1 1/ 1 1/ 1138===========================================139(3) +1/+1 +1/+1 +1/+1 +1/+1140(4) - 0/ 0 - -1/ 0141(5) 0/-1 0/ 0 -1/-1 0/ 0142(6) - 0/-1 - 0/-1143===========================================144Result 1/ 1 1/ 1 1/ 1 1/ 1145146In any cases, charges to this page should be 1/ 1.1471484.2 Swap-out.149At swap-out, typical state transition is below.150151(a) add to swap cache. (marked as SwapCache)152swp_entry's refcnt += 1.153(b) fully unmapped.154swp_entry's refcnt += # of ptes.155(c) write back to swap.156(d) delete from swap cache. (remove from SwapCache)157swp_entry's refcnt -= 1.158159160At (b), the page is marked as SwapCache and not uncharged.161At (d), the page is removed from SwapCache and a charge in page_cgroup162is moved to swap_cgroup.163164Finally, at task exit,165(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.166Here, a charge in swap_cgroup disappears.1671685. Page Cache169Page Cache is charged at170- add_to_page_cache_locked().171172uncharged at173- __remove_from_page_cache().174175The logic is very clear. (About migration, see below)176Note: __remove_from_page_cache() is called by remove_from_page_cache()177and __remove_mapping().1781796. Shmem(tmpfs) Page Cache180Memcg's charge/uncharge have special handlers of shmem. The best way181to understand shmem's page state transition is to read mm/shmem.c.182But brief explanation of the behavior of memcg around shmem will be183helpful to understand the logic.184185Shmem's page (just leaf page, not direct/indirect block) can be on186- radix-tree of shmem's inode.187- SwapCache.188- Both on radix-tree and SwapCache. This happens at swap-in189and swap-out,190191It's charged when...192- A new page is added to shmem's radix-tree.193- A swp page is read. (move a charge from swap_cgroup to page_cgroup)194It's uncharged when195- A page is removed from radix-tree and not SwapCache.196- When SwapCache is removed, a charge is moved to swap_cgroup.197- When swp_entry's refcnt goes down to 0, a charge in swap_cgroup198disappears.1992007. Page Migration201One of the most complicated functions is page-migration-handler.202Memcg has 2 routines. Assume that we are migrating a page's contents203from OLDPAGE to NEWPAGE.204205Usual migration logic is..206(a) remove the page from LRU.207(b) allocate NEWPAGE (migration target)208(c) lock by lock_page().209(d) unmap all mappings.210(e-1) If necessary, replace entry in radix-tree.211(e-2) move contents of a page.212(f) map all mappings again.213(g) pushback the page to LRU.214(-) OLDPAGE will be freed.215216Before (g), memcg should complete all necessary charge/uncharge to217NEWPAGE/OLDPAGE.218219The point is....220- If OLDPAGE is anonymous, all charges will be dropped at (d) because221try_to_unmap() drops all mapcount and the page will not be222SwapCache.223224- If OLDPAGE is SwapCache, charges will be kept at (g) because225__delete_from_swap_cache() isn't called at (e-1)226227- If OLDPAGE is page-cache, charges will be kept at (g) because228remove_from_swap_cache() isn't called at (e-1)229230memcg provides following hooks.231232- mem_cgroup_prepare_migration(OLDPAGE)233Called after (b) to account a charge (usage += PAGE_SIZE) against234memcg which OLDPAGE belongs to.235236- mem_cgroup_end_migration(OLDPAGE, NEWPAGE)237Called after (f) before (g).238If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already239charged, a charge by prepare_migration() is automatically canceled.240If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.241242But zap_pte() (by exit or munmap) can be called while migration,243we have to check if OLDPAGE/NEWPAGE is a valid page after commit().2442458. LRU246Each memcg has its own private LRU. Now, its handling is under global247VM's control (means that it's handled under global zone->lru_lock).248Almost all routines around memcg's LRU is called by global LRU's249list management functions under zone->lru_lock().250251A special function is mem_cgroup_isolate_pages(). This scans252memcg's private LRU and call __isolate_lru_page() to extract a page253from LRU.254(By __isolate_lru_page(), the page is removed from both of global and255private LRU.)2562572589. Typical Tests.259260Tests for racy cases.2612629.1 Small limit to memcg.263When you do test to do racy case, it's good test to set memcg's limit264to be very small rather than GB. Many races found in the test under265xKB or xxMB limits.266(Memory behavior under GB and Memory behavior under MB shows very267different situation.)2682699.2 Shmem270Historically, memcg's shmem handling was poor and we saw some amount271of troubles here. This is because shmem is page-cache but can be272SwapCache. Test with shmem/tmpfs is always good test.2732749.3 Migration275For NUMA, migration is an another special case. To do easy test, cpuset276is useful. Following is a sample script to do migration.277278mount -t cgroup -o cpuset none /opt/cpuset279280mkdir /opt/cpuset/01281echo 1 > /opt/cpuset/01/cpuset.cpus282echo 0 > /opt/cpuset/01/cpuset.mems283echo 1 > /opt/cpuset/01/cpuset.memory_migrate284mkdir /opt/cpuset/02285echo 1 > /opt/cpuset/02/cpuset.cpus286echo 1 > /opt/cpuset/02/cpuset.mems287echo 1 > /opt/cpuset/02/cpuset.memory_migrate288289In above set, when you moves a task from 01 to 02, page migration to290node 0 to node 1 will occur. Following is a script to migrate all291under cpuset.292--293move_task()294{295for pid in $1296do297/bin/echo $pid >$2/tasks 2>/dev/null298echo -n $pid299echo -n " "300done301echo END302}303304G1_TASK=`cat ${G1}/tasks`305G2_TASK=`cat ${G2}/tasks`306move_task "${G1_TASK}" ${G2} &307--3089.4 Memory hotplug.309memory hotplug test is one of good test.310to offline memory, do following.311# echo offline > /sys/devices/system/memory/memoryXXX/state312(XXX is the place of memory)313This is an easy way to test page migration, too.3143159.5 mkdir/rmdir316When using hierarchy, mkdir/rmdir test should be done.317Use tests like the following.318319echo 1 >/opt/cgroup/01/memory/use_hierarchy320mkdir /opt/cgroup/01/child_a321mkdir /opt/cgroup/01/child_b322323set limit to 01.324add limit to 01/child_b325run jobs under child_a and child_b326327create/delete following groups at random while jobs are running.328/opt/cgroup/01/child_a/child_aa329/opt/cgroup/01/child_b/child_bb330/opt/cgroup/01/child_c331332running new jobs in new group is also good.3333349.6 Mount with other subsystems.335Mounting with other subsystems is a good test because there is a336race and lock dependency with other cgroup subsystems.337338example)339# mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices340341and do task move, mkdir, rmdir etc...under this.3423439.7 swapoff.344Besides management of swap is one of complicated parts of memcg,345call path of swap-in at swapoff is not same as usual swap-in path..346It's worth to be tested explicitly.347348For example, test like following is good.349(Shell-A)350# mount -t cgroup none /cgroup -o memory351# mkdir /cgroup/test352# echo 40M > /cgroup/test/memory.limit_in_bytes353# echo 0 > /cgroup/test/tasks354Run malloc(100M) program under this. You'll see 60M of swaps.355(Shell-B)356# move all tasks in /cgroup/test to /cgroup357# /sbin/swapoff -a358# rmdir /cgroup/test359# kill malloc task.360361Of course, tmpfs v.s. swapoff test should be tested, too.3623639.8 OOM-Killer364Out-of-memory caused by memcg's limit will kill tasks under365the memcg. When hierarchy is used, a task under hierarchy366will be killed by the kernel.367In this case, panic_on_oom shouldn't be invoked and tasks368in other groups shouldn't be killed.369370It's not difficult to cause OOM under memcg as following.371Case A) when you can swapoff372#swapoff -a373#echo 50M > /memory.limit_in_bytes374run 51M of malloc375376Case B) when you use mem+swap limitation.377#echo 50M > memory.limit_in_bytes378#echo 50M > memory.memsw.limit_in_bytes379run 51M of malloc3803819.9 Move charges at task migration382Charges associated with a task can be moved along with task migration.383384(Shell-A)385#mkdir /cgroup/A386#echo $$ >/cgroup/A/tasks387run some programs which uses some amount of memory in /cgroup/A.388389(Shell-B)390#mkdir /cgroup/B391#echo 1 >/cgroup/B/memory.move_charge_at_immigrate392#echo "pid of the program running in group A" >/cgroup/B/tasks393394You can see charges have been moved by reading *.usage_in_bytes or395memory.stat of both A and B.396See 8.2 of Documentation/cgroups/memory.txt to see what value should be397written to move_charge_at_immigrate.3983999.10 Memory thresholds400Memory controller implements memory thresholds using cgroups notification401API. You can use Documentation/cgroups/cgroup_event_listener.c to test402it.403404(Shell-A) Create cgroup and run event listener405# mkdir /cgroup/A406# ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M407408(Shell-B) Add task to cgroup and try to allocate and free memory409# echo $$ >/cgroup/A/tasks410# a="$(dd if=/dev/zero bs=1M count=10)"411# a=412413You will see message from cgroup_event_listener every time you cross414the thresholds.415416Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds.417418It's good idea to test root cgroup as well.419420421