Path: blob/master/Documentation/accounting/taskstats.txt
10821 views
Per-task statistics interface1-----------------------------234Taskstats is a netlink-based interface for sending per-task and5per-process statistics from the kernel to userspace.67Taskstats was designed for the following benefits:89- efficiently provide statistics during lifetime of a task and on its exit10- unified interface for multiple accounting subsystems11- extensibility for use by future accounting patches1213Terminology14-----------1516"pid", "tid" and "task" are used interchangeably and refer to the standard17Linux task defined by struct task_struct. per-pid stats are the same as18per-task stats.1920"tgid", "process" and "thread group" are used interchangeably and refer to the21tasks that share an mm_struct i.e. the traditional Unix process. Despite the22use of tgid, there is no special treatment for the task that is thread group23leader - a process is deemed alive as long as it has any task belonging to it.2425Usage26-----2728To get statistics during a task's lifetime, userspace opens a unicast netlink29socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid.30The response contains statistics for a task (if pid is specified) or the sum of31statistics for all tasks of the process (if tgid is specified).3233To obtain statistics for tasks which are exiting, the userspace listener34sends a register command and specifies a cpumask. Whenever a task exits on35one of the cpus in the cpumask, its per-pid statistics are sent to the36registered listener. Using cpumasks allows the data received by one listener37to be limited and assists in flow control over the netlink interface and is38explained in more detail below.3940If the exiting task is the last thread exiting its thread group,41an additional record containing the per-tgid stats is also sent to userspace.42The latter contains the sum of per-pid stats for all threads in the thread43group, both past and present.4445getdelays.c is a simple utility demonstrating usage of the taskstats interface46for reporting delay accounting statistics. Users can register cpumasks,47send commands and process responses, listen for per-tid/tgid exit data,48write the data received to a file and do basic flow control by increasing49receive buffer sizes.5051Interface52---------5354The user-kernel interface is encapsulated in include/linux/taskstats.h5556To avoid this documentation becoming obsolete as the interface evolves, only57an outline of the current version is given. taskstats.h always overrides the58description here.5960struct taskstats is the common accounting structure for both per-pid and61per-tgid data. It is versioned and can be extended by each accounting subsystem62that is added to the kernel. The fields and their semantics are defined in the63taskstats.h file.6465The data exchanged between user and kernel space is a netlink message belonging66to the NETLINK_GENERIC family and using the netlink attributes interface.67The messages are in the format6869+----------+- - -+-------------+-------------------+70| nlmsghdr | Pad | genlmsghdr | taskstats payload |71+----------+- - -+-------------+-------------------+727374The taskstats payload is one of the following three kinds:75761. Commands: Sent from user to kernel. Commands to get data on77a pid/tgid consist of one attribute, of type TASKSTATS_CMD_ATTR_PID/TGID,78containing a u32 pid or tgid in the attribute payload. The pid/tgid denotes79the task/process for which userspace wants statistics.8081Commands to register/deregister interest in exit data from a set of cpus82consist of one attribute, of type83TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the84attribute payload. The cpumask is specified as an ascii string of85comma-separated cpu ranges e.g. to listen to exit data from cpus 1,2,3,5,7,886the cpumask would be "1-3,5,7-8". If userspace forgets to deregister interest87in cpus before closing the listening socket, the kernel cleans up its interest88set over time. However, for the sake of efficiency, an explicit deregistration89is advisable.90912. Response for a command: sent from the kernel in response to a userspace92command. The payload is a series of three attributes of type:9394a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates95a pid/tgid will be followed by some stats.9697b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats98are being returned.99100c) TASKSTATS_TYPE_STATS: attribute with a struct taskstats as payload. The101same structure is used for both per-pid and per-tgid stats.1021033. New message sent by kernel whenever a task exits. The payload consists of a104series of attributes of the following type:105106a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats107b) TASKSTATS_TYPE_PID: contains exiting task's pid108c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats109d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats110e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs111f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process112113114per-tgid stats115--------------116117Taskstats provides per-process stats, in addition to per-task stats, since118resource management is often done at a process granularity and aggregating task119stats in userspace alone is inefficient and potentially inaccurate (due to lack120of atomicity).121122However, maintaining per-process, in addition to per-task stats, within the123kernel has space and time overheads. To address this, the taskstats code124accumulates each exiting task's statistics into a process-wide data structure.125When the last task of a process exits, the process level data accumulated also126gets sent to userspace (along with the per-task data).127128When a user queries to get per-tgid data, the sum of all other live threads in129the group is added up and added to the accumulated total for previously exited130threads of the same thread group.131132Extending taskstats133-------------------134135There are two ways to extend the taskstats interface to export more136per-task/process stats as patches to collect them get added to the kernel137in future:1381391. Adding more fields to the end of the existing struct taskstats. Backward140compatibility is ensured by the version number within the141structure. Userspace will use only the fields of the struct that correspond142to the version its using.1431442. Defining separate statistic structs and using the netlink attributes145interface to return them. Since userspace processes each netlink attribute146independently, it can always ignore attributes whose type it does not147understand (because it is using an older version of the interface).148149150Choosing between 1. and 2. is a matter of trading off flexibility and151overhead. If only a few fields need to be added, then 1. is the preferable152path since the kernel and userspace don't need to incur the overhead of153processing new netlink attributes. But if the new fields expand the existing154struct too much, requiring disparate userspace accounting utilities to155unnecessarily receive large structures whose fields are of no interest, then156extending the attributes structure would be worthwhile.157158Flow control for taskstats159--------------------------160161When the rate of task exits becomes large, a listener may not be able to keep162up with the kernel's rate of sending per-tid/tgid exit data leading to data163loss. This possibility gets compounded when the taskstats structure gets164extended and the number of cpus grows large.165166To avoid losing statistics, userspace should do one or more of the following:167168- increase the receive buffer sizes for the netlink sockets opened by169listeners to receive exit data.170171- create more listeners and reduce the number of cpus being listened to by172each listener. In the extreme case, there could be one listener for each cpu.173Users may also consider setting the cpu affinity of the listener to the subset174of cpus to which it listens, especially if they are listening to just one cpu.175176Despite these measures, if the userspace receives ENOBUFS error messages177indicated overflow of receive buffers, it should take measures to handle the178loss of data.179180----181182183