2.  Kernel primitives

     

     

      The facilities available to a UNIX user process are logically divided into two parts: kernel facilities directly implemented by UNIX code running in the operating system, and system facilities implemented either by the system, or in cooperation with a server process. These kernel facilities are described in this section 1.

      The facilities implemented in the kernel are those which define the UNIX virtual machine in which each process runs. Like many real machines, this virtual machine has memory management hardware, an interrupt facility, timers and counters. The UNIX virtual machine also allows access to files and other objects through a set of descriptors. Each descriptor resembles a device controller, and supports a set of operations. Like devices on real machines, some of which are internal to the machine and some of which are external, parts of the descriptor machinery are built-in to the operating system, while other parts are often implemented in server processes on other machines. The facilities provided through the descriptor machinery are described in section 2.

2.1.  Processes and protection

     

     

2.1.1.  Host and process identifiers

      Each UNIX host has associated with it a 32-bit host id, and a host name of up to 64 characters (as defined by MAXHOSTNAMELEN in <sys/param.h>). These are set (by a privileged user) and returned by the calls:

sethostid(hostid)
long hostid;

hostid = gethostid();
result long hostid;

sethostname(name, len)
char *name; int len;

len = gethostname(buf, buflen)
result int len; result char *buf; int buflen;
On each host runs a set of processes. Each process is largely independent of other processes, having its own protection domain, address space, timers, and an independent set of references to system or user implemented objects.

      Each process in a host is named by an integer called the process id. This number is in the range 1-30000 and is returned by the getpid routine:

pid = getpid();
result int pid;
On each UNIX host this identifier is guaranteed to be unique; in a multi-host environment, the (hostid, process id) pairs are guaranteed unique.

2.1.2.  Process creation and termination

      A new process is created by making a logical duplicate of an existing process:

pid = fork();
result int pid;
The fork call returns twice, once in the parent process, where pid is the process identifier of the child, and once in the child process where pid is 0. The parent-child relationship induces a hierarchical structure on the set of processes in the system.

      A process may terminate by executing an exit call:

exit(status)
int status;
returning 8 bits of exit status to its parent.

      When a child process exits or terminates abnormally, the parent process receives information about any event which caused termination of the child process. A second call provides a non-blocking interface and may also be used to retrieve information about resources consumed by the process during its lifetime.

#include <sys/wait.h>

pid = wait(astatus);
result int pid; result union wait *astatus;

pid = wait3(astatus, options, arusage);
result int pid; result union waitstatus *astatus;
int options; result struct rusage *arusage;

      A process can overlay itself with the memory image of another process, passing the newly created process a set of parameters, using the call:

execve(name, argv, envp)
char *name, **argv, **envp;
The specified name must be a file which is in a format recognized by the system, either a binary executable file or a file which causes the execution of a specified interpreter program to process its contents.

2.1.3.  User and group ids

      Each process in the system has associated with it two user-id's: a real user id and a effective user id, both 16 bit unsigned integers (type uid_t). Each process has an real accounting group id and an effective accounting group id and a set of access group id's. The group id's are 16 bit unsigned integers (type gid_t). Each process may be in several different access groups, with the maximum concurrent number of access groups a system compilation parameter, the constant NGROUPS in the file <sys/param.h>, guaranteed to be at least 8.

      The real and effective user ids associated with a process are returned by:

ruid = getuid();
result uid_t ruid;

euid = geteuid();
result uid_t euid;
the real and effective accounting group ids by:
rgid = getgid();
result gid_t rgid;

egid = getegid();
result gid_t egid;
The access group id set is returned by a getgroups call*:
ngroups = getgroups(gidsetsize, gidset);
result int ngroups; int gidsetsize; result int gidset[gidsetsize];

      The user and group id's are assigned at login time using the setreuid, setregid, and setgroups calls:

setreuid(ruid, euid);
int ruid, euid;

setregid(rgid, egid);
int rgid, egid;

setgroups(gidsetsize, gidset)
int gidsetsize; int gidset[gidsetsize];
The setreuid call sets both the real and effective user-id's, while the setregid call sets both the real and effective accounting group id's. Unless the caller is the super-user, ruid must be equal to either the current real or effective user-id, and rgid equal to either the current real or effective accounting group id. The setgroups call is restricted to the super-user.

2.1.4.  Process groups

      Each process in the system is also normally associated with a process group. The group of processes in a process group is sometimes referred to as a job and manipulated by high-level system software (such as the shell). The current process group of a process is returned by the getpgrp call:

pgrp = getpgrp(pid);
result int pgrp; int pid;
When a process is in a specific process group it may receive software interrupts affecting the group, causing the group to suspend or resume execution or to be interrupted or terminated. In particular, a system terminal has a process group and only processes which are in the process group of the terminal may read from the terminal, allowing arbitration of terminals among several different jobs.

      The process group associated with a process may be changed by the setpgrp call:

setpgrp(pid, pgrp);
int pid, pgrp;
Newly created processes are assigned process id's distinct from all processes and process groups, and the same process group as their parent. A normal (unprivileged) process may set its process group equal to its process id. A privileged process may set the process group of any process to any value.

2.2.  Memory management**

     

     

2.2.1.  Text, data and stack

      Each process begins execution with three logical areas of memory called text, data and stack. The text area is read-only and shared, while the data and stack areas are private to the process. Both the data and stack areas may be extended and contracted on program request. The call

addr = sbrk(incr);
result caddr_t addr; int incr;
changes the size of the data area by incr bytes and returns the new end of the data area, while
addr = sstk(incr);
result caddr_t addr; int incr;
changes the size of the stack area. The stack area is also automatically extended as needed. On the VAX the text and data areas are adjacent in the P0 region, while the stack section is in the P1 region, and grows downward.

2.2.2.  Mapping pages

      The system supports sharing of data between processes by allowing pages to be mapped into memory. These mapped pages may be shared with other processes or private to the process. Protection and sharing options are defined in <sys/mman.h> as:

/* protections are chosen from these bits, or-ed together */
#define	PROT_READ	0x04	/* pages can be read */
#define	PROT_WRITE	0x02	/* pages can be written */
#define	PROT_EXEC	0x01	/* pages can be executed */
/* flags contain mapping type, sharing type and options */
/* mapping type; choose one */
#define MAP_FILE	0x0001	/* mapped from a file or device */
#define MAP_ANON	0x0002	/* allocated from memory, swap space */
#define MAP_TYPE	0x000f	/* mask for type field */
/* sharing types; choose one */
#define	MAP_SHARED	0x0010	/* share changes */
#define	MAP_PRIVATE	0x0000	/* changes are private */
/* other flags */
#define MAP_FIXED	0x0020	/* map addr must be exactly as requested */
#define MAP_INHERIT	0x0040	/* region is retained after exec */
#define MAP_HASSEMAPHORE	0x0080	/* region may contain semaphores */
#define MAP_NOPREALLOC	0x0100	/* do not preallocate space */
The cpu-dependent size of a page is returned by the getpagesize system call:
pagesize = getpagesize();
result int pagesize;

The call:

maddr = mmap(addr, len, prot, flags, fd, pos);
result caddr_t maddr; caddr_t addr; int *len, prot, flags, fd; off_t pos;
causes the pages starting at addr and continuing for at most len bytes to be mapped from the object represented by descriptor fd, starting at byte offset pos. The starting address of the region is returned; for the convenience of the system, it may differ from that supplied unless the MAP_FIXED flag is given, in which case the exact address will be used or the call will fail. The actual amount mapped is returned in len. The addr, len, and pos parameters must all be multiples of the pagesize. A successful mmap will delete any previous mapping in the allocated address range. The parameter prot specifies the accessibility of the mapped pages. The parameter flags specifies the type of object to be mapped, mapping options, and whether modifications made to this mapped copy of the page are to be kept private, or are to be shared with other references. Possible types include MAP_FILE, mapping a regular file or character-special device memory, and MAP_ANON, which maps memory not associated with any specific file. The file descriptor used for creating MAP_ANON regions is used only for naming, and may be given as -1 if no name is associated with the region.*** The MAP_INHERIT flag allows a region to be inherited after an exec. The MAP_HASSEMAPHORE flag allows special handling for regions that may contain semaphores. The MAP_NOPREALLOC flag allows processes to allocate regions whose virtual address space, if fully allocated, would exceed the available memory plus swap resources. Such regions may get a SIGSEGV signal if they page fault and resources are not available to service their request; typically they would free up some resources via unmap so that when they return from the signal the page fault could be successfully completed.

      A facility is provided to synchronize a mapped region with the file it maps; the call

msync(addr, len);
caddr_t addr; int len;
writes any modified pages back to the filesystem and updates the file modification time. If len is 0, all modified pages within the region containing addr will be flushed; if len is non-zero, only the pages containing addr and len succeeding locations will be examined. Any required synchronization of memory caches will also take place at this time. Filesystem operations on a file that is mapped for shared modifications are unpredictable except after an msync.

      A mapping can be removed by the call

munmap(addr, len);
caddr_t addr; int len;
This call deletes the mappings for the specified address range, and causes further references to addresses within the range to generate invalid memory references.

2.2.3.  Page protection control

      A process can control the protection of pages using the call

mprotect(addr, len, prot);
caddr_t addr; int len, prot;
This call changes the specified pages to have protection prot. Not all implementations will guarantee protection on a page basis; the granularity of protection changes may be as large as an entire region.

2.2.4.  Giving and getting advice

      A process that has knowledge of its memory behavior may use the madvise call:

madvise(addr, len, behav);
caddr_t addr; int len, behav;
Behav describes expected behavior, as given in <sys/mman.h>:
#define	MADV_NORMAL	0	/* no further special treatment */
#define	MADV_RANDOM	1	/* expect random page references */
#define	MADV_SEQUENTIAL	2	/* expect sequential references */
#define	MADV_WILLNEED	3	/* will need these pages */
#define	MADV_DONTNEED	4	/* don't need these pages */
#define	MADV_SPACEAVAIL	5	/* insure that resources are reserved */
Finally, a process may obtain information about whether pages are core resident by using the call
mincore(addr, len, vec)
caddr_t addr; int len; result char *vec;
Here the current core residency of the pages is returned in the character array vec, with a value of 1 meaning that the page is in-core.

2.2.5.  Synchronization primitives

      Primitives are provided for synchronization using semaphores in shared memory. Semaphores must lie within a MAP_SHARED region with at least modes PROT_READ and PROT_WRITE. The MAP_HASSEMAPHORE flag must have been specified when the region was created. To acquire a lock a process calls:

value = mset(sem, wait)
result int value; semaphore *sem; int wait;
Mset indivisibly tests and sets the semaphore sem. If the previous value is zero, the process has acquired the lock and mset returns true immediately. Otherwise, if the wait flag is zero, failure is returned. If wait is true and the previous value is non-zero, mset relinquishes the processor until notified that it should retry.

To release a lock a process calls:

mclear(sem)
semaphore *sem;
Mclear indivisibly tests and clears the semaphore sem. If the ``WANT'' flag is zero in the previous value, mclear returns immediately. If the ``WANT'' flag is non-zero in the previous value, mclear arranges for waiting processes to retry before returning.

      Two routines provide services analogous to the kernel sleep and wakeup functions interpreted in the domain of shared memory. A process may relinquish the processor by calling msleep with a set semaphore:

msleep(sem)
semaphore *sem;
If the semaphore is still set when it is checked by the kernel, the process will be put in a sleeping state until some other process issues an mwakeup for the same semaphore within the region using the call:
mwakeup(sem)
semaphore *sem;
An mwakeup may awaken all sleepers on the semaphore, or may awaken only the next sleeper on a queue.

2.3.  Signals

     

     

     

2.3.1.  Overview

      The system defines a set of signals that may be delivered to a process. Signal delivery resembles the occurrence of a hardware interrupt: the signal is blocked from further occurrence, the current process context is saved, and a new one is built. A process may specify the handler to which a signal is delivered, or specify that the signal is to be blocked or ignored. A process may also specify that a default action is to be taken when signals occur.

      Some signals will cause a process to exit when they are not caught. This may be accompanied by creation of a core image file, containing the current memory image of the process for use in post-mortem debugging. A process may choose to have signals delivered on a special stack, so that sophisticated software stack manipulations are possible.

      All signals have the same priority. If multiple signals are pending simultaneously, the order in which they are delivered to a process is implementation specific. Signal routines execute with the signal that caused their invocation blocked, but other signals may yet occur. Mechanisms are provided whereby critical sections of code may protect themselves against the occurrence of specified signals.

2.3.2.  Signal types

      The signals defined by the system fall into one of five classes: hardware conditions, software conditions, input/output notification, process control, or resource control. The set of signals is defined in the file <signal.h>.

      Hardware signals are derived from exceptional conditions which may occur during execution. Such signals include SIGFPE representing floating point and other arithmetic exceptions, SIGILL for illegal instruction execution, SIGSEGV for addresses outside the currently assigned area of memory, and SIGBUS for accesses that violate memory protection constraints. Other, more cpu-specific hardware signals exist, such as those for the various customer-reserved instructions on the VAX (SIGIOT, SIGEMT, and SIGTRAP).

      Software signals reflect interrupts generated by user request: SIGINT for the normal interrupt signal; SIGQUIT for the more powerful quit signal, that normally causes a core image to be generated; SIGHUP and SIGTERM that cause graceful process termination, either because a user has ``hung up'', or by user or program request; and SIGKILL, a more powerful termination signal which a process cannot catch or ignore. Programs may define their own asynchronous events using SIGUSR1 and SIGUSR2. Other software signals (SIGALRM, SIGVTALRM, SIGPROF) indicate the expiration of interval timers.

      A process can request notification via a SIGIO signal when input or output is possible on a descriptor, or when a non-blocking operation completes. A process may request to receive a SIGURG signal when an urgent condition arises.

      A process may be stopped by a signal sent to it or the members of its process group. The SIGSTOP signal is a powerful stop signal, because it cannot be caught. Other stop signals SIGTSTP, SIGTTIN, and SIGTTOU are used when a user request, input request, or output request respectively is the reason for stopping the process. A SIGCONT signal is sent to a process when it is continued from a stopped state. Processes may receive notification with a SIGCHLD signal when a child process changes state, either by stopping or by terminating.

      Exceeding resource limits may cause signals to be generated. SIGXCPU occurs when a process nears its CPU time limit and SIGXFSZ warns that the limit on file size creation has been reached.

2.3.3.  Signal handlers

      A process has a handler associated with each signal. The handler controls the way the signal is delivered. The call

#include <signal.h>

struct sigvec {
	int	(*sv_handler)();
	int	sv_mask;
	int	sv_flags;
};

sigvec(signo, sv, osv)
int signo; struct sigvec *sv; result struct sigvec *osv;
assigns interrupt handler address sv_handler to signal signo. Each handler address specifies either an interrupt routine for the signal, that the signal is to be ignored, or that a default action (usually process termination) is to occur if the signal occurs. The constants SIG_IGN and SIG_DEF used as values for sv_handler cause ignoring or defaulting of a condition. The sv_mask value specifies the signal mask to be used when the handler is invoked; it implicitly includes the signal which invoked the handler. Signal masks include one bit for each signal; the mask for a signal signo is provided by the macro sigmask(signo), from <signal.h>. Sv_flags specifies whether system calls should be restarted if the signal handler returns and whether the handler should operate on the normal run-time stack or a special signal stack (see below). If osv is non-zero, the previous signal vector is returned.

      When a signal condition arises for a process, the signal is added to a set of signals pending for the process. If the signal is not currently blocked by the process then it will be delivered. The process of signal delivery adds the signal to be delivered and those signals specified in the associated signal handler's sv_mask to a set of those masked for the process, saves the current process context, and places the process in the context of the signal handling routine. The call is arranged so that if the signal handling routine exits normally the signal mask will be restored and the process will resume execution in the original context. If the process wishes to resume in a different context, then it must arrange to restore the signal mask itself.

      The mask of blocked signals is independent of handlers for signals. It delays signals from being delivered much as a raised hardware interrupt priority level delays hardware interrupts. Preventing an interrupt from occurring by changing the handler is analogous to disabling a device from further interrupts.

      The signal handling routine sv_handler is called by a C call of the form

(*sv_handler)(signo, code, scp);
int signo; long code; struct sigcontext *scp;
The signo gives the number of the signal that occurred, and the code, a word of information supplied by the hardware. The scp parameter is a pointer to a machine-dependent structure containing the information for restoring the context before the signal.

2.3.4.  Sending signals

      A process can send a signal to another process or group of processes with the calls:

kill(pid, signo)
int pid, signo;

killpgrp(pgrp, signo)
int pgrp, signo;
Unless the process sending the signal is privileged, it must have the same effective user id as the process receiving the signal.

      Signals are also sent implicitly from a terminal device to the process group associated with the terminal when certain input characters are typed.

2.3.5.  Protecting critical sections

      To block a section of code against one or more signals, a sigblock call may be used to add a set of signals to the existing mask, returning the old mask:

oldmask = sigblock(mask);
result long oldmask; long mask;
The old mask can then be restored later with sigsetmask,
oldmask = sigsetmask(mask);
result long oldmask; long mask;
The sigblock call can be used to read the current mask by specifying an empty mask.

      It is possible to check conditions with some signals blocked, and then to pause waiting for a signal and restoring the mask, by using:

sigpause(mask);
long mask;

2.3.6.  Signal stacks

      Applications that maintain complex or fixed size stacks can use the call

struct sigstack {
	caddr_t	ss_sp;
	int	ss_onstack;
};

sigstack(ss, oss)
struct sigstack *ss; result struct sigstack *oss;
to provide the system with a stack based at ss_sp for delivery of signals. The value ss_onstack indicates whether the process is currently on the signal stack, a notion maintained in software by the system.

      When a signal is to be delivered, the system checks whether the process is on a signal stack. If not, then the process is switched to the signal stack for delivery, with the return from the signal arranged to restore the previous stack.

      If the process wishes to take a non-local exit from the signal routine, or run code from the signal stack that uses a different stack, a sigstack call should be used to reset the signal stack.

2.4.  Timers

     

     

2.4.1.  Real time

      The system's notion of the current Greenwich time and the current time zone is set and returned by the call by the calls:

#include <sys/time.h>

settimeofday(tvp, tzp);
struct timeval *tp;
struct timezone *tzp;

gettimeofday(tp, tzp);
result struct timeval *tp;
result struct timezone *tzp;
where the structures are defined in <sys/time.h> as:
struct timeval {
	long	tv_sec;	/* seconds since Jan 1, 1970 */
	long	tv_usec;	/* and microseconds */
};

struct timezone {
	int	tz_minuteswest;	/* of Greenwich */
	int	tz_dsttime;	/* type of dst correction to apply */
};
The precision of the system clock is hardware dependent. Earlier versions of UNIX contained only a 1-second resolution version of this call, which remains as a library routine:
time(tvsec)
result long *tvsec;
returning only the tv_sec field from the gettimeofday call.

2.4.2.  Interval time

      The system provides each process with three interval timers, defined in <sys/time.h>:

#define	ITIMER_REAL	0	/* real time intervals */
#define	ITIMER_VIRTUAL	1	/* virtual time intervals */
#define	ITIMER_PROF	2	/* user and system virtual time */
The ITIMER_REAL timer decrements in real time. It could be used by a library routine to maintain a wakeup service queue. A SIGALRM signal is delivered when this timer expires.

      The ITIMER_VIRTUAL timer decrements in process virtual time. It runs only when the process is executing. A SIGVTALRM signal is delivered when it expires.

      The ITIMER_PROF timer decrements both in process virtual time and when the system is running on behalf of the process. It is designed to be used by processes to statistically profile their execution. A SIGPROF signal is delivered when it expires.

      A timer value is defined by the itimerval structure:

struct itimerval {
	struct	timeval it_interval;	/* timer interval */
	struct	timeval it_value;	/* current value */
};
and a timer is set or read by the call:
getitimer(which, value);
int which; result struct itimerval *value;

setitimer(which, value, ovalue);
int which; struct itimerval *value; result struct itimerval *ovalue;
The third argument to setitimer specifies an optional structure to receive the previous contents of the interval timer. A timer can be disabled by specifying a timer value of 0.

      The system rounds argument timer intervals to be not less than the resolution of its clock. This clock resolution can be determined by loading a very small value into a timer and reading the timer back to see what value resulted.

      The alarm system call of earlier versions of UNIX is provided as a library routine using the ITIMER_REAL timer. The process profiling facilities of earlier versions of UNIX remain because it is not always possible to guarantee the automatic restart of system calls after receipt of a signal. The profil call arranges for the kernel to begin gathering execution statistics for a process:

profil(buf, bufsize, offset, scale);
result char *buf; int bufsize, offset, scale;
This begins sampling of the program counter, with statistics maintained in the user-provided buffer.

2.5.  Descriptors

     

     

     

2.5.1.  The reference table

      Each process has access to resources through descriptors. Each descriptor is a handle allowing the process to reference objects such as files, devices and communications links.

      Rather than allowing processes direct access to descriptors, the system introduces a level of indirection, so that descriptors may be shared between processes. Each process has a descriptor reference table, containing pointers to the actual descriptors. The descriptors themselves thus have multiple references, and are reference counted by the system.

      Each process has a fixed size descriptor reference table, where the size is returned by the getdtablesize call:

nds = getdtablesize();
result int nds;
and guaranteed to be at least 20. The entries in the descriptor reference table are referred to by small integers; for example if there are 20 slots they are numbered 0 to 19.

2.5.2.  Descriptor properties

      Each descriptor has a logical set of properties maintained by the system and defined by its type. Each type supports a set of operations; some operations, such as reading and writing, are common to several abstractions, while others are unique. The generic operations applying to many of these types are described in section 2.1. Naming contexts, files and directories are described in section 2.2. Section 2.3 describes communications domains and sockets. Terminals and (structured and unstructured) devices are described in section 2.4.

2.5.3.  Managing descriptor references

      A duplicate of a descriptor reference may be made by doing

new = dup(old);
result int new; int old;
returning a copy of descriptor reference old indistinguishable from the original. The new chosen by the system will be the smallest unused descriptor reference slot. A copy of a descriptor reference may be made in a specific slot by doing
dup2(old, new);
int old, new;
The dup2 call causes the system to deallocate the descriptor reference current occupying slot new, if any, replacing it with a reference to the same descriptor as old. This deallocation is also performed by:
close(old);
int old;

2.5.4.  Multiplexing requests

      The system provides a standard way to do synchronous and asynchronous multiplexing of operations.

      Synchronous multiplexing is performed by using the select call to examine the state of multiple descriptors simultaneously, and to wait for state changes on those descriptors. Sets of descriptors of interest are specified as bit masks, as follows:

#include <sys/types.h>

nds = select(nd, in, out, except, tvp);
result int nds; int nd; result fd_set *in, *out, *except;
struct timeval *tvp;

FD_ZERO(&fdset);
FD_SET(fd, &fdset);
FD_CLR(fd, &fdset);
FD_ISSET(fd, &fdset);
int fs; fs_set fdset;
The select call examines the descriptors specified by the sets in, out and except, replacing the specified bit masks by the subsets that select true for input, output, and exceptional conditions respectively (nd indicates the number of file descriptors specified by the bit masks). If any descriptors meet the following criteria, then the number of such descriptors is returned in nds and the bit masks are updated.

If none of the specified conditions is true, the operation waits for one of the conditions to arise, blocking at most the amount of time specified by tvp. If tvp is given as 0, the select waits indefinitely.

      Options affecting I/O on a descriptor may be read and set by the call:

dopt = fcntl(d, cmd, arg)
result int dopt; int d, cmd, arg;

/* interesting values for cmd */
#define	F_SETFL	3	/* set descriptor options */
#define	F_GETFL	4	/* get descriptor options */
#define	F_SETOWN	5	/* set descriptor owner (pid/pgrp) */
#define	F_GETOWN	6	/* get descriptor owner (pid/pgrp) */
The F_SETFL cmd may be used to set a descriptor in non-blocking I/O mode and/or enable signaling when I/O is possible. F_SETOWN may be used to specify a process or process group to be signaled when using the latter mode of operation or when urgent indications arise.

      Operations on non-blocking descriptors will either complete immediately, note an error EWOULDBLOCK, partially complete an input or output operation returning a partial count, or return an error EINPROGRESS noting that the requested operation is in progress. A descriptor which has signalling enabled will cause the specified process and/or process group be signaled, with a SIGIO for input, output, or in-progress operation complete, or a SIGURG for exceptional conditions.

      For example, when writing to a terminal using non-blocking output, the system will accept only as much data as there is buffer space for and return; when making a connection on a socket, the operation may return indicating that the connection establishment is ``in progress''. The select facility can be used to determine when further output is possible on the terminal, or when the connection establishment attempt is complete.

2.5.5.  Descriptor wrapping.**

      A user process may build descriptors of a specified type by wrapping a communications channel with a system supplied protocol translator:

new = wrap(old, proto)
result int new; int old; struct dprop *proto;
Operations on the descriptor old are then translated by the system provided protocol translator into requests on the underlying object old in a way defined by the protocol. The protocols supported by the kernel may vary from system to system and are described in the programmers manual.

      Protocols may be based on communications multiplexing or a rights-passing style of handling multiple requests made on the same object. For instance, a protocol for implementing a file abstraction may or may not include locally generated ``read-ahead'' requests. A protocol that provides for read-ahead may provide higher performance but have a more difficult implementation.

      Another example is the terminal driving facilities. Normally a terminal is associated with a communications line, and the terminal type and standard terminal access protocol are wrapped around a synchronous communications line and given to the user. If a virtual terminal is required, the terminal driver can be wrapped around a communications link, the other end of which is held by a virtual terminal protocol interpreter.

2.6.  Resource controls

     

     

2.6.1.  Process priorities

      The system gives CPU scheduling priority to processes that have not used CPU time recently. This tends to favor interactive processes and processes that execute only for short periods. It is possible to determine the priority currently assigned to a process, process group, or the processes of a specified user, or to alter this priority using the calls:

#define	PRIO_PROCESS	0	/* process */
#define	PRIO_PGRP	1	/* process group */
#define	PRIO_USER	2	/* user id */

prio = getpriority(which, who);
result int prio; int which, who;

setpriority(which, who, prio);
int which, who, prio;
The value prio is in the range -20 to 20. The default priority is 0; lower priorities cause more favorable execution. The getpriority call returns the highest priority (lowest numerical value) enjoyed by any of the specified processes. The setpriority call sets the priorities of all of the specified processes to the specified value. Only the super-user may lower priorities.

2.6.2.  Resource utilization

      The resources used by a process are returned by a getrusage call, returning information in a structure defined in <sys/resource.h>:

#define	RUSAGE_SELF	0		/* usage by this process */
#define	RUSAGE_CHILDREN	-1		/* usage by all children */

getrusage(who, rusage)
int who; result struct rusage *rusage;

struct rusage {
	struct	timeval ru_utime;	/* user time used */
	struct	timeval ru_stime;	/* system time used */
	int	ru_maxrss;	/* maximum core resident set size: kbytes */
	int	ru_ixrss;	/* integral shared memory size (kbytes*sec) */
	int	ru_idrss;	/* unshared data memory size */
	int	ru_isrss;	/* unshared stack memory size */
	int	ru_minflt;	/* page-reclaims */
	int	ru_majflt;	/* page faults */
	int	ru_nswap;	/* swaps */
	int	ru_inblock;	/* block input operations */
	int	ru_oublock;	/* block output operations */
	int	ru_msgsnd;	/* messages sent */
	int	ru_msgrcv;	/* messages received */
	int	ru_nsignals;	/* signals received */
	int	ru_nvcsw;	/* voluntary context switches */
	int	ru_nivcsw;	/* involuntary context switches */
};
The who parameter specifies whose resource usage is to be returned. The resources used by the current process, or by all the terminated children of the current process may be requested.

2.6.3.  Resource limits

      The resources of a process for which limits are controlled by the kernel are defined in <sys/resource.h>, and controlled by the getrlimit and setrlimit calls:

#define	RLIMIT_CPU	0	/* cpu time in milliseconds */
#define	RLIMIT_FSIZE	1	/* maximum file size */
#define	RLIMIT_DATA	2	/* maximum data segment size */
#define	RLIMIT_STACK	3	/* maximum stack segment size */
#define	RLIMIT_CORE	4	/* maximum core file size */
#define	RLIMIT_RSS	5	/* maximum resident set size */

#define	RLIM_NLIMITS	6

#define	RLIM_INFINITY	0x7fffffff

struct rlimit {
	int	rlim_cur;	/* current (soft) limit */
	int	rlim_max;	/* hard limit */
};

getrlimit(resource, rlp)
int resource; result struct rlimit *rlp;

setrlimit(resource, rlp)
int resource; struct rlimit *rlp;

      Only the super-user can raise the maximum limits. Other users may only alter rlim_cur within the range from 0 to rlim_max or (irreversibly) lower rlim_max.

2.7.  System operation support

     

     

      Unless noted otherwise, the calls in this section are permitted only to a privileged user.

2.7.1.  Bootstrap operations

      The call

mount(blkdev, dir, ronly);
char *blkdev, *dir; int ronly;
extends the UNIX name space. The mount call specifies a block device blkdev containing a UNIX file system to be made available starting at dir. If ronly is set then the file system is read-only; writes to the file system will not be permitted and access times will not be updated when files are referenced. Dir is normally a name in the root directory.

      The call

swapon(blkdev, size);
char *blkdev; int size;
specifies a device to be made available for paging and swapping.

     

2.7.2.  Shutdown operations

      The call

unmount(dir);
char *dir;
unmounts the file system mounted on dir. This call will succeed only if the file system is not currently being used.

      The call

sync();
schedules input/output to clean all system buffer caches. (This call does not require privileged status.)

      The call

reboot(how)
int how;
causes a machine halt or reboot. The call may request a reboot by specifying how as RB_AUTOBOOT, or that the machine be halted with RB_HALT. These constants are defined in <sys/reboot.h>.

2.7.3.  Accounting

      The system optionally keeps an accounting record in a file for each process that exits on the system. The format of this record is beyond the scope of this document. The accounting may be enabled to a file name by doing

acct(path);
char *path;
If path is null, then accounting is disabled. Otherwise, the named file becomes the accounting file.