The internal structure of the network system is divided into three layers. These layers correspond to the services provided by the socket abstraction, those provided by the communication protocols, and those provided by the hardware interfaces. The communication protocols are normally layered into two or more individual cooperating layers, though they are collectively viewed in the system as one layer providing services supportive of the appropriate socket abstraction.
The following sections describe the properties of each layer in the system and the interfaces to which each must conform.
The socket layer deals with the interprocess communication facilities provided by the system. A socket is a bidirectional endpoint of communication which is ``typed'' by the semantics of communication it supports. The system calls described in the Berkeley Software Architecture Manual [Joy86] are used to manipulate sockets.
A socket consists of the following data structure:
struct socket { short so_type; /* generic type */ short so_options; /* from socket call */ short so_linger; /* time to linger while closing */ short so_state; /* internal state flags */ caddr_t so_pcb; /* protocol control block */ struct protosw *so_proto; /* protocol handle */ struct socket *so_head; /* back pointer to accept socket */ struct socket *so_q0; /* queue of partial connections */ short so_q0len; /* partials on so_q0 */ struct socket *so_q; /* queue of incoming connections */ short so_qlen; /* number of connections on so_q */ short so_qlimit; /* max number queued connections */ struct sockbuf so_rcv; /* receive queue */ struct sockbuf so_snd; /* send queue */ short so_timeo; /* connection timeout */ u_short so_error; /* error affecting connection */ u_short so_oobmark; /* chars to oob mark */ short so_pgrp; /* pgrp for signals */ };
Each socket contains two data queues, so_rcv and so_snd, and a pointer to routines which provide supporting services. The type of the socket, so_type is defined at socket creation time and used in selecting those services which are appropriate to support it. The supporting protocol is selected at socket creation time and recorded in the socket data structure for later use. Protocols are defined by a table of procedures, the protosw structure, which will be described in detail later. A pointer to a protocol-specific data structure, the ``protocol control block,'' is also present in the socket structure. Protocols control this data structure, which normally includes a back pointer to the parent socket structure to allow easy lookup when returning information to a user (for example, placing an error number in the so_error field). The other entries in the socket structure are used in queuing connection requests, validating user requests, storing socket characteristics (e.g. options supplied at the time a socket is created), and maintaining a socket's state.
Processes ``rendezvous at a socket'' in many instances. For instance, when a process wishes to extract data from a socket's receive queue and it is empty, or lacks sufficient data to satisfy the request, the process blocks, supplying the address of the receive queue as a ``wait channel' to be used in notification. When data arrives for the process and is placed in the socket's queue, the blocked process is identified by the fact it is waiting ``on the queue.''
A socket's state is defined from the following:
#define SS_NOFDREF 0x001 /* no file table ref any more */ #define SS_ISCONNECTED 0x002 /* socket connected to a peer */ #define SS_ISCONNECTING 0x004 /* in process of connecting to peer */ #define SS_ISDISCONNECTING 0x008 /* in process of disconnecting */ #define SS_CANTSENDMORE 0x010 /* can't send more data to peer */ #define SS_CANTRCVMORE 0x020 /* can't receive more data from peer */ #define SS_RCVATMARK 0x040 /* at mark on input */ #define SS_PRIV 0x080 /* privileged */ #define SS_NBIO 0x100 /* non-blocking ops */ #define SS_ASYNC 0x200 /* async i/o notify */
The state of a socket is manipulated both by the protocols and the user (through system calls). When a socket is created, the state is defined based on the type of socket. It may change as control actions are performed, for example connection establishment. It may also change according to the type of input/output the user wishes to perform, as indicated by options set with fcntl. ``Non-blocking'' I/O implies that a process should never be blocked to await resources. Instead, any call which would block returns prematurely with the error EWOULDBLOCK, or the service request may be partially fulfilled, e.g. a request for more data than is present.
If a process requested ``asynchronous'' notification of events related to the socket, the SIGIO signal is posted to the process when such events occur. An event is a change in the socket's state; examples of such occurrences are: space becoming available in the send queue, new data available in the receive queue, connection establishment or disestablishment, etc.
A socket may be marked ``privileged'' if it was created by the super-user. Only privileged sockets may bind addresses in privileged portions of an address space or use ``raw'' sockets to access lower levels of the network.
A socket's data queue contains a pointer to the data stored in the queue and other entries related to the management of the data. The following structure defines a data queue:
struct sockbuf { u_short sb_cc; /* actual chars in buffer */ u_short sb_hiwat; /* max actual char count */ u_short sb_mbcnt; /* chars of mbufs used */ u_short sb_mbmax; /* max chars of mbufs to use */ u_short sb_lowat; /* low water mark */ short sb_timeo; /* timeout */ struct mbuf *sb_mb; /* the mbuf chain */ struct proc *sb_sel; /* process selecting read/write */ short sb_flags; /* flags, see below */ };
Data is stored in a queue as a chain of mbufs. The actual count of data characters as well as high and low water marks are used by the protocols in controlling the flow of data. The amount of buffer space (characters of mbufs and associated data pages) is also recorded along with the limit on buffer allocation. The socket routines cooperate in implementing the flow control policy by blocking a process when it requests to send data and the high water mark has been reached, or when it requests to receive data and less than the low water mark is present (assuming non-blocking I/O has not been specified).*
When a socket is created, the supporting protocol ``reserves'' space for the send and receive queues of the socket. The limit on buffer allocation is set somewhat higher than the limit on data characters to account for the granularity of buffer allocation. The actual storage associated with a socket queue may fluctuate during a socket's lifetime, but it is assumed that this reservation will always allow a protocol to acquire enough memory to satisfy the high water marks.
The timeout and select values are manipulated by the socket routines in implementing various portions of the interprocess communications facilities and will not be described here.
Data queued at a socket is stored in one of two styles. Stream-oriented sockets queue data with no addresses, headers or record boundaries. The data are in mbufs linked through the m_next field. Buffers containing access rights may be present within the chain if the underlying protocol supports passage of access rights. Record-oriented sockets, including datagram sockets, queue data as a list of packets; the sections of packets are distinguished by the types of the mbufs containing them. The mbufs which comprise a record are linked through the m_next field; records are linked from the m_act field of the first mbuf of one packet to the first mbuf of the next. Each packet begins with an mbuf containing the ``from'' address if the protocol provides it, then any buffers containing access rights, and finally any buffers containing data. If a record contains no data, no data buffers are required unless neither address nor access rights are present.
A socket queue has a number of flags used in synchronizing access to the data and in acquiring resources:
#define SB_LOCK 0x01 /* lock on data queue (so_rcv only) */ #define SB_WANT 0x02 /* someone is waiting to lock */ #define SB_WAIT 0x04 /* someone is waiting for data/space */ #define SB_SEL 0x08 /* buffer is selected */ #define SB_COLL 0x10 /* collision selecting */
In dealing with connection oriented sockets (e.g. SOCK_STREAM) the two ends are considered distinct. One end is termed active, and generates connection requests. The other end is called passive and accepts connection requests.
From the passive side, a socket is marked with SO_ACCEPTCONN when a listen call is made, creating two queues of sockets: so_q0 for connections in progress and so_q for connections already made and awaiting user acceptance. As a protocol is preparing incoming connections, it creates a socket structure queued on so_q0 by calling the routine sonewconn(). When the connection is established, the socket structure is then transferred to so_q, making it available for an accept.
If an SO_ACCEPTCONN socket is closed with sockets on either so_q0 or so_q, these sockets are dropped, with notification to the peers as appropriate.
Each socket is created in a communications domain, which usually implies both an addressing structure (address family) and a set of protocols which implement various socket types within the domain (protocol family). Each domain is defined by the following structure:
struct domain { int dom_family; /* PF_xxx */ char *dom_name; int (*dom_init)(); /* initialize domain data structures */ int (*dom_externalize)(); /* externalize access rights */ int (*dom_dispose)(); /* dispose of internalized rights */ struct protosw *dom_protosw, *dom_protoswNPROTOSW; struct domain *dom_next; };
At boot time, each domain configured into the kernel is added to a linked list of domain. The initialization procedure of each domain is then called. After that time, the domain structure is used to locate protocols within the protocol family. It may also contain procedure references for externalization of access rights at the receiving socket and the disposal of access rights that are not received.
Protocols are described by a set of entry points and certain socket-visible characteristics, some of which are used in deciding which socket type(s) they may support.
An entry in the ``protocol switch'' table exists for each protocol module configured into the system. It has the following form:
struct protosw { short pr_type; /* socket type used for */ struct domain *pr_domain; /* domain protocol a member of */ short pr_protocol; /* protocol number */ short pr_flags; /* socket visible attributes */ /* protocol-protocol hooks */ int (*pr_input)(); /* input to protocol (from below) */ int (*pr_output)(); /* output to protocol (from above) */ int (*pr_ctlinput)(); /* control input (from below) */ int (*pr_ctloutput)(); /* control output (from above) */ /* user-protocol hook */ int (*pr_usrreq)(); /* user request */ /* utility hooks */ int (*pr_init)(); /* initialization routine */ int (*pr_fasttimo)(); /* fast timeout (200ms) */ int (*pr_slowtimo)(); /* slow timeout (500ms) */ int (*pr_drain)(); /* flush any excess space possible */ };
A protocol is called through the pr_init entry before any other. Thereafter it is called every 200 milliseconds through the pr_fasttimo entry and every 500 milliseconds through the pr_slowtimo for timer based actions. The system will call the pr_drain entry if it is low on space and this should throw away any non-critical data.
Protocols pass data between themselves as chains of mbufs using the pr_input and pr_output routines. Pr_input passes data up (towards the user) and pr_output passes it down (towards the network); control information passes up and down on pr_ctlinput and pr_ctloutput. The protocol is responsible for the space occupied by any of the arguments to these entries and must either pass it onward or dispose of it. (On output, the lowest level reached must free buffers storing the arguments; on input, the highest level is responsible for freeing buffers.)
The pr_usrreq routine interfaces protocols to the socket code and is described below.
The pr_flags field is constructed from the following values:
#define PR_ATOMIC 0x01 /* exchange atomic messages only */ #define PR_ADDR 0x02 /* addresses given with messages */ #define PR_CONNREQUIRED 0x04 /* connection required by protocol */ #define PR_WANTRCVD 0x08 /* want PRU_RCVD calls */ #define PR_RIGHTS 0x10 /* passes capabilities */
When a socket is created, the socket routines scan the protocol table for the domain looking for an appropriate protocol to support the type of socket being created. The pr_type field contains one of the possible socket types (e.g. SOCK_STREAM), while the pr_domain is a back pointer to the domain structure. The pr_protocol field contains the protocol number of the protocol, normally a well-known value.
Each network-interface configured into a system defines a path through which packets may be sent and received. Normally a hardware device is associated with this interface, though there is no requirement for this (for example, all systems have a software ``loopback'' interface used for debugging and performance analysis). In addition to manipulating the hardware device, an interface module is responsible for encapsulation and decapsulation of any link-layer header information required to deliver a message to its destination. The selection of which interface to use in delivering packets is a routing decision carried out at a higher level than the network-interface layer. An interface may have addresses in one or more address families. The address is set at boot time using an ioctl on a socket in the appropriate domain; this operation is implemented by the protocol family, after verifying the operation through the device ioctl entry.
An interface is defined by the following structure,
struct ifnet { char *if_name; /* name, e.g. ``en'' or ``lo'' */ short if_unit; /* sub-unit for lower level driver */ short if_mtu; /* maximum transmission unit */ short if_flags; /* up/down, broadcast, etc. */ short if_timer; /* time 'til if_watchdog called */ struct ifaddr *if_addrlist; /* list of addresses of interface */ struct ifqueue if_snd; /* output queue */ int (*if_init)(); /* init routine */ int (*if_output)(); /* output routine */ int (*if_ioctl)(); /* ioctl routine */ int (*if_reset)(); /* bus reset routine */ int (*if_watchdog)(); /* timer routine */ int if_ipackets; /* packets received on interface */ int if_ierrors; /* input errors on interface */ int if_opackets; /* packets sent on interface */ int if_oerrors; /* output errors on interface */ int if_collisions; /* collisions on csma interfaces */ struct ifnet *if_next; };
struct ifaddr { struct sockaddr ifa_addr; /* address of interface */ union { struct sockaddr ifu_broadaddr; struct sockaddr ifu_dstaddr; } ifa_ifu; struct ifnet *ifa_ifp; /* back-pointer to interface */ struct ifaddr *ifa_next; /* next address for interface */ }; #define ifa_broadaddr ifa_ifu.ifu_broadaddr /* broadcast address */ #define ifa_dstaddr ifa_ifu.ifu_dstaddr /* other end of p-to-p link */
Each interface has a send queue and routines used for initialization, if_init, and output, if_output. If the interface resides on a system bus, the routine if_reset will be called after a bus reset has been performed. An interface may also specify a timer routine, if_watchdog; if if_timer is non-zero, it is decremented once per second until it reaches zero, at which time the watchdog routine is called.
The state of an interface and certain characteristics are stored in the if_flags field. The following values are possible:
#define IFF_UP 0x1 /* interface is up */ #define IFF_BROADCAST 0x2 /* broadcast is possible */ #define IFF_DEBUG 0x4 /* turn on debugging */ #define IFF_LOOPBACK 0x8 /* is a loopback net */ #define IFF_POINTOPOINT 0x10 /* interface is point-to-point link */ #define IFF_NOTRAILERS 0x20 /* avoid use of trailers */ #define IFF_RUNNING 0x40 /* resources allocated */ #define IFF_NOARP 0x80 /* no address resolution protocol */
Various statistics are also stored in the interface structure. These may be viewed by users using the netstat(1) program.
The interface address and flags may be set with the SIOCSIFADDR and SIOCSIFFLAGS ioctls. SIOCSIFADDR is used initially to define each interface's address; SIOGSIFFLAGS can be used to mark an interface down and perform site-specific configuration. The destination address of a point-to-point link is set with SIOCSIFDSTADDR. Corresponding operations exist to read each value. Protocol families may also support operations to set and read the broadcast address. In addition, the SIOCGIFCONF ioctl retrieves a list of interface names and addresses for all interfaces and protocols on the host.
All hardware related interfaces currently reside on the UNIBUS. Consequently a common set of utility routines for dealing with the UNIBUS has been developed. Each UNIBUS interface utilizes a structure of the following form:
struct ifubinfo { short iff_uban; /* uba number */ short iff_hlen; /* local net header length */ struct uba_regs *iff_uba; /* uba regs, in vm */ short iff_flags; /* used during uballoc's */ };
struct ifrw { caddr_t ifrw_addr; /* virt addr of header */ short ifrw_bdp; /* unibus bdp */ short ifrw_flags; /* type, etc. */ #define IFRW_W 0x01 /* is a transmit buffer */ int ifrw_info; /* value from ubaalloc */ int ifrw_proto; /* map register prototype */ struct pte *ifrw_mr; /* base of map registers */ };
struct ifxmt { struct ifrw ifrw; caddr_t ifw_base; /* virt addr of buffer */ struct pte ifw_wmap[IF_MAXNUBAMR]; /* base pages for output */ struct mbuf *ifw_xtofree; /* pages being dma'd out */ short ifw_xswapd; /* mask of clusters swapped */ short ifw_nmr; /* number of entries in wmap */ }; #define ifw_addr ifrw.ifrw_addr #define ifw_bdp ifrw.ifrw_bdp #define ifw_flags ifrw.ifrw_flags #define ifw_info ifrw.ifrw_info #define ifw_proto ifrw.ifrw_proto #define ifw_mr ifrw.ifrw_mr
struct ifuba { struct ifubinfo ifu_info; struct ifrw ifu_r; struct ifxmt ifu_xmt; }; #define ifu_uban ifu_info.iff_uban #define ifu_hlen ifu_info.iff_hlen #define ifu_uba ifu_info.iff_uba #define ifu_flags ifu_info.iff_flags #define ifu_w ifu_xmt.ifrw #define ifu_xtofree ifu_xmt.ifw_xtofree
The if_ubinfo structure contains the general information needed to characterize the I/O-mapped buffers for the device. In addition, there is a structure describing each buffer, including UNIBUS resources held by the interface. Sufficient memory pages and bus map registers are allocated to each buffer upon initialization according to the maximum packet size and header length. The kernel virtual address of the buffer is held in ifrw_addr, and the map registers begin at ifrw_mr. UNIBUS map register ifrw_mr[-1] maps the local network header ending on a page boundary. UNIBUS data paths are reserved for read and for write, given by ifrw_bdp. The prototype of the map registers for read and for write is saved in ifrw_proto.
When write transfers are not at least half-full pages on page boundaries, the data are just copied into the pages mapped on the UNIBUS and the transfer is started. If a write transfer is at least half a page long and on a page boundary, UNIBUS page table entries are swapped to reference the pages, and then the initial pages are remapped from ifw_wmap when the transfer completes. The mbufs containing the mapped pages are placed on the ifw_xtofree queue to be freed after transmission.
When read transfers give at least half a page of data to be input, page frames are allocated from a network page list and traded with the pages already containing the data, mapping the allocated pages to replace the input pages for the next UNIBUS data input.
The following utility routines are available for use in writing network interface drivers; all use the structures described above.
if_ubaminit(ifubinfo, uban, hlen, nmr, ifr, nr, ifx, nx);
if_ubainit(ifuba, uban, hlen, nmr);
m = if_ubaget(ifubinfo, ifr, totlen, off0, ifp);
m = if_rubaget(ifuba, totlen, off0, ifp);
if_wubaput(ifubinfo, ifx, m);
if_wubaput(ifuba, m);