Building a TCP/IP Library: From Sockets to Protocol Stack
Overview
This article explains how to design and implement a TCP/IP library from low-level socket interfaces up through protocol stack components. It targets experienced systems programmers building network stacks for user-space applications or lightweight embedded systems. We’ll cover architecture, core modules, key algorithms, common pitfalls, testing, and performance tuning.
1. Goals and constraints
- Primary goal: provide a reliable, modular, and testable TCP/IP stack implementation suitable for user-space applications or constrained devices.
- Constraints: limited memory/CPU (embedded), portability across OSes, clean API for applications, clear separation between link, network, transport layers.
2. High-level architecture
Design a layered architecture mirroring the TCP/IP model:
- Link layer: device drivers, packet I/O, frame parsing
- Network layer: IPv4/IPv6 packet processing, routing, fragmentation
- Transport layer: UDP (simple), TCP (connection state machine, retransmission, congestion control)
- Socket API: BSD-like socket interface or simplified custom API for application use
- Utilities: ARP, ICMP, DNS resolver, timers, buffer management, packet queues
Use modular components with well-defined interfaces. Keep the core stack free of platform-specific code; isolate device/OS integration behind an adaptation layer.
3. Data structures and buffer management
- Packet buffers (pbuf): single buffer type supporting chained fragments to avoid copies. Include metadata: length, offset, protocol, reference count.
- Connection control blocks (TCB): per-TCP-connection state (snd_una, snd_nxt, rcv_nxt, cwnd, ssthresh, timers, retransmission queue, MSS).
- Routing table: prefix match structure (CIDR trie or linear table for embedded).
- Socket descriptors: map application handles to TCBs/UDPsockets and store options.
- Use ring buffers for device queues and efficient zero-copy where possible.
4. Link layer and packet I/O
- Implement an abstract NIC interface with callbacks: tx(packet), rx(packet), mtu(), hwaddr().
- For user-space, implement raw sockets or TUN/TAP adaptation. For embedded, connect to driver-specific send/receive.
- Frame parsing: detect EtherType, handle VLAN tags, pass IP packets to network layer, handle ARP locally.
5. IP layer (IPv4 focus)
- Parse and validate IP header (checksum, version, header length, total length, TTL).
- Routing lookup: determine outgoing interface and next-hop MAC.
- Fragmentation: for outgoing, fragment oversized packets according to MTU; for incoming, reassemble fragments using fragment queues and timeouts.
- ICMP handling: respond to echo requests, send unreachable/time-exceeded messages as needed.
- ARP integration: resolve MACs asynchronously; queue packets pending resolution, retry with timeouts.
6. UDP: stateless transport
- Map incoming UDP datagrams to sockets by local port and address.
- For send, construct UDP header, compute checksum (optional for IPv4 depending on targets), and hand packet to IP layer.
- No retransmission; expose socket options for broadcast, multicast, and receive buffer sizing.
7. TCP fundamentals and state machine
- Implement TCP as per RFC 793 with modern updates: selective acknowledgements (SACK optional), window scaling, timestamp option.
- State machine: LISTEN, SYN-SENT, SYN-RECEIVED, ESTABLISHED, FIN-WAIT, CLOSE-WAIT, LAST-ACK, TIME-WAIT.
- Three-way handshake, graceful close, and abort on errors.
8. Reliable transmission: retransmit, timers, and queues
- Retransmission queue: store unacknowledged segments with send time and retransmit count.
- Timers:
- Retransmission timer (RTO) per connection using RTT estimation (Jacobson/Karels): SRTT, RTTVAR; RTO = SRTT + 4*RTTVAR.
- Delayed ACK timer.
- Persist timer for zero-window probing.
- TIME-WAIT timer.
- On RTO expiry, retransmit earliest unacked segment and back off RTO exponentially (binary exponential backoff).
9. Congestion control and flow control
- Flow control: advertise receiver window; support window scaling.
- Congestion control: implement TCP Reno or Cubic (Reno simpler). Basic Reno algorithm:
- Slow start: cwnd doubles each RTT until ssthresh reached.
- Congestion avoidance: cwnd increases by MSS*MSS/cwnd per ACK.
- On loss (triple duplicate ACKs): ssthresh = cwnd/2, cwnd = ssthresh + 3*MSS, enter fast recovery.
- On timeout: ssthresh = cwnd/2, cwnd = MSS, enter slow start.
- Consider SACK and selective retransmit for higher performance over lossy links.
10. Path MTU discovery and MSS
- Determine MSS during SYN exchange based on interface MTU minus IP/TCP header sizes and options.
- Implement Path MTU Discovery (PMTUD) using ICMP “fragmentation needed” messages; fall back to packetization-layer PMTUD if ICMP unreliable.
11. Socket API design
Provide a BSD-like API surface or a simplified variant:
- socket(), bind(), listen(), accept(), connect
Leave a Reply