The Docker Networking Model, Layers 1 and 2

Exploring Docker container networking: the virtualized physical and data link layers.

April 28, 2018

Part one in a series of posts on Docker networking

The default networking model used by Docker is a simple and familiar pattern: a multihomed host connecting an “internal” network to “external” ones, providing address translation as it forwards packets back and forth. In this case, the internal network is one or more virtualized segments connecting local containers, and the external networks are the Docker hosts other network connections. The Docker host will typically route packets between these networks.

The private internal network segment is built around a kernel ethernet bridge. Each container is “wired” to the Docker host’s bridge by one half of a veth(4) interface pair. The other half is placed in the container’s network namespace and manifests as its eth0 interface.

Our Test Setup

We’ll start up a container with a socket printer to facilitate further investigation.

Run a daemonized container with netcat(1)1 listening on 80/tcp.

jereme@buttercup $ docker run -d -p 8080:80 --name=test_ct debian nc -lkp 80
5d0b5875fcde3e67b02db0a356c129a34b02ad6ae70b6c076ef53845fbba3acb

In this simple topology we have a single container, test_ct, with a network interface, eth0, connected to the default Docker network, which is named bridge.

Network Segment Design

Each of these networks is built around a kernel ethernet bridge. The default Docker network is named bridge and uses an ethernet bridge named docker0.

We can list out the currently defined Docker networks.

jereme@buttercup $ docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
be324648dcae        bridge              bridge              local
7c1aff0016a4        host                host                local
70a1ccaba5b5        none                null                local

…and tnen inspect a particular network, dumping out all sorts of configuration details and the state of currently connected containers.

jereme@buttercup $ docker network inspect bridge
[
    {
        "Name": "bridge",
        "Id": "be324648dcae32e5b3f61ed5824e3c2cce1c249c26886b48adb5ca1c21719659",
        "Created": "2018-07-02T20:17:26.877137429-04:00",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.17.0.1/16",
                    "IPRange": "172.17.0.0/16",
                    "Gateway": "172.17.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "5d0b5875fcde3e67b02db0a356c129a34b02ad6ae70b6c076ef53845fbba3acb": {
                "Name": "test_ct",
                "EndpointID": "804ccbc32cdec13d835d6eaf7719d3a3bae8798a47e3f04ff0d4b183281afd44",
                "MacAddress": "02:42:ac:11:00:02",
                "IPv4Address": "172.17.0.2/16",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.bridge.default_bridge": "true",
            "com.docker.network.bridge.enable_icc": "true",
            "com.docker.network.bridge.enable_ip_masquerade": "true",
            "com.docker.network.bridge.host_binding_ipv4": "0.0.0.0",
            "com.docker.network.bridge.name": "docker0",
            "com.docker.network.driver.mtu": "1500"
        },
        "Labels": {}
    }
]

Examing a Bridge

Using brctl(8), from bridge-utils, we can examine the bridge interface, docker0, which forms the base of the bridge Docker network.

Here we see the bridge has a single connected interface: veth8c9981e:

jereme@buttercup $ sudo brctl show docker0
bridge name     bridge id               STP enabled     interfaces
docker0         8000.024278a263b6       no              veth8c9981e

Bridges are simply network interfaces and they can be managed like any other network link, with ip(8), from iproute2.

jereme@buttercup $ ip link list dev docker0
50: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default 
    link/ether 02:42:78:a2:63:b6 brd ff:ff:ff:ff:ff:ff

Together these two interfaces: docker0 and veth8c9981e, form a single broadcast domain. As we add additional containers to the bridge Docker network, we will see additional veth interfaces added to that bridge, expanding the network segment.

The design is quite scalable. Here is a production Docker host with many hundreds of containers on the docker0 bridge.

jereme@gnarly-carrot.nyc2 $ sudo brctl show docker0 | head
bridge name     bridge id               STP enabled     interfaces
docker0         8000.021c744cae6f       no              veth0032851
                                                        veth00ef9dd
                                                        veth01570e4
                                                        veth033485b
                                                        veth040a5dc
                                                        veth05e46cd
                                                        veth061f1f7
                                                        veth07251d6
                                                        veth072e807
jereme@gnarly-carrot.nyc2 $ sudo brctl show docker0 | grep -c veth
349

Connecting Containers to a Bridge

Containers are connected to a given Docker network’s ethernet bridge via a pair of interconnected veth(4) interfaces. These interface pairs can be thought of as the two ends of a tunnel. This is the crux of Docker container network connectivity.

We’ve seen the beginnings of this already: one half of a veth pair, like veth8c9981e above, remains in the Docker host’s network namespace where it’s connected to the specified bridge. The other half is placed into the network namespace of the container and manifests as its eth0 interface. In so doing, we establish the layer 2 path upon which the rest of the container’s network connectivity will be built.

Using our test_ct as an example, recall the connected interface, veth8c9981e, on bridge docker0.

jereme@buttercup $ sudo brctl show docker0
bridge name     bridge id               STP enabled     interfaces
docker0         8000.024278a263b6       no              veth8c9981e

We can use ethtool(8) to get veth8c9981e’s peer_ifindex - the other end of the tunnel, so to speak.

jereme@buttercup $ sudo ethtool -S veth8c9981e
NIC statistics:
     peer_ifindex: 75

Looking through the list of interfaces on our Docker host, we don’t see an interface with index 75

jereme@buttercup $ ip -o link list | cut -d : -f 1,2
1: lo
2: eth0
3: wlan0
50: docker0
74: tun0
76: veth8c9981e@if75

…but if we examine the interfaces in our test_ct container, we find our associated peer interface.

jereme@buttercup $ docker exec test_ct ip -o link list | cut -d : -f 1,2
1: lo
75: eth0@if76

Docker has handled the work of moving the peer interface into our container’s network namespace. In so doing, our container is now connected to our Docker network’s designated bridge interface.

Containers are processes running with isolated namespaces and other resource partitioning, like control groups. This is a simplification, but not untrue.

Accordingly, this nicely buttoned up docker(1) invocation:

jereme@buttercup $ docker exec test_ct ip link show dev eth0
75: eth0@if76: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0

…is equivalent to using nsenter(1), from the util-linux package, to run ip in the net and mount namespaces of our container’s process.

jereme@buttercup $ ct_pid=$(docker inspect test_ct | jq .[].State.Pid)

jereme@buttercup $ ps $ct_pid
  PID TTY      STAT   TIME COMMAND
 7581 ?        Ss     0:00 nc -lkp 80

jereme@buttercup  $ sudo nsenter --net --mount --target $ct_pid ip link show dev eth0
75: eth0@if76: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0

Manually Adding a Second veth Interface to a Container

You can do this manually if you want to explore a bit deeper.

Here we create a new veth pair: ve_A and ve_B2, which are allocated indices 90 and 89 respectively.

jereme@buttercup $ sudo ip link add ve_A type veth peer name ve_B

jereme@buttercup $ ip -o link list | cut -d : -f 1,2
1: lo
2: eth0
3: wlan0
50: docker0
74: tun0
76: veth8c9981e@if75
89: ve_B@ve_A
90: ve_A@ve_B

We then move ve_B into test_ct’s network namespace:

jereme@buttercup $ sudo ip link set ve_B netns $(docker inspect test_ct | jq .[].State.Pid)

The other half of our veth pair, ve_A, remains in our namespace. Notice how ip now displays ve_A@ve_B as ve_A@if89 (89 being the peer interface’s index).

jereme@buttercup $ ip -o link list | cut -d : -f 1,2
1: lo
2: eth0
3: wlan0
50: docker0
74: tun0
76: veth8c9981e@if75
90: ve_A@if89

Looking back into out container, we see the newly added interface:

jereme@buttercup $ docker exec test_ct ip -o link list | cut -d : -f 1,2
1: lo
75: eth0@if76
89: ve_B@if90

Regarding Interface Indices

From what we’ve seen so far it looks like interface indices are global to the kernel. After all, we created a veth pair and moved interface 89 into an existing container and observed that it retained the index 89. However, I believe there is actually one interface index set per network namespace and we actually have two distinct 89’s, so to speak.

I’m glad the index value is retained as we move interfaces between namespaces as this helps keep things simple, but I believe they are actually distinct. As an example, you can see each namespace has an index 1 for its loopback interface, lo.

jereme@buttercup $ ip link list dev lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

jereme@buttercup $ docker exec test_ct ip link list dev lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

I suppose it’s possible that there’s some special handling for lo, but I don’t believe that’s the case.

Consider dev_new_index from net/core/dev.c.

/**
 *	dev_new_index	-	allocate an ifindex
 *	@net: the applicable net namespace
 *
 *	Returns a suitable unique value for a new device interface
 *	number.  The caller must hold the rtnl semaphore or the
 *	dev_base_lock to be sure it remains unique.
 */
static int dev_new_index(struct net *net)
{
	int ifindex = net->ifindex;
	for (;;) {
		if (++ifindex <= 0)
			ifindex = 1;
		if (!__dev_get_by_index(net, ifindex))
			return net->ifindex = ifindex;
	}
}

I get the impression, from lots of mailing list posts, that in the early day of namespaces interface indices were still global, but I don’t think this is the case anymore. Corrections here (as always) are most welcome!

Conclusion

So there you have it: some of the fundamentals of how interfaces are organized and bridged into networks to interconnect containers. In the next post in this series, we’ll move up the stack a bit and look at addressing and routing of traffic, between containers, and beyond the Docker host itself.


  1. You’ll want the OpenBSD flavor of netcat(1) which provides the -k flag. [return]
  2. The interface names do not include the @{peer} part. That’s just a helpful detail provided by ip(1). You won’t see this in /sys/class/net. [return]
comments powered by Disqus