4
17
2017
22

局域网中通过 TUN 和软路由实现链路聚合

可怕,都两年没写东西了……

背景

前段时间奉老板命为实验室搭建 Linux 计算集群。之前实验室跑计算都是直接远程登录机房的 Windows 主机的,完全没有作业调度。听说要搞个 Linux 集群我还是蛮有兴趣的。(怎么又搞运维)

然而,实验室当初显然没有按照集群的配置来采购硬件。起步的其实只有三台计算用的服务器,没有配置存储服务器和网络设备。最初我是想至少添加一台登录+存储用的服务器吧,不然这个集群的拓扑实在有点坑。不过学长和老板觉得先用现有硬件搭建一个看看吧,所以我就拉了个旧主机将就着当登录节点了。既然如此,也没必要怂恿实验室加上万元添置 RDMA 模块了,搞个便宜的千兆交换机凑合着用就行了。

未名实验室的集群就基于这套不怎么专业的设备搭建起来了。为了充分榨干这一套设备的性能,后期我在文件系统、网络、软件配置等方面做了很多优化尝试。其中一个想法是:

集群中每台计算节点都装了四块千兆网卡,是否能够将所有网卡都利用起来,组成一个“四千兆”网卡,提高带宽?

警告:本文作者计算机网络课划水太多,文章充斥着大量非专业表述,请谨慎参考。

讨论

广义地讲,同时利用多个物理网卡发送/接收 packet,以提高网络带宽(突破单网卡带宽)和可用性的技术,都叫做链路聚合(Link aggregation)。

在网络协议栈中,不同层面都可以实现链路聚合:

  • 物理层(layer 1):Wifi 一根天线信号不好,装两根(胡诌的);
  • 数据链路层(layer 2):Linux 下的 Ethernet Bonding;
  • 网络层(layer 3):操作路由,以数据包、连接或者目的主机为单位将包分散在多个端口上发送/接收,如 ECMP 以及后面会讲的软路由方法。
  • ……

每种方法都有适合的应用场景,并行的粒度也不一样。比如说,作为一台网页服务器,要和世界各地大量的主机建立连接,改一改路由,让某些地区的连接走端口A上绑定的IP访问、另一些地区走端口B上绑定的IP访问,就能够均衡负载并且提高总带宽了,虽然单个连接还是不能突破单网卡带宽。

而我更在意的是局域网中两台机器之间的通信,希望单个连接也能够突破单网卡带宽。比如,集群两台机器之间要通过 NFS 或者通过 SSH 拷贝数据,我希望能够通过四块网卡的聚合达到 500MB/s 的传输速度。

此外网络拓扑也得明确一下:所有主机通过一台无网管千兆交换机相连。背板带宽足够大,不会成为瓶颈。

各种方法

先约定一下标记。假定我们有三台 Linux 主机(HostA、HostB、HostC),然后每台主机分别有两块网卡 eth0 和 eth1。三台主机的 eth0 上绑定的 IP 分别为 10.0.0.10/24、10.0.0.20/24、10.0.0.30/24;eth1 上绑定的 IP 分别为 10.0.0.11/24、10.0.0.21/24、10.0.0.31/24。

Ethernet Bonding

Ethernet Bonding 是 Linux 内核支持的一种将多块物理网卡组合成一块逻辑网卡的技术。IP 绑定在逻辑网卡上,链路聚合是在数据链路层进行的。

网上有很多资料,不再赘述了。比如:

对于几种以提高吞吐量为目的的 bonding 模式(比如 balance-rr),发送包是打散到每个物理端口上的,但接收包的负载均衡是通过不断发送 ARP 广播将 IP 绑定到不同端口实现的。由于 MAC 协议的限制,同一时刻一台主机上 MAC 表中某个 IP 只会对应到一个端口,发送到这个 IP 的包也只会涌向一个端口(虽然是从两个端口发送的)。因此,bonding 不能实现我需要的链路聚合。

ECMP 和 软路由

Equal-cost multi-path routing(ECMP),等价多路径路由,简单来讲就是为同一个目的地址配置“多个”“等价的”路由。

比如添加两个 ECMP 路由,使得到 10.0.0.0/24 可以走 eth0 和 eth1 两个端口出去:

ip route add 10.0.0.0/24 nexthop dev eth0 nexthop dev eth1

或者,要经过多个默认网关接入 Internet:

ip route add 0/0 nexthop via 网关1 dev eth0 nexthop via 网关2 dev eth1

Linux 内核根据报文源/目的 IP、源/目的端口(UDP/TCP)进行 hash 决定使用哪条 ECMP 路由发包(参考资料)。因此,至少对于一个 TCP/UDP 四元组,路由是固定的。另一个问题是,ECMP 只是决定发包的路由,不能做到接收包的负载均衡。如果要做收包的负载均衡,则需要发送方也配合、向两个端口发送。

由于我所有设备的所有端口是在一个局域网/交换机下,路由直接由 ARP 表决定了。同一时刻一个IP只能对应一个MAC,接收包只会从一个端口进来。所以只加入上面的 ECMP 路由、不做进一步配置的话,目前的效果是:

  • 发包能够突破单网卡带宽,但必须建立多个连接(使得 ECMP hash 不同)
  • 单 IP 收包还是只能从一个端口进来,不能突破单网卡带宽

所以,前面这样做 ECMP 还是没法实现我需要的链路聚合。

改进1:接收包负载均衡

其实,前面说的“同一时刻一个IP只能对应一个MAC”并不完全正确,其实每个网卡都有一个单独的 ARP 表。比如,可以通过 ip neigh show dev enp5s0 查询 enp5s0 上的 ARP 表。

这有什么意义呢?如果我们能让不同网卡拥有不同的 ARP 表记录:HostA 的 eth0 认为 10.0.0.20 在 HostB 的 eth0 上,HostA 的 eth1 认为 10.0.0.20 在 HostB 的 eth1 上…… 反之 HostB 看 HostA 亦然。这样,在做到发包的负载均衡的同时,也能够做到收包的负载均衡了。

怎么让不同网卡拥有不同的 ARP 表记录呢?最简单的方法是…… 多买几个路由器,把一组网卡(比如所有主机的 eth0)隔离在一个广播域中。不过,也可以用静态 ARP 表完成这件事:

arp -i eth0 -s 10.0.0.10 <MAC of HostA's eth0>
arp -i eth1 -s 10.0.0.10 <MAC of HostA's eth1>
arp -i eth0 -s 10.0.0.20 <MAC of HostB's eth0>
arp -i eth1 -s 10.0.0.20 <MAC of HostB's eth1>
arp -i eth0 -s 10.0.0.30 <MAC of HostC's eth0>
arp -i eth1 -s 10.0.0.30 <MAC of HostC's eth1>

虽然乍一看有点难以维护,但实际上如果端口数量完全一样,所有主机使用同一份 ARP 表就行了。

改进2:以包为单位进行负载均衡

ECMP 的负载均衡“粒度”不能满足我的需求。其实以包为单位的负载均衡非常容易做的:用 iptables 给发包随机或者依次打上 1、2、…… N(N 为网卡数)的标记;然后设定策略路由,根据标号选择路由就好了。

iptables -t mangle -A OUTPUT -d 10.0.0.0/24 -m statistic --mode nth --every 2 --packet 0 -j MARK --set-mark 1
iptables -t mangle -A OUTPUT -d 10.0.0.0/24 -m statistic --mode nth --every 2 --packet 1 -j MARK --set-mark 2
ip rule add fwmark 1 table 10000
ip rule add fwmark 2 table 10001
ip route add 10.0.0.0/24 dev eth0 table 10000
ip route add 10.0.0.0/24 dev eth1 table 10001

不过还没完,Linux 内核有一个叫 Reverse Path Filtering 的保护机制。如果开启了“严格模式”(我的 Arch Linux 上默认如此),发送包的端口和上面绑定的 IP 不一致时会被内核过滤掉。可以通过内核参数 net.ipv4.conf.<interface>.rp_filter 控制这一行为(参考资料):

rp_filter - INTEGER
	0 - No source validation.
	1 - Strict mode as defined in RFC3704 Strict Reverse Path
	    Each incoming packet is tested against the FIB and if the interface
	    is not the best reverse path the packet check will fail.
	    By default failed packets are discarded.
	2 - Loose mode as defined in RFC3704 Loose Reverse Path
	    Each incoming packet's source address is also tested against the FIB
	    and if the source address is not reachable via any interface
	    the packet check will fail.

一般设置成 2,也就是“宽松模式”就好了,这样只要验证源 IP 属于本机就行了:

net.ipv4.conf.eth0.rp_filter=2
net.ipv4.conf.eth1.rp_filter=2

现在,接收和发送双方的负载均衡都实现了,并且是 per-packet 的。如果用 iperf 测试,应该可以看到单连接就能够塞满多块网卡带宽。

到了这里,和 ECMP 也没啥关系了,纯粹是 iptables 配策略路由的普通软路由方案。

利用 TUN/TAP 设备简化软路由管理

前面用不同网卡绑定不同的静态 ARP 表绕开 ARP 广播实现了更复杂的软路由。其实也可以加一层 TUN/TAP 设备(通俗地讲就是一种“虚拟网卡”)来做真正的软路由,不用手动写 ARP 表了。

首先在每台机器上都创建一个 TUN 设备,随便绑定一个 IP:

ip tuntap add mode tun tun0
ip addr add 10.99.255.10/32 dev tun0    # HostA
ip addr add 10.99.255.20/32 dev tun0    # HostB
ip addr add 10.99.255.30/32 dev tun0    # HostC

然后接下来就是要设置 10.99.255.[10,20,30] 这个“虚拟”网段内的路由。这完全是软件上做的,还是随机打标签+策略路由:

iptables -t mangle -A OUTPUT -d 10.0.0.0/24 -m statistic --mode nth --every 2 --packet 0 -j MARK --set-mark 1
iptables -t mangle -A OUTPUT -d 10.0.0.0/24 -m statistic --mode nth --every 2 --packet 1 -j MARK --set-mark 2
ip rule add fwmark 1 table 10000
ip rule add fwmark 2 table 10001
ip route add 10.99.255.10 via 10.0.0.10 dev eth0 table 10000
ip route add 10.99.255.10 via 10.0.0.11 dev eth0 table 10001
ip route add 10.99.255.20 via 10.0.0.20 dev eth0 table 10000
ip route add 10.99.255.20 via 10.0.0.21 dev eth0 table 10001
ip route add 10.99.255.30 via 10.0.0.30 dev eth0 table 10000
ip route add 10.99.255.30 via 10.0.0.31 dev eth0 table 10001

和上个方法对比一下,其实加了个 TUN 的区别就在于把手写的静态 ARP 表换成了手写路由表(实质上 ARP 表也是种路由表)。不过自己设定的 IP 地址终究是比乱糟糟的 MAC 地址好看一些,并且该方法完全不影响原来物理网卡上绑定的 IP 及其路由。

 

到这里,我需要的链路聚合已经算比较优雅地实现了吧。为了更愉快地配合 Debian 的 ifupdown 使用,我写了个 if-up 脚本进行链路聚合(Gist)。放进 /etc/network/if-up.d 里,然后这么配置一下:

auto tun0
iface tun0 inet static
    address 10.99.255.10
    netmask 255.255.255.255
    mtu 9000
    fakenet 10.99.255.0/24
    table 10000 10001 10002 10003
    rspec 10.99.255.1:10.3.23.1:10.3.23.1:10.3.23.1 \
          10.99.255.10:10.99.0.10:10.99.1.10:10.99.2.10:10.3.23.10 \
          10.99.255.20:10.99.0.20:10.99.1.20:10.99.2.20:10.3.23.20 \
          10.99.255.30:10.99.0.30:10.99.1.30:10.99.2.30:10.3.23.30
    pre-up ip tuntap add mode tun tun0
    post-down ip tuntap del mode tun tun
Category: 计算机 | Tags: Linux 网络 | Read Count: 7622
Avatar_small
civaget 说:
2024年1月11日 23:16

Somebody essentially lend a hand to make significantly articles I might state. This is the first time I frequented your website page and to this point? I surprised with the analysis you made to create this actual submit amazing. Great activity! how do i make google docs dark mode

Avatar_small
seo service london 说:
2024年9月25日 15:40

I know that you explain it very well. And I hope that other readers will also experience how I feel after reading your article. I really enjoy simply reading all of your weblogs. Simply wanted to inform you that you have people like me who appreciate your work. Definitely a great post. Hats off to you! The information that you have provided is very helpful. 

Avatar_small
fknapredak 说:
2024年9月25日 16:55

The latest with a game system designed for direct pg slot players because the game format is very advanced, can play pg slot auto via ios and android systems, supports playing pg slots via mobile Deposit-withdraw automatically in just 8 seconds 

Avatar_small
Research materials 说:
2024年9月25日 16:57

Amazing experience after visiting this site which is really full of information and someone have information for.And also you could buy some cosplay costumes from news update for latest updates in urdu language this is best . Just have fun

Avatar_small
here 说:
2024年9月25日 16:59

I think this is an informative post and it is very beneficial and knowledgeable. Therefore, I would like to thank you for the endeavors that you have made in writing this article. All the content is absolutely well-researched. Thanks.

Avatar_small
click here 说:
2024年9月25日 17:08

This was really an interesting topic and I kinda agree with what you have mentioned here!Hello, I have browsed most of your posts. This post is probably where I got the most useful information for my research. Thanks for posting, maybe we can see more on this. Are you aware of any other websites on this subjec

Avatar_small
check here 说:
2024年9月25日 17:10

The Karol Bagh call girl's torso is a Breathtaking display of lean muscles and supple curves, which culminate in a narrow waist that emphasizes the sensuous flare of her hips. The long, shapely legs seem to be eternally drawn out, the embodiment of seduction, as they move with a fluid grace that mesmerizes onlookers.

Avatar_small
References 说:
2024年9月25日 17:17

Amazing experience after visiting this site which is really full of information and someone have information for.And also you could buy some cosplay costumes from news update for latest updates in urdu language this is best . Just have fun

Avatar_small
read more 说:
2024年9月25日 17:20

Great post i must say and thanks for the information. Education is definitely a sticky subject. However, is still among the leading topics of our time. I appreciate your post and look forward to more. 

Avatar_small
read more 说:
2024年9月25日 17:23

When you actually wish to make your life happy, you need these girls. Showcasing eagerness they will give you an unforgettable time filled with intense sensations of love and pleasure. Take part in the mesmerizing moment of love with the escorts of our

Avatar_small
토토마트 说:
2024年9月25日 17:44

This was really an interesting topic and I kinda agree with what you have mentioned here!Hello, I have browsed most of your posts. This post is probably where I got the most useful information for my research. Thanks for posting, maybe we can see more on this. Are you aware of any other websites on this subjec

Avatar_small
good data 说:
2024年9月25日 17:45

Models are beautiful, intriguing, and provocative. Because of their attractive height and body type, many people fantasize about making love to one of them. Many wealthy, respectable men actually spend significant time and resources attending their social events in the hope of winning their affection and subsequently being invited to their hotels.

Avatar_small
information 说:
2024年9月25日 17:51

I am happy to find this post very useful for me, as it contains lot of information. I always prefer to read the quality content and this thing I found in you post. Thanks for sharing

Avatar_small
read more 说:
2024年9月25日 17:57

First You got a great blog. I will be interested in more similar topics. I see you got really very useful topics, i will be always

Avatar_small
먹튀마루 说:
2024年9月25日 17:59

Thanks for the blog loaded with so many information. Stopping by your blog helped me to get what I was looking for.

Avatar_small
Find out 说:
2024年9月25日 18:07

Cosmetic surgery has become increasingly popular, with many seeking to enhance their appearance or correct imperfections. In Gurgaon, a city renowned for its advanced healthcare infrastructure, Dr. Himani Yadav one of the best qualified and experienced cosmetic surgeon in Gurgaon practice at her own clinic at Sector 57 – Ikonic Aesthetics can significantly impact your results.

Avatar_small
good information 说:
2024年9月25日 18:24

This is extremely fascinating substance! I have completely delighted in perusing your focuses and have reached the conclusion that you are right about a hefty portion of them. You are extraordinary

Avatar_small
스포츠토토 说:
2024年9月25日 18:37

Thanks for the blog loaded with so many information. Stopping by your blog helped me to get what I was looking for.

Avatar_small
click here 说:
2024年9月25日 18:41

What's up? Your post was truly amazing and full of valuable insights! I was hoping you could expand further on this topic as I am very interested in learning more. Your knowledge and perspective are greatly appreciated. Thank you for sharing your thoughts and taking the time to do so! Stay well. my web page

Avatar_small
read more 说:
2024年9月25日 18:42

The number 1 online slot game in Thailand, slot 888 online that includes the slot 888 game camp to play more than 300 games, slot 888 auto, open for deposit-withdrawal service with an automatic system.

Avatar_small
파워에이스 说:
2024年9月25日 18:42

I assume this is an informative put up and it's far very beneficial and knowledgeable. I simply tripped upon your blog and ached to mention that i have absolutely loved reading your blog submit. Thank you for sharing. You've got stated very interesting information ! Ps first rate web page. fantastic goods from you, guy. I have apprehend your stuff previous to and you are just too outstanding. I clearly like what you’ve obtained here, surely like what you’re saying and the way in that you say it. 

Avatar_small
get more info 说:
2024年9月25日 18:44

This was really an interesting topic and I kinda agree with what you have mentioned here!Hello, I have browsed most of your posts. This post is probably where I got the most useful information for my research. Thanks for posting, maybe we can see more on this. Are you aware of any other websites on this subjec


登录 *


loading captcha image...
(输入验证码)
or Ctrl+Enter

| Theme: Aeros 2.0 by TheBuckmaker.com