linux

mirror of https://github.com/torvalds/linux synced 2024-10-03 09:48:02 +00:00

Author	SHA1	Message	Date
Daniel Borkmann	1cd4bc987a	vxlan: Fix regression when dropping packets due to invalid src addresses Commit `f58f45c1e5` ("vxlan: drop packets from invalid src-address") has recently been added to vxlan mainly in the context of source address snooping/learning so that when it is enabled, an entry in the FDB is not being created for an invalid address for the corresponding tunnel endpoint. Before commit `f58f45c1e5` vxlan was similarly behaving as geneve in that it passed through whichever macs were set in the L2 header. It turns out that this change in behavior breaks setups, for example, Cilium with netkit in L3 mode for Pods as well as tunnel mode has been passing before the change in `f58f45c1e5` for both vxlan and geneve. After mentioned change it is only passing for geneve as in case of vxlan packets are dropped due to vxlan_set_mac() returning false as source and destination macs are zero which for E/W traffic via tunnel is totally fine. Fix it by only opting into the is_valid_ether_addr() check in vxlan_set_mac() when in fact source address snooping/learning is actually enabled in vxlan. This is done by moving the check into vxlan_snoop(). With this change, the Cilium connectivity test suite passes again for both tunnel flavors. Fixes: `f58f45c1e5` ("vxlan: drop packets from invalid src-address") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: David Bauer <mail@david-bauer.net> Cc: Ido Schimmel <idosch@nvidia.com> Cc: Nikolay Aleksandrov <razor@blackwall.org> Cc: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: David Bauer <mail@david-bauer.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-06-05 10:51:10 +01:00
Eric Dumazet	1eb2cded45	net: annotate writes on dev->mtu from ndo_change_mtu() Simon reported that ndo_change_mtu() methods were never updated to use WRITE_ONCE(dev->mtu, new_mtu) as hinted in commit `501a90c945` ("inet: protect against too small mtu values.") We read dev->mtu without holding RTNL in many places, with READ_ONCE() annotations. It is time to take care of ndo_change_mtu() methods to use corresponding WRITE_ONCE() Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Simon Horman <horms@kernel.org> Closes: https://lore.kernel.org/netdev/20240505144608.GB67882@kernel.org/ Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20240506102812.3025432-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-05-07 16:19:14 -07:00
Eric Dumazet	9cf621bd5f	rtnetlink: allow rtnl_fill_link_netnsid() to run under RCU protection We want to be able to run rtnl_fill_ifinfo() under RCU protection instead of RTNL in the future. All rtnl_link_ops->get_link_net() methods already using dev_net() are ready. I added READ_ONCE() annotations on others. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-05-07 11:14:50 +02:00
Jakub Kicinski	e958da0ddb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. Conflicts: include/linux/filter.h kernel/bpf/core.c `66e13b615a` ("bpf: verifier: prevent userspace memory access") `d503a04f8b` ("bpf: Add support for certain atomics in bpf_arena to x86 JIT") https://lore.kernel.org/all/20240429114939.210328b0@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-05-02 12:06:25 -07:00
Guillaume Nault	f778941913	vxlan: Pull inner IP header in vxlan_rcv(). Ensure the inner IP header is part of skb's linear data before reading its ECN bits. Otherwise we might read garbage. One symptom is the system erroneously logging errors like "vxlan: non-ECT from xxx.xxx.xxx.xxx with TOS=xxxx". Similar bugs have been fixed in geneve, ip_tunnel and ip6_tunnel (see commit `1ca1ba465e` ("geneve: make sure to pull inner header in geneve_rx()") for example). So let's reuse the same code structure for consistency. Maybe we'll can add a common helper in the future. Fixes: `d342894c5d` ("vxlan: virtual extensible lan") Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/1239c8db54efec341dd6455c77e0380f58923a3c.1714495737.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-05-01 19:07:11 -07:00
Guillaume Nault	b22ea4ef4c	vxlan: Add missing VNI filter counter update in arp_reduce(). VXLAN stores per-VNI statistics using vxlan_vnifilter_count(). These statistics were not updated when arp_reduce() failed its pskb_may_pull() call. Use vxlan_vnifilter_count() to update the VNI counter when that happens. Fixes: `4095e0e132` ("drivers: vxlan: vnifilter: per vni stats") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-29 13:39:15 +01:00
Guillaume Nault	6dee402dab	vxlan: Fix racy device stats updates. VXLAN devices update their stats locklessly. Therefore these counters should either be stored in per-cpu data structures or the updates should be done using atomic increments. Since the net_device_core_stats infrastructure is already used in vxlan_rcv(), use it for the other rx_dropped and tx_dropped counter updates. Update the other counters atomically using DEV_STATS_INC(). Fixes: `d342894c5d` ("vxlan: virtual extensible lan") Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-29 13:39:15 +01:00
Eric Dumazet	e8dfd42c17	ipv6: introduce dst_rt6_info() helper Instead of (struct rt6_info *)dst casts, we can use : #define dst_rt6_info(_ptr) \ container_of_const(_ptr, struct rt6_info, dst) Some places needed missing const qualifiers : ip6_confirm_neigh(), ipv6_anycast_destination(), ipv6_unicast_destination(), has_gateway() v2: added missing parts (David Ahern) Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-29 13:32:01 +01:00
Jakub Kicinski	2bd87951de	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/ti/icssg/icssg_prueth.c net/mac80211/chan.c `89884459a0` ("wifi: mac80211: fix idle calculation with multi-link") `87f5500285` ("wifi: mac80211: simplify ieee80211_assign_link_chanctx()") https://lore.kernel.org/all/20240422105623.7b1fbda2@canb.auug.org.au/ net/unix/garbage.c `1971d13ffa` ("af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().") `4090fa373f` ("af_unix: Replace garbage collection algorithm.") drivers/net/ethernet/ti/icssg/icssg_prueth.c drivers/net/ethernet/ti/icssg/icssg_common.c `4dcd0e83ea` ("net: ti: icssg-prueth: Fix signedness bug in prueth_init_rx_chns()") `e2dc7bfd67` ("net: ti: icssg-prueth: Move common functions into a separate file") No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-25 12:41:37 -07:00
David Bauer	f58f45c1e5	vxlan: drop packets from invalid src-address The VXLAN driver currently does not check if the inner layer2 source-address is valid. In case source-address snooping/learning is enabled, a entry in the FDB for the invalid address is created with the layer3 address of the tunnel endpoint. If the frame happens to have a non-unicast address set, all this non-unicast traffic is subsequently not flooded to the tunnel network but sent to the learnt host in the FDB. To make matters worse, this FDB entry does not expire. Apply the same filtering for packets as it is done for bridges. This not only drops these invalid packets but avoids them from being learnt into the FDB. Fixes: `d342894c5d` ("vxlan: virtual extensible lan") Suggested-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David Bauer <mail@david-bauer.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-19 12:54:33 +01:00
Alexander Lobakin	5832c4a77d	ip_tunnel: convert __be16 tunnel flags to bitmaps Historically, tunnel flags like TUNNEL_CSUM or TUNNEL_ERSPAN_OPT have been defined as __be16. Now all of those 16 bits are occupied and there's no more free space for new flags. It can't be simply switched to a bigger container with no adjustments to the values, since it's an explicit Endian storage, and on LE systems (__be16)0x0001 equals to (__be64)0x0001000000000000. We could probably define new 64-bit flags depending on the Endianness, i.e. (__be64)0x0001 on BE and (__be64)0x00010000... on LE, but that would introduce an Endianness dependency and spawn a ton of Sparse warnings. To mitigate them, all of those places which were adjusted with this change would be touched anyway, so why not define stuff properly if there's no choice. Define IP_TUNNEL__BIT counterparts as a bit number instead of the value already coded and a fistful of <16 <-> bitmap> converters and helpers. The two flags which have a different bit position are SIT_ISATAP_BIT and VTI_ISVTI_BIT, as they were defined not as __cpu_to_be16(), but as (__force __be16), i.e. had different positions on LE and BE. Now they both have strongly defined places. Change all __be16 fields which were used to store those flags, to IP_TUNNEL_DECLARE_FLAGS() -> DECLARE_BITMAP(__IP_TUNNEL_FLAG_NUM) -> unsigned long[1] for now, and replace all TUNNEL_ occurrences to their bitmap counterparts. Use the converters in the places which talk to the userspace, hardware (NFP) or other hosts (GRE header). The rest must explicitly use the new flags only. This must be done at once, otherwise there will be too many conversions throughout the code in the intermediate commits. Finally, disable the old __be16 flags for use in the kernel code (except for the two 'irregular' flags mentioned above), to prevent any accidental (mis)use of them. For the userspace, nothing is changed, only additions were made. Most noticeable bloat-o-meter difference (.text): vmlinux: 307/-1 (306) gre.ko: 62/0 (62) ip_gre.ko: 941/-217 (724) [] ip_tunnel.ko: 390/-900 (-510) [] ip_vti.ko: 138/0 (138) ip6_gre.ko: 534/-18 (516) [] ip6_tunnel.ko: 118/-10 (108) [] gre_flags_to_tnl_flags() grew, but still is inlined [] ip_tunnel_find() got uninlined, hence such decrease The average code size increase in non-extreme case is 100-200 bytes per module, mostly due to sizeof(long) > sizeof(__be16), as %__IP_TUNNEL_FLAG_NUM is less than %BITS_PER_LONG and the compilers are able to expand the majority of bitmap_() calls here into direct operations on scalars. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-01 10:49:28 +01:00
Breno Leitao	195f88c577	vxlan: Remove generic .ndo_get_stats64 Commit `3e2f544dd8` ("net: get stats64 if device if driver is configured") moved the callback to dev_get_tstats64() to net core, so, unless the driver is doing some custom stats collection, it does not need to set .ndo_get_stats64. Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64 function pointer. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com> Link: https://lore.kernel.org/r/20240311112437.3813987-2-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-11 16:01:06 -07:00
Breno Leitao	e28c5efc31	vxlan: Do not alloc tstats manually With commit `34d21de99c` ("net: Move {l,t,d}stats allocation to core and convert veth & vrf"), stats allocation could be done on net core instead of in this driver. With this new approach, the driver doesn't have to bother with error handling (allocation failure checking, making sure free happens in the right spot, etc). This is core responsibility now. Remove the allocation in the vxlan driver and leverage the network core allocation instead. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com> Link: https://lore.kernel.org/r/20240311112437.3813987-1-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-11 16:01:06 -07:00
Ricardo B. Marliere	c7170e7672	net: vxlan: constify the struct device_type usage Since commit `aed65af1cc` ("drivers: make device_type const"), the driver core can properly handle constant struct device_type. Move the vxlan_type variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-21 09:45:23 +00:00
Eric Dumazet	0bef512012	net: add netdev_lockdep_set_classes() to virtual drivers Based on a syzbot report, it appears many virtual drivers do not yet use netdev_lockdep_set_classes(), triggerring lockdep false positives. WARNING: possible recursive locking detected 6.8.0-rc4-next-20240212-syzkaller #0 Not tainted syz-executor.0/19016 is trying to acquire lock: ffff8880162cb298 (_xmit_ETHER#2){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] ffff8880162cb298 (_xmit_ETHER#2){+.-.}-{2:2}, at: __netif_tx_lock include/linux/netdevice.h:4452 [inline] ffff8880162cb298 (_xmit_ETHER#2){+.-.}-{2:2}, at: sch_direct_xmit+0x1c4/0x5f0 net/sched/sch_generic.c:340 but task is already holding lock: ffff8880223db4d8 (_xmit_ETHER#2){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] ffff8880223db4d8 (_xmit_ETHER#2){+.-.}-{2:2}, at: __netif_tx_lock include/linux/netdevice.h:4452 [inline] ffff8880223db4d8 (_xmit_ETHER#2){+.-.}-{2:2}, at: sch_direct_xmit+0x1c4/0x5f0 net/sched/sch_generic.c:340 other info that might help us debug this: Possible unsafe locking scenario: CPU0 lock(_xmit_ETHER#2); lock(_xmit_ETHER#2); * DEADLOCK * May be due to missing lock nesting notation 9 locks held by syz-executor.0/19016: #0: ffffffff8f385208 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock net/core/rtnetlink.c:79 [inline] #0: ffffffff8f385208 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x82c/0x1040 net/core/rtnetlink.c:6603 #1: ffffc90000a08c00 ((&in_dev->mr_ifc_timer)){+.-.}-{0:0}, at: call_timer_fn+0xc0/0x600 kernel/time/timer.c:1697 #2: ffffffff8e131520 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:298 [inline] #2: ffffffff8e131520 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:750 [inline] #2: ffffffff8e131520 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x45f/0x1360 net/ipv4/ip_output.c:228 #3: ffffffff8e131580 (rcu_read_lock_bh){....}-{1:2}, at: local_bh_disable include/linux/bottom_half.h:20 [inline] #3: ffffffff8e131580 (rcu_read_lock_bh){....}-{1:2}, at: rcu_read_lock_bh include/linux/rcupdate.h:802 [inline] #3: ffffffff8e131580 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x2c4/0x3b10 net/core/dev.c:4284 #4: ffff8880416e3258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+...}-{2:2}, at: spin_trylock include/linux/spinlock.h:361 [inline] #4: ffff8880416e3258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+...}-{2:2}, at: qdisc_run_begin include/net/sch_generic.h:195 [inline] #4: ffff8880416e3258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+...}-{2:2}, at: __dev_xmit_skb net/core/dev.c:3771 [inline] #4: ffff8880416e3258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+...}-{2:2}, at: __dev_queue_xmit+0x1262/0x3b10 net/core/dev.c:4325 #5: ffff8880223db4d8 (_xmit_ETHER#2){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] #5: ffff8880223db4d8 (_xmit_ETHER#2){+.-.}-{2:2}, at: __netif_tx_lock include/linux/netdevice.h:4452 [inline] #5: ffff8880223db4d8 (_xmit_ETHER#2){+.-.}-{2:2}, at: sch_direct_xmit+0x1c4/0x5f0 net/sched/sch_generic.c:340 #6: ffffffff8e131520 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:298 [inline] #6: ffffffff8e131520 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:750 [inline] #6: ffffffff8e131520 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x45f/0x1360 net/ipv4/ip_output.c:228 #7: ffffffff8e131580 (rcu_read_lock_bh){....}-{1:2}, at: local_bh_disable include/linux/bottom_half.h:20 [inline] #7: ffffffff8e131580 (rcu_read_lock_bh){....}-{1:2}, at: rcu_read_lock_bh include/linux/rcupdate.h:802 [inline] #7: ffffffff8e131580 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x2c4/0x3b10 net/core/dev.c:4284 #8: ffff888014d9d258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+...}-{2:2}, at: spin_trylock include/linux/spinlock.h:361 [inline] #8: ffff888014d9d258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+...}-{2:2}, at: qdisc_run_begin include/net/sch_generic.h:195 [inline] #8: ffff888014d9d258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+...}-{2:2}, at: __dev_xmit_skb net/core/dev.c:3771 [inline] #8: ffff888014d9d258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+...}-{2:2}, at: __dev_queue_xmit+0x1262/0x3b10 net/core/dev.c:4325 stack backtrace: CPU: 1 PID: 19016 Comm: syz-executor.0 Not tainted 6.8.0-rc4-next-20240212-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114 check_deadlock kernel/locking/lockdep.c:3062 [inline] validate_chain+0x15c1/0x58e0 kernel/locking/lockdep.c:3856 __lock_acquire+0x1346/0x1fd0 kernel/locking/lockdep.c:5137 lock_acquire+0x1e4/0x530 kernel/locking/lockdep.c:5754 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline] _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154 spin_lock include/linux/spinlock.h:351 [inline] __netif_tx_lock include/linux/netdevice.h:4452 [inline] sch_direct_xmit+0x1c4/0x5f0 net/sched/sch_generic.c:340 __dev_xmit_skb net/core/dev.c:3784 [inline] __dev_queue_xmit+0x1912/0x3b10 net/core/dev.c:4325 neigh_output include/net/neighbour.h:542 [inline] ip_finish_output2+0xe66/0x1360 net/ipv4/ip_output.c:235 iptunnel_xmit+0x540/0x9b0 net/ipv4/ip_tunnel_core.c:82 ip_tunnel_xmit+0x20ee/0x2960 net/ipv4/ip_tunnel.c:831 erspan_xmit+0x9de/0x1460 net/ipv4/ip_gre.c:720 __netdev_start_xmit include/linux/netdevice.h:4989 [inline] netdev_start_xmit include/linux/netdevice.h:5003 [inline] xmit_one net/core/dev.c:3555 [inline] dev_hard_start_xmit+0x242/0x770 net/core/dev.c:3571 sch_direct_xmit+0x2b6/0x5f0 net/sched/sch_generic.c:342 __dev_xmit_skb net/core/dev.c:3784 [inline] __dev_queue_xmit+0x1912/0x3b10 net/core/dev.c:4325 neigh_output include/net/neighbour.h:542 [inline] ip_finish_output2+0xe66/0x1360 net/ipv4/ip_output.c:235 igmpv3_send_cr net/ipv4/igmp.c:723 [inline] igmp_ifc_timer_expire+0xb71/0xd90 net/ipv4/igmp.c:813 call_timer_fn+0x17e/0x600 kernel/time/timer.c:1700 expire_timers kernel/time/timer.c:1751 [inline] __run_timers+0x621/0x830 kernel/time/timer.c:2038 run_timer_softirq+0x67/0xf0 kernel/time/timer.c:2051 __do_softirq+0x2bc/0x943 kernel/softirq.c:554 invoke_softirq kernel/softirq.c:428 [inline] __irq_exit_rcu+0xf2/0x1c0 kernel/softirq.c:633 irq_exit_rcu+0x9/0x30 kernel/softirq.c:645 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1076 [inline] sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1076 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702 RIP: 0010:resched_offsets_ok kernel/sched/core.c:10127 [inline] RIP: 0010:__might_resched+0x16f/0x780 kernel/sched/core.c:10142 Code: 00 4c 89 e8 48 c1 e8 03 48 ba 00 00 00 00 00 fc ff df 48 89 44 24 38 0f b6 04 10 84 c0 0f 85 87 04 00 00 41 8b 45 00 c1 e0 08 <01> d8 44 39 e0 0f 85 d6 00 00 00 44 89 64 24 1c 48 8d bc 24 a0 00 RSP: 0018:ffffc9000ee069e0 EFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8880296a9e00 RDX: dffffc0000000000 RSI: ffff8880296a9e00 RDI: ffffffff8bfe8fa0 RBP: ffffc9000ee06b00 R08: ffffffff82326877 R09: 1ffff11002b5ad1b R10: dffffc0000000000 R11: ffffed1002b5ad1c R12: 0000000000000000 R13: ffff8880296aa23c R14: 000000000000062a R15: 1ffff92001dc0d44 down_write+0x19/0x50 kernel/locking/rwsem.c:1578 kernfs_activate fs/kernfs/dir.c:1403 [inline] kernfs_add_one+0x4af/0x8b0 fs/kernfs/dir.c:819 __kernfs_create_file+0x22e/0x2e0 fs/kernfs/file.c:1056 sysfs_add_file_mode_ns+0x24a/0x310 fs/sysfs/file.c:307 create_files fs/sysfs/group.c:64 [inline] internal_create_group+0x4f4/0xf20 fs/sysfs/group.c:152 internal_create_groups fs/sysfs/group.c:192 [inline] sysfs_create_groups+0x56/0x120 fs/sysfs/group.c:218 create_dir lib/kobject.c:78 [inline] kobject_add_internal+0x472/0x8d0 lib/kobject.c:240 kobject_add_varg lib/kobject.c:374 [inline] kobject_init_and_add+0x124/0x190 lib/kobject.c:457 netdev_queue_add_kobject net/core/net-sysfs.c:1706 [inline] netdev_queue_update_kobjects+0x1f3/0x480 net/core/net-sysfs.c:1758 register_queue_kobjects net/core/net-sysfs.c:1819 [inline] netdev_register_kobject+0x265/0x310 net/core/net-sysfs.c:2059 register_netdevice+0x1191/0x19c0 net/core/dev.c:10298 bond_newlink+0x3b/0x90 drivers/net/bonding/bond_netlink.c:576 rtnl_newlink_create net/core/rtnetlink.c:3506 [inline] __rtnl_newlink net/core/rtnetlink.c:3726 [inline] rtnl_newlink+0x158f/0x20a0 net/core/rtnetlink.c:3739 rtnetlink_rcv_msg+0x885/0x1040 net/core/rtnetlink.c:6606 netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2543 netlink_unicast_kernel net/netlink/af_netlink.c:1341 [inline] netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1367 netlink_sendmsg+0xa3c/0xd70 net/netlink/af_netlink.c:1908 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:745 __sys_sendto+0x3a4/0x4f0 net/socket.c:2191 __do_sys_sendto net/socket.c:2203 [inline] __se_sys_sendto net/socket.c:2199 [inline] __x64_sys_sendto+0xde/0x100 net/socket.c:2199 do_syscall_64+0xfb/0x240 entry_SYSCALL_64_after_hwframe+0x6d/0x75 RIP: 0033:0x7fc3fa87fa9c Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240212140700.2795436-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-13 18:45:06 -08:00
Eric Dumazet	110d3047a3	vxlan: use exit_batch_rtnl() method exit_batch_rtnl() is called while RTNL is held, and devices to be unregistered can be queued in the dev_kill_list. This saves one rtnl_lock()/rtnl_unlock() pair per netns and one unregister_netdevice_many() call. v4: (Paolo feedback : https://netdev-3.bots.linux.dev/vmksft-net/results/453141/17-udpgro-fwd-sh/stdout ) - Changed vxlan_destroy_tunnels() to use vxlan_dellink() instead of unregister_netdevice_queue to propely remove devices from vn->vxlan_list. - vxlan_destroy_tunnels() can simply iterate one list (vn->vxlan_list) to find all devices in the most efficient way. - Moved sanity checks in a separate vxlan_exit_net() method. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Antoine Tenart <atenart@kernel.org> Link: https://lore.kernel.org/r/20240206144313.2050392-10-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-07 18:55:11 -08:00
Ido Schimmel	4cde72fead	vxlan: mdb: Add MDB bulk deletion support Implement MDB bulk deletion support in the VXLAN driver, allowing MDB entries to be deleted in bulk according to provided parameters. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-12-20 11:27:21 +00:00
Alce Lafranque	c6e9dba3be	vxlan: add support for flowlabel inherit By default, VXLAN encapsulation over IPv6 sets the flow label to 0, with an option for a fixed value. This commits add the ability to inherit the flow label from the inner packet, like for other tunnel implementations. This enables devices using only L3 headers for ECMP to correctly balance VXLAN-encapsulated IPv6 packets. ``` $ ./ip/ip link add dummy1 type dummy $ ./ip/ip addr add 2001:db8::2/64 dev dummy1 $ ./ip/ip link set up dev dummy1 $ ./ip/ip link add vxlan1 type vxlan id 100 flowlabel inherit remote 2001:db8::1 local 2001:db8::2 $ ./ip/ip link set up dev vxlan1 $ ./ip/ip addr add 2001:db8:1::2/64 dev vxlan1 $ ./ip/ip link set arp off dev vxlan1 $ ping -q 2001:db8:1::1 & $ tshark -d udp.port==8472,vxlan -Vpni dummy1 -c1 [...] Internet Protocol Version 6, Src: 2001:db8::2, Dst: 2001:db8::1 0110 .... = Version: 6 .... 0000 0000 .... .... .... .... .... = Traffic Class: 0x00 (DSCP: CS0, ECN: Not-ECT) .... 0000 00.. .... .... .... .... .... = Differentiated Services Codepoint: Default (0) .... .... ..00 .... .... .... .... .... = Explicit Congestion Notification: Not ECN-Capable Transport (0) .... 1011 0001 1010 1111 1011 = Flow Label: 0xb1afb [...] Virtual eXtensible Local Area Network Flags: 0x0800, VXLAN Network ID (VNI) Group Policy ID: 0 VXLAN Network Identifier (VNI): 100 [...] Internet Protocol Version 6, Src: 2001:db8:1::2, Dst: 2001:db8:1::1 0110 .... = Version: 6 .... 0000 0000 .... .... .... .... .... = Traffic Class: 0x00 (DSCP: CS0, ECN: Not-ECT) .... 0000 00.. .... .... .... .... .... = Differentiated Services Codepoint: Default (0) .... .... ..00 .... .... .... .... .... = Explicit Congestion Notification: Not ECN-Capable Transport (0) .... 1011 0001 1010 1111 1011 = Flow Label: 0xb1afb ``` Signed-off-by: Alce Lafranque <alce@lafranque.net> Co-developed-by: Vincent Bernat <vincent@bernat.ch> Signed-off-by: Vincent Bernat <vincent@bernat.ch> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-11-16 22:33:31 +00:00
Benjamin Poirier	6d90b64256	vxlan: Cleanup IFLA_VXLAN_PORT_RANGE entry in vxlan_get_size() This patch is basically a followup to commit `4e4b1798cc` ("vxlan: Add missing entries to vxlan_get_size()"). All of the attributes in vxlan_get_size() appear in the same order that they are filled in vxlan_fill_info() except for IFLA_VXLAN_PORT_RANGE. For consistency, move that entry to match its order and add a comment, like for all other entries. Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Link: https://lore.kernel.org/r/20231027184410.236671-1-bpoirier@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-10-27 15:36:38 -07:00
Ido Schimmel	32d9673e96	vxlan: mdb: Add MDB get support Implement support for MDB get operation by looking up a matching MDB entry, allocating the skb according to the entry's size and then filling in the response. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-10-27 10:51:42 +01:00
Ido Schimmel	14c32a46d9	vxlan: mdb: Factor out a helper for remote entry size calculation Currently, netlink notifications are sent for individual remote entries and not for the entire MDB entry itself. Subsequent patches are going to add MDB get support which will require the VXLAN driver to reply with an entire MDB entry. Therefore, as a preparation, factor out a helper to calculate the size of an individual remote entry. When determining the size of the reply this helper will be invoked for each remote entry in the MDB entry. No functional changes intended. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-10-27 10:51:41 +01:00
Ido Schimmel	ff97d2a956	vxlan: mdb: Adjust function arguments Adjust the function's arguments and rename it to allow it to be reused by future call sites that only have access to 'struct vxlan_mdb_entry_key', but not to 'struct vxlan_mdb_config'. No functional changes intended. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-10-27 10:51:41 +01:00
Jakub Kicinski	ea23fbd2a8	netlink: make range pointers in policies const struct nla_policy is usually constant itself, but unless we make the ranges inside constant we won't be able to make range structs const. The ranges are not modified by the core. Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231025162204.132528-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-10-26 16:24:09 -07:00
Beniamino Galvani	2aceb896ee	vxlan: use generic function for tunnel IPv6 route lookup The route lookup can be done now via generic function udp_tunnel6_dst_lookup() to replace the custom implementation in vxlan6_get_route(). This is similar to what already done for IPv4 in commit `6f19b2c136` ("vxlan: use generic function for tunnel IPv4 route lookup"). Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-10-23 08:48:57 +01:00
Beniamino Galvani	6f19b2c136	vxlan: use generic function for tunnel IPv4 route lookup The route lookup can be done now via generic function udp_tunnel_dst_lookup() to replace the custom implementations in vxlan_get_route(). Note that this patch only touches IPv4, while IPv6 still uses vxlan6_get_route(). After IPv6 route lookup gets converted as well, vxlan_xmit_one() can be simplified by removing local variables that will be passed via "struct ip_tunnel_key", such as remote_ip, local_ip, flow_flags, label. Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Beniamino Galvani <b.galvani@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-10-16 09:57:52 +01:00
Amit Cohen	2dcd22023c	vxlan: vxlan_core: Support FDB flushing by destination IP Add support for flush VXLAN FDB entries by destination IP. FDB entry is stored as {MAC, SRC_VNI} + remote. The destination IP is an attribute of the remote. For multicast entries, the VXLAN driver stores a linked list of remotes for a given key. In user space, each remote is represented as a separate entry, so when flush is sent with filter of 'destination IP', flush only the match remotes. In case that there are no additional remotes, destroy the entry. For example, the following are stored as one entry with several remotes: $ bridge fdb show dev vx10 00:00:00:00:00:00 dst 192.1.1.3 self permanent 00:00:00:00:00:00 dst 192.1.1.1 self permanent 00:00:00:00:00:00 dst 192.1.1.2 self permanent 00:00:00:00:00:00 dst 192.1.1.1 vni 1000 self permanent When user flush by destination IP x, only the relevant remotes will be flushed: $ bridge fdb flush dev vx10 dst 192.1.1.1 $ bridge fdb show dev vx10 00:00:00:00:00:00 dst 192.1.1.3 self permanent 00:00:00:00:00:00 dst 192.1.1.2 self permanent Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-10-13 10:00:32 +01:00
Amit Cohen	ac0db4ddd0	vxlan: vxlan_core: Support FDB flushing by destination port Add support for flush VXLAN FDB entries by destination port. FDB entry is stored as {MAC, SRC_VNI} + remote. The destination port is an attribute of the remote. For multicast entries, the VXLAN driver stores a linked list of remotes for a given key. In user space, each remote is represented as a separate entry, so when flush is sent with filter of 'destination port', flush only the match remotes. In case that there are no additional remotes, destroy the entry. For example, the following are stored as one entry with several remotes: $ bridge fdb show dev vx10 00:00:00:00:00:00 dst 192.1.1.1 port 1111 vni 2000 self permanent 00:00:00:00:00:00 dst 192.1.1.1 port 1111 vni 3000 self permanent 00:00:00:00:00:00 dst 192.1.1.1 port 2222 vni 2000 self permanent 00:00:00:00:00:00 dst 192.1.1.1 vni 3000 self permanent When user flush by port x, only the relevant remotes will be flushed: $ bridge fdb flush dev vx10 port 1111 $ bridge fdb show dev vx10 00:00:00:00:00:00 dst 192.1.1.1 port 2222 vni 2000 self permanent 00:00:00:00:00:00 dst 192.1.1.1 vni 3000 self permanent Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-10-13 10:00:31 +01:00
Amit Cohen	c499fccb71	vxlan: vxlan_core: Support FDB flushing by destination VNI Add support for flush VXLAN FDB entries by destination VNI. FDB entry is stored as {MAC, SRC_VNI} + remote. The destination VNI is an attribute of the remote. For multicast entries, the VXLAN driver stores a linked list of remotes for a given key. In user space, each remote is represented as a separate entry, so when flush is sent with filter of 'destination VNI', flush only the match remotes. In case that there are no additional remotes, destroy the entry. For example, the following are stored as one entry with several remotes: $ bridge fdb show dev vx10 00:00:00:00:00:00 dst 192.1.1.1 vni 3000 self permanent 00:00:00:00:00:00 dst 192.1.1.1 vni 4000 self permanent 00:00:00:00:00:00 dst 192.1.1.1 vni 2000 self permanent 00:00:00:00:00:00 dst 192.1.1.2 vni 2000 self permanent When user flush by VNI x, only the relevant remotes will be flushed: $ bridge fdb flush dev vx10 vni 2000 $ bridge fdb show dev vx10 00:00:00:00:00:00 dst 192.1.1.1 vni 3000 self permanent 00:00:00:00:00:00 dst 192.1.1.1 vni 4000 self permanent Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-10-13 10:00:31 +01:00
Amit Cohen	36c111233b	vxlan: vxlan_core: Support FDB flushing by nexthop ID Add support for flush VXLAN FDB entries by nexthop ID. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-10-13 10:00:31 +01:00
Amit Cohen	a0f89d5e68	vxlan: vxlan_core: Support FDB flushing by source VNI Add support for flush VXLAN FDB entries by source VNI. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-10-13 10:00:31 +01:00
Amit Cohen	d324eb9cec	vxlan: vxlan_core: Add support for FDB flush The merge commit `9271686937` ("Merge branch 'br-flush-filtering'") added support for FDB flushing in bridge driver only, the VXLAN driver does not support such flushing. Extend VXLAN driver to support FDB flushing. In this commit, add support for flushing with state and flags, which are the fields that supported in the bridge driver. Note that bridge driver supports 'NTF_USE' flag, but there is no point to support this flag for flushing as it is ignored when flags are stored. 'NTF_STICKY' is not relevant for VXLAN driver. 'NTF_ROUTER' is not supported in bridge driver for flush as it is not relevant for bridge, add it for VXLAN. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-10-13 10:00:31 +01:00
Amit Cohen	77b613efcc	vxlan: vxlan_core: Do not skip default entry in vxlan_flush() by default Currently, the function vxlan_flush() does not flush the default FDB entry (an entry with all_zeros_mac and default VNI), as it is deleted at vxlan_uninit(). When this function will be used for flushing FDB entries from user space, it will have to flush also the default entry in case that other parameters match (e.g., VNI, flags). Extend 'struct vxlan_fdb_flush_desc' to include an indication whether the default entry should be flushed or not. The default value (false) indicates to flush it, adjust all the existing callers to set '.ignore_default_entry' to true, so the current behavior will not be changed. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-10-13 10:00:30 +01:00
Amit Cohen	bfe36bf781	vxlan: vxlan_core: Make vxlan_flush() more generic for future use The function vxlan_flush() gets a boolean called 'do_all' and in case that it is false, it does not flush entries with state 'NUD_PERMANENT' or 'NUD_NOARP'. The following patches will add support for FDB flush with parameters from user space. Make the function more generic, so it can be used later. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-10-13 10:00:30 +01:00
Benjamin Poirier	4e4b1798cc	vxlan: Add missing entries to vxlan_get_size() There are some attributes added by vxlan_fill_info() which are not accounted for in vxlan_get_size(). Add them. I didn't find a way to trigger an actual problem from this miscalculation since there is usually extra space in netlink size calculations like if_nlmsg_size(); but maybe I just didn't search long enough. Fixes: `3511494ce2` ("vxlan: Group Policy extension") Fixes: `e1e5314de0` ("vxlan: implement GPE") Fixes: `0ace2ca89c` ("vxlan: Use checksum partial with remote checksum offload") Fixes: `f9c4bb0b24` ("vxlan: vni filtering support on collect metadata device") Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-09-20 09:00:54 +01:00
Ido Schimmel	63c11dc2ca	vxlan: vnifilter: Use GFP_KERNEL instead of GFP_ATOMIC The function is not called from an atomic context so use GFP_KERNEL instead of GFP_ATOMIC. The allocation of the per-CPU stats is already performed with GFP_KERNEL. Tested using test_vxlan_vnifiltering.sh with CONFIG_DEBUG_ATOMIC_SLEEP. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20230821141923.1889776-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-22 10:58:45 -07:00
Li Zetao	3c0930b491	vxlan: Use helper functions to update stats Use the helper functions dev_sw_netstats_rx_add() and dev_sw_netstats_tx_add() to update stats, which helps to provide code readability. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-14 08:06:24 +01:00
Jakub Kicinski	4d016ae42e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/intel/igc/igc_main.c `06b412589e` ("igc: Add lock to safeguard global Qbv variables") `d3750076d4` ("igc: Add TransmissionOverrun counter") drivers/net/ethernet/microsoft/mana/mana_en.c `a7dfeda6fd` ("net: mana: Fix MANA VF unload when hardware is unresponsive") `a9ca9f9cef` ("page_pool: split types and declarations from page_pool.h") `92272ec410` ("eth: add missing xdp.h includes in drivers") net/mptcp/protocol.h `511b90e392` ("mptcp: fix disconnect vs accept race") `b8dc6d6ce9` ("mptcp: fix rcv buffer auto-tuning") tools/testing/selftests/net/mptcp/mptcp_join.sh `c8c101ae39` ("selftests: mptcp: join: fix 'implicit EP' test") `03668c65d1` ("selftests: mptcp: join: rework detailed report") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-10 14:10:53 -07:00
Fedor Pchelkin	b1c936e9af	drivers: vxlan: vnifilter: free percpu vni stats on error path In case rhashtable_lookup_insert_fast() fails inside vxlan_vni_add(), the allocated percpu vni stats are not freed on the error path. Introduce vxlan_vni_free() which would work as a nice wrapper to free vxlan_vni_node resources properly. Found by Linux Verification Center (linuxtesting.org). Fixes: `4095e0e132` ("drivers: vxlan: vnifilter: per vni stats") Suggested-by: Ido Schimmel <idosch@idosch.org> Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-06 16:43:07 +01:00
Jakub Kicinski	014acf2668	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-07-27 15:22:46 -07:00
Jiri Benc	b0b672c4d0	vxlan: fix GRO with VXLAN-GPE In VXLAN-GPE, there may not be an Ethernet header following the VXLAN header. But in GRO, the vxlan driver calls eth_gro_receive unconditionally, which means the following header is incorrectly parsed as Ethernet. Introduce GPE specific GRO handling. For better performance, do not check for GPE during GRO but rather install a different set of functions at setup time. Fixes: `e1e5314de0` ("vxlan: implement GPE") Reported-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-07-24 10:47:09 +01:00
Jiri Benc	17a0a64448	vxlan: generalize vxlan_parse_gpe_hdr and remove unused args The vxlan_parse_gpe_hdr function extracts the next protocol value from the GPE header and marks GPE bits as parsed. In order to be used in the next patch, split the function into protocol extraction and bit marking. The bit marking is meaningful only in vxlan_rcv; move it directly there. Rename the function to vxlan_parse_gpe_proto to reflect what it now does. Remove unused arguments skb and vxflags. Move the function earlier in the file to allow it to be called from more places in the next patch. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-07-24 10:47:09 +01:00
Jiri Benc	94d166c531	vxlan: calculate correct header length for GPE VXLAN-GPE does not add an extra inner Ethernet header. Take that into account when calculating header length. This causes problems in skb_tunnel_check_pmtu, where incorrect PMTU is cached. In the collect_md mode (which is the only mode that VXLAN-GPE supports), there's no magic auto-setting of the tunnel interface MTU. It can't be, since the destination and thus the underlying interface may be different for each packet. So, the administrator is responsible for setting the correct tunnel interface MTU. Apparently, the administrators are capable enough to calculate that the maximum MTU for VXLAN-GPE is (their_lower_MTU - 36). They set the tunnel interface MTU to 1464. If you run a TCP stream over such interface, it's then segmented according to the MTU 1464, i.e. producing 1514 bytes frames. Which is okay, this still fits the lower MTU. However, skb_tunnel_check_pmtu (called from vxlan_xmit_one) uses 50 as the header size and thus incorrectly calculates the frame size to be 1528. This leads to ICMP too big message being generated (locally), PMTU of 1450 to be cached and the TCP stream to be resegmented. The fix is to use the correct actual header size, especially for skb_tunnel_check_pmtu calculation. Fixes: `e1e5314de0` ("vxlan: implement GPE") Signed-off-by: Jiri Benc <jbenc@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-07-24 09:37:32 +01:00
Ido Schimmel	d977e1c8e3	vxlan: Add support for nexthop ID metadata VXLAN FDB entries can point to FDB nexthop objects. Each such object includes the IP address(es) of remote VTEP(s) via which the target host is accessible. Example: # ip nexthop add id 1 via 192.0.2.1 fdb # ip nexthop add id 2 via 192.0.2.17 fdb # ip nexthop add id 1000 group 1/2 fdb # bridge fdb add 00:11:22:33:44:55 dev vx0 self static nhid 1000 src_vni 10020 This is useful for EVPN multihoming where a single host can be connected to multiple VTEPs. The source VTEP will calculate the flow hash of the skb and forward it towards the IP address of one of the VTEPs member in the nexthop group. There are cases where an external entity (e.g., the bridge driver) can provide not only the tunnel ID (i.e., VNI) of the skb, but also the ID of the nexthop object via which the skb should be forwarded. Therefore, in order to support such cases, when the VXLAN device is in external / collect metadata mode and the tunnel info attached to the skb is of bridge type, extract the nexthop ID from the tunnel info. If the ID is valid (i.e., non-zero), forward the skb via the nexthop object associated with the ID, as if the skb hit an FDB entry associated with this ID. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-07-19 10:53:48 +01:00
Vladimir Nikishkin	69474a8a58	net: vxlan: Add nolocalbypass option to vxlan. If a packet needs to be encapsulated towards a local destination IP, the packet will undergo a "local bypass" and be injected into the Rx path as if it was received by the target VXLAN device without undergoing encapsulation. If such a device does not exist, the packet will be dropped. There are scenarios where we do not want to perform such a bypass, but instead want the packet to be encapsulated and locally received by a user space program for post-processing. To that end, add a new VXLAN device attribute that controls whether a "local bypass" is performed or not. Default to performing a bypass to maintain existing behavior. Signed-off-by: Vladimir Nikishkin <vladimir@nikishkin.pw> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-05-13 17:02:33 +01:00
Gavin Li	c641e9279f	vxlan: Expose helper vxlan_build_gbp_hdr The function vxlan_build_gbp_hdr will be used by other modules to build gbp option in vxlan header according to gbp flags. Signed-off-by: Gavin Li <gavinl@nvidia.com> Reviewed-by: Gavi Teitz <gavi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Acked-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-17 22:41:16 -07:00
Gavin Li	df5e87f16c	vxlan: Remove unused argument from vxlan_build_gbp_hdr( ) and vxlan_build_gpe_hdr( ) Remove unused argument (i.e. u32 vxflags) in vxlan_build_gbp_hdr( ) and vxlan_build_gpe_hdr( ) function arguments. Signed-off-by: Gavin Li <gavinl@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-17 22:41:16 -07:00
Ido Schimmel	08f876a7d7	vxlan: Enable MDB support Now that the VXLAN MDB control and data paths are in place we can expose the VXLAN MDB functionality to user space. Set the VXLAN MDB net device operations to the appropriate functions, thereby allowing the rtnetlink code to reach the VXLAN driver. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-17 08:05:50 +00:00
Ido Schimmel	0f83e69f44	vxlan: Add MDB data path support Integrate MDB support into the Tx path of the VXLAN driver, allowing it to selectively forward IP multicast traffic according to the matched MDB entry. If MDB entries are configured (i.e., 'VXLAN_F_MDB' is set) and the packet is an IP multicast packet, perform up to three different lookups according to the following priority: 1. For an (S, G) entry, using {Source VNI, Source IP, Destination IP}. 2. For a (, G) entry, using {Source VNI, Destination IP}. 3. For the catchall MDB entry (0.0.0.0 or ::), using the source VNI. The catchall MDB entry is similar to the catchall FDB entry (00:00:00:00:00:00) that is currently used to transmit BUM (broadcast, unknown unicast and multicast) traffic. However, unlike the catchall FDB entry, this entry is only used to transmit unregistered IP multicast traffic that is not link-local. Therefore, when configured, the catchall FDB entry will only transmit BULL (broadcast, unknown unicast, link-local multicast) traffic. The catchall MDB entry is useful in deployments where inter-subnet multicast forwarding is used and not all the VTEPs in a tenant domain are members in all the broadcast domains. In such deployments it is advantageous to transmit BULL (broadcast, unknown unicast and link-local multicast) and unregistered IP multicast traffic on different tunnels. If the same tunnel was used, a VTEP only interested in IP multicast traffic would also pull all the BULL traffic and drop it as it is not a member in the originating broadcast domain [1]. If the packet did not match an MDB entry (or if the packet is not an IP multicast packet), return it to the Tx path, allowing it to be forwarded according to the FDB. If the packet did match an MDB entry, forward it to the associated remote VTEPs. However, if the entry is a (, G) entry and the associated remote is in INCLUDE mode, then skip over it as the source IP is not in its source list (otherwise the packet would have matched on an (S, G) entry). Similarly, if the associated remote is marked as BLOCKED (can only be set on (S, G) entries), then skip over it as well as the remote is in EXCLUDE mode and the source IP is in its source list. [1] https://datatracker.ietf.org/doc/html/draft-ietf-bess-evpn-irb-mcast#section-2.6 Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-17 08:05:50 +00:00
Ido Schimmel	bc6c6b013f	vxlan: mdb: Add an internal flag to indicate MDB usage Add an internal flag to indicate whether MDB entries are configured or not. Set the flag after installing the first MDB entry and clear it before deleting the last one. The flag will be consulted by the data path which will only perform an MDB lookup if the flag is set, thereby keeping the MDB overhead to a minimum when the MDB is not used. Another option would have been to use a static key, but it is global and not per-device, unlike the current approach. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-17 08:05:49 +00:00
Ido Schimmel	a3a48de5ea	vxlan: mdb: Add MDB control path support Implement MDB control path support, enabling the creation, deletion, replacement and dumping of MDB entries in a similar fashion to the bridge driver. Unlike the bridge driver, each entry stores a list of remote VTEPs to which matched packets need to be replicated to and not a list of bridge ports. The motivating use case is the installation of MDB entries by a user space control plane in response to received EVPN routes. As such, only allow permanent MDB entries to be installed and do not implement snooping functionality, avoiding a lot of unnecessary complexity. Since entries can only be modified by user space under RTNL, use RTNL as the write lock. Use RCU to ensure that MDB entries and remotes are not freed while being accessed from the data path during transmission. In terms of uAPI, reuse the existing MDB netlink interface, but add a few new attributes to request and response messages: * IP address of the destination VXLAN tunnel endpoint where the multicast receivers reside. * UDP destination port number to use to connect to the remote VXLAN tunnel endpoint. * VXLAN VNI Network Identifier to use to connect to the remote VXLAN tunnel endpoint. Required when Ingress Replication (IR) is used and the remote VTEP is not a member of originating broadcast domain (VLAN/VNI) [1]. * Source VNI Network Identifier the MDB entry belongs to. Used only when the VXLAN device is in external mode. * Interface index of the outgoing interface to reach the remote VXLAN tunnel endpoint. This is required when the underlay destination IP is multicast (P2MP), as the multicast routing tables are not consulted. All the new attributes are added under the 'MDBA_SET_ENTRY_ATTRS' nest which is strictly validated by the bridge driver, thereby automatically rejecting the new attributes. [1] https://datatracker.ietf.org/doc/html/draft-ietf-bess-evpn-irb-mcast#section-3.2.2 Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-17 08:05:49 +00:00

1 2

81 commits