Functions supplied in ip6_udp_tunnel.c are only needed when IPV6 is
selected. When IPV6 is not selected, those functions are stubbed out
in udp_tunnel.h.
==================================================================
net/ipv6/ip6_udp_tunnel.c:15:5: error: redefinition of 'udp_sock_create6'
int udp_sock_create6(struct net *net, struct udp_port_cfg *cfg,
In file included from net/ipv6/ip6_udp_tunnel.c:9:0:
include/net/udp_tunnel.h:36:19: note: previous definition of 'udp_sock_create6' was here
static inline int udp_sock_create6(struct net *net, struct udp_port_cfg *cfg,
==================================================================
Fixes: fd384412e udp_tunnel: Seperate ipv6 functions into its own file
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added netlink handling of IP tunnel encapulation paramters, properly
adjust MTU for encapsulation. Added ip_tunnel_encap call to
ipip6_tunnel_xmit to actually perform FOU encapsulation.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Want to be able to use these in foo-over-udp offloads, etc.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added a few more UDP tunnel APIs that can be shared by UDP based
tunnel protocol implementation. The main ones are highlighted below.
setup_udp_tunnel_sock() configures UDP listener socket for
receiving UDP encapsulated packets.
udp_tunnel_xmit_skb() and upd_tunnel6_xmit_skb() transmit skb
using UDP encapsulation.
udp_tunnel_sock_release() closes the UDP tunnel listener socket.
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ip6_udp_tunnel.c for ipv6 UDP tunnel functions to avoid ifdefs
in udp_tunnel.c
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Input path of TCP do not currently uses TCP_SKB_CB(skb)->tcp_flags,
which is only used in output path.
tcp_recvmsg(), looks at tcp_hdr(skb)->syn for every skb found in receive queue,
and its unfortunate because this bit is located in a cache line right before
the payload.
We can simplify TCP by copying tcp flags into TCP_SKB_CB(skb)->tcp_flags.
This patch does so, and avoids the cache line miss in tcp_recvmsg()
Following patches will
- allow a segment with FIN being coalesced in tcp_try_coalesce()
- simplify tcp_collapse() by not copying the headers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If IPv6 is explicitly disabled before the interface comes up,
it makes no sense to continue when it comes up, even just
print a message.
(I am not sure about other cases though, so I prefer not to touch)
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor out allocation and initialization and make
the refcount code more readable.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similarly the code is already protected by rtnl lock.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similarly the code is already protected by rtnl lock.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor out allocation and initialization and make
the refcount code more readable.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make it accept inet6_dev, and rename it to __ipv6_dev_ac_inc()
to reflect this change.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Just move rtnl lock up, so that the anycast list can be protected
by rtnl lock now.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These code is now protected by rtnl lock, rcu read lock
is useless now.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 2abb7cdc0d ("udp: Add support for doing checksum unnecessary
conversion") caused napi_gro_cb structs with the "flush" field zero to
take the "udp_gro_receive" path rather than the "set flush to 1" path
that they would previously take. As a result I saw booting from an NFS
root hang shortly after starting userspace, with "server not
responding" messages.
This change to the handling of "flush == 0" packets appears to be
incidental to the goal of adding new code in the case where
skb_gro_checksum_validate_zero_check() returns zero. Based on that and
the fact that it breaks things, I'm assuming that it is unintentional.
Fixes: 2abb7cdc0d ("udp: Add support for doing checksum unnecessary conversion")
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
nf-next pull request
The following patchset contains Netfilter/IPVS updates for your
net-next tree. Regarding nf_tables, most updates focus on consolidating
the NAT infrastructure and adding support for masquerading. More
specifically, they are:
1) use __u8 instead of u_int8_t in arptables header, from
Mike Frysinger.
2) Add support to match by skb->pkttype to the meta expression, from
Ana Rey.
3) Add support to match by cpu to the meta expression, also from
Ana Rey.
4) A smatch warning about IPSET_ATTR_MARKMASK validation, patch from
Vytas Dauksa.
5) Fix netnet and netportnet hash types the range support for IPv4,
from Sergey Popovich.
6) Fix missing-field-initializer warnings resolved, from Mark Rustad.
7) Dan Carperter reported possible integer overflows in ipset, from
Jozsef Kadlecsick.
8) Filter out accounting objects in nfacct by type, so you can
selectively reset quotas, from Alexey Perevalov.
9) Move specific NAT IPv4 functions to the core so x_tables and
nf_tables can share the same NAT IPv4 engine.
10) Use the new NAT IPv4 functions from nft_chain_nat_ipv4.
11) Move specific NAT IPv6 functions to the core so x_tables and
nf_tables can share the same NAT IPv4 engine.
12) Use the new NAT IPv6 functions from nft_chain_nat_ipv6.
13) Refactor code to add nft_delrule(), which can be reused in the
enhancement of the NFT_MSG_DELTABLE to remove a table and its
content, from Arturo Borrero.
14) Add a helper function to unregister chain hooks, from
Arturo Borrero.
15) A cleanup to rename to nft_delrule_by_chain for consistency with
the new nft_*() functions, also from Arturo.
16) Add support to match devgroup to the meta expression, from Ana Rey.
17) Reduce stack usage for IPVS socket option, from Julian Anastasov.
18) Remove unnecessary textsearch state initialization in xt_string,
from Bojan Prtvar.
19) Add several helper functions to nf_tables, more work to prepare
the enhancement of NFT_MSG_DELTABLE, again from Arturo Borrero.
20) Enhance NFT_MSG_DELTABLE to delete a table and its content, from
Arturo Borrero.
21) Support NAT flags in the nat expression to indicate the flavour,
eg. random fully, from Arturo.
22) Add missing audit code to ebtables when replacing tables, from
Nicolas Dichtel.
23) Generalize the IPv4 masquerading code to allow its re-use from
nf_tables, from Arturo.
24) Generalize the IPv6 masquerading code, also from Arturo.
25) Add the new masq expression to support IPv4/IPv6 masquerading
from nf_tables, also from Arturo.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ipv6_gro_receive and ipv6_gro_complete to sit_offload to
support GRO.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In TCP gro we check flush_id which is derived from the IP identifier.
In IPv4 gro path the flush_id is set with the expectation that every
matched packet increments IP identifier. In IPv6, the flush_id is
never set and thus is uinitialized. What's worse is that in IPv6
over IPv4 encapsulation, the IP identifier is taken from the outer
header which is currently not incremented on every packet for Linux
stack, so GRO in this case never matches packets (identifier is
not increasing).
This patch clears flush_id for every time for a matched packet in
IPv6 gro_receive. We need to do this each time to overwrite the
setting that would be done in IPv4 gro_receive per the outer
header in IPv6 over Ipv4 encapsulation.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/ipv6/udp_offload.c:159:5: warning: symbol 'udp6_gro_complete' was
not declared. Should it be static?
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 57c67ff4bd ("udp: additional GRO support")
Cc: Tom Herbert <therbert@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's not used anywhere, so just remove these.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck reported high false sharing on dst refcount in tcp stack
when prequeue is used. prequeue is the mechanism used when a thread is
blocked in recvmsg()/read() on a TCP socket, using a blocking model
rather than select()/poll()/epoll() non blocking one.
We already try to use RCU in input path as much as possible, but we were
forced to take a refcount on the dst when skb escaped RCU protected
region. When/if the user thread runs on different cpu, dst_release()
will then touch dst refcount again.
Commit 093162553c (tcp: force a dst refcount when prequeue packet)
was an example of a race fix.
It turns out the only remaining usage of skb->dst for a packet stored
in a TCP socket prequeue is IP early demux.
We can add a logic to detect when IP early demux is probably going
to use skb->dst. Because we do an optimistic check rather than duplicate
existing logic, we need to guard inet_sk_rx_dst_set() and
inet6_sk_rx_dst_set() from using a NULL dst.
Many thanks to Alexander for providing a nice bug report, git bisection,
and reproducer.
Tested using Alexander script on a 40Gb NIC, 8 RX queues.
Hosts have 24 cores, 48 hyper threads.
echo 0 >/proc/sys/net/ipv4/tcp_autocorking
for i in `seq 0 47`
do
for j in `seq 0 2`
do
netperf -H $DEST -t TCP_STREAM -l 1000 \
-c -C -T $i,$i -P 0 -- \
-m 64 -s 64K -D &
done
done
Before patch : ~6Mpps and ~95% cpu usage on receiver
After patch : ~9Mpps and ~35% cpu usage on receiver.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net.ipv4.ip_nonlocal_bind sysctl was global to all network
namespaces. This patch allows to set a different value for each
network namespace.
Signed-off-by: Vincent Bernat <vincent@bernat.im>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nft_masq expression is intended to perform NAT in the masquerade flavour.
We decided to have the masquerade functionality in a separated expression other
than nft_nat.
Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Let's refactor the code so we can reach the masquerade functionality
from outside the xt context (ie. nftables).
The patch includes the addition of an atomic counter to the masquerade
notifier: the stuff to be done by the notifier is the same for xt and
nftables. Therefore, only one notification handler is needed.
This factorization only involves IPv6; a similar patch exists to
handle IPv4.
Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use the exported IPv6 NAT functions that are provided by the core. This
removes duplicated code so iptables and nft use the same NAT codebase.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Move the specific NAT IPv6 core functions that are called from the
hooks from ip6table_nat.c to nf_nat_l3proto_ipv6.c. This prepares the
ground to allow iptables and nft to use the same NAT engine code that
comes in a follow up patch.
This also renames nf_nat_ipv6_fn to nft_nat_ipv6_fn in
net/ipv6/netfilter/nft_chain_nat_ipv6.c to avoid a compilation breakage.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
It is possible that the interface is already gone after joining
the list of anycast on this interface as we don't hold a refcount
for the device, in this case we are safe to ignore the error.
What's more important, for API compatibility we should not
change this behavior for applications even if it were correct.
Fixes: commit a9ed4a2986 ("ipv6: fix rtnl locking in setsockopt for anycast and multicast")
Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP_SKB_CB(skb)->when has different meaning in output and input paths.
In output path, it contains a timestamp.
In input path, it contains an ISN, chosen by tcp_timewait_state_process()
Lets add a different name to ease code comprehension.
Note that 'when' field will disappear in following patch,
as skb_mstamp already contains timestamp, the anonymous
union will promptly disappear as well.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
addrconf_get_prefix_route() ensures to get the right route in the right table.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no reason to take a refcnt before deleting the peer address route.
It's done some lines below for the local prefix route because
inet6_ifa_finish_destroy() will release it at the end.
For the peer address route, we want to free it right now.
This bug has been introduced by commit
caeaba7900 ("ipv6: add support of peer address").
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Calling setsockopt with IPV6_JOIN_ANYCAST or IPV6_LEAVE_ANYCAST
triggers the assertion in addrconf_join_solict()/addrconf_leave_solict()
ipv6_sock_ac_join(), ipv6_sock_ac_drop(), ipv6_sock_ac_close() need to
take RTNL before calling ipv6_dev_ac_inc/dec. Same thing with
ipv6_sock_mc_join(), ipv6_sock_mc_drop(), ipv6_sock_mc_close() before
calling ipv6_dev_mc_inc/dec.
This patch moves ASSERT_RTNL() up a level in the call stack.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reported-by: Tommi Rantala <tt.rantala@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new sysctl_mld_qrv knob to configure the mldv1/v2 query
robustness variable. It specifies how many retransmit of unsolicited mld
retransmit should happen. Admins might want to tune this on lossy links.
Also reset mld state on interface down/up, so we pick up new sysctl
settings during interface up event.
IPv6 certification requests this knob to be available.
I didn't make this knob netns specific, as it is mostly a setting in a
physical environment and should be per host.
Cc: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
make defconfig reports:
warning: (NETFILTER_XT_TARGET_LOG) selects NF_LOG_IPV6 which has unmet direct dependencies (NET && INET && IPV6 && NETFILTER && NETFILTER_ADVANCED)
Fixes: d79a61d netfilter: NETFILTER_XT_TARGET_LOG selects NF_LOG_*
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
pull request: Netfilter/IPVS fixes for net
The following patchset contains seven Netfilter fixes for your net
tree, they are:
1) Make the NAT infrastructure independent of x_tables, some users are
already starting to test nf_tables with NAT without enabling x_tables.
Without this patch for Kconfig, there's a superfluous dependency
between NAT and x_tables.
2) Allow to use 0 in the cgroup match, the kernel rejects with -EINVAL
with no good reason. From Daniel Borkmann.
3) Select CONFIG_NF_NAT from the nf_tables NAT expression, this also
resolves another NAT dependency with x_tables.
4) Use HAVE_JUMP_LABEL instead of CONFIG_JUMP_LABEL in the Netfilter hook
code as elsewhere in the kernel to resolve toolchain problems, from
Zhouyi Zhou.
5) Use iptunnel_handle_offloads() to set up tunnel encapsulation
depending on the offload capabilities, reported by Alex Gartrell
patch from Julian Anastasov.
6) Fix wrong family when registering the ip_vs_local_reply6() hook,
also from Julian.
7) Select the NF_LOG_* symbols from NETFILTER_XT_TARGET_LOG. Rafał
Miłecki reported that when jumping from 3.16 to 3.17-rc, his log
target is not selected anymore due to changes in the previous
development cycle to accomodate the full logging support for
nf_tables.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
sk->sk_error_queue is dequeued in four locations. All share the
exact same logic. Deduplicate.
Also collapse the two critical sections for dequeue (at the top of
the recv handler) and signal (at the bottom).
This moves signal generation for the next packet forward, which should
be harmless.
It also changes the behavior if the recv handler exits early with an
error. Previously, a signal for follow-up packets on the errqueue
would then not be scheduled. The new behavior, to always signal, is
arguably a bug fix.
For rxrpc, the change causes the same function to be called repeatedly
for each queued packet (because the recv handler == sk_error_report).
It is likely that all packets will fail for the same reason (e.g.,
memory exhaustion).
This code runs without sk_lock held, so it is not safe to trust that
sk->sk_err is immutable inbetween releasing q->lock and the subsequent
test. Introduce int err just to avoid this potential race.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for doing CHECKSUM_UNNECESSARY to CHECKSUM_COMPLETE
conversion in UDP tunneling path.
In the normal UDP path, we call skb_checksum_try_convert after locating
the UDP socket. The check is that checksum conversion is enabled for
the socket (new flag in UDP socket) and that checksum field is
non-zero.
In the UDP GRO path, we call skb_gro_checksum_try_convert after
checksum is validated and checksum field is non-zero. Since this is
already in GRO we assume that checksum conversion is always wanted.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes no changes to the logic of the code but simply addresses
coding style issues as detected by checkpatch.
Both objdump and diff -w show no differences.
This patch removes some blank lines between the end of a function
definition and the EXPORT_SYMBOL_GPL macro in order to prevent
checkpatch warning that EXPORT_SYMBOL must immediately follow
a function.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes no changes to the logic of the code but simply addresses
coding style issues as detected by checkpatch.
Both objdump and diff -w show no differences.
This patch addresses structure definitions, specifically it cleanses the brace
placement and replaces spaces with tabs in a few places.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes no changes to the logic of the code but simply addresses
coding style issues as detected by checkpatch.
Both objdump and diff -w show no differences.
A number of items are addressed in this patch:
* Multiple spaces converted to tabs
* Spaces before tabs removed.
* Spaces in pointer typing cleansed (char *)foo etc.
* Remove space after sizeof
* Ensure spacing around comparators such as if statements.
Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement GRO for UDPv6. Add UDP checksum verification in gro_receive
for both UDP4 and UDP6 calling skb_gro_checksum_validate_zero_check.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In tcp[64]_gro_receive call skb_gro_checksum_validate to validate TCP
checksum in the gro context.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace open codings of (((u64) <x> * <y>) >> 32) with reciprocal_scale().
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the NAT configs depend on iptables and ip6tables. However,
users should be capable of enabling NAT for nft without having to
switch on iptables.
Fix this by adding new specific IP_NF_NAT and IP6_NF_NAT config
switches for iptables and ip6tables NAT support. I have also moved
the original NF_NAT_IPV4 and NF_NAT_IPV6 configs out of the scope
of iptables to make them independent of it.
This patch also adds NETFILTER_XT_NAT which selects the xt_nat
combo that provides snat/dnat for iptables. We cannot use NF_NAT
anymore since nf_tables can select this.
Reported-by: Matteo Croce <technoboy85@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Make sure we use the correct address-family-specific function for
handling MTU reductions from within tcp_release_cb().
Previously AF_INET6 sockets were incorrectly always using the IPv6
code path when sometimes they were handling IPv4 traffic and thus had
an IPv4 dst.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Diagnosed-by: Willem de Bruijn <willemb@google.com>
Fixes: 563d34d057 ("tcp: dont drop MTU reduction indications")
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As of 4fddbf5d78 ("sit: strictly restrict incoming traffic to tunnel link device"),
when looking up a tunnel, tunnel's underlying interface (t->parms.link)
is verified to match incoming traffic's ingress device.
However the comparison was incorrectly based on skb->dev->iflink.
Instead, dev->ifindex should be used, which correctly represents the
interface from which the IP stack hands the ipip6 packets.
This allows setting up sit tunnels bound to vlan interfaces (otherwise
incoming ipip6 traffic on the vlan interface was dropped due to
ipip6_tunnel_lookup match failure).
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>