Pull networking fixes from David Miller:
1) Don't use MMIO on certain iwlwifi devices otherwise we get a
firmware crash.
2) Don't corrupt the GRO lists of mac80211 contexts by doing sends via
timer interrupt, from Johannes Berg.
3) SKB tailroom is miscalculated in AP_VLAN crypto code, from Michal
Kazior.
4) Fix fw_status memory leak in iwlwifi, from Haim Dreyfuss.
5) Fix use after free in iwl_mvm_d0i3_enable_tx(), from Eliad Peller.
6) JIT'ing of large BPF programs is broken on x86, from Alexei
Starovoitov.
7) EMAC driver ethtool register dump size is miscalculated, from Ivan
Mikhaylov.
8) Fix PHY initial link mode when autonegotiation is disabled in
amd-xgbe, from Tom Lendacky.
9) Fix NULL deref on SOCK_DEAD socket in AF_UNIX and CAIF protocols,
from Mark Salyzyn.
10) credit_bytes not initialized properly in xen-netback, from Ross
Lagerwall.
11) Fallback from MSI-X to INTx interrupts not handled properly in mlx4
driver, fix from Benjamin Poirier.
12) Perform ->attach() after binding dev->qdisc in packet scheduler,
otherwise we can crash. From Cong WANG.
13) Don't clobber data in sctp_v4_map_v6(). From Jason Gunthorpe.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (30 commits)
sctp: Fix mangled IPv4 addresses on a IPv6 listening socket
net_sched: invoke ->attach() after setting dev->qdisc
xen-netfront: properly destroy queues when removing device
mlx4_core: Fix fallback from MSI-X to INTx
xen/netback: Properly initialize credit_bytes
net: netxen: correct sysfs bin attribute return code
tools: bpf_jit_disasm: fix segfault on disabled debugging log output
unix/caif: sk_socket can disappear when state is unlocked
amd-xgbe-phy: Fix initial mode when autoneg is disabled
net: dp83640: fix improper double spin locking.
net: dp83640: reinforce locking rules.
net: dp83640: fix broken calibration routine.
net: stmmac: create one debugfs dir per net-device
net/ibm/emac: fix size of emac dump memory areas
x86: bpf_jit: fix compilation of large bpf programs
net: phy: bcm7xxx: Fix 7425 PHY ID and flags
iwlwifi: mvm: avoid use-after-free on iwl_mvm_d0i3_enable_tx()
iwlwifi: mvm: clean net-detect info if device was reset during suspend
iwlwifi: mvm: take the UCODE_DOWN reference when resuming
iwlwifi: mvm: BT Coex - duplicate the command if sent ASYNC
...
After commit 07f4c90062 ("tcp/dccp: try to not exhaust
ip_local_port_range in connect()") it is advised to have an even number
of ports described in /proc/sys/net/ipv4/ip_local_port_range
This means start/end values should have a different parity.
Let's warn sysadmins of this, so that they can update their settings
if they want to.
Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__inet_hash_connect() does not use its third argument (port_offset)
if socket was already bound to a source port.
No need to perform useless but expensive md5 computations.
Reported-by: Crestez Dan Leonard <cdleonard@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For mq qdisc, we add per tx queue qdisc to root qdisc
for display purpose, however, that happens too early,
before the new dev->qdisc is finally set, this causes
q->list points to an old root qdisc which is going to be
freed right before assigning with a new one.
Fix this by moving ->attach() after setting dev->qdisc.
For the record, this fixes the following crash:
------------[ cut here ]------------
WARNING: CPU: 1 PID: 975 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
list_del corruption. prev->next should be ffff8800d1998ae8, but was 6b6b6b6b6b6b6b6b
CPU: 1 PID: 975 Comm: tc Not tainted 4.1.0-rc4+ #1019
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
0000000000000009 ffff8800d73fb928 ffffffff81a44e7f 0000000047574756
ffff8800d73fb978 ffff8800d73fb968 ffffffff810790da ffff8800cfc4cd20
ffffffff814e725b ffff8800d1998ae8 ffffffff82381250 0000000000000000
Call Trace:
[<ffffffff81a44e7f>] dump_stack+0x4c/0x65
[<ffffffff810790da>] warn_slowpath_common+0x9c/0xb6
[<ffffffff814e725b>] ? __list_del_entry+0x5a/0x98
[<ffffffff81079162>] warn_slowpath_fmt+0x46/0x48
[<ffffffff81820eb0>] ? dev_graft_qdisc+0x5e/0x6a
[<ffffffff814e725b>] __list_del_entry+0x5a/0x98
[<ffffffff814e72a7>] list_del+0xe/0x2d
[<ffffffff81822f05>] qdisc_list_del+0x1e/0x20
[<ffffffff81820cd1>] qdisc_destroy+0x30/0xd6
[<ffffffff81822676>] qdisc_graft+0x11d/0x243
[<ffffffff818233c1>] tc_get_qdisc+0x1a6/0x1d4
[<ffffffff810b5eaf>] ? mark_lock+0x2e/0x226
[<ffffffff817ff8f5>] rtnetlink_rcv_msg+0x181/0x194
[<ffffffff817ff72e>] ? rtnl_lock+0x17/0x19
[<ffffffff817ff72e>] ? rtnl_lock+0x17/0x19
[<ffffffff817ff774>] ? __rtnl_unlock+0x17/0x17
[<ffffffff81855dc6>] netlink_rcv_skb+0x4d/0x93
[<ffffffff817ff756>] rtnetlink_rcv+0x26/0x2d
[<ffffffff818544b2>] netlink_unicast+0xcb/0x150
[<ffffffff81161db9>] ? might_fault+0x59/0xa9
[<ffffffff81854f78>] netlink_sendmsg+0x4fa/0x51c
[<ffffffff817d6e09>] sock_sendmsg_nosec+0x12/0x1d
[<ffffffff817d8967>] sock_sendmsg+0x29/0x2e
[<ffffffff817d8cf3>] ___sys_sendmsg+0x1b4/0x23a
[<ffffffff8100a1b8>] ? native_sched_clock+0x35/0x37
[<ffffffff810a1d83>] ? sched_clock_local+0x12/0x72
[<ffffffff810a1fd4>] ? sched_clock_cpu+0x9e/0xb7
[<ffffffff810def2a>] ? current_kernel_time+0xe/0x32
[<ffffffff810b4bc5>] ? lock_release_holdtime.part.29+0x71/0x7f
[<ffffffff810ddebf>] ? read_seqcount_begin.constprop.27+0x5f/0x76
[<ffffffff810b6292>] ? trace_hardirqs_on_caller+0x17d/0x199
[<ffffffff811b14d5>] ? __fget_light+0x50/0x78
[<ffffffff817d9808>] __sys_sendmsg+0x42/0x60
[<ffffffff817d9838>] SyS_sendmsg+0x12/0x1c
[<ffffffff81a50e97>] system_call_fastpath+0x12/0x6f
---[ end trace ef29d3fb28e97ae7 ]---
For long term, we probably need to clean up the qdisc_graft() code
in case it hides other bugs like this.
Fixes: 95dc19299f ("pkt_sched: give visibility to mq slave qdiscs")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A long standing problem on busy servers is the tiny available TCP port
range (/proc/sys/net/ipv4/ip_local_port_range) and the default
sequential allocation of source ports in connect() system call.
If a host is having a lot of active TCP sessions, chances are
very high that all ports are in use by at least one flow,
and subsequent bind(0) attempts fail, or have to scan a big portion of
space to find a slot.
In this patch, I changed the starting point in __inet_hash_connect()
so that we try to favor even [1] ports, leaving odd ports for bind()
users.
We still perform a sequential search, so there is no guarantee, but
if connect() targets are very different, end result is we leave
more ports available to bind(), and we spread them all over the range,
lowering time for both connect() and bind() to find a slot.
This strategy only works well if /proc/sys/net/ipv4/ip_local_port_range
is even, ie if start/end values have different parity.
Therefore, default /proc/sys/net/ipv4/ip_local_port_range was changed to
32768 - 60999 (instead of 32768 - 61000)
There is no change on security aspects here, only some poor hashing
schemes could be eventually impacted by this change.
[1] : The odd/even property depends on ip_local_port_range values parity
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for setting the current cca ed level value over
nl802154.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Reviewed-by: Varka Bhadram <varkabhadram@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds information about the current cca ed level when the phy
is dumped over nl802154.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Reviewed-by: Varka Bhadram <varkabhadram@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
We currently always send fragments without DF bit set.
Thus, given following setup:
mtu1500 - mtu1500:1400 - mtu1400:1280 - mtu1280
A R1 R2 B
Where R1 and R2 run linux with netfilter defragmentation/conntrack
enabled, then if Host A sent a fragmented packet _with_ DF set to B, R1
will respond with icmp too big error if one of these fragments exceeded
1400 bytes.
However, if R1 receives fragment sizes 1200 and 100, it would
forward the reassembled packet without refragmenting, i.e.
R2 will send an icmp error in response to a packet that was never sent,
citing mtu that the original sender never exceeded.
The other minor issue is that a refragmentation on R1 will conceal the
MTU of R2-B since refragmentation does not set DF bit on the fragments.
This modifies ip_fragment so that we track largest fragment size seen
both for DF and non-DF packets, and set frag_max_size to the largest
value.
If the DF fragment size is larger or equal to the non-df one, we will
consider the packet a path mtu probe:
We set DF bit on the reassembled skb and also tag it with a new IPCB flag
to force refragmentation even if skb fits outdev mtu.
We will also set DF bit on each fragment in this case.
Joint work with Hannes Frederic Sowa.
Reported-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_skb_dst_mtu is small inline helper, but its called in several places.
before: 17061 44 0 17105 42d1 net/ipv4/ip_output.o
after: 16805 44 0 16849 41d1 net/ipv4/ip_output.o
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds transmission power setting support for IEEE-802.15.4
devices via nl802154.
Signed-off-by: Varka Bhadram <varkab@cdac.in>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
We used to get this indirectly I supposed, but no longer do.
Either way, an explicit include should have been done in the
first place.
net/ipv4/fib_trie.c: In function '__node_free_rcu':
>> net/ipv4/fib_trie.c:293:3: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration]
vfree(n);
^
net/ipv4/fib_trie.c: In function 'tnode_alloc':
>> net/ipv4/fib_trie.c:312:3: error: implicit declaration of function 'vzalloc' [-Werror=implicit-function-declaration]
return vzalloc(size);
^
>> net/ipv4/fib_trie.c:312:3: warning: return makes pointer from integer without a cast
cc1: some warnings being treated as errors
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By making sure sk->sk_gso_max_segs minimal value is one,
and sysctl_tcp_min_tso_segs minimal value is one as well,
tcp_tso_autosize() will return a non zero value.
We can then revert 843925f33f
("tcp: Do not apply TSO segment limit to non-TSO packets")
and save few cpu cycles in fast path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
got a rare NULL pointer dereference in clear_bit
Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
----
v2: switch to sock_flag(sk, SOCK_DEAD) and added net/caif/caif_socket.c
v3: return -ECONNRESET in upstream caller of wait function for SOCK_DEAD
Signed-off-by: David S. Miller <davem@davemloft.net>
If tcp ehash table is constrained to a very small number of buckets
(eg boot parameter thash_entries=128), then we can crash if spinlock
array has more entries.
While we are at it, un-inline inet_ehash_locks_alloc() and make
following changes :
- Budget 2 cache lines per cpu worth of 'spinlocks'
- Try to kmalloc() the array to avoid extra TLB pressure.
(Most servers at Google allocate 8192 bytes for this hash table)
- Get rid of various #ifdef
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit dd3f9e70f5
("tipc: add packet sequence number at instant of transmission") we
made a change with the consequence that packets in the link backlog
queue don't contain valid sequence numbers.
However, when we create a link protocol message, we still use the
sequence number of the first packet in the backlog, if there is any,
as "next_sent" indicator in the message. This may entail unnecessary
retransissions or stale packet transmission when there is very low
traffic on the link.
This commit fixes this issue by only using the current value of
tipc_link::snd_nxt as indicator.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* AP_VLAN tailroom calculation fix, the bug leads to warnings
along with dropped packets
* NAPI context issue, calling napi_gro_receive() from a timer
(obviously) can lead to crashes
* remain-on-channel combining leads to dropped requests and not
being able to finish certain operations, so remove it
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJVZB7ZAAoJEDBSmw7B7bqr/qkQAKW5RHFfo8TtjUB7pl/iWqc3
/gyym9tisy4hc7OOr8BDTQdUvqzY4fRMhuTAvjPLXCqdV2Isz1IetiogiJRMglNs
3v83QkmEI8vMQ3lP6Y+2Hnz9tw1zNaVXHGwIKPjk9YLrzsBV0AJoGqn8qo4OBfWl
JjkHLM6/0PVDy5UDF95cRyM6+L0XJdPdVS/YRLslp5Tda8fgTbH+dMwLnzjQZbIu
ZanFpAuRxx35g/Zg6vAsRhlva/zrucphteaiJGAa6a3NgH9Z4tDlGHRveHQOgNYt
xHKNcvOgegaFNcEY8ftKMcQ/RIVJjxXr6nPYnQyFnG0aAxYePysNx8TJaUX2dxq7
+O1RlYHwJpRRUmScSsDFDa/CmQcpUxgloUfCmkWS611g1LZFnpVOcXEZRJg/c9lm
hO6mH19OYlDiWeE3ZhKeYJNxmpWvPM4bxhswHcYfLG+vA93kLTYQ/xGi/0YfMKl+
+UCTPbdiXdRyYLzixiu/NWcKwWDH2pHAH1pjimH+r2266lQYs2Jsk8436uQKhZxI
D4l3ethujDcmMzO+ZTzLHjWbNdO2fC4R1LIF/Eg/sDe7g11dQItWYrAVddQFIkfb
/A5VV1DbGW33tpj4QUXfjuQ65I6rLOq4NbTY3j4/lPSiDkqIpEfwjzuc3/UZeuZl
Z8cu8Qxf7jdOfihbnUFR
=PS0g
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2015-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
We have three more fixes:
* AP_VLAN tailroom calculation fix, the bug leads to warnings
along with dropped packets
* NAPI context issue, calling napi_gro_receive() from a timer
(obviously) can lead to crashes
* remain-on-channel combining leads to dropped requests and not
being able to finish certain operations, so remove it
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fix the handling to call cca mode setting. If the phy isn't
flag then the driver doesn't support this setting.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Reported-by: Varka Bhadram <varkabhadram@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
mac802154_mlme_start_req() calls
ieee802154_mlme_ops(dev)->llsec->set_params() on the net_device
passed into it, however, this net_device will always be a mac802154
net_device, so just call mac802154_set_params() directly instead.
Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
ieee802154_get_dev() only returns devices that have dev->type ==
ARPHRD_IEEE802154, therefore, there is no need to check this again
in raw_bind().
Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
In the past, 802.15.4 interfaces and 6LoWPAN interfaces used the
same dev->type (ARPHRD_IEEE802154), and 802.15.4 interfaces were
distinguished from 6LoWPAN interfaces by their differing dev->mtu.
6LoWPAN interfaces have their own ARPHRD type now, so there is no
longer any need to check dev->mtu to distinguish 802.15.4 devices
from 6LoWPAN devices.
Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This allows us to create netdev tables that contain ingress chains. Use
skb_header_pointer() as we may see shared sk_buffs at this stage.
This change provides access to the existing nf_tables features from the ingress
hook.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds the internal NFT_AF_NEEDS_DEV flag to indicate that you must
attach this table to a net_device.
This change is required by the follow up patch that introduces the new netdev
table.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The proper return code for trying to send a packet that exceeds the
outgoing interface's MTU is EMSGSIZE, not EINVAL, so patch ieee802154's
raw_sendmsg() to do the right thing. (Its dgram_sendmsg() was already
returning EMSGSIZE for this case.)
Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
->ndo_do_ioctl() can be entered with the rtnl lock already held,
for example when sending a wext ioctl to a device (in which case
the rtnl lock is taken by wext_ioctl_dispatch()), but
mac802154_wpan_ioctl() currently unconditionally takes the rtnl
lock on entry, which can cause deadlocks.
To fix this, bail out of mac802154_wpan_ioctl() before taking the
rtnl lock if the ioctl cmd is not one of the cmds we implement.
Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
When we disconnect from the AP, drivers call cfg80211_disconnect().
This doesn't know whether the disconnection was initiated locally
or by the AP though, which can cause problems with the supplicant,
for example with WPS. This issue obviously doesn't show up with any
mac80211 based driver since mac80211 doesn't call this function.
Fix this by requiring drivers to indicate whether the disconnect is
locally generated or not. I've tried to update the drivers, but may
not have gotten the values correct, and some drivers may currently
not be able to report correct values. In case of doubt I left it at
false, which is the current behaviour.
For libertas, make adjustments as indicated by Dan Williams.
Reported-by: Matthieu Mauger <matthieux.mauger@intel.com>
Tested-by: Matthieu Mauger <matthieux.mauger@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
make C=2 CF=-D__CHECK_ENDIAN__ net/core/utils.o
...
net/core/utils.c:307:72: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:307:72: expected restricted __wsum [usertype] addend
net/core/utils.c:307:72: got restricted __be32 [usertype] from
net/core/utils.c:308:34: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:308:34: expected restricted __wsum [usertype] addend
net/core/utils.c:308:34: got restricted __be32 [usertype] to
net/core/utils.c:310:70: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:310:70: expected restricted __wsum [usertype] addend
net/core/utils.c:310:70: got restricted __be32 [usertype] from
net/core/utils.c:310:77: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:310:77: expected restricted __wsum [usertype] addend
net/core/utils.c:310:77: got restricted __be32 [usertype] to
net/core/utils.c:312:72: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:312:72: expected restricted __wsum [usertype] addend
net/core/utils.c:312:72: got restricted __be32 [usertype] from
net/core/utils.c:313:35: warning: incorrect type in argument 2 (different base types)
net/core/utils.c:313:35: expected restricted __wsum [usertype] addend
net/core/utils.c:313:35: got restricted __be32 [usertype] to
Note we can use csum_replace4() helper
Fixes: 58e3cac561 ("net: optimise inet_proto_csum_replace4()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
make C=2 CF=-D__CHECK_ENDIAN__ net/core/secure_seq.o
net/core/secure_seq.c:157:50: warning: restricted __be32 degrades to
integer
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A few comments had minor typos. These are being fixed.
Signed-off-by: Florian Grandel <fgrandel@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Check in fdb_add_entry() if the source port should learn, similar
check is used in br_fdb_update.
Note that new fdb entries which are added manually or
as local ones are still permitted.
This patch has been tested by running traffic via a bridge port and
switching the port's state, also by manually adding/removing entries
from the bridge's fdb.
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/core/pktgen.c:2672:43: warning: incorrect type in assignment (different base types)
net/core/pktgen.c:2672:43: expected unsigned short [unsigned] [short] [usertype] <noident>
net/core/pktgen.c:2672:43: got restricted __be16 [usertype] protocol
Let's use proper struct ethhdr instead of hard coding everything.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6_select_ident() returns a 32bit value in network order.
Fixes: 286c2349f6 ("ipv6: Clean up ipv6_select_ident() and ip6_fragment()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
API compliance scanning with coccinelle flagged:
./net/irda/timer.c:63:35-37: use of msecs_to_jiffies probably perferable
Converting milliseconds to jiffies by "val * HZ / 1000" technically
is not a clean solution as it does not handle all corner cases correctly.
By changing the conversion to use msecs_to_jiffies(val) conversion is
correct in all cases. Further the () around the arithmetic expression
was dropped.
Patch was compile tested for x86_64_defconfig + CONFIG_IRDA=m
Patch is against 4.1-rc4 (localversion-next is -next-20150522)
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Network managers like netifd (used in OpenWRT for instance) try to
configure interface options after creation but before setting the
interface up.
Unfortunately the sysfs / bridge currently only allows to configure the
hash_max and multicast_router options when the bridge interface is up.
But since br_multicast_init() doesn't start any timers and only sets
default values and initializes timers it should be save to reconfigure
the default values after that, before things actually get active after
the bridge is set up.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
since commit 6aafeef03b ("netfilter: push reasm skb through instead of
original frag skbs") we will end up sometimes re-fragmenting skbs
that we've reassembled.
ipv6 defrag preserves the original skbs using the skb frag list, i.e. as long
as the skb frag list is preserved there is no problem since we keep
original geometry of fragments intact.
However, in the rare case where the frag list is munged or skb
is linearized, we might send larger fragments than what we originally
received.
A router in the path might then send packet-too-big errors even if
sender never sent fragments exceeding the reported mtu:
mtu 1500 - 1500:1400 - 1400:1280 - 1280
A R1 R2 B
1 - A sends to B, fragment size 1400
2 - R2 sends pkttoobig error for 1280
3 - A sends to B, fragment size 1280
4 - R2 sends pkttoobig error for 1280 again because it sees fragments of size 1400.
make sure ip6_fragment always caps MTU at largest packet size seen
when defragmented skb is forwarded.
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the patch
'ipv6: Only create RTF_CACHE routes after encountering pmtu exception',
we need to compensate the performance hit (bouncing dst->__refcnt).
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch breaks up ip6_rt_copy() into ip6_rt_copy_init() and
ip6_rt_cache_alloc().
In the later patch, we need to create a percpu rt6_info copy. Hence,
refactor the common rt6_info init codes to ip6_rt_copy_init().
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch keeps track of the DST_NOCACHE routes in a list and replaces its
dev with loopback during the iface down/unregister event.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch always creates RTF_CACHE clone with DST_NOCACHE
when FLOWI_FLAG_KNOWN_NH is set so that the rt6i_dst is set to
the fl6->daddr.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Tested-by: Julian Anastasov <ja@ssi.bg>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The neighbor look-up used to depend on the rt6i_gateway (if
there is a gateway) or the rt6i_dst (if it is a RTF_CACHE clone)
as the nexthop address. Note that rt6i_dst is set to fl6->daddr
for the RTF_CACHE clone where fl6->daddr is the one used to do
the route look-up.
Now, we only create RTF_CACHE clone after encountering exception.
When doing the neighbor look-up with a route that is neither a gateway
nor a RTF_CACHE clone, the daddr in skb will be used as the nexthop.
In some cases, the daddr in skb is not the one used to do
the route look-up. One example is in ip_vs_dr_xmit_v6() where the
real nexthop server address is different from the one in the skb.
This patch is going to follow the IPv4 approach and ask the
ip6_pol_route() callers to set the FLOWI_FLAG_KNOWN_NH properly.
In the next patch, ip6_pol_route() will honor the FLOWI_FLAG_KNOWN_NH
and create a RTF_CACHE clone.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Tested-by: Julian Anastasov <ja@ssi.bg>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of doing the rt6->rt6i_node check whenever we need
to get the route's cookie. Refactor it into rt6_get_cookie().
It is a prep work to handle FLOWI_FLAG_KNOWN_NH and also
percpu rt6_info later.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch creates a RTF_CACHE routes only after encountering a pmtu
exception.
After ip6_rt_update_pmtu() has inserted the RTF_CACHE route to the fib6
tree, the rt->rt6i_node->fn_sernum is bumped which will fail the
ip6_dst_check() and trigger a relookup.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
A prep work for creating RTF_CACHE on exception only. After this
patch, the same condition (rt->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY))
is checked twice. This redundancy will be removed in the later patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
When creating a RTF_CACHE route, RTF_ANYCAST is set based on rt6i_dst.
Also, rt6i_gateway is always set to the nexthop while the nexthop
could be a gateway or the rt6i_dst.addr.
After removing the rt6i_dst and rt6i_src dependency in the last patch,
we also need to stop the caller from depending on rt6i_gateway and
RTF_ANYCAST.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the assumptions that the returned rt is always
a RTF_CACHE entry with the rt6i_dst and rt6i_src containing the
destination and source address. The dst and src can be recovered from
the calling site.
We may consider to rename (rt6i_dst, rt6i_src) to
(rt6i_key_dst, rt6i_key_src) later.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the ipv6_select_ident() signature to return a
fragment id instead of taking a whole frag_hdr as a param to
only set the frag_hdr->identification.
It also cleans up ip6_fragment() to obtain the fragment id at the
beginning instead of using multiple "if" later to check fragment id
has been generated or not.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Send icmp pmtu error if we find that the largest fragment of df-skb
exceeded the output path mtu.
The ip output path will still catch this later on but we can avoid the
forward/postrouting hook traversal by rejecting right away.
This is what ipv6 already does.
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
unix_stream_recvmsg is refactored to unix_stream_read_generic in this
patch and enhanced to deal with pipe splicing. The refactoring is
inneglible, we mostly have to deal with a non-existing struct msghdr
argument.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prepare skb_splice_bits to be able to deal with AF_UNIX sockets.
AF_UNIX sockets don't use lock_sock/release_sock and thus we have to
use a callback to make the locking and unlocking configureable.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements sendpage support for AF_UNIX SOCK_STREAM
sockets. This is also required for a complete splice implementation.
The implementation is a bit tricky because we append to already existing
skbs and so have to hold unix_sk->readlock to protect the reading side
from either advancing UNIXCB.consumed or freeing the skb at the socket
receive tail.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull two Ceph fixes from Sage Weil:
"These fix an issue with the RBD notifications when there are topology
changes in the cluster"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
Revert "libceph: clear r_req_lru_item in __unregister_linger_request()"
libceph: request a new osdmap if lingering request maps to no osd
This patch removes the mib lock. The new locking mechanism is to protect
the mib values with the rtnl lock. Note that this isn't always necessary
if we have an interface up the most mib values are readonly (e.g.
address settings). With this behaviour we can remove locking in
hotpath like frame parsing completely. It depends on context if we need
to hold the rtnl lock or not, this makes the callbacks of
ieee802154_mlme_ops unnecessary because these callbacks hols always the
locks.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch will use atomic operations for sequence number incrementation
while MAC header generation. Upper layers like af_802154 or 6LoWPAN
could call this function in a parallel context while generating 802.15.4
MAC header before queuing into wpan interfaces transmit queue.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch removes the pib lock which is now replaced by rtnl lock. The
new interface already use the rtnl lock only. Nevertheless this patch
will fix issues while using new and old interface at the same time.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch fixes an issue to set address configuration with ioctl.
Accessing the mib requires rtnl lock and the ndo_do_ioctl doesn't hold
the rtnl lock while this callback is called. This patch do that
manually.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Reported-by: Matteo Petracca <matteo.petracca@sssup.it>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Conflicts:
drivers/net/ethernet/cadence/macb.c
drivers/net/phy/phy.c
include/linux/skbuff.h
net/ipv4/tcp.c
net/switchdev/switchdev.c
Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD}
renaming overlapping with net-next changes of various
sorts.
phy.c was a case of two changes, one adding a local
variable to a function whilst the second was removing
one.
tcp.c overlapped a deadlock fix with the addition of new tcp_info
statistic values.
macb.c involved the addition of two zyncq device entries.
skbuff.h involved adding back ipv4_daddr to nf_bridge_info
whilst net-next changes put two other existing members of
that struct into a union.
Signed-off-by: David S. Miller <davem@davemloft.net>
Giving /proc/net/pktgen/pgctrl an invalid command just returns shell
success and prints a warning in dmesg. This is not very useful for
shell scripting, as it can only detect the error by parsing dmesg.
Instead return -EINVAL when the command is unknown, as this provides
userspace shell scripting a way of detecting this.
Also bump version tag to 2.75, because (1) reading /proc/net/pktgen/pgctrl
output this version number which would allow to detect this small
semantic change, and (2) because the pktgen version tag have not been
updated since 2010.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Too many spaces were introduced in commit 63adc6fb8a ("pktgen: cleanup
checkpatch warnings"), thus misaligning "src_min:" to other columns.
Fixes: 63adc6fb8a ("pktgen: cleanup checkpatch warnings")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When trying to configure the settings for PHY1, using commands
like 'ethtool -s eth0 phyad 1 speed 100', the 'ethtool' seems to
modify other settings apart from the speed of the PHY1, in the
above case.
The ethtool seems to query the settings for PHY0, and use this
as the base to apply the new settings to the PHY1. This is
causing the other settings of the PHY 1 to be wrongly
configured.
The issue is caused by the '_ethtool_get_settings()' API, which
gets called because of the 'ETHTOOL_GSET' command, is clearing
the 'cmd' pointer (of type 'struct ethtool_cmd') by calling
memset. This clears all the parameters (if any) passed for the
'ETHTOOL_GSET' cmd. So the driver's callback is always invoked
with 'cmd->phy_address' as '0'.
The '_ethtool_get_settings()' is called from other files in the
'net/core'. So the fix is applied to the 'ethtool_get_settings()'
which is only called in the context of the 'ethtool'.
Signed-off-by: Arun Parameswaran <aparames@broadcom.com>
Reviewed-by: Ray Jui <rjui@broadcom.com>
Reviewed-by: Scott Branden <sbranden@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When more than a multicast address is present in a MLDv2 report, all but
the first address is ignored, because the code breaks out of the loop if
there has not been an error adding that address.
This has caused failures when two guests connected through the bridge
tried to communicate using IPv6. Neighbor discoveries would not be
transmitted to the other guest when both used a link-local address and a
static address.
This only happens when there is a MLDv2 querier in the network.
The fix will only break out of the loop when there is a failure adding a
multicast address.
The mdb before the patch:
dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp
dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp
dev ovirtmgmt port bond0.86 grp ff02::2 temp
After the patch:
dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp
dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp
dev ovirtmgmt port bond0.86 grp ff02::fb temp
dev ovirtmgmt port bond0.86 grp ff02::2 temp
dev ovirtmgmt port bond0.86 grp ff02::d temp
dev ovirtmgmt port vnet0 grp ff02::1:ff00:76 temp
dev ovirtmgmt port bond0.86 grp ff02::16 temp
dev ovirtmgmt port vnet1 grp ff02::1:ff00:77 temp
dev ovirtmgmt port bond0.86 grp ff02::1:ff00:def temp
dev ovirtmgmt port bond0.86 grp ff02::1:ffa1:40bf temp
Fixes: 08b202b672 ("bridge br_multicast: IPv6 MLD support.")
Reported-by: Rik Theys <Rik.Theys@esat.kuleuven.be>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Tested-by: Rik Theys <Rik.Theys@esat.kuleuven.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
When replacing an IPv4 route, tb_id member of the new fib_alias
structure is not set in the replace code path so that the new route is
ignored.
Fixes: 0ddcf43d5d ("ipv4: FIB Local/MAIN table collapse")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contain Netfilter fixes for your net tree, they are:
1) Fix a race in nfnetlink_log and nfnetlink_queue that can lead to a crash.
This problem is due to wrong order in the per-net registration and netlink
socket events. Patch from Francesco Ruggeri.
2) Make sure that counters that userspace pass us are higher than 0 in all the
x_tables frontends. Discovered via Trinity, patch from Dave Jones.
3) Revert a patch for br_netfilter to rely on the conntrack status bits. This
breaks stateless IPv6 NAT transformations. Patch from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_error does not check if in_dev is NULL before dereferencing it.
IThe following sequence of calls is possible:
CPU A CPU B
ip_rcv_finish
ip_route_input_noref()
ip_route_input_slow()
inetdev_destroy()
dst_input()
With the result that a network device can be destroyed while processing
an input packet.
A crash was triggered with only unicast packets in flight, and
forwarding enabled on the only network device. The error condition
was created by the removal of the network device.
As such it is likely the that error code was -EHOSTUNREACH, and the
action taken by ip_error (if in_dev had been accessible) would have
been to not increment any counters and to have tried and likely failed
to send an icmp error as the network device is going away.
Therefore handle this weird case by just dropping the packet if
!in_dev. It will result in dropping the packet sooner, and will not
result in an actual change of behavior.
Fixes: 251da41301 ("ipv4: Cache ip_error() routes even when not forwarding.")
Reported-by: Vittorio Gambaletta <linuxbugs@vittgam.net>
Tested-by: Vittorio Gambaletta <linuxbugs@vittgam.net>
Signed-off-by: Vittorio Gambaletta <linuxbugs@vittgam.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This restored previous behaviour. If caller does not want ports to be
filled, we should not break.
Fixes: 06635a35d1 ("flow_dissect: use programable dissector in skb_flow_dissect and friends")
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Taking socket spinlock in tcp_get_info() can deadlock, as
inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i],
while packet processing can use the reverse locking order.
We could avoid this locking for TCP_LISTEN states, but lockdep would
certainly get confused as all TCP sockets share same lockdep classes.
[ 523.722504] ======================================================
[ 523.728706] [ INFO: possible circular locking dependency detected ]
[ 523.734990] 4.1.0-dbg-DEV #1676 Not tainted
[ 523.739202] -------------------------------------------------------
[ 523.745474] ss/18032 is trying to acquire lock:
[ 523.750002] (slock-AF_INET){+.-...}, at: [<ffffffff81669d44>] tcp_get_info+0x2c4/0x360
[ 523.758129]
[ 523.758129] but task is already holding lock:
[ 523.763968] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff816bcb75>] inet_diag_dump_icsk+0x1d5/0x6c0
[ 523.774661]
[ 523.774661] which lock already depends on the new lock.
[ 523.774661]
[ 523.782850]
[ 523.782850] the existing dependency chain (in reverse order) is:
[ 523.790326]
-> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}:
[ 523.796599] [<ffffffff811126bb>] lock_acquire+0xbb/0x270
[ 523.802565] [<ffffffff816f5868>] _raw_spin_lock+0x38/0x50
[ 523.808628] [<ffffffff81665af8>] __inet_hash_nolisten+0x78/0x110
[ 523.815273] [<ffffffff816819db>] tcp_v4_syn_recv_sock+0x24b/0x350
[ 523.822067] [<ffffffff81684d41>] tcp_check_req+0x3c1/0x500
[ 523.828199] [<ffffffff81682d09>] tcp_v4_do_rcv+0x239/0x3d0
[ 523.834331] [<ffffffff816842fe>] tcp_v4_rcv+0xa8e/0xc10
[ 523.840202] [<ffffffff81658fa3>] ip_local_deliver_finish+0x133/0x3e0
[ 523.847214] [<ffffffff81659a9a>] ip_local_deliver+0xaa/0xc0
[ 523.853440] [<ffffffff816593b8>] ip_rcv_finish+0x168/0x5c0
[ 523.859624] [<ffffffff81659db7>] ip_rcv+0x307/0x420
Lets use u64_sync infrastructure instead. As a bonus, 64bit
arches get optimized, as these are nop for them.
Fixes: 0df48c26d8 ("tcp: add tcpi_bytes_acked to tcp_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch tracks the total number of inbound and outbound segments on a
TCP socket. One may use this number to have an idea on connection
quality when compared against the retransmissions.
RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut
These are a 32bit field each and can be fetched both from TCP_INFO
getsockopt() if one has a handle on a TCP socket, or from inet_diag
netlink facility (iproute2/ss patch will follow)
Note that tp->segs_out was placed near tp->snd_nxt for good data
locality and minimal performance impact, while tp->segs_in was placed
near tp->bytes_received for the same reason.
Join work with Eric Dumazet.
Note that received SYN are accounted on the listener, but sent SYNACK
are not accounted.
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip -6 addr add dead::1/128 dev eth0
sleep 5
ip -6 route add default via dead::1/128
-> fails
ip -6 addr add dead::1/128 dev eth0
ip -6 route add default via dead::1/128
-> succeeds
reason is that if (nonsensensical) route above is added,
dead::1 is still subject to DAD, so the route lookup will
pick eth0 as outdev due to the prefix route that is added before
DAD work is started.
Add explicit test that checks if nexthop gateway is a local address.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1167969
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_csk_get_port() randomization effort tends to spread
sockets on all the available range (ip_local_port_range)
This is unfortunate because SO_REUSEADDR sockets have
less requirements than non SO_REUSEADDR ones.
If an application uses SO_REUSEADDR hint, it is to try to
allow source ports being shared.
So instead of picking a random port number in ip_local_port_range,
lets try first in first half of the range.
This gives more chances to use upper half of the range for the
sockets with strong requirements (not using SO_REUSEADDR)
Note this patch does not add a new sysctl, and only changes
the way we try to pick port number.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
Cc: Flavio Leitner <fbl@redhat.com>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We no longer need bsocket atomic counter, as inet_csk_get_port()
calls bind_conflict() regardless of its value, after commit
2b05ad33e1 ("tcp: bind() fix autoselection to share ports")
This patch removes overhead of maintaining this counter and
double inet_csk_get_port() calls under pressure.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
Cc: Flavio Leitner <fbl@redhat.com>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently rely on the setting of SOCK_NOSPACE in the write()
path to ensure that we wake up any epoll edge trigger waiters when
acks return to free space in the write queue. However, if we fail
to allocate even a single skb in the write queue, we could end up
waiting indefinitely.
Fix this by explicitly issuing a wakeup when we detect the condition
of an empty write queue and a return value of -EAGAIN. This allows
userspace to re-try as we expect this to be a temporary failure.
I've tested this approach by artificially making
sk_stream_alloc_skb() return NULL periodically. In that case,
epoll edge trigger waiters will hang indefinitely in epoll_wait()
without this patch.
Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vijay reported that a loop as simple as ...
while true; do
tc qdisc add dev foo root handle 1: prio
tc filter add dev foo parent 1: u32 match u32 0 0 flowid 1
tc qdisc del dev foo root
rmmod cls_u32
done
... will panic the kernel. Moreover, he bisected the change
apparently introducing it to 78fd1d0ab0 ("netlink: Re-add
locking to netlink_lookup() and seq walker").
The removal of synchronize_net() from the netlink socket
triggering the qdisc to be removed, seems to have uncovered
an RCU resp. module reference count race from the tc API.
Given that RCU conversion was done after e341694e3e ("netlink:
Convert netlink_lookup() to use RCU protected hash table")
which added the synchronize_net() originally, occasion of
hitting the bug was less likely (not impossible though):
When qdiscs that i) support attaching classifiers and,
ii) have at least one of them attached, get deleted, they
invoke tcf_destroy_chain(), and thus call into ->destroy()
handler from a classifier module.
After RCU conversion, all classifier that have an internal
prio list, unlink them and initiate freeing via call_rcu()
deferral.
Meanhile, tcf_destroy() releases already reference to the
tp->ops->owner module before the queued RCU callback handler
has been invoked.
Subsequent rmmod on the classifier module is then not prevented
since all module references are already dropped.
By the time, the kernel invokes the RCU callback handler from
the module, that function address is then invalid.
One way to fix it would be to add an rcu_barrier() to
unregister_tcf_proto_ops() to wait for all pending call_rcu()s
to complete.
synchronize_rcu() is not appropriate as under heavy RCU
callback load, registered call_rcu()s could be deferred
longer than a grace period. In case we don't have any pending
call_rcu()s, the barrier is allowed to return immediately.
Since we came here via unregister_tcf_proto_ops(), there
are no users of a given classifier anymore. Further nested
call_rcu()s pointing into the module space are not being
done anywhere.
Only cls_bpf_delete_prog() may schedule a work item, to
unlock pages eventually, but that is not in the range/context
of cls_bpf anymore.
Fixes: 25d8c0d55f ("net: rcu-ify tcf_proto")
Fixes: 9888faefe1 ("net: sched: cls_basic use RCU")
Reported-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reduce ifdef pollution slightly, no functional change. We can simply
remove the extra alternative definition of handle_ing() and nf_ingress().
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 8e4d980ac2 ("tcp: fix behavior for epoll edge trigger")
we fixed a possible hang of TCP sockets under memory pressure,
by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule()
if no packet is in socket write queue.
It turns out there are other cases where we want to force memory
schedule :
tcp_fragment() & tso_fragment() need to split a big TSO packet into
two smaller ones. If we block here because of TCP memory pressure,
we can effectively block TCP socket from sending new data.
If no further ACK is coming, this hang would be definitive, and socket
has no chance to effectively reduce its memory usage.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[1] When entering NUD_PROBE state via neigh_update(), perhaps received
from userspace, correctly (re)initialize the probes count to zero.
This is useful for forcing revalidation of a neighbor (for example
if the host is attempting to do DNA [IPv4 4436, IPv6 6059]).
[2] Notify listeners when a neighbor goes into NUD_PROBE state.
By sending notifications on entry to NUD_PROBE state listeners get
more timely warnings of imminent connectivity issues.
The current notifications on entry to NUD_STALE have somewhat
limited usefulness: NUD_STALE is a perfectly normal state, as is
NUD_DELAY, whereas notifications on entry to NUD_FAILURE come after
a neighbor reachability problem has been confirmed (typically after
three probes).
Signed-off-by: Erik Kline <ek@google.com>
Acked-By: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As we're now always including the high bits of the sequence number
in the IV generation process we need to ensure that they don't
contain crap.
This patch ensures that the high sequence bits are always zeroed
so that we don't leak random data into the IV.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This reverts commit ba9d114ec5.
.. which introduced a regression that prevented all lingering requests
requeued in kick_requests() from ever being sent to the OSDs, resulting
in a lot of missed notifies. In retrospect it's pretty obvious that
r_req_lru_item item in the case of lingering requests can be used not
only for notarget, but also for unsent linkage due to how tightly
actual map and enqueue operations are coupled in __map_request().
The assertion that was being silenced is taken care of in the previous
("libceph: request a new osdmap if lingering request maps to no osd")
commit: by always kicking homeless lingering requests we ensure that
none of them ends up on the notarget list outside of the critical
section guarded by request_mutex.
Cc: stable@vger.kernel.org # 3.18+, needs b049453221 "libceph: request a new osdmap if lingering request maps to no osd"
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
This commit does two things. First, if there are any homeless
lingering requests, we now request a new osdmap even if the osdmap that
is being processed brought no changes, i.e. if a given lingering
request turned homeless in one of the previous epochs and remained
homeless in the current epoch. Not doing so leaves us with a stale
osdmap and as a result we may miss our window for reestablishing the
watch and lose notifies.
MON=1 OSD=1:
# cat linger-needmap.sh
#!/bin/bash
rbd create --size 1 test
DEV=$(rbd map test)
ceph osd out 0
rbd map dne/dne # obtain a new osdmap as a side effect (!)
sleep 1
ceph osd in 0
rbd resize --size 2 test
# rbd info test | grep size -> 2M
# blockdev --getsize $DEV -> 1M
N.B.: Not obtaining a new osdmap in between "osd out" and "osd in"
above is enough to make it miss that resize notify, but that is a
bug^Wlimitation of ceph watch/notify v1.
Second, homeless lingering requests are now kicked just like those
lingering requests whose mapping has changed. This is mainly to
recognize that a homeless lingering request makes no sense and to
preserve the invariant that a registered lingering request is not
sitting on any of r_req_lru_item lists. This spares us a WARN_ON,
which commit ba9d114ec5 ("libceph: clear r_req_lru_item in
__unregister_linger_request()") tried to fix the _wrong_ way.
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
When replacing an IPv6 multipath route with "ip route replace", i.e.
NLM_F_CREATE | NLM_F_REPLACE, fib6_add_rt2node() replaces only first
matching route without fixing its siblings, resulting in corrupted
siblings linked list; removing one of the siblings can then end in an
infinite loop.
IPv6 ECMP implementation is a bit different from IPv4 so that route
replacement cannot work in exactly the same way. This should be a
reasonable approximation:
1. If the new route is ECMP-able and there is a matching ECMP-able one
already, replace it and all its siblings (if any).
2. If the new route is ECMP-able and no matching ECMP-able route exists,
replace first matching non-ECMP-able (if any) or just add the new one.
3. If the new route is not ECMP-able, replace first matching
non-ECMP-able route (if any) or add the new route.
We also need to remove the NLM_F_REPLACE flag after replacing old
route(s) by first nexthop of an ECMP route so that each subsequent
nexthop does not replace previous one.
Fixes: 51ebd31815 ("ipv6: add support of equal cost multipath (ECMP)")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If adding a nexthop of an IPv6 multipath route fails, comment in
ip6_route_multipath() says we are going to delete all nexthops already
added. However, current implementation deletes even the routes it
hasn't even tried to add yet. For example, running
ip route add 1234:5678::/64 \
nexthop via fe80::aa dev dummy1 \
nexthop via fe80::bb dev dummy1 \
nexthop via fe80::cc dev dummy1
twice results in removing all routes first command added.
Limit the second (delete) run to nexthops that succeeded in the first
(add) run.
Fixes: 51ebd31815 ("ipv6: add support of equal cost multipath (ECMP)")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a station does a channel switch, it's not well defined what its TDLS
peers would do. Avoid a situation when the local side marks a potentially
disconnected peer as a TDLS peer.
Keeping peers connected through CSA is doubly problematic with the upcoming
TDLS WIDER-BW feature which allows peers to widen the BSS channel. The
new channel transitioned-to might not be compatible and would require
a re-negotiation anyway.
Make sure to disallow new TDLS link during CSA.
Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Some splats I was seeing:
(a) WARNING: CPU: 1 PID: 0 at /devel/src/linux/net/mac80211/wep.c:102 ieee80211_wep_add_iv
(b) WARNING: CPU: 1 PID: 0 at /devel/src/linux/net/mac80211/wpa.c:73 ieee80211_tx_h_michael_mic_add
(c) WARNING: CPU: 3 PID: 0 at /devel/src/linux/net/mac80211/wpa.c:433 ieee80211_crypto_ccmp_encrypt
I've seen (a) and (b) with ath9k hw crypto and (c)
with ath9k sw crypto. All of them were related to
insufficient skb tailroom and I was able to
trigger these with ping6 program.
AP_VLANs may inherit crypto keys from parent AP.
This wasn't considered and yielded problems in
some setups resulting in inability to transmit
data because mac80211 wouldn't resize skbs when
necessary and subsequently drop some packets due
to insufficient tailroom.
For efficiency purposes don't inspect both AP_VLAN
and AP sdata looking for tailroom counter. Instead
update AP_VLAN tailroom counters whenever their
master AP tailroom counter changes.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Due to remain-on-channel scheduling delays, when we split an ROC
while coalescing, we'll usually get a picture like this:
existing ROC: |------------------|
current time: ^
new ROC: |------| |-------|
If the expected response frames are then transmitted by the peer
in the hole between the two fragments of the new ROC, we miss
them and the process (e.g. ANQP query) fails.
mac80211 expects that the window to miss something is small:
existing ROC: |------------------|
new ROC: |------||-------|
but that's normally not the case.
To avoid this problem, coalesce only if the new ROC's duration
is <= the remaining time on the existing one:
existing ROC: |------------------|
new ROC: |-----|
and never split a new one but schedule it afterwards instead:
existing ROC: |------------------|
new ROC: |-------------|
type=bugfix
bug=not-tracked
fixes=unknown
Reported-by: Matti Gottlieb <matti.gottlieb@intel.com>
Reviewed-by: EliadX Peller <eliad@wizery.com>
Reviewed-by: Matti Gottlieb <matti.gottlieb@intel.com>
Tested-by: Matti Gottlieb <matti.gottlieb@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Drivers with fast-xmit (e.g. ath10k) running in
AP_VLAN setups would fail to communicate with
connected 4addr stations.
The reason was when new station associates it
first goes into master AP interface. It is not
until later that a dedicated AP_VLAN is created
for it and the station itself is moved there.
After that Tx directed at the station should use
4addr header. However fast-xmit wasn't
recalculated and 3addr header remained to be used.
This in turn caused the connected 4addr stations
to drop packets coming from the AP until some
other event would cause fast-xmit to recalculate
for that station (which could never come).
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Use dev_pm_ops instead of the legacy suspend/resume callbacks for the wiphy
class suspend and resume operations.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Use dev_pm_ops instead of the legacy suspend/resume callbacks for the
rfkill class suspend and resume operations.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This reverts commit c055d5b03b.
There are two issues:
'dnat_took_place' made me think that this is related to
-j DNAT/MASQUERADE.
But thats only one part of the story. This is also relevant for SNAT
when we undo snat translation in reverse/reply direction.
Furthermore, I originally wanted to do this mainly to avoid
storing ipv6 addresses once we make DNAT/REDIRECT work
for ipv6 on bridges.
However, I forgot about SNPT/DNPT which is stateless.
So we can't escape storing address for ipv6 anyway. Might as
well do it for ipv4 too.
Reported-and-tested-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After improving setsockopt() coverage in trinity, I started triggering
vmalloc failures pretty reliably from this code path:
warn_alloc_failed+0xe9/0x140
__vmalloc_node_range+0x1be/0x270
vzalloc+0x4b/0x50
__do_replace+0x52/0x260 [ip_tables]
do_ipt_set_ctl+0x15d/0x1d0 [ip_tables]
nf_setsockopt+0x65/0x90
ip_setsockopt+0x61/0xa0
raw_setsockopt+0x16/0x60
sock_common_setsockopt+0x14/0x20
SyS_setsockopt+0x71/0xd0
It turns out we don't validate that the num_counters field in the
struct we pass in from userspace is initialized.
The same problem also exists in ebtables, arptables, ipv6, and the
compat variants.
Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nfnetlink_{log,queue}_init() register the netlink callback nf*_rcv_nl_event
before registering the pernet_subsys, but the callback relies on data
structures allocated by pernet init functions.
When nfnetlink_{log,queue} is loaded, if a netlink message is received after
the netlink callback is registered but before the pernet_subsys is registered,
the kernel will panic in the sequence
nfulnl_rcv_nl_event
nfnl_log_pernet
net_generic
BUG_ON(id == 0) where id is nfnl_log_net_id.
The panic can be easily reproduced in 4.0.3 by:
while true ;do modprobe nfnetlink_log ; rmmod nfnetlink_log ; done &
while true ;do ip netns add dummy ; ip netns del dummy ; done &
This patch moves register_pernet_subsys to earlier in nfnetlink_log_init.
Notice that the BUG_ON hit in 4.0.3 was recently removed in 2591ffd308
["netns: remove BUG_ONs from net_generic()"].
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
My recent change here introduced a possible memory leak if the
driver registers an invalid cipher schemes. This won't really
happen in practice, but fix the leak nonetheless.
Fixes: e3a55b5399 ("mac80211: validate cipher scheme PN length better")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This work as a follow-up of commit f7b3bec6f5 ("net: allow setting ecn
via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
ECN connections. In other words, this work adds a retry with a non-ECN
setup SYN packet, as suggested from the RFC on the first timeout:
[...] A host that receives no reply to an ECN-setup SYN within the
normal SYN retransmission timeout interval MAY resend the SYN and
any subsequent SYN retransmissions with CWR and ECE cleared. [...]
Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
that is, Linux default since 2009 via commit 255cac91c3 ("tcp: extend
ECN sysctl to allow server-side only ECN"):
1) Normal ECN-capable path:
SYN ECE CWR ----->
<----- SYN ACK ECE
ACK ----->
2) Path with broken middlebox, when client has fallback:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ----->
<----- SYN ACK
ACK ----->
In case we would not have the fallback implemented, the middlebox drop
point would basically end up as:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
In any case, it's rather a smaller percentage of sites where there would
occur such additional setup latency: it was found in end of 2014 that ~56%
of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
fallback would mitigate with a slight latency trade-off. Recent related
paper on this topic:
Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
Gorry Fairhurst, and Richard Scheffenegger:
"Enabling Internet-Wide Deployment of Explicit Congestion Notification."
Proc. PAM 2015, New York.
http://ecn.ethz.ch/ecn-pam15.pdf
Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
allows for disabling the fallback.
tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
rather we let tcp_ecn_rcv_synack() take that over on input path in case a
SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
ECN being negotiated eventually in that case.
Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Dave That <dave.taht@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
problem that leads to dropped frames.
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJVWuR+AAoJEDBSmw7B7bqrgRYP/1A/g6osMT4DG/WecYleDlip
1De0c1S4rgHVI+/ZvK4JcyYbjHYwKhbXHBtsNHV+J9GehqXyWn0BJPkSnxZ3HdZV
M5CHSgtN2OBfHJ03OTpduvdNzKjpVOCf2PWKFnhJhDzYdfa9qh9kKDwRGeDcHvfc
++vVs+bMzjhnWj2y0TpEs1fQcd69MrR9Af2ptftOrusuVkDxShKrgY4xj1d+OVyC
FggUn/oj6/CgGVn8KV1hld+Cb1Tk1/D9uksXYZepHNo4qb0M8T8BBWIQCpdbK4Ge
qAG8w7/suLGqb8VU5k0jM4Uqbn5l9cm7PX1PQrxCdyFHMf3kojR8LgI33Xqm4d40
9HxnXLlDoaawTOiAJIG1HMEzawriWfxSly3hS1Q/B/FGo68C2KIg9h5/w98GNfIB
PNE41GopCwQlmhORGXxpzwf/jJ5mL9V6PjxUnKpsd/BlbUlKLmFnx7JABicLl/Ps
292l2yZR9Jrzaf8njmGoIyYb+AREvJF4zQu9rduiro56+rCvGvFJZ8xfwGsRvNTH
f/HILhW+GDPlJp4StCvKQxm0bWJ6feRiPCYr2JRViMQmp6hX6AMYfVcf7jcpIWio
uTB9FAW6XGPrltm+1IeyWICbiF0VpuqpPl8V3UTiDgIBdk4wqD5kr3HSW6zb1w9L
KJYFbpC9igqGk+fc3u4b
=DNh/
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2015-05-19' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
This has just a single fix, for a WEP tailroom check
problem that leads to dropped frames.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
* LED throughput trigger was crashing
* fast-xmit wasn't treating QoS changes in IBSS correctly
* TDLS could use the wrong channel definition
* using a reserved channel context could use the wrong channel width
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJVWuPvAAoJEDBSmw7B7bqr/A4P/0TqzkCC5L2qJvi3a6QNFxvf
s3riQMJ8WQeUxCRNgFNeeeNUAgSJn3hhiINGrjRmkwXXxYC4mbCwM0YXNT+WhSRL
/Kx4mRJr7u5ZU0olW+KRvIV5CyTsbr9zVnaraCh5NV43nT87ZVZRBKC9vz2UkSM5
AsN6fUvhWMGhhHoGGDqtjRBjve8Xs5iKiEcE1iQTzLOPnFP3dKtB1zKKiA0JCQs4
OjxkQ7uaF0T1IfkMFr0gyzgQi4A8iPoMKV3qcRIH/QZN5dpJ6DR1dgaU50CrzQ+R
JD9W09ifF9U8GnvQU/baJHKCxEvnQWO2XwlV4+mV6bXF1j5Ng4LRiXntIeu2d3T3
5JuvPV9cNJb8dSTzsYw+TRJg73hStlJCAjVMJ7hiOMQc1YCCY9Exrff0pWzJPJfE
NygIkMHXymcy66yL3b7DIIXro5jHNVGVoHq3vMB+W+/EcEDFN6L9LeCzUVo+oKjl
Qg4kC7VHDjcdt0f7Vgv2Cal76ZVfCZaq74QZV1cySF2sCiD27LnAAfoHVeMY979K
qBsCRqhkBlc7ntnstv6tGz9LfG8ro+Fv548HIUDG80capZl6N6FR6g+8hYIuvJwu
2abJq36bp/NAGDt43UofmtDxyZNyvoKmzcQKSdn2QpryGKQ3uRcDHG/I+WD0yWX9
4WFNEm86sXmfL/Eyu0lU
=iUFe
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2015-05-19' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
This just has a few fixes:
* LED throughput trigger was crashing
* fast-xmit wasn't treating QoS changes in IBSS correctly
* TDLS could use the wrong channel definition
* using a reserved channel context could use the wrong channel width
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
After sending the new data packets to probe (step 2), F-RTO may
incorrectly send more probes if the next ACK advances SND_UNA and
does not sack new packet. However F-RTO RFC 5682 probes at most
once. This bug may cause sender to always send new data instead of
repairing holes, inducing longer HoL blocking on the receiver for
the application.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Undo based on TCP timestamps should only happen on ACKs that advance
SND_UNA, according to the Eifel algorithm in RFC 3522:
Section 3.2:
(4) If the value of the Timestamp Echo Reply field of the
acceptable ACK's Timestamps option is smaller than the
value of RetransmitTS, then proceed to step (5),
Section Terminology:
We use the term 'acceptable ACK' as defined in [RFC793]. That is an
ACK that acknowledges previously unacknowledged data.
This is because upon receiving an out-of-order packet, the receiver
returns the last timestamp that advances RCV_NXT, not the current
timestamp of the packet in the DUPACK. Without checking the flag,
the DUPACK will cause tcp_packet_delayed() to return true and
tcp_try_undo_loss() will revert cwnd reduction.
Note that we check the condition in CA_Recovery already by only
calling tcp_try_undo_partial() if FLAG_SND_UNA_ADVANCED is set or
tcp_try_undo_recovery() if snd_una crosses high_seq.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit <5cf3d46192fc> ("udp: Simplify__udp*_lib_mcast_deliver")
simplified the filter for incoming IPv6 multicast but removed
the check of the local socket address and the UDP destination
address.
This patch restores the filter to prevent sockets bound to a IPv6
multicast IP to receive other UDP traffic link unicast.
Signed-off-by: Henning Rogge <hrogge@gmail.com>
Fixes: 5cf3d46192 ("udp: Simplify__udp*_lib_mcast_deliver")
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>