Commit graph

549585 commits

Author SHA1 Message Date
Joe Stringer e754ec69ab openvswitch: Serialize nested ct actions if provided
If userspace provides a ct action with no nested mark or label, then the
storage for these fields is zeroed. Later when actions are requested,
such zeroed fields are serialized even though userspace didn't
originally specify them. Fix the behaviour by ensuring that no action is
serialized in this case, and reject actions where userspace attempts to
set these fields with mask=0. This should make netlink marshalling
consistent across deserialization/reserialization.

Reported-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:33:43 -07:00
Joe Stringer 4f0909ee3d openvswitch: Mark connections new when not confirmed.
New, related connections are marked as such as part of ovs_ct_lookup(),
but they are not marked as "new" if the commit flag is used. Make this
consistent by setting the "new" flag whenever !nf_ct_is_confirmed(ct).

Reported-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:33:40 -07:00
Joe Stringer 1d008a1df9 openvswitch: Clarify conntrack COMMIT behaviour
The presence of this attribute does not modify the ct_state for the
current packet, only future packets. Make this more clear in the header
definition.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:33:38 -07:00
Joe Stringer 9e384715e9 openvswitch: Reject ct_state masks for unknown bits
Currently, 0-bits are generated in ct_state where the bit position is
undefined, and matches are accepted on these bit-positions. If userspace
requests to match the 0-value for this bit then it may expect only a
subset of traffic to match this value, whereas currently all packets
will have this bit set to 0. Fix this by rejecting such masks.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:33:36 -07:00
Renato Westphal e2e8009ff7 tcp: remove improper preemption check in tcp_xmit_probe_skb()
Commit e520af48c7 introduced the following bug when setting the
TCP_REPAIR sockoption:

[ 2860.657036] BUG: using __this_cpu_add() in preemptible [00000000] code: daemon/12164
[ 2860.657045] caller is __this_cpu_preempt_check+0x13/0x20
[ 2860.657049] CPU: 1 PID: 12164 Comm: daemon Not tainted 4.2.3 #1
[ 2860.657051] Hardware name: Dell Inc. PowerEdge R210 II/0JP7TR, BIOS 2.0.5 03/13/2012
[ 2860.657054]  ffffffff81c7f071 ffff880231e9fdf8 ffffffff8185d765 0000000000000002
[ 2860.657058]  0000000000000001 ffff880231e9fe28 ffffffff8146ed91 ffff880231e9fe18
[ 2860.657062]  ffffffff81cd1a5d ffff88023534f200 ffff8800b9811000 ffff880231e9fe38
[ 2860.657065] Call Trace:
[ 2860.657072]  [<ffffffff8185d765>] dump_stack+0x4f/0x7b
[ 2860.657075]  [<ffffffff8146ed91>] check_preemption_disabled+0xe1/0xf0
[ 2860.657078]  [<ffffffff8146edd3>] __this_cpu_preempt_check+0x13/0x20
[ 2860.657082]  [<ffffffff817e0bc7>] tcp_xmit_probe_skb+0xc7/0x100
[ 2860.657085]  [<ffffffff817e1e2d>] tcp_send_window_probe+0x2d/0x30
[ 2860.657089]  [<ffffffff817d1d8c>] do_tcp_setsockopt.isra.29+0x74c/0x830
[ 2860.657093]  [<ffffffff817d1e9c>] tcp_setsockopt+0x2c/0x30
[ 2860.657097]  [<ffffffff81767b74>] sock_common_setsockopt+0x14/0x20
[ 2860.657100]  [<ffffffff817669e1>] SyS_setsockopt+0x71/0xc0
[ 2860.657104]  [<ffffffff81865172>] entry_SYSCALL_64_fastpath+0x16/0x75

Since tcp_xmit_probe_skb() can be called from process context, use
NET_INC_STATS() instead of NET_INC_STATS_BH().

Fixes: e520af48c7 ("tcp: add TCPWinProbe and TCPKeepAlive SNMP counters")
Signed-off-by: Renato Westphal <renatow@taghos.com.br>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:29:26 -07:00
David S. Miller 36a28b2116 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains four Netfilter fixes for net, they are:

1) Fix Kconfig dependencies of new nf_dup_ipv4 and nf_dup_ipv6.

2) Remove bogus test nh_scope in IPv4 rpfilter match that is breaking
   --accept-local, from Xin Long.

3) Wait for RCU grace period after dropping the pending packets in the
   nfqueue, from Florian Westphal.

4) Fix sleeping allocation while holding spin_lock_bh, from Nikolay Borisov.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:26:17 -07:00
Arad, Ronen b1974ed05e netlink: Rightsize IFLA_AF_SPEC size calculation
if_nlmsg_size() overestimates the minimum allocation size of netlink
dump request (when called from rtnl_calcit()) or the size of the
message (when called from rtnl_getlink()). This is because
ext_filter_mask is not supported by rtnl_link_get_af_size() and
rtnl_link_get_size().

The over-estimation is significant when at least one netdev has many
VLANs configured (8 bytes for each configured VLAN).

This patch-set "rightsizes" the protocol specific attribute size
calculation by propagating ext_filter_mask to rtnl_link_get_af_size()
and adding this a argument to get_link_af_size op in rtnl_af_ops.

Bridge module already used filtering aware sizing for notifications.
br_get_link_af_size_filtered() is consistent with the modified
get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops.
br_get_link_af_size() becomes unused and thus removed.

Signed-off-by: Ronen Arad <ronen.arad@intel.com>
Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:15:20 -07:00
Jon Paul Maloy e53567948f tipc: conditionally expand buffer headroom over udp tunnel
In commit d999297c3d ("tipc: reduce locking scope during packet reception")
we altered the packet retransmission function. Since then, when
restransmitting packets, we create a clone of the original buffer
using __pskb_copy(skb, MIN_H_SIZE), where MIN_H_SIZE is the size of
the area we want to have copied, but also the smallest possible TIPC
packet size. The value of MIN_H_SIZE is 24.

Unfortunately, __pskb_copy() also has the effect that the headroom
of the cloned buffer takes the size MIN_H_SIZE. This is too small
for carrying the packet over the UDP tunnel bearer, which requires
a minimum headroom of 28 bytes. A change to just use pskb_copy()
lets the clone inherit the original headroom of 80 bytes, but also
assumes that the copied data area is of at least that size, something
that is not always the case. So that is not a viable solution.

We now fix this by adding a check for sufficient headroom in the
transmit function of udp_media.c, and expanding it when necessary.

Fixes: commit d999297c3d ("tipc: reduce locking scope during packet reception")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:13:48 -07:00
Andreas Schwab 7a4264a925 net: cavium: change NET_VENDOR_CAVIUM to bool
CONFIG_NET_VENDOR_CAVIUM is only used to hide/show config options and to
include subdirectories in the build, so it doesn't make sense to make it
tristate.

Signed-off-by: Andreas Schwab <schwab@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:12:16 -07:00
Jon Paul Maloy 45c8b7b175 tipc: allow non-linear first fragment buffer
The current code for message reassembly is erroneously assuming that
the the first arriving fragment buffer always is linear, and then goes
ahead resetting the fragment list of that buffer in anticipation of
more arriving fragments.

However, if the buffer already happens to be non-linear, we will
inadvertently drop the already attached fragment list, and later
on trig a BUG() in __pskb_pull_tail().

We see this happen when running fragmented TIPC multicast across UDP,
something made possible since
commit d0f91938be ("tipc: add ip/udp media type")

We fix this by not resetting the fragment list when the buffer is non-
linear, and by initiatlizing our private fragment list tail pointer to
the tail of the existing fragment list.

Fixes: commit d0f91938be ("tipc: add ip/udp media type")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:08:24 -07:00
James Morse 1241365f1a openvswitch: Allocate memory for ovs internal device stats.
"openvswitch: Remove vport stats" removed the per-vport statistics, in
order to use the netdev's statistics fields.
"openvswitch: Fix ovs_vport_get_stats()" fixed the export of these stats
to user-space, by using the provided netdev_ops to collate them - but ovs
internal devices still use an unallocated dev->tstats field to count
packets, which are no longer exported by this api.

Allocate the dev->tstats field for ovs internal devices, and wire up
ndo_get_stats64 with the original implementation of
ovs_vport_get_stats().

On its own, "openvswitch: Fix ovs_vport_get_stats()" fixes the OOPs,
unmasking a full-on panic on arm64:

=============%<==============
[<ffffffbffc00ce4c>] internal_dev_recv+0xa8/0x170 [openvswitch]
[<ffffffbffc0008b4>] do_output.isra.31+0x60/0x19c [openvswitch]
[<ffffffbffc000bf8>] do_execute_actions+0x208/0x11c0 [openvswitch]
[<ffffffbffc001c78>] ovs_execute_actions+0xc8/0x238 [openvswitch]
[<ffffffbffc003dfc>] ovs_packet_cmd_execute+0x21c/0x288 [openvswitch]
[<ffffffc0005e8c5c>] genl_family_rcv_msg+0x1b0/0x310
[<ffffffc0005e8e60>] genl_rcv_msg+0xa4/0xe4
[<ffffffc0005e7ddc>] netlink_rcv_skb+0xb0/0xdc
[<ffffffc0005e8a94>] genl_rcv+0x38/0x50
[<ffffffc0005e76c0>] netlink_unicast+0x164/0x210
[<ffffffc0005e7b70>] netlink_sendmsg+0x304/0x368
[<ffffffc0005a21c0>] sock_sendmsg+0x30/0x4c
[SNIP]
Kernel panic - not syncing: Fatal exception in interrupt
=============%<==============

Fixes: 8c876639c9 ("openvswitch: Remove vport stats.")
Signed-off-by: James Morse <james.morse@arm.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:06:36 -07:00
David Ahern f1900fb5ec net: Really fix vti6 with oif in dst lookups
6e28b00082 ("net: Fix vti use case with oif in dst lookups for IPv6")
is missing the checks on FLOWI_FLAG_SKIP_NH_OIF. Add them.

Fixes: 42a7b32b73 ("xfrm: Add oif to dst lookups")
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:04:54 -07:00
Jon Paul Maloy 53387c4e22 tipc: extend broadcast link window size
The default fix broadcast window size is currently set to 20 packets.
This is a very low value, set at a time when we were still testing on
10 Mb/s hubs, and a change to it is long overdue.

Commit 7845989cb4 ("net: tipc: fix stall during bclink wakeup procedure")
revealed a problem with this low value. For messages of importance LOW,
the backlog queue limit will be  calculated to 30 packets, while a
single, maximum sized message of 66000 bytes, carried across a 1500 MTU
network consists of 46 packets.

This leads to the following scenario (among others leading to the same
situation):

1: Msg 1 of 46 packets is sent. 20 packets go to the transmit queue, 26
   packets to the backlog queue.
2: Msg 2 of 46 packets is attempted sent, but rejected because there is
   no more space in the backlog queue at this level. The sender is added
   to the wakeup queue with a "pending packets chain size" number of 46.
3: Some packets in the transmit queue are acked and released. We try to
   wake up the sender, but the pending size of 46 is bigger than the LOW
   wakeup limit of 30, so this doesn't happen.
5: Subsequent acks releases all the remaining buffers. Each time we test
   for the wakeup criteria and find that 46 still is larger than 30,
   even after both the transmit and the backlog queues are empty.
6: The sender is never woken up and given a chance to send its message.
   He is stuck.

We could now loosen the wakeup criteria (used by link_prepare_wakeup())
to become equal to the send criteria (used by tipc_link_xmit()), i.e.,
by ignoring the "pending packets chain size" value altogether, or we can
just increase the queue limits so that the criteria can be satisfied
anyway. There are good reasons (potentially multiple waiting senders) to
not opt for the former solution, so we choose the latter one.

This commit fixes the problem by giving the broadcast link window a
default value of 50 packets. We also introduce a new minimum link
window size BCLINK_MIN_WIN of 32, which is enough to always avoid the
described situation. Finally, in order to not break any existing users
which may set the window explicitly, we enforce that the window is set
to the new minimum value in case the user is trying to set it to
anything lower.

Fixes: 7845989cb4 ("net: tipc: fix stall during bclink wakeup procedure")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:02:17 -07:00
Johan Hedberg 17bc08f0d1 Bluetooth: Remove unnecessary hci_explicit_connect_lookup function
There's only one user of this helper which can be replaces with a call
to hci_pend_le_action_lookup() and a check for params->explicit_connect.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 18:58:23 +02:00
Johan Hedberg 1ede9868f6 Bluetooth: Remove redundant (and possibly wrong) flag clearing
There's no need to clear the HCI_CONN_ENCRYPT_PEND flag in
smp_failure. In fact, this may cause the encryption tracking to get
out of sync as this has nothing to do with HCI activity.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 18:57:03 +02:00
Johan Hedberg b5c2b6214c Bluetooth: Add hdev helper variable to hci_le_create_connection_cancel
The hci_le_create_connection_cancel() function needs to use the hdev
pointer in many places so add a variable for it to avoid the need to
dereference the hci_conn every time.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 18:45:43 +02:00
Johan Hedberg ec182f0397 Bluetooth: Remove unnecessary indentation in unpair_device()
Instead of doing all of the LE-specific handling in an else-branch in
unpair_device() create a 'done' label for the BR/EDR branch to jump to
and then remove the else-branch completely.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 18:40:21 +02:00
Johan Hedberg f5ad4ffceb Bluetooth: 6lowpan: Use hci_conn_hash_lookup_le() when possible
Use the new hci_conn_hash_lookup_le() API to look up LE connections.
This way we're guaranteed exact matches that also take into account
the address type.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 18:39:16 +02:00
Johan Hedberg 9d4c1cc15b Bluetooth: Use hci_conn_hash_lookup_le() when possible
Use the new hci_conn_hash_lookup_le() API to look up LE connections.
This way we're guaranteed exact matches that also take into account
the address type.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 18:38:22 +02:00
Johan Hedberg 1b51c7b6e8 Bluetooth: Add hci_conn_hash_lookup_le() helper function
Many of the existing LE connection lookups are forced to use
hci_conn_hash_lookup_ba() which doesn't take into account the address
type. What's worse, most of the users don't bother checking that the
returned address type matches what was wanted.

This patch adds a new helper API to look up LE connections based on
their address and address type, paving the way to have the
hci_conn_hash_lookup_ba() users converted to do more precise lookups.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 18:36:39 +02:00
Johan Hedberg 85813a7ec7 Bluetooth: Add le_addr_type() helper function
The mgmt code needs to convert from mgmt/L2CAP address types to HCI in
many places. Having a dedicated helper function for this simplifies
code by shortening it and removing unnecessary 'addr_type' variables.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 18:35:00 +02:00
Elad Raz 6ac311ae8b Adding switchdev ageing notification on port bridged
Configure ageing time to the HW for newly bridged device

CC: Scott Feldman <sfeldma@gmail.com>
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Elad Raz <eladr@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:50:57 -07:00
Dan Carpenter 50010c2059 irda: precedence bug in irlmp_seq_hb_idx()
This is decrementing the pointer, instead of the value stored in the
pointer.  KASan detects it as an out of bounds reference.

Reported-by: "Berry Cheng 程君(成淼)" <chengmiao.cj@alibaba-inc.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:48:26 -07:00
Joe Jin ca88ea1247 xen-netfront: update num_queues to real created
Sometimes xennet_create_queues() may failed to created all requested
queues, we need to update num_queues to real created to avoid NULL
pointer dereference.

Signed-off-by: Joe Jin <joe.jin@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: David S. Miller <davem@davemloft.net>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:45:39 -07:00
Gao feng f6a835bb04 vsock: fix missing cleanup when misc_register failed
reset transport and unlock if misc_register failed.

Signed-off-by: Gao feng <omarapazanadi@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:41:06 -07:00
David S. Miller 10be15ff6c Merge branch 'mv643xx-fixes'
Philipp Kirchhofer says:

====================
net: mv643xx_eth: TSO TX data corruption fixes

as previously discussed [1] the mv643xx_eth driver has some
issues with data corruption when using TCP segmentation offload (TSO).

The following patch set improves this situation by fixing two data
corruption bugs in the TSO TX path.

Before applying the patches repeatedly accessing large files located on
a SMB share on my NSA325 NAS with TSO enabled resulted in different
hash sums, which confirmed that data corruption is happening during
file transfer. After applying the patches the hash sums were the same.

As this is my first patch submission please feel free to point out any
issues with the patch set.

[1] http://thread.gmane.org/gmane.linux.network/336530
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:36:51 -07:00
Philipp Kirchhofer 968200f322 net: mv643xx_eth: Defer writing the first TX descriptor when using TSO
To prevent a race between the TX DMA engine and the CPU the writing of the
first transmit descriptor must be deferred until all following descriptors
have been updated. The network card may otherwise start transmitting before
all packet descriptors are set up correctly, which leads to data corruption
or an aborted transmit operation.

This deferral is already done in the non-TSO TX path, implement it also in
the TSO TX path.

Signed-off-by: Philipp Kirchhofer <philipp@familie-kirchhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:36:41 -07:00
Philipp Kirchhofer 91986fd3d3 net: mv643xx_eth: Ensure proper data alignment in TSO TX path
The TX DMA engine requires that buffers with a size of 8 bytes or smaller
must be 64 bit aligned. This requirement may be violated when doing TSO,
as in this case larger skb frags can be broken up and transmitted in small
parts with then inappropriate alignment.

Fix this by checking for proper alignment before handing a buffer to the
DMA engine. If the data is misaligned realign it by copying it into the
TSO header data area.

Signed-off-by: Philipp Kirchhofer <philipp@familie-kirchhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:36:38 -07:00
David S. Miller eb9fae328f Merge branch 'tcp-rack'
Yuchung Cheng says:

====================
RACK loss detection

RACK (Recent ACK) loss recovery uses the notion of time instead of
packet sequence (FACK) or counts (dupthresh).

It's inspired by the FACK heuristic in tcp_mark_lost_retrans(): when a
limited transmit (new data packet) is sacked in recovery, then any
retransmission sent before that newly sacked packet was sent must have
been lost, since at least one round trip time has elapsed.

But that existing heuristic from tcp_mark_lost_retrans()
has several limitations:
  1) it can't detect tail drops since it depends on limited transmit
  2) it's disabled upon reordering (assumes no reordering)
  3) it's only enabled in fast recovery but not timeout recovery

RACK addresses these limitations with a core idea: an unacknowledged
packet P1 is deemed lost if a packet P2 that was sent later is is
s/acked, since at least one round trip has passed.

Since RACK cares about the time sequence instead of the data sequence
of packets, it can detect tail drops when a later retransmission is
s/acked, while FACK or dupthresh can't. For reordering RACK uses a
dynamically adjusted reordering window ("reo_wnd") to reduce false
positives on ever (small) degree of reordering, similar to the delayed
Early Retransmit.

In the current patch set RACK is only a supplemental loss detection
and does not trigger fast recovery. However we are developing RACK
to replace or consolidate FACK/dupthresh, early retransmit, and
thin-dupack. These heuristics all implicitly bear the time notion.
For example, the delayed Early Retransmit is simply applying RACK
to trigger the fast recovery with small inflight.

RACK requires measuring the minimum RTT. Tracking a global min is less
robust due to traffic engineering pathing changes. Therefore it uses a
windowed filter by Kathleen Nichols. The min RTT can also be useful
for various other purposes like congestion control or stat monitoring.

This patch has been used on Google servers for well over 1 year. RACK
has also been implemented in the QUIC protocol. We are submitting an
IETF draft as well.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:59 -07:00
Yuchung Cheng 4f41b1c58a tcp: use RACK to detect losses
This patch implements the second half of RACK that uses the the most
recent transmit time among all delivered packets to detect losses.

tcp_rack_mark_lost() is called upon receiving a dubious ACK.
It then checks if an not-yet-sacked packet was sent at least
"reo_wnd" prior to the sent time of the most recently delivered.
If so the packet is deemed lost.

The "reo_wnd" reordering window starts with 1msec for fast loss
detection and changes to min-RTT/4 when reordering is observed.
We found 1msec accommodates well on tiny degree of reordering
(<3 pkts) on faster links. We use min-RTT instead of SRTT because
reordering is more of a path property but SRTT can be inflated by
self-inflicated congestion. The factor of 4 is borrowed from the
delayed early retransmit and seems to work reasonably well.

Since RACK is still experimental, it is now used as a supplemental
loss detection on top of existing algorithms. It is only effective
after the fast recovery starts or after the timeout occurs. The
fast recovery is still triggered by FACK and/or dupack threshold
instead of RACK.

We introduce a new sysctl net.ipv4.tcp_recovery for future
experiments of loss recoveries. For now RACK can be disabled by
setting it to 0.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:53 -07:00
Yuchung Cheng 659a8ad56f tcp: track the packet timings in RACK
This patch is the first half of the RACK loss recovery.

RACK loss recovery uses the notion of time instead
of packet sequence (FACK) or counts (dupthresh). It's inspired by the
previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
transmit (new data packet) is sacked, then current retransmitted
sequence below the newly sacked sequence must been lost,
since at least one round trip time has elapsed.

But it has several limitations:
1) can't detect tail drops since it depends on limited transmit
2) is disabled upon reordering (assumes no reordering)
3) only enabled in fast recovery ut not timeout recovery

RACK (Recently ACK) addresses these limitations with the notion
of time instead: a packet P1 is lost if a later packet P2 is s/acked,
as at least one round trip has passed.

Since RACK cares about the time sequence instead of the data sequence
of packets, it can detect tail drops when later retransmission is
s/acked while FACK or dupthresh can't. For reordering RACK uses a
dynamically adjusted reordering window ("reo_wnd") to reduce false
positives on ever (small) degree of reordering.

This patch implements tcp_advanced_rack() which tracks the
most recent transmission time among the packets that have been
delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
is the key to determine which packet has been lost.

Consider an example that the sender sends six packets:
T1: P1 (lost)
T2: P2
T3: P3
T4: P4
T100: sack of P2. rack.mstamp = T2
T101: retransmit P1
T102: sack of P2,P3,P4. rack.mstamp = T4
T205: ACK of P4 since the hole is repaired. rack.mstamp = T101

We need to be careful about spurious retransmission because it may
falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
to falsely mark all packets lost, just like a spurious timeout.

We identify spurious retransmission by the ACK's TS echo value.
If TS option is not applicable but the retransmission is acknowledged
less than min-RTT ago, it is likely to be spurious. We refrain from
using the transmission time of these spurious retransmissions.

The second half is implemented in the next patch that marks packet
lost using RACK timestamp.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:48 -07:00
Yuchung Cheng 625a5e109a tcp: skb_mstamp_after helper
a helper to prepare the first main RACK patch.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:46 -07:00
Yuchung Cheng 77c631273d tcp: add tcp_tsopt_ecr_before helper
a helper to prepare the main RACK patch

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:45 -07:00
Yuchung Cheng af82f4e848 tcp: remove tcp_mark_lost_retrans()
Remove the existing lost retransmit detection because RACK subsumes
it completely. This also stops the overloading the ack_seq field of
the skb control block.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:44 -07:00
Yuchung Cheng f672258391 tcp: track min RTT using windowed min-filter
Kathleen Nichols' algorithm for tracking the minimum RTT of a
data stream over some measurement window. It uses constant space
and constant time per update. Yet it almost always delivers
the same minimum as an implementation that has to keep all
the data in the window. The measurement window is tunable via
sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.

The algorithm keeps track of the best, 2nd best & 3rd best min
values, maintaining an invariant that the measurement time of
the n'th best >= n-1'th best. It also makes sure that the three
values are widely separated in the time window since that bounds
the worse case error when that data is monotonically increasing
over the window.

Upon getting a new min, we can forget everything earlier because
it has no value - the new min is less than everything else in the
window by definition and it's the most recent. So we restart fresh
on every new min and overwrites the 2nd & 3rd choices. The same
property holds for the 2nd & 3rd best.

Therefore we have to maintain two invariants to maximize the
information in the samples, one on values (1st.v <= 2nd.v <=
3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
now). These invariants determine the structure of the code

The RTT input to the windowed filter is the minimum RTT measured
from ACK or SACK, or as the last resort from TCP timestamps.

The accessor tcp_min_rtt() returns the minimum RTT seen in the
window. ~0U indicates it is not available. The minimum is 1usec
even if the true RTT is below that.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:43 -07:00
Yuchung Cheng 9e45a3e36b tcp: apply Kern's check on RTTs used for congestion control
Currently ca_seq_rtt_us does not use Kern's check. Fix that by
checking if any packet acked is a retransmit, for both RTT used
for RTT estimation and congestion control.

Fixes: 5b08e47ca ("tcp: prefer packet timing to TS-ECR for RTT")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:41 -07:00
David S. Miller f3c9f95056 Merge branch 'smsc-energy-detect'
Heiko Schocher says:

====================
net, phy, smsc: add posibility to disable energy detect mode

On some boards the energy enable detect mode leads in
trouble with some switches, so make the enabling of
this mode configurable through DT.
Therefore the property "smsc,disable-energy-detect" is
introduced.

Patch 1 introduces phy-handle support for the ti,cpsw
driver. This is needed now for the smsc phy.

Patch 2 adds the disable energy mode functionality
to the smsc phy

Changes in v2:
- add comments from Florian Fainelli
  - I did not change disable property name into enable
    because I fear to break existing behaviour
  - add smsc vendor prefix
  - remove CONFIG_OF and use __maybe_unused
  - introduce "phy-handle" ability into ti,cpsw
    driver, so I can remove bogus:
      if (!of_node && dev->parent->of_node)
          of_node = dev->parent->of_node;
    construct. Therefore new patch for the ti,cpsw
    driver is necessary.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 06:41:51 -07:00
Heiko Schocher d88ecb373b net: phy: smsc: disable energy detect mode
On some boards the energy enable detect mode leads in
trouble with some switches, so make the enabling of
this mode configurable through DT.

Signed-off-by: Heiko Schocher <hs@denx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 06:41:44 -07:00
Heiko Schocher 9e42f71526 drivers: net: cpsw: add phy-handle parsing
add the ability to parse "phy-handle". This
is needed for phys, which have a DT node, and
need to parse DT properties.

Signed-off-by: Heiko Schocher <hs@denx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 06:41:42 -07:00
Simon Arlott aebd99477f bcm63xx_enet: check 1000BASE-T advertisement configuration
If a gigabit ethernet PHY is connected to a fast ethernet MAC,
then it can detect 1000 support from the partner but not use it.

This results in a forced speed of 1000 and RX/TX failure.

Check for 1000BASE-T support and then check the advertisement
configuration before setting the MAC speed to 1000mbit.

Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 06:36:38 -07:00
David S. Miller c8fdc32491 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2015-10-19

This series contains updates to i40e and i40evf only.

Kiran adds a spinlock around code accessing VSI MAC filter list to
ensure that we are synchronizing access to the filter list, otherwise
we can end up with multiple accesses at the same time which can cause
the VSI MAC filter list to get in an unstable or corrupted state.

Jesse fixes overlong BIT defines, where the RSS enabling call were
mistakenly missed.  Also fixes a bug where the enable function was
enabling the interrupt twice while trying to update the two interrupt
throttle rate thresholds for Rx and Tx, while refactoring the IRQ
enable function to simplify reading the flow.  Addressed the high
CPU utilization of some small streaming workloads that the driver should
reduce CPU in.

Anjali fixes two X722 issues with respect to EEPROM checksum verify and
reading NVM version info.  Fixed where a mask value was accidentally
replaced with a bit mask causing Flow Director sideband to be broken.

Alex Duyck fixes areas of the drivers which run from hard interrupt
context or with interrupts already disabled in netpoll, so use
napi_schedule_irqoff() instead of napi_schedule().

Mitch fixes the VF drivers to not easily give up when it is not able
to communicate with the PF driver.

Carolyn fixes a problem where our tools MAC loopback test, after driver
unbind would fail because the hardware was configured for multiqueue and
unbind operation did not clear this configuration.  Also fixed a issue
where the NVMUpdate tool gets bad data from the PHY when using the PHY
NVM feature because of contention on the MDIO interface from getting
PHY capability calls from the driver during regular operations.

Catherine fixed an issue where we were checking if autoneg was allowed
to change before checking if autoneg was changing, these checks need to
be in the reverse order.

Jean Sacren fixes up an function header comment to align the kernel-docs
with the actual code.

v2: Cleaned up the use of spin_is_locked() in patch 1 based on feedback
    from David Miller, since it always evaluates to zero on uni-processor
    builds
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 06:29:56 -07:00
Johan Hedberg 8ce783dc5e Bluetooth: Fix missing hdev locking for LE scan cleanup
The hci_conn objects don't have a dedicated lock themselves but rely
on the caller to hold the hci_dev lock for most types of access. The
hci_conn_timeout() function has so far sent certain HCI commands based
on the hci_conn state which has been possible without holding the
hci_dev lock.

The recent changes to do LE scanning before connect attempts added
even more operations to hci_conn and hci_dev from hci_conn_timeout,
thereby exposing potential race conditions with the hci_dev and
hci_conn states.

As an example of such a race, here there's a timeout but an
l2cap_sock_connect() call manages to race with the cleanup routine:

[Oct21 08:14] l2cap_chan_timeout: chan ee4b12c0 state BT_CONNECT
[  +0.000004] l2cap_chan_close: chan ee4b12c0 state BT_CONNECT
[  +0.000002] l2cap_chan_del: chan ee4b12c0, conn f3141580, err 111, state BT_CONNECT
[  +0.000002] l2cap_sock_teardown_cb: chan ee4b12c0 state BT_CONNECT
[  +0.000005] l2cap_chan_put: chan ee4b12c0 orig refcnt 4
[  +0.000010] hci_conn_drop: hcon f53d56e0 orig refcnt 1
[  +0.000013] l2cap_chan_put: chan ee4b12c0 orig refcnt 3
[  +0.000063] hci_conn_timeout: hcon f53d56e0 state BT_CONNECT
[  +0.000049] hci_conn_params_del: addr ee:0d:30:09:53:1f (type 1)
[  +0.000002] hci_chan_list_flush: hcon f53d56e0
[  +0.000001] hci_chan_del: hci0 hcon f53d56e0 chan f4e7ccc0
[  +0.004528] l2cap_sock_create: sock e708fc00
[  +0.000023] l2cap_chan_create: chan ee4b1770
[  +0.000001] l2cap_chan_hold: chan ee4b1770 orig refcnt 1
[  +0.000002] l2cap_sock_init: sk ee4b3390
[  +0.000029] l2cap_sock_bind: sk ee4b3390
[  +0.000010] l2cap_sock_setsockopt: sk ee4b3390
[  +0.000037] l2cap_sock_connect: sk ee4b3390
[  +0.000002] l2cap_chan_connect: 00:02:72:d9:e5:8b -> ee:0d:30:09:53:1f (type 2) psm 0x00
[  +0.000002] hci_get_route: 00:02:72:d9:e5:8b -> ee:0d:30:09:53:1f
[  +0.000001] hci_dev_hold: hci0 orig refcnt 8
[  +0.000003] hci_conn_hold: hcon f53d56e0 orig refcnt 0

Above the l2cap_chan_connect() shouldn't have been able to reach the
hci_conn f53d56e0 anymore but since hci_conn_timeout didn't do proper
locking that's not the case. The end result is a reference to hci_conn
that's not in the conn_hash list, resulting in list corruption when
trying to remove it later:

[Oct21 08:15] l2cap_chan_timeout: chan ee4b1770 state BT_CONNECT
[  +0.000004] l2cap_chan_close: chan ee4b1770 state BT_CONNECT
[  +0.000003] l2cap_chan_del: chan ee4b1770, conn f3141580, err 111, state BT_CONNECT
[  +0.000001] l2cap_sock_teardown_cb: chan ee4b1770 state BT_CONNECT
[  +0.000005] l2cap_chan_put: chan ee4b1770 orig refcnt 4
[  +0.000002] hci_conn_drop: hcon f53d56e0 orig refcnt 1
[  +0.000015] l2cap_chan_put: chan ee4b1770 orig refcnt 3
[  +0.000038] hci_conn_timeout: hcon f53d56e0 state BT_CONNECT
[  +0.000003] hci_chan_list_flush: hcon f53d56e0
[  +0.000002] hci_conn_hash_del: hci0 hcon f53d56e0
[  +0.000001] ------------[ cut here ]------------
[  +0.000461] WARNING: CPU: 0 PID: 1782 at lib/list_debug.c:56 __list_del_entry+0x3f/0x71()
[  +0.000839] list_del corruption, f53d56e0->prev is LIST_POISON2 (00000200)

The necessary fix is unfortunately more complicated than just adding
hci_dev_lock/unlock calls to the hci_conn_timeout() call path.
Particularly, the hci_conn_del() API, which expects the hci_dev lock to
be held, performs a cancel_delayed_work_sync(&hcon->disc_work) which
would lead to a deadlock if the hci_conn_timeout() call path tries to
acquire the same lock.

This patch solves the problem by deferring the cleanup work to a
separate work callback. To protect against the hci_dev or hci_conn
going away meanwhile temporary references are taken with the help of
hci_dev_hold() and hci_conn_get().

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org # 4.3
2015-10-21 14:25:34 +02:00
Johannes Berg e5a9f8d046 mac80211: move station statistics into sub-structs
Group station statistics by where they're (mostly) updated
(TX, RX and TX-status) and group them into sub-structs of
the struct sta_info.

Also rename the variables since the grouping now makes it
obvious where they belong.

This makes it easier to identify where the statistics are
updated in the code, and thus easier to think about them.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-21 10:08:22 +02:00
Johannes Berg 976bd9efda mac80211: move beacon_loss_count into ifmgd
There's little point in keeping (and even sending to userspace)
the beacon_loss_count value per station, since it can only apply
to the AP on a managed-mode connection. Move the value to ifmgd,
advertise it only in managed mode, and remove it from ethtool as
it's available through better interfaces.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-21 10:08:21 +02:00
Johannes Berg 763aa27a29 mac80211: remove sta->last_ack_signal
This file only feeds a debugfs file that isn't very useful, so remove
it. If necessary, we can add other ways to get this information, for
example in the NL80211_CMD_PROBE_CLIENT response.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-21 10:08:21 +02:00
Marcel Holtmann 213445b2b4 Bluetooth: btintel: Enable extra Intel vendor events
The Intel Bluetooth controllers can emit extra vendor specific events in
error conditions or for debugging purposes. To make the life easier for
engineers, enable them by default. When the vendor_diag options has been
enabled, then additional debug events are also enabled.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-10-21 07:34:11 +03:00
Marcel Holtmann e4c534bbac Bluetooth: btusb: Set manufacturer for Intel bootloader devices
For Intel bootloader devices, set the manufacturer information so that
it becomes possible to decode the boot process.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-10-21 07:32:12 +03:00
Marcel Holtmann 98a63aaf24 Bluetooth: Introduce driver specific post init callback
Some drivers might have to restore certain settings after the init
procedure has been completed. This driver callback allows them to hook
into that stage. This callback is run just before the controller is
declared as powered up.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-10-21 07:30:53 +03:00
Marcel Holtmann aee61f7aa8 Bluetooth: hci_uart: Provide initial manufacturer information
Provide an early indication about the manufacturer information so that
it can be forwarded into monitor channel.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-10-21 07:20:44 +03:00
Dean Jenkins 9f7378a9d6 Bluetooth: l2cap_disconnection_req priority over shutdown
There is a L2CAP protocol race between the local peer and
the remote peer demanding disconnection of the L2CAP link.

When L2CAP ERTM is used, l2cap_sock_shutdown() can be called
from userland to disconnect L2CAP. However, there can be a
delay introduced by waiting for ACKs. During this waiting
period, the remote peer may have sent a Disconnection Request.
Therefore, recheck the shutdown status of the socket
after waiting for ACKs because there is no need to do
further processing if the connection has gone.

Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com>
Signed-off-by: Harish Jenny K N <harish_kandiga@mentor.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 00:49:26 +02:00