Commit graph

40139 commits

Author SHA1 Message Date
Shraddha Barke
70cf052d0c libceph: remove con argument in handle_reply()
Since handle_reply() does not use its con argument, remove it.

Signed-off-by: Shraddha Barke <shraddha.6596@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02 23:36:47 +01:00
Trond Myklebust
ac3c860c75 NFS: NFSoRDMA Client Side Changes
In addition to a variety of bugfixes, these patches are mostly geared at
 enabling both swap and backchannel support to the NFS over RDMA client.
 
 Signed-off-by: Anna Schumake <Anna.Schumaker@Netapp.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWN9tvAAoJENfLVL+wpUDrurkP/0exWvxZb0yAxOlquyh4tmUA
 ZO2rd+aap9iyaOPYGcWGd38x3WuvoecuaT/Eu+wRGkH89sF1LMSA+GUD7Ua/Ii7r
 5spQP6tVRVswr+cK53H3fbEpQE7NTuBJB4RjivmddmduMPy678FcMSg4wfMqGwmw
 bFuCG70bYkEboIe+jiqNOzy6+Dkkn6h4pLg8S89jGj4XeV7JF9l7Cr0OfxZVWxme
 YX1y9lyIMB/dKsD8o2TjhfeSQ1TtmWDS1rw7MurIF/pIlmvTfAoivZFfflrAbOC6
 vx/wWsswLKZPJ72QrXfnRErEI+8nea5mvBvgW2xQh1GywWQI5kzdvG3lVMmvjX3I
 g5X/e6oDaPAtBXuzundQP7vE3yYTGGH+C0rBoFRHR5ThuRZyNqQY0VphQ/nz+B6b
 m5loQaxKy+qDdNH0sTwaY3KUNoP4LHzMF+15g2nVIjKLZlG+7Yx8yJwhkKx4XXzn
 t8opIcLSNb6ehlQ/Vw3smhjc6NAXecg0jEeGkL1MV0Cqpk+Uyf1JFNyDL/nJkeI+
 3zlmVDIIbPCHz7gmqhlXCN6Ql6QttgGyt5mgW0f6Q1N0Miqix6DCywu9aaprLZPJ
 O+MOZaNa/6F0KSZpPTwqZ5i7nxrBu48r8OK0HDU7FOdJ1CZXd7y7TXrXnBVco4uu
 AXVsLy/tnjAlqOy07ibB
 =Ush5
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-4.4-2' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: NFSoRDMA Client Side Changes

In addition to a variety of bugfixes, these patches are mostly geared at
enabling both swap and backchannel support to the NFS over RDMA client.

Signed-off-by: Anna Schumake <Anna.Schumaker@Netapp.com>
2015-11-02 17:09:24 -05:00
Matthias Schiffer
ec13ad1d70 ipv6: fix crash on ICMPv6 redirects with prohibited/blackholed source
There are other error values besides ip6_null_entry that can be returned by
ip6_route_redirect(): fib6_rule_action() can also result in
ip6_blk_hole_entry and ip6_prohibit_entry if such ip rules are installed.

Only checking for ip6_null_entry in rt6_do_redirect() causes ip6_ins_rt()
to be called with rt->rt6i_table == NULL in these cases, making the kernel
crash.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 16:30:15 -05:00
Chuck Lever
76566773a1 NFS: Enable client side NFSv4.1 backchannel to use other transports
Forechannel transports get their own "bc_up" method to create an
endpoint for the backchannel service.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
[Anna Schumaker: Add forward declaration of struct net to xprt.h]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 16:29:13 -05:00
Eric Dumazet
9e17f8a475 net: make skb_set_owner_w() more robust
skb_set_owner_w() is called from various places that assume
skb->sk always point to a full blown socket (as it changes
sk->sk_wmem_alloc)

We'd like to attach skb to request sockets, and in the future
to timewait sockets as well. For these kind of pseudo sockets,
we need to take a traditional refcount and use sock_edemux()
as the destructor.

It is now time to un-inline skb_set_owner_w(), being too big.

Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Bisected-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 16:28:49 -05:00
Ido Schimmel
eca1e006cf bridge: vlan: Use rcu_dereference instead of rtnl_dereference
br_should_learn() is protected by RCU and not by RTNL, so use correct
flavor of nbp_vlan_group().

Fixes: 907b1e6e83 ("bridge: vlan: use proper rcu for the vlgrp
member")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 16:27:39 -05:00
Ani Sinha
44f49dd8b5 ipmr: fix possible race resulting from improper usage of IP_INC_STATS_BH() in preemptible context.
Fixes the following kernel BUG :

BUG: using __this_cpu_add() in preemptible [00000000] code: bash/2758
caller is __this_cpu_preempt_check+0x13/0x15
CPU: 0 PID: 2758 Comm: bash Tainted: P           O   3.18.19 #2
 ffffffff8170eaca ffff880110d1b788 ffffffff81482b2a 0000000000000000
 0000000000000000 ffff880110d1b7b8 ffffffff812010ae ffff880007cab800
 ffff88001a060800 ffff88013a899108 ffff880108b84240 ffff880110d1b7c8
Call Trace:
[<ffffffff81482b2a>] dump_stack+0x52/0x80
[<ffffffff812010ae>] check_preemption_disabled+0xce/0xe1
[<ffffffff812010d4>] __this_cpu_preempt_check+0x13/0x15
[<ffffffff81419d60>] ipmr_queue_xmit+0x647/0x70c
[<ffffffff8141a154>] ip_mr_forward+0x32f/0x34e
[<ffffffff8141af76>] ip_mroute_setsockopt+0xe03/0x108c
[<ffffffff810553fc>] ? get_parent_ip+0x11/0x42
[<ffffffff810e6974>] ? pollwake+0x4d/0x51
[<ffffffff81058ac0>] ? default_wake_function+0x0/0xf
[<ffffffff810553fc>] ? get_parent_ip+0x11/0x42
[<ffffffff810613d9>] ? __wake_up_common+0x45/0x77
[<ffffffff81486ea9>] ? _raw_spin_unlock_irqrestore+0x1d/0x32
[<ffffffff810618bc>] ? __wake_up_sync_key+0x4a/0x53
[<ffffffff8139a519>] ? sock_def_readable+0x71/0x75
[<ffffffff813dd226>] do_ip_setsockopt+0x9d/0xb55
[<ffffffff81429818>] ? unix_seqpacket_sendmsg+0x3f/0x41
[<ffffffff813963fe>] ? sock_sendmsg+0x6d/0x86
[<ffffffff813959d4>] ? sockfd_lookup_light+0x12/0x5d
[<ffffffff8139650a>] ? SyS_sendto+0xf3/0x11b
[<ffffffff810d5738>] ? new_sync_read+0x82/0xaa
[<ffffffff813ddd19>] compat_ip_setsockopt+0x3b/0x99
[<ffffffff813fb24a>] compat_raw_setsockopt+0x11/0x32
[<ffffffff81399052>] compat_sock_common_setsockopt+0x18/0x1f
[<ffffffff813c4d05>] compat_SyS_setsockopt+0x1a9/0x1cf
[<ffffffff813c4149>] compat_SyS_socketcall+0x180/0x1e3
[<ffffffff81488ea1>] cstar_dispatch+0x7/0x1e

Signed-off-by: Ani Sinha <ani@arista.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 15:57:12 -05:00
Ido Schimmel
ddd611d3ff bridge: vlan: Use correct flag name in comment
The flag used to indicate if a VLAN should be used for filtering - as
opposed to context only - on the bridge itself (e.g. br0) is called
'brentry' and not 'brvlan'.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 15:40:11 -05:00
Ido Schimmel
07bc588fc1 bridge: vlan: Prevent possible use-after-free
When adding a port to a bridge we initialize VLAN filtering on it. We do
not bail out in case an error occurred in nbp_vlan_init, as it can be
used as a non VLAN filtering bridge.

However, if VLAN filtering is required and an error occurred in
nbp_vlan_init, we should set vlgrp to NULL, so that VLAN filtering
functions (e.g. br_vlan_find, br_get_pvid) will know the struct is
invalid and will not try to access it.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 15:40:10 -05:00
Eric Dumazet
ce1050089c tcp/dccp: fix ireq->pktopts race
IPv6 request sockets store a pointer to skb containing the SYN packet
to be able to transfer it to full blown socket when 3WHS is done
(ireq->pktopts -> np->pktoptions)

As explained in commit 5e0724d027 ("tcp/dccp: fix hashdance race for
passive sessions"), we must transfer the skb only if we won the
hashdance race, if multiple cpus receive the 'ack' packet completing
3WHS at the same time.

Fixes: e994b2f0fb ("tcp: do not lock listener to process SYN packets")
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 15:38:26 -05:00
santosh.shilimkar@oracle.com
7b5654349e RDS: convert bind hash table to re-sizable hashtable
To further improve the RDS connection scalabilty on massive systems
where number of sockets grows into tens of thousands  of sockets, there
is a need of larger bind hashtable. Pre-allocated 8K or 16K table is
not very flexible in terms of memory utilisation. The rhashtable
infrastructure gives us the flexibility to grow the hashtbable based
on use and also comes up with inbuilt efficient bucket(chain) handling.

Reviewed-by: David Miller <davem@davemloft.net>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 15:36:23 -05:00
Saurabh Sengar
d3ffaefa1b net: rds: changing the return type from int to void
as result of function rds_iw_flush_mr_pool is nowhere checked,
changing its return type from int to void.
also removing the unused variable rc as there is nothing to return

Signed-off-by: Saurabh Sengar <saurabh.truth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 15:35:19 -05:00
Paolo Abeni
9920e48b83 ipv4: use l4 hash for locally generated multipath flows
This patch changes how the multipath hash is computed for locally
generated flows: now the hash comprises l4 information.

This allows better utilization of the available paths when the existing
flows have the same source IP and the same destination IP: with l3 hash,
even when multiple connections are in place simultaneously, a single path
will be used, while with l4 hash we can use all the available paths.

v2 changes:
- use get_hash_from_flowi4() instead of implementing just another l4 hash
  function

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 14:38:43 -05:00
Chuck Lever
0f2e3bdab6 SUNRPC: Remove the TCP-only restriction in bc_svc_process()
Allow the use of other transport classes when handling a backward
direction RPC call.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
9468431962 svcrdma: Add backward direction service for RPC/RDMA transport
On NFSv4.1 mount points, the Linux NFS client uses this transport
endpoint to receive backward direction calls and route replies back
to the NFSv4.1 server.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
63cae47005 xprtrdma: Handle incoming backward direction RPC calls
Introduce a code path in the rpcrdma_reply_handler() to catch
incoming backward direction RPC calls and route them to the ULP's
backchannel server.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
83128a60ca xprtrdma: Add support for sending backward direction RPC replies
Backward direction RPC replies are sent via the client transport's
send_request method, the same way forward direction RPC calls are
sent.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
124fa17d3e xprtrdma: Pre-allocate Work Requests for backchannel
Pre-allocate extra send and receive Work Requests needed to handle
backchannel receives and sends.

The transport doesn't know how many extra WRs to pre-allocate until
the xprt_setup_backchannel() call, but that's long after the WRs are
allocated during forechannel setup.

So, use a fixed value for now.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
f531a5dbc4 xprtrdma: Pre-allocate backward rpc_rqst and send/receive buffers
xprtrdma's backward direction send and receive buffers are the same
size as the forechannel's inline threshold, and must be pre-
registered.

The consumer has no control over which receive buffer the adapter
chooses to catch an incoming backwards-direction call. Any receive
buffer can be used for either a forward reply or a backward call.
Thus both types of RPC message must all be the same size.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
42e5c3e272 SUNRPC: Abstract backchannel operations
xprt_{setup,destroy}_backchannel() won't be adequate for RPC/RMDA
bi-direction. In particular, receive buffers have to be pre-
registered and posted in order to receive incoming backchannel
requests.

Add a virtual function call to allow the insertion of appropriate
backchannel setup and destruction methods for each transport.

In addition, freeing a backchannel request is a little different
for RPC/RDMA. Introduce an rpc_xprt_op to handle the difference.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
a5b027e189 xprtrdma: Saving IRQs no longer needed for rb_lock
Now that RPC replies are processed in a workqueue, there's no need
to disable IRQs when managing send and receive buffers. This saves
noticeable overhead per RPC.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
2da9ab3008 xprtrdma: Remove reply tasklet
Clean up: The reply tasklet is no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
fe97b47cd6 xprtrdma: Use workqueue to process RPC/RDMA replies
The reply tasklet is fast, but it's single threaded. After reply
traffic saturates a single CPU, there's no more reply processing
capacity.

Replace the tasklet with a workqueue to spread reply handling across
all CPUs.  This also moves RPC/RDMA reply handling out of the soft
IRQ context and into a context that allows sleeps.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
1e465fd4ff xprtrdma: Replace send and receive arrays
The rb_send_bufs and rb_recv_bufs arrays are used to implement a
pair of stacks for keeping track of free rpcrdma_req and rpcrdma_rep
structs. Replace those arrays with free lists.

To allow more than 512 RPCs in-flight at once, each of these arrays
would be larger than a page (assuming 8-byte addresses and 4KB
pages). Allowing up to 64K in-flight RPCs (as TCP now does), each
buffer array would have to be 128 pages. That's an order-6
allocation. (Not that we're going there.)

A list is easier to expand dynamically. Instead of allocating a
larger array of pointers and copying the existing pointers to the
new array, simply append more buffers to each list.

This also makes it simpler to manage receive buffers that might
catch backwards-direction calls, or to post receive buffers in
bulk to amortize the overhead of ib_post_recv.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
b0e178a2d8 xprtrdma: Refactor reply handler error handling
Clean up: The error cases in rpcrdma_reply_handler() almost never
execute. Ensure the compiler places them out of the hot path.

No behavior change expected.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
4220a07264 xprtrdma: Prevent loss of completion signals
Commit 8301a2c047 ("xprtrdma: Limit work done by completion
handler") was supposed to prevent xprtrdma's upcall handlers from
starving other softIRQ work by letting them return to the provider
before all CQEs have been polled.

The logic assumes the provider will call the upcall handler again
immediately if the CQ is re-armed while there are still queued CQEs.

This assumption is invalid. The IBTA spec says that after a CQ is
armed, the hardware must interrupt only when a new CQE is inserted.
xprtrdma can't rely on the provider calling again, even though some
providers do.

Therefore, leaving CQEs on queue makes sense only when there is
another mechanism that ensures all remaining CQEs are consumed in a
timely fashion. xprtrdma does not have such a mechanism. If a CQE
remains queued, the transport can wait forever to send the next RPC.

Finally, move the wcs array back onto the stack to ensure that the
poll array is always local to the CPU where the completion upcall is
running.

Fixes: 8301a2c047 ("xprtrdma: Limit work done by completion ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
7b3d770c67 xprtrdma: Re-arm after missed events
ib_req_notify_cq(IB_CQ_REPORT_MISSED_EVENTS) returns a positive
value if WCs were added to a CQ after the last completion upcall
but before the CQ has been re-armed.

Commit 7f23f6f6e3 ("xprtrmda: Reduce lock contention in
completion handlers") assumed that when ib_req_notify_cq() returned
a positive RC, the CQ had also been successfully re-armed, making
it safe to return control to the provider without losing any
completion signals. That is an invalid assumption.

Change both completion handlers to continue polling while
ib_req_notify_cq() returns a positive value.

Fixes: 7f23f6f6e3 ("xprtrmda: Reduce lock contention in ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever
a045178887 xprtrdma: Enable swap-on-NFS/RDMA
After adding a swapfile on an NFS/RDMA mount and removing the
normal swap partition, I was able to push the NFS client well
into swap without any issue.

I forgot to swapoff the NFS file before rebooting. This pinned
the NFS mount and the IB core and provider, causing shutdown to
hang. I think this is expected and safe behavior. Probably
shutdown scripts should "swapoff -a" before unmounting any
filesystems.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Steve Wise
8610586d82 xprtrdma: don't log warnings for flushed completions
Unsignaled send WRs can get flushed as part of normal unmount, so don't
log them as warnings.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Tina Ruchandani
1032a66871 Use 64-bit timekeeping
This patch changes the use of struct timespec in
dccp_probe to use struct timespec64 instead. timespec uses a 32-bit
seconds field which will overflow in the year 2038 and beyond. timespec64
uses a 64-bit seconds field. Note that the correctness of the code isn't
changed, since the original code only uses the timestamps to compute a
small elapsed interval. This patch is part of a larger attempt to remove
instances of 32-bit timekeeping structures (timespec, timeval, time_t)
from the kernel so it is easier to identify where the real 2038 issues
are.

Signed-off-by: Tina Ruchandani <ruchandani.tina@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-01 17:01:16 -05:00
Julian Anastasov
c9b3292eeb ipv4: update RTNH_F_LINKDOWN flag on UP event
When nexthop is part of multipath route we should clear the
LINKDOWN flag when link goes UP or when first address is added.
This is needed because we always set LINKDOWN flag when DEAD flag
was set but now on UP the nexthop is not dead anymore. Examples when
LINKDOWN bit can be forgotten when no NETDEV_CHANGE is delivered:

- link goes down (LINKDOWN is set), then link goes UP and device
shows carrier OK but LINKDOWN remains set

- last address is deleted (LINKDOWN is set), then address is
added and device shows carrier OK but LINKDOWN remains set

Steps to reproduce:
modprobe dummy
ifconfig dummy0 192.168.168.1 up

here add a multipath route where one nexthop is for dummy0:

ip route add 1.2.3.4 nexthop dummy0 nexthop SOME_OTHER_DEVICE
ifconfig dummy0 down
ifconfig dummy0 up

now ip route shows nexthop that is not dead. Now set the sysctl var:

echo 1 > /proc/sys/net/ipv4/conf/dummy0/ignore_routes_with_linkdown

now ip route will show a dead nexthop because the forgotten
RTNH_F_LINKDOWN is propagated as RTNH_F_DEAD.

Fixes: 8a3d03166f ("net: track link-status of ipv4 nexthops")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-01 16:57:39 -05:00
Julian Anastasov
4f823defdd ipv4: fix to not remove local route on link down
When fib_netdev_event calls fib_disable_ip on NETDEV_DOWN event
we should not delete the local routes if the local address
is still present. The confusion comes from the fact that both
fib_netdev_event and fib_inetaddr_event use the NETDEV_DOWN
constant. Fix it by returning back the variable 'force'.

Steps to reproduce:
modprobe dummy
ifconfig dummy0 192.168.168.1 up
ifconfig dummy0 down
ip route list table local | grep dummy | grep host
local 192.168.168.1 dev dummy0  proto kernel  scope host  src 192.168.168.1

Fixes: 8a3d03166f ("net: track link-status of ipv4 nexthops")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-01 16:57:39 -05:00
Vivien Didelot
76e398a627 net: dsa: use switchdev obj for VLAN add/del ops
Simplify DSA by pushing the switchdev objects for VLAN add and delete
operations down to its drivers. Currently only mv88e6xxx is affected.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-01 15:56:11 -05:00
Stefan Hajnoczi
ea3803c193 VSOCK: define VSOCK_SS_LISTEN once only
The SS_LISTEN socket state is defined by both af_vsock.c and
vmci_transport.c.  This is risky since the value could be changed in one
file and the other would be out of sync.

Rename from SS_LISTEN to VSOCK_SS_LISTEN since the constant is not part
of enum socket_state (SS_CONNECTED, ...).  This way it is clear that the
constant is vsock-specific.

The big text reflow in af_vsock.c was necessary to keep to the maximum
line length.  Text is unchanged except for s/SS_LISTEN/VSOCK_SS_LISTEN/.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-01 12:14:47 -05:00
Jon Paul Maloy
5cbb28a4bf tipc: linearize arriving NAME_DISTR and LINK_PROTO buffers
Testing of the new UDP bearer has revealed that reception of
NAME_DISTRIBUTOR, LINK_PROTOCOL/RESET and LINK_PROTOCOL/ACTIVATE
message buffers is not prepared for the case that those may be
non-linear.

We now linearize all such buffers before they are delivered up to the
generic reception layer.

In order for the commit to apply cleanly to 'net' and 'stable', we do
the change in the function tipc_udp_recv() for now. Later, we will post
a commit to 'net-next' moving the linearization to generic code, in
tipc_named_rcv() and tipc_link_proto_rcv().

Fixes: commit d0f91938be ("tipc: add ip/udp media type")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-01 12:04:29 -05:00
Hannes Frederic Sowa
405c92f7a5 ipv6: add defensive check for CHECKSUM_PARTIAL skbs in ip_fragment
CHECKSUM_PARTIAL skbs should never arrive in ip_fragment. If we get one
of those warn about them once and handle them gracefully by recalculating
the checksum.

Fixes: commit 32dce968dd ("ipv6: Allow for partial checksums on non-ufo packets")
See-also: commit 72e843bb09 ("ipv6: ip6_fragment() should check CHECKSUM_PARTIAL")
Cc: Eric Dumazet <edumazet@google.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-01 12:01:28 -05:00
Hannes Frederic Sowa
682b1a9d3f ipv6: no CHECKSUM_PARTIAL on MSG_MORE corked sockets
We cannot reliable calculate packet size on MSG_MORE corked sockets
and thus cannot decide if they are going to be fragmented later on,
so better not use CHECKSUM_PARTIAL in the first place.

The IPv6 code also intended to protect and not use CHECKSUM_PARTIAL in
the existence of IPv6 extension headers, but the condition was wrong. Fix
it up, too. Also the condition to check whether the packet fits into
one fragment was wrong and has been corrected.

Fixes: commit 32dce968dd ("ipv6: Allow for partial checksums on non-ufo packets")
See-also: commit 72e843bb09 ("ipv6: ip6_fragment() should check CHECKSUM_PARTIAL")
Cc: Eric Dumazet <edumazet@google.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-01 12:01:27 -05:00
Hannes Frederic Sowa
dbd3393c56 ipv4: add defensive check for CHECKSUM_PARTIAL skbs in ip_fragment
CHECKSUM_PARTIAL skbs should never arrive in ip_fragment. If we get one
of those warn about them once and handle them gracefully by recalculating
the checksum.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-01 12:01:27 -05:00
Hannes Frederic Sowa
d749c9cbff ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked sockets
We cannot reliable calculate packet size on MSG_MORE corked sockets
and thus cannot decide if they are going to be fragmented later on,
so better not use CHECKSUM_PARTIAL in the first place.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-01 12:01:27 -05:00
David S. Miller
b75ec3af27 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-11-01 00:15:30 -04:00
David S. Miller
e7b63ff115 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2015-10-30

1) The flow cache is limited by the flow cache limit which
   depends on the number of cpus and the xfrm garbage collector
   threshold which is independent of the number of cpus. This
   leads to the fact that on systems with more than 16 cpus
   we hit the xfrm garbage collector limit and refuse new
   allocations, so new flows are dropped. On systems with 16
   or less cpus, we hit the flowcache limit. In this case, we
   shrink the flow cache instead of refusing new flows.

   We increase the xfrm garbage collector threshold to INT_MAX
   to get the same behaviour, independent of the number of cpus.

2) Fix some unaligned accesses on sparc systems.
   From Sowmini Varadhan.

3) Fix some header checks in _decode_session4. We may call
   pskb_may_pull with a negative value converted to unsigened
   int from pskb_may_pull. This can lead to incorrect policy
   lookups. We fix this by a check of the data pointer position
   before we call pskb_may_pull.

4) Reload skb header pointers after calling pskb_may_pull
   in _decode_session4 as this may change the pointers into
   the packet.

5) Add a missing statistic counter on inner mode errors.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-30 20:51:56 +09:00
Scott Feldman
e258d919b1 switchdev: fix: pass correct obj size when deferring obj add
Fixes: 4d429c5dd ("switchdev: introduce possibility to defer obj_add/del")
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-30 20:23:37 +09:00
Scott Feldman
3a7bde55a1 switchdev: fix: erasing too much of vlan obj when handling multiple vlan specs
When adding vlans with multiple IFLA_BRIDGE_VLAN_INFO attrs set in AFSPEC,
we would wipe the vlan obj struct after the first IFLA_BRIDGE_VLAN_INFO.
Fix this by only clearing what's necessary on each IFLA_BRIDGE_VLAN_INFO
iteration.

Fixes: 9e8f4a54 ("switchdev: push object ID back to object structure")
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-30 20:23:35 +09:00
David S. Miller
740215ddb5 NFC 4.4 pull request
This is the NFC pull request for 4.4.
 
 It's a bit bigger than usual, the 3 main culprits being:
 
 - A new driver for Intel's Fields Peak NCI chipset. In order to
   support this chipset we had to export a few NCI routines and
   extend the driver NCI ops to not only support proprietary
   commands but also core ones.
 
 - Support for vendor commands for both STM drivers, st-nci
   and st21nfca. Those vendor commands allow to run factory tests
   through the NFC netlink interface.
 
 - New i2c and SPI support for the Marvell driver, together with
   firmware download support for this driver's core.
 
 Besides that we also have:
 
 - A few file renames in the STM drivers, to keep the naming
   consistent between drivers.
 
 - Some improvements and fixes on the NCI HCI layer, mostly to
   properly reach a secure element over a legacy HCI link.
 
 - A few fixes for the s3fwrn5 and trf7970a drivers.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWMGIaAAoJEIqAPN1PVmxKHjYP/3Q3Y4Vhvw3kTfDP3IlnAuH3
 XjBMGKPLu72MmtSk9jFOr5VuC76YtJzwf+4nKGJybu619NPKfxXN7r83bpsZV1Bk
 +0cS1RpjIQh92a0ElvX1muCFhgH7ax7zeqQ+29OSpA33e67/DlUcwxiqzF15cwWC
 Bk0pUv1FxMoNi5ZkG1JrRqrhx/Yqo1dw2HrnMbKVgwLtLzODBuzoGVKfydTo0b1j
 hkl30DPF3AYMxnwIml3tM8zT96b1LtD0Xgs1yF8IdrIJ+6YLn/6tnw1rUxnE8Ovo
 JFtvtS0OKjGTNFr1NhueG0i5td8TR4MAJKHh0Lz9ISIFWtNtsVjFfGuAeWZcQ/n/
 rQS7xOvzBCndNbe8PS9wBiNQAqLAH5/dzvKRwNboRttkpwIrNOgaBYj2LpuRzpfO
 p+ArwBryAorfxQVOIWl4knc59UsiPUKOK61uMTZ1sU7jCEvUNVChIm8EGRlMnpMQ
 ZFlBa2lqNdgz7ubKLofbnWLiCNY6r0E13MSLHZlJZX61IMjs13ojDeKMvitFBe+b
 1hDwbSWxIRB8xKbcsIA9bPUnEc16Syywz/Q4iAsE8Gy6l5J41MhA/q2QaO9WSrPE
 Leah53l5EwQRd55WjJkCkIKZwvCjIerkESfAS0oprELIYXaxs/1PbVl6C7VYZA4K
 A5tYLw2vS+tTK4Mgi/ym
 =aw/h
 -----END PGP SIGNATURE-----

Merge tag 'nfc-next-4.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next

Samuel Ortiz says:

====================
NFC 4.4 pull request

This is the NFC pull request for 4.4.

It's a bit bigger than usual, the 3 main culprits being:

- A new driver for Intel's Fields Peak NCI chipset. In order to
  support this chipset we had to export a few NCI routines and
  extend the driver NCI ops to not only support proprietary
  commands but also core ones.

- Support for vendor commands for both STM drivers, st-nci
  and st21nfca. Those vendor commands allow to run factory tests
  through the NFC netlink interface.

- New i2c and SPI support for the Marvell driver, together with
  firmware download support for this driver's core.

Besides that we also have:

- A few file renames in the STM drivers, to keep the naming
  consistent between drivers.

- Some improvements and fixes on the NCI HCI layer, mostly to
  properly reach a secure element over a legacy HCI link.

- A few fixes for the s3fwrn5 and trf7970a drivers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-30 20:19:43 +09:00
David S. Miller
5bf8921116 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2015-10-28

Here are a some more Bluetooth patches for 4.4 which collected up during
the past week. The most important ones are from Kuba Pawlak for fixing
locking issues with SCO sockets. There's also a fix from Alexander Aring
for 6lowpan, a memleak fix from Julia Lawall for the btmrvl driver and
some cleanup patches from Marcel.

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-30 19:41:10 +09:00
Alexander Duyck
b7b0b1d290 ipv6: recreate ipv6 link-local addresses when increasing MTU over IPV6_MIN_MTU
This change makes it so that we reinitialize the interface if the MTU is
increased back above IPV6_MIN_MTU and the interface is up.

Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-30 18:11:07 +09:00
Ido Schimmel
741af0053b switchdev: Add support for flood control
Allow devices supporting this feature to control the flooding of unknown
unicast traffic, by making switchdev infrastructure propagate this setting
to the switch driver.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-30 12:26:38 +09:00
Roopa Prabhu
b7af1472af bridge: set is_local and is_static before fdb entry is added to the fdb hashtable
Problem Description:
We can add fdbs pointing to the bridge with NULL ->dst but that has a
few race conditions because br_fdb_insert() is used which first creates
the fdb and then, after the fdb has been published/linked, sets
"is_local" to 1 and in that time frame if a packet arrives for that fdb
it may see it as non-local and either do a NULL ptr dereference in
br_forward() or attach the fdb to the port where it arrived, and later
br_fdb_insert() will make it local thus getting a wrong fdb entry.
Call chain br_handle_frame_finish() -> br_forward():
But in br_handle_frame_finish() in order to call br_forward() the dst
should not be local i.e. skb != NULL, whenever the dst is
found to be local skb is set to NULL so we can't forward it,
and here comes the problem since it's running only
with RCU when forwarding packets it can see the entry before "is_local"
is set to 1 and actually try to dereference NULL.
The main issue is that if someone sends a packet to the switch while
it's adding the entry which points to the bridge device, it may
dereference NULL ptr. This is needed now after we can add fdbs
pointing to the bridge.  This poses a problem for
br_fdb_update() as well, while someone's adding a bridge fdb, but
before it has is_local == 1, it might get moved to a port if it comes
as a source mac and then it may get its "is_local" set to 1

This patch changes fdb_create to take is_local and is_static as
arguments to set these values in the fdb entry before it is added to the
hash. Also adds null check for port in br_forward.

Fixes: 3741873b4f ("bridge: allow adding of fdb entries pointing to the bridge device")
Reported-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-30 12:13:05 +09:00
Hannes Frederic Sowa
89bc7848a9 ipv6: protect mtu calculation of wrap-around and infinite loop by rounding issues
Raw sockets with hdrincl enabled can insert ipv6 extension headers
right into the data stream. In case we need to fragment those packets,
we reparse the options header to find the place where we can insert
the fragment header. If the extension headers exceed the link's MTU we
actually cannot make progress in such a case.

Instead of ending up in broken arithmetic or rounding towards 0 and
entering an endless loop in ip6_fragment, just prevent those cases by
aborting early and signal -EMSGSIZE to user space.

This is the second version of the patch which doesn't use the
overflow_usub function, which got reverted for now.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-29 07:01:50 -07:00
Hannes Frederic Sowa
1e0d69a9cc Revert "Merge branch 'ipv6-overflow-arith'"
Linus dislikes these changes. To not hold up the net-merge let's revert
it for now and fix the bug like Linus suggested.

This reverts commit ec3661b422, reversing
changes made to c80dbe0461.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-29 07:01:48 -07:00
Sagi Grimberg
9ddc87374a RDS/IW: Convert to new memory registration API
Get rid of fast_reg page list and its construction.
Instead, just pass the RDS sg list to ib_map_mr_sg
and post the new ib_reg_wr.

This is done both for server IW RDMA_READ registration
and the client remote key registration.

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-10-28 22:27:18 -04:00
Sagi Grimberg
412a15c0fe svcrdma: Port to new memory registration API
Instead of maintaining a fastreg page list, keep an sg table
and convert an array of pages to a sg list. Then call ib_map_mr_sg
and construct ib_reg_wr.

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Selvin Xavier <selvin.xavier@avagotech.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-10-28 22:27:18 -04:00
Sagi Grimberg
4143f34e01 xprtrdma: Port to new memory registration API
Instead of maintaining a fastreg page list, keep an sg table
and convert an array of pages to a sg list. Then call ib_map_mr_sg
and construct ib_reg_wr.

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Selvin Xavier <selvin.xavier@avagotech.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-10-28 22:27:18 -04:00
Doug Ledford
63e8790d39 Merge branch 'wr-cleanup' into k.o/for-4.4 2015-10-28 22:23:34 -04:00
Doug Ledford
eb14ab3ba1 Merge branch 'wr-cleanup' of git://git.infradead.org/users/hch/rdma into wr-cleanup
Signed-off-by: Doug Ledford <dledford@redhat.com>

Conflicts:
	drivers/infiniband/ulp/isert/ib_isert.c - Commit 4366b19ca5
	(iser-target: Change the recv buffers posting logic) changed the
	logic in isert_put_datain() and had to be hand merged
2015-10-28 22:21:09 -04:00
Guy Shapiro
fa20105e09 IB/cma: Add support for network namespaces
Add support for network namespaces in the ib_cma module. This is
accomplished by:

1. Adding network namespace parameter for rdma_create_id. This parameter is
   used to populate the network namespace field in rdma_id_private.
   rdma_create_id keeps a reference on the network namespace.
2. Using the network namespace from the rdma_id instead of init_net inside
   of ib_cma, when listening on an ID and when looking for an ID for an
   incoming request.
3. Decrementing the reference count for the appropriate network namespace
   when calling rdma_destroy_id.

In order to preserve the current behavior init_net is passed when calling
from other modules.

Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-10-28 12:32:48 -04:00
Robert Dolca
f11631748e NFC: nci: non-static functions can not be inline
This fixes a build error that seems to be toochain
dependent (Not seen with gcc v5.1):

In file included from net/nfc/nci/rsp.c:36:0:
net/nfc/nci/rsp.c: In function ‘nci_rsp_packet’:
include/net/nfc/nci_core.h:355:12: error: inlining failed in call to
always_inline ‘nci_prop_rsp_packet’: function body not available
 inline int nci_prop_rsp_packet(struct nci_dev *ndev, __u16 opcode,

Signed-off-by: Robert Dolca <robert.dolca@intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-28 06:44:45 +01:00
Robert Shearman
cf4b24f002 mpls: reduce memory usage of routes
Nexthops for MPLS routes have a via address field sized for the
largest via address that is expected, which is 32 bytes. This means
that in the most common case of having ipv4 via addresses, 28 bytes of
memory more than required are used per nexthop. In the other common
case of an ipv6 nexthop then 16 bytes more than required are
used. With large numbers of MPLS routes this extra memory usage could
start to become significant.

To avoid allocating memory for a maximum length via address when not
all of it is required and to allow for ease of iterating over
nexthops, then the via addresses are changed to be stored in the same
memory block as the route and nexthops, but in an array after the end
of the array of nexthops. New accessors are provided to retrieve a
pointer to the via address.

To allow for O(1) access without having to store a pointer or offset
per nh, the via address for each nexthop is sized according to the
maximum via address for any nexthop in the route, which is stored in a
new route field, rt_max_alen, but this is in an existing hole in
struct mpls_route so it doesn't increase the size of the
structure. Each via address is ensured to be aligned to VIA_ALEN_ALIGN
to account for architectures that don't allow unaligned accesses.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-27 19:52:59 -07:00
Robert Shearman
b4e04fc735 mpls: fix forwarding using v4/v6 explicit null
Fill in the via address length for the predefined IPv4 and IPv6
explicit-null label routes.

Fixes: f8efb73c97 ("mpls: multipath route support")
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-27 19:52:58 -07:00
Sowmini Varadhan
8ce675ff39 RDS-TCP: Recover correctly from pskb_pull()/pksb_trim() failure in rds_tcp_data_recv
Either of pskb_pull() or pskb_trim() may fail under low memory conditions.
If rds_tcp_data_recv() ignores such failures, the application will
receive corrupted data because the skb has not been correctly
carved to the RDS datagram size.

Avoid this by handling pskb_pull/pskb_trim failure in the same
manner as the skb_clone failure: bail out of rds_tcp_data_recv(), and
retry via the deferred call to rds_send_worker() that gets set up on
ENOMEM from rds_tcp_read_sock()

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-27 19:46:34 -07:00
Florian Westphal
dbc3617f4c netfilter: nfnetlink: don't probe module if it exists
nfnetlink_bind request_module()s all the time as nfnetlink_get_subsys()
shifts the argument by 8 to obtain the subsys id.

So using type instead of type << 8 always returns NULL.

Fixes: 03292745b0 ("netlink: add nlk->netlink_bind hook for module auto-loading")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-28 03:40:50 +01:00
Hannes Frederic Sowa
080a270f5a sock: don't enable netstamp for af_unix sockets
netstamp_needed is toggled for all socket families if they request
timestamping. But some protocols don't need the lower-layer timestamping
code at all. This patch starts disabling it for af-unix.

E.g. systemd enables timestamping during boot-up on the journald af-unix
sockets, thus causing the system to globally enable timestamping in the
lower networking stack. Still, it is very probable that timestamping
gets activated, by e.g. dhclient or various NTP implementations.

Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-27 19:39:14 -07:00
Joe Stringer
6f5cadee44 openvswitch: Fix skb leak using IPv6 defrag
nf_ct_frag6_gather() makes a clone of each skb passed to it, and if the
reassembly is successful, expects the caller to free all of the original
skbs using nf_ct_frag6_consume_orig(). This call was previously missing,
meaning that the original fragments were never freed (with the exception
of the last fragment to arrive).

Fix this by ensuring that all original fragments except for the last
fragment are freed via nf_ct_frag6_consume_orig(). The last fragment
will be morphed into the head, so it must not be freed yet. Furthermore,
retain the ->next pointer for the head after skb_morph().

Fixes: 7f8a436eaa ("openvswitch: Add conntrack action")
Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-27 19:32:18 -07:00
Joe Stringer
190b8ffbb7 ipv6: Export nf_ct_frag6_consume_orig()
This is needed in openvswitch to fix an skb leak in the next patch.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-27 19:32:17 -07:00
Joe Stringer
74c1661813 openvswitch: Fix double-free on ip_defrag() errors
If ip_defrag() returns an error other than -EINPROGRESS, then the skb is
freed. When handle_fragments() passes this back up to
do_execute_actions(), it will be freed again. Prevent this double free
by never freeing the skb in do_execute_actions() for errors returned by
ovs_ct_execute. Always free it in ovs_ct_execute() error paths instead.

Fixes: 7f8a436eaa ("openvswitch: Add conntrack action")
Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-27 19:32:14 -07:00
Alexander Duyck
c2229fe143 fib_trie: leaf_walk_rcu should not compute key if key is less than pn->key
We were computing the child index in cases where the key value we were
looking for was actually less than the base key of the tnode.  As a result
we were getting incorrect index values that would cause us to skip over
some children.

To fix this I have added a test that will force us to use child index 0 if
the key we are looking for is less than the key of the current tnode.

Fixes: 8be33e955c ("fib_trie: Fib walk rcu should take a tnode and key instead of a trie and a leaf")
Reported-by: Brian Rak <brak@gameservers.com>
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-27 18:14:51 -07:00
Alexander Aring
324e786ee3 bluetooth: 6lowpan: fix NOHZ: local_softirq_pending
Jukka reported about the following warning:

"NOHZ: local_softirq_pending 08"

I remember this warning and we had a similar issue when using workqueues
and calling netif_rx. See commit 5ff3fec ("mac802154: fix NOHZ
local_softirq_pending 08 warning").

This warning occurs when calling "netif_rx" inside the wrong context
(non softirq context). The net core api offers "netif_rx_ni" to call
netif_rx inside the correct softirq context.

Reported-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-27 09:53:36 +01:00
Munehisa Kamata
94f9cd8143 netfilter: nf_nat_redirect: add missing NULL pointer check
Commit 8b13eddfdf ("netfilter: refactor NAT
redirect IPv4 to use it from nf_tables") has introduced a trivial logic
change which can result in the following crash.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
IP: [<ffffffffa033002d>] nf_nat_redirect_ipv4+0x2d/0xa0 [nf_nat_redirect]
PGD 3ba662067 PUD 3ba661067 PMD 0
Oops: 0000 [#1] SMP
Modules linked in: ipv6(E) xt_REDIRECT(E) nf_nat_redirect(E) xt_tcpudp(E) iptable_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack(E) ip_tables(E) x_tables(E) binfmt_misc(E) xfs(E) libcrc32c(E) evbug(E) evdev(E) psmouse(E) i2c_piix4(E) i2c_core(E) acpi_cpufreq(E) button(E) ext4(E) crc16(E) jbd2(E) mbcache(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E)
CPU: 0 PID: 2536 Comm: ip Tainted: G            E   4.1.7-15.23.amzn1.x86_64 #1
Hardware name: Xen HVM domU, BIOS 4.2.amazon 05/06/2015
task: ffff8800eb438000 ti: ffff8803ba664000 task.ti: ffff8803ba664000
[...]
Call Trace:
 <IRQ>
 [<ffffffffa0334065>] redirect_tg4+0x15/0x20 [xt_REDIRECT]
 [<ffffffffa02e2e99>] ipt_do_table+0x2b9/0x5e1 [ip_tables]
 [<ffffffffa0328045>] iptable_nat_do_chain+0x25/0x30 [iptable_nat]
 [<ffffffffa031777d>] nf_nat_ipv4_fn+0x13d/0x1f0 [nf_nat_ipv4]
 [<ffffffffa0328020>] ? iptable_nat_ipv4_fn+0x20/0x20 [iptable_nat]
 [<ffffffffa031785e>] nf_nat_ipv4_in+0x2e/0x90 [nf_nat_ipv4]
 [<ffffffffa03280a5>] iptable_nat_ipv4_in+0x15/0x20 [iptable_nat]
 [<ffffffff81449137>] nf_iterate+0x57/0x80
 [<ffffffff814491f7>] nf_hook_slow+0x97/0x100
 [<ffffffff814504d4>] ip_rcv+0x314/0x400

unsigned int
nf_nat_redirect_ipv4(struct sk_buff *skb,
...
{
...
		rcu_read_lock();
		indev = __in_dev_get_rcu(skb->dev);
		if (indev != NULL) {
			ifa = indev->ifa_list;
			newdst = ifa->ifa_local; <---
		}
		rcu_read_unlock();
...
}

Before the commit, 'ifa' had been always checked before access. After the
commit, however, it could be accessed even if it's NULL. Interestingly,
this was once fixed in 2003.

http://marc.info/?l=netfilter-devel&m=106668497403047&w=2

In addition to the original one, we have seen the crash when packets that
need to be redirected somehow arrive on an interface which hasn't been
yet fully configured.

This change just reverts the logic to the old behavior to avoid the crash.

Fixes: 8b13eddfdf ("netfilter: refactor NAT redirect IPv4 to use it from nf_tables")
Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-27 06:54:56 +01:00
emmanuel.grumbach@intel.com
8941faa161 net: tso: add support for IPv6
Adding IPv6 for the TSO helper API is trivial:
* Don't play with the id (which doesn't exist in IPv6)
* Correctly update the payload_len (don't include the
  length of the IP header itself)

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-26 22:24:22 -07:00
Eric Dumazet
7e3b6e7423 ipv6: gre: support SIT encapsulation
gre_gso_segment() chokes if SIT frames were aggregated by GRO engine.

Fixes: 61c1db7fae ("ipv6: sit: add GSO/TSO support")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-26 22:01:18 -07:00
Kuba Pawlak
2c501cdd68 Bluetooth: Fix crash on fast disconnect of SCO
Fix a crash that may happen when a connection is closed before it was fully
established. Mapping conn->hcon was released by shutdown function, but it
is still referenced in (not yet finished) connection established handling
function.

[ 4635.254073] BUG: unable to handle kernel NULL pointer dereference at 00000013
[ 4635.262058] IP: [<c11659f0>] memcmp+0xe/0x25
[ 4635.266835] *pdpt = 0000000024190001 *pde = 0000000000000000
[ 4635.273261] Oops: 0000 [#1] PREEMPT SMP
[ 4635.277652] Modules linked in: evdev ecb vfat fat libcomposite usb2380 isofs zlib_inflate rfcomm(O) udc_core bnep(O) btusb(O) btbcm(O) btintel(O) bluetooth(O) cdc_acm arc4 uinput hid_mule
[ 4635.321761] Pid: 363, comm: kworker/u:2H Tainted: G           O 3.8.0-119.1-plk-adaptation-byt-ivi-brd #1
[ 4635.332642] EIP: 0060:[<c11659f0>] EFLAGS: 00010206 CPU: 0
[ 4635.338767] EIP is at memcmp+0xe/0x25
[ 4635.342852] EAX: e4720678 EBX: 00000000 ECX: 00000006 EDX: 00000013
[ 4635.349849] ESI: 00000000 EDI: fb85366c EBP: e40c7dc0 ESP: e40c7db4
[ 4635.356846]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[ 4635.362873] CR0: 8005003b CR2: 00000013 CR3: 24191000 CR4: 001007f0
[ 4635.369869] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 4635.376865] DR6: ffff0ff0 DR7: 00000400
[ 4635.381143] Process kworker/u:2H (pid: 363, ti=e40c6000 task=e40c5510 task.ti=e40c6000)
[ 4635.390080] Stack:
[ 4635.392319]  e4720400 00000000 fb85366c e40c7df4 fb842285 e40c7de2 fb853200 00000013
[ 4635.401003]  e3f101c4 e4720678 e3f101c0 e403be0a e40c7dfc e416a000 e403be0a fb85366c
[ 4635.409692]  e40c7e1c fb820186 020f6c00 e47c49ac e47c4008 00000000 e416a000 e47c402c
[ 4635.418380] Call Trace:
[ 4635.421153]  [<fb842285>] sco_connect_cfm+0xff/0x236 [bluetooth]
[ 4635.427893]  [<fb820186>] hci_sync_conn_complete_evt.clone.101+0x227/0x268 [bluetooth]
[ 4635.436758]  [<fb82370f>] hci_event_packet+0x1caa/0x21d3 [bluetooth]
[ 4635.443859]  [<c106231f>] ? trace_hardirqs_on+0xb/0xd
[ 4635.449502]  [<c1375b8a>] ? _raw_spin_unlock_irqrestore+0x42/0x59
[ 4635.456340]  [<fb814b67>] hci_rx_work+0xb9/0x350 [bluetooth]
[ 4635.462663]  [<c1039f1e>] ? process_one_work+0x17b/0x2e6
[ 4635.468596]  [<c1039f77>] process_one_work+0x1d4/0x2e6
[ 4635.474333]  [<c1039f1e>] ? process_one_work+0x17b/0x2e6
[ 4635.480294]  [<fb814aae>] ? hci_cmd_work+0xda/0xda [bluetooth]
[ 4635.486810]  [<c103a3fa>] worker_thread+0x171/0x20f
[ 4635.492257]  [<c10456c5>] ? complete+0x34/0x3e
[ 4635.497219]  [<c103ea06>] kthread+0x90/0x95
[ 4635.501888]  [<c103a289>] ? manage_workers+0x1df/0x1df
[ 4635.507628]  [<c1376537>] ret_from_kernel_thread+0x1b/0x28
[ 4635.513755]  [<c103e976>] ? __init_kthread_worker+0x42/0x42
[ 4635.519975] Code: 74 0d 3c 79 74 04 3c 59 75 0c c6 02 01 eb 03 c6 02 00 31 c0 eb 05 b8 ea ff ff ff 5d c3 55 89 e5 57 56 53 31 db eb 0e 0f b6 34 18 <0f> b6 3c 1a 43 29 fe 75 07 49 85 c9 7f
[ 4635.541264] EIP: [<c11659f0>] memcmp+0xe/0x25 SS:ESP 0068:e40c7db4
[ 4635.548166] CR2: 0000000000000013
[ 4635.552177] ---[ end trace e05ce9b8ce6182f6 ]---

Signed-off-by: Kuba Pawlak <kubax.t.pawlak@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-27 06:00:07 +01:00
Bjørn Mork
4b3418fba0 ipv6: icmp: include addresses in debug messages
Messages like "icmp6_send: no reply to icmp error" are close
to useless. Adding source and destination addresses to provide
some more clue.

Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-26 21:59:42 -07:00
Vincent Cuissard
2bd832459a NFC: NCI: allow spi driver to choose transfer clock
In some cases low level drivers might want to update the
SPI transfer clock (e.g. during firmware download).

This patch adds this support. Without any modification the
driver will use the default SPI clock (from pdata or device tree).

Signed-off-by: Vincent Cuissard <cuissard@marvell.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-27 04:23:34 +01:00
Vincent Cuissard
fcd9d046fd NFC: NCI: move generic spi driver to a module
SPI driver should be a module.

Signed-off-by: Vincent Cuissard <cuissard@marvell.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-27 04:21:38 +01:00
Vincent Cuissard
e5629d2947 NFC: NCI: export nci_send_frame and nci_send_cmd function
Export nci_send_frame and nci_send_cmd symbols to allow drivers
to use it. This is needed for example if NCI is used during
firmware download phase.

Signed-off-by: Vincent Cuissard <cuissard@marvell.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-27 04:16:14 +01:00
Christophe Ricard
15d17170b4 NFC: st21nfca: Add support for proprietary commands
Add support for proprietary commands useful mainly
for factory testings.

Here is a list:

- FACTORY_MODE: Allow to set the driver into a mode where no
  secure element are activated. It does not consider any
  NFC_ATTR_VENDOR_DATA.
- HCI_CLEAR_ALL_PIPES: Allow to execute a HCI clear all pipes
  command. It does not consider any NFC_ATTR_VENDOR_DATA.
- HCI_DM_PUT_DATA: Allow to configure specific CLF registry as
  for example RF trimmings or low level drivers configurations
  (I2C, SPI, SWP).
- HCI_DM_UPDATE_AID: Allow to configure an AID routing into the
  CLF routing table following RF technology, CLF mode or protocol.
- HCI_DM_GET_INFO: Allow to retrieve CLF information.
- HCI_DM_GET_DATA: Allow to retrieve CLF configurable data such as
  low level drivers configurations or RF trimmings.
- HCI_DM_LOAD: Allow to load a firmware into the CLF. A complete
  packet can be more than 8KB.
- HCI_DM_RESET: Allow to run a CLF reset in order to "commit" CLF
  configuration changes without CLF power off.
- HCI_GET_PARAM: Allow to retrieve an HCI CLF parameter (for example
  the white list).
- HCI_DM_FIELD_GENERATOR: Allow to generate different kind of RF
  technology. When using this command to anti-collision is done.
- HCI_LOOPBACK: Allow to echo a command and test the Dh to CLF
  connectivity.

Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-27 04:00:24 +01:00
Christophe Ricard
064d004796 NFC: st-nci: Add few code style fixes
Add some few code style fixes.

Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-27 03:55:13 +01:00
Christophe Ricard
96d4581f0b NFC: netlink: Add mode parameter to deactivate_target functions
In order to manage in a better way the nci poll mode state machine,
add mode parameter to deactivate_target functions.
This way we can manage different target state.
mode parameter make sense only in nci core.

Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-27 03:55:12 +01:00
Marcel Holtmann
c4297e8f7f Bluetooth: Fix some obvious coding style issues in the SCO module
Lets fix this obvious coding style issues in the SCO module and bring it
in line with the rest of the Bluetooth subsystem.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-10-26 08:22:00 +02:00
Marcel Holtmann
05fcd4c4f1 Bluetooth: Replace hci_notify with hci_sock_dev_event
There is no point in wrapping hci_sock_dev_event around hci_notify. It
is an empty wrapper which adds no value. So remove it.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-10-26 08:21:47 +02:00
Marcel Holtmann
242c0ebd37 Bluetooth: Rename bt_cb()->req into bt_cb()->hci
The SKB context buffer for HCI request is really not just for requests,
information in their are preserved for the whole HCI layer. So it makes
more sense to actually rename it into bt_cb()->hci and also call it then
struct hci_ctrl.

In addition that allows moving the decoded opcode for outgoing packets
into that struct. So far it was just consuming valuable space from the
main shared items. And opcode are not valid for L2CAP packets.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-10-26 08:21:03 +02:00
Marcel Holtmann
d94a61040d Bluetooth: Remove unneeded parenthesis around MSG_OOB
There are two checks that are still using (MSG_OOB) instead of just
MSG_OOB and so lets just fix them.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-10-26 08:20:51 +02:00
Christophe Ricard
a1b0b94158 NFC: nci: Create pipe on specific gate in nci_hci_connect_gate
Some gates might need to have their pipes explicitly created.
Add a call to nci_hci_create_pipe in nci_hci_connect_gate for
every gate that is different than NCI_HCI_LINK_MGMT_GATE or
NCI_HCI_ADMIN_GATE.

In case of an error when opening a pipe, like in hci layer,
delete the pipe if it was created.

Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-26 06:53:13 +01:00
Christophe Ricard
8a49943f5b NFC: nci: Call nci_hci_clear_all_pipes at HCI initial activation.
When session_id is filled to 0xff, the pipe configuration is
probably incorrect and needs to be cleared.

Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-26 06:53:13 +01:00
Christophe Ricard
fa6fbadea5 NFC: nci: add nci_hci_clear_all_pipes functions
nci_hci_clear_all_pipes might be use full in some cases
for example after a firmware update.

Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-26 06:53:12 +01:00
Christophe Ricard
e65917b6d5 NFC: nci: extract pipe value using NCI_HCP_MSG_GET_PIPE
When receiving data in nci_hci_msg_rx_work, extract pipe
value using NCI_HCP_MSG_GET_PIPE macro.

Cc: stable@vger.kernel.org
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-26 06:53:12 +01:00
Christophe Ricard
d8cd37ed2f NFC: nci: Fix improper management of HCI return code
When sending HCI data over NCI, HCI return code is part
of the NCI data. In order to get correctly the HCI return
code, we assume the NCI communication is successful and
extract the return code for the nci_hci functions return code.

This is done because nci_to_errno does not match hci return
code value.

Cc: stable@vger.kernel.org
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-26 06:53:12 +01:00
Christophe Ricard
500c4ef022 NFC: nci: Fix incorrect data chaining when sending data
When sending HCI data over NCI, cmd information should be
present only on the first packet.
Each packet shall be specifically allocated and sent to the
NCI layer.

Cc: stable@vger.kernel.org
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-26 06:53:11 +01:00
Kuba Pawlak
1da5537ecc Bluetooth: Fix locking issue during fast SCO reconnection.
When SCO connection is requested and disconnected fast, there is a change
that sco_sock_shutdown is going to preempt thread started in sco_connect_cfm.
When this happens struct sock sk may be removed but a pointer to it is still
held in sco_conn_ready, where embedded spinlock is used. If it is used, but
struct sock has been removed, it will crash.

Block connection object, which will prevent struct sock from being removed
and give connection process chance to finish.

BUG: spinlock bad magic on CPU#0, kworker/u:2H/319
 lock: 0xe3e99434, .magic: f3000000, .owner: (���/0, .owner_cpu: -203804160
Pid: 319, comm: kworker/u:2H Tainted: G           O 3.8.0-115.1-plk-adaptation-byt-ivi-brd #1
Call Trace:
 [<c1155659>] ? do_raw_spin_lock+0x19/0xe9
 [<fb75354f>] ? sco_connect_cfm+0x92/0x236 [bluetooth]
 [<fb731dbc>] ? hci_sync_conn_complete_evt.clone.101+0x18b/0x1cb [bluetooth]
 [<fb734ee7>] ? hci_event_packet+0x1acd/0x21a6 [bluetooth]
 [<c1041095>] ? finish_task_switch+0x50/0x89
 [<c1349a2e>] ? __schedule+0x638/0x6b8
 [<fb727918>] ? hci_rx_work+0xb9/0x2b8 [bluetooth]
 [<c103760a>] ? queue_delayed_work_on+0x21/0x2a
 [<c1035df9>] ? process_one_work+0x157/0x21b
 [<fb72785f>] ? hci_cmd_work+0xef/0xef [bluetooth]
 [<c1036217>] ? worker_thread+0x16e/0x20a
 [<c10360a9>] ? manage_workers+0x1cf/0x1cf
 [<c103a0ef>] ? kthread+0x8d/0x92
 [<c134adf7>] ? ret_from_kernel_thread+0x1b/0x28
 [<c103a062>] ? __init_kthread_worker+0x24/0x24
BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [<  (null)>]   (null)
*pdpt = 00000000244e1001 *pde = 0000000000000000
Oops: 0010 [#1] PREEMPT SMP
Modules linked in: evdev ecb rfcomm(O) libcomposite usb2380 udc_core bnep(O) btusb(O) btbcm(O) cdc_acm btintel(O) bluetooth(O) arc4 uinput hid_multitouch usbhid hid iwlmvm(O)e
Pid: 319, comm: kworker/u:2H Tainted: G           O 3.8.0-115.1-plk-adaptation-byt-ivi-brd #1
EIP: 0060:[<00000000>] EFLAGS: 00010246 CPU: 0
EIP is at 0x0
EAX: e3e99400 EBX: e3e99400 ECX: 00000100 EDX: 00000000
ESI: e3e99434 EDI: fb763ce0 EBP: e49b9e44 ESP: e49b9e14
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: 00000000 CR3: 24444000 CR4: 001007f0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
Process kworker/u:2H (pid: 319, ti=e49b8000 task=e4ab9030 task.ti=e49b8000)
Stack:
 fb75355b 00000246 fb763900 22222222 22222222 22222222 e3f94460 e3ca7c0a
 e49b9e4c e3f34c00 e3ca7c0a fb763ce0 e49b9e6c fb731dbc 02000246 e4cec85c
 e4cec008 00000000 e3f34c00 e4cec000 e3c2ce00 0000002c e49b9ed0 fb734ee7
Call Trace:
 [<fb75355b>] ? sco_connect_cfm+0x9e/0x236 [bluetooth]
 [<fb731dbc>] ? hci_sync_conn_complete_evt.clone.101+0x18b/0x1cb [bluetooth]
 [<fb734ee7>] ? hci_event_packet+0x1acd/0x21a6 [bluetooth]
 [<c1041095>] ? finish_task_switch+0x50/0x89
 [<c1349a2e>] ? __schedule+0x638/0x6b8
 [<fb727918>] ? hci_rx_work+0xb9/0x2b8 [bluetooth]
 [<c103760a>] ? queue_delayed_work_on+0x21/0x2a
 [<c1035df9>] ? process_one_work+0x157/0x21b
 [<fb72785f>] ? hci_cmd_work+0xef/0xef [bluetooth]
 [<c1036217>] ? worker_thread+0x16e/0x20a
 [<c10360a9>] ? manage_workers+0x1cf/0x1cf
 [<c103a0ef>] ? kthread+0x8d/0x92
 [<c134adf7>] ? ret_from_kernel_thread+0x1b/0x28
 [<c103a062>] ? __init_kthread_worker+0x24/0x24
Code:  Bad EIP value.
EIP: [<00000000>] 0x0 SS:ESP 0068:e49b9e14
CR2: 0000000000000000
---[ end trace 942a6577c0abd725 ]---

Signed-off-by: Kuba Pawlak <kubax.t.pawlak@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-25 21:06:39 +01:00
Kuba Pawlak
435c513369 Bluetooth: Fix locking issue on SCO disconnection
Thread handling SCO disconnection may get preempted in '__sco_sock_close'
after dropping a reference to hci_conn but before marking this as NULL
in associated struct sco_conn. When execution returs to this thread,
this connection will possibly be released, resulting in kernel crash

Lock connection before this point.

BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [<fb770ab9>] __sco_sock_close+0x194/0x1ff [bluetooth]
*pdpt = 0000000023da6001 *pde = 0000000000000000
Oops: 0002 [#1] PREEMPT SMP
Modules linked in: evdev ecb rfcomm(O) libcomposite usb2380 udc_core bnep(O) btusb(O) btbcm(O) cdc_acm btintel(O) bluetooth(O) arc4 uinput hid_multitouch usbhid iwlmvm(O) hide
Pid: 984, comm: bluetooth Tainted: G           O 3.8.0-115.1-plk-adaptation-byt-ivi-brd #1
EIP: 0060:[<fb770ab9>] EFLAGS: 00010282 CPU: 2
EIP is at __sco_sock_close+0x194/0x1ff [bluetooth]
EAX: 00000000 EBX: e49d7600 ECX: ef1ec3c2 EDX: 000000c3
ESI: e4c12000 EDI: 00000000 EBP: ef1edf5c ESP: ef1edf4c
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 80050033 CR2: 00000000 CR3: 23da7000 CR4: 001007f0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
Process bluetooth (pid: 984, ti=ef1ec000 task=e47f2550 task.ti=ef1ec000)
Stack:
 e4c120d0 e49d7600 00000000 08421a40 ef1edf70 fb770b7a 00000002 e8a4cc80
 08421a40 ef1ec000 c12966b1 00000001 00000000 0000000b 084954c8 c1296b6c
 0000001b 00000002 0000001b 00000002 00000000 00000002 b2524880 00000046
Call Trace:
 [<fb770b7a>] ? sco_sock_shutdown+0x56/0x95 [bluetooth]
 [<c12966b1>] ? sys_shutdown+0x37/0x53
 [<c1296b6c>] ? sys_socketcall+0x12e/0x1be
 [<c134ae7e>] ? sysenter_do_call+0x12/0x26
 [<c1340000>] ? ip_vs_control_net_cleanup+0x46/0xb1
Code: e8 90 6b 8c c5 f6 05 72 5d 78 fb 04 74 17 8b 46 08 50 56 68 0a fd 77 fb 68 60 5d 78 fb e8 68 95 9e c5 83 c4 10 8b 83 fc 01 00 00 <c7> 00 00 00 00 00 eb 32 ba 68 00 00 0b
EIP: [<fb770ab9>] __sco_sock_close+0x194/0x1ff [bluetooth] SS:ESP 0068:ef1edf4c
CR2: 0000000000000000
---[ end trace 47fa2f55a9544e69 ]---

Signed-off-by: Kuba Pawlak <kubax.t.pawlak@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-25 21:06:39 +01:00
Kuba Pawlak
75e34f5cf6 Bluetooth: Fix crash on SCO disconnect
When disconnecting audio from the phone's side, it may happen, that
a thread handling HCI message 'disconnection complete' will get preempted
in 'sco_conn_del' before calling 'sco_sock_kill', still holding a pointer
to struct sock sk. Interrupting thread started in 'sco_sock_shutdown' will
carry on releasing resources and will eventually release struct sock.
When execution goes back to first thread it will call sco_sock_kill using
now invalid pointer to already destroyed socket.

Fix is to grab a reference to the socket a release it after calling
'sco_sock_kill'.

[  166.358213] BUG: unable to handle kernel paging request at 7541203a
[  166.365228] IP: [<fb6e8bfb>] bt_sock_unlink+0x1a/0x38 [bluetooth]
[  166.372068] *pdpt = 0000000024b19001 *pde = 0000000000000000
[  166.378483] Oops: 0002 [#1] PREEMPT SMP
[  166.382871] Modules linked in: evdev ecb rfcomm(O) libcomposite usb2380 udc_core bnep(O) btusb(O) btbcm(O) btintel(O) cdc_acm bluetooth(O) arc4 uinput hid_multitouch iwlmvm(O) usbhid hide
[  166.424233] Pid: 338, comm: kworker/u:2H Tainted: G           O 3.8.0-115.1-plk-adaptation-byt-ivi-brd #1
[  166.435112] EIP: 0060:[<fb6e8bfb>] EFLAGS: 00010206 CPU: 0
[  166.441259] EIP is at bt_sock_unlink+0x1a/0x38 [bluetooth]
[  166.447382] EAX: 632e6563 EBX: e4bfc600 ECX: e466d4d3 EDX: 7541203a
[  166.454369] ESI: fb7278ac EDI: e4d52000 EBP: e4669e20 ESP: e4669e0c
[  166.461366]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[  166.467391] CR0: 8005003b CR2: 7541203a CR3: 24aba000 CR4: 001007f0
[  166.474387] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[  166.481375] DR6: ffff0ff0 DR7: 00000400
[  166.485654] Process kworker/u:2H (pid: 338, ti=e4668000 task=e466e030 task.ti=e4668000)
[  166.494591] Stack:
[  166.496830]  e4bfc600 e4bfc600 fb715c28 e4717ee0 e4d52000 e4669e3c fb715cf3 e4bfc634
[  166.505518]  00000068 e4d52000 e4c32000 fb7277c0 e4669e6c fb6f2019 0000004a 00000216
[  166.514205]  e4660101 e4c32008 02000001 00000013 e4d52000 e4c32000 e3dc9240 00000005
[  166.522891] Call Trace:
[  166.525654]  [<fb715c28>] ? sco_sock_kill+0x73/0x9a [bluetooth]
[  166.532295]  [<fb715cf3>] ? sco_conn_del+0xa4/0xbf [bluetooth]
[  166.538836]  [<fb6f2019>] ? hci_disconn_complete_evt.clone.55+0x1bd/0x205 [bluetooth]
[  166.547609]  [<fb6f73d3>] ? hci_event_packet+0x297/0x223c [bluetooth]
[  166.554805]  [<c10416da>] ? dequeue_task+0xaf/0xb7
[  166.560154]  [<c1041095>] ? finish_task_switch+0x50/0x89
[  166.566086]  [<c1349a2e>] ? __schedule+0x638/0x6b8
[  166.571460]  [<fb6eb906>] ? hci_rx_work+0xb9/0x2b8 [bluetooth]
[  166.577975]  [<c1035df9>] ? process_one_work+0x157/0x21b
[  166.583933]  [<fb6eb84d>] ? hci_cmd_work+0xef/0xef [bluetooth]
[  166.590448]  [<c1036217>] ? worker_thread+0x16e/0x20a
[  166.596088]  [<c10360a9>] ? manage_workers+0x1cf/0x1cf
[  166.601826]  [<c103a0ef>] ? kthread+0x8d/0x92
[  166.606691]  [<c134adf7>] ? ret_from_kernel_thread+0x1b/0x28
[  166.613010]  [<c103a062>] ? __init_kthread_worker+0x24/0x24
[  166.619230] Code: 85 63 ff ff ff 31 db 8d 65 f4 89 d8 5b 5e 5f 5d c3 56 8d 70 04 53 89 f0 89 d3 e8 7e 17 c6 c5 8b 53 28 85 d2 74 1a 8b 43 24 85 c0 <89> 02 74 03 89 50 04 c7 43 28 00 00 00
[  166.640501] EIP: [<fb6e8bfb>] bt_sock_unlink+0x1a/0x38 [bluetooth] SS:ESP 0068:e4669e0c
[  166.649474] CR2: 000000007541203a
[  166.653420] ---[ end trace 0181ff2c9e42d51e ]---
[  166.658609] note: kworker/u:2H[338] exited with preempt_count 1

Signed-off-by: Kuba Pawlak <kubax.t.pawlak@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-25 21:06:39 +01:00
Robert Dolca
85b9ce9a21 NFC: nci: add nci_get_conn_info_by_id function
This functin takes as a parameter a pointer to the nci_dev
struct and the first byte from the values of the first domain
specific parameter that was used for the connection creation.

Signed-off-by: Robert Dolca <robert.dolca@intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-25 20:29:11 +01:00
Robert Dolca
caa575a86e NFC: nci: fix possible crash in nci_core_conn_create
If the number of destination speific parameters supplied is 0
the call will fail. If the first destination specific parameter
does not have a value, curr_id will be set to 0.

Signed-off-by: Robert Dolca <robert.dolca@intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-25 20:29:05 +01:00
Robert Dolca
22e4bd09c4 NFC: nci: rename nci_prop_ops to nci_driver_ops
Initially it was used to create hooks in the driver for
proprietary operations. Currently it is being used for hooks
for both proprietary and generic operations.

Signed-off-by: Robert Dolca <robert.dolca@intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-25 20:28:59 +01:00
Robert Dolca
0a97a3cba2 NFC: nci: Allow the driver to set handler for core nci ops
The driver may be required to act when some responses or
notifications arrive. For example the NCI core does not have a
handler for NCI_OP_CORE_GET_CONFIG_RSP. The NFCC can send a
config response that has to be read by the driver and the packet
may contain vendor specific data.

The Fields Peak driver needs to take certain actions when a reset
notification arrives (packet also not handled by the nfc core).

The driver handlers do not interfere with the core and they are
called after the core processes the packet.

Signed-off-by: Robert Dolca <robert.dolca@intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-25 19:12:57 +01:00
Robert Dolca
7bc4824ed5 NFC: nci: Introduce nci_core_cmd
This allows sending core commands from the driver. The driver
should be able to send NCI core commands like CORE_GET_CONFIG_CMD.

Signed-off-by: Robert Dolca <robert.dolca@intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-25 19:12:13 +01:00
Robert Dolca
e4dbd62528 NFC: nci: Do not call post_setup when setup fails
The driver should know that it can continue with post setup where
setup left off. Being able to execute post_setup when setup fails
may force the developer to keep this state in the driver.

Signed-off-by: Robert Dolca <robert.dolca@intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-25 19:11:58 +01:00
Robert Dolca
2663589ce6 NFC: nci: Add function to get max packet size for conn
FDP driver needs to send the firmware as regular packets
(not fragmented). The driver should have a way to
get the max packet size for a given connection.

Signed-off-by: Robert Dolca <robert.dolca@intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-25 19:11:42 +01:00
Robert Dolca
ea785c094d NFC: nci: Export nci data send API
For the firmware update the driver may use nci_send_data.

Signed-off-by: Robert Dolca <robert.dolca@intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-25 19:11:35 +01:00
Eric Dumazet
1586a5877d af_unix: do not report POLLOUT on listeners
poll(POLLOUT) on a listener should not report fd is ready for
a write().

This would break some applications using poll() and pfd.events = -1,
as they would not block in poll()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alan Burlison <Alan.Burlison@oracle.com>
Tested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-25 06:37:45 -07:00
Wu Fengguang
742e038330 tipc: link_is_bc_sndlink() can be static
TO: "David S. Miller" <davem@davemloft.net>
CC: netdev@vger.kernel.org
CC: Jon Maloy <jon.maloy@ericsson.com>
CC: Ying Xue <ying.xue@windriver.com>
CC: tipc-discussion@lists.sourceforge.net
CC: linux-kernel@vger.kernel.org

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-25 06:31:52 -07:00
Jon Paul Maloy
2af5ae372a tipc: clean up unused code and structures
After the previous changes in this series, we can now remove some
unused code and structures, both in the broadcast, link aggregation
and link code.

There are no functional changes in this commit.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:47 -07:00
Jon Paul Maloy
c49a0a8439 tipc: ensure binding table initial distribution is sent via first link
Correct synchronization of the broadcast link at first contact between
two nodes is dependent on the assumption that the binding table "bulk"
update passes via the same link as the initial broadcast syncronization
message, i.e., via the first link that is established.

This is not guaranteed in the current implementation. If two link
come up very close to each other in time, the "bulk" may quite well
pass via the second link, and hence void the guarantee of a correct
initial synchronization before the broadcast link is opened.

This commit makes two small changes to strengthen this guarantee.

1) We let the second established link occupy slot 1 of the
   "active_links" array, while the first link will retain slot 0.
   (This is in reality a cosmetic change, we could just as well keep
    the current, opposite order)

2) We let the name distributor always use link selector/slot 0 when
   it sends it binding table updates.

The extra traffic bias on the first link caused by this change should
be negligible, since binding table updates constitutes a very small
fraction of the total traffic.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:46 -07:00
Jon Paul Maloy
c72fa872a2 tipc: eliminate link's reference to owner node
With the recent commit series, we have established a one-way dependency
between the link aggregation (struct tipc_node) instances and their
pertaining tipc_link instances. This has enabled quite significant code
and structure simplifications.

In this commit, we eliminate the field 'owner', which points to an
instance of struct tipc_node, from struct tipc_link, and replace it with
a pointer to struct net, which is the only external reference now needed
by a link instance.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:45 -07:00
Jon Paul Maloy
7214bcf875 tipc: eliminate redundant buffer cloning at transmission
Since all packet transmitters (link, bcast, discovery) are now sending
consumable buffer clones to the bearer layer, we can remove the
redundant buffer cloning that is perfomed in the lower level functions
tipc_l2_send_msg() and tipc_udp_send_msg().

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:44 -07:00
Jon Paul Maloy
60852d6795 tipc: let neighbor discoverer tranmsit consumable buffers
The neighbor discovery function currently uses the function
tipc_bearer_send() for transmitting packets, assuming that the
sent buffers are not consumed by the called function.

We want to change this, in order to avoid unnecessary buffer cloning
elswhere in the code.

This commit introduces a new function tipc_bearer_skb() which consumes
the sent buffers, and let the discoverer functions use this new call
instead. The discoverer does now itself perform the cloning when
that is necessary.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:44 -07:00
Jon Paul Maloy
959e1781aa tipc: introduce jumbo frame support for broadcast
Until now, we have only been supporting a fix MTU size of 1500 bytes
for all broadcast media, irrespective of their actual capability.

We now make the broadcast MTU adaptable to the carrying media, i.e.,
we use the smallest MTU supported by any of the interfaces attached
to TIPC.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:40 -07:00
Jon Paul Maloy
b06b281e79 tipc: simplify bearer level broadcast
Until now, we have been keeping track of the exact set of broadcast
destinations though the help structure tipc_node_map. This leads us to
have to maintain a whole infrastructure for supporting this, including
a pseudo-bearer and a number of functions to manipulate both the bearers
and the node map correctly. Apart from the complexity, this approach is
also limiting, as struct tipc_node_map only can support cluster local
broadcast if we want to avoid it becoming excessively large. We want to
eliminate this limitation, in order to enable introduction of scoped
multicast in the future.

A closer analysis reveals that it is unnecessary maintaining this "full
set" overview; it is sufficient to keep a counter per bearer, indicating
how many nodes can be reached via this bearer at the moment. The protocol
is now robust enough to handle transitional discrepancies between the
nominal number of reachable destinations, as expected by the broadcast
protocol itself, and the number which is actually reachable at the
moment. The initial broadcast synchronization, in conjunction with the
retransmission mechanism, ensures that all packets will eventually be
acknowledged by the correct set of destinations.

This commit introduces these changes.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:39 -07:00
Jon Paul Maloy
5266698661 tipc: let broadcast packet reception use new link receive function
The code path for receiving broadcast packets is currently distinct
from the unicast path. This leads to unnecessary code and data
duplication, something that can be avoided with some effort.

We now introduce separate per-peer tipc_link instances for handling
broadcast packet reception. Each receive link keeps a pointer to the
common, single, broadcast link instance, and can hence handle release
and retransmission of send buffers as if they belonged to the own
instance.

Furthermore, we let each unicast link instance keep a reference to both
the pertaining broadcast receive link, and to the common send link.
This makes it possible for the unicast links to easily access data for
broadcast link synchronization, as well as for carrying acknowledges for
received broadcast packets.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:37 -07:00
Jon Paul Maloy
fd556f209a tipc: introduce capability bit for broadcast synchronization
Until now, we have tried to support both the newer, dedicated broadcast
synchronization mechanism along with the older, less safe, RESET_MSG/
ACTIVATE_MSG based one. The latter method has turned out to be a hazard
in a highly dynamic cluster, so we find it safer to disable it completely
when we find that the former mechanism is supported by the peer node.

For this purpose, we now introduce a new capabability bit,
TIPC_BCAST_SYNCH, to inform any peer nodes that dedicated broadcast
syncronization is supported by the present node. The new bit is conveyed
between peers in the 'capabilities' field of neighbor discovery messages.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:35 -07:00
Jon Paul Maloy
2f56612457 tipc: let broadcast transmission use new link transmit function
This commit simplifies the broadcast link transmission function, by
leveraging previous changes to the link transmission function and the
broadcast transmission link life cycle.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:32 -07:00
Jon Paul Maloy
c1ab3f1dea tipc: make struct tipc_link generic to support broadcast
Realizing that unicast is just a special case of broadcast, we also see
that we can go in the other direction, i.e., that modest changes to the
current unicast link can make it generic enough to support broadcast.

The following changes are introduced here:

- A new counter ("ackers") in struct tipc_link, to indicate how many
  peers need to ack a packet before it can be released.
- A corresponding counter in the skb user area, to keep track of how
  many peers a are left to ack before a buffer can be released.
- A new counter ("acked"), to keep persistent track of how far a peer
  has acked at the moment, i.e., where in the transmission queue to
  start updating buffers when the next ack arrives. This is to avoid
  double acknowledgements from a peer, with inadvertent relase of
  packets as a result.
- A more generic tipc_link_retrans() function, where retransmit starts
  from a given sequence number, instead of the first packet in the
  transmision queue. This is to minimize the number of retransmitted
  packets on the broadcast media.

When the new functionality is taken into use in the next commits,
we expect it to have minimal effect on unicast mode performance.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:32 -07:00
Jon Paul Maloy
323019069e tipc: use explicit allocation of broadcast send link
The broadcast link instance (struct tipc_link) used for sending is
currently aggregated into struct tipc_bclink. This means that we cannot
use the regular tipc_link_create() function for initiating the link, but
do instead have to initiate numerous fields directly from the
bcast_init() function.

We want to reduce dependencies between the broadcast functionality
and the inner workings of tipc_link. In this commit, we introduce
a new function tipc_bclink_create() to link.c, and allocate the
instance of the link separately using this function.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:30 -07:00
Jon Paul Maloy
0e05498e9e tipc: make link implementation independent from struct tipc_bearer
In reality, the link implementation is already independent from
struct tipc_bearer, in that it doesn't store any reference to it.
However, we still pass on a pointer to a bearer instance in the
function tipc_link_create(), just to have it extract some
initialization information from it.

I later commits, we need to create instances of tipc_link without
having any associated struct tipc_bearer. To facilitate this, we
want to extract the initialization data already in the creator
function in node.c, before calling tipc_link_create(), and pass
this info on as individual parameters in the call.

This commit introduces this change.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:30 -07:00
Jon Paul Maloy
5fd9fd6351 tipc: create broadcast transmission link at namespace init
The broadcast transmission link is currently instantiated when the
network subsystem is started, i.e., on order from user space via netlink.

This forces the broadcast transmission code to do unnecessary tests for
the existence of the transmission link, as well in single mode node as
in network mode.

In this commit, we do instead create the link during initialization of
the name space, and remove it when it is stopped. The fact that the
transmission link now has a guaranteed longer life cycle than any of its
potential clients paves the way for further code simplifcations
and optimizations.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:27 -07:00
Jon Paul Maloy
0043550b0a tipc: move broadcast link lock to struct tipc_net
The broadcast lock will need to be acquired outside bcast.c in a later
commit. For this reason, we move the lock to struct tipc_net. Consistent
with the changes in the previous commit, we also introducee two new
functions tipc_bcast_lock() and tipc_bcast_unlock(). The code that is
currently using tipc_bclink_lock()/unlock() will be phased out during
the coming commits in this series.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:25 -07:00
Jon Paul Maloy
6beb19a62a tipc: move bcast definitions to bcast.c
Currently, a number of structure and function definitions related
to the broadcast functionality are unnecessarily exposed in the file
bcast.h. This obscures the fact that the external interface towards
the broadcast link in fact is very narrow, and causes unnecessary
recompilations of other files when anything changes in those
definitions.

In this commit, we move as many of those definitions as is currently
possible to the file bcast.c.

We also rename the structure 'tipc_bclink' to 'tipc_bc_base', both
since the name does not correctly describe the contents of this
struct, and will do so even less in the future, and because we want
to use the term 'link' more appropriately in the functionality
introduced later in this series.

Finally, we rename a couple of functions, such as tipc_bclink_xmit()
and others that will be kept in the future, to include the term 'bcast'
instead.

There are no functional changes in this commit.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:24 -07:00
David S. Miller
ba3e2084f2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv6/xfrm6_output.c
	net/openvswitch/flow_netlink.c
	net/openvswitch/vport-gre.c
	net/openvswitch/vport-vxlan.c
	net/openvswitch/vport.c
	net/openvswitch/vport.h

The openvswitch conflicts were overlapping changes.  One was
the egress tunnel info fix in 'net' and the other was the
vport ->send() op simplification in 'net-next'.

The xfrm6_output.c conflicts was also a simplification
overlapping a bug fix.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:54:12 -07:00
David S. Miller
a72c9512bf Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2015-10-22

Here's probably the last bluetooth-next pull request for 4.4. Among
several other changes it contains the rest of the fixes & cleanups from
the Bluetooth UnplugFest (that didn't need to be hurried to 4.3).

 - Refactoring & cleanups to 6lowpan code
 - New USB ids for two Atheros controllers and BCM43142A0 from Broadcom
 - Fix (quirk) for broken Broadcom BCM2045 controllers
 - Support for latest Apple controllers
 - Improvements to the vendor diagnostic message support

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 05:13:16 -07:00
Neil Brown
778620364e sunrpc/cache: make cache flushing more reliable.
The caches used to store sunrpc authentication information can be
flushed by writing a timestamp to a file in /proc.

This timestamp has a one-second resolution and any entry in cache that
was last_refreshed *before* that time is treated as expired.

This is problematic as it is not possible to reliably flush the cache
without interrupting NFS service.
If the current time is written to the "flush" file, any entry that was
added since the current second started will still be treated as valid.
If one second beyond than the current time is written to the file
then no entries can be valid until the second ticks over.  This will
mean that no NFS request will be handled for up to 1 second.

To resolve this issue we make two changes:

1/ treat an entry as expired if the timestamp when it was last_refreshed
  is before *or the same as* the expiry time.  This means that current
  code which writes out the current time will now flush the cache
  reliably.

2/ when a new entry in added to the cache -  set the last_refresh timestamp
  to 1 second *beyond* the current flush time, when that not in the
  past.
  This ensures that newly added entries will always be valid.

Now that we have a very reliable way to flush the cache, and also
since we are using "since-boot" timestamps which are monotonic,
change cache_purge() to set the smallest future flush_time which
will work, and leave it there: don't revert to '1'.

Also disable the setting of the 'flush_time' far into the future.
That has never been useful and is now awkward as it would cause
last_refresh times to be strange.
Finally: if a request is made to set the 'flush_time' to the current
second, assume the intent is to flush the cache and advance it, if
necessary, to 1 second beyond the current 'flush_time' so that all
active entries will be deemed to be expired.

As part of this we need to add a 'cache_detail' arg to cache_init()
and cache_fresh_locked() so they can find the current ->flush_time.

Signed-off-by: NeilBrown <neilb@suse.com>
Reported-by: Olaf Kirch <okir@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-10-23 15:57:30 -04:00
Arnd Bergmann
cc6a7aab55 sunrpc: avoid warning in gss_key_timeout
The gss_key_timeout() function causes a harmless warning in some
configurations, e.g. ARM imx_v6_v7_defconfig with gcc-5.2, if the
compiler cannot figure out the state of the 'expire' variable across
an rcu_read_unlock():

net/sunrpc/auth_gss/auth_gss.c: In function 'gss_key_timeout':
net/sunrpc/auth_gss/auth_gss.c:1422:211: warning: 'expire' may be used uninitialized in this function [-Wmaybe-uninitialized]

To avoid this warning without adding a bogus initialization, this
rewrites the function so the comparison is done inside of the
critical section. As a side-effect, it also becomes slightly
easier to understand because the implementation now more closely
resembles the comment above it.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: c5e6aecd03 ("sunrpc: fix RCU handling of gc_ctx field")
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-10-23 15:57:28 -04:00
Trond Myklebust
226453d8cf SUNRPC: Use MSG_SENDPAGE_NOTLAST when calling sendpage()
If we're sending more pages via kernel_sendpage(), then set
MSG_SENDPAGE_NOTLAST.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-10-23 15:57:27 -04:00
David S. Miller
bf7958607d Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2015-10-23

This series contains updates to i40e, i40evf, if_link, ixgbe and ixgbevf.

Anjali adds a workaround to drop any flow control frames from being
transmitted from any VSI, so that a malicious VF cannot send flow control
or PFC packets out on the wire.  Also fixed a bug in debugfs by grabbing
the filter list lock before adding or deleting a filter.

Akeem fixes an issue where we were unconditionally returning VEB bridge
mode before allowing LB in the add VSI routine, resolve by checking if
the bridge is actually in VEB mode first.

Mitch fixed an issue where the incorrect structure was being used for
VLAN filter list, which meant the VLAN filter list did not get
processed correctly and VLAN filters would not be re-enabled after any
kind of reset.

Helin fixed a problem of possibly getting inconsistent flow control
status after a PF reset.  The issue was requested_mode was being set
with a default value during probe, but the hardware state could be a
different value from this mode.

Carolyn fixed a problem where the driver output of the OEM version
string varied from the other tools.

Jean Sacren fixes up kernel documentation by fixing function header
comments to match actual variables used in the functions.  Also
cleaned up variable initialization, when the variable would be
over-written immediately.

Hiroshi Shimanoto provides three patches to add "trusted" VF by adding
netlink directives and an NDO entry.  Then implement these new controls
in ixgbe and ixgbevf.  This series has gone through several iterations
to address all the suggested community changes and concerns.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-23 06:58:09 -07:00
Robert Shearman
1c78efa831 mpls: flow-based multipath selection
Change the selection of a multipath route to use a flow-based
hash. This more suitable for traffic sensitive to reordering within a
flow (e.g. TCP, L2VPN) and whilst still allowing a good distribution
of traffic given enough flows.

Selection of the path for a multipath route is done using a hash of:
1. Label stack up to MAX_MP_SELECT_LABELS labels or up to and
   including entropy label, whichever is first.
2. 3-tuple of (L3 src, L3 dst, proto) from IPv4/IPv6 header in MPLS
   payload, if present.

Naturally, a 5-tuple hash using L4 information in addition would be
possible and be better in some scenarios, but there is a tradeoff
between looking deeper into the packet to achieve good distribution,
and packet forwarding performance, and I have erred on the side of the
latter as the default.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-23 06:26:45 -07:00
Roopa Prabhu
f8efb73c97 mpls: multipath route support
This patch adds support for MPLS multipath routes.

Includes following changes to support multipath:
- splits struct mpls_route into 'struct mpls_route + struct mpls_nh'

- 'struct mpls_nh' represents a mpls nexthop label forwarding entry

- moves mpls route and nexthop structures into internal.h

- A mpls_route can point to multiple mpls_nh structs

- the nexthops are maintained as a array (similar to ipv4 fib)

- In the process of restructuring, this patch also consistently changes
  all labels to u8

- Adds support to parse/fill RTA_MULTIPATH netlink attribute for
multipath routes similar to ipv4/v6 fib

- In this patch, the multipath route nexthop selection algorithm
simply returns the first nexthop. It is replaced by a
hash based algorithm from Robert Shearman in the next patch

- mpls_route_update cleanup: remove 'dev' handling in mpls_route_update.
mpls_route_update though implemented to update based on dev, it was
never used that way. And the dev handling gets tricky with multiple
nexthops. Cannot match against any single nexthops dev. So, this patch
removes the unused 'dev' handling in mpls_route_update.

- dead route/path handling will be implemented in a subsequent patch

Example:

$ip -f mpls route add 100 nexthop as 200 via inet 10.1.1.2 dev swp1 \
                nexthop as 700 via inet 10.1.1.6 dev swp2 \
                nexthop as 800 via inet 40.1.1.2 dev swp3

$ip  -f mpls route show
100
        nexthop as to 200 via inet 10.1.1.2  dev swp1
        nexthop as to 700 via inet 10.1.1.6  dev swp2
        nexthop as to 800 via inet 40.1.1.2  dev swp3

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-23 06:26:42 -07:00
Li RongQing
ce9d9b8e5c net: sysctl: fix a kmemleak warning
the returned buffer of register_sysctl() is stored into net_header
variable, but net_header is not used after, and compiler maybe
optimise the variable out, and lead kmemleak reported the below warning

	comm "swapper/0", pid 1, jiffies 4294937448 (age 267.270s)
	hex dump (first 32 bytes):
	90 38 8b 01 c0 ff ff ff 00 00 00 00 01 00 00 00 .8..............
	01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
	backtrace:
	[<ffffffc00020f134>] create_object+0x10c/0x2a0
	[<ffffffc00070ff44>] kmemleak_alloc+0x54/0xa0
	[<ffffffc0001fe378>] __kmalloc+0x1f8/0x4f8
	[<ffffffc00028e984>] __register_sysctl_table+0x64/0x5a0
	[<ffffffc00028eef0>] register_sysctl+0x30/0x40
	[<ffffffc00099c304>] net_sysctl_init+0x20/0x58
	[<ffffffc000994dd8>] sock_init+0x10/0xb0
	[<ffffffc0000842e0>] do_one_initcall+0x90/0x1b8
	[<ffffffc000966bac>] kernel_init_freeable+0x218/0x2f0
	[<ffffffc00070ed6c>] kernel_init+0x1c/0xe8
	[<ffffffc000083bfc>] ret_from_fork+0xc/0x50
	[<ffffffffffffffff>] 0xffffffffffffffff <<end check kmemleak>>

Before fix, the objdump result on ARM64:
0000000000000000 <net_sysctl_init>:
   0:   a9be7bfd        stp     x29, x30, [sp,#-32]!
   4:   90000001        adrp    x1, 0 <net_sysctl_init>
   8:   90000000        adrp    x0, 0 <net_sysctl_init>
   c:   910003fd        mov     x29, sp
  10:   91000021        add     x1, x1, #0x0
  14:   91000000        add     x0, x0, #0x0
  18:   a90153f3        stp     x19, x20, [sp,#16]
  1c:   12800174        mov     w20, #0xfffffff4                // #-12
  20:   94000000        bl      0 <register_sysctl>
  24:   b4000120        cbz     x0, 48 <net_sysctl_init+0x48>
  28:   90000013        adrp    x19, 0 <net_sysctl_init>
  2c:   91000273        add     x19, x19, #0x0
  30:   9101a260        add     x0, x19, #0x68
  34:   94000000        bl      0 <register_pernet_subsys>
  38:   2a0003f4        mov     w20, w0
  3c:   35000060        cbnz    w0, 48 <net_sysctl_init+0x48>
  40:   aa1303e0        mov     x0, x19
  44:   94000000        bl      0 <register_sysctl_root>
  48:   2a1403e0        mov     w0, w20
  4c:   a94153f3        ldp     x19, x20, [sp,#16]
  50:   a8c27bfd        ldp     x29, x30, [sp],#32
  54:   d65f03c0        ret
After:
0000000000000000 <net_sysctl_init>:
   0:   a9bd7bfd        stp     x29, x30, [sp,#-48]!
   4:   90000000        adrp    x0, 0 <net_sysctl_init>
   8:   910003fd        mov     x29, sp
   c:   a90153f3        stp     x19, x20, [sp,#16]
  10:   90000013        adrp    x19, 0 <net_sysctl_init>
  14:   91000000        add     x0, x0, #0x0
  18:   91000273        add     x19, x19, #0x0
  1c:   f90013f5        str     x21, [sp,#32]
  20:   aa1303e1        mov     x1, x19
  24:   12800175        mov     w21, #0xfffffff4                // #-12
  28:   94000000        bl      0 <register_sysctl>
  2c:   f9002260        str     x0, [x19,#64]
  30:   b40001a0        cbz     x0, 64 <net_sysctl_init+0x64>
  34:   90000014        adrp    x20, 0 <net_sysctl_init>
  38:   91000294        add     x20, x20, #0x0
  3c:   9101a280        add     x0, x20, #0x68
  40:   94000000        bl      0 <register_pernet_subsys>
  44:   2a0003f5        mov     w21, w0
  48:   35000080        cbnz    w0, 58 <net_sysctl_init+0x58>
  4c:   aa1403e0        mov     x0, x20
  50:   94000000        bl      0 <register_sysctl_root>
  54:   14000004        b       64 <net_sysctl_init+0x64>
  58:   f9402260        ldr     x0, [x19,#64]
  5c:   94000000        bl      0 <unregister_sysctl_table>
  60:   f900227f        str     xzr, [x19,#64]
  64:   2a1503e0        mov     w0, w21
  68:   f94013f5        ldr     x21, [sp,#32]
  6c:   a94153f3        ldp     x19, x20, [sp,#16]
  70:   a8c37bfd        ldp     x29, x30, [sp],#48
  74:   d65f03c0        ret

Add the possible error handle to free the net_header to remove the
kmemleak warning

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-23 06:22:08 -07:00
Hiroshi Shimamoto
dd461d6aa8 if_link: Add control trust VF
Add netlink directives and ndo entry to trust VF user.

This controls the special permission of VF user.
The administrator will dedicatedly trust VF user to use some features
which impacts security and/or performance.

The administrator never turn it on unless VF user is fully trusted.

CC: Sy Jong Choi <sy.jong.choi@intel.com>
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Acked-by: Greg Rose <gregory.v.rose@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2015-10-23 05:44:28 -07:00
Eric Dumazet
5e0724d027 tcp/dccp: fix hashdance race for passive sessions
Multiple cpus can process duplicates of incoming ACK messages
matching a SYN_RECV request socket. This is a rare event under
normal operations, but definitely can happen.

Only one must win the race, otherwise corruption would occur.

To fix this without adding new atomic ops, we use logic in
inet_ehash_nolisten() to detect the request was present in the same
ehash bucket where we try to insert the new child.

If request socket was not found, we have to undo the child creation.

This actually removes a spin_lock()/spin_unlock() pair in
reqsk_queue_unlink() for the fast path.

Fixes: e994b2f0fb ("tcp: do not lock listener to process SYN packets")
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-23 05:42:21 -07:00
Li RongQing
f6b8dec998 af_key: fix two typos
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-23 03:05:19 -07:00
Paolo Abeni
7b1311807f ipv4: implement support for NOPREFIXROUTE ifa flag for ipv4 address
Currently adding a new ipv4 address always cause the creation of the
related network route, with default metric. When a host has multiple
interfaces on the same network, multiple routes with the same metric
are created.

If the userspace wants to set specific metric on each routes, i.e.
giving better metric to ethernet links in respect to Wi-Fi ones,
the network routes must be deleted and recreated, which is error-prone.

This patch implements the support for IFA_F_NOPREFIXROUTE for ipv4
address. When an address is added with such flag set, no associated
network route is created, no network route is deleted when
said IP is gone and it's up to the user space manage such route.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-23 02:54:54 -07:00
Hannes Frederic Sowa
b72a2b01b6 ipv6: protect mtu calculation of wrap-around and infinite loop by rounding issues
Raw sockets with hdrincl enabled can insert ipv6 extension headers
right into the data stream. In case we need to fragment those packets,
we reparse the options header to find the place where we can insert
the fragment header. If the extension headers exceed the link's MTU we
actually cannot make progress in such a case.

Instead of ending up in broken arithmetic or rounding towards 0 and
entering an endless loop in ip6_fragment, just prevent those cases by
aborting early and signal -EMSGSIZE to user space.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-23 02:49:36 -07:00
Andrew Shewmaker
c80dbe0461 tcp: allow dctcp alpha to drop to zero
If alpha is strictly reduced by alpha >> dctcp_shift_g and if alpha is less
than 1 << dctcp_shift_g, then alpha may never reach zero. For example,
given shift_g=4 and alpha=15, alpha >> dctcp_shift_g yields 0 and alpha
remains 15. The effect isn't noticeable in this case below cwnd=137, but
could gradually drive uncongested flows with leftover alpha down to
cwnd=137. A larger dctcp_shift_g would have a greater effect.

This change causes alpha=15 to drop to 0 instead of being decrementing by 1
as it would when alpha=16. However, it requires one less conditional to
implement since it doesn't have to guard against subtracting 1 from 0U. A
decay of 15 is not unreasonable since an equal or greater amount occurs at
alpha >= 240.

Signed-off-by: Andrew G. Shewmaker <agshew@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-23 02:46:52 -07:00
lucien
ab997ad408 ipv6: fix the incorrect return value of throw route
The error condition -EAGAIN, which is signaled by throw routes, tells
the rules framework to walk on searching for next matches. If the walk
ends and we stop walking the rules with the result of a throw route we
have to translate the error conditions to -ENETUNREACH.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-23 02:38:18 -07:00
Steffen Klassert
cb866e3298 xfrm: Increment statistic counter on inner mode error
Increment the LINUX_MIB_XFRMINSTATEMODEERROR statistic counter
to notify about dropped packets if we fail to fetch a inner mode.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-10-23 07:52:58 +02:00
Steffen Klassert
ea673a4d3a xfrm4: Reload skb header pointers after calling pskb_may_pull.
A call to pskb_may_pull may change the pointers into the packet,
so reload the pointers after the call.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-10-23 07:32:39 +02:00
Steffen Klassert
1a14f1e555 xfrm4: Fix header checks in _decode_session4.
We skip the header informations if the data pointer points
already behind the header in question for some protocols.
This is because we call pskb_may_pull with a negative value
converted to unsigened int from pskb_may_pull in this case.
Skipping the header informations can lead to incorrect policy
lookups, so fix it by a check of the data pointer position
before we call pskb_may_pull.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-10-23 07:31:23 +02:00
Sowmini Varadhan
e33d4f13d2 xfrm: Fix unaligned access to stats in copy_to_user_state()
On sparc, deleting established SAs (e.g., by restarting ipsec)
results in unaligned access messages via xfrm_del_sa ->
km_state_notify -> xfrm_send_state_notify().

Even though struct xfrm_usersa_info is aligned on 8-byte boundaries,
netlink attributes are fundamentally only 4 byte aligned, and this
cannot be changed for nla_data() that is passed up to userspace.
As a result, the put_unaligned() macro needs to be used to
set up potentially unaligned fields such as the xfrm_stats in
copy_to_user_state()

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-10-23 06:49:29 +02:00
Pravin B Shelar
fc4099f172 openvswitch: Fix egress tunnel info.
While transitioning to netdev based vport we broke OVS
feature which allows user to retrieve tunnel packet egress
information for lwtunnel devices.  Following patch fixes it
by introducing ndo operation to get the tunnel egress info.
Same ndo operation can be used for lwtunnel devices and compat
ovs-tnl-vport devices. So after adding such device operation
we can remove similar operation from ovs-vport.

Fixes: 614732eaa1 ("openvswitch: Use regular VXLAN net_device device").
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 19:39:25 -07:00
Jorgen Hansen
8566b86ab9 VSOCK: Fix lockdep issue.
The recent fix for the vsock sock_put issue used the wrong
initializer for the transport spin_lock causing an issue when
running with lockdep checking.

Testing: Verified fix on kernel with lockdep enabled.

Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com>
Signed-off-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 18:26:29 -07:00
David S. Miller
199c655069 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2015-10-22

1) Fix IPsec pre-encap fragmentation for GSO packets.
   From Herbert Xu.

2) Fix some header checks in _decode_session6.
   We skip the header informations if the data pointer points
   already behind the header in question for some protocols.
   This is because we call pskb_may_pull with a negative value
   converted to unsigened int from pskb_may_pull in this case.
   Skipping the header informations can lead to incorrect policy
   lookups. From Mathias Krause.

3) Allow to change the replay threshold and expiry timer of a
   state without having to set other attributes like replay
   counter and byte lifetime. Changing these other attributes
   may break the SA. From Michael Rossberg.

4) Fix pmtu discovery for local generated packets.
   We may fail dispatch to the inner address family.
   As a reault, the local error handler is not called
   and the mtu value is not reported back to userspace.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 07:46:05 -07:00
Vivien Didelot
1a49a2fbf8 net: dsa: remove port_fdb_getnext
No driver implements port_fdb_getnext anymore, and port_fdb_dump is
preferred anyway, so remove this function from DSA.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 07:38:45 -07:00
Vivien Didelot
ea70ba9806 net: dsa: add port_fdb_dump function
Not all switch chips support a Get Next operation to iterate on its FDB.
So add a more simple port_fdb_dump function for them.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 07:38:35 -07:00
David Ahern
d46a9d678e net: ipv6: Dont add RT6_LOOKUP_F_IFACE flag if saddr set
741a11d9e4 ("net: ipv6: Add RT6_LOOKUP_F_IFACE flag if oif is set")
adds the RT6_LOOKUP_F_IFACE flag to make device index mismatch fatal if
oif is given. Hajime reported that this change breaks the Mobile IPv6
use case that wants to force the message through one interface yet use
the source address from another interface. Handle this case by only
adding the flag if oif is set and saddr is not set.

Fixes: 741a11d9e4 ("net: ipv6: Add RT6_LOOKUP_F_IFACE flag if oif is set")
Cc: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 07:36:19 -07:00
David S. Miller
e9829b9745 Here's another set of patches for the current cycle:
* I merged net-next back to avoid a conflict with the
  * cfg80211 scheduled scan API extensions
  * preparations for better scan result timestamping
  * regulatory cleanups
  * mac80211 statistics cleanups
  * a few other small cleanups and fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJWJ6lbAAoJEDBSmw7B7bqraasP/Ryaa7zL10E+dOQtqBQHQeMe
 olbrCUtTYltr4nnuESzh5WPeIVZBQ0DIduoLLF0IDSPVwE/NrbpFUVIMHvJvr+s7
 rE9k8RB4P7BMTjf+mkDX1Od9kCKGkt4ezcyt/oNIsqM12SN9JQ99itwz6Mp94xCs
 XKsiXJRh9f/8Qwd/74qQq1Va3UfGAVuKO8WpUe/A7TYTla8ZY20pv1D8kQKQzrFg
 DwsMirjmHcUpobSjnPAAmZevRxdk6o0E+P7DYG172H2Tm8/EIMR/gYMnQeYW6HkA
 lfMMDfAGmNvyRm8v1iuBLodREP4kn4VbhMSZDtH7D6FYfmJh5fSeG09bSe51G5Xh
 zv/B8A1cCbWFqtQHp3wI6ml8VDyAhDc2Hvqb75KRn6FplIkEiszVP0y3cNHWiJVt
 Ix6Sysoa6kQDXEgR50APeLJ3VI+/mhXmvIila4jP9PKhO14SDHrCoRQO62Z0COJ7
 2E5Ir2KE8T+O9mSeuB7m8xD/t60HDd3q3tLZmH0Ps6xfxKf9y2hdZacbX4Hi5Mqk
 2XxXZYnhAXUqZmZhmG3ajnEiB4UGMt21R7dIqNTaQ9chOGBkHqIZxPm82XtNb13h
 yHILavGpUDT0z6OB2z8fxUcj4a4SrrK+aiIGh4iFpDR0Nu0IyZ5cPHXY2FfvJWmD
 ZO74RMEpBodYR8BsV4yP
 =uZ5N
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2015-10-21' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Here's another set of patches for the current cycle:
 * I merged net-next back to avoid a conflict with the
 * cfg80211 scheduled scan API extensions
 * preparations for better scan result timestamping
 * regulatory cleanups
 * mac80211 statistics cleanups
 * a few other small cleanups and fixes
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 07:28:41 -07:00
Jorgen Hansen
4ef7ea9195 VSOCK: sock_put wasn't safe to call in interrupt context
In the vsock vmci_transport driver, sock_put wasn't safe to call
in interrupt context, since that may call the vsock destructor
which in turn calls several functions that should only be called
from process context. This change defers the callling of these
functions  to a worker thread. All these functions were
deallocation of resources related to the transport itself.

Furthermore, an unused callback was removed to simplify the
cleanup.

Multiple customers have been hitting this issue when using
VMware tools on vSphere 2015.

Also added a version to the vmci transport module (starting from
1.0.2.0-k since up until now it appears that this module was
sharing version with vsock that is currently at 1.0.1.0-k).

Reviewed-by: Aditya Asarwade <asarwade@vmware.com>
Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com>
Signed-off-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 07:21:05 -07:00
David Herrmann
47191d65b6 netlink: fix locking around NETLINK_LIST_MEMBERSHIPS
Currently, NETLINK_LIST_MEMBERSHIPS grabs the netlink table while copying
the membership state to user-space. However, grabing the netlink table is
effectively a write_lock_irq(), and as such we should not be triggering
page-faults in the critical section.

This can be easily reproduced by the following snippet:
    int s = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
    void *p = mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
    int r = getsockopt(s, 0x10e, 9, p, (void*)((char*)p + 4092));

This should work just fine, but currently triggers EFAULT and a possible
WARN_ON below handle_mm_fault().

Fix this by reducing locking of NETLINK_LIST_MEMBERSHIPS to a read-side
lock. The write-lock was overkill in the first place, and the read-lock
allows page-faults just fine.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 07:18:28 -07:00
Pravin B Shelar
aec1592474 openvswitch: Use dev_queue_xmit for vport send.
With use of lwtunnel, we can directly call dev_queue_xmit()
rather than calling netdev vport send operation.
Following change make tunnel vport code bit cleaner.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 06:46:16 -07:00
Pravin B Shelar
99e28f18e3 openvswitch: Fix incorrect type use.
Patch fixes following sparse warning.
net/openvswitch/flow_netlink.c:583:30: warning: incorrect type in assignment (different base types)
net/openvswitch/flow_netlink.c:583:30:    expected restricted __be16 [usertype] ipv4
net/openvswitch/flow_netlink.c:583:30:    got int

Fixes: 6b26ba3a7d ("openvswitch: netlink attributes for IPv6 tunneling")
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 06:46:13 -07:00
Eric Dumazet
45efccdbec netfilter: xt_TEE: fix NULL dereference
iptables -I INPUT ... -j TEE --gateway 10.1.2.3

<crash> because --oif was not specified

tee_tg_check() sets ->priv pointer to NULL in this case.

Fixes: bbde9fc182 ("netfilter: factor out packet duplication for IPv4/IPv6")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-22 12:55:25 +02:00
Marcel Holtmann
13972adc32 Bluetooth: Increase minor version of core module
With the addition of support for diagnostic feature, it makes sense to
increase the minor version of the Bluetooth core module.

The module version is not used anywhere, but it gives a nice extra
hint for debugging purposes.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-10-22 13:37:26 +03:00
Alexander Aring
aeedebff69 ieee802154: 6lowpan: fix memory leak
Looking at current situation of memory management in 6lowpan receive
function I detected some invalid handling. After calling
lowpan_invoke_rx_handlers we will do a kfree_skb and then NET_RX_DROP on
error handling. We don't do this before, also on
skb_share_check/skb_unshare which might manipulate the reference
counters.

After running some 'grep -r "dev_add_pack" net/' to look how others
packet-layer receive callbacks works I detected that every subsystem do
a kfree_skb, then NET_RX_DROP without calling skb functions which
might manipulate the skb reference counters. This is the reason why we
should do the same here like all others subsystems. I didn't find any
documentation how the packet-layer receive callbacks handle NET_RX_DROP
return values either.

This patch will add a kfree_skb, then NET_RX_DROP handling for the
"trivial checks", in case of skb_share_check/skb_unshare the kfree_skb
call will be done inside these functions.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-22 12:24:42 +02:00
Johan Hedberg
88d07feb09 Bluetooth: Make hci_disconnect() behave correctly for all states
There are a few places that don't explicitly check the connection
state before calling hci_disconnect(). To make this API do the right
thing take advantage of the new hci_abort_conn() API and also make
sure to only read the clock offset if we're really connected.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-22 11:37:22 +02:00
Johan Hedberg
89e0ccc882 Bluetooth: Take advantage of connection abort helpers
Convert the various places mapping connection state to
disconnect/cancel HCI command to use the new hci_abort_conn helper
API.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-22 11:37:22 +02:00
Johan Hedberg
dcc0f0d9ce Bluetooth: Introduce hci_req helper to abort a connection
There are several different places needing to make sure that a
connection gets disconnected or canceled. The exact action needed
depends on the connection state, so centralizing this logic can save
quite a lot of code duplication.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-22 11:37:22 +02:00
Johan Hedberg
c81d555a26 Bluetooth: Fix crash in SMP when unpairing
When unpairing the keys stored in hci_dev are removed. If SMP is
ongoing the SMP context will also have references to these keys, so
removing them from the hci_dev lists will make the pointers invalid.
This can result in the following type of crashes:

 BUG: unable to handle kernel paging request at 6b6b6b6b
 IP: [<c11f26be>] __list_del_entry+0x44/0x71
 *pde = 00000000
 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
 Modules linked in: hci_uart btqca btusb btintel btbcm btrtl hci_vhci rfcomm bluetooth_6lowpan bluetooth
 CPU: 0 PID: 723 Comm: kworker/u5:0 Not tainted 4.3.0-rc3+ #1379
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
 Workqueue: hci0 hci_rx_work [bluetooth]
 task: f19da940 ti: f1a94000 task.ti: f1a94000
 EIP: 0060:[<c11f26be>] EFLAGS: 00010202 CPU: 0
 EIP is at __list_del_entry+0x44/0x71
 EAX: c0088d20 EBX: f30fcac0 ECX: 6b6b6b6b EDX: 6b6b6b6b
 ESI: f4b60000 EDI: c0088d20 EBP: f1a95d90 ESP: f1a95d8c
  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
 CR0: 8005003b CR2: 6b6b6b6b CR3: 319e5000 CR4: 00000690
 Stack:
  f30fcac0 f1a95db0 f82dc3e1 f1bfc000 00000000 c106524f f1bfc000 f30fd020
  f1a95dc0 f1a95dd0 f82dcbdb f1a95de0 f82dcbdb 00000067 f1bfc000 f30fd020
  f1a95de0 f1a95df0 f82d1126 00000067 f82d1126 00000006 f30fd020 f1bfc000
 Call Trace:
  [<f82dc3e1>] smp_chan_destroy+0x192/0x240 [bluetooth]
  [<c106524f>] ? trace_hardirqs_on_caller+0x14e/0x169
  [<f82dcbdb>] smp_teardown_cb+0x47/0x64 [bluetooth]
  [<f82dcbdb>] ? smp_teardown_cb+0x47/0x64 [bluetooth]
  [<f82d1126>] l2cap_chan_del+0x5d/0x14d [bluetooth]
  [<f82d1126>] ? l2cap_chan_del+0x5d/0x14d [bluetooth]
  [<f82d40ef>] l2cap_conn_del+0x109/0x17b [bluetooth]
  [<f82d40ef>] ? l2cap_conn_del+0x109/0x17b [bluetooth]
  [<f82c0205>] ? hci_event_packet+0x5b1/0x2092 [bluetooth]
  [<f82d41aa>] l2cap_disconn_cfm+0x49/0x50 [bluetooth]
  [<f82d41aa>] ? l2cap_disconn_cfm+0x49/0x50 [bluetooth]
  [<f82c0228>] hci_event_packet+0x5d4/0x2092 [bluetooth]
  [<c1332c16>] ? skb_release_data+0x6a/0x95
  [<f82ce5d4>] ? hci_send_to_monitor+0xe7/0xf4 [bluetooth]
  [<c1409708>] ? _raw_spin_unlock_irqrestore+0x44/0x57
  [<f82b3bb0>] hci_rx_work+0xf1/0x28b [bluetooth]
  [<f82b3bb0>] ? hci_rx_work+0xf1/0x28b [bluetooth]
  [<c10635a0>] ? __lock_is_held+0x2e/0x44
  [<c104772e>] process_one_work+0x232/0x432
  [<c1071ddc>] ? rcu_read_lock_sched_held+0x50/0x5a
  [<c104772e>] ? process_one_work+0x232/0x432
  [<c1047d48>] worker_thread+0x1b8/0x255
  [<c1047b90>] ? rescuer_thread+0x23c/0x23c
  [<c104bb71>] kthread+0x91/0x96
  [<c14096a7>] ? _raw_spin_unlock_irq+0x27/0x44
  [<c1409d61>] ret_from_kernel_thread+0x21/0x30
  [<c104bae0>] ? kthread_parkme+0x1e/0x1e

To solve the issue, introduce a new smp_cancel_pairing() API that can
be used to clean up the SMP state before touching the hci_dev lists.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-22 09:02:03 +02:00
Johan Hedberg
fc64361ac1 Bluetooth: Disable auto-connection parameters when unpairing
For connection parameters that are left around until a disconnection
we should at least clear any auto-connection properties. This way a
new Add Device call is required to re-set them after calling Unpair
Device.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-22 09:02:03 +02:00
Eric Dumazet
feec0cb3f2 ipv6: gro: support sit protocol
Tom Herbert added SIT support to GRO with commit
19424e052f ("sit: Add gro callbacks to sit_offload"),
later reverted by Herbert Xu.

The problem came because Tom patch was building GRO
packets without proper meta data : If packets were locally
delivered, we would not care.

But if packets needed to be forwarded, GSO engine was not
able to segment individual segments.

With the following patch, we correctly set skb->encapsulation
and inner network header. We also update gso_type.

Tested:

Server :
netserver
modprobe dummy
ifconfig dummy0 8.0.0.1 netmask 255.255.255.0 up
arp -s 8.0.0.100 4e:32:51:04:47:e5
iptables -I INPUT -s 10.246.7.151 -j TEE --gateway 8.0.0.100
ifconfig sixtofour0
sixtofour0 Link encap:IPv6-in-IPv4
          inet6 addr: 2002:af6:798::1/128 Scope:Global
          inet6 addr: 2002:af6:798::/128 Scope:Global
          UP RUNNING NOARP  MTU:1480  Metric:1
          RX packets:411169 errors:0 dropped:0 overruns:0 frame:0
          TX packets:409414 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:20319631739 (20.3 GB)  TX bytes:29529556 (29.5 MB)

Client :
netperf -H 2002:af6:798::1 -l 1000 &

Checked on server traffic copied on dummy0 and verify segments were
properly rebuilt, with proper IP headers, TCP checksums...

tcpdump on eth0 shows proper GRO aggregation takes place.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:36:11 -07:00
Joe Stringer
e754ec69ab openvswitch: Serialize nested ct actions if provided
If userspace provides a ct action with no nested mark or label, then the
storage for these fields is zeroed. Later when actions are requested,
such zeroed fields are serialized even though userspace didn't
originally specify them. Fix the behaviour by ensuring that no action is
serialized in this case, and reject actions where userspace attempts to
set these fields with mask=0. This should make netlink marshalling
consistent across deserialization/reserialization.

Reported-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:33:43 -07:00
Joe Stringer
4f0909ee3d openvswitch: Mark connections new when not confirmed.
New, related connections are marked as such as part of ovs_ct_lookup(),
but they are not marked as "new" if the commit flag is used. Make this
consistent by setting the "new" flag whenever !nf_ct_is_confirmed(ct).

Reported-by: Jarno Rajahalme <jrajahalme@nicira.com>
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:33:40 -07:00
Joe Stringer
9e384715e9 openvswitch: Reject ct_state masks for unknown bits
Currently, 0-bits are generated in ct_state where the bit position is
undefined, and matches are accepted on these bit-positions. If userspace
requests to match the 0-value for this bit then it may expect only a
subset of traffic to match this value, whereas currently all packets
will have this bit set to 0. Fix this by rejecting such masks.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:33:36 -07:00
Renato Westphal
e2e8009ff7 tcp: remove improper preemption check in tcp_xmit_probe_skb()
Commit e520af48c7 introduced the following bug when setting the
TCP_REPAIR sockoption:

[ 2860.657036] BUG: using __this_cpu_add() in preemptible [00000000] code: daemon/12164
[ 2860.657045] caller is __this_cpu_preempt_check+0x13/0x20
[ 2860.657049] CPU: 1 PID: 12164 Comm: daemon Not tainted 4.2.3 #1
[ 2860.657051] Hardware name: Dell Inc. PowerEdge R210 II/0JP7TR, BIOS 2.0.5 03/13/2012
[ 2860.657054]  ffffffff81c7f071 ffff880231e9fdf8 ffffffff8185d765 0000000000000002
[ 2860.657058]  0000000000000001 ffff880231e9fe28 ffffffff8146ed91 ffff880231e9fe18
[ 2860.657062]  ffffffff81cd1a5d ffff88023534f200 ffff8800b9811000 ffff880231e9fe38
[ 2860.657065] Call Trace:
[ 2860.657072]  [<ffffffff8185d765>] dump_stack+0x4f/0x7b
[ 2860.657075]  [<ffffffff8146ed91>] check_preemption_disabled+0xe1/0xf0
[ 2860.657078]  [<ffffffff8146edd3>] __this_cpu_preempt_check+0x13/0x20
[ 2860.657082]  [<ffffffff817e0bc7>] tcp_xmit_probe_skb+0xc7/0x100
[ 2860.657085]  [<ffffffff817e1e2d>] tcp_send_window_probe+0x2d/0x30
[ 2860.657089]  [<ffffffff817d1d8c>] do_tcp_setsockopt.isra.29+0x74c/0x830
[ 2860.657093]  [<ffffffff817d1e9c>] tcp_setsockopt+0x2c/0x30
[ 2860.657097]  [<ffffffff81767b74>] sock_common_setsockopt+0x14/0x20
[ 2860.657100]  [<ffffffff817669e1>] SyS_setsockopt+0x71/0xc0
[ 2860.657104]  [<ffffffff81865172>] entry_SYSCALL_64_fastpath+0x16/0x75

Since tcp_xmit_probe_skb() can be called from process context, use
NET_INC_STATS() instead of NET_INC_STATS_BH().

Fixes: e520af48c7 ("tcp: add TCPWinProbe and TCPKeepAlive SNMP counters")
Signed-off-by: Renato Westphal <renatow@taghos.com.br>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:29:26 -07:00
David S. Miller
36a28b2116 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains four Netfilter fixes for net, they are:

1) Fix Kconfig dependencies of new nf_dup_ipv4 and nf_dup_ipv6.

2) Remove bogus test nh_scope in IPv4 rpfilter match that is breaking
   --accept-local, from Xin Long.

3) Wait for RCU grace period after dropping the pending packets in the
   nfqueue, from Florian Westphal.

4) Fix sleeping allocation while holding spin_lock_bh, from Nikolay Borisov.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:26:17 -07:00
Arad, Ronen
b1974ed05e netlink: Rightsize IFLA_AF_SPEC size calculation
if_nlmsg_size() overestimates the minimum allocation size of netlink
dump request (when called from rtnl_calcit()) or the size of the
message (when called from rtnl_getlink()). This is because
ext_filter_mask is not supported by rtnl_link_get_af_size() and
rtnl_link_get_size().

The over-estimation is significant when at least one netdev has many
VLANs configured (8 bytes for each configured VLAN).

This patch-set "rightsizes" the protocol specific attribute size
calculation by propagating ext_filter_mask to rtnl_link_get_af_size()
and adding this a argument to get_link_af_size op in rtnl_af_ops.

Bridge module already used filtering aware sizing for notifications.
br_get_link_af_size_filtered() is consistent with the modified
get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops.
br_get_link_af_size() becomes unused and thus removed.

Signed-off-by: Ronen Arad <ronen.arad@intel.com>
Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:15:20 -07:00
Jon Paul Maloy
e53567948f tipc: conditionally expand buffer headroom over udp tunnel
In commit d999297c3d ("tipc: reduce locking scope during packet reception")
we altered the packet retransmission function. Since then, when
restransmitting packets, we create a clone of the original buffer
using __pskb_copy(skb, MIN_H_SIZE), where MIN_H_SIZE is the size of
the area we want to have copied, but also the smallest possible TIPC
packet size. The value of MIN_H_SIZE is 24.

Unfortunately, __pskb_copy() also has the effect that the headroom
of the cloned buffer takes the size MIN_H_SIZE. This is too small
for carrying the packet over the UDP tunnel bearer, which requires
a minimum headroom of 28 bytes. A change to just use pskb_copy()
lets the clone inherit the original headroom of 80 bytes, but also
assumes that the copied data area is of at least that size, something
that is not always the case. So that is not a viable solution.

We now fix this by adding a check for sufficient headroom in the
transmit function of udp_media.c, and expanding it when necessary.

Fixes: commit d999297c3d ("tipc: reduce locking scope during packet reception")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:13:48 -07:00
Jon Paul Maloy
45c8b7b175 tipc: allow non-linear first fragment buffer
The current code for message reassembly is erroneously assuming that
the the first arriving fragment buffer always is linear, and then goes
ahead resetting the fragment list of that buffer in anticipation of
more arriving fragments.

However, if the buffer already happens to be non-linear, we will
inadvertently drop the already attached fragment list, and later
on trig a BUG() in __pskb_pull_tail().

We see this happen when running fragmented TIPC multicast across UDP,
something made possible since
commit d0f91938be ("tipc: add ip/udp media type")

We fix this by not resetting the fragment list when the buffer is non-
linear, and by initiatlizing our private fragment list tail pointer to
the tail of the existing fragment list.

Fixes: commit d0f91938be ("tipc: add ip/udp media type")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:08:24 -07:00
James Morse
1241365f1a openvswitch: Allocate memory for ovs internal device stats.
"openvswitch: Remove vport stats" removed the per-vport statistics, in
order to use the netdev's statistics fields.
"openvswitch: Fix ovs_vport_get_stats()" fixed the export of these stats
to user-space, by using the provided netdev_ops to collate them - but ovs
internal devices still use an unallocated dev->tstats field to count
packets, which are no longer exported by this api.

Allocate the dev->tstats field for ovs internal devices, and wire up
ndo_get_stats64 with the original implementation of
ovs_vport_get_stats().

On its own, "openvswitch: Fix ovs_vport_get_stats()" fixes the OOPs,
unmasking a full-on panic on arm64:

=============%<==============
[<ffffffbffc00ce4c>] internal_dev_recv+0xa8/0x170 [openvswitch]
[<ffffffbffc0008b4>] do_output.isra.31+0x60/0x19c [openvswitch]
[<ffffffbffc000bf8>] do_execute_actions+0x208/0x11c0 [openvswitch]
[<ffffffbffc001c78>] ovs_execute_actions+0xc8/0x238 [openvswitch]
[<ffffffbffc003dfc>] ovs_packet_cmd_execute+0x21c/0x288 [openvswitch]
[<ffffffc0005e8c5c>] genl_family_rcv_msg+0x1b0/0x310
[<ffffffc0005e8e60>] genl_rcv_msg+0xa4/0xe4
[<ffffffc0005e7ddc>] netlink_rcv_skb+0xb0/0xdc
[<ffffffc0005e8a94>] genl_rcv+0x38/0x50
[<ffffffc0005e76c0>] netlink_unicast+0x164/0x210
[<ffffffc0005e7b70>] netlink_sendmsg+0x304/0x368
[<ffffffc0005a21c0>] sock_sendmsg+0x30/0x4c
[SNIP]
Kernel panic - not syncing: Fatal exception in interrupt
=============%<==============

Fixes: 8c876639c9 ("openvswitch: Remove vport stats.")
Signed-off-by: James Morse <james.morse@arm.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:06:36 -07:00
David Ahern
f1900fb5ec net: Really fix vti6 with oif in dst lookups
6e28b00082 ("net: Fix vti use case with oif in dst lookups for IPv6")
is missing the checks on FLOWI_FLAG_SKIP_NH_OIF. Add them.

Fixes: 42a7b32b73 ("xfrm: Add oif to dst lookups")
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:04:54 -07:00
Jon Paul Maloy
53387c4e22 tipc: extend broadcast link window size
The default fix broadcast window size is currently set to 20 packets.
This is a very low value, set at a time when we were still testing on
10 Mb/s hubs, and a change to it is long overdue.

Commit 7845989cb4 ("net: tipc: fix stall during bclink wakeup procedure")
revealed a problem with this low value. For messages of importance LOW,
the backlog queue limit will be  calculated to 30 packets, while a
single, maximum sized message of 66000 bytes, carried across a 1500 MTU
network consists of 46 packets.

This leads to the following scenario (among others leading to the same
situation):

1: Msg 1 of 46 packets is sent. 20 packets go to the transmit queue, 26
   packets to the backlog queue.
2: Msg 2 of 46 packets is attempted sent, but rejected because there is
   no more space in the backlog queue at this level. The sender is added
   to the wakeup queue with a "pending packets chain size" number of 46.
3: Some packets in the transmit queue are acked and released. We try to
   wake up the sender, but the pending size of 46 is bigger than the LOW
   wakeup limit of 30, so this doesn't happen.
5: Subsequent acks releases all the remaining buffers. Each time we test
   for the wakeup criteria and find that 46 still is larger than 30,
   even after both the transmit and the backlog queues are empty.
6: The sender is never woken up and given a chance to send its message.
   He is stuck.

We could now loosen the wakeup criteria (used by link_prepare_wakeup())
to become equal to the send criteria (used by tipc_link_xmit()), i.e.,
by ignoring the "pending packets chain size" value altogether, or we can
just increase the queue limits so that the criteria can be satisfied
anyway. There are good reasons (potentially multiple waiting senders) to
not opt for the former solution, so we choose the latter one.

This commit fixes the problem by giving the broadcast link window a
default value of 50 packets. We also introduce a new minimum link
window size BCLINK_MIN_WIN of 32, which is enough to always avoid the
described situation. Finally, in order to not break any existing users
which may set the window explicitly, we enforce that the window is set
to the new minimum value in case the user is trying to set it to
anything lower.

Fixes: 7845989cb4 ("net: tipc: fix stall during bclink wakeup procedure")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:02:17 -07:00
Doug Ledford
fc81a06965 Merge branch 'k.o/for-4.3-v1' into k.o/for-4.4
Pick up the late fixes from the 4.3 cycle so we have them in our
next branch.
2015-10-21 16:40:21 -04:00
Johan Hedberg
17bc08f0d1 Bluetooth: Remove unnecessary hci_explicit_connect_lookup function
There's only one user of this helper which can be replaces with a call
to hci_pend_le_action_lookup() and a check for params->explicit_connect.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 18:58:23 +02:00
Johan Hedberg
1ede9868f6 Bluetooth: Remove redundant (and possibly wrong) flag clearing
There's no need to clear the HCI_CONN_ENCRYPT_PEND flag in
smp_failure. In fact, this may cause the encryption tracking to get
out of sync as this has nothing to do with HCI activity.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 18:57:03 +02:00
Johan Hedberg
b5c2b6214c Bluetooth: Add hdev helper variable to hci_le_create_connection_cancel
The hci_le_create_connection_cancel() function needs to use the hdev
pointer in many places so add a variable for it to avoid the need to
dereference the hci_conn every time.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 18:45:43 +02:00
Johan Hedberg
ec182f0397 Bluetooth: Remove unnecessary indentation in unpair_device()
Instead of doing all of the LE-specific handling in an else-branch in
unpair_device() create a 'done' label for the BR/EDR branch to jump to
and then remove the else-branch completely.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 18:40:21 +02:00
Johan Hedberg
f5ad4ffceb Bluetooth: 6lowpan: Use hci_conn_hash_lookup_le() when possible
Use the new hci_conn_hash_lookup_le() API to look up LE connections.
This way we're guaranteed exact matches that also take into account
the address type.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 18:39:16 +02:00
Johan Hedberg
9d4c1cc15b Bluetooth: Use hci_conn_hash_lookup_le() when possible
Use the new hci_conn_hash_lookup_le() API to look up LE connections.
This way we're guaranteed exact matches that also take into account
the address type.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 18:38:22 +02:00
Johan Hedberg
85813a7ec7 Bluetooth: Add le_addr_type() helper function
The mgmt code needs to convert from mgmt/L2CAP address types to HCI in
many places. Having a dedicated helper function for this simplifies
code by shortening it and removing unnecessary 'addr_type' variables.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 18:35:00 +02:00
Elad Raz
6ac311ae8b Adding switchdev ageing notification on port bridged
Configure ageing time to the HW for newly bridged device

CC: Scott Feldman <sfeldma@gmail.com>
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Elad Raz <eladr@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:50:57 -07:00
Dan Carpenter
50010c2059 irda: precedence bug in irlmp_seq_hb_idx()
This is decrementing the pointer, instead of the value stored in the
pointer.  KASan detects it as an out of bounds reference.

Reported-by: "Berry Cheng 程君(成淼)" <chengmiao.cj@alibaba-inc.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:48:26 -07:00
Gao feng
f6a835bb04 vsock: fix missing cleanup when misc_register failed
reset transport and unlock if misc_register failed.

Signed-off-by: Gao feng <omarapazanadi@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:41:06 -07:00
David Howells
146aa8b145 KEYS: Merge the type-specific data with the payload data
Merge the type-specific data with the payload data into one four-word chunk
as it seems pointless to keep them separate.

Use user_key_payload() for accessing the payloads of overloaded
user-defined keys.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-cifs@vger.kernel.org
cc: ecryptfs@vger.kernel.org
cc: linux-ext4@vger.kernel.org
cc: linux-f2fs-devel@lists.sourceforge.net
cc: linux-nfs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: linux-ima-devel@lists.sourceforge.net
2015-10-21 15:18:36 +01:00
Yuchung Cheng
4f41b1c58a tcp: use RACK to detect losses
This patch implements the second half of RACK that uses the the most
recent transmit time among all delivered packets to detect losses.

tcp_rack_mark_lost() is called upon receiving a dubious ACK.
It then checks if an not-yet-sacked packet was sent at least
"reo_wnd" prior to the sent time of the most recently delivered.
If so the packet is deemed lost.

The "reo_wnd" reordering window starts with 1msec for fast loss
detection and changes to min-RTT/4 when reordering is observed.
We found 1msec accommodates well on tiny degree of reordering
(<3 pkts) on faster links. We use min-RTT instead of SRTT because
reordering is more of a path property but SRTT can be inflated by
self-inflicated congestion. The factor of 4 is borrowed from the
delayed early retransmit and seems to work reasonably well.

Since RACK is still experimental, it is now used as a supplemental
loss detection on top of existing algorithms. It is only effective
after the fast recovery starts or after the timeout occurs. The
fast recovery is still triggered by FACK and/or dupack threshold
instead of RACK.

We introduce a new sysctl net.ipv4.tcp_recovery for future
experiments of loss recoveries. For now RACK can be disabled by
setting it to 0.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:53 -07:00
Yuchung Cheng
659a8ad56f tcp: track the packet timings in RACK
This patch is the first half of the RACK loss recovery.

RACK loss recovery uses the notion of time instead
of packet sequence (FACK) or counts (dupthresh). It's inspired by the
previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
transmit (new data packet) is sacked, then current retransmitted
sequence below the newly sacked sequence must been lost,
since at least one round trip time has elapsed.

But it has several limitations:
1) can't detect tail drops since it depends on limited transmit
2) is disabled upon reordering (assumes no reordering)
3) only enabled in fast recovery ut not timeout recovery

RACK (Recently ACK) addresses these limitations with the notion
of time instead: a packet P1 is lost if a later packet P2 is s/acked,
as at least one round trip has passed.

Since RACK cares about the time sequence instead of the data sequence
of packets, it can detect tail drops when later retransmission is
s/acked while FACK or dupthresh can't. For reordering RACK uses a
dynamically adjusted reordering window ("reo_wnd") to reduce false
positives on ever (small) degree of reordering.

This patch implements tcp_advanced_rack() which tracks the
most recent transmission time among the packets that have been
delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
is the key to determine which packet has been lost.

Consider an example that the sender sends six packets:
T1: P1 (lost)
T2: P2
T3: P3
T4: P4
T100: sack of P2. rack.mstamp = T2
T101: retransmit P1
T102: sack of P2,P3,P4. rack.mstamp = T4
T205: ACK of P4 since the hole is repaired. rack.mstamp = T101

We need to be careful about spurious retransmission because it may
falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
to falsely mark all packets lost, just like a spurious timeout.

We identify spurious retransmission by the ACK's TS echo value.
If TS option is not applicable but the retransmission is acknowledged
less than min-RTT ago, it is likely to be spurious. We refrain from
using the transmission time of these spurious retransmissions.

The second half is implemented in the next patch that marks packet
lost using RACK timestamp.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:48 -07:00
Yuchung Cheng
77c631273d tcp: add tcp_tsopt_ecr_before helper
a helper to prepare the main RACK patch

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:45 -07:00
Yuchung Cheng
af82f4e848 tcp: remove tcp_mark_lost_retrans()
Remove the existing lost retransmit detection because RACK subsumes
it completely. This also stops the overloading the ack_seq field of
the skb control block.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:44 -07:00
Yuchung Cheng
f672258391 tcp: track min RTT using windowed min-filter
Kathleen Nichols' algorithm for tracking the minimum RTT of a
data stream over some measurement window. It uses constant space
and constant time per update. Yet it almost always delivers
the same minimum as an implementation that has to keep all
the data in the window. The measurement window is tunable via
sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.

The algorithm keeps track of the best, 2nd best & 3rd best min
values, maintaining an invariant that the measurement time of
the n'th best >= n-1'th best. It also makes sure that the three
values are widely separated in the time window since that bounds
the worse case error when that data is monotonically increasing
over the window.

Upon getting a new min, we can forget everything earlier because
it has no value - the new min is less than everything else in the
window by definition and it's the most recent. So we restart fresh
on every new min and overwrites the 2nd & 3rd choices. The same
property holds for the 2nd & 3rd best.

Therefore we have to maintain two invariants to maximize the
information in the samples, one on values (1st.v <= 2nd.v <=
3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
now). These invariants determine the structure of the code

The RTT input to the windowed filter is the minimum RTT measured
from ACK or SACK, or as the last resort from TCP timestamps.

The accessor tcp_min_rtt() returns the minimum RTT seen in the
window. ~0U indicates it is not available. The minimum is 1usec
even if the true RTT is below that.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:43 -07:00
Yuchung Cheng
9e45a3e36b tcp: apply Kern's check on RTTs used for congestion control
Currently ca_seq_rtt_us does not use Kern's check. Fix that by
checking if any packet acked is a retransmit, for both RTT used
for RTT estimation and congestion control.

Fixes: 5b08e47ca ("tcp: prefer packet timing to TS-ECR for RTT")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 07:00:41 -07:00
Johan Hedberg
8ce783dc5e Bluetooth: Fix missing hdev locking for LE scan cleanup
The hci_conn objects don't have a dedicated lock themselves but rely
on the caller to hold the hci_dev lock for most types of access. The
hci_conn_timeout() function has so far sent certain HCI commands based
on the hci_conn state which has been possible without holding the
hci_dev lock.

The recent changes to do LE scanning before connect attempts added
even more operations to hci_conn and hci_dev from hci_conn_timeout,
thereby exposing potential race conditions with the hci_dev and
hci_conn states.

As an example of such a race, here there's a timeout but an
l2cap_sock_connect() call manages to race with the cleanup routine:

[Oct21 08:14] l2cap_chan_timeout: chan ee4b12c0 state BT_CONNECT
[  +0.000004] l2cap_chan_close: chan ee4b12c0 state BT_CONNECT
[  +0.000002] l2cap_chan_del: chan ee4b12c0, conn f3141580, err 111, state BT_CONNECT
[  +0.000002] l2cap_sock_teardown_cb: chan ee4b12c0 state BT_CONNECT
[  +0.000005] l2cap_chan_put: chan ee4b12c0 orig refcnt 4
[  +0.000010] hci_conn_drop: hcon f53d56e0 orig refcnt 1
[  +0.000013] l2cap_chan_put: chan ee4b12c0 orig refcnt 3
[  +0.000063] hci_conn_timeout: hcon f53d56e0 state BT_CONNECT
[  +0.000049] hci_conn_params_del: addr ee:0d:30:09:53:1f (type 1)
[  +0.000002] hci_chan_list_flush: hcon f53d56e0
[  +0.000001] hci_chan_del: hci0 hcon f53d56e0 chan f4e7ccc0
[  +0.004528] l2cap_sock_create: sock e708fc00
[  +0.000023] l2cap_chan_create: chan ee4b1770
[  +0.000001] l2cap_chan_hold: chan ee4b1770 orig refcnt 1
[  +0.000002] l2cap_sock_init: sk ee4b3390
[  +0.000029] l2cap_sock_bind: sk ee4b3390
[  +0.000010] l2cap_sock_setsockopt: sk ee4b3390
[  +0.000037] l2cap_sock_connect: sk ee4b3390
[  +0.000002] l2cap_chan_connect: 00:02:72:d9:e5:8b -> ee:0d:30:09:53:1f (type 2) psm 0x00
[  +0.000002] hci_get_route: 00:02:72:d9:e5:8b -> ee:0d:30:09:53:1f
[  +0.000001] hci_dev_hold: hci0 orig refcnt 8
[  +0.000003] hci_conn_hold: hcon f53d56e0 orig refcnt 0

Above the l2cap_chan_connect() shouldn't have been able to reach the
hci_conn f53d56e0 anymore but since hci_conn_timeout didn't do proper
locking that's not the case. The end result is a reference to hci_conn
that's not in the conn_hash list, resulting in list corruption when
trying to remove it later:

[Oct21 08:15] l2cap_chan_timeout: chan ee4b1770 state BT_CONNECT
[  +0.000004] l2cap_chan_close: chan ee4b1770 state BT_CONNECT
[  +0.000003] l2cap_chan_del: chan ee4b1770, conn f3141580, err 111, state BT_CONNECT
[  +0.000001] l2cap_sock_teardown_cb: chan ee4b1770 state BT_CONNECT
[  +0.000005] l2cap_chan_put: chan ee4b1770 orig refcnt 4
[  +0.000002] hci_conn_drop: hcon f53d56e0 orig refcnt 1
[  +0.000015] l2cap_chan_put: chan ee4b1770 orig refcnt 3
[  +0.000038] hci_conn_timeout: hcon f53d56e0 state BT_CONNECT
[  +0.000003] hci_chan_list_flush: hcon f53d56e0
[  +0.000002] hci_conn_hash_del: hci0 hcon f53d56e0
[  +0.000001] ------------[ cut here ]------------
[  +0.000461] WARNING: CPU: 0 PID: 1782 at lib/list_debug.c:56 __list_del_entry+0x3f/0x71()
[  +0.000839] list_del corruption, f53d56e0->prev is LIST_POISON2 (00000200)

The necessary fix is unfortunately more complicated than just adding
hci_dev_lock/unlock calls to the hci_conn_timeout() call path.
Particularly, the hci_conn_del() API, which expects the hci_dev lock to
be held, performs a cancel_delayed_work_sync(&hcon->disc_work) which
would lead to a deadlock if the hci_conn_timeout() call path tries to
acquire the same lock.

This patch solves the problem by deferring the cleanup work to a
separate work callback. To protect against the hci_dev or hci_conn
going away meanwhile temporary references are taken with the help of
hci_dev_hold() and hci_conn_get().

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org # 4.3
2015-10-21 14:25:34 +02:00
Johannes Berg
e5a9f8d046 mac80211: move station statistics into sub-structs
Group station statistics by where they're (mostly) updated
(TX, RX and TX-status) and group them into sub-structs of
the struct sta_info.

Also rename the variables since the grouping now makes it
obvious where they belong.

This makes it easier to identify where the statistics are
updated in the code, and thus easier to think about them.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-21 10:08:22 +02:00
Johannes Berg
976bd9efda mac80211: move beacon_loss_count into ifmgd
There's little point in keeping (and even sending to userspace)
the beacon_loss_count value per station, since it can only apply
to the AP on a managed-mode connection. Move the value to ifmgd,
advertise it only in managed mode, and remove it from ethtool as
it's available through better interfaces.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-21 10:08:21 +02:00
Johannes Berg
763aa27a29 mac80211: remove sta->last_ack_signal
This file only feeds a debugfs file that isn't very useful, so remove
it. If necessary, we can add other ways to get this information, for
example in the NL80211_CMD_PROBE_CLIENT response.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-21 10:08:21 +02:00
Marcel Holtmann
98a63aaf24 Bluetooth: Introduce driver specific post init callback
Some drivers might have to restore certain settings after the init
procedure has been completed. This driver callback allows them to hook
into that stage. This callback is run just before the controller is
declared as powered up.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-10-21 07:30:53 +03:00
Dean Jenkins
9f7378a9d6 Bluetooth: l2cap_disconnection_req priority over shutdown
There is a L2CAP protocol race between the local peer and
the remote peer demanding disconnection of the L2CAP link.

When L2CAP ERTM is used, l2cap_sock_shutdown() can be called
from userland to disconnect L2CAP. However, there can be a
delay introduced by waiting for ACKs. During this waiting
period, the remote peer may have sent a Disconnection Request.
Therefore, recheck the shutdown status of the socket
after waiting for ACKs because there is no need to do
further processing if the connection has gone.

Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com>
Signed-off-by: Harish Jenny K N <harish_kandiga@mentor.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 00:49:26 +02:00
Dean Jenkins
04ba72e6b2 Bluetooth: Reorganize mutex lock in l2cap_sock_shutdown()
This commit reorganizes the mutex lock and is now
only protecting l2cap_chan_close(). This is now consistent
with other places where l2cap_chan_close() is called.

If a conn connection exists, call
mutex_lock(&conn->chan_lock) before calling l2cap_chan_close()
to ensure other L2CAP protocol operations do not interfere.

Note that the conn structure has to be protected from being
freed as it is possible for the connection to be disconnected
whilst the locks are not held. This solution allows the mutex
lock to be used even when the connection has just been
disconnected.

This commit also reduces the scope of chan locking.

The only place where chan locking is needed is the call to
l2cap_chan_close(chan, 0) which if necessary closes the channel.
Therefore, move the l2cap_chan_lock(chan) and
l2cap_chan_lock(chan) locking calls to around
l2cap_chan_close(chan, 0).

This allows __l2cap_wait_ack(sk, chan) to be called with no
chan locks being held so L2CAP messaging over the ACL link
can be done unimpaired.

Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com>
Signed-off-by: Harish Jenny K N <harish_kandiga@mentor.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 00:49:26 +02:00
Dean Jenkins
e7456437c1 Bluetooth: Unwind l2cap_sock_shutdown()
l2cap_sock_shutdown() is designed to only action shutdown
of the channel when shutdown is not already in progress.
Therefore, reorganise the code flow by adding a goto
to jump to the end of function handling when shutdown is
already being actioned. This removes one level of code
indentation and make the code more readable.

Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com>
Signed-off-by: Harish Jenny K N <harish_kandiga@mentor.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 00:49:26 +02:00
Alexander Aring
09bf420f10 6lowpan: put mcast compression in an own function
This patch moves the mcast compression algorithmn to an own function
like all other compression/decompression methods in iphc.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 00:49:25 +02:00
Alexander Aring
b5af9bdbfe 6lowpan: rework tc and flow label handling
This patch reworks the handling of compression/decompression of traffic
class and flow label handling. The current method is hard to understand,
also doesn't checks if we can read the buffer from skb length.

I tried to put the shifting operations into static inline functions and
comment each steps which I did there to make it hopefully somewhat more
readable. The big mess to deal with that is the that the ipv6 header
bring the order "DSCP + ECN" but iphc uses "ECN + DSCP". Additional the
DCSP + ECN bits are splitted in ipv6_hdr inside the priority and
flow_lbl[0] fields.

I tested these compressions by using fakelb 802.15.4 driver and
manipulate the tc and flow label fields manually in function
"__ip6_local_out" before the skb will be send to lower layers. Then I
looked up the tc and flow label fields in wireshark on a wpan and lowpan
interface.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 00:49:25 +02:00
Alexander Aring
c8a3e7eb98 6lowpan: iphc: change define values
This patch has the main goal to delete shift operations. Instead we
doing masks and equals afterwards. E.g. for the SAM evaluation we
masking only the SAM value which fits in iphc1 byte, then comparing with
all possible SAM values over a switch case statement. We will not
shifting the SAM value to somewhat readable anymore.
Additional this patch slighty change the naming style like RFC 6282,
e.g. TTL to HLIM and we will drop an errno now if CID flag is set,
because we don't support it.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 00:49:25 +02:00
Alexander Aring
028b2a8c16 6lowpan: remove lowpan_is_addr_broadcast
This macro is used at 802.15.4 6LoWPAN only and can be replaced by
memcmp with the interface broadcast address.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 00:49:25 +02:00
Alexander Aring
6350047eb8 6lowpan: move IPHC functionality defines
This patch removes the IPHC related defines for doing bit manipulation
from global 6lowpan header to the iphc file which should the only one
implementation which use these defines.

Also move next header compression defines to their nhc implementation.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 00:49:25 +02:00
Alexander Aring
607b0bd3f2 6lowpan: nhc: move iphc manipulation out of nhc
This patch moves the iphc setting of next header commpression bit inside
iphc functionality. Setting of IPHC bits should be happen at iphc.c file
only.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 00:49:25 +02:00
Alexander Aring
478208e3b9 6lowpan: remove lowpan_fetch_skb_u8
This patch removes the lowpan_fetch_skb_u8 function for getting the iphc
bytes. Instead we using the generic which has a len parameter to tell
the amount of bytes to fetch.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 00:49:25 +02:00
Alexander Aring
8911d7748c 6lowpan: cleanup lowpan_header_decompress
This patch changes the lowpan_header_decompress function by removing
inklayer related information from parameters. This is currently for
supporting short and extended address for iphc handling in 802154.
We don't support short address handling anyway right now, but there
exists already code for handling short addresses in
lowpan_header_decompress.

The address parameters are also changed to a void pointer, so 6LoWPAN
linklayer specific code can put complex structures as these parameters
and cast it again inside the generic code by evaluating linklayer type
before. The order is also changed by destination address at first and
then source address, which is the same like all others functions where
destination is always the first, memcpy, dev_hard_header,
lowpan_header_compress, etc.

This patch also moves the fetching of iphc values from 6LoWPAN linklayer
specific code into the generic branch.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 00:49:24 +02:00
Alexander Aring
a6f773891a 6lowpan: cleanup lowpan_header_compress
This patch changes the lowpan_header_compress function by removing
unused parameters like "len" and drop static value parameters of
protocol type. Instead we really check the protocol type inside inside
the skb structure. Also we drop the use of IEEE802154_ADDR_LEN which is
link-layer specific. Instead we using EUI64_ADDR_LEN which should always
the default case for now.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 00:49:24 +02:00
Alexander Aring
bf513fd6fc 6lowpan: introduce LOWPAN_IPHC_MAX_HC_BUF_LEN
This patch introduces the LOWPAN_IPHC_MAX_HC_BUF_LEN define which
represent the worst-case supported IPHC buffer length. It's used to
allocate the stack buffer space for creating the IPHC header.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 00:49:24 +02:00
Alexander Aring
cefdb801c8 bluetooth: 6lowpan: use lowpan dispatch helpers
This patch adds a check if the dataroom of skb contains a dispatch value
by checking if skb->len != 0. This patch also change the dispatch
evaluation by the recently introduced helpers for checking the common
6LoWPAN dispatch values for IPv6 and IPHC header.

There was also a forgotten else branch which should drop the packet if
no matching dispatch is available.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 00:49:24 +02:00
Alexander Aring
71cd2aa53d mac802154: llsec: use kzfree
This patch will use kzfree instead kfree for security related
information which can be offered by acccident.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 00:49:24 +02:00
Johan Hedberg
a6ad2a6b9c Bluetooth: Fix removing connection parameters when unpairing
The commit 89cbb0638e introduced support for deferred connection
parameter removal when unpairing by removing them only once an
existing connection gets disconnected. However, it failed to address
the scenario when we're *not* connected and do an unpair operation.

What makes things worse is that most user space BlueZ versions will
first issue a disconnect request and only then unpair, meaning the
buggy code will be triggered every time. This effectively causes the
kernel to resume scanning and reconnect to a device for which we've
removed all keys and GATT database information.

This patch fixes the issue by adding the missing call to the
hci_conn_params_del() function to a branch which handles the case of
no existing connection.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org # 3.19+
2015-10-21 00:49:24 +02:00
Marcel Holtmann
e131d74a3a Bluetooth: Add support setup stage internal notification event
Before the vendor specific setup stage is triggered call back into the
core to trigger an internal notification event. That event is used to
send an index update to the monitor interface. With that specific event
it is possible to update userspace with manufacturer information before
any HCI command has been executed. This is useful for early stage
debugging of vendor specific initialization sequences.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-10-21 00:49:23 +02:00
David Herrmann
660f0fc07d Bluetooth: hidp: fix device disconnect on idle timeout
The HIDP specs define an idle-timeout which automatically disconnects a
device. This has always been implemented in the HIDP layer and forced a
synchronous shutdown of the hidp-scheduler. This works just fine, but
lacks a forced disconnect on the underlying l2cap channels. This has been
broken since:

    commit 5205185d46
    Author: David Herrmann <dh.herrmann@gmail.com>
    Date:   Sat Apr 6 20:28:47 2013 +0200

        Bluetooth: hidp: remove old session-management

The old session-management always forced an l2cap error on the ctrl/intr
channels when shutting down. The new session-management skips this, as we
don't want to enforce channel policy on the caller. In other words, if
user-space removes an HIDP device, the underlying channels (which are
*owned* and *referenced* by user-space) are still left active. User-space
needs to call shutdown(2) or close(2) to release them.

Unfortunately, this does not work with idle-timeouts. There is no way to
signal user-space that the HIDP layer has been stopped. The API simply
does not support any event-passing except for poll(2). Hence, we restore
old behavior and force EUNATCH on the sockets if the HIDP layer is
disconnected due to idle-timeouts (behavior of explicit disconnects
remains unmodified). User-space can still call

    getsockopt(..., SO_ERROR, ...)

..to retrieve the EUNATCH error and clear sk_err. Hence, the channels can
still be re-used (which nobody does so far, though). Therefore, the API
still supports the new behavior, but with this patch it's also compatible
to the old implicit channel shutdown.

Cc: <stable@vger.kernel.org> # 3.10+
Reported-by: Mark Haun <haunma@keteu.org>
Reported-by: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 00:49:23 +02:00
Marcel Holtmann
7e995b9ead Bluetooth: Add new quirk for non-persistent diagnostic settings
If the diagnostic settings are not persistent over HCI Reset, then this
quirk can be used to tell the Bluetoth core about it. This will ensure
that the settings are programmed correctly when the controller is
powered up.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-10-21 00:49:22 +02:00
Johan Hedberg
cad20c2780 Bluetooth: Don't use remote address type to decide IRK persistency
There are LE devices on the market that start off by announcing their
public address and then once paired switch to using private address.
To be interoperable with such devices we should simply trust the fact
that we're receiving an IRK from them to indicate that they may use
private addresses in the future. Instead, simply tie the persistency
to the bonding/no-bonding information the same way as for LTKs and
CSRKs.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21 00:49:21 +02:00
Marcel Holtmann
581d6fd60f Bluetooth: Queue diagnostic messages together with HCI packets
Sending diagnostic messages directly to the monitor socket might cause
issues for devices processing their messages in interrupt context. So
instead of trying to directly forward them, queue them up with the other
HCI packets and lets them be processed by the sockets at the same time.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-10-21 00:49:21 +02:00
Marcel Holtmann
bb77543ebd Bluetooth: Restrict valid packet types via HCI_CHANNEL_RAW
When using the HCI_CHANNEL_RAW, restrict the packet types to valid ones
from the Bluetooth specification.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-10-21 00:49:21 +02:00
Marcel Holtmann
8cd4f58142 Bluetooth: Remove quirk for HCI_VENDOR_PKT filter handling
The HCI_VENDOR_PKT quirk was needed for BPA-100/105 devices that send
these messages. Now that there is support for proper diagnostic channel
this quirk is no longer needed.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-10-21 00:49:21 +02:00
David S. Miller
26440c835f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/asix_common.c
	net/ipv4/inet_connection_sock.c
	net/switchdev/switchdev.c

In the inet_connection_sock.c case the request socket hashing scheme
is completely different in net-next.

The other two conflicts were overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-20 06:08:27 -07:00
Julia Lawall
9abebb8a10 NFC: delete null dereference
The exit label performs device_unlock(&dev->dev);, which will fail when dev
is NULL, and nfc_put_device(dev);, which is not useful when dev is NULL, so
just exit the function immediately.

Problem found using scripts/coccinelle/null/deref_null.cocci

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-10-19 20:10:37 +02:00
Linus Torvalds
1099f86044 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Account for extra headroom in ath9k driver, from Felix Fietkau.

 2) Fix OOPS in pppoe driver due to incorrect socket state transition,
    from Guillaume Nault.

 3) Kill memory leak in amd-xgbe debugfx, from Geliang Tang.

 4) Power management fixes for iwlwifi, from Johannes Berg.

 5) Fix races in reqsk_queue_unlink(), from Eric Dumazet.

 6) Fix dst_entry usage in ARP replies, from Jiri Benc.

 7) Cure OOPSes with SO_GET_FILTER, from Daniel Borkmann.

 8) Missing allocation failure check in amd-xgbe, from Tom Lendacky.

 9) Various resource allocation/freeing cures in DSA< from Neil
    Armstrong.

10) A series of bug fixes in the openvswitch conntrack support, from
    Joe Stringer.

11) Fix two cases (BPF and act_mirred) where we have to clean the sender
    cpu stored in the SKB before transmitting.  From WANG Cong and
    Alexei Starovoitov.

12) Disable VLAN filtering in promiscuous mode in mlx5 driver, from
    Achiad Shochat.

13) Older bnx2x chips cannot do 4-tuple UDP hashing, so prevent this
    configuration via ethtool.  From Yuval Mintz.

14) Don't call rt6_uncached_list_flush_dev() from rt6_ifdown() when
    'dev' is NULL, from Eric Biederman.

15) Prevent stalled link synchronization in tipc, from Jon Paul Maloy.

16) kcalloc() gstrings ethtool buffer before having driver fill it in,
    in order to prevent kernel memory leaking.  From Joe Perches.

17) Fix mixxing rt6_info initialization for blackhole routes, from
    Martin KaFai Lau.

18) Kill VLAN regression in via-rhine, from Andrej Ota.

19) Missing pfmemalloc check in sk_add_backlog(), from Eric Dumazet.

20) Fix spurious MSG_TRUNC signalling in netlink dumps, from Ronen Arad.

21) Scrube SKBs when pushing them between namespaces in openvswitch,
    from Joe Stringer.

22) bcmgenet enables link interrupts too early, fix from Florian
    Fainelli.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (92 commits)
  net: bcmgenet: Fix early link interrupt enabling
  tunnels: Don't require remote endpoint or ID during creation.
  openvswitch: Scrub skb between namespaces
  xen-netback: correctly check failed allocation
  net: asix: add support for the Billionton GUSB2AM-1G-B USB adapter
  netlink: Trim skb to alloc size to avoid MSG_TRUNC
  net: add pfmemalloc check in sk_add_backlog()
  via-rhine: fix VLAN receive handling regression.
  ipv6: Initialize rt6_info properly in ip6_blackhole_route()
  ipv6: Move common init code for rt6_info to a new function rt6_info_init()
  Bluetooth: Fix initializing conn_params in scan phase
  Bluetooth: Fix conn_params list update in hci_connect_le_scan_cleanup
  Bluetooth: Fix remove_device behavior for explicit connects
  Bluetooth: Fix LE reconnection logic
  Bluetooth: Fix reference counting for LE-scan based connections
  Bluetooth: Fix double scan updates
  mlxsw: core: Fix race condition in __mlxsw_emad_transmit
  tipc: move fragment importance field to new header position
  ethtool: Use kcalloc instead of kmalloc for ethtool_get_strings
  tipc: eliminate risk of stalled link synchronization
  ...
2015-10-19 09:55:40 -07:00
Steffen Klassert
ca064bd893 xfrm: Fix pmtu discovery for local generated packets.
Commit 044a832a77 ("xfrm: Fix local error reporting crash
with interfamily tunnels") moved the setting of skb->protocol
behind the last access of the inner mode family to fix an
interfamily crash. Unfortunately now skb->protocol might not
be set at all, so we fail dispatch to the inner address family.
As a reault, the local error handler is not called and the
mtu value is not reported back to userspace.

We fix this by setting skb->protocol on message size errors
before we call xfrm_local_error.

Fixes: 044a832a77 ("xfrm: Fix local error reporting crash with interfamily tunnels")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-10-19 10:30:05 +02:00
David S. Miller
371f1c7e0d Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for your net-next
tree. Most relevantly, updates for the nfnetlink_log to integrate with
conntrack, fixes for cttimeout and improvements for nf_queue core, they are:

1) Remove useless ifdef around static inline function in IPVS, from
   Eric W. Biederman.

2) Simplify the conntrack support for nfnetlink_queue: Merge
   nfnetlink_queue_ct.c file into nfnetlink_queue_core.c, then rename it back
   to nfnetlink_queue.c

3) Use y2038 safe timestamp from nfnetlink_queue.

4) Get rid of dead function definition in nf_conntrack, from Flavio
   Leitner.

5) Attach conntrack support for nfnetlink_log.c, from Ken-ichirou MATSUZAWA.
   This adds a new NETFILTER_NETLINK_GLUE_CT Kconfig switch that
   controls enabling both nfqueue and nflog integration with conntrack.
   The userspace application can request this via NFULNL_CFG_F_CONNTRACK
   configuration flag.

6) Remove unused netns variables in IPVS, from Eric W. Biederman and
   Simon Horman.

7) Don't put back the refcount on the cttimeout object from xt_CT on success.

8) Fix crash on cttimeout policy object removal. We have to flush out
   the cttimeout extension area of the conntrack not to refer to an unexisting
   object that was just removed.

9) Make sure rcu_callback completion before removing nfnetlink_cttimeout
   module removal.

10) Fix compilation warning in br_netfilter when no nf_defrag_ipv4 and
    nf_defrag_ipv6 are enabled. Patch from Arnd Bergmann.

11) Autoload ctnetlink dependencies when NFULNL_CFG_F_CONNTRACK is
    requested. Again from Ken-ichirou MATSUZAWA.

12) Don't use pointer to previous hook when reinjecting traffic via
    nf_queue with NF_REPEAT verdict since it may be already gone. This
    also avoids a deadloop if the userspace application keeps returning
    NF_REPEAT.

13) A bunch of cleanups for netfilter IPv4 and IPv6 code from Ian Morris.

14) Consolidate logger instance existence check in nfulnl_recv_config().

15) Fix broken atomicity when applying configuration updates to logger
    instances in nfnetlink_log.

16) Get rid of the .owner attribute in our hook object. We don't need
    this anymore since we're dropping pending packets that have escaped
    from the kernel when unremoving the hook. Patch from Florian Westphal.

17) Remove unnecessary rcu_read_lock() from nf_reinject code, we always
    assume RCU read side lock from .call_rcu in nfnetlink. Also from Florian.

18) Use static inline function instead of macros to define NF_HOOK() and
    NF_HOOK_COND() when no netfilter support in on, from Arnd Bergmann.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-18 22:48:34 -07:00
santosh.shilimkar@oracle.com
7b4b000951 RDS: fix rds-ping deadlock over TCP transport
Sowmini found hang with rds-ping while testing RDS over TCP. Its
a corner case and doesn't happen always. The issue is not reproducible
with IB transport. Its clear from below dump why we see it with RDS TCP.

 [<ffffffff8153b7e5>] do_tcp_setsockopt+0xb5/0x740
 [<ffffffff8153bec4>] tcp_setsockopt+0x24/0x30
 [<ffffffff814d57d4>] sock_common_setsockopt+0x14/0x20
 [<ffffffffa096071d>] rds_tcp_xmit_prepare+0x5d/0x70 [rds_tcp]
 [<ffffffffa093b5f7>] rds_send_xmit+0xd7/0x740 [rds]
 [<ffffffffa093bda2>] rds_send_pong+0x142/0x180 [rds]
 [<ffffffffa0939d34>] rds_recv_incoming+0x274/0x330 [rds]
 [<ffffffff810815ae>] ? ttwu_queue+0x11e/0x130
 [<ffffffff814dcacd>] ? skb_copy_bits+0x6d/0x2c0
 [<ffffffffa0960350>] rds_tcp_data_recv+0x2f0/0x3d0 [rds_tcp]
 [<ffffffff8153d836>] tcp_read_sock+0x96/0x1c0
 [<ffffffffa0960060>] ? rds_tcp_recv_init+0x40/0x40 [rds_tcp]
 [<ffffffff814d6a90>] ? sock_def_write_space+0xa0/0xa0
 [<ffffffffa09604d1>] rds_tcp_data_ready+0xa1/0xf0 [rds_tcp]
 [<ffffffff81545249>] tcp_data_queue+0x379/0x5b0
 [<ffffffffa0960cdb>] ? rds_tcp_write_space+0xbb/0x110 [rds_tcp]
 [<ffffffff81547fd2>] tcp_rcv_established+0x2e2/0x6e0
 [<ffffffff81552602>] tcp_v4_do_rcv+0x122/0x220
 [<ffffffff81553627>] tcp_v4_rcv+0x867/0x880
 [<ffffffff8152e0b3>] ip_local_deliver_finish+0xa3/0x220

This happens because rds_send_xmit() chain wants to take
sock_lock which is already taken by tcp_v4_rcv() on its
way to rds_tcp_data_ready(). Commit db6526dcb5 ("RDS: use
rds_send_xmit() state instead of RDS_LL_SEND_FULL") which
was trying to opportunistically finish the send request
in same thread context.

But because of above recursive lock hang with RDS TCP,
the send work from rds_send_pong() needs to deferred to
worker to avoid lock up. Given RDS ping is more of connectivity
test than performance critical path, its should be ok even
for transport like IB.

Reported-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by:  Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-18 22:45:55 -07:00
Eric Dumazet
dc6ef6be52 tcp: do not set queue_mapping on SYNACK
At the time of commit fff3269907 ("tcp: reflect SYN queue_mapping into
SYNACK packets") we had little ways to cope with SYN floods.

We no longer need to reflect incoming skb queue mappings, and instead
can pick a TX queue based on cpu cooking the SYNACK, with normal XPS
affinities.

Note that all SYNACK retransmits were picking TX queue 0, this no longer
is a win given that SYNACK rtx are now distributed on all cpus.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-18 22:26:02 -07:00
Joe Stringer
740dbc2891 openvswitch: Scrub skb between namespaces
If OVS receives a packet from another namespace, then the packet should
be scrubbed. However, people have already begun to rely on the behaviour
that skb->mark is preserved across namespaces, so retain this one field.

This is mainly to address information leakage between namespaces when
using OVS internal ports, but by placing it in ovs_vport_receive() it is
more generally applicable, meaning it should not be overlooked if other
port types are allowed to be moved into namespaces in future.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-18 22:24:50 -07:00
David S. Miller
a5d6f7dd30 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Johan Hedberg says:

====================
pull request: bluetooth 2015-10-16

First of all, sorry for the late set of patches for the 4.3 cycle. We
just finished an intensive week of testing at the Bluetooth UnPlugFest
and discovered (and fixed) issues there. Unfortunately a few issues
affect 4.3-rc5 in a way that they break existing Bluetooth LE mouse and
keyboard support.

The regressions result from supporting LE privacy in conjunction with
scanning for Resolvable Private Addresses before connecting. A feature
that has been tested heavily (including automated unit tests), but sadly
some regressions slipped in. The UnPlugFest with its multitude of test
platforms is a good battle testing ground for uncovering every corner
case.

The patches in this pull request focus only on fixing the regressions in
4.3-rc5. The patches look a bit larger since we also added comments in
the critical sections of the fixes to improve clarity.

I would appreciate if we can get these regression fixes to Linus
quickly. Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-18 22:23:33 -07:00
Arad, Ronen
db65a3aaf2 netlink: Trim skb to alloc size to avoid MSG_TRUNC
netlink_dump() allocates skb based on the calculated min_dump_alloc or
a per socket max_recvmsg_len.
min_alloc_size is maximum space required for any single netdev
attributes as calculated by rtnl_calcit().
max_recvmsg_len tracks the user provided buffer to netlink_recvmsg.
It is capped at 16KiB.
The intention is to avoid small allocations and to minimize the number
of calls required to obtain dump information for all net devices.

netlink_dump packs as many small messages as could fit within an skb
that was sized for the largest single netdev information. The actual
space available within an skb is larger than what is requested. It could
be much larger and up to near 2x with align to next power of 2 approach.

Allowing netlink_dump to use all the space available within the
allocated skb increases the buffer size a user has to provide to avoid
truncaion (i.e. MSG_TRUNG flag set).

It was observed that with many VLANs configured on at least one netdev,
a larger buffer of near 64KiB was necessary to avoid "Message truncated"
error in "ip link" or "bridge [-c[ompressvlans]] vlan show" when
min_alloc_size was only little over 32KiB.

This patch trims skb to allocated size in order to allow the user to
avoid truncation with more reasonable buffer size.

Signed-off-by: Ronen Arad <ronen.arad@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-18 19:34:12 -07:00
Li RongQing
26fb342c73 ipconfig: send Client-identifier in DHCP requests
A dhcp server may provide parameters to a client from a pool of IP
addresses and using a shared rootfs, or provide a specific set of
parameters for a specific client, usually using the MAC address to
identify each client individually. The dhcp protocol also specifies
a client-id field which can be used to determine the correct
parameters to supply when no MAC address is available. There is
currently no way to tell the kernel to supply a specific client-id,
only the userspace dhcp clients support this feature, but this can
not be used when the network is needed before userspace is available
such as when the root filesystem is on NFS.

This patch is to be able to do something like "ip=dhcp,client_id_type,
client_id_value", as a kernel parameter to enable the kernel to
identify itself to the server.

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-18 19:23:52 -07:00
Peter Hurley
fef062cbf2 tty: Remove ASYNC_CLOSING checks in open()/hangup() methods
Since at least before 2.6.30, tty drivers that do not drop the tty lock
while closing cannot observe ASYNC_CLOSING set while holding the
tty lock; this includes the tty driver's open() and hangup() methods,
since the tty core calls these methods holding the tty lock.

For these drivers, waiting for ASYNC_CLOSING to clear while opening
is not required, since this condition cannot occur. Similarly, even
when the open() method drops and reacquires the tty lock after
blocking, ASYNC_CLOSING cannot be set (again, for drivers that
do not drop the tty lock while closing).

Now that tty port drivers no longer drop the tty lock while closing
(since 'tty: Remove tty_wait_until_sent_from_close()'), the same
conditions apply: waiting for ASYNC_CLOSING to clear while opening
is not required, nor is re-checking ASYNC_CLOSING after dropping and
reacquiring the tty lock while blocking (eg., in *_block_til_ready()).

Note: The ASYNC_CLOSING flag state is still maintained since several
bitrotting drivers use it for (dubious) other purposes.

Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-17 21:11:29 -07:00
Pablo Neira Ayuso
f0a0a978b6 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
This merge resolves conflicts with 75aec9df3a ("bridge: Remove
br_nf_push_frag_xmit_sk") as part of Eric Biederman's effort to improve
netns support in the network stack that reached upstream via David's
net-next tree.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Conflicts:
	net/bridge/br_netfilter_hooks.c
2015-10-17 14:28:03 +02:00
Nikolay Borisov
00db674bed netfilter: ipset: Fix sleeping memory allocation in atomic context
Commit 00590fdd5b introduced RCU locking in list type and in
doing so introduced a memory allocation in list_set_add, which
is done in an atomic context, due to the fact that ipset rcu
list modifications are serialised with a spin lock. The reason
why we can't use a mutex is that in addition to modifying the
list with ipset commands, it's also being modified when a
particular ipset rule timeout expires aka garbage collection.
This gc is triggered from set_cleanup_entries, which in turn
is invoked from a timer thus requiring the lock to be bh-safe.

Concretely the following call chain can lead to "sleeping function
called in atomic context" splat:
call_ad -> list_set_uadt -> list_set_uadd -> kzalloc(, GFP_KERNEL).
And since GFP_KERNEL allows initiating direct reclaim thus
potentially sleeping in the allocation path.

To fix the issue change the allocation type to GFP_ATOMIC, to
correctly reflect that it is occuring in an atomic context.

Fixes: 00590fdd5b ("netfilter: ipset: Introduce RCU locking in list type")
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-17 13:01:24 +02:00
Linus Torvalds
59bcce1216 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fixes from Sage Weil:
 "Just two small items from Ilya:

  The first patch fixes the RBD readahead to grab full objects.  The
  second fixes the write ops to prevent undue promotion when a cache
  tier is configured on the server side"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: use writefull op for object size writes
  rbd: set max_sectors explicitly
2015-10-16 12:47:02 -07:00
Ian Morris
c8d71d08aa netfilter: ipv4: whitespace around operators
This patch cleanses whitespace around arithmetical operators.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-16 19:19:23 +02:00
Ian Morris
24cebe3f29 netfilter: ipv4: code indentation
Use tabs instead of spaces to indent code.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-16 19:19:15 +02:00
Ian Morris
6c28255b46 netfilter: ipv4: function definition layout
Use tabs instead of spaces to indent second line of parameters in
function definitions.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-16 19:19:10 +02:00
Ian Morris
27951a0168 netfilter: ipv4: ternary operator layout
Correct whitespace layout of ternary operators in the netfilter-ipv4
code.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-16 19:19:04 +02:00
Ian Morris
19f0a60201 netfilter: ipv4: label placement
Whitespace cleansing: Labels should not be indented.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-16 19:17:41 +02:00
Arnd Bergmann
008027c31d netfilter: turn NF_HOOK into an inline function
A recent change to the dst_output handling caused a new warning
when the call to NF_HOOK() is the only used of a local variable
passed as 'dev', and CONFIG_NETFILTER is disabled:

net/ipv6/ip6_output.c: In function 'ip6_output':
net/ipv6/ip6_output.c:135:21: warning: unused variable 'dev' [-Wunused-variable]

The reason for this is that the NF_HOOK macro in this case does
not reference the variable at all, and the call to dev_net(dev)
got removed from the ip6_output function. To avoid that warning now
and in the future, this changes the macro into an equivalent
inline function, which tells the compiler that the variable is
passed correctly but still unused.

The dn_forward function apparently had the same problem in
the past and added a local workaround that no longer works
with the inline function. In order to avoid a regression, we
have to also remove the #ifdef from decnet in the same patch.

Fixes: ede2059dba ("dst: Pass net into dst->output")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-16 18:45:36 +02:00
Florian Westphal
81b4325eba netfilter: nf_queue: remove rcu_read_lock calls
All verdict handlers make use of the nfnetlink .call_rcu callback
so rcu readlock is already held.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-16 18:22:41 +02:00
Florian Westphal
ed78d09d59 netfilter: make nf_queue_entry_get_refs return void
We don't care if module is being unloaded anymore since hook unregister
handling will destroy queue entries using that hook.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-16 18:22:23 +02:00
Florian Westphal
2ffbceb2b0 netfilter: remove hook owner refcounting
since commit 8405a8fff3 ("netfilter: nf_qeueue: Drop queue entries on
nf_unregister_hook") all pending queued entries are discarded.

So we can simply remove all of the owner handling -- when module is
removed it also needs to unregister all its hooks.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-16 18:21:39 +02:00
Ilya Dryomov
e30b7577bf rbd: use writefull op for object size writes
This covers only the simplest case - an object size sized write, but
it's still useful in tiering setups when EC is used for the base tier
as writefull op can be proxied, saving an object promotion.

Even though updating ceph_osdc_new_request() to allow writefull should
just be a matter of fixing an assert, I didn't do it because its only
user is cephfs.  All other sites were updated.

Reflects ceph.git commit 7bfb7f9025a8ee0d2305f49bf0336d2424da5b5b.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2015-10-16 16:49:01 +02:00
Jiri Pirko
573c7ba006 net: introduce pre-change upper device notifier
This newly introduced netdevice notifier is called before actual change
upper happens. That provides a possibility for notifier handlers to
know upper change will happen and react to it, including possibility to
forbid the change. That is valuable for drivers which can check if the
upper device linkage is supported and forbid that in case it is not.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-16 07:15:05 -07:00
David Ahern
51161aa98d net: Fix suspicious RCU usage in fib_rebalance
This command:
  ip route add 192.168.1.0/24 nexthop via 10.2.1.5 dev eth1 nexthop via 10.2.2.5 dev eth2

generated this suspicious RCU usage message:

[ 63.249262]
[ 63.249939] ===============================
[ 63.251571] [ INFO: suspicious RCU usage. ]
[ 63.253250] 4.3.0-rc3+ #298 Not tainted
[ 63.254724] -------------------------------
[ 63.256401] ../include/linux/inetdevice.h:205 suspicious rcu_dereference_check() usage!
[ 63.259450]
[ 63.259450] other info that might help us debug this:
[ 63.259450]
[ 63.262297]
[ 63.262297] rcu_scheduler_active = 1, debug_locks = 1
[ 63.264647] 1 lock held by ip/2870:
[ 63.265896] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff813ebfb7>] rtnl_lock+0x12/0x14
[ 63.268858]
[ 63.268858] stack backtrace:
[ 63.270409] CPU: 4 PID: 2870 Comm: ip Not tainted 4.3.0-rc3+ #298
[ 63.272478] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 63.275745] 0000000000000001 ffff8800b8c9f8b8 ffffffff8125f73c ffff88013afcf301
[ 63.278185] ffff8800bab7a380 ffff8800b8c9f8e8 ffffffff8107bf30 ffff8800bb728000
[ 63.280634] ffff880139fe9a60 0000000000000000 ffff880139fe9a00 ffff8800b8c9f908
[ 63.283177] Call Trace:
[ 63.283959] [<ffffffff8125f73c>] dump_stack+0x4c/0x68
[ 63.285593] [<ffffffff8107bf30>] lockdep_rcu_suspicious+0xfa/0x103
[ 63.287500] [<ffffffff8144d752>] __in_dev_get_rcu+0x48/0x4f
[ 63.289169] [<ffffffff8144d797>] fib_rebalance+0x3e/0x127
[ 63.290753] [<ffffffff8144d986>] ? rcu_read_unlock+0x3e/0x5f
[ 63.292442] [<ffffffff8144ea45>] fib_create_info+0xaf9/0xdcc
[ 63.294093] [<ffffffff8106c12f>] ? sched_clock_local+0x12/0x75
[ 63.295791] [<ffffffff8145236a>] fib_table_insert+0x8c/0x451
[ 63.297493] [<ffffffff8144bf9c>] ? fib_get_table+0x36/0x43
[ 63.299109] [<ffffffff8144c3ca>] inet_rtm_newroute+0x43/0x51
[ 63.300709] [<ffffffff813ef684>] rtnetlink_rcv_msg+0x182/0x195
[ 63.302334] [<ffffffff8107d04c>] ? trace_hardirqs_on+0xd/0xf
[ 63.303888] [<ffffffff813ebfb7>] ? rtnl_lock+0x12/0x14
[ 63.305346] [<ffffffff813ef502>] ? __rtnl_unlock+0x12/0x12
[ 63.306878] [<ffffffff81407c4c>] netlink_rcv_skb+0x3d/0x90
[ 63.308437] [<ffffffff813ec00e>] rtnetlink_rcv+0x21/0x28
[ 63.309916] [<ffffffff81407742>] netlink_unicast+0xfa/0x17f
[ 63.311447] [<ffffffff81407a5e>] netlink_sendmsg+0x297/0x2dc
[ 63.313029] [<ffffffff813c6cd4>] sock_sendmsg_nosec+0x12/0x1d
[ 63.314597] [<ffffffff813c835b>] ___sys_sendmsg+0x196/0x21b
[ 63.316125] [<ffffffff8100bf9f>] ? native_sched_clock+0x1f/0x3c
[ 63.317671] [<ffffffff8106c12f>] ? sched_clock_local+0x12/0x75
[ 63.319185] [<ffffffff8106c397>] ? sched_clock_cpu+0x9d/0xb6
[ 63.320693] [<ffffffff8107e2d7>] ? __lock_is_held+0x32/0x54
[ 63.322145] [<ffffffff81159fcb>] ? __fget_light+0x4b/0x77
[ 63.323541] [<ffffffff813c8726>] __sys_sendmsg+0x3d/0x5b
[ 63.324947] [<ffffffff813c8751>] SyS_sendmsg+0xd/0x19
[ 63.326274] [<ffffffff814c8f57>] entry_SYSCALL_64_fastpath+0x12/0x6f

It looks like all of the code paths to fib_rebalance are under rtnl.

Fixes: 0e884c78ee ("ipv4: L3 hash-based multipath")
Cc: Peter Nørlund <pch@ordbogen.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-16 00:57:55 -07:00
Eric Dumazet
ebb516af60 tcp/dccp: fix race at listener dismantle phase
Under stress, a close() on a listener can trigger the
WARN_ON(sk->sk_ack_backlog) in inet_csk_listen_stop()

We need to test if listener is still active before queueing
a child in inet_csk_reqsk_queue_add()

Create a common inet_child_forget() helper, and use it
from inet_csk_reqsk_queue_add() and inet_csk_listen_stop()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-16 00:52:19 -07:00
Eric Dumazet
f03f2e154f tcp/dccp: add inet_csk_reqsk_queue_drop_and_put() helper
Let's reduce the confusion about inet_csk_reqsk_queue_drop() :
In many cases we also need to release reference on request socket,
so add a helper to do this, reducing code size and complexity.

Fixes: 4bdc3d6614 ("tcp/dccp: fix behavior of stale SYN_RECV request sockets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-16 00:52:18 -07:00
Eric Dumazet
ef84d8ce5a Revert "inet: fix double request socket freeing"
This reverts commit c69736696c.

At the time of above commit, tcp_req_err() and dccp_req_err()
were dead code, as SYN_RECV request sockets were not yet in ehash table.

Real bug was fixed later in a different commit.

We need to revert to not leak a refcount on request socket.

inet_csk_reqsk_queue_drop_and_put() will be added
in following commit to make clean inet_csk_reqsk_queue_drop()
does not release the reference owned by caller.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-16 00:52:17 -07:00
Martin KaFai Lau
0a1f596200 ipv6: Initialize rt6_info properly in ip6_blackhole_route()
ip6_blackhole_route() does not initialize the newly allocated
rt6_info properly.  This patch:
1. Call rt6_info_init() to initialize rt6i_siblings and rt6i_uncached

2. The current rt->dst._metrics init code is incorrect:
   - 'rt->dst._metrics = ort->dst._metris' is not always safe
   - Not sure what dst_copy_metrics() is trying to do here
     considering ip6_rt_blackhole_cow_metrics() always returns
     NULL

   Fix:
   - Always do dst_copy_metrics()
   - Replace ip6_rt_blackhole_cow_metrics() with
     dst_cow_metrics_generic()

3. Mask out the RTF_PCPU bit from the newly allocated blackhole route.
   This bug triggers an oops (reported by Phil Sutter) in rt6_get_cookie().
   It is because RTF_PCPU is set while rt->dst.from is NULL.

Fixes: d52d3997f8 ("ipv6: Create percpu rt6_info")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reported-by: Phil Sutter <phil@nwl.cc>
Tested-by: Phil Sutter <phil@nwl.cc>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Phil Sutter <phil@nwl.cc>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-16 00:39:16 -07:00
Martin KaFai Lau
ebfa45f0d9 ipv6: Move common init code for rt6_info to a new function rt6_info_init()
Introduce rt6_info_init() to do the common init work for
'struct rt6_info' (after calling dst_alloc).

It is a prep work to fix the rt6_info init logic in the
ip6_blackhole_route().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Phil Sutter <phil@nwl.cc>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-16 00:39:14 -07:00
Jakub Pawlowski
5157b8a503 Bluetooth: Fix initializing conn_params in scan phase
This patch makes sure that conn_params that were created just for
explicit_connect, will get properly deleted during cleanup.

Signed-off-by: Jakub Pawlowski <jpawlowski@google.com>
Acked-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-16 09:24:41 +02:00
Johan Hedberg
9ad3e6ffe1 Bluetooth: Fix conn_params list update in hci_connect_le_scan_cleanup
After clearing the params->explicit_connect variable the parameters
may need to be either added back to the right list or potentially left
absent from both the le_reports and the le_conns lists.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-16 09:24:41 +02:00
Johan Hedberg
679d2b6f9d Bluetooth: Fix remove_device behavior for explicit connects
Devices undergoing an explicit connect should not have their
conn_params struct removed by the mgmt Remove Device command. This
patch fixes the necessary checks in the command handler to correct the
behavior.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-16 09:24:41 +02:00
Johan Hedberg
49c509220d Bluetooth: Fix LE reconnection logic
We can't use hci_explicit_connect_lookup() since that would only cover
explicit connections, leaving normal reconnections completely
untouched. Not using it in turn means leaving out entries in
pend_le_reports.

To fix this and simplify the logic move conn params from the reports
list to the pend_le_conns list for the duration of an explicit
connect. Once the connect is complete move the params back to the
pend_le_reports list. This also means that the explicit connect lookup
function only needs to look into the pend_le_conns list.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-16 09:24:41 +02:00