Commit graph

548650 commits

Author SHA1 Message Date
Greg Bowers 947570e800 i40e: Add support for non-willing Apps
Adds support for setting a new bit in the Set Local LLDP MIB AQ command
Type field.  When set to 1, the bit indicates to FW that Apps should be
treated as non-willing.  When 0, FW behaves as before.

Change-ID: I0d2101c1606c59c7188d3e6a0c7810e0f205233a
Signed-off-by: Greg Bowers <gregory.j.bowers@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2015-10-16 04:48:11 -07:00
Shannon Nelson 1cdfd88f2d i40e: priv flag for controlling VEB stats
Add an ethtool priv flag to enable and disable printing
the VEB statistics.

Change-ID: I7654054a3a73b08aa8310d94ee8fce6219107dd8
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2015-10-16 04:45:47 -07:00
Greg Rose d9d17cf74a i40e: Removed unused defines
Two defines that are not used are causing customer confusion - remove
them.

Change-ID: Icef0325aca8e0f4fcdfc519e026bdd375e791200
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2015-10-16 04:43:25 -07:00
Shannon Nelson 3c5c420535 i40e: remove read/write failed messages from nvmupdate
Allow the nvmupdate application to decide when a read or write error
should be exposed to the user.  Since the application needs to use
write probes to find the ReadOnly sections on a potentially unknown NVM
version in the HW and read probes to check the status of the last write,
some error messages are expected, but need not be shown to the users.
The driver doesn't know which are ignorable from real errors, so needs
to let the application make the decision.

Change-ID: I78fca8ab672bede11c10c820b83c26adfd536d03
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2015-10-16 04:41:00 -07:00
Jingjing Wu 4e68adfeb9 i40e/i40evf: Fix compile issue related to const string
Add const to functions that return strings that aren't going to be
modified. This addresses some reported compile complaints.

Change-ID: Ic56b1e814ab4d23a50480e7fdec652445f776ee8
Signed-off-by: Jingjing Wu <jingjing.wu@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2015-10-16 04:38:35 -07:00
Shannon Nelson 6dec101765 i40e: generate fewer startup messages
Cut down on the number of startup log entries by putting a couple behind
debug flags and combining a couple others into a single line.

Change-ID: I708089f086308f84d43f8b6f0e8a634a02d058fb
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2015-10-16 04:36:13 -07:00
Jesse Brandeburg 32b3e08fff drivers/net/intel: use napi_complete_done()
As per Eric Dumazet's previous patches:
(see commit (24d2e4a507) - tg3: use napi_complete_done())

Quoting verbatim:
Using napi_complete_done() instead of napi_complete() allows
us to use /sys/class/net/ethX/gro_flush_timeout

GRO layer can aggregate more packets if the flush is delayed a bit,
without having to set too big coalescing parameters that impact
latencies.
</end quote>

Tested
configuration: low latency via ethtool -C ethx adaptive-rx off
				rx-usecs 10 adaptive-tx off tx-usecs 15
workload: streaming rx using netperf TCP_MAERTS

igb:
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.1 () port 0 AF_INET : demo
...
Interim result:  941.48 10^6bits/s over 1.000 seconds ending at 1440193171.589

Alignment      Offset         Bytes    Bytes       Recvs   Bytes    Sends
Local  Remote  Local  Remote  Xfered   Per                 Per
Recv   Send    Recv   Send             Recv (avg)          Send (avg)
    8       8      0       0 1176930056  1475.36    797726   16384.00  71905

MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.1 () port 0 AF_INET : demo
...
Interim result:  941.49 10^6bits/s over 0.997 seconds ending at 1440193142.763

Alignment      Offset         Bytes    Bytes       Recvs   Bytes    Sends
Local  Remote  Local  Remote  Xfered   Per                 Per
Recv   Send    Recv   Send             Recv (avg)          Send (avg)
    8       8      0       0 1175182320  50476.00     23282   16384.00  71816

i40e:
Hard to test because the traffic is incoming so fast (24Gb/s) that GRO
always receives 87kB, even at the highest interrupt rate.

Other drivers were only compile tested.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2015-10-16 04:33:46 -07:00
Alexander Duyck 7709b4c1ff i40evf: Add support for netpoll
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2015-10-16 04:31:20 -07:00
Alexander Duyck 8b65035905 i40e/i40evf: Drop useless "IN_NETPOLL" flag
The code in i40e and i40evf is using an "IN_NETPOLL" flag that has never
added any value due to the fact that the Rx clean-up is handled in NAPI.
As such the flag was set, the queue was scheduled via NAPI, and then polled
from the netpoll controller and if any Rx packets were processed the were
processed in the wrong context.

In addition the flag itself just added an unneeded conditional to the
hot-path so it can safely be dropped and save us a few instructions.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2015-10-16 04:28:57 -07:00
Alexander Duyck c67caceb86 i40e/i40evf: Fix handling of napi budget
The polling routine for i40e was rounding up the budget for Rx cleanup to
1.  This is incorrect as the netpoll poll call is expecting no Rx to be
processed as the budget passed was 0.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2015-10-16 04:26:33 -07:00
David Ahern 51161aa98d net: Fix suspicious RCU usage in fib_rebalance
This command:
  ip route add 192.168.1.0/24 nexthop via 10.2.1.5 dev eth1 nexthop via 10.2.2.5 dev eth2

generated this suspicious RCU usage message:

[ 63.249262]
[ 63.249939] ===============================
[ 63.251571] [ INFO: suspicious RCU usage. ]
[ 63.253250] 4.3.0-rc3+ #298 Not tainted
[ 63.254724] -------------------------------
[ 63.256401] ../include/linux/inetdevice.h:205 suspicious rcu_dereference_check() usage!
[ 63.259450]
[ 63.259450] other info that might help us debug this:
[ 63.259450]
[ 63.262297]
[ 63.262297] rcu_scheduler_active = 1, debug_locks = 1
[ 63.264647] 1 lock held by ip/2870:
[ 63.265896] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff813ebfb7>] rtnl_lock+0x12/0x14
[ 63.268858]
[ 63.268858] stack backtrace:
[ 63.270409] CPU: 4 PID: 2870 Comm: ip Not tainted 4.3.0-rc3+ #298
[ 63.272478] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 63.275745] 0000000000000001 ffff8800b8c9f8b8 ffffffff8125f73c ffff88013afcf301
[ 63.278185] ffff8800bab7a380 ffff8800b8c9f8e8 ffffffff8107bf30 ffff8800bb728000
[ 63.280634] ffff880139fe9a60 0000000000000000 ffff880139fe9a00 ffff8800b8c9f908
[ 63.283177] Call Trace:
[ 63.283959] [<ffffffff8125f73c>] dump_stack+0x4c/0x68
[ 63.285593] [<ffffffff8107bf30>] lockdep_rcu_suspicious+0xfa/0x103
[ 63.287500] [<ffffffff8144d752>] __in_dev_get_rcu+0x48/0x4f
[ 63.289169] [<ffffffff8144d797>] fib_rebalance+0x3e/0x127
[ 63.290753] [<ffffffff8144d986>] ? rcu_read_unlock+0x3e/0x5f
[ 63.292442] [<ffffffff8144ea45>] fib_create_info+0xaf9/0xdcc
[ 63.294093] [<ffffffff8106c12f>] ? sched_clock_local+0x12/0x75
[ 63.295791] [<ffffffff8145236a>] fib_table_insert+0x8c/0x451
[ 63.297493] [<ffffffff8144bf9c>] ? fib_get_table+0x36/0x43
[ 63.299109] [<ffffffff8144c3ca>] inet_rtm_newroute+0x43/0x51
[ 63.300709] [<ffffffff813ef684>] rtnetlink_rcv_msg+0x182/0x195
[ 63.302334] [<ffffffff8107d04c>] ? trace_hardirqs_on+0xd/0xf
[ 63.303888] [<ffffffff813ebfb7>] ? rtnl_lock+0x12/0x14
[ 63.305346] [<ffffffff813ef502>] ? __rtnl_unlock+0x12/0x12
[ 63.306878] [<ffffffff81407c4c>] netlink_rcv_skb+0x3d/0x90
[ 63.308437] [<ffffffff813ec00e>] rtnetlink_rcv+0x21/0x28
[ 63.309916] [<ffffffff81407742>] netlink_unicast+0xfa/0x17f
[ 63.311447] [<ffffffff81407a5e>] netlink_sendmsg+0x297/0x2dc
[ 63.313029] [<ffffffff813c6cd4>] sock_sendmsg_nosec+0x12/0x1d
[ 63.314597] [<ffffffff813c835b>] ___sys_sendmsg+0x196/0x21b
[ 63.316125] [<ffffffff8100bf9f>] ? native_sched_clock+0x1f/0x3c
[ 63.317671] [<ffffffff8106c12f>] ? sched_clock_local+0x12/0x75
[ 63.319185] [<ffffffff8106c397>] ? sched_clock_cpu+0x9d/0xb6
[ 63.320693] [<ffffffff8107e2d7>] ? __lock_is_held+0x32/0x54
[ 63.322145] [<ffffffff81159fcb>] ? __fget_light+0x4b/0x77
[ 63.323541] [<ffffffff813c8726>] __sys_sendmsg+0x3d/0x5b
[ 63.324947] [<ffffffff813c8751>] SyS_sendmsg+0xd/0x19
[ 63.326274] [<ffffffff814c8f57>] entry_SYSCALL_64_fastpath+0x12/0x6f

It looks like all of the code paths to fib_rebalance are under rtnl.

Fixes: 0e884c78ee ("ipv4: L3 hash-based multipath")
Cc: Peter Nørlund <pch@ordbogen.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-16 00:57:55 -07:00
Tom Herbert ac00737f4e bpf: Need to call bpf_prog_uncharge_memlock from bpf_prog_put
Currently, is only called from __prog_put_rcu in the bpf_prog_release
path. Need this to call this from bpf_prog_put also to get correct
accounting.

Fixes: aaac3ba95e ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Tom Herbert <tom@herbertland.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-16 00:55:02 -07:00
David S. Miller a302afe980 Merge branch 'robust_listener'
Eric Dumazet says:

====================
tcp/dccp: make our listener code more robust

This patch series addresses request sockets leaks and listener dismantle
phase. This survives a stress test with listeners being added/removed
quite randomly.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-16 00:52:27 -07:00
Eric Dumazet ebb516af60 tcp/dccp: fix race at listener dismantle phase
Under stress, a close() on a listener can trigger the
WARN_ON(sk->sk_ack_backlog) in inet_csk_listen_stop()

We need to test if listener is still active before queueing
a child in inet_csk_reqsk_queue_add()

Create a common inet_child_forget() helper, and use it
from inet_csk_reqsk_queue_add() and inet_csk_listen_stop()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-16 00:52:19 -07:00
Eric Dumazet f03f2e154f tcp/dccp: add inet_csk_reqsk_queue_drop_and_put() helper
Let's reduce the confusion about inet_csk_reqsk_queue_drop() :
In many cases we also need to release reference on request socket,
so add a helper to do this, reducing code size and complexity.

Fixes: 4bdc3d6614 ("tcp/dccp: fix behavior of stale SYN_RECV request sockets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-16 00:52:18 -07:00
Eric Dumazet ef84d8ce5a Revert "inet: fix double request socket freeing"
This reverts commit c69736696c.

At the time of above commit, tcp_req_err() and dccp_req_err()
were dead code, as SYN_RECV request sockets were not yet in ehash table.

Real bug was fixed later in a different commit.

We need to revert to not leak a refcount on request socket.

inet_csk_reqsk_queue_drop_and_put() will be added
in following commit to make clean inet_csk_reqsk_queue_drop()
does not release the reference owned by caller.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-16 00:52:17 -07:00
Ivan Vecera 47ea032533 drivers/net: get rid of unnecessary initializations in .get_drvinfo()
Many drivers initialize uselessly n_priv_flags, n_stats, testinfo_len,
eedump_len & regdump_len fields in their .get_drvinfo() ethtool op.
It's not necessary as these fields is filled in ethtool_get_drvinfo().

v2: removed unused variable
v3: removed another unused variable

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-16 00:24:10 -07:00
David S. Miller ae23051820 Merge branch 'tipc-link-improvements'
Jon Maloy says:

====================
tipc: some link level code improvements

Extensive testing has revealed some weaknesses and non-optimal solutions
in the link level code.

This commit series addresses those issues.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:55:33 -07:00
Jon Paul Maloy c819930090 tipc: update node FSM when peer RESET message is received
The change made in the previous commit revealed a small flaw in the way
the node FSM is updated. When the function tipc_node_link_down() is
called for the last link to a node, we should check whether this was
caused by a local reset or by a received RESET message from the peer.
In the latter case, we can directly issue a PEER_LOST_CONTACT_EVT to
the node FSM, so that it is ready to re-establish contact. If this is
not done, the peer node will sometimes have to go through a second
establish cycle before the link becomes stable.

We fix this in this commit by conditionally issuing the mentioned
event in the function tipc_node_link_down(). We also move LINK_RESET
FSM even away from the link_reset() function and into the caller
function, partially because it is easier to follow the code when state
changes are gathered at a limited number of locations, partially
because there will be cases in future commits where we don't want the
link to go RESET mode when link_reset() is called.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:55:23 -07:00
Jon Paul Maloy 282b3a0562 tipc: send out RESET immediately when link goes down
When a link is taken down because of a node local event, such as
disabling of a bearer or an interface, we currently leave it to the
peer node to discover the broken communication. The default time for
such failure discovery is 1.5-2 seconds.

If we instead allow the terminating link endpoint to send out a RESET
message at the moment it is reset, we can achieve the impression that
both endpoints are going down instantly. Since this is a very common
scenario, we find it worthwhile to make this small modification.

Apart from letting the link produce the said message, we also have to
ensure that the interface is able to transmit it before TIPC is
detached. We do this by performing the disabling of a bearer in three
steps:

1) Disable reception of TIPC packets from the interface in question.
2) Take down the links, while allowing them so send out a RESET message.
3) Disable transmission of TIPC packets on the interface.

Apart from this, we now have to react on the NETDEV_GOING_DOWN event,
instead of as currently the NEDEV_DOWN event, to ensure that such
transmission is possible during the teardown phase.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:55:22 -07:00
Jon Paul Maloy 73f646cec3 tipc: delay ESTABLISH state event when link is established
Link establishing, just like link teardown, is a non-atomic action, in
the sense that discovering that conditions are right to establish a link,
and the actual adding of the link to one of the node's send slots is done
in two different lock contexts. The link FSM is designed to help bridging
the gap between the two contexts in a safe manner.

We have now discovered a weakness in the implementaton of this FSM.
Because we directly let the link go from state LINK_ESTABLISHING to
state LINK_ESTABLISHED already in the first lock context, we are unable
to distinguish between a fully established link, i.e., a link that has
been added to its slot, and a link that has not yet reached the second
lock context. It may hence happen that a manual intervention, e.g., when
disabling an interface, causes the function tipc_node_link_down() to try
removing the link from the node slots, decrementing its active link
counter etc, although the link was never added there in the first place.

We solve this by delaying the actual state change until we reach the
second lock context, inside the function tipc_node_link_up(). This
makes it possible for potentail callers of __tipc_node_link_down() to
know if they should proceed or not, and the problem is solved.

Unforunately, the situation described above also has a second problem.
Since there by necessity is a tipc_node_link_up() call pending once
the node lock has been released, we must defuse that call by setting
the link back from LINK_ESTABLISHING to LINK_RESET state. This forces
us to make a slight modification to the link FSM, which will now look
as follows.

 +------------------------------------+
 |RESET_EVT                           |
 |                                    |
 |                             +--------------+
 |           +-----------------|   SYNCHING   |-----------------+
 |           |FAILURE_EVT      +--------------+   PEER_RESET_EVT|
 |           |                  A            |                  |
 |           |                  |            |                  |
 |           |                  |            |                  |
 |           |                  |SYNCH_      |SYNCH_            |
 |           |                  |BEGIN_EVT   |END_EVT           |
 |           |                  |            |                  |
 |           V                  |            V                  V
 |    +-------------+          +--------------+          +------------+
 |    |  RESETTING  |<---------|  ESTABLISHED |--------->| PEER_RESET |
 |    +-------------+ FAILURE_ +--------------+ PEER_    +------------+
 |           |        EVT        |    A         RESET_EVT       |
 |           |                   |    |                         |
 |           |  +----------------+    |                         |
 |  RESET_EVT|  |RESET_EVT            |                         |
 |           |  |                     |                         |
 |           |  |                     |ESTABLISH_EVT            |
 |           |  |  +-------------+    |                         |
 |           |  |  | RESET_EVT   |    |                         |
 |           |  |  |             |    |                         |
 |           V  V  V             |    |                         |
 |    +-------------+          +--------------+        RESET_EVT|
 +--->|    RESET    |--------->| ESTABLISHING |<----------------+
      +-------------+ PEER_    +--------------+
       |           A  RESET_EVT       |
       |           |                  |
       |           |                  |
       |FAILOVER_  |FAILOVER_         |FAILOVER_
       |BEGIN_EVT  |END_EVT           |BEGIN_EVT
       |           |                  |
       V           |                  |
      +-------------+                 |
      | FAILINGOVER |<----------------+
      +-------------+

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:55:21 -07:00
Jon Paul Maloy 8306f99a51 tipc: disallow packet duplicates in link deferred queue
After the previous commits, we are guaranteed that no packets
of type LINK_PROTOCOL or with illegal sequence numbers will be
attempted added to the link deferred queue. This makes it possible to
make some simplifications to the sorting algorithm in the function
tipc_skb_queue_sorted().

We also alter the function so that it will drop packets if one with
the same seqeunce number is already present in the queue. This is
necessary because we have identified weird packet sequences, involving
duplicate packets, where a legitimate in-sequence packet may advance to
the head of the queue without being detected and de-queued.

Finally, we make this function outline, since it will now be called only
in exceptional cases.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:55:21 -07:00
Jon Paul Maloy 81204c492b tipc: improve sequence number checking
The sequence number of an incoming packet is currently only checked
for less than, equality to, or bigger than the next expected number,
meaning that the receive window in practice becomes one half sequence
number cycle, or U16_MAX/2. This does not make sense, and may not even
be safe if there are extreme delays in the network. Any packet sent by
the peer during the ongoing cycle must belong inside his current send
window, or should otherwise be dropped if possible.

Since a link endpoint cannot know its peer's current send window, it
has to base this sanity check on a worst-case assumption, i.e., that
the peer is using a maximum sized window of 8191 packets. Using this
assumption, we now add a check that the sequence number is not bigger
than next_expected + TIPC_MAX_LINK_WIN. We also re-order the checks
done, so that the receive window test is performed before the gap test.
This way, we are guaranteed that no packet with illegal sequence numbers
are ever added to the deferred queue.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:55:20 -07:00
Jon Paul Maloy f9aa358a81 tipc: simplify tipc_link_rcv() reception loop
Currently, all packets received in tipc_link_rcv() are unconditionally
added to the packet deferred queue, whereafter that queue is walked and
all its buffers evaluated for delivery. This is both non-optimal and
and makes the queue sorting function unnecessary complex.

This commit changes the loop so that an arrived packet is evaluated
first, and added to the deferred queue only when a sequence number gap
is discovered. A non-empty deferred queue is walked until it is empty
or until its head's sequence number doesn't fit.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:55:19 -07:00
Jon Paul Maloy 9945e8043e tipc: limit usage of temporary skb list during packet reception
During packet reception, the function tipc_link_rcv() adds its accepted
packets to a temporary buffer queue, before finally splicing this queue
into the lock protected input queue that will be delivered up to the
socket layer. The purpose is to reduce potential contention on the input
queue lock. However, since the vast majority of packets arrive in
sequence, they will anyway be added one by one to the input queue, and
the use of the temporary queue becomes a sub-optimization.

The only case where this queue makes sense is when unpacking buffers
from a bundle packet; here we want to avoid dozens of small buffers
to be added individually to the lock-protected input queue in a tight
loop.

In this commit, we remove the general usage of the temporary queue,
and keep it only for the packet unbundling case.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:55:18 -07:00
Insu Yun 175f8d6746 mlx4: corretly check failed allocation
When allocation fails, mlx4_alloc_cmd_mailbox returns -ENOMEM.
Since there is no case that mlx4_alloc_cmd_mailbox returns NULL,
it needs to be checked by IS_ERR, not IS_ERR_OR_NULL

Signed-off-by: Insu Yun <wuninsu@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:31:38 -07:00
Eric Dumazet e87eb4051e bonding: support encapsulated ipv6 TSO
If using a sixtofour device on top of a bonding device,
skb segmentation of TCP traffic is done right before calling
bonding xmit, because bonding only enables TSO for IPv4.

This patch improves single flow performance by about 120 % on my hosts,
because segmentation is deferred right before calling slave xmit.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:29:28 -07:00
David S. Miller 181e4246b4 Merge branch 'mlxsw-cleanups'
Jiri Pirko says:

====================
mlxsw: Driver update, cleanups

This patchset contains various cleanups and improvements in mlxsw driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:28:03 -07:00
Ido Schimmel 5cd16d8c78 mlxsw: cmd: Update CONFIG_PROFILE command documentation
The meaning of certain parameters in the profile passed to the device
during initialization has changed, so update their documentation
accordingly.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:27:57 -07:00
Ido Schimmel 801bd3defb mlxsw: Add trap group for control packets
Previously, we trapped flooded and control packets using the same trap
group. This can cause flooded packets to overflow the PCI bus and
prevent control packets (e.g. STP, LACP) from getting to the CPU.

Solve this by splitting the RX trap group to RX and control, which allows
us to configure a policer on the first, thereby preventing it from
overflowing the PCI bus.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:27:56 -07:00
Ido Schimmel f24af33015 mlxsw: Simplify traps creation
The Host Trap Group Table (HTGT) register configures trap groups, which
are populated with trap IDs using the Host PacKet Trap (HPKT) register.
However, a trap ID can only be present inside one trap group (the last
configured).

Instead of passing both the trap group and ID for the function that
packs HPKT, pass only the trap ID and derive from it the trap group.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:27:55 -07:00
Jiri Pirko ebb7963f9b mlxsw: Introduce mlxsw_reg_spms_vid_pack helper and use it
Introduce separate helper for packing SPMS VIDs, as it can be used for
multiple VIDs and not only for one as previous SPMS pack function
provided.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:27:55 -07:00
Ido Schimmel fa6ad058bc mlxsw: reg: Adjust definition of enum mlxsw_reg_sfgc_type
Define max which would be needed later on.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:27:54 -07:00
Jiri Pirko 36b78e8aba mlxsw: reg: Remove extra space in SFGC ID define
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:27:53 -07:00
Jiri Pirko 3f0effd16b mlxsw: reg: Uppercase letters in register IDs
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:27:52 -07:00
Jiri Pirko 6cf9dc8b77 mlxsw: Use dev_level_ratelimited instead of net_ratelimit & dev_level
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:27:51 -07:00
Jiri Pirko 18ea54454e mlxsw: core: Do not use EMADs in mlxsw_emad_fini
Be symmetric with mlxsw_emad_init and don't use EMADs in mlxsw_emad_fini
cleanup function. Use command interface instead.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:27:51 -07:00
Jiri Pirko 3e2206da73 mlxsw: pci: Limit number of entries being sent in single MAP_FA cmd
Firmware accepts only limited number of mapping entries for MAP_FA
command. In order to prevent overflow, introduce a limit and in case the
number of entries is bigger, call MAP_FA multiple times.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:27:49 -07:00
Jiri Pirko c85c3882ad mlxsw: pci: Remove MLXSW_PCI_RDQS/SDQS defines and checks
Remove strict number check of queues count as various ASICs have
different counts.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:27:48 -07:00
Jiri Pirko 424e1114af mlxsw: pci: Do not use MLXSW_PCI_SDQS_COUNT define
Use mlxsw_pci_sdq_count helper instead.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:27:48 -07:00
Jiri Pirko e4c870b1b4 mlxsw: pci: Use MLXSW_PCI_CQS_MAX instead of MLXSW_PCI_CQS_COUNT
The count of CQs can be different for various ASICs, so just define
maximal value and check for that.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:27:47 -07:00
Jiri Pirko ffe053285b mlxsw: switchx2: Use ETH_ALEN for mac address length
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:27:45 -07:00
Ido Schimmel 33a704a59b mlxsw: Remove multicast ID configuration
With respect to a firmware change, the Switch Multicast ID (SMID)
register is no longer needed, so the related configuration code can be
removed.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:27:45 -07:00
Lendacky, Thomas 96aec91148 amd-xgbe: Use system workqueue for device restart
A previous patch switched from using the system workqueue to the device
workqueue for various operations. During a device restart the device
workqueue is flushed so the restart cannot use this workqueue or else
a deadlock results.  Move the device restart back to using the system
workqueue.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 06:13:35 -07:00
David S. Miller 74661bee1f Merge branch 'switchdev-locking'
Jiri Pirko says:

====================
switchdev: change locking

This is something which I'm currently struggling with.
Callers of attr_set and obj_add/del often hold not only RTNL, but also
spinlock (bridge). So in that case, the driver implementing the op cannot sleep.

The way rocker is dealing with this now is just to invoke driver operation
and go out, without any checking or reporting of the operation status.

Since it would be nice to at least put a warning in case the operation fails,
it makes sense to do this in delayed work directly in switchdev core
instead of implementing this in separate drivers. And that is what this patchset
is introducing.

So from now on, the locking of switchdev mod ops is consistent. Caller either
holds rtnl mutex or in case it does not, caller sets defer flag, telling
switchdev core to process the op later, in deferred queue.

Function to force to process switchdev deferred ops can be called by op
caller in appropriate location, for example after it releases
spin lock, to force switchdev core to process pending ops.

v1->v2:
- rebased on current net-next head (including Scott's ageing patchset)
v2->v3:
- fixed comment s/of/or/ typo suggested by Nik
v3->v4:
- the actual patchset is sent instead of different branch I send in v3 :/
v4->v5:
- added patch to "const" attr param
- reworked deferred ops infrastructure (mainly patch number 1 and
  internal users (patch 3 and 5)) - resolves the issue pointed out
  by John
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 06:09:58 -07:00
Jiri Pirko 771acac2ff switchdev: assert rtnl mutex when going over lower netdevs
netdev_for_each_lower_dev has to be called with rtnl mutex held. So
better enforce it in switchdev functions.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 06:09:53 -07:00
Jiri Pirko d33eeb645d rocker: remove nowait from switchdev callbacks.
No need to avoid sleeping in switchdev callbacks now, as the switchdev
core allows it.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 06:09:51 -07:00
Jiri Pirko 56607386e8 bridge: defer switchdev fdb del call in fdb_del_external_learn
Since spinlock is held here, defer the switchdev operation. Also, ensure
that defered switchdev ops are processed before port master device
is unlinked.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 06:09:50 -07:00
Jiri Pirko 4d429c5ddc switchdev: introduce possibility to defer obj_add/del
Similar to the attr usecase, the caller knows if he is holding RTNL and is
in atomic section. So let the called to decide the correct call variant.

This allows drivers to sleep inside their ops and wait for hw to get the
operation status. Then the status is propagated into switchdev core.
This avoids silent errors in drivers.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 06:09:49 -07:00
Jiri Pirko 850d0cbc91 switchdev: remove pointers from switchdev objects
When object is used in deferred work, we cannot use pointers in
switchdev object structures because the memory they point at may be already
used by someone else. So rather do local copy of the value.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 06:09:49 -07:00