Commit graph

444428 commits

Author SHA1 Message Date
Michal Schmidt e5eca6d41f rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
When running RHEL6 userspace on a current upstream kernel, "ip link"
fails to show VF information.

The reason is a kernel<->userspace API change introduced by commit
88c5b5ce5c ("rtnetlink: Call nlmsg_parse() with correct header length"),
after which the kernel does not see iproute2's IFLA_EXT_MASK attribute
in the netlink request.

iproute2 adjusted for the API change in its commit 63338dca4513
("libnetlink: Use ifinfomsg instead of rtgenmsg in rtnl_wilddump_req_filter").

The problem has been noticed before:
http://marc.info/?l=linux-netdev&m=136692296022182&w=2
(Subject: Re: getting VF link info seems to be broken in 3.9-rc8)

We can do better than tell those with old userspace to upgrade. We can
recognize the old iproute2 in the kernel by checking the netlink message
length. Even when including the IFLA_EXT_MASK attribute, its netlink
message is shorter than struct ifinfomsg.

With this patch "ip link" shows VF information in both old and new
iproute2 versions.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12 11:07:42 -07:00
Per Hurtig bef1909ee3 tcp: fixing TLP's FIN recovery
Fix to a problem observed when losing a FIN segment that does not
contain data.  In such situations, TLP is unable to recover from
*any* tail loss and instead adds at least PTO ms to the
retransmission process, i.e., RTO = RTO + PTO.

Signed-off-by: Per Hurtig <per.hurtig@kau.se>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12 11:05:51 -07:00
David S. Miller fba0e1a3cf Merge branch 'fec'
Fugang Duan says:

====================
net: fec: Enable Software TSO to improve the tx performance

Add SG and software TSO support for FEC.
This feature allows to improve outbound throughput performance.
Tested on imx6dl sabresd board, running iperf tcp tests shows:
        * 82% improvement comparing with NO SG & TSO patch

$ ethtool -K eth0 sg on
$ ethtool -K eth0 tso on
[  3] local 10.192.242.108 port 35388 connected with 10.192.242.167 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 3.0 sec   181 MBytes   506 Mbits/sec
* cpu loading is 30%

$ ethtool -K eth0 sg off
$ ethtool -K eth0 tso off
[  3] local 10.192.242.108 port 52618 connected with 10.192.242.167 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 3.0 sec  99.5 MBytes   278 Mbits/sec

FEC HW support IP header and TCP/UDP hw checksum, support multi buffer descriptor transfer
one frame, but don't support HW TSO. And imx6q/dl SOC FEC Gbps speed has HW bus Bandwidth
limitation (400Mbps ~ 700Mbps), imx6sx SOC FEC Gbps speed has no HW bandwidth limitation.

The patch set just enable TSO feature, which is done following the mv643xx_eth driver.

Test result analyze:
imx6dl sabresd board: there have 82% improvement, since imx6dl FEC HW has bandwidth limitation,
                      the performance with SW TSO is a milestone.

Addition test:
imx6sx sdb board:
upstream still don't support imx6sx due to some patches being upstream... they use same FEC IP.
Use the SW TSO patches test imx6sx sdb board in internal kernel tree:
No SW TSO patch: tx bandwidth 840Mbps, cpu loading is 100%.
SW TSO patch:    tx bandwidth 942Mbps, cpu loading is 65%.
It means the patch set have great improvement for imx6sx FEC performance.

V2:
* From Frank Li's suggestion:
	Change the API "fec_enet_txdesc_entry_free" name to "fec_enet_get_free_txdesc_num".
* Summary David Laight and Eric Dumazet's thoughts:
	RX BD entry number change to 256.
* From ezequiel's suggestion:
	Follow the latest TSO fixes from his solution to rework the queue stop/wake-up.
	Avoid unmapping the TSO header buffers.
* From Eric Dumazet's suggestion:
	Avoid more bytes copy, just copying the unaligned part of the payload into first
	descriptor. The suggestion will bring more complex for the driver, and imx6dl FEC
	DMA need 16 bytes alignment, but cpu loading is not problem that cpu loading is
	30%, the current performance is so better. Later chip like imx6sx Gigbit FEC DMA
	support byte alignment, so there don't exist memory copy. So, the V2 version drop
	the suggestion.
	Anyway, thanks for Eric's response and suggestion.

V3:
* From David Laight's feedback:
	Decide to drop RX BD entry number change for the SW TSO patch set.
	I will generate one separate patch to increase RX BDs entry for interrupt coalescing feature which
	will be supported in my later patch set.

V4:
* From David Laight's feedback:
	Remove the conditional in .fec_enet_get_bd_index().

V5:
* Patch #4 update:
  From David Laight's feedback:
	"expect fec_enet_get_free_txdesc_num() to return one less than it does currently."
	Change the function:
	Return space available, 0..size-1.  it always leave one free entry. Which is same as linux circ_buf.

Thanks for Eric and ezequiel's help and idea.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12 11:02:08 -07:00
Nimrod Andy 79f339125e net: fec: Add software TSO support
Add software TSO support for FEC.
This feature allows to improve outbound throughput performance.

Tested on imx6dl sabresd board, running iperf tcp tests shows:
- 16.2% improvement comparing with FEC SG patch
- 82% improvement comparing with NO SG & TSO patch

$ ethtool -K eth0 tso on
$ iperf -c 10.192.242.167 -t 3 &
[  3] local 10.192.242.108 port 35388 connected with 10.192.242.167 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 3.0 sec   181 MBytes   506 Mbits/sec

During the testing, CPU loading is 30%.
Since imx6dl FEC Bandwidth is limited to SOC system bus bandwidth, the
performance with SW TSO is a milestone.

CC: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Laight <David.Laight@ACULAB.COM>
CC: Li Frank <B20596@freescale.com>
Signed-off-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12 11:01:57 -07:00
Nimrod Andy 6e909283cb net: fec: Add Scatter/gather support
Add Scatter/gather support for FEC.
This feature allows to improve outbound throughput performance.

Tested on imx6dl sabresd board:
Running iperf tests shows a 55.4% improvement.

$ ethtool -K eth0 sg off
$ iperf -c 10.192.242.167 -t 3 &
[  3] local 10.192.242.108 port 52618 connected with 10.192.242.167 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 3.0 sec  99.5 MBytes   278 Mbits/sec

$ ethtool -K eth0 sg on
$ iperf -c 10.192.242.167 -t 3 &
[  3] local 10.192.242.108 port 52617 connected with 10.192.242.167 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 3.0 sec   154 MBytes   432 Mbits/sec

CC: Li Frank <B20596@freescale.com>
Signed-off-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12 11:01:57 -07:00
Nimrod Andy 55d0218ae2 net: fec: Increase buffer descriptor entry number
In order to support SG, software TSO, let's increase BD entry number.

CC: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12 11:01:57 -07:00
Nimrod Andy 09d1e541fd net: fec: Factorize feature setting
In order to enhance the code readable, let's factorize the
feature list.

Signed-off-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12 11:01:57 -07:00
Nimrod Andy 96c50caa51 net: fec: Enable IP header hardware checksum
IP header checksum is calcalated by network layer in default.
To support software TSO, it is better to use HW calculate the
IP header checksum.

FEC hw checksum feature request the checksum field in frame
is zero, otherwise the calculative CRC is not correct.

For segmentated TCP packet, HW calculate the IP header checksum again,
it doesn't bring any impact. For SW TSO, HW calculated checksum bring
better performance.

Signed-off-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12 11:01:57 -07:00
Nimrod Andy 61a4427b95 net: fec: Factorize the .xmit transmit function
Make the code more readable and easy to support other features like
SG, TSO, moving the common transmit function to one api.

And the patch also factorize the getting BD index to it own function.

CC: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12 11:01:57 -07:00
Linus Lüssing 3993c4e159 bridge: fix compile error when compiling without IPv6 support
Some fields in "struct net_bridge" aren't available when compiling the
kernel without IPv6 support. Therefore adding a check/macro to skip the
complaining code sections in that case.

Introduced by 2cd4143192
("bridge: memorize and export selected IGMP/MLD querier port")

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12 11:00:24 -07:00
Linus Lüssing 6c03ee8bda bridge: fix smatch warning / potential null pointer dereference
"New smatch warnings:
  net/bridge/br_multicast.c:1368 br_ip6_multicast_query() error:
    we previously assumed 'group' could be null (see line 1349)"

In the rare (sort of broken) case of a query having a Maximum
Response Delay of zero, we could create a potential null pointer
dereference.

Fixing this by skipping the multicast specific MLD Query parsing again
if no multicast group address is available.

Introduced by dc4eb53a99
("bridge: adhere to querier election mechanism specified by RFCs")

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12 11:00:24 -07:00
François Cachereul 179584388d via-rhine: fix full-duplex with autoneg disable
With some specific configuration (VT6105M on Soekris 5510 and depending
on the device at the other end), fragmented packets were not transmitted
when forcing 100 full-duplex with autoneg disable.

This fix now write full-duplex chips register when forcing full or
half-duplex not only when autoneg is enable.

Signed-off-by: François Cachereul <f.cachereul@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12 10:31:10 -07:00
David S. Miller a4d3de0d5f Merge branch 'bnx2x'
Yuval Mintz says:

====================
bnx2x: Bug fixes patch series

This patch series contains various bug fixes - 2 link related fixes,
one sriov-related issue and an additional fix for a theoretical bug
on new boards.

Please consider applying these patches to `net'.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12 10:28:49 -07:00
Ariel Elior f2cfa997ef bnx2x: Enlarge the dorq threshold for VFs
A malicious VF might try to starve the other VFs & PF by creating
contineous doorbell floods. In order to negate this, HW has a threshold of
doorbells per client, which will stop the client doorbells from arriving
if crossed.

The threshold currently configured for VFs is too low - under extreme traffic
scenarios, it's possible for a VF to reach the threshold and thus for its
fastpath to stop working.

Signed-off-by: Ariel Elior <ariel.elior@qlogic.com>
Signed-off-by: Yuval Mintz <yuval.mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12 10:28:18 -07:00
Yuval Mintz b17b0ca164 bnx2x: Check for UNDI in uncommon branch
If L2FW utilized by the UNDI driver has the same version number as that
of the regular FW, a driver loading after UNDI and receiving an uncommon
answer from management will mistakenly assume the loaded FW matches its
own requirement and try to exist the flow via FLR.

Signed-off-by: Yuval Mintz <yuval.mintz@qlogic.com>
Signed-off-by: Ariel Elior <ariel.elior@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12 10:28:18 -07:00
Yaniv Rosner a2755be5b5 bnx2x: Fix 1G-baseT link
Set the phy access mode even in case of link-flap avoidance.

Signed-off-by: Yaniv Rosner <yaniv.rosner@qlogic.com>
Signed-off-by: Yuval Mintz <yuval.mintz@qlogic.com>
Signed-off-by: Ariel Elior <ariel.elior@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12 10:28:18 -07:00
Yaniv Rosner dad91ee478 bnx2x: Fix link for KR with swapped polarity lane
This avoids clearing the RX polarity setting in KR mode when polarity lane
is swapped, as otherwise this will result in failed link.

Signed-off-by: Yaniv Rosner <yaniv.rosner@qlogic.com>
Signed-off-by: Yuval Mintz <yuval.mintz@qlogic.com>
Signed-off-by: Ariel Elior <ariel.elior@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12 10:28:18 -07:00
Xufeng Zhang d3217b15a1 sctp: Fix sk_ack_backlog wrap-around problem
Consider the scenario:
For a TCP-style socket, while processing the COOKIE_ECHO chunk in
sctp_sf_do_5_1D_ce(), after it has passed a series of sanity check,
a new association would be created in sctp_unpack_cookie(), but afterwards,
some processing maybe failed, and sctp_association_free() will be called to
free the previously allocated association, in sctp_association_free(),
sk_ack_backlog value is decremented for this socket, since the initial
value for sk_ack_backlog is 0, after the decrement, it will be 65535,
a wrap-around problem happens, and if we want to establish new associations
afterward in the same socket, ABORT would be triggered since sctp deem the
accept queue as full.
Fix this issue by only decrementing sk_ack_backlog for associations in
the endpoint's list.

Fix-suggested-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Xufeng Zhang <xufeng.zhang@windriver.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12 10:27:14 -07:00
David S. Miller 902455e007 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/core/rtnetlink.c
	net/core/skbuff.c

Both conflicts were very simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 16:02:55 -07:00
Doug Ledford c5b4616087 net/core: Add VF link state control policy
Commit 1d8faf48c7 (net/core: Add VF link state control) added VF link state
control to the netlink VF nested structure, but failed to add a proper entry
for the new structure into the VF policy table.  Add the missing entry so
the table and the actual data copied into the netlink nested struct are in
sync.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:51:37 -07:00
Andy Fleming 39f33367e4 net/fsl: xgmac_mdio is dependent on OF_MDIO
Signed-off-by: Shruti Kanetkar <Shruti@Freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:50:59 -07:00
Shruti Kanetkar 55fd36419c net/fsl: Make xgmac_mdio read error message useful
Print the device address, the register number and the PHY ID for
which the MDIO read operation failed

Signed-off-by: Shruti Kanetkar <Shruti@Freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:50:59 -07:00
Florian Westphal 6e765a009a net_sched: drr: warn when qdisc is not work conserving
The DRR scheduler requires that items on the active list are work
conserving, i.e. do not hold on to skbs for throttling purposes, etc.
Attaching e.g. tbf renders DRR useless because all other classes on the
active list are delayed as well.

So, warn users that this configuration won't work as expected; we
already do this in couple of other qdiscs, see e.g.

commit b00355db3f
('pkt_sched: sch_hfsc: sch_htb: Add non-work-conserving warning handler')

The 'const' change is needed to avoid compiler warning ("discards 'const'
qualifier from pointer target type").

tested with:
drr_hier() {
        parent=$1
        classes=$2
        for i in  $(seq 1 $classes); do
                classid=$parent$(printf %x $i)
                tc class add dev eth0 parent $parent classid $classid drr
		tc qdisc add dev eth0 parent $classid tbf rate 64kbit burst 256kbit limit 64kbit
        done
}
tc qdisc add dev eth0 root handle 1: drr
drr_hier 1: 32
tc filter add dev eth0 protocol all pref 1 parent 1: handle 1 flow hash keys dst perturb 1 divisor 32

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:50:59 -07:00
David S. Miller f3591fd4c9 Merge branch 'inet_csums'
Tom Herbert says:

====================
net: Checksum offload changes - Part IV

I am working on overhauling RX checksum offload. Goals of this effort
are:

- Specify what exactly it means when driver returns CHECKSUM_UNNECESSARY
- Preserve CHECKSUM_COMPLETE through encapsulation layers
- Don't do skb_checksum more than once per packet
- Unify GRO and non-GRO csum verification as much as possible
- Unify the checksum functions (checksum_init)
- Simply code

What is in this fourth patch set:

- Preserve CHECKSUM_COMPLETE instead of changing it to
  CHECKSUM_UNNECESSARY. This allows correct reuse in validating multiple
  csums in a packet.
- When SW needs to compute the packet checksum, save it as
  CHECKSUM_COMPLETE. Also mark that checksum was compute by SW.
- Add skb_gro_postpull_rcsum to udp and vxlan to make GRO work with
  CHECKSUM_COMPLETE.

v2: Removed patch setting skb_encapsulation when validating checksum
    in tcp_gro_receive

Please review carefully and test if possible, mucking with basic
checksum functions is always a little precarious :-)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:46:17 -07:00
Tom Herbert 6bae1d4cc3 net: Add skb_gro_postpull_rcsum to udp and vxlan
Need to gro_postpull_rcsum for GRO to work with checksum complete.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:46:13 -07:00
Tom Herbert 7e3cead517 net: Save software checksum complete
In skb_checksum complete, if we need to compute the checksum for the
packet (via skb_checksum) save the result as CHECKSUM_COMPLETE.
Subsequent checksum verification can use this.

Also, added csum_complete_sw flag to distinguish between software and
hardware generated checksum complete, we should always be able to trust
the software computation.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:46:13 -07:00
Tom Herbert 5d0c2b95bc net: Preserve CHECKSUM_COMPLETE at validation
Currently when the first checksum in a packet is validated using
CHECKSUM_COMPLETE, ip_summed is overwritten to be CHECKSUM_UNNECESSARY
so that any subsequent checksums in the packet are not correctly
validated.

This patch adds csum_valid flag in sk_buff and uses that to indicate
validated checksum instead of setting CHECKSUM_UNNECESSARY. The bit
is set accordingly in the skb_checksum_validate_* functions. The flag
is checked in skb_checksum_complete, so that validation is communicated
between checksum_init and checksum_complete sequence in TCP and UDP.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:46:12 -07:00
David S. Miller 1054cc150c Merge branch 'qlcnic-next'
Shahed Shaikh says:

====================
This series contains an enhancement in the area of firmware minidump collection
and optimization of ring count validation function.

Please apply this series to net-next.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:44:42 -07:00
Shahed Shaikh 038782d6d0 qlcnic: Update version to 5.3.60
Signed-off-by: Shahed Shaikh <shahed.shaikh@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:44:29 -07:00
Shahed Shaikh 18e0d62533 qlcnic: Optimize ring count validations
- Check interrupt mode at the start of qlcnic_set_channels().
- Do not validate ring count if they are not going to change.

Signed-off-by: Shahed Shaikh <shahed.shaikh@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:44:29 -07:00
Shahed Shaikh 4da005cf1e qlcnic: Pre-allocate DMA buffer used for minidump collection
Pre-allocate the physically contiguous DMA buffer used for
minidump collection at driver load time, rather than at
run time, to minimize allocation failures. Driver will allocate
the buffer at load time if PEX DMA support capability is indicated
by the adapter.

Signed-off-by: Shahed Shaikh <shahed.shaikh@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:44:29 -07:00
Dmitry Popov efd0f11d85 ip_vti: fix sparse warnings for VTI_ISVTI
This patch fixes the following sparse warnings:

net/ipv4/ip_tunnel.c:245:53: warning: restricted __be16 degrades to integer
net/ipv4/ip_vti.c:321:19: warning: incorrect type in assignment (different base types)
net/ipv4/ip_vti.c:321:19:    expected restricted __be16 [addressable] [assigned] [usertype] i_flags
net/ipv4/ip_vti.c:321:19:    got int
net/ipv4/ip_vti.c:447:24: warning: incorrect type in assignment (different base types)
net/ipv4/ip_vti.c:447:24:    expected restricted __be16 [usertype] i_flags
net/ipv4/ip_vti.c:447:24:    got int

Since VTI_ISVTI is always used with ip_tunnel_parm->i_flags (which is __be16),
we can __force cast VTI_ISVTI to __be16 in header file.

Signed-off-by: Dmitry Popov <ixaphire@qrator.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:39:19 -07:00
Dan Carpenter 2f87208efb drivers: net: davinci_cpdma: double free on error
We recently change the kzalloc() to devm_kzalloc() so freeing "ctlr"
here could lead to a double free.

Fixes: e194312854 ('drivers: net: davinci_cpdma: Convert kzalloc() to devm_kzalloc().')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:39:19 -07:00
Dan Carpenter 8fc908c3c3 amd-xgbe: unwind on error in xgbe_mdio_register()
There is a typo here so we return directly instead of unwinding.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:39:19 -07:00
Varka Bhadram 0aaf43f51d mrf24j40: add device managed APIs
adds the device managed APIs so that no need worry about
freeing the resources.

Signed-off-by: Varka Bhadram <varkab@cdac.in>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:39:19 -07:00
stephen hemminger f647944995 ceph: remove bogus extern
Sparse complained about this bogus extern on definition of
a function.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:39:19 -07:00
Alexei Starovoitov 783e327b69 net: filter: document internal instruction encoding
This patch adds a description of eBPFs instruction encoding in order
to bring the documentation in line with the implementation.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:39:18 -07:00
Alexei Starovoitov e4ad403269 net: filter: mention eBPF terminology as well
Since the term eBPF is used anyway on mailing list discussions, lets
also document that in the main BPF documentation file and replace a
couple of occurrences with eBPF terminology to be more clear.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:39:18 -07:00
Eric Dumazet 9709674e68 ipv4: fix a race in ip4_datagram_release_cb()
Alexey gave a AddressSanitizer[1] report that finally gave a good hint
at where was the origin of various problems already reported by Dormando
in the past [2]

Problem comes from the fact that UDP can have a lockless TX path, and
concurrent threads can manipulate sk_dst_cache, while another thread,
is holding socket lock and calls __sk_dst_set() in
ip4_datagram_release_cb() (this was added in linux-3.8)

It seems that all we need to do is to use sk_dst_check() and
sk_dst_set() so that all the writers hold same spinlock
(sk->sk_dst_lock) to prevent corruptions.

TCP stack do not need this protection, as all sk_dst_cache writers hold
the socket lock.

[1]
https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel

AddressSanitizer: heap-use-after-free in ipv4_dst_check
Read of size 2 by thread T15453:
 [<ffffffff817daa3a>] ipv4_dst_check+0x1a/0x90 ./net/ipv4/route.c:1116
 [<ffffffff8175b789>] __sk_dst_check+0x89/0xe0 ./net/core/sock.c:531
 [<ffffffff81830a36>] ip4_datagram_release_cb+0x46/0x390 ??:0
 [<ffffffff8175eaea>] release_sock+0x17a/0x230 ./net/core/sock.c:2413
 [<ffffffff81830882>] ip4_datagram_connect+0x462/0x5d0 ??:0
 [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
 [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
 [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
 [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
./arch/x86/kernel/entry_64.S:629

Freed by thread T15455:
 [<ffffffff8178d9b8>] dst_destroy+0xa8/0x160 ./net/core/dst.c:251
 [<ffffffff8178de25>] dst_release+0x45/0x80 ./net/core/dst.c:280
 [<ffffffff818304c1>] ip4_datagram_connect+0xa1/0x5d0 ??:0
 [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
 [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
 [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
 [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
./arch/x86/kernel/entry_64.S:629

Allocated by thread T15453:
 [<ffffffff8178d291>] dst_alloc+0x81/0x2b0 ./net/core/dst.c:171
 [<ffffffff817db3b7>] rt_dst_alloc+0x47/0x50 ./net/ipv4/route.c:1406
 [<     inlined    >] __ip_route_output_key+0x3e8/0xf70
__mkroute_output ./net/ipv4/route.c:1939
 [<ffffffff817dde08>] __ip_route_output_key+0x3e8/0xf70 ./net/ipv4/route.c:2161
 [<ffffffff817deb34>] ip_route_output_flow+0x14/0x30 ./net/ipv4/route.c:2249
 [<ffffffff81830737>] ip4_datagram_connect+0x317/0x5d0 ??:0
 [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
 [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
 [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
 [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
./arch/x86/kernel/entry_64.S:629

[2]
<4>[196727.311203] general protection fault: 0000 [#1] SMP
<4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
<4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
<4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
<4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
<4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
<4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
<4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
<4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
<4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
<4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
<4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
<4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
<4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
<4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[196727.311713] Stack:
<4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
<4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
<4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
<4>[196727.311885] Call Trace:
<4>[196727.311907]  <IRQ>
<4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
<4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
<4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
<4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
<4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
<4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
<4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
<4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
<4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
<4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
<4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
<4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
<4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
<4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
<4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
<4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
<4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
<4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
<4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
<4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
<4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
<4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
<4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
<4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
<4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
<4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
<4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
<4>[196727.312722]  <EOI>
<1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
<4>[196727.313100]  RSP <ffff885effd23a70>
<4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
<0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt

Reported-by: Alexey Preobrazhensky <preobr@google.com>
Reported-by: dormando <dormando@rydia.ne>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 8141ed9fce ("ipv4: Add a socket release callback for datagram sockets")
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:39:18 -07:00
Daniel Borkmann a101ccd141 net: filter: add test_bpf module under MAINTAINERS' networking section
Add lib/test_bpf.c entry to maintainers file under networking.
All changes were posted via netdev for review, so make sure
other people Cc it as well when they call get_maintainer.pl.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:39:18 -07:00
Octavian Purdila bad93e9d4e net: add __pskb_copy_fclone and pskb_copy_for_clone
There are several instances where a pskb_copy or __pskb_copy is
immediately followed by an skb_clone.

Add a couple of new functions to allow the copy skb to be allocated
from the fclone cache and thus speed up subsequent skb_clone calls.

Cc: Alexander Smirnov <alex.bluesman.smirnov@gmail.com>
Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Cc: Marek Lindner <mareklindner@neomailbox.ch>
Cc: Simon Wunderlich <sw@simonwunderlich.de>
Cc: Antonio Quartulli <antonio@meshcoding.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Gustavo Padovan <gustavo@padovan.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Arvid Brodin <arvid.brodin@alten.se>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: Lauro Ramos Venancio <lauro.venancio@openbossa.org>
Cc: Aloisio Almeida Jr <aloisio.almeida@openbossa.org>
Cc: Samuel Ortiz <sameo@linux.intel.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Allan Stephens <allan.stephens@windriver.com>
Cc: Andrew Hendry <andrew.hendry@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Reviewed-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:38:02 -07:00
Jon Cooper daf37b556e sfc: PIO:Restrict to 64bit arch and use 64-bit writes.
Fixes:ee45fd92c739
("sfc: Use TX PIO for sufficiently small packets")

The linux net driver uses memcpy_toio() in order to copy into
the PIO buffers.
Even on a 64bit machine this causes 32bit accesses to a write-
combined memory region.
There are hardware limitations that mean that only 64bit
naturally aligned accesses are safe in all cases.
Due to being write-combined memory region two 32bit accesses
may be coalesced to form a 64bit non 64bit aligned access.
Solution was to open-code the memory copy routines using pointers
and to only enable PIO for x86_64 machines.

Not tested on platforms other than x86_64 because this patch
disables the PIO feature on other platforms.
Compile-tested on x86 to ensure that works.

The WARN_ON_ONCE() code in the previous version of this patch
has been moved into the internal sfc debug driver as the
assertion was unnecessary in the upstream kernel code.

This bug fix applies to v3.13 and v3.14 stable branches.

Signed-off-by: Shradha Shah <sshah@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:36:21 -07:00
David S. Miller 1a0b20b257 Merge branch 'bridge-next'
Toshiaki Makita says:

====================
bridge: 802.1ad vlan protocol support

Currently bridge vlan filtering doesn't work fine with 802.1ad protocol.
Only if a bridge is configured without pvid, the bridge receives only
802.1ad tagged frames and no STP is used, it will work.
Otherwise:
- If pvid is configured, it can put only 802.1Q tags but cannot put 802.1ad
  tags.
- If 802.1Q and 802.1ad tagged frames arrive in mixture, it applies filtering
  regardless of their protocols.
- While an 802.1ad bridge should use another mac address for STP BPDU and
  should forward customer's BPDU frames, it can't.
Thus, we can't properly handle frames once 802.1ad is used.

Handling 802.1ad is useful if we want to allow stacked vlans to be used,
e.g., guest VMs wants to use vlan tags and the host also wants to segregate
guest's traffic from other guests' by vlan tags.

Here is the image describing how to configure a bridge to filter VMs traffic.

         +-------+p/u   +-----+  +---------+
 +----+  |       |------|vnet0|--|User A VM|
 |eth0|--|802.1ad|      +-----+  +---------+
 +----+  |bridge |p/u   +-----+  +---------+
         |       |------|vnet1|--|User B VM|
         +-------+      +-----+  +---------+
p/u: pvid/untagged

This patch set enables us to set vlan protocols per bridge.
This tries to implement a bridge like S-VLAN component in IEEE 802.1Q-2011
spec.

Note that there is another possible implementation that sets vlan protocols
per port. Some HW switches seem to take that approach.
However, I think per-bridge approach is better, because;
- I think the typical usage of an 802.1ad bridge is segregating 802.1Q tagged
  traffic (like what is described above), and this doesn't need the ability to
  be set protocols per port. Also, If a bridge has many ports and it supports
  per-port setting, we might have to make much more extra configurations to
  change protocols of all ports.

- I assume that the main perpose to set protocol per port is to assign S-VID
  according to C-VID, or to realize two logical bridges (one is an 802.1Q
  filtering bridge and the other is an 802.1ad filtering bridge) in one bridge.
  The former usually needs additional features such as vlan id mapping, and
  is likely to make bridge's code complicated. If a user wants, such enhanced
  features can be accomplished by a combination of multiple bridges, so it is
  not absolutely necessary to implement these features in a bridge itself.
  The latter is simply unnecessary because we can easily make two bridges of
  which one is an 802.1Q bridge and the other is an 802.1ad bridge.

Here is an example of the enhanced feature that we can realize by using
multiple bridges and veth interfaces. This way is documented in
IEEE 802.1Q-2011 clause 15.4 (C-tagged service interface).

 +----+  +-------+p/u         +------+  +----+  +--+
 |eth0|--|802.1ad|----veth----|802.1Q|--|vnet|--|VM|
 +----+  |bridge |----veth----|bridge|  +----+  +--+
         +-------+p/u         +------+
p/u: pvid/untagged

In this configuration, we can map C-VIDs to any S-VID.
For example;
 C-VID 10 and 20 to S-VID 100
 C-VID 30 to S-VID 110
This is achieved through the 802.1Q bridge that forwards C-tagged frames to
proper ports of the 802.1ad bridge.

Changes:
v1 -> v2:
- Make the way to forward bridge group addresses more generic by introducing
  new mask, group_fwd_mask_required.

RFC -> v1:
- Add S-TAG tx offload.
- Remove a fix around stacked vlan which has already been fixed.
- Take into account Bridge Group Addresses.
- Separate handling of protocol-mismatch from br_vlan_get_tag().
- Change the way to set vlan_proto from netlink to sysfs because no other
  existing configuration per bridge can be set by netlink.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:23:03 -07:00
Toshiaki Makita 204177f3f3 bridge: Support 802.1ad vlan filtering
This enables us to change the vlan protocol for vlan filtering.
We come to be able to filter frames on the basis of 802.1ad vlan tags
through a bridge.

This also changes br->group_addr if it has not been set by user.
This is needed for an 802.1ad bridge.
(See IEEE 802.1Q-2011 8.13.5.)

Furthermore, this sets br->group_fwd_mask_required so that an 802.1ad
bridge can forward the Nearest Customer Bridge group addresses except
for br->group_addr, which should be passed to higher layer.

To change the vlan protocol, write a protocol in sysfs:
# echo 0x88a8 > /sys/class/net/br0/bridge/vlan_protocol

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:22:53 -07:00
Toshiaki Makita f2808d226f bridge: Prepare for forwarding another bridge group addresses
If a bridge is an 802.1ad bridge, it must forward another bridge group
addresses (the Nearest Customer Bridge group addresses).
(For details, see IEEE 802.1Q-2011 8.6.3.)

As user might not want group_fwd_mask to be modified by enabling 802.1ad,
introduce a new mask, group_fwd_mask_required, which indicates addresses
the bridge wants to forward. This will be set by enabling 802.1ad.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:22:53 -07:00
Toshiaki Makita 8580e2117c bridge: Prepare for 802.1ad vlan filtering support
This enables a bridge to have vlan protocol informantion and allows vlan
tag manipulation (retrieve, insert and remove tags) according to the vlan
protocol.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:22:53 -07:00
Toshiaki Makita 1c5abb6c77 bridge: Add 802.1ad tx vlan acceleration
Bridge device doesn't need to embed S-tag into skb->data.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:22:53 -07:00
Arnd Bergmann e7b599d7c1 net: xen-netback: include linux/vmalloc.h again
commit e9ce7cb6b1 ("xen-netback: Factor queue-specific data into
queue struct") added a use of vzalloc/vfree to interface.c, but
removed the #include <linux/vmalloc.h> statement at the same time,
which causes this build error:

drivers/net/xen-netback/interface.c: In function 'xenvif_free':
drivers/net/xen-netback/interface.c:754:2: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration]
  vfree(vif->queues);
  ^
cc1: some warnings being treated as errors

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Andrew J. Bennieston <andrew.bennieston@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:19:28 -07:00
Jongsung Kim 71b9c4a838 net: phy: realtek: register/unregister multiple drivers properly
Using phy_drivers_register/_unregister functions is proper way to
handle multiple PHY drivers registration. For Realtek PHY drivers
module, it fixes incomplete current error-handlings up and adds
missed unregistration for the RTL8201CP driver.

Signed-off-by: Jongsung Kim <neidhard.kim@lge.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:17:27 -07:00
Yoshihiro Shimoda 1b72a0fc9c net: sh_eth: Fix timing of RACT setting in sh_eth_rx()
This patch fixes an issue that we cannot use nfs rootfs correctly
on r8a7790 when the command below runs on a host PC.

 $ sudo ping -f -l 8 $BOARD_IP_ADDR

Since the driver sets the RACT to 1 in the first while loop of
sh_eth_rx(), the controller accepts a next frame into the next RX
descriptor during the while loop. But, in the first while loop
doesn't allocate a next skb. So, this patch removes the RACT setting
in the first while loop of sh_eth_rx().

Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:15:31 -07:00