For the short-term solution, lets fix bpf helper functions to use
skb->mac_header relative offsets instead of skb->data in order to
get the same eBPF programs with cls_bpf and act_bpf work on ingress
and egress qdisc path. We need to ensure that mac_header is set
before calling into programs. This is effectively the first option
from below referenced discussion.
More long term solution for LD_ABS|LD_IND instructions will be more
intrusive but also more beneficial than this, and implemented later
as it's too risky at this point in time.
I.e., we plan to look into the option of moving skb_pull() out of
eth_type_trans() and into netif_receive_skb() as has been suggested
as second option. Meanwhile, this solution ensures ingress can be
used with eBPF, too, and that we won't run into ABI troubles later.
For dealing with negative offsets inside eBPF helper functions,
we've implemented bpf_skb_clone_unwritable() to test for unwriteable
headers.
Reference: http://thread.gmane.org/gmane.linux.network/359129/focus=359694
Fixes: 608cd71a9c ("tc: bpf: generalize pedit action")
Fixes: 91bc4822c3 ("tc: bpf: add checksum helpers")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove duplicated include.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Al Viro says:
====================
netdev-related stuff in vfs.git
There are several commits sitting in vfs.git that probably ought to go in
via net-next.git. First of all, there's merge with vfs.git#iocb - that's
Christoph's aio rework, which has triggered conflicts with the ->sendmsg()
and ->recvmsg() patches a while ago. It's not so much Christoph's stuff
that ought to be in net-next, as (pretty simple) conflict resolution on merge.
The next chunk is switch to {compat_,}import_iovec/import_single_range - new
safer primitives for initializing iov_iter. The primitives themselves come
from vfs/git#iov_iter (and they are used quite a lot in vfs part of queue),
conversion of net/socket.c syscalls belongs in net-next, IMO. Next there's
afs and rxrpc stuff from dhowells. And then there's sanitizing kernel_sendmsg
et.al. + missing inlined helper for "how much data is left in msg->msg_iter" -
this stuff is used in e.g. cifs stuff, but it belongs in net-next.
That pile is pullable from
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-davem
I'll post the individual patches in there in followups; could you take a look
and tell if everything in there is OK with you?
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Even if we make use of classifier and actions from the egress
path, we're going into handle_ing() executing additional code
on a per-packet cost for ingress qdisc, just to realize that
nothing is attached on ingress.
Instead, this can just be blinded out as a no-op entirely with
the use of a static key. On input fast-path, we already make
use of static keys in various places, e.g. skb time stamping,
in RPS, etc. It makes sense to not waste time when we're assured
that no ingress qdisc is attached anywhere.
Enabling/disabling of that code path is being done via two
helpers, namely net_{inc,dec}_ingress_queue(), that are being
invoked under RTNL mutex when a ingress qdisc is being either
initialized or destructed.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2015-04-11
This series contains updates to iflink, ixgbe and ixgbevf.
The entire set of changes come from Vlad Zolotarov to ultimately add
the ethtool ops to VF driver to allow querying the RSS indirection table
and RSS random key.
Currently we support only 82599 and x540 devices. On those devices, VFs
share the RSS redirection table and hash key with a PF. Letting the VF
query this information may introduce some security risks, therefore this
feature will be disabled by default.
The new netdev op allows a system administrator to change the default
behaviour with "ip link set" command. The relevant iproute2 patch has
already been sent and awaits for this series upstream.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
With recent adoption of skc_cookie in struct sock_common,
struct tcp_timewait_sock size increased from 192 to 200 bytes
on 64bit arches. SLAB rounds then to 256 bytes.
It is time to drop SLAB_HWCACHE_ALIGN constraint for twsk_slab.
This saves about 12 MB of memory on typical configuration reaching
262144 timewait sockets, and has no noticeable impact on performance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add configuration setting for drivers to allow/block an RSS Redirection
Table and a Hash Key querying for discrete VFs.
On some devices VF share the mentioned above information with PF and
querying it may adduce a theoretical security risk. We want to let a
system administrator to decide if he/she wants to take this risk or not.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
When an FDB entry is added or deleted the information about VLAN
is not passed to listening applications like 'bridge monitor fdb'.
With this patch VLAN ID is passed if it was set in the original
netlink message.
Also remove an unused bdev variable.
Signed-off-by: Hubert Sokolowski <hubert.sokolowski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Which this patch, it's possible to dump the list of ids allocated for peer
netns.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this patch, netns ids that are created and deleted are advertised into the
group RTNLGRP_NSID.
Because callers of rtnl_net_notifyid() already know the id of the peer, there is
no need to call __peernet2id() in rtnl_net_fill().
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need to initialize err, it will be overridden by the value of nlmsg_parse().
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On the output paths in particular, we have to sometimes deal with two
socket contexts. First, and usually skb->sk, is the local socket that
generated the frame.
And second, is potentially the socket used to control a tunneling
socket, such as one the encapsulates using UDP.
We do not want to disassociate skb->sk when encapsulating in order
to fix this, because that would break socket memory accounting.
The most extreme case where this can cause huge problems is an
AF_PACKET socket transmitting over a vxlan device. We hit code
paths doing checks that assume they are dealing with an ipv4
socket, but are actually operating upon the AF_PACKET one.
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/mellanox/mlx4/cmd.c
net/core/fib_rules.c
net/ipv4/fib_frontend.c
The fib_rules.c and fib_frontend.c conflicts were locking adjustments
in 'net' overlapping addition and removal of code in 'net-next'.
The mlx4 conflict was a bug fix in 'net' happening in the same
place a constant was being replaced with a more suitable macro.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 608cd71a9c ("tc: bpf: generalize pedit action") has added the
possibility to mangle packet data to BPF programs in the tc pipeline.
This patch adds two helpers bpf_l3_csum_replace() and bpf_l4_csum_replace()
for fixing up the protocol checksums after the packet mangling.
It also adds 'flags' argument to bpf_skb_store_bytes() helper to avoid
unnecessary checksum recomputations when BPF programs adjusting l3/l4
checksums and documents all three helpers in uapi header.
Moreover, a sample program is added to show how BPF programs can make use
of the mangle and csum helpers.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should not consult skb->sk for output decisions in xmit recursion
levels > 0 in the stack. Otherwise local socket settings could influence
the result of e.g. tunnel encapsulation process.
ipv6 does not conform with this in three places:
1) ip6_fragment: we do consult ipv6_npinfo for frag_size
2) sk_mc_loop in ipv6 uses skb->sk and checks if we should
loop the packet back to the local socket
3) ip6_skb_dst_mtu could query the settings from the user socket and
force a wrong MTU
Furthermore:
In sk_mc_loop we could potentially land in WARN_ON(1) if we use a
PF_PACKET socket ontop of an IPv6-backed vxlan device.
Reuse xmit_recursion as we are currently only interested in protecting
tunnel devices.
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds the ability to read out the skb->priority from an eBPF
program, so that it can be taken into account from a tc filter
or action for the use-case where the priority is not being used
to directly override the filter classification in a qdisc, but
to tag traffic otherwise for the classifier; the priority can be
assigned from various places incl. user space, in future we may
also mangle it from an eBPF program.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
First, let's explain the problem.
Suppose you have an ipip interface that stands in the netns foo and its link
part in the netns bar (so the netns bar has an nsid into the netns foo).
Now, you remove the netns bar:
- the bar nsid into the netns foo is removed
- the netns exit method of ipip is called, thus our ipip iface is removed:
=> a netlink message is built in the netns foo to advertise this deletion
=> this netlink message requests an nsid for bar, thus a new nsid is
allocated for bar and never removed.
This patch adds a check in peernet2id() so that an id cannot be allocated for
a netns which is currently destroyed.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts
commit 4217291e59 ("netns: don't clear nsid too early on removal").
This is not the right fix, it introduces races.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have to hold rtnl lock for fib_rules_unregister()
otherwise the following race could happen:
fib_rules_unregister(): fib_nl_delrule():
... ...
... ops = lookup_rules_ops();
list_del_rcu(&ops->list);
list_for_each_entry(ops->rules) {
fib_rules_cleanup_ops(ops); ...
list_del_rcu(); list_del_rcu();
}
Note, net->rules_mod_lock is actually not needed at all,
either upper layer netns code or rtnl lock guarantees
we are safe.
Cc: Alexander Duyck <alexander.h.duyck@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/usb/asix_common.c
drivers/net/usb/sr9800.c
drivers/net/usb/usbnet.c
include/linux/usb/usbnet.h
net/ipv4/tcp_ipv4.c
net/ipv6/tcp_ipv6.c
The TCP conflicts were overlapping changes. In 'net' we added a
READ_ONCE() to the socket cached RX route read, whilst in 'net-next'
Eric Dumazet touched the surrounding code dealing with how mini
sockets are handled.
With USB, it's a case of the same bug fix first going into net-next
and then I cherry picked it back into net.
Signed-off-by: David S. Miller <davem@davemloft.net>
Virtual interfaces are supposed to set an iflink value != of their ifindex.
It was not the case for some of them, like vxlan, bond or bridge.
Let's set iflink to 0 when dev->rtnl_link_ops is set.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that all users of iflink have the ndo_get_iflink handler available, it's
possible to remove this field.
By default, dev_get_iflink() returns the ifindex of the interface.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The goal of this patch is to prepare the removal of the iflink field. It
introduces a new ndo function, which will be implemented by virtual interfaces.
There is no functional change into this patch. All readers of iflink field
now call dev_get_iflink().
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unlike other places, this function uses name "dev" for what should be
"orig_dev", which might be a bit confusing. So fix this.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
As there are a number of (especially virtual) devices that don't
need the multiple vlan check, introduce passthru_features_check() for
convenience.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
To allow drivers to handle the features check for multiple tags,
move the check to ndo_features_check().
As no drivers currently handle multiple tagged TSO, introduce
dflt_features_check() and call it if the driver does not have
ndo_features_check().
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Separate the two checks for single vlan and multiple vlans in
netif_skb_features(). This allows us to move the check for multiple
vlans to another function later.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
existing TC action 'pedit' can munge any bits of the packet.
Generalize it for use in bpf programs attached as cls_bpf and act_bpf via
bpf_skb_store_bytes() helper function.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the current code, ids are removed too early.
Suppose you have an ipip interface that stands in the netns foo and its link
part in the netns bar (so the netns bar has an nsid into the netns foo).
Now, you remove the netns bar:
- the bar nsid into the netns foo is removed
- the netns exit method of ipip is called, thus our ipip iface is removed:
=> a netlink message is sent in the netns foo to advertise this deletion
=> this netlink message requests an nsid for bar, thus a new nsid is
allocated for bar and never removed.
We must remove nsids when we are sure that nobody will refer to netns currently
cleaned.
Fixes: 0c7aecd4bd ("netns: add rtnl cmd to add and get peer netns ids")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If vlan offloading takes place then vlan header is removed from frame
and its contents, both vlan_tci and vlan_proto, is available to user
space via TPACKET interface. However, only vlan_tci can be used in BPF
filters.
This commit introduces a new BPF extension. It makes possible to load
the value of vlan_proto (vlan TPID) to register A. Support for classic
BPF and eBPF is being added, analogous to skb->protocol.
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: Michal Sekletar <msekleta@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
With dev group, we can change a batch of net devices,
so we should allow to delete them together too.
Group 0 is not allowed to be deleted since it is
the default group.
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We implement the SO_SNDLOWAT etc not to be settable and return
ENOPROTOOPT per 1003.1g 7. Move the comment to appropriate
position and update the reference.
Signed-off-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a low hanging fruit, as we'll get rid of syn_wait_lock eventually.
We hold syn_wait_lock for such small sections, that it makes no sense to use
a read/write lock. A spin lock is simply faster.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb->priority can be set for two purposes:
1) With respect to IP TOS field, which is computed by a mask.
Ususally used for priority qdisc's (pfifo, prio etc.), on TX
side (we only have ingress qdisc on RX side).
2) Used as a classid or flowid, works in the same way with tc
classid. What's more, this can even override the classid
of tc filters.
For case 1), it has been respected within its netns, I don't
see any point of keeping it for another netns, especially
when packets will be forwarded to Rx path (no matter from TX
path or RX path).
For case 2) we care, our applications run inside a netns,
and we classify the packets by our own filters outside,
If some application sets this priority, it could bypass
our filters, therefore clear it when moving out of a netns,
it makes no sense to bypass tc filters out of its netns.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We send unicast neighbor (ARP or NDP) solicitations ucast_probes
times in PROBE state. Zhu Yanjun reported that some implementation
does not reply against them and the entry will become FAILED, which
is undesirable.
We had been dealt with such nodes by sending multicast probes mcast_
solicit times after unicast probes in PROBE state. In 2003, I made
a change not to send them to improve compatibility with IPv6 NDP.
Let's introduce per-protocol per-interface sysctl knob "mcast_
reprobe" to configure the number of multicast (re)solicitation for
reconfirmation in PROBE state. The default is 0, since we have
been doing so for 10+ years.
Reported-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
CC: Ulf Samuelsson <ulf.samuelsson@ericsson.com>
Signed-off-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to prepare eBPF support for tc action, we need to add
sched_act_type, so that the eBPF verifier is aware of what helper
function act_bpf may use, that it can load skb data and read out
currently available skb fields.
This is bascially analogous to 96be4325f4 ("ebpf: add sched_cls_type
and map it to sk_filter's verifier ops").
BPF_PROG_TYPE_SCHED_CLS and BPF_PROG_TYPE_SCHED_ACT need to be
separate since both will have a different set of functionality in
future (classifier vs action), thus we won't run into ABI troubles
when the point in time comes to diverge functionality from the
classifier.
The future plan for act_bpf would be that it will be able to write
into skb->data and alter selected fields mirrored in struct __sk_buff.
For an initial support, it's sufficient to map it to sk_filter_ops.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/emulex/benet/be_main.c
net/core/sysctl_net_core.c
net/ipv4/inet_diag.c
The be_main.c conflict resolution was really tricky. The conflict
hunks generated by GIT were very unhelpful, to say the least. It
split functions in half and moved them around, when the real actual
conflict only existed solely inside of one function, that being
be_map_pci_bars().
So instead, to resolve this, I checked out be_main.c from the top
of net-next, then I applied the be_main.c changes from 'net' since
the last time I merged. And this worked beautifully.
The inet_diag.c and sysctl_net_core.c conflicts were simple
overlapping changes, and were easily to resolve.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit c249739579 ("bpf: allow BPF programs access 'protocol' and 'vlan_tci'
fields") has added support for accessing protocol, vlan_present and vlan_tci
into the skb offset map.
As referenced in the below discussion, accessing skb->protocol from an eBPF
program should be converted without handling endianess.
The reason for this is that an eBPF program could simply do a check more
naturally, by f.e. testing skb->protocol == htons(ETH_P_IP), where the LLVM
compiler resolves htons() against a constant automatically during compilation
time, as opposed to an otherwise needed run time conversion.
After all, the way of programming both from a user perspective differs quite
a lot, i.e. bpf_asm ["ld proto"] versus a C subset/LLVM.
Reference: https://patchwork.ozlabs.org/patch/450819/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_ack_backlog & sk_max_ack_backlog were 16bit fields, meaning
listen() backlog was limited to 65535.
It is time to increase the width to allow much bigger backlog,
if admins change /proc/sys/net/core/somaxconn &
/proc/sys/net/ipv4/tcp_max_syn_backlog default values.
Tested:
echo 5000000 >/proc/sys/net/core/somaxconn
echo 5000000 >/proc/sys/net/ipv4/tcp_max_syn_backlog
Ran a SYNFLOOD test against a listener using listen(fd, 5000000)
myhost~# grep request_sock_TCP /proc/slabinfo
request_sock_TCP 4185642 4411940 304 13 1 : tunables 54 27 8 : slabdata 339380 339380 0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One of the major issue for TCP is the SYNACK rtx handling,
done by inet_csk_reqsk_queue_prune(), fired by the keepalive
timer of a TCP_LISTEN socket.
This function runs for awful long times, with socket lock held,
meaning that other cpus needing this lock have to spin for hundred of ms.
SYNACK are sent in huge bursts, likely to cause severe drops anyway.
This model was OK 15 years ago when memory was very tight.
We now can afford to have a timer per request sock.
Timer invocations no longer need to lock the listener,
and can be run from all cpus in parallel.
With following patch increasing somaxconn width to 32 bits,
I tested a listener with more than 4 million active request sockets,
and a steady SYNFLOOD of ~200,000 SYN per second.
Host was sending ~830,000 SYNACK per second.
This is ~100 times more what we could achieve before this patch.
Later, we will get rid of the listener hash and use ehash instead.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a networking device is taken down that has a non-trivial number
of VLAN devices configured under it, we eat a full synchronize_net()
for every such VLAN device.
This is because of the call chain:
NETDEV_DOWN notifier
--> vlan_device_event()
--> dev_change_flags()
--> __dev_change_flags()
--> __dev_close()
--> __dev_close_many()
--> dev_deactivate_many()
--> synchronize_net()
This is kind of rediculous because we already have infrastructure for
batching doing operation X to a list of net devices so that we only
incur one sync.
So make use of that by exporting dev_close_many() and adjusting it's
interfaace so that the caller can fully manage the batch list. Use
this in vlan_device_event() and all the overhead goes away.
Reported-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to port id allow netdevices to specify port names and export
the name via sysfs. Drivers can implement the netdevice operation to
assist udev in having sane default names for the devices using the
rule:
$ cat /etc/udev/rules.d/80-net-setup-link.rules
SUBSYSTEM=="net", ACTION=="add", ATTR{phys_port_name}!="",
NAME="$attr{phys_port_name}"
Use of phys_name versus phys_id was suggested-by Jiri Pirko.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a tx_maxrate attribute to the tx queue sysfs entry allowing
for max-rate limiting. Along with DCB-ETS and BQL this provides another
knob to tune queue performance. The limit units are Mbps.
By default it is disabled. To disable the rate limitation after it
has been set for a queue, it should be set to zero.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The listener field in struct tcp_request_sock is a pointer
back to the listener. We now have req->rsk_listener, so TCP
only needs one boolean and not a full pointer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
as a follow on to patch 70006af955 ("bpf: allow eBPF access skb fields")
this patch allows 'protocol' and 'vlan_tci' fields to be accessible
from extended BPF programs.
The usage of 'protocol', 'vlan_present' and 'vlan_tci' fields is the same as
corresponding SKF_AD_PROTOCOL, SKF_AD_VLAN_TAG_PRESENT and SKF_AD_VLAN_TAG
accesses in classic BPF.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Creating a kernel socket with sock_create_kern() happens in "init_net"
namespace, however, releasing it with sk_release_kernel() occurs in
the current namespace which may be different with "init_net" namespace.
Therefore, we should guarantee that the namespace in which a kernel
socket is created is same as the socket is created.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
reqsk_put() is the generic function that should be used
to release a refcount (and automatically call reqsk_free())
reqsk_free() might be called if refcount is known to be 0
or undefined.
refcnt is set to one in inet_csk_reqsk_queue_add()
As request socks are not yet in global ehash table,
I added temporary debugging checks in reqsk_put() and reqsk_free()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_edemux() is not used in fast path, and should
really call sock_gen_put() to save some code.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
introduce user accessible mirror of in-kernel 'struct sk_buff':
struct __sk_buff {
__u32 len;
__u32 pkt_type;
__u32 mark;
__u32 queue_mapping;
};
bpf programs can do:
int bpf_prog(struct __sk_buff *skb)
{
__u32 var = skb->pkt_type;
which will be compiled to bpf assembler as:
dst_reg = *(u32 *)(src_reg + 4) // 4 == offsetof(struct __sk_buff, pkt_type)
bpf verifier will check validity of access and will convert it to:
dst_reg = *(u8 *)(src_reg + offsetof(struct sk_buff, __pkt_type_offset))
dst_reg &= 7
since skb->pkt_type is a bitfield.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the possibility to obtain raw_smp_processor_id() in
eBPF. Currently, this is only possible in classic BPF where commit
da2033c282 ("filter: add SKF_AD_RXHASH and SKF_AD_CPU") has added
facilities for this.
Perhaps most importantly, this would also allow us to track per CPU
statistics with eBPF maps, or to implement a poor-man's per CPU data
structure through eBPF maps.
Example function proto-type looks like:
u32 (*smp_processor_id)(void) = (void *)BPF_FUNC_get_smp_processor_id;
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This work is similar to commit 4cd3675ebf ("filter: added BPF
random opcode") and adds a possibility for packet sampling in eBPF.
Currently, this is only possible in classic BPF and useful to
combine sampling with f.e. packet sockets, possible also with tc.
Example function proto-type looks like:
u32 (*prandom_u32)(void) = (void *)BPF_FUNC_get_prandom_u32;
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_edemux() & sock_gen_put() should be ready to cope with request socks.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make proto_register() & proto_unregister() a bit nicer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hold_net and release_net were an idea that turned out to be useless.
The code has been disabled since 2008. Kill the code it is long past due.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Test that sk != NULL before reading sk->sk_tsflags.
Fixes: 49ca0d8bfa ("net-timestamp: no-payload option")
Reported-by: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John reported that my previous commit added a regression
on his router.
This is because sender_cpu & napi_id share a common location,
so get_xps_queue() can see garbage and perform an out of bound access.
We need to make sure sender_cpu is cleared before doing the transmit,
otherwise any NIC busy poll enabled (skb_mark_napi_id()) can trigger
this bug.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: John <jw@nuclearfallout.net>
Bisected-by: John <jw@nuclearfallout.net>
Fixes: 2bd82484bb ("xps: fix xps for stacked devices")
Signed-off-by: David S. Miller <davem@davemloft.net>
A long standing problem in netlink socket dumps is the use
of kernel socket addresses as cookies.
1) It is a security concern.
2) Sockets can be reused quite quickly, so there is
no guarantee a cookie is used once and identify
a flow.
3) request sock, establish sock, and timewait socks
for a given flow have different cookies.
Part of our effort to bring better TCP statistics requires
to switch to a different allocator.
In this patch, I chose to use a per network namespace 64bit generator,
and to use it only in the case a socket needs to be dumped to netlink.
(This might be refined later if needed)
Note that I tried to carry cookies from request sock, to establish sock,
then timewait sockets.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Eric Salo <salo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is meant to collapse local and main into one by converting
tb_data from an array to a pointer. Doing this allows us to point the
local table into the main while maintaining the same variables in the
table.
As such the tb_data was converted from an array to a pointer, and a new
array called data is added in order to still provide an object for tb_data
to point to.
In order to track the origin of the fib aliases a tb_id value was added in
a hole that existed on 64b systems. Using this we can also reverse the
merge in the event that custom FIB rules are enabled.
With this patch I am seeing an improvement of 20ns to 30ns for routing
lookups as long as custom rules are not enabled, with custom rules enabled
we fall back to split tables and the original behavior.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If rtnl_newlink() fails on it's call to dev_change_net_namespace(), we
have to make use of the ->dellink() method, if present, just like we
do when rtnl_configure_link() fails.
Fixes: 317f4810e4 ("rtnl: allow to create device with IFLA_LINK_NETNSID set")
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_diag_check_cookie() second parameter is constant
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper function which allows getting the struct net_device pointer
associated with a given struct device_node pointer. This is useful for
instance for DSA Ethernet devices not backed by a platform_device, but a PCI
device.
Since we need to access net_class which is not accessible outside of
net/core/net-sysfs.c, this helper function is also added here and gated
with CONFIG_OF_NET.
Network devices initialized with SET_NETDEV_DEV() are also taken into
account by checking for dev->parent first and then falling back to
checking the device pointer within struct net_device.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove a little bit of unnecessary work when transmitting a packet with
neigh_packet_xmit. Use the neighbour table index not the address family
as a parameter.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some drivers use copybreak to copy tiny frames into smaller skb,
and this smaller skb might not have skb->head_frag set for various
reasons.
skb_gro_receive() currently doesn't allow to aggregate the smaller skb
into the previous GRO packet if this GRO packet has at least 2 MSS in
it.
Following workload easily demonstrates the problem.
netperf -t TCP_RR -H target -- -r 3000,3000
(tcpdump shows one GRO packet with 2 MSS, plus one additional packet of
104 bytes that should have been appended.)
It turns out that we can remove code from skb_gro_receive(), because
commit 8a29111c7c ("net: gro: allow to build full sized skb") and its
followups removed the assumption that a GRO packet with a frag_list had
to have an empty head.
Removing this code allows the aggregation of the last (incomplete) frame
in some RPC workloads. Note that tcp_gro_receive() already takes care of
forcing a flush if necessary, including this case.
If we want to avoid using frag_list in the first place (in forwarding
workloads for example, as the outgoing NIC is generally not able to cope
with skbs having a frag_list), we need to address this separately.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
For MPLS I am building the code so that either the neighbour mac
address can be specified or we can have a next hop in ipv4 or ipv6.
The kind of next hop we have is indicated by the neighbour table
pointer. A neighbour table pointer of NULL is a link layer address.
A non-NULL neighbour table pointer indicates which neighbour table and
thus which address family the next hop address is in that we need to
look up.
The code either sends a packet directly or looks up the appropriate
neighbour table entry and sends the packet.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While looking at the mpls code I found myself writing yet another
version of neigh_lookup_noref. We currently have __ipv4_lookup_noref
and __ipv6_lookup_noref.
So to make my work a little easier and to make it a smidge easier to
verify/maintain the mpls code in the future I stopped and wrote
___neigh_lookup_noref. Then I rewote __ipv4_lookup_noref and
__ipv6_lookup_noref in terms of this new function. I tested my new
version by verifying that the same code is generated in
ip_finish_output2 and ip6_finish_output2 where these functions are
inlined.
To get to ___neigh_lookup_noref I added a new neighbour cache table
function key_eq. So that the static size of the key would be
available.
I also added __neigh_lookup_noref for people who want to to lookup
a neighbour table entry quickly but don't know which neibhgour table
they are going to look up.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/rocker/rocker.c
The rocker commit was two overlapping changes, one to rename
the ->vport member to ->pport, and another making the bitmask
expression use '1ULL' instead of plain '1'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Having a dst helps a little bit for teql but is fundamentally
unnecessary and there are code paths where a dst is not available that
it would be nice to use the neighbour cache.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Add protocol to neigh_tbl so that dst->ops->protocol is not needed
- Acquire the device from neigh->dev
This results in a neigh_hh_init that will cache the samve values
regardless of the packets flowing through it.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only caller is now is ax25_neigh_construct so move
neigh_compat_output into ax25_ip.c make it static and rename it
ax25_neigh_output.
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both sk_attach_filter() and sk_attach_bpf() are setting up sk_filter,
charging skmem and attaching it to the socket after we got the eBPF
prog up and ready. Lets refactor that into a common helper.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
After TIPC doesn't depend on iocb argument in its internal
implementations of sendmsg() and recvmsg() hooks defined in proto
structure, no any user is using iocb argument in them at all now.
Then we can drop the redundant iocb argument completely from kinds of
implementations of both sendmsg() and recvmsg() in the entire
networking stack.
Cc: Christoph Hellwig <hch@lst.de>
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As part of an effort to move skb->dropcount to skb->cb[], use
a common function in order to set dropcount in struct sk_buff.
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
is_gpl_compatible and prog_type should be moved directly into bpf_prog
as they stay immutable during bpf_prog's lifetime, are core attributes
and they can be locked as read-only later on via bpf_prog_select_runtime().
With a bit of rearranging, this also allows us to shrink bpf_prog_aux
to exactly 1 cacheline.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As discussed recently and at netconf/netdev01, we want to prevent making
bpf_verifier_ops registration available for modules, but have them at a
controlled place inside the kernel instead.
The reason for this is, that out-of-tree modules can go crazy and define
and register any verfifier ops they want, doing all sorts of crap, even
bypassing available GPLed eBPF helper functions. We don't want to offer
such a shiny playground, of course, but keep strict control to ourselves
inside the core kernel.
This also encourages us to design eBPF user helpers carefully and
generically, so they can be shared among various subsystems using eBPF.
For the eBPF traffic classifier (cls_bpf), it's a good start to share
the same helper facilities as we currently do in eBPF for socket filters.
That way, we have BPF_PROG_TYPE_SCHED_CLS look like it's own type, thus
one day if there's a good reason to diverge the set of helper functions
from the set available to socket filters, we keep ABI compatibility.
In future, we could place all bpf_prog_type_list at a central place,
perhaps.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This gets rid of CONFIG_BPF_SYSCALL ifdefs in the socket filter code,
now that the BPF internal header can deal with it.
While going over it, I also changed eBPF related functions to a sk_filter
prefix to be more consistent with the rest of the file.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can move bpf_map_ops and bpf_verifier_ops and other structs into ro
section, bpf_map_type_list and bpf_prog_type_list into read mostly.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We did a failed attempt in the past to only use rcu in rtnl dump
operations (commit e67f88dd12 "net: dont hold rtnl mutex during
netlink dump callbacks")
Now that dumps are holding RTNL anyway, there is no need to also
use rcu locking, as it forbids any scheduling ability, like
GFP_KERNEL allocations that controlling path should use instead
of GFP_ATOMIC whenever possible.
This should fix following splat Cong Wang reported :
[ INFO: suspicious RCU usage. ]
3.19.0+ #805 Tainted: G W
include/linux/rcupdate.h:538 Illegal context switch in RCU read-side critical section!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
2 locks held by ip/771:
#0: (rtnl_mutex){+.+.+.}, at: [<ffffffff8182b8f4>] netlink_dump+0x21/0x26c
#1: (rcu_read_lock){......}, at: [<ffffffff817d785b>] rcu_read_lock+0x0/0x6e
stack backtrace:
CPU: 3 PID: 771 Comm: ip Tainted: G W 3.19.0+ #805
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
0000000000000001 ffff8800d51e7718 ffffffff81a27457 0000000029e729e6
ffff8800d6108000 ffff8800d51e7748 ffffffff810b539b ffffffff820013dd
00000000000001c8 0000000000000000 ffff8800d7448088 ffff8800d51e7758
Call Trace:
[<ffffffff81a27457>] dump_stack+0x4c/0x65
[<ffffffff810b539b>] lockdep_rcu_suspicious+0x107/0x110
[<ffffffff8109796f>] rcu_preempt_sleep_check+0x45/0x47
[<ffffffff8109e457>] ___might_sleep+0x1d/0x1cb
[<ffffffff8109e67d>] __might_sleep+0x78/0x80
[<ffffffff814b9b1f>] idr_alloc+0x45/0xd1
[<ffffffff810cb7ab>] ? rcu_read_lock_held+0x3b/0x3d
[<ffffffff814b9f9d>] ? idr_for_each+0x53/0x101
[<ffffffff817c1383>] alloc_netid+0x61/0x69
[<ffffffff817c14c3>] __peernet2id+0x79/0x8d
[<ffffffff817c1ab7>] peernet2id+0x13/0x1f
[<ffffffff817d8673>] rtnl_fill_ifinfo+0xa8d/0xc20
[<ffffffff810b17d9>] ? __lock_is_held+0x39/0x52
[<ffffffff817d894f>] rtnl_dump_ifinfo+0x149/0x213
[<ffffffff8182b9c2>] netlink_dump+0xef/0x26c
[<ffffffff8182bcba>] netlink_recvmsg+0x17b/0x2c5
[<ffffffff817b0adc>] __sock_recvmsg+0x4e/0x59
[<ffffffff817b1b40>] sock_recvmsg+0x3f/0x51
[<ffffffff817b1f9a>] ___sys_recvmsg+0xf6/0x1d9
[<ffffffff8115dc67>] ? handle_pte_fault+0x6e1/0xd3d
[<ffffffff8100a3a0>] ? native_sched_clock+0x35/0x37
[<ffffffff8109f45b>] ? sched_clock_local+0x12/0x72
[<ffffffff8109f6ac>] ? sched_clock_cpu+0x9e/0xb7
[<ffffffff810cb7ab>] ? rcu_read_lock_held+0x3b/0x3d
[<ffffffff811abde8>] ? __fcheck_files+0x4c/0x58
[<ffffffff811ac556>] ? __fget_light+0x2d/0x52
[<ffffffff817b376f>] __sys_recvmsg+0x42/0x60
[<ffffffff817b379f>] SyS_recvmsg+0x12/0x1c
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 0c7aecd4bd ("netns: add rtnl cmd to add and get peer netns ids")
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When applicable verify that the caller has permisson to the underlying
network namespace for a newly created network device.
Similary checks exist for the network namespace a network device will
be created in.
Fixes: 317f4810e4 ("rtnl: allow to create device with IFLA_LINK_NETNSID set")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When applicable verify that the caller has permision to create a
network device in another network namespace. This check is already
present when moving a network device between network namespaces in
setlink so all that is needed is to duplicate that check in newlink.
This change almost backports cleanly, but there are context conflicts
as the code that follows was added in v4.0-rc1
Fixes: b51642f6d7 net: Enable a userns root rtnl calls that are safe for unprivilged users
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arrays (when not in a struct) "shall have a value greater than zero".
GCC complains when it's not the case here.
Fixes: ba7d49b1f0 ("rtnetlink: provide api for getting and setting slave info")
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trying to use burst capability (aka xmit_more) on a virtual device
like bonding is not supported.
For example, skb might be queued multiple times on a qdisc, with
various list corruptions.
Fixes: 38b2cf2982 ("net: pktgen: packet bursting via skb->xmit_more")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Although it is clear that textsearch state is intentionally passed to
skb_find_text() as uninitialized argument, it was never used by the
callers. Therefore, we can simplify skb_find_text() by making it
local variable.
Signed-off-by: Bojan Prtvar <prtvar.b@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"dev->ethtool_ops" and "ops" are the same, but we should use "ops"
everywhere to be consistent.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
colons are used as a separator in netdev device lookup in dev_ioctl.c
Specific functions are SIOCGIFTXQLEN SIOCETHTOOL SIOCSIFNAME
Signed-off-by: Matthew Thode <mthode@mthode.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit aafb3e98b2 (netdev: introduce new NETIF_F_HW_SWITCH_OFFLOAD feature
flag for switch device offloads) add a new feature without adding it to
netdev_features_strings array; this patch fixes this.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Non NAPI drivers can call skb_tstamp_tx() and then sock_queue_err_skb()
from hard IRQ context.
Therefore, sock_dequeue_err_skb() needs to block hard irq or
corruptions or hangs can happen.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 364a9e9324 ("sock: deduplicate errqueue dequeue")
Fixes: cb820f8e4b ("net: Provide a generic socket error queue delivery method for Tx time stamps.")
Signed-off-by: David S. Miller <davem@davemloft.net>
The gnet_stats_copy_app() function gets called, more often than not, with its
second argument a pointer to an automatic variable in the caller's stack.
Therefore, to avoid copying garbage afterwards when calling
gnet_stats_finish_copy(), this data is better copied to a dynamically allocated
memory that gets freed after use.
[xiyou.wangcong@gmail.com: remove a useless kfree()]
Signed-off-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
1) Missing netlink attribute validation in nft_lookup, from Patrick
McHardy.
2) Restrict ipv6 partial checksum handling to UDP, since that's the
only case it works for. From Vlad Yasevich.
3) Clear out silly device table sentinal macros used by SSB and BCMA
drivers. From Joe Perches.
4) Make sure the remote checksum code never creates a situation where
the remote checksum is applied yet the tunneling metadata describing
the remote checksum transformation is still present. Otherwise an
external entity might see this and apply the checksum again. From
Tom Herbert.
5) Use msecs_to_jiffies() where applicable, from Nicholas Mc Guire.
6) Don't explicitly initialize timer struct fields, use setup_timer()
and mod_timer() instead. From Vaishali Thakkar.
7) Don't invoke tg3_halt() without the tp->lock held, from Jun'ichi
Nomura.
8) Missing __percpu annotation in ipvlan driver, from Eric Dumazet.
9) Don't potentially perform skb_get() on shared skbs, also from Eric
Dumazet.
10) Fix COW'ing of metrics for non-DST_HOST routes in ipv6, from Martin
KaFai Lau.
11) Fix merge resolution error between the iov_iter changes in vhost and
some bug fixes that occurred at the same time. From Jason Wang.
12) If rtnl_configure_link() fails we have to perform a call to
->dellink() before unregistering the device. From WANG Cong.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (39 commits)
net: dsa: Set valid phy interface type
rtnetlink: call ->dellink on failure when ->newlink exists
com20020-pci: add support for eae single card
vhost_net: fix wrong iter offset when setting number of buffers
net: spelling fixes
net/core: Fix warning while make xmldocs caused by dev.c
net: phy: micrel: disable NAND-tree for KSZ8021, KSZ8031, KSZ8051, KSZ8081
ipv6: fix ipv6_cow_metrics for non DST_HOST case
openvswitch: Fix key serialization.
r8152: restore hw settings
hso: fix rx parsing logic when skb allocation fails
tcp: make sure skb is not shared before using skb_get()
bridge: netfilter: Move sysctl-specific error code inside #ifdef
ipv6: fix possible deadlock in ip6_fl_purge / ip6_fl_gc
ipvlan: add a missing __percpu pcpu_stats
tg3: Hold tp->lock before calling tg3_halt() from tg3_init_one()
bgmac: fix device initialization on Northstar SoCs (condition typo)
qlcnic: Delete existing multicast MAC list before adding new
net/mlx5_core: Fix configuration of log_uar_page_sz
sunvnet: don't change gso data on clones
...
Ignacy reported that when eth0 is down and add a vlan device
on top of it like:
ip link add link eth0 name eth0.1 up type vlan id 1
We will get a refcount leak:
unregister_netdevice: waiting for eth0.1 to become free. Usage count = 2
The problem is when rtnl_configure_link() fails in rtnl_newlink(),
we simply call unregister_device(), but for stacked device like vlan,
we almost do nothing when we unregister the upper device, more work
is done when we unregister the lower device, so call its ->dellink().
Reported-by: Ignacy Gawedzki <ignacy.gawedzki@green-communications.fr>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spelling errors caught by codespell.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fix following warning wile make xmldocs.
Warning(.//net/core/dev.c:5345): No description found
for parameter 'bonding_info'
Warning(.//net/core/dev.c:5345): Excess function parameter
'netdev_bonding_info' description in 'netdev_bonding_info_change'
This warning starts to appear after following patch was added
into Linus's tree during merger period.
commit 61bd3857ff
net/core: Add event for a change in slave state
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
printk and friends can now format bitmaps using '%*pb[l]'. cpumask
and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
respectively which can be used to generate the two printf arguments
necessary to format the specified cpu/nodemask.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds infrastructure so that remote checksum offload can
set CHECKSUM_PARTIAL instead of calling csum_partial and writing
the modfied checksum field.
Add skb_remcsum_adjust_partial function to set an skb for using
CHECKSUM_PARTIAL with remote checksum offload. Changed
skb_remcsum_process and skb_gro_remcsum_process to take a boolean
argument to indicate if checksum partial can be set or the
checksum needs to be modified using the normal algorithm.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure root user does not try something stupid.
Also make sure mask field in struct rps_sock_flow_table
does not share a cache line with the potentially often dirtied
flow table.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 567e4b7973 ("net: rfs: add hash collision detection")
Signed-off-by: David S. Miller <davem@davemloft.net>
Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.
Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)
This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.
I use a 32bit value, and dynamically split it in two parts.
For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.
Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.
If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).
This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.
This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hotpluging a cpu might be rare, yet we have to use proper
handlers when taking over packets found in backlog queues.
dev_cpu_callback() runs from process context, thus we should
call netif_rx_ni() to properly invoke softirq handler.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ifla_vf_policy[] is wrong in advertising its individual member types as
NLA_BINARY since .type = NLA_BINARY in combination with .len declares the
len member as *max* attribute length [0, len].
The issue is that when do_setvfinfo() is being called to set up a VF
through ndo handler, we could set corrupted data if the attribute length
is less than the size of the related structure itself.
The intent is exactly the opposite, namely to make sure to pass at least
data of minimum size of len.
Fixes: ebc08a6f47 ("rtnetlink: Add VF config code to rtnetlink")
Cc: Mitch Williams <mitch.a.williams@intel.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes two issues in UDP checksum computation in pktgen.
First, the pseudo-header uses the source and destination IP
addresses. Currently, the ports are used for IPv4.
Second, the UDP checksum covers both header and data. So we need to
generate the data earlier (move pktgen_finalize_skb up), and compute
the checksum for UDP header + data.
Fixes: c26bf4a513 ("pktgen: Add UDPCSUM flag to support UDP checksums")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
flow_cache_flush_task references a structure member flow_cache_gc_work
where it should reference flow_cache_flush_task instead.
Kernel panic occurs on kernels using IPsec during XFRM garbage
collection. The garbage collection interval can be shortened using the
following sysctl settings:
net.ipv4.xfrm4_gc_thresh=4
net.ipv6.xfrm6_gc_thresh=4
With the default settings, our productions servers crash approximately
once a week. With the settings above, they crash immediately.
Fixes: ca925cf153 ("flowcache: Make flow cache name space aware")
Reported-by: Tomáš Charvát <tc@excello.cz>
Tested-by: Jan Hejl <jh@excello.cz>
Signed-off-by: Miroslav Urbanek <mu@miroslavurbanek.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/vxlan.c
drivers/vhost/net.c
include/linux/if_vlan.h
net/core/dev.c
The net/core/dev.c conflict was the overlap of one commit marking an
existing function static whilst another was adding a new function.
In the include/linux/if_vlan.h case, the type used for a local
variable was changed in 'net', whereas the function got rewritten
to fix a stacked vlan bug in 'net-next'.
In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next'
overlapped with an endainness fix for VHOST 1.0 in 'net'.
In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter
in 'net-next' whereas in 'net' there was a bug fix to pass in the
correct network namespace pointer in calls to this function.
Signed-off-by: David S. Miller <davem@davemloft.net>
netdev_adjacent_add_links() and netdev_adjacent_del_links()
are static.
queue->qdisc has __rcu annotation, need to use RCU_INIT_POINTER()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add event which provides an indication on a change in the state
of a bonding slave. The event handler should cast the pointer to the
appropriate type (struct netdev_bonding_info) in order to get the
full info about the slave.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* revert a patch that caused a regression with mesh userspace (Bob)
* fix a number of suspend/resume related races
(from Emmanuel, Luca and myself - we'll look at backporting later)
* add software implementations for new ciphers (Jouni)
* add a new ACPI ID for Broadcom's rfkill (Mika)
* allow using netns FD for wireless (Vadim)
* some other cleanups (various)
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJU0NEaAAoJEDBSmw7B7bqr2DAP/3nbk6WB2Lz4Vwi8fh9C4y6X
4ZzN2v7NHaimC+Wpxg3wP2wCdX2VG3ZWwF3yLm6qgVHGnE35RFLxURen1eqNeW77
Yf5gvZ066nCGEQ8l8J6YK9vrLX4qp5c4lyE00bbxpZTA4Qq71SgTg+rmGC1be8uX
vacfaLwfDWffuOOnjBAPfanj7f4AQaUEY2uN1WkBFC7iEeOtPcWVkHAFVIeGjJfQ
vfgQJcwOjgWjYbwZdQQi7Aj+k1Yzda4pg1yEWn3CkZ8zyeCYGs1gk2ovoBgPgXBP
yT9ypIdFsN242VJvy7nkFnCKA8mhKyltMQ1Xjs0Q9lAxWdaq9U+iqt5cUn4jxVIG
T9Vi3PCbx/nVOqcfR81dBTQ3uDU5AyosPsmh2YTxi5lpRBrsjNY2FtaKE0sm2Om3
wRiSPOdPrXBeEnU0KssI9e6euXgS4JQV78Naq85OWZDd2yZ1fT5U2fi8y4drRGlz
rucbbobdVhQch5L4FStPz1uW5pNuJrhekXeZIE8MruTNg2A2oBAK3ApO7hxn68sE
RnbAnkxVLwgedC9042JF5eiS1PDIU46w4e782j/+XskVKMEakqd23iJycx3tmgHZ
cxDi/qKZ2RCE74YsT61o/9ErSVPvCfNPL3+CVov918jidQQg2WHfKm/jFlIxHIxk
4wBP7p2VGlENMgw/R8GJ
=wOaR
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2015-02-03' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Last round of updates for net-next:
* revert a patch that caused a regression with mesh userspace (Bob)
* fix a number of suspend/resume related races
(from Emmanuel, Luca and myself - we'll look at backporting later)
* add software implementations for new ciphers (Jouni)
* add a new ACPI ID for Broadcom's rfkill (Mika)
* allow using netns FD for wireless (Vadim)
* some other cleanups (various)
Signed-off-by: David S. Miller <davem@davemloft.net>
A typical qdisc setup is the following :
bond0 : bonding device, using HTB hierarchy
eth1/eth2 : slaves, multiqueue NIC, using MQ + FQ qdisc
XPS allows to spread packets on specific tx queues, based on the cpu
doing the send.
Problem is that dequeues from bond0 qdisc can happen on random cpus,
due to the fact that qdisc_run() can dequeue a batch of packets.
CPUA -> queue packet P1 on bond0 qdisc, P1->ooo_okay=1
CPUA -> queue packet P2 on bond0 qdisc, P2->ooo_okay=0
CPUB -> dequeue packet P1 from bond0
enqueue packet on eth1/eth2
CPUC -> dequeue packet P2 from bond0
enqueue packet on eth1/eth2 using sk cache (ooo_okay is 0)
get_xps_queue() then might select wrong queue for P1, since current cpu
might be different than CPUA.
P2 might be sent on the old queue (stored in sk->sk_tx_queue_mapping),
if CPUC runs a bit faster (or CPUB spins a bit on qdisc lock)
Effect of this bug is TCP reorders, and more generally not optimal
TX queue placement. (A victim bulk flow can be migrated to the wrong TX
queue for a while)
To fix this, we have to record sender cpu number the first time
dev_queue_xmit() is called for one tx skb.
We can union napi_id (used on receive path) and sender_cpu,
granted we clear sender_cpu in skb_scrub_packet() (credit to Willem for
this union idea)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tx timestamps are looped onto the error queue on top of an skb. This
mechanism leaks packet headers to processes unless the no-payload
options SOF_TIMESTAMPING_OPT_TSONLY is set.
Add a sysctl that optionally drops looped timestamp with data. This
only affects processes without CAP_NET_RAW.
The policy is checked when timestamps are generated in the stack.
It is possible for timestamps with data to be reported after the
sysctl is set, if these were queued internally earlier.
No vulnerability is immediately known that exploits knowledge
gleaned from packet headers, but it may still be preferable to allow
administrators to lock down this path at the cost of possible
breakage of legacy applications.
Signed-off-by: Willem de Bruijn <willemb@google.com>
----
Changes
(v1 -> v2)
- test socket CAP_NET_RAW instead of capable(CAP_NET_RAW)
(rfc -> v1)
- document the sysctl in Documentation/sysctl/net.txt
- fix access control race: read .._OPT_TSONLY only once,
use same value for permission check and skb generation.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit
timestamps, this loops timestamps on top of empty packets.
Doing so reduces the pressure on SO_RCVBUF. Payload inspection and
cmsg reception (aside from timestamps) are no longer possible. This
works together with a follow on patch that allows administrators to
only allow tx timestamping if it does not loop payload or metadata.
Signed-off-by: Willem de Bruijn <willemb@google.com>
----
Changes (rfc -> v1)
- add documentation
- remove unnecessary skb->len test (thanks to Richard Cochran)
Signed-off-by: David S. Miller <davem@davemloft.net>
bridge flags are needed inside ndo_bridge_setlink/dellink handlers to
avoid another call to parse IFLA_AF_SPEC inside these handlers
This is used later in this series
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
vlan_get_protocol() could not get network protocol if a skb has a 802.1ad
vlan tag or multiple vlans, which caused incorrect checksum calculation
in several drivers.
Fix vlan_get_protocol() to retrieve network protocol instead of incorrect
vlan protocol.
As the logic is the same as skb_network_protocol(), create a common helper
function __vlan_get_protocol() and call it from existing functions.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
When many pf_packet listeners are created on a lot of interfaces the
current implementation using global packet type lists scales poorly.
This patch adds per net_device packet type lists to fix this problem.
The patch was originally written by Eric Biederman for linux-2.6.29.
Tested on linux-3.16.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When IFLA_LINK_NETNSID is used, the netdevice should be built in this link netns
and moved at the end to another netns (pointed by the socket netns or
IFLA_NET_NS_[PID|FD]).
Existing user of the newlink handler will use the netns argument (src_net) to
find a link netdevice or to check some other information into the link netns.
For example, to find a netdevice, two information are required: an ifindex
(usually from IFLA_LINK) and a netns (this link netns).
Note: when using IFLA_LINK_NETNSID and IFLA_NET_NS_[PID|FD], a user may create a
netdevice that stands in netnsX and with its link part in netnsY, by sending a
rtnl message from netnsZ.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reported in: https://bugzilla.kernel.org/show_bug.cgi?id=92081
This patch avoids calling rtnl_notify if the device ndo_bridge_getlink
handler does not return any bytes in the skb.
Alternately, the skb->len check can be moved inside rtnl_notify.
For the bridge vlan case described in 92081, there is also a fix needed
in bridge driver to generate a proper notification. Will fix that in
subsequent patch.
v2: rebase patch on net tree
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
arch/arm/boot/dts/imx6sx-sdb.dts
net/sched/cls_bpf.c
Two simple sets of overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
The flows are hashed on the sending node address, which allows us
to spread out the TIPC link processing to RPS enabled cores. There
is no point to include the destination address in the hash as that
will always be the same for all inbound links. We have experimented
with a 3-tuple hash over [srcnode, sport, dport], but this showed to
give slightly lower performance because of increased lock contention
when the same link was handled by multiple cores.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Netlink FDB messages are sent in the link netns. The header of these messages
contains the ifindex (ndm_ifindex) of the netdevice, but this ifindex is
unusable in case of x-netns vxlan.
I named the new attribute NDA_NDM_IFINDEX_NETNSID, to avoid confusion with
NDA_IFINDEX.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If an error occurs when the netdevice is moved to the link netns, a full cleanup
must be done.
Fixes: 317f4810e4 ("rtnl: allow to create device with IFLA_LINK_NETNSID set")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added new NL80211_ATTR_NETNS_FD which allows to
set namespace via nl80211 by fd.
Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch adds the ability to create a netdevice in a specified netns and
then move it into the final netns. In fact, it allows to have a symetry between
get and set rtnl messages.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new attribute (IFLA_LINK_NETNSID) which contains the 'link'
netns id when this netns is different from the netns where the interface
stands (for example for x-net interfaces like ip tunnels).
With this attribute, it's possible to interpret correctly all advertised
information (like IFLA_LINK, etc.).
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this patch, a user can define an id for a peer netns by providing a FD or a
PID. These ids are local to the netns where it is added (ie valid only into this
netns).
The main function (ie the one exported to other module), peernet2id(), allows to
get the id of a peer netns. If no id has been assigned by the user, this
function allocates one.
These ids will be used in netlink messages to point to a peer netns, for example
in case of a x-netns interface.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 02dba4388d ("bridge: fix setlink/dellink notifications") removed usage of oflags in
both rtnl_bridge_setlink() and rtnl_bridge_dellink() methods. This patch removes this variable as it is no
longer needed.
Signed-off-by: Rami Rosen <rami.rosen@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 053c095a82 ("netlink: make nlmsg_end() and genlmsg_end()
void") didn't catch all of the cases where callers were breaking out
on the return value being equal to zero, which they no longer should
when zero means success.
Fix all such cases.
Reported-by: Marcel Holtmann <marcel@holtmann.org>
Reported-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Contrary to common expectations for an "int" return, these functions
return only a positive value -- if used correctly they cannot even
return 0 because the message header will necessarily be in the skb.
This makes the very common pattern of
if (genlmsg_end(...) < 0) { ... }
be a whole bunch of dead code. Many places also simply do
return nlmsg_end(...);
and the caller is expected to deal with it.
This also commonly (at least for me) causes errors, because it is very
common to write
if (my_function(...))
/* error condition */
and if my_function() does "return nlmsg_end()" this is of course wrong.
Additionally, there's not a single place in the kernel that actually
needs the message length returned, and if anyone needs it later then
it'll be very easy to just use skb->len there.
Remove this, and make the functions void. This removes a bunch of dead
code as described above. The patch adds lines because I did
- return nlmsg_end(...);
+ nlmsg_end(...);
+ return 0;
I could have preserved all the function's return values by returning
skb->len, but instead I've audited all the places calling the affected
functions and found that none cared. A few places actually compared
the return value with <= 0 in dump functionality, but that could just
be changed to < 0 with no change in behaviour, so I opted for the more
efficient version.
One instance of the error I've made numerous times now is also present
in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
check for <0 or <=0 and thus broke out of the loop every single time.
I've preserved this since it will (I think) have caused the messages to
userspace to be formatted differently with just a single message for
every SKB returned to userspace. It's possible that this isn't needed
for the tools that actually use this, but I don't even know what they
are so couldn't test that changing this behaviour would be acceptable.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
problems with bridge getlink/setlink notifications today:
- bridge setlink generates two notifications to userspace
- one from the bridge driver
- one from rtnetlink.c (rtnl_bridge_notify)
- dellink generates one notification from rtnetlink.c. Which
means bridge setlink and dellink notifications are not
consistent
- Looking at the code it appears,
If both BRIDGE_FLAGS_MASTER and BRIDGE_FLAGS_SELF were set,
the size calculation in rtnl_bridge_notify can be wrong.
Example: if you set both BRIDGE_FLAGS_MASTER and BRIDGE_FLAGS_SELF
in a setlink request to rocker dev, rtnl_bridge_notify will
allocate skb for one set of bridge attributes, but,
both the bridge driver and rocker dev will try to add
attributes resulting in twice the number of attributes
being added to the skb. (rocker dev calls ndo_dflt_bridge_getlink)
There are multiple options:
1) Generate one notification including all attributes from master and self:
But, I don't think it will work, because both master and self may use
the same attributes/policy. Cannot pack the same set of attributes in a
single notification from both master and slave (duplicate attributes).
2) Generate one notification from master and the other notification from
self (This seems to be ideal):
For master: the master driver will send notification (bridge in this
example)
For self: the self driver will send notification (rocker in the above
example. It can use helpers from rtnetlink.c to do so. Like the
ndo_dflt_bridge_getlink api).
This patch implements 2) (leaving the 'rtnl_bridge_notify' around to be used
with 'self').
v1->v2 :
- rtnl_bridge_notify is now called only for self,
so, remove 'BRIDGE_FLAGS_SELF' check and cleanup a few things
- rtnl_bridge_dellink used to always send a RTM_NEWLINK msg
earlier. So, I have changed the notification from br_dellink to
go as RTM_NEWLINK
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
softnet_data.input_pkt_queue is protected by a spinlock that
we must hold when transferring packets from victim queue to an active
one. This is because other cpus could still be trying to enqueue packets
into victim queue.
A second problem is that when we transfert the NAPI poll_list from
victim to current cpu, we absolutely need to special case the percpu
backlog, because we do not want to add complex locking to protect
process_queue : Only owner cpu is allowed to manipulate it, unless cpu
is offline.
Based on initial patch from Prasad Sodagudi & Subash Abhinov
Kasiviswanathan.
This version is better because we do not slow down packet processing,
only make migration safer.
Reported-by: Prasad Sodagudi <psodagud@codeaurora.org>
Reported-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/xen-netfront.c
Minor overlapping changes in xen-netfront.c, mostly to do
with some buffer management changes alongside the split
of stats into TX and RX.
Signed-off-by: David S. Miller <davem@davemloft.net>
When setting base_reachable_time or base_reachable_time_ms on a
specific interface through sysctl or netlink, the reachable_time
value is not updated.
This means that neighbour entries will continue to be updated using the
old value until it is recomputed in neigh_period_work (which
recomputes the value every 300*HZ).
On systems with HZ equal to 1000 for instance, it means 5mins before
the change is effective.
This patch changes this behavior by recomputing reachable_time after
each set on base_reachable_time or base_reachable_time_ms.
The new value will become effective the next time the neighbour's timer
is triggered.
Changes are made in two places: the netlink code for set and the sysctl
handling code. For sysctl, I use a proc_handler. The ipv6 network
code does provide its own handler but it already refreshes
reachable_time correctly so it's not an issue.
Any other user of neighbour which provide its own handlers must
refresh reachable_time.
Signed-off-by: Jean-Francois Remy <jeff@melix.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The same macros are used for rx as well. So rename it.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
netif_alloc_rx_queues() uses kcalloc() to allocate memory
for "struct netdev_queue *_rx" array.
If we are doing large rx queue allocation kcalloc() might
fail, so this patch does a fallback to vzalloc().
Similar implementation is done for tx queue allocation in
netif_alloc_netdev_queues().
We avoid failure of high order memory allocation
with the help of vzalloc(), this allows us to do large
rx and tx queue allocation which in turn helps us to
increase the number of queues in tun.
As vmalloc() adds overhead on a critical network path,
__GFP_REPEAT flag is used with kzalloc() to do this fallback
only when really needed.
Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Gibson <dgibson@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch extends the ethtool plugin module eeprom API to support cards
whose phy support is delegated to a separate driver.
The handlers for ETHTOOL_GMODULEINFO and ETHTOOL_GMODULEEEPROM call the
module_info and module_eeprom functions if the phy driver provides them;
otherwise the handlers call the equivalent ethtool_ops functions provided
by network drivers with built-in phy support.
Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the minimum necessary for the RTAX_CC_ALGO congestion
control metric to be set up and dumped back to user space.
While the internal representation of RTAX_CC_ALGO is handled as a u32
key, we avoided to expose this implementation detail to user space, thus
instead, we chose the netlink attribute that is being exchanged between
user space to be the actual congestion control algorithm name, similarly
as in the setsockopt(2) API in order to allow for maximum flexibility,
even for 3rd party modules.
It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as
it should have been stored in RTAX_FEATURES instead, we first thought
about reusing it for the congestion control key, but it brings more
complications and/or confusion than worth it.
Joint work with Florian Westphal.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add checking whether the call to ndo_dflt_fdb_dump is needed.
It is not expected to call ndo_dflt_fdb_dump unconditionally
by some drivers (i.e. qlcnic or macvlan) that defines
own ndo_fdb_dump. Other drivers define own ndo_fdb_dump
and don't want ndo_dflt_fdb_dump to be called at all.
At the same time it is desirable to call the default dump
function on a bridge device.
Fix attributes that are passed to dev->netdev_ops->ndo_fdb_dump.
Add extra checking in br_fdb_dump to avoid duplicate entries
as now filter_dev can be NULL.
Following tests for filtering have been performed before
the change and after the patch was applied to make sure
they are the same and it doesn't break the filtering algorithm.
[root@localhost ~]# cd /root/iproute2-3.18.0/bridge
[root@localhost bridge]# modprobe dummy
[root@localhost bridge]# ./bridge fdb add f1:f2:f3:f4:f5:f6 dev dummy0
[root@localhost bridge]# brctl addbr br0
[root@localhost bridge]# brctl addif br0 dummy0
[root@localhost bridge]# ip link set dev br0 address 02:00:00:12:01:04
[root@localhost bridge]# # show all
[root@localhost bridge]# ./bridge fdb show
33:33:00:00:00:01 dev p2p1 self permanent
01:00:5e:00:00:01 dev p2p1 self permanent
33:33:ff:ac:ce:32 dev p2p1 self permanent
33:33:00:00:02:02 dev p2p1 self permanent
01:00:5e:00:00:fb dev p2p1 self permanent
33:33:00:00:00:01 dev p7p1 self permanent
01:00:5e:00:00:01 dev p7p1 self permanent
33:33:ff:79:50:53 dev p7p1 self permanent
33:33:00:00:02:02 dev p7p1 self permanent
01:00:5e:00:00:fb dev p7p1 self permanent
f2:46:50:85:6d:d9 dev dummy0 master br0 permanent
f2:46:50:85:6d:d9 dev dummy0 vlan 1 master br0 permanent
33:33:00:00:00:01 dev dummy0 self permanent
f1:f2:f3:f4:f5:f6 dev dummy0 self permanent
33:33:00:00:00:01 dev br0 self permanent
02:00:00:12:01:04 dev br0 vlan 1 master br0 permanent
02:00:00:12:01:04 dev br0 master br0 permanent
[root@localhost bridge]# # filter by bridge
[root@localhost bridge]# ./bridge fdb show br br0
f2:46:50:85:6d:d9 dev dummy0 master br0 permanent
f2:46:50:85:6d:d9 dev dummy0 vlan 1 master br0 permanent
33:33:00:00:00:01 dev dummy0 self permanent
f1:f2:f3:f4:f5:f6 dev dummy0 self permanent
33:33:00:00:00:01 dev br0 self permanent
02:00:00:12:01:04 dev br0 vlan 1 master br0 permanent
02:00:00:12:01:04 dev br0 master br0 permanent
[root@localhost bridge]# # filter by port
[root@localhost bridge]# ./bridge fdb show brport dummy0
f2:46:50:85:6d:d9 master br0 permanent
f2:46:50:85:6d:d9 vlan 1 master br0 permanent
33:33:00:00:00:01 self permanent
f1:f2:f3:f4:f5:f6 self permanent
[root@localhost bridge]# # filter by port + bridge
[root@localhost bridge]# ./bridge fdb show br br0 brport dummy0
f2:46:50:85:6d:d9 master br0 permanent
f2:46:50:85:6d:d9 vlan 1 master br0 permanent
33:33:00:00:00:01 self permanent
f1:f2:f3:f4:f5:f6 self permanent
[root@localhost bridge]#
Signed-off-by: Hubert Sokolowski <hubert.sokolowski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not needed, only four cases:
- kfree_skb (or one of its aliases).
Don't need to zero, memory will be freed.
- kfree_skb_partial and head was stolen: memory will be freed.
- skb_morph: The skb header fields (including tc ones) will be
copied over from the 'to-be-morphed' skb right after
skb_release_head_state returns.
- skb_segment: Same as before, all the skb header
fields are copied over from the original skb right away.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
GSO isn't the only offload feature with restrictions that
potentially can't be expressed with the current features mechanism.
Checksum is another although it's a general issue that could in
theory apply to anything. Even if it may be possible to
implement these restrictions in other ways, it can result in
duplicate code or inefficient per-packet behavior.
This generalizes ndo_gso_check so that drivers can remove any
features that don't make sense for a given packet, similar to
netif_skb_features(). It also converts existing driver
restrictions to the new format, completing the work that was
done to support tunnel protocols since the issues apply to
checksums as well.
By actually removing features from the set that are used to do
offloading, it solves another problem with the existing
interface. In these cases, GSO would run with the original set
of features and not do anything because it appears that
segmentation is not required.
CC: Tom Herbert <therbert@google.com>
CC: Joe Stringer <joestringer@nicira.com>
CC: Eric Dumazet <edumazet@google.com>
CC: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Tom Herbert <therbert@google.com>
Fixes: 04ffcb255f ("net: Add ndo_gso_check")
Tested-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using VXLAN tunnels and a sky2 device, I have experienced
checksum failures of the following type:
[ 4297.761899] eth0: hw csum failure
[...]
[ 4297.765223] Call Trace:
[ 4297.765224] <IRQ> [<ffffffff8172f026>] dump_stack+0x46/0x58
[ 4297.765235] [<ffffffff8162ba52>] netdev_rx_csum_fault+0x42/0x50
[ 4297.765238] [<ffffffff8161c1a0>] ? skb_push+0x40/0x40
[ 4297.765240] [<ffffffff8162325c>] __skb_checksum_complete+0xbc/0xd0
[ 4297.765243] [<ffffffff8168c602>] tcp_v4_rcv+0x2e2/0x950
[ 4297.765246] [<ffffffff81666ca0>] ? ip_rcv_finish+0x360/0x360
These are reliably reproduced in a network topology of:
container:eth0 == host(OVS VXLAN on VLAN) == bond0 == eth0 (sky2) -> switch
When VXLAN encapsulated traffic is received from a similarly
configured peer, the above warning is generated in the receive
processing of the encapsulated packet. Note that the warning is
associated with the container eth0.
The skbs from sky2 have ip_summed set to CHECKSUM_COMPLETE, and
because the packet is an encapsulated Ethernet frame, the checksum
generated by the hardware includes the inner protocol and Ethernet
headers.
The receive code is careful to update the skb->csum, except in
__dev_forward_skb, as called by dev_forward_skb. __dev_forward_skb
calls eth_type_trans, which in turn calls skb_pull_inline(skb, ETH_HLEN)
to skip over the Ethernet header, but does not update skb->csum when
doing so.
This patch resolves the problem by adding a call to
skb_postpull_rcsum to update the skb->csum after the call to
eth_type_trans.
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_scrub_packet() is called when a packet switches between a context
such as between underlay and overlay, between namespaces, or between
L3 subnets.
While we already scrub the packet mark, connection tracking entry,
and cached destination, the security mark/context is left intact.
It seems wrong to inherit the security context of a packet when going
from overlay to underlay or across forwarding paths.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When vlan tags are stacked, it is very likely that the outer tag is stored
in skb->vlan_tci and skb->protocol shows the inner tag's vlan_proto.
Currently netif_skb_features() first looks at skb->protocol even if there
is the outer tag in vlan_tci, thus it incorrectly retrieves the protocol
encapsulated by the inner vlan instead of the inner vlan protocol.
This allows GSO packets to be passed to HW and they end up being
corrupted.
Fixes: 58e998c6d2 ("offloading: Force software GSO for multiple vlan tags.")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes MPLS GSO for case when mpls is compiled as kernel module.
Fixes: 0d89d2035f ("MPLS: Add limited GSO support").
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch rearranges the loop in net_rx_action to reduce the
amount of jumping back and forth when reading the code.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should only perform the softnet_break check after we have polled
at least one device in net_rx_action. Otherwise a zero or negative
setting of netdev_budget can lock up the whole system.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit d75b1ade56 (net: less
interrupt masking in NAPI) required drivers to leave poll_list
empty if the entire budget is consumed.
We have already had two broken drivers so let's add a check for
this.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch creates a new function napi_poll and moves the napi
polling code from net_rx_action into it.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>