Commit graph

637 commits

Author SHA1 Message Date
Jeremy Cline c8e8cd579b net: socket: fix potential spectre v1 gadget in socketcall
'call' is a user-controlled value, so sanitize the array index after the
bounds check to avoid speculating past the bounds of the 'nargs' array.

Found with the help of Smatch:

net/socket.c:2508 __do_sys_socketcall() warn: potential spectre issue
'nargs' [r] (local cap)

Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jeremy Cline <jcline@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-28 22:43:30 -07:00
Al Viro d93aa9d82a new wrapper: alloc_file_pseudo()
takes inode, vfsmount, name, O_... flags and file_operations and
either returns a new struct file (in which case inode reference we
held is consumed) or returns ERR_PTR(), in which case no refcounts
are altered.

converted aio_private_file() and sock_alloc_file() to it

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:23 -04:00
Al Viro c9c554f214 alloc_file(): switch to passing O_... flags instead of FMODE_... mode
... so that it could set both ->f_flags and ->f_mode, without callers
having to set ->f_flags manually.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:02:57 -04:00
Christoph Hellwig e88958e636 net: handle NULL ->poll gracefully
The big aio poll revert broke various network protocols that don't
implement ->poll as a patch in the aio poll serie removed sock_no_poll
and made the common code handle this case.

Reported-by: syzbot+57727883dbad76db2ef0@syzkaller.appspotmail.com
Reported-by: syzbot+cdb0d3176b53d35ad454@syzkaller.appspotmail.com
Reported-by: syzbot+2c7e8f74f8b2571c87e8@syzkaller.appspotmail.com
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Fixes: a11e1d432b ("Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-29 06:51:51 -07:00
Linus Torvalds a11e1d432b Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL
The poll() changes were not well thought out, and completely
unexplained.  They also caused a huge performance regression, because
"->poll()" was no longer a trivial file operation that just called down
to the underlying file operations, but instead did at least two indirect
calls.

Indirect calls are sadly slow now with the Spectre mitigation, but the
performance problem could at least be largely mitigated by changing the
"->get_poll_head()" operation to just have a per-file-descriptor pointer
to the poll head instead.  That gets rid of one of the new indirections.

But that doesn't fix the new complexity that is completely unwarranted
for the regular case.  The (undocumented) reason for the poll() changes
was some alleged AIO poll race fixing, but we don't make the common case
slower and more complex for some uncommon special case, so this all
really needs way more explanations and most likely a fundamental
redesign.

[ This revert is a revert of about 30 different commits, not reverted
  individually because that would just be unnecessarily messy  - Linus ]

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-28 10:40:47 -07:00
Cong Wang 6d8c50dcb0 socket: close race condition between sock_close() and sockfs_setattr()
fchownat() doesn't even hold refcnt of fd until it figures out
fd is really needed (otherwise is ignored) and releases it after
it resolves the path. This means sock_close() could race with
sockfs_setattr(), which leads to a NULL pointer dereference
since typically we set sock->sk to NULL in ->release().

As pointed out by Al, this is unique to sockfs. So we can fix this
in socket layer by acquiring inode_lock in sock_close() and
checking against NULL in sockfs_setattr().

sock_release() is called in many places, only the sock_close()
path matters here. And fortunately, this should not affect normal
sock_close() as it is only called when the last fd refcnt is gone.
It only affects sock_close() with a parallel sockfs_setattr() in
progress, which is not common.

Fixes: 86741ec254 ("net: core: Add a UID field to struct sock.")
Reported-by: shankarapailoor <shankarapailoor@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-10 12:25:53 -07:00
Linus Torvalds 10b1eb7d8c Merge branch 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security system updates from James Morris:

 - incorporate new socketpair() hook into LSM and wire up the SELinux
   and Smack modules. From David Herrmann:

     "The idea is to allow SO_PEERSEC to be called on AF_UNIX sockets
      created via socketpair(2), and return the same information as if
      you emulated socketpair(2) via a temporary listener socket.

      Right now SO_PEERSEC will return the unlabeled credentials for a
      socketpair, rather than the actual credentials of the creating
      process."

 - remove the unused security_settime LSM hook (Sargun Dhillon).

 - remove some stack allocated arrays from the keys code (Tycho
   Andersen)

* 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  dh key: get rid of stack allocated array for zeroes
  dh key: get rid of stack allocated array
  big key: get rid of stack array allocation
  smack: provide socketpair callback
  selinux: provide socketpair callback
  net: hook socketpair() into LSM
  security: add hook for socketpair()
  security: remove security_settime
2018-06-06 16:15:56 -07:00
Christoph Hellwig 1525242310 net: add support for ->poll_mask in proto_ops
The socket file operations still implement ->poll until all protocols are
switched over.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-26 09:16:44 +02:00
Christoph Hellwig 3cafb37633 net: refactor socket_poll
Factor out two busy poll related helpers for late reuse, and remove
a command that isn't very helpful, especially with the __poll_t
annotations in place.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-26 09:16:44 +02:00
David Herrmann d47cd9450d net: hook socketpair() into LSM
Use the newly created LSM-hook for socketpair(). The default hook
return-value is 0, so behavior stays the same unless LSMs start using
this hook.

Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: James Morris <james.morris@microsoft.com>
2018-05-04 12:48:54 -07:00
Linus Torvalds 672a9c1069 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  kfifo: fix inaccurate comment
  tools/thermal: tmon: fix for segfault
  net: Spelling s/stucture/structure/
  edd: don't spam log if no EDD information is present
  Documentation: Fix early-microcode.txt references after file rename
  tracing: Block comments should align the * on each line
  treewide: Fix typos in printk
  GenWQE: Fix a typo in two comments
  treewide: Align function definition open/close braces
2018-04-05 11:56:35 -07:00
Linus Torvalds 5bb053bef8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Support offloading wireless authentication to userspace via
    NL80211_CMD_EXTERNAL_AUTH, from Srinivas Dasari.

 2) A lot of work on network namespace setup/teardown from Kirill Tkhai.
    Setup and cleanup of namespaces now all run asynchronously and thus
    performance is significantly increased.

 3) Add rx/tx timestamping support to mv88e6xxx driver, from Brandon
    Streiff.

 4) Support zerocopy on RDS sockets, from Sowmini Varadhan.

 5) Use denser instruction encoding in x86 eBPF JIT, from Daniel
    Borkmann.

 6) Support hw offload of vlan filtering in mvpp2 dreiver, from Maxime
    Chevallier.

 7) Support grafting of child qdiscs in mlxsw driver, from Nogah
    Frankel.

 8) Add packet forwarding tests to selftests, from Ido Schimmel.

 9) Deal with sub-optimal GSO packets better in BBR congestion control,
    from Eric Dumazet.

10) Support 5-tuple hashing in ipv6 multipath routing, from David Ahern.

11) Add path MTU tests to selftests, from Stefano Brivio.

12) Various bits of IPSEC offloading support for mlx5, from Aviad
    Yehezkel, Yossi Kuperman, and Saeed Mahameed.

13) Support RSS spreading on ntuple filters in SFC driver, from Edward
    Cree.

14) Lots of sockmap work from John Fastabend. Applications can use eBPF
    to filter sendmsg and sendpage operations.

15) In-kernel receive TLS support, from Dave Watson.

16) Add XDP support to ixgbevf, this is significant because it should
    allow optimized XDP usage in various cloud environments. From Tony
    Nguyen.

17) Add new Intel E800 series "ice" ethernet driver, from Anirudh
    Venkataramanan et al.

18) IP fragmentation match offload support in nfp driver, from Pieter
    Jansen van Vuuren.

19) Support XDP redirect in i40e driver, from Björn Töpel.

20) Add BPF_RAW_TRACEPOINT program type for accessing the arguments of
    tracepoints in their raw form, from Alexei Starovoitov.

21) Lots of striding RQ improvements to mlx5 driver with many
    performance improvements, from Tariq Toukan.

22) Use rhashtable for inet frag reassembly, from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1678 commits)
  net: mvneta: improve suspend/resume
  net: mvneta: split rxq/txq init and txq deinit into SW and HW parts
  ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh
  net: bgmac: Fix endian access in bgmac_dma_tx_ring_free()
  net: bgmac: Correctly annotate register space
  route: check sysctl_fib_multipath_use_neigh earlier than hash
  fix typo in command value in drivers/net/phy/mdio-bitbang.
  sky2: Increase D3 delay to sky2 stops working after suspend
  net/mlx5e: Set EQE based as default TX interrupt moderation mode
  ibmvnic: Disable irqs before exiting reset from closed state
  net: sched: do not emit messages while holding spinlock
  vlan: also check phy_driver ts_info for vlan's real device
  Bluetooth: Mark expected switch fall-throughs
  Bluetooth: Set HCI_QUIRK_SIMULTANEOUS_DISCOVERY for BTUSB_QCA_ROME
  Bluetooth: btrsi: remove unused including <linux/version.h>
  Bluetooth: hci_bcm: Remove DMI quirk for the MINIX Z83-4
  sh_eth: kill useless check in __sh_eth_get_regs()
  sh_eth: add sh_eth_cpu_data::no_xdfar flag
  ipv6: factorize sk_wmem_alloc updates done by __ip6_append_data()
  ipv4: factorize sk_wmem_alloc updates done by __ip_append_data()
  ...
2018-04-03 14:04:18 -07:00
Linus Torvalds 642e7fd233 Merge branch 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux
Pull removal of in-kernel calls to syscalls from Dominik Brodowski:
 "System calls are interaction points between userspace and the kernel.
  Therefore, system call functions such as sys_xyzzy() or
  compat_sys_xyzzy() should only be called from userspace via the
  syscall table, but not from elsewhere in the kernel.

  At least on 64-bit x86, it will likely be a hard requirement from
  v4.17 onwards to not call system call functions in the kernel: It is
  better to use use a different calling convention for system calls
  there, where struct pt_regs is decoded on-the-fly in a syscall wrapper
  which then hands processing over to the actual syscall function. This
  means that only those parameters which are actually needed for a
  specific syscall are passed on during syscall entry, instead of
  filling in six CPU registers with random user space content all the
  time (which may cause serious trouble down the call chain). Those
  x86-specific patches will be pushed through the x86 tree in the near
  future.

  Moreover, rules on how data may be accessed may differ between kernel
  data and user data. This is another reason why calling sys_xyzzy() is
  generally a bad idea, and -- at most -- acceptable in arch-specific
  code.

  This patchset removes all in-kernel calls to syscall functions in the
  kernel with the exception of arch/. On top of this, it cleans up the
  three places where many syscalls are referenced or prototyped, namely
  kernel/sys_ni.c, include/linux/syscalls.h and include/linux/compat.h"

* 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: (109 commits)
  bpf: whitelist all syscalls for error injection
  kernel/sys_ni: remove {sys_,sys_compat} from cond_syscall definitions
  kernel/sys_ni: sort cond_syscall() entries
  syscalls/x86: auto-create compat_sys_*() prototypes
  syscalls: sort syscall prototypes in include/linux/compat.h
  net: remove compat_sys_*() prototypes from net/compat.h
  syscalls: sort syscall prototypes in include/linux/syscalls.h
  kexec: move sys_kexec_load() prototype to syscalls.h
  x86/sigreturn: use SYSCALL_DEFINE0
  x86: fix sys_sigreturn() return type to be long, not unsigned long
  x86/ioport: add ksys_ioperm() helper; remove in-kernel calls to sys_ioperm()
  mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead()
  mm: add ksys_mmap_pgoff() helper; remove in-kernel calls to sys_mmap_pgoff()
  mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64()
  fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate()
  fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls
  fs: add ksys_truncate() wrapper; remove in-kernel calls to sys_truncate()
  fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall
  kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid()
  kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare()
  ...
2018-04-02 21:22:12 -07:00
Dominik Brodowski d27e9afc64 net: socket: replace call to sys_recv() with __sys_recvfrom()
sys_recv() merely expands the parameters to __sys_recvfrom() by NULL and
NULL. Open-code this in the two places which used sys_recv() as a wrapper
to __sys_recvfrom().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:16 +02:00
Dominik Brodowski f3bf896b1d net: socket: replace calls to sys_send() with __sys_sendto()
sys_send() merely expands the parameters to __sys_sendto() by NULL and 0.
Open-code this in the two places which used sys_send() as a wrapper to
__sys_sendto().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:16 +02:00
Dominik Brodowski e1834a329d net: socket: move check for forbid_cmsg_compat to __sys_...msg()
The non-compat codepaths for sys_...msg() verify that MSG_CMSG_COMPAT
is not set. By moving this check to the __sys_...msg() functions
(and making it dependent on a static flag passed to this function), we
can call the __sys...msg() functions instead of the syscall functions
in all cases. __sys_recvmmsg() does not need this trickery, as the
check is handled within the do_sys_recvmmsg() function internal to
net/socket.c.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:15 +02:00
Dominik Brodowski 1255e26906 net: socket: add do_sys_recvmmsg() helper; remove in-kernel call to syscall
Using the net-internal helper do_sys_recvmmsg() allows us to avoid the
internal calls to the sys_getsockopt() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:14 +02:00
Dominik Brodowski 13a2d70e2b net: socket: add __sys_getsockopt() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_getsockopt() allows us to avoid the
internal calls to the sys_getsockopt() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:13 +02:00
Dominik Brodowski cc36dca0df net: socket: add __sys_setsockopt() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_setsockopt() allows us to avoid the
internal calls to the sys_setsockopt() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:13 +02:00
Dominik Brodowski 005a1aeac4 net: socket: add __sys_shutdown() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_shutdown() allows us to avoid the
internal calls to the sys_shutdown() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:12 +02:00
Dominik Brodowski 6debc8d834 net: socket: add __sys_socketpair() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_socketpair() allows us to avoid the
internal calls to the sys_socketpair() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:11 +02:00
Dominik Brodowski b21c8f838a net: socket: add __sys_getpeername() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_getpeername() allows us to avoid the
internal calls to the sys_getpeername() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:10 +02:00
Dominik Brodowski 8882a107b3 net: socket: add __sys_getsockname() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_getsockname() allows us to avoid the
internal calls to the sys_getsockname() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:09 +02:00
Dominik Brodowski 25e290eed9 net: socket: add __sys_listen() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_listen() allows us to avoid the
internal calls to the sys_listen() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:09 +02:00
Dominik Brodowski 1387c2c2f9 net: socket: add __sys_connect() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_connect() allows us to avoid the
internal calls to the sys_connect() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:08 +02:00
Dominik Brodowski a87d35d87a net: socket: add __sys_bind() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_bind() allows us to avoid the
internal calls to the sys_bind() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:07 +02:00
Dominik Brodowski 9d6a15c3f2 net: socket: add __sys_socket() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_socket() allows us to avoid the
internal calls to the sys_socket() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:06 +02:00
Dominik Brodowski 4541e80560 net: socket: add __sys_accept4() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_accept4() allows us to avoid the
internal calls to the sys_accept4() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:06 +02:00
Dominik Brodowski 211b634b7f net: socket: add __sys_sendto() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_sendto() allows us to avoid the
internal calls to the sys_sendto() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:05 +02:00
Dominik Brodowski 7a09e1eb9c net: socket: add __sys_recvfrom() helper; remove in-kernel call to syscall
Using the net-internal helper __sys_recvfrom() allows us to avoid the
internal calls to the sys_recvfrom() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:04 +02:00
Geert Uytterhoeven b903036aad net: Spelling s/stucture/structure/
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2018-03-27 09:51:23 +02:00
David S. Miller 03fe2debbb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Fun set of conflict resolutions here...

For the mac80211 stuff, these were fortunately just parallel
adds.  Trivially resolved.

In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
function phy_disable_interrupts() earlier in the file, whilst in
'net-next' the phy_error() call from this function was removed.

In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
'rt_table_id' member of rtable collided with a bug fix in 'net' that
added a new struct member "rt_mtu_locked" which needs to be copied
over here.

The mlxsw driver conflict consisted of net-next separating
the span code and definitions into separate files, whilst
a 'net' bug fix made some changes to that moved code.

The mlx5 infiniband conflict resolution was quite non-trivial,
the RDMA tree's merge commit was used as a guide here, and
here are their notes:

====================

    Due to bug fixes found by the syzkaller bot and taken into the for-rc
    branch after development for the 4.17 merge window had already started
    being taken into the for-next branch, there were fairly non-trivial
    merge issues that would need to be resolved between the for-rc branch
    and the for-next branch.  This merge resolves those conflicts and
    provides a unified base upon which ongoing development for 4.17 can
    be based.

    Conflicts:
            drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f95
            (IB/mlx5: Fix cleanup order on unload) added to for-rc and
            commit b5ca15ad7e (IB/mlx5: Add proper representors support)
            add as part of the devel cycle both needed to modify the
            init/de-init functions used by mlx5.  To support the new
            representors, the new functions added by the cleanup patch
            needed to be made non-static, and the init/de-init list
            added by the representors patch needed to be modified to
            match the init/de-init list changes made by the cleanup
            patch.
    Updates:
            drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
            prototypes added by representors patch to reflect new function
            names as changed by cleanup patch
            drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
            stage list to match new order from cleanup patch
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 11:31:58 -04:00
Xin Long bf2ae2e4bf sock_diag: request _diag module only when the family or proto has been registered
Now when using 'ss' in iproute, kernel would try to load all _diag
modules, which also causes corresponding family and proto modules
to be loaded as well due to module dependencies.

Like after running 'ss', sctp, dccp, af_packet (if it works as a module)
would be loaded.

For example:

  $ lsmod|grep sctp
  $ ss
  $ lsmod|grep sctp
  sctp_diag              16384  0
  sctp                  323584  5 sctp_diag
  inet_diag              24576  4 raw_diag,tcp_diag,sctp_diag,udp_diag
  libcrc32c              16384  3 nf_conntrack,nf_nat,sctp

As these family and proto modules are loaded unintentionally, it
could cause some problems, like:

- Some debug tools use 'ss' to collect the socket info, which loads all
  those diag and family and protocol modules. It's noisy for identifying
  issues.

- Users usually expect to drop sctp init packet silently when they
  have no sense of sctp protocol instead of sending abort back.

- It wastes resources (especially with multiple netns), and SCTP module
  can't be unloaded once it's loaded.

...

In short, it's really inappropriate to have these family and proto
modules loaded unexpectedly when just doing debugging with inet_diag.

This patch is to introduce sock_load_diag_module() where it loads
the _diag module only when it's corresponding family or proto has
been already registered.

Note that we can't just load _diag module without the family or
proto loaded, as some symbols used in _diag module are from the
family or proto module.

v1->v2:
  - move inet proto check to inet_diag to avoid a compiling err.
v2->v3:
  - define sock_load_diag_module in sock.c and export one symbol
    only.
  - improve the changelog.

Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Phil Sutter <phil@nwl.cc>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-12 11:03:42 -04:00
Soheil Hassas Yeganeh 7797dc4141 socket: skip checking sk_err for recvmmsg(MSG_ERRQUEUE)
recvmmsg does not call ___sys_recvmsg when sk_err is set.
That is fine for normal reads but, for MSG_ERRQUEUE, recvmmsg
should always call ___sys_recvmsg regardless of sk->sk_err to
be able to clear error queue. Otherwise, users are not able to
drain the error queue using recvmmsg.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-01 21:27:49 -05:00
Alexey Dobriyan 08009a7602 net: make kmem caches as __ro_after_init
All kmem caches aren't reallocated once set up.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-26 15:11:48 -05:00
Kirill Tkhai d8d211a2a0 net: Make extern and export get_net_ns()
This function will be used to obtain net of tun device.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-15 15:34:42 -05:00
David Ahern e92bad5034 net: Remove atalk header from socket.c
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14 11:55:33 -05:00
Denys Vlasenko 9b2c45d479 net: make getname() functions return length rather than use int* parameter
Changes since v1:
Added changes in these files:
    drivers/infiniband/hw/usnic/usnic_transport.c
    drivers/staging/lustre/lnet/lnet/lib-socket.c
    drivers/target/iscsi/iscsi_target_login.c
    drivers/vhost/net.c
    fs/dlm/lowcomms.c
    fs/ocfs2/cluster/tcp.c
    security/tomoyo/network.c

Before:
All these functions either return a negative error indicator,
or store length of sockaddr into "int *socklen" parameter
and return zero on success.

"int *socklen" parameter is awkward. For example, if caller does not
care, it still needs to provide on-stack storage for the value
it does not need.

None of the many FOO_getname() functions of various protocols
ever used old value of *socklen. They always just overwrite it.

This change drops this parameter, and makes all these functions, on success,
return length of sockaddr. It's always >= 0 and can be differentiated
from an error.

Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.

rpc_sockname() lost "int buflen" parameter, since its only use was
to be passed to kernel_getsockname() as &buflen and subsequently
not used in any way.

Userspace API is not changed.

    text    data     bss      dec     hex filename
30108430 2633624  873672 33615726 200ef6e vmlinux.before.o
30108109 2633612  873672 33615393 200ee21 vmlinux.o

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: David S. Miller <davem@davemloft.net>
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
CC: linux-bluetooth@vger.kernel.org
CC: linux-decnet-user@lists.sourceforge.net
CC: linux-wireless@vger.kernel.org
CC: linux-rdma@vger.kernel.org
CC: linux-sctp@vger.kernel.org
CC: linux-nfs@vger.kernel.org
CC: linux-x25@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-12 14:15:04 -05:00
Linus Torvalds b2fe5fa686 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Significantly shrink the core networking routing structures. Result
    of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf

 2) Add netdevsim driver for testing various offloads, from Jakub
    Kicinski.

 3) Support cross-chip FDB operations in DSA, from Vivien Didelot.

 4) Add a 2nd listener hash table for TCP, similar to what was done for
    UDP. From Martin KaFai Lau.

 5) Add eBPF based queue selection to tun, from Jason Wang.

 6) Lockless qdisc support, from John Fastabend.

 7) SCTP stream interleave support, from Xin Long.

 8) Smoother TCP receive autotuning, from Eric Dumazet.

 9) Lots of erspan tunneling enhancements, from William Tu.

10) Add true function call support to BPF, from Alexei Starovoitov.

11) Add explicit support for GRO HW offloading, from Michael Chan.

12) Support extack generation in more netlink subsystems. From Alexander
    Aring, Quentin Monnet, and Jakub Kicinski.

13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From
    Russell King.

14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso.

15) Many improvements and simplifications to the NFP driver bpf JIT,
    from Jakub Kicinski.

16) Support for ipv6 non-equal cost multipath routing, from Ido
    Schimmel.

17) Add resource abstration to devlink, from Arkadi Sharshevsky.

18) Packet scheduler classifier shared filter block support, from Jiri
    Pirko.

19) Avoid locking in act_csum, from Davide Caratti.

20) devinet_ioctl() simplifications from Al viro.

21) More TCP bpf improvements from Lawrence Brakmo.

22) Add support for onlink ipv6 route flag, similar to ipv4, from David
    Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits)
  tls: Add support for encryption using async offload accelerator
  ip6mr: fix stale iterator
  net/sched: kconfig: Remove blank help texts
  openvswitch: meter: Use 64-bit arithmetic instead of 32-bit
  tcp_nv: fix potential integer overflow in tcpnv_acked
  r8169: fix RTL8168EP take too long to complete driver initialization.
  qmi_wwan: Add support for Quectel EP06
  rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK
  ipmr: Fix ptrdiff_t print formatting
  ibmvnic: Wait for device response when changing MAC
  qlcnic: fix deadlock bug
  tcp: release sk_frag.page in tcp_disconnect
  ipv4: Get the address of interface correctly.
  net_sched: gen_estimator: fix lockdep splat
  net: macb: Handle HRESP error
  net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring
  ipv6: addrconf: break critical section in addrconf_verify_rtnl()
  ipv6: change route cache aging logic
  i40e/i40evf: Update DESC_NEEDED value to reflect larger value
  bnxt_en: cleanup DIM work on device shutdown
  ...
2018-01-31 14:31:10 -08:00
Linus Torvalds 168fe32a07 Merge branch 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull poll annotations from Al Viro:
 "This introduces a __bitwise type for POLL### bitmap, and propagates
  the annotations through the tree. Most of that stuff is as simple as
  'make ->poll() instances return __poll_t and do the same to local
  variables used to hold the future return value'.

  Some of the obvious brainos found in process are fixed (e.g. POLLIN
  misspelled as POLL_IN). At that point the amount of sparse warnings is
  low and most of them are for genuine bugs - e.g. ->poll() instance
  deciding to return -EINVAL instead of a bitmap. I hadn't touched those
  in this series - it's large enough as it is.

  Another problem it has caught was eventpoll() ABI mess; select.c and
  eventpoll.c assumed that corresponding POLL### and EPOLL### were
  equal. That's true for some, but not all of them - EPOLL### are
  arch-independent, but POLL### are not.

  The last commit in this series separates userland POLL### values from
  the (now arch-independent) kernel-side ones, converting between them
  in the few places where they are copied to/from userland. AFAICS, this
  is the least disruptive fix preserving poll(2) ABI and making epoll()
  work on all architectures.

  As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
  it will trigger only on what would've triggered EPOLLWRBAND on other
  architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
  at all on sparc. With this patch they should work consistently on all
  architectures"

* 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
  make kernel-side POLL... arch-independent
  eventpoll: no need to mask the result of epi_item_poll() again
  eventpoll: constify struct epoll_event pointers
  debugging printk in sg_poll() uses %x to print POLL... bitmap
  annotate poll(2) guts
  9p: untangle ->poll() mess
  ->si_band gets POLL... bitmap stored into a user-visible long field
  ring_buffer_poll_wait() return value used as return value of ->poll()
  the rest of drivers/*: annotate ->poll() instances
  media: annotate ->poll() instances
  fs: annotate ->poll() instances
  ipc, kernel, mm: annotate ->poll() instances
  net: annotate ->poll() instances
  apparmor: annotate ->poll() instances
  tomoyo: annotate ->poll() instances
  sound: annotate ->poll() instances
  acpi: annotate ->poll() instances
  crypto: annotate ->poll() instances
  block: annotate ->poll() instances
  x86: annotate ->poll() instances
  ...
2018-01-30 17:58:07 -08:00
Al Viro 5c59e564e4 kill kernel_sock_ioctl()
no users since 2014

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-24 19:13:45 -05:00
Al Viro 44c02a2c3d dev_ioctl(): move copyin/copyout to callers
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-24 19:13:45 -05:00
Al Viro b1b0c24506 lift handling of SIOCIW... out of dev_ioctl()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-24 19:13:45 -05:00
Al Viro 4cf808e7ac kill dev_ifname32()
same story...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-24 19:13:45 -05:00
Al Viro f92d4fc953 kill bond_ioctl()
Same story as with dev_ifsioc(), except that the last cases with non-trivial
conversions had been taken out in 2013...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-24 19:13:45 -05:00
Al Viro bf4405737f kill dev_ifsioc()
Once upon a time net/socket.c:dev_ifsioc() used to handle SIOCSHWTSTAMP and
SIOCSIFMAP.  These have different native and compat layout, so the format
conversion had been needed.  In 2009 these two cases had been taken out,
turning the rest into a convoluted way to calling sock_do_ioctl().  We copy
compat structure into native one, call sock_do_ioctl() on that and copy
the result back for the in/out ioctls.  No layout transformation anywhere,
so we might as well just call sock_do_ioctl() and skip all the headache with
copying.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-24 19:13:45 -05:00
Al Viro 36fd633ec9 net: separate SIOCGIFCONF handling from dev_ioctl()
Only two of dev_ioctl() callers may pass SIOCGIFCONF to it.
Separating that codepath from the rest of dev_ioctl() allows both
to simplify dev_ioctl() itself (all other cases work with struct ifreq *)
*and* seriously simplify the compat side of that beast: all it takes
is passing to inet_gifconf() an extra argument - the size of individual
records (sizeof(struct ifreq) or sizeof(struct compat_ifreq)).  With
dev_ifconf() called directly from sock_do_ioctl()/compat_dev_ifconf()
that's easy to arrange.

As the result, compat side of SIOCGIFCONF doesn't need any
allocations, copy_in_user() back and forth, etc.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-24 19:13:45 -05:00
Daniel Borkmann fa9dd599b4 bpf: get rid of pure_initcall dependency to enable jits
Having a pure_initcall() callback just to permanently enable BPF
JITs under CONFIG_BPF_JIT_ALWAYS_ON is unnecessary and could leave
a small race window in future where JIT is still disabled on boot.
Since we know about the setting at compilation time anyway, just
initialize it properly there. Also consolidate all the individual
bpf_jit_enable variables into a single one and move them under one
location. Moreover, don't allow for setting unspecified garbage
values on them.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-19 18:37:00 -08:00
David S. Miller 19d28fbd30 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
BPF alignment tests got a conflict because the registers
are output as Rn_w instead of just Rn in net-next, and
in net a fixup for a testcase prohibits logical operations
on pointers before using them.

Also, we should attempt to patch BPF call args if JIT always on is
enabled.  Instead, if we fail to JIT the subprogs we should pass
an error back up and fail immediately.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-11 22:13:42 -05:00
Linus Torvalds cbd0a6a2cc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs regression fix from Al Viro/

Fix a leak in socket() introduced by commit 8e1611e235 ("make
sock_alloc_file() do sock_release() on failures").

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  Fix a leak in socket(2) when we fail to allocate a file descriptor.
2018-01-10 17:55:42 -08:00
Al Viro ce4bb04cae Fix a leak in socket(2) when we fail to allocate a file descriptor.
Got broken by "make sock_alloc_file() do sock_release() on failures" -
cleanup after sock_map_fd() failure got pulled all the way into
sock_alloc_file(), but it used to serve the case when sock_map_fd()
failed *before* getting to sock_alloc_file() as well, and that got
lost.  Trivial to fix, fortunately.

Fixes: 8e1611e235 (make sock_alloc_file() do sock_release() on failures)
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-10 18:47:05 -05:00
Alexei Starovoitov 290af86629 bpf: introduce BPF_JIT_ALWAYS_ON config
The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.

A quote from goolge project zero blog:
"At this point, it would normally be necessary to locate gadgets in
the host kernel code that can be used to actually leak data by reading
from an attacker-controlled location, shifting and masking the result
appropriately and then using the result of that as offset to an
attacker-controlled address for a load. But piecing gadgets together
and figuring out which ones work in a speculation context seems annoying.
So instead, we decided to use the eBPF interpreter, which is built into
the host kernel - while there is no legitimate way to invoke it from inside
a VM, the presence of the code in the host kernel's text section is sufficient
to make it usable for the attack, just like with ordinary ROP gadgets."

To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
option that removes interpreter from the kernel in favor of JIT-only mode.
So far eBPF JIT is supported by:
x64, arm64, arm32, sparc64, s390, powerpc64, mips64

The start of JITed program is randomized and code page is marked as read-only.
In addition "constant blinding" can be turned on with net.core.bpf_jit_harden

v2->v3:
- move __bpf_prog_ret0 under ifdef (Daniel)

v1->v2:
- fix init order, test_bpf and cBPF (Daniel's feedback)
- fix offloaded bpf (Jakub's feedback)
- add 'return 0' dummy in case something can invoke prog->bpf_func
- retarget bpf tree. For bpf-next the patch would need one extra hunk.
  It will be sent when the trees are merged back to net-next

Considered doing:
  int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
but it seems better to land the patch as-is and in bpf-next remove
bpf_jit_enable global variable from all JITs, consolidate in one place
and remove this jit_init() function.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-09 22:25:26 +01:00
Tonghao Zhang 648845ab7e sock: Move the socket inuse to namespace.
In some case, we want to know how many sockets are in use in
different _net_ namespaces. It's a key resource metric.

This patch add a member in struct netns_core. This is a counter
for socket-inuse in the _net_ namespace. The patch will add/sub
counter in the sk_alloc, sk_clone_lock and __sk_free.

This patch will not counter the socket created in kernel.
It's not very useful for userspace to know how many kernel
sockets we created.

The main reasons for doing this are that:

1. When linux calls the 'do_exit' for process to exit, the functions
'exit_task_namespaces' and 'exit_task_work' will be called sequentially.
'exit_task_namespaces' may have destroyed the _net_ namespace, but
'sock_release' called in 'exit_task_work' may use the _net_ namespace
if we counter the socket-inuse in sock_release.

2. socket and sock are in pair. More important, sock holds the _net_
namespace. We counter the socket-inuse in sock, for avoiding holding
_net_ namespace again in socket. It's a easy way to maintain the code.

Signed-off-by: Martin Zhang <zhangjunweimartin@didichuxing.com>
Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-19 09:58:14 -05:00
Al Viro 8e1611e235 make sock_alloc_file() do sock_release() on failures
This changes calling conventions (and simplifies the hell out
the callers).  New rules: once struct socket had been passed
to sock_alloc_file(), it's been consumed either by struct file
or by sock_release() done by sock_alloc_file().  Either way
the caller should not do sock_release() after that point.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05 18:39:29 -05:00
Al Viro 016a266bdf socketpair(): allocate descriptors first
simplifies failure exits considerably...

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05 18:39:28 -05:00
Al Viro ade994f4f6 net: annotate ->poll() instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-27 16:20:04 -05:00
Al Viro e6c8adca20 anntotate the places where ->poll() return values go
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-27 16:19:53 -05:00
Levin, Alexander (Sasha Levin) 4950276672 kmemcheck: remove annotations
Patch series "kmemcheck: kill kmemcheck", v2.

As discussed at LSF/MM, kill kmemcheck.

KASan is a replacement that is able to work without the limitation of
kmemcheck (single CPU, slow).  KASan is already upstream.

We are also not aware of any users of kmemcheck (or users who don't
consider KASan as a suitable replacement).

The only objection was that since KASAN wasn't supported by all GCC
versions provided by distros at that time we should hold off for 2
years, and try again.

Now that 2 years have passed, and all distros provide gcc that supports
KASAN, kill kmemcheck again for the very same reasons.

This patch (of 4):

Remove kmemcheck annotations, and calls to kmemcheck from the kernel.

[alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
  Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Hansen <devtimhansen@gmail.com>
Cc: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
John Fastabend db5980d804 net: fixes for skb_send_sock
A couple fixes to new skb_send_sock infrastructure. However, no users
currently exist for this code (adding user in next handful of patches)
so it should not be possible to trigger a panic with existing in-kernel
code.

Fixes: 306b13eb3c ("proto_ops: Add locked held versions of sendmsg and sendpage")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16 11:27:52 -07:00
Tom Herbert 306b13eb3c proto_ops: Add locked held versions of sendmsg and sendpage
Add new proto_ops sendmsg_locked and sendpage_locked that can be
called when the socket lock is already held. Correspondingly, add
kernel_sendmsg_locked and kernel_sendpage_locked as front end
functions.

These functions will be used in zero proxy so that we can take
the socket lock in a ULP sendmsg/sendpage and then directly call the
backend transport proto_ops functions.

Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 15:26:18 -07:00
David S. Miller 29fda25a2d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two minor conflicts in virtio_net driver (bug fix overlapping addition
of a helper) and MAINTAINERS (new driver edit overlapping revamp of
PHY entry).

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 10:07:50 -07:00
stephen hemminger 614d79c09e socket: fix set not used warning
The variable owned_by_user is always set, but only used
when kernel is configured with LOCKDEP enabled.

Get rid of the warning by moving the code to put the call
to owned_by_user into the the rcu_protected call.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-25 12:31:37 -07:00
Paolo Abeni 864d966424 net/socket: fix type in assignment and trim long line
The commit ffb07550c7 ("copy_msghdr_from_user(): get rid of
field-by-field copyin") introduce a new sparse warning:

net/socket.c:1919:27: warning: incorrect type in assignment (different address spaces)
net/socket.c:1919:27:    expected void *msg_control
net/socket.c:1919:27:    got void [noderef] <asn:1>*[addressable] msg_control

and a line above 80 chars, let's fix them

Fixes: ffb07550c7 ("copy_msghdr_from_user(): get rid of field-by-field copyin")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 14:17:01 -07:00
Linus Torvalds 2173bd0631 Merge branch 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull network field-by-field copy-in updates from Al Viro:
 "This part of the misc compat queue was held back for review from
  networking folks and since davem has jus ACKed those..."

* 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  get_compat_bpf_fprog(): don't copyin field-by-field
  get_compat_msghdr(): get rid of field-by-field copyin
  copy_msghdr_from_user(): get rid of field-by-field copyin
2017-07-15 11:06:17 -07:00
Linus Torvalds 3bad2f1c67 Merge branch 'work.misc-set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc user access cleanups from Al Viro:
 "The first pile is assorted getting rid of cargo-culted access_ok(),
  cargo-culted set_fs() and field-by-field copyouts.

  The same description applies to a lot of stuff in other branches -
  this is just the stuff that didn't fit into a more specific topical
  branch"

* 'work.misc-set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  Switch flock copyin/copyout primitives to copy_{from,to}_user()
  fs/fcntl: return -ESRCH in f_setown when pid/pgid can't be found
  fs/fcntl: f_setown, avoid undefined behaviour
  fs/fcntl: f_setown, allow returning error
  lpfc debugfs: get rid of pointless access_ok()
  adb: get rid of pointless access_ok()
  isdn: get rid of pointless access_ok()
  compat statfs: switch to copy_to_user()
  fs/locks: don't mess with the address limit in compat_fcntl64
  nfsd_readlink(): switch to vfs_get_link()
  drbd: ->sendpage() never needed set_fs()
  fs/locks: pass kernel struct flock to fcntl_getlk/setlk
  fs: locks: Fix some troubles at kernel-doc comments
2017-07-05 13:13:32 -07:00
Al Viro ffb07550c7 copy_msghdr_from_user(): get rid of field-by-field copyin
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-04 13:14:33 -04:00
Jiri Slaby 393cc3f511 fs/fcntl: f_setown, allow returning error
Allow f_setown to return an error value. We will fail in the next patch
with EINVAL for bad input to f_setown, so tile the path for the later
patch.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-06-14 08:46:36 -04:00
Rosen, Rami 241c4667fc net: socket: fix a typo in sockfd_lookup().
This patch fixes a typo in sockfd_lookup() in net/socket.c.

Signed-off-by: Rami Rosen <rami.rosen@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-22 12:14:04 -04:00
Miroslav Lichvar b50a5c70ff net: allow simultaneous SW and HW transmit timestamping
Add SOF_TIMESTAMPING_OPT_TX_SWHW option to allow an outgoing packet to
be looped to the socket's error queue with a software timestamp even
when a hardware transmit timestamp is expected to be provided by the
driver.

Applications using this option will receive two separate messages from
the error queue, one with a software timestamp and the other with a
hardware timestamp. As the hardware timestamp is saved to the shared skb
info, which may happen before the first message with software timestamp
is received by the application, the hardware timestamp is copied to the
SCM_TIMESTAMPING control message only when the skb has no software
timestamp or it is an incoming packet.

While changing sw_tx_timestamp(), inline it in skb_tx_timestamp() as
there are no other users.

CC: Richard Cochran <richardcochran@gmail.com>
CC: Willem de Bruijn <willemb@google.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-21 13:37:32 -04:00
Miroslav Lichvar aad9c8c470 net: add new control message for incoming HW-timestamped packets
Add SOF_TIMESTAMPING_OPT_PKTINFO option to request a new control message
for incoming packets with hardware timestamps. It contains the index of
the real interface which received the packet and the length of the
packet at layer 2.

The index is useful with bonding, bridges and other interfaces, where
IP_PKTINFO doesn't allow applications to determine which PHC made the
timestamp. With the L2 length (and link speed) it is possible to
transpose preamble timestamps to trailer timestamps, which are used in
the NTP protocol.

While this information could be provided by two new socket options
independently from timestamping, it doesn't look like they would be very
useful. With this option any performance impact is limited to hardware
timestamping.

Use dev_get_by_napi_id() to get the device and its index. On kernels
with disabled CONFIG_NET_RX_BUSY_POLL or drivers not using NAPI, a zero
index will be returned in the control message.

CC: Richard Cochran <richardcochran@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-21 13:37:32 -04:00
R. Parameswaran 57240d0078 l2tp: device MTU setup, tunnel socket needs a lock
The MTU overhead calculation in L2TP device set-up
merged via commit b784e7ebfc
needs to be adjusted to lock the tunnel socket while
referencing the sub-data structures to derive the
socket's IP overhead.

Reported-by: Guillaume Nault <g.nault@alphalink.fr>
Tested-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: R. Parameswaran <rparames@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17 13:01:48 -04:00
R. Parameswaran 113c307593 New kernel function to get IP overhead on a socket.
A new function, kernel_sock_ip_overhead(), is provided
to calculate the cumulative overhead imposed by the IP
Header and IP options, if any, on a socket's payload.
The new function returns an overhead of zero for sockets
that do not belong to the IPv4 or IPv6 address families.
This is used in the L2TP code path to compute the
total outer IP overhead on the L2TP tunnel socket when
calculating the default MTU for Ethernet pseudowires.

Signed-off-by: R. Parameswaran <rparames@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-06 13:43:31 -07:00
Soheil Hassas Yeganeh 4ef1b28694 tcp: mark skbs with SCM_TIMESTAMPING_OPT_STATS
SOF_TIMESTAMPING_OPT_STATS can be enabled and disabled
while packets are collected on the error queue.
So, checking SOF_TIMESTAMPING_OPT_STATS in sk->sk_tsflags
is not enough to safely assume that the skb contains
OPT_STATS data.

Add a bit in sock_exterr_skb to indicate whether the
skb contains opt_stats data.

Fixes: 1c885808e4 ("tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING")
Reported-by: JongHwan Kim <zzoru007@gmail.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21 18:44:17 -07:00
Soheil Hassas Yeganeh 8605330aac tcp: fix SCM_TIMESTAMPING_OPT_STATS for normal skbs
__sock_recv_timestamp can be called for both normal skbs (for
receive timestamps) and for skbs on the error queue (for transmit
timestamps).

Commit 1c885808e4
(tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING)
assumes any skb passed to __sock_recv_timestamp are from
the error queue, containing OPT_STATS in the content of the skb.
This results in accessing invalid memory or generating junk
data.

To fix this, set skb->pkt_type to PACKET_OUTGOING for packets
on the error queue. This is safe because on the receive path
on local sockets skb->pkt_type is never set to PACKET_OUTGOING.
With that, copy OPT_STATS from a packet, only if its pkt_type
is PACKET_OUTGOING.

Fixes: 1c885808e4 ("tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING")
Reported-by: JongHwan Kim <zzoru007@gmail.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-21 18:44:17 -07:00
David Howells cdfbabfb2f net: Work around lockdep limitation in sockets that use sockets
Lockdep issues a circular dependency warning when AFS issues an operation
through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.

The theory lockdep comes up with is as follows:

 (1) If the pagefault handler decides it needs to read pages from AFS, it
     calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
     creating a call requires the socket lock:

	mmap_sem must be taken before sk_lock-AF_RXRPC

 (2) afs_open_socket() opens an AF_RXRPC socket and binds it.  rxrpc_bind()
     binds the underlying UDP socket whilst holding its socket lock.
     inet_bind() takes its own socket lock:

	sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET

 (3) Reading from a TCP socket into a userspace buffer might cause a fault
     and thus cause the kernel to take the mmap_sem, but the TCP socket is
     locked whilst doing this:

	sk_lock-AF_INET must be taken before mmap_sem

However, lockdep's theory is wrong in this instance because it deals only
with lock classes and not individual locks.  The AF_INET lock in (2) isn't
really equivalent to the AF_INET lock in (3) as the former deals with a
socket entirely internal to the kernel that never sees userspace.  This is
a limitation in the design of lockdep.

Fix the general case by:

 (1) Double up all the locking keys used in sockets so that one set are
     used if the socket is created by userspace and the other set is used
     if the socket is created by the kernel.

 (2) Store the kern parameter passed to sk_alloc() in a variable in the
     sock struct (sk_kern_sock).  This informs sock_lock_init(),
     sock_init_data() and sk_clone_lock() as to the lock keys to be used.

     Note that the child created by sk_clone_lock() inherits the parent's
     kern setting.

 (3) Add a 'kern' parameter to ->accept() that is analogous to the one
     passed in to ->create() that distinguishes whether kernel_accept() or
     sys_accept4() was the caller and can be passed to sk_alloc().

     Note that a lot of accept functions merely dequeue an already
     allocated socket.  I haven't touched these as the new socket already
     exists before we get the parameter.

     Note also that there are a couple of places where I've made the accepted
     socket unconditionally kernel-based:

	irda_accept()
	rds_rcp_accept_one()
	tcp_accept_from_sock()

     because they follow a sock_create_kern() and accept off of that.

Whilst creating this, I noticed that lustre and ocfs don't create sockets
through sock_create_kern() and thus they aren't marked as for-kernel,
though they appear to be internal.  I wonder if these should do that so
that they use the new set of lock keys.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 18:23:27 -08:00
Alexander Potapenko 9f138fa609 net: initialize msg.msg_flags in recvfrom
KMSAN reports a use of uninitialized memory in put_cmsg() because
msg.msg_flags in recvfrom haven't been initialized properly.
The flag values don't affect the result on this path, but it's still a
good idea to initialize them explicitly.

Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 17:21:21 -08:00
Maxime Jayat e623a9e9de net: socket: fix recvmmsg not returning error from sock_error
Commit 34b88a68f2 ("net: Fix use after free in the recvmmsg exit path"),
changed the exit path of recvmmsg to always return the datagrams
variable and modified the error paths to set the variable to the error
code returned by recvmsg if necessary.

However in the case sock_error returned an error, the error code was
then ignored, and recvmmsg returned 0.

Change the error path of recvmmsg to correctly return the error code
of sock_error.

The bug was triggered by using recvmmsg on a CAN interface which was
not up. Linux 4.6 and later return 0 in this case while earlier
releases returned -ENETDOWN.

Fixes: 34b88a68f2 ("net: Fix use after free in the recvmmsg exit path")
Signed-off-by: Maxime Jayat <maxime.jayat@mobile-devices.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21 13:35:25 -05:00
David S. Miller 02ac5d1487 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two AF_* families adding entries to the lockdep tables
at the same time.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-11 14:43:39 -05:00
Tobias Klauser dc647ec88e net: socket: Make unnecessarily global sockfs_setattr() static
Make sockfs_setattr() static as it is not used outside of net/socket.c

This fixes the following GCC warning:
net/socket.c:534:5: warning: no previous prototype for ‘sockfs_setattr’ [-Wmissing-prototypes]

Fixes: 86741ec254 ("net: core: Add a UID field to struct sock.")
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-10 11:29:50 -05:00
yuan linyu 1e9116327e net: change init_inodecache() return void
sock_init() call it but not check it's return value,
so change it to void return and add an internal BUG_ON() check.

Signed-off-by: yuan linyu <Linyu.Yuan@alcatel-sbell.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-09 12:05:29 -05:00
David S. Miller 76eb75be79 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-01-05 11:03:07 -05:00
David S. Miller ac4340fc3c net: Assert at build time the assumptions we make about the CMSG header.
It must always be the case that CMSG_ALIGN(sizeof(hdr)) == sizeof(hdr).

Otherwise there are missing adjustments in the various calculations
that parse and build these things.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-04 13:24:19 -05:00
Eric Biggers e1a3a60a2e net: socket: don't set sk_uid to garbage value in ->setattr()
->setattr() was recently implemented for socket files to sync the socket
inode's uid to the new 'sk_uid' member of struct sock.  It does this by
copying over the ia_uid member of struct iattr.  However, ia_uid is
actually only valid when ATTR_UID is set in ia_valid, indicating that
the uid is being changed, e.g. by chown.  Other metadata operations such
as chmod or utimes leave ia_uid uninitialized.  Therefore, sk_uid could
be set to a "garbage" value from the stack.

Fix this by only copying the uid over when ATTR_UID is set.

Fixes: 86741ec254 ("net: core: Add a UID field to struct sock.")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-01 11:53:34 -05:00
Thomas Gleixner 2456e85535 ktime: Get rid of the union
ktime is a union because the initial implementation stored the time in
scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
variant for 32bit machines. The Y2038 cleanup removed the timespec variant
and switched everything to scalar nanoseconds. The union remained, but
become completely pointless.

Get rid of the union and just keep ktime_t as simple typedef of type s64.

The conversion was done with coccinelle and some manual mopping up.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
2016-12-25 17:21:22 +01:00
Linus Torvalds 7c0f6ba682 Replace <asm/uaccess.h> with <linux/uaccess.h> globally
This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24 11:46:01 -08:00
Amit Kushwaha fa1bd57a63 net: socket: removed an unnecessary newline
This patch removes a newline which was added
in socket.c file in net-next

Signed-off-by: Amit Kushwaha <kushwaha.a@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-10 17:27:07 -05:00
Amit Kushwaha 846cc1231a net: socket: preferred __aligned(size) for control buffer
This patch cleanup checkpatch.pl warning
WARNING: __aligned(size) is preferred over __attribute__((aligned(size)))

Signed-off-by: Amit Kushwaha <kushwaha.a@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 18:20:46 -05:00
Francis Yan 1c885808e4 tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING
This patch exports the sender chronograph stats via the socket
SO_TIMESTAMPING channel. Currently we can instrument how long a
particular application unit of data was queued in TCP by tracking
SOF_TIMESTAMPING_TX_SOFTWARE and SOF_TIMESTAMPING_TX_SCHED. Having
these sender chronograph stats exported simultaneously along with
these timestamps allow further breaking down the various sender
limitation.  For example, a video server can tell if a particular
chunk of video on a connection takes a long time to deliver because
TCP was experiencing small receive window. It is not possible to
tell before this patch without packet traces.

To prepare these stats, the user needs to set
SOF_TIMESTAMPING_OPT_STATS and SOF_TIMESTAMPING_OPT_TSONLY flags
while requesting other SOF_TIMESTAMPING TX timestamps. When the
timestamps are available in the error queue, the stats are returned
in a separate control message of type SCM_TIMESTAMPING_OPT_STATS,
in a list of TLVs (struct nlattr) of types: TCP_NLA_BUSY_TIME,
TCP_NLA_RWND_LIMITED, TCP_NLA_SNDBUF_LIMITED. Unit is microsecond.

Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 10:04:25 -05:00
David S. Miller f9aa9dc7d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All conflicts were simple overlapping changes except perhaps
for the Thunder driver.

That driver has a change_mtu method explicitly for sending
a message to the hardware.  If that fails it returns an
error.

Normally a driver doesn't need an ndo_change_mtu method becuase those
are usually just range changes, which are now handled generically.
But since this extra operation is needed in the Thunder driver, it has
to stay.

However, if the message send fails we have to restore the original
MTU before the change because the entire call chain expects that if
an error is thrown by ndo_change_mtu then the MTU did not change.
Therefore code is added to nicvf_change_mtu to remember the original
MTU, and to restore it upon nicvf_update_hw_max_frs() failue.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-22 13:27:16 -05:00
Andreas Gruenbacher 4a59015372 xattr: Fix setting security xattrs on sockfs
The IOP_XATTR flag is set on sockfs because sockfs supports getting the
"system.sockprotoname" xattr.  Since commit 6c6ef9f2, this flag is checked for
setxattr support as well.  This is wrong on sockfs because security xattr
support there is supposed to be provided by security_inode_setsecurity.  The
smack security module relies on socket labels (xattrs).

Fix this by adding a security xattr handler on sockfs that returns
-EAGAIN, and by checking for -EAGAIN in setxattr.

We cannot simply check for -EOPNOTSUPP in setxattr because there are
filesystems that neither have direct security xattr support nor support
via security_inode_setsecurity.  A more proper fix might be to move the
call to security_inode_setsecurity into sockfs, but it's not clear to me
if that is safe: we would end up calling security_inode_post_setxattr after
that as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-11-17 00:00:23 -05:00
David S. Miller bb598c1b8c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of bug fixes in 'net' overlapping other changes in
'net-next-.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15 10:54:36 -05:00
Soheil Hassas Yeganeh 3023898b7d sock: fix sendmmsg for partial sendmsg
Do not send the next message in sendmmsg for partial sendmsg
invocations.

sendmmsg assumes that it can continue sending the next message
when the return value of the individual sendmsg invocations
is positive. It results in corrupting the data for TCP,
SCTP, and UNIX streams.

For example, sendmmsg([["abcd"], ["efgh"]]) can result in a stream
of "aefgh" if the first sendmsg invocation sends only the first
byte while the second sendmsg goes through.

Datagram sockets either send the entire datagram or fail, so
this patch affects only sockets of type SOCK_STREAM and
SOCK_SEQPACKET.

Fixes: 228e548e60 ("net: Add sendmmsg socket system call")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09 13:18:12 -05:00
Lorenzo Colitti 86741ec254 net: core: Add a UID field to struct sock.
Protocol sockets (struct sock) don't have UIDs, but most of the
time, they map 1:1 to userspace sockets (struct socket) which do.

Various operations such as the iptables xt_owner match need
access to the "UID of a socket", and do so by following the
backpointer to the struct socket. This involves taking
sk_callback_lock and doesn't work when there is no socket
because userspace has already called close().

Simplify this by adding a sk_uid field to struct sock whose value
matches the UID of the corresponding struct socket. The semantics
are as follows:

1. Whenever sk_socket is non-null: sk_uid is the same as the UID
   in sk_socket, i.e., matches the return value of sock_i_uid.
   Specifically, the UID is set when userspace calls socket(),
   fchown(), or accept().
2. When sk_socket is NULL, sk_uid is defined as follows:
   - For a socket that no longer has a sk_socket because
     userspace has called close(): the previous UID.
   - For a cloned socket (e.g., an incoming connection that is
     established but on which userspace has not yet called
     accept): the UID of the socket it was cloned from.
   - For a socket that has never had an sk_socket: UID 0 inside
     the user namespace corresponding to the network namespace
     the socket belongs to.

Kernel sockets created by sock_create_kern are a special case
of #1 and sk_uid is the user that created them. For kernel
sockets created at network namespace creation time, such as the
per-processor ICMP and TCP sockets, this is the user that created
the network namespace.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04 14:45:22 -04:00
Andrey Vagin c62cce2cae net: add an ioctl to get a socket network namespace
Each socket operates in a network namespace where it has been created,
so if we want to dump and restore a socket, we have to know its network
namespace.

We have a socket_diag to get information about sockets, it doesn't
report sockets which are not bound or connected.

This patch introduces a new socket ioctl, which is called SIOCGSKNS
and used to get a file descriptor for a socket network namespace.

A task must have CAP_NET_ADMIN in a target network namespace to
use this ioctl.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-31 10:56:36 -04:00
Andreas Gruenbacher fd50ecaddf vfs: Remove {get,set,remove}xattr inode operations
These inode operations are no longer used; remove them.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-07 21:48:36 -04:00
Andreas Gruenbacher bba0bd31b1 sockfs: Get rid of getxattr iop
If we allow pseudo-filesystems created with mount_pseudo to have xattr
handlers, we can replace sockfs_getxattr with a sockfs_xattr_get handler
to use the xattr handler name parsing.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-06 22:17:38 -04:00
Andreas Gruenbacher 971df15bd5 sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
The standard return value for unsupported attribute names is
-EOPNOTSUPP, as opposed to undefined but supported attributes
(-ENODATA).

Also, fail for attribute names like "system.sockprotonameXXX" and
simplify the code a bit.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-06 22:17:38 -04:00
Deepa Dinamani 766b9f928b fs: poll/select/recvmmsg: use timespec64 for timeout events
struct timespec is not y2038 safe.  Even though timespec might be
sufficient to represent timeouts, use struct timespec64 here as the plan
is to get rid of all timespec reference in the kernel.

The patch transitions the common functions: poll_select_set_timeout()
and select_estimate_accuracy() to use timespec64.  And, all the syscalls
that use these functions are transitioned in the same patch.

The restart block parameters for poll uses monotonic time.  Use
timespec64 here as well to assign timeout value.  This parameter in the
restart block need not change because this only holds the monotonic
timestamp at which timeout should occur.  And, unsigned long data type
should be big enough for this timestamp.

The system call interfaces will be handled in a separate series.

Compat interfaces need not change as timespec64 is an alias to struct
timespec on a 64 bit system.

Link: http://lkml.kernel.org/r/1461947989-21926-3-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Linus Torvalds a7fd20d1c4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Support SPI based w5100 devices, from Akinobu Mita.

   2) Partial Segmentation Offload, from Alexander Duyck.

   3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.

   4) Allow cls_flower stats offload, from Amir Vadai.

   5) Implement bpf blinding, from Daniel Borkmann.

   6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
      actually using FASYNC these atomics are superfluous.  From Eric
      Dumazet.

   7) Run TCP more preemptibly, also from Eric Dumazet.

   8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
      driver, from Gal Pressman.

   9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.

  10) Improve BPF usage documentation, from Jesper Dangaard Brouer.

  11) Support tunneling offloads in qed, from Manish Chopra.

  12) aRFS offloading in mlx5e, from Maor Gottlieb.

  13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
      Leitner.

  14) Add MSG_EOR support to TCP, this allows controlling packet
      coalescing on application record boundaries for more accurate
      socket timestamp sampling.  From Martin KaFai Lau.

  15) Fix alignment of 64-bit netlink attributes across the board, from
      Nicolas Dichtel.

  16) Per-vlan stats in bridging, from Nikolay Aleksandrov.

  17) Several conversions of drivers to ethtool ksettings, from Philippe
      Reynes.

  18) Checksum neutral ILA in ipv6, from Tom Herbert.

  19) Factorize all of the various marvell dsa drivers into one, from
      Vivien Didelot

  20) Add VF support to qed driver, from Yuval Mintz"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
  Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
  Revert "phy dp83867: Make rgmii parameters optional"
  r8169: default to 64-bit DMA on recent PCIe chips
  phy dp83867: Make rgmii parameters optional
  phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
  bpf: arm64: remove callee-save registers use for tmp registers
  asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
  switchdev: pass pointer to fib_info instead of copy
  net_sched: close another race condition in tcf_mirred_release()
  tipc: fix nametable publication field in nl compat
  drivers: net: Don't print unpopulated net_device name
  qed: add support for dcbx.
  ravb: Add missing free_irq() calls to ravb_close()
  qed: Remove a stray tab
  net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
  net: ethernet: fec-mpc52xx: use phydev from struct net_device
  bpf, doc: fix typo on bpf_asm descriptions
  stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
  net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
  net: ethernet: fs-enet: use phydev from struct net_device
  ...
2016-05-17 16:26:30 -07:00
Soheil Hassas Yeganeh 0a2cf20c3f tcp: remove SKBTX_ACK_TSTAMP since it is redundant
The SKBTX_ACK_TSTAMP flag is set in skb_shinfo->tx_flags when
the timestamp of the TCP acknowledgement should be reported on
error queue. Since accessing skb_shinfo is likely to incur a
cache-line miss at the time of receiving the ack, the
txstamp_ack bit was added in tcp_skb_cb, which is set iff
the SKBTX_ACK_TSTAMP flag is set for an skb. This makes
SKBTX_ACK_TSTAMP flag redundant.

Remove the SKBTX_ACK_TSTAMP and instead use the txstamp_ack bit
everywhere.

Note that this frees one bit in shinfo->tx_flags.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Suggested-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:06:10 -04:00
David S. Miller 6c61403dae Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-04-14 00:39:15 -04:00
Al Viro ce23e64013 ->getxattr(): pass dentry and inode as separate arguments
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-11 00:48:00 -04:00
Hannes Frederic Sowa 1e1d04e678 net: introduce lockdep_is_held and update various places to use it
The socket is either locked if we hold the slock spin_lock for
lock_sock_fast and unlock_sock_fast or we own the lock (sk_lock.owned
!= 0). Check for this and at the same time improve that the current
thread/cpu is really holding the lock.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 16:44:14 -04:00
Soheil Hassas Yeganeh c14ac9451c sock: enable timestamping using control messages
Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
This is very costly when users want to sample writes to gather
tx timestamps.

Add support for enabling SO_TIMESTAMPING via control messages by
using tsflags added in `struct sockcm_cookie` (added in the previous
patches in this series) to set the tx_flags of the last skb created in
a sendmsg. With this patch, the timestamp recording bits in tx_flags
of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.

Please note that this is only effective for overriding the recording
timestamps flags. Users should enable timestamp reporting (e.g.,
SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
socket options and then should ask for SOF_TIMESTAMPING_TX_*
using control messages per sendmsg to sample timestamps for each
write.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04 15:50:30 -04:00
Al Viro 2da62906b1 [net] drop 'size' argument of sock_recvmsg()
all callers have it equal to msg_data_left(msg).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-28 13:57:51 -04:00
Arnaldo Carvalho de Melo 34b88a68f2 net: Fix use after free in the recvmmsg exit path
The syzkaller fuzzer hit the following use-after-free:

  Call Trace:
   [<ffffffff8175ea0e>] __asan_report_load8_noabort+0x3e/0x40 mm/kasan/report.c:295
   [<ffffffff851cc31a>] __sys_recvmmsg+0x6fa/0x7f0 net/socket.c:2261
   [<     inline     >] SYSC_recvmmsg net/socket.c:2281
   [<ffffffff851cc57f>] SyS_recvmmsg+0x16f/0x180 net/socket.c:2270
   [<ffffffff86332bb6>] entry_SYSCALL_64_fastpath+0x16/0x7a
  arch/x86/entry/entry_64.S:185

And, as Dmitry rightly assessed, that is because we can drop the
reference and then touch it when the underlying recvmsg calls return
some packets and then hit an error, which will make recvmmsg to set
sock->sk->sk_err, oops, fix it.

Reported-and-Tested-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Fixes: a2e2725541 ("net: Introduce recvmmsg socket syscall")
http://lkml.kernel.org/r/20160122211644.GC2470@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-14 12:41:49 -04:00
liping.zhang f3c986908c net: socket: use pr_info_once to tip the obsolete usage of PF_PACKET
There is no need to use the static variable here, pr_info_once is more
concise.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-13 22:37:50 -04:00
Tom Herbert f092276d85 net: Add MSG_BATCH flag
Add a new msg flag called MSG_BATCH. This flag is used in sendmsg to
indicate that more messages will follow (i.e. a batch of messages is
being sent). This is similar to MSG_MORE except that the following
messages are not merged into one packet, they are sent individually.
sendmmsg is updated so that each contained message except for the
last one is marked as MSG_BATCH.

MSG_BATCH is a performance optimization in cases where a socket
implementation can benefit by transmitting packets in a batch.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09 16:36:13 -05:00
Tom Herbert 28a94d8fb3 net: Allow MSG_EOR in each msghdr of sendmmsg
This patch allows setting MSG_EOR in each individual msghdr passed
in sendmmsg. This allows a sendmmsg to send multiple messages when
using SOCK_SEQPACKET.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09 16:36:13 -05:00
Tom Herbert f4a00aacdb net: Make sock_alloc exportable
Export it for cases where we want to create sockets by hand.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09 16:36:13 -05:00
Vladimir Davydov 5d097056c9 kmemcg: account certain kmem allocations to memcg
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg.  For the list, see below:

 - threadinfo
 - task_struct
 - task_delay_info
 - pid
 - cred
 - mm_struct
 - vm_area_struct and vm_region (nommu)
 - anon_vma and anon_vma_chain
 - signal_struct
 - sighand_struct
 - fs_struct
 - files_struct
 - fdtable and fdtable->full_fds_bits
 - dentry and external_name
 - inode for all filesystems. This is the most tedious part, because
   most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds.  Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Eric Dumazet a78cb84c62 net: add scheduling point in recvmmsg/sendmmsg
Applications often have to reduce number of datagrams
they receive or send per system call to avoid starvation problems.

Really the kernel should take care of this by using cond_resched(),
so that applications can experiment bigger batch sizes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 22:56:29 -05:00
Nicolai Stange 574aab1e02 net, socket, socket_wq: fix missing initialization of flags
Commit ceb5d58b21 ("net: fix sock_wake_async() rcu protection") from
the current 4.4 release cycle introduced a new flags member in
struct socket_wq and moved SOCKWQ_ASYNC_NOSPACE and SOCKWQ_ASYNC_WAITDATA
from struct socket's flags member into that new place.

Unfortunately, the new flags field is never initialized properly, at least
not for the struct socket_wq instance created in sock_alloc_inode().

One particular issue I encountered because of this is that my GNU Emacs
failed to draw anything on my desktop -- i.e. what I got is a transparent
window, including the title bar. Bisection lead to the commit mentioned
above and further investigation by means of strace told me that Emacs
is indeed speaking to my Xorg through an O_ASYNC AF_UNIX socket. This is
reproducible 100% of times and the fact that properly initializing the
struct socket_wq ->flags fixes the issue leads me to the conclusion that
somehow SOCKWQ_ASYNC_WAITDATA got set in the uninitialized ->flags,
preventing my Emacs from receiving any SIGIO's due to data becoming
available and it got stuck.

Make sock_alloc_inode() set the newly created struct socket_wq's ->flags
member to zero.

Fixes: ceb5d58b21 ("net: fix sock_wake_async() rcu protection")
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-30 16:38:01 -05:00
tadeusz.struk@intel.com 130ed5d105 net: fix uninitialized variable issue
msg_iocb needs to be initialized on the recv/recvfrom path.
Otherwise afalg will wrongly interpret it as an async call.

Cc: stable@vger.kernel.org
Reported-by: Harald Freudenberger <freude@linux.vnet.ibm.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 15:46:48 -05:00
Eric Dumazet ceb5d58b21 net: fix sock_wake_async() rcu protection
Dmitry provided a syzkaller (http://github.com/google/syzkaller)
triggering a fault in sock_wake_async() when async IO is requested.

Said program stressed af_unix sockets, but the issue is generic
and should be addressed in core networking stack.

The problem is that by the time sock_wake_async() is called,
we should not access the @flags field of 'struct socket',
as the inode containing this socket might be freed without
further notice, and without RCU grace period.

We already maintain an RCU protected structure, "struct socket_wq"
so moving SOCKWQ_ASYNC_NOSPACE & SOCKWQ_ASYNC_WAITDATA into it
is the safe route.

It also reduces number of cache lines needing dirtying, so might
provide a performance improvement anyway.

In followup patches, we might move remaining flags (SOCK_NOSPACE,
SOCK_PASSCRED, SOCK_PASSSEC) to save 8 bytes and let 'struct socket'
being mostly read and let it being shared between cpus.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-01 15:45:05 -05:00
Eric Dumazet 9cd3e072b0 net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
This patch is a cleanup to make following patch easier to
review.

Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
from (struct socket)->flags to a (struct socket_wq)->flags
to benefit from RCU protection in sock_wake_async()

To ease backports, we rename both constants.

Two new helpers, sk_set_bit(int nr, struct sock *sk)
and sk_clear_bit(int net, struct sock *sk) are added so that
following patch can change their implementation.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-01 15:45:05 -05:00
Viresh Kumar b5ffe63442 net: Drop unlikely before IS_ERR(_OR_NULL)
IS_ERR(_OR_NULL) already contain an 'unlikely' compiler flag and there
is no need to do that again from its callers. Drop it.

Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-09-29 15:15:40 +02:00
Eric W. Biederman eeb1bd5c40 net: Add a struct net parameter to sock_create_kern
This is long overdue, and is part of cleaning up how we allocate kernel
sockets that don't reference count struct net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11 10:50:17 -04:00
Eric W. Biederman 140e807da1 tun: Utilize the normal socket network namespace refcounting.
There is no need for tun to do the weird network namespace refcounting.
The existing network namespace refcounting in tfile has almost exactly
the same lifetime.  So rewrite the code to use the struct sock network
namespace refcounting and remove the unnecessary hand rolled network
namespace refcounting and the unncesary tfile->net.

This change allows the tun code to directly call sock_put bypassing
sock_release and making SOCK_EXTERNALLY_ALLOCATED unnecessary.

Remove the now unncessary tun_release so that if anything tries to use
the sock_release code path the kernel will oops, and let us know about
the bug.

The macvtap code already uses it's internal socket this way.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11 10:50:16 -04:00
David Howells c5ef603528 VFS: net/: d_inode() annotations
socket inodes and sunrpc filesystems - inodes owned by that code

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15 15:06:56 -04:00
Al Viro 5d5d568975 make new_sync_{read,write}() static
All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:40 -04:00
Al Viro 01e97e6517 new helper: msg_data_left()
convert open-coded instances

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 15:53:35 -04:00
Al Viro d8725c86ae get rid of the size argument of sock_sendmsg()
it's equal to iov_iter_count(&msg->msg_iter) in all cases

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 15:27:37 -04:00
Al Viro 6aa248145a switch kernel_sendmsg() and kernel_recvmsg() to iov_iter_kvec()
For kernel_sendmsg() that eliminates the need to play with setfs();
for kernel_recvmsg() it does *not* - a couple of callers are using
it with non-NULL ->msg_control, which would be treated as userland
address on recvmsg side of things.

In all cases we are really setting a kvec-backed iov_iter, though.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-09 00:02:34 -04:00
Al Viro da18428498 net: switch importing msghdr from userland to {compat_,}import_iovec()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-09 00:02:26 -04:00
Al Viro 602bd0e90e net: switch sendto() and recvfrom() to import_single_range()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-09 00:02:21 -04:00
Al Viro 237dae8890 Merge branch 'iocb' into for-davem
trivial conflict in net/socket.c and non-trivial one in crypto -
that one had evaded aio_complete() removal.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-09 00:01:38 -04:00
tadeusz.struk@intel.com 0345f93138 net: socket: add support for async operations
Add support for async operations.

Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:41:36 -04:00
David S. Miller 0fa74a4be4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/emulex/benet/be_main.c
	net/core/sysctl_net_core.c
	net/ipv4/inet_diag.c

The be_main.c conflict resolution was really tricky.  The conflict
hunks generated by GIT were very unhelpful, to say the least.  It
split functions in half and moved them around, when the real actual
conflict only existed solely inside of one function, that being
be_map_pci_bars().

So instead, to resolve this, I checked out be_main.c from the top
of net-next, then I applied the be_main.c changes from 'net' since
the last time I merged.  And this worked beautifully.

The inet_diag.c and sysctl_net_core.c conflicts were simple
overlapping changes, and were easily to resolve.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 18:51:09 -04:00
Al Viro 4de930efc2 net: validate the range we feed to iov_iter_init() in sys_sendto/sys_recvfrom
Cc: stable@vger.kernel.org # v3.19
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 16:38:06 -04:00
Christoph Hellwig 599bd19bdc fs: don't allow to complete sync iocbs through aio_complete
The AIO interface is fairly complex because it tries to allow
filesystems to always work async and then wakeup a synchronous
caller through aio_complete.  It turns out that basically no one
was doing this to avoid the complexity and context switches,
and we've already fixed up the remaining users and can now
get rid of this case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-13 12:10:22 -04:00
Christoph Hellwig 66ee59af63 fs: remove ki_nbytes
There is no need to pass the total request length in the kiocb, as
we already get passed in through the iov_iter argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-12 23:50:23 -04:00
Ying Xue 1b78414047 net: Remove iocb argument from sendmsg and recvmsg
After TIPC doesn't depend on iocb argument in its internal
implementations of sendmsg() and recvmsg() hooks defined in proto
structure, no any user is using iocb argument in them at all now.
Then we can drop the redundant iocb argument completely from kinds of
implementations of both sendmsg() and recvmsg() in the entire
networking stack.

Cc: Christoph Hellwig <hch@lst.de>
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02 13:06:31 -05:00
Eyal Birger 744d5a3e9f net: move skb->dropcount to skb->cb[]
Commit 977750076d ("af_packet: add interframe drop cmsg (v6)")
unionized skb->mark and skb->dropcount in order to allow recording
of the socket drop count while maintaining struct sk_buff size.

skb->dropcount was introduced since there was no available room
in skb->cb[] in packet sockets. However, its introduction led to
the inability to export skb->mark, or any other aliased field to
userspace if so desired.

Moving the dropcount metric to skb->cb[] eliminates this problem
at the expense of 4 bytes less in skb->cb[] for protocol families
using it.

Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02 00:19:30 -05:00
Al Viro 8ae5e030f3 net: switch sockets to ->read_iter/->write_iter
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04 01:34:15 -05:00
Al Viro 6d65233020 net/socket.c: fold do_sock_{read,write} into callers
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04 01:34:15 -05:00
Christoph Hellwig 7cc0566268 net: remove sock_iocb
The sock_iocb structure is allocate on stack for each read/write-like
operation on sockets, and contains various fields of which only the
embedded msghdr and sometimes a pointer to the scm_cookie is ever used.
Get rid of the sock_iocb and put a msghdr directly on the stack and pass
the scm_cookie explicitly to netlink_mmap_sendmsg.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-28 23:15:07 -08:00
David S. Miller 95f873f2ff Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	arch/arm/boot/dts/imx6sx-sdb.dts
	net/sched/cls_bpf.c

Two simple sets of overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-27 16:59:56 -08:00
Christoph Hellwig 06539d3071 net: don't OOPS on socket aio
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-27 12:25:33 -08:00
Nicolas Dichtel 66c1a12c65 socket: use ki_nbytes instead of iov_length()
This field already contains the length of the iovec, no need to calculate it
again.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-17 23:58:37 -05:00
Nicolas Dichtel 7eb35b1483 socket: use iov_length()
Better to use available helpers.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-15 13:56:15 -05:00
Al Viro e3bb504efd [regression] chunk lost from bd9b51
Reported-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-19 07:13:21 -05:00
Linus Torvalds 603ba7e41b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile #2 from Al Viro:
 "Next pile (and there'll be one or two more).

  The large piece in this one is getting rid of /proc/*/ns/* weirdness;
  among other things, it allows to (finally) make nameidata completely
  opaque outside of fs/namei.c, making for easier further cleanups in
  there"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  coda_venus_readdir(): use file_inode()
  fs/namei.c: fold link_path_walk() call into path_init()
  path_init(): don't bother with LOOKUP_PARENT in argument
  fs/namei.c: new helper (path_cleanup())
  path_init(): store the "base" pointer to file in nameidata itself
  make default ->i_fop have ->open() fail with ENXIO
  make nameidata completely opaque outside of fs/namei.c
  kill proc_ns completely
  take the targets of /proc/*/ns/* symlinks to separate fs
  bury struct proc_ns in fs/proc
  copy address of proc_ns_ops into ns_common
  new helpers: ns_alloc_inum/ns_free_inum
  make proc_ns_operations work with struct ns_common * instead of void *
  switch the rest of proc_ns_operations to working with &...->ns
  netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
  make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
  common object embedded into various struct ....ns
2014-12-16 15:53:03 -08:00
Al Viro bd9b51e79c make default ->i_fop have ->open() fail with ENXIO
As it is, default ->i_fop has NULL ->open() (along with all other methods).
The only case where it matters is reopening (via procfs symlink) a file that
didn't get its ->f_op from ->i_fop - anything else will have ->i_fop assigned
to something sane (default would fail on read/write/ioctl/etc.).

	Unfortunately, such case exists - alloc_file() users, especially
anon_get_file() ones.  There we have tons of opened files of very different
kinds sharing the same inode.  As the result, attempt to reopen those via
procfs succeeds and you get a descriptor you can't do anything with.

	Moreover, in case of sockets we set ->i_fop that will only be used
on such reopen attempts - and put a failing ->open() into it to make sure
those do not succeed.

	It would be simpler to put such ->open() into default ->i_fop and leave
it unchanged both for anon inode (as we do anyway) and for socket ones.  Result:
	* everything going through do_dentry_open() works as it used to
	* sock_no_open() kludge is gone
	* attempts to reopen anon-inode files fail as they really ought to
	* ditto for aio_private_file()
	* ditto for perfmon - this one actually tried to imitate sock_no_open()
trick, but failed to set ->i_fop, so in the current tree reopens succeed and
yield completely useless descriptor.  Intent clearly had been to fail with
-ENXIO on such reopens; now it actually does.
	* everything else that used alloc_file() keeps working - it has ->i_fop
set for its inodes anyway

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-10 21:32:15 -05:00
Al Viro c0371da604 put iov_iter into msghdr
Note that the code _using_ ->msg_iter at that point will be very
unhappy with anything other than unshifted iovec-backed iov_iter.
We still need to convert users to proper primitives.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-09 16:29:03 -05:00
Gu Zheng 0cf00c6f36 net/socket.c : introduce helper function do_sock_sendmsg to replace reduplicate code
Introduce helper function do_sock_sendmsg() to simplify sock_sendmsg{_nosec},
and replace reduplicate code.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-09 15:24:26 -05:00
Al Viro 08adb7dabd fold verify_iovec() into copy_msghdr_from_user()
... and do the same on the compat side of things.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 16:23:49 -05:00
Al Viro 0844932009 {compat_,}verify_iovec(): switch to generic copying of iovecs
use {compat_,}rw_copy_check_uvector().  As the result, we are
guaranteed that all iovecs seen in ->msg_iov by ->sendmsg()
and ->recvmsg() will pass access_ok().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 16:23:16 -05:00
Al Viro 666547ff59 separate kernel- and userland-side msghdr
Kernel-side struct msghdr is (currently) using the same layout as
userland one, but it's not a one-to-one copy - even without considering
32bit compat issues, we have msg_iov, msg_name and msg_control copied
to kernel[1].  It's fairly localized, so we get away with a few functions
where that knowledge is needed (and we could shrink that set even
more).  Pretty much everything deals with the kernel-side variant and
the few places that want userland one just use a bunch of force-casts
to paper over the differences.

The thing is, kernel-side definition of struct msghdr is *not* exposed
in include/uapi - libc doesn't see it, etc.  So we can add struct user_msghdr,
with proper annotations and let the few places that ever deal with those
beasts use it for userland pointers.  Saner typechecking aside, that will
allow to change the layout of kernel-side msghdr - e.g. replace
msg_iov/msg_iovlen there with struct iov_iter, getting rid of the need
to modify the iovec as we copy data to/from it, etc.

We could introduce kernel_msghdr instead, but that would create much more
noise - the absolute majority of the instances would need to have the
type switched to kernel_msghdr and definition of struct msghdr in
include/linux/socket.h is not going to be seen by userland anyway.

This commit just introduces user_msghdr and switches the few places that
are dealing with userland-side msghdr to it.

[1] actually, it's even trickier than that - we copy msg_control for
sendmsg, but keep the userland address on recvmsg.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 16:22:59 -05:00
Linus Torvalds ef4a48c513 File locking related changes for v3.18 (pile #1)
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUNZK4AAoJEAAOaEEZVoIVI08P/iM7eaIVRnqaqtWw/JBzxiba
 EMDlJYUBSlv6lYk9s8RJT4bMmcmGAKSYzVAHSoPahzNcqTDdFLeDTLGxJ8uKBbjf
 d1qRRdH1yZHGUzCvJq3mEendjfXn435Y3YburUxjLfmzrzW7EbMvndiQsS5dhAm9
 PEZ+wrKF/zFL7LuXa1YznYrbqOD/GRsJAXGEWc3kNwfS9avephVG/RI3GtpI2PJj
 RY1mf8P7+WOlrShYoEuUo5aqs01MnU70LbqGHzY8/QKH+Cb0SOkCHZPZyClpiA+G
 MMJ+o2XWcif3BZYz+dobwz/FpNZ0Bar102xvm2E8fqByr/T20JFjzooTKsQ+PtCk
 DetQptrU2gtyZDKtInJUQSDPrs4cvA13TW+OEB1tT8rKBnmyEbY3/TxBpBTB9E6j
 eb/V3iuWnywR3iE+yyvx24Qe7Pov6deM31s46+Vj+GQDuWmAUJXemhfzPtZiYpMT
 exMXTyDS3j+W+kKqHblfU5f+Bh1eYGpG2m43wJVMLXKV7NwDf8nVV+Wea962ga+w
 BAM3ia4JRVgRWJBPsnre3lvGT5kKPyfTZsoG+kOfRxiorus2OABoK+SIZBZ+c65V
 Xh8VH5p3qyCUBOynXlHJWFqYWe2wH0LfbPrwe9dQwTwON51WF082EMG5zxTG0Ymf
 J2z9Shz68zu0ok8cuSlo
 =Hhee
 -----END PGP SIGNATURE-----

Merge tag 'locks-v3.18-1' of git://git.samba.org/jlayton/linux

Pull file locking related changes from Jeff Layton:
 "This release is a little more busy for file locking changes than the
  last:

   - a set of patches from Kinglong Mee to fix the lockowner handling in
     knfsd
   - a pile of cleanups to the internal file lease API.  This should get
     us a bit closer to allowing for setlease methods that can block.

  There are some dependencies between mine and Bruce's trees this cycle,
  and I based my tree on top of the requisite patches in Bruce's tree"

* tag 'locks-v3.18-1' of git://git.samba.org/jlayton/linux: (26 commits)
  locks: fix fcntl_setlease/getlease return when !CONFIG_FILE_LOCKING
  locks: flock_make_lock should return a struct file_lock (or PTR_ERR)
  locks: set fl_owner for leases to filp instead of current->files
  locks: give lm_break a return value
  locks: __break_lease cleanup in preparation of allowing direct removal of leases
  locks: remove i_have_this_lease check from __break_lease
  locks: move freeing of leases outside of i_lock
  locks: move i_lock acquisition into generic_*_lease handlers
  locks: define a lm_setup handler for leases
  locks: plumb a "priv" pointer into the setlease routines
  nfsd: don't keep a pointer to the lease in nfs4_file
  locks: clean up vfs_setlease kerneldoc comments
  locks: generic_delete_lease doesn't need a file_lock at all
  nfsd: fix potential lease memory leak in nfs4_setlease
  locks: close potential race in lease_get_mtime
  security: make security_file_set_fowner, f_setown and __f_setown void return
  locks: consolidate "nolease" routines
  locks: remove lock_may_read and lock_may_write
  lockd: rip out deferred lock handling from testlock codepath
  NFSD: Get reference of lockowner when coping file_lock
  ...
2014-10-11 13:21:34 -04:00
David S. Miller 1f6d80358d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	arch/mips/net/bpf_jit.c
	drivers/net/can/flexcan.c

Both the flexcan and MIPS bpf_jit conflicts were cases of simple
overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-23 12:09:27 -04:00
Ani Sinha 6a2a2b3ae0 net:socket: set msg_namelen to 0 if msg_name is passed as NULL in msghdr struct from userland.
Linux manpage for recvmsg and sendmsg calls does not explicitly mention setting msg_namelen to 0 when
msg_name passed set as NULL. When developers don't set msg_namelen member in msghdr, it might contain garbage
value which will fail the validation check and sendmsg and recvmsg calls from kernel will return EINVAL. This will
break old binaries and any code for which there is no access to source code.
To fix this, we set msg_namelen to 0 when msg_name is passed as NULL from userland.

Signed-off-by: Ani Sinha <ani@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-09 17:35:46 -07:00
Willem de Bruijn 67cc0d4077 net-timestamp: optimize sock_tx_timestamp default path
Few packets have timestamping enabled. Exit sock_tx_timestamp quickly
in this common case.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-09 17:34:41 -07:00
Jeff Layton e0b93eddfe security: make security_file_set_fowner, f_setown and __f_setown void return
security_file_set_fowner always returns 0, so make it f_setown and
__f_setown void return functions and fix up the error handling in the
callers.

Cc: linux-security-module@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-09-09 16:01:36 -04:00
Masanari Iida e793c0f70e net: treewide: Fix typo found in DocBook/networking.xml
This patch fix spelling typo found in DocBook/networking.xml.
It is because the neworking.xml is generated from comments
in the source, I have to fix typo in comments within the source.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05 17:35:28 -07:00
Willem de Bruijn c199105d15 net-timestamp: only report sw timestamp if reporting bit is set
The timestamping API has separate bits for generating and reporting
timestamps. A software timestamp should only be reported for a packet
when the packet has the relevant generation flag (SKBTX_..) set
and the socket has reporting bit SOF_TIMESTAMPING_SOFTWARE set.

The second check was accidentally removed. Reinstitute the original
behavior.

Tested:
  Without this patch, Documentation/networking/txtimestamp reports
  timestamps regardless of whether SOF_TIMESTAMPING_SOFTWARE is set.
  After the patch, it only reports them when the flag is set.

Fixes: f24b9be595 ("net-timestamp: extend SCM_TIMESTAMPING ancillary data struct")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05 15:02:43 -07:00
Eric Dumazet 140c55d4b5 net-timestamp: sock_tx_timestamp() fix
sock_tx_timestamp() should not ignore initial *tx_flags value, as TCP
stack can store SKBTX_SHARED_FRAG in it.

Also first argument (struct sock *) can be const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 4ed2d765df ("net-timestamp: TCP timestamping")
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-06 12:38:07 -07:00
Willem de Bruijn e1c8a607b2 net-timestamp: ACK timestamp for bytestreams
Add SOF_TIMESTAMPING_TX_ACK, a request for a tstamp when the last byte
in the send() call is acknowledged. It implements the feature for TCP.

The timestamp is generated when the TCP socket cumulative ACK is moved
beyond the tracked seqno for the first time. The feature ignores SACK
and FACK, because those acknowledge the specific byte, but not
necessarily the entire contents of the buffer up to that byte.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:35:54 -07:00
Willem de Bruijn e7fd288538 net-timestamp: SCHED timestamp on entering packet scheduler
Kernel transmit latency is often incurred in the packet scheduler.
Introduce a new timestamp on transmission just before entering the
scheduler. When data travels through multiple devices (bonding,
tunneling, ...) each device will export an individual timestamp.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:35:54 -07:00
Willem de Bruijn b9f40e21ef net-timestamp: move timestamp flags out of sk_flags
sk_flags is reaching its limit. New timestamping options will not fit.
Move all of them into a new field sk->sk_tsflags.

Added benefit is that this removes boilerplate code to convert between
SOF_TIMESTAMPING_.. and SOCK_TIMESTAMPING_.. in getsockopt/setsockopt.

SOCK_TIMESTAMPING_RX_SOFTWARE is also used to toggle the receive
timestamp logic (netstamp_needed). That can be simplified and this
last key removed, but will leave that for a separate patch.

Signed-off-by: Willem de Bruijn <willemb@google.com>

----

The u16 in sock can be moved into a 16-bit hole below sk_gso_max_segs,
though that scatters tstamp fields throughout the struct.
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:35:54 -07:00
Willem de Bruijn f24b9be595 net-timestamp: extend SCM_TIMESTAMPING ancillary data struct
Applications that request kernel tx timestamps with SO_TIMESTAMPING
read timestamps as recvmsg() ancillary data. The response is defined
implicitly as timespec[3].

1) define struct scm_timestamping explicitly and

2) add support for new tstamp types. On tx, scm_timestamping always
   accompanies a sock_extended_err. Define previously unused field
   ee_info to signal the type of ts[0]. Introduce SCM_TSTAMP_SND to
   define the existing behavior.

The reception path is not modified. On rx, no struct similar to
sock_extended_err is passed along with SCM_TIMESTAMPING.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:35:53 -07:00
Willem de Bruijn 4d276eb6a4 net: remove deprecated syststamp timestamp
The SO_TIMESTAMPING API defines three types of timestamps: software,
hardware in raw format (hwtstamp) and hardware converted to system
format (syststamp). The last has been deprecated in favor of combining
hwtstamp with a PTP clock driver. There are no active users in the
kernel.

The option was device driver dependent. If set, but without hardware
support, the correct behavior is to return zero in the relevant field
in the SCM_TIMESTAMPING ancillary message. Without device drivers
implementing the option, this field is effectively always zero.

Remove the internal plumbing to dissuage new drivers from implementing
the feature. Keep the SOF_TIMESTAMPING_SYS_HARDWARE flag, however, to
avoid breaking existing applications that request the timestamp.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-29 11:39:50 -07:00
Jan Glauber b7c0ddf5f2 net: use SYSCALL_DEFINEx for sys_recv
Make sys_recv a first class citizen by using the SYSCALL_DEFINEx
macro. Besides being cleaner this will also generate meta data
for the system call so tracing tools like ftrace or LTTng can
resolve this system call.

Signed-off-by: Jan Glauber <jan.glauber@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-16 15:15:05 -04:00
Daniel Borkmann 408eccce32 net: ptp: move PTP classifier in its own file
This commit fixes a build error reported by Fengguang, that is
triggered when CONFIG_NETWORK_PHY_TIMESTAMPING is not set:

  ERROR: "ptp_classify_raw" [drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.ko] undefined!

The fix is to introduce its own file for the PTP BPF classifier,
so that PTP_1588_CLOCK and/or NETWORK_PHY_TIMESTAMPING can select
it independently from each other. IXP4xx driver on ARM needs to
select it as well since it does not seem to select PTP_1588_CLOCK
or similar that would pull it in automatically.

This also allows for hiding all of the internals of the BPF PTP
program inside that file, and only exporting relevant API bits
to drivers.

This patch also adds a kdoc documentation of ptp_classify_raw()
API to make it clear that it can return PTP_CLASS_* defines. Also,
the BPF program has been translated into bpf_asm code, so that it
can be more easily read and altered (extensively documented in [1]).

In the kernel tree under tools/net/ we have bpf_asm and bpf_dbg
tools, so the commented program can simply be translated via
`./bpf_asm -c prog` where prog is a file that contains the
commented code. This makes it easily readable/verifiable and when
there's a need to change something, jump offsets etc do not need
to be replaced manually which can be very error prone. Instead,
a newly translated version via bpf_asm can simply replace the old
code. I have checked opcode diffs before/after and it's the very
same filter.

  [1] Documentation/networking/filter.txt

Fixes: 164d8c6665 ("net: ptp: do not reimplement PTP/BPF classifier")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Jiri Benc <jbenc@redhat.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-01 16:43:18 -04:00
David S. Miller 85dcce7a73 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/r8152.c
	drivers/net/xen-netback/netback.c

Both the r8152 and netback conflicts were simple overlapping
changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-14 22:31:55 -04:00
Linus Torvalds 53611c0ce9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "I know this is a bit more than you want to see, and I've told the
  wireless folks under no uncertain terms that they must severely scale
  back the extent of the fixes they are submitting this late in the
  game.

  Anyways:

   1) vmxnet3's netpoll doesn't perform the equivalent of an ISR, which
      is the correct implementation, like it should.  Instead it does
      something like a NAPI poll operation.  This leads to crashes.

      From Neil Horman and Arnd Bergmann.

   2) Segmentation of SKBs requires proper socket orphaning of the
      fragments, otherwise we might access stale state released by the
      release callbacks.

      This is a 5 patch fix, but the initial patches are giving
      variables and such significantly clearer names such that the
      actual fix itself at the end looks trivial.

      From Michael S.  Tsirkin.

   3) TCP control block release can deadlock if invoked from a timer on
      an already "owned" socket.  Fix from Eric Dumazet.

   4) In the bridge multicast code, we must validate that the
      destination address of general queries is the link local all-nodes
      multicast address.  From Linus Lüssing.

   5) The x86 BPF JIT support for negative offsets puts the parameter
      for the helper function call in the wrong register.  Fix from
      Alexei Starovoitov.

   6) The descriptor type used for RTL_GIGA_MAC_VER_17 chips in the
      r8169 driver is incorrect.  Fix from Hayes Wang.

   7) The xen-netback driver tests skb_shinfo(skb)->gso_type bits to see
      if a packet is a GSO frame, but that's not the correct test.  It
      should use skb_is_gso(skb) instead.  Fix from Wei Liu.

   8) Negative msg->msg_namelen values should generate an error, from
      Matthew Leach.

   9) at86rf230 can deadlock because it takes the same lock from it's
      ISR and it's hard_start_xmit method, without disabling interrupts
      in the latter.  Fix from Alexander Aring.

  10) The FEC driver's restart doesn't perform operations in the correct
      order, so promiscuous settings can get lost.  Fix from Stefan
      Wahren.

  11) Fix SKB leak in SCTP cookie handling, from Daniel Borkmann.

  12) Reference count and memory leak fixes in TIPC from Ying Xue and
      Erik Hugne.

  13) Forced eviction in inet_frag_evictor() must strictly make sure all
      frags are deleted, otherwise module unload (f.e.  6lowpan) can
      crash.  Fix from Florian Westphal.

  14) Remove assumptions in AF_UNIX's use of csum_partial() (which it
      uses as a hash function), which breaks on PowerPC.  From Anton
      Blanchard.

      The main gist of the issue is that csum_partial() is defined only
      as a value that, once folded (f.e.  via csum_fold()) produces a
      correct 16-bit checksum.  It is legitimate, therefore, for
      csum_partial() to produce two different 32-bit values over the
      same data if their respective alignments are different.

  15) Fix endiannes bug in MAC address handling of ibmveth driver, also
      from Anton Blanchard.

  16) Error checks for ipv6 exthdrs offload registration are reversed,
      from Anton Nayshtut.

  17) Externally triggered ipv6 addrconf routes should count against the
      garbage collection threshold.  Fix from Sabrina Dubroca.

  18) The PCI shutdown handler added to the bnx2 driver can wedge the
      chip if it was not brought up earlier already, which in particular
      causes the firmware to shut down the PHY.  Fix from Michael Chan.

  19) Adjust the sanity WARN_ON_ONCE() in qdisc_list_add() because as
      currently coded it can and does trigger in legitimate situations.
      From Eric Dumazet.

  20) BNA driver fails to build on ARM because of a too large udelay()
      call, fix from Ben Hutchings.

  21) Fair-Queue qdisc holds locks during GFP_KERNEL allocations, fix
      from Eric Dumazet.

  22) The vlan passthrough ops added in the previous release causes a
      regression in source MAC address setting of outgoing headers in
      some circumstances.  Fix from Peter Boström"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (70 commits)
  ipv6: Avoid unnecessary temporary addresses being generated
  eth: fec: Fix lost promiscuous mode after reconnecting cable
  bonding: set correct vlan id for alb xmit path
  at86rf230: fix lockdep splats
  net/mlx4_en: Deregister multicast vxlan steering rules when going down
  vmxnet3: fix building without CONFIG_PCI_MSI
  MAINTAINERS: add networking selftests to NETWORKING
  net: socket: error on a negative msg_namelen
  MAINTAINERS: Add tools/net to NETWORKING [GENERAL]
  packet: doc: Spelling s/than/that/
  net/mlx4_core: Load the IB driver when the device supports IBoE
  net/mlx4_en: Handle vxlan steering rules for mac address changes
  net/mlx4_core: Fix wrong dump of the vxlan offloads device capability
  xen-netback: use skb_is_gso in xenvif_start_xmit
  r8169: fix the incorrect tx descriptor version
  tools/net/Makefile: Define PACKAGE to fix build problems
  x86: bpf_jit: support negative offsets
  bridge: multicast: enable snooping on general queries only
  bridge: multicast: add sanity check for general query destination
  tcp: tcp_release_cb() should release socket ownership
  ...
2014-03-13 20:38:36 -07:00
Matthew Leach dbb490b965 net: socket: error on a negative msg_namelen
When copying in a struct msghdr from the user, if the user has set the
msg_namelen parameter to a negative value it gets clamped to a valid
size due to a comparison between signed and unsigned values.

Ensure the syscall errors when the user passes in a negative value.

Signed-off-by: Matthew Leach <matthew.leach@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-12 16:29:24 -04:00
Al Viro 00e188ef6a sockfd_lookup_light(): switch to fdget^W^Waway from fget_light
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-10 11:44:41 -04:00
Yang Yingliang 3410f22ea9 socket: replace some printk with pr_*
Prefer pr_*(...) to printk(KERN_* ...).

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 18:15:10 -05:00
Yann Droneaud d73aa2867f net: handle error more gracefully in socketpair()
This patch makes socketpair() use error paths which do not
rely on heavy-weight call to sys_close(): it's better to try
to push the file descriptor to userspace before installing
the socket file to the file descriptor, so that errors are
catched earlier and being easier to handle.

Using sys_close() seems to be the exception, while writing the
file descriptor before installing it look like it's more or less
the norm: eg. except for code used in init/, error handling
involve fput() and put_unused_fd(), but not sys_close().

This make socketpair() usage of sys_close() quite unusual.
So it deserves to be replaced by the common pattern relying on
fput() and put_unused_fd() just like, for example, the one used
in pipe(2) or recvmsg(2).

Three distinct error paths are still needed since calling
fput() on file structure returned by sock_alloc_file() will
implicitly call sock_release() on the associated socket
structure.

Cc: David S. Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=1385979146-13825-1-git-send-email-ydroneaud@opteya.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 22:24:13 -05:00
David S. Miller 426e1fa31e Merge branch 'siocghwtstamp' of git://git.kernel.org/pub/scm/linux/kernel/git/bwh/sfc-next
Ben Hutchings says:

====================
SIOCGHWTSTAMP ioctl

1. Add the SIOCGHWTSTAMP ioctl and update the timestamping
documentation.
2. Implement SIOCGHWTSTAMP in most drivers that support SIOCSHWTSTAMP.
3. Add a test program to exercise SIOC{G,S}HWTSTAMP.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-05 19:45:14 -05:00
Dan Carpenter db31c55a6f net: clamp ->msg_namelen instead of returning an error
If kmsg->msg_namelen > sizeof(struct sockaddr_storage) then in the
original code that would lead to memory corruption in the kernel if you
had audit configured.  If you didn't have audit configured it was
harmless.

There are some programs such as beta versions of Ruby which use too
large of a buffer and returning an error code breaks them.  We should
clamp the ->msg_namelen value instead.

Fixes: 1661bf364a ("net: heap overflow in __audit_sockaddr()")
Reported-by: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Tested-by: Eric Wong <normalperson@yhbt.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-29 16:12:52 -05:00
Hannes Frederic Sowa 68c6beb373 net: add BUG_ON if kernel advertises msg_namelen > sizeof(struct sockaddr_storage)
In that case it is probable that kernel code overwrote part of the
stack. So we should bail out loudly here.

The BUG_ON may be removed in future if we are sure all protocols are
conformant.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-20 21:52:30 -05:00
Hannes Frederic Sowa f3d3342602 net: rework recvmsg handler msg_name and msg_namelen logic
This patch now always passes msg->msg_namelen as 0. recvmsg handlers must
set msg_namelen to the proper size <= sizeof(struct sockaddr_storage)
to return msg_name to the user.

This prevents numerous uninitialized memory leaks we had in the
recvmsg handlers and makes it harder for new code to accidentally leak
uninitialized memory.

Optimize for the case recvfrom is called with NULL as address. We don't
need to copy the address at all, so set it to NULL before invoking the
recvmsg handler. We can do so, because all the recvmsg handlers must
cope with the case a plain read() is called on them. read() also sets
msg_name to NULL.

Also document these changes in include/linux/net.h as suggested by David
Miller.

Changes since RFC:

Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
affect sendto as it would bail out earlier while trying to copy-in the
address. It also more naturally reflects the logic by the callers of
verify_iovec.

With this change in place I could remove "
if (!uaddr || msg_sys->msg_namelen == 0)
	msg->msg_name = NULL
".

This change does not alter the user visible error logic as we ignore
msg_namelen as long as msg_name is NULL.

Also remove two unnecessary curly brackets in ___sys_recvmsg and change
comments to netdev style.

Cc: David Miller <davem@davemloft.net>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-20 21:52:30 -05:00
Ben Hutchings fd468c74bd net_tstamp: Add SIOCGHWTSTAMP ioctl to match SIOCSHWTSTAMP
SIOCSHWTSTAMP returns the real configuration to the application
using it, but there is currently no way for any other
application to find out the configuration non-destructively.
Add a new ioctl for this, making it unprivileged.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
2013-11-19 19:07:21 +00:00
Ben Hutchings 590d4693fb net/compat: Merge multiple implementations of ifreq::ifr_data conversion
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
2013-11-18 23:50:13 +00:00
Ben Hutchings 417c3522b3 net/compat: Fix minor information leak in siocdevprivate_ioctl()
We don't need to check that ifr_data itself is a valid user pointer,
but we should check &ifr_data is.  Thankfully the copy of ifr_name is
checked, so this can only leak a few bytes from immediately above the
user address limit.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
2013-11-18 23:50:12 +00:00
Dan Carpenter 1661bf364a net: heap overflow in __audit_sockaddr()
We need to cap ->msg_namelen or it leads to a buffer overflow when we
to the memcpy() in __audit_sockaddr().  It requires CAP_AUDIT_CONTROL to
exploit this bug.

The call tree is:
___sys_recvmsg()
  move_addr_to_user()
    audit_sockaddr()
      __audit_sockaddr()

Reported-by: Jüri Aedla <juri.aedla@gmail.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 16:05:14 -04:00
Linus Torvalds 9bf12df31f Merge git://git.kvack.org/~bcrl/aio-next
Pull aio changes from Ben LaHaise:
 "First off, sorry for this pull request being late in the merge window.
  Al had raised a couple of concerns about 2 items in the series below.
  I addressed the first issue (the race introduced by Gu's use of
  mm_populate()), but he has not provided any further details on how he
  wants to rework the anon_inode.c changes (which were sent out months
  ago but have yet to be commented on).

  The bulk of the changes have been sitting in the -next tree for a few
  months, with all the issues raised being addressed"

* git://git.kvack.org/~bcrl/aio-next: (22 commits)
  aio: rcu_read_lock protection for new rcu_dereference calls
  aio: fix race in ring buffer page lookup introduced by page migration support
  aio: fix rcu sparse warnings introduced by ioctx table lookup patch
  aio: remove unnecessary debugging from aio_free_ring()
  aio: table lookup: verify ctx pointer
  staging/lustre: kiocb->ki_left is removed
  aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
  aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
  aio: convert the ioctx list to table lookup v3
  aio: double aio_max_nr in calculations
  aio: Kill ki_dtor
  aio: Kill ki_users
  aio: Kill unneeded kiocb members
  aio: Kill aio_rw_vect_retry()
  aio: Don't use ctx->tail unnecessarily
  aio: io_cancel() no longer returns the io_event
  aio: percpu ioctx refcount
  aio: percpu reqs_available
  aio: reqs_active -> reqs_available
  aio: fix build when migration is disabled
  ...
2013-09-13 10:55:58 -07:00
Mathieu Desnoyers 3ddc5b46a8 kernel-wide: fix missing validations on __get/__put/__copy_to/__copy_from_user()
I found the following pattern that leads in to interesting findings:

  grep -r "ret.*|=.*__put_user" *
  grep -r "ret.*|=.*__get_user" *
  grep -r "ret.*|=.*__copy" *

The __put_user() calls in compat_ioctl.c, ptrace compat, signal compat,
since those appear in compat code, we could probably expect the kernel
addresses not to be reachable in the lower 32-bit range, so I think they
might not be exploitable.

For the "__get_user" cases, I don't think those are exploitable: the worse
that can happen is that the kernel will copy kernel memory into in-kernel
buffers, and will fail immediately afterward.

The alpha csum_partial_copy_from_user() seems to be missing the
access_ok() check entirely.  The fix is inspired from x86.  This could
lead to information leak on alpha.  I also noticed that many architectures
map csum_partial_copy_from_user() to csum_partial_copy_generic(), but I
wonder if the latter is performing the access checks on every
architectures.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:18 -07:00
Cong Wang e0d1095ae3 net: rename CONFIG_NET_LL_RX_POLL to CONFIG_NET_RX_BUSY_POLL
Eliezer renames several *ll_poll to *busy_poll, but forgets
CONFIG_NET_LL_RX_POLL, so in case of confusion, rename it too.

Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01 15:11:17 -07:00
Kent Overstreet d29c445b63 aio: Kill ki_dtor
sock_aio_dtor() is dead code - and stuff that does need to do cleanup
can simply do it before calling aio_complete().

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-07-30 11:53:12 -04:00
Kent Overstreet 73a7075e3f aio: Kill aio_rw_vect_retry()
This code doesn't serve any purpose anymore, since the aio retry
infrastructure has been removed.

This change should be safe because aio_read/write are also used for
synchronous IO, and called from do_sync_read()/do_sync_write() - and
there's no looping done in the sync case (the read and write syscalls).

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-07-30 11:53:12 -04:00
Eliezer Tamir 64b0dc517e net: rename busy poll socket op and globals
Rename LL_SO to BUSY_POLL_SO
Rename sysctl_net_ll_{read,poll} to sysctl_busy_{read,poll}
Fix up users of these variables.
Fix documentation for sysctl.

a patch for the socket.7  man page will follow separately,
because of limitations of my mail setup.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-10 17:08:27 -07:00
Eliezer Tamir 076bb0c82a net: rename include/net/ll_poll.h to include/net/busy_poll.h
Rename the file and correct all the places where it is included.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-10 17:08:27 -07:00
Eliezer Tamir cbf55001b2 net: rename low latency sockets functions to busy poll
Rename functions in include/net/ll_poll.h to busy wait.
Clarify documentation about expected power use increase.
Rename POLL_LL to POLL_BUSY_LOOP.
Add need_resched() testing to poll/select busy loops.

Note, that in select and poll can_busy_poll is dynamic and is
updated continuously to reflect the existence of supported
sockets with valid queue information.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-08 19:25:45 -07:00
Eliezer Tamir 2d48d67fa8 net: poll/select low latency socket support
select/poll busy-poll support.

Split sysctl value into two separate ones, one for read and one for poll.
updated Documentation/sysctl/net.txt

Add a new poll flag POLL_LL. When this flag is set, sock_poll will call
sk_poll_ll if possible. sock_poll sets this flag in its return value
to indicate to select/poll when a socket that can busy poll is found.

When poll/select have nothing to report, call the low-level
sock_poll again until we are out of time or we find something.

Once the system call finds something, it stops setting POLL_LL, so it can
return the result to the user ASAP.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-25 16:35:52 -07:00
Eliezer Tamir dafcc4380d net: add socket option for low latency polling
adds a socket option for low latency polling.
This allows overriding the global sysctl value with a per-socket one.
Unexport sysctl_net_ll_poll since for now it's not needed in modules.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-17 15:48:14 -07:00
Eliezer Tamir eb6db62282 net: change sysctl_net_ll_poll into an unsigned int
There is no reason for sysctl_net_ll_poll to be an unsigned long.
Change it into an unsigned int.
Fix the proc handler.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-17 15:48:13 -07:00
Eliezer Tamir 0602129286 net: add low latency socket poll
Adds an ndo_ll_poll method and the code that supports it.
This method can be used by low latency applications to busy-poll
Ethernet device queues directly from the socket code.
sysctl_net_ll_poll controls how many microseconds to poll.
Default is zero (disabled).
Individual protocol support will be added by subsequent patches.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-10 21:22:35 -07:00
David S. Miller 93a306aef5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Merge 'net' into 'net-next' to get the MSG_CMSG_COMPAT
regression fix.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-06 23:39:26 -07:00
Andy Lutomirski a7526eb5d0 net: Unbreak compat_sys_{send,recv}msg
I broke them in this commit:

    commit 1be374a051
    Author: Andy Lutomirski <luto@amacapital.net>
    Date:   Wed May 22 14:07:44 2013 -0700

        net: Block MSG_CMSG_COMPAT in send(m)msg and recv(m)msg

This patch adds __sys_sendmsg and __sys_sendmsg as common helpers that accept
MSG_CMSG_COMPAT and blocks MSG_CMSG_COMPAT at the syscall entrypoints.  It
also reverts some unnecessary checks in sys_socketcall.

Apparently I was suffering from underscore blindness the first time around.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Tested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-06 11:52:14 -07:00
David S. Miller 143554ace8 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Conflicts:
	net/netfilter/nf_log.c

The conflict in nf_log.c is that in 'net' we added CONFIG_PROC_FS
protection around foo_proc_entry() calls to fix a build failure,
whereas in Pablo's tree a guard if() test around a call is
remove_proc_entry() was removed.  Trivially resolved.

Pablo Neira Ayuso says:

====================
The following patchset contains the first batch of
Netfilter/IPVS updates for your net-next tree, they are:

* Three patches with improvements and code refactorization
  for nfnetlink_queue, from Florian Westphal.

* FTP helper now parses replies without brackets, as RFC1123
  recommends, from Jeff Mahoney.

* Rise a warning to tell everyone about ULOG deprecation,
  NFLOG has been already in the kernel tree for long time
  and supersedes the old logging over netlink stub, from
  myself.

* Don't panic if we fail to load netfilter core framework,
  just bail out instead, from myself.

* Add cond_resched_rcu, used by IPVS to allow rescheduling
  while walking over big hashtables, from Simon Horman.

* Change type of IPVS sysctl_sync_qlen_max sysctl to avoid
  possible overflow, from Zhang Yanfei.

* Use strlcpy instead of strncpy to skip zeroing of already
  initialized area to write the extension names in ebtables,
  from Chen Gang.

* Use already existing per-cpu notrack object from xt_CT,
  from Eric Dumazet.

* Save explicit socket lookup in xt_socket now that we have
  early demux, also from Eric Dumazet.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-06 01:03:06 -07:00
Andy Lutomirski 1be374a051 net: Block MSG_CMSG_COMPAT in send(m)msg and recv(m)msg
To: linux-kernel@vger.kernel.org
Cc: x86@kernel.org, trinity@vger.kernel.org, Andy Lutomirski <luto@amacapital.net>, netdev@vger.kernel.org, "David S.
	Miller" <davem@davemloft.net>
Subject: [PATCH 5/5] net: Block MSG_CMSG_COMPAT in send(m)msg and recv(m)msg

MSG_CMSG_COMPAT is (AFAIK) not intended to be part of the API --
it's a hack that steals a bit to indicate to other networking code
that a compat entry was used.  So don't allow it from a non-compat
syscall.

This prevents an oops when running this code:

int main()
{
	int s;
	struct sockaddr_in addr;
	struct msghdr *hdr;

	char *highpage = mmap((void*)(TASK_SIZE_MAX - 4096), 4096,
	                      PROT_READ | PROT_WRITE,
	                      MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0);
	if (highpage == MAP_FAILED)
		err(1, "mmap");

	s = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
	if (s == -1)
		err(1, "socket");

        addr.sin_family = AF_INET;
        addr.sin_port = htons(1);
        addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
	if (connect(s, (struct sockaddr*)&addr, sizeof(addr)) != 0)
		err(1, "connect");

	void *evil = highpage + 4096 - COMPAT_MSGHDR_SIZE;
	printf("Evil address is %p\n", evil);

	if (syscall(__NR_sendmmsg, s, evil, 1, MSG_CMSG_COMPAT) < 0)
		err(1, "sendmmsg");

	return 0;
}

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28 23:55:41 -07:00
Pablo Neira Ayuso 6d11cfdba5 netfilter: don't panic on error while walking through the init path
Don't panic if we hit an error while adding the nf_log or pernet
netfilter support, just bail out.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Gao feng <gaofeng@cn.fujitsu.com>
2013-05-23 14:22:30 +02:00
Linus Torvalds c4cc75c332 Merge git://git.infradead.org/users/eparis/audit
Pull audit changes from Eric Paris:
 "Al used to send pull requests every couple of years but he told me to
  just start pushing them to you directly.

  Our touching outside of core audit code is pretty straight forward.  A
  couple of interface changes which hit net/.  A simple argument bug
  calling audit functions in namei.c and the removal of some assembly
  branch prediction code on ppc"

* git://git.infradead.org/users/eparis/audit: (31 commits)
  audit: fix message spacing printing auid
  Revert "audit: move kaudit thread start from auditd registration to kaudit init"
  audit: vfs: fix audit_inode call in O_CREAT case of do_last
  audit: Make testing for a valid loginuid explicit.
  audit: fix event coverage of AUDIT_ANOM_LINK
  audit: use spin_lock in audit_receive_msg to process tty logging
  audit: do not needlessly take a lock in tty_audit_exit
  audit: do not needlessly take a spinlock in copy_signal
  audit: add an option to control logging of passwords with pam_tty_audit
  audit: use spin_lock_irqsave/restore in audit tty code
  helper for some session id stuff
  audit: use a consistent audit helper to log lsm information
  audit: push loginuid and sessionid processing down
  audit: stop pushing loginid, uid, sessionid as arguments
  audit: remove the old depricated kernel interface
  audit: make validity checking generic
  audit: allow checking the type of audit message in the user filter
  audit: fix build break when AUDIT_DEBUG == 2
  audit: remove duplicate export of audit_enabled
  Audit: do not print error when LSMs disabled
  ...
2013-05-11 14:29:11 -07:00
Linus Torvalds 20b4fb4852 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS updates from Al Viro,

Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).

7kloc removed.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
  don't bother with deferred freeing of fdtables
  proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
  proc: Make the PROC_I() and PDE() macros internal to procfs
  proc: Supply a function to remove a proc entry by PDE
  take cgroup_open() and cpuset_open() to fs/proc/base.c
  ppc: Clean up scanlog
  ppc: Clean up rtas_flash driver somewhat
  hostap: proc: Use remove_proc_subtree()
  drm: proc: Use remove_proc_subtree()
  drm: proc: Use minor->index to label things, not PDE->name
  drm: Constify drm_proc_list[]
  zoran: Don't print proc_dir_entry data in debug
  reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
  proc: Supply an accessor for getting the data from a PDE's parent
  airo: Use remove_proc_subtree()
  rtl8192u: Don't need to save device proc dir PDE
  rtl8187se: Use a dir under /proc/net/r8180/
  proc: Add proc_mkdir_data()
  proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
  proc: Move PDE_NET() to fs/proc/proc_net.c
  ...
2013-05-01 17:51:54 -07:00
Al Viro 0e2bcaae83 sock_close() couldn't have been called with NULL inode since at least 2.1.early
... if not since 0.99 or so.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-29 15:41:43 -04:00
Daniel Borkmann 6e94d1ef37 net: socket: move ktime2ts to ktime header api
Currently, ktime2ts is a small helper function that is only used in
net/socket.c. Move this helper into the ktime API as a small inline
function, so that i) it's maintained together with ktime routines,
and ii) also other files can make use of it. The function is named
ktime_to_timespec_cond() and placed into the generic part of ktime,
since we internally make use of ktime_to_timespec(). ktime_to_timespec()
itself does not check the ktime variable for zero, hence, we name
this function ktime_to_timespec_cond() for only a conditional
conversion, and adapt its users to it.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19 16:39:13 -04:00
Daniel Borkmann bf84a01063 net: sock: make sock_tx_timestamp void
Currently, sock_tx_timestamp() always returns 0. The comment that
describes the sock_tx_timestamp() function wrongly says that it
returns an error when an invalid argument is passed (from commit
20d4947353, ``net: socket infrastructure for SO_TIMESTAMPING'').
Make the function void, so that we can also remove all the unneeded
if conditions that check for such a _non-existant_ error case in the
output path.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-14 15:41:49 -04:00
Chen Gang 2950fa9d32 kernel: audit: beautify code, for extern function, better to check its parameters by itself
__audit_socketcall is an extern function.
  better to check its parameters by itself.

    also can return error code, when fail (find invalid parameters).
    also use macro instead of real hard code number
    also give related comments for it.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
[eparis: fix the return value when !CONFIG_AUDIT]
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-10 13:31:12 -04:00
Linus Torvalds d895cb1af1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile (part one) from Al Viro:
 "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
  locking violations, etc.

  The most visible changes here are death of FS_REVAL_DOT (replaced with
  "has ->d_weak_revalidate()") and a new helper getting from struct file
  to inode.  Some bits of preparation to xattr method interface changes.

  Misc patches by various people sent this cycle *and* ocfs2 fixes from
  several cycles ago that should've been upstream right then.

  PS: the next vfs pile will be xattr stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
  saner proc_get_inode() calling conventions
  proc: avoid extra pde_put() in proc_fill_super()
  fs: change return values from -EACCES to -EPERM
  fs/exec.c: make bprm_mm_init() static
  ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
  ocfs2: fix possible use-after-free with AIO
  ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
  get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
  target: writev() on single-element vector is pointless
  export kernel_write(), convert open-coded instances
  fs: encode_fh: return FILEID_INVALID if invalid fid_type
  kill f_vfsmnt
  vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
  nfsd: handle vfs_getattr errors in acl protocol
  switch vfs_getattr() to struct path
  default SET_PERSONALITY() in linux/elf.h
  ceph: prepopulate inodes only when request is aborted
  d_hash_and_lookup(): export, switch open-coded instances
  9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
  9p: split dropping the acls from v9fs_set_create_acl()
  ...
2013-02-26 20:16:07 -08:00
Al Viro 21d206819a get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-26 02:46:11 -05:00
Anatol Pomozov 39b6525274 fs: Preserve error code in get_empty_filp(), part 2
Allocating a file structure in function get_empty_filp() might fail because
of several reasons:
 - not enough memory for file structures
 - operation is not allowed
 - user is over its limit

Currently the function returns NULL in all cases and we loose the exact
reason of the error. All callers of get_empty_filp() assume that the function
can fail with ENFILE only.

Return error through pointer. Change all callers to preserve this error code.

[AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit
(things remaining here deal with alloc_file()), removed pipe(2) behaviour change]

Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22 23:31:32 -05:00
Stephen Hemminger 954b124453 ethtool: fix sparse warning
Fixes sparse complaints about dropping __user in casts.
 warning: cast removes address space of expression

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-11 14:16:26 -05:00
Paul Gortmaker a786a7c0ad wanrouter: completely decouple obsolete code from kernel.
The original suggestion to delete wanrouter started earlier
with the mainline commit f0d1b3c2bc
("net/wanrouter: Deprecate and schedule for removal") in May 2012.

More importantly, Dan Carpenter found[1] that the driver had a
fundamental breakage introduced back in 2008, with commit
7be6065b39 ("netdevice wanrouter: Convert directly reference of
netdev->priv").  So we know with certainty that the code hasn't been
used by anyone willing to at least take the effort to send an e-mail
report of breakage for at least 4 years.

This commit does a decouple of the wanrouter subsystem, by going
after the Makefile/Kconfig and similar files, so that these mainline
files that we are keeping do not have the big wanrouter file/driver
deletion commit tied into their history.

Once this commit is in place, we then can remove the obsolete cyclomx
drivers and similar that have a dependency on CONFIG_WAN_ROUTER_DRIVERS.

[1] http://www.spinics.net/lists/netdev/msg218670.html

Originally-by: Joe Perches <joe@perches.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2013-01-31 19:20:33 -05:00
Daniel Wagner 6a328d8c6f cgroup: net_cls: Rework update socket logic
The cgroup logic part of net_cls is very similar as the one in
net_prio. Let's stream line the net_cls logic with the net_prio one.

The net_prio update logic was changed by following commit (note there
were some changes necessary later on)

commit 406a3c638c
Author: John Fastabend <john.r.fastabend@intel.com>
Date:   Fri Jul 20 10:39:25 2012 +0000

    net: netprio_cgroup: rework update socket logic

    Instead of updating the sk_cgrp_prioidx struct field on every send
    this only updates the field when a task is moved via cgroup
    infrastructure.

    This allows sockets that may be used by a kernel worker thread
    to be managed. For example in the iscsi case today a user can
    put iscsid in a netprio cgroup and control traffic will be sent
    with the correct sk_cgrp_prioidx value set but as soon as data
    is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
    is updated with the kernel worker threads value which is the
    default case.

    It seems more correct to only update the field when the user
    explicitly sets it via control group infrastructure. This allows
    the users to manage sockets that may be used with other threads.

Since classid is now updated when the task is moved between the
cgroups, we don't have to call sock_update_classid() from various
places to ensure we always using the latest classid value.

[v2: Use iterate_fd() instead of open coding]

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc:  Li Zefan <lizefan@huawei.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Joe Perches <joe@perches.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <netdev@vger.kernel.org>
Cc: <cgroups@vger.kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-26 03:40:51 -04:00
Daniel Wagner fd9a08a7b8 cgroup: net_cls: Pass in task to sock_update_classid()
sock_update_classid() assumes that the update operation always are
applied on the current task. sock_update_classid() needs to know on
which tasks to work on in order to be able to migrate task between
cgroups using the struct cgroup_subsys attach() callback.

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Joe Perches <joe@perches.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <netdev@vger.kernel.org>
Cc: <cgroups@vger.kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-26 03:40:50 -04:00
Linus Torvalds aab174f0df Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs update from Al Viro:

 - big one - consolidation of descriptor-related logics; almost all of
   that is moved to fs/file.c

   (BTW, I'm seriously tempted to rename the result to fd.c.  As it is,
   we have a situation when file_table.c is about handling of struct
   file and file.c is about handling of descriptor tables; the reasons
   are historical - file_table.c used to be about a static array of
   struct file we used to have way back).

   A lot of stray ends got cleaned up and converted to saner primitives,
   disgusting mess in android/binder.c is still disgusting, but at least
   doesn't poke so much in descriptor table guts anymore.  A bunch of
   relatively minor races got fixed in process, plus an ext4 struct file
   leak.

 - related thing - fget_light() partially unuglified; see fdget() in
   there (and yes, it generates the code as good as we used to have).

 - also related - bits of Cyrill's procfs stuff that got entangled into
   that work; _not_ all of it, just the initial move to fs/proc/fd.c and
   switch of fdinfo to seq_file.

 - Alex's fs/coredump.c spiltoff - the same story, had been easier to
   take that commit than mess with conflicts.  The rest is a separate
   pile, this was just a mechanical code movement.

 - a few misc patches all over the place.  Not all for this cycle,
   there'll be more (and quite a few currently sit in akpm's tree)."

Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
  MAX_LFS_FILESIZE should be a loff_t
  compat: fs: Generic compat_sys_sendfile implementation
  fs: push rcu_barrier() from deactivate_locked_super() to filesystems
  btrfs: reada_extent doesn't need kref for refcount
  coredump: move core dump functionality into its own file
  coredump: prevent double-free on an error path in core dumper
  usb/gadget: fix misannotations
  fcntl: fix misannotations
  ceph: don't abuse d_delete() on failure exits
  hypfs: ->d_parent is never NULL or negative
  vfs: delete surplus inode NULL check
  switch simple cases of fget_light to fdget
  new helpers: fdget()/fdput()
  switch o2hb_region_dev_write() to fget_light()
  proc_map_files_readdir(): don't bother with grabbing files
  make get_file() return its argument
  vhost_set_vring(): turn pollstart/pollstop into bool
  switch prctl_set_mm_exe_file() to fget_light()
  switch xfs_find_handle() to fget_light()
  switch xfs_swapext() to fget_light()
  ...
2012-10-02 20:25:04 -07:00
Eric Dumazet e2bcabec6e net: remove sk_init() helper
It seems sk_init() has no value today and even does strange things :

# grep . /proc/sys/net/core/?mem_*
/proc/sys/net/core/rmem_default:212992
/proc/sys/net/core/rmem_max:131071
/proc/sys/net/core/wmem_default:212992
/proc/sys/net/core/wmem_max:131071

We can remove it completely.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shan Wei <davidshan@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-27 18:42:00 -04:00
Al Viro 56b31d1c9f unexport sock_map_fd(), switch to sock_alloc_file()
Both modular callers of sock_map_fd() had been buggy; sctp one leaks
descriptor and file if copy_to_user() fails, 9p one shouldn't be
exposing file in the descriptor table at all.

Switch both to sock_alloc_file(), export it, unexport sock_map_fd() and
make it static.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:08:50 -04:00
Al Viro 2840763051 take descriptor handling from sock_alloc_file() to callers
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:08:49 -04:00
David S. Miller b48b63a1f6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/netfilter/nfnetlink_log.c
	net/netfilter/xt_LOG.c

Rather easy conflict resolution, the 'net' tree had bug fixes to make
sure we checked if a socket is a time-wait one or not and elide the
logging code if so.

Whereas on the 'net-next' side we are calculating the UID and GID from
the creds using different interfaces due to the user namespace changes
from Eric Biederman.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-15 11:43:53 -04:00
Mikulas Patocka ed6fe9d614 Fix order of arguments to compat_put_time[spec|val]
Commit 644595f896 ("compat: Handle COMPAT_USE_64BIT_TIME in
net/socket.c") introduced a bug where the helper functions to take
either a 64-bit or compat time[spec|val] got the arguments in the wrong
order, passing the kernel stack pointer off as a user pointer (and vice
versa).

Because of the user address range check, that in turn then causes an
EFAULT due to the user pointer range checking failing for the kernel
address.  Incorrectly resuling in a failed system call for 32-bit
processes with a 64-bit kernel.

On odder architectures like HP-PA (with separate user/kernel address
spaces), it can be used read kernel memory.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-05 18:34:13 -07:00
Masatake YAMATO 600e177920 net: Providing protocol type via system.sockprotoname xattr of /proc/PID/fd entries
lsof reports some of socket descriptors as "can't identify protocol" like:

    [yamato@localhost]/tmp% sudo lsof | grep dbus | grep iden
    dbus-daem   652          dbus    6u     sock ... 17812 can't identify protocol
    dbus-daem   652          dbus   34u     sock ... 24689 can't identify protocol
    dbus-daem   652          dbus   42u     sock ... 24739 can't identify protocol
    dbus-daem   652          dbus   48u     sock ... 22329 can't identify protocol
    ...

lsof cannot resolve the protocol used in a socket because procfs
doesn't provide the map between inode number on sockfs and protocol
type of the socket.

For improving the situation this patch adds an extended attribute named
'system.sockprotoname' in which the protocol name for
/proc/PID/fd/SOCKET is stored. So lsof can know the protocol for a
given /proc/PID/fd/SOCKET with getxattr system call.

A few weeks ago I submitted a patch for the same purpose. The patch
was introduced /proc/net/sockfs which enumerates inodes and protocols
of all sockets alive on a system. However, it was rejected because (1)
a global lock was needed, and (2) the layout of struct socket was
changed with the patch.

This patch doesn't use any global lock; and doesn't change the layout
of any structs.

In this patch, a protocol name is stored to dentry->d_name of sockfs
when new socket is associated with a file descriptor. Before this
patch dentry->d_name was not used; it was just filled with empty
string. lsof may use an extended attribute named
'system.sockprotoname' to retrieve the value of dentry->d_name.

It is nice if we can see the protocol name with ls -l
/proc/PID/fd. However, "socket:[#INODE]", the name format returned
from sockfs_dname() was already defined. To keep the compatibility
between kernel and user land, the extended attribute is used to
prepare the value of dentry->d_name.

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-04 15:52:13 -04:00
Mathias Krause 43da5f2e0d net: fix info leak in compat dev_ifconf()
The implementation of dev_ifconf() for the compat ioctl interface uses
an intermediate ifc structure allocated in userland for the duration of
the syscall. Though, it fails to initialize the padding bytes inserted
for alignment and that for leaks four bytes of kernel stack. Add an
explicit memset(0) before filling the structure to avoid the info leak.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-15 21:36:31 -07:00
John Fastabend 406a3c638c net: netprio_cgroup: rework update socket logic
Instead of updating the sk_cgrp_prioidx struct field on every send
this only updates the field when a task is moved via cgroup
infrastructure.

This allows sockets that may be used by a kernel worker thread
to be managed. For example in the iscsi case today a user can
put iscsid in a netprio cgroup and control traffic will be sent
with the correct sk_cgrp_prioidx value set but as soon as data
is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
is updated with the kernel worker threads value which is the
default case.

It seems more correct to only update the field when the user
explicitly sets it via control group infrastructure. This allows
the users to manage sockets that may be used with other threads.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-22 12:44:01 -07:00
Mikulas Patocka b09e786bd1 tun: fix a crash bug and a memory leak
This patch fixes a crash
tun_chr_close -> netdev_run_todo -> tun_free_netdev -> sk_release_kernel ->
sock_release -> iput(SOCK_INODE(sock))
introduced by commit 1ab5ecb90c

The problem is that this socket is embedded in struct tun_struct, it has
no inode, iput is called on invalid inode, which modifies invalid memory
and optionally causes a crash.

sock_release also decrements sockets_in_use, this causes a bug that
"sockets: used" field in /proc/*/net/sockstat keeps on decreasing when
creating and closing tun devices.

This patch introduces a flag SOCK_EXTERNALLY_ALLOCATED that instructs
sock_release to not free the inode and not decrement sockets_in_use,
fixing both memory corruption and sockets_in_use underflow.

It should be backported to 3.3 an 3.4 stabke.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: stable@kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 11:21:06 -07:00
Linus Torvalds f5c101892f Merge branch 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu updates from Tejun Heo:
 "Contains Alex Shi's three patches to remove percpu_xxx() which overlap
  with this_cpu_xxx().  There shouldn't be any functional change."

* 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  percpu: remove percpu_xxx() functions
  x86: replace percpu_xxx funcs with this_cpu_xxx
  net: replace percpu_xxx funcs with this_cpu_xxx or __this_cpu_xxx
2012-05-22 17:37:47 -07:00
Joe Perches e87cc4728f net: Convert net_ratelimit uses to net_<level>_ratelimited
Standardize the net core ratelimited logging functions.

Coalesce formats, align arguments.
Change a printk then vprintk sequence to use printf extension %pV.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-15 13:45:03 -04:00
Alex Shi 19e8d69c54 net: replace percpu_xxx funcs with this_cpu_xxx or __this_cpu_xxx
percpu_xxx funcs are duplicated with this_cpu_xxx funcs, so replace
them for further code clean up.

And in preempt safe scenario, __this_cpu_xxx funcs may has a bit
better performance since __this_cpu_xxx has no redundant
preempt_enable/preempt_disable on some architectures.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-05-14 14:15:31 -07:00
Eric Dumazet a74e910618 net: change big iov allocations
iov of more than 8 entries are allocated in sendmsg()/recvmsg() through
sock_kmalloc()

As these allocations are temporary only and small enough, it makes sense
to use plain kmalloc() and avoid sk_omem_alloc atomic overhead.

Slightly changed fast path to be even faster.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mike Waychison <mikew@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-21 16:24:20 -04:00
Eric W. Biederman 2ca794e5e8 net sysctl: Initialize the network sysctls sooner to avoid problems.
If the netfilter code is modified to use register_net_sysctl_table the
kernel fails to boot because the per net sysctl infrasturce is not setup
soon enough.  So to avoid races call net_sysctl_init from sock_init().

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-20 21:21:16 -04:00
Eric Dumazet 95c9617472 net: cleanup unsigned to unsigned int
Use of "unsigned int" is preferred to bare "unsigned" in net tree.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-15 12:44:40 -04:00
Eric Dumazet 35f9c09fe9 tcp: tcp_sendpages() should call tcp_push() once
commit 2f53384424 (tcp: allow splice() to build full TSO packets) added
a regression for splice() calls using SPLICE_F_MORE.

We need to call tcp_flush() at the end of the last page processed in
tcp_sendpages(), or else transmits can be deferred and future sends
stall.

Add a new internal flag, MSG_SENDPAGE_NOTLAST, acting like MSG_MORE, but
with different semantic.

For all sendpage() providers, its a transparent change. Only
sock_sendpage() and tcp_sendpages() can differentiate the two different
flags provided by pipe_to_sendpage()

Reported-by: Tom Herbert <therbert@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail>com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-05 19:04:27 -04:00
Linus Torvalds a591afc01d Merge branch 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x32 support for x86-64 from Ingo Molnar:
 "This tree introduces the X32 binary format and execution mode for x86:
  32-bit data space binaries using 64-bit instructions and 64-bit kernel
  syscalls.

  This allows applications whose working set fits into a 32 bits address
  space to make use of 64-bit instructions while using a 32-bit address
  space with shorter pointers, more compressed data structures, etc."

Fix up trivial context conflicts in arch/x86/{Kconfig,vdso/vma.c}

* 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
  x32: Fix alignment fail in struct compat_siginfo
  x32: Fix stupid ia32/x32 inversion in the siginfo format
  x32: Add ptrace for x32
  x32: Switch to a 64-bit clock_t
  x32: Provide separate is_ia32_task() and is_x32_task() predicates
  x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls
  x86/x32: Fix the binutils auto-detect
  x32: Warn and disable rather than error if binutils too old
  x32: Only clear TIF_X32 flag once
  x32: Make sure TS_COMPAT is cleared for x32 tasks
  fs: Remove missed ->fds_bits from cessation use of fd_set structs internally
  fs: Fix close_on_exec pointer in alloc_fdtable
  x32: Drop non-__vdso weak symbols from the x32 VDSO
  x32: Fix coding style violations in the x32 VDSO code
  x32: Add x32 VDSO support
  x32: Allow x32 to be configured
  x32: If configured, add x32 system calls to system call tables
  x32: Handle process creation
  x32: Signal-related system calls
  x86: Add #ifdef CONFIG_COMPAT to <asm/sys_ia32.h>
  ...
2012-03-29 18:12:23 -07:00
Maciej Żenczykowski 43db362d3a net: get rid of some pointless casts to sockaddr
The following 4 functions:
  move_addr_to_kernel
  move_addr_to_user
  verify_iovec
  verify_compat_iovec
are always effectively called with a sockaddr_storage.

Make this explicit by changing their signature.

This removes a large number of casts from sockaddr_storage to sockaddr.

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-11 19:11:22 -07:00
H. Peter Anvin 644595f896 compat: Handle COMPAT_USE_64BIT_TIME in net/socket.c
Use helper functions aware of COMPAT_USE_64BIT_TIME to write struct
timeval and struct timespec to userspace in net/socket.c.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-20 12:48:48 -08:00
Eric Dumazet cf778b00e9 net: reintroduce missing rcu_assign_pointer() calls
commit a9b3cd7f32 (rcu: convert uses of rcu_assign_pointer(x, NULL) to
RCU_INIT_POINTER) did a lot of incorrect changes, since it did a
complete conversion of rcu_assign_pointer(x, y) to RCU_INIT_POINTER(x,
y).

We miss needed barriers, even on x86, when y is not NULL.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-12 12:26:56 -08:00
Linus Torvalds 9753dfe19a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1958 commits)
  net: pack skb_shared_info more efficiently
  net_sched: red: split red_parms into parms and vars
  net_sched: sfq: extend limits
  cnic: Improve error recovery on bnx2x devices
  cnic: Re-init dev->stats_addr after chip reset
  net_sched: Bug in netem reordering
  bna: fix sparse warnings/errors
  bna: make ethtool_ops and strings const
  xgmac: cleanups
  net: make ethtool_ops const
  vmxnet3" make ethtool ops const
  xen-netback: make ops structs const
  virtio_net: Pass gfp flags when allocating rx buffers.
  ixgbe: FCoE: Add support for ndo_get_fcoe_hbainfo() call
  netdev: FCoE: Add new ndo_get_fcoe_hbainfo() call
  igb: reset PHY after recovering from PHY power down
  igb: add basic runtime PM support
  igb: Add support for byte queue limits.
  e1000: cleanup CE4100 MDIO registers access
  e1000: unmap ce4100_gbe_mdio_base_virt in e1000_remove
  ...
2012-01-06 17:22:09 -08:00
Linus Torvalds 07d106d0a3 vfs: fix up ENOIOCTLCMD error handling
We're doing some odd things there, which already messes up various users
(see the net/socket.c code that this removes), and it was going to add
yet more crud to the block layer because of the incorrect error code
translation.

ENOIOCTLCMD is not an error return that should be returned to user mode
from the "ioctl()" system call, but it should *not* be translated as
EINVAL ("Invalid argument").  It should be translated as ENOTTY
("Inappropriate ioctl for device").

That EINVAL confusion has apparently so permeated some code that the
block layer actually checks for it, which is sad.  We continue to do so
for now, but add a big comment about how wrong that is, and we should
remove it entirely eventually.  In the meantime, this tries to keep the
changes localized to just the EINVAL -> ENOTTY fix, and removing code
that makes it harder to do the right thing.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-05 15:40:12 -08:00
Ben Hutchings 55664f324c ethtool: Allow drivers to select RX NFC rule locations
Define special location values for RX NFC that request the driver to
select the actual rule location.  This allows for implementation on
devices that use hash-based filter lookup, whereas currently the API is
more suited to devices with TCAM lookup or linear search.

In ethtool_set_rxnfc() and the compat wrapper ethtool_ioctl(), copy
the structure back to user-space after insertion so that the actual
location is returned.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-04 14:09:10 -05:00
Neil Horman 5bc1421e34 net: add network priority cgroup infrastructure (v4)
This patch adds in the infrastructure code to create the network priority
cgroup.  The cgroup, in addition to the standard processes file creates two
control files:

1) prioidx - This is a read-only file that exports the index of this cgroup.
This is a value that is both arbitrary and unique to a cgroup in this subsystem,
and is used to index the per-device priority map

2) priomap - This is a writeable file.  On read it reports a table of 2-tuples
<name:priority> where name is the name of a network interface and priority is
indicates the priority assigned to frames egresessing on the named interface and
originating from a pid in this cgroup

This cgroup allows for skb priority to be set prior to a root qdisc getting
selected. This is benenficial for DCB enabled systems, in that it allows for any
application to use dcb configured priorities so without application modification

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
CC: Robert Love <robert.w.love@intel.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-22 15:22:23 -05:00
Johannes Berg 6e3e939f3b net: add wireless TX status socket option
The 802.1X EAPOL handshake hostapd does requires
knowing whether the frame was ack'ed by the peer.
Currently, we fudge this pretty badly by not even
transmitting the frame as a normal data frame but
injecting it with radiotap and getting the status
out of radiotap monitor as well. This is rather
complex, confuses users (mon.wlan0 presence) and
doesn't work with all hardware.

To get rid of that hack, introduce a real wifi TX
status option for data frame transmissions.

This works similar to the existing TX timestamping
in that it reflects the SKB back to the socket's
error queue with a SCM_WIFI_STATUS cmsg that has
an int indicating ACK status (0/1).

Since it is possible that at some point we will
want to have TX timestamping and wifi status in a
single errqueue SKB (there's little point in not
doing that), redefine SO_EE_ORIGIN_TIMESTAMPING
to SO_EE_ORIGIN_TXSTATUS which can collect more
than just the timestamp; keep the old constant
as an alias of course. Currently the internal APIs
don't make that possible, but it wouldn't be hard
to split them up in a way that makes it possible.

Thanks to Neil Horman for helping me figure out
the functions that add the control messages.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-11-09 16:01:02 -05:00
David S. Miller 8decf86879 Merge branch 'master' of github.com:davem330/net
Conflicts:
	MAINTAINERS
	drivers/net/Kconfig
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
	drivers/net/ethernet/broadcom/tg3.c
	drivers/net/wireless/iwlwifi/iwl-pci.c
	drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c
	drivers/net/wireless/rt2x00/rt2800usb.c
	drivers/net/wireless/wl12xx/main.c
2011-09-22 03:23:13 -04:00
Mathieu Desnoyers bc909d9ddb sendmmsg/sendmsg: fix unsafe user pointer access
Dereferencing a user pointer directly from kernel-space without going
through the copy_from_user family of functions is a bad idea. Two of
such usages can be found in the sendmsg code path called from sendmmsg,
added by

commit c71d8ebe7a upstream.
commit 5b47b8038f in the 3.0-stable tree.

Usages are performed through memcmp() and memcpy() directly. Fix those
by using the already copied msg_sys structure instead of the __user *msg
structure. Note that msg_sys can be set to NULL by verify_compat_iovec()
or verify_iovec(), which requires additional NULL pointer checks.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: David Goulet <dgoulet@ev0ke.net>
CC: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
CC: Anton Blanchard <anton@samba.org>
CC: David S. Miller <davem@davemloft.net>
CC: stable <stable@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-24 19:45:03 -07:00
David S. Miller 19fd61785a Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net 2011-08-07 23:20:26 -07:00
Tetsuo Handa c71d8ebe7a net: Fix security_socket_sendmsg() bypass problem.
The sendmmsg() introduced by commit 228e548e "net: Add sendmmsg socket system
call" is capable of sending to multiple different destination addresses.

SMACK is using destination's address for checking sendmsg() permission.
However, security_socket_sendmsg() is called for only once even if multiple
different destination addresses are passed to sendmmsg().

Therefore, we need to call security_socket_sendmsg() for each destination
address rather than only the first destination address.

Since calling security_socket_sendmsg() every time when only single destination
address was passed to sendmmsg() is a waste of time, omit calling
security_socket_sendmsg() unless destination address of previous datagram and
that of current datagram differs.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Anton Blanchard <anton@samba.org>
Cc: stable <stable@kernel.org> [3.0+]
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-05 03:31:03 -07:00
Anton Blanchard 98382f419f net: Cap number of elements for sendmmsg
To limit the amount of time we can spend in sendmmsg, cap the
number of elements to UIO_MAXIOV (currently 1024).

For error handling an application using sendmmsg needs to retry at
the first unsent message, so capping is simpler and requires less
application logic than returning EINVAL.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable <stable@kernel.org> [3.0+]
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-05 03:31:03 -07:00
Anton Blanchard 728ffb86f1 net: sendmmsg should only return an error if no messages were sent
sendmmsg uses a similar error return strategy as recvmmsg but it
turns out to be a confusing way to communicate errors.

The current code stores the error code away and returns it on the next
sendmmsg call. This means a call with completely valid arguments could
get an error from a previous call.

Change things so we only return an error if no datagrams could be sent.
If less than the requested number of messages were sent, the application
must retry starting at the first failed one and if the problem is
persistent the error will be returned.

This matches the behaviour of other syscalls like read/write - it
is not an error if less than the requested number of elements are sent.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable <stable@kernel.org> [3.0+]
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-05 03:31:02 -07:00
Stephen Hemminger a9b3cd7f32 rcu: convert uses of rcu_assign_pointer(x, NULL) to RCU_INIT_POINTER
When assigning a NULL value to an RCU protected pointer, no barrier
is needed. The rcu_assign_pointer, used to handle that but will soon
change to not handle the special case.

Convert all rcu_assign_pointer of NULL value.

//smpl
@@ expression P; @@

- rcu_assign_pointer(P, NULL)
+ RCU_INIT_POINTER(P, NULL)

// </smpl>

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-02 04:29:23 -07:00
Linus Torvalds d5eab9152a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (32 commits)
  tg3: Remove 5719 jumbo frames and TSO blocks
  tg3: Break larger frags into 4k chunks for 5719
  tg3: Add tx BD budgeting code
  tg3: Consolidate code that calls tg3_tx_set_bd()
  tg3: Add partial fragment unmapping code
  tg3: Generalize tg3_skb_error_unmap()
  tg3: Remove short DMA check for 1st fragment
  tg3: Simplify tx bd assignments
  tg3: Reintroduce tg3_tx_ring_info
  ASIX: Use only 11 bits of header for data size
  ASIX: Simplify condition in rx_fixup()
  Fix cdc-phonet build
  bonding: reduce noise during init
  bonding: fix string comparison errors
  net: Audit drivers to identify those needing IFF_TX_SKB_SHARING cleared
  net: add IFF_SKB_TX_SHARED flag to priv_flags
  net: sock_sendmsg_nosec() is static
  forcedeth: fix vlans
  gianfar: fix bug caused by 87c288c6e9
  gro: Only reset frag0 when skb can be pulled
  ...
2011-07-28 05:58:19 -07:00
Eric Dumazet 894dc24ce7 net: sock_sendmsg_nosec() is static
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Anton Blanchard <anton@samba.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-27 22:39:30 -07:00
Eric Dumazet a209dfc7b0 vfs: dont chain pipe/anon/socket on superblock s_inodes list
Workloads using pipes and sockets hit inode_sb_list_lock contention.

superblock s_inodes list is needed for quota, dirty, pagecache and
fsnotify management. pipe/anon/socket fs are clearly not candidates for
these.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-26 12:57:09 -04:00
Linus Torvalds 06f4e926d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits)
  macvlan: fix panic if lowerdev in a bond
  tg3: Add braces around 5906 workaround.
  tg3: Fix NETIF_F_LOOPBACK error
  macvlan: remove one synchronize_rcu() call
  networking: NET_CLS_ROUTE4 depends on INET
  irda: Fix error propagation in ircomm_lmp_connect_response()
  irda: Kill set but unused variable 'bytes' in irlan_check_command_param()
  irda: Kill set but unused variable 'clen' in ircomm_connect_indication()
  rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport()
  be2net: Kill set but unused variable 'req' in lancer_fw_download()
  irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication()
  atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined.
  rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer().
  rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler()
  rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection()
  rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window()
  pkt_sched: Kill set but unused variable 'protocol' in tc_classify()
  isdn: capi: Use pr_debug() instead of ifdefs.
  tg3: Update version to 3.119
  tg3: Apply rx_discards fix to 5719/5720
  ...

Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c
as per Davem.
2011-05-20 13:43:21 -07:00
David S. Miller 9cbc94eabb Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/vmxnet3/vmxnet3_ethtool.c
	net/core/dev.c
2011-05-17 17:33:11 -04:00
Anton Blanchard b9eb8b8752 net: recvmmsg: Strip MSG_WAITFORONE when calling recvmsg
recvmmsg fails on a raw socket with EINVAL. The reason for this is
packet_recvmsg checks the incoming flags:

        err = -EINVAL;
        if (flags & ~(MSG_PEEK|MSG_DONTWAIT|MSG_TRUNC|MSG_CMSG_COMPAT|MSG_ERRQUEUE))
                goto out;

This patch strips out MSG_WAITFORONE when calling recvmmsg which
fixes the issue.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@kernel.org [2.6.34+]
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-17 15:38:57 -04:00
Lai Jiangshan 6184522024 net,rcu: convert call_rcu(wq_free_rcu) to kfree_rcu()
The rcu callback wq_free_rcu() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(wq_free_rcu).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-07 22:51:10 -07:00
Anton Blanchard 228e548e60 net: Add sendmmsg socket system call
This patch adds a multiple message send syscall and is the send
version of the existing recvmmsg syscall. This is heavily
based on the patch by Arnaldo that added recvmmsg.

I wrote a microbenchmark to test the performance gains of using
this new syscall:

http://ozlabs.org/~anton/junkcode/sendmmsg_test.c

The test was run on a ppc64 box with a 10 Gbit network card. The
benchmark can send both UDP and RAW ethernet packets.

64B UDP

batch   pkts/sec
1       804570
2       872800 (+ 8 %)
4       916556 (+14 %)
8       939712 (+17 %)
16      952688 (+18 %)
32      956448 (+19 %)
64      964800 (+20 %)

64B raw socket

batch   pkts/sec
1       1201449
2       1350028 (+12 %)
4       1461416 (+22 %)
8       1513080 (+26 %)
16      1541216 (+28 %)
32      1553440 (+29 %)
64      1557888 (+30 %)

We see a 20% improvement in throughput on UDP send and 30%
on raw socket send.

[ Add sparc syscall entries. -DaveM ]

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-05 11:10:14 -07:00
David S. Miller 1c01a80cfe Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/smsc911x.c
2011-04-11 13:44:25 -07:00