linux/kernel/bpf
John Fastabend e9db4ef6bf bpf: sockhash fix omitted bucket lock in sock_close
First the sk_callback_lock() was being used to protect both the
sock callback hooks and the psock->maps list. This got overly
convoluted after the addition of sockhash (in sockmap it made
some sense because masp and callbacks were tightly coupled) so
lets split out a specific lock for maps and only use the callback
lock for its intended purpose. This fixes a couple cases where
we missed using maps lock when it was in fact needed. Also this
makes it easier to follow the code because now we can put the
locking closer to the actual code its serializing.

Next, in sock_hash_delete_elem() the pattern was as follows,

  sock_hash_delete_elem()
     [...]
     spin_lock(bucket_lock)
     l = lookup_elem_raw()
     if (l)
        hlist_del_rcu()
        write_lock(sk_callback_lock)
         .... destroy psock ...
        write_unlock(sk_callback_lock)
     spin_unlock(bucket_lock)

The ordering is necessary because we only know the {p}sock after
dereferencing the hash table which we can't do unless we have the
bucket lock held. Once we have the bucket lock and the psock element
it is deleted from the hashmap to ensure any other path doing a lookup
will fail. Finally, the refcnt is decremented and if zero the psock
is destroyed.

In parallel with the above (or free'ing the map) a tcp close event
may trigger tcp_close(). Which at the moment omits the bucket lock
altogether (oops!) where the flow looks like this,

  bpf_tcp_close()
     [...]
     write_lock(sk_callback_lock)
     for each psock->maps // list of maps this sock is part of
         hlist_del_rcu(ref_hash_node);
         .... destroy psock ...
     write_unlock(sk_callback_lock)

Obviously, and demonstrated by syzbot, this is broken because
we can have multiple threads deleting entries via hlist_del_rcu().

To fix this we might be tempted to wrap the hlist operation in a
bucket lock but that would create a lock inversion problem. In
summary to follow locking rules the psocks maps list needs the
sk_callback_lock (after this patch maps_lock) but we need the bucket
lock to do the hlist_del_rcu.

To resolve the lock inversion problem pop the head of the maps list
repeatedly and remove the reference until no more are left. If a
delete happens in parallel from the BPF API that is OK as well because
it will do a similar action, lookup the lock in the map/hash, delete
it from the map/hash, and dec the refcnt. We check for this case
before doing a destroy on the psock to ensure we don't have two
threads tearing down a psock. The new logic is as follows,

  bpf_tcp_close()
  e = psock_map_pop(psock->maps) // done with map lock
  bucket_lock() // lock hash list bucket
  l = lookup_elem_raw(head, hash, key, key_size);
  if (l) {
     //only get here if elmnt was not already removed
     hlist_del_rcu()
     ... destroy psock...
  }
  bucket_unlock()

And finally for all the above to work add missing locking around  map
operations per above. Then add RCU annotations and use
rcu_dereference/rcu_assign_pointer to manage values relying on RCU so
that the object is not free'd from sock_hash_free() while it is being
referenced in bpf_tcp_close().

Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com
Fixes: 8111038444 ("bpf: sockmap, add hash map support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-01 01:21:32 +02:00
..
arraymap.c bpf: btf: Rename btf_key_id and btf_value_id in bpf_map_info 2018-05-23 12:03:32 +02:00
bpf_lru_list.c bpf: lru: Lower the PERCPU_NR_SCANS from 16 to 4 2017-04-17 13:55:52 -04:00
bpf_lru_list.h bpf: Only set node->ref = 1 if it has not been set 2017-09-01 09:57:39 -07:00
btf.c treewide: kvzalloc() -> kvcalloc() 2018-06-12 16:19:22 -07:00
cgroup.c bpf: fix attach type BPF_LIRC_MODE2 dependency wrt CONFIG_CGROUP_BPF 2018-06-26 11:28:38 +02:00
core.c bpf: undo prog rejection on read-only lock failure 2018-06-29 10:47:35 -07:00
cpumap.c xdp: introduce xdp_return_frame_rx_napi 2018-05-24 18:36:15 -07:00
devmap.c xdp: Fix handling of devmap in generic XDP 2018-06-15 23:47:15 +02:00
disasm.c bpf: Remove struct bpf_verifier_env argument from print_bpf_insn 2018-03-23 17:38:57 +01:00
disasm.h bpf: Remove struct bpf_verifier_env argument from print_bpf_insn 2018-03-23 17:38:57 +01:00
hashtab.c bpf: avoid retpoline for lookup/update/delete calls on maps 2018-06-03 07:45:37 -07:00
helpers.c bpf: implement bpf_get_current_cgroup_id() helper 2018-06-03 18:22:41 -07:00
inode.c bpf: implement dummy fops for bpf objects 2018-06-08 10:58:48 -07:00
lpm_trie.c treewide: kmalloc() -> kmalloc_array() 2018-06-12 16:19:22 -07:00
Makefile bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP 2018-05-03 15:55:24 -07:00
map_in_map.c bpf: Add syscall lookup support for fd array and htab 2017-06-29 13:13:25 -04:00
map_in_map.h bpf: Add syscall lookup support for fd array and htab 2017-06-29 13:13:25 -04:00
offload.c bpf: offload: allow offloaded programs to use perf event arrays 2018-05-04 23:41:03 +02:00
percpu_freelist.c bpf: fix lockdep splat 2017-11-15 19:46:32 +09:00
percpu_freelist.h bpf: introduce percpu_freelist 2016-03-08 15:28:31 -05:00
sockmap.c bpf: sockhash fix omitted bucket lock in sock_close 2018-07-01 01:21:32 +02:00
stackmap.c bpf: avoid -Wmaybe-uninitialized warning 2018-05-28 17:40:59 +02:00
syscall.c bpf: fix attach type BPF_LIRC_MODE2 dependency wrt CONFIG_CGROUP_BPF 2018-06-26 11:28:38 +02:00
tnum.c bpf/verifier: improve register value range tracking with ARSH 2018-04-29 08:45:53 -07:00
verifier.c treewide: Use array_size() in vzalloc() 2018-06-12 16:19:22 -07:00
xskmap.c xsk: clean up SPDX headers 2018-05-18 16:07:02 +02:00