Merge branch 'bpf-tcp-header-opts'

Martin KaFai Lau says:

====================
The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
algorithm to be written in BPF.  It opens up opportunities to allow
a faster turnaround time in testing/releasing new congestion control
ideas to production environment.

The same flexibility can be extended to writing TCP header option.
It is not uncommon that people want to test new TCP header option
to improve the TCP performance.  Another use case is for data-center
that has a more controlled environment and has more flexibility in
putting header options for internal traffic only.

This patch set introduces the necessary BPF logic and API to
allow bpf program to write and parse header options.

There are also some changes to TCP and they are mostly to provide
the needed sk and skb info to the bpf program to make decision.

Patch 9 is the main patch and has more details on the API and design.

The set includes an example which sends the max delay ack in
the BPF TCP header option and the receiving side can
then adjust its RTO accordingly.

v5:
- Move some of the comments from git commit message to the UAPI bpf.h
  in patch 9

- Some variable clean up in the tests (patch 11).

v4:
- Since bpf-next is currently closed, tag the set with RFC to keep the
  review cadence

- Separate tcp changes in its own patches (5, 6, 7).  It is a bit
  tricky since most of the tcp changes is to call out the bpf prog to
  write and parse the header.  The write and parse callout has been
  modularized into a few bpf_skops_* function in v3.

  This revision (v4) tries to move those bpf_skops_* functions into separate
  TCP patches.  However, they will be half implemented to highlight
  the changes to the TCP stack, mainly:
    - when the bpf prog will be called in the TCP stack and
    - what information needs to pump through the TCP stack to the actual bpf
      prog callsite.

  The bpf_skops_* functions will be fully implemented in patch 9 together
  with other bpf pieces.

- Use struct_size() in patch 1 (Eric)

- Add saw_unknown to struct tcp_options_received in patch 4 (Eric)

v3:
- Add kdoc for tcp_make_synack (Jakub Kicinski)
- Add BPF_WRITE_HDR_TCP_CURRENT_MSS and BPF_WRITE_HDR_TCP_SYNACK_COOKIE
  in bpf.h to give a clearer meaning to sock_ops->args[0] when
  writing header option.
- Rename BPF_SOCK_OPS_PARSE_UNKWN_HDR_OPT_CB_FLAG
  to     BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG

v2:
- Instead of limiting the bpf prog to write experimental
  option (kind:254, magic:0xeB9F), this revision allows the bpf prog to
  write any TCP header option through the bpf_store_hdr_opt() helper.
  That will allow different bpf-progs to write its own
  option and the helper will guarantee there is no duplication.

- Add bpf_load_hdr_opt() helper to search a particular option by kind.
  Some of the get_syn logic is refactored to bpf_sock_ops_get_syn().

- Since bpf prog is no longer limited to option (254, 0xeB9F),
  the TCP_SKB_CB(skb)->bpf_hdr_opt_off is no longer needed.
  Instead, when there is any option kernel cannot recognize,
  the bpf prog will be called if the
  BPF_SOCK_OPS_PARSE_UNKWN_HDR_OPT_CB_FLAG is set.
  [ The "unknown_opt" is learned in tcp_parse_options() in patch 4. ]

- Add BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG.
  If this flag is set, the bpf-prog will be called
  on all tcp packet received at an established sk.
  It will be useful to ensure a previously written header option is
  received by the peer.
  e.g. The latter test is using this on the active-side during syncookie.

- The test_tcp_hdr_options.c is adjusted accordingly
  to test writing both experimental and regular TCP header option.

- The test_misc_tcp_hdr_options.c is added to mainly
  test different cases on the new helpers.

- Break up the TCP_BPF_RTO_MIN and TCP_BPF_DELACK_MAX into
  two patches.

- Directly store the tcp_hdrlen in "struct saved_syn" instead of
  going back to the tcp header to obtain it by "th->doff * 4"

- Add a new optval(==2) for setsockopt(TCP_SAVE_SYN) such
  that it will also store the mac header (patch 9).
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This commit is contained in:
Alexei Starovoitov 2020-08-24 14:35:01 -07:00
commit 890f4365e4
22 changed files with 3198 additions and 62 deletions

View file

@ -279,6 +279,31 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
#define BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, uaddr) \
BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP6_RECVMSG, NULL)
/* The SOCK_OPS"_SK" macro should be used when sock_ops->sk is not a
* fullsock and its parent fullsock cannot be traced by
* sk_to_full_sk().
*
* e.g. sock_ops->sk is a request_sock and it is under syncookie mode.
* Its listener-sk is not attached to the rsk_listener.
* In this case, the caller holds the listener-sk (unlocked),
* set its sock_ops->sk to req_sk, and call this SOCK_OPS"_SK" with
* the listener-sk such that the cgroup-bpf-progs of the
* listener-sk will be run.
*
* Regardless of syncookie mode or not,
* calling bpf_setsockopt on listener-sk will not make sense anyway,
* so passing 'sock_ops->sk == req_sk' to the bpf prog is appropriate here.
*/
#define BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(sock_ops, sk) \
({ \
int __ret = 0; \
if (cgroup_bpf_enabled) \
__ret = __cgroup_bpf_run_filter_sock_ops(sk, \
sock_ops, \
BPF_CGROUP_SOCK_OPS); \
__ret; \
})
#define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) \
({ \
int __ret = 0; \

View file

@ -1236,13 +1236,17 @@ struct bpf_sock_addr_kern {
struct bpf_sock_ops_kern {
struct sock *sk;
u32 op;
union {
u32 args[4];
u32 reply;
u32 replylong[4];
};
u32 is_fullsock;
struct sk_buff *syn_skb;
struct sk_buff *skb;
void *skb_data_end;
u8 op;
u8 is_fullsock;
u8 remaining_opt_len;
u64 temp; /* temp and everything after is not
* initialized to 0 before calling
* the BPF program. New fields that

View file

@ -92,6 +92,8 @@ struct tcp_options_received {
smc_ok : 1, /* SMC seen on SYN packet */
snd_wscale : 4, /* Window scaling received from sender */
rcv_wscale : 4; /* Window scaling to send to receiver */
u8 saw_unknown:1, /* Received unknown option */
unused:7;
u8 num_sacks; /* Number of SACK blocks */
u16 user_mss; /* mss requested by user in ioctl */
u16 mss_clamp; /* Maximal mss, negotiated at connection setup */
@ -237,14 +239,13 @@ struct tcp_sock {
repair : 1,
frto : 1;/* F-RTO (RFC5682) activated in CA_Loss */
u8 repair_queue;
u8 syn_data:1, /* SYN includes data */
u8 save_syn:2, /* Save headers of SYN packet */
syn_data:1, /* SYN includes data */
syn_fastopen:1, /* SYN includes Fast Open option */
syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */
syn_fastopen_ch:1, /* Active TFO re-enabling probe */
syn_data_acked:1,/* data in SYN is acked by SYN-ACK */
save_syn:1, /* Save headers of SYN packet */
is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */
syn_smc:1; /* SYN includes SMC */
is_cwnd_limited:1;/* forward progress limited by snd_cwnd? */
u32 tlp_high_seq; /* snd_nxt at the time of TLP */
u32 tcp_tx_delay; /* delay (in usec) added to TX packets */
@ -391,6 +392,9 @@ struct tcp_sock {
#if IS_ENABLED(CONFIG_MPTCP)
bool is_mptcp;
#endif
#if IS_ENABLED(CONFIG_SMC)
bool syn_smc; /* SYN includes SMC */
#endif
#ifdef CONFIG_TCP_MD5SIG
/* TCP AF-Specific parts; only used by MD5 Signature support so far */
@ -406,7 +410,7 @@ struct tcp_sock {
* socket. Used to retransmit SYNACKs etc.
*/
struct request_sock __rcu *fastopen_rsk;
u32 *saved_syn;
struct saved_syn *saved_syn;
};
enum tsq_enum {
@ -484,6 +488,12 @@ static inline void tcp_saved_syn_free(struct tcp_sock *tp)
tp->saved_syn = NULL;
}
static inline u32 tcp_saved_syn_len(const struct saved_syn *saved_syn)
{
return saved_syn->mac_hdrlen + saved_syn->network_hdrlen +
saved_syn->tcp_hdrlen;
}
struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk,
const struct sk_buff *orig_skb);

View file

@ -86,6 +86,8 @@ struct inet_connection_sock {
struct timer_list icsk_retransmit_timer;
struct timer_list icsk_delack_timer;
__u32 icsk_rto;
__u32 icsk_rto_min;
__u32 icsk_delack_max;
__u32 icsk_pmtu_cookie;
const struct tcp_congestion_ops *icsk_ca_ops;
const struct inet_connection_sock_af_ops *icsk_af_ops;

View file

@ -41,6 +41,13 @@ struct request_sock_ops {
int inet_rtx_syn_ack(const struct sock *parent, struct request_sock *req);
struct saved_syn {
u32 mac_hdrlen;
u32 network_hdrlen;
u32 tcp_hdrlen;
u8 data[];
};
/* struct request_sock - mini sock to represent a connection request
*/
struct request_sock {
@ -60,7 +67,7 @@ struct request_sock {
struct timer_list rsk_timer;
const struct request_sock_ops *rsk_ops;
struct sock *sk;
u32 *saved_syn;
struct saved_syn *saved_syn;
u32 secid;
u32 peer_secid;
};

View file

@ -394,7 +394,7 @@ void tcp_metrics_init(void);
bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst);
void tcp_close(struct sock *sk, long timeout);
void tcp_init_sock(struct sock *sk);
void tcp_init_transfer(struct sock *sk, int bpf_op);
void tcp_init_transfer(struct sock *sk, int bpf_op, struct sk_buff *skb);
__poll_t tcp_poll(struct file *file, struct socket *sock,
struct poll_table_struct *wait);
int tcp_getsockopt(struct sock *sk, int level, int optname,
@ -455,7 +455,8 @@ enum tcp_synack_type {
struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
struct request_sock *req,
struct tcp_fastopen_cookie *foc,
enum tcp_synack_type synack_type);
enum tcp_synack_type synack_type,
struct sk_buff *syn_skb);
int tcp_disconnect(struct sock *sk, int flags);
void tcp_finish_connect(struct sock *sk, struct sk_buff *skb);
@ -699,7 +700,7 @@ static inline void tcp_fast_path_check(struct sock *sk)
static inline u32 tcp_rto_min(struct sock *sk)
{
const struct dst_entry *dst = __sk_dst_get(sk);
u32 rto_min = TCP_RTO_MIN;
u32 rto_min = inet_csk(sk)->icsk_rto_min;
if (dst && dst_metric_locked(dst, RTAX_RTO_MIN))
rto_min = dst_metric_rtt(dst, RTAX_RTO_MIN);
@ -2035,7 +2036,8 @@ struct tcp_request_sock_ops {
int (*send_synack)(const struct sock *sk, struct dst_entry *dst,
struct flowi *fl, struct request_sock *req,
struct tcp_fastopen_cookie *foc,
enum tcp_synack_type synack_type);
enum tcp_synack_type synack_type,
struct sk_buff *syn_skb);
};
extern const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops;
@ -2233,6 +2235,55 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
struct msghdr *msg, int len, int flags);
#endif /* CONFIG_NET_SOCK_MSG */
#ifdef CONFIG_CGROUP_BPF
/* Copy the listen sk's HDR_OPT_CB flags to its child.
*
* During 3-Way-HandShake, the synack is usually sent from
* the listen sk with the HDR_OPT_CB flags set so that
* bpf-prog will be called to write the BPF hdr option.
*
* In fastopen, the child sk is used to send synack instead
* of the listen sk. Thus, inheriting the HDR_OPT_CB flags
* from the listen sk gives the bpf-prog a chance to write
* BPF hdr option in the synack pkt during fastopen.
*
* Both fastopen and non-fastopen child will inherit the
* HDR_OPT_CB flags to keep the bpf-prog having a consistent
* behavior when deciding to clear this cb flags (or not)
* during the PASSIVE_ESTABLISHED_CB.
*
* In the future, other cb flags could be inherited here also.
*/
static inline void bpf_skops_init_child(const struct sock *sk,
struct sock *child)
{
tcp_sk(child)->bpf_sock_ops_cb_flags =
tcp_sk(sk)->bpf_sock_ops_cb_flags &
(BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG |
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG |
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG);
}
static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
struct sk_buff *skb,
unsigned int end_offset)
{
skops->skb = skb;
skops->skb_data_end = skb->data + end_offset;
}
#else
static inline void bpf_skops_init_child(const struct sock *sk,
struct sock *child)
{
}
static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
struct sk_buff *skb,
unsigned int end_offset)
{
}
#endif
/* Call BPF_SOCK_OPS program that returns an int. If the return value
* is < 0, then the BPF op failed (for example if the loaded BPF
* program does not support the chosen operation or there is no BPF

View file

@ -3395,6 +3395,120 @@ union bpf_attr {
* A non-negative value equal to or less than *size* on success,
* or a negative error in case of failure.
*
* long bpf_load_hdr_opt(struct bpf_sock_ops *skops, void *searchby_res, u32 len, u64 flags)
* Description
* Load header option. Support reading a particular TCP header
* option for bpf program (BPF_PROG_TYPE_SOCK_OPS).
*
* If *flags* is 0, it will search the option from the
* sock_ops->skb_data. The comment in "struct bpf_sock_ops"
* has details on what skb_data contains under different
* sock_ops->op.
*
* The first byte of the *searchby_res* specifies the
* kind that it wants to search.
*
* If the searching kind is an experimental kind
* (i.e. 253 or 254 according to RFC6994). It also
* needs to specify the "magic" which is either
* 2 bytes or 4 bytes. It then also needs to
* specify the size of the magic by using
* the 2nd byte which is "kind-length" of a TCP
* header option and the "kind-length" also
* includes the first 2 bytes "kind" and "kind-length"
* itself as a normal TCP header option also does.
*
* For example, to search experimental kind 254 with
* 2 byte magic 0xeB9F, the searchby_res should be
* [ 254, 4, 0xeB, 0x9F, 0, 0, .... 0 ].
*
* To search for the standard window scale option (3),
* the searchby_res should be [ 3, 0, 0, .... 0 ].
* Note, kind-length must be 0 for regular option.
*
* Searching for No-Op (0) and End-of-Option-List (1) are
* not supported.
*
* *len* must be at least 2 bytes which is the minimal size
* of a header option.
*
* Supported flags:
* * **BPF_LOAD_HDR_OPT_TCP_SYN** to search from the
* saved_syn packet or the just-received syn packet.
*
* Return
* >0 when found, the header option is copied to *searchby_res*.
* The return value is the total length copied.
*
* **-EINVAL** If param is invalid
*
* **-ENOMSG** The option is not found
*
* **-ENOENT** No syn packet available when
* **BPF_LOAD_HDR_OPT_TCP_SYN** is used
*
* **-ENOSPC** Not enough space. Only *len* number of
* bytes are copied.
*
* **-EFAULT** Cannot parse the header options in the packet
*
* **-EPERM** This helper cannot be used under the
* current sock_ops->op.
*
* long bpf_store_hdr_opt(struct bpf_sock_ops *skops, const void *from, u32 len, u64 flags)
* Description
* Store header option. The data will be copied
* from buffer *from* with length *len* to the TCP header.
*
* The buffer *from* should have the whole option that
* includes the kind, kind-length, and the actual
* option data. The *len* must be at least kind-length
* long. The kind-length does not have to be 4 byte
* aligned. The kernel will take care of the padding
* and setting the 4 bytes aligned value to th->doff.
*
* This helper will check for duplicated option
* by searching the same option in the outgoing skb.
*
* This helper can only be called during
* BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
*
* Return
* 0 on success, or negative error in case of failure:
*
* **-EINVAL** If param is invalid
*
* **-ENOSPC** Not enough space in the header.
* Nothing has been written
*
* **-EEXIST** The option has already existed
*
* **-EFAULT** Cannot parse the existing header options
*
* **-EPERM** This helper cannot be used under the
* current sock_ops->op.
*
* long bpf_reserve_hdr_opt(struct bpf_sock_ops *skops, u32 len, u64 flags)
* Description
* Reserve *len* bytes for the bpf header option. The
* space will be used by bpf_store_hdr_opt() later in
* BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
*
* If bpf_reserve_hdr_opt() is called multiple times,
* the total number of bytes will be reserved.
*
* This helper can only be called during
* BPF_SOCK_OPS_HDR_OPT_LEN_CB.
*
* Return
* 0 on success, or negative error in case of failure:
*
* **-EINVAL** if param is invalid
*
* **-ENOSPC** Not enough space in the header.
*
* **-EPERM** This helper cannot be used under the
* current sock_ops->op.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@ -3539,6 +3653,9 @@ union bpf_attr {
FN(skc_to_tcp_request_sock), \
FN(skc_to_udp6_sock), \
FN(get_task_stack), \
FN(load_hdr_opt), \
FN(store_hdr_opt), \
FN(reserve_hdr_opt),
/* */
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
@ -4165,6 +4282,36 @@ struct bpf_sock_ops {
__u64 bytes_received;
__u64 bytes_acked;
__bpf_md_ptr(struct bpf_sock *, sk);
/* [skb_data, skb_data_end) covers the whole TCP header.
*
* BPF_SOCK_OPS_PARSE_HDR_OPT_CB: The packet received
* BPF_SOCK_OPS_HDR_OPT_LEN_CB: Not useful because the
* header has not been written.
* BPF_SOCK_OPS_WRITE_HDR_OPT_CB: The header and options have
* been written so far.
* BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB: The SYNACK that concludes
* the 3WHS.
* BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: The ACK that concludes
* the 3WHS.
*
* bpf_load_hdr_opt() can also be used to read a particular option.
*/
__bpf_md_ptr(void *, skb_data);
__bpf_md_ptr(void *, skb_data_end);
__u32 skb_len; /* The total length of a packet.
* It includes the header, options,
* and payload.
*/
__u32 skb_tcp_flags; /* tcp_flags of the header. It provides
* an easy way to check for tcp_flags
* without parsing skb_data.
*
* In particular, the skb_tcp_flags
* will still be available in
* BPF_SOCK_OPS_HDR_OPT_LEN even though
* the outgoing header has not
* been written yet.
*/
};
/* Definitions for bpf_sock_ops_cb_flags */
@ -4173,8 +4320,51 @@ enum {
BPF_SOCK_OPS_RETRANS_CB_FLAG = (1<<1),
BPF_SOCK_OPS_STATE_CB_FLAG = (1<<2),
BPF_SOCK_OPS_RTT_CB_FLAG = (1<<3),
/* Call bpf for all received TCP headers. The bpf prog will be
* called under sock_ops->op == BPF_SOCK_OPS_PARSE_HDR_OPT_CB
*
* Please refer to the comment in BPF_SOCK_OPS_PARSE_HDR_OPT_CB
* for the header option related helpers that will be useful
* to the bpf programs.
*
* It could be used at the client/active side (i.e. connect() side)
* when the server told it that the server was in syncookie
* mode and required the active side to resend the bpf-written
* options. The active side can keep writing the bpf-options until
* it received a valid packet from the server side to confirm
* the earlier packet (and options) has been received. The later
* example patch is using it like this at the active side when the
* server is in syncookie mode.
*
* The bpf prog will usually turn this off in the common cases.
*/
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG = (1<<4),
/* Call bpf when kernel has received a header option that
* the kernel cannot handle. The bpf prog will be called under
* sock_ops->op == BPF_SOCK_OPS_PARSE_HDR_OPT_CB.
*
* Please refer to the comment in BPF_SOCK_OPS_PARSE_HDR_OPT_CB
* for the header option related helpers that will be useful
* to the bpf programs.
*/
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG = (1<<5),
/* Call bpf when the kernel is writing header options for the
* outgoing packet. The bpf prog will first be called
* to reserve space in a skb under
* sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB. Then
* the bpf prog will be called to write the header option(s)
* under sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
*
* Please refer to the comment in BPF_SOCK_OPS_HDR_OPT_LEN_CB
* and BPF_SOCK_OPS_WRITE_HDR_OPT_CB for the header option
* related helpers that will be useful to the bpf programs.
*
* The kernel gets its chance to reserve space and write
* options first before the BPF program does.
*/
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
/* Mask of all currently supported cb flags */
BPF_SOCK_OPS_ALL_CB_FLAGS = 0xF,
BPF_SOCK_OPS_ALL_CB_FLAGS = 0x7F,
};
/* List of known BPF sock_ops operators.
@ -4230,6 +4420,63 @@ enum {
*/
BPF_SOCK_OPS_RTT_CB, /* Called on every RTT.
*/
BPF_SOCK_OPS_PARSE_HDR_OPT_CB, /* Parse the header option.
* It will be called to handle
* the packets received at
* an already established
* connection.
*
* sock_ops->skb_data:
* Referring to the received skb.
* It covers the TCP header only.
*
* bpf_load_hdr_opt() can also
* be used to search for a
* particular option.
*/
BPF_SOCK_OPS_HDR_OPT_LEN_CB, /* Reserve space for writing the
* header option later in
* BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
* Arg1: bool want_cookie. (in
* writing SYNACK only)
*
* sock_ops->skb_data:
* Not available because no header has
* been written yet.
*
* sock_ops->skb_tcp_flags:
* The tcp_flags of the
* outgoing skb. (e.g. SYN, ACK, FIN).
*
* bpf_reserve_hdr_opt() should
* be used to reserve space.
*/
BPF_SOCK_OPS_WRITE_HDR_OPT_CB, /* Write the header options
* Arg1: bool want_cookie. (in
* writing SYNACK only)
*
* sock_ops->skb_data:
* Referring to the outgoing skb.
* It covers the TCP header
* that has already been written
* by the kernel and the
* earlier bpf-progs.
*
* sock_ops->skb_tcp_flags:
* The tcp_flags of the outgoing
* skb. (e.g. SYN, ACK, FIN).
*
* bpf_store_hdr_opt() should
* be used to write the
* option.
*
* bpf_load_hdr_opt() can also
* be used to search for a
* particular option that
* has already been written
* by the kernel or the
* earlier bpf-progs.
*/
};
/* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
@ -4257,6 +4504,63 @@ enum {
enum {
TCP_BPF_IW = 1001, /* Set TCP initial congestion window */
TCP_BPF_SNDCWND_CLAMP = 1002, /* Set sndcwnd_clamp */
TCP_BPF_DELACK_MAX = 1003, /* Max delay ack in usecs */
TCP_BPF_RTO_MIN = 1004, /* Min delay ack in usecs */
/* Copy the SYN pkt to optval
*
* BPF_PROG_TYPE_SOCK_OPS only. It is similar to the
* bpf_getsockopt(TCP_SAVED_SYN) but it does not limit
* to only getting from the saved_syn. It can either get the
* syn packet from:
*
* 1. the just-received SYN packet (only available when writing the
* SYNACK). It will be useful when it is not necessary to
* save the SYN packet for latter use. It is also the only way
* to get the SYN during syncookie mode because the syn
* packet cannot be saved during syncookie.
*
* OR
*
* 2. the earlier saved syn which was done by
* bpf_setsockopt(TCP_SAVE_SYN).
*
* The bpf_getsockopt(TCP_BPF_SYN*) option will hide where the
* SYN packet is obtained.
*
* If the bpf-prog does not need the IP[46] header, the
* bpf-prog can avoid parsing the IP header by using
* TCP_BPF_SYN. Otherwise, the bpf-prog can get both
* IP[46] and TCP header by using TCP_BPF_SYN_IP.
*
* >0: Total number of bytes copied
* -ENOSPC: Not enough space in optval. Only optlen number of
* bytes is copied.
* -ENOENT: The SYN skb is not available now and the earlier SYN pkt
* is not saved by setsockopt(TCP_SAVE_SYN).
*/
TCP_BPF_SYN = 1005, /* Copy the TCP header */
TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */
TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */
};
enum {
BPF_LOAD_HDR_OPT_TCP_SYN = (1ULL << 0),
};
/* args[0] value during BPF_SOCK_OPS_HDR_OPT_LEN_CB and
* BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
*/
enum {
BPF_WRITE_HDR_TCP_CURRENT_MSS = 1, /* Kernel is finding the
* total option spaces
* required for an established
* sk in order to calculate the
* MSS. No skb is actually
* sent.
*/
BPF_WRITE_HDR_TCP_SYNACK_COOKIE = 2, /* Kernel is in syncookie mode
* when sending a SYN.
*/
};
struct bpf_perf_event_value {

View file

@ -4459,6 +4459,7 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname,
} else {
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
unsigned long timeout;
if (optlen != sizeof(int))
return -EINVAL;
@ -4480,6 +4481,20 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname,
tp->snd_ssthresh = val;
}
break;
case TCP_BPF_DELACK_MAX:
timeout = usecs_to_jiffies(val);
if (timeout > TCP_DELACK_MAX ||
timeout < TCP_TIMEOUT_MIN)
return -EINVAL;
inet_csk(sk)->icsk_delack_max = timeout;
break;
case TCP_BPF_RTO_MIN:
timeout = usecs_to_jiffies(val);
if (timeout > TCP_RTO_MIN ||
timeout < TCP_TIMEOUT_MIN)
return -EINVAL;
inet_csk(sk)->icsk_rto_min = timeout;
break;
case TCP_SAVE_SYN:
if (val < 0 || val > 1)
ret = -EINVAL;
@ -4550,9 +4565,9 @@ static int _bpf_getsockopt(struct sock *sk, int level, int optname,
tp = tcp_sk(sk);
if (optlen <= 0 || !tp->saved_syn ||
optlen > tp->saved_syn[0])
optlen > tcp_saved_syn_len(tp->saved_syn))
goto err_clear;
memcpy(optval, tp->saved_syn + 1, optlen);
memcpy(optval, tp->saved_syn->data, optlen);
break;
default:
goto err_clear;
@ -4654,9 +4669,99 @@ static const struct bpf_func_proto bpf_sock_ops_setsockopt_proto = {
.arg5_type = ARG_CONST_SIZE,
};
static int bpf_sock_ops_get_syn(struct bpf_sock_ops_kern *bpf_sock,
int optname, const u8 **start)
{
struct sk_buff *syn_skb = bpf_sock->syn_skb;
const u8 *hdr_start;
int ret;
if (syn_skb) {
/* sk is a request_sock here */
if (optname == TCP_BPF_SYN) {
hdr_start = syn_skb->data;
ret = tcp_hdrlen(syn_skb);
} else if (optname == TCP_BPF_SYN_IP) {
hdr_start = skb_network_header(syn_skb);
ret = skb_network_header_len(syn_skb) +
tcp_hdrlen(syn_skb);
} else {
/* optname == TCP_BPF_SYN_MAC */
hdr_start = skb_mac_header(syn_skb);
ret = skb_mac_header_len(syn_skb) +
skb_network_header_len(syn_skb) +
tcp_hdrlen(syn_skb);
}
} else {
struct sock *sk = bpf_sock->sk;
struct saved_syn *saved_syn;
if (sk->sk_state == TCP_NEW_SYN_RECV)
/* synack retransmit. bpf_sock->syn_skb will
* not be available. It has to resort to
* saved_syn (if it is saved).
*/
saved_syn = inet_reqsk(sk)->saved_syn;
else
saved_syn = tcp_sk(sk)->saved_syn;
if (!saved_syn)
return -ENOENT;
if (optname == TCP_BPF_SYN) {
hdr_start = saved_syn->data +
saved_syn->mac_hdrlen +
saved_syn->network_hdrlen;
ret = saved_syn->tcp_hdrlen;
} else if (optname == TCP_BPF_SYN_IP) {
hdr_start = saved_syn->data +
saved_syn->mac_hdrlen;
ret = saved_syn->network_hdrlen +
saved_syn->tcp_hdrlen;
} else {
/* optname == TCP_BPF_SYN_MAC */
/* TCP_SAVE_SYN may not have saved the mac hdr */
if (!saved_syn->mac_hdrlen)
return -ENOENT;
hdr_start = saved_syn->data;
ret = saved_syn->mac_hdrlen +
saved_syn->network_hdrlen +
saved_syn->tcp_hdrlen;
}
}
*start = hdr_start;
return ret;
}
BPF_CALL_5(bpf_sock_ops_getsockopt, struct bpf_sock_ops_kern *, bpf_sock,
int, level, int, optname, char *, optval, int, optlen)
{
if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP &&
optname >= TCP_BPF_SYN && optname <= TCP_BPF_SYN_MAC) {
int ret, copy_len = 0;
const u8 *start;
ret = bpf_sock_ops_get_syn(bpf_sock, optname, &start);
if (ret > 0) {
copy_len = ret;
if (optlen < copy_len) {
copy_len = optlen;
ret = -ENOSPC;
}
memcpy(optval, start, copy_len);
}
/* Zero out unused buffer at the end */
memset(optval + copy_len, 0, optlen - copy_len);
return ret;
}
return _bpf_getsockopt(bpf_sock->sk, level, optname, optval, optlen);
}
@ -6150,6 +6255,232 @@ static const struct bpf_func_proto bpf_sk_assign_proto = {
.arg3_type = ARG_ANYTHING,
};
static const u8 *bpf_search_tcp_opt(const u8 *op, const u8 *opend,
u8 search_kind, const u8 *magic,
u8 magic_len, bool *eol)
{
u8 kind, kind_len;
*eol = false;
while (op < opend) {
kind = op[0];
if (kind == TCPOPT_EOL) {
*eol = true;
return ERR_PTR(-ENOMSG);
} else if (kind == TCPOPT_NOP) {
op++;
continue;
}
if (opend - op < 2 || opend - op < op[1] || op[1] < 2)
/* Something is wrong in the received header.
* Follow the TCP stack's tcp_parse_options()
* and just bail here.
*/
return ERR_PTR(-EFAULT);
kind_len = op[1];
if (search_kind == kind) {
if (!magic_len)
return op;
if (magic_len > kind_len - 2)
return ERR_PTR(-ENOMSG);
if (!memcmp(&op[2], magic, magic_len))
return op;
}
op += kind_len;
}
return ERR_PTR(-ENOMSG);
}
BPF_CALL_4(bpf_sock_ops_load_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock,
void *, search_res, u32, len, u64, flags)
{
bool eol, load_syn = flags & BPF_LOAD_HDR_OPT_TCP_SYN;
const u8 *op, *opend, *magic, *search = search_res;
u8 search_kind, search_len, copy_len, magic_len;
int ret;
/* 2 byte is the minimal option len except TCPOPT_NOP and
* TCPOPT_EOL which are useless for the bpf prog to learn
* and this helper disallow loading them also.
*/
if (len < 2 || flags & ~BPF_LOAD_HDR_OPT_TCP_SYN)
return -EINVAL;
search_kind = search[0];
search_len = search[1];
if (search_len > len || search_kind == TCPOPT_NOP ||
search_kind == TCPOPT_EOL)
return -EINVAL;
if (search_kind == TCPOPT_EXP || search_kind == 253) {
/* 16 or 32 bit magic. +2 for kind and kind length */
if (search_len != 4 && search_len != 6)
return -EINVAL;
magic = &search[2];
magic_len = search_len - 2;
} else {
if (search_len)
return -EINVAL;
magic = NULL;
magic_len = 0;
}
if (load_syn) {
ret = bpf_sock_ops_get_syn(bpf_sock, TCP_BPF_SYN, &op);
if (ret < 0)
return ret;
opend = op + ret;
op += sizeof(struct tcphdr);
} else {
if (!bpf_sock->skb ||
bpf_sock->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB)
/* This bpf_sock->op cannot call this helper */
return -EPERM;
opend = bpf_sock->skb_data_end;
op = bpf_sock->skb->data + sizeof(struct tcphdr);
}
op = bpf_search_tcp_opt(op, opend, search_kind, magic, magic_len,
&eol);
if (IS_ERR(op))
return PTR_ERR(op);
copy_len = op[1];
ret = copy_len;
if (copy_len > len) {
ret = -ENOSPC;
copy_len = len;
}
memcpy(search_res, op, copy_len);
return ret;
}
static const struct bpf_func_proto bpf_sock_ops_load_hdr_opt_proto = {
.func = bpf_sock_ops_load_hdr_opt,
.gpl_only = false,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_PTR_TO_MEM,
.arg3_type = ARG_CONST_SIZE,
.arg4_type = ARG_ANYTHING,
};
BPF_CALL_4(bpf_sock_ops_store_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock,
const void *, from, u32, len, u64, flags)
{
u8 new_kind, new_kind_len, magic_len = 0, *opend;
const u8 *op, *new_op, *magic = NULL;
struct sk_buff *skb;
bool eol;
if (bpf_sock->op != BPF_SOCK_OPS_WRITE_HDR_OPT_CB)
return -EPERM;
if (len < 2 || flags)
return -EINVAL;
new_op = from;
new_kind = new_op[0];
new_kind_len = new_op[1];
if (new_kind_len > len || new_kind == TCPOPT_NOP ||
new_kind == TCPOPT_EOL)
return -EINVAL;
if (new_kind_len > bpf_sock->remaining_opt_len)
return -ENOSPC;
/* 253 is another experimental kind */
if (new_kind == TCPOPT_EXP || new_kind == 253) {
if (new_kind_len < 4)
return -EINVAL;
/* Match for the 2 byte magic also.
* RFC 6994: the magic could be 2 or 4 bytes.
* Hence, matching by 2 byte only is on the
* conservative side but it is the right
* thing to do for the 'search-for-duplication'
* purpose.
*/
magic = &new_op[2];
magic_len = 2;
}
/* Check for duplication */
skb = bpf_sock->skb;
op = skb->data + sizeof(struct tcphdr);
opend = bpf_sock->skb_data_end;
op = bpf_search_tcp_opt(op, opend, new_kind, magic, magic_len,
&eol);
if (!IS_ERR(op))
return -EEXIST;
if (PTR_ERR(op) != -ENOMSG)
return PTR_ERR(op);
if (eol)
/* The option has been ended. Treat it as no more
* header option can be written.
*/
return -ENOSPC;
/* No duplication found. Store the header option. */
memcpy(opend, from, new_kind_len);
bpf_sock->remaining_opt_len -= new_kind_len;
bpf_sock->skb_data_end += new_kind_len;
return 0;
}
static const struct bpf_func_proto bpf_sock_ops_store_hdr_opt_proto = {
.func = bpf_sock_ops_store_hdr_opt,
.gpl_only = false,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_PTR_TO_MEM,
.arg3_type = ARG_CONST_SIZE,
.arg4_type = ARG_ANYTHING,
};
BPF_CALL_3(bpf_sock_ops_reserve_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock,
u32, len, u64, flags)
{
if (bpf_sock->op != BPF_SOCK_OPS_HDR_OPT_LEN_CB)
return -EPERM;
if (flags || len < 2)
return -EINVAL;
if (len > bpf_sock->remaining_opt_len)
return -ENOSPC;
bpf_sock->remaining_opt_len -= len;
return 0;
}
static const struct bpf_func_proto bpf_sock_ops_reserve_hdr_opt_proto = {
.func = bpf_sock_ops_reserve_hdr_opt,
.gpl_only = false,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_ANYTHING,
.arg3_type = ARG_ANYTHING,
};
#endif /* CONFIG_INET */
bool bpf_helper_changes_pkt_data(void *func)
@ -6178,6 +6509,9 @@ bool bpf_helper_changes_pkt_data(void *func)
func == bpf_lwt_seg6_store_bytes ||
func == bpf_lwt_seg6_adjust_srh ||
func == bpf_lwt_seg6_action ||
#endif
#ifdef CONFIG_INET
func == bpf_sock_ops_store_hdr_opt ||
#endif
func == bpf_lwt_in_push_encap ||
func == bpf_lwt_xmit_push_encap)
@ -6550,6 +6884,12 @@ sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
case BPF_FUNC_sk_storage_delete:
return &bpf_sk_storage_delete_proto;
#ifdef CONFIG_INET
case BPF_FUNC_load_hdr_opt:
return &bpf_sock_ops_load_hdr_opt_proto;
case BPF_FUNC_store_hdr_opt:
return &bpf_sock_ops_store_hdr_opt_proto;
case BPF_FUNC_reserve_hdr_opt:
return &bpf_sock_ops_reserve_hdr_opt_proto;
case BPF_FUNC_tcp_sock:
return &bpf_tcp_sock_proto;
#endif /* CONFIG_INET */
@ -7349,6 +7689,20 @@ static bool sock_ops_is_valid_access(int off, int size,
return false;
info->reg_type = PTR_TO_SOCKET_OR_NULL;
break;
case offsetof(struct bpf_sock_ops, skb_data):
if (size != sizeof(__u64))
return false;
info->reg_type = PTR_TO_PACKET;
break;
case offsetof(struct bpf_sock_ops, skb_data_end):
if (size != sizeof(__u64))
return false;
info->reg_type = PTR_TO_PACKET_END;
break;
case offsetof(struct bpf_sock_ops, skb_tcp_flags):
bpf_ctx_record_field_size(info, size_default);
return bpf_ctx_narrow_access_ok(off, size,
size_default);
default:
if (size != size_default)
return false;
@ -8450,17 +8804,22 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
return insn - insn_buf;
switch (si->off) {
case offsetof(struct bpf_sock_ops, op) ...
case offsetof(struct bpf_sock_ops, op):
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
op),
si->dst_reg, si->src_reg,
offsetof(struct bpf_sock_ops_kern, op));
break;
case offsetof(struct bpf_sock_ops, replylong[0]) ...
offsetof(struct bpf_sock_ops, replylong[3]):
BUILD_BUG_ON(sizeof_field(struct bpf_sock_ops, op) !=
sizeof_field(struct bpf_sock_ops_kern, op));
BUILD_BUG_ON(sizeof_field(struct bpf_sock_ops, reply) !=
sizeof_field(struct bpf_sock_ops_kern, reply));
BUILD_BUG_ON(sizeof_field(struct bpf_sock_ops, replylong) !=
sizeof_field(struct bpf_sock_ops_kern, replylong));
off = si->off;
off -= offsetof(struct bpf_sock_ops, op);
off += offsetof(struct bpf_sock_ops_kern, op);
off -= offsetof(struct bpf_sock_ops, replylong[0]);
off += offsetof(struct bpf_sock_ops_kern, replylong[0]);
if (type == BPF_WRITE)
*insn++ = BPF_STX_MEM(BPF_W, si->dst_reg, si->src_reg,
off);
@ -8681,6 +9040,49 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
case offsetof(struct bpf_sock_ops, sk):
SOCK_OPS_GET_SK();
break;
case offsetof(struct bpf_sock_ops, skb_data_end):
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
skb_data_end),
si->dst_reg, si->src_reg,
offsetof(struct bpf_sock_ops_kern,
skb_data_end));
break;
case offsetof(struct bpf_sock_ops, skb_data):
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
skb),
si->dst_reg, si->src_reg,
offsetof(struct bpf_sock_ops_kern,
skb));
*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1);
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, data),
si->dst_reg, si->dst_reg,
offsetof(struct sk_buff, data));
break;
case offsetof(struct bpf_sock_ops, skb_len):
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
skb),
si->dst_reg, si->src_reg,
offsetof(struct bpf_sock_ops_kern,
skb));
*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1);
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, len),
si->dst_reg, si->dst_reg,
offsetof(struct sk_buff, len));
break;
case offsetof(struct bpf_sock_ops, skb_tcp_flags):
off = offsetof(struct sk_buff, cb);
off += offsetof(struct tcp_skb_cb, tcp_flags);
*target_size = sizeof_field(struct tcp_skb_cb, tcp_flags);
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
skb),
si->dst_reg, si->src_reg,
offsetof(struct bpf_sock_ops_kern,
skb));
*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1);
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct tcp_skb_cb,
tcp_flags),
si->dst_reg, si->dst_reg, off);
break;
}
return insn - insn_buf;
}

View file

@ -418,6 +418,8 @@ void tcp_init_sock(struct sock *sk)
INIT_LIST_HEAD(&tp->tsorted_sent_queue);
icsk->icsk_rto = TCP_TIMEOUT_INIT;
icsk->icsk_rto_min = TCP_RTO_MIN;
icsk->icsk_delack_max = TCP_DELACK_MAX;
tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
minmax_reset(&tp->rtt_min, tcp_jiffies32, ~0U);
@ -2685,6 +2687,8 @@ int tcp_disconnect(struct sock *sk, int flags)
icsk->icsk_backoff = 0;
icsk->icsk_probes_out = 0;
icsk->icsk_rto = TCP_TIMEOUT_INIT;
icsk->icsk_rto_min = TCP_RTO_MIN;
icsk->icsk_delack_max = TCP_DELACK_MAX;
tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
tp->snd_cwnd = TCP_INIT_CWND;
tp->snd_cwnd_cnt = 0;
@ -3207,7 +3211,8 @@ static int do_tcp_setsockopt(struct sock *sk, int level, int optname,
break;
case TCP_SAVE_SYN:
if (val < 0 || val > 1)
/* 0: disable, 1: enable, 2: start from ether_header */
if (val < 0 || val > 2)
err = -EINVAL;
else
tp->save_syn = val;
@ -3788,20 +3793,21 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
lock_sock(sk);
if (tp->saved_syn) {
if (len < tp->saved_syn[0]) {
if (put_user(tp->saved_syn[0], optlen)) {
if (len < tcp_saved_syn_len(tp->saved_syn)) {
if (put_user(tcp_saved_syn_len(tp->saved_syn),
optlen)) {
release_sock(sk);
return -EFAULT;
}
release_sock(sk);
return -EINVAL;
}
len = tp->saved_syn[0];
len = tcp_saved_syn_len(tp->saved_syn);
if (put_user(len, optlen)) {
release_sock(sk);
return -EFAULT;
}
if (copy_to_user(optval, tp->saved_syn + 1, len)) {
if (copy_to_user(optval, tp->saved_syn->data, len)) {
release_sock(sk);
return -EFAULT;
}

View file

@ -295,7 +295,7 @@ static struct sock *tcp_fastopen_create_child(struct sock *sk,
refcount_set(&req->rsk_refcnt, 2);
/* Now finish processing the fastopen child socket. */
tcp_init_transfer(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
tcp_init_transfer(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB, skb);
tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;

View file

@ -138,6 +138,69 @@ void clean_acked_data_flush(void)
EXPORT_SYMBOL_GPL(clean_acked_data_flush);
#endif
#ifdef CONFIG_CGROUP_BPF
static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb)
{
bool unknown_opt = tcp_sk(sk)->rx_opt.saw_unknown &&
BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG);
bool parse_all_opt = BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG);
struct bpf_sock_ops_kern sock_ops;
if (likely(!unknown_opt && !parse_all_opt))
return;
/* The skb will be handled in the
* bpf_skops_established() or
* bpf_skops_write_hdr_opt().
*/
switch (sk->sk_state) {
case TCP_SYN_RECV:
case TCP_SYN_SENT:
case TCP_LISTEN:
return;
}
sock_owned_by_me(sk);
memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
sock_ops.op = BPF_SOCK_OPS_PARSE_HDR_OPT_CB;
sock_ops.is_fullsock = 1;
sock_ops.sk = sk;
bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
}
static void bpf_skops_established(struct sock *sk, int bpf_op,
struct sk_buff *skb)
{
struct bpf_sock_ops_kern sock_ops;
sock_owned_by_me(sk);
memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
sock_ops.op = bpf_op;
sock_ops.is_fullsock = 1;
sock_ops.sk = sk;
/* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */
if (skb)
bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
}
#else
static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb)
{
}
static void bpf_skops_established(struct sock *sk, int bpf_op,
struct sk_buff *skb)
{
}
#endif
static void tcp_gro_dev_warn(struct sock *sk, const struct sk_buff *skb,
unsigned int len)
{
@ -3801,7 +3864,7 @@ static void tcp_parse_fastopen_option(int len, const unsigned char *cookie,
foc->exp = exp_opt;
}
static void smc_parse_options(const struct tcphdr *th,
static bool smc_parse_options(const struct tcphdr *th,
struct tcp_options_received *opt_rx,
const unsigned char *ptr,
int opsize)
@ -3810,10 +3873,13 @@ static void smc_parse_options(const struct tcphdr *th,
if (static_branch_unlikely(&tcp_have_smc)) {
if (th->syn && !(opsize & 1) &&
opsize >= TCPOLEN_EXP_SMC_BASE &&
get_unaligned_be32(ptr) == TCPOPT_SMC_MAGIC)
get_unaligned_be32(ptr) == TCPOPT_SMC_MAGIC) {
opt_rx->smc_ok = 1;
return true;
}
}
#endif
return false;
}
/* Try to parse the MSS option from the TCP header. Return 0 on failure, clamped
@ -3874,6 +3940,7 @@ void tcp_parse_options(const struct net *net,
ptr = (const unsigned char *)(th + 1);
opt_rx->saw_tstamp = 0;
opt_rx->saw_unknown = 0;
while (length > 0) {
int opcode = *ptr++;
@ -3964,15 +4031,21 @@ void tcp_parse_options(const struct net *net,
*/
if (opsize >= TCPOLEN_EXP_FASTOPEN_BASE &&
get_unaligned_be16(ptr) ==
TCPOPT_FASTOPEN_MAGIC)
TCPOPT_FASTOPEN_MAGIC) {
tcp_parse_fastopen_option(opsize -
TCPOLEN_EXP_FASTOPEN_BASE,
ptr + 2, th->syn, foc, true);
else
smc_parse_options(th, opt_rx, ptr,
opsize);
break;
}
if (smc_parse_options(th, opt_rx, ptr, opsize))
break;
opt_rx->saw_unknown = 1;
break;
default:
opt_rx->saw_unknown = 1;
}
ptr += opsize-2;
length -= opsize;
@ -5590,6 +5663,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
goto discard;
}
bpf_skops_parse_hdr(sk, skb);
return true;
discard:
@ -5798,7 +5873,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
}
EXPORT_SYMBOL(tcp_rcv_established);
void tcp_init_transfer(struct sock *sk, int bpf_op)
void tcp_init_transfer(struct sock *sk, int bpf_op, struct sk_buff *skb)
{
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
@ -5819,7 +5894,7 @@ void tcp_init_transfer(struct sock *sk, int bpf_op)
tp->snd_cwnd = tcp_init_cwnd(tp, __sk_dst_get(sk));
tp->snd_cwnd_stamp = tcp_jiffies32;
tcp_call_bpf(sk, bpf_op, 0, NULL);
bpf_skops_established(sk, bpf_op, skb);
tcp_init_congestion_control(sk);
tcp_init_buffer_space(sk);
}
@ -5838,7 +5913,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
sk_mark_napi_id(sk, skb);
}
tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB, skb);
/* Prevent spurious tcp_cwnd_restart() on first data
* packet.
@ -6310,7 +6385,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
} else {
tcp_try_undo_spurious_syn(sk);
tp->retrans_stamp = 0;
tcp_init_transfer(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
tcp_init_transfer(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB,
skb);
WRITE_ONCE(tp->copied_seq, tp->rcv_nxt);
}
smp_mb();
@ -6599,13 +6675,27 @@ static void tcp_reqsk_record_syn(const struct sock *sk,
{
if (tcp_sk(sk)->save_syn) {
u32 len = skb_network_header_len(skb) + tcp_hdrlen(skb);
u32 *copy;
struct saved_syn *saved_syn;
u32 mac_hdrlen;
void *base;
copy = kmalloc(len + sizeof(u32), GFP_ATOMIC);
if (copy) {
copy[0] = len;
memcpy(&copy[1], skb_network_header(skb), len);
req->saved_syn = copy;
if (tcp_sk(sk)->save_syn == 2) { /* Save full header. */
base = skb_mac_header(skb);
mac_hdrlen = skb_mac_header_len(skb);
len += mac_hdrlen;
} else {
base = skb_network_header(skb);
mac_hdrlen = 0;
}
saved_syn = kmalloc(struct_size(saved_syn, data, len),
GFP_ATOMIC);
if (saved_syn) {
saved_syn->mac_hdrlen = mac_hdrlen;
saved_syn->network_hdrlen = skb_network_header_len(skb);
saved_syn->tcp_hdrlen = tcp_hdrlen(skb);
memcpy(saved_syn->data, base, len);
req->saved_syn = saved_syn;
}
}
}
@ -6752,7 +6842,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
}
if (fastopen_sk) {
af_ops->send_synack(fastopen_sk, dst, &fl, req,
&foc, TCP_SYNACK_FASTOPEN);
&foc, TCP_SYNACK_FASTOPEN, skb);
/* Add the child socket directly into the accept queue */
if (!inet_csk_reqsk_queue_add(sk, req, fastopen_sk)) {
reqsk_fastopen_remove(fastopen_sk, req, false);
@ -6770,7 +6860,8 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
tcp_timeout_init((struct sock *)req));
af_ops->send_synack(sk, dst, &fl, req, &foc,
!want_cookie ? TCP_SYNACK_NORMAL :
TCP_SYNACK_COOKIE);
TCP_SYNACK_COOKIE,
skb);
if (want_cookie) {
reqsk_free(req);
return 0;

View file

@ -965,7 +965,8 @@ static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst,
struct flowi *fl,
struct request_sock *req,
struct tcp_fastopen_cookie *foc,
enum tcp_synack_type synack_type)
enum tcp_synack_type synack_type,
struct sk_buff *syn_skb)
{
const struct inet_request_sock *ireq = inet_rsk(req);
struct flowi4 fl4;
@ -976,7 +977,7 @@ static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst,
if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL)
return -1;
skb = tcp_make_synack(sk, dst, req, foc, synack_type);
skb = tcp_make_synack(sk, dst, req, foc, synack_type, syn_skb);
if (skb) {
__tcp_v4_send_check(skb, ireq->ir_loc_addr, ireq->ir_rmt_addr);

View file

@ -548,6 +548,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
newtp->fastopen_req = NULL;
RCU_INIT_POINTER(newtp->fastopen_rsk, NULL);
bpf_skops_init_child(sk, newsk);
tcp_bpf_clone(sk, newsk);
__TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);

View file

@ -438,6 +438,7 @@ struct tcp_out_options {
u8 ws; /* window scale, 0 to disable */
u8 num_sack_blocks; /* number of SACK blocks to include */
u8 hash_size; /* bytes in hash_location */
u8 bpf_opt_len; /* length of BPF hdr option */
__u8 *hash_location; /* temporary pointer, overloaded */
__u32 tsval, tsecr; /* need to include OPTION_TS */
struct tcp_fastopen_cookie *fastopen_cookie; /* Fast open cookie */
@ -452,6 +453,145 @@ static void mptcp_options_write(__be32 *ptr, struct tcp_out_options *opts)
#endif
}
#ifdef CONFIG_CGROUP_BPF
static int bpf_skops_write_hdr_opt_arg0(struct sk_buff *skb,
enum tcp_synack_type synack_type)
{
if (unlikely(!skb))
return BPF_WRITE_HDR_TCP_CURRENT_MSS;
if (unlikely(synack_type == TCP_SYNACK_COOKIE))
return BPF_WRITE_HDR_TCP_SYNACK_COOKIE;
return 0;
}
/* req, syn_skb and synack_type are used when writing synack */
static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
struct request_sock *req,
struct sk_buff *syn_skb,
enum tcp_synack_type synack_type,
struct tcp_out_options *opts,
unsigned int *remaining)
{
struct bpf_sock_ops_kern sock_ops;
int err;
if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)) ||
!*remaining)
return;
/* *remaining has already been aligned to 4 bytes, so *remaining >= 4 */
/* init sock_ops */
memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
sock_ops.op = BPF_SOCK_OPS_HDR_OPT_LEN_CB;
if (req) {
/* The listen "sk" cannot be passed here because
* it is not locked. It would not make too much
* sense to do bpf_setsockopt(listen_sk) based
* on individual connection request also.
*
* Thus, "req" is passed here and the cgroup-bpf-progs
* of the listen "sk" will be run.
*
* "req" is also used here for fastopen even the "sk" here is
* a fullsock "child" sk. It is to keep the behavior
* consistent between fastopen and non-fastopen on
* the bpf programming side.
*/
sock_ops.sk = (struct sock *)req;
sock_ops.syn_skb = syn_skb;
} else {
sock_owned_by_me(sk);
sock_ops.is_fullsock = 1;
sock_ops.sk = sk;
}
sock_ops.args[0] = bpf_skops_write_hdr_opt_arg0(skb, synack_type);
sock_ops.remaining_opt_len = *remaining;
/* tcp_current_mss() does not pass a skb */
if (skb)
bpf_skops_init_skb(&sock_ops, skb, 0);
err = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
if (err || sock_ops.remaining_opt_len == *remaining)
return;
opts->bpf_opt_len = *remaining - sock_ops.remaining_opt_len;
/* round up to 4 bytes */
opts->bpf_opt_len = (opts->bpf_opt_len + 3) & ~3;
*remaining -= opts->bpf_opt_len;
}
static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb,
struct request_sock *req,
struct sk_buff *syn_skb,
enum tcp_synack_type synack_type,
struct tcp_out_options *opts)
{
u8 first_opt_off, nr_written, max_opt_len = opts->bpf_opt_len;
struct bpf_sock_ops_kern sock_ops;
int err;
if (likely(!max_opt_len))
return;
memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
sock_ops.op = BPF_SOCK_OPS_WRITE_HDR_OPT_CB;
if (req) {
sock_ops.sk = (struct sock *)req;
sock_ops.syn_skb = syn_skb;
} else {
sock_owned_by_me(sk);
sock_ops.is_fullsock = 1;
sock_ops.sk = sk;
}
sock_ops.args[0] = bpf_skops_write_hdr_opt_arg0(skb, synack_type);
sock_ops.remaining_opt_len = max_opt_len;
first_opt_off = tcp_hdrlen(skb) - max_opt_len;
bpf_skops_init_skb(&sock_ops, skb, first_opt_off);
err = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
if (err)
nr_written = 0;
else
nr_written = max_opt_len - sock_ops.remaining_opt_len;
if (nr_written < max_opt_len)
memset(skb->data + first_opt_off + nr_written, TCPOPT_NOP,
max_opt_len - nr_written);
}
#else
static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
struct request_sock *req,
struct sk_buff *syn_skb,
enum tcp_synack_type synack_type,
struct tcp_out_options *opts,
unsigned int *remaining)
{
}
static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb,
struct request_sock *req,
struct sk_buff *syn_skb,
enum tcp_synack_type synack_type,
struct tcp_out_options *opts)
{
}
#endif
/* Write previously computed TCP options to the packet.
*
* Beware: Something in the Internet is very sensitive to the ordering of
@ -691,6 +831,8 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
}
}
bpf_skops_hdr_opt_len(sk, skb, NULL, NULL, 0, opts, &remaining);
return MAX_TCP_OPTION_SPACE - remaining;
}
@ -701,7 +843,8 @@ static unsigned int tcp_synack_options(const struct sock *sk,
struct tcp_out_options *opts,
const struct tcp_md5sig_key *md5,
struct tcp_fastopen_cookie *foc,
enum tcp_synack_type synack_type)
enum tcp_synack_type synack_type,
struct sk_buff *syn_skb)
{
struct inet_request_sock *ireq = inet_rsk(req);
unsigned int remaining = MAX_TCP_OPTION_SPACE;
@ -758,6 +901,9 @@ static unsigned int tcp_synack_options(const struct sock *sk,
smc_set_option_cond(tcp_sk(sk), ireq, opts, &remaining);
bpf_skops_hdr_opt_len((struct sock *)sk, skb, req, syn_skb,
synack_type, opts, &remaining);
return MAX_TCP_OPTION_SPACE - remaining;
}
@ -826,6 +972,15 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
}
if (unlikely(BPF_SOCK_OPS_TEST_FLAG(tp,
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG))) {
unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
bpf_skops_hdr_opt_len(sk, skb, NULL, NULL, 0, opts, &remaining);
size = MAX_TCP_OPTION_SPACE - remaining;
}
return size;
}
@ -1213,6 +1368,9 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
}
#endif
/* BPF prog is the last one writing header option */
bpf_skops_write_hdr_opt(sk, skb, NULL, NULL, 0, &opts);
INDIRECT_CALL_INET(icsk->icsk_af_ops->send_check,
tcp_v6_send_check, tcp_v4_send_check,
sk, skb);
@ -3336,20 +3494,20 @@ int tcp_send_synack(struct sock *sk)
}
/**
* tcp_make_synack - Prepare a SYN-ACK.
* sk: listener socket
* dst: dst entry attached to the SYNACK
* req: request_sock pointer
* foc: cookie for tcp fast open
* synack_type: Type of synback to prepare
*
* Allocate one skb and build a SYNACK packet.
* @dst is consumed : Caller should not use it again.
* tcp_make_synack - Allocate one skb and build a SYNACK packet.
* @sk: listener socket
* @dst: dst entry attached to the SYNACK. It is consumed and caller
* should not use it again.
* @req: request_sock pointer
* @foc: cookie for tcp fast open
* @synack_type: Type of synack to prepare
* @syn_skb: SYN packet just received. It could be NULL for rtx case.
*/
struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
struct request_sock *req,
struct tcp_fastopen_cookie *foc,
enum tcp_synack_type synack_type)
enum tcp_synack_type synack_type,
struct sk_buff *syn_skb)
{
struct inet_request_sock *ireq = inet_rsk(req);
const struct tcp_sock *tp = tcp_sk(sk);
@ -3408,8 +3566,11 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
md5 = tcp_rsk(req)->af_specific->req_md5_lookup(sk, req_to_sk(req));
#endif
skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4);
/* bpf program will be interested in the tcp_flags */
TCP_SKB_CB(skb)->tcp_flags = TCPHDR_SYN | TCPHDR_ACK;
tcp_header_size = tcp_synack_options(sk, req, mss, skb, &opts, md5,
foc, synack_type) + sizeof(*th);
foc, synack_type,
syn_skb) + sizeof(*th);
skb_push(skb, tcp_header_size);
skb_reset_transport_header(skb);
@ -3441,6 +3602,9 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
rcu_read_unlock();
#endif
bpf_skops_write_hdr_opt((struct sock *)sk, skb, req, syn_skb,
synack_type, &opts);
skb->skb_mstamp_ns = now;
tcp_add_tx_delay(skb, tp);
@ -3741,6 +3905,8 @@ void tcp_send_delayed_ack(struct sock *sk)
ato = min(ato, max_ato);
}
ato = min_t(u32, ato, inet_csk(sk)->icsk_delack_max);
/* Stay within the limit we were given */
timeout = jiffies + ato;
@ -3934,7 +4100,8 @@ int tcp_rtx_synack(const struct sock *sk, struct request_sock *req)
int res;
tcp_rsk(req)->txhash = net_tx_rndhash();
res = af_ops->send_synack(sk, NULL, &fl, req, NULL, TCP_SYNACK_NORMAL);
res = af_ops->send_synack(sk, NULL, &fl, req, NULL, TCP_SYNACK_NORMAL,
NULL);
if (!res) {
__TCP_INC_STATS(sock_net(sk), TCP_MIB_RETRANSSEGS);
__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);

View file

@ -501,7 +501,8 @@ static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst,
struct flowi *fl,
struct request_sock *req,
struct tcp_fastopen_cookie *foc,
enum tcp_synack_type synack_type)
enum tcp_synack_type synack_type,
struct sk_buff *syn_skb)
{
struct inet_request_sock *ireq = inet_rsk(req);
struct ipv6_pinfo *np = tcp_inet6_sk(sk);
@ -515,7 +516,7 @@ static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst,
IPPROTO_TCP)) == NULL)
goto done;
skb = tcp_make_synack(sk, dst, req, foc, synack_type);
skb = tcp_make_synack(sk, dst, req, foc, synack_type, syn_skb);
if (skb) {
__tcp_v6_send_check(skb, &ireq->ir_v6_loc_addr,

View file

@ -3395,6 +3395,120 @@ union bpf_attr {
* A non-negative value equal to or less than *size* on success,
* or a negative error in case of failure.
*
* long bpf_load_hdr_opt(struct bpf_sock_ops *skops, void *searchby_res, u32 len, u64 flags)
* Description
* Load header option. Support reading a particular TCP header
* option for bpf program (BPF_PROG_TYPE_SOCK_OPS).
*
* If *flags* is 0, it will search the option from the
* sock_ops->skb_data. The comment in "struct bpf_sock_ops"
* has details on what skb_data contains under different
* sock_ops->op.
*
* The first byte of the *searchby_res* specifies the
* kind that it wants to search.
*
* If the searching kind is an experimental kind
* (i.e. 253 or 254 according to RFC6994). It also
* needs to specify the "magic" which is either
* 2 bytes or 4 bytes. It then also needs to
* specify the size of the magic by using
* the 2nd byte which is "kind-length" of a TCP
* header option and the "kind-length" also
* includes the first 2 bytes "kind" and "kind-length"
* itself as a normal TCP header option also does.
*
* For example, to search experimental kind 254 with
* 2 byte magic 0xeB9F, the searchby_res should be
* [ 254, 4, 0xeB, 0x9F, 0, 0, .... 0 ].
*
* To search for the standard window scale option (3),
* the searchby_res should be [ 3, 0, 0, .... 0 ].
* Note, kind-length must be 0 for regular option.
*
* Searching for No-Op (0) and End-of-Option-List (1) are
* not supported.
*
* *len* must be at least 2 bytes which is the minimal size
* of a header option.
*
* Supported flags:
* * **BPF_LOAD_HDR_OPT_TCP_SYN** to search from the
* saved_syn packet or the just-received syn packet.
*
* Return
* >0 when found, the header option is copied to *searchby_res*.
* The return value is the total length copied.
*
* **-EINVAL** If param is invalid
*
* **-ENOMSG** The option is not found
*
* **-ENOENT** No syn packet available when
* **BPF_LOAD_HDR_OPT_TCP_SYN** is used
*
* **-ENOSPC** Not enough space. Only *len* number of
* bytes are copied.
*
* **-EFAULT** Cannot parse the header options in the packet
*
* **-EPERM** This helper cannot be used under the
* current sock_ops->op.
*
* long bpf_store_hdr_opt(struct bpf_sock_ops *skops, const void *from, u32 len, u64 flags)
* Description
* Store header option. The data will be copied
* from buffer *from* with length *len* to the TCP header.
*
* The buffer *from* should have the whole option that
* includes the kind, kind-length, and the actual
* option data. The *len* must be at least kind-length
* long. The kind-length does not have to be 4 byte
* aligned. The kernel will take care of the padding
* and setting the 4 bytes aligned value to th->doff.
*
* This helper will check for duplicated option
* by searching the same option in the outgoing skb.
*
* This helper can only be called during
* BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
*
* Return
* 0 on success, or negative error in case of failure:
*
* **-EINVAL** If param is invalid
*
* **-ENOSPC** Not enough space in the header.
* Nothing has been written
*
* **-EEXIST** The option has already existed
*
* **-EFAULT** Cannot parse the existing header options
*
* **-EPERM** This helper cannot be used under the
* current sock_ops->op.
*
* long bpf_reserve_hdr_opt(struct bpf_sock_ops *skops, u32 len, u64 flags)
* Description
* Reserve *len* bytes for the bpf header option. The
* space will be used by bpf_store_hdr_opt() later in
* BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
*
* If bpf_reserve_hdr_opt() is called multiple times,
* the total number of bytes will be reserved.
*
* This helper can only be called during
* BPF_SOCK_OPS_HDR_OPT_LEN_CB.
*
* Return
* 0 on success, or negative error in case of failure:
*
* **-EINVAL** if param is invalid
*
* **-ENOSPC** Not enough space in the header.
*
* **-EPERM** This helper cannot be used under the
* current sock_ops->op.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@ -3539,6 +3653,9 @@ union bpf_attr {
FN(skc_to_tcp_request_sock), \
FN(skc_to_udp6_sock), \
FN(get_task_stack), \
FN(load_hdr_opt), \
FN(store_hdr_opt), \
FN(reserve_hdr_opt),
/* */
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
@ -4165,6 +4282,36 @@ struct bpf_sock_ops {
__u64 bytes_received;
__u64 bytes_acked;
__bpf_md_ptr(struct bpf_sock *, sk);
/* [skb_data, skb_data_end) covers the whole TCP header.
*
* BPF_SOCK_OPS_PARSE_HDR_OPT_CB: The packet received
* BPF_SOCK_OPS_HDR_OPT_LEN_CB: Not useful because the
* header has not been written.
* BPF_SOCK_OPS_WRITE_HDR_OPT_CB: The header and options have
* been written so far.
* BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB: The SYNACK that concludes
* the 3WHS.
* BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: The ACK that concludes
* the 3WHS.
*
* bpf_load_hdr_opt() can also be used to read a particular option.
*/
__bpf_md_ptr(void *, skb_data);
__bpf_md_ptr(void *, skb_data_end);
__u32 skb_len; /* The total length of a packet.
* It includes the header, options,
* and payload.
*/
__u32 skb_tcp_flags; /* tcp_flags of the header. It provides
* an easy way to check for tcp_flags
* without parsing skb_data.
*
* In particular, the skb_tcp_flags
* will still be available in
* BPF_SOCK_OPS_HDR_OPT_LEN even though
* the outgoing header has not
* been written yet.
*/
};
/* Definitions for bpf_sock_ops_cb_flags */
@ -4173,8 +4320,51 @@ enum {
BPF_SOCK_OPS_RETRANS_CB_FLAG = (1<<1),
BPF_SOCK_OPS_STATE_CB_FLAG = (1<<2),
BPF_SOCK_OPS_RTT_CB_FLAG = (1<<3),
/* Call bpf for all received TCP headers. The bpf prog will be
* called under sock_ops->op == BPF_SOCK_OPS_PARSE_HDR_OPT_CB
*
* Please refer to the comment in BPF_SOCK_OPS_PARSE_HDR_OPT_CB
* for the header option related helpers that will be useful
* to the bpf programs.
*
* It could be used at the client/active side (i.e. connect() side)
* when the server told it that the server was in syncookie
* mode and required the active side to resend the bpf-written
* options. The active side can keep writing the bpf-options until
* it received a valid packet from the server side to confirm
* the earlier packet (and options) has been received. The later
* example patch is using it like this at the active side when the
* server is in syncookie mode.
*
* The bpf prog will usually turn this off in the common cases.
*/
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG = (1<<4),
/* Call bpf when kernel has received a header option that
* the kernel cannot handle. The bpf prog will be called under
* sock_ops->op == BPF_SOCK_OPS_PARSE_HDR_OPT_CB.
*
* Please refer to the comment in BPF_SOCK_OPS_PARSE_HDR_OPT_CB
* for the header option related helpers that will be useful
* to the bpf programs.
*/
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG = (1<<5),
/* Call bpf when the kernel is writing header options for the
* outgoing packet. The bpf prog will first be called
* to reserve space in a skb under
* sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB. Then
* the bpf prog will be called to write the header option(s)
* under sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
*
* Please refer to the comment in BPF_SOCK_OPS_HDR_OPT_LEN_CB
* and BPF_SOCK_OPS_WRITE_HDR_OPT_CB for the header option
* related helpers that will be useful to the bpf programs.
*
* The kernel gets its chance to reserve space and write
* options first before the BPF program does.
*/
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
/* Mask of all currently supported cb flags */
BPF_SOCK_OPS_ALL_CB_FLAGS = 0xF,
BPF_SOCK_OPS_ALL_CB_FLAGS = 0x7F,
};
/* List of known BPF sock_ops operators.
@ -4230,6 +4420,63 @@ enum {
*/
BPF_SOCK_OPS_RTT_CB, /* Called on every RTT.
*/
BPF_SOCK_OPS_PARSE_HDR_OPT_CB, /* Parse the header option.
* It will be called to handle
* the packets received at
* an already established
* connection.
*
* sock_ops->skb_data:
* Referring to the received skb.
* It covers the TCP header only.
*
* bpf_load_hdr_opt() can also
* be used to search for a
* particular option.
*/
BPF_SOCK_OPS_HDR_OPT_LEN_CB, /* Reserve space for writing the
* header option later in
* BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
* Arg1: bool want_cookie. (in
* writing SYNACK only)
*
* sock_ops->skb_data:
* Not available because no header has
* been written yet.
*
* sock_ops->skb_tcp_flags:
* The tcp_flags of the
* outgoing skb. (e.g. SYN, ACK, FIN).
*
* bpf_reserve_hdr_opt() should
* be used to reserve space.
*/
BPF_SOCK_OPS_WRITE_HDR_OPT_CB, /* Write the header options
* Arg1: bool want_cookie. (in
* writing SYNACK only)
*
* sock_ops->skb_data:
* Referring to the outgoing skb.
* It covers the TCP header
* that has already been written
* by the kernel and the
* earlier bpf-progs.
*
* sock_ops->skb_tcp_flags:
* The tcp_flags of the outgoing
* skb. (e.g. SYN, ACK, FIN).
*
* bpf_store_hdr_opt() should
* be used to write the
* option.
*
* bpf_load_hdr_opt() can also
* be used to search for a
* particular option that
* has already been written
* by the kernel or the
* earlier bpf-progs.
*/
};
/* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
@ -4257,6 +4504,63 @@ enum {
enum {
TCP_BPF_IW = 1001, /* Set TCP initial congestion window */
TCP_BPF_SNDCWND_CLAMP = 1002, /* Set sndcwnd_clamp */
TCP_BPF_DELACK_MAX = 1003, /* Max delay ack in usecs */
TCP_BPF_RTO_MIN = 1004, /* Min delay ack in usecs */
/* Copy the SYN pkt to optval
*
* BPF_PROG_TYPE_SOCK_OPS only. It is similar to the
* bpf_getsockopt(TCP_SAVED_SYN) but it does not limit
* to only getting from the saved_syn. It can either get the
* syn packet from:
*
* 1. the just-received SYN packet (only available when writing the
* SYNACK). It will be useful when it is not necessary to
* save the SYN packet for latter use. It is also the only way
* to get the SYN during syncookie mode because the syn
* packet cannot be saved during syncookie.
*
* OR
*
* 2. the earlier saved syn which was done by
* bpf_setsockopt(TCP_SAVE_SYN).
*
* The bpf_getsockopt(TCP_BPF_SYN*) option will hide where the
* SYN packet is obtained.
*
* If the bpf-prog does not need the IP[46] header, the
* bpf-prog can avoid parsing the IP header by using
* TCP_BPF_SYN. Otherwise, the bpf-prog can get both
* IP[46] and TCP header by using TCP_BPF_SYN_IP.
*
* >0: Total number of bytes copied
* -ENOSPC: Not enough space in optval. Only optlen number of
* bytes is copied.
* -ENOENT: The SYN skb is not available now and the earlier SYN pkt
* is not saved by setsockopt(TCP_SAVE_SYN).
*/
TCP_BPF_SYN = 1005, /* Copy the TCP header */
TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */
TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */
};
enum {
BPF_LOAD_HDR_OPT_TCP_SYN = (1ULL << 0),
};
/* args[0] value during BPF_SOCK_OPS_HDR_OPT_LEN_CB and
* BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
*/
enum {
BPF_WRITE_HDR_TCP_CURRENT_MSS = 1, /* Kernel is finding the
* total option spaces
* required for an established
* sk in order to calculate the
* MSS. No skb is actually
* sent.
*/
BPF_WRITE_HDR_TCP_SYNACK_COOKIE = 2, /* Kernel is in syncookie mode
* when sending a SYN.
*/
};
struct bpf_perf_event_value {

View file

@ -104,6 +104,43 @@ int start_server(int family, int type, const char *addr_str, __u16 port,
return -1;
}
int fastopen_connect(int server_fd, const char *data, unsigned int data_len,
int timeout_ms)
{
struct sockaddr_storage addr;
socklen_t addrlen = sizeof(addr);
struct sockaddr_in *addr_in;
int fd, ret;
if (getsockname(server_fd, (struct sockaddr *)&addr, &addrlen)) {
log_err("Failed to get server addr");
return -1;
}
addr_in = (struct sockaddr_in *)&addr;
fd = socket(addr_in->sin_family, SOCK_STREAM, 0);
if (fd < 0) {
log_err("Failed to create client socket");
return -1;
}
if (settimeo(fd, timeout_ms))
goto error_close;
ret = sendto(fd, data, data_len, MSG_FASTOPEN, (struct sockaddr *)&addr,
addrlen);
if (ret != data_len) {
log_err("sendto(data, %u) != %d\n", data_len, ret);
goto error_close;
}
return fd;
error_close:
save_errno_close(fd);
return -1;
}
static int connect_fd_to_addr(int fd,
const struct sockaddr_storage *addr,
socklen_t addrlen)

View file

@ -37,6 +37,8 @@ int start_server(int family, int type, const char *addr, __u16 port,
int timeout_ms);
int connect_to_fd(int server_fd, int timeout_ms);
int connect_fd_to_fd(int client_fd, int server_fd, int timeout_ms);
int fastopen_connect(int server_fd, const char *data, unsigned int data_len,
int timeout_ms);
int make_sockaddr(int family, const char *addr_str, __u16 port,
struct sockaddr_storage *addr, socklen_t *len);

View file

@ -0,0 +1,622 @@
// SPDX-License-Identifier: GPL-2.0
/* Copyright (c) 2020 Facebook */
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/socket.h>
#include <linux/compiler.h>
#include "test_progs.h"
#include "cgroup_helpers.h"
#include "network_helpers.h"
#include "test_tcp_hdr_options.h"
#include "test_tcp_hdr_options.skel.h"
#include "test_misc_tcp_hdr_options.skel.h"
#define LO_ADDR6 "::eB9F"
#define CG_NAME "/tcpbpf-hdr-opt-test"
struct bpf_test_option exp_passive_estab_in;
struct bpf_test_option exp_active_estab_in;
struct bpf_test_option exp_passive_fin_in;
struct bpf_test_option exp_active_fin_in;
struct hdr_stg exp_passive_hdr_stg;
struct hdr_stg exp_active_hdr_stg = { .active = true, };
static struct test_misc_tcp_hdr_options *misc_skel;
static struct test_tcp_hdr_options *skel;
static int lport_linum_map_fd;
static int hdr_stg_map_fd;
static __u32 duration;
static int cg_fd;
struct sk_fds {
int srv_fd;
int passive_fd;
int active_fd;
int passive_lport;
int active_lport;
};
static int add_lo_addr(void)
{
char ip_addr_cmd[256];
int cmdlen;
cmdlen = snprintf(ip_addr_cmd, sizeof(ip_addr_cmd),
"ip -6 addr add %s/128 dev lo scope host",
LO_ADDR6);
if (CHECK(cmdlen >= sizeof(ip_addr_cmd), "compile ip cmd",
"failed to add host addr %s to lo. ip cmdlen is too long\n",
LO_ADDR6))
return -1;
if (CHECK(system(ip_addr_cmd), "run ip cmd",
"failed to add host addr %s to lo\n", LO_ADDR6))
return -1;
return 0;
}
static int create_netns(void)
{
if (CHECK(unshare(CLONE_NEWNET), "create netns",
"unshare(CLONE_NEWNET): %s (%d)",
strerror(errno), errno))
return -1;
if (CHECK(system("ip link set dev lo up"), "run ip cmd",
"failed to bring lo link up\n"))
return -1;
if (add_lo_addr())
return -1;
return 0;
}
static int write_sysctl(const char *sysctl, const char *value)
{
int fd, err, len;
fd = open(sysctl, O_WRONLY);
if (CHECK(fd == -1, "open sysctl", "open(%s): %s (%d)\n",
sysctl, strerror(errno), errno))
return -1;
len = strlen(value);
err = write(fd, value, len);
close(fd);
if (CHECK(err != len, "write sysctl",
"write(%s, %s): err:%d %s (%d)\n",
sysctl, value, err, strerror(errno), errno))
return -1;
return 0;
}
static void print_hdr_stg(const struct hdr_stg *hdr_stg, const char *prefix)
{
fprintf(stderr, "%s{active:%u, resend_syn:%u, syncookie:%u, fastopen:%u}\n",
prefix ? : "", hdr_stg->active, hdr_stg->resend_syn,
hdr_stg->syncookie, hdr_stg->fastopen);
}
static void print_option(const struct bpf_test_option *opt, const char *prefix)
{
fprintf(stderr, "%s{flags:0x%x, max_delack_ms:%u, rand:0x%x}\n",
prefix ? : "", opt->flags, opt->max_delack_ms, opt->rand);
}
static void sk_fds_close(struct sk_fds *sk_fds)
{
close(sk_fds->srv_fd);
close(sk_fds->passive_fd);
close(sk_fds->active_fd);
}
static int sk_fds_shutdown(struct sk_fds *sk_fds)
{
int ret, abyte;
shutdown(sk_fds->active_fd, SHUT_WR);
ret = read(sk_fds->passive_fd, &abyte, sizeof(abyte));
if (CHECK(ret != 0, "read-after-shutdown(passive_fd):",
"ret:%d %s (%d)\n",
ret, strerror(errno), errno))
return -1;
shutdown(sk_fds->passive_fd, SHUT_WR);
ret = read(sk_fds->active_fd, &abyte, sizeof(abyte));
if (CHECK(ret != 0, "read-after-shutdown(active_fd):",
"ret:%d %s (%d)\n",
ret, strerror(errno), errno))
return -1;
return 0;
}
static int sk_fds_connect(struct sk_fds *sk_fds, bool fast_open)
{
const char fast[] = "FAST!!!";
struct sockaddr_in6 addr6;
socklen_t len;
sk_fds->srv_fd = start_server(AF_INET6, SOCK_STREAM, LO_ADDR6, 0, 0);
if (CHECK(sk_fds->srv_fd == -1, "start_server", "%s (%d)\n",
strerror(errno), errno))
goto error;
if (fast_open)
sk_fds->active_fd = fastopen_connect(sk_fds->srv_fd, fast,
sizeof(fast), 0);
else
sk_fds->active_fd = connect_to_fd(sk_fds->srv_fd, 0);
if (CHECK_FAIL(sk_fds->active_fd == -1)) {
close(sk_fds->srv_fd);
goto error;
}
len = sizeof(addr6);
if (CHECK(getsockname(sk_fds->srv_fd, (struct sockaddr *)&addr6,
&len), "getsockname(srv_fd)", "%s (%d)\n",
strerror(errno), errno))
goto error_close;
sk_fds->passive_lport = ntohs(addr6.sin6_port);
len = sizeof(addr6);
if (CHECK(getsockname(sk_fds->active_fd, (struct sockaddr *)&addr6,
&len), "getsockname(active_fd)", "%s (%d)\n",
strerror(errno), errno))
goto error_close;
sk_fds->active_lport = ntohs(addr6.sin6_port);
sk_fds->passive_fd = accept(sk_fds->srv_fd, NULL, 0);
if (CHECK(sk_fds->passive_fd == -1, "accept(srv_fd)", "%s (%d)\n",
strerror(errno), errno))
goto error_close;
if (fast_open) {
char bytes_in[sizeof(fast)];
int ret;
ret = read(sk_fds->passive_fd, bytes_in, sizeof(bytes_in));
if (CHECK(ret != sizeof(fast), "read fastopen syn data",
"expected=%lu actual=%d\n", sizeof(fast), ret)) {
close(sk_fds->passive_fd);
goto error_close;
}
}
return 0;
error_close:
close(sk_fds->active_fd);
close(sk_fds->srv_fd);
error:
memset(sk_fds, -1, sizeof(*sk_fds));
return -1;
}
static int check_hdr_opt(const struct bpf_test_option *exp,
const struct bpf_test_option *act,
const char *hdr_desc)
{
if (CHECK(memcmp(exp, act, sizeof(*exp)),
"expected-vs-actual", "unexpected %s\n", hdr_desc)) {
print_option(exp, "expected: ");
print_option(act, " actual: ");
return -1;
}
return 0;
}
static int check_hdr_stg(const struct hdr_stg *exp, int fd,
const char *stg_desc)
{
struct hdr_stg act;
if (CHECK(bpf_map_lookup_elem(hdr_stg_map_fd, &fd, &act),
"map_lookup(hdr_stg_map_fd)", "%s %s (%d)\n",
stg_desc, strerror(errno), errno))
return -1;
if (CHECK(memcmp(exp, &act, sizeof(*exp)),
"expected-vs-actual", "unexpected %s\n", stg_desc)) {
print_hdr_stg(exp, "expected: ");
print_hdr_stg(&act, " actual: ");
return -1;
}
return 0;
}
static int check_error_linum(const struct sk_fds *sk_fds)
{
unsigned int nr_errors = 0;
struct linum_err linum_err;
int lport;
lport = sk_fds->passive_lport;
if (!bpf_map_lookup_elem(lport_linum_map_fd, &lport, &linum_err)) {
fprintf(stderr,
"bpf prog error out at lport:passive(%d), linum:%u err:%d\n",
lport, linum_err.linum, linum_err.err);
nr_errors++;
}
lport = sk_fds->active_lport;
if (!bpf_map_lookup_elem(lport_linum_map_fd, &lport, &linum_err)) {
fprintf(stderr,
"bpf prog error out at lport:active(%d), linum:%u err:%d\n",
lport, linum_err.linum, linum_err.err);
nr_errors++;
}
return nr_errors;
}
static void check_hdr_and_close_fds(struct sk_fds *sk_fds)
{
if (sk_fds_shutdown(sk_fds))
goto check_linum;
if (check_hdr_stg(&exp_passive_hdr_stg, sk_fds->passive_fd,
"passive_hdr_stg"))
goto check_linum;
if (check_hdr_stg(&exp_active_hdr_stg, sk_fds->active_fd,
"active_hdr_stg"))
goto check_linum;
if (check_hdr_opt(&exp_passive_estab_in, &skel->bss->passive_estab_in,
"passive_estab_in"))
goto check_linum;
if (check_hdr_opt(&exp_active_estab_in, &skel->bss->active_estab_in,
"active_estab_in"))
goto check_linum;
if (check_hdr_opt(&exp_passive_fin_in, &skel->bss->passive_fin_in,
"passive_fin_in"))
goto check_linum;
check_hdr_opt(&exp_active_fin_in, &skel->bss->active_fin_in,
"active_fin_in");
check_linum:
CHECK_FAIL(check_error_linum(sk_fds));
sk_fds_close(sk_fds);
}
static void prepare_out(void)
{
skel->bss->active_syn_out = exp_passive_estab_in;
skel->bss->passive_synack_out = exp_active_estab_in;
skel->bss->active_fin_out = exp_passive_fin_in;
skel->bss->passive_fin_out = exp_active_fin_in;
}
static void reset_test(void)
{
size_t optsize = sizeof(struct bpf_test_option);
int lport, err;
memset(&skel->bss->passive_synack_out, 0, optsize);
memset(&skel->bss->passive_fin_out, 0, optsize);
memset(&skel->bss->passive_estab_in, 0, optsize);
memset(&skel->bss->passive_fin_in, 0, optsize);
memset(&skel->bss->active_syn_out, 0, optsize);
memset(&skel->bss->active_fin_out, 0, optsize);
memset(&skel->bss->active_estab_in, 0, optsize);
memset(&skel->bss->active_fin_in, 0, optsize);
skel->data->test_kind = TCPOPT_EXP;
skel->data->test_magic = 0xeB9F;
memset(&exp_passive_estab_in, 0, optsize);
memset(&exp_active_estab_in, 0, optsize);
memset(&exp_passive_fin_in, 0, optsize);
memset(&exp_active_fin_in, 0, optsize);
memset(&exp_passive_hdr_stg, 0, sizeof(exp_passive_hdr_stg));
memset(&exp_active_hdr_stg, 0, sizeof(exp_active_hdr_stg));
exp_active_hdr_stg.active = true;
err = bpf_map_get_next_key(lport_linum_map_fd, NULL, &lport);
while (!err) {
bpf_map_delete_elem(lport_linum_map_fd, &lport);
err = bpf_map_get_next_key(lport_linum_map_fd, &lport, &lport);
}
}
static void fastopen_estab(void)
{
struct bpf_link *link;
struct sk_fds sk_fds;
hdr_stg_map_fd = bpf_map__fd(skel->maps.hdr_stg_map);
lport_linum_map_fd = bpf_map__fd(skel->maps.lport_linum_map);
exp_passive_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS;
exp_passive_estab_in.rand = 0xfa;
exp_passive_estab_in.max_delack_ms = 11;
exp_active_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS;
exp_active_estab_in.rand = 0xce;
exp_active_estab_in.max_delack_ms = 22;
exp_passive_hdr_stg.fastopen = true;
prepare_out();
/* Allow fastopen without fastopen cookie */
if (write_sysctl("/proc/sys/net/ipv4/tcp_fastopen", "1543"))
return;
link = bpf_program__attach_cgroup(skel->progs.estab, cg_fd);
if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n",
PTR_ERR(link)))
return;
if (sk_fds_connect(&sk_fds, true)) {
bpf_link__destroy(link);
return;
}
check_hdr_and_close_fds(&sk_fds);
bpf_link__destroy(link);
}
static void syncookie_estab(void)
{
struct bpf_link *link;
struct sk_fds sk_fds;
hdr_stg_map_fd = bpf_map__fd(skel->maps.hdr_stg_map);
lport_linum_map_fd = bpf_map__fd(skel->maps.lport_linum_map);
exp_passive_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS;
exp_passive_estab_in.rand = 0xfa;
exp_passive_estab_in.max_delack_ms = 11;
exp_active_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS |
OPTION_F_RESEND;
exp_active_estab_in.rand = 0xce;
exp_active_estab_in.max_delack_ms = 22;
exp_passive_hdr_stg.syncookie = true;
exp_active_hdr_stg.resend_syn = true,
prepare_out();
/* Clear the RESEND to ensure the bpf prog can learn
* want_cookie and set the RESEND by itself.
*/
skel->bss->passive_synack_out.flags &= ~OPTION_F_RESEND;
/* Enforce syncookie mode */
if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "2"))
return;
link = bpf_program__attach_cgroup(skel->progs.estab, cg_fd);
if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n",
PTR_ERR(link)))
return;
if (sk_fds_connect(&sk_fds, false)) {
bpf_link__destroy(link);
return;
}
check_hdr_and_close_fds(&sk_fds);
bpf_link__destroy(link);
}
static void fin(void)
{
struct bpf_link *link;
struct sk_fds sk_fds;
hdr_stg_map_fd = bpf_map__fd(skel->maps.hdr_stg_map);
lport_linum_map_fd = bpf_map__fd(skel->maps.lport_linum_map);
exp_passive_fin_in.flags = OPTION_F_RAND;
exp_passive_fin_in.rand = 0xfa;
exp_active_fin_in.flags = OPTION_F_RAND;
exp_active_fin_in.rand = 0xce;
prepare_out();
if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "1"))
return;
link = bpf_program__attach_cgroup(skel->progs.estab, cg_fd);
if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n",
PTR_ERR(link)))
return;
if (sk_fds_connect(&sk_fds, false)) {
bpf_link__destroy(link);
return;
}
check_hdr_and_close_fds(&sk_fds);
bpf_link__destroy(link);
}
static void __simple_estab(bool exprm)
{
struct bpf_link *link;
struct sk_fds sk_fds;
hdr_stg_map_fd = bpf_map__fd(skel->maps.hdr_stg_map);
lport_linum_map_fd = bpf_map__fd(skel->maps.lport_linum_map);
exp_passive_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS;
exp_passive_estab_in.rand = 0xfa;
exp_passive_estab_in.max_delack_ms = 11;
exp_active_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS;
exp_active_estab_in.rand = 0xce;
exp_active_estab_in.max_delack_ms = 22;
prepare_out();
if (!exprm) {
skel->data->test_kind = 0xB9;
skel->data->test_magic = 0;
}
if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "1"))
return;
link = bpf_program__attach_cgroup(skel->progs.estab, cg_fd);
if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n",
PTR_ERR(link)))
return;
if (sk_fds_connect(&sk_fds, false)) {
bpf_link__destroy(link);
return;
}
check_hdr_and_close_fds(&sk_fds);
bpf_link__destroy(link);
}
static void no_exprm_estab(void)
{
__simple_estab(false);
}
static void simple_estab(void)
{
__simple_estab(true);
}
static void misc(void)
{
const char send_msg[] = "MISC!!!";
char recv_msg[sizeof(send_msg)];
const unsigned int nr_data = 2;
struct bpf_link *link;
struct sk_fds sk_fds;
int i, ret;
lport_linum_map_fd = bpf_map__fd(misc_skel->maps.lport_linum_map);
if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "1"))
return;
link = bpf_program__attach_cgroup(misc_skel->progs.misc_estab, cg_fd);
if (CHECK(IS_ERR(link), "attach_cgroup(misc_estab)", "err: %ld\n",
PTR_ERR(link)))
return;
if (sk_fds_connect(&sk_fds, false)) {
bpf_link__destroy(link);
return;
}
for (i = 0; i < nr_data; i++) {
/* MSG_EOR to ensure skb will not be combined */
ret = send(sk_fds.active_fd, send_msg, sizeof(send_msg),
MSG_EOR);
if (CHECK(ret != sizeof(send_msg), "send(msg)", "ret:%d\n",
ret))
goto check_linum;
ret = read(sk_fds.passive_fd, recv_msg, sizeof(recv_msg));
if (CHECK(ret != sizeof(send_msg), "read(msg)", "ret:%d\n",
ret))
goto check_linum;
}
if (sk_fds_shutdown(&sk_fds))
goto check_linum;
CHECK(misc_skel->bss->nr_syn != 1, "unexpected nr_syn",
"expected (1) != actual (%u)\n",
misc_skel->bss->nr_syn);
CHECK(misc_skel->bss->nr_data != nr_data, "unexpected nr_data",
"expected (%u) != actual (%u)\n",
nr_data, misc_skel->bss->nr_data);
/* The last ACK may have been delayed, so it is either 1 or 2. */
CHECK(misc_skel->bss->nr_pure_ack != 1 &&
misc_skel->bss->nr_pure_ack != 2,
"unexpected nr_pure_ack",
"expected (1 or 2) != actual (%u)\n",
misc_skel->bss->nr_pure_ack);
CHECK(misc_skel->bss->nr_fin != 1, "unexpected nr_fin",
"expected (1) != actual (%u)\n",
misc_skel->bss->nr_fin);
check_linum:
CHECK_FAIL(check_error_linum(&sk_fds));
sk_fds_close(&sk_fds);
bpf_link__destroy(link);
}
struct test {
const char *desc;
void (*run)(void);
};
#define DEF_TEST(name) { #name, name }
static struct test tests[] = {
DEF_TEST(simple_estab),
DEF_TEST(no_exprm_estab),
DEF_TEST(syncookie_estab),
DEF_TEST(fastopen_estab),
DEF_TEST(fin),
DEF_TEST(misc),
};
void test_tcp_hdr_options(void)
{
int i;
skel = test_tcp_hdr_options__open_and_load();
if (CHECK(!skel, "open and load skel", "failed"))
return;
misc_skel = test_misc_tcp_hdr_options__open_and_load();
if (CHECK(!misc_skel, "open and load misc test skel", "failed"))
goto skel_destroy;
cg_fd = test__join_cgroup(CG_NAME);
if (CHECK_FAIL(cg_fd < 0))
goto skel_destroy;
for (i = 0; i < ARRAY_SIZE(tests); i++) {
if (!test__start_subtest(tests[i].desc))
continue;
if (create_netns())
break;
tests[i].run();
reset_test();
}
close(cg_fd);
skel_destroy:
test_misc_tcp_hdr_options__destroy(misc_skel);
test_tcp_hdr_options__destroy(skel);
}

View file

@ -0,0 +1,325 @@
// SPDX-License-Identifier: GPL-2.0
/* Copyright (c) 2020 Facebook */
#include <stddef.h>
#include <errno.h>
#include <stdbool.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <linux/ipv6.h>
#include <linux/tcp.h>
#include <linux/socket.h>
#include <linux/bpf.h>
#include <linux/types.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
#define BPF_PROG_TEST_TCP_HDR_OPTIONS
#include "test_tcp_hdr_options.h"
__u16 last_addr16_n = __bpf_htons(0xeB9F);
__u16 active_lport_n = 0;
__u16 active_lport_h = 0;
__u16 passive_lport_n = 0;
__u16 passive_lport_h = 0;
/* options received at passive side */
unsigned int nr_pure_ack = 0;
unsigned int nr_data = 0;
unsigned int nr_syn = 0;
unsigned int nr_fin = 0;
/* Check the header received from the active side */
static int __check_active_hdr_in(struct bpf_sock_ops *skops, bool check_syn)
{
union {
struct tcphdr th;
struct ipv6hdr ip6;
struct tcp_exprm_opt exprm_opt;
struct tcp_opt reg_opt;
__u8 data[100]; /* IPv6 (40) + Max TCP hdr (60) */
} hdr = {};
__u64 load_flags = check_syn ? BPF_LOAD_HDR_OPT_TCP_SYN : 0;
struct tcphdr *pth;
int ret;
hdr.reg_opt.kind = 0xB9;
/* The option is 4 bytes long instead of 2 bytes */
ret = bpf_load_hdr_opt(skops, &hdr.reg_opt, 2, load_flags);
if (ret != -ENOSPC)
RET_CG_ERR(ret);
/* Test searching magic with regular kind */
hdr.reg_opt.len = 4;
ret = bpf_load_hdr_opt(skops, &hdr.reg_opt, sizeof(hdr.reg_opt),
load_flags);
if (ret != -EINVAL)
RET_CG_ERR(ret);
hdr.reg_opt.len = 0;
ret = bpf_load_hdr_opt(skops, &hdr.reg_opt, sizeof(hdr.reg_opt),
load_flags);
if (ret != 4 || hdr.reg_opt.len != 4 || hdr.reg_opt.kind != 0xB9 ||
hdr.reg_opt.data[0] != 0xfa || hdr.reg_opt.data[1] != 0xce)
RET_CG_ERR(ret);
/* Test searching experimental option with invalid kind length */
hdr.exprm_opt.kind = TCPOPT_EXP;
hdr.exprm_opt.len = 5;
hdr.exprm_opt.magic = 0;
ret = bpf_load_hdr_opt(skops, &hdr.exprm_opt, sizeof(hdr.exprm_opt),
load_flags);
if (ret != -EINVAL)
RET_CG_ERR(ret);
/* Test searching experimental option with 0 magic value */
hdr.exprm_opt.len = 4;
ret = bpf_load_hdr_opt(skops, &hdr.exprm_opt, sizeof(hdr.exprm_opt),
load_flags);
if (ret != -ENOMSG)
RET_CG_ERR(ret);
hdr.exprm_opt.magic = __bpf_htons(0xeB9F);
ret = bpf_load_hdr_opt(skops, &hdr.exprm_opt, sizeof(hdr.exprm_opt),
load_flags);
if (ret != 4 || hdr.exprm_opt.len != 4 ||
hdr.exprm_opt.kind != TCPOPT_EXP ||
hdr.exprm_opt.magic != __bpf_htons(0xeB9F))
RET_CG_ERR(ret);
if (!check_syn)
return CG_OK;
/* Test loading from skops->syn_skb if sk_state == TCP_NEW_SYN_RECV
*
* Test loading from tp->saved_syn for other sk_state.
*/
ret = bpf_getsockopt(skops, SOL_TCP, TCP_BPF_SYN_IP, &hdr.ip6,
sizeof(hdr.ip6));
if (ret != -ENOSPC)
RET_CG_ERR(ret);
if (hdr.ip6.saddr.s6_addr16[7] != last_addr16_n ||
hdr.ip6.daddr.s6_addr16[7] != last_addr16_n)
RET_CG_ERR(0);
ret = bpf_getsockopt(skops, SOL_TCP, TCP_BPF_SYN_IP, &hdr, sizeof(hdr));
if (ret < 0)
RET_CG_ERR(ret);
pth = (struct tcphdr *)(&hdr.ip6 + 1);
if (pth->dest != passive_lport_n || pth->source != active_lport_n)
RET_CG_ERR(0);
ret = bpf_getsockopt(skops, SOL_TCP, TCP_BPF_SYN, &hdr, sizeof(hdr));
if (ret < 0)
RET_CG_ERR(ret);
if (hdr.th.dest != passive_lport_n || hdr.th.source != active_lport_n)
RET_CG_ERR(0);
return CG_OK;
}
static int check_active_syn_in(struct bpf_sock_ops *skops)
{
return __check_active_hdr_in(skops, true);
}
static int check_active_hdr_in(struct bpf_sock_ops *skops)
{
struct tcphdr *th;
if (__check_active_hdr_in(skops, false) == CG_ERR)
return CG_ERR;
th = skops->skb_data;
if (th + 1 > skops->skb_data_end)
RET_CG_ERR(0);
if (tcp_hdrlen(th) < skops->skb_len)
nr_data++;
if (th->fin)
nr_fin++;
if (th->ack && !th->fin && tcp_hdrlen(th) == skops->skb_len)
nr_pure_ack++;
return CG_OK;
}
static int active_opt_len(struct bpf_sock_ops *skops)
{
int err;
/* Reserve more than enough to allow the -EEXIST test in
* the write_active_opt().
*/
err = bpf_reserve_hdr_opt(skops, 12, 0);
if (err)
RET_CG_ERR(err);
return CG_OK;
}
static int write_active_opt(struct bpf_sock_ops *skops)
{
struct tcp_exprm_opt exprm_opt = {};
struct tcp_opt win_scale_opt = {};
struct tcp_opt reg_opt = {};
struct tcphdr *th;
int err, ret;
exprm_opt.kind = TCPOPT_EXP;
exprm_opt.len = 4;
exprm_opt.magic = __bpf_htons(0xeB9F);
reg_opt.kind = 0xB9;
reg_opt.len = 4;
reg_opt.data[0] = 0xfa;
reg_opt.data[1] = 0xce;
win_scale_opt.kind = TCPOPT_WINDOW;
err = bpf_store_hdr_opt(skops, &exprm_opt, sizeof(exprm_opt), 0);
if (err)
RET_CG_ERR(err);
/* Store the same exprm option */
err = bpf_store_hdr_opt(skops, &exprm_opt, sizeof(exprm_opt), 0);
if (err != -EEXIST)
RET_CG_ERR(err);
err = bpf_store_hdr_opt(skops, &reg_opt, sizeof(reg_opt), 0);
if (err)
RET_CG_ERR(err);
err = bpf_store_hdr_opt(skops, &reg_opt, sizeof(reg_opt), 0);
if (err != -EEXIST)
RET_CG_ERR(err);
/* Check the option has been written and can be searched */
ret = bpf_load_hdr_opt(skops, &exprm_opt, sizeof(exprm_opt), 0);
if (ret != 4 || exprm_opt.len != 4 || exprm_opt.kind != TCPOPT_EXP ||
exprm_opt.magic != __bpf_htons(0xeB9F))
RET_CG_ERR(ret);
reg_opt.len = 0;
ret = bpf_load_hdr_opt(skops, &reg_opt, sizeof(reg_opt), 0);
if (ret != 4 || reg_opt.len != 4 || reg_opt.kind != 0xB9 ||
reg_opt.data[0] != 0xfa || reg_opt.data[1] != 0xce)
RET_CG_ERR(ret);
th = skops->skb_data;
if (th + 1 > skops->skb_data_end)
RET_CG_ERR(0);
if (th->syn) {
active_lport_h = skops->local_port;
active_lport_n = th->source;
/* Search the win scale option written by kernel
* in the SYN packet.
*/
ret = bpf_load_hdr_opt(skops, &win_scale_opt,
sizeof(win_scale_opt), 0);
if (ret != 3 || win_scale_opt.len != 3 ||
win_scale_opt.kind != TCPOPT_WINDOW)
RET_CG_ERR(ret);
/* Write the win scale option that kernel
* has already written.
*/
err = bpf_store_hdr_opt(skops, &win_scale_opt,
sizeof(win_scale_opt), 0);
if (err != -EEXIST)
RET_CG_ERR(err);
}
return CG_OK;
}
static int handle_hdr_opt_len(struct bpf_sock_ops *skops)
{
__u8 tcp_flags = skops_tcp_flags(skops);
if ((tcp_flags & TCPHDR_SYNACK) == TCPHDR_SYNACK)
/* Check the SYN from bpf_sock_ops_kern->syn_skb */
return check_active_syn_in(skops);
/* Passive side should have cleared the write hdr cb by now */
if (skops->local_port == passive_lport_h)
RET_CG_ERR(0);
return active_opt_len(skops);
}
static int handle_write_hdr_opt(struct bpf_sock_ops *skops)
{
if (skops->local_port == passive_lport_h)
RET_CG_ERR(0);
return write_active_opt(skops);
}
static int handle_parse_hdr(struct bpf_sock_ops *skops)
{
/* Passive side is not writing any non-standard/unknown
* option, so the active side should never be called.
*/
if (skops->local_port == active_lport_h)
RET_CG_ERR(0);
return check_active_hdr_in(skops);
}
static int handle_passive_estab(struct bpf_sock_ops *skops)
{
int err;
/* No more write hdr cb */
bpf_sock_ops_cb_flags_set(skops,
skops->bpf_sock_ops_cb_flags &
~BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG);
/* Recheck the SYN but check the tp->saved_syn this time */
err = check_active_syn_in(skops);
if (err == CG_ERR)
return err;
nr_syn++;
/* The ack has header option written by the active side also */
return check_active_hdr_in(skops);
}
SEC("sockops/misc_estab")
int misc_estab(struct bpf_sock_ops *skops)
{
int true_val = 1;
switch (skops->op) {
case BPF_SOCK_OPS_TCP_LISTEN_CB:
passive_lport_h = skops->local_port;
passive_lport_n = __bpf_htons(passive_lport_h);
bpf_setsockopt(skops, SOL_TCP, TCP_SAVE_SYN,
&true_val, sizeof(true_val));
set_hdr_cb_flags(skops);
break;
case BPF_SOCK_OPS_TCP_CONNECT_CB:
set_hdr_cb_flags(skops);
break;
case BPF_SOCK_OPS_PARSE_HDR_OPT_CB:
return handle_parse_hdr(skops);
case BPF_SOCK_OPS_HDR_OPT_LEN_CB:
return handle_hdr_opt_len(skops);
case BPF_SOCK_OPS_WRITE_HDR_OPT_CB:
return handle_write_hdr_opt(skops);
case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
return handle_passive_estab(skops);
}
return CG_OK;
}
char _license[] SEC("license") = "GPL";

View file

@ -0,0 +1,623 @@
// SPDX-License-Identifier: GPL-2.0
/* Copyright (c) 2020 Facebook */
#include <stddef.h>
#include <errno.h>
#include <stdbool.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <linux/tcp.h>
#include <linux/socket.h>
#include <linux/bpf.h>
#include <linux/types.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
#define BPF_PROG_TEST_TCP_HDR_OPTIONS
#include "test_tcp_hdr_options.h"
#ifndef sizeof_field
#define sizeof_field(TYPE, MEMBER) sizeof((((TYPE *)0)->MEMBER))
#endif
__u8 test_kind = TCPOPT_EXP;
__u16 test_magic = 0xeB9F;
struct bpf_test_option passive_synack_out = {};
struct bpf_test_option passive_fin_out = {};
struct bpf_test_option passive_estab_in = {};
struct bpf_test_option passive_fin_in = {};
struct bpf_test_option active_syn_out = {};
struct bpf_test_option active_fin_out = {};
struct bpf_test_option active_estab_in = {};
struct bpf_test_option active_fin_in = {};
struct {
__uint(type, BPF_MAP_TYPE_SK_STORAGE);
__uint(map_flags, BPF_F_NO_PREALLOC);
__type(key, int);
__type(value, struct hdr_stg);
} hdr_stg_map SEC(".maps");
static bool skops_want_cookie(const struct bpf_sock_ops *skops)
{
return skops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE;
}
static bool skops_current_mss(const struct bpf_sock_ops *skops)
{
return skops->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS;
}
static __u8 option_total_len(__u8 flags)
{
__u8 i, len = 1; /* +1 for flags */
if (!flags)
return 0;
/* RESEND bit does not use a byte */
for (i = OPTION_RESEND + 1; i < __NR_OPTION_FLAGS; i++)
len += !!TEST_OPTION_FLAGS(flags, i);
if (test_kind == TCPOPT_EXP)
return len + TCP_BPF_EXPOPT_BASE_LEN;
else
return len + 2; /* +1 kind, +1 kind-len */
}
static void write_test_option(const struct bpf_test_option *test_opt,
__u8 *data)
{
__u8 offset = 0;
data[offset++] = test_opt->flags;
if (TEST_OPTION_FLAGS(test_opt->flags, OPTION_MAX_DELACK_MS))
data[offset++] = test_opt->max_delack_ms;
if (TEST_OPTION_FLAGS(test_opt->flags, OPTION_RAND))
data[offset++] = test_opt->rand;
}
static int store_option(struct bpf_sock_ops *skops,
const struct bpf_test_option *test_opt)
{
union {
struct tcp_exprm_opt exprm;
struct tcp_opt regular;
} write_opt;
int err;
if (test_kind == TCPOPT_EXP) {
write_opt.exprm.kind = TCPOPT_EXP;
write_opt.exprm.len = option_total_len(test_opt->flags);
write_opt.exprm.magic = __bpf_htons(test_magic);
write_opt.exprm.data32 = 0;
write_test_option(test_opt, write_opt.exprm.data);
err = bpf_store_hdr_opt(skops, &write_opt.exprm,
sizeof(write_opt.exprm), 0);
} else {
write_opt.regular.kind = test_kind;
write_opt.regular.len = option_total_len(test_opt->flags);
write_opt.regular.data32 = 0;
write_test_option(test_opt, write_opt.regular.data);
err = bpf_store_hdr_opt(skops, &write_opt.regular,
sizeof(write_opt.regular), 0);
}
if (err)
RET_CG_ERR(err);
return CG_OK;
}
static int parse_test_option(struct bpf_test_option *opt, const __u8 *start)
{
opt->flags = *start++;
if (TEST_OPTION_FLAGS(opt->flags, OPTION_MAX_DELACK_MS))
opt->max_delack_ms = *start++;
if (TEST_OPTION_FLAGS(opt->flags, OPTION_RAND))
opt->rand = *start++;
return 0;
}
static int load_option(struct bpf_sock_ops *skops,
struct bpf_test_option *test_opt, bool from_syn)
{
union {
struct tcp_exprm_opt exprm;
struct tcp_opt regular;
} search_opt;
int ret, load_flags = from_syn ? BPF_LOAD_HDR_OPT_TCP_SYN : 0;
if (test_kind == TCPOPT_EXP) {
search_opt.exprm.kind = TCPOPT_EXP;
search_opt.exprm.len = 4;
search_opt.exprm.magic = __bpf_htons(test_magic);
search_opt.exprm.data32 = 0;
ret = bpf_load_hdr_opt(skops, &search_opt.exprm,
sizeof(search_opt.exprm), load_flags);
if (ret < 0)
return ret;
return parse_test_option(test_opt, search_opt.exprm.data);
} else {
search_opt.regular.kind = test_kind;
search_opt.regular.len = 0;
search_opt.regular.data32 = 0;
ret = bpf_load_hdr_opt(skops, &search_opt.regular,
sizeof(search_opt.regular), load_flags);
if (ret < 0)
return ret;
return parse_test_option(test_opt, search_opt.regular.data);
}
}
static int synack_opt_len(struct bpf_sock_ops *skops)
{
struct bpf_test_option test_opt = {};
__u8 optlen;
int err;
if (!passive_synack_out.flags)
return CG_OK;
err = load_option(skops, &test_opt, true);
/* bpf_test_option is not found */
if (err == -ENOMSG)
return CG_OK;
if (err)
RET_CG_ERR(err);
optlen = option_total_len(passive_synack_out.flags);
if (optlen) {
err = bpf_reserve_hdr_opt(skops, optlen, 0);
if (err)
RET_CG_ERR(err);
}
return CG_OK;
}
static int write_synack_opt(struct bpf_sock_ops *skops)
{
struct bpf_test_option opt;
if (!passive_synack_out.flags)
/* We should not even be called since no header
* space has been reserved.
*/
RET_CG_ERR(0);
opt = passive_synack_out;
if (skops_want_cookie(skops))
SET_OPTION_FLAGS(opt.flags, OPTION_RESEND);
return store_option(skops, &opt);
}
static int syn_opt_len(struct bpf_sock_ops *skops)
{
__u8 optlen;
int err;
if (!active_syn_out.flags)
return CG_OK;
optlen = option_total_len(active_syn_out.flags);
if (optlen) {
err = bpf_reserve_hdr_opt(skops, optlen, 0);
if (err)
RET_CG_ERR(err);
}
return CG_OK;
}
static int write_syn_opt(struct bpf_sock_ops *skops)
{
if (!active_syn_out.flags)
RET_CG_ERR(0);
return store_option(skops, &active_syn_out);
}
static int fin_opt_len(struct bpf_sock_ops *skops)
{
struct bpf_test_option *opt;
struct hdr_stg *hdr_stg;
__u8 optlen;
int err;
if (!skops->sk)
RET_CG_ERR(0);
hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0);
if (!hdr_stg)
RET_CG_ERR(0);
if (hdr_stg->active)
opt = &active_fin_out;
else
opt = &passive_fin_out;
optlen = option_total_len(opt->flags);
if (optlen) {
err = bpf_reserve_hdr_opt(skops, optlen, 0);
if (err)
RET_CG_ERR(err);
}
return CG_OK;
}
static int write_fin_opt(struct bpf_sock_ops *skops)
{
struct bpf_test_option *opt;
struct hdr_stg *hdr_stg;
if (!skops->sk)
RET_CG_ERR(0);
hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0);
if (!hdr_stg)
RET_CG_ERR(0);
if (hdr_stg->active)
opt = &active_fin_out;
else
opt = &passive_fin_out;
if (!opt->flags)
RET_CG_ERR(0);
return store_option(skops, opt);
}
static int resend_in_ack(struct bpf_sock_ops *skops)
{
struct hdr_stg *hdr_stg;
if (!skops->sk)
return -1;
hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0);
if (!hdr_stg)
return -1;
return !!hdr_stg->resend_syn;
}
static int nodata_opt_len(struct bpf_sock_ops *skops)
{
int resend;
resend = resend_in_ack(skops);
if (resend < 0)
RET_CG_ERR(0);
if (resend)
return syn_opt_len(skops);
return CG_OK;
}
static int write_nodata_opt(struct bpf_sock_ops *skops)
{
int resend;
resend = resend_in_ack(skops);
if (resend < 0)
RET_CG_ERR(0);
if (resend)
return write_syn_opt(skops);
return CG_OK;
}
static int data_opt_len(struct bpf_sock_ops *skops)
{
/* Same as the nodata version. Mostly to show
* an example usage on skops->skb_len.
*/
return nodata_opt_len(skops);
}
static int write_data_opt(struct bpf_sock_ops *skops)
{
return write_nodata_opt(skops);
}
static int current_mss_opt_len(struct bpf_sock_ops *skops)
{
/* Reserve maximum that may be needed */
int err;
err = bpf_reserve_hdr_opt(skops, option_total_len(OPTION_MASK), 0);
if (err)
RET_CG_ERR(err);
return CG_OK;
}
static int handle_hdr_opt_len(struct bpf_sock_ops *skops)
{
__u8 tcp_flags = skops_tcp_flags(skops);
if ((tcp_flags & TCPHDR_SYNACK) == TCPHDR_SYNACK)
return synack_opt_len(skops);
if (tcp_flags & TCPHDR_SYN)
return syn_opt_len(skops);
if (tcp_flags & TCPHDR_FIN)
return fin_opt_len(skops);
if (skops_current_mss(skops))
/* The kernel is calculating the MSS */
return current_mss_opt_len(skops);
if (skops->skb_len)
return data_opt_len(skops);
return nodata_opt_len(skops);
}
static int handle_write_hdr_opt(struct bpf_sock_ops *skops)
{
__u8 tcp_flags = skops_tcp_flags(skops);
struct tcphdr *th;
if ((tcp_flags & TCPHDR_SYNACK) == TCPHDR_SYNACK)
return write_synack_opt(skops);
if (tcp_flags & TCPHDR_SYN)
return write_syn_opt(skops);
if (tcp_flags & TCPHDR_FIN)
return write_fin_opt(skops);
th = skops->skb_data;
if (th + 1 > skops->skb_data_end)
RET_CG_ERR(0);
if (skops->skb_len > tcp_hdrlen(th))
return write_data_opt(skops);
return write_nodata_opt(skops);
}
static int set_delack_max(struct bpf_sock_ops *skops, __u8 max_delack_ms)
{
__u32 max_delack_us = max_delack_ms * 1000;
return bpf_setsockopt(skops, SOL_TCP, TCP_BPF_DELACK_MAX,
&max_delack_us, sizeof(max_delack_us));
}
static int set_rto_min(struct bpf_sock_ops *skops, __u8 peer_max_delack_ms)
{
__u32 min_rto_us = peer_max_delack_ms * 1000;
return bpf_setsockopt(skops, SOL_TCP, TCP_BPF_RTO_MIN, &min_rto_us,
sizeof(min_rto_us));
}
static int handle_active_estab(struct bpf_sock_ops *skops)
{
struct hdr_stg init_stg = {
.active = true,
};
int err;
err = load_option(skops, &active_estab_in, false);
if (err && err != -ENOMSG)
RET_CG_ERR(err);
init_stg.resend_syn = TEST_OPTION_FLAGS(active_estab_in.flags,
OPTION_RESEND);
if (!skops->sk || !bpf_sk_storage_get(&hdr_stg_map, skops->sk,
&init_stg,
BPF_SK_STORAGE_GET_F_CREATE))
RET_CG_ERR(0);
if (init_stg.resend_syn)
/* Don't clear the write_hdr cb now because
* the ACK may get lost and retransmit may
* be needed.
*
* PARSE_ALL_HDR cb flag is set to learn if this
* resend_syn option has received by the peer.
*
* The header option will be resent until a valid
* packet is received at handle_parse_hdr()
* and all hdr cb flags will be cleared in
* handle_parse_hdr().
*/
set_parse_all_hdr_cb_flags(skops);
else if (!active_fin_out.flags)
/* No options will be written from now */
clear_hdr_cb_flags(skops);
if (active_syn_out.max_delack_ms) {
err = set_delack_max(skops, active_syn_out.max_delack_ms);
if (err)
RET_CG_ERR(err);
}
if (active_estab_in.max_delack_ms) {
err = set_rto_min(skops, active_estab_in.max_delack_ms);
if (err)
RET_CG_ERR(err);
}
return CG_OK;
}
static int handle_passive_estab(struct bpf_sock_ops *skops)
{
struct hdr_stg init_stg = {};
struct tcphdr *th;
int err;
err = load_option(skops, &passive_estab_in, true);
if (err == -ENOENT) {
/* saved_syn is not found. It was in syncookie mode.
* We have asked the active side to resend the options
* in ACK, so try to find the bpf_test_option from ACK now.
*/
err = load_option(skops, &passive_estab_in, false);
init_stg.syncookie = true;
}
/* ENOMSG: The bpf_test_option is not found which is fine.
* Bail out now for all other errors.
*/
if (err && err != -ENOMSG)
RET_CG_ERR(err);
th = skops->skb_data;
if (th + 1 > skops->skb_data_end)
RET_CG_ERR(0);
if (th->syn) {
/* Fastopen */
/* Cannot clear cb_flags to stop write_hdr cb.
* synack is not sent yet for fast open.
* Even it was, the synack may need to be retransmitted.
*
* PARSE_ALL_HDR cb flag is set to learn
* if synack has reached the peer.
* All cb_flags will be cleared in handle_parse_hdr().
*/
set_parse_all_hdr_cb_flags(skops);
init_stg.fastopen = true;
} else if (!passive_fin_out.flags) {
/* No options will be written from now */
clear_hdr_cb_flags(skops);
}
if (!skops->sk ||
!bpf_sk_storage_get(&hdr_stg_map, skops->sk, &init_stg,
BPF_SK_STORAGE_GET_F_CREATE))
RET_CG_ERR(0);
if (passive_synack_out.max_delack_ms) {
err = set_delack_max(skops, passive_synack_out.max_delack_ms);
if (err)
RET_CG_ERR(err);
}
if (passive_estab_in.max_delack_ms) {
err = set_rto_min(skops, passive_estab_in.max_delack_ms);
if (err)
RET_CG_ERR(err);
}
return CG_OK;
}
static int handle_parse_hdr(struct bpf_sock_ops *skops)
{
struct hdr_stg *hdr_stg;
struct tcphdr *th;
if (!skops->sk)
RET_CG_ERR(0);
th = skops->skb_data;
if (th + 1 > skops->skb_data_end)
RET_CG_ERR(0);
hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0);
if (!hdr_stg)
RET_CG_ERR(0);
if (hdr_stg->resend_syn || hdr_stg->fastopen)
/* The PARSE_ALL_HDR cb flag was turned on
* to ensure that the previously written
* options have reached the peer.
* Those previously written option includes:
* - Active side: resend_syn in ACK during syncookie
* or
* - Passive side: SYNACK during fastopen
*
* A valid packet has been received here after
* the 3WHS, so the PARSE_ALL_HDR cb flag
* can be cleared now.
*/
clear_parse_all_hdr_cb_flags(skops);
if (hdr_stg->resend_syn && !active_fin_out.flags)
/* Active side resent the syn option in ACK
* because the server was in syncookie mode.
* A valid packet has been received, so
* clear header cb flags if there is no
* more option to send.
*/
clear_hdr_cb_flags(skops);
if (hdr_stg->fastopen && !passive_fin_out.flags)
/* Passive side was in fastopen.
* A valid packet has been received, so
* the SYNACK has reached the peer.
* Clear header cb flags if there is no more
* option to send.
*/
clear_hdr_cb_flags(skops);
if (th->fin) {
struct bpf_test_option *fin_opt;
int err;
if (hdr_stg->active)
fin_opt = &active_fin_in;
else
fin_opt = &passive_fin_in;
err = load_option(skops, fin_opt, false);
if (err && err != -ENOMSG)
RET_CG_ERR(err);
}
return CG_OK;
}
SEC("sockops/estab")
int estab(struct bpf_sock_ops *skops)
{
int true_val = 1;
switch (skops->op) {
case BPF_SOCK_OPS_TCP_LISTEN_CB:
bpf_setsockopt(skops, SOL_TCP, TCP_SAVE_SYN,
&true_val, sizeof(true_val));
set_hdr_cb_flags(skops);
break;
case BPF_SOCK_OPS_TCP_CONNECT_CB:
set_hdr_cb_flags(skops);
break;
case BPF_SOCK_OPS_PARSE_HDR_OPT_CB:
return handle_parse_hdr(skops);
case BPF_SOCK_OPS_HDR_OPT_LEN_CB:
return handle_hdr_opt_len(skops);
case BPF_SOCK_OPS_WRITE_HDR_OPT_CB:
return handle_write_hdr_opt(skops);
case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
return handle_passive_estab(skops);
case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
return handle_active_estab(skops);
}
return CG_OK;
}
char _license[] SEC("license") = "GPL";

View file

@ -0,0 +1,151 @@
/* SPDX-License-Identifier: GPL-2.0 */
/* Copyright (c) 2020 Facebook */
#ifndef _TEST_TCP_HDR_OPTIONS_H
#define _TEST_TCP_HDR_OPTIONS_H
struct bpf_test_option {
__u8 flags;
__u8 max_delack_ms;
__u8 rand;
} __attribute__((packed));
enum {
OPTION_RESEND,
OPTION_MAX_DELACK_MS,
OPTION_RAND,
__NR_OPTION_FLAGS,
};
#define OPTION_F_RESEND (1 << OPTION_RESEND)
#define OPTION_F_MAX_DELACK_MS (1 << OPTION_MAX_DELACK_MS)
#define OPTION_F_RAND (1 << OPTION_RAND)
#define OPTION_MASK ((1 << __NR_OPTION_FLAGS) - 1)
#define TEST_OPTION_FLAGS(flags, option) (1 & ((flags) >> (option)))
#define SET_OPTION_FLAGS(flags, option) ((flags) |= (1 << (option)))
/* Store in bpf_sk_storage */
struct hdr_stg {
bool active;
bool resend_syn; /* active side only */
bool syncookie; /* passive side only */
bool fastopen; /* passive side only */
};
struct linum_err {
unsigned int linum;
int err;
};
#define TCPHDR_FIN 0x01
#define TCPHDR_SYN 0x02
#define TCPHDR_RST 0x04
#define TCPHDR_PSH 0x08
#define TCPHDR_ACK 0x10
#define TCPHDR_URG 0x20
#define TCPHDR_ECE 0x40
#define TCPHDR_CWR 0x80
#define TCPHDR_SYNACK (TCPHDR_SYN | TCPHDR_ACK)
#define TCPOPT_EOL 0
#define TCPOPT_NOP 1
#define TCPOPT_WINDOW 3
#define TCPOPT_EXP 254
#define TCP_BPF_EXPOPT_BASE_LEN 4
#define MAX_TCP_HDR_LEN 60
#define MAX_TCP_OPTION_SPACE 40
#ifdef BPF_PROG_TEST_TCP_HDR_OPTIONS
#define CG_OK 1
#define CG_ERR 0
#ifndef SOL_TCP
#define SOL_TCP 6
#endif
struct tcp_exprm_opt {
__u8 kind;
__u8 len;
__u16 magic;
union {
__u8 data[4];
__u32 data32;
};
} __attribute__((packed));
struct tcp_opt {
__u8 kind;
__u8 len;
union {
__u8 data[4];
__u32 data32;
};
} __attribute__((packed));
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 2);
__type(key, int);
__type(value, struct linum_err);
} lport_linum_map SEC(".maps");
static inline unsigned int tcp_hdrlen(const struct tcphdr *th)
{
return th->doff << 2;
}
static inline __u8 skops_tcp_flags(const struct bpf_sock_ops *skops)
{
return skops->skb_tcp_flags;
}
static inline void clear_hdr_cb_flags(struct bpf_sock_ops *skops)
{
bpf_sock_ops_cb_flags_set(skops,
skops->bpf_sock_ops_cb_flags &
~(BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG |
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG));
}
static inline void set_hdr_cb_flags(struct bpf_sock_ops *skops)
{
bpf_sock_ops_cb_flags_set(skops,
skops->bpf_sock_ops_cb_flags |
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG |
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG);
}
static inline void
clear_parse_all_hdr_cb_flags(struct bpf_sock_ops *skops)
{
bpf_sock_ops_cb_flags_set(skops,
skops->bpf_sock_ops_cb_flags &
~BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG);
}
static inline void
set_parse_all_hdr_cb_flags(struct bpf_sock_ops *skops)
{
bpf_sock_ops_cb_flags_set(skops,
skops->bpf_sock_ops_cb_flags |
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG);
}
#define RET_CG_ERR(__err) ({ \
struct linum_err __linum_err; \
int __lport; \
\
__linum_err.linum = __LINE__; \
__linum_err.err = __err; \
__lport = skops->local_port; \
bpf_map_update_elem(&lport_linum_map, &__lport, &__linum_err, BPF_NOEXIST); \
clear_hdr_cb_flags(skops); \
clear_parse_all_hdr_cb_flags(skops); \
return CG_ERR; \
})
#endif /* BPF_PROG_TEST_TCP_HDR_OPTIONS */
#endif /* _TEST_TCP_HDR_OPTIONS_H */