Skip to content
Snippets Groups Projects
  1. May 23, 2024
    • Paolo Abeni's avatar
      net: relax socket state check at accept time. · 26afda78
      Paolo Abeni authored
      
      Christoph reported the following splat:
      
      WARNING: CPU: 1 PID: 772 at net/ipv4/af_inet.c:761 __inet_accept+0x1f4/0x4a0
      Modules linked in:
      CPU: 1 PID: 772 Comm: syz-executor510 Not tainted 6.9.0-rc7-g7da7119fe22b #56
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      RIP: 0010:__inet_accept+0x1f4/0x4a0 net/ipv4/af_inet.c:759
      Code: 04 38 84 c0 0f 85 87 00 00 00 41 c7 04 24 03 00 00 00 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc e8 ec b7 da fd <0f> 0b e9 7f fe ff ff e8 e0 b7 da fd 0f 0b e9 fe fe ff ff 89 d9 80
      RSP: 0018:ffffc90000c2fc58 EFLAGS: 00010293
      RAX: ffffffff836bdd14 RBX: 0000000000000000 RCX: ffff888104668000
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: dffffc0000000000 R08: ffffffff836bdb89 R09: fffff52000185f64
      R10: dffffc0000000000 R11: fffff52000185f64 R12: dffffc0000000000
      R13: 1ffff92000185f98 R14: ffff88810754d880 R15: ffff8881007b7800
      FS:  000000001c772880(0000) GS:ffff88811b280000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fb9fcf2e178 CR3: 00000001045d2002 CR4: 0000000000770ef0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <TASK>
       inet_accept+0x138/0x1d0 net/ipv4/af_inet.c:786
       do_accept+0x435/0x620 net/socket.c:1929
       __sys_accept4_file net/socket.c:1969 [inline]
       __sys_accept4+0x9b/0x110 net/socket.c:1999
       __do_sys_accept net/socket.c:2016 [inline]
       __se_sys_accept net/socket.c:2013 [inline]
       __x64_sys_accept+0x7d/0x90 net/socket.c:2013
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0x58/0x100 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x76/0x7e
      RIP: 0033:0x4315f9
      Code: fd ff 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 ab b4 fd ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007ffdb26d9c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002b
      RAX: ffffffffffffffda RBX: 0000000000400300 RCX: 00000000004315f9
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
      RBP: 00000000006e1018 R08: 0000000000400300 R09: 0000000000400300
      R10: 0000000000400300 R11: 0000000000000246 R12: 0000000000000000
      R13: 000000000040cdf0 R14: 000000000040ce80 R15: 0000000000000055
       </TASK>
      
      The reproducer invokes shutdown() before entering the listener status.
      After commit 94062790 ("tcp: defer shutdown(SEND_SHUTDOWN) for
      TCP_SYN_RECV sockets"), the above causes the child to reach the accept
      syscall in FIN_WAIT1 status.
      
      Eric noted we can relax the existing assertion in __inet_accept()
      
      Reported-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/490
      
      
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: 94062790 ("tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets")
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/23ab880a44d8cfd967e84de8b93dbf48848e3d8c.1716299669.git.pabeni@redhat.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      26afda78
    • Jason Xing's avatar
      tcp: remove 64 KByte limit for initial tp->rcv_wnd value · 378979e9
      Jason Xing authored
      
      Recently, we had some servers upgraded to the latest kernel and noticed
      the indicator from the user side showed worse results than before. It is
      caused by the limitation of tp->rcv_wnd.
      
      In 2018 commit a337531b ("tcp: up initial rmem to 128KB and SYN rwin
      to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most
      CDN teams would not benefit from this change because they cannot have a
      large window to receive a big packet, which will be slowed down especially
      in long RTT. Small rcv_wnd means slow transfer speed, to some extent. It's
      the side effect for the latency/time-sensitive users.
      
      To avoid future confusion, current change doesn't affect the initial
      receive window on the wire in a SYN or SYN+ACK packet which are set within
      65535 bytes according to RFC 7323 also due to the limit in
      __tcp_transmit_skb():
      
          th->window      = htons(min(tp->rcv_wnd, 65535U));
      
      In one word, __tcp_transmit_skb() already ensures that constraint is
      respected, no matter how large tp->rcv_wnd is. The change doesn't violate
      RFC.
      
      Let me provide one example if with or without the patch:
      Before:
      client   --- SYN: rwindow=65535 ---> server
      client   <--- SYN+ACK: rwindow=65535 ----  server
      client   --- ACK: rwindow=65536 ---> server
      Note: for the last ACK, the calculation is 512 << 7.
      
      After:
      client   --- SYN: rwindow=65535 ---> server
      client   <--- SYN+ACK: rwindow=65535 ----  server
      client   --- ACK: rwindow=175232 ---> server
      Note: I use the following command to make it work:
      ip route change default via [ip] dev eth0 metric 100 initrwnd 120
      For the last ACK, the calculation is 1369 << 7.
      
      When we apply such a patch, having a large rcv_wnd if the user tweak this
      knob can help transfer data more rapidly and save some rtts.
      
      Fixes: a337531b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
      Signed-off-by: default avatarJason Xing <kernelxing@tencent.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Link: https://lore.kernel.org/r/20240521134220.12510-1-kerneljasonxing@gmail.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      378979e9
    • Dae R. Jeong's avatar
      tls: fix missing memory barrier in tls_init · 91e61dd7
      Dae R. Jeong authored
      
      In tls_init(), a write memory barrier is missing, and store-store
      reordering may cause NULL dereference in tls_{setsockopt,getsockopt}.
      
      CPU0                               CPU1
      -----                              -----
      // In tls_init()
      // In tls_ctx_create()
      ctx = kzalloc()
      ctx->sk_proto = READ_ONCE(sk->sk_prot) -(1)
      
      // In update_sk_prot()
      WRITE_ONCE(sk->sk_prot, tls_prots)     -(2)
      
                                         // In sock_common_setsockopt()
                                         READ_ONCE(sk->sk_prot)->setsockopt()
      
                                         // In tls_{setsockopt,getsockopt}()
                                         ctx->sk_proto->setsockopt()    -(3)
      
      In the above scenario, when (1) and (2) are reordered, (3) can observe
      the NULL value of ctx->sk_proto, causing NULL dereference.
      
      To fix it, we rely on rcu_assign_pointer() which implies the release
      barrier semantic. By moving rcu_assign_pointer() after ctx->sk_proto is
      initialized, we can ensure that ctx->sk_proto are visible when
      changing sk->sk_prot.
      
      Fixes: d5bee737 ("net/tls: Annotate access to sk_prot with READ_ONCE/WRITE_ONCE")
      Signed-off-by: default avatarYewon Choi <woni9911@gmail.com>
      Signed-off-by: default avatarDae R. Jeong <threeearcat@gmail.com>
      Link: https://lore.kernel.org/netdev/ZU4OJG56g2V9z_H7@dragonet/T/
      Link: https://lore.kernel.org/r/Zkx4vjSFp0mfpjQ2@libra05
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      91e61dd7
  2. May 21, 2024
    • Aaron Conole's avatar
      openvswitch: Set the skbuff pkt_type for proper pmtud support. · 30a92c9e
      Aaron Conole authored
      
      Open vSwitch is originally intended to switch at layer 2, only dealing with
      Ethernet frames.  With the introduction of l3 tunnels support, it crossed
      into the realm of needing to care a bit about some routing details when
      making forwarding decisions.  If an oversized packet would need to be
      fragmented during this forwarding decision, there is a chance for pmtu
      to get involved and generate a routing exception.  This is gated by the
      skbuff->pkt_type field.
      
      When a flow is already loaded into the openvswitch module this field is
      set up and transitioned properly as a packet moves from one port to
      another.  In the case that a packet execute is invoked after a flow is
      newly installed this field is not properly initialized.  This causes the
      pmtud mechanism to omit sending the required exception messages across
      the tunnel boundary and a second attempt needs to be made to make sure
      that the routing exception is properly setup.  To fix this, we set the
      outgoing packet's pkt_type to PACKET_OUTGOING, since it can only get
      to the openvswitch module via a port device or packet command.
      
      Even for bridge ports as users, the pkt_type needs to be reset when
      doing the transmit as the packet is truly outgoing and routing needs
      to get involved post packet transformations, in the case of
      VXLAN/GENEVE/udp-tunnel packets.  In general, the pkt_type on output
      gets ignored, since we go straight to the driver, but in the case of
      tunnel ports they go through IP routing layer.
      
      This issue is periodically encountered in complex setups, such as large
      openshift deployments, where multiple sets of tunnel traversal occurs.
      A way to recreate this is with the ovn-heater project that can setup
      a networking environment which mimics such large deployments.  We need
      larger environments for this because we need to ensure that flow
      misses occur.  In these environment, without this patch, we can see:
      
        ./ovn_cluster.sh start
        podman exec ovn-chassis-1 ip r a 170.168.0.5/32 dev eth1 mtu 1200
        podman exec ovn-chassis-1 ip netns exec sw01p1 ip r flush cache
        podman exec ovn-chassis-1 ip netns exec sw01p1 \
               ping 21.0.0.3 -M do -s 1300 -c2
        PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data.
        From 21.0.0.3 icmp_seq=2 Frag needed and DF set (mtu = 1142)
      
        --- 21.0.0.3 ping statistics ---
        ...
      
      Using tcpdump, we can also see the expected ICMP FRAG_NEEDED message is not
      sent into the server.
      
      With this patch, setting the pkt_type, we see the following:
      
        podman exec ovn-chassis-1 ip netns exec sw01p1 \
               ping 21.0.0.3 -M do -s 1300 -c2
        PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data.
        From 21.0.0.3 icmp_seq=1 Frag needed and DF set (mtu = 1222)
        ping: local error: message too long, mtu=1222
      
        --- 21.0.0.3 ping statistics ---
        ...
      
      In this case, the first ping request receives the FRAG_NEEDED message and
      a local routing exception is created.
      
      Tested-by: default avatarJaime Caamano <jcaamano@redhat.com>
      Reported-at: https://issues.redhat.com/browse/FDP-164
      
      
      Fixes: 58264848 ("openvswitch: Add vxlan tunneling support.")
      Signed-off-by: default avatarAaron Conole <aconole@redhat.com>
      Acked-by: default avatarEelco Chaudron <echaudro@redhat.com>
      Link: https://lore.kernel.org/r/20240516200941.16152-1-aconole@redhat.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      30a92c9e
    • Michal Luczaj's avatar
      af_unix: Fix garbage collection of embryos carrying OOB with SCM_RIGHTS · 041933a1
      Michal Luczaj authored
      
      GC attempts to explicitly drop oob_skb's reference before purging the hit
      list.
      
      The problem is with embryos: kfree_skb(u->oob_skb) is never called on an
      embryo socket.
      
      The python script below [0] sends a listener's fd to its embryo as OOB
      data.  While GC does collect the embryo's queue, it fails to drop the OOB
      skb's refcount.  The skb which was in embryo's receive queue stays as
      unix_sk(sk)->oob_skb and keeps the listener's refcount [1].
      
      Tell GC to dispose embryo's oob_skb.
      
      [0]:
      from array import array
      from socket import *
      
      addr = '\x00unix-oob'
      lis = socket(AF_UNIX, SOCK_STREAM)
      lis.bind(addr)
      lis.listen(1)
      
      s = socket(AF_UNIX, SOCK_STREAM)
      s.connect(addr)
      scm = (SOL_SOCKET, SCM_RIGHTS, array('i', [lis.fileno()]))
      s.sendmsg([b'x'], [scm], MSG_OOB)
      lis.close()
      
      [1]
      $ grep unix-oob /proc/net/unix
      $ ./unix-oob.py
      $ grep unix-oob /proc/net/unix
      0000000000000000: 00000002 00000000 00000000 0001 02     0 @unix-oob
      0000000000000000: 00000002 00000000 00010000 0001 01  6072 @unix-oob
      
      Fixes: 4090fa37 ("af_unix: Replace garbage collection algorithm.")
      Signed-off-by: default avatarMichal Luczaj <mhal@rbox.co>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      041933a1
    • Kuniyuki Iwashima's avatar
      tcp: Fix shift-out-of-bounds in dctcp_update_alpha(). · 3ebc46ca
      Kuniyuki Iwashima authored
      
      In dctcp_update_alpha(), we use a module parameter dctcp_shift_g
      as follows:
      
        alpha -= min_not_zero(alpha, alpha >> dctcp_shift_g);
        ...
        delivered_ce <<= (10 - dctcp_shift_g);
      
      It seems syzkaller started fuzzing module parameters and triggered
      shift-out-of-bounds [0] by setting 100 to dctcp_shift_g:
      
        memcpy((void*)0x20000080,
               "/sys/module/tcp_dctcp/parameters/dctcp_shift_g\000", 47);
        res = syscall(__NR_openat, /*fd=*/0xffffffffffffff9cul, /*file=*/0x20000080ul,
                      /*flags=*/2ul, /*mode=*/0ul);
        memcpy((void*)0x20000000, "100\000", 4);
        syscall(__NR_write, /*fd=*/r[0], /*val=*/0x20000000ul, /*len=*/4ul);
      
      Let's limit the max value of dctcp_shift_g by param_set_uint_minmax().
      
      With this patch:
      
        # echo 10 > /sys/module/tcp_dctcp/parameters/dctcp_shift_g
        # cat /sys/module/tcp_dctcp/parameters/dctcp_shift_g
        10
        # echo 11 > /sys/module/tcp_dctcp/parameters/dctcp_shift_g
        -bash: echo: write error: Invalid argument
      
      [0]:
      UBSAN: shift-out-of-bounds in net/ipv4/tcp_dctcp.c:143:12
      shift exponent 100 is too large for 32-bit type 'u32' (aka 'unsigned int')
      CPU: 0 PID: 8083 Comm: syz-executor345 Not tainted 6.9.0-05151-g1b294a1f3561 #2
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      1.13.0-1ubuntu1.1 04/01/2014
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0x201/0x300 lib/dump_stack.c:114
       ubsan_epilogue lib/ubsan.c:231 [inline]
       __ubsan_handle_shift_out_of_bounds+0x346/0x3a0 lib/ubsan.c:468
       dctcp_update_alpha+0x540/0x570 net/ipv4/tcp_dctcp.c:143
       tcp_in_ack_event net/ipv4/tcp_input.c:3802 [inline]
       tcp_ack+0x17b1/0x3bc0 net/ipv4/tcp_input.c:3948
       tcp_rcv_state_process+0x57a/0x2290 net/ipv4/tcp_input.c:6711
       tcp_v4_do_rcv+0x764/0xc40 net/ipv4/tcp_ipv4.c:1937
       sk_backlog_rcv include/net/sock.h:1106 [inline]
       __release_sock+0x20f/0x350 net/core/sock.c:2983
       release_sock+0x61/0x1f0 net/core/sock.c:3549
       mptcp_subflow_shutdown+0x3d0/0x620 net/mptcp/protocol.c:2907
       mptcp_check_send_data_fin+0x225/0x410 net/mptcp/protocol.c:2976
       __mptcp_close+0x238/0xad0 net/mptcp/protocol.c:3072
       mptcp_close+0x2a/0x1a0 net/mptcp/protocol.c:3127
       inet_release+0x190/0x1f0 net/ipv4/af_inet.c:437
       __sock_release net/socket.c:659 [inline]
       sock_close+0xc0/0x240 net/socket.c:1421
       __fput+0x41b/0x890 fs/file_table.c:422
       task_work_run+0x23b/0x300 kernel/task_work.c:180
       exit_task_work include/linux/task_work.h:38 [inline]
       do_exit+0x9c8/0x2540 kernel/exit.c:878
       do_group_exit+0x201/0x2b0 kernel/exit.c:1027
       __do_sys_exit_group kernel/exit.c:1038 [inline]
       __se_sys_exit_group kernel/exit.c:1036 [inline]
       __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1036
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0xe4/0x240 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x67/0x6f
      RIP: 0033:0x7f6c2b5005b6
      Code: Unable to access opcode bytes at 0x7f6c2b50058c.
      RSP: 002b:00007ffe883eb948 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
      RAX: ffffffffffffffda RBX: 00007f6c2b5862f0 RCX: 00007f6c2b5005b6
      RDX: 0000000000000001 RSI: 000000000000003c RDI: 0000000000000001
      RBP: 0000000000000001 R08: 00000000000000e7 R09: ffffffffffffffc0
      R10: 0000000000000006 R11: 0000000000000246 R12: 00007f6c2b5862f0
      R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001
       </TASK>
      
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Reported-by: default avatarYue Sun <samsun1006219@gmail.com>
      Reported-by: default avatarxingwei lee <xrivendell7@gmail.com>
      Closes: https://lore.kernel.org/netdev/CAEkJfYNJM=cw-8x7_Vmj1J6uYVCWMbbvD=EFmDPVBGpTsqOxEA@mail.gmail.com/
      
      
      Fixes: e3118e83 ("net: tcp: add DCTCP congestion control algorithm")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240517091626.32772-1-kuniyu@amazon.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      3ebc46ca
    • Hangbin Liu's avatar
      ipv6: sr: fix memleak in seg6_hmac_init_algo · efb9f4f1
      Hangbin Liu authored
      
      seg6_hmac_init_algo returns without cleaning up the previous allocations
      if one fails, so it's going to leak all that memory and the crypto tfms.
      
      Update seg6_hmac_exit to only free the memory when allocated, so we can
      reuse the code directly.
      
      Fixes: bf355b8d ("ipv6: sr: add core files for SR HMAC support")
      Reported-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Closes: https://lore.kernel.org/netdev/Zj3bh-gE7eT6V6aH@hog/
      
      
      Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/20240517005435.2600277-1-liuhangbin@gmail.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      efb9f4f1
    • Kuniyuki Iwashima's avatar
      af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock. · 9841991a
      Kuniyuki Iwashima authored
      Billy Jheng Bing-Jhong reported a race between __unix_gc() and
      queue_oob().
      
      __unix_gc() tries to garbage-collect close()d inflight sockets,
      and then if the socket has MSG_OOB in unix_sk(sk)->oob_skb, GC
      will drop the reference and set NULL to it locklessly.
      
      However, the peer socket still can send MSG_OOB message and
      queue_oob() can update unix_sk(sk)->oob_skb concurrently, leading
      NULL pointer dereference. [0]
      
      To fix the issue, let's update unix_sk(sk)->oob_skb under the
      sk_receive_queue's lock and take it everywhere we touch oob_skb.
      
      Note that we defer kfree_skb() in manage_oob() to silence lockdep
      false-positive (See [1]).
      
      [0]:
      BUG: kernel NULL pointer dereference, address: 0000000000000008
       PF: supervisor write access in kernel mode
       PF: error_code(0x0002) - not-present page
      PGD 8000000009f5e067 P4D 8000000009f5e067 PUD 9f5d067 PMD 0
      Oops: 0002 [#1] PREEMPT SMP PTI
      CPU: 3 PID: 50 Comm: kworker/3:1 Not tainted 6.9.0-rc5-00191-gd091e579b864 #110
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      Workqueue: events delayed_fput
      RIP: 0010:skb_dequeue (./include/linux/skbuff.h:2386 ./include/linux/skbuff.h:2402 net/core/skbuff.c:3847)
      Code: 39 e3 74 3e 8b 43 10 48 89 ef 83 e8 01 89 43 10 49 8b 44 24 08 49 c7 44 24 08 00 00 00 00 49 8b 14 24 49 c7 04 24 00 00 00 00 <48> 89 42 08 48 89 10 e8 e7 c5 42 00 4c 89 e0 5b 5d 41 5c c3 cc cc
      RSP: 0018:ffffc900001bfd48 EFLAGS: 00000002
      RAX: 0000000000000000 RBX: ffff8880088f5ae8 RCX: 00000000361289f9
      RDX: 0000000000000000 RSI: 0000000000000206 RDI: ffff8880088f5b00
      RBP: ffff8880088f5b00 R08: 0000000000080000 R09: 0000000000000001
      R10: 0000000000000003 R11: 0000000000000001 R12: ffff8880056b6a00
      R13: ffff8880088f5280 R14: 0000000000000001 R15: ffff8880088f5a80
      FS:  0000000000000000(0000) GS:ffff88807dd80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000008 CR3: 0000000006314000 CR4: 00000000007506f0
      PKRU: 55555554
      Call Trace:
       <TASK>
       unix_release_sock (net/unix/af_unix.c:654)
       unix_release (net/unix/af_unix.c:1050)
       __sock_release (net/socket.c:660)
       sock_close (net/socket.c:1423)
       __fput (fs/file_table.c:423)
       delayed_fput (fs/file_table.c:444 (discriminator 3))
       process_one_work (kernel/workqueue.c:3259)
       worker_thread (kernel/workqueue.c:3329 kernel/workqueue.c:3416)
       kthread (kernel/kthread.c:388)
       ret_from_fork (arch/x86/kernel/process.c:153)
       ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
       </TASK>
      Modules linked in:
      CR2: 0000000000000008
      
      Link: https://lore.kernel.org/netdev/a00d3993-c461-43f2-be6d-07259c98509a@rbox.co/
      
       [1]
      Fixes: 1279f9d9 ("af_unix: Call kfree_skb() for dead unix_(sk)->oob_skb in GC.")
      Reported-by: default avatarBilly Jheng Bing-Jhong <billy@starlabs.sg>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240516134835.8332-1-kuniyu@amazon.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      9841991a
  3. May 20, 2024
  4. May 17, 2024
    • Tom Parkin's avatar
      l2tp: fix ICMP error handling for UDP-encap sockets · 6e828dc6
      Tom Parkin authored
      
      Since commit a36e185e
      ("udp: Handle ICMP errors for tunnels with same destination port on both endpoints")
      UDP's handling of ICMP errors has allowed for UDP-encap tunnels to
      determine socket associations in scenarios where the UDP hash lookup
      could not.
      
      Subsequently, commit d26796ae
      ("udp: check udp sock encap_type in __udp_lib_err")
      subtly tweaked the approach such that UDP ICMP error handling would be
      skipped for any UDP socket which has encapsulation enabled.
      
      In the case of L2TP tunnel sockets using UDP-encap, this latter
      modification effectively broke ICMP error reporting for the L2TP
      control plane.
      
      To a degree this isn't catastrophic inasmuch as the L2TP control
      protocol defines a reliable transport on top of the underlying packet
      switching network which will eventually detect errors and time out.
      
      However, paying attention to the ICMP error reporting allows for more
      timely detection of errors in L2TP userspace, and aids in debugging
      connectivity issues.
      
      Reinstate ICMP error handling for UDP encap L2TP tunnels:
      
       * implement struct udp_tunnel_sock_cfg .encap_err_rcv in order to allow
         the L2TP code to handle ICMP errors;
      
       * only implement error-handling for tunnels which have a managed
         socket: unmanaged tunnels using a kernel socket have no userspace to
         report errors back to;
      
       * flag the error on the socket, which allows for userspace to get an
         error such as -ECONNREFUSED back from sendmsg/recvmsg;
      
       * pass the error into ip[v6]_icmp_error() which allows for userspace to
         get extended error information via. MSG_ERRQUEUE.
      
      Fixes: d26796ae ("udp: check udp sock encap_type in __udp_lib_err")
      Signed-off-by: default avatarTom Parkin <tparkin@katalix.com>
      Link: https://lore.kernel.org/r/20240513172248.623261-1-tparkin@katalix.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6e828dc6
    • Eric Dumazet's avatar
      af_packet: do not call packet_read_pending() from tpacket_destruct_skb() · 581073f6
      Eric Dumazet authored
      
      trafgen performance considerably sank on hosts with many cores
      after the blamed commit.
      
      packet_read_pending() is very expensive, and calling it
      in af_packet fast path defeats Daniel intent in commit
      b0138408 ("packet: use percpu mmap tx frame pending refcount")
      
      tpacket_destruct_skb() makes room for one packet, we can immediately
      wakeup a producer, no need to completely drain the tx ring.
      
      Fixes: 89ed5b51 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/r/20240515163358.4105915-1-edumazet@google.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      581073f6
    • Eric Dumazet's avatar
      netrom: fix possible dead-lock in nr_rt_ioctl() · e03e7f20
      Eric Dumazet authored
      
      syzbot loves netrom, and found a possible deadlock in nr_rt_ioctl [1]
      
      Make sure we always acquire nr_node_list_lock before nr_node_lock(nr_node)
      
      [1]
      WARNING: possible circular locking dependency detected
      6.9.0-rc7-syzkaller-02147-g654de42f3fc6 #0 Not tainted
      ------------------------------------------------------
      syz-executor350/5129 is trying to acquire lock:
       ffff8880186e2070 (&nr_node->node_lock){+...}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline]
       ffff8880186e2070 (&nr_node->node_lock){+...}-{2:2}, at: nr_node_lock include/net/netrom.h:152 [inline]
       ffff8880186e2070 (&nr_node->node_lock){+...}-{2:2}, at: nr_dec_obs net/netrom/nr_route.c:464 [inline]
       ffff8880186e2070 (&nr_node->node_lock){+...}-{2:2}, at: nr_rt_ioctl+0x1bb/0x1090 net/netrom/nr_route.c:697
      
      but task is already holding lock:
       ffffffff8f7053b8 (nr_node_list_lock){+...}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline]
       ffffffff8f7053b8 (nr_node_list_lock){+...}-{2:2}, at: nr_dec_obs net/netrom/nr_route.c:462 [inline]
       ffffffff8f7053b8 (nr_node_list_lock){+...}-{2:2}, at: nr_rt_ioctl+0x10a/0x1090 net/netrom/nr_route.c:697
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (nr_node_list_lock){+...}-{2:2}:
              lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
              __raw_spin_lock_bh include/linux/spinlock_api_smp.h:126 [inline]
              _raw_spin_lock_bh+0x35/0x50 kernel/locking/spinlock.c:178
              spin_lock_bh include/linux/spinlock.h:356 [inline]
              nr_remove_node net/netrom/nr_route.c:299 [inline]
              nr_del_node+0x4b4/0x820 net/netrom/nr_route.c:355
              nr_rt_ioctl+0xa95/0x1090 net/netrom/nr_route.c:683
              sock_do_ioctl+0x158/0x460 net/socket.c:1222
              sock_ioctl+0x629/0x8e0 net/socket.c:1341
              vfs_ioctl fs/ioctl.c:51 [inline]
              __do_sys_ioctl fs/ioctl.c:904 [inline]
              __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:890
              do_syscall_x64 arch/x86/entry/common.c:52 [inline]
              do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
             entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
      -> #0 (&nr_node->node_lock){+...}-{2:2}:
              check_prev_add kernel/locking/lockdep.c:3134 [inline]
              check_prevs_add kernel/locking/lockdep.c:3253 [inline]
              validate_chain+0x18cb/0x58e0 kernel/locking/lockdep.c:3869
              __lock_acquire+0x1346/0x1fd0 kernel/locking/lockdep.c:5137
              lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
              __raw_spin_lock_bh include/linux/spinlock_api_smp.h:126 [inline]
              _raw_spin_lock_bh+0x35/0x50 kernel/locking/spinlock.c:178
              spin_lock_bh include/linux/spinlock.h:356 [inline]
              nr_node_lock include/net/netrom.h:152 [inline]
              nr_dec_obs net/netrom/nr_route.c:464 [inline]
              nr_rt_ioctl+0x1bb/0x1090 net/netrom/nr_route.c:697
              sock_do_ioctl+0x158/0x460 net/socket.c:1222
              sock_ioctl+0x629/0x8e0 net/socket.c:1341
              vfs_ioctl fs/ioctl.c:51 [inline]
              __do_sys_ioctl fs/ioctl.c:904 [inline]
              __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:890
              do_syscall_x64 arch/x86/entry/common.c:52 [inline]
              do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
             entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(nr_node_list_lock);
                                     lock(&nr_node->node_lock);
                                     lock(nr_node_list_lock);
        lock(&nr_node->node_lock);
      
       *** DEADLOCK ***
      
      1 lock held by syz-executor350/5129:
        #0: ffffffff8f7053b8 (nr_node_list_lock){+...}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline]
        #0: ffffffff8f7053b8 (nr_node_list_lock){+...}-{2:2}, at: nr_dec_obs net/netrom/nr_route.c:462 [inline]
        #0: ffffffff8f7053b8 (nr_node_list_lock){+...}-{2:2}, at: nr_rt_ioctl+0x10a/0x1090 net/netrom/nr_route.c:697
      
      stack backtrace:
      CPU: 0 PID: 5129 Comm: syz-executor350 Not tainted 6.9.0-rc7-syzkaller-02147-g654de42f3fc6 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024
      Call Trace:
       <TASK>
        __dump_stack lib/dump_stack.c:88 [inline]
        dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
        check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2187
        check_prev_add kernel/locking/lockdep.c:3134 [inline]
        check_prevs_add kernel/locking/lockdep.c:3253 [inline]
        validate_chain+0x18cb/0x58e0 kernel/locking/lockdep.c:3869
        __lock_acquire+0x1346/0x1fd0 kernel/locking/lockdep.c:5137
        lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
        __raw_spin_lock_bh include/linux/spinlock_api_smp.h:126 [inline]
        _raw_spin_lock_bh+0x35/0x50 kernel/locking/spinlock.c:178
        spin_lock_bh include/linux/spinlock.h:356 [inline]
        nr_node_lock include/net/netrom.h:152 [inline]
        nr_dec_obs net/netrom/nr_route.c:464 [inline]
        nr_rt_ioctl+0x1bb/0x1090 net/netrom/nr_route.c:697
        sock_do_ioctl+0x158/0x460 net/socket.c:1222
        sock_ioctl+0x629/0x8e0 net/socket.c:1341
        vfs_ioctl fs/ioctl.c:51 [inline]
        __do_sys_ioctl fs/ioctl.c:904 [inline]
        __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:890
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240515142934.3708038-1-edumazet@google.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e03e7f20
    • xu xin's avatar
      net/ipv6: Fix route deleting failure when metric equals 0 · bb487272
      xu xin authored
      
      Problem
      =========
      After commit 67f69513 ("ipv6: Move setting default metric for routes"),
      we noticed that the logic of assigning the default value of fc_metirc
      changed in the ioctl process. That is, when users use ioctl(fd, SIOCADDRT,
      rt) with a non-zero metric to add a route,  then they may fail to delete a
      route with passing in a metric value of 0 to the kernel by ioctl(fd,
      SIOCDELRT, rt). But iproute can succeed in deleting it.
      
      As a reference, when using iproute tools by netlink to delete routes with
      a metric parameter equals 0, like the command as follows:
      
      	ip -6 route del fe80::/64 via fe81::5054:ff:fe11:3451 dev eth0 metric 0
      
      the user can still succeed in deleting the route entry with the smallest
      metric.
      
      Root Reason
      ===========
      After commit 67f69513 ("ipv6: Move setting default metric for routes"),
      When ioctl() pass in SIOCDELRT with a zero metric, rtmsg_to_fib6_config()
      will set a defalut value (1024) to cfg->fc_metric in kernel, and in
      ip6_route_del() and the line 4074 at net/ipv3/route.c, it will check by
      
      	if (cfg->fc_metric && cfg->fc_metric != rt->fib6_metric)
      		continue;
      
      and the condition is true and skip the later procedure (deleting route)
      because cfg->fc_metric != rt->fib6_metric. But before that commit,
      cfg->fc_metric is still zero there, so the condition is false and it
      will do the following procedure (deleting).
      
      Solution
      ========
      In order to keep a consistent behaviour across netlink() and ioctl(), we
      should allow to delete a route with a metric value of 0. So we only do
      the default setting of fc_metric in route adding.
      
      CC: stable@vger.kernel.org # 5.4+
      Fixes: 67f69513 ("ipv6: Move setting default metric for routes")
      Co-developed-by: default avatarFan Yu <fan.yu9@zte.com.cn>
      Signed-off-by: default avatarFan Yu <fan.yu9@zte.com.cn>
      Signed-off-by: default avatarxu xin <xu.xin16@zte.com.cn>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20240514201102055dD2Ba45qKbLlUMxu_DTHP@zte.com.cn
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bb487272
  5. May 16, 2024
  6. May 15, 2024
    • Nikolay Aleksandrov's avatar
      net: bridge: mst: fix vlan use-after-free · 3a7c1661
      Nikolay Aleksandrov authored
      
      syzbot reported a suspicious rcu usage[1] in bridge's mst code. While
      fixing it I noticed that nothing prevents a vlan to be freed while
      walking the list from the same path (br forward delay timer). Fix the rcu
      usage and also make sure we are not accessing freed memory by making
      br_mst_vlan_set_state use rcu read lock.
      
      [1]
       WARNING: suspicious RCU usage
       6.9.0-rc6-syzkaller #0 Not tainted
       -----------------------------
       net/bridge/br_private.h:1599 suspicious rcu_dereference_protected() usage!
       ...
       stack backtrace:
       CPU: 1 PID: 8017 Comm: syz-executor.1 Not tainted 6.9.0-rc6-syzkaller #0
       Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
       Call Trace:
        <IRQ>
        __dump_stack lib/dump_stack.c:88 [inline]
        dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
        lockdep_rcu_suspicious+0x221/0x340 kernel/locking/lockdep.c:6712
        nbp_vlan_group net/bridge/br_private.h:1599 [inline]
        br_mst_set_state+0x1ea/0x650 net/bridge/br_mst.c:105
        br_set_state+0x28a/0x7b0 net/bridge/br_stp.c:47
        br_forward_delay_timer_expired+0x176/0x440 net/bridge/br_stp_timer.c:88
        call_timer_fn+0x18e/0x650 kernel/time/timer.c:1793
        expire_timers kernel/time/timer.c:1844 [inline]
        __run_timers kernel/time/timer.c:2418 [inline]
        __run_timer_base+0x66a/0x8e0 kernel/time/timer.c:2429
        run_timer_base kernel/time/timer.c:2438 [inline]
        run_timer_softirq+0xb7/0x170 kernel/time/timer.c:2448
        __do_softirq+0x2c6/0x980 kernel/softirq.c:554
        invoke_softirq kernel/softirq.c:428 [inline]
        __irq_exit_rcu+0xf2/0x1c0 kernel/softirq.c:633
        irq_exit_rcu+0x9/0x30 kernel/softirq.c:645
        instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1043 [inline]
        sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1043
        </IRQ>
        <TASK>
       asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702
       RIP: 0010:lock_acquire+0x264/0x550 kernel/locking/lockdep.c:5758
       Code: 2b 00 74 08 4c 89 f7 e8 ba d1 84 00 f6 44 24 61 02 0f 85 85 01 00 00 41 f7 c7 00 02 00 00 74 01 fb 48 c7 44 24 40 0e 36 e0 45 <4b> c7 44 25 00 00 00 00 00 43 c7 44 25 09 00 00 00 00 43 c7 44 25
       RSP: 0018:ffffc90013657100 EFLAGS: 00000206
       RAX: 0000000000000001 RBX: 1ffff920026cae2c RCX: 0000000000000001
       RDX: dffffc0000000000 RSI: ffffffff8bcaca00 RDI: ffffffff8c1eaa60
       RBP: ffffc90013657260 R08: ffffffff92efe507 R09: 1ffffffff25dfca0
       R10: dffffc0000000000 R11: fffffbfff25dfca1 R12: 1ffff920026cae28
       R13: dffffc0000000000 R14: ffffc90013657160 R15: 0000000000000246
      
      Fixes: ec7328b5 ("net: bridge: mst: Multiple Spanning Tree (MST) mode")
      Reported-by: default avatar <syzbot+fa04eb8a56fd923fc5d8@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=fa04eb8a56fd923fc5d8
      
      
      Signed-off-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3a7c1661
    • Nikolay Aleksandrov's avatar
      net: bridge: xmit: make sure we have at least eth header len bytes · 8bd67ebb
      Nikolay Aleksandrov authored
      
      syzbot triggered an uninit value[1] error in bridge device's xmit path
      by sending a short (less than ETH_HLEN bytes) skb. To fix it check if
      we can actually pull that amount instead of assuming.
      
      Tested with dropwatch:
       drop at: br_dev_xmit+0xb93/0x12d0 [bridge] (0xffffffffc06739b3)
       origin: software
       timestamp: Mon May 13 11:31:53 2024 778214037 nsec
       protocol: 0x88a8
       length: 2
       original length: 2
       drop reason: PKT_TOO_SMALL
      
      [1]
      BUG: KMSAN: uninit-value in br_dev_xmit+0x61d/0x1cb0 net/bridge/br_device.c:65
       br_dev_xmit+0x61d/0x1cb0 net/bridge/br_device.c:65
       __netdev_start_xmit include/linux/netdevice.h:4903 [inline]
       netdev_start_xmit include/linux/netdevice.h:4917 [inline]
       xmit_one net/core/dev.c:3531 [inline]
       dev_hard_start_xmit+0x247/0xa20 net/core/dev.c:3547
       __dev_queue_xmit+0x34db/0x5350 net/core/dev.c:4341
       dev_queue_xmit include/linux/netdevice.h:3091 [inline]
       __bpf_tx_skb net/core/filter.c:2136 [inline]
       __bpf_redirect_common net/core/filter.c:2180 [inline]
       __bpf_redirect+0x14a6/0x1620 net/core/filter.c:2187
       ____bpf_clone_redirect net/core/filter.c:2460 [inline]
       bpf_clone_redirect+0x328/0x470 net/core/filter.c:2432
       ___bpf_prog_run+0x13fe/0xe0f0 kernel/bpf/core.c:1997
       __bpf_prog_run512+0xb5/0xe0 kernel/bpf/core.c:2238
       bpf_dispatcher_nop_func include/linux/bpf.h:1234 [inline]
       __bpf_prog_run include/linux/filter.h:657 [inline]
       bpf_prog_run include/linux/filter.h:664 [inline]
       bpf_test_run+0x499/0xc30 net/bpf/test_run.c:425
       bpf_prog_test_run_skb+0x14ea/0x1f20 net/bpf/test_run.c:1058
       bpf_prog_test_run+0x6b7/0xad0 kernel/bpf/syscall.c:4269
       __sys_bpf+0x6aa/0xd90 kernel/bpf/syscall.c:5678
       __do_sys_bpf kernel/bpf/syscall.c:5767 [inline]
       __se_sys_bpf kernel/bpf/syscall.c:5765 [inline]
       __x64_sys_bpf+0xa0/0xe0 kernel/bpf/syscall.c:5765
       x64_sys_call+0x96b/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:322
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: default avatar <syzbot+a63a1f6a062033cf0f40@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=a63a1f6a062033cf0f40
      
      
      Signed-off-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8bd67ebb
  7. May 14, 2024
Loading