Skip to content
Snippets Groups Projects
  1. Feb 01, 2024
    • Paolo Abeni's avatar
      mptcp: fix data re-injection from stale subflow · b6c620dc
      Paolo Abeni authored
      When the MPTCP PM detects that a subflow is stale, all the packet
      scheduler must re-inject all the mptcp-level unacked data. To avoid
      acquiring unneeded locks, it first try to check if any unacked data
      is present at all in the RTX queue, but such check is currently
      broken, as it uses TCP-specific helper on an MPTCP socket.
      
      Funnily enough fuzzers and static checkers are happy, as the accessed
      memory still belongs to the mptcp_sock struct, and even from a
      functional perspective the recovery completed successfully, as
      the short-cut test always failed.
      
      A recent unrelated TCP change - commit d5fed5ad ("tcp: reorganize
      tcp_sock fast path variables") - exposed the issue, as the tcp field
      reorganization makes the mptcp code always skip the re-inection.
      
      Fix the issue dropping the bogus call: we are on a slow path, the early
      optimization proved once again to be evil.
      
      Fixes: 1e1d9d6f ("mptcp: handle pending data on closed subflow")
      Cc: stable@vger.kernel.org
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/468
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMat Martineau <martineau@kernel.org>
      Signed-off-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Link: https://lore.kernel.org/r/20240131-upstream-net-20240131-mptcp-ci-issues-v1-1-4c1c11e571ff@kernel.org
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b6c620dc
    • Eric Dumazet's avatar
      af_unix: fix lockdep positive in sk_diag_dump_icons() · 4d322dce
      Eric Dumazet authored
      
      syzbot reported a lockdep splat [1].
      
      Blamed commit hinted about the possible lockdep
      violation, and code used unix_state_lock_nested()
      in an attempt to silence lockdep.
      
      It is not sufficient, because unix_state_lock_nested()
      is already used from unix_state_double_lock().
      
      We need to use a separate subclass.
      
      This patch adds a distinct enumeration to make things
      more explicit.
      
      Also use swap() in unix_state_double_lock() as a clean up.
      
      v2: add a missing inline keyword to unix_state_lock_nested()
      
      [1]
      WARNING: possible circular locking dependency detected
      6.8.0-rc1-syzkaller-00356-g8a696a29c690 #0 Not tainted
      
      syz-executor.1/2542 is trying to acquire lock:
       ffff88808b5df9e8 (rlock-AF_UNIX){+.+.}-{2:2}, at: skb_queue_tail+0x36/0x120 net/core/skbuff.c:3863
      
      but task is already holding lock:
       ffff88808b5dfe70 (&u->lock/1){+.+.}-{2:2}, at: unix_dgram_sendmsg+0xfc7/0x2200 net/unix/af_unix.c:2089
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (&u->lock/1){+.+.}-{2:2}:
              lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
              _raw_spin_lock_nested+0x31/0x40 kernel/locking/spinlock.c:378
              sk_diag_dump_icons net/unix/diag.c:87 [inline]
              sk_diag_fill+0x6ea/0xfe0 net/unix/diag.c:157
              sk_diag_dump net/unix/diag.c:196 [inline]
              unix_diag_dump+0x3e9/0x630 net/unix/diag.c:220
              netlink_dump+0x5c1/0xcd0 net/netlink/af_netlink.c:2264
              __netlink_dump_start+0x5d7/0x780 net/netlink/af_netlink.c:2370
              netlink_dump_start include/linux/netlink.h:338 [inline]
              unix_diag_handler_dump+0x1c3/0x8f0 net/unix/diag.c:319
             sock_diag_rcv_msg+0xe3/0x400
              netlink_rcv_skb+0x1df/0x430 net/netlink/af_netlink.c:2543
              sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:280
              netlink_unicast_kernel net/netlink/af_netlink.c:1341 [inline]
              netlink_unicast+0x7e6/0x980 net/netlink/af_netlink.c:1367
              netlink_sendmsg+0xa37/0xd70 net/netlink/af_netlink.c:1908
              sock_sendmsg_nosec net/socket.c:730 [inline]
              __sock_sendmsg net/socket.c:745 [inline]
              sock_write_iter+0x39a/0x520 net/socket.c:1160
              call_write_iter include/linux/fs.h:2085 [inline]
              new_sync_write fs/read_write.c:497 [inline]
              vfs_write+0xa74/0xca0 fs/read_write.c:590
              ksys_write+0x1a0/0x2c0 fs/read_write.c:643
              do_syscall_x64 arch/x86/entry/common.c:52 [inline]
              do_syscall_64+0xf5/0x230 arch/x86/entry/common.c:83
             entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      -> #0 (rlock-AF_UNIX){+.+.}-{2:2}:
              check_prev_add kernel/locking/lockdep.c:3134 [inline]
              check_prevs_add kernel/locking/lockdep.c:3253 [inline]
              validate_chain+0x1909/0x5ab0 kernel/locking/lockdep.c:3869
              __lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
              lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
              __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
              _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
              skb_queue_tail+0x36/0x120 net/core/skbuff.c:3863
              unix_dgram_sendmsg+0x15d9/0x2200 net/unix/af_unix.c:2112
              sock_sendmsg_nosec net/socket.c:730 [inline]
              __sock_sendmsg net/socket.c:745 [inline]
              ____sys_sendmsg+0x592/0x890 net/socket.c:2584
              ___sys_sendmsg net/socket.c:2638 [inline]
              __sys_sendmmsg+0x3b2/0x730 net/socket.c:2724
              __do_sys_sendmmsg net/socket.c:2753 [inline]
              __se_sys_sendmmsg net/socket.c:2750 [inline]
              __x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2750
              do_syscall_x64 arch/x86/entry/common.c:52 [inline]
              do_syscall_64+0xf5/0x230 arch/x86/entry/common.c:83
             entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&u->lock/1);
                                     lock(rlock-AF_UNIX);
                                     lock(&u->lock/1);
        lock(rlock-AF_UNIX);
      
       *** DEADLOCK ***
      
      1 lock held by syz-executor.1/2542:
        #0: ffff88808b5dfe70 (&u->lock/1){+.+.}-{2:2}, at: unix_dgram_sendmsg+0xfc7/0x2200 net/unix/af_unix.c:2089
      
      stack backtrace:
      CPU: 1 PID: 2542 Comm: syz-executor.1 Not tainted 6.8.0-rc1-syzkaller-00356-g8a696a29c690 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
      Call Trace:
       <TASK>
        __dump_stack lib/dump_stack.c:88 [inline]
        dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
        check_noncircular+0x366/0x490 kernel/locking/lockdep.c:2187
        check_prev_add kernel/locking/lockdep.c:3134 [inline]
        check_prevs_add kernel/locking/lockdep.c:3253 [inline]
        validate_chain+0x1909/0x5ab0 kernel/locking/lockdep.c:3869
        __lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
        lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
        __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
        _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
        skb_queue_tail+0x36/0x120 net/core/skbuff.c:3863
        unix_dgram_sendmsg+0x15d9/0x2200 net/unix/af_unix.c:2112
        sock_sendmsg_nosec net/socket.c:730 [inline]
        __sock_sendmsg net/socket.c:745 [inline]
        ____sys_sendmsg+0x592/0x890 net/socket.c:2584
        ___sys_sendmsg net/socket.c:2638 [inline]
        __sys_sendmmsg+0x3b2/0x730 net/socket.c:2724
        __do_sys_sendmmsg net/socket.c:2753 [inline]
        __se_sys_sendmmsg net/socket.c:2750 [inline]
        __x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2750
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0xf5/0x230 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      RIP: 0033:0x7f26d887cda9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f26d95a60c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
      RAX: ffffffffffffffda RBX: 00007f26d89abf80 RCX: 00007f26d887cda9
      RDX: 000000000000003e RSI: 00000000200bd000 RDI: 0000000000000004
      RBP: 00007f26d88c947a R08: 0000000000000000 R09: 0000000000000000
      R10: 00000000000008c0 R11: 0000000000000246 R12: 0000000000000000
      R13: 000000000000000b R14: 00007f26d89abf80 R15: 00007ffcfe081a68
      
      Fixes: 2aac7a2c ("unix_diag: Pending connections IDs NLA")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240130184235.1620738-1-edumazet@google.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4d322dce
  2. Jan 31, 2024
  3. Jan 30, 2024
    • Eric Dumazet's avatar
      llc: call sock_orphan() at release time · aa2b2eb3
      Eric Dumazet authored
      
      syzbot reported an interesting trace [1] caused by a stale sk->sk_wq
      pointer in a closed llc socket.
      
      In commit ff7b11aa ("net: socket: set sock->sk to NULL after
      calling proto_ops::release()") Eric Biggers hinted that some protocols
      are missing a sock_orphan(), we need to perform a full audit.
      
      In net-next, I plan to clear sock->sk from sock_orphan() and
      amend Eric patch to add a warning.
      
      [1]
       BUG: KASAN: slab-use-after-free in list_empty include/linux/list.h:373 [inline]
       BUG: KASAN: slab-use-after-free in waitqueue_active include/linux/wait.h:127 [inline]
       BUG: KASAN: slab-use-after-free in sock_def_write_space_wfree net/core/sock.c:3384 [inline]
       BUG: KASAN: slab-use-after-free in sock_wfree+0x9a8/0x9d0 net/core/sock.c:2468
      Read of size 8 at addr ffff88802f4fc880 by task ksoftirqd/1/27
      
      CPU: 1 PID: 27 Comm: ksoftirqd/1 Not tainted 6.8.0-rc1-syzkaller-00049-g6098d87eaf31 #0
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
      Call Trace:
       <TASK>
        __dump_stack lib/dump_stack.c:88 [inline]
        dump_stack_lvl+0xd9/0x1b0 lib/dump_stack.c:106
        print_address_description mm/kasan/report.c:377 [inline]
        print_report+0xc4/0x620 mm/kasan/report.c:488
        kasan_report+0xda/0x110 mm/kasan/report.c:601
        list_empty include/linux/list.h:373 [inline]
        waitqueue_active include/linux/wait.h:127 [inline]
        sock_def_write_space_wfree net/core/sock.c:3384 [inline]
        sock_wfree+0x9a8/0x9d0 net/core/sock.c:2468
        skb_release_head_state+0xa3/0x2b0 net/core/skbuff.c:1080
        skb_release_all net/core/skbuff.c:1092 [inline]
        napi_consume_skb+0x119/0x2b0 net/core/skbuff.c:1404
        e1000_unmap_and_free_tx_resource+0x144/0x200 drivers/net/ethernet/intel/e1000/e1000_main.c:1970
        e1000_clean_tx_irq drivers/net/ethernet/intel/e1000/e1000_main.c:3860 [inline]
        e1000_clean+0x4a1/0x26e0 drivers/net/ethernet/intel/e1000/e1000_main.c:3801
        __napi_poll.constprop.0+0xb4/0x540 net/core/dev.c:6576
        napi_poll net/core/dev.c:6645 [inline]
        net_rx_action+0x956/0xe90 net/core/dev.c:6778
        __do_softirq+0x21a/0x8de kernel/softirq.c:553
        run_ksoftirqd kernel/softirq.c:921 [inline]
        run_ksoftirqd+0x31/0x60 kernel/softirq.c:913
        smpboot_thread_fn+0x660/0xa10 kernel/smpboot.c:164
        kthread+0x2c6/0x3a0 kernel/kthread.c:388
        ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
        ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
       </TASK>
      
      Allocated by task 5167:
        kasan_save_stack+0x33/0x50 mm/kasan/common.c:47
        kasan_save_track+0x14/0x30 mm/kasan/common.c:68
        unpoison_slab_object mm/kasan/common.c:314 [inline]
        __kasan_slab_alloc+0x81/0x90 mm/kasan/common.c:340
        kasan_slab_alloc include/linux/kasan.h:201 [inline]
        slab_post_alloc_hook mm/slub.c:3813 [inline]
        slab_alloc_node mm/slub.c:3860 [inline]
        kmem_cache_alloc_lru+0x142/0x6f0 mm/slub.c:3879
        alloc_inode_sb include/linux/fs.h:3019 [inline]
        sock_alloc_inode+0x25/0x1c0 net/socket.c:308
        alloc_inode+0x5d/0x220 fs/inode.c:260
        new_inode_pseudo+0x16/0x80 fs/inode.c:1005
        sock_alloc+0x40/0x270 net/socket.c:634
        __sock_create+0xbc/0x800 net/socket.c:1535
        sock_create net/socket.c:1622 [inline]
        __sys_socket_create net/socket.c:1659 [inline]
        __sys_socket+0x14c/0x260 net/socket.c:1706
        __do_sys_socket net/socket.c:1720 [inline]
        __se_sys_socket net/socket.c:1718 [inline]
        __x64_sys_socket+0x72/0xb0 net/socket.c:1718
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0xd3/0x250 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      Freed by task 0:
        kasan_save_stack+0x33/0x50 mm/kasan/common.c:47
        kasan_save_track+0x14/0x30 mm/kasan/common.c:68
        kasan_save_free_info+0x3f/0x60 mm/kasan/generic.c:640
        poison_slab_object mm/kasan/common.c:241 [inline]
        __kasan_slab_free+0x121/0x1b0 mm/kasan/common.c:257
        kasan_slab_free include/linux/kasan.h:184 [inline]
        slab_free_hook mm/slub.c:2121 [inline]
        slab_free mm/slub.c:4299 [inline]
        kmem_cache_free+0x129/0x350 mm/slub.c:4363
        i_callback+0x43/0x70 fs/inode.c:249
        rcu_do_batch kernel/rcu/tree.c:2158 [inline]
        rcu_core+0x819/0x1680 kernel/rcu/tree.c:2433
        __do_softirq+0x21a/0x8de kernel/softirq.c:553
      
      Last potentially related work creation:
        kasan_save_stack+0x33/0x50 mm/kasan/common.c:47
        __kasan_record_aux_stack+0xba/0x100 mm/kasan/generic.c:586
        __call_rcu_common.constprop.0+0x9a/0x7b0 kernel/rcu/tree.c:2683
        destroy_inode+0x129/0x1b0 fs/inode.c:315
        iput_final fs/inode.c:1739 [inline]
        iput.part.0+0x560/0x7b0 fs/inode.c:1765
        iput+0x5c/0x80 fs/inode.c:1755
        dentry_unlink_inode+0x292/0x430 fs/dcache.c:400
        __dentry_kill+0x1ca/0x5f0 fs/dcache.c:603
        dput.part.0+0x4ac/0x9a0 fs/dcache.c:845
        dput+0x1f/0x30 fs/dcache.c:835
        __fput+0x3b9/0xb70 fs/file_table.c:384
        task_work_run+0x14d/0x240 kernel/task_work.c:180
        exit_task_work include/linux/task_work.h:38 [inline]
        do_exit+0xa8a/0x2ad0 kernel/exit.c:871
        do_group_exit+0xd4/0x2a0 kernel/exit.c:1020
        __do_sys_exit_group kernel/exit.c:1031 [inline]
        __se_sys_exit_group kernel/exit.c:1029 [inline]
        __x64_sys_exit_group+0x3e/0x50 kernel/exit.c:1029
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0xd3/0x250 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      The buggy address belongs to the object at ffff88802f4fc800
       which belongs to the cache sock_inode_cache of size 1408
      The buggy address is located 128 bytes inside of
       freed 1408-byte region [ffff88802f4fc800, ffff88802f4fcd80)
      
      The buggy address belongs to the physical page:
      page:ffffea0000bd3e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2f4f8
      head:ffffea0000bd3e00 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
      anon flags: 0xfff00000000840(slab|head|node=0|zone=1|lastcpupid=0x7ff)
      page_type: 0xffffffff()
      raw: 00fff00000000840 ffff888013b06b40 0000000000000000 0000000000000001
      raw: 0000000000000000 0000000080150015 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      page_owner tracks the page as allocated
      page last allocated via order 3, migratetype Reclaimable, gfp_mask 0xd20d0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_RECLAIMABLE), pid 4956, tgid 4956 (sshd), ts 31423924727, free_ts 0
        set_page_owner include/linux/page_owner.h:31 [inline]
        post_alloc_hook+0x2d0/0x350 mm/page_alloc.c:1533
        prep_new_page mm/page_alloc.c:1540 [inline]
        get_page_from_freelist+0xa28/0x3780 mm/page_alloc.c:3311
        __alloc_pages+0x22f/0x2440 mm/page_alloc.c:4567
        __alloc_pages_node include/linux/gfp.h:238 [inline]
        alloc_pages_node include/linux/gfp.h:261 [inline]
        alloc_slab_page mm/slub.c:2190 [inline]
        allocate_slab mm/slub.c:2354 [inline]
        new_slab+0xcc/0x3a0 mm/slub.c:2407
        ___slab_alloc+0x4af/0x19a0 mm/slub.c:3540
        __slab_alloc.constprop.0+0x56/0xa0 mm/slub.c:3625
        __slab_alloc_node mm/slub.c:3678 [inline]
        slab_alloc_node mm/slub.c:3850 [inline]
        kmem_cache_alloc_lru+0x379/0x6f0 mm/slub.c:3879
        alloc_inode_sb include/linux/fs.h:3019 [inline]
        sock_alloc_inode+0x25/0x1c0 net/socket.c:308
        alloc_inode+0x5d/0x220 fs/inode.c:260
        new_inode_pseudo+0x16/0x80 fs/inode.c:1005
        sock_alloc+0x40/0x270 net/socket.c:634
        __sock_create+0xbc/0x800 net/socket.c:1535
        sock_create net/socket.c:1622 [inline]
        __sys_socket_create net/socket.c:1659 [inline]
        __sys_socket+0x14c/0x260 net/socket.c:1706
        __do_sys_socket net/socket.c:1720 [inline]
        __se_sys_socket net/socket.c:1718 [inline]
        __x64_sys_socket+0x72/0xb0 net/socket.c:1718
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0xd3/0x250 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      page_owner free stack trace missing
      
      Memory state around the buggy address:
       ffff88802f4fc780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ffff88802f4fc800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff88802f4fc880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                         ^
       ffff88802f4fc900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88802f4fc980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 43815482 ("net: sock_def_readable() and friends RCU conversion")
      Reported-and-tested-by: default avatar <syzbot+32b89eaa102b372ff76d@syzkaller.appspotmail.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240126165532.3396702-1-edumazet@google.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      aa2b2eb3
    • Helge Deller's avatar
      ipv6: Ensure natural alignment of const ipv6 loopback and router addresses · 60365049
      Helge Deller authored
      
      On a parisc64 kernel I sometimes notice this kernel warning:
      Kernel unaligned access to 0x40ff8814 at ndisc_send_skb+0xc0/0x4d8
      
      The address 0x40ff8814 points to the in6addr_linklocal_allrouters
      variable and the warning simply means that some ipv6 function tries to
      read a 64-bit word directly from the not-64-bit aligned
      in6addr_linklocal_allrouters variable.
      
      Unaligned accesses are non-critical as the architecture or exception
      handlers usually will fix it up at runtime. Nevertheless it may trigger
      a performance penality for some architectures. For details read the
      "unaligned-memory-access" kernel documentation.
      
      The patch below ensures that the ipv6 loopback and router addresses will
      always be naturally aligned. This prevents the unaligned accesses for
      all architectures.
      
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Fixes: 034dfc5d ("ipv6: export in6addr_loopback to modules")
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/ZbNuFM1bFqoH-UoY@p100
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      60365049
  4. Jan 29, 2024
    • Samasth Norway Ananda's avatar
      NFSv4.1: Assign the right value for initval and retries for rpc timeout · ccbca118
      Samasth Norway Ananda authored
      
      Make sure the rpc timeout was assigned with the correct value for
      initial timeout and max number of retries.
      
      Fixes: 57331a59 ("NFSv4.1: Use the nfs_client's rpc timeouts for backchannel")
      Signed-off-by: default avatarSamasth Norway Ananda <samasth.norway.ananda@oracle.com>
      Reviewed-by: default avatarBenjamin Coddington <bcodding@redhat.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      ccbca118
    • Eric Dumazet's avatar
      tcp: add sanity checks to rx zerocopy · 577e4432
      Eric Dumazet authored
      TCP rx zerocopy intent is to map pages initially allocated
      from NIC drivers, not pages owned by a fs.
      
      This patch adds to can_map_frag() these additional checks:
      
      - Page must not be a compound one.
      - page->mapping must be NULL.
      
      This fixes the panic reported by ZhangPeng.
      
      syzbot was able to loopback packets built with sendfile(),
      mapping pages owned by an ext4 file to TCP rx zerocopy.
      
      r3 = socket$inet_tcp(0x2, 0x1, 0x0)
      mmap(&(0x7f0000ff9000/0x4000)=nil, 0x4000, 0x0, 0x12, r3, 0x0)
      r4 = socket$inet_tcp(0x2, 0x1, 0x0)
      bind$inet(r4, &(0x7f0000000000)={0x2, 0x4e24, @multicast1}, 0x10)
      connect$inet(r4, &(0x7f00000006c0)={0x2, 0x4e24, @empty}, 0x10)
      r5 = openat$dir(0xffffffffffffff9c, &(0x7f00000000c0)='./file0\x00',
          0x181e42, 0x0)
      fallocate(r5, 0x0, 0x0, 0x85b8)
      sendfile(r4, r5, 0x0, 0x8ba0)
      getsockopt$inet_tcp_TCP_ZEROCOPY_RECEIVE(r4, 0x6, 0x23,
          &(0x7f00000001c0)={&(0x7f0000ffb000/0x3000)=nil, 0x3000, 0x0, 0x0, 0x0,
          0x0, 0x0, 0x0, 0x0}, &(0x7f0000000440)=0x40)
      r6 = openat$dir(0xffffffffffffff9c, &(0x7f00000000c0)='./file0\x00',
          0x181e42, 0x0)
      
      Fixes: 93ab6cc6 ("tcp: implement mmap() for zero copy receive")
      Link: https://lore.kernel.org/netdev/5106a58e-04da-372a-b836-9d3d0bd2507b@huawei.com/T/
      
      
      Reported-and-bisected-by: default avatarZhangPeng <zhangpeng362@huawei.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Arjun Roy <arjunroy@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: linux-mm@vger.kernel.org
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      577e4432
    • Fedor Pchelkin's avatar
      nfc: nci: free rx_data_reassembly skb on NCI device cleanup · bfb007ae
      Fedor Pchelkin authored
      
      rx_data_reassembly skb is stored during NCI data exchange for processing
      fragmented packets. It is dropped only when the last fragment is processed
      or when an NTF packet with NCI_OP_RF_DEACTIVATE_NTF opcode is received.
      However, the NCI device may be deallocated before that which leads to skb
      leak.
      
      As by design the rx_data_reassembly skb is bound to the NCI device and
      nothing prevents the device to be freed before the skb is processed in
      some way and cleaned, free it on the NCI device cleanup.
      
      Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
      
      Fixes: 6a2968aa ("NFC: basic NCI protocol implementation")
      Cc: stable@vger.kernel.org
      Reported-by: default avatar <syzbot+6b7c68d9c21e4ee4251b@syzkaller.appspotmail.com>
      Closes: https://lore.kernel.org/lkml/000000000000f43987060043da7b@google.com/
      
      
      Signed-off-by: default avatarFedor Pchelkin <pchelkin@ispras.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bfb007ae
    • Nikita Zhandarovich's avatar
      net: hsr: remove WARN_ONCE() in send_hsr_supervision_frame() · 37e8c97e
      Nikita Zhandarovich authored
      
      Syzkaller reported [1] hitting a warning after failing to allocate
      resources for skb in hsr_init_skb(). Since a WARN_ONCE() call will
      not help much in this case, it might be prudent to switch to
      netdev_warn_once(). At the very least it will suppress syzkaller
      reports such as [1].
      
      Just in case, use netdev_warn_once() in send_prp_supervision_frame()
      for similar reasons.
      
      [1]
      HSR: Could not send supervision frame
      WARNING: CPU: 1 PID: 85 at net/hsr/hsr_device.c:294 send_hsr_supervision_frame+0x60a/0x810 net/hsr/hsr_device.c:294
      RIP: 0010:send_hsr_supervision_frame+0x60a/0x810 net/hsr/hsr_device.c:294
      ...
      Call Trace:
       <IRQ>
       hsr_announce+0x114/0x370 net/hsr/hsr_device.c:382
       call_timer_fn+0x193/0x590 kernel/time/timer.c:1700
       expire_timers kernel/time/timer.c:1751 [inline]
       __run_timers+0x764/0xb20 kernel/time/timer.c:2022
       run_timer_softirq+0x58/0xd0 kernel/time/timer.c:2035
       __do_softirq+0x21a/0x8de kernel/softirq.c:553
       invoke_softirq kernel/softirq.c:427 [inline]
       __irq_exit_rcu kernel/softirq.c:632 [inline]
       irq_exit_rcu+0xb7/0x120 kernel/softirq.c:644
       sysvec_apic_timer_interrupt+0x95/0xb0 arch/x86/kernel/apic/apic.c:1076
       </IRQ>
       <TASK>
       asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:649
      ...
      
      This issue is also found in older kernels (at least up to 5.10).
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatar <syzbot+3ae0a3f42c84074b7c8e@syzkaller.appspotmail.com>
      Fixes: 121c33b0 ("net: hsr: introduce common code for skb initialization")
      Signed-off-by: default avatarNikita Zhandarovich <n.zhandarovich@fintech.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      37e8c97e
  5. Jan 27, 2024
    • Linus Lüssing's avatar
      batman-adv: mcast: fix memory leak on deleting a batman-adv interface · 0a186b49
      Linus Lüssing authored
      
      The batman-adv multicast tracker TVLV handler is registered for the
      new batman-adv multicast packet type upon creating a batman-adv interface,
      but not unregistered again upon the interface's deletion, leading to a
      memory leak.
      
      Fix this memory leak by calling the according TVLV handler unregister
      routine for the multicast tracker TVLV upon batman-adv interface
      deletion.
      
      Fixes: 07afe1ba ("batman-adv: mcast: implement multicast packet reception and forwarding")
      Reported-and-tested-by: default avatar <syzbot+ebe64cc5950868e77358@syzkaller.appspotmail.com>
      Closes: https://lore.kernel.org/all/000000000000beadc4060f0cbc23@google.com/
      
      
      Signed-off-by: default avatarLinus Lüssing <linus.luessing@c0d3.blue>
      Signed-off-by: default avatarSven Eckelmann <sven@narfation.org>
      Signed-off-by: default avatarSimon Wunderlich <sw@simonwunderlich.de>
      0a186b49
    • Linus Lüssing's avatar
      batman-adv: mcast: fix mcast packet type counter on timeouted nodes · 59f7ea70
      Linus Lüssing authored
      
      When a node which does not have the new batman-adv multicast packet type
      capability vanishes then the according, global counter erroneously would
      not be reduced in response on other nodes. Which in turn leads to the mesh
      never switching back to sending with the new multicast packet type.
      
      Fix this by reducing the according counter when such a node times out.
      
      Fixes: 90039133 ("batman-adv: mcast: implement multicast packet generation")
      Signed-off-by: default avatarLinus Lüssing <linus.luessing@c0d3.blue>
      Signed-off-by: default avatarSven Eckelmann <sven@narfation.org>
      Signed-off-by: default avatarSimon Wunderlich <sw@simonwunderlich.de>
      59f7ea70
    • Nicolas Dichtel's avatar
      ipmr: fix kernel panic when forwarding mcast packets · e622502c
      Nicolas Dichtel authored
      
      The stacktrace was:
      [   86.305548] BUG: kernel NULL pointer dereference, address: 0000000000000092
      [   86.306815] #PF: supervisor read access in kernel mode
      [   86.307717] #PF: error_code(0x0000) - not-present page
      [   86.308624] PGD 0 P4D 0
      [   86.309091] Oops: 0000 [#1] PREEMPT SMP NOPTI
      [   86.309883] CPU: 2 PID: 3139 Comm: pimd Tainted: G     U             6.8.0-6wind-knet #1
      [   86.311027] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
      [   86.312728] RIP: 0010:ip_mr_forward (/build/work/knet/net/ipv4/ipmr.c:1985)
      [ 86.313399] Code: f9 1f 0f 87 85 03 00 00 48 8d 04 5b 48 8d 04 83 49 8d 44 c5 00 48 8b 40 70 48 39 c2 0f 84 d9 00 00 00 49 8b 46 58 48 83 e0 fe <80> b8 92 00 00 00 00 0f 84 55 ff ff ff 49 83 47 38 01 45 85 e4 0f
      [   86.316565] RSP: 0018:ffffad21c0583ae0 EFLAGS: 00010246
      [   86.317497] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      [   86.318596] RDX: ffff9559cb46c000 RSI: 0000000000000000 RDI: 0000000000000000
      [   86.319627] RBP: ffffad21c0583b30 R08: 0000000000000000 R09: 0000000000000000
      [   86.320650] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
      [   86.321672] R13: ffff9559c093a000 R14: ffff9559cc00b800 R15: ffff9559c09c1d80
      [   86.322873] FS:  00007f85db661980(0000) GS:ffff955a79d00000(0000) knlGS:0000000000000000
      [   86.324291] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   86.325314] CR2: 0000000000000092 CR3: 000000002f13a000 CR4: 0000000000350ef0
      [   86.326589] Call Trace:
      [   86.327036]  <TASK>
      [   86.327434] ? show_regs (/build/work/knet/arch/x86/kernel/dumpstack.c:479)
      [   86.328049] ? __die (/build/work/knet/arch/x86/kernel/dumpstack.c:421 /build/work/knet/arch/x86/kernel/dumpstack.c:434)
      [   86.328508] ? page_fault_oops (/build/work/knet/arch/x86/mm/fault.c:707)
      [   86.329107] ? do_user_addr_fault (/build/work/knet/arch/x86/mm/fault.c:1264)
      [   86.329756] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223)
      [   86.330350] ? __irq_work_queue_local (/build/work/knet/kernel/irq_work.c:111 (discriminator 1))
      [   86.331013] ? exc_page_fault (/build/work/knet/./arch/x86/include/asm/paravirt.h:693 /build/work/knet/arch/x86/mm/fault.c:1515 /build/work/knet/arch/x86/mm/fault.c:1563)
      [   86.331702] ? asm_exc_page_fault (/build/work/knet/./arch/x86/include/asm/idtentry.h:570)
      [   86.332468] ? ip_mr_forward (/build/work/knet/net/ipv4/ipmr.c:1985)
      [   86.333183] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223)
      [   86.333920] ipmr_mfc_add (/build/work/knet/./include/linux/rcupdate.h:782 /build/work/knet/net/ipv4/ipmr.c:1009 /build/work/knet/net/ipv4/ipmr.c:1273)
      [   86.334583] ? __pfx_ipmr_hash_cmp (/build/work/knet/net/ipv4/ipmr.c:363)
      [   86.335357] ip_mroute_setsockopt (/build/work/knet/net/ipv4/ipmr.c:1470)
      [   86.336135] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223)
      [   86.336854] ? ip_mroute_setsockopt (/build/work/knet/net/ipv4/ipmr.c:1470)
      [   86.337679] do_ip_setsockopt (/build/work/knet/net/ipv4/ip_sockglue.c:944)
      [   86.338408] ? __pfx_unix_stream_read_actor (/build/work/knet/net/unix/af_unix.c:2862)
      [   86.339232] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223)
      [   86.339809] ? aa_sk_perm (/build/work/knet/security/apparmor/include/cred.h:153 /build/work/knet/security/apparmor/net.c:181)
      [   86.340342] ip_setsockopt (/build/work/knet/net/ipv4/ip_sockglue.c:1415)
      [   86.340859] raw_setsockopt (/build/work/knet/net/ipv4/raw.c:836)
      [   86.341408] ? security_socket_setsockopt (/build/work/knet/security/security.c:4561 (discriminator 13))
      [   86.342116] sock_common_setsockopt (/build/work/knet/net/core/sock.c:3716)
      [   86.342747] do_sock_setsockopt (/build/work/knet/net/socket.c:2313)
      [   86.343363] __sys_setsockopt (/build/work/knet/./include/linux/file.h:32 /build/work/knet/net/socket.c:2336)
      [   86.344020] __x64_sys_setsockopt (/build/work/knet/net/socket.c:2340)
      [   86.344766] do_syscall_64 (/build/work/knet/arch/x86/entry/common.c:52 /build/work/knet/arch/x86/entry/common.c:83)
      [   86.345433] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223)
      [   86.346161] ? syscall_exit_work (/build/work/knet/./include/linux/audit.h:357 /build/work/knet/kernel/entry/common.c:160)
      [   86.346938] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223)
      [   86.347657] ? syscall_exit_to_user_mode (/build/work/knet/kernel/entry/common.c:215)
      [   86.348538] ? srso_return_thunk (/build/work/knet/arch/x86/lib/retpoline.S:223)
      [   86.349262] ? do_syscall_64 (/build/work/knet/./arch/x86/include/asm/cpufeature.h:171 /build/work/knet/arch/x86/entry/common.c:98)
      [   86.349971] entry_SYSCALL_64_after_hwframe (/build/work/knet/arch/x86/entry/entry_64.S:129)
      
      The original packet in ipmr_cache_report() may be queued and then forwarded
      with ip_mr_forward(). This last function has the assumption that the skb
      dst is set.
      
      After the below commit, the skb dst is dropped by ipv4_pktinfo_prepare(),
      which causes the oops.
      
      Fixes: bb740365 ("ipmr: support IP_PKTINFO on cache report IGMP msg")
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240125141847.1931933-1-nicolas.dichtel@6wind.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e622502c
  6. Jan 26, 2024
    • Eric Dumazet's avatar
      ip6_tunnel: make sure to pull inner header in __ip6_tnl_rcv() · 8d975c15
      Eric Dumazet authored
      
      syzbot found __ip6_tnl_rcv() could access unitiliazed data [1].
      
      Call pskb_inet_may_pull() to fix this, and initialize ipv6h
      variable after this call as it can change skb->head.
      
      [1]
       BUG: KMSAN: uninit-value in __INET_ECN_decapsulate include/net/inet_ecn.h:253 [inline]
       BUG: KMSAN: uninit-value in INET_ECN_decapsulate include/net/inet_ecn.h:275 [inline]
       BUG: KMSAN: uninit-value in IP6_ECN_decapsulate+0x7df/0x1e50 include/net/inet_ecn.h:321
        __INET_ECN_decapsulate include/net/inet_ecn.h:253 [inline]
        INET_ECN_decapsulate include/net/inet_ecn.h:275 [inline]
        IP6_ECN_decapsulate+0x7df/0x1e50 include/net/inet_ecn.h:321
        ip6ip6_dscp_ecn_decapsulate+0x178/0x1b0 net/ipv6/ip6_tunnel.c:727
        __ip6_tnl_rcv+0xd4e/0x1590 net/ipv6/ip6_tunnel.c:845
        ip6_tnl_rcv+0xce/0x100 net/ipv6/ip6_tunnel.c:888
       gre_rcv+0x143f/0x1870
        ip6_protocol_deliver_rcu+0xda6/0x2a60 net/ipv6/ip6_input.c:438
        ip6_input_finish net/ipv6/ip6_input.c:483 [inline]
        NF_HOOK include/linux/netfilter.h:314 [inline]
        ip6_input+0x15d/0x430 net/ipv6/ip6_input.c:492
        ip6_mc_input+0xa7e/0xc80 net/ipv6/ip6_input.c:586
        dst_input include/net/dst.h:461 [inline]
        ip6_rcv_finish+0x5db/0x870 net/ipv6/ip6_input.c:79
        NF_HOOK include/linux/netfilter.h:314 [inline]
        ipv6_rcv+0xda/0x390 net/ipv6/ip6_input.c:310
        __netif_receive_skb_one_core net/core/dev.c:5532 [inline]
        __netif_receive_skb+0x1a6/0x5a0 net/core/dev.c:5646
        netif_receive_skb_internal net/core/dev.c:5732 [inline]
        netif_receive_skb+0x58/0x660 net/core/dev.c:5791
        tun_rx_batched+0x3ee/0x980 drivers/net/tun.c:1555
        tun_get_user+0x53af/0x66d0 drivers/net/tun.c:2002
        tun_chr_write_iter+0x3af/0x5d0 drivers/net/tun.c:2048
        call_write_iter include/linux/fs.h:2084 [inline]
        new_sync_write fs/read_write.c:497 [inline]
        vfs_write+0x786/0x1200 fs/read_write.c:590
        ksys_write+0x20f/0x4c0 fs/read_write.c:643
        __do_sys_write fs/read_write.c:655 [inline]
        __se_sys_write fs/read_write.c:652 [inline]
        __x64_sys_write+0x93/0xd0 fs/read_write.c:652
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0x6d/0x140 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      Uninit was created at:
        slab_post_alloc_hook+0x129/0xa70 mm/slab.h:768
        slab_alloc_node mm/slub.c:3478 [inline]
        kmem_cache_alloc_node+0x5e9/0xb10 mm/slub.c:3523
        kmalloc_reserve+0x13d/0x4a0 net/core/skbuff.c:560
        __alloc_skb+0x318/0x740 net/core/skbuff.c:651
        alloc_skb include/linux/skbuff.h:1286 [inline]
        alloc_skb_with_frags+0xc8/0xbd0 net/core/skbuff.c:6334
        sock_alloc_send_pskb+0xa80/0xbf0 net/core/sock.c:2787
        tun_alloc_skb drivers/net/tun.c:1531 [inline]
        tun_get_user+0x1e8a/0x66d0 drivers/net/tun.c:1846
        tun_chr_write_iter+0x3af/0x5d0 drivers/net/tun.c:2048
        call_write_iter include/linux/fs.h:2084 [inline]
        new_sync_write fs/read_write.c:497 [inline]
        vfs_write+0x786/0x1200 fs/read_write.c:590
        ksys_write+0x20f/0x4c0 fs/read_write.c:643
        __do_sys_write fs/read_write.c:655 [inline]
        __se_sys_write fs/read_write.c:652 [inline]
        __x64_sys_write+0x93/0xd0 fs/read_write.c:652
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0x6d/0x140 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      CPU: 0 PID: 5034 Comm: syz-executor331 Not tainted 6.7.0-syzkaller-00562-g9f8413c4a66f #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
      
      Fixes: 0d3c703a ("ipv6: Cleanup IPv6 tunnel receive path")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240125170557.2663942-1-edumazet@google.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8d975c15
    • Wen Gu's avatar
      net/smc: fix incorrect SMC-D link group matching logic · c3dfcdb6
      Wen Gu authored
      
      The logic to determine if SMC-D link group matches is incorrect. The
      correct logic should be that it only returns true when the GID is the
      same, and the SMC-D device is the same and the extended GID is the same
      (in the case of virtual ISM).
      
      It can be fixed by adding brackets around the conditional (or ternary)
      operator expression. But for better readability and maintainability, it
      has been changed to an if-else statement.
      
      Reported-by: default avatarMatthew Rosato <mjrosato@linux.ibm.com>
      Closes: https://lore.kernel.org/r/13579588-eb9d-4626-a063-c0b77ed80f11@linux.ibm.com
      Fixes: b40584d1 ("net/smc: compatible with 128-bits extended GID of virtual ISM device")
      Link: https://lore.kernel.org/r/13579588-eb9d-4626-a063-c0b77ed80f11@linux.ibm.com
      
      
      Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Reviewed-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Link: https://lore.kernel.org/r/20240125123916.77928-1-guwen@linux.alibaba.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c3dfcdb6
  7. Jan 25, 2024
    • Maciej Fijalkowski's avatar
      xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL · fbadd83a
      Maciej Fijalkowski authored
      
      XSK ZC Rx path calculates the size of data that will be posted to XSK Rx
      queue via subtracting xdp_buff::data_end from xdp_buff::data.
      
      In bpf_xdp_frags_increase_tail(), when underlying memory type of
      xdp_rxq_info is MEM_TYPE_XSK_BUFF_POOL, add offset to data_end in tail
      fragment, so that later on user space will be able to take into account
      the amount of bytes added by XDP program.
      
      Fixes: 24ea5012 ("xsk: support mbuf on ZC RX")
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-10-maciej.fijalkowski@intel.com
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fbadd83a
    • Maciej Fijalkowski's avatar
      xsk: fix usage of multi-buffer BPF helpers for ZC XDP · c5114710
      Maciej Fijalkowski authored
      
      Currently when packet is shrunk via bpf_xdp_adjust_tail() and memory
      type is set to MEM_TYPE_XSK_BUFF_POOL, null ptr dereference happens:
      
      [1136314.192256] BUG: kernel NULL pointer dereference, address:
      0000000000000034
      [1136314.203943] #PF: supervisor read access in kernel mode
      [1136314.213768] #PF: error_code(0x0000) - not-present page
      [1136314.223550] PGD 0 P4D 0
      [1136314.230684] Oops: 0000 [#1] PREEMPT SMP NOPTI
      [1136314.239621] CPU: 8 PID: 54203 Comm: xdpsock Not tainted 6.6.0+ #257
      [1136314.250469] Hardware name: Intel Corporation S2600WFT/S2600WFT,
      BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
      [1136314.265615] RIP: 0010:__xdp_return+0x6c/0x210
      [1136314.274653] Code: ad 00 48 8b 47 08 49 89 f8 a8 01 0f 85 9b 01 00 00 0f 1f 44 00 00 f0 41 ff 48 34 75 32 4c 89 c7 e9 79 cd 80 ff 83 fe 03 75 17 <f6> 41 34 01 0f 85 02 01 00 00 48 89 cf e9 22 cc 1e 00 e9 3d d2 86
      [1136314.302907] RSP: 0018:ffffc900089f8db0 EFLAGS: 00010246
      [1136314.312967] RAX: ffffc9003168aed0 RBX: ffff8881c3300000 RCX:
      0000000000000000
      [1136314.324953] RDX: 0000000000000000 RSI: 0000000000000003 RDI:
      ffffc9003168c000
      [1136314.336929] RBP: 0000000000000ae0 R08: 0000000000000002 R09:
      0000000000010000
      [1136314.348844] R10: ffffc9000e495000 R11: 0000000000000040 R12:
      0000000000000001
      [1136314.360706] R13: 0000000000000524 R14: ffffc9003168aec0 R15:
      0000000000000001
      [1136314.373298] FS:  00007f8df8bbcb80(0000) GS:ffff8897e0e00000(0000)
      knlGS:0000000000000000
      [1136314.386105] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [1136314.396532] CR2: 0000000000000034 CR3: 00000001aa912002 CR4:
      00000000007706f0
      [1136314.408377] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
      0000000000000000
      [1136314.420173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
      0000000000000400
      [1136314.431890] PKRU: 55555554
      [1136314.439143] Call Trace:
      [1136314.446058]  <IRQ>
      [1136314.452465]  ? __die+0x20/0x70
      [1136314.459881]  ? page_fault_oops+0x15b/0x440
      [1136314.468305]  ? exc_page_fault+0x6a/0x150
      [1136314.476491]  ? asm_exc_page_fault+0x22/0x30
      [1136314.484927]  ? __xdp_return+0x6c/0x210
      [1136314.492863]  bpf_xdp_adjust_tail+0x155/0x1d0
      [1136314.501269]  bpf_prog_ccc47ae29d3b6570_xdp_sock_prog+0x15/0x60
      [1136314.511263]  ice_clean_rx_irq_zc+0x206/0xc60 [ice]
      [1136314.520222]  ? ice_xmit_zc+0x6e/0x150 [ice]
      [1136314.528506]  ice_napi_poll+0x467/0x670 [ice]
      [1136314.536858]  ? ttwu_do_activate.constprop.0+0x8f/0x1a0
      [1136314.546010]  __napi_poll+0x29/0x1b0
      [1136314.553462]  net_rx_action+0x133/0x270
      [1136314.561619]  __do_softirq+0xbe/0x28e
      [1136314.569303]  do_softirq+0x3f/0x60
      
      This comes from __xdp_return() call with xdp_buff argument passed as
      NULL which is supposed to be consumed by xsk_buff_free() call.
      
      To address this properly, in ZC case, a node that represents the frag
      being removed has to be pulled out of xskb_list. Introduce
      appropriate xsk helpers to do such node operation and use them
      accordingly within bpf_xdp_adjust_tail().
      
      Fixes: 24ea5012 ("xsk: support mbuf on ZC RX")
      Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> # For the xsk header part
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-4-maciej.fijalkowski@intel.com
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c5114710
    • Maciej Fijalkowski's avatar
      xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags · f7f6aa8e
      Maciej Fijalkowski authored
      
      XDP multi-buffer support introduced XDP_FLAGS_HAS_FRAGS flag that is
      used by drivers to notify data path whether xdp_buff contains fragments
      or not. Data path looks up mentioned flag on first buffer that occupies
      the linear part of xdp_buff, so drivers only modify it there. This is
      sufficient for SKB and XDP_DRV modes as usually xdp_buff is allocated on
      stack or it resides within struct representing driver's queue and
      fragments are carried via skb_frag_t structs. IOW, we are dealing with
      only one xdp_buff.
      
      ZC mode though relies on list of xdp_buff structs that is carried via
      xsk_buff_pool::xskb_list, so ZC data path has to make sure that
      fragments do *not* have XDP_FLAGS_HAS_FRAGS set. Otherwise,
      xsk_buff_free() could misbehave if it would be executed against xdp_buff
      that carries a frag with XDP_FLAGS_HAS_FRAGS flag set. Such scenario can
      take place when within supplied XDP program bpf_xdp_adjust_tail() is
      used with negative offset that would in turn release the tail fragment
      from multi-buffer frame.
      
      Calling xsk_buff_free() on tail fragment with XDP_FLAGS_HAS_FRAGS would
      result in releasing all the nodes from xskb_list that were produced by
      driver before XDP program execution, which is not what is intended -
      only tail fragment should be deleted from xskb_list and then it should
      be put onto xsk_buff_pool::free_list. Such multi-buffer frame will never
      make it up to user space, so from AF_XDP application POV there would be
      no traffic running, however due to free_list getting constantly new
      nodes, driver will be able to feed HW Rx queue with recycled buffers.
      Bottom line is that instead of traffic being redirected to user space,
      it would be continuously dropped.
      
      To fix this, let us clear the mentioned flag on xsk_buff_pool side
      during xdp_buff initialization, which is what should have been done
      right from the start of XSK multi-buffer support.
      
      Fixes: 1bbc04de ("ice: xsk: add RX multi-buffer support")
      Fixes: 1c9ba9c1 ("i40e: xsk: add RX multi-buffer support")
      Fixes: 24ea5012 ("xsk: support mbuf on ZC RX")
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-3-maciej.fijalkowski@intel.com
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f7f6aa8e
    • Maciej Fijalkowski's avatar
      xsk: recycle buffer in case Rx queue was full · 26900989
      Maciej Fijalkowski authored
      
      Add missing xsk_buff_free() call when __xsk_rcv_zc() failed to produce
      descriptor to XSK Rx queue.
      
      Fixes: 24ea5012 ("xsk: support mbuf on ZC RX")
      Acked-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-2-maciej.fijalkowski@intel.com
      
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      26900989
  8. Jan 24, 2024
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: validate NFPROTO_* family · d0009eff
      Pablo Neira Ayuso authored
      
      Several expressions explicitly refer to NF_INET_* hook definitions
      from expr->ops->validate, however, family is not validated.
      
      Bail out with EOPNOTSUPP in case they are used from unsupported
      families.
      
      Fixes: 0ca743a5 ("netfilter: nf_tables: add compatibility layer for x_tables")
      Fixes: a3c90f7a ("netfilter: nf_tables: flow offload expression")
      Fixes: 2fa84193 ("netfilter: nf_tables: introduce routing expression")
      Fixes: 554ced0a ("netfilter: nf_tables: add support for native socket matching")
      Fixes: ad49d86e ("netfilter: nf_tables: Add synproxy support")
      Fixes: 4ed8eb65 ("netfilter: nf_tables: Add native tproxy support")
      Fixes: 6c472602 ("netfilter: nf_tables: add xfrm expression")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      d0009eff
    • Florian Westphal's avatar
      netfilter: nf_tables: reject QUEUE/DROP verdict parameters · f342de4e
      Florian Westphal authored
      
      This reverts commit e0abdadc.
      
      core.c:nf_hook_slow assumes that the upper 16 bits of NF_DROP
      verdicts contain a valid errno, i.e. -EPERM, -EHOSTUNREACH or similar,
      or 0.
      
      Due to the reverted commit, its possible to provide a positive
      value, e.g. NF_ACCEPT (1), which results in use-after-free.
      
      Its not clear to me why this commit was made.
      
      NF_QUEUE is not used by nftables; "queue" rules in nftables
      will result in use of "nft_queue" expression.
      
      If we later need to allow specifiying errno values from userspace
      (do not know why), this has to call NF_DROP_GETERR and check that
      "err <= 0" holds true.
      
      Fixes: e0abdadc ("netfilter: nf_tables: accept QUEUE/DROP verdict parameters")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarNotselwyn <notselwyn@pwning.tech>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      f342de4e
    • Florian Westphal's avatar
      netfilter: nf_tables: restrict anonymous set and map names to 16 bytes · b462579b
      Florian Westphal authored
      
      nftables has two types of sets/maps, one where userspace defines the
      name, and anonymous sets/maps, where userspace defines a template name.
      
      For the latter, kernel requires presence of exactly one "%d".
      nftables uses "__set%d" and "__map%d" for this.  The kernel will
      expand the format specifier and replaces it with the smallest unused
      number.
      
      As-is, userspace could define a template name that allows to move
      the set name past the 256 bytes upperlimit (post-expansion).
      
      I don't see how this could be a problem, but I would prefer if userspace
      cannot do this, so add a limit of 16 bytes for the '%d' template name.
      
      16 bytes is the old total upper limit for set names that existed when
      nf_tables was merged initially.
      
      Fixes: 38745490 ("netfilter: nf_tables: Allow set names of up to 255 chars")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      b462579b
    • Florian Westphal's avatar
      netfilter: nft_limit: reject configurations that cause integer overflow · c9d9eb9c
      Florian Westphal authored
      
      Reject bogus configs where internal token counter wraps around.
      This only occurs with very very large requests, such as 17gbyte/s.
      
      Its better to reject this rather than having incorrect ratelimit.
      
      Fixes: d2168e84 ("netfilter: nft_limit: add per-byte limiting")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      c9d9eb9c
    • Pablo Neira Ayuso's avatar
      netfilter: nft_chain_filter: handle NETDEV_UNREGISTER for inet/ingress basechain · 01acb2e8
      Pablo Neira Ayuso authored
      
      Remove netdevice from inet/ingress basechain in case NETDEV_UNREGISTER
      event is reported, otherwise a stale reference to netdevice remains in
      the hook list.
      
      Fixes: 60a3815d ("netfilter: add inet ingress support")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      01acb2e8
    • Ido Schimmel's avatar
      net/sched: flower: Fix chain template offload · 32f2a0af
      Ido Schimmel authored
      
      When a qdisc is deleted from a net device the stack instructs the
      underlying driver to remove its flow offload callback from the
      associated filter block using the 'FLOW_BLOCK_UNBIND' command. The stack
      then continues to replay the removal of the filters in the block for
      this driver by iterating over the chains in the block and invoking the
      'reoffload' operation of the classifier being used. In turn, the
      classifier in its 'reoffload' operation prepares and emits a
      'FLOW_CLS_DESTROY' command for each filter.
      
      However, the stack does not do the same for chain templates and the
      underlying driver never receives a 'FLOW_CLS_TMPLT_DESTROY' command when
      a qdisc is deleted. This results in a memory leak [1] which can be
      reproduced using [2].
      
      Fix by introducing a 'tmplt_reoffload' operation and have the stack
      invoke it with the appropriate arguments as part of the replay.
      Implement the operation in the sole classifier that supports chain
      templates (flower) by emitting the 'FLOW_CLS_TMPLT_{CREATE,DESTROY}'
      command based on whether a flow offload callback is being bound to a
      filter block or being unbound from one.
      
      As far as I can tell, the issue happens since cited commit which
      reordered tcf_block_offload_unbind() before tcf_block_flush_all_chains()
      in __tcf_block_put(). The order cannot be reversed as the filter block
      is expected to be freed after flushing all the chains.
      
      [1]
      unreferenced object 0xffff888107e28800 (size 2048):
        comm "tc", pid 1079, jiffies 4294958525 (age 3074.287s)
        hex dump (first 32 bytes):
          b1 a6 7c 11 81 88 ff ff e0 5b b3 10 81 88 ff ff  ..|......[......
          01 00 00 00 00 00 00 00 e0 aa b0 84 ff ff ff ff  ................
        backtrace:
          [<ffffffff81c06a68>] __kmem_cache_alloc_node+0x1e8/0x320
          [<ffffffff81ab374e>] __kmalloc+0x4e/0x90
          [<ffffffff832aec6d>] mlxsw_sp_acl_ruleset_get+0x34d/0x7a0
          [<ffffffff832bc195>] mlxsw_sp_flower_tmplt_create+0x145/0x180
          [<ffffffff832b2e1a>] mlxsw_sp_flow_block_cb+0x1ea/0x280
          [<ffffffff83a10613>] tc_setup_cb_call+0x183/0x340
          [<ffffffff83a9f85a>] fl_tmplt_create+0x3da/0x4c0
          [<ffffffff83a22435>] tc_ctl_chain+0xa15/0x1170
          [<ffffffff838a863c>] rtnetlink_rcv_msg+0x3cc/0xed0
          [<ffffffff83ac87f0>] netlink_rcv_skb+0x170/0x440
          [<ffffffff83ac6270>] netlink_unicast+0x540/0x820
          [<ffffffff83ac6e28>] netlink_sendmsg+0x8d8/0xda0
          [<ffffffff83793def>] ____sys_sendmsg+0x30f/0xa80
          [<ffffffff8379d29a>] ___sys_sendmsg+0x13a/0x1e0
          [<ffffffff8379d50c>] __sys_sendmsg+0x11c/0x1f0
          [<ffffffff843b9ce0>] do_syscall_64+0x40/0xe0
      unreferenced object 0xffff88816d2c0400 (size 1024):
        comm "tc", pid 1079, jiffies 4294958525 (age 3074.287s)
        hex dump (first 32 bytes):
          40 00 00 00 00 00 00 00 57 f6 38 be 00 00 00 00  @.......W.8.....
          10 04 2c 6d 81 88 ff ff 10 04 2c 6d 81 88 ff ff  ..,m......,m....
        backtrace:
          [<ffffffff81c06a68>] __kmem_cache_alloc_node+0x1e8/0x320
          [<ffffffff81ab36c1>] __kmalloc_node+0x51/0x90
          [<ffffffff81a8ed96>] kvmalloc_node+0xa6/0x1f0
          [<ffffffff82827d03>] bucket_table_alloc.isra.0+0x83/0x460
          [<ffffffff82828d2b>] rhashtable_init+0x43b/0x7c0
          [<ffffffff832aed48>] mlxsw_sp_acl_ruleset_get+0x428/0x7a0
          [<ffffffff832bc195>] mlxsw_sp_flower_tmplt_create+0x145/0x180
          [<ffffffff832b2e1a>] mlxsw_sp_flow_block_cb+0x1ea/0x280
          [<ffffffff83a10613>] tc_setup_cb_call+0x183/0x340
          [<ffffffff83a9f85a>] fl_tmplt_create+0x3da/0x4c0
          [<ffffffff83a22435>] tc_ctl_chain+0xa15/0x1170
          [<ffffffff838a863c>] rtnetlink_rcv_msg+0x3cc/0xed0
          [<ffffffff83ac87f0>] netlink_rcv_skb+0x170/0x440
          [<ffffffff83ac6270>] netlink_unicast+0x540/0x820
          [<ffffffff83ac6e28>] netlink_sendmsg+0x8d8/0xda0
          [<ffffffff83793def>] ____sys_sendmsg+0x30f/0xa80
      
      [2]
       # tc qdisc add dev swp1 clsact
       # tc chain add dev swp1 ingress proto ip chain 1 flower dst_ip 0.0.0.0/32
       # tc qdisc del dev swp1 clsact
       # devlink dev reload pci/0000:06:00.0
      
      Fixes: bbf73830 ("net: sched: traverse chains in block with tcf_get_next_chain()")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      32f2a0af
  9. Jan 23, 2024
    • Zhengchao Shao's avatar
      ipv6: init the accept_queue's spinlocks in inet6_create · 435e202d
      Zhengchao Shao authored
      
      In commit 198bc90e("tcp: make sure init the accept_queue's spinlocks
      once"), the spinlocks of accept_queue are initialized only when socket is
      created in the inet4 scenario. The locks are not initialized when socket
      is created in the inet6 scenario. The kernel reports the following error:
      INFO: trying to register non-static key.
      The code is fine but needs lockdep annotation, or maybe
      you didn't initialize this object before use?
      turning off the locking correctness validator.
      Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      Call Trace:
      <TASK>
      	dump_stack_lvl (lib/dump_stack.c:107)
      	register_lock_class (kernel/locking/lockdep.c:1289)
      	__lock_acquire (kernel/locking/lockdep.c:5015)
      	lock_acquire.part.0 (kernel/locking/lockdep.c:5756)
      	_raw_spin_lock_bh (kernel/locking/spinlock.c:178)
      	inet_csk_listen_stop (net/ipv4/inet_connection_sock.c:1386)
      	tcp_disconnect (net/ipv4/tcp.c:2981)
      	inet_shutdown (net/ipv4/af_inet.c:935)
      	__sys_shutdown (./include/linux/file.h:32 net/socket.c:2438)
      	__x64_sys_shutdown (net/socket.c:2445)
      	do_syscall_64 (arch/x86/entry/common.c:52)
      	entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
      RIP: 0033:0x7f52ecd05a3d
      Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7
      48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
      ff 73 01 c3 48 8b 0d ab a3 0e 00 f7 d8 64 89 01 48
      RSP: 002b:00007f52ecf5dde8 EFLAGS: 00000293 ORIG_RAX: 0000000000000030
      RAX: ffffffffffffffda RBX: 00007f52ecf5e640 RCX: 00007f52ecd05a3d
      RDX: 00007f52ecc8b188 RSI: 0000000000000000 RDI: 0000000000000004
      RBP: 00007f52ecf5de20 R08: 00007ffdae45c69f R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000293 R12: 00007f52ecf5e640
      R13: 0000000000000000 R14: 00007f52ecc8b060 R15: 00007ffdae45c6e0
      
      Fixes: 198bc90e ("tcp: make sure init the accept_queue's spinlocks once")
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240122102001.2851701-1-shaozhengchao@huawei.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      435e202d
    • Zhengchao Shao's avatar
      netlink: fix potential sleeping issue in mqueue_flush_file · 234ec0b6
      Zhengchao Shao authored
      
      I analyze the potential sleeping issue of the following processes:
      Thread A                                Thread B
      ...                                     netlink_create  //ref = 1
      do_mq_notify                            ...
        sock = netlink_getsockbyfilp          ...     //ref = 2
        info->notify_sock = sock;             ...
      ...                                     netlink_sendmsg
      ...                                       skb = netlink_alloc_large_skb  //skb->head is vmalloced
      ...                                       netlink_unicast
      ...                                         sk = netlink_getsockbyportid //ref = 3
      ...                                         netlink_sendskb
      ...                                           __netlink_sendskb
      ...                                             skb_queue_tail //put skb to sk_receive_queue
      ...                                         sock_put //ref = 2
      ...                                     ...
      ...                                     netlink_release
      ...                                       deferred_put_nlk_sk //ref = 1
      mqueue_flush_file
        spin_lock
        remove_notification
          netlink_sendskb
            sock_put  //ref = 0
              sk_free
                ...
                __sk_destruct
                  netlink_sock_destruct
                    skb_queue_purge  //get skb from sk_receive_queue
                      ...
                      __skb_queue_purge_reason
                        kfree_skb_reason
                          __kfree_skb
                          ...
                          skb_release_all
                            skb_release_head_state
                              netlink_skb_destructor
                                vfree(skb->head)  //sleeping while holding spinlock
      
      In netlink_sendmsg, if the memory pointed to by skb->head is allocated by
      vmalloc, and is put to sk_receive_queue queue, also the skb is not freed.
      When the mqueue executes flush, the sleeping bug will occur. Use
      vfree_atomic instead of vfree in netlink_skb_destructor to solve the issue.
      
      Fixes: c05cdb1b ("netlink: allow large data transfers from user-space")
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Link: https://lore.kernel.org/r/20240122011807.2110357-1-shaozhengchao@huawei.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      234ec0b6
    • Salvatore Dipietro's avatar
      tcp: Add memory barrier to tcp_push() · 7267e8dc
      Salvatore Dipietro authored
      On CPUs with weak memory models, reads and updates performed by tcp_push
      to the sk variables can get reordered leaving the socket throttled when
      it should not. The tasklet running tcp_wfree() may also not observe the
      memory updates in time and will skip flushing any packets throttled by
      tcp_push(), delaying the sending. This can pathologically cause 40ms
      extra latency due to bad interactions with delayed acks.
      
      Adding a memory barrier in tcp_push removes the bug, similarly to the
      previous commit bf06200e ("tcp: tsq: fix nonagle handling").
      smp_mb__after_atomic() is used to not incur in unnecessary overhead
      on x86 since not affected.
      
      Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu
      22.04 and Apache Tomcat 9.0.83 running the basic servlet below:
      
      import java.io.IOException;
      import java.io.OutputStreamWriter;
      import java.io.PrintWriter;
      import javax.servlet.ServletException;
      import javax.servlet.http.HttpServlet;
      import javax.servlet.http.HttpServletRequest;
      import javax.servlet.http.HttpServletResponse;
      
      public class HelloWorldServlet extends HttpServlet {
          @Override
          protected void doGet(HttpServletRequest request, HttpServletResponse response)
            throws ServletException, IOException {
              response.setContentType("text/html;charset=utf-8");
              OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
              String s = "a".repeat(3096);
              osw.write(s,0,s.length());
              osw.flush();
          }
      }
      
      Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
      c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
      values is observed while, with the patch, the extra latency disappears.
      
      No patch and tcp_autocorking=1
      ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
        ...
       50.000%    0.91ms
       75.000%    1.13ms
       90.000%    1.46ms
       99.000%    1.74ms
       99.900%    1.89ms
       99.990%   41.95ms  <<< 40+ ms extra latency
       99.999%   48.32ms
      100.000%   48.96ms
      
      With patch and tcp_autocorking=1
      ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
      
      
        ...
       50.000%    0.90ms
       75.000%    1.13ms
       90.000%    1.45ms
       99.000%    1.72ms
       99.900%    1.83ms
       99.990%    2.11ms  <<< no 40+ ms extra latency
       99.999%    2.53ms
      100.000%    2.62ms
      
      Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
      affected by this issue and the patch doesn't introduce any additional
      delay.
      
      Fixes: 7aa5470c ("tcp: tsq: move tsq_flags close to sk_wmem_alloc")
      Signed-off-by: default avatarSalvatore Dipietro <dipiets@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240119190133.43698-1-dipiets@amazon.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      7267e8dc
  10. Jan 22, 2024
  11. Jan 21, 2024
  12. Jan 20, 2024
    • Kuniyuki Iwashima's avatar
      llc: Drop support for ETH_P_TR_802_2. · e3f9bed9
      Kuniyuki Iwashima authored
      
      syzbot reported an uninit-value bug below. [0]
      
      llc supports ETH_P_802_2 (0x0004) and used to support ETH_P_TR_802_2
      (0x0011), and syzbot abused the latter to trigger the bug.
      
        write$tun(r0, &(0x7f0000000040)={@val={0x0, 0x11}, @val, @mpls={[], @llc={@snap={0xaa, 0x1, ')', "90e5dd"}}}}, 0x16)
      
      llc_conn_handler() initialises local variables {saddr,daddr}.mac
      based on skb in llc_pdu_decode_sa()/llc_pdu_decode_da() and passes
      them to __llc_lookup().
      
      However, the initialisation is done only when skb->protocol is
      htons(ETH_P_802_2), otherwise, __llc_lookup_established() and
      __llc_lookup_listener() will read garbage.
      
      The missing initialisation existed prior to commit 211ed865
      ("net: delete all instances of special processing for token ring").
      
      It removed the part to kick out the token ring stuff but forgot to
      close the door allowing ETH_P_TR_802_2 packets to sneak into llc_rcv().
      
      Let's remove llc_tr_packet_type and complete the deprecation.
      
      [0]:
      BUG: KMSAN: uninit-value in __llc_lookup_established+0xe9d/0xf90
       __llc_lookup_established+0xe9d/0xf90
       __llc_lookup net/llc/llc_conn.c:611 [inline]
       llc_conn_handler+0x4bd/0x1360 net/llc/llc_conn.c:791
       llc_rcv+0xfbb/0x14a0 net/llc/llc_input.c:206
       __netif_receive_skb_one_core net/core/dev.c:5527 [inline]
       __netif_receive_skb+0x1a6/0x5a0 net/core/dev.c:5641
       netif_receive_skb_internal net/core/dev.c:5727 [inline]
       netif_receive_skb+0x58/0x660 net/core/dev.c:5786
       tun_rx_batched+0x3ee/0x980 drivers/net/tun.c:1555
       tun_get_user+0x53af/0x66d0 drivers/net/tun.c:2002
       tun_chr_write_iter+0x3af/0x5d0 drivers/net/tun.c:2048
       call_write_iter include/linux/fs.h:2020 [inline]
       new_sync_write fs/read_write.c:491 [inline]
       vfs_write+0x8ef/0x1490 fs/read_write.c:584
       ksys_write+0x20f/0x4c0 fs/read_write.c:637
       __do_sys_write fs/read_write.c:649 [inline]
       __se_sys_write fs/read_write.c:646 [inline]
       __x64_sys_write+0x93/0xd0 fs/read_write.c:646
       do_syscall_x64 arch/x86/entry/common.c:51 [inline]
       do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      Local variable daddr created at:
       llc_conn_handler+0x53/0x1360 net/llc/llc_conn.c:783
       llc_rcv+0xfbb/0x14a0 net/llc/llc_input.c:206
      
      CPU: 1 PID: 5004 Comm: syz-executor994 Not tainted 6.6.0-syzkaller-14500-g1c41041124bd #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023
      
      Fixes: 211ed865 ("net: delete all instances of special processing for token ring")
      Reported-by: default avatar <syzbot+b5ad66046b913bc04c6f@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=b5ad66046b913bc04c6f
      
      
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240119015515.61898-1-kuniyu@amazon.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e3f9bed9
    • Eric Dumazet's avatar
      llc: make llc_ui_sendmsg() more robust against bonding changes · dad555c8
      Eric Dumazet authored
      
      syzbot was able to trick llc_ui_sendmsg(), allocating an skb with no
      headroom, but subsequently trying to push 14 bytes of Ethernet header [1]
      
      Like some others, llc_ui_sendmsg() releases the socket lock before
      calling sock_alloc_send_skb().
      Then it acquires it again, but does not redo all the sanity checks
      that were performed.
      
      This fix:
      
      - Uses LL_RESERVED_SPACE() to reserve space.
      - Check all conditions again after socket lock is held again.
      - Do not account Ethernet header for mtu limitation.
      
      [1]
      
      skbuff: skb_under_panic: text:ffff800088baa334 len:1514 put:14 head:ffff0000c9c37000 data:ffff0000c9c36ff2 tail:0x5dc end:0x6c0 dev:bond0
      
       kernel BUG at net/core/skbuff.c:193 !
      Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
      Modules linked in:
      CPU: 0 PID: 6875 Comm: syz-executor.0 Not tainted 6.7.0-rc8-syzkaller-00101-g0802e17d9aca-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
      pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
       pc : skb_panic net/core/skbuff.c:189 [inline]
       pc : skb_under_panic+0x13c/0x140 net/core/skbuff.c:203
       lr : skb_panic net/core/skbuff.c:189 [inline]
       lr : skb_under_panic+0x13c/0x140 net/core/skbuff.c:203
      sp : ffff800096f97000
      x29: ffff800096f97010 x28: ffff80008cc8d668 x27: dfff800000000000
      x26: ffff0000cb970c90 x25: 00000000000005dc x24: ffff0000c9c36ff2
      x23: ffff0000c9c37000 x22: 00000000000005ea x21: 00000000000006c0
      x20: 000000000000000e x19: ffff800088baa334 x18: 1fffe000368261ce
      x17: ffff80008e4ed000 x16: ffff80008a8310f8 x15: 0000000000000001
      x14: 1ffff00012df2d58 x13: 0000000000000000 x12: 0000000000000000
      x11: 0000000000000001 x10: 0000000000ff0100 x9 : e28a51f1087e8400
      x8 : e28a51f1087e8400 x7 : ffff80008028f8d0 x6 : 0000000000000000
      x5 : 0000000000000001 x4 : 0000000000000001 x3 : ffff800082b78714
      x2 : 0000000000000001 x1 : 0000000100000000 x0 : 0000000000000089
      Call trace:
        skb_panic net/core/skbuff.c:189 [inline]
        skb_under_panic+0x13c/0x140 net/core/skbuff.c:203
        skb_push+0xf0/0x108 net/core/skbuff.c:2451
        eth_header+0x44/0x1f8 net/ethernet/eth.c:83
        dev_hard_header include/linux/netdevice.h:3188 [inline]
        llc_mac_hdr_init+0x110/0x17c net/llc/llc_output.c:33
        llc_sap_action_send_xid_c+0x170/0x344 net/llc/llc_s_ac.c:85
        llc_exec_sap_trans_actions net/llc/llc_sap.c:153 [inline]
        llc_sap_next_state net/llc/llc_sap.c:182 [inline]
        llc_sap_state_process+0x1ec/0x774 net/llc/llc_sap.c:209
        llc_build_and_send_xid_pkt+0x12c/0x1c0 net/llc/llc_sap.c:270
        llc_ui_sendmsg+0x7bc/0xb1c net/llc/af_llc.c:997
        sock_sendmsg_nosec net/socket.c:730 [inline]
        __sock_sendmsg net/socket.c:745 [inline]
        sock_sendmsg+0x194/0x274 net/socket.c:767
        splice_to_socket+0x7cc/0xd58 fs/splice.c:881
        do_splice_from fs/splice.c:933 [inline]
        direct_splice_actor+0xe4/0x1c0 fs/splice.c:1142
        splice_direct_to_actor+0x2a0/0x7e4 fs/splice.c:1088
        do_splice_direct+0x20c/0x348 fs/splice.c:1194
        do_sendfile+0x4bc/0xc70 fs/read_write.c:1254
        __do_sys_sendfile64 fs/read_write.c:1322 [inline]
        __se_sys_sendfile64 fs/read_write.c:1308 [inline]
        __arm64_sys_sendfile64+0x160/0x3b4 fs/read_write.c:1308
        __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline]
        invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:51
        el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:136
        do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:155
        el0_svc+0x54/0x158 arch/arm64/kernel/entry-common.c:678
        el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:696
        el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:595
      Code: aa1803e6 aa1903e7 a90023f5 94792f6a (d4210000)
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-and-tested-by: default avatar <syzbot+2a7024e9502df538e8ef@syzkaller.appspotmail.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240118183625.4007013-1-edumazet@google.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dad555c8
Loading