Skip to content
Snippets Groups Projects
  1. Feb 27, 2024
  2. Feb 22, 2024
  3. Feb 21, 2024
    • Florian Westphal's avatar
      netfilter: nf_tables: use kzalloc for hook allocation · 195e5f88
      Florian Westphal authored
      
      KMSAN reports unitialized variable when registering the hook,
         reg->hook_ops_type == NF_HOOK_OP_BPF)
              ~~~~~~~~~~~ undefined
      
      This is a small structure, just use kzalloc to make sure this
      won't happen again when new fields get added to nf_hook_ops.
      
      Fixes: 7b4b2fa3 ("netfilter: annotate nf_tables base hook ops")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      195e5f88
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: register hooks last when adding new chain/flowtable · d472e985
      Pablo Neira Ayuso authored
      
      Register hooks last when adding chain/flowtable to ensure that packets do
      not walk over datastructure that is being released in the error path
      without waiting for the rcu grace period.
      
      Fixes: 91c7b38d ("netfilter: nf_tables: use new transaction infrastructure to handle chain")
      Fixes: 3b49e2e9 ("netfilter: nf_tables: add flow table netlink frontend")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      d472e985
    • Pablo Neira Ayuso's avatar
      netfilter: nft_flow_offload: release dst in case direct xmit path is used · 8762785f
      Pablo Neira Ayuso authored
      
      Direct xmit does not use it since it calls dev_queue_xmit() to send
      packets, hence it calls dst_release().
      
      kmemleak reports:
      
      unreferenced object 0xffff88814f440900 (size 184):
        comm "softirq", pid 0, jiffies 4294951896
        hex dump (first 32 bytes):
          00 60 5b 04 81 88 ff ff 00 e6 e8 82 ff ff ff ff  .`[.............
          21 0b 50 82 ff ff ff ff 00 00 00 00 00 00 00 00  !.P.............
        backtrace (crc cb2bf5d6):
          [<000000003ee17107>] kmem_cache_alloc+0x286/0x340
          [<0000000021a5de2c>] dst_alloc+0x43/0xb0
          [<00000000f0671159>] rt_dst_alloc+0x2e/0x190
          [<00000000fe5092c9>] __mkroute_output+0x244/0x980
          [<000000005fb96fb0>] ip_route_output_flow+0xc0/0x160
          [<0000000045367433>] nf_ip_route+0xf/0x30
          [<0000000085da1d8e>] nf_route+0x2d/0x60
          [<00000000d1ecd1cb>] nft_flow_route+0x171/0x6a0 [nft_flow_offload]
          [<00000000d9b2fb60>] nft_flow_offload_eval+0x4e8/0x700 [nft_flow_offload]
          [<000000009f447dbb>] expr_call_ops_eval+0x53/0x330 [nf_tables]
          [<00000000072e1be6>] nft_do_chain+0x17c/0x840 [nf_tables]
          [<00000000d0551029>] nft_do_chain_inet+0xa1/0x210 [nf_tables]
          [<0000000097c9d5c6>] nf_hook_slow+0x5b/0x160
          [<0000000005eccab1>] ip_forward+0x8b6/0x9b0
          [<00000000553a269b>] ip_rcv+0x221/0x230
          [<00000000412872e5>] __netif_receive_skb_one_core+0xfe/0x110
      
      Fixes: fa502c86 ("netfilter: flowtable: simplify route logic")
      Reported-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      8762785f
    • Pablo Neira Ayuso's avatar
      netfilter: nft_flow_offload: reset dst in route object after setting up flow · 9e0f0430
      Pablo Neira Ayuso authored
      
      dst is transferred to the flow object, route object does not own it
      anymore.  Reset dst in route object, otherwise if flow_offload_add()
      fails, error path releases dst twice, leading to a refcount underflow.
      
      Fixes: a3c90f7a ("netfilter: nf_tables: flow offload expression")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      9e0f0430
    • Florian Westphal's avatar
      netfilter: nf_tables: set dormant flag on hook register failure · bccebf64
      Florian Westphal authored
      
      We need to set the dormant flag again if we fail to register
      the hooks.
      
      During memory pressure hook registration can fail and we end up
      with a table marked as active but no registered hooks.
      
      On table/base chain deletion, nf_tables will attempt to unregister
      the hook again which yields a warn splat from the nftables core.
      
      Reported-and-tested-by: default avatar <syzbot+de4025c006ec68ac56fc@syzkaller.appspotmail.com>
      Fixes: 179d9ba5 ("netfilter: nf_tables: fix table flag updates")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      bccebf64
    • Sabrina Dubroca's avatar
      tls: don't skip over different type records from the rx_list · ec823bf3
      Sabrina Dubroca authored
      
      If we queue 3 records:
       - record 1, type DATA
       - record 2, some other type
       - record 3, type DATA
      and do a recv(PEEK), the rx_list will contain the first two records.
      
      The next large recv will walk through the rx_list and copy data from
      record 1, then stop because record 2 is a different type. Since we
      haven't filled up our buffer, we will process the next available
      record. It's also DATA, so we can merge it with the current read.
      
      We shouldn't do that, since there was a record in between that we
      ignored.
      
      Add a flag to let process_rx_list inform tls_sw_recvmsg that it had
      more data available.
      
      Fixes: 692d7b5d ("tls: Fix recvmsg() to be able to peek across multiple records")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/f00c0c0afa080c60f016df1471158c1caf983c34.1708007371.git.sd@queasysnail.net
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ec823bf3
    • Sabrina Dubroca's avatar
      tls: stop recv() if initial process_rx_list gave us non-DATA · fdfbaec5
      Sabrina Dubroca authored
      
      If we have a non-DATA record on the rx_list and another record of the
      same type still on the queue, we will end up merging them:
       - process_rx_list copies the non-DATA record
       - we start the loop and process the first available record since it's
         of the same type
       - we break out of the loop since the record was not DATA
      
      Just check the record type and jump to the end in case process_rx_list
      did some work.
      
      Fixes: 692d7b5d ("tls: Fix recvmsg() to be able to peek across multiple records")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/bd31449e43bd4b6ff546f5c51cf958c31c511deb.1708007371.git.sd@queasysnail.net
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fdfbaec5
    • Sabrina Dubroca's avatar
      tls: break out of main loop when PEEK gets a non-data record · 10f41d07
      Sabrina Dubroca authored
      
      PEEK needs to leave decrypted records on the rx_list so that we can
      receive them later on, so it jumps back into the async code that
      queues the skb. Unfortunately that makes us skip the
      TLS_RECORD_TYPE_DATA check at the bottom of the main loop, so if two
      records of the same (non-DATA) type are queued, we end up merging
      them.
      
      Add the same record type check, and make it unlikely to not penalize
      the async fastpath. Async decrypt only applies to data record, so this
      check is only needed for PEEK.
      
      process_rx_list also has similar issues.
      
      Fixes: 692d7b5d ("tls: Fix recvmsg() to be able to peek across multiple records")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/3df2eef4fdae720c55e69472b5bea668772b45a2.1708007371.git.sd@queasysnail.net
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      10f41d07
    • Shigeru Yoshida's avatar
      bpf, sockmap: Fix NULL pointer dereference in sk_psock_verdict_data_ready() · 4cd12c60
      Shigeru Yoshida authored
      
      syzbot reported the following NULL pointer dereference issue [1]:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000000
        [...]
        RIP: 0010:0x0
        [...]
        Call Trace:
         <TASK>
         sk_psock_verdict_data_ready+0x232/0x340 net/core/skmsg.c:1230
         unix_stream_sendmsg+0x9b4/0x1230 net/unix/af_unix.c:2293
         sock_sendmsg_nosec net/socket.c:730 [inline]
         __sock_sendmsg+0x221/0x270 net/socket.c:745
         ____sys_sendmsg+0x525/0x7d0 net/socket.c:2584
         ___sys_sendmsg net/socket.c:2638 [inline]
         __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2667
         do_syscall_64+0xf9/0x240
         entry_SYSCALL_64_after_hwframe+0x6f/0x77
      
      If sk_psock_verdict_data_ready() and sk_psock_stop_verdict() are called
      concurrently, psock->saved_data_ready can be NULL, causing the above issue.
      
      This patch fixes this issue by calling the appropriate data ready function
      using the sk_psock_data_ready() helper and protecting it from concurrency
      with sk->sk_callback_lock.
      
      Fixes: 6df7f764 ("bpf, sockmap: Wake up polling after data copy")
      Reported-by: default avatar <syzbot+fd7b34375c1c8ce29c93@syzkaller.appspotmail.com>
      Signed-off-by: default avatarShigeru Yoshida <syoshida@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatar <syzbot+fd7b34375c1c8ce29c93@syzkaller.appspotmail.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=fd7b34375c1c8ce29c93 [1]
      Link: https://lore.kernel.org/bpf/20240218150933.6004-1-syoshida@redhat.com
      4cd12c60
    • Kuniyuki Iwashima's avatar
      af_unix: Drop oob_skb ref before purging queue in GC. · aa82ac51
      Kuniyuki Iwashima authored
      
      syzbot reported another task hung in __unix_gc().  [0]
      
      The current while loop assumes that all of the left candidates
      have oob_skb and calling kfree_skb(oob_skb) releases the remaining
      candidates.
      
      However, I missed a case that oob_skb has self-referencing fd and
      another fd and the latter sk is placed before the former in the
      candidate list.  Then, the while loop never proceeds, resulting
      the task hung.
      
      __unix_gc() has the same loop just before purging the collected skb,
      so we can call kfree_skb(oob_skb) there and let __skb_queue_purge()
      release all inflight sockets.
      
      [0]:
      Sending NMI from CPU 0 to CPUs 1:
      NMI backtrace for cpu 1
      CPU: 1 PID: 2784 Comm: kworker/u4:8 Not tainted 6.8.0-rc4-syzkaller-01028-g71b605d32017 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
      Workqueue: events_unbound __unix_gc
      RIP: 0010:__sanitizer_cov_trace_pc+0x0/0x70 kernel/kcov.c:200
      Code: 89 fb e8 23 00 00 00 48 8b 3d 84 f5 1a 0c 48 89 de 5b e9 43 26 57 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 <f3> 0f 1e fa 48 8b 04 24 65 48 8b 0d 90 52 70 7e 65 8b 15 91 52 70
      RSP: 0018:ffffc9000a17fa78 EFLAGS: 00000287
      RAX: ffffffff8a0a6108 RBX: ffff88802b6c2640 RCX: ffff88802c0b3b80
      RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
      RBP: ffffc9000a17fbf0 R08: ffffffff89383f1d R09: 1ffff1100ee5ff84
      R10: dffffc0000000000 R11: ffffed100ee5ff85 R12: 1ffff110056d84ee
      R13: ffffc9000a17fae0 R14: 0000000000000000 R15: ffffffff8f47b840
      FS:  0000000000000000(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffef5687ff8 CR3: 0000000029b34000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <NMI>
       </NMI>
       <TASK>
       __unix_gc+0xe69/0xf40 net/unix/garbage.c:343
       process_one_work kernel/workqueue.c:2633 [inline]
       process_scheduled_works+0x913/0x1420 kernel/workqueue.c:2706
       worker_thread+0xa5f/0x1000 kernel/workqueue.c:2787
       kthread+0x2ef/0x390 kernel/kthread.c:388
       ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
       ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:242
       </TASK>
      
      Reported-and-tested-by: default avatar <syzbot+ecab4d36f920c3574bf9@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=ecab4d36f920c3574bf9
      
      
      Fixes: 25236c91 ("af_unix: Fix task hung while purging oob_skb in GC.")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aa82ac51
    • Eric Dumazet's avatar
      net: implement lockless setsockopt(SO_PEEK_OFF) · 56667da7
      Eric Dumazet authored
      
      syzbot reported a lockdep violation [1] involving af_unix
      support of SO_PEEK_OFF.
      
      Since SO_PEEK_OFF is inherently not thread safe (it uses a per-socket
      sk_peek_off field), there is really no point to enforce a pointless
      thread safety in the kernel.
      
      After this patch :
      
      - setsockopt(SO_PEEK_OFF) no longer acquires the socket lock.
      
      - skb_consume_udp() no longer has to acquire the socket lock.
      
      - af_unix no longer needs a special version of sk_set_peek_off(),
        because it does not lock u->iolock anymore.
      
      As a followup, we could replace prot->set_peek_off to be a boolean
      and avoid an indirect call, since we always use sk_set_peek_off().
      
      [1]
      
      WARNING: possible circular locking dependency detected
      6.8.0-rc4-syzkaller-00267-g0f1dd5e91e2b #0 Not tainted
      
      syz-executor.2/30025 is trying to acquire lock:
       ffff8880765e7d80 (&u->iolock){+.+.}-{3:3}, at: unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789
      
      but task is already holding lock:
       ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline]
       ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline]
       ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (sk_lock-AF_UNIX){+.+.}-{0:0}:
              lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
              lock_sock_nested+0x48/0x100 net/core/sock.c:3524
              lock_sock include/net/sock.h:1691 [inline]
              __unix_dgram_recvmsg+0x1275/0x12c0 net/unix/af_unix.c:2415
              sock_recvmsg_nosec+0x18e/0x1d0 net/socket.c:1046
              ____sys_recvmsg+0x3c0/0x470 net/socket.c:2801
              ___sys_recvmsg net/socket.c:2845 [inline]
              do_recvmmsg+0x474/0xae0 net/socket.c:2939
              __sys_recvmmsg net/socket.c:3018 [inline]
              __do_sys_recvmmsg net/socket.c:3041 [inline]
              __se_sys_recvmmsg net/socket.c:3034 [inline]
              __x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034
             do_syscall_64+0xf9/0x240
             entry_SYSCALL_64_after_hwframe+0x6f/0x77
      
      -> #0 (&u->iolock){+.+.}-{3:3}:
              check_prev_add kernel/locking/lockdep.c:3134 [inline]
              check_prevs_add kernel/locking/lockdep.c:3253 [inline]
              validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869
              __lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
              lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
              __mutex_lock_common kernel/locking/mutex.c:608 [inline]
              __mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752
              unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789
             sk_setsockopt+0x207e/0x3360
              do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307
              __sys_setsockopt+0x1ad/0x250 net/socket.c:2334
              __do_sys_setsockopt net/socket.c:2343 [inline]
              __se_sys_setsockopt net/socket.c:2340 [inline]
              __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
             do_syscall_64+0xf9/0x240
             entry_SYSCALL_64_after_hwframe+0x6f/0x77
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(sk_lock-AF_UNIX);
                                     lock(&u->iolock);
                                     lock(sk_lock-AF_UNIX);
        lock(&u->iolock);
      
       *** DEADLOCK ***
      
      1 lock held by syz-executor.2/30025:
        #0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline]
        #0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline]
        #0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193
      
      stack backtrace:
      CPU: 0 PID: 30025 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00267-g0f1dd5e91e2b #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
      Call Trace:
       <TASK>
        __dump_stack lib/dump_stack.c:88 [inline]
        dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106
        check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2187
        check_prev_add kernel/locking/lockdep.c:3134 [inline]
        check_prevs_add kernel/locking/lockdep.c:3253 [inline]
        validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869
        __lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
        lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
        __mutex_lock_common kernel/locking/mutex.c:608 [inline]
        __mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752
        unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789
       sk_setsockopt+0x207e/0x3360
        do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307
        __sys_setsockopt+0x1ad/0x250 net/socket.c:2334
        __do_sys_setsockopt net/socket.c:2343 [inline]
        __se_sys_setsockopt net/socket.c:2340 [inline]
        __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
       do_syscall_64+0xf9/0x240
       entry_SYSCALL_64_after_hwframe+0x6f/0x77
      RIP: 0033:0x7f78a1c7dda9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f78a0fde0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
      RAX: ffffffffffffffda RBX: 00007f78a1dac050 RCX: 00007f78a1c7dda9
      RDX: 000000000000002a RSI: 0000000000000001 RDI: 0000000000000006
      RBP: 00007f78a1cca47a R08: 0000000000000004 R09: 0000000000000000
      R10: 0000000020000180 R11: 0000000000000246 R12: 0000000000000000
      R13: 000000000000006e R14: 00007f78a1dac050 R15: 00007ffe5cd81ae8
      
      Fixes: 859051dd ("bpf: Implement cgroup sockaddr hooks for unix sockets")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
      Cc: Daan De Meyer <daan.j.demeyer@gmail.com>
      Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
      Cc: Martin KaFai Lau <martin.lau@kernel.org>
      Cc: David Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      56667da7
  4. Feb 20, 2024
    • Kuniyuki Iwashima's avatar
      arp: Prevent overflow in arp_req_get(). · a7d60277
      Kuniyuki Iwashima authored
      
      syzkaller reported an overflown write in arp_req_get(). [0]
      
      When ioctl(SIOCGARP) is issued, arp_req_get() looks up an neighbour
      entry and copies neigh->ha to struct arpreq.arp_ha.sa_data.
      
      The arp_ha here is struct sockaddr, not struct sockaddr_storage, so
      the sa_data buffer is just 14 bytes.
      
      In the splat below, 2 bytes are overflown to the next int field,
      arp_flags.  We initialise the field just after the memcpy(), so it's
      not a problem.
      
      However, when dev->addr_len is greater than 22 (e.g. MAX_ADDR_LEN),
      arp_netmask is overwritten, which could be set as htonl(0xFFFFFFFFUL)
      in arp_ioctl() before calling arp_req_get().
      
      To avoid the overflow, let's limit the max length of memcpy().
      
      Note that commit b5f0de6d ("net: dev: Convert sa_data to flexible
      array in struct sockaddr") just silenced syzkaller.
      
      [0]:
      memcpy: detected field-spanning write (size 16) of single field "r->arp_ha.sa_data" at net/ipv4/arp.c:1128 (size 14)
      WARNING: CPU: 0 PID: 144638 at net/ipv4/arp.c:1128 arp_req_get+0x411/0x4a0 net/ipv4/arp.c:1128
      Modules linked in:
      CPU: 0 PID: 144638 Comm: syz-executor.4 Not tainted 6.1.74 #31
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
      RIP: 0010:arp_req_get+0x411/0x4a0 net/ipv4/arp.c:1128
      Code: fd ff ff e8 41 42 de fb b9 0e 00 00 00 4c 89 fe 48 c7 c2 20 6d ab 87 48 c7 c7 80 6d ab 87 c6 05 25 af 72 04 01 e8 5f 8d ad fb <0f> 0b e9 6c fd ff ff e8 13 42 de fb be 03 00 00 00 4c 89 e7 e8 a6
      RSP: 0018:ffffc900050b7998 EFLAGS: 00010286
      RAX: 0000000000000000 RBX: ffff88803a815000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffffffff8641a44a RDI: 0000000000000001
      RBP: ffffc900050b7a98 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 203a7970636d656d R12: ffff888039c54000
      R13: 1ffff92000a16f37 R14: ffff88803a815084 R15: 0000000000000010
      FS:  00007f172bf306c0(0000) GS:ffff88805aa00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f172b3569f0 CR3: 0000000057f12005 CR4: 0000000000770ef0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <TASK>
       arp_ioctl+0x33f/0x4b0 net/ipv4/arp.c:1261
       inet_ioctl+0x314/0x3a0 net/ipv4/af_inet.c:981
       sock_do_ioctl+0xdf/0x260 net/socket.c:1204
       sock_ioctl+0x3ef/0x650 net/socket.c:1321
       vfs_ioctl fs/ioctl.c:51 [inline]
       __do_sys_ioctl fs/ioctl.c:870 [inline]
       __se_sys_ioctl fs/ioctl.c:856 [inline]
       __x64_sys_ioctl+0x18e/0x220 fs/ioctl.c:856
       do_syscall_x64 arch/x86/entry/common.c:51 [inline]
       do_syscall_64+0x37/0x90 arch/x86/entry/common.c:81
       entry_SYSCALL_64_after_hwframe+0x64/0xce
      RIP: 0033:0x7f172b262b8d
      Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f172bf300b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      RAX: ffffffffffffffda RBX: 00007f172b3abf80 RCX: 00007f172b262b8d
      RDX: 0000000020000000 RSI: 0000000000008954 RDI: 0000000000000003
      RBP: 00007f172b2d3493 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      R13: 000000000000000b R14: 00007f172b3abf80 R15: 00007f172bf10000
       </TASK>
      
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Reported-by: default avatarBjoern Doebel <doebel@amazon.de>
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240215230516.31330-1-kuniyu@amazon.com
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      a7d60277
    • Vasiliy Kovalev's avatar
      devlink: fix possible use-after-free and memory leaks in devlink_init() · def689fc
      Vasiliy Kovalev authored
      
      The pernet operations structure for the subsystem must be registered
      before registering the generic netlink family.
      
      Make an unregister in case of unsuccessful registration.
      
      Fixes: 687125b5 ("devlink: split out core code")
      Signed-off-by: default avatarVasiliy Kovalev <kovalev@altlinux.org>
      Link: https://lore.kernel.org/r/20240215203400.29976-1-kovalev@altlinux.org
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      def689fc
    • Vasiliy Kovalev's avatar
      ipv6: sr: fix possible use-after-free and null-ptr-deref · 5559cea2
      Vasiliy Kovalev authored
      
      The pernet operations structure for the subsystem must be registered
      before registering the generic netlink family.
      
      Fixes: 915d7e5e ("ipv6: sr: add code base for control plane support of SR-IPv6")
      Signed-off-by: default avatarVasiliy Kovalev <kovalev@altlinux.org>
      Link: https://lore.kernel.org/r/20240215202717.29815-1-kovalev@altlinux.org
      
      
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      5559cea2
  5. Feb 18, 2024
  6. Feb 16, 2024
    • Jakub Kicinski's avatar
      net/sched: act_mirred: don't override retval if we already lost the skb · 166c2c8a
      Jakub Kicinski authored
      
      If we're redirecting the skb, and haven't called tcf_mirred_forward(),
      yet, we need to tell the core to drop the skb by setting the retcode
      to SHOT. If we have called tcf_mirred_forward(), however, the skb
      is out of our hands and returning SHOT will lead to UaF.
      
      Move the retval override to the error path which actually need it.
      
      Reviewed-by: default avatarMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Fixes: e5cf1baf ("act_mirred: use TC_ACT_REINSERT when possible")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      166c2c8a
    • Jakub Kicinski's avatar
      net/sched: act_mirred: use the backlog for mirred ingress · 52f671db
      Jakub Kicinski authored
      
      The test Davide added in commit ca22da2f ("act_mirred: use the backlog
      for nested calls to mirred ingress") hangs our testing VMs every 10 or so
      runs, with the familiar tcp_v4_rcv -> tcp_v4_rcv deadlock reported by
      lockdep.
      
      The problem as previously described by Davide (see Link) is that
      if we reverse flow of traffic with the redirect (egress -> ingress)
      we may reach the same socket which generated the packet. And we may
      still be holding its socket lock. The common solution to such deadlocks
      is to put the packet in the Rx backlog, rather than run the Rx path
      inline. Do that for all egress -> ingress reversals, not just once
      we started to nest mirred calls.
      
      In the past there was a concern that the backlog indirection will
      lead to loss of error reporting / less accurate stats. But the current
      workaround does not seem to address the issue.
      
      Fixes: 53592b36 ("net/sched: act_mirred: Implement ingress actions")
      Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Suggested-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Link: https://lore.kernel.org/netdev/33dc43f587ec1388ba456b4915c75f02a8aae226.1663945716.git.dcaratti@redhat.com/
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      52f671db
    • Kuniyuki Iwashima's avatar
      dccp/tcp: Unhash sk from ehash for tb2 alloc failure after check_estalblished(). · 66b60b0c
      Kuniyuki Iwashima authored
      
      syzkaller reported a warning [0] in inet_csk_destroy_sock() with no
      repro.
      
        WARN_ON(inet_sk(sk)->inet_num && !inet_csk(sk)->icsk_bind_hash);
      
      However, the syzkaller's log hinted that connect() failed just before
      the warning due to FAULT_INJECTION.  [1]
      
      When connect() is called for an unbound socket, we search for an
      available ephemeral port.  If a bhash bucket exists for the port, we
      call __inet_check_established() or __inet6_check_established() to check
      if the bucket is reusable.
      
      If reusable, we add the socket into ehash and set inet_sk(sk)->inet_num.
      
      Later, we look up the corresponding bhash2 bucket and try to allocate
      it if it does not exist.
      
      Although it rarely occurs in real use, if the allocation fails, we must
      revert the changes by check_established().  Otherwise, an unconnected
      socket could illegally occupy an ehash entry.
      
      Note that we do not put tw back into ehash because sk might have
      already responded to a packet for tw and it would be better to free
      tw earlier under such memory presure.
      
      [0]:
      WARNING: CPU: 0 PID: 350830 at net/ipv4/inet_connection_sock.c:1193 inet_csk_destroy_sock (net/ipv4/inet_connection_sock.c:1193)
      Modules linked in:
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      RIP: 0010:inet_csk_destroy_sock (net/ipv4/inet_connection_sock.c:1193)
      Code: 41 5c 41 5d 41 5e e9 2d 4a 3d fd e8 28 4a 3d fd 48 89 ef e8 f0 cd 7d ff 5b 5d 41 5c 41 5d 41 5e e9 13 4a 3d fd e8 0e 4a 3d fd <0f> 0b e9 61 fe ff ff e8 02 4a 3d fd 4c 89 e7 be 03 00 00 00 e8 05
      RSP: 0018:ffffc9000b21fd38 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 0000000000009e78 RCX: ffffffff840bae40
      RDX: ffff88806e46c600 RSI: ffffffff840bb012 RDI: ffff88811755cca8
      RBP: ffff88811755c880 R08: 0000000000000003 R09: 0000000000000000
      R10: 0000000000009e78 R11: 0000000000000000 R12: ffff88811755c8e0
      R13: ffff88811755c892 R14: ffff88811755c918 R15: 0000000000000000
      FS:  00007f03e5243800(0000) GS:ffff88811ae00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000001b32f21000 CR3: 0000000112ffe001 CR4: 0000000000770ef0
      PKRU: 55555554
      Call Trace:
       <TASK>
       ? inet_csk_destroy_sock (net/ipv4/inet_connection_sock.c:1193)
       dccp_close (net/dccp/proto.c:1078)
       inet_release (net/ipv4/af_inet.c:434)
       __sock_release (net/socket.c:660)
       sock_close (net/socket.c:1423)
       __fput (fs/file_table.c:377)
       __fput_sync (fs/file_table.c:462)
       __x64_sys_close (fs/open.c:1557 fs/open.c:1539 fs/open.c:1539)
       do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
       entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
      RIP: 0033:0x7f03e53852bb
      Code: 03 00 00 00 0f 05 48 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 43 c9 f5 ff 8b 7c 24 0c 41 89 c0 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 89 44 24 0c e8 a1 c9 f5 ff 8b 44
      RSP: 002b:00000000005dfba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
      RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f03e53852bb
      RDX: 0000000000000002 RSI: 0000000000000002 RDI: 0000000000000003
      RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000167c
      R10: 0000000008a79680 R11: 0000000000000293 R12: 00007f03e4e43000
      R13: 00007f03e4e43170 R14: 00007f03e4e43178 R15: 00007f03e4e43170
       </TASK>
      
      [1]:
      FAULT_INJECTION: forcing a failure.
      name failslab, interval 1, probability 0, space 0, times 0
      CPU: 0 PID: 350833 Comm: syz-executor.1 Not tainted 6.7.0-12272-g2121c43f88f5 #9
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      Call Trace:
       <TASK>
       dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1))
       should_fail_ex (lib/fault-inject.c:52 lib/fault-inject.c:153)
       should_failslab (mm/slub.c:3748)
       kmem_cache_alloc (mm/slub.c:3763 mm/slub.c:3842 mm/slub.c:3867)
       inet_bind2_bucket_create (net/ipv4/inet_hashtables.c:135)
       __inet_hash_connect (net/ipv4/inet_hashtables.c:1100)
       dccp_v4_connect (net/dccp/ipv4.c:116)
       __inet_stream_connect (net/ipv4/af_inet.c:676)
       inet_stream_connect (net/ipv4/af_inet.c:747)
       __sys_connect_file (net/socket.c:2048 (discriminator 2))
       __sys_connect (net/socket.c:2065)
       __x64_sys_connect (net/socket.c:2072)
       do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
       entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
      RIP: 0033:0x7f03e5284e5d
      Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48
      RSP: 002b:00007f03e4641cc8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
      RAX: ffffffffffffffda RBX: 00000000004bbf80 RCX: 00007f03e5284e5d
      RDX: 0000000000000010 RSI: 0000000020000000 RDI: 0000000000000003
      RBP: 00000000004bbf80 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
      R13: 000000000000000b R14: 00007f03e52e5530 R15: 0000000000000000
       </TASK>
      
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Fixes: 28044fc1 ("net: Add a bhash2 table hashed by port and address")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66b60b0c
    • Tobias Waldekranz's avatar
      net: bridge: switchdev: Ensure deferred event delivery on unoffload · f7a70d65
      Tobias Waldekranz authored
      
      When unoffloading a device, it is important to ensure that all
      relevant deferred events are delivered to it before it disassociates
      itself from the bridge.
      
      Before this change, this was true for the normal case when a device
      maps 1:1 to a net_bridge_port, i.e.
      
         br0
         /
      swp0
      
      When swp0 leaves br0, the call to switchdev_deferred_process() in
      del_nbp() makes sure to process any outstanding events while the
      device is still associated with the bridge.
      
      In the case when the association is indirect though, i.e. when the
      device is attached to the bridge via an intermediate device, like a
      LAG...
      
          br0
          /
        lag0
        /
      swp0
      
      ...then detaching swp0 from lag0 does not cause any net_bridge_port to
      be deleted, so there was no guarantee that all events had been
      processed before the device disassociated itself from the bridge.
      
      Fix this by always synchronously processing all deferred events before
      signaling completion of unoffloading back to the driver.
      
      Fixes: 4e51bf44 ("net: bridge: move the switchdev object replay helpers to "push" mode")
      Signed-off-by: default avatarTobias Waldekranz <tobias@waldekranz.com>
      Reviewed-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f7a70d65
    • Tobias Waldekranz's avatar
      net: bridge: switchdev: Skip MDB replays of deferred events on offload · dc489f86
      Tobias Waldekranz authored
      
      Before this change, generation of the list of MDB events to replay
      would race against the creation of new group memberships, either from
      the IGMP/MLD snooping logic or from user configuration.
      
      While new memberships are immediately visible to walkers of
      br->mdb_list, the notification of their existence to switchdev event
      subscribers is deferred until a later point in time. So if a replay
      list was generated during a time that overlapped with such a window,
      it would also contain a replay of the not-yet-delivered event.
      
      The driver would thus receive two copies of what the bridge internally
      considered to be one single event. On destruction of the bridge, only
      a single membership deletion event was therefore sent. As a
      consequence of this, drivers which reference count memberships (at
      least DSA), would be left with orphan groups in their hardware
      database when the bridge was destroyed.
      
      This is only an issue when replaying additions. While deletion events
      may still be pending on the deferred queue, they will already have
      been removed from br->mdb_list, so no duplicates can be generated in
      that scenario.
      
      To a user this meant that old group memberships, from a bridge in
      which a port was previously attached, could be reanimated (in
      hardware) when the port joined a new bridge, without the new bridge's
      knowledge.
      
      For example, on an mv88e6xxx system, create a snooping bridge and
      immediately add a port to it:
      
          root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \
          > ip link set dev x3 up master br0
      
      And then destroy the bridge:
      
          root@infix-06-0b-00:~$ ip link del dev br0
          root@infix-06-0b-00:~$ mvls atu
          ADDRESS             FID  STATE      Q  F  0  1  2  3  4  5  6  7  8  9  a
          DEV:0 Marvell 88E6393X
          33:33:00:00:00:6a     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
          33:33:ff:87:e4:3f     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
          ff:ff:ff:ff:ff:ff     1  static     -  -  0  1  2  3  4  5  6  7  8  9  a
          root@infix-06-0b-00:~$
      
      The two IPv6 groups remain in the hardware database because the
      port (x3) is notified of the host's membership twice: once via the
      original event and once via a replay. Since only a single delete
      notification is sent, the count remains at 1 when the bridge is
      destroyed.
      
      Then add the same port (or another port belonging to the same hardware
      domain) to a new bridge, this time with snooping disabled:
      
          root@infix-06-0b-00:~$ ip link add dev br1 up type bridge mcast_snooping 0 && \
          > ip link set dev x3 up master br1
      
      All multicast, including the two IPv6 groups from br0, should now be
      flooded, according to the policy of br1. But instead the old
      memberships are still active in the hardware database, causing the
      switch to only forward traffic to those groups towards the CPU (port
      0).
      
      Eliminate the race in two steps:
      
      1. Grab the write-side lock of the MDB while generating the replay
         list.
      
      This prevents new memberships from showing up while we are generating
      the replay list. But it leaves the scenario in which a deferred event
      was already generated, but not delivered, before we grabbed the
      lock. Therefore:
      
      2. Make sure that no deferred version of a replay event is already
         enqueued to the switchdev deferred queue, before adding it to the
         replay list, when replaying additions.
      
      Fixes: 4f2673b3 ("net: bridge: add helper to replay port and host-joined mdb entries")
      Signed-off-by: default avatarTobias Waldekranz <tobias@waldekranz.com>
      Reviewed-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dc489f86
    • Alexander Gordeev's avatar
      net/iucv: fix the allocation size of iucv_path_table array · b4ea9b6a
      Alexander Gordeev authored
      
      iucv_path_table is a dynamically allocated array of pointers to
      struct iucv_path items. Yet, its size is calculated as if it was
      an array of struct iucv_path items.
      
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      Reviewed-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b4ea9b6a
  7. Feb 15, 2024
  8. Feb 14, 2024
    • Felix Fietkau's avatar
      netfilter: nf_tables: fix bidirectional offload regression · 84443741
      Felix Fietkau authored
      
      Commit 8f84780b ("netfilter: flowtable: allow unidirectional rules")
      made unidirectional flow offload possible, while completely ignoring (and
      breaking) bidirectional flow offload for nftables.
      Add the missing flag that was left out as an exercise for the reader :)
      
      Cc: Vlad Buslov <vladbu@nvidia.com>
      Fixes: 8f84780b ("netfilter: flowtable: allow unidirectional rules")
      Reported-by: default avatarDaniel Golle <daniel@makrotopia.org>
      Signed-off-by: default avatarFelix Fietkau <nbd@nbd.name>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      84443741
    • Kyle Swenson's avatar
      netfilter: nat: restore default DNAT behavior · 0f1ae282
      Kyle Swenson authored
      
      When a DNAT rule is configured via iptables with different port ranges,
      
      iptables -t nat -A PREROUTING -p tcp -d 10.0.0.2 -m tcp --dport 32000:32010
      -j DNAT --to-destination 192.168.0.10:21000-21010
      
      we seem to be DNATing to some random port on the LAN side. While this is
      expected if --random is passed to the iptables command, it is not
      expected without passing --random.  The expected behavior (and the
      observed behavior prior to the commit in the "Fixes" tag) is the traffic
      will be DNAT'd to 192.168.0.10:21000 unless there is a tuple collision
      with that destination.  In that case, we expect the traffic to be
      instead DNAT'd to 192.168.0.10:21001, so on so forth until the end of
      the range.
      
      This patch intends to restore the behavior observed prior to the "Fixes"
      tag.
      
      Fixes: 6ed5943f ("netfilter: nat: remove l4 protocol port rovers")
      Signed-off-by: default avatarKyle Swenson <kyle.swenson@est.tech>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      0f1ae282
Loading