Skip to content
Snippets Groups Projects
  1. Jul 12, 2023
  2. Jul 05, 2023
    • Florian Westphal's avatar
      netfilter: conntrack: don't fold port numbers into addresses before hashing · eaf9e719
      Florian Westphal authored
      
      Originally this used jhash2() over tuple and folded the zone id,
      the pernet hash value, destination port and l4 protocol number into the
      32bit seed value.
      
      When the switch to siphash was done, I used an on-stack temporary
      buffer to build a suitable key to be hashed via siphash().
      
      But this showed up as performance regression, so I got rid of
      the temporary copy and collected to-be-hashed data in 4 u64 variables.
      
      This makes it easy to build tuples that produce the same hash, which isn't
      desirable even though chain lengths are limited.
      
      Switch back to plain siphash, but just like with jhash2(), take advantage
      of the fact that most of to-be-hashed data is already in a suitable order.
      
      Use an empty struct as annotation in 'struct nf_conntrack_tuple' to mark
      last member that can be used as hash input.
      
      The only remaining data that isn't present in the tuple structure are the
      zone identifier and the pernet hash: fold those into the key.
      
      Fixes: d2c806ab ("netfilter: conntrack: use siphash_4u64")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      eaf9e719
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: report use refcount overflow · 1689f259
      Pablo Neira Ayuso authored
      
      Overflow use refcount checks are not complete.
      
      Add helper function to deal with object reference counter tracking.
      Report -EMFILE in case UINT_MAX is reached.
      
      nft_use_dec() splats in case that reference counter underflows,
      which should not ever happen.
      
      Add nft_use_inc_restore() and nft_use_dec_restore() which are used
      to restore reference counter from error and abort paths.
      
      Use u32 in nft_flowtable and nft_object since helper functions cannot
      work on bitfields.
      
      Remove the few early incomplete checks now that the helper functions
      are in place and used to check for refcount overflow.
      
      Fixes: 96518518 ("netfilter: add nftables")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      1689f259
  3. Jun 29, 2023
  4. Jun 27, 2023
    • Alexander Mikhalitsyn's avatar
      net: scm: introduce and use scm_recv_unix helper · a9c49cc2
      Alexander Mikhalitsyn authored
      Recently, our friends from bluetooth subsystem reported [1] that after
      commit 5e2ff670 ("scm: add SO_PASSPIDFD and SCM_PIDFD") scm_recv()
      helper become unusable in kernel modules (because it uses unexported
      pidfd_prepare() API).
      
      We were aware of this issue and workarounded it in a hard way
      by commit 97154bcf ("af_unix: Kconfig: make CONFIG_UNIX bool").
      
      But recently a new functionality was added in the scope of commit
      817efd3cad74 ("Bluetooth: hci_sock: Forward credentials to monitor")
      and after that bluetooth can't be compiled as a kernel module.
      
      After some discussion in [1] we decided to split scm_recv() into
      two helpers, one won't support SCM_PIDFD (used for unix sockets),
      and another one will be completely the same as it was before commit
      5e2ff670 ("scm: add SO_PASSPIDFD and SCM_PIDFD").
      
      Link: https://lore.kernel.org/lkml/CAJqdLrpFcga4n7wxBhsFqPQiN8PKFVr6U10fKcJ9W7AcZn+o6Q@mail.gmail.com/
      
       [1]
      Fixes: 5e2ff670 ("scm: add SO_PASSPIDFD and SCM_PIDFD")
      Signed-off-by: default avatarAlexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20230627174314.67688-3-kuniyu@amazon.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a9c49cc2
    • Kuniyuki Iwashima's avatar
      af_unix: Skip SCM_PIDFD if scm->pid is NULL. · 603fc57a
      Kuniyuki Iwashima authored
      
      syzkaller hit a WARN_ON_ONCE(!scm->pid) in scm_pidfd_recv().
      
      In unix_stream_read_generic(), if there is no skb in the queue, we could
      bail out the do-while loop without calling scm_set_cred():
      
        1. No skb in the queue
        2. sk is non-blocking
             or
           shutdown(sk, RCV_SHUTDOWN) is called concurrently
             or
           peer calls close()
      
      If the socket is configured with SO_PASSPIDFD, scm_pidfd_recv() would
      populate cmsg with garbage emitting the warning.
      
      Let's skip SCM_PIDFD if scm->pid is NULL in scm_pidfd_recv().
      
      Note another way would be skip calling scm_recv() in such cases, but this
      caused a regression resulting in commit 9d797ee2 ("Revert "af_unix:
      Call scm_recv() only after scm_set_cred()."").
      
      WARNING: CPU: 1 PID: 3245 at include/net/scm.h:138 scm_pidfd_recv include/net/scm.h:138 [inline]
      WARNING: CPU: 1 PID: 3245 at include/net/scm.h:138 scm_recv.constprop.0+0x754/0x850 include/net/scm.h:177
      Modules linked in:
      CPU: 1 PID: 3245 Comm: syz-executor.1 Not tainted 6.4.0-rc5-01219-gfa0e21fa4443 #2
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      RIP: 0010:scm_pidfd_recv include/net/scm.h:138 [inline]
      RIP: 0010:scm_recv.constprop.0+0x754/0x850 include/net/scm.h:177
      Code: 67 fd e9 55 fd ff ff e8 4a 70 67 fd e9 7f fd ff ff e8 40 70 67 fd e9 3e fb ff ff e8 36 70 67 fd e9 02 fd ff ff e8 8c 3a 20 fd <0f> 0b e9 fe fb ff ff e8 50 70 67 fd e9 2e f9 ff ff e8 46 70 67 fd
      RSP: 0018:ffffc90009af7660 EFLAGS: 00010216
      RAX: 00000000000000a1 RBX: ffff888041e58a80 RCX: ffffc90003852000
      RDX: 0000000000040000 RSI: ffffffff842675b4 RDI: 0000000000000007
      RBP: ffffc90009af7810 R08: 0000000000000007 R09: 0000000000000013
      R10: 00000000000000f8 R11: 0000000000000001 R12: ffffc90009af7db0
      R13: 0000000000000000 R14: ffff888041e58a88 R15: 1ffff9200135eecc
      FS:  00007f6b7113f640(0000) GS:ffff88806cf00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f6b7111de38 CR3: 0000000012a6e002 CR4: 0000000000770ee0
      PKRU: 55555554
      Call Trace:
       <TASK>
       unix_stream_read_generic+0x5fe/0x1f50 net/unix/af_unix.c:2830
       unix_stream_recvmsg+0x194/0x1c0 net/unix/af_unix.c:2880
       sock_recvmsg_nosec net/socket.c:1019 [inline]
       sock_recvmsg+0x188/0x1d0 net/socket.c:1040
       ____sys_recvmsg+0x210/0x610 net/socket.c:2712
       ___sys_recvmsg+0xff/0x190 net/socket.c:2754
       do_recvmmsg+0x25d/0x6c0 net/socket.c:2848
       __sys_recvmmsg net/socket.c:2927 [inline]
       __do_sys_recvmmsg net/socket.c:2950 [inline]
       __se_sys_recvmmsg net/socket.c:2943 [inline]
       __x64_sys_recvmmsg+0x224/0x290 net/socket.c:2943
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3f/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
      RIP: 0033:0x7f6b71da2e5d
      Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48
      RSP: 002b:00007f6b7113ecc8 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
      RAX: ffffffffffffffda RBX: 00000000004bc050 RCX: 00007f6b71da2e5d
      RDX: 0000000000000007 RSI: 0000000020006600 RDI: 000000000000000b
      RBP: 00000000004bc050 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000120 R11: 0000000000000246 R12: 0000000000000000
      R13: 000000000000006e R14: 00007f6b71e03530 R15: 0000000000000000
       </TASK>
      
      Fixes: 5e2ff670 ("scm: add SO_PASSPIDFD and SCM_PIDFD")
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20230627174314.67688-2-kuniyu@amazon.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      603fc57a
    • Kuniyuki Iwashima's avatar
      netlink: Add __sock_i_ino() for __netlink_diag_dump(). · 25a9c8a4
      Kuniyuki Iwashima authored
      
      syzbot reported a warning in __local_bh_enable_ip(). [0]
      
      Commit 8d61f926 ("netlink: fix potential deadlock in
      netlink_set_err()") converted read_lock(&nl_table_lock) to
      read_lock_irqsave() in __netlink_diag_dump() to prevent a deadlock.
      
      However, __netlink_diag_dump() calls sock_i_ino() that uses
      read_lock_bh() and read_unlock_bh().  If CONFIG_TRACE_IRQFLAGS=y,
      read_unlock_bh() finally enables IRQ even though it should stay
      disabled until the following read_unlock_irqrestore().
      
      Using read_lock() in sock_i_ino() would trigger a lockdep splat
      in another place that was fixed in commit f064af1e ("net: fix
      a lockdep splat"), so let's add __sock_i_ino() that would be safe
      to use under BH disabled.
      
      [0]:
      WARNING: CPU: 0 PID: 5012 at kernel/softirq.c:376 __local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376
      Modules linked in:
      CPU: 0 PID: 5012 Comm: syz-executor487 Not tainted 6.4.0-rc7-syzkaller-00202-g6f68fc395f49 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023
      RIP: 0010:__local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376
      Code: 45 bf 01 00 00 00 e8 91 5b 0a 00 e8 3c 15 3d 00 fb 65 8b 05 ec e9 b5 7e 85 c0 74 58 5b 5d c3 65 8b 05 b2 b6 b4 7e 85 c0 75 a2 <0f> 0b eb 9e e8 89 15 3d 00 eb 9f 48 89 ef e8 6f 49 18 00 eb a8 0f
      RSP: 0018:ffffc90003a1f3d0 EFLAGS: 00010046
      RAX: 0000000000000000 RBX: 0000000000000201 RCX: 1ffffffff1cf5996
      RDX: 0000000000000000 RSI: 0000000000000201 RDI: ffffffff8805c6f3
      RBP: ffffffff8805c6f3 R08: 0000000000000001 R09: ffff8880152b03a3
      R10: ffffed1002a56074 R11: 0000000000000005 R12: 00000000000073e4
      R13: dffffc0000000000 R14: 0000000000000002 R15: 0000000000000000
      FS:  0000555556726300(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000000000045ad50 CR3: 000000007c646000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       sock_i_ino+0x83/0xa0 net/core/sock.c:2559
       __netlink_diag_dump+0x45c/0x790 net/netlink/diag.c:171
       netlink_diag_dump+0xd6/0x230 net/netlink/diag.c:207
       netlink_dump+0x570/0xc50 net/netlink/af_netlink.c:2269
       __netlink_dump_start+0x64b/0x910 net/netlink/af_netlink.c:2374
       netlink_dump_start include/linux/netlink.h:329 [inline]
       netlink_diag_handler_dump+0x1ae/0x250 net/netlink/diag.c:238
       __sock_diag_cmd net/core/sock_diag.c:238 [inline]
       sock_diag_rcv_msg+0x31e/0x440 net/core/sock_diag.c:269
       netlink_rcv_skb+0x165/0x440 net/netlink/af_netlink.c:2547
       sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:280
       netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
       netlink_unicast+0x547/0x7f0 net/netlink/af_netlink.c:1365
       netlink_sendmsg+0x925/0xe30 net/netlink/af_netlink.c:1914
       sock_sendmsg_nosec net/socket.c:724 [inline]
       sock_sendmsg+0xde/0x190 net/socket.c:747
       ____sys_sendmsg+0x71c/0x900 net/socket.c:2503
       ___sys_sendmsg+0x110/0x1b0 net/socket.c:2557
       __sys_sendmsg+0xf7/0x1c0 net/socket.c:2586
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f5303aaabb9
      Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007ffc7506e548 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5303aaabb9
      RDX: 0000000000000000 RSI: 0000000020000180 RDI: 0000000000000003
      RBP: 00007f5303a6ed60 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f5303a6edf0
      R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
       </TASK>
      
      Fixes: 8d61f926 ("netlink: fix potential deadlock in netlink_set_err()")
      Reported-by: default avatar <syzbot+5da61cf6a9bc1902d422@syzkaller.appspotmail.com>
      Link: https://syzkaller.appspot.com/bug?extid=5da61cf6a9bc1902d422
      
      
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20230626164313.52528-1-kuniyu@amazon.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      25a9c8a4
    • Vladimir Oltean's avatar
      net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses · d06f925f
      Vladimir Oltean authored
      
      When using the felix driver (the only one which supports UC filtering
      and MC filtering) as a DSA master for a random other DSA switch, one can
      see the following stack trace when the downstream switch ports join a
      VLAN-aware bridge:
      
      =============================
      WARNING: suspicious RCU usage
      -----------------------------
      net/8021q/vlan_core.c:238 suspicious rcu_dereference_protected() usage!
      
      stack backtrace:
      Workqueue: dsa_ordered dsa_slave_switchdev_event_work
      Call trace:
       lockdep_rcu_suspicious+0x170/0x210
       vlan_for_each+0x8c/0x188
       dsa_slave_sync_uc+0x128/0x178
       __hw_addr_sync_dev+0x138/0x158
       dsa_slave_set_rx_mode+0x58/0x70
       __dev_set_rx_mode+0x88/0xa8
       dev_uc_add+0x74/0xa0
       dsa_port_bridge_host_fdb_add+0xec/0x180
       dsa_slave_switchdev_event_work+0x7c/0x1c8
       process_one_work+0x290/0x568
      
      What it's saying is that vlan_for_each() expects rtnl_lock() context and
      it's not getting it, when it's called from the DSA master's ndo_set_rx_mode().
      
      The caller of that - dsa_slave_set_rx_mode() - is the slave DSA
      interface's dsa_port_bridge_host_fdb_add() which comes from the deferred
      dsa_slave_switchdev_event_work().
      
      We went to great lengths to avoid the rtnl_lock() context in that call
      path in commit 0faf890f ("net: dsa: drop rtnl_lock from
      dsa_slave_switchdev_event_work"), and calling rtnl_lock() is simply not
      an option due to the possibility of deadlocking when calling
      dsa_flush_workqueue() from the call paths that do hold rtnl_lock() -
      basically all of them.
      
      So, when the DSA master calls vlan_for_each() from its ndo_set_rx_mode(),
      the state of the 8021q driver on this device is really not protected
      from concurrent access by anything.
      
      Looking at net/8021q/, I don't think that vlan_info->vid_list was
      particularly designed with RCU traversal in mind, so introducing an RCU
      read-side form of vlan_for_each() - vlan_for_each_rcu() - won't be so
      easy, and it also wouldn't be exactly what we need anyway.
      
      In general I believe that the solution isn't in net/8021q/ anyway;
      vlan_for_each() is not cut out for this task. DSA doesn't need rtnl_lock()
      to be held per se - since it's not a netdev state change that we're
      blocking, but rather, just concurrent additions/removals to a VLAN list.
      We don't even need sleepable context - the callback of vlan_for_each()
      just schedules deferred work.
      
      The proposed escape is to remove the dependency on vlan_for_each() and
      to open-code a non-sleepable, rtnl-free alternative to that, based on
      copies of the VLAN list modified from .ndo_vlan_rx_add_vid() and
      .ndo_vlan_rx_kill_vid().
      
      Fixes: 64fdc5f3 ("net: dsa: sync unicast and multicast addresses for VLAN filters too")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20230626154402.3154454-1-vladimir.oltean@nxp.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d06f925f
  5. Jun 26, 2023
  6. Jun 24, 2023
  7. Jun 21, 2023
  8. Jun 20, 2023
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: reject unbound anonymous set before commit phase · 938154b9
      Pablo Neira Ayuso authored
      
      Add a new list to track set transaction and to check for unbound
      anonymous sets before entering the commit phase.
      
      Bail out at the end of the transaction handling if an anonymous set
      remains unbound.
      
      Fixes: 96518518 ("netfilter: add nftables")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      938154b9
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: drop map element references from preparation phase · 628bd3e4
      Pablo Neira Ayuso authored
      
      set .destroy callback releases the references to other objects in maps.
      This is very late and it results in spurious EBUSY errors. Drop refcount
      from the preparation phase instead, update set backend not to drop
      reference counter from set .destroy path.
      
      Exceptions: NFT_TRANS_PREPARE_ERROR does not require to drop the
      reference counter because the transaction abort path releases the map
      references for each element since the set is unbound. The abort path
      also deals with releasing reference counter for new elements added to
      unbound sets.
      
      Fixes: 59105446 ("netfilter: nf_tables: revisit chain/object refcounting from elements")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      628bd3e4
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain · 26b5a571
      Pablo Neira Ayuso authored
      
      Add a new state to deal with rule expressions deactivation from the
      newrule error path, otherwise the anonymous set remains in the list in
      inactive state for the next generation. Mark the set/chain transaction
      as unbound so the abort path releases this object, set it as inactive in
      the next generation so it is not reachable anymore from this transaction
      and reference counter is dropped.
      
      Fixes: 1240eb93 ("netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      26b5a571
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: fix chain binding transaction logic · 4bedf9ee
      Pablo Neira Ayuso authored
      
      Add bound flag to rule and chain transactions as in 6a0a8d10
      ("netfilter: nf_tables: use-after-free in failing rule with bound set")
      to skip them in case that the chain is already bound from the abort
      path.
      
      This patch fixes an imbalance in the chain use refcnt that triggers a
      WARN_ON on the table and chain destroy path.
      
      This patch also disallows nested chain bindings, which is not
      supported from userspace.
      
      The logic to deal with chain binding in nft_data_hold() and
      nft_data_release() is not correct. The NFT_TRANS_PREPARE state needs a
      special handling in case a chain is bound but next expressions in the
      same rule fail to initialize as described by 1240eb93 ("netfilter:
      nf_tables: incorrect error path handling with NFT_MSG_NEWRULE").
      
      The chain is left bound if rule construction fails, so the objects
      stored in this chain (and the chain itself) are released by the
      transaction records from the abort path, follow up patch ("netfilter:
      nf_tables: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain")
      completes this error handling.
      
      When deleting an existing rule, chain bound flag is set off so the
      rule expression .destroy path releases the objects.
      
      Fixes: d0e2c7de ("netfilter: nf_tables: add NFT_CHAIN_BINDING")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      4bedf9ee
    • Vladimir Oltean's avatar
      net: dsa: introduce preferred_default_local_cpu_port and use on MT7530 · b79d7c14
      Vladimir Oltean authored
      
      Since the introduction of the OF bindings, DSA has always had a policy that
      in case multiple CPU ports are present in the device tree, the numerically
      smallest one is always chosen.
      
      The MT7530 switch family, except the switch on the MT7988 SoC, has 2 CPU
      ports, 5 and 6, where port 6 is preferable on the MT7531BE switch because
      it has higher bandwidth.
      
      The MT7530 driver developers had 3 options:
      - to modify DSA when the MT7531 switch support was introduced, such as to
        prefer the better port
      - to declare both CPU ports in device trees as CPU ports, and live with the
        sub-optimal performance resulting from not preferring the better port
      - to declare just port 6 in the device tree as a CPU port
      
      Of course they chose the path of least resistance (3rd option), kicking the
      can down the road. The hardware description in the device tree is supposed
      to be stable - developers are not supposed to adopt the strategy of
      piecemeal hardware description, where the device tree is updated in
      lockstep with the features that the kernel currently supports.
      
      Now, as a result of the fact that they did that, any attempts to modify the
      device tree and describe both CPU ports as CPU ports would make DSA change
      its default selection from port 6 to 5, effectively resulting in a
      performance degradation visible to users with the MT7531BE switch as can be
      seen below.
      
      Without preferring port 6:
      
      [ ID][Role] Interval           Transfer     Bitrate         Retr
      [  5][TX-C]   0.00-20.00  sec   374 MBytes   157 Mbits/sec  734    sender
      [  5][TX-C]   0.00-20.00  sec   373 MBytes   156 Mbits/sec    receiver
      [  7][RX-C]   0.00-20.00  sec  1.81 GBytes   778 Mbits/sec    0    sender
      [  7][RX-C]   0.00-20.00  sec  1.81 GBytes   777 Mbits/sec    receiver
      
      With preferring port 6:
      
      [ ID][Role] Interval           Transfer     Bitrate         Retr
      [  5][TX-C]   0.00-20.00  sec  1.99 GBytes   856 Mbits/sec  273    sender
      [  5][TX-C]   0.00-20.00  sec  1.99 GBytes   855 Mbits/sec    receiver
      [  7][RX-C]   0.00-20.00  sec  1.72 GBytes   737 Mbits/sec   15    sender
      [  7][RX-C]   0.00-20.00  sec  1.71 GBytes   736 Mbits/sec    receiver
      
      Using one port for WAN and the other ports for LAN is a very popular use
      case which is what this test emulates.
      
      As such, this change proposes that we retroactively modify stable kernels
      (which don't support the modification of the CPU port assignments, so as to
      let user space fix the problem and restore the throughput) to keep the
      mt7530 driver preferring port 6 even with device trees where the hardware
      is more fully described.
      
      Fixes: c288575f ("net: dsa: mt7530: Add the support of MT7531 switch")
      Signed-off-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarArınç ÜNAL <arinc.unal@arinc9.com>
      Reviewed-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarFlorian Fainelli <florian.fainelli@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b79d7c14
  9. Jun 19, 2023
  10. Jun 18, 2023
    • Arjun Roy's avatar
      tcp: Use per-vma locking for receive zerocopy · 7a7f0946
      Arjun Roy authored
      
      Per-VMA locking allows us to lock a struct vm_area_struct without
      taking the process-wide mmap lock in read mode.
      
      Consider a process workload where the mmap lock is taken constantly in
      write mode. In this scenario, all zerocopy receives are periodically
      blocked during that period of time - though in principle, the memory
      ranges being used by TCP are not touched by the operations that need
      the mmap write lock. This results in performance degradation.
      
      Now consider another workload where the mmap lock is never taken in
      write mode, but there are many TCP connections using receive zerocopy
      that are concurrently receiving. These connections all take the mmap
      lock in read mode, but this does induce a lot of contention and atomic
      ops for this process-wide lock. This results in additional CPU
      overhead caused by contending on the cache line for this lock.
      
      However, with per-vma locking, both of these problems can be avoided.
      
      As a test, I ran an RPC-style request/response workload with 4KB
      payloads and receive zerocopy enabled, with 100 simultaneous TCP
      connections. I measured perf cycles within the
      find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and
      without per-vma locking enabled.
      
      When using process-wide mmap semaphore read locking, about 1% of
      measured perf cycles were within this path. With per-VMA locking, this
      value dropped to about 0.45%.
      
      Signed-off-by: default avatarArjun Roy <arjunroy@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7a7f0946
  11. Jun 17, 2023
    • mfreemon@cloudflare.com's avatar
      tcp: enforce receive buffer memory limits by allowing the tcp window to shrink · b650d953
      mfreemon@cloudflare.com authored
      Under certain circumstances, the tcp receive buffer memory limit
      set by autotuning (sk_rcvbuf) is increased due to incoming data
      packets as a result of the window not closing when it should be.
      This can result in the receive buffer growing all the way up to
      tcp_rmem[2], even for tcp sessions with a low BDP.
      
      To reproduce:  Connect a TCP session with the receiver doing
      nothing and the sender sending small packets (an infinite loop
      of socket send() with 4 bytes of payload with a sleep of 1 ms
      in between each send()).  This will cause the tcp receive buffer
      to grow all the way up to tcp_rmem[2].
      
      As a result, a host can have individual tcp sessions with receive
      buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
      limits, causing the host to go into tcp memory pressure mode.
      
      The fundamental issue is the relationship between the granularity
      of the window scaling factor and the number of byte ACKed back
      to the sender.  This problem has previously been identified in
      RFC 7323, appendix F [1].
      
      The Linux kernel currently adheres to never shrinking the window.
      
      In addition to the overallocation of memory mentioned above, the
      current behavior is functionally incorrect, because once tcp_rmem[2]
      is reached when no remediations remain (i.e. tcp collapse fails to
      free up any more memory and there are no packets to prune from the
      out-of-order queue), the receiver will drop in-window packets
      resulting in retransmissions and an eventual timeout of the tcp
      session.  A receive buffer full condition should instead result
      in a zero window and an indefinite wait.
      
      In practice, this problem is largely hidden for most flows.  It
      is not applicable to mice flows.  Elephant flows can send data
      fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
      triggering a zero window.
      
      But this problem does show up for other types of flows.  Examples
      are websockets and other type of flows that send small amounts of
      data spaced apart slightly in time.  In these cases, we directly
      encounter the problem described in [1].
      
      RFC 7323, section 2.4 [2], says there are instances when a retracted
      window can be offered, and that TCP implementations MUST ensure
      that they handle a shrinking window, as specified in RFC 1122,
      section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
      management have made clear that sender must accept a shrunk window
      from the receiver, including RFC 793 [4] and RFC 1323 [5].
      
      This patch implements the functionality to shrink the tcp window
      when necessary to keep the right edge within the memory limit by
      autotuning (sk_rcvbuf).  This new functionality is enabled with
      the new sysctl: net.ipv4.tcp_shrink_window
      
      Additional information can be found at:
      https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/
      
      [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
      [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
      [3] https://www.rfc-editor.org/rfc/rfc1122#page-91
      [4] https://www.rfc-editor.org/rfc/rfc793
      [5] https://www.rfc-editor.org/rfc/rfc1323
      
      
      
      Signed-off-by: default avatarMike Freemon <mfreemon@cloudflare.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b650d953
    • YueHaibing's avatar
      net: sched: Remove unused qdisc_l2t() · e16ad981
      YueHaibing authored
      
      This is unused since switch to psched_l2t_ns().
      
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Link: https://lore.kernel.org/r/20230615124810.34020-1-yuehaibing@huawei.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e16ad981
  12. Jun 16, 2023
    • Breno Leitao's avatar
      net: ioctl: Use kernel memory on protocol ioctl callbacks · e1d001fa
      Breno Leitao authored
      
      Most of the ioctls to net protocols operates directly on userspace
      argument (arg). Usually doing get_user()/put_user() directly in the
      ioctl callback.  This is not flexible, because it is hard to reuse these
      functions without passing userspace buffers.
      
      Change the "struct proto" ioctls to avoid touching userspace memory and
      operate on kernel buffers, i.e., all protocol's ioctl callbacks is
      adapted to operate on a kernel memory other than on userspace (so, no
      more {put,get}_user() and friends being called in the ioctl callback).
      
      This changes the "struct proto" ioctl format in the following way:
      
          int                     (*ioctl)(struct sock *sk, int cmd,
      -                                        unsigned long arg);
      +                                        int *karg);
      
      (Important to say that this patch does not touch the "struct proto_ops"
      protocols)
      
      So, the "karg" argument, which is passed to the ioctl callback, is a
      pointer allocated to kernel space memory (inside a function wrapper).
      This buffer (karg) may contain input argument (copied from userspace in
      a prep function) and it might return a value/buffer, which is copied
      back to userspace if necessary. There is not one-size-fits-all format
      (that is I am using 'may' above), but basically, there are three type of
      ioctls:
      
      1) Do not read from userspace, returns a result to userspace
      2) Read an input parameter from userspace, and does not return anything
        to userspace
      3) Read an input from userspace, and return a buffer to userspace.
      
      The default case (1) (where no input parameter is given, and an "int" is
      returned to userspace) encompasses more than 90% of the cases, but there
      are two other exceptions. Here is a list of exceptions:
      
      * Protocol RAW:
         * cmd = SIOCGETVIFCNT:
           * input and output = struct sioc_vif_req
         * cmd = SIOCGETSGCNT
           * input and output = struct sioc_sg_req
         * Explanation: for the SIOCGETVIFCNT case, userspace passes the input
           argument, which is struct sioc_vif_req. Then the callback populates
           the struct, which is copied back to userspace.
      
      * Protocol RAW6:
         * cmd = SIOCGETMIFCNT_IN6
           * input and output = struct sioc_mif_req6
         * cmd = SIOCGETSGCNT_IN6
           * input and output = struct sioc_sg_req6
      
      * Protocol PHONET:
        * cmd == SIOCPNADDRESOURCE | SIOCPNDELRESOURCE
           * input int (4 bytes)
        * Nothing is copied back to userspace.
      
      For the exception cases, functions sock_sk_ioctl_inout() will
      copy the userspace input, and copy it back to kernel space.
      
      The wrapper that prepare the buffer and put the buffer back to user is
      sk_ioctl(), so, instead of calling sk->sk_prot->ioctl(), the callee now
      calls sk_ioctl(), which will handle all cases.
      
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20230609152800.830401-1-leitao@debian.org
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e1d001fa
  13. Jun 15, 2023
  14. Jun 14, 2023
Loading