- Jul 12, 2023
-
-
Pedro Tammela authored
Eric Dumazet says[1]: ------- Speaking of psched_mtu(), I see that net/sched/sch_pie.c is using it without holding RTNL, so dev->mtu can be changed underneath. KCSAN could issue a warning. ------- Annotate dev->mtu with READ_ONCE() so KCSAN don't issue a warning. [1] https://lore.kernel.org/all/CANn89iJoJO5VtaJ-2=_d2aOQhb0Xw8iBT_Cxqp2HyuS-zj6azw@mail.gmail.com/ v1 -> v2: Fix commit message Fixes: d4b36210 ("net: pkt_sched: PIE AQM scheme") Suggested-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
Pedro Tammela <pctammela@mojatatu.com> Reviewed-by:
Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230711021634.561598-1-pctammela@mojatatu.com Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Jul 05, 2023
-
-
Florian Westphal authored
Originally this used jhash2() over tuple and folded the zone id, the pernet hash value, destination port and l4 protocol number into the 32bit seed value. When the switch to siphash was done, I used an on-stack temporary buffer to build a suitable key to be hashed via siphash(). But this showed up as performance regression, so I got rid of the temporary copy and collected to-be-hashed data in 4 u64 variables. This makes it easy to build tuples that produce the same hash, which isn't desirable even though chain lengths are limited. Switch back to plain siphash, but just like with jhash2(), take advantage of the fact that most of to-be-hashed data is already in a suitable order. Use an empty struct as annotation in 'struct nf_conntrack_tuple' to mark last member that can be used as hash input. The only remaining data that isn't present in the tuple structure are the zone identifier and the pernet hash: fold those into the key. Fixes: d2c806ab ("netfilter: conntrack: use siphash_4u64") Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Pablo Neira Ayuso authored
Overflow use refcount checks are not complete. Add helper function to deal with object reference counter tracking. Report -EMFILE in case UINT_MAX is reached. nft_use_dec() splats in case that reference counter underflows, which should not ever happen. Add nft_use_inc_restore() and nft_use_dec_restore() which are used to restore reference counter from error and abort paths. Use u32 in nft_flowtable and nft_object since helper functions cannot work on bitfields. Remove the few early incomplete checks now that the helper functions are in place and used to check for refcount overflow. Fixes: 96518518 ("netfilter: add nftables") Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
- Jun 29, 2023
-
-
Luiz Augusto von Dentz authored
This rework sync_interval to be sync_factor as having sync_interval in the order of seconds is sometimes not disarable. Wit sync_factor the application can tell how many SDU intervals it wants to send an announcement with PA, the EA interval is set to 2 times that so a factor of 24 of BIG SDU interval of 10ms would look like the following: < HCI Command: LE Set Extended Advertising Parameters (0x08|0x0036) plen 25 Handle: 0x01 Properties: 0x0000 Min advertising interval: 480.000 msec (0x0300) Max advertising interval: 480.000 msec (0x0300) Channel map: 37, 38, 39 (0x07) Own address type: Random (0x01) Peer address type: Public (0x00) Peer address: 00:00:00:00:00:00 (OUI 00-00-00) Filter policy: Allow Scan Request from Any, Allow Connect Request from Any (0x00) TX power: Host has no preference (0x7f) Primary PHY: LE 1M (0x01) Secondary max skip: 0x00 Secondary PHY: LE 2M (0x02) SID: 0x00 Scan request notifications: Disabled (0x00) < HCI Command: LE Set Periodic Advertising Parameters (0x08|0x003e) plen 7 Handle: 1 Min interval: 240.00 msec (0x00c0) Max interval: 240.00 msec (0x00c0) Properties: 0x0000 Signed-off-by:
Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Luiz Augusto von Dentz authored
When receiving a scan response there is no way to know if the remote device is connectable or not, so when it cannot be merged don't make any assumption and instead just mark it with a new flag defined as MGMT_DEV_FOUND_SCAN_RSP so userspace can tell it is a standalone SCAN_RSP. Link: https://lore.kernel.org/linux-bluetooth/CABBYNZ+CYMsDSPTxBn09Js3BcdC-x7vZFfyLJ3ppZGGwJKmUTw@mail.gmail.com/ Fixes: c70a7e4c ("Bluetooth: Add support for Not Connectable flag for Device Found events") Signed-off-by:
Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Jun 27, 2023
-
-
Alexander Mikhalitsyn authored
Recently, our friends from bluetooth subsystem reported [1] that after commit 5e2ff670 ("scm: add SO_PASSPIDFD and SCM_PIDFD") scm_recv() helper become unusable in kernel modules (because it uses unexported pidfd_prepare() API). We were aware of this issue and workarounded it in a hard way by commit 97154bcf ("af_unix: Kconfig: make CONFIG_UNIX bool"). But recently a new functionality was added in the scope of commit 817efd3cad74 ("Bluetooth: hci_sock: Forward credentials to monitor") and after that bluetooth can't be compiled as a kernel module. After some discussion in [1] we decided to split scm_recv() into two helpers, one won't support SCM_PIDFD (used for unix sockets), and another one will be completely the same as it was before commit 5e2ff670 ("scm: add SO_PASSPIDFD and SCM_PIDFD"). Link: https://lore.kernel.org/lkml/CAJqdLrpFcga4n7wxBhsFqPQiN8PKFVr6U10fKcJ9W7AcZn+o6Q@mail.gmail.com/ [1] Fixes: 5e2ff670 ("scm: add SO_PASSPIDFD and SCM_PIDFD") Signed-off-by:
Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by:
Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230627174314.67688-3-kuniyu@amazon.com Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Kuniyuki Iwashima authored
syzkaller hit a WARN_ON_ONCE(!scm->pid) in scm_pidfd_recv(). In unix_stream_read_generic(), if there is no skb in the queue, we could bail out the do-while loop without calling scm_set_cred(): 1. No skb in the queue 2. sk is non-blocking or shutdown(sk, RCV_SHUTDOWN) is called concurrently or peer calls close() If the socket is configured with SO_PASSPIDFD, scm_pidfd_recv() would populate cmsg with garbage emitting the warning. Let's skip SCM_PIDFD if scm->pid is NULL in scm_pidfd_recv(). Note another way would be skip calling scm_recv() in such cases, but this caused a regression resulting in commit 9d797ee2 ("Revert "af_unix: Call scm_recv() only after scm_set_cred().""). WARNING: CPU: 1 PID: 3245 at include/net/scm.h:138 scm_pidfd_recv include/net/scm.h:138 [inline] WARNING: CPU: 1 PID: 3245 at include/net/scm.h:138 scm_recv.constprop.0+0x754/0x850 include/net/scm.h:177 Modules linked in: CPU: 1 PID: 3245 Comm: syz-executor.1 Not tainted 6.4.0-rc5-01219-gfa0e21fa4443 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:scm_pidfd_recv include/net/scm.h:138 [inline] RIP: 0010:scm_recv.constprop.0+0x754/0x850 include/net/scm.h:177 Code: 67 fd e9 55 fd ff ff e8 4a 70 67 fd e9 7f fd ff ff e8 40 70 67 fd e9 3e fb ff ff e8 36 70 67 fd e9 02 fd ff ff e8 8c 3a 20 fd <0f> 0b e9 fe fb ff ff e8 50 70 67 fd e9 2e f9 ff ff e8 46 70 67 fd RSP: 0018:ffffc90009af7660 EFLAGS: 00010216 RAX: 00000000000000a1 RBX: ffff888041e58a80 RCX: ffffc90003852000 RDX: 0000000000040000 RSI: ffffffff842675b4 RDI: 0000000000000007 RBP: ffffc90009af7810 R08: 0000000000000007 R09: 0000000000000013 R10: 00000000000000f8 R11: 0000000000000001 R12: ffffc90009af7db0 R13: 0000000000000000 R14: ffff888041e58a88 R15: 1ffff9200135eecc FS: 00007f6b7113f640(0000) GS:ffff88806cf00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6b7111de38 CR3: 0000000012a6e002 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> unix_stream_read_generic+0x5fe/0x1f50 net/unix/af_unix.c:2830 unix_stream_recvmsg+0x194/0x1c0 net/unix/af_unix.c:2880 sock_recvmsg_nosec net/socket.c:1019 [inline] sock_recvmsg+0x188/0x1d0 net/socket.c:1040 ____sys_recvmsg+0x210/0x610 net/socket.c:2712 ___sys_recvmsg+0xff/0x190 net/socket.c:2754 do_recvmmsg+0x25d/0x6c0 net/socket.c:2848 __sys_recvmmsg net/socket.c:2927 [inline] __do_sys_recvmmsg net/socket.c:2950 [inline] __se_sys_recvmmsg net/socket.c:2943 [inline] __x64_sys_recvmmsg+0x224/0x290 net/socket.c:2943 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3f/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7f6b71da2e5d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48 RSP: 002b:00007f6b7113ecc8 EFLAGS: 00000246 ORIG_RAX: 000000000000012b RAX: ffffffffffffffda RBX: 00000000004bc050 RCX: 00007f6b71da2e5d RDX: 0000000000000007 RSI: 0000000020006600 RDI: 000000000000000b RBP: 00000000004bc050 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000120 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000006e R14: 00007f6b71e03530 R15: 0000000000000000 </TASK> Fixes: 5e2ff670 ("scm: add SO_PASSPIDFD and SCM_PIDFD") Reported-by:
syzkaller <syzkaller@googlegroups.com> Signed-off-by:
Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230627174314.67688-2-kuniyu@amazon.com Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Kuniyuki Iwashima authored
syzbot reported a warning in __local_bh_enable_ip(). [0] Commit 8d61f926 ("netlink: fix potential deadlock in netlink_set_err()") converted read_lock(&nl_table_lock) to read_lock_irqsave() in __netlink_diag_dump() to prevent a deadlock. However, __netlink_diag_dump() calls sock_i_ino() that uses read_lock_bh() and read_unlock_bh(). If CONFIG_TRACE_IRQFLAGS=y, read_unlock_bh() finally enables IRQ even though it should stay disabled until the following read_unlock_irqrestore(). Using read_lock() in sock_i_ino() would trigger a lockdep splat in another place that was fixed in commit f064af1e ("net: fix a lockdep splat"), so let's add __sock_i_ino() that would be safe to use under BH disabled. [0]: WARNING: CPU: 0 PID: 5012 at kernel/softirq.c:376 __local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376 Modules linked in: CPU: 0 PID: 5012 Comm: syz-executor487 Not tainted 6.4.0-rc7-syzkaller-00202-g6f68fc395f49 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023 RIP: 0010:__local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376 Code: 45 bf 01 00 00 00 e8 91 5b 0a 00 e8 3c 15 3d 00 fb 65 8b 05 ec e9 b5 7e 85 c0 74 58 5b 5d c3 65 8b 05 b2 b6 b4 7e 85 c0 75 a2 <0f> 0b eb 9e e8 89 15 3d 00 eb 9f 48 89 ef e8 6f 49 18 00 eb a8 0f RSP: 0018:ffffc90003a1f3d0 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000201 RCX: 1ffffffff1cf5996 RDX: 0000000000000000 RSI: 0000000000000201 RDI: ffffffff8805c6f3 RBP: ffffffff8805c6f3 R08: 0000000000000001 R09: ffff8880152b03a3 R10: ffffed1002a56074 R11: 0000000000000005 R12: 00000000000073e4 R13: dffffc0000000000 R14: 0000000000000002 R15: 0000000000000000 FS: 0000555556726300(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000045ad50 CR3: 000000007c646000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> sock_i_ino+0x83/0xa0 net/core/sock.c:2559 __netlink_diag_dump+0x45c/0x790 net/netlink/diag.c:171 netlink_diag_dump+0xd6/0x230 net/netlink/diag.c:207 netlink_dump+0x570/0xc50 net/netlink/af_netlink.c:2269 __netlink_dump_start+0x64b/0x910 net/netlink/af_netlink.c:2374 netlink_dump_start include/linux/netlink.h:329 [inline] netlink_diag_handler_dump+0x1ae/0x250 net/netlink/diag.c:238 __sock_diag_cmd net/core/sock_diag.c:238 [inline] sock_diag_rcv_msg+0x31e/0x440 net/core/sock_diag.c:269 netlink_rcv_skb+0x165/0x440 net/netlink/af_netlink.c:2547 sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:280 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline] netlink_unicast+0x547/0x7f0 net/netlink/af_netlink.c:1365 netlink_sendmsg+0x925/0xe30 net/netlink/af_netlink.c:1914 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg+0xde/0x190 net/socket.c:747 ____sys_sendmsg+0x71c/0x900 net/socket.c:2503 ___sys_sendmsg+0x110/0x1b0 net/socket.c:2557 __sys_sendmsg+0xf7/0x1c0 net/socket.c:2586 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f5303aaabb9 Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffc7506e548 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5303aaabb9 RDX: 0000000000000000 RSI: 0000000020000180 RDI: 0000000000000003 RBP: 00007f5303a6ed60 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f5303a6edf0 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> Fixes: 8d61f926 ("netlink: fix potential deadlock in netlink_set_err()") Reported-by:
<syzbot+5da61cf6a9bc1902d422@syzkaller.appspotmail.com> Link: https://syzkaller.appspot.com/bug?extid=5da61cf6a9bc1902d422 Suggested-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by:
Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230626164313.52528-1-kuniyu@amazon.com Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
When using the felix driver (the only one which supports UC filtering and MC filtering) as a DSA master for a random other DSA switch, one can see the following stack trace when the downstream switch ports join a VLAN-aware bridge: ============================= WARNING: suspicious RCU usage ----------------------------- net/8021q/vlan_core.c:238 suspicious rcu_dereference_protected() usage! stack backtrace: Workqueue: dsa_ordered dsa_slave_switchdev_event_work Call trace: lockdep_rcu_suspicious+0x170/0x210 vlan_for_each+0x8c/0x188 dsa_slave_sync_uc+0x128/0x178 __hw_addr_sync_dev+0x138/0x158 dsa_slave_set_rx_mode+0x58/0x70 __dev_set_rx_mode+0x88/0xa8 dev_uc_add+0x74/0xa0 dsa_port_bridge_host_fdb_add+0xec/0x180 dsa_slave_switchdev_event_work+0x7c/0x1c8 process_one_work+0x290/0x568 What it's saying is that vlan_for_each() expects rtnl_lock() context and it's not getting it, when it's called from the DSA master's ndo_set_rx_mode(). The caller of that - dsa_slave_set_rx_mode() - is the slave DSA interface's dsa_port_bridge_host_fdb_add() which comes from the deferred dsa_slave_switchdev_event_work(). We went to great lengths to avoid the rtnl_lock() context in that call path in commit 0faf890f ("net: dsa: drop rtnl_lock from dsa_slave_switchdev_event_work"), and calling rtnl_lock() is simply not an option due to the possibility of deadlocking when calling dsa_flush_workqueue() from the call paths that do hold rtnl_lock() - basically all of them. So, when the DSA master calls vlan_for_each() from its ndo_set_rx_mode(), the state of the 8021q driver on this device is really not protected from concurrent access by anything. Looking at net/8021q/, I don't think that vlan_info->vid_list was particularly designed with RCU traversal in mind, so introducing an RCU read-side form of vlan_for_each() - vlan_for_each_rcu() - won't be so easy, and it also wouldn't be exactly what we need anyway. In general I believe that the solution isn't in net/8021q/ anyway; vlan_for_each() is not cut out for this task. DSA doesn't need rtnl_lock() to be held per se - since it's not a netdev state change that we're blocking, but rather, just concurrent additions/removals to a VLAN list. We don't even need sleepable context - the callback of vlan_for_each() just schedules deferred work. The proposed escape is to remove the dependency on vlan_for_each() and to open-code a non-sleepable, rtnl-free alternative to that, based on copies of the VLAN list modified from .ndo_vlan_rx_add_vid() and .ndo_vlan_rx_kill_vid(). Fixes: 64fdc5f3 ("net: dsa: sync unicast and multicast addresses for VLAN filters too") Signed-off-by:
Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20230626154402.3154454-1-vladimir.oltean@nxp.com Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Jun 26, 2023
-
-
Florian Westphal authored
Now that set->nelems is always updated permit update of the sets max size. Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
- Jun 24, 2023
-
-
David Howells authored
Remove ->sendpage() and ->sendpage_locked(). sendmsg() with MSG_SPLICE_PAGES should be used instead. This allows multiple pages and multipage folios to be passed through. Signed-off-by:
David Howells <dhowells@redhat.com> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for net/can cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: linux-afs@lists.infradead.org cc: mptcp@lists.linux.dev cc: rds-devel@oss.oracle.com cc: tipc-discussion@lists.sourceforge.net cc: virtualization@lists.linux-foundation.org Link: https://lore.kernel.org/r/20230623225513.2732256-16-dhowells@redhat.com Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Jun 21, 2023
-
-
Ilan Peer authored
Retrieve the Power Spectral Density (PSD) value from RNR AP information entry and store it so it could be used by the drivers. PSD value is explained in Section 9.4.2.170 of Draft P802.11Revme_D2.0. Signed-off-by:
Ilan Peer <ilan.peer@intel.com> Signed-off-by:
Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230619161906.067ded2b8fc3.I9f407ab5800cbb07045a0537a513012960ced740@changeid Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Johannes Berg authored
We shouldn't refer to CPTCFG_, that's for backports, in mainline that's just CONFIG_. Fix it. Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Christophe JAILLET authored
Group some variables based on their sizes to reduce hole and avoid padding. On x86_64, this shrinks the size of 'struct mctp_route' from 72 to 64 bytes. It saves a few bytes of memory and is more cache-line friendly. Signed-off-by:
Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by:
Simon Horman <simon.horman@corigine.com> Acked-by:
Jeremy Kerr <jk@codeconstruct.com.au> Reviewed-by:
Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/393ad1a5aef0aa28d839eeb3d7477da0e0eeb0b0.1687080803.git.christophe.jaillet@wanadoo.fr Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Jun 20, 2023
-
-
Pablo Neira Ayuso authored
Add a new list to track set transaction and to check for unbound anonymous sets before entering the commit phase. Bail out at the end of the transaction handling if an anonymous set remains unbound. Fixes: 96518518 ("netfilter: add nftables") Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Pablo Neira Ayuso authored
set .destroy callback releases the references to other objects in maps. This is very late and it results in spurious EBUSY errors. Drop refcount from the preparation phase instead, update set backend not to drop reference counter from set .destroy path. Exceptions: NFT_TRANS_PREPARE_ERROR does not require to drop the reference counter because the transaction abort path releases the map references for each element since the set is unbound. The abort path also deals with releasing reference counter for new elements added to unbound sets. Fixes: 59105446 ("netfilter: nf_tables: revisit chain/object refcounting from elements") Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Pablo Neira Ayuso authored
Add a new state to deal with rule expressions deactivation from the newrule error path, otherwise the anonymous set remains in the list in inactive state for the next generation. Mark the set/chain transaction as unbound so the abort path releases this object, set it as inactive in the next generation so it is not reachable anymore from this transaction and reference counter is dropped. Fixes: 1240eb93 ("netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE") Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Pablo Neira Ayuso authored
Add bound flag to rule and chain transactions as in 6a0a8d10 ("netfilter: nf_tables: use-after-free in failing rule with bound set") to skip them in case that the chain is already bound from the abort path. This patch fixes an imbalance in the chain use refcnt that triggers a WARN_ON on the table and chain destroy path. This patch also disallows nested chain bindings, which is not supported from userspace. The logic to deal with chain binding in nft_data_hold() and nft_data_release() is not correct. The NFT_TRANS_PREPARE state needs a special handling in case a chain is bound but next expressions in the same rule fail to initialize as described by 1240eb93 ("netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE"). The chain is left bound if rule construction fails, so the objects stored in this chain (and the chain itself) are released by the transaction records from the abort path, follow up patch ("netfilter: nf_tables: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain") completes this error handling. When deleting an existing rule, chain bound flag is set off so the rule expression .destroy path releases the objects. Fixes: d0e2c7de ("netfilter: nf_tables: add NFT_CHAIN_BINDING") Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Vladimir Oltean authored
Since the introduction of the OF bindings, DSA has always had a policy that in case multiple CPU ports are present in the device tree, the numerically smallest one is always chosen. The MT7530 switch family, except the switch on the MT7988 SoC, has 2 CPU ports, 5 and 6, where port 6 is preferable on the MT7531BE switch because it has higher bandwidth. The MT7530 driver developers had 3 options: - to modify DSA when the MT7531 switch support was introduced, such as to prefer the better port - to declare both CPU ports in device trees as CPU ports, and live with the sub-optimal performance resulting from not preferring the better port - to declare just port 6 in the device tree as a CPU port Of course they chose the path of least resistance (3rd option), kicking the can down the road. The hardware description in the device tree is supposed to be stable - developers are not supposed to adopt the strategy of piecemeal hardware description, where the device tree is updated in lockstep with the features that the kernel currently supports. Now, as a result of the fact that they did that, any attempts to modify the device tree and describe both CPU ports as CPU ports would make DSA change its default selection from port 6 to 5, effectively resulting in a performance degradation visible to users with the MT7531BE switch as can be seen below. Without preferring port 6: [ ID][Role] Interval Transfer Bitrate Retr [ 5][TX-C] 0.00-20.00 sec 374 MBytes 157 Mbits/sec 734 sender [ 5][TX-C] 0.00-20.00 sec 373 MBytes 156 Mbits/sec receiver [ 7][RX-C] 0.00-20.00 sec 1.81 GBytes 778 Mbits/sec 0 sender [ 7][RX-C] 0.00-20.00 sec 1.81 GBytes 777 Mbits/sec receiver With preferring port 6: [ ID][Role] Interval Transfer Bitrate Retr [ 5][TX-C] 0.00-20.00 sec 1.99 GBytes 856 Mbits/sec 273 sender [ 5][TX-C] 0.00-20.00 sec 1.99 GBytes 855 Mbits/sec receiver [ 7][RX-C] 0.00-20.00 sec 1.72 GBytes 737 Mbits/sec 15 sender [ 7][RX-C] 0.00-20.00 sec 1.71 GBytes 736 Mbits/sec receiver Using one port for WAN and the other ports for LAN is a very popular use case which is what this test emulates. As such, this change proposes that we retroactively modify stable kernels (which don't support the modification of the CPU port assignments, so as to let user space fix the problem and restore the throughput) to keep the mt7530 driver preferring port 6 even with device trees where the hardware is more fully described. Fixes: c288575f ("net: dsa: mt7530: Add the support of MT7531 switch") Signed-off-by:
Vladimir Oltean <olteanv@gmail.com> Signed-off-by:
Arınç ÜNAL <arinc.unal@arinc9.com> Reviewed-by:
Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by:
Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Jun 19, 2023
-
-
Kuniyuki Iwashima authored
As Eric Dumazet pointed out [0], ipv6_rthdr_rcv() pulls these data - Segment Routing Header : 8 - Hdr Ext Len : skb_transport_header(skb)[1] << 3 needed by ipv6_rpl_srh_rcv(). We can remove pskb_may_pull() and replace pskb_pull() with skb_pull() in ipv6_rpl_srh_rcv(). Link: https://lore.kernel.org/netdev/CANn89iLboLwLrHXeHJucAqBkEL_S0rJFog68t7wwwXO-aNf5Mg@mail.gmail.com/ [0] Signed-off-by:
Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
YueHaibing authored
commit f2f16758 ("xsk: Remove unused xsk_buff_discard") left behind this, remove it. Signed-off-by:
YueHaibing <yuehaibing@huawei.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Reviewed-by:
Simon Horman <simon.horman@corigine.com> Acked-by:
Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230616062800.30780-1-yuehaibing@huawei.com
-
Veerendranath Jakkam authored
STA MLD setup links may get removed if AP MLD remove the corresponding affiliated APs with Multi-Link reconfiguration as described in P802.11be_D3.0, section 35.3.6.2.2 Removing affiliated APs. Currently, there is no support to notify such operation to cfg80211 and userspace. Add support for the drivers to indicate STA MLD setup links removal to cfg80211 and notify the same to userspace. Upon receiving such indication from the driver, clear the MLO links information of the removed links in the WDEV. Signed-off-by:
Veerendranath Jakkam <quic_vjakkam@quicinc.com> Link: https://lore.kernel.org/r/20230317142153.237900-1-quic_vjakkam@quicinc.com [rename function and attribute, fix kernel-doc] Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Johannes Berg authored
Since regulatory disconnect was added, OCB and NAN interface types were added, which made it completely unusable for any driver that allowed OCB/NAN. Add OCB/NAN (though NAN doesn't do anything, we don't have any info) and also remove all the logic that opts out, so it won't be broken again if/when new interface types are added. Fixes: 6e0bd6c3 ("cfg80211: 802.11p OCB mode handling") Fixes: cb3b7d87 ("cfg80211: add start / stop NAN commands") Signed-off-by:
Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/r/20230616222844.2794d1625a26.I8e78a3789a29e6149447b3139df724a6f1b46fc3@changeid Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Benjamin Berg authored
This is already needed within mac80211 and support is also needed by cfg80211 to parse ML elements. Signed-off-by:
Benjamin Berg <benjamin.berg@intel.com> Signed-off-by:
Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230616094949.29c3ebeed10d.I009c049289dd0162c2e858ed8b68d2875a672ed6@changeid Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Benjamin Berg authored
This new function is called from within the inform_bss(_frame)_data functions in order for the driver to update data that it is tracking. Signed-off-by:
Benjamin Berg <benjamin.berg@intel.com> Signed-off-by:
Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230616094949.8d7781b0f965.I80041183072b75c081996a1a5a230b34aff5c668@changeid Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Mukesh Sisodiya authored
For multi-link operation(MLO) TDLS management frames need to be transmitted on a specific link. The TDLS setup request will add BSSID along with peer address and userspace will pass the link-id based on BSSID value to the driver(or mac80211). Signed-off-by:
Mukesh Sisodiya <mukesh.sisodiya@intel.com> Signed-off-by:
Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230616094948.cb3d87c22812.Ia3d15ac4a9a182145bf2d418bcb3ddf4539cd0a7@changeid Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Ilan Peer authored
When the association is complete, do not configure disabled links, and track them as part of the interface data. Signed-off-by:
Ilan Peer <ilan.peer@intel.com> Signed-off-by:
Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230608163202.c194fabeb81a.Iaefdef5ba0492afe9a5ede14c68060a4af36e444@changeid Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
- Jun 18, 2023
-
-
Arjun Roy authored
Per-VMA locking allows us to lock a struct vm_area_struct without taking the process-wide mmap lock in read mode. Consider a process workload where the mmap lock is taken constantly in write mode. In this scenario, all zerocopy receives are periodically blocked during that period of time - though in principle, the memory ranges being used by TCP are not touched by the operations that need the mmap write lock. This results in performance degradation. Now consider another workload where the mmap lock is never taken in write mode, but there are many TCP connections using receive zerocopy that are concurrently receiving. These connections all take the mmap lock in read mode, but this does induce a lot of contention and atomic ops for this process-wide lock. This results in additional CPU overhead caused by contending on the cache line for this lock. However, with per-vma locking, both of these problems can be avoided. As a test, I ran an RPC-style request/response workload with 4KB payloads and receive zerocopy enabled, with 100 simultaneous TCP connections. I measured perf cycles within the find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and without per-vma locking enabled. When using process-wide mmap semaphore read locking, about 1% of measured perf cycles were within this path. With per-VMA locking, this value dropped to about 0.45%. Signed-off-by:
Arjun Roy <arjunroy@google.com> Reviewed-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Jun 17, 2023
-
-
mfreemon@cloudflare.com authored
Under certain circumstances, the tcp receive buffer memory limit set by autotuning (sk_rcvbuf) is increased due to incoming data packets as a result of the window not closing when it should be. This can result in the receive buffer growing all the way up to tcp_rmem[2], even for tcp sessions with a low BDP. To reproduce: Connect a TCP session with the receiver doing nothing and the sender sending small packets (an infinite loop of socket send() with 4 bytes of payload with a sleep of 1 ms in between each send()). This will cause the tcp receive buffer to grow all the way up to tcp_rmem[2]. As a result, a host can have individual tcp sessions with receive buffers of size tcp_rmem[2], and the host itself can reach tcp_mem limits, causing the host to go into tcp memory pressure mode. The fundamental issue is the relationship between the granularity of the window scaling factor and the number of byte ACKed back to the sender. This problem has previously been identified in RFC 7323, appendix F [1]. The Linux kernel currently adheres to never shrinking the window. In addition to the overallocation of memory mentioned above, the current behavior is functionally incorrect, because once tcp_rmem[2] is reached when no remediations remain (i.e. tcp collapse fails to free up any more memory and there are no packets to prune from the out-of-order queue), the receiver will drop in-window packets resulting in retransmissions and an eventual timeout of the tcp session. A receive buffer full condition should instead result in a zero window and an indefinite wait. In practice, this problem is largely hidden for most flows. It is not applicable to mice flows. Elephant flows can send data fast enough to "overrun" the sk_rcvbuf limit (in a single ACK), triggering a zero window. But this problem does show up for other types of flows. Examples are websockets and other type of flows that send small amounts of data spaced apart slightly in time. In these cases, we directly encounter the problem described in [1]. RFC 7323, section 2.4 [2], says there are instances when a retracted window can be offered, and that TCP implementations MUST ensure that they handle a shrinking window, as specified in RFC 1122, section 4.2.2.16 [3]. All prior RFCs on the topic of tcp window management have made clear that sender must accept a shrunk window from the receiver, including RFC 793 [4] and RFC 1323 [5]. This patch implements the functionality to shrink the tcp window when necessary to keep the right edge within the memory limit by autotuning (sk_rcvbuf). This new functionality is enabled with the new sysctl: net.ipv4.tcp_shrink_window Additional information can be found at: https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/ [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4 [3] https://www.rfc-editor.org/rfc/rfc1122#page-91 [4] https://www.rfc-editor.org/rfc/rfc793 [5] https://www.rfc-editor.org/rfc/rfc1323 Signed-off-by:
Mike Freemon <mfreemon@cloudflare.com> Reviewed-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
YueHaibing authored
This is unused since switch to psched_l2t_ns(). Signed-off-by:
YueHaibing <yuehaibing@huawei.com> Reviewed-by:
Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230615124810.34020-1-yuehaibing@huawei.com Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Jun 16, 2023
-
-
Breno Leitao authored
Most of the ioctls to net protocols operates directly on userspace argument (arg). Usually doing get_user()/put_user() directly in the ioctl callback. This is not flexible, because it is hard to reuse these functions without passing userspace buffers. Change the "struct proto" ioctls to avoid touching userspace memory and operate on kernel buffers, i.e., all protocol's ioctl callbacks is adapted to operate on a kernel memory other than on userspace (so, no more {put,get}_user() and friends being called in the ioctl callback). This changes the "struct proto" ioctl format in the following way: int (*ioctl)(struct sock *sk, int cmd, - unsigned long arg); + int *karg); (Important to say that this patch does not touch the "struct proto_ops" protocols) So, the "karg" argument, which is passed to the ioctl callback, is a pointer allocated to kernel space memory (inside a function wrapper). This buffer (karg) may contain input argument (copied from userspace in a prep function) and it might return a value/buffer, which is copied back to userspace if necessary. There is not one-size-fits-all format (that is I am using 'may' above), but basically, there are three type of ioctls: 1) Do not read from userspace, returns a result to userspace 2) Read an input parameter from userspace, and does not return anything to userspace 3) Read an input from userspace, and return a buffer to userspace. The default case (1) (where no input parameter is given, and an "int" is returned to userspace) encompasses more than 90% of the cases, but there are two other exceptions. Here is a list of exceptions: * Protocol RAW: * cmd = SIOCGETVIFCNT: * input and output = struct sioc_vif_req * cmd = SIOCGETSGCNT * input and output = struct sioc_sg_req * Explanation: for the SIOCGETVIFCNT case, userspace passes the input argument, which is struct sioc_vif_req. Then the callback populates the struct, which is copied back to userspace. * Protocol RAW6: * cmd = SIOCGETMIFCNT_IN6 * input and output = struct sioc_mif_req6 * cmd = SIOCGETSGCNT_IN6 * input and output = struct sioc_sg_req6 * Protocol PHONET: * cmd == SIOCPNADDRESOURCE | SIOCPNDELRESOURCE * input int (4 bytes) * Nothing is copied back to userspace. For the exception cases, functions sock_sk_ioctl_inout() will copy the userspace input, and copy it back to kernel space. The wrapper that prepare the buffer and put the buffer back to user is sk_ioctl(), so, instead of calling sk->sk_prot->ioctl(), the callee now calls sk_ioctl(), which will handle all cases. Signed-off-by:
Breno Leitao <leitao@debian.org> Reviewed-by:
Willem de Bruijn <willemb@google.com> Reviewed-by:
David Ahern <dsahern@kernel.org> Reviewed-by:
Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230609152800.830401-1-leitao@debian.org Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Jun 15, 2023
-
-
Jakub Kicinski authored
All callers of tls_is_sk_tx_device_offloaded() currently do an equivalent of: if (skb->sk && tls_is_skb_tx_device_offloaded(skb->sk)) Have the helper accept skb and do the skb->sk check locally. Two drivers have local static inlines with similar wrappers already. While at it change the ifdef condition to TLS_DEVICE. Only TLS_DEVICE selects SOCK_VALIDATE_XMIT, so the two are equivalent. This makes removing the duplicated IS_ENABLED() check in funeth more obviously correct. Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Acked-by:
Maxim Mikityanskiy <maxtram95@gmail.com> Reviewed-by:
Simon Horman <simon.horman@corigine.com> Acked-by:
Tariq Toukan <tariqt@nvidia.com> Acked-by:
Dimitris Michailidis <dmichail@fungible.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Jun 14, 2023
-
-
Johannes Berg authored
Support new firmware that can validate the validate bits in sniffer mode, and advertise that fact and the result of the checks in the U-SIG radiotap field. Signed-off-by:
Johannes Berg <johannes.berg@intel.com> Signed-off-by:
Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230613155501.c20480aa1171.Icc0d077dae01d662ccb948823e196aa9c5c87976@changeid Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Ilan Peer authored
Instead clarify the cases where link ID == 0 is intended for an AP STA that is not part of an AP MLD. Signed-off-by:
Ilan Peer <ilan.peer@intel.com> Signed-off-by:
Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230611121219.77236a2e26ad.I8193ca8e236c9eb015870471f77a7d5134da3156@changeid Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Ilan Peer authored
An AP part of an AP MLD might be temporarily disabled, and might be enabled later. Such a link should be included in the association exchange, but should not be used until enabled. Extend the NL80211_CMD_ASSOCIATE to also indicate disabled links. Signed-off-by:
Ilan Peer <ilan.peer@intel.com> Signed-off-by:
Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230608163202.c4c61ee4c4a5.I784ef4a0d619fc9120514b5615458fbef3b3684a@changeid Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Ilan Peer authored
As a preparation to support disabled/dormant links, add the following function: - ieee80211_vif_usable_links(): returns the bitmap of the links that can be activated. Use this function in all the places that the bitmap of the usable links is needed. - ieee80211_vif_is_mld(): returns true iff the vif is an MLD. Use this function in all the places where an indication that the connection is a MLD is needed. Signed-off-by:
Ilan Peer <ilan.peer@intel.com> Signed-off-by:
Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230608163202.86e3351da1fc.If6fe3a339fda2019f13f57ff768ecffb711b710a@changeid Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Miri Korenblit authored
There are cases in which we don't want the user to override the smps mode, e.g. when SMPS should be disabled due to EMLSR. Add a driver flag to disable SMPS overriding and don't override if it is set. Signed-off-by:
Miri Korenblit <miriam.rachel.korenblit@intel.com> Signed-off-by:
Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230608163202.ef129e80556c.I74a298fdc86b87074c95228d3916739de1400597@changeid Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Johannes Berg authored
There's quite a bit of code accessing sband iftype data (HE, HE 6 GHz, EHT) and we always need to remember to use the ieee80211_vif_type_p2p() helper. Add new helpers to directly get it from the sband/vif rather than having to call ieee80211_vif_type_p2p(). Convert most code with the following spatch: @@ expression vif, sband; @@ -ieee80211_get_he_iftype_cap(sband, ieee80211_vif_type_p2p(vif)) +ieee80211_get_he_iftype_cap_vif(sband, vif) @@ expression vif, sband; @@ -ieee80211_get_eht_iftype_cap(sband, ieee80211_vif_type_p2p(vif)) +ieee80211_get_eht_iftype_cap_vif(sband, vif) @@ expression vif, sband; @@ -ieee80211_get_he_6ghz_capa(sband, ieee80211_vif_type_p2p(vif)) +ieee80211_get_he_6ghz_capa_vif(sband, vif) Signed-off-by:
Johannes Berg <johannes.berg@intel.com> Signed-off-by:
Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230604120651.db099f49e764.Ie892966c49e22c7b7ee1073bc684f142debfdc84@changeid Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Gilad Itzkovitch authored
Increase the size of S1G rate_info flags to support S1G and add flags for new S1G MCS and the supported bandwidths. Also, include S1G rate information to netlink STA rate message. Lastly, add rate calculation function for S1G MCS. Signed-off-by:
Gilad Itzkovitch <gilad.itzkovitch@morsemicro.com> Link: https://lore.kernel.org/r/20230518000723.991912-1-gilad.itzkovitch@morsemicro.com Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Peilin Ye authored
mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for clsact Qdiscs and miniq_egress. Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER requests (thanks Hillf Danton for the hint), when replacing ingress or clsact Qdiscs, for example, the old Qdisc ("@old") could access the same miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"), causing race conditions [1] including a use-after-free bug in mini_qdisc_pair_swap() reported by syzbot: BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901 ... Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106 print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319 print_report mm/kasan/report.c:430 [inline] kasan_report+0x11c/0x130 mm/kasan/report.c:536 mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 tcf_chain_head_change_item net/sched/cls_api.c:495 [inline] tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509 tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline] tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline] tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266 ... @old and @new should not affect each other. In other words, @old should never modify miniq_{in,e}gress after @new, and @new should not update @old's RCU state. Fixing without changing sch_api.c turned out to be difficult (please refer to Closes: for discussions). Instead, make sure @new's first call always happen after @old's last call (in {ingress,clsact}_destroy()) has finished: In qdisc_graft(), return -EBUSY if @old has any ongoing filter requests, and call qdisc_destroy() for @old before grafting @new. Introduce qdisc_refcount_dec_if_one() as the counterpart of qdisc_refcount_inc_nz() used for filter requests. Introduce a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check, just like qdisc_put() etc. Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs". [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under TC_H_ROOT (no longer possible after commit c7cfbd11 ("net/sched: sch_ingress: Only create under TC_H_INGRESS")) on eth0 that has 8 transmission queues: Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then adds a flower filter X to A. Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and b2) to replace A, then adds a flower filter Y to B. Thread 1 A's refcnt Thread 2 RTM_NEWQDISC (A, RTNL-locked) qdisc_create(A) 1 qdisc_graft(A) 9 RTM_NEWTFILTER (X, RTNL-unlocked) __tcf_qdisc_find(A) 10 tcf_chain0_head_change(A) mini_qdisc_pair_swap(A) (1st) | | RTM_NEWQDISC (B, RTNL-locked) RCU sync 2 qdisc_graft(B) | 1 notify_and_destroy(A) | tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-unlocked) qdisc_destroy(A) tcf_chain0_head_change(B) tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd) mini_qdisc_pair_swap(A) (3rd) | ... ... Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets on eth0 will not find filter Y in sch_handle_ingress(). This is just one of the possible consequences of concurrently accessing miniq_{in,e}gress pointers. Fixes: 7a096d57 ("net: sched: ingress: set 'unlocked' flag for Qdisc ops") Fixes: 87f37392 ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops") Reported-by:
<syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com> Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/ Cc: Hillf Danton <hdanton@sina.com> Cc: Vlad Buslov <vladbu@mellanox.com> Signed-off-by:
Peilin Ye <peilin.ye@bytedance.com> Acked-by:
Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by:
Paolo Abeni <pabeni@redhat.com>
-