TL;DR
A developer built a Linux kernel module that makes Thunderbolt ports emulate InfiniBand devices, enabling high-speed RDMA communication between consumer mini PCs. This breakthrough could democratize AI training and inference at home.
A developer has created a Linux kernel module that enables Thunderbolt 4 and USB4 ports on AMD mini PCs to emulate InfiniBand devices, achieving high-speed RDMA communication at home. This development could allow consumer hardware to handle AI training and inference workloads traditionally reserved for enterprise networks. Smart home gadgets could benefit from such high-speed data transfer capabilities.
The project involves experimental RDMA-over-USB4 for two AMD mini PCs, specifically 128GB Strix Halo models, enabling bidirectional data transfer rates of approximately 95 Gb/s with around 7 microseconds of latency. The developer reports that this setup supports tensor-parallel inference and Fully Sharded Data Parallel (FSDP) workloads, such as a MiniMax-M2.7 inference run that exceeds the capacity of a single machine, and a Gemma 3 27B LoRA FSDP step that reduced training time from over 21 minutes to just over two minutes compared to Ethernet.
This was achieved by developing a custom Linux kernel module that makes Thunderbolt ports appear as InfiniBand devices, leveraging RDMA (Remote Direct Memory Access) technology to facilitate rapid data exchange. The setup reportedly sustains around 48 Gb/s per direction, with aggregate performance of about 95 Gb/s, vastly outperforming standard Ethernet and soft-RoCE configurations on Thunderbolt networks. Latency measurements show significant improvements over traditional Ethernet and Thunderbolt-based networking, with one-way latency at about 7 microseconds versus 28 to 65 microseconds in other setups.
Why It Matters
This breakthrough demonstrates that high-performance, low-latency RDMA communication can be achieved on consumer hardware using Thunderbolt ports, potentially democratizing access to AI training and inference capabilities without costly enterprise networking gear. If scalable and stable, this approach could enable hobbyists and small labs to perform distributed AI workloads at home, reducing reliance on cloud services and expensive data center infrastructure.
Thunderbolt 4 USB4 high-speed data transfer cable
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Background
Traditionally, high-speed RDMA networks like InfiniBand are confined to enterprise data centers and supercomputers due to their specialized hardware and complex setup. Recent efforts have explored RDMA over Ethernet (RoCE) and soft-RoCE implementations, but these are limited in performance and latency. The developer’s work builds on ongoing research into making RDMA more accessible, leveraging USB4 and Thunderbolt interfaces common on consumer PCs. This project is experimental, with the developer noting it is based on research code with potential false assumptions and sharp edges, and not intended for production use.
“This is experimental research code, most of it AI-generated, and it loads experimental kernel modules on machines I was willing to crash repeatedly.”
— the developer behind the project
“We built experimental RDMA-over-USB4 for 128GB Strix Halo mini PCs. It lets two consumer boxes talk fast enough to run tensor-parallel inference and FSDP workloads across both machines.”
— the developer
RDMA compatible Thunderbolt 4 external device
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
What Remains Unclear
It remains unclear how stable and scalable this solution is for long-term or production use, and whether it can be widely adopted across different hardware configurations. Further testing and development are needed to determine its practical viability.
high-performance mini PC with Thunderbolt ports
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
What’s Next
The next steps involve refining the kernel modules for stability, testing across more hardware setups, and exploring potential integration into consumer operating systems. Broader community engagement and peer review are likely to follow to assess feasibility for wider use.
AI training hardware for home use
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Key Questions
Can this setup be used for commercial or production AI workloads?
Currently, no. The project is experimental and not intended for production use. Stability and scalability are still under evaluation.
What hardware is required to replicate this setup?
At minimum, two AMD mini PCs with Thunderbolt 4 or USB4 ports, and the developer’s custom Linux kernel modules. Hardware compatibility and driver support are still being tested.
Does this mean consumers can build their own high-speed networks at home?
Potentially, yes, but current implementations are experimental. Widespread adoption will require further development and stability improvements.
How does this compare to traditional Ethernet or Wi-Fi for AI workloads?
According to the developer, RDMA over Thunderbolt offers significantly lower latency (~7 microseconds) and higher throughput (~95 Gb/s bidirectional) compared to Ethernet or Wi-Fi, which are typically slower and have higher latency.
Source: Hacker News