VOPD3 will let compilers pair instructions more leniently.
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
The next generation of Radeon GPUs from AMD are expected to be a significant upgrade over RDNA 4, and one of the issues Team Red seems to be tackling is dual issue execution.
That's the GPU's ability to execute two instructions in the same cycle — AMD's cards have had this feature since RDNA 3, but strict pairing rules meant that compilers couldn't always take advantage of it, limiting theoretical peak performance.
A new LLVM patch now suggests that AMD will be solving this on RDNA 5.
Coelacanth's Dream, a Linux-focused outlet, examined the new changes and found out they reference gfx13 , which is derived from gfx130 , aka RDNA 5.
AMD is apparently adding a new instruction format called " VOPD3 " that is designed to better interface with the dual issue VALU (Vector Arithmetic Logic Unit; shader unit).
It should be more lenient, making it easier for the compiler to use dual issue execution.
On a technical level, the existing system, known as VOPD , largely only worked with simpler 2-operand instructions, which made it harder for compilers to schedule compatible instruction pairs.
VOPD3 will expand this to 3-operand instructions, so it would be able to support operations like fused multiply-add (FMA).
In fact, V_FMA_F32 was added in this very pull request and that's how we can infer it'll be on RDNA 5.
This would allow dual issue execution to happen more often, leading to a potentially massive increase in FP32 throughput (in some cases).
Shader units will spend less time waiting for clock cycles and instead get more work done, making each instruction more efficient.
This could help in demanding scenarios, such as rendering, which means game engines will be able to able to optimize for dual issue VALU.
Reducing the number of cases where pairing fails due to restrictions is a key step to making the hardware more efficient without brute-forcing IPC uplifts through silicon.
FMA instructions are also important when it comes to neural rendering, so things like upscaling and frame-gen tech can also get a boost here, even if the hardware itself is not more performant — since dual issue execution improves efficiency regardless.
You can check out the Coelacanth's Dream article linked above if you're interested in more specifics, but be warned that it's very dense.
Moreover, RDNA 5 is a ways out at this point, and more consumer-facing updates like higher core counts would certainly be a more marketable trait.
Still, seeing a GPU reach its advertised FP32 throughput more easily and more consistently is a big architectural win.
Follow Tom's Hardware on Google News , or add us as a preferred source , to get our latest news, analysis, & reviews in your feeds.
Hassam Nasir is a die-hard hardware enthusiast with years of experience as a tech editor and writer, focusing on detailed CPU comparisons and general hardware news.
When he’s not working, you’ll find him bending tubes for his ever-evolving custom water-loop gaming rig or benchmarking the latest CPUs and GPUs just for fun.
Faiakes said: This sounds like driver issue.
Faiakes said: Couldn't this apply to rdna 4?
Alpha_Lyrae said: In RT workloads, dual-issue came under vector register pressure,
Alpha_Lyrae said: as RT tends to eat up VGPRs and RDNA3/4 had to secure two separate VGPRs for the dual-issue instruction, else hardware will simply refuse to launch it.
Alpha_Lyrae said: there's a chance AMD also moved to SIMD64 with pseudo-SIMD32 support.
-Fran- said: Anything about fixing the chiplet design with RDNA5?
Regards.
bit_user said: Why would they go back to Wave64?
Alpha_Lyrae said: Wave64 is still supported.
Alpha_Lyrae said: There's a lot of latency hiding and deep parallel work queues in modern GPUs, so in isolation, 4x throughput per SIMD64 (vs SIMD16) and 2-4 SIMD64s per CU sounds great (CDNA lacks gfx engines, so CU can be physically wider than RDNA).
In practice, you still need to ensure you can fully fill wavefront queues by not being locally resource limited.
Dynamic allocation of registers is a good start, but there's more to be done on that front.
Related Stories
Source: This article was originally published by Toms Hardware
Read Full Original Article →
Comments (0)
No comments yet. Be the first to comment!
Leave a Comment