Mihnea Balta wrote:Reusing execution units while a thread is waiting for data to arrive from memory pays off. HT didn't do much back when it was first introduced in Pentium 4 chips because the memory was fast enough, but latency has improved very little since then, while execution units have become significantly faster. That's why HT works today.
The same principle is used in lots of current architectures. GPUs do it too, but in a slightly different way (they run many threads in lockstep, so that hundreds of cycles pass before a thread needs to advance to the next instruction after a fetch, giving the memory time to transfer the data). The Larrabee 1 design has 4-way hyperthreading in its cores to hide L1 cache misses, and you're supposed to also run up to 10 soft fibers on each HT context to make good use of it. The Xbox 360 has 3 cores with HT. The PPC chip in the Cell is hyperthreaded. HT "cores" are not marketing cores.
Hi Mihnea,
I know, but I didn´t expect Maxwell to be memory transfer limited. DDR solved by the way a lot and most apps are not even close to using the theoretical DDR3 bandwith. I always made sure that I went for minimal latency instead of transfer speed, but the nett effect on rendering engines has not been spectacular.
I call HT ´cores´ marketing cores as they are now deliberately used to cloud people´s perception. I know why they are there and why Intel introduced them and that made good sense. I think I explained in an earlier post. For compute intensive tasks which are well optimized the advantage of modern HT is totally not comparable to real cores. You understand, I understand, but lots of people (as you can see earlier in this thread) mistake x cores physical chips with full computing units with HT ´cores´.
No renderer in which I have been involved with has anywhere near a 90-95% speedup, so that´s still a mystery to me. L1 misses can´t cause this.