All posts related to V2
By dmeyer
#349883
From the results above it looks like this may have to do with some sort of performance hit on the 83xx chips, which is a different core than the 24xx series.

I've got some 61xx MagnyCours machines I can test this on but they are not up at this time. In a couple of weeks they'll be back on.
#349896
First of all - thank you very much for your effort.

In the "Football" Test you see the influence of the "R2" option. The differences in time are:

System A: 3m51s - faster
System B: 2m19s - faster
System C: 6m09s - faster

In the "Bobeck" Test the differences are:

System A: 2m27s - slower
System B: 2m45s - faster
System C: 17m57s - faster

My personal conclusion is, that the differences on the quad-socket (8354) system are quite big in both cases (not as big as with my system but it is noticeable). The AMD two-socket System (2427) doesn't suffer and the differences are comparable with the Intel system. There must be something which causes this "Sensibility" on these Systems with more than two sockets. Now it would be interesting if this can be reproduced on an Intel quad-socket System...

Interestingly the single socket system with the 2600K showed the least difference (Regarding the "R2" Tests if you look at the 4th post of this thread).
By nachob
#349906
These new AMD' bulldozer modules have two cores that share some resources internally: the early pipeline stages (eg. instruction fetch, decode), the FPUs, and the L2 cache. In the specific R2 material, as Juan explained, there are some pow functions that would make one core wait until the other has finished and would explain why the performance is almost half of the expected one.

As Juan also said before, it would be very useful to make a comparison using only 1 thread on the machines, that will probably show the expected performance when there is no sharing of resources inside the modules.

nachob
#349967
...I did some further testing regarding scaling of CPU Cores in Maxwell with the two Scenes ("Bobeck" and "Football") that cause so much difference in the Render Times on my AMD System. I shortened the whole thing by limiting the SL to 12.00 at the "Football" - Scene and 8.00 at the "Bobeck" - Scene. The differences regarding the materials are the same than in the tests before:


Image

The scene with the "R2" option disabled scales with the use of more Scores. The scene with the "R2" option enabled scales up to the use of 12 CPU Cores (03m56s). The Result with 16 CPU Cores is worse (04m19s) and it even gets slower using more cores then... :shock:

Image

"Football" (1024x768 - SL15) using 24 CPU Cores:
"R2" enabled : 14m31s / 787.9 - slower than 12 Cores!!! :roll:
"R2" disabled: 6m15s / 1825.17

"Football" (1024x768 - SL15) using 12 CPU Cores:
"R2" enabled : 13m11s / 868.19
"R2" disabled: 8m14s / 1386.63




Image

Using only one CPU Core, the result with background material S30 / R30 is slower compared to the result with the background material S95 / R5. But you can also see that the CPU scaling with S30 / R30 is much better on this system than with S95 / R5. With the use of 8 CPU Cores there is nearly the same time (07m12s vs 07m15s). Using 12 Cores or more the material less shiny (S30 / R30) results in a much faster Render...

Image

"Bobeck" (1280x960 - SL16) using 24 CPU Cores::
Background Material - "Shininess"=95% / "Roughness"=5%: 2h05m25s / 214.18 - "as fast as" 12 Cores
Background Material - "Shininess"=30% / "Roughness"=30%: 1h08m51s / 390.03 - scales quite good with additional 12 CPU Cores

"Bobeck" (1280x960 - SL16) using 12 CPU Cores::
Background Material - "Shininess"=95% / "Roughness"=5%: 2h08m06s / 209.66
Background Material - "Shininess"=30% / "Roughness"=30%: 2h01m04s / 221.84


As you can see, the Render Times in Maxwell on this quad socket system are getting really worse under certain circumstances ("Material properties" , "shiny" materials" ?) if using more than 12 Cores (in this case more than two CPU sockets - Six Cores per socket). If you are avoiding certain Material properties, CPU scaling works quite good...

Again there is the question, if there is something like a "workaround" for that... :wink:


As you can see Intel doesn't seem to have this kind of problem with the dual socket systems and the CPU scales perfectly with the amount of cores:

Image

Image
#349970
As you can see Intel doesn't seem to have this kind of problem with the dual socket systems and the CPU scales perfectly with the amount of cores:
...but the impact of the R2 setting is still huge! This is nearly 25%!!! :shock: I wasn't aware of that... thanks for the detailed testing!
#349984
zdeno, thanks again :) ...

As you can see I'm a quite average user of Maxwell. My skills aren't very good and I have to learn a lot about this program. I followed the development of Maxwell since 2005 and I started working with it in 2008.
The main purpose for the acquisition of the Opteron System (starting with 4x 8360 SE followed by 4x 8393SE) was using it for Maxwell. In 2009 the architecture of the AMD Opteron system was far better than what Intel could offer (in that setting). One of the main advantages is the very fast memory system. Compared with the (old) Intel (Server) Architecture (e.g. Systems with more than two CPU sockets) AMD Systems didn't only scale with the CPUs but they scaled with CPU and Memory. That means that also the performance of the whole memory system gets better with each CPU (+ associated memory) you add to the same system. You can see this in several applications. This applies definitely to the K9 and K10 Architecture with socket F (K10 = Barcelona - 4C - 83xx - 2MB / Shanghai - 4C - 83xx - 6MB Cache / Istanbul - 6C -84xx) I don't know how this behaves with the successors (socket G34).

Maxwell profits a lot from a system with a "fast memory connection". You can notice that if you run Benchwell on a Xeon DP system with the EVGA SR-2 where you can manipulate the "Uncore frequency" of the CPUs:

Image

Benchwell:
4.20 GHz / 1.80 GHz Uncore - 3m03s / 3491.75
3.40 GHz / 4.00 GHz Uncore - 3m01s / 3530.01 - ! faster ! :wink:

4.20 GHz / 4.00 GHz Uncore - 2m29s / 4280.58 - 34s faster than with 9x Uncore!

Even a Xeon DP system with 3.40 GHz and 4.0 Ghz Uncore is faster than the same system with 4.20 GHz and an Uncore Frequency of 1.8 GHz. In this case I would say Maxwell scores more with the Uncore Frequency than with the real CPU frequency.

So my personal goal of this thread is to find out, why the Opteron system looses so much of its possible performance (that it has in other applications as I showed you before).

If somebody has any Idea, I would be very happy!
Last edited by arch3990 on Sun Dec 04, 2011 10:31 am, edited 1 time in total.
#350019
zdeno wrote:...but what about this CPUs temperatures ?
I did some further testing regarding the subject of the CPU temperatures:

Bobeck - Background S95 / R5 after 54% of Rendering - (1h10m54s)

Image



Bobeck - Background S30 / R30 after 53% of Rendering - (31m59s)

Image

I think you can see, that the temperatures should be no problem. The differences in the temperatures between the two scenes - "S95 / R5" seems to run "cooler" - brings me back to these points:

juan wrote: Very interesting tests, thanks. When R2 is enabled there are not additional memory allocations but just a few more calls to the pow(float, float) function. It is surprising performance changes so dramatically in that test. It would be needed to make more tests to confirm that AMD pow() is remarkable slower, than Intel's but so far it looks so. Besides to this, tests can be done using just one thread, to check if it has something to do with sharing issues in the processor unit.
and
nachob wrote:These new AMD' bulldozer modules have two cores that share some resources internally: the early pipeline stages (eg. instruction fetch, decode), the FPUs, and the L2 cache. In the specific R2 material, as Juan explained, there are some pow functions that would make one core wait until the other has finished and would explain why the performance is almost half of the expected one.

As Juan also said before, it would be very useful to make a comparison using only 1 thread on the machines, that will probably show the expected performance when there is no sharing of resources inside the modules.

nachob
...but here we do not have any "R2" in the materials - the difference is just S30/R30 vs. S95/R5 in the background material and by looking at the Scene with almost twice the render times (S95/R5) but less temperatures on the CPUs you might assume the same thing as with the "R2" materials. Maxwell is not using or is not able to use the full potential of the CPUs in this case...

Conclusion of the temperature tests:

I think the CPUs do not get to hot. The CPUs get a little bit warmer (CPU0: +1° / CPU1: +3° / CPU2: +5° / CPU3: +2°) during the Rendering of the scene (S30/R30) that is much faster (56min and 34sec) - so it could mean that the CPUs are "more in use" rendering this scene. On the other hand the Scene with S95/R5 runs "cooler" on the CPUs or perhaps doesn't / "is not able to" fully use the whole "rendering" potential of the CPUs.

Some questions - and I hope I get answers :wink: :

- Do I have to avoid the Use of some materials or certain features / options of materials to get adequate Render Times on this System?
- Is an AMD System (with more than two sockets?) not a good choice as a render machine for Maxwell?
- ( Do I have to sell this system and only buy Intel? :lol: )
By zdeno
#350030
arch3990
again thx for answer, it really looks like temeprature is not an issue

I am waiting for answers from NL team too. This is very interesting case.

But You have to be very patient
when it is all around bug then time and even chance to get any sign is 20 times longer than when You point some good feature ;)
#350262
zdeno
we "sit in the same boat" regarding this topic but I think we are not the only ones. There maybe some others who use or who are willing to use an AMD MP platform.
The new lineup of AMD Server CPUs has arrived (62XX) and they are quite powerfull and very "cheap" regarding the performance:

Image

source:
http://www.tecchannel.de/server/prozess ... bulldozer/

e.g.:
4x AMD Opteron 6274, 16x 2.20GHz, Sockel-G34, boxed (OS6274WKTGGGUWOF) - (4x 600 Euro) + Tyan S8812, AMD SR5690 (quad Sockel-G34, quad PC3-10667R reg ECC DDR3) (S8812WGM3NR) - (700 Euro)
3100 Euro is not so much for 64 Rendering Cores :)
Help with swimming pool water

I think you posted a while back that its best to u[…]

Sketchup 2026 Released

Considering how long a version for Sketchup 2025 t[…]

Greetings, One of my users with Sketchup 2025 (25[…]

Maxwell Rhino 5.2.6.8 plugin with macOS Tahoe 26

Good morning everyone, I’d like to know if t[…]