Everything related to Maxwell network rendering systems.
User avatar
By zparrish
#385501
Basic Info:
I have a test scene that I've been rendering as a "Cooperative Job" to SL 18. There are 5 rendernodes assigned to the job.

Issue 1: Premature merge requests
The auto-preview / merge request appears to kick in early on (around SL 4, just like the docs mention). However, the 1st few attempts actual fail and say something like:
Code: Select all
2015-02-19T20:01:14Z WARNING process_period(): No preview for cooperative job
2015-02-19T20:01:14Z  NOTICE  Starting merge (1, 0)
2015-02-19T20:01:14Z  NOTICE  enqueue_output(): End of output. Cmd has finished
2015-02-19T20:01:15Z  NOTICE  Merge (1, 0) finished. Output /{shared}\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:15Z WARNING Mergeinfo was for a preview!
2015-02-19T20:01:15Z WARNING mxi_output path Y:\Maxwell Shared\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:15Z WARNING Asking for preview without mxi at Y:\Maxwell Shared\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:21Z  NOTICE  Starting merge (1, 0)
2015-02-19T20:01:21Z  NOTICE  enqueue_output(): End of output. Cmd has finished
2015-02-19T20:01:22Z  NOTICE  Merge (1, 0) finished. Output /{shared}\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:22Z WARNING Mergeinfo was for a preview!
2015-02-19T20:01:22Z WARNING mxi_output path Y:\Maxwell Shared\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:22Z WARNING Asking for preview without mxi at Y:\Maxwell Shared\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:39Z  NOTICE  Starting merge (1, 0)
2015-02-19T20:01:39Z  NOTICE  enqueue_output(): End of output. Cmd has finished
2015-02-19T20:01:39Z  NOTICE  Merge (1, 0) finished. Output /{shared}\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:39Z WARNING Mergeinfo was for a preview!
2015-02-19T20:01:39Z WARNING mxi_output path Y:\Maxwell Shared\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:39Z WARNING Asking for preview without mxi at Y:\Maxwell Shared\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:48Z  NOTICE  Copying c:\users\mktgdev\appdata\local\temp\mktgCGI-RN13\SCR-50_Linen_Nectar13.mxi to Y:\Maxwell Shared\mktgCGI-RN13\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:49Z  NOTICE  Copying c:\users\mktgdev\appdata\local\temp\mktgCGI-RN13\SCR-50_Linen_Nectar13.png to Y:\Maxwell Shared\mktgCGI-RN13\SCR-50_Linen_Nectar13.png
2015-02-19T20:01:49Z WARNING process_period(): No preview for cooperative job
2015-02-19T20:01:49Z  NOTICE  Starting merge (1, 0)
2015-02-19T20:01:55Z  NOTICE  enqueue_output(): End of output. Cmd has finished
2015-02-19T20:01:56Z  NOTICE  Merge (1, 0) finished. Output /{shared}\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:56Z WARNING Mergeinfo was for a preview!
2015-02-19T20:01:56Z WARNING mxi_output path Y:\Maxwell Shared\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:57Z WARNING preview.sampling_level 11.5350871278
Based on the log of each rendernode and directly observing the share directory, I can see the merge request happen before any of the nodes have even sent their MXI files back to the share. It looks like when the 1st of the rendernodes gets to about local SL 10, each node copies it's MXI file to the share. The next time the preview / merge request is issued, it successfully merges the MXI files and generates the preview.

It would be nice if the manager knew to wait to request the preview until all of the nodes saved a copy to the share. If there is a simple way to do that, it would create a little less overhead. I didn't see a log entry from the manager to the nodes telling them to start the copy (unless it's part of the merge request). Is the copy process actually initiated by the rendernodes themselves? If so, that would explain why the manager requests the preview anyway.

Issue 2: Random MXI Merge Crashes
The incremental "mximerge" was randomly crashing on me. When this happened, the respective rendernode would kind of pause and wouldn't actually talk to the manager. Even if I issued a "Stop" from the monitor, the rendernode host would just sit there waiting for me to tell it to close the mximerge process. Once closed, the render node would reopen communications with the manager and continue with it's original render instructions. However, the maxwell.exe process it originally spawned is somewhat of a zombie process. It won't "Stop" with instructions issued through the monitor. I have to manually kill the process. I haven't tested what happens if I let the rendernode continue rendering to it's target SL. Do cooperative rendernodes know to stop if they individually reach the target SL? I know they receive the "Stop" request from the manager once the manager determines that the merged SL has met the target SL.

Oddly enough, I set my minimum verbosity of all components back to "Notice" and the random crash seems to have stopped. I was running the verbosity level at "Info" and even "Debug" when I experienced the crash. I started questioning whether or not all of that data and the many individual requests were clogging up the I/O and somewhere along the line a status data frame got lost or packets were being dropped. I know that the asynchronous nature of these tools is supposed to prevent that, but I couldn't help pondering this possibility.

The other possibility that crossed my mind is that the mximerge request was issued while one or more of the nodes were still copying their local MXI files to the share. If that happened, then (hypothetically) mximerge would be trying to merge partial and incomplete MXI files. When the nodes copy their MXI files to the share, do they write them to a temporary file on the share and then rename those temps, or does it try to populate the MXI file container directly?

Issue 3: Autonomy
Some of things that I really like about the merge process happening automatically are:
  • An incremental, merged MXI file gets saved. The old system waited until the collective SL matched the target SL, and then the manager told every node to send it's file and would merge them together. If the job crashed in the middle, you would either lose all of your progress, or you would have to connect to each rendernode and manually copy / merge the local MXI files. This new system makes it possible to at least not have to "start from scratch".
  • You get a preview without having to manually request it.
Here's some things that I think may cause issues with it being fully automated:
  • It creates additional overhead by requiring a machine that's already rendering to wait for a bit and merge the MXI files. This hinders overall performance for cooperative jobs.
  • The machine that does the merging is already running a maxwell.exe process. If that process is consuming a large amount of RAM, and each MXI file is large enough, the rendernode host may crash due to insufficient resources. I haven't seen this happen yet (or developed a test for it), so it is possible that you guys programmed a fail safe in there for that.
Would it be possible to add some controls for the merge autonomy? Things like being able to disable that, or even pick a specific host to handle the merges would be really nice. Maybe even have a checkbox that forces the manager host to deal with the merge so it never disrupts a rendernode host?


Thanks!
User avatar
By pablo
#385751
Thank you for your through report. As always, it is very useful.

Accurately reported, some times the merge process starts before the first mxi to copy is in place. This has been fixed in 0.7.12. Also, the 'always longing' merge process is something to tackle ... it is something that happens more in windows, i guess that because of the OS scheduler. My guess is that the render process consumess all cpu resources -- as it should btw -- and there is little left for the merge when happening alongside it. On the last merge (the one after maxwell finishes) it all goes well, but those intermediate ones sometimes are so slow that the next merge request arrives before the first finishes.

To fix this I plan on stop the maxwell process in the machine that do the merge, restoring it afterwards.

The crashes you report were not actually crashes, but 'fake' disconnections from the jobman, again accurately diagnosed. They were related to a clogging of I/O from the rendernode, but not from the logs, but from the preview image itself. The preview image was sent (before 0.7.12) using the same channel as the status from the rendernode, and with big images this took enought time for the job manager to consider the render node as disconnected.

The design idea behind this was to avoid writing files for the previews, but in the end it has proven to be more harm than good.

Version 0.7.12 fixes this by storing thumbnail and preview images as files that are served regularly through HTTP instead of passing trough the job manager.

About the merge autonomy, there will be options in the future to prevent this merges alltogether, or to choose the time when they will be done. Actually the ability to use all contributions from rendernodes regardless of their disconnections do not depend on any periodic merge, just on copying the mxi files periodically to the share -- without merging them -- and merging them all in the end. This will require a little change on how the job manager views the rendernodes, to make it consider the rendernode the same for publishing, but 'different on each reconnection' for rendering. That and using the hostname to identify the nodes (a user reported that his rendernode stopped contributing to the render after a dhcp address change) are changes that are also in the queue.

As to the merge process disturbing the render, I prefer to consider it the other way round and as told, stop the render while merging. But being able to choose wich node will do the merge is a very easy to add option so I'll write it down also.

Thank you again,
Pablo
Will there be a Maxwell Render 6 ?

Let's be realistic. What's left of NL is only milk[…]