- Thu Feb 19, 2015 11:39 pm
#385501
Basic Info:
I have a test scene that I've been rendering as a "Cooperative Job" to SL 18. There are 5 rendernodes assigned to the job.
Issue 1: Premature merge requests
The auto-preview / merge request appears to kick in early on (around SL 4, just like the docs mention). However, the 1st few attempts actual fail and say something like:
It would be nice if the manager knew to wait to request the preview until all of the nodes saved a copy to the share. If there is a simple way to do that, it would create a little less overhead. I didn't see a log entry from the manager to the nodes telling them to start the copy (unless it's part of the merge request). Is the copy process actually initiated by the rendernodes themselves? If so, that would explain why the manager requests the preview anyway.
Issue 2: Random MXI Merge Crashes
The incremental "mximerge" was randomly crashing on me. When this happened, the respective rendernode would kind of pause and wouldn't actually talk to the manager. Even if I issued a "Stop" from the monitor, the rendernode host would just sit there waiting for me to tell it to close the mximerge process. Once closed, the render node would reopen communications with the manager and continue with it's original render instructions. However, the maxwell.exe process it originally spawned is somewhat of a zombie process. It won't "Stop" with instructions issued through the monitor. I have to manually kill the process. I haven't tested what happens if I let the rendernode continue rendering to it's target SL. Do cooperative rendernodes know to stop if they individually reach the target SL? I know they receive the "Stop" request from the manager once the manager determines that the merged SL has met the target SL.
Oddly enough, I set my minimum verbosity of all components back to "Notice" and the random crash seems to have stopped. I was running the verbosity level at "Info" and even "Debug" when I experienced the crash. I started questioning whether or not all of that data and the many individual requests were clogging up the I/O and somewhere along the line a status data frame got lost or packets were being dropped. I know that the asynchronous nature of these tools is supposed to prevent that, but I couldn't help pondering this possibility.
The other possibility that crossed my mind is that the mximerge request was issued while one or more of the nodes were still copying their local MXI files to the share. If that happened, then (hypothetically) mximerge would be trying to merge partial and incomplete MXI files. When the nodes copy their MXI files to the share, do they write them to a temporary file on the share and then rename those temps, or does it try to populate the MXI file container directly?
Issue 3: Autonomy
Some of things that I really like about the merge process happening automatically are:
Thanks!
I have a test scene that I've been rendering as a "Cooperative Job" to SL 18. There are 5 rendernodes assigned to the job.
Issue 1: Premature merge requests
The auto-preview / merge request appears to kick in early on (around SL 4, just like the docs mention). However, the 1st few attempts actual fail and say something like:
Code: Select all
Based on the log of each rendernode and directly observing the share directory, I can see the merge request happen before any of the nodes have even sent their MXI files back to the share. It looks like when the 1st of the rendernodes gets to about local SL 10, each node copies it's MXI file to the share. The next time the preview / merge request is issued, it successfully merges the MXI files and generates the preview.2015-02-19T20:01:14Z WARNING process_period(): No preview for cooperative job
2015-02-19T20:01:14Z NOTICE Starting merge (1, 0)
2015-02-19T20:01:14Z NOTICE enqueue_output(): End of output. Cmd has finished
2015-02-19T20:01:15Z NOTICE Merge (1, 0) finished. Output /{shared}\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:15Z WARNING Mergeinfo was for a preview!
2015-02-19T20:01:15Z WARNING mxi_output path Y:\Maxwell Shared\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:15Z WARNING Asking for preview without mxi at Y:\Maxwell Shared\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:21Z NOTICE Starting merge (1, 0)
2015-02-19T20:01:21Z NOTICE enqueue_output(): End of output. Cmd has finished
2015-02-19T20:01:22Z NOTICE Merge (1, 0) finished. Output /{shared}\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:22Z WARNING Mergeinfo was for a preview!
2015-02-19T20:01:22Z WARNING mxi_output path Y:\Maxwell Shared\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:22Z WARNING Asking for preview without mxi at Y:\Maxwell Shared\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:39Z NOTICE Starting merge (1, 0)
2015-02-19T20:01:39Z NOTICE enqueue_output(): End of output. Cmd has finished
2015-02-19T20:01:39Z NOTICE Merge (1, 0) finished. Output /{shared}\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:39Z WARNING Mergeinfo was for a preview!
2015-02-19T20:01:39Z WARNING mxi_output path Y:\Maxwell Shared\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:39Z WARNING Asking for preview without mxi at Y:\Maxwell Shared\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:48Z NOTICE Copying c:\users\mktgdev\appdata\local\temp\mktgCGI-RN13\SCR-50_Linen_Nectar13.mxi to Y:\Maxwell Shared\mktgCGI-RN13\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:49Z NOTICE Copying c:\users\mktgdev\appdata\local\temp\mktgCGI-RN13\SCR-50_Linen_Nectar13.png to Y:\Maxwell Shared\mktgCGI-RN13\SCR-50_Linen_Nectar13.png
2015-02-19T20:01:49Z WARNING process_period(): No preview for cooperative job
2015-02-19T20:01:49Z NOTICE Starting merge (1, 0)
2015-02-19T20:01:55Z NOTICE enqueue_output(): End of output. Cmd has finished
2015-02-19T20:01:56Z NOTICE Merge (1, 0) finished. Output /{shared}\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:56Z WARNING Mergeinfo was for a preview!
2015-02-19T20:01:56Z WARNING mxi_output path Y:\Maxwell Shared\SCR-50_Linen_Nectar13.mxi
2015-02-19T20:01:57Z WARNING preview.sampling_level 11.5350871278
It would be nice if the manager knew to wait to request the preview until all of the nodes saved a copy to the share. If there is a simple way to do that, it would create a little less overhead. I didn't see a log entry from the manager to the nodes telling them to start the copy (unless it's part of the merge request). Is the copy process actually initiated by the rendernodes themselves? If so, that would explain why the manager requests the preview anyway.
Issue 2: Random MXI Merge Crashes
The incremental "mximerge" was randomly crashing on me. When this happened, the respective rendernode would kind of pause and wouldn't actually talk to the manager. Even if I issued a "Stop" from the monitor, the rendernode host would just sit there waiting for me to tell it to close the mximerge process. Once closed, the render node would reopen communications with the manager and continue with it's original render instructions. However, the maxwell.exe process it originally spawned is somewhat of a zombie process. It won't "Stop" with instructions issued through the monitor. I have to manually kill the process. I haven't tested what happens if I let the rendernode continue rendering to it's target SL. Do cooperative rendernodes know to stop if they individually reach the target SL? I know they receive the "Stop" request from the manager once the manager determines that the merged SL has met the target SL.
Oddly enough, I set my minimum verbosity of all components back to "Notice" and the random crash seems to have stopped. I was running the verbosity level at "Info" and even "Debug" when I experienced the crash. I started questioning whether or not all of that data and the many individual requests were clogging up the I/O and somewhere along the line a status data frame got lost or packets were being dropped. I know that the asynchronous nature of these tools is supposed to prevent that, but I couldn't help pondering this possibility.
The other possibility that crossed my mind is that the mximerge request was issued while one or more of the nodes were still copying their local MXI files to the share. If that happened, then (hypothetically) mximerge would be trying to merge partial and incomplete MXI files. When the nodes copy their MXI files to the share, do they write them to a temporary file on the share and then rename those temps, or does it try to populate the MXI file container directly?
Issue 3: Autonomy
Some of things that I really like about the merge process happening automatically are:
- An incremental, merged MXI file gets saved. The old system waited until the collective SL matched the target SL, and then the manager told every node to send it's file and would merge them together. If the job crashed in the middle, you would either lose all of your progress, or you would have to connect to each rendernode and manually copy / merge the local MXI files. This new system makes it possible to at least not have to "start from scratch".
- You get a preview without having to manually request it.
- It creates additional overhead by requiring a machine that's already rendering to wait for a bit and merge the MXI files. This hinders overall performance for cooperative jobs.
- The machine that does the merging is already running a maxwell.exe process. If that process is consuming a large amount of RAM, and each MXI file is large enough, the rendernode host may crash due to insufficient resources. I haven't seen this happen yet (or developed a test for it), so it is possible that you guys programmed a fail safe in there for that.
Thanks!
Regards,
Zack Parrish
-
Maxwell - 4.2.0.3
Maxwell 4 | 3ds Max - 4.2.4
336 capable Maxwell threads!
-
Workstation:
Dual E5-2680v3, 64GB, Quadro K5200
48 threads (HT) @ 139.2GHz
-
Render Farm:
288 threads (HT) @ 835.2GHz
Zack Parrish
-
Maxwell - 4.2.0.3
Maxwell 4 | 3ds Max - 4.2.4
336 capable Maxwell threads!
-
Workstation:
Dual E5-2680v3, 64GB, Quadro K5200
48 threads (HT) @ 139.2GHz
-
Render Farm:
288 threads (HT) @ 835.2GHz