Everything related to Maxwell network rendering systems.
#382328
Just before the end of 5-hour-renders, fairly often one of the render node crashes - never in the first hour, always just a few minutes before the render is finished : (

From the verbose log:

[17/August/2014 00:55:51] Message from render process: time_update 17363
[17/August/2014 00:55:51] Message from render process: new_sampling_level_reached 16.157475
[17/August/2014 00:55:51] [17/August/2014 00:55:51] [INFO]: Message to render node: time_update 17363
[17/August/2014 00:55:51] [17/August/2014 00:55:51] [INFO]: Message to render node: new_sampling_level_reached 16.157475
[17/August/2014 00:56:00] The remote host closed the connection . Code: 1
[17/August/2014 00:56:00] ERROR: Error in rendering process. The process crashed some time after starting successfully.
[17/August/2014 00:56:00] ERROR: Render process crashed!

[17/August/2014 00:56:00] Connecting to render process: Binding to port: 45463
[17/August/2014 00:56:00] TCP message from manager received.
[17/August/2014 00:56:00] Message from manager: cpuid
[17/August/2014 00:56:00] 7501
[17/August/2014 00:56:00] TCP message from manager received.
[17/August/2014 00:56:00] New job order received

The crashed node is automatically restarting, but at SL 0, not at the last SL (the last MXI written to disk had a higher SL), so the restarted render has a lower SL compared to where the render already was just before the crash.

Questions:

1. How does one analyse the cause of the crash? There is nothing in the node's Windows Event Viewer, it's running cool and fine.
2. Is there some way one can resume the render with a higher SL, not from SL 0, with the MXI one can find in the monitor's temp folder?
3. Any other file/trick to not have the crashed node pick up automatically at SL 0, but rather have the entire job stopped so one can do a regular resume, which would at least yield the highest SL reached just before le crash?

Tack så mycket!

Image
Image
Image
#382330
And another crash... with the render node automatically restarting on the job at SL 0, while the other nodes just continue rendering the job.

From the verbose log:

[17/August/2014 14:18:12] Message from render process: time_update 12076
[17/August/2014 14:18:12] Message from render process: new_sampling_level_reached 15.093441
[17/August/2014 14:18:12] [17/August/2014 14:18:12] [INFO]: Message to render node: time_update 12076
[17/August/2014 14:18:12] [17/August/2014 14:18:12] [INFO]: Message to render node: new_sampling_level_reached 15.093441
[17/August/2014 14:18:16] The remote host closed the connection . Code: 1
[17/August/2014 14:18:16] ERROR: Error in rendering process. The process crashed some time after starting successfully.
[17/August/2014 14:18:16] ERROR: Render process crashed!

[17/August/2014 14:18:16] Connecting to render process: Binding to port: 45463
[17/August/2014 14:18:16] TCP message from manager received.
[17/August/2014 14:18:16] Message from manager: cpuid
[17/August/2014 14:18:16] 15821
[17/August/2014 14:18:16] TCP message from manager received.
[17/August/2014 14:18:16] New job order received

And, again, just before, the MXI/image was written:

[17/August/2014 14:04:33] Message from render process: start_writing_mxi
[17/August/2014 14:04:33] [INFO]: Start writing MXI file...
[17/August/2014 14:04:33] [17/August/2014 14:04:33] [INFO]: Message to render node: start_writing_mxi
[17/August/2014 14:04:39] Message from render process: end_writing_mxi
[17/August/2014 14:04:39] [INFO]: End writing MXI file...
[17/August/2014 14:04:39] [17/August/2014 14:04:39] [INFO]: Message to render node: end_writing_mxi
[17/August/2014 14:04:39] [17/August/2014 14:04:39] [INFO]: MXI successfully renamed.

[17/August/2014 14:04:41] Message from render process: time_update 11264
[17/August/2014 14:04:41] Message from render process: new_sampling_level_reached 14.921569
[17/August/2014 14:04:41] [17/August/2014 14:04:41] [INFO]: Message to render node: time_update 11264
[17/August/2014 14:04:41] [17/August/2014 14:04:41] [INFO]: Message to render node: new_sampling_level_reached 14.921569
[17/August/2014 14:04:41] Message from render process: start_writing_image
[17/August/2014 14:04:41] [INFO]: Start writing image file...
[17/August/2014 14:04:41] [17/August/2014 14:04:41] [INFO]: Message to render node: start_writing_image
[17/August/2014 14:04:46] Message from render process: end_writing_image
[17/August/2014 14:04:46] Message from render process: time_update 11276
[17/August/2014 14:04:46] Message from render process: new_sampling_level_reached 14.92157
[17/August/2014 14:04:46] Message from render process: render_info rendering...8 percent
[17/August/2014 14:04:46] [INFO]: End writing image file...
[17/August/2014 14:04:46] [17/August/2014 14:04:46] [INFO]: Message to render node: end_writing_image
[17/August/2014 14:04:46] [17/August/2014 14:04:46] [INFO]: Benchmark of 288.217. Time: 3h07m56s. SL of 14.92

What does that ominous "Code: 1" mean? How can one identify the cause of the crash? All nodes have 16GB memory, firewalls turned off, the scene is super simple, no other tasks are run in parallel on the nodes, no overheating, no messages in Windows' Event Viewer...

It is really unfortunate that in case of a node crashing, the last written MXI is not salvaged somewhere so one can pick up the job from there, but instead having the crashed node restarting from SL 0, losing much time in the process. It would also be good that, if the pre-crash MXI could be salvaged as suggested, the next job in the queue would be started, instead of a crashed node automatically restarting - if a node crashes again (and maybe again), one could come back Monday morning finding the same job still rendering, instead of at least having 5-6 finished and 2-3 semi-finished (from node crashes) renders.
render engines and Maxwell

Funny, I think, that when I check CG sites they ar[…]

Hey, I guess maxwell is not going to be updates a[…]

Help with swimming pool water

Hi Choo Chee. Thanks for posting. I have used re[…]