- Tue Aug 26, 2014 3:24 pm
#382522
Who crashed and why?
From the manager's log:
[26/August/2014 13:55:33] Render node: PC1 Job ID: 1 sl: 13.3298
[26/August/2014 13:55:33] Node: PC1: node_status_changed: render_crashed <--- where and why?
[26/August/2014 13:55:33] ERROR: Rendering process has crashed in node: PC1: 192.168.2.2
[26/August/2014 13:55:33] ##### Try to work #####
[26/August/2014 13:55:33] ##### There is a cooperative job running that accepts more nodes #####
[26/August/2014 13:55:33] ##### Processing pending jobs #####
[26/August/2014 13:55:33] ##### Job 1 #####
[26/August/2014 13:55:33] ### Job type: cooperative ###
[26/August/2014 13:55:33] ### Assigned to: any node available ###
[26/August/2014 13:55:33] Processing job already running, probably adding more nodes to a coop job
[26/August/2014 13:55:33] Sending dependencies info. Num of dependencies: 46
From the render node's log:
[26/August/2014 13:55:26] [26/August/2014 13:55:26] [INFO]: Message to render node: time_update 1488
[26/August/2014 13:55:26] [26/August/2014 13:55:26] [INFO]: Message to render node: new_sampling_level_reached 13.329779
[26/August/2014 13:55:30] The remote host closed the connection . Code: 1 <--- remote host?
[26/August/2014 13:55:30] ERROR: Error in rendering process. The process crashed some time after starting successfully.
[26/August/2014 13:55:30] ERROR: Render process crashed!
[26/August/2014 13:55:30] Connecting to render process: Binding to port: 45463
[26/August/2014 13:55:37] TCP message from manager received.
[26/August/2014 13:55:37] Message from manager: cpuid
[26/August/2014 13:55:37] 12633
[26/August/2014 13:55:37] New job order received
The node that crashed (but was it the node, or maybe the network or the manager or the licensing PC?) picks itself up and automatically rejoins rendering, which adds time to the ongoing job, which, in turn, delays the next pending job.
It would be good if the last "good" MXI of the crashing node would be saved and the next pending job started; that way one could manually merge later. The way it is now, the crashed node restarts its contribution at SL 0, which, in case of a crash just before the rendering is done, effectively doubles the specified rendering time. And cooperative renders always crash just before they're done
What is the right method to find out the root cause of a crash?
Thanks!
From the manager's log:
[26/August/2014 13:55:33] Render node: PC1 Job ID: 1 sl: 13.3298
[26/August/2014 13:55:33] Node: PC1: node_status_changed: render_crashed <--- where and why?
[26/August/2014 13:55:33] ERROR: Rendering process has crashed in node: PC1: 192.168.2.2
[26/August/2014 13:55:33] ##### Try to work #####
[26/August/2014 13:55:33] ##### There is a cooperative job running that accepts more nodes #####
[26/August/2014 13:55:33] ##### Processing pending jobs #####
[26/August/2014 13:55:33] ##### Job 1 #####
[26/August/2014 13:55:33] ### Job type: cooperative ###
[26/August/2014 13:55:33] ### Assigned to: any node available ###
[26/August/2014 13:55:33] Processing job already running, probably adding more nodes to a coop job
[26/August/2014 13:55:33] Sending dependencies info. Num of dependencies: 46
From the render node's log:
[26/August/2014 13:55:26] [26/August/2014 13:55:26] [INFO]: Message to render node: time_update 1488
[26/August/2014 13:55:26] [26/August/2014 13:55:26] [INFO]: Message to render node: new_sampling_level_reached 13.329779
[26/August/2014 13:55:30] The remote host closed the connection . Code: 1 <--- remote host?
[26/August/2014 13:55:30] ERROR: Error in rendering process. The process crashed some time after starting successfully.
[26/August/2014 13:55:30] ERROR: Render process crashed!
[26/August/2014 13:55:30] Connecting to render process: Binding to port: 45463
[26/August/2014 13:55:37] TCP message from manager received.
[26/August/2014 13:55:37] Message from manager: cpuid
[26/August/2014 13:55:37] 12633
[26/August/2014 13:55:37] New job order received
The node that crashed (but was it the node, or maybe the network or the manager or the licensing PC?) picks itself up and automatically rejoins rendering, which adds time to the ongoing job, which, in turn, delays the next pending job.
It would be good if the last "good" MXI of the crashing node would be saved and the next pending job started; that way one could manually merge later. The way it is now, the crashed node restarts its contribution at SL 0, which, in case of a crash just before the rendering is done, effectively doubles the specified rendering time. And cooperative renders always crash just before they're done
What is the right method to find out the root cause of a crash?
Thanks!