Yeah, there's a lot of geek jargon in there for sure! Convoluted is one of many adjectives that could be used to describe it
I haven't troubleshot the hardware yet for a number of reasons. I'm pretty sure the hardware is fine. I actually run CoreTemp on all of the nodes to monitor any CPU overheating and none of them have reported anything unusual. I know that doesn't look at the other thermal sensors, but the chassis design is actually really effective at airflow and the fans on these things are outrageously loud. Plus, they are all housed in an IDF that's commercially cooled, so they have access to nearly perfect ambient air conditions. The nodes themselves are all embedded in (2) of these SuperMicro chassis (4 nodes per chassis):
https://www.supermicro.com/products/sys ... TP-HTR.cfm
There is a slight difference between the two chassis since we ordered them about a year apart. I discovered this when doing a low level copy of one of the SSDs from the old chassis to the new and the NICs didn't work at all! I had to reinstall the Intel driver for them to work properly. They must have upgraded the NICs to a slightly newer version, but the problem I'm seeing will randomly happen to a handful of nodes and it doesn't appear to be isolated to either chassis. Every other component is identical; the motherboards (excluding the embedded NICs), the RAM modules, the SSDs, even the CPUs are exactly the same.
I also haven't tried the standalone renderer yet to see if the problem persists. When watching the node and the crash happens, it always happens immediately after Voxelization is complete and the render process begins. When looking at the MR log, it doesn't log that Voxelization ever completed, it just restarts. It's never crashed on me on my workstation when testing, so I assumed it had to be a network issue. With that being said, you brought up the possibility of this being scene specific. At first, I dismissed this as well since this is happening on any one of the variants of a scene that I'm working on. However, to reduce disk space, I took every static object in the scene and made it on big MXS reference. That export was done by an older version of Maxwell back in June of 2015. All of the materials would also have been saved from that older version.
In light of that, I opened up and resaved all 67 of my MXM files, as well as opened the MXS referenced file and resaved it. The problem still persists
I've also changed the settings of each node to force a Low priority in the event there was an issue between processes and potential response times. The only perceivable difference I can see now is that some of the nodes (again randomly) will render as if they are only using a single thread. Their speed and benchmark values plumetted for some reason, but its randomly on 1 to 3 nodes.
It also occurred to me that the node logs may not be as complete as a log generated by Maxwell Render itself, so I modified the settings for MR on each node to save a "Debug Mode" level log. When one of them has crashed enough to be killed by the manager, the last batch of lines in the MR log file are always the same:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
<START LOG>
maxwell:[14/June/2016 14:39:45] [INFO]: Checking Data
maxwell:[14/June/2016 14:39:45] [INFO]: Loading Bitmaps & Preprocessing Data
maxwell:[14/June/2016 14:39:45] [DEBUG]: Preprocessing data.
maxwell:[14/June/2016 14:39:45] [DEBUG]: Preprocessing scene modifiers.
maxwell:[14/June/2016 14:39:45] [INFO]: Loading geometry extensions.
maxwell:[14/June/2016 14:39:45] [DEBUG]: Preprocessing MXS references.
maxwell:[14/June/2016 14:39:45] [INFO]: Loading MXS references...
maxwell:[14/June/2016 14:39:46] [INFO]: MXS references loaded successfully
maxwell:[14/June/2016 14:39:46] [DEBUG]: Pretessellating meshes with displacement.
maxwell:[14/June/2016 14:39:46] [DEBUG]: Checking spot/IES lights.
maxwell:[14/June/2016 14:39:46] [DEBUG]: Initializing data.
maxwell:[14/June/2016 14:39:46] [DEBUG]: Preprocessing geometry.
maxwell:[14/June/2016 14:39:47] [DEBUG]: Preprocessing materials.
maxwell:[14/June/2016 14:39:53] [DEBUG]: Preprocessing additional parameters.
maxwell:[14/June/2016 14:39:53] [DEBUG]: Initializing render engine.
maxwell:[14/June/2016 14:39:53] [WARNING]: DO NOT SAVE IMAGE FILE flag enabled. Image file will not be saved
maxwell:[14/June/2016 14:39:53] [DEBUG]: Initializing multilight data.
maxwell:[14/June/2016 14:39:53] [DEBUG]: Multilight enabled.
maxwell:[14/June/2016 14:39:53] [INFO]: Starting voxelization.
maxwell:[INFO]:
maxwell:[INFO]: Scene Info:
maxwell:[INFO]: Image output: C:\Users\mktgdev\AppData\Local\Temp\mxnetwork\rendernode_mktgCGI-RN16\mxoDAF2.png
maxwell:[INFO]: MXI output: C:\Users\mktgdev\AppData\Local\Temp\mxnetwork\rendernode_mktgCGI-RN16\Rx_7004_Happy-2.mxi
maxwell:[INFO]:
<************************* Nothing else *************************>
<END LOG>
If I look at the logs of node the crashes but tries again to render, you can see the overlap in the log files in the exact same place everytime:
<START LOG>
maxwell:[14/June/2016 14:39:42] [INFO]: Checking Data
maxwell:[14/June/2016 14:39:42] [INFO]: Loading Bitmaps & Preprocessing Data
maxwell:[14/June/2016 14:39:42] [DEBUG]: Preprocessing data.
maxwell:[14/June/2016 14:39:42] [DEBUG]: Preprocessing scene modifiers.
maxwell:[14/June/2016 14:39:42] [INFO]: Loading geometry extensions.
maxwell:[14/June/2016 14:39:42] [DEBUG]: Preprocessing MXS references.
maxwell:[14/June/2016 14:39:42] [INFO]: Loading MXS references...
maxwell:[14/June/2016 14:39:43] [INFO]: MXS references loaded successfully
maxwell:[14/June/2016 14:39:43] [DEBUG]: Pretessellating meshes with displacement.
maxwell:[14/June/2016 14:39:43] [DEBUG]: Checking spot/IES lights.
maxwell:[14/June/2016 14:39:43] [DEBUG]: Initializing data.
maxwell:[14/June/2016 14:39:44] [DEBUG]: Preprocessing geometry.
maxwell:[14/June/2016 14:39:44] [DEBUG]: Preprocessing materials.
maxwell:[14/June/2016 14:39:50] [DEBUG]: Preprocessing additional parameters.
maxwell:[14/June/2016 14:39:51] [DEBUG]: Initializing render engine.
maxwell:[14/June/2016 14:39:51] [WARNING]: DO NOT SAVE IMAGE FILE flag enabled. Image file will not be saved
maxwell:[14/June/2016 14:39:51] [DEBUG]: Initializing multilight data.
maxwell:[14/June/2016 14:39:51] [DEBUG]: Multilight enabled.
maxwell:[14/June/2016 14:39:51] [INFO]: Starting voxelization.
maxwell:[INFO]:
maxwell:[INFO]: Scene Info:
maxwell:[INFO]: Image output: C:\Users\mktgdev\AppData\Local\Temp\mxnetwork\rendernode_mktgCGI-RN11\mxoDAF2.png
maxwell:[INFO]: MXI output: C:\Users\mktgdev\AppData\Local\Temp\mxnetwork\rendernode_mktgCGI-RN11\Rx_7004_Happy-2.mxi
maxwell:[INFO]:
<************************* Process Starts Over Here *************************>
maxwell:[INFO]: Installing error mode handler
maxwell:[INFO]:
maxwell:[INFO]:
maxwell:[INFO]: MAXWELL RENDER (M~R). Engine version 3.2.1.4
maxwell:[INFO]: (C) 2004-2015. Licensed to Next Limit Technologies
maxwell:[INFO]:
maxwell:[INFO]: C:\Program Files\Next Limit\Maxwell 3\maxwell.exe -nowait -mxs:Y:\CS Products\General Cubicle\scenes\Fabric Renders\Variants\Main Camera\Rx_7004_Happy.mxs -o:C:\Users\mktgdev\AppData\Local\Temp\mxnetwork\rendernode_mktgCGI-RN11\mxoDAF2.png -mxi:C:\Users\mktgdev\AppData\Local\Temp\mxnetwork\rendernode_mktgCGI-RN11\Rx_7004_Happy-2.mxi -noimage -s:18 -t:9999 -mintime:0 -node -nodeport:45463 -renameoutput -idcpu:876 -slupdate:10 -v:4 -depth:8 -res:1200x1080 -camera:Camera.Main-SHIFTED -motionblur:off -displacement:off -dispersion:off -extrasamplingenabled:off -noimage:on -nomxi:off -ml:intensity -layers:0 -channel:alpha,off,16,png -channel:shadow,off,16,png -channel:object,off,16,png -channel:material,off,16,png -channel:motion,off,16,png -channel:zbuffer,off,16,png -channel:roughness,off,16,png -channel:fresnel,off,16,png -channel:normals,off,16,png -channel:position,off,16,png -channel:deep,off,32,exr -channel:uv,off,16,png -channel:alpha_custom,off,16,png -channel:reflectance,off,16,png -normalsspace:world -positionspace:world -zmin:0 -zmax:1 -deeptype:alpha -deepmindistance:0.2 -deepmaxsamples:20 -threads:32 -p:low -nomxi:off -time:50000 -rs:1
maxwell:[INFO]:
maxwell:[INFO]:
maxwell:[INFO]: LICENSE INFO
<END LOG>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
I did ultimately try a completely different scene to see if this may be scene specific and I think it may be. I figured what better test than a scene that all Maxwell users should be familiar with, Benchwell. Based on some info I got about extracting the scene data (
http://www.maxwellrender.com/forum/view ... 89#p391008), I decided to queue it up as a network render and I had no issues at all! That definitely narrows down the culprit, but its also maddeningly unhelpful since the scene I'm working on is so complex that the thought of rebuilding it somehow is a bit overwhelming. Any suggestions of rebuilding a scene in the least manual way? What about one that's crashing during and/or after the Voxelization stage? How many corruptions could exist by simply importing the MXS reference file into a blank MXS?
I did just think of a wishlist item though. A tool that purges out unallocated scene assets and then rebuilds any Maxwell related data (MXS and MXM data namely) against the currently installed Maxwell core and its SDK. Although not very catchy, it could be called something like "Purgewell", or maybe even "Fixwell"
Thanks Mihai!