failures - unknown reason - Maxwell Render

Reply

failures - unknown reason#391267

By greengreen - Sun Jun 05, 2016 9:15 am

- Sun Jun 05, 2016 9:15 am #391267

Can anyone tell me why I keep getting these failures and the Maxwell Node closes on the farm computer?
In this case, I restarted the node, and it failed a couple times but then started rendering.
Another node didn't say it failed, but just didn't do anything after it said "new job order received". I can reset the node in the Monitor, and it starts working, but after saying it failed two times.

From the Network Monitor:
[05/June/2016 00:09:23] Scene data sent successfully.
[05/June/2016 00:09:23] Start scene data transfer
[05/June/2016 00:09:23] ERROR: Rendering process has failed in node: Farm-1: 192.168.5.1
[05/June/2016 00:09:23]
[05/June/2016 00:09:23] ERROR: Render node Farm-1 failed too many times and will be killed.
[05/June/2016 00:09:23]
[05/June/2016 00:09:24] Scene data sent successfully.
[05/June/2016 00:09:24] ERROR: Rendering process has failed in node: Farm-1: 192.168.5.1
[05/June/2016 00:09:24]
[05/June/2016 00:09:24] ERROR: Render node Farm-1 failed too many times and will be killed.
[05/June/2016 00:09:24]
[05/June/2016 00:09:25] ERROR: Rendering process has failed in node: Farm-1: 192.168.5.1
[05/June/2016 00:09:25]
[05/June/2016 00:09:25] ERROR: Render node Farm-1 failed too many times and will be killed.
[05/June/2016 00:09:25]
[05/June/2016 00:09:25] ERROR: Render node Farm-1 failed too many times and will be killed.
[05/June/2016 00:09:25]

From the Network Manager:
[05/June/2016 00:09:23] ERROR: Rendering process has failed in node: Farm-1: 192.168.5.1
[05/June/2016 00:09:23] ERROR: Render node Farm-1 failed too many times and will be killed.
[05/June/2016 00:09:24] ERROR: Rendering process has failed in node: Farm-1: 192.168.5.1
[05/June/2016 00:09:24] ERROR: Render node Farm-1 failed too many times and will be killed.
[05/June/2016 00:09:25] ERROR: Rendering process has failed in node: Farm-1: 192.168.5.1
[05/June/2016 00:09:25] ERROR: Render node Farm-1 failed too many times and will be killed.
[05/June/2016 00:09:25] Warning: Render node disconnected: Farm-1
[05/June/2016 00:09:25] ERROR: Render node Farm-1 failed too many times and will be killed.

Re: failures - unknown reason#391269

By Mihai - Sun Jun 05, 2016 5:48 pm

- Sun Jun 05, 2016 5:48 pm #391269

I'd test to see if the scene renders ok in standalone mode on that node, maybe it's running out of RAM or there's another problem and it's not related to network. It looks like the render process itself crashes.

Maxwellzone.com - tutorials, training and other goodies related to Maxwell Render
Youtube Maxwell channel

Re: failures - unknown reason#391270

By greengreen - Sun Jun 05, 2016 6:34 pm

- Sun Jun 05, 2016 6:34 pm #391270

thanks, I'll give it a try

Re: failures - unknown reason#391271

By greengreen - Sun Jun 05, 2016 7:06 pm

- Sun Jun 05, 2016 7:06 pm #391271

I tried running it on the farm computer. Maybe it's because this mxm material is missing. Weird that it fails then runs sometimes though. Seemed to fail immediately when running from the farm though

Re: failures - unknown reason#391306

By zparrish - Thu Jun 09, 2016 2:48 am

- Thu Jun 09, 2016 2:48 am #391306

I can vouch for something very, very similar. I've gone as far as making sure all of my farm machines and the file server they pull from have all of the Windows updates. I began to think it was the switch in the rack, but I'm starting to think it may actually be a software issue.

I've noticed that it randomly picks the node(s) that act up after rebooting. After its all been rebooted, the node(s) that are problematic are consistent, until I reboot it all, then it randomly picks new trouble makers. I also noticed that the Maxwell log file gets deleted on the machines that are causing problems, which makes it even stranger.

I checked the Windows logs and there's nothing in there to help me determine the issue. It's really odd, and I'm glad I'm not the only one seeing something like this.

I'm running MR 3.2.1.4 on Win 7 x64 Enterprise. The only box using a different OS is the file server, which is Win Server 2008 R2 w/ SP1.

Regards,
Zack Parrish
-
Maxwell - 4.2.0.3
Maxwell 4 | 3ds Max - 4.2.4
336 capable Maxwell threads!
-
Workstation:
Dual E5-2680v3, 64GB, Quadro K5200
48 threads (HT) @ 139.2GHz
-
Render Farm:
288 threads (HT) @ 835.2GHz

Re: failures - unknown reason#391364

By zparrish - Tue Jun 14, 2016 3:27 pm

- Tue Jun 14, 2016 3:27 pm #391364

I was finally able to get the log information from a node that crashed. Here's the last 13 lines. I believe the green line that's in bold has something to do with it.

Log from node mktgCGI-RN12: 192.168.2.41
[14/June/2016 08:26:10] Message from render process: render_info voxelization done.
[14/June/2016 08:26:11] [14/June/2016 08:26:10] [INFO]: End Voxelization
[14/June/2016 08:26:11] [14/June/2016 08:26:10] [INFO]: Voxelization done.
[14/June/2016 08:26:11] Message from render process: render_started
[14/June/2016 08:26:11] The remote host closed the connection . Code: 1
[14/June/2016 08:26:11] [14/June/2016 08:26:11] [INFO]: Start Rendering
[14/June/2016 08:26:11] [INFO]:
[14/June/2016 08:26:11] ERROR: Error in rendering process. The process crashed some time after starting successfully.
[14/June/2016 08:26:11] ERROR: Render process crashed!
[14/June/2016 08:26:11] Connecting to render process: Binding to port: 45463
[14/June/2016 08:26:11] TCP message from manager received.
[14/June/2016 08:26:11] Message from manager: kill_node
[14/June/2016 08:26:11]

Now here's the strange part, when I looked for a corresponding log entry on the Manager machine, it's timestamp was off by 4 whole seconds!:

Log from manager mun-d020: 192.168.2.39
[14/June/2016 08:26:07] Node: mktgCGI-RN16: render_info: voxelization done.
[14/June/2016 08:26:07] Node: mktgCGI-RN15: render_info: starting render
[14/June/2016 08:26:07] Node: mktgCGI-RN12: node_status_changed: render_crashed
[14/June/2016 08:26:07] ERROR: Rendering process has crashed in node: mktgCGI-RN12: 192.168.2.41
[14/June/2016 08:26:07] ERROR: Render node mktgCGI-RN12 failed too many times and will be killed.

To give some background into the farm's setup, there are (10) total machines running Maxwell and (1) file server for distributing assets. The manager (192.168.2.39), my primary workstation (192.168.2.32), and the file server (192.168.2.X) are all connected to our Corporate domain via a different LAN and thus have their system clocks synchronized by our domain controller. The other (8) servers are not on the domain at all and subsequently synchronize their clocks with the default NTP provided by the Windows 7 installation (I think it's time.windows.com). All of these devices pull their render assets from the domain connected file server.

So here's the question:

How dependent is the overall network rendering process on the system clock of each device within the farm?

Between pulling a Maxwell license, downloading assets, writing to log files, sending and receiving update information, and then writing the resulting image file(s) back to the file server, there's an awful lot of chatter happening. I didn't think the file sever really cared about the system clocks, but maybe it does.

When a node downloads assets for rendering, where does it store them?

I could verify if the assets are actually intact if I knew where it cached them. Perhaps there's a file corruption of renderable asset that happens as the result of (9) or so systems with different system clocks pulling files from the same file server. If a corrupted, referenced MXM file or image file were cached on one or more of the systems, I could see that tanking the render process for said systems. I think the MXS files are intact since it can read the setup information correctly (unless that's being passed directly from the manager and the MXS local settings are ignored).

It's all just a guess, but I thought I would mention this in case it may be related. Thanks!

Regards,
Zack Parrish
-
Maxwell - 4.2.0.3
Maxwell 4 | 3ds Max - 4.2.4
336 capable Maxwell threads!
-
Workstation:
Dual E5-2680v3, 64GB, Quadro K5200
48 threads (HT) @ 139.2GHz
-
Render Farm:
288 threads (HT) @ 835.2GHz

Re: failures - unknown reason#391365

By Mihai - Tue Jun 14, 2016 5:04 pm

- Tue Jun 14, 2016 5:04 pm #391365

As far as I know, if you don't use the "Send dependencies" when adding a job, the nodes get the assets directly from the networked path, and the render process on each node loads them into memory. They don't store them locally. If you do use send dependencies, the Manager will take care to send all the assets to each node, and they will store these files in their systems temp folder (click File>Open temp folder to see what's in that nodes temp folder).

It is always recommended to use network paths for files and assets and not use Send dependencies.

Maybe that log discrepancy between time noted in Manager vs Node has to do with the fact that the node was using its CPU heavily before crashing and so writing to the log was delayed a bit compared to the Manager.

Maxwellzone.com - tutorials, training and other goodies related to Maxwell Render
Youtube Maxwell channel

Re: failures - unknown reason#391366

By zparrish - Tue Jun 14, 2016 5:32 pm

- Tue Jun 14, 2016 5:32 pm #391366

I actually went through the process of manually configuring the (8) non-domain servers to pull sync their clocks to our domains PDC. I decided to do that after physically verifying that the clocks were out of sync by several seconds. Then, I rebooted the whole farm and issued a render request but the problem persisted. Ironically, I had (5) servers fail, which is more than I've ever had act up at a time. Then, I tried using the "Send Dependancies" option, but unfortunately, the problem still persisted for (3) of the nodes.

I think the problem resides elsewhere, but I'm not sure what else to check at this point. The networking and drive mapping should all be working correctly. I created a startup script for each node that maps the drives and validates the connection by attempting to write to the mapped paths. All of that is functioning as expected.

Perhaps it has something to do with the setup of the file server. It's a bit convoluted, so brace yourself for this description

The file server is actually a VM using Windows Server 2008 R2. That VM is managed by our VMware vSphere / vCenter and is hosted by a box that is part of a large list of managed hosts. The storage space used by the file server is hosted by a Synology server, which is using 2TB VMDK virtual disks that are virtually connected to the file server using an iSCSI / LUN connection managed by the vSphere hosts layer. That means that the VMDK LUNs appear as local disks to the file server. It has no idea that they are virtual disks on another physical box.

As complex as that is, I don't think it's related to this issue. I didn't seem to have these issues until I upgraded from Maxwell 3.2.1.2 to Maxwell 3.2.1.4. We've been running on this file server setup for quite some time.

Regards,
Zack Parrish
-
Maxwell - 4.2.0.3
Maxwell 4 | 3ds Max - 4.2.4
336 capable Maxwell threads!
-
Workstation:
Dual E5-2680v3, 64GB, Quadro K5200
48 threads (HT) @ 139.2GHz
-
Render Farm:
288 threads (HT) @ 835.2GHz

Re: failures - unknown reason#391370

By Mihai - Tue Jun 14, 2016 7:52 pm

- Tue Jun 14, 2016 7:52 pm #391370

Do you have the sysadmins dictionary to go along with that post?

Btw, personally I did notice the problem with .14 that resuming previous coop jobs is crashing the Monitor.

I would start looking more towards the hardware in those nodes, did you try first of all a few standalone renders, for a few hours, to rule out any RAM or overheating issues? If that's done, does the process always crash at a certain point? With a certain scene?

Maxwellzone.com - tutorials, training and other goodies related to Maxwell Render
Youtube Maxwell channel

Re: failures - unknown reason#391372

By zparrish - Tue Jun 14, 2016 9:43 pm

- Tue Jun 14, 2016 9:43 pm #391372

Yeah, there's a lot of geek jargon in there for sure! Convoluted is one of many adjectives that could be used to describe it

I haven't troubleshot the hardware yet for a number of reasons. I'm pretty sure the hardware is fine. I actually run CoreTemp on all of the nodes to monitor any CPU overheating and none of them have reported anything unusual. I know that doesn't look at the other thermal sensors, but the chassis design is actually really effective at airflow and the fans on these things are outrageously loud. Plus, they are all housed in an IDF that's commercially cooled, so they have access to nearly perfect ambient air conditions. The nodes themselves are all embedded in (2) of these SuperMicro chassis (4 nodes per chassis):

https://www.supermicro.com/products/sys ... TP-HTR.cfm

There is a slight difference between the two chassis since we ordered them about a year apart. I discovered this when doing a low level copy of one of the SSDs from the old chassis to the new and the NICs didn't work at all! I had to reinstall the Intel driver for them to work properly. They must have upgraded the NICs to a slightly newer version, but the problem I'm seeing will randomly happen to a handful of nodes and it doesn't appear to be isolated to either chassis. Every other component is identical; the motherboards (excluding the embedded NICs), the RAM modules, the SSDs, even the CPUs are exactly the same.

I also haven't tried the standalone renderer yet to see if the problem persists. When watching the node and the crash happens, it always happens immediately after Voxelization is complete and the render process begins. When looking at the MR log, it doesn't log that Voxelization ever completed, it just restarts. It's never crashed on me on my workstation when testing, so I assumed it had to be a network issue. With that being said, you brought up the possibility of this being scene specific. At first, I dismissed this as well since this is happening on any one of the variants of a scene that I'm working on. However, to reduce disk space, I took every static object in the scene and made it on big MXS reference. That export was done by an older version of Maxwell back in June of 2015. All of the materials would also have been saved from that older version.

In light of that, I opened up and resaved all 67 of my MXM files, as well as opened the MXS referenced file and resaved it. The problem still persists

I've also changed the settings of each node to force a Low priority in the event there was an issue between processes and potential response times. The only perceivable difference I can see now is that some of the nodes (again randomly) will render as if they are only using a single thread. Their speed and benchmark values plumetted for some reason, but its randomly on 1 to 3 nodes.

It also occurred to me that the node logs may not be as complete as a log generated by Maxwell Render itself, so I modified the settings for MR on each node to save a "Debug Mode" level log. When one of them has crashed enough to be killed by the manager, the last batch of lines in the MR log file are always the same:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

<START LOG>
maxwell:[14/June/2016 14:39:45] [INFO]: Checking Data
maxwell:[14/June/2016 14:39:45] [INFO]: Loading Bitmaps & Preprocessing Data
maxwell:[14/June/2016 14:39:45] [DEBUG]: Preprocessing data.
maxwell:[14/June/2016 14:39:45] [DEBUG]: Preprocessing scene modifiers.
maxwell:[14/June/2016 14:39:45] [INFO]: Loading geometry extensions.
maxwell:[14/June/2016 14:39:45] [DEBUG]: Preprocessing MXS references.
maxwell:[14/June/2016 14:39:45] [INFO]: Loading MXS references...
maxwell:[14/June/2016 14:39:46] [INFO]: MXS references loaded successfully
maxwell:[14/June/2016 14:39:46] [DEBUG]: Pretessellating meshes with displacement.
maxwell:[14/June/2016 14:39:46] [DEBUG]: Checking spot/IES lights.
maxwell:[14/June/2016 14:39:46] [DEBUG]: Initializing data.
maxwell:[14/June/2016 14:39:46] [DEBUG]: Preprocessing geometry.
maxwell:[14/June/2016 14:39:47] [DEBUG]: Preprocessing materials.
maxwell:[14/June/2016 14:39:53] [DEBUG]: Preprocessing additional parameters.
maxwell:[14/June/2016 14:39:53] [DEBUG]: Initializing render engine.
maxwell:[14/June/2016 14:39:53] [WARNING]: DO NOT SAVE IMAGE FILE flag enabled. Image file will not be saved
maxwell:[14/June/2016 14:39:53] [DEBUG]: Initializing multilight data.
maxwell:[14/June/2016 14:39:53] [DEBUG]: Multilight enabled.
maxwell:[14/June/2016 14:39:53] [INFO]: Starting voxelization.
maxwell:[INFO]:
maxwell:[INFO]: Scene Info:
maxwell:[INFO]: Image output: C:\Users\mktgdev\AppData\Local\Temp\mxnetwork\rendernode_mktgCGI-RN16\mxoDAF2.png
maxwell:[INFO]: MXI output: C:\Users\mktgdev\AppData\Local\Temp\mxnetwork\rendernode_mktgCGI-RN16\Rx_7004_Happy-2.mxi
maxwell:[INFO]:
<************************* Nothing else *************************>
<END LOG>

If I look at the logs of node the crashes but tries again to render, you can see the overlap in the log files in the exact same place everytime:

<START LOG>
maxwell:[14/June/2016 14:39:42] [INFO]: Checking Data
maxwell:[14/June/2016 14:39:42] [INFO]: Loading Bitmaps & Preprocessing Data
maxwell:[14/June/2016 14:39:42] [DEBUG]: Preprocessing data.
maxwell:[14/June/2016 14:39:42] [DEBUG]: Preprocessing scene modifiers.
maxwell:[14/June/2016 14:39:42] [INFO]: Loading geometry extensions.
maxwell:[14/June/2016 14:39:42] [DEBUG]: Preprocessing MXS references.
maxwell:[14/June/2016 14:39:42] [INFO]: Loading MXS references...
maxwell:[14/June/2016 14:39:43] [INFO]: MXS references loaded successfully
maxwell:[14/June/2016 14:39:43] [DEBUG]: Pretessellating meshes with displacement.
maxwell:[14/June/2016 14:39:43] [DEBUG]: Checking spot/IES lights.
maxwell:[14/June/2016 14:39:43] [DEBUG]: Initializing data.
maxwell:[14/June/2016 14:39:44] [DEBUG]: Preprocessing geometry.
maxwell:[14/June/2016 14:39:44] [DEBUG]: Preprocessing materials.
maxwell:[14/June/2016 14:39:50] [DEBUG]: Preprocessing additional parameters.
maxwell:[14/June/2016 14:39:51] [DEBUG]: Initializing render engine.
maxwell:[14/June/2016 14:39:51] [WARNING]: DO NOT SAVE IMAGE FILE flag enabled. Image file will not be saved
maxwell:[14/June/2016 14:39:51] [DEBUG]: Initializing multilight data.
maxwell:[14/June/2016 14:39:51] [DEBUG]: Multilight enabled.
maxwell:[14/June/2016 14:39:51] [INFO]: Starting voxelization.
maxwell:[INFO]:
maxwell:[INFO]: Scene Info:
maxwell:[INFO]: Image output: C:\Users\mktgdev\AppData\Local\Temp\mxnetwork\rendernode_mktgCGI-RN11\mxoDAF2.png
maxwell:[INFO]: MXI output: C:\Users\mktgdev\AppData\Local\Temp\mxnetwork\rendernode_mktgCGI-RN11\Rx_7004_Happy-2.mxi
maxwell:[INFO]:
<************************* Process Starts Over Here *************************>
maxwell:[INFO]: Installing error mode handler
maxwell:[INFO]:
maxwell:[INFO]:
maxwell:[INFO]: MAXWELL RENDER (M~R). Engine version 3.2.1.4
maxwell:[INFO]: (C) 2004-2015. Licensed to Next Limit Technologies
maxwell:[INFO]:
maxwell:[INFO]: C:\Program Files\Next Limit\Maxwell 3\maxwell.exe -nowait -mxs:Y:\CS Products\General Cubicle\scenes\Fabric Renders\Variants\Main Camera\Rx_7004_Happy.mxs -o:C:\Users\mktgdev\AppData\Local\Temp\mxnetwork\rendernode_mktgCGI-RN11\mxoDAF2.png -mxi:C:\Users\mktgdev\AppData\Local\Temp\mxnetwork\rendernode_mktgCGI-RN11\Rx_7004_Happy-2.mxi -noimage -s:18 -t:9999 -mintime:0 -node -nodeport:45463 -renameoutput -idcpu:876 -slupdate:10 -v:4 -depth:8 -res:1200x1080 -camera:Camera.Main-SHIFTED -motionblur:off -displacement:off -dispersion:off -extrasamplingenabled:off -noimage:on -nomxi:off -ml:intensity -layers:0 -channel:alpha,off,16,png -channel:shadow,off,16,png -channel:object,off,16,png -channel:material,off,16,png -channel:motion,off,16,png -channel:zbuffer,off,16,png -channel:roughness,off,16,png -channel:fresnel,off,16,png -channel:normals,off,16,png -channel:position,off,16,png -channel:deep,off,32,exr -channel:uv,off,16,png -channel:alpha_custom,off,16,png -channel:reflectance,off,16,png -normalsspace:world -positionspace:world -zmin:0 -zmax:1 -deeptype:alpha -deepmindistance:0.2 -deepmaxsamples:20 -threads:32 -p:low -nomxi:off -time:50000 -rs:1
maxwell:[INFO]:
maxwell:[INFO]:
maxwell:[INFO]: LICENSE INFO
<END LOG>

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I did ultimately try a completely different scene to see if this may be scene specific and I think it may be. I figured what better test than a scene that all Maxwell users should be familiar with, Benchwell. Based on some info I got about extracting the scene data (http://www.maxwellrender.com/forum/view ... 89#p391008), I decided to queue it up as a network render and I had no issues at all! That definitely narrows down the culprit, but its also maddeningly unhelpful since the scene I'm working on is so complex that the thought of rebuilding it somehow is a bit overwhelming. Any suggestions of rebuilding a scene in the least manual way? What about one that's crashing during and/or after the Voxelization stage? How many corruptions could exist by simply importing the MXS reference file into a blank MXS?

I did just think of a wishlist item though. A tool that purges out unallocated scene assets and then rebuilds any Maxwell related data (MXS and MXM data namely) against the currently installed Maxwell core and its SDK. Although not very catchy, it could be called something like "Purgewell", or maybe even "Fixwell"

Thanks Mihai!

Regards,
Zack Parrish
-
Maxwell - 4.2.0.3
Maxwell 4 | 3ds Max - 4.2.4
336 capable Maxwell threads!
-
Workstation:
Dual E5-2680v3, 64GB, Quadro K5200
48 threads (HT) @ 139.2GHz
-
Render Farm:
288 threads (HT) @ 835.2GHz

Re: failures - unknown reason#391419

By zparrish - Mon Jun 20, 2016 7:44 pm

- Mon Jun 20, 2016 7:44 pm #391419

I think I found the issue on the scene that's been giving me problems. I had quite a few MXM references in the base scene, which was converted into an MXS reference. Maxwell didn't really care for that apparently. In 3ds Max, I checked the "Embed Material in Scene" for all of the MXM references and the network rendering issues went away!

I knew that you couldn't do MXS references inside of MXS references, but having MXM references inside an MXS reference didn't really occur to me to be a bad idea. At any rate, that solved it for me. Thanks!

Regards,
Zack Parrish
-
Maxwell - 4.2.0.3
Maxwell 4 | 3ds Max - 4.2.4
336 capable Maxwell threads!
-
Workstation:
Dual E5-2680v3, 64GB, Quadro K5200
48 threads (HT) @ 139.2GHz
-
Render Farm:
288 threads (HT) @ 835.2GHz

Re: failures - unknown reason#391420

By Mihai - Mon Jun 20, 2016 7:48 pm

- Mon Jun 20, 2016 7:48 pm #391420

Interesting....thanks for letting us know!

Maxwellzone.com - tutorials, training and other goodies related to Maxwell Render
Youtube Maxwell channel

Reply

Page 1 of 1
12 posts