- Fri Jun 13, 2014 7:11 pm
#381250
Ok, so here's the overall setup and what happened:
Setup:
(1) manager
(6) render nodes
The manager and one of the render nodes are safely installed in an IT controlled IDF with an enterprise level UPS. The other 5 render nodes are regular desktops with no UPS attached. 4 of the 5 render nodes are setup to auto-boot on power failure, auto login to Windows 7, and launch the Render Node app shortcut ("C:\Program Files\Next Limit\Maxwell 3\mxnetwork.exe" -node:<Manager IP>). Those same nodes also have their firewalls disabled.
There is an active Cooperative job being network rendered.
As far as networking is concerned, it's all gigabit with 2 switches and one router. The IDF has a 3Com switch and the other 5 nodes are all connected to a basic D-Link switch. There's a gigabit line connecting the two switches. The router itself only serves to host a DHCP server for other devices on the network and is connected to the D-Link switch with its own internal gigabit ports. None of the P2P traffic between nodes goes through the router, that's all host > switch > switch > host and back.
What happened:
We had a random power grid flash that knocked down the non-UPS connected nodes. When they all came back on-line, not all of them successfully reintegrated into the existing Cooperative task. Some didn't start rendering at all while others didn't properly report their SL and times to the manager, preventing the monitor from correctly reporting on the status of the task. I Stopped the task, waited for all nodes to declare as "finished" and then resumed the task.
One node in particular was spitting out errors about not being able to connect using the established TCP IP port (which happened to be 45463). This node has the firewall disabled. I then started looking at network level firewalls and port priorities, but there was nothing in there to prevent port 45463, plus it was already working on all the other nodes. Also, the only network level firewall we are using in he router, which is bypassed by all of this anyway.
Then I started looking into netstat output and tried to see if there was an existing process using that port on the faulty node. Again, there was nothing there. After scratching my head for a bit, I looked back through the Windows process list and noticed a maxwell.exe still running on that machine. I thought that was odd as the Render Node app didn't report an existing instance of any mxnetwork.exe or maxwell.exe sessions running. I force closed the maxwell.exe process, shutdown the mxnetwork.exe process, and restarted the Render Node.
Viola, it worked! So basically, mxnetwork.exe launched maxwell.exe initially, but never properly closed it when I stopped / resumed the render task. I don't know if this was one of the nodes that didn't report its SL &/or time or not. As far as I knew, maxwell.exe doesn't actually consume ports used by mxnetwork.exe. It will use ports for standard functions like network file retrieval, but it doesn't overlap the port range of mxnetwork.exe. Somehow, the residual maxwell.exe process was locking port 45463, which prevented mxnetwork from reusing that port.
The really strange part is that this never even showed up in netstat. Anyway, it's fixed. Just wanted to post this if anyone can't figure out why a render node fails to use the port it was just using. Look for a rogue maxwell.exe process and kill it. Then restart your node.
This also may be some kind of bug between mxnetwork.exe and the maxwell.exe process that it launches.
Thanks!
Setup:
(1) manager
(6) render nodes
The manager and one of the render nodes are safely installed in an IT controlled IDF with an enterprise level UPS. The other 5 render nodes are regular desktops with no UPS attached. 4 of the 5 render nodes are setup to auto-boot on power failure, auto login to Windows 7, and launch the Render Node app shortcut ("C:\Program Files\Next Limit\Maxwell 3\mxnetwork.exe" -node:<Manager IP>). Those same nodes also have their firewalls disabled.
There is an active Cooperative job being network rendered.
As far as networking is concerned, it's all gigabit with 2 switches and one router. The IDF has a 3Com switch and the other 5 nodes are all connected to a basic D-Link switch. There's a gigabit line connecting the two switches. The router itself only serves to host a DHCP server for other devices on the network and is connected to the D-Link switch with its own internal gigabit ports. None of the P2P traffic between nodes goes through the router, that's all host > switch > switch > host and back.
What happened:
We had a random power grid flash that knocked down the non-UPS connected nodes. When they all came back on-line, not all of them successfully reintegrated into the existing Cooperative task. Some didn't start rendering at all while others didn't properly report their SL and times to the manager, preventing the monitor from correctly reporting on the status of the task. I Stopped the task, waited for all nodes to declare as "finished" and then resumed the task.
One node in particular was spitting out errors about not being able to connect using the established TCP IP port (which happened to be 45463). This node has the firewall disabled. I then started looking at network level firewalls and port priorities, but there was nothing in there to prevent port 45463, plus it was already working on all the other nodes. Also, the only network level firewall we are using in he router, which is bypassed by all of this anyway.
Then I started looking into netstat output and tried to see if there was an existing process using that port on the faulty node. Again, there was nothing there. After scratching my head for a bit, I looked back through the Windows process list and noticed a maxwell.exe still running on that machine. I thought that was odd as the Render Node app didn't report an existing instance of any mxnetwork.exe or maxwell.exe sessions running. I force closed the maxwell.exe process, shutdown the mxnetwork.exe process, and restarted the Render Node.
Viola, it worked! So basically, mxnetwork.exe launched maxwell.exe initially, but never properly closed it when I stopped / resumed the render task. I don't know if this was one of the nodes that didn't report its SL &/or time or not. As far as I knew, maxwell.exe doesn't actually consume ports used by mxnetwork.exe. It will use ports for standard functions like network file retrieval, but it doesn't overlap the port range of mxnetwork.exe. Somehow, the residual maxwell.exe process was locking port 45463, which prevented mxnetwork from reusing that port.
The really strange part is that this never even showed up in netstat. Anyway, it's fixed. Just wanted to post this if anyone can't figure out why a render node fails to use the port it was just using. Look for a rogue maxwell.exe process and kill it. Then restart your node.
This also may be some kind of bug between mxnetwork.exe and the maxwell.exe process that it launches.
Thanks!
Regards,
Zack Parrish
-
Maxwell - 4.2.0.3
Maxwell 4 | 3ds Max - 4.2.4
336 capable Maxwell threads!
-
Workstation:
Dual E5-2680v3, 64GB, Quadro K5200
48 threads (HT) @ 139.2GHz
-
Render Farm:
288 threads (HT) @ 835.2GHz
Zack Parrish
-
Maxwell - 4.2.0.3
Maxwell 4 | 3ds Max - 4.2.4
336 capable Maxwell threads!
-
Workstation:
Dual E5-2680v3, 64GB, Quadro K5200
48 threads (HT) @ 139.2GHz
-
Render Farm:
288 threads (HT) @ 835.2GHz