Bizarre disconnect between mxnetwork.exe and maxwell.exe

Reply

Bizarre disconnect between mxnetwork.exe and maxwell.exe#381250

By zparrish - Fri Jun 13, 2014 7:11 pm

- Fri Jun 13, 2014 7:11 pm #381250

Ok, so here's the overall setup and what happened:

Setup:
(1) manager
(6) render nodes

The manager and one of the render nodes are safely installed in an IT controlled IDF with an enterprise level UPS. The other 5 render nodes are regular desktops with no UPS attached. 4 of the 5 render nodes are setup to auto-boot on power failure, auto login to Windows 7, and launch the Render Node app shortcut ("C:\Program Files\Next Limit\Maxwell 3\mxnetwork.exe" -node:<Manager IP>). Those same nodes also have their firewalls disabled.

There is an active Cooperative job being network rendered.

As far as networking is concerned, it's all gigabit with 2 switches and one router. The IDF has a 3Com switch and the other 5 nodes are all connected to a basic D-Link switch. There's a gigabit line connecting the two switches. The router itself only serves to host a DHCP server for other devices on the network and is connected to the D-Link switch with its own internal gigabit ports. None of the P2P traffic between nodes goes through the router, that's all host > switch > switch > host and back.

What happened:
We had a random power grid flash that knocked down the non-UPS connected nodes. When they all came back on-line, not all of them successfully reintegrated into the existing Cooperative task. Some didn't start rendering at all while others didn't properly report their SL and times to the manager, preventing the monitor from correctly reporting on the status of the task. I Stopped the task, waited for all nodes to declare as "finished" and then resumed the task.

One node in particular was spitting out errors about not being able to connect using the established TCP IP port (which happened to be 45463). This node has the firewall disabled. I then started looking at network level firewalls and port priorities, but there was nothing in there to prevent port 45463, plus it was already working on all the other nodes. Also, the only network level firewall we are using in he router, which is bypassed by all of this anyway.

Then I started looking into netstat output and tried to see if there was an existing process using that port on the faulty node. Again, there was nothing there. After scratching my head for a bit, I looked back through the Windows process list and noticed a maxwell.exe still running on that machine. I thought that was odd as the Render Node app didn't report an existing instance of any mxnetwork.exe or maxwell.exe sessions running. I force closed the maxwell.exe process, shutdown the mxnetwork.exe process, and restarted the Render Node.

Viola, it worked! So basically, mxnetwork.exe launched maxwell.exe initially, but never properly closed it when I stopped / resumed the render task. I don't know if this was one of the nodes that didn't report its SL &/or time or not. As far as I knew, maxwell.exe doesn't actually consume ports used by mxnetwork.exe. It will use ports for standard functions like network file retrieval, but it doesn't overlap the port range of mxnetwork.exe. Somehow, the residual maxwell.exe process was locking port 45463, which prevented mxnetwork from reusing that port.

The really strange part is that this never even showed up in netstat. Anyway, it's fixed. Just wanted to post this if anyone can't figure out why a render node fails to use the port it was just using. Look for a rogue maxwell.exe process and kill it. Then restart your node.

This also may be some kind of bug between mxnetwork.exe and the maxwell.exe process that it launches.

Thanks!

Regards,
Zack Parrish
-
Maxwell - 4.2.0.3
Maxwell 4 | 3ds Max - 4.2.4
336 capable Maxwell threads!
-
Workstation:
Dual E5-2680v3, 64GB, Quadro K5200
48 threads (HT) @ 139.2GHz
-
Render Farm:
288 threads (HT) @ 835.2GHz

Re: Bizarre disconnect between mxnetwork.exe and maxwell.exe#381251

By zparrish - Fri Jun 13, 2014 7:36 pm

- Fri Jun 13, 2014 7:36 pm #381251

FYI, I'm using Maxwell 3.0.1.3.

For indexing purposes, the error I was getting was:

ERROR: TCP server can not listen port: 45463
TCP alternative socket connected

Regards,
Zack Parrish
-
Maxwell - 4.2.0.3
Maxwell 4 | 3ds Max - 4.2.4
336 capable Maxwell threads!
-
Workstation:
Dual E5-2680v3, 64GB, Quadro K5200
48 threads (HT) @ 139.2GHz
-
Render Farm:
288 threads (HT) @ 835.2GHz

Re: Bizarre disconnect between mxnetwork.exe and maxwell.exe#381349

By zparrish - Wed Jun 18, 2014 10:24 pm

- Wed Jun 18, 2014 10:24 pm #381349

I have a few things to add to this, as well as a question. I've noticed that some of my render nodes don't successfully "jump into" a preexisting cooperative network job. They connect to the manager, they receive the request and scene parameters, they even start rendering. For some reason though, they don't always send the SL and timestamp correctly. When this happens, the manager starts log messages like this:

Manager:
"[18/June/2014 15:50:48] Node: mktgCGI-RN05: time_update: 120
[18/June/2014 15:50:48] NULL Job changed time?
[18/June/2014 15:50:48] Node: mktgCGI-RN05: new_sampling_level_reached: 4.191273
[18/June/2014 15:50:48] NULL Job changed SL?
[18/June/2014 15:50:58] Node: mktgCGI-RN05: time_update: 129
[18/June/2014 15:50:58] NULL Job changed time?
[18/June/2014 15:50:58] Node: mktgCGI-RN05: new_sampling_level_reached: 4.368726
[18/June/2014 15:50:58] NULL Job changed SL?
[18/June/2014 15:51:08] Node: mktgCGI-RN05: time_update: 139
[18/June/2014 15:51:08] NULL Job changed time?
[18/June/2014 15:51:08] Node: mktgCGI-RN05: new_sampling_level_reached: 4.525051
[18/June/2014 15:51:08] NULL Job changed SL?"

At the same time, everything looks good on the render node log:

Render Node
"[18/June/2014 15:50:53] TCP message from manager received.
[18/June/2014 15:50:58] TCP message from manager received.
[18/June/2014 15:51:00] Message from render process: time_update 120
[18/June/2014 15:51:00] Message from render process: new_sampling_level_reached 4.191273
[18/June/2014 15:51:00] [18/June/2014 15:51:00] [INFO]: Message to render node: time_update 120
[18/June/2014 15:51:00] [18/June/2014 15:51:00] [INFO]: Message to render node: new_sampling_level_reached 4.191273
[18/June/2014 15:51:03] TCP message from manager received.
[18/June/2014 15:51:08] TCP message from manager received.
[18/June/2014 15:51:10] Message from render process: time_update 129
[18/June/2014 15:51:10] Message from render process: new_sampling_level_reached 4.368726
[18/June/2014 15:51:10] [18/June/2014 15:51:10] [INFO]: Message to render node: time_update 129
[18/June/2014 15:51:10] [18/June/2014 15:51:10] [INFO]: Message to render node: new_sampling_level_reached 4.368726
[18/June/2014 15:51:13] TCP message from manager received.
[18/June/2014 15:51:18] TCP message from manager received.
[18/June/2014 15:51:20] Message from render process: time_update 139
[18/June/2014 15:51:20] Message from render process: new_sampling_level_reached 4.525051
[18/June/2014 15:51:20] [18/June/2014 15:51:20] [INFO]: Message to render node: time_update 139
[18/June/2014 15:51:20] [18/June/2014 15:51:20] [INFO]: Message to render node: new_sampling_level_reached 4.525051"

Right about the time I issue the Stop command from the monitor, here's what the console says:

Monitor console
"[18/June/2014 15:51:30] mktgCGI-RN05 has FINISHED
[18/June/2014 15:51:30] mktgCGI-RN01 has FINISHED
[18/June/2014 15:51:30] mktgCGI-RN02 has FINISHED
[18/June/2014 15:51:30] mktgCGI-RN03 has FINISHED
[18/June/2014 15:51:30] mun-d314 has FINISHED
[18/June/2014 15:51:30] mktgCGI-RN04 has FINISHED
[18/June/2014 15:51:30] mktgCGI-RN05 has FINISHED
[18/June/2014 15:51:30] mktgCGI-RN01 has FINISHED
[18/June/2014 15:51:30] mun-d314 has FINISHED
[18/June/2014 15:51:30] mktgCGI-RN03 has FINISHED
[18/June/2014 15:51:30] mktgCGI-RN02 has FINISHED
[18/June/2014 15:51:30] mktgCGI-RN04 has FINISHED
[18/June/2014 15:51:30] 192.168.2.38 has FINISHED"

***Please note that mktgCGI-RN05 and 192.168.2.38 are the same system. It seems odd that they would both show up.***

There are 2 major problems that this seems to create. Even though the render node starts to assist with the cooperative job, the manager won't halt the nodes until its cumulative SL has reached the target SL. When a render node isn't properly communicating with the manager, the monitor doesn't factor it's SL into the cumulative SL, so the task goes on longer than needed.

I also can't actually stop the maxwell.exe process on the render node by issuing a "Stop" command from the Monitor. The node just keeps on rendering until I physically log into that machine (I use RDP) and force quit the Render Node app (mxnetwork.exe, which also ends maxwell.exe) and relaunch the Render Node app.

Question:
Is this a problem inherent to Maxwell or is it more likely that my mish-mash of computers, NIC's, and Ethernet switches is the cause? I assumed that if it was a network topology issue, I'd have more problems than just declaring SL and timestamps.

I'd really like to know a decent way to prevent this. I've designed my network / render farm so that I can remotely manage it. The above mentioned issues all require that I'm physically close to my render nodes to verify, troubleshoot, and correct which can be severely problematic.

Thanks!

Regards,
Zack Parrish
-
Maxwell - 4.2.0.3
Maxwell 4 | 3ds Max - 4.2.4
336 capable Maxwell threads!
-
Workstation:
Dual E5-2680v3, 64GB, Quadro K5200
48 threads (HT) @ 139.2GHz
-
Render Farm:
288 threads (HT) @ 835.2GHz

Re: Bizarre disconnect between mxnetwork.exe and maxwell.exe#381456

By zparrish - Tue Jun 24, 2014 4:01 pm

- Tue Jun 24, 2014 4:01 pm #381456

Update (Windows only):

I decided to write some Windows administrative scripts to manually power on, power off, and restart my render network. In terms of the Maxwell specific processes, I know that Maxwell Render comes with tools specific to resetting your render network, but based on what I've previously posted, I am unable to get that to work consistently.

Specific to resetting the Maxwell Render Nodes, I wrote the following Powershell script to manually kill "mxnetwork" and "maxwell" and then relaunch the Render Node applications while loading their GUIs into each Render Nodes' interactive sessions. That part was crucial for me because if I see any issues while network rendering, I typically RDP into any node that's having issues. Without the GUI, I wouldn't be able to quickly read that node's console.

To use this script, you basically just need to modify the top 5 lines to reflect your environment. Also, you will need to do the following on each Render Node's physical machine:

1.) Launch Powershell with Elevated privelages (right click -> Run as Admin)
2.) Enable remote access for powershell by running the following command (http://technet.microsoft.com/en-us/libr ... 47850.aspx) - Enable-PSRemoting
3.) Important!! - Create identical administrative accounts on each render node where their usernames / passwords are consistent from machine to machine. This is necessary for Powershell to kill remote process with the PsKill command, and to remotely launch mxnetwork.exe with the PsExec function. Otherwise, you would need to provide different credentials for each machine, or use an entirely different form of authentication.

Code: Select all

$maxwell_hosts = '192.168.1.1', '192.168.1.2', '192.168.1.3', '192.168.1.4' 
$username = 'USERNAME'
$password = 'PASSWORD'
$MXNETWORK_dot_EXE = 'C:\Program Files\Next Limit\Maxwell 3\mxnetwork.exe'
$manager_IP = '192.168.1.100'

foreach ($maxwell_host in $maxwell_hosts) {
    if (Test-Connection -ComputerName $maxwell_host -Quiet -Count 1) {
        #For some reason, pskill needs the host location concatinated with '\\'
        $full_host = '\\' + $maxwell_host;
        pskill $full_host -u $username -p $password 'maxwell';
        pskill $full_host -u $username -p $password 'mxnetwork';
        $results = psexec $full_host -u $username -p $password query session
        $id = $results | Select-String "$username\s+(\w+)" | Foreach {$_.Matches[0].Groups[1].Value};
        echo 'Results = ' $results
        echo 'ID = ' $id
        #PsExec didn't like a concatinated IP address, so the below syntax was used.
        psexec \\$maxwell_host -u $username -p $password -i $id -d $MXNETWORK_dot_EXE "-node:$manager_IP"
    } else {
        echo "Maxwell host $maxwell_host did not respond."
    }
}

Regards,
Zack Parrish
-
Maxwell - 4.2.0.3
Maxwell 4 | 3ds Max - 4.2.4
336 capable Maxwell threads!
-
Workstation:
Dual E5-2680v3, 64GB, Quadro K5200
48 threads (HT) @ 139.2GHz
-
Render Farm:
288 threads (HT) @ 835.2GHz

Re: Bizarre disconnect between mxnetwork.exe and maxwell.exe#381480

By Mihai - Wed Jun 25, 2014 11:05 am

- Wed Jun 25, 2014 11:05 am #381480

Hello Zack,

Thanks a lot for the detailed info, some comments:

We are redesigning parts of the Network and hopefully this Win ghost process issue should be fixed. We will try to figure out why this happens anyway on Win.

Those "NULL Job changed time?" messages mean that something went wrong, typically it happens when the Manager is restarted and the Node is still rendering a job ordered by the previous instance of the Manager. That job is a zombi and all we can do there is to kill that node. We will see if we can reproduce that, but if you find a sequence of steps that always reproduce it that would be great. In your case, I guess the Manager was not restarted, you just killed a Node and restarted it?

***Please note that mktgCGI-RN05 and 192.168.2.38 are the same system. It seems odd that they would both show up.***

This is perhaps the root of the problem. Internally we work with IPs, we use names only for display, unless Qt is doing something strange here. So network thinks there are actually two machines when they are not. Again, if you do find a series of steps that always reproduces this that would help a lot.

Maxwellzone.com - tutorials, training and other goodies related to Maxwell Render
Youtube Maxwell channel

Re: Bizarre disconnect between mxnetwork.exe and maxwell.exe#381501

By zparrish - Wed Jun 25, 2014 6:18 pm

- Wed Jun 25, 2014 6:18 pm #381501

Hi Mihai,
Thanks for the update! Here's some info that may help clarify things:

Manager Setup
The manager is running on Windows Server 2008 R2 within a VM on a VCenter / ESX host. It gets physically load balanced from one actual ESX Host to another on the fly with no end user interruption (that I'm aware of). I installed it there because it's the most redundant, shared location I have access to. The manager never went down and I didn't restart it, so unless the load balancing of VMWare created a "hick-up", the manager never actually reset.

IP vs Hostname resolution
All of my Maxwell related machines and applications are running on machines with static IP addresses. When installing / licensing Maxwell, I only ever used IP addresses, not hostnames. I assumed that the Manager was using some NetBIOS functions and/or WINS lookup to retrieve the hostnames as there are no DNS entries anywhere to resolve these.

Other thoughts
I'm not quite sure how to "exactly" replicate all of this without manually causing a power failure. If I do come up with a solid process to replicate this and it produces the error, I will certainly post back about it with my findings.

To recap, my initial scenario happened like this:

1.) Manager is up and running, as well all 6 render nodes.
2.) Render job is submitted with "Use any available render node" checked.
3.) Power failure knocks down 5 of the render nodes, as well as an ethernet switch between those nodes and the manager.
4.) The power failure leaves the Manager and one render node completely untouched.
5.) Power comes back, all 5 nodes come on-line and reintegrate into the existing job.
6.) Monitor is reporting weird statuses about some of the nodes (missing SL, missing time, etc.)
7.) I stopped the render job from the Monitor and waited for all nodes to report back "finished" and/or "ready"
8.) I resumed the job, but some of the render nodes still reported back incomplete info.
9.) I stopped the job again. Then, I used the "Reset" tools from the Monitor to reset the nodes.
10.) I resumed the job again.
11.) Some of the nodes wouldn't start rendering at all as the port was already in use from the previously submitted job, which was the result of the zombie maxwell processes not stopping when I reset the nodes from the Monitor.
12.) I killed the zombie maxwell processes and restarted mxnetwork. The error message about the port went away and everything worked as expected.

I suppose a similar process would look like this:
1.) Manager is up and running, as well 2 or more render nodes.
2.) Render job is submitted with "Use any available render node" checked.
3.) Select a render node, kill "mxnetwork" and "maxwell" processes.
4.) Relaunch mxnetwork.exe on that node.
5.) Check to see if the render node properly reintegrates into the ongoing render job and reports its SL and timestamp to the Monitor and Manager.
6.) If not, stop the render job from the Monitor and wait for all nodes to report back "finished" and/or "ready".
7.) Resume the job and make sure all render nodes report correctly to the Monitor and Manager.
8.) If not, repeat step 6 and proceed directly to step 9.
9.) Reset the nodes using the Monitor.
10.) Check the console of the render nodes for errors related to the inability to use a specific port.
11.) If there are errors, check for zombie maxwell processes on the nodes.
12.) If they are there, then the problem was correctly reproduced.

Thanks again Mihai!

Regards,
Zack Parrish
-
Maxwell - 4.2.0.3
Maxwell 4 | 3ds Max - 4.2.4
336 capable Maxwell threads!
-
Workstation:
Dual E5-2680v3, 64GB, Quadro K5200
48 threads (HT) @ 139.2GHz
-
Render Farm:
288 threads (HT) @ 835.2GHz

Re: Bizarre disconnect between mxnetwork.exe and maxwell.exe#381509

By zparrish - Wed Jun 25, 2014 9:57 pm

- Wed Jun 25, 2014 9:57 pm #381509

I tried my suggested steps to replicate this (without a power failure) and I wasn't able to replicate the zombie maxwell.exe processes. However, I was consistently able to produce some of the communication problems between mxnetwork.exe, the Manager, and the Monitor.

Regardless of which system I use, if I have an active network render with "Use any available render node" checked, and I launch a render node, when it tries to integrate into the job, it fails to relay its SL and Elapsed Time all the way through to the Monitor. For this example, the render node that I'm activating and shutting down manually is "mktgCGI-RN05". The other node (mktgcCGI-RN01) is always on and never requires manual assistance.

The manager console reports this when I do the following steps:

1.) Issue new cooperative job from Monitor with "Use any available Render Node" checked.
2.) Launch an additional render node.
3.) Stop the render job due to Monitor reporting anomolies of Time Elapsed and SL of the newly added render node (mktcgCGI-RN05).
4.) Resume the job, even though mkctgCGI-RN05 never actually stopped rendering.

If you look closely, you'll see that mktgcgCGI-RN01 has "Render node:" messages in the console that that indicate it's Job ID and SL. mktgCGI-RN05 does not have those types of messages initially, just the "NULL Job" messages. When I stop the job, mktgCGI-RN05 continues to render the job, and the Manager console keeps reporting the "NULL Job" messages for that node. After I resume the job, the Monitor will re-associate the Time Elapsed and SL values for mktgCGI-RN05 to the new job ID issued by resume.

Step 1
[25/June/2014 15:23:23] Adding new job to queue..
[25/June/2014 15:23:23] ##### Try to work #####
[25/June/2014 15:23:23] ##### Processing pending jobs #####
[25/June/2014 15:23:23] ##### Job 24 #####
[25/June/2014 15:23:23] ### Job type: cooperative ###
[25/June/2014 15:23:23] ### Assigned to: any node available ###
[25/June/2014 15:23:23] Processing job
[25/June/2014 15:23:23] Sending dependencies info. Num of dependencies: 1
[25/June/2014 15:23:23] Dependency: Y:\Model_Support\Maxwell Pack-n-Go\f72_t42516_p381501.mxs
[25/June/2014 15:23:23] Processing cooperative job: 24
[25/June/2014 15:23:23] Sending job order to node: mktgCGI-RN01
[25/June/2014 15:23:23] Node: mktgCGI-RN01: job_order_received: 24
[25/June/2014 15:23:23] Node: mktgCGI-RN01 has received job order
[25/June/2014 15:23:23] nodeHasReceivedJobOrder: the node mktgCGI-RN01 is not involved in any transference
[25/June/2014 15:23:23] Node: mktgCGI-RN01: engine_version: 3.0.1.3
[25/June/2014 15:23:23] Node: mktgCGI-RN01: render_info: preprocessing data...
[25/June/2014 15:23:23] Node: mktgCGI-RN01: render_info: starting voxelization...
[25/June/2014 15:23:23] Node: mktgCGI-RN01: render_info: voxelization done.
[25/June/2014 15:23:23] Node: mktgCGI-RN01: render_info: starting render
[25/June/2014 15:23:25] Node: mktgCGI-RN01: time_update: 1
[25/June/2014 15:23:25] Node: mktgCGI-RN01: new_sampling_level_reached: 1.00000
[25/June/2014 15:23:25] Render node: mktgCGI-RN01 Job ID: 24 sl: 1
[25/June/2014 15:23:28] Node: mktgCGI-RN01: time_update: 4
[25/June/2014 15:23:28] Node: mktgCGI-RN01: new_sampling_level_reached: 2.00000
[25/June/2014 15:23:28] Render node: mktgCGI-RN01 Job ID: 24 sl: 2
Step 2
[25/June/2014 15:23:29] Connection with render node enabled.
[25/June/2014 15:23:29] getRenderNode address not found: 192.168.2.38
[25/June/2014 15:23:29] Render Node added to list:
[25/June/2014 15:23:29] Name: 192.168.2.38
[25/June/2014 15:23:29] Address: 192.168.2.38
[25/June/2014 15:23:29] ##### Try to work #####
[25/June/2014 15:23:29] ##### There is a cooperative job running that accepts more nodes #####
[25/June/2014 15:23:29] ##### Processing pending jobs #####
[25/June/2014 15:23:29] ##### Job 24 #####
[25/June/2014 15:23:29] ### Job type: cooperative ###
[25/June/2014 15:23:29] ### Assigned to: any node available ###
[25/June/2014 15:23:29] Processing job already running, probably adding more nodes to a coop job
[25/June/2014 15:23:29] Sending dependencies info. Num of dependencies: 1
[25/June/2014 15:23:29] Dependency: Y:\Model_Support\Maxwell Pack-n-Go\f72_t42516_p381501.mxs
[25/June/2014 15:23:29] Processing cooperative job: 24
[25/June/2014 15:23:29] Sending job order to node: 192.168.2.38
[25/June/2014 15:23:30] Node: 192.168.2.38: node_local_name: mktgCGI-RN05
[25/June/2014 15:23:30] Node: mktgCGI-RN05: node_os: Windows 64
[25/June/2014 15:23:30] Node: mktgCGI-RN05: node_version: 3.0.1.3
[25/June/2014 15:23:30] Node: mktgCGI-RN05: stop_render_order_received_before_rendering:
[25/June/2014 15:23:30] Connection through alternative port enabled.
[25/June/2014 15:23:30] Add alternative socket for node: node
[25/June/2014 15:23:30] Socket descriptor: 1740
[25/June/2014 15:23:30] Socket local port: 45464
[25/June/2014 15:23:30] Socket peer port: 49245
[25/June/2014 15:23:30] Node: mktgCGI-RN05: job_order_received: 24
[25/June/2014 15:23:30] Node: mktgCGI-RN05: engine_version: 3.0.1.3
[25/June/2014 15:23:30] Node: mktgCGI-RN05: render_info: preprocessing data...
[25/June/2014 15:23:30] Node: mktgCGI-RN05: render_info: starting voxelization...
[25/June/2014 15:23:30] Node: mktgCGI-RN05: render_info: voxelization done.
[25/June/2014 15:23:30] Node: mktgCGI-RN05: render_info: starting render
[25/June/2014 15:23:31] Node: mktgCGI-RN01: time_update: 7
[25/June/2014 15:23:31] Node: mktgCGI-RN01: new_sampling_level_reached: 3.00000
[25/June/2014 15:23:31] Render node: mktgCGI-RN01 Job ID: 24 sl: 3
[25/June/2014 15:23:33] Node: mktgCGI-RN05: time_update: 3
[25/June/2014 15:23:33] NULL Job changed time?
[25/June/2014 15:23:33] Node: mktgCGI-RN05: new_sampling_level_reached: 1.00000
[25/June/2014 15:23:33] NULL Job changed SL?
[25/June/2014 15:23:33] Node: mktgCGI-RN01: time_update: 9
[25/June/2014 15:23:33] Node: mktgCGI-RN01: new_sampling_level_reached: 4.000042
[25/June/2014 15:23:33] Render node: mktgCGI-RN01 Job ID: 24 sl: 4.00004
[25/June/2014 15:23:34] Node: mktgCGI-RN01: time_update: 10
[25/June/2014 15:23:34] Node: mktgCGI-RN01: new_sampling_level_reached: 4.00004
[25/June/2014 15:23:34] Render node: mktgCGI-RN01 Job ID: 24 sl: 4.00004
[25/June/2014 15:23:38] Node: mktgCGI-RN05: time_update: 8
[25/June/2014 15:23:38] NULL Job changed time?
[25/June/2014 15:23:38] Node: mktgCGI-RN05: new_sampling_level_reached: 2.00203
[25/June/2014 15:23:38] NULL Job changed SL?
[25/June/2014 15:23:38] Node: mktgCGI-RN01: time_update: 15
[25/June/2014 15:23:38] Node: mktgCGI-RN01: new_sampling_level_reached: 5.00004
[25/June/2014 15:23:38] Render node: mktgCGI-RN01 Job ID: 24 sl: 5.00004
[25/June/2014 15:23:40] Node: mktgCGI-RN05: time_update: 9
[25/June/2014 15:23:40] NULL Job changed time?
[25/June/2014 15:23:40] Node: mktgCGI-RN05: new_sampling_level_reached: 2.002029
[25/June/2014 15:23:40] NULL Job changed SL?
Step 3
[25/June/2014 15:23:41] Monitor: MUN-SRV01: stop_job: 24
[25/June/2014 15:23:43] Node: mktgCGI-RN01: time_update: 20
[25/June/2014 15:23:43] Node: mktgCGI-RN01: new_sampling_level_reached: 5.997461
[25/June/2014 15:23:43] Render node: mktgCGI-RN01 Job ID: 24 sl: 5.99746
[25/June/2014 15:23:44] Node: mktgCGI-RN01: time_update: 20
[25/June/2014 15:23:44] Node: mktgCGI-RN01: new_sampling_level_reached: 5.99746
[25/June/2014 15:23:44] Render node: mktgCGI-RN01 Job ID: 24 sl: 5.99746
[25/June/2014 15:23:44] Node: mktgCGI-RN01: node_status_changed: render_finished
[25/June/2014 15:23:44] Render process finished in node: mktgCGI-RN01 Job ID: 24
[25/June/2014 15:23:44] Starting MXI transference..
[25/June/2014 15:23:44] New MXI transfer request received
[25/June/2014 15:23:44] Node: mktgCGI-RN01: render_info: sending MXI output...
[25/June/2014 15:23:44] Node: mktgCGI-RN01: render_info: Sending MXI...1 percent
[25/June/2014 15:23:44] Node: mktgCGI-RN01: render_info: Sending MXI...94 percent
[25/June/2014 15:23:44] Node: mktgCGI-RN01: node_status_changed: mxi_sent
[25/June/2014 15:23:44] Node mktgCGI-RN01 has sent MXI file.
[25/June/2014 15:23:48] All the nodes have sent the MXI file. The merging process can start.
[25/June/2014 15:23:48] Node: mktgCGI-RN05: time_update: 14
[25/June/2014 15:23:48] NULL Job changed time?
[25/June/2014 15:23:48] Node: mktgCGI-RN05: new_sampling_level_reached: 3.00142
[25/June/2014 15:23:48] NULL Job changed SL?
[25/June/2014 15:23:48] [MXIMERGE]:================================================== MAXWELL RENDER (M~R). MxiMerge Utility. Mximerge 3.0.0.4 (C) 2004-2013. Licensed to Next Limit Technologies ================================================== Adding to the merging list: C:/Users/zparrish/AppData/Local/Temp/2/mxnetwork/manager_MUN-SRV01/job_24/f72_t42516_p381501_mktgCGI-RN01.mxi [INFO]: MXI Successfully Merged
[25/June/2014 15:23:48] Merging process finished successfully!
[25/June/2014 15:23:48] Final image : successfully removed previous MXI
[25/June/2014 15:23:48] MXI saved in: Y:\Model_Support\Maxwell Pack-n-Go\f72_t42516_p381501.mxi
[25/June/2014 15:23:48] Final MXI properly written! ;-D
[25/June/2014 15:23:48] Warning: Final image could not be copied in the selected path
[25/June/2014 15:23:48] Warning: You can find the merged image in: C:\Users\zparrish\AppData\Local\Temp\2\mxnetwork\manager_MUN-SRV01\job_24\merged.png
[25/June/2014 15:23:48] ##### Job finished. id: 24 #####
[25/June/2014 15:23:50] Node: mktgCGI-RN05: time_update: 19
[25/June/2014 15:23:50] NULL Job changed time?
[25/June/2014 15:23:50] Node: mktgCGI-RN05: new_sampling_level_reached: 3.646614
[25/June/2014 15:23:50] NULL Job changed SL?
[25/June/2014 15:23:54] Node: mktgCGI-RN05: time_update: 24
[25/June/2014 15:23:54] NULL Job changed time?
[25/June/2014 15:23:54] Node: mktgCGI-RN05: new_sampling_level_reached: 4.00092
[25/June/2014 15:23:54] NULL Job changed SL?
Step 4
[25/June/2014 15:23:58] Adding new job to queue..
[25/June/2014 15:23:58] ##### Try to work #####
[25/June/2014 15:23:58] ##### Processing pending jobs #####
[25/June/2014 15:23:58] ##### Job 25 #####
[25/June/2014 15:23:58] ### Job type: cooperative ###
[25/June/2014 15:23:58] ### Assigned to: any node available ###
[25/June/2014 15:23:58] Processing job
[25/June/2014 15:23:58] Sending dependencies info. Num of dependencies: 1
[25/June/2014 15:23:58] Dependency: Y:\Model_Support\Maxwell Pack-n-Go\f72_t42516_p381501.mxs
[25/June/2014 15:23:58] Processing cooperative job: 25
[25/June/2014 15:23:58] Sending job order to node: mktgCGI-RN01
[25/June/2014 15:23:58] Sending job order to node: mktgCGI-RN05
[25/June/2014 15:23:58] Node: mktgCGI-RN01: job_order_received: 25
[25/June/2014 15:23:58] Node: mktgCGI-RN01 has received job order
[25/June/2014 15:23:58] nodeHasReceivedJobOrder: the node mktgCGI-RN01 is not involved in any transference
[25/June/2014 15:23:59] Node: mktgCGI-RN01: engine_version: 3.0.1.3
[25/June/2014 15:23:59] Node: mktgCGI-RN01: render_info: preprocessing data...
[25/June/2014 15:23:59] Node: mktgCGI-RN01: render_info: starting voxelization...
[25/June/2014 15:23:59] Node: mktgCGI-RN01: render_info: voxelization done.
[25/June/2014 15:23:59] Node: mktgCGI-RN01: render_info: starting render
[25/June/2014 15:24:00] Node: mktgCGI-RN05: time_update: 30
[25/June/2014 15:24:00] Node: mktgCGI-RN05: new_sampling_level_reached: 4.442967
[25/June/2014 15:24:00] Render node: mktgCGI-RN05 Job ID: 25 sl: 4.44297
[25/June/2014 15:24:00] Node: mktgCGI-RN01: time_update: 1
[25/June/2014 15:24:00] Node: mktgCGI-RN01: new_sampling_level_reached: 1.00000
[25/June/2014 15:24:00] Render node: mktgCGI-RN01 Job ID: 25 sl: 1
[25/June/2014 15:24:03] Node: mktgCGI-RN01: time_update: 4
[25/June/2014 15:24:03] Node: mktgCGI-RN01: new_sampling_level_reached: 2.00006
[25/June/2014 15:24:03] Render node: mktgCGI-RN01 Job ID: 25 sl: 2.00006
[25/June/2014 15:24:06] Node: mktgCGI-RN01: time_update: 7
[25/June/2014 15:24:06] Node: mktgCGI-RN01: new_sampling_level_reached: 3.00008
[25/June/2014 15:24:06] Render node: mktgCGI-RN01 Job ID: 25 sl: 3.00008
[25/June/2014 15:24:07] Node: mktgCGI-RN05: time_update: 36
[25/June/2014 15:24:07] Node: mktgCGI-RN05: new_sampling_level_reached: 5.00061
[25/June/2014 15:24:07] Render node: mktgCGI-RN05 Job ID: 25 sl: 5.00061
[25/June/2014 15:24:09] Node: mktgCGI-RN01: time_update: 9
[25/June/2014 15:24:09] Node: mktgCGI-RN01: new_sampling_level_reached: 4.000014
[25/June/2014 15:24:09] Render node: mktgCGI-RN01 Job ID: 25 sl: 4.00001
[25/June/2014 15:24:09] Node: mktgCGI-RN01: time_update: 10
[25/June/2014 15:24:09] Node: mktgCGI-RN01: new_sampling_level_reached: 4.00001
[25/June/2014 15:24:09] Render node: mktgCGI-RN01 Job ID: 25 sl: 4.00001
[25/June/2014 15:24:10] Node: mktgCGI-RN05: time_update: 39
[25/June/2014 15:24:10] Node: mktgCGI-RN05: new_sampling_level_reached: 5.156068
[25/June/2014 15:24:10] Render node: mktgCGI-RN05 Job ID: 25 sl: 5.15607
[25/June/2014 15:24:14] Node: mktgCGI-RN01: time_update: 14
[25/June/2014 15:24:14] Node: mktgCGI-RN01: new_sampling_level_reached: 5.00001
[25/June/2014 15:24:14] Render node: mktgCGI-RN01 Job ID: 25 sl: 5.00001
[25/June/2014 15:24:19] Node: mktgCGI-RN01: time_update: 19
[25/June/2014 15:24:19] Node: mktgCGI-RN01: new_sampling_level_reached: 6.000025
[25/June/2014 15:24:19] Render node: mktgCGI-RN01 Job ID: 25 sl: 6.00002
[25/June/2014 15:24:19] Node: mktgCGI-RN01: time_update: 20
[25/June/2014 15:24:19] Node: mktgCGI-RN01: new_sampling_level_reached: 6.00003
[25/June/2014 15:24:19] Render node: mktgCGI-RN01 Job ID: 25 sl: 6.00003
[25/June/2014 15:24:20] Node: mktgCGI-RN05: time_update: 49
[25/June/2014 15:24:20] Node: mktgCGI-RN05: new_sampling_level_reached: 5.817015
[25/June/2014 15:24:20] Render node: mktgCGI-RN05 Job ID: 25 sl: 5.81701
[25/June/2014 15:24:25] Node: mktgCGI-RN05: time_update: 55
[25/June/2014 15:24:25] Node: mktgCGI-RN05: new_sampling_level_reached: 6.00042
[25/June/2014 15:24:25] Render node: mktgCGI-RN05 Job ID: 25 sl: 6.00042
[25/June/2014 15:24:25] Node: mktgCGI-RN01: time_update: 26
[25/June/2014 15:24:25] Node: mktgCGI-RN01: new_sampling_level_reached: 7.00000
[25/June/2014 15:24:25] Render node: mktgCGI-RN01 Job ID: 25 sl: 7
[25/June/2014 15:24:29] Node: mktgCGI-RN01: time_update: 29
[25/June/2014 15:24:29] Node: mktgCGI-RN01: new_sampling_level_reached: 7.322736
[25/June/2014 15:24:29] Render node: mktgCGI-RN01 Job ID: 25 sl: 7.32274
[25/June/2014 15:24:30] Node: mktgCGI-RN05: time_update: 60
[25/June/2014 15:24:30] Node: mktgCGI-RN05: new_sampling_level_reached: 6.203255
[25/June/2014 15:24:30] Render node: mktgCGI-RN05 Job ID: 25 sl: 6.20326

**********

If I stop the job again, it will properly stop and merge the MXI files. Here's the Manager console messages after stopping the job for the last time. Please note that the warning message occurs because I'm only having it save the MXI file and not save an image file:

[25/June/2014 15:31:36] Monitor: MUN-SRV01: stop_job: 25
[25/June/2014 15:31:39] Node: mktgCGI-RN01: time_update: 460
[25/June/2014 15:31:39] Node: mktgCGI-RN01: new_sampling_level_reached: 15.54226
[25/June/2014 15:31:39] Render node: mktgCGI-RN01 Job ID: 25 sl: 15.5423
[25/June/2014 15:31:39] Node: mktgCGI-RN01: node_status_changed: render_finished
[25/June/2014 15:31:39] Render process finished in node: mktgCGI-RN01 Job ID: 25
[25/June/2014 15:31:39] Starting MXI transference..
[25/June/2014 15:31:39] New MXI transfer request received
[25/June/2014 15:31:39] Node: mktgCGI-RN01: render_info: sending MXI output...
[25/June/2014 15:31:40] Node: mktgCGI-RN01: render_info: Sending MXI...2 percent
[25/June/2014 15:31:40] Node: mktgCGI-RN01: node_status_changed: mxi_sent
[25/June/2014 15:31:40] Node mktgCGI-RN01 has sent MXI file.
[25/June/2014 15:31:43] Not all the nodes have sent the MXI file yet. The merging process can't start.
[25/June/2014 15:31:43] Node: mktgCGI-RN05: time_update: 489
[25/June/2014 15:31:43] Node: mktgCGI-RN05: new_sampling_level_reached: 11.442964
[25/June/2014 15:31:43] Render node: mktgCGI-RN05 Job ID: 25 sl: 11.443
[25/June/2014 15:31:43] Node: mktgCGI-RN05: time_update: 490
[25/June/2014 15:31:43] Node: mktgCGI-RN05: new_sampling_level_reached: 11.44296
[25/June/2014 15:31:43] Render node: mktgCGI-RN05 Job ID: 25 sl: 11.443
[25/June/2014 15:31:43] Node: mktgCGI-RN05: node_status_changed: render_finished
[25/June/2014 15:31:43] Render process finished in node: mktgCGI-RN05 Job ID: 25
[25/June/2014 15:31:43] Starting MXI transference..
[25/June/2014 15:31:43] New MXI transfer request received
[25/June/2014 15:31:43] Node: mktgCGI-RN05: render_info: sending MXI output...
[25/June/2014 15:31:43] Node: mktgCGI-RN05: render_info: Sending MXI...2 percent
[25/June/2014 15:31:43] Node: mktgCGI-RN05: node_status_changed: mxi_sent
[25/June/2014 15:31:43] Node mktgCGI-RN05 has sent MXI file.
[25/June/2014 15:31:47] All the nodes have sent the MXI file. The merging process can start.
[25/June/2014 15:31:47] [MXIMERGE]:================================================== MAXWELL RENDER (M~R). MxiMerge Utility. Mximerge 3.0.0.4 (C) 2004-2013. Licensed to Next Limit Technologies ================================================== Adding to the merging list: C:/Users/zparrish/AppData/Local/Temp/2/mxnetwork/manager_MUN-SRV01/job_25/f72_t42516_p381501_mktgCGI-RN01.mxi C:/Users/zparrish/AppData/Local/Temp/2/mxnetwork/manager_MUN-SRV01/job_25/f72_t42516_p381501_mktgCGI-RN05.mxi C:/Users/zparrish/AppData/Local/Temp/2/mxnetwork/manager_MUN-SRV01/job_25/initialmxi.mxi [INFO]: Merging mxi files in C:\Users\zparrish\AppData\Local\Temp\2\mxnetwork\manager_MUN-SRV01\job_25\merged.mxi ... [INFO]: MXI Successfully Merged
[25/June/2014 15:31:47] Merging process finished successfully!
[25/June/2014 15:31:47] Final image : successfully removed previous MXI
[25/June/2014 15:31:47] MXI saved in: Y:\Model_Support\Maxwell Pack-n-Go\f72_t42516_p381501.mxi
[25/June/2014 15:31:47] Final MXI properly written! ;-D
[25/June/2014 15:31:47] Warning: Final image could not be copied in the selected path
[25/June/2014 15:31:47] Warning: You can find the merged image in: C:\Users\zparrish\AppData\Local\Temp\2\mxnetwork\manager_MUN-SRV01\job_25\merged.png
[25/June/2014 15:31:47] ##### Job finished. id: 25 #####

I might just try a fake power outage with only one one my other nodes running to see if I can generate the zombie maxwell.exe processes.

Thanks!

Regards,
Zack Parrish
-
Maxwell - 4.2.0.3
Maxwell 4 | 3ds Max - 4.2.4
336 capable Maxwell threads!
-
Workstation:
Dual E5-2680v3, 64GB, Quadro K5200
48 threads (HT) @ 139.2GHz
-
Render Farm:
288 threads (HT) @ 835.2GHz

Re: Bizarre disconnect between mxnetwork.exe and maxwell.exe#382044

By zparrish - Fri Jul 18, 2014 5:20 pm

- Fri Jul 18, 2014 5:20 pm #382044

Just a quick comment on this, if my network job isn't set to "Use any available Render Node" and I manually "Add Node to Job" through the Monitor, it works perfectly!

Thanks!

Regards,
Zack Parrish
-
Maxwell - 4.2.0.3
Maxwell 4 | 3ds Max - 4.2.4
336 capable Maxwell threads!
-
Workstation:
Dual E5-2680v3, 64GB, Quadro K5200
48 threads (HT) @ 139.2GHz
-
Render Farm:
288 threads (HT) @ 835.2GHz

Reply

Page 1 of 1
8 posts

Bizarre disconnect between mxnetwork.exe and maxwell.exe#381250

Re: Bizarre disconnect between mxnetwork.exe and maxwell.exe#381251

Re: Bizarre disconnect between mxnetwork.exe and maxwell.exe#381349

Re: Bizarre disconnect between mxnetwork.exe and maxwell.exe#381456

Re: Bizarre disconnect between mxnetwork.exe and maxwell.exe#381480

Re: Bizarre disconnect between mxnetwork.exe and maxwell.exe#381501

Re: Bizarre disconnect between mxnetwork.exe and maxwell.exe#381509

Re: Bizarre disconnect between mxnetwork.exe and maxwell.exe#382044

render engines and Maxwell

What about connection with Blender?

Help with swimming pool water

Useful links

Join us on Twitter @MaxwellRender