Quantcast

"Jittery" performance with large cluster

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

"Jittery" performance with large cluster

cpapado
Hello everyone,

we've been running equalizer for a while in a large visualization cluster. Our architecture has 18 nodes, each with 4 GPUs. The Equalizer configuration file actually defines 72 nodes, each with one pipe, one window and finally one channel (with the pipe assigned to the corresponding GPU on the system and the OpenGL context getting properly created there). Furthermore, for each GPU, we define one canvas and one segment. So we end up with 72 nodes and 72 canvases.

What I am observing is best described as jittering or frame stuttering, happening about every 1 second. The frame rate will drop from 100s of FPS down to 1-2 very briefly and then recover, only to happen soon there after. This is not related to rendering complexity (it happens with very simple scenes and also with the various eq samples).

I did some profiling on the AppNode driving the cluster in order to narrow down the source of the issue. I am noticing hotspots in co::LocalNode::_runReceiverThread (38.17% of all samples). In particular, there seems to be a bunch of time spent within co::LocalNode::_handleData (26.6% of all CPU time) and approximately 12.7% for the call to ConnectionSet::select() within the same function (_runReceiverThread). The second hotspot that I've noticed is in the ServerThread::run() function and more specifically in _cmdStartFrame() (roughly 25% of CPU time spent there).

Our application is relatively simple with a basic distributed object for application state (a few kb in size). This object gets commited 2-3 times during a single event/frame loop.

I've tried a number of things to work around this:
-Forcing swap-sync to off through out the cluster
-Trying different pipe threading modes
-Setting up RSP (which seems to work but makes no difference)
-Played around with swap barries
-Disabled statistics collection

None of the above made any significant difference in terms of performance.

My next step is to simplify the equalizer configuration by fixing the mess that I currently have with the 72 canvases and actually use 4 canvases (one per wall) with properly defined segments. Meanwhile, I'd love to get people's input on the above!

Thank you all,

Harris

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "Jittery" performance with large cluster

ROHN Carsten
Hey Harris,

From your description, I think you're getting a problem with one of the major bottlenecks of collage: reading from many connections happens in one thread. We see this issue mostly in HPC context, where we have up to 150 nodes rendering for one master. One thread is just too slow to handle so many connections.

We even have the effect of connections "starving", because the reader doesn't even have the chance to select the connections in the "back" of the connection set. This is another thing to be done for Collage: handle connections fairly on the reader side, to avoid the starving effect. Every connection should have the same chance to be selected. Right now, the first connections are getting prioritized.

If you are desperate, you can try our multiple read threads implementation, which will improve reading throughput. Have a look at https://github.com/rttag/Collage/commits/master , the commits from Nov 18th to 28th, maybe also pick the one from Dec 18th and definitely the one from Jan 3rd. Unfortunately, this Collage is not totally up to date, so maybe you have to merge a bit. Also, we don't use this stuff productively yet, so there might be bugs. But in first tests we saw a huge performance gain for scalability scenarios even with a low number of clients and also improvements in many other areas.

This topic is very interesting for us as well, please keep me updated.

Regards,
Carsten

P.S.: Maybe you're not even having a software issue. We had a similar problem at a customer (twice by now). After quite some analysis we identified the switch as one of the problems, and a virus scan with "intrusion prevention" was deep scanning all the packets and therefor stalling network traffic. But I guess, you're not running windows, are you?





Hello everyone,

we've been running equalizer for a while in a large visualization cluster.
Our architecture has 18 nodes, each with 4 GPUs. The Equalizer configuration file actually defines 72 nodes, each with one pipe, one window and finally one channel (with the pipe assigned to the corresponding GPU on the system and the OpenGL context getting properly created there). Furthermore, for each GPU, we define one canvas and one segment. So we end up with 72 nodes and 72 canvases.

What I am observing is best described as jittering or frame stuttering, happening about every 1 second. The frame rate will drop from 100s of FPS down to 1-2 very briefly and then recover, only to happen soon there after.
This is not related to rendering complexity (it happens with very simple scenes and also with the various eq samples).

I did some profiling on the AppNode driving the cluster in order to narrow down the source of the issue. I am noticing hotspots in co::LocalNode::_runReceiverThread (38.17% of all samples). In particular, there seems to be a bunch of time spent within co::LocalNode::_handleData (26.6% of all CPU time) and approximately 12.7% for the call to
ConnectionSet::select() within the same function (_runReceiverThread). The second hotspot that I've noticed is in the ServerThread::run() function and more specifically in _cmdStartFrame() (roughly 25% of CPU time spent there).

Our application is relatively simple with a basic distributed object for application state (a few kb in size). This object gets commited 2-3 times during a single event/frame loop.

I've tried a number of things to work around this:
-Forcing swap-sync to off through out the cluster -Trying different pipe threading modes -Setting up RSP (which seems to work but makes no difference) -Played around with swap barries -Disabled statistics collection

None of the above made any significant difference in terms of performance.

My next step is to simplify the equalizer configuration by fixing the mess that I currently have with the 72 canvases and actually use 4 canvases (one per wall) with properly defined segments. Meanwhile, I'd love to get people's input on the above!

Thank you all,

Harris





--
View this message in context: http://software.1713.n2.nabble.com/Jittery-performance-with-large-cluster-tp7585928.html
Sent from the Equalizer - Parallel Rendering mailing list archive at Nabble.com.

_______________________________________________
eq-dev mailing list
[hidden email]
http://www.equalizergraphics.com/cgi-bin/mailman/listinfo/eq-dev
http://www.equalizergraphics.com
This email and any attachments are intended solely for the use of the individual or entity to whom it is addressed and may be confidential and/or privileged.

If you are not one of the named recipients or have received this email in error,

(i) you should not read, disclose, or copy it,

(ii) please notify sender of your receipt by reply email and delete this email and all attachments,

(iii) Realtime Technology does not accept or assume any liability or responsibility for any use of or reliance on this email.

For other languages, go to http://www.3ds.com/terms/email-disclaimer

_______________________________________________
eq-dev mailing list
[hidden email]
http://www.equalizergraphics.com/cgi-bin/mailman/listinfo/eq-dev
http://www.equalizergraphics.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "Jittery" performance with large cluster

Stefan Eilemann
Hi all,

On 9. Jul 2014, at 10:48, "ROHN Carsten [via Software]" <[hidden email]> wrote:

> From your description, I think you're getting a problem with one of the major bottlenecks of collage: reading from many connections happens in one thread. We see this issue mostly in HPC context, where we have up to 150 nodes rendering for one master. One thread is just too slow to handle so many connections.

I'm not sure. Since it stalls heavily every now and then I would lean towards another problem which stalls the normal processing at around 100fps, something like garbage collection in Java.


> We even have the effect of connections "starving", because the reader doesn't even have the chance to select the connections in the "back" of the connection set. This is another thing to be done for Collage: handle connections fairly on the reader side, to avoid the starving effect.

This is implemented on Linux, but not yet on Windows (see https://github.com/Eyescale/Collage/issues/38).

> This topic is very interesting for us as well, please keep me updated.

Indeed :)

> P.S.: Maybe you're not even having a software issue. We had a similar problem at a customer (twice by now). After quite some analysis we identified the switch as one of the problems, and a virus scan with "intrusion prevention" was deep scanning all the packets and therefor stalling network traffic. But I guess, you're not running windows, are you?

Yes, as said above I would lean in this direction of investigation.

> What I am observing is best described as jittering or frame stuttering, happening about every 1 second. The frame rate will drop from 100s of FPS down to 1-2 very briefly and then recover, only to happen soon there after.
> This is not related to rendering complexity (it happens with very simple scenes and also with the various eq samples).
>
> I did some profiling on the AppNode driving the cluster in order to narrow down the source of the issue. I am noticing hotspots in co::LocalNode::_runReceiverThread (38.17% of all samples). In particular, there seems to be a bunch of time spent within co::LocalNode::_handleData (26.6% of all CPU time) and approximately 12.7% for the call to
> ConnectionSet::select() within the same function (_runReceiverThread). The second hotspot that I've noticed is in the ServerThread::run() function and more specifically in _cmdStartFrame() (roughly 25% of CPU time spent there).

I don't think basic profiling will find the issue. Can you somehow get a profile when the stall happens, to see what is going on there?

The times look about right. The receiver thread + handle data is likely your bottleneck, as Carsten said above. With 70 clients they get quite a bit of load, but since you are running at >60FPS normally I would not worry yet.

The startFrame is also not surprising. There has been no optimization done on that code at all, since so far it has never shown up as a bottleneck. The whole traversal and task generation code can however be optimized by caching information, once the hotspots are identified.

> Meanwhile, I'd love to get people's input on the above!

I would really like to find out what the cause for the stall is. It seems that normally you are fine, so a one second pause seems to have a special cause like an external influence or strange code path.


HTH,

Stefan.


signature.asc (858 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "Jittery" performance with large cluster

cpapado
In reply to this post by ROHN Carsten
Thank you both for the very informative replies.

I did a bunch of more investigation today. To preface things, yes we are running windows on the rendering cluster. I did try to run a heterogenous cluster (mac app node with windows render nodes) but I couldn't get it to work.

Carsten, I did actually think of an underlying hardware or "windows" issue. Our render node machines are not running an AV and don't have local firewalls. The switch connecting them is a Dell Poweredge and I fiddled with it as well. I disabled QoS on it and also a weird "Green Ethernet" power saving mode and neither of those tweaks made a difference. I took a look at the TCP connections on the server with TCPView and didn't notice anything awkward (such as connections dropping). It is also worth noting that we have a secondary application, which utilizes a home-brew clustered rendering framework (albeit much simpler than Equalizer) and it does not exhibit this problem so I am inclined to say that the root cause is not our network infrastructure. The only piece of evidence from this part of my investigation was a noticeable drop in network traffic going in/out of the AppNode when the "spikes" or "jitters" take place.

Additionally, I tried various configuration "styles" for our cluster and noticed that the underlying canvas complexity doesn't matter. I tried "scaling" the cluster size and found out that the jittering seems to appear after adding approximately 6 nodes and then gets progressively worse as the cluster gets bigger.

Stefan, the profiling numbers that I reported in the first post were actually from one of the moments during which the application "stalled". It is actually pretty interesting, the CPU utilization graph from the profiling showed spikes at similar intervals to the stalls so I just grabbed some statistics from one of those "spike" time intervals and reported them :). Also, as I mentioned I am experiencing similar behavior (and profiles) with built-in equalizer samples (e.g. eqPly).

I'll give a shot at compiling the version of Collage that Carsten just pointed too and let you guys know how it goes. Stefan, I'm also noticing that in the Issue #38 link you mention something about paying attention to > 64 connection threads in windows? What does that pertain to?

Thank you both again for the comments and please let me know if you have any more insights. I'll report our progress :).
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "Jittery" performance with large cluster

Stefan Eilemann
Hi,

On 9. Jul 2014, at 23:30, cpapado <[hidden email]> wrote:

> Stefan, the profiling numbers that I reported in the first post were
> actually from one of the moments during which the application "stalled". It
> is actually pretty interesting, the CPU utilization graph from the profiling
> showed spikes at similar intervals to the stalls so I just grabbed some
> statistics from one of those "spike" time intervals and reported them :).

That's good information. Can you do a bit more drill-down on this trace? I see two options:

1) There is some regular expensive call, e.g., BufferCache::compact, being run
2) For some reason there is a spike in co::Commands, e.g., caused by an "event storm"

> I'll give a shot at compiling the version of Collage that Carsten just
> pointed too and let you guys know how it goes. Stefan, I'm also noticing
> that in the Issue #38 link you mention something about paying attention to >
> 64 connection threads in windows? What does that pertain to?

WaitForMultipleObjectsEx takes at most 64 handles. Once ConnectionSet reaches this limit, it delegates up to 63 Connections to a sub-ConnectionSet in a thread and  adds one handle for the thread to the master set. You'll find this in the code easily, but it obviously complicates the implementation of #38.



HTH,

Stefan.


_______________________________________________
eq-dev mailing list
[hidden email]
http://www.equalizergraphics.com/cgi-bin/mailman/listinfo/eq-dev
http://www.equalizergraphics.com

signature.asc (858 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "Jittery" performance with large cluster

cpapado
Thank you again for the response Stefan. Drilling all the way down the traces, I'm noticing that the majority of CPU time is spent in NtDeviceIoControFile (~23%) and NtWaitForMultipleObjects (~14%). All the calls to NtDeviceIoControlFile stem from co::SocketConnection::write(). The calls to NtWaitForMultipleObjects all stem from co::ConnectionSet::select(). I'm not seeing anything else in the trace that would indicate an expensive out-of-place Collage call happening fairly regularly. I can also upload one of the Visual studio profiler traces if you want to take a closer look.

Is there a way to aggregate and dump Collage statistics to a file? Maybe that will provide some more insight.

I also checked whether our software was accidentally pulling in some debug-mode DLLs but that was not the case. Finally, I tried the Collage edits that Carsten suggested by cherry-picking the relevant commits on top of the Collage version that we use (1.1.0) but I guess I messed something up because it would infinite loop during initialization (and yes, I did recompile EQ and our software against it as well).

Thanks again :)

-Harris
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: "Jittery" performance with large cluster

Stefan Eilemann

On 10. Jul 2014, at 22:32, cpapado <[hidden email]> wrote:

> I can also upload one
> of the Visual studio profiler traces if you want to take a closer look.

No, since I don't have an installation at hand.

> Is there a way to aggregate and dump Collage statistics to a file? Maybe
> that will provide some more insight.

Not really, I haven't needed to do an analysis like that. As a first step, I would log everytime you're blocked longer than 100ms in ConnectionSet::select or spend more than 100ms in LocalNode::handleData. I would guess it's one of the two.

Of course, if you want to create something more elaborate I'm sure the community would appreciate it.


HTH,

Stefan.



_______________________________________________
eq-dev mailing list
[hidden email]
http://www.equalizergraphics.com/cgi-bin/mailman/listinfo/eq-dev
http://www.equalizergraphics.com

signature.asc (858 bytes) Download Attachment
Loading...