windows & linux build of application & bullet behave very differently, but why

Post Reply
User avatar
drleviathan
Posts: 849
Joined: Tue Sep 30, 2014 6:03 pm
Location: San Francisco

Re: windows & linux build of application & bullet behave very differently, but why

Post by drleviathan »

I wonder if you're getting NaN injection. When one of the objects in the simulation gets NaN in one of its values then NaN can propagate to other objects and cause very strange and varied failures. These things happen because: NaN + number = NaN and all comparisons against NaN evaluate to false hence logic flow wanders down unexpected pathways.

Maybe try sanity checking all of your object transforms for NaN.
ellie
Posts: 5
Joined: Wed Mar 11, 2020 11:08 pm

Re: windows & linux build of application & bullet behave very differently, but why

Post by ellie »

Just as a refresher for anyone dropping in (since the top post is missing) I asked about objects magically freezing but only in my Windows build:

https://wobble.ninja/temp/test_windows_omgwat.mp4 (look at how stuff starts hovering in the hair! Correct behavior would be this: https://wobble.ninja/temp/test_linux_works.mp4 )

Indeed it looks like your very helpful guess was correct @drleviathan and it's a NaN:

https://i.imgur.com/njTCxJF.png (image, linked externally because phpBB3 is otherwise unhappy)

I wrapped all the places where I get the transform for rendering with sanity checks, and eventually (a few seconds in) it breaks, producing NaN out of bullet's getWorldTransform(). However, I also wrapped all the places where I ever set the transform and all of those are fine. So eventually, bullet just produces NaN, but again, only for my Windows build. It's built with `-O0` in the main program and for bullet I used `config=release64` in premake, so I'm not aware of any unsafe optimizations I explicitly enabled. I do use setLocalScale on some shapes, but I added in asserts that it's always > 0 and these all were fine too.

Is there any obvious reason why this would happen when I only set not-NaN transforms? Does bullet have a "check for NaN at every step" debug mode that could catch this earlier?
User avatar
drleviathan
Posts: 849
Joined: Tue Sep 30, 2014 6:03 pm
Location: San Francisco

Re: windows & linux build of application & bullet behave very differently, but why

Post by drleviathan »

AFAIK this isn't a common problem on windows for other uses so my intuition tells me: if this is a Real Bug then you've found an obscure recipe for it. Which makes me wonder: what are you doing that is odd or uncommon? I don't have many ideas as to what it would be without more information but here are two:

(1) You didn't bypass the default fixedTimeStep strategy of stepSimulation() did you? In other words: you didn't supply a value of zero to the maxSubSteps (second, optional) argument of stepSimulation()? Alternatively, you aren't using a very very small value for fixedTimeStep (third, optional argument)?

(2) Are you using an odd/complicated constraint configuration? If you are using constraints, what types and what two objects do they connect?

(3) Are you using any non-convex shapes like btBvhTriangleMeshShape? If so, perhaps you should sanity-check the triangles you are supplying there, or temporarily hack in some convex shapes to see if the problem goes away.
ellie
Posts: 5
Joined: Wed Mar 11, 2020 11:08 pm

Re: windows & linux build of application & bullet behave very differently, but why

Post by ellie »

I forgot to mention this, but I'm compiling bullet from the commit tagged with "2.89". I can't think of much that is uncommon, here is more details to your listed questions:

1.) This is my current update loop, with PHYSICS_STEP_MS set to 16, so I run it at 62.5FPS:

Code: Select all

    uint64_t now = datetime_Ticks();
    while (physicsticks < now) {
        didstep++;
        const btScalar seconds = ((double)PHYSICS_STEP_MS) * 0.001;
        bulletworld->stepSimulation(seconds, 1, seconds);
        cworld->needAABBUpdate = 0;  // Not a bullet thing, I use this to remember to recalculate all AABBs before I sweep when I add objects to the scene in the same frame
        physicsticks += PHYSICS_STEP_MS;
        stepped_dt += PHYSICS_STEP_MS;
        if (stepCallback)
            stepCallback(cworld, PHYSICS_STEP_MS);
    }
(I'm aware bullet physics itself does already do fixed step and that it's better than my code loop that currently lacks strategies to avoid getting stuck if it "falls behind", I was planning to look into that more in-depth later)

2.) I just use boxes, capsules, and static meshes right now with no constraints. However, I do set a non-1.0 scale on many things, I use Ccd enabled on all items, and I change scale of shapes of in-use by bodies, or warp bodies after their creation. I use this code to reset them when I do that:

Code: Select all

    bulletworld->getPairCache()->cleanProxyFromPairs(
        obj->rigidbody->getBroadphaseProxy(), bulletworld->getDispatcher()
    );
    bulletworld->updateSingleAabb(obj->rigidbody);
    ...
    obj->rigidbody->activate();
I had the impression from research this should be mostly safe to do outside of velocity glitches if any, but maybe I was wrong...?

3.) I do indeed use a triangle mesh for the static level geometry. I just added some additional sanity checks to my wrapper through which all of the polygons are passing and it appears to be all in order:

Code: Select all

    assert(pos1x != pos2x || pos1y != pos2y || pos1z != pos2z);
    assert(pos1x != pos3x || pos1y != pos3y || pos1z != pos3z);
    assert(pos3x != pos2x || pos3y != pos2y || pos3z != pos2z);
    assert(
        !isNaN(pos1x) && !isNaN(pos1y) && !isNaN(pos1z) &&
        !isNaN(pos2x) && !isNaN(pos2y) && !isNaN(pos2z) &&
        !isNaN(pos3x) && !isNaN(pos3y) && !isNaN(pos3z)
    );
I suppose I could try to produce a more minimal example, but since I also got some Lua in there that could take me a while. But since we're running out of obvious guesses that might be my best option to find out more, right? (I'm open to more theories as to what could be the cause though!)
ellie
Posts: 5
Joined: Wed Mar 11, 2020 11:08 pm

Re: windows & linux build of application & bullet behave very differently, but why

Post by ellie »

Ok, so I debugged this some more:

This issue only happens when I set setCcdMotionThreshold and setCcdSweptSphereRadius. Otherwise it magically goes away.

Edit: it also seems to be related to object scale, see next post. I removed the unrelated guesses I previously had here
Last edited by ellie on Fri Mar 13, 2020 10:03 am, edited 1 time in total.
ellie
Posts: 5
Joined: Wed Mar 11, 2020 11:08 pm

Re: windows & linux build of application & bullet behave very differently, but why

Post by ellie »

Ok, so I just fixed an application bug of mine where I hadn't scaled up/down the Ccd sizes accordingly when setting setLocalScale() on the rigidbody's collision shape. This made it less common to happen, but the NaN still happens. So it seems to me like there is possibly some corner case in the Ccd code that breaks under less fortunate numeric conditions and then sets a coordinate of the world transform to NaN, which is a bit of a problem.

Edit: I just managed to get the NaN on Linux too, so it's not a platform dependent problem. The way to trigger it seems to be to have a setCcdSweptSphereRadius that is close to (but still definitely above) the collision margin set on the object, while having a setLocalScale() that also puts all of a btBoxShape and btCapsuleShape's outer dimensions also near to but still above the margin when multiplied, and a setCcdMotionThreshold that is below the margin (which I assumed should also be safe since it's just the motion threshold, not the object size) applied to all such boxes and capsules. The margin I used was 0.01. So anything small enough to get close to such a collision margin seems to cause bullet to spew NaN eventually, while on the Windows build it happens a little quicker for some reason.
ellie
Posts: 5
Joined: Wed Mar 11, 2020 11:08 pm

Re: windows & linux build of application & bullet behave very differently, but why

Post by ellie »

I investigated it more: it appears the problem is velocities just get too high especially with Ccd enabled, when any slightly larger & heavier and slightly smaller items smash down onto a triangle mesh in unfortunate conditions. This appears to be only avoidable with very strictly avoiding too differently-sized objects, sadly.

However, simply hard-clamping the velocities appears to do WONDERS in my tests! Therefore I made a ticket & patch hoping for a compile time option: https://github.com/bulletphysics/bullet3/issues/2668 - because I think this would make Ccd way more usable in more complex scenarios, since this NaN thing is really quite a problem when it occurs.
Post Reply