THIRD PARTY HELL
When working with third party software, you usually do it because you want to use the software "as-is", and not have to worry about fixing it up, or trying to figure out what it is doing. It should be the perfect black box. But, as we often find in out profession (programming), that is not the case always. This is why I decided to write my bad experience with the scripting language LUA and my usage of it in a commercial project that I was working on. This is mostly so that someone that will write this line ( n = (*clvalue(L->base - 1)->c.f)(L); /* do the actual call */ ) in a search engine comes up with some more info about it than me :)
To start things off properly: I think LUA is a very cool script language. It is fast, small and very customizable, and its lexical proximity to C/C++ makes it the ideal candidate for a script language to supplement a C/C++ code base. I have used it in two projects that I have worked on (Far Cry and 50 Cent:Bulletproof) and I am very satisfied with it overall. But, as with anything else, LUA has its little quirks that need to be ironed out. I should point out that the last version that has the problem that I will be talking about here is LUA 5.0. I have since heard that the particular piece of code that was crashing on my implementation has been pretty much rewritten for LUA 5.1.
The problem started very late in the development of 50 Cent:Bulletproof. The game would crash randomly after some time playing, but always with the same call stack, deep inside LUA. The crash was in the luaD_precall function, specifically in the (soon to be) dreaded line:
n = (*clvalue(L->base - 1)->c.f)(L); /* do the actual call */
The crash was because the value *clvalue(L->base - 1) evaluated to 0. Believe me, it was not fun to go through the macro nesting that LUA is doing (who codes like that still :)). For people not very versed in LUA, the crash meant that the closure for the C function that LUA was trying to call could not be found. Now this sounds serious - and it is. Ideally, if the function did not exist, LUA would be able to detect that condition when it tried to look it up and output an error message - something along the lines of "trying to call a nil value". But that didn't happen.
And even if it did - it would not help. When I analyzed the game debug output, I noticed a lot of "Attempt to call a nil value" LUA errors. Usually, I do not give much importance to these errors, since I consider LUA capable to recover from them (if it can detect them and output an error message). They just seemed too frequent, and since the game took about 10-15 minutes to crash, it occurred to me that a lot of these errors would accumulate over time. At a suggestion from a coworker, I rigged the code to create an "Attempt to call a nil value" every frame. And surely, with this modification in the code, the game crashed 3 seconds after it started.
Why does the code crash there? I don't know. I just know that somehow, all these (supposedly intercepted) calls to nil functions somehow accumulate and corrupt the VM. So if you are having this crash, regardless of how infrequent it is, chances are that you are accumulating errors. This should not really result in a crash in ideal circumstances - if LUA is able to detect this situation, why isn't it able to recover from it safely?
I should say that I was using a very conservative garbage collection threshold. The game was executing on the PS2, and the garbage collector was a nightmare, particularly with the amount of tables that we had. The threshold for collecting garbage was when it would reach half of the allocated LUA heap - which was about 700KB. Maybe that had something to do with it, although I would think LUA would run out of memory before it crashed in such an unlikely place.
The moral of the story? Get rid of all LUA error messages that happen during execution. When this crash was fixed (by avoiding calling nil functions via testing if they existed first), the game became incredibly stable, with a record run of 314 hours of operation with only one crash. And that crash was not LUA related :)