Wednesday, June 22, 2011

Weltmeisterschaft

I'm off for the next while to see the women's football world cup in Germany, so I'm unlikely to be posting much, and if I do, it probably won't be technical. See you all after the final.

Interesting Bugs

There are lots of interesting ways for software to go wrong. Here's one recent one that we uncovered at Cincom. And I should say that by "we" I mostly mean Tom Robinson, from the Store group.

I first noticed this bug in the middle of a demo in Frankfurt. I was showing some interesting Glorp and StoreGlorp capabilities. I'd run an interesting expression in a Store workbook, inspect the result, then run another expression and get an error. Force a reconnect, and it was fine again, but it occured several times.

Travis Griggs noticed an even odder manifestion of it in trying out some Store expressions. He wrote an expression to find the head of the 7.9 trunk for a particular package or bundle, and it ran fine, returning the expected result. Then he ran it to find the head of the 7.8 trunk, and it returned the exact same package as before, from the 7.9 trunk. Running it again produced the right answer.

Investigation revealed nothing obvious going on at the Glorp caching level. The database appeared to be genuinely returning results that were clearly incorrect for the query that was issued. Sometimes they were obviously for a different query. You might ask for a StorePackage and get back a single column that was obviously for the tw_databaseidentifier. This only happened when using Postgresql. And it apparently only happened when using the Store Workbook. Normal Store operations never showed this.

After much examination, Tom tracked down the cause, an interaction of several causes. First, when you open an inspector, there may be code specific to the type of object being inspected. In particular, the inspector shows the icon for things. The icon for packages shows differently depending if it's the version that's resident in the image or not. Asking that, especially on a completely new connection, could end up doing a database query. That's not really great, but should still work. But...

The inspectors try to be robust. If you write a printString that goes into infinite recursion, raises an exception, or otherwise doesn't return, it tries to stop it and just print an indication that it didn't work. So one of the mechanisms there is a timeout. If it takes too long, it just terminates the process. And in addition...

The Postgresql driver is written purely in Smalltalk. So there's Smalltalk code that manages a socket and the communication on it. This is in contrast to most other database drivers that call out to a C library, which may be just communicating on a socket underneath, but the protocol and details are hidden.

So, what happens is that we issue a query, and open an inspector on the resulting object(s). The inspector is the reason that this didn't affect normal Store operations. The inspector wants to know if this is the current version in the image, so it issues a query to the database. That query will run in a separate Glorp session, but because we're trying to be careful of resources, it'll re-use the underlying database connection. And there's more setup code than is really necessary that runs for each new session (or did, up until more recent builds). If the database is Postgresql, and isn't very close by on a fast machine, the inspector will time out before it gets the answer, so it will terminate the process. The database driver doesn't clean up properly when the process is terminated, so the previous results are left in a buffer. The next query that's issued on that connection will start looking for results, and get the results of the previous query out of the buffer. If the shape of those results matches what we're expecting, we'll just get a wrong answer. If the shape doesn't match, we'll get a confusing error. For example, if it's running the initial setup query to get the tw_databaseidentifier, we might see the result of that.

Fortunately, this only affects operations that use inspectors, so it has no effect on normal Store operations. And databases other than Postgresql keep the individual query results separate by themselves, so that won't happen. But it's a nice example of some very interesting interactions that normally don't show up on their own.