This article about a system outage at Protx, an on-line payment processing company, made for interesting reading.
In short, they did a system upgrade at 6am last Wednesday and it started to grind to an effective halt as business built up during the day. In a letter to its customers Protx state
"It was established that one table in the database had not been indexed properly and that this was the primary cause for the failure,"
The Register states
"Protx said the issue had not cropped up during testing."
This amused me because it is exactly the type of problem I had to deal with in the early 1990s at a retail operation. The front office systems were linked by a CTOS network to a pair of Dell Unix Servers which acted as a gateway between CTOS and TCP/IP. The Dell boxes communicated with an IBM RS/6000 which ran the back office application on an Oracle database. My area of responsibility was the Dell Unix boxes.
When I arrived at the office I was told that there was a problem with the network. Shop transactions were timing out and that I needed to fix the network ASAP. So I got to work.
The first thing I needed to do was replicate the problem (easy - every purchase transaction was failing) and look through the logs and status data to see what was happening. I quickly established that the problem was 'in' the IBM RS/6000 where the back office application ran.
Front end transactions would go across the CTOS network into one of the two Dell boxes, be translated and then sent by TCP/IP to the RS/6000 where the back office application received it and processed it. Unfortunately a response was not sent before the CTOS network layer timed out and returned and error to the front end application. It was this error that made management decide that the network was at fault.
I tried to explain that the network time-out error was a symptom of the back end application not completing its processing and sending a response. Management were having none of it.
So, I started to prove the fault was isolated. By running through all the front office transactions I was able to show that only the purchase transaction was failing. Network traffic for other operations such as stock checks worked fine. The data would come across the CTOS network, be translated in the Dells, arrive and be processed by the back office application and the results go back across the network to the front office systems. All this worked so the basic network worked. It was just the purchase transaction that failed and as far as the network systems were concerned one transaction was no different from any other.
I collected lots of figures to show that everything worked for the purchase transaction. Once I'd exhausted my patience I started looking into the back office application even though that was not my area of responsibility.
It quickly transpired that some database maintenance has been preformed the night before the morning of the failure. My eyes got very wide! What had been done I asked? A database restructuring was the reply. My eyes got even wider!
What exactly happened during this restructuring I asked. The contents of the tables had been dumped to text files and the tables had been recreated using SQL scripts supplied by the software vendor and then the tables were reloaded from the text files.
I asked if this procedure had ever been done before. The answer was no. I was bug eyed by this time.
I immediately got hold of the SQL scripts and looked at them. I also looked at the relative sizes of the text files. It was easy to see which tables were important - the transaction logging table which was big and had a single index on it for the primary key.
What were the chances that the primary key was not very useful when processing purchase transactions resulting in a table scan - i.e. slowly read every record in the table rather than quickly go to the required one(s) using an index? Good enough to be worth trying something I thought.
I looked at the table definition, decided on some likely indices based on years of experience and gut instinct and the database staff created some extra indices.
Low and behold the purchase transaction started working quickly again and the network time-out errors stopped.
Low and behold!
In short, they did a system upgrade at 6am last Wednesday and it started to grind to an effective halt as business built up during the day. In a letter to its customers Protx state
"It was established that one table in the database had not been indexed properly and that this was the primary cause for the failure,"
The Register states
"Protx said the issue had not cropped up during testing."
This amused me because it is exactly the type of problem I had to deal with in the early 1990s at a retail operation. The front office systems were linked by a CTOS network to a pair of Dell Unix Servers which acted as a gateway between CTOS and TCP/IP. The Dell boxes communicated with an IBM RS/6000 which ran the back office application on an Oracle database. My area of responsibility was the Dell Unix boxes.
When I arrived at the office I was told that there was a problem with the network. Shop transactions were timing out and that I needed to fix the network ASAP. So I got to work.
The first thing I needed to do was replicate the problem (easy - every purchase transaction was failing) and look through the logs and status data to see what was happening. I quickly established that the problem was 'in' the IBM RS/6000 where the back office application ran.
Front end transactions would go across the CTOS network into one of the two Dell boxes, be translated and then sent by TCP/IP to the RS/6000 where the back office application received it and processed it. Unfortunately a response was not sent before the CTOS network layer timed out and returned and error to the front end application. It was this error that made management decide that the network was at fault.
I tried to explain that the network time-out error was a symptom of the back end application not completing its processing and sending a response. Management were having none of it.
So, I started to prove the fault was isolated. By running through all the front office transactions I was able to show that only the purchase transaction was failing. Network traffic for other operations such as stock checks worked fine. The data would come across the CTOS network, be translated in the Dells, arrive and be processed by the back office application and the results go back across the network to the front office systems. All this worked so the basic network worked. It was just the purchase transaction that failed and as far as the network systems were concerned one transaction was no different from any other.
I collected lots of figures to show that everything worked for the purchase transaction. Once I'd exhausted my patience I started looking into the back office application even though that was not my area of responsibility.
It quickly transpired that some database maintenance has been preformed the night before the morning of the failure. My eyes got very wide! What had been done I asked? A database restructuring was the reply. My eyes got even wider!
What exactly happened during this restructuring I asked. The contents of the tables had been dumped to text files and the tables had been recreated using SQL scripts supplied by the software vendor and then the tables were reloaded from the text files.
I asked if this procedure had ever been done before. The answer was no. I was bug eyed by this time.
I immediately got hold of the SQL scripts and looked at them. I also looked at the relative sizes of the text files. It was easy to see which tables were important - the transaction logging table which was big and had a single index on it for the primary key.
What were the chances that the primary key was not very useful when processing purchase transactions resulting in a table scan - i.e. slowly read every record in the table rather than quickly go to the required one(s) using an index? Good enough to be worth trying something I thought.
I looked at the table definition, decided on some likely indices based on years of experience and gut instinct and the database staff created some extra indices.
Low and behold the purchase transaction started working quickly again and the network time-out errors stopped.
Low and behold!
Comments
Post a Comment