How to get into a good mess

Over the past weekend the online shop of SpaceCat stopped working.

That was the result of some unfortunate decisions and the fortunate situation of being featured. Anyway, that should have never happened, but I re-learned a few lessons.

Problem #1: The shop is our own implementation

I always defend to not reinvent the wheel, but in this case we did. Should we had used a 3rd system such TapJoy or ScoreLoop -which are already integrated-, this should have never happened, partly because their code is more tested, partly because it relies on their servers.

Why did we did it?

Because we wanted to have a shop that was not bounded to just Android Market to be able to distribute the game over other channels such as Amazon AppStore. We integrated PayPal for that. And we had to implement tracking of purchases per user, which is done for you when you use Android Market IAB .

ScoreLoop did not support it. We could have done it with TapJoy, but the initial release did not include that library, since this was originaly designed for Chalk Ball.

In summary, that was the result of legacy decisions that could have been changed later on but we did not want to throw out work away.

Was it worth it?

Hell no! Paypal is a nightmare from the accounting point of view (not entering other discussions about PayPal here) and at the end we did not delivered SpaceCat through other channels, so 100% not worth it.

Problem #2: The shop was hosted on a shared server

Yes, we just put it in on a server we used for hosting other websites. At the time it looked like a good idea and it was not like we were making too many requests. After all, users don’t open the shop that often, they do play, and that does not require the server.

Why did we did it?

Because we did not have a dedicated server out there and setting it up for this sounded like an overkill.

Was it worth it?

Yes, until the moment SpaceCat was featured. We did spare the cost of a dedicated server for 6 months, while the game was not popular enough.

We should have prepared the migration at the very moment SpaceCat was featured, but honestly I did not see this comming.

Problem #3: The URL of the API was hardcoded

Yes it was. What else do you want me to say?

It required an update of the game to get it fixed. Should this happened to an iOS game we could have been unable to fix it for days.

Why did we did it?

Good question. Because I did not think of it at the moment. No excuse.

Was it worth it?

No, it could have been a much worst scenario. I am glad it just required to do an update and that the continuous integration system was properly setup so we could do it quickly.

What happened exactly?

Once SpaceCat got featured, we started having around 50,000 new users a day, 50% of them opened the shop. So after a week, that database had grown from 50,000 to 300,000 rows.

In addition to that, we started having more than 4 requests per second hitting that table. While this should not be that bad, it was too much for a shared server and the account was suspended.

The fix

The solution was to set up a dedicated hosting so we can have total control and maybe install other tools that are not mySQL in the future if it is required.

Once the host was delivered (and the company did a very good job by completing it in less than 5 hours) installing the code of the shop and migrate the data was done in less than an hour.

Then, we had to reconfigure SpaceCat and publish an update on Android Market. I’m glad to have jenkins in place.

All in all the system was down only for 12 hours, which is not that bad.

Even more

In addition to that, once the dedicated server was up, I noticed that the table was being slow even if the server was not heavily loaded. I looked again at everything and I noticed that it was missing one index.

How did that happen? I don’t know, I was sure to have it in place. Probably because we lacked a proper deploy method and that index was only on the dev environment. Probably because of the lack of stress tests; until 300.000 entries & 4 request per second the table was not performing that bad and we never noticed it. This is, again, a problem of implementing your own solution.

Now the system is good and working, but it has given me too many headaches over the past days so, again, if you are going to create an app that uses a service and it has the potential to be used by many people, use a 3rd party solution if it is available.