10. February 2011 14:26
By
Alice
In
Echo Bazaar | Techier than usual
Hi all
We moved some things around internally at the weekend. We took measures to ensure players shouldn't be affected, but it looks like a small minority are still suffering. We understand, however, that it doesn't help to be told you're a minority! Given time, things should settle down for you too, but meanwhile here are some things that may help.
EDIT! This is now fixed. See http://blog.failbettergames.com/post/A-cup-of-the-good-stuff.aspx
165706b1-8173-4472-8650-d0430c75fa86|7|4.3
Tags:
26. January 2011 03:09
By
Alexis
In
Echo Bazaar | Techier than usual
[EDIT: ooh good. We're better now. ]
God, it was slow last night, wasn't it?[1] And we've had some outages. You've noticed. What's going on?
We've moved to a new hosting service. That's the thing about server moves: you know they're going to cause more pain and swelling than you expect, but even so you don't expect *that* much pain and swelling. Like Hofstadter's Law but with more blood and error messages. At the same time, we're seeing higher than usual player growth (which is part of the reason we moved), and we're having to work hard to keep ahead of it.
Anyway, you'll have noticed it's much quicker today than yesterday. This should be a trend, and the longer outages should be behind us. Should. We're getting there: just a little further. Sorry about that. All shall be well.
[1] UK night. US daytime. Which is pretty much the problem. :-)
07bbc277-3702-4f0f-8be9-152183530a65|9|5.0
Tags:
26. November 2010 18:08
By
Alexis
In
Echo Bazaar | Techier than usual
Why is Fate purchase unavailable?
The service we use to take payments has been up and down like a wallaby on a seesaw since mid-morning: their service has been returning error messages.
Because of Thanksgiving / Black Friday?
We assume so.
What's Black Friday?
It's a US thing, Google it. Unless you're US in which case laugh at our foreign ignorance.
Never mind that, when's it back up?
We haven't heard back from our payment provider yet. They're probably having a worse day than we are. Also they're on PST, also they may be on holiday. Anyway we'll put it back up when we're sure it's solid. We don't want players using a service that might be misbehaving.
I came on this morning ready to buy half-price Fate and now I can't!!!!
Don't worry, we'll extend the Fate sale when it's back up.
Are you going to change payment providers?
No idea yet, we'll see what their explanation is. They've generally been very good. But we don't get paid when we can't sell Fate, so we need a service we can rely on.
You folks must be having a really bad day.
That's startup life for you.
How can I cheer you up?
Buy more Fate when it's back up. Or stop to be kind to a cat. We like cats.
df9cc16d-86d6-4848-9d5d-547f9a3e1f19|5|5.0
Tags:
10. May 2010 14:51
By
Alexis
In
Techier than usual
I know last week's authentication problems frustrated a lot of people. (One of them was me.) At the time we initially limited ourselves to the usual bland 'We know there are issues and we're looking into it' on Twitter: it takes time to communicate details, it provokes kibitzing, it can reveal details about our setup that I don't necessarily want to, and if I'm doing something stupid, I don't necessarily want people to know. :-) But when we went into a bit more detail, people seemed to like it. So for the hell of it, here's an abridged account of what went on over the last four days: a peek into the engine room.
1. Get reports that some users are having problems authenticating.
2. Get enough reports that I become convinced that it's not just the usual occasional Twitter hiccups.
3. Try to reproduce the issues with several test accounts. No dice.
4. RDP into our servers in the US and see if I can reproduce from there, in case it's my location that's unaffected. Nope.
5. Scour the Web, api.twitter.com, the Twitter stream and so forth for references to similar problems. No luck.
6. Still can't reproduce. Lots of people complaining now. Problem seems to be becoming more widespread.
7. Realise the functionality that normally emails error messages to me has been down for about 72 hours. Oops. Footle about with it for a quarter-hour before giving up for now (the logs still work).
8. Add some diagnostics to see what's failing. There's a JSON deserialization error in Tweetsharp (the Twitter API library we use[1]) that I don't understand.
9. Download the latest version of Tweetsharp since we're about two versions behind. Run smoke tests and do an emergency deploy. This is a shot in the dark, but has fixed issues twice before when we hit Twitter problems. (What happened on those previous occasions was, Twitter trailed their changes, the Tweetsharp committers followed through, then Twitter made the changes once everyone had caught up...but we hadn't caught up.) No luck.
10. Discover my wife can reproduce. er, that is, my wife can recreate the authentication problem.
11. Get sidetracked by a hypothesis that the problem only affects users with spaces in their display name.
12. Dig into root cause using wife's account. Fail to get anywhere. It's failing for my test accounts too, now, though, which is... sort of reassuring.
13. Talk to another Tweetsharp-using dev with similar problems who thought he was going mad. Start a thread on the Tweetsharp site. Tweetsharp co-ordinator Jason Diller responds almost immediately.
14. Doh! moment when I realise actually one of the threads on Twitter devtalk *is* referring to this problem. Twitter accidentally introduced a change - the user/status call is returning an additional user object with just the ID, inside the status tag which is inside the main user object. This is why the deserialization error made no sense to me - it's this tiny overlooked user object it can't deserialise, not the main one.
16. So Twitter have acknowledged it as a bug, we're not just going mad, which is nice.
17. But doesn't do us any immediate good.
18. Jason the Tweetsharp guy checks in a workaround to their repo, just two hours after we first reported the problem. Bloody hell, go Tweetsharp.
19. But by this point Twitter are saying the problem is fixed and we just need to wait for their cache to clear.
20. And [specific problem excised for reasons of extreme dullness] means I can't build the trunk version of Tweetsharp anyway.
21. Did I mention that we're now around 11pm, my time (BST)?
22. I tinker a bit more then go to bed in the fond belief that Twitter will have made everything good by morning.
23. Overnight, various US timezone types keep being unable to auth and indicating sorrow.
25. Various people are saying the original issue is not fixed yet on the Twitter thread. Twitter says, Real Soon Now.
26. I add a quick and ugly patch (a regex that takes the extraneous user object out of the returned JSON).
27. Which works.
28. But a few people are still complaining they can't tweet content.
29. Twitter assures everyone on devtalk that the original issue really is fixed.
30. There are still problems with tweeting content. This is a bigger deal than it might immediately seem. (i) it's part of the implicit contract with players that we give actions for content tweeting - if we're at risk of having players tweet content and not rewarding them with actions, we take that seriously. Which doesn't seem to be happening, but (ii) the lack of tweets is already having a visible effect on our growth numbers.
31. Guess what? I can't reproduce the tweet problem.
32. I follow what look like relevant logged errors down into the Tweetsharp source code, can't understand the error I'm getting, can't reproduce locally or live.
33. Maybe it's my dodgy regex hack fix? I don't see how but it seemed to start about the same time, and I didn't like the hack...
34. Check devtalk. Original issue really is really fixed, Twitter says.
35. I remove the regex hack from the live site and wait to see if people are still reporting problems.
36. I find out very quickly that the original issue is not in fact fixed.
37. I find out after a delay that the tweeting issue is not fixed by removing the hack.
38. Gah!
39. Put the regex hack back.
40. Go off and work on the migration to the new servers for a bit, hoping I'll have an idea about the tweeting issue by the time I come back. One of the (brand new) servers suddenly becomes unresponsive. Have a long argument with my hosting provider.
41. Take a few hours off. It's Saturday by now, dammit.
42. Add some more logging code to the site that actually tells me what the content was that someone failed to tweet. Leave it overnight. Ask players reporting bugs for more specific info (the key bit is, which bit of content did you fail with?)
43. Thank Christ, there do seem to be specific bits of content that cause the functionality to break. Further investigation shows anything with a carriage return or a tab character left in (they were pasted in from various text editing tools) is breaking. But they've worked for months!
44. Hypothesis: the newest version of Tweetsharp, which involved some underlying rewrites, now breaks with these whitespace characters (not a problem for most because who'd put a tab in a tweet?)
45. Run a db script that strips out all the tabs and carriage returns.
46. Problem solved.
47. Have a double espresso.
48. Change daughter's nappy.
Lessons learnt:
ONE! intermittent bugs are a bitch. OK, I knew that already. But I should have enlisted the aid of my lovely players in getting more details sooner.
TWO! Deploying newer versions of key libraries is dancing for rain at the best of times, and sometimes it means you get struck by lightning. Even if it seems like a good idea at half eight in the evening.
THREE! Add logging code sooner, not later.
FOUR! Open source projects with active co-ordinators are good.
FIVE! which I knew - detailed bug reports are very, very useful. If you've sent one (especially if you included exact time with timezone of problem, browser in use, current user name) consider yourself the recipient of a big beaming smile.
[1] and would thoroughly recommend for all you people out there in C# land
0b89a634-8a22-45cd-b820-3ff00f5deac5|4|4.5
Tags: