(or, open data is not always 100% open)
Five months ago I wrote a Twitter bot called @whensmybus. It took me a fortnight to code up and test the first version, which was pretty simple to begin with – it took a single bus number, and a geotag from a Tweet, and worked out when the next bus would be for you. And then people started using it, and really liking it. And because they liked it, they found ways of making it better (curse those users!). So I had to add in features like understanding English placenames, being able to specify a destination, handling direct messages, multiple bus routes, and tolerating the many many ways it’s been possible to break or confuse it, and this took up a lot of my time. And it was fun, to be honest.
At the same time, those bloody users also asked me when was I going to do a version for the Tube. But I was too busy adding the features for @whensmybus, and that’s one reason why it took me five months to write its counterpart, @whensmytube, which I launched last week. But there’s a stack of other reasons why it took so long. It didn’t seem too difficult to begin with. Just like with buses, Transport for London have made their Tube departure data open-source (via a system called TrackerNet), as well as the locations of all their stations. It would be pretty simple to do the same for tube data as it would for bus data, right?
So, for anyone interested in open data, software development, or just with a lay interest in why software doesn’t get new features quickly, here’s a run-down of why:
1. The Tube data isn’t complete
TfL helpfully provide details of all their Tube stations in a format called KML, from which it’s reasonably easy to extract the names and locations of every station. Well, they say “all”. That’s a bit of a lie. The file hasn’t been updated in a while; according to it, the East London Line is still part of the Tube network, and Heathrow Terminal 5 and Wood Lane stations don’t exist; neither do the stations on the new Woolwich Arsenal and Stratford International branches of the DLR. This has been griped about by other developers, but no update has been forthcoming. So it took time to do the ballache task of manually writing the data that hadn’t been included in the first place.
To make things more annoying, certain stations are left out of the TrackerNet system. If you want live updates from Chesham, Preston Road, or anywhere between Latimer Road and Goldhawk Road on the Hammersmith & City, you’re plain out of luck. Sorry, this is TfL’s fault and not mine. This also wasn’t documented anywhere, just omitted from the system documentation.
2. The Tube data isn’t built for passengers
To be fair to TfL, they do say what the TrackerNet service was meant for – it is built on their internal systems and was for non-critical monitoring of service by their staff, and there is a disclaimer saying this. The public version is useful, but unlike its bus counterpart there is a lot of data there which is not for public consumption. If anything, it’s too useful, as it contains irrelevant information such as:
- Trains that are out of service or “Specials”
- Trains that are parked in sidings
- Trains on National Rail systems, like Chiltern Railways, that run over Tube lines
- Data on whether a train is scheduled to go to a depot after its journey
- Some trains just don’t know what their final destination is yet, and are just labelled “Unknown”
And none of these special cases are documented in the system. So I had to spend a lot of time working out these odd edge cases and filtering out the chaff. And the code is by no means complete – I have to wait until irrelevant information is shown up to be able to filter it, because TfL don’t provide anywhere a list of possible values. This is annoying – so much so that I have even taken the step of submitting a Freedom of Information request to find out all the possible destinations a train can be given on the system to make sure, but I’m still waiting on it.
The documentation also falls down on being useful to reuse. For example, each station has a name (e.g. “King’s Cross St. Pancras”) and a code (e.g. “KXX”). Because spellings can vary, it’s easier to use the three-letter code when doing lookups for consistency. But the list of codes, and the station names they correspond to, were locked in a bunch of tables in a write-protected PDF, so it was impossible for me to create a table of code-to-station-name lookup table. In the end, I’m glad that someone had done the hard work for me, rather than I having to manually type them out.
On top of that, the system uses terminology more suited to insiders. For example, most stations have platforms labelled Eastbound/Westbound or Northbound/Southbound, which is fine. But the Circle Line and the Central Line’s Hainault Loop have designations “Inner Rail” and “Outer Rail”. And then to make my life even worse, some edge cases like White City and Edgware Road have platforms that take trains in both directions. This is confusing as hell, and so I had to spend a bit of time dealing with these cases and converting them to more familiar terms, or degrading gracefully.
This is a pain, but worth it. As far as I’m aware, no other Tube live data app (including TfL’s website, or the otherwise excellent iPhone app by Malcolm Barclay, which I regard as the gold standard of useful transport apps) takes this amount of care in cleaning up the output presented to the user.
3. Humans are marvellous, ambiguous, inconsistent creatures
And then on top of that there’s the usual complications of ambiguity. There are 40,000 bus stops in London, and typically you search for one by the surrounding area or the road it’s on, because you don’t know its exact name, and the app can look up roughly where you are, and give an approximate answer. But, there are fewer than 300 Tube stations, and so you’re more likely to know the name of the exact one you want. But, there are variations in spelling and usage. Typically, a user is more likely to ask for “Kings Cross” than the full name “King’s Cross St. Pancras” – punctuation and all. This all needs dealing with gracefully and without fuss.
4. Despite all my work, it’s still in beta
There’s plenty @whensmytube doesn’t yet do. It only accepts requests “from” a station and doesn’t yet accept filtering by destination “to”. This is because, unlike bus routes, most tube lines are not linear (and some even have loops). Calculating this is tricky, and TfL don’t provide an open-source network graph of the Tube network (i.e. telling us which station connects to which), and I haven’t yet had the time to manually write one.
5. But I’m still glad I did it
Despite all my problems with wrangling TfL’s data, I’m still pleased with the resulting app. Not least because, hey, it shipped, and that’s to be proud of in its own right. But more because everything I learned from it has kept me keen, and it’s had some pleasant side effects. The refactoring of the code I had to do has made @whensmybus a better product, and all the learnings of how to deal with the Tube network meant I was able to code and release a sister product, @whensmyDLR, with only a few days’ extra coding. Not bad.
But, here’s some quick conclusions from wrangling with this beast for the past five months:
- Open data is not the same as useful data If it’s badly-annotated, or incomplete, then an open data project is not as useful. Releasing an API to the public is a great thing, but please don’t just leave it at that; make sure the data is as clean as possible, and updates are made to it when needed.
- Open documentation is as important as open data It’s great having that data, but unless there’s documentation in an open format on how that data should be interpreted & parsed, it’s a struggle. All the features should be documented and all possible data values provided.
- Make your code as modular as possible If you’re having to deal with dirty or incomplete datasets, or undocumented features, break your code up into as a modular a form you can get away with. The string cleaning function or convenience data structure you wrote once will almost certainly need to be used again for something else down the line, and in any case they shouldn’t clutter your core code.
- In the end, it’s worth it Or, ignore all my moaning. Yes, it can be a pain, and annoying to deal with cleaning up or even writing your own data to go along with it; but in the end, a cleanly-coded, working product you can look on with pride is its own reward.
- Thank you TfL Despite all my bitching above, I’m still really grateful that TfL have opened their datasets, even if there are flaws in how it’s distributed and documented. Better something than nothing at all – so thank you guys, and please keep doing more. Thank you.