Charles Arthur had a nice post at the weekend entitled: If I had one piece of advice to a journalist starting out now, it would be: learn to code. As any modern journalist is able to Google around for facts, Charles tells any budding journo to set themselves above and beyond the normal set of “IT skills”; being able to get a more powerful grip on data is now becoming part of what a journalist should know:
None of which is saying you shouldn?t be talking to your sources, and questioning what you?re told, and trying to find other means of finding stuff out from people. But nowadays, computers are a sort of primary source too. You?ve got to learn to interrogate them effectively – and quote them meaningfully – too.
It’s great advice – playing with data and getting a feel for how to get the best out of it not only helps you find new things out, but also helps open your mind up to a more healthy appreciation of data. It allows you to explore the possibilities of data as well as its flaws, when it can be trusted and when it should be taken with a pinch of salt. And it’s things like this that contribute towards a sense of joyful skepticism that any self-respecting geek should possess (and you thought it was just about watching every episode of Battlestar Galactica).
I gave up programming as a full-time career more than three years ago but have still kept my hand in programming since, either for fun or to make work quicker and easier. Working in the digital and social media PR sector isn’t just about going to the pub (truth be told, it’s actually about going to very expensive pubs) but also about dealing with vast quantities of information – so you can see how programming can help. Making tasks faster is part of it, but the programming mindset is equally if not more important: it has taught me skills such as looking to optimise and make things quicker, filtering noise from the signal, reusing what you have to save effort in the future, and not being surprised by the unexpected.
So I’m going to say what Charles Arthur said, but bigger. If you work in any information industry, or are thinking about a career in it, learn to code. And by code I don’t mean learn something hardcore like Java or C++, or even learn a full programming language (as you’ll see below). But it means getting above the usual abstractions you see – your web browser, Word, Excel – and getting involved at a deeper level, get to appreciate what the data it is you’re reading and realise it’s not just something to look at.
So where and what would I recommend getting started with programming? From my own weary geek’s viewpoint, here’s six ways of getting into it – three of which really aren’t strictly programming at all:
Regular expressions. I cannot begin to think how many times these have bailed me out of an otherwise unrecoverable situation. Regular expressions are ways of finding and replacing text that are much more powerful than the bog standard. For example, you might want to get all the telephone numbers or postcodes, out of a document, but they are all different so a simple search wouldn’t be able to do it, so you have to do it by hand (and might miss one out).
A regular expression on the other hand can say “find me any group of eleven digits that begin with a 0, and either match the patern 0xx xxxx xxxx, or 0xxxx xxxxxx” – and bingo, you have all your phone numbers. Get clever and you can even tell it to not worry about whether it’s a space or a hyphen in between the groups of numbers. Be careful – they can get complicated, so build them up slowly and step by step – and they can do unexpected things, so always back stuff up.
CSV. Many people work with Excel spreadsheets and while it is great for tabulating data it isn’t a very portable format. Often you want to copy data in or out of Excel into other applications and it ends up being a horrible mess of numbers separated by spaces and tabs that you have to re-align yourself. CSV (comma-separated values) is the very boring but portable way of getting data in and out of Excel – it just consists of text with no styles, with commas to mark in between each column.
CSV looks like shit but it makes up for it by being able to be extremely portable and lightweight. Combined with regular expressions above and you’re able to take the useful data out of a horrible mess, replace everything between it with commas, and you can now import it straight into a spreadsheet. Or vice-versa – extracting numbers out of the spreadsheet and allowing other apps to play with it (like I did with the general election map)
Yahoo! Pipes. I am still waiting for Yahoo to piss this one up against the wall like they have done with Technorati and Flickr. So much of the web already runs on RSS (Really Simple Syndication) – streams of links and articles – that being able to manipulate them like this is a real boon. Yahoo! Pipes takes RSS feeds and allows you to merge them together, filter them, cross-reference them and more. When I was looking for a job last year I used a series of Pipes to pull feeds from various job websites, filter out the kinds of jobs I didn’t like, and then remove the duplicates so I wasn’t wasting my time – all delivered to Google Reader for easy perusal, as and when they came in. The interface is as reasonably usable as you could expect and has led to some really useful apps being created.
Python. The biggie and the one I use the most. Python‘s strengths lie in its simplicity – it’s quite simple and human-friendly and runs pretty much on anything. It also has a sensible structure and organisation, which teaches you to code well and clearly. Finally, the vast libraries available mean you can play with pretty much any data format, such such as BeautifulSoup, which allows you extract data from webpages easily. Python’s one drawback is that it falls down on its relatively poor documentation & tutorials, with some honourable exceptions such as Mark Pilgrim, so do hunt around and don’t let the technicalese put you off.
Finally, PHP gets an honourable mention – a easy enough language to learn and used widely, but with so many evolutions and a complicated past the language is a mess, and it teaches several bad programming habits.
I wouldn’t recommend doing all six at once, or even ever, nor would I set expectations too high. In some respects, it’s not even about the code or the results you get – it’s as much about the philosophy and understanding it brings with it: that data is not a static thing but ours to play with, making us able to create wonderful new things or change society for the better.