So you want to be a geek…

23 January 2009

Charles Arthur had a nice post at the weekend entitled: If I had one piece of advice to a journalist starting out now, it would be: learn to code. As any modern journalist is able to Google around for facts, Charles tells any budding journo to set themselves above and beyond the normal set of “IT skills”; being able to get a more powerful grip on data is now becoming part of what a journalist should know:

None of which is saying you shouldn?t be talking to your sources, and questioning what you?re told, and trying to find other means of finding stuff out from people. But nowadays, computers are a sort of primary source too. You?ve got to learn to interrogate them effectively – and quote them meaningfully – too.

It’s great advice – playing with data and getting a feel for how to get the best out of it not only helps you find new things out, but also helps open your mind up to a more healthy appreciation of data. It allows you to explore the possibilities of data as well as its flaws, when it can be trusted and when it should be taken with a pinch of salt. And it’s things like this that contribute towards a sense of joyful skepticism that any self-respecting geek should possess (and you thought it was just about watching every episode of Battlestar Galactica).

I gave up programming as a full-time career more than three years ago but have still kept my hand in programming since, either for fun or to make work quicker and easier. Working in the digital and social media PR sector isn’t just about going to the pub (truth be told, it’s actually about going to very expensive pubs) but also about dealing with vast quantities of information – so you can see how programming can help. Making tasks faster is part of it, but the programming mindset is equally if not more important: it has taught me skills such as looking to optimise and make things quicker, filtering noise from the signal, reusing what you have to save effort in the future, and not being surprised by the unexpected.

So I’m going to say what Charles Arthur said, but bigger. If you work in any information industry, or are thinking about a career in it, learn to code. And by code I don’t mean learn something hardcore like Java or C++, or even learn a full programming language (as you’ll see below). But it means getting above the usual abstractions you see – your web browser, Word, Excel – and getting involved at a deeper level, get to appreciate what the data it is you’re reading and realise it’s not just something to look at.

So where and what would I recommend getting started with programming? From my own weary geek’s viewpoint, here’s six ways of getting into it – three of which really aren’t strictly programming at all:

Regular expressions. I cannot begin to think how many times these have bailed me out of an otherwise unrecoverable situation. Regular expressions are ways of finding and replacing text that are much more powerful than the bog standard. For example, you might want to get all the telephone numbers or postcodes, out of a document, but they are all different so a simple search wouldn’t be able to do it, so you have to do it by hand (and might miss one out).

A regular expression on the other hand can say “find me any group of eleven digits that begin with a 0, and either match the patern 0xx xxxx xxxx, or 0xxxx xxxxxx” – and bingo, you have all your phone numbers. Get clever and you can even tell it to not worry about whether it’s a space or a hyphen in between the groups of numbers. Be careful – they can get complicated, so build them up slowly and step by step – and they can do unexpected things, so always back stuff up.

CSV. Many people work with Excel spreadsheets and while it is great for tabulating data it isn’t a very portable format. Often you want to copy data in or out of Excel into other applications and it ends up being a horrible mess of numbers separated by spaces and tabs that you have to re-align yourself. CSV (comma-separated values) is the very boring but portable way of getting data in and out of Excel – it just consists of text with no styles, with commas to mark in between each column.

CSV looks like shit but it makes up for it by being able to be extremely portable and lightweight. Combined with regular expressions above and you’re able to take the useful data out of a horrible mess, replace everything between it with commas, and you can now import it straight into a spreadsheet. Or vice-versa – extracting numbers out of the spreadsheet and allowing other apps to play with it (like I did with the general election map)

Yahoo! Pipes. I am still waiting for Yahoo to piss this one up against the wall like they have done with Technorati and Flickr. So much of the web already runs on RSS (Really Simple Syndication) – streams of links and articles – that being able to manipulate them like this is a real boon. Yahoo! Pipes takes RSS feeds and allows you to merge them together, filter them, cross-reference them and more. When I was looking for a job last year I used a series of Pipes to pull feeds from various job websites, filter out the kinds of jobs I didn’t like, and then remove the duplicates so I wasn’t wasting my time – all delivered to Google Reader for easy perusal, as and when they came in. The interface is as reasonably usable as you could expect and has led to some really useful apps being created.

JavaScript. JavaScript is indispensible part of the web, although originally it seemed destined for little more than launching popups and stupid messages in your status bar. Now virtually every page is interactive in some way, and JavaScript’s true power is being exploited. One of the most obvious ways of getting it to work for you is Greasemonkey, which allows you to add scripts to change the behaviour of what appears on your screen – such as making Google Reader more readable, getting rid of Xeni Jardin, or (with the help of regular expressions) making postcodes turn into links of maps.

Python. The biggie and the one I use the most. Python‘s strengths lie in its simplicity – it’s quite simple and human-friendly and runs pretty much on anything. It also has a sensible structure and organisation, which teaches you to code well and clearly. Finally, the vast libraries available mean you can play with pretty much any data format, such such as BeautifulSoup, which allows you extract data from webpages easily. Python’s one drawback is that it falls down on its relatively poor documentation & tutorials, with some honourable exceptions such as Mark Pilgrim, so do hunt around and don’t let the technicalese put you off.

Finally, PHP gets an honourable mention – a easy enough language to learn and used widely, but with so many evolutions and a complicated past the language is a mess, and it teaches several bad programming habits.

I wouldn’t recommend doing all six at once, or even ever, nor would I set expectations too high. In some respects, it’s not even about the code or the results you get – it’s as much about the philosophy and understanding it brings with it: that data is not a static thing but ours to play with, making us able to create wonderful new things or change society for the better.


11 Responses

The other “Geeky” thing I would recommend is… Take a GCSE or A-Level in statistics. Then learn to use Excel*.

I’ve been blown away by the number of sales / PR / journalists who’ve shown me a graph purporting to show that PRODUCT X is the next iPod, cures cancer, and makes more money than anything else in their portfolio.

A cursory look at the data – even _sorting_ the data – shows a completely different picture. It doesn’t take much to learn about standard deviation or any of the other basic statistical tools. The rewards, however, are outstanding. I’ve punctured myths, showed where real problems lie and been able to give credit to groups who would otherwise be overlooked.

Terence
*Or OpenOffice Calc. Or something similar.

Um, you propose learning JS and PHP. I agree (I’m learning both). But, um, wouldn’t a knowledge of HTML be pretty essential first for both of those?

Hell, every blogger should learn how to read and replicate HTML, at least at a basic level. So many can’t even be bothered to learn how to code an anchor, let alone anything useful. But yeah, stats and data manipulation should be basic skills for journalists. If they can’t look at the dataset themself and spot obvious flaws, then they can’t do the job.

I gave up IT for journalism in 1996, gave that up for an academic career in 2004, and I still get people asking me if I can help them with their computers. (The really annoying thing is, I usually can.)

But… having given up IT for journalism 13 years ago, I know not this PHP or Python of which you speak. (Well, obviously I’ve heard of them…) I tend to agree more with the beginning and the very end of the post (and Terence’s comment, in particular) than with the middle – I think it’s an attitude rather than a specific skillset. It’s looking at the data (or the press release) and thinking

a) what’s this really saying?
b) how can I get hold of the data?
c) when I’ve answered b), how can I play with it to answer a)?

The idea that there’s a difference between raw data and aggregate data, and that the raw data’s malleable – that to me is the real lightbulb moment. The more I think about it the more I lean towards Terence’s point – number-crunching and IT skillz aren’t the same thing & don’t necessarily go together (um… do they?).

Incidentally, the last time I approached Charles Arthur for journalistic advice, his advice was that if I wanted to sell stuff I should go out and talk to people and find things out – any idiot with a blog can write a column. This was probably sound advice. (I decided to stick with academia.)

amoebic vodka

Journalists realising that correlation != causation would be a pretty good first step on the data thing. Though I suspect the post is targeted at journalists who’ve already got that.

Good post Chris.

As a semi-geeky PR person, I’m still embarrassed by my lack of coding skills. Sure, I can mess around with a little bit of HTML and can get by amending Wikipedia pages, but I wish I knew a bit about PHP and Javascript too (I’ve never been gripped by Yahoo Pipes, despite having a couple of plays though).

Unfortunately, most PRs – even those that claim to be social media experts – aren’t that hot on the tech side. Maybe it’s something that should feature on PR, marketing and business degree courses…?

I couldn’t agree with you more on the general principle, but I disagree regarding regular expressions. They are evilly seductive. True enough, I use them to solve simple problems every day, but I’m too often tempted to use them for complicated problems. Then the words of Jamie Zawinski echo mockingly around my head…

Some people, when confronted with a problem, think “I know, I?ll use regular expressions.” Now they have two problems.

I don’t think journalists need to learn how to code, any more than they need to learn how to make bread to be able to write about food or the baking industry. What they *do* need to know is not to take digital data at face-value, *understand* the digital experience instead of poking at it from the sidelines, have an understanding of how things are developed and indeed, be able to talk to geeks/programmers on their own level. Whether they need to learn how to program is a different matter.

And what happened to the days when BASIC and HTML was all you needed? *sigh*

First they came for the code monkeys, and I did nothing, because I was not a code monkey.
Then they came for the twitterers, and I did nothing, because I did not have twitter.
Then they came for the hackers, and I did nothing, because I do not hack.

And then they came for me, and I defeated and bamboozled them with my use of Oxford commas and a correct use of the Queen’s English.

Hey, we all need a skill and sometimes it’s better to build on your strengths and make them even better than building up weaknesses that will only ever be slightly less worserer than they were before.

But interesting thoughts all the same, Sir.

[...] various other blogs: Chris Applegate, who describes himself as a “wannabe polymath” chipped in with recommendations to get involved with regular expressions, comma-separated variables …. And Tom Armitage – who *IS* a programmer, even if he hotly denies it – suggested thinking like a [...]

[...] So you want to be a geek? Some interesting pointers here. Something I really need to look into in more detail. (tags: advice blog post chrisapplegate qwghlm geek programming data journalism coding) [...]

[...] code, because it’s a collection of really useful resources. For what it’s worth, I wrote a blog post nearly three years ago on things on things to get started on – though if I wrote it today I [...]