It’s only a temporary affair but OpenAustralia is not getting updated with the latest speeches in the House of Representatives and the Senate.
Why? Well, let me explain. It’s been a tumultuous few weeks behind the scenes here. If you use OpenAustralia you’re probably blissfully unaware of some changes that have taken place at the official online home of the Hansard at aph.gov.au which have caused us a great deal of grief.
Several weeks ago, a new system for accessing the Hansard at aph.gov.au was made live and the old system was immediately switched off. We had some warning that this was going to happen. Also, we were told by a person at the Department of Parliamentary Services (DPS) that the old system would be kept online for about a month after the switchover. Unfortunately, this isn’t what actually happened.
After the switchover nothing worked for us. Our parser that scrapes all the Hansard information depended very tightly on how the information was structured and everything had changed! So, nothing worked.
Many conversation ensued with the DPS imploring them to turn on the old system again and at least give us some grace period to try to rewrite our parser to work with the new parlinfo search. Thankfully after a few days they agreed to put the old system back up for a short period of time.
That allowed OpenAustralia to keep on working for a little while.
Then, for me, the fun truly started. I was faced with a new system that bore only a passing resemblance to the old one. The way that the Hansard was split into multiple pages had changed; The structure of the HTML markup had changed; the metadata associated with the pages had changed – everything had changed! Worse still, I soon discovered that there were some absolutely fundamental problems. Information was missing, such as whether a particular page is “procedural text”, most pages are not valid XHTML – a typical page when put through an HTML validator comes up with over 600 errors; I discovered some instances where the text was in the wrong order, even where several different sections of text from different places had been combined into one section.
Somehow I tried to work my way around each of these problems. I battled away at this for a few weeks making very slow and painful progress.
Then, I heard murmurings from the DPS that another solution might be coming. What might this be?
Three days ago, Friday last week, they added a new link to Hansard pages that allow you to download an XML file. This XML file is the underlying data that until now has only been used internally within the DPS. It is what comes out of the “Hansard Production System” which are the people and systems that annotate and record the Hansard and is what goes into the web system. So, it has all the information required to truly make sense of the Hansard.
I had asked for access to the XML data in November of last year when I started working on what became OpenAustralia. I never heard anything back. Also, during phone calls with DPS I brought it up again but I never expected it to get anywhere. It turned out that at the same time Jason Wilson from GetUp‘s Project Democracy had been asking for the same thing. So, huge thanks goes to Jason Wilson and his team at GetUp for helping getting DPS to give us the Hansard XML data.
I dropped everything and have spent every waking moment since then working on rewriting the parser to work from the XML file. I’ve made good progress. Now, it’s Monday, but I don’t realistically think that it’s going to be anywhere near ready by tomorrow when the first of the Hansard from this most recent parliamentary day will appear.
So, please be patient while we fix this. We’ll do everything we can to make it as quick as possible.
And, of course, we’ll keep you posted.
5 Comments
Gah! Well at least they won’t be changing the XML format dramatically without having to change their own internal systems too.
Many thanks for all your hard work. I’m sorry to hear you had to waste all that time, why can’t they get their act together? Wonderful stuff.
Ditto. It’s really great to see this site up and much respect for your efforts.
I am seaching for some idea to write in my blog… somehow come to your blog. best of luck. Eugene
Everything’s up and running again! See http://www.openaustralia.org/news/archives/2008/11/03/government_websi