Openness in data formats

Me and Tantek

Tantek wrote this thought provoking entry about data formats and openness. Which I can't help but kind of agree on and disagree on. So first his entry.

  1. ASCII is dependable. Project Gutenberg insists on publishing their e-books as plain ASCII text as Mark Pilgrim noted, and their reasons are solid.
  2. Compatible XHTML is now also dependable. In the 15+ years since its public introduction, I believe that HTML has established itself sufficiently prominently worldwide that I feel quite comfortable declaring that HTML will be accepted to be as reliable as ASCII in coming years. In particular, authoring what I like to call Compatible XHTML, that is, valid XHTML 1.0 strict that conforms to Appendix C, is IMHO the way to author HTML that will have longevity as good as ASCII. Note that files in most file systems have no sense of “MIME-type”, thus the winged-mythological-creatures-on-the-head-of-a-pin style arguments about text/html vs. application/xhtml+xml that are often used to discredit either HTML or XHTML (or both) are irrelevant for the most common case of keeping archives of files in file systems.
  3. Plain old XML (POX) formats in the long run are no better than proprietary binary formats. XML, both in technology and as a “technical culture” is too biased towards Tower of Babel outcomes. I've spoken on this many times, but in short, the culture surrounding XML, especially the unquestioned faith in namespaces and misplaced assumed requirement thereof, leads to (has already lead to) Tower of Babel style interoperability failures. As this is a cultural bias (whether intentional or not) built into the very foundations of XML, I don't think it can be saved. There may be a few XML formats that survive and converge sufficiently to be dependable (maybe RSS, maybe Atom), but for now XHTML is IMHO the only longerm reliable XML format, and that has more to do with it being based on HTML than it being XML.
  4. Formats that are smaller (e.g. define fewer terms) tend to be more reliable.
  5. Formats that are simpler (e.g. define fewer restrictions/rules for publishers) tend to be more reliable.
  6. Formats that are more compatible with existing reliable formats tend to be more reliable, e.g. HTML worked well with existing systems that supported “plain text” (AKA ASCII)
  7. Formats that are easier to use, i.e. publish, and more immediately useful, rapidly become widely adopted, and thus become reliable as a breadth of software and services catches up with a breadth of published data in those formats.

The microformats principles were based on these observations. Now this doesn't mean I think microformats will replace existing reliable formats. Not at all. For example, I feel quite confident storing files in the following formats:

  • ASCII / “plain text” / .txt / (UTF8 only if necessary)
  • mbox
  • X)HTML
  • JPEG
  • PNG
  • WAV
  • MP3
  • MPEG

So my take on Tantek's thoughts.

Plain old XML (POX) formats in the long run are no better than proprietary binary formats. See I take issue with this, I understand what Tantek is getting at but I would say plain xml without a schema isn't leaning towards the Tower of Babel. And like Tantek already mentioned RSS and ATOM are pretty close to the non-tower of babel direction. I would also add FOAF and OPML to the list. I would love for SVG to also be included in this but alas its not. Formats that are smaller (e.g. define fewer terms) tend to be more reliable. Good point, hence why things should be broken down like how XHTML and SVG got Modularization.

My list of formats are slightly different too.

  • XHTML (Unicode)
  • XML (Unicode)
  • JPEG
  • PNG
  • MPEG3 audio
  • MPEG4 video
  • WAVE
  • SVG

Comments [Comments]
Trackbacks [0]

Blojsom 3.0 adds database storage and a even stronger API

My favorate blogging server Blojsom is shifting to Database storage for its next version. David Czarnecki the owner of the Open Source project outlined its very active history.

  • 01/29/2003 – blojsom project was registered on SourceForge and development was started.
  • 02/02/2003 – blojsom 1.0 was officially released. 18 releases were made in the 1.x cycle.
  • 09/10/2003 – blojsom 2.0 was officially released.
  • 06/28/2004 – Apple officially announces Tiger Server wherein blojsom is bundled as Weblog Server.
  • 03/14/2006 – blojsom 2.30 was officially released. 30 releases have been made in the 2.x cycle.

I remember running Blojsom betas, I think I started at Blojsom 0.7 when it could only handle one blog at a time. Then Blojsom 2.x came around and gave the whole project a real boost because it could easily handle many blogs under one install. I think the record is still 25,000 by some university in Australia. During the 1.x life of Blojsom, lots of plugins were developed and Blojsom was seriously deconstructed by the guys at HP research labs as part of there semantic blogging project. Its one of the things which I loved about Blojsom. Its nod towards something bigger than just simply blogging. Jon Udell did a talk about controlling our own data at Etech recently and one of snippits I heard was about he would run Xpath searches over his blog to pull out certain things. Its a step beyond tagging but one of the things which Blojsom has had for quite some time (Q3 2003 actually). Blojsom also has some other great stuff going for it like LDAP support!

Anyway, its a awesome blogging server and I believe Blojsom 3.0 will be better than Word Press. Its outgrown its roots in Bloxsom, which I believe is now struggling to stay around? And out grown all the Java solutions like Roller and Snipsnap. Being Java based will keep it out of the mainstream because most people have a LAMP setup on there hoster, but otherwise Blojsom 3.0 would be a bigger deal. Anyway more details about Blojsom 3.0

The first major change has been in the way blojsom is “wired” together. I've rewritten blojsom to use Spring for its dependency injection and bean management. There were aspects of the blojsom 2.x codebase that were more “patchwork” with respect to how certain components used or referenced other components.

The second major change has been in the datastore. I don't necessarily think I've exhausted all that can be done using the filesystem as a content database, but I've been feeling like there's a lot of development energy into making relations between data in the filesystem that can be expressed very easy using a relational database.

In blojsom 3.0, I've settled on using a relational database for the datastore. I'm using Hibernate as the ORM library to manage the data. This means goodbye to all the .properties files for configuration! It was fun while it lasted. The templates and themes are still stored on the filesystem, but I'd envision also storing the template data within the database as well. I've already prototyped use of the Velocity database template loader. I imagine removing any filesystem dependency will allow blojsom to be used in a clustered environment more easily.

Ultimately I think this will allow blojsom to scale much more than I think it can using the filesystem as a content database. I don't believe there are any esoteric relationships among the data in blojsom as to require a full-time DBA to manage an installation of blojsom.

The last major change has been in evolving blojsom's API.

For awhile now there are aspects of the API that were a throwback to needing certain data or referring to elements a certain way. I just wanted a more self-documenting and less redundant API.

For example, I've renamed the BlojsomPlugin interface to Plugin. I felt that having the org.blojsom.plugin package was declarative enough, but that keeping BlojsomPlugin was too redundant. None of the APIs have gone away, they're just more simple and straightforward.

The long and short of it is that you can do all of the things in blojsom 3.0 that were done in previous releases of blojsom. There are a few more components and plugins to migrate to 3.0, but I'm happy with how far things have come in such a short time given the scope of the changes.

You're more than welcome to start playing with blojsom 3.0 right now. All that you need to do after setting up your database is to add a blog and a user for that blog and you'll be able to login through the administration console.

If any of this interests you, feel free to participate on the blojsom-developers mailing list.

Being hosted with Hub.org, it would be wrong for me to not to choose PostgreSQL for my database backend. I would love to try other storage backends like a XMLDB but I can't quite experiment with this blog till I've tested it fully. Maybe there will be a way to run one blog on a Database and another on a filesystem or XML Database? Because that would be great. If worst comes to worst I will just run another copy of Blojsom for testing purposes.

Comments [Comments]
Trackbacks [0]

Semanticly changing cubicgarden

This page is xhtml 1.1 valid

Its been all of about a week since I wrote anything. I've been quite busy but I've actually been working on this blog. I've changed the structured of the pages which does cause some problems with some of you using Internet Explorer but most of you are using the RSS/ATOM so its low on my list of changes. I've also finally sorted out most of the issues with why the site didn't validate. As you can see, it now validates. This won't always be the case, due to that well talked about entity problem in copy and pasted url's. I'm also going to try and use Microformats more than I have in the past. I've not dumped OPML for outlining but I like XOXO and am actively looking for a application which supports it for quick editing. In the past I was using JOE (java outline editor) which is great because it allows you to runs python scripts which can do many things. But its not had much updates as of late. So can anyone suggest a XOXO editor besides the javascript one. If not there are XSLs to convert between OPML and XOXO so I'm not that worried.

Comments [Comments]
Trackbacks [0]

Tim Berners-Lee Semantic web lecture

Tim Berners Lee in Oxford

After the mad panic trying to get the train up to Oxford due to the Trainline machine at work not working. We arrived at the Oxford University venue well before the start time and picked a great spot for the lecture. Tim Berners-Lee was good to see live, you could see he certainly was no Steve Jobs. He was more like Bill Gates, a little uneasy with public talking but happy to talk about his vision and his work towards that vision. That vision is the Semantic Web. Rather than me explain every aspect of the talk its best I point you towards Tim's S5 presentation, a webcast (coming soon), this blog and my notes. I've also added my photos from the lecture to Flickr.

So generally I'm even more sure that the semantic web is happening but within certain domains. Will the semantic web happen across the web, doubtful at best. Recent developments in web 2.0 have really pushed the web towards a more richer smeantic web but away from top down ontologies and rules.

Oh and believe it or not, me and Miles were quoted in the Newstatesman blog

Comments [Comments]
Trackbacks [0]

Live clipboard from Microsoft

Before I've even had the chance to play with Microsoft's Simple Sharing Extensions, Ray Ozzie just shared a prototype they have been playing with internally. Its called Live Clipboard and basiclly is a clipboard for the semantic web.

Its a JavaScript-based solution which works in most browsers like Internet Explorer and Firefox. It stores data on the page as actual xml data trees which can be copied and pasted without having to select the text content. Its a difficult concept to explain but luckly Ray's got tons of screencasts to show how it works. The interesting thing is that not only does Live clipboard work in the browser domain but also in the desktop domain. Thanks to 25hours a day for the Etech trip report, which alerted me to Live clipboard in my RSS reader today.

Honestly when I first read the post, I did think this would be perfect as a Firefox Extension or even Greasemonkey script but you would miss out on the desktop side of things. I'll be interested to know how flexable Live clipboard is. For example will it read all types of Microformats? How about FOAF and XFN? Humm, I wonder if you could do something between a Firefox extension and a Yahoo Widget?

Comments [Comments]
Trackbacks [0]

A XSL transformation mindset

Someone asks on Metafilter.

When you imagine XSLT transformations happening in your mind's eye, what does it look like?

Its a really good question and opens up a whole range of thinking about the differences in peoples thought processes. So first Jeff talks about the question.

This is a very powerful question to ask, because ancient, procedurally oriented developers like me sometimes have trouble following the non-linear, pattern-driven processing that takes place when an XSLT template is applied to a tree of XML elements. In fact I have noticed that non-developers sometimes have an easier time with XSLT than do experienced developers, because they don't try as hard to figure out what is happening beneath the covers.

I would kind of agree with that statement. Theres something about XSL and XML which just makes sense in my head. I'm not from a traditional software or computer science background, so I still find it weird to be called a programmer by some of my peers. John wrote this fantastic comment.

My first project with XSLT a few years back was to actually generate XSLT *from* XML and XSLT and forced me to break my ideas of how it worked. When I finally got the whole “it happens all at once” approach, it started to make sense. However, every programmer that I've brought on board to an XSLT project since has had trouble getting out of the procedural thinking and that ends up being the biggest source for their mistakes.

Unfortunately, like MagicEye images, some people just aren't able to unfocus their minds in the right way to really grok XSLT beyond the simplest examples.

I have heard of programmers comparing XSLT to Prolog and even Lisp, I'm not sure how true this is but its certain that you can't approch XSLT in a regular way. Recursion is one of those things which seems to drive people mad. In XSL there's a lot of recursion and declaration which seems to fit the way I think. I always wanted to create a SVG of a XSLT process. So you can see in lines and boxes what templates are being called and add some kind of dimension to XSL. I'm sure its not that hard and even my experiements with transforming Cocoon's Sitemap file into SVG didn't require too much work. Talking about recursion someone posted this nice animated gif of how it all works. There's no douht that XSL requires a different mindset and working with a programming language like Java or Perl will be more of a hinderance that an advantage.

I posted this question to a few of the XSL developers I know and got a variaty of answers. In my own mind I see lots of lines and trees which get broken into branches

Comments [Comments]
Trackbacks [0]

Tagging which way? How about my way?

Story telling fest

Looking though my to read at somepoint in the future tagged catagory in Great News I found this useful summary of the problem with tagging online at the moment. Tag formats: Can’t we all just get along? covers the main tagging applications online and shows the confusion between spaced keywords and the comma seperated method.

So where do I fall on this issue? Well although I use Flickr and Del.icio.us almost everyday, I think they could both do benefit from using commas to seperate tags. All the latest services which I've used which support tagging have used commas because they make a lot more sense. As Victor says in the comments,

commas are faster than quotes.

as i see it (in my own experience) tags can be annoying if you don’t really care about them when you have to enter them. Usually you care about them later on, when you cannot find what you’re looking for. but they’re still a(nother) time-consuming task.

i’d use fast, thus i’d use commas.

The only thing which puts me off commas is the language issue, which is that some languages use commas for other things. There was a suggestion to use semicolon but I feel that would go down like a listening to your ipod in a church service. Other solutions which I've seen around the web include autosensing spaces or commas and the Amazon box model type thing. Which I personally think sucks because it takes too long to fill them in. I wonder why no ones written a greasemonkey script to allow people to pick a method which will be translated across all tagging services. So I can type commas into Flickr and it just translates it into spaces for me. Yeah its very lazyweb stuff. But as FataL points out, this can't be that hard.

Computer now smart enough to parse them all:
south asia, africa = [south asia] [africa]
“south asia” africa = [south asia] [africa]
‘south asia’ africa = [south asia] [africa]
(south asia) africa = [south asia] [africa]
south asia – africa = [south asia] [africa]
It’s not so hard to program all this I believe.

Comments [Comments]
Trackbacks [0]

Firefox 1.5 now out but with limited SVG support

Firefox 1.5 released

Firefox 1.5 is released, hooray! And its the same as Firefox 1.5 RC3 which I've been using for a while now, hooray again… But not with full support for SVG 1.1 Full, Tiny or Basic profiles. This is a crying shame but still marks another step forward for SVG on the desktop. The full version which supports SVG is still in development and should be available in Firefox 3 according to SVG news. At least SVG is doing much better in the mobile space, almost 100 phones and counting.

If you want to see whats possible with Firefox 1.5 and SVG, do check out the Canvas painter demos which are poping up everywhere. Vladimir has a link to the best ones.

Comments [Comments]
Trackbacks [0]

All your bases belong to google

This entry by Greg at Blogdigger titled Someone set up us The Bomb is excellent. I'd honestly hadn't really looked into Google base because the idea of marking up my data just for Google gives me the creeps, but the angle gives me a even creeper feeling.

In an effort to push things in the proper direction, a small group of individuals and companies began working on ways to structure information, in an attempt to prevent SDL (Semantic Data Loss) and create better search in the process. The history here goes back quite a bit, so I'll skip to the end, which is often called datablogging, microformats and/or structured blogging, all of which attempt to make the process of capturing the meaning of content easier both for the producer and the consumer. Things were moving along nicely in that direction; Google Base, however sends a proverbial “Make your time” to all those services, since Google Base essentially allows content producers to explicitly tell Google what all those little bits of data mean and how to interpret them.

Greg is right, but this is the dilemma. Google is offering a solution to put large amounts of structured data online while Databloggling hasnt gone that far and Microformats for as much I love them are still a second thought when blogging. I mean I'm a xml guy and I usually write the text, add the basic links, etc then some tags and maybe trackbacks. The adding of microformats usually comes afterwards, imagine what most people do.

We really need to start adding microformats to the Blogging applications, and soon.

Comments [Comments]
Trackbacks [0]