The problem with Language RSS

Shoshannah Forbes and I have been sending emails back and forth about the issue of RSS adoption in Right to Left languages like Hebrew and Arabic. It fits so closely with what I'm going to say tomorrow at XTECH (where I happen to be right now actually) its almost uncanny. I asked Shoshannah if I could blog her reply to my question about her RSS feeds. Basicly her RSS feeds include the HTML attribute dir to indicate direction of the text. Which makes it invalid and may break quite a few of the RSS readers out there. Anyhow here is the email complete with my new agreements and additional comments. Please remember as usual these comments are my own views and not BBC World Service's views (my employer).

Shoshannah Forbes wrote:

> The problem I am facing is simple:
> If I use valid RSS with no dir=rtl, then 99% of the RSS readers will display the text block as LTR, with punctuation digits and English in wrong locations, making the whole thing unreadable.
> When adding dir=rtl, at least I can get about 50% of the RSS readers to display the post body properly (titles are still a mess).

Agreed, but I feel there are two ways of looking at the problem. From your point of view it makes sense to include dir=”rtl” because very few software developers are going to change there code to take this into consideration. For us (the BBC World Service) we have the might to speak to developers and get them to change there code. Even if we do not do it for ourselves, we owe it to our audience (my own feelings).

> I don't use unicode control characters for a few reasons:
> * They are a real pain to input- it is like entering the control characters for CR/LF or < font > tag manually (but worse)- there are just to many places to enter them.

Yep totally agree

> * Most keyboard layouts do not have a direct way to enter them.

Yeah were using virtual keyboards for some languages and there a nightmare!

> * They make a mess of the text- they are only used for the RSS, and unneeded for the editing or the html display, and can produce unexpected results when entered into the text.

Yep, agreed

> * There are many clients that incorrectly display them as visible characters in the text.

Yeah, its a shame and that will change but its too much trouble at the moment

> * They make the text much more difficult to edit- if you change the text, you need to go back and change them as well. And since they are invisible, you get an awful lot of trial an error.

Indeed! You really need to understand them to edit with them. This would require extra training for our language services

> * They force me to use explicit directionality, which complicates things and makes the text less portable.

Yeah, there is a idea of reuse through out our language services. This is tricky already, who knows how much more tricky it would be if text was unicode directional too

> * My web app that creates the RSS from my HTML does not know how to add them automatically.

Yep, I know my Blogger app (Blojsom) supports Unicode Directionality IF i put them in at the start but then were back to the editor problem of virtual keyboards and sticking in hidden characters! The same is true of the BBC World Service systems. We use XSL with Saxon so if the characters are there, it should (not tested by myself) pass through to the RSS.

> * Since they are rarely used in other contexts, I can't focus on the content when writing, and have to start thinking more closely about the presentation.

Yeah indeed! Our language services are already busy as hell, unicode directionality would just add a level of complex on top of a already stressful job.

> * Moving from me to other users- most Hebrew/Arabic users don't know about them, and don't want to know. You try to explain to your mother that when she is writing in her weblog, she can't write in here usual manner, but has to enter this strange codes in a foreign language which have complicated rules (I have seen many pros get confuses with these characters, I don't expect laypeople to understand them).

Right on the nail! One of my points for tomorrow is unicode directionality is too damm difficult and very confusing! i expect some will challenge me about this tomorrow and honestly I will just admit its too difficult for me its even more difficult for others. Plus we should be making things easier for people not harder. The barrier for entry should be at a level where your mum or my mum could use it and write it.

> * It doesn't scale- think about a an Israeli blog hosting service- they want to offer RSS feeds for all the blogs, with minimum work for the users. Relaying on unicode control characters just doesn't do it.

Yeah plus from the Israeli blog hosting point of view, you want to get people going quick and easily not putting them off with complex editiing. Its the reason why Blogger does so well, 3 steps and you got your own blog.

> * Since they are complex, it is difficult to create a GUI for entering them (unlike general RTL/LTR controls, which are available everywhere).

Yeah its almost needs to be just like the direction attribute in HTML. I'm suggesting tomorrow a attribute like this for RSS.

> Not having the dir attribute in RSS gets rid of some markup- in favor of lower level much more complex control characters. A bad deal, IMO, and one which is a major cause for the problems when dealing with Hebrew/Arabic RSS.

Indeed, it was a ideal solution but the real world use is too painful

> I think that the root of the problem is that bidi is part presentation and part structure. And since even in the best of cases (for example, the automatic bidi control in recent QT or GTK applications on Linux) there are still many many cases that can *not* be covered reliably by the display algorithms of the software, I tend to think that for practical prepossess, bidi is more structure then presentation.

Yeah agreed, theres lots of push to put bidi information inside of CSS instead of HTML even, which is correct if you see bidi as presentation.

> I sure wish there was a way in RSS to tell the client “this element is RTL” or “this area is LTR” without resorting to HTML hacks. But at the moment, those hacks are the only practical tool I have to get at least *some* of the readers out there to display the text properly (more like “mostly properly”).

I feel your pain and I'm not even writing my own content in a right to left language! Its such a shame that HTML hacks are the only way we can move forward on this. The crux of my presentation and paper is that developers and content providers need to work much closer together and the RSS specificiation needs to make full use of attributes like xml:lang and maybe some other kind attribute for direction.

Comments [Comments]
Trackbacks [0]

Author: Ianforrester

Senior firestarter at BBC R&D, emergent technology expert and serial social geek event organiser. Can be found at cubicgarden@mas.to, cubicgarden@twit.social and cubicgarden@blacktwitter.io